ALSE - Motif Finding Tool
The University of Hong Kong
Department of Computer Science, The University of Hong Kong
Useful Links
 

ALSE Algorithm - An EM-like approach

   

Finding Common Patterns

 

Motif Finding Problem

The motif finding problem is to find common patterns of the binding sites for an unknown transcription factor from a set of sequences.
There are different models to the motif-finding problem. However, they all share a common weakness. Only strong-signal sequencs are used as input, i.e. they cannot confirm that the discovered patterns may also occur in other background and un-related sequences, therefore, should not be considered motifs.

 

Our Algorithm

 

Our approach

  1. Classic motif finding algorithms only consider sequences that are known (or believed) to contain the common patterns (call it the set T). This may lead to error in finding patterns that appear repeatedly in all the sequences, therefore leading to poor or inaccurate results. Our approach overcomes this problem by introducing an extra input to the problem -- the background sequences (call it the set set F). Our algorithm search for patterns that frequently appears in the set T but rarely appear in the set F.
  2. Sometimes, the pattern that we are searching for may appear more than once in a DNA sequence. Our algorithm consider that the sequences with repeated patterns.

Problem Definition

Given two sets of DNA sequences T and F, where T contain sequences that are believed to contain the motifs and F as a set of background sequences. Find those short patterns (motif) occur frequently in set T but rarely occur in set F.

To be precise, given T and F as defined previously, we want to find a probability matrix M (presenting a motif pattern) and a threshold α, such that the p-value is minimized.

The EM-like Approach

ALSE finds out those probability matrices and thresholds which result in the lowest p-value by applying an EM-like iterations:

  1. Find a set of patterns (represented as a probability matrix) with the lowest p-values, known as the set Seeds (say 100).
  2. For each M0 (with corresponding threshold α0) in the set Seeds, refine it to a matrix M1 (with α1) such that M1 yieids a lower probabilty p-value(M1,α1)
  3. Continue Step 2 until there is no improvement in the p-value, or the maximum number of iterations is reached.
  4. Output the motifs with smallest p-values (say 50).
 

Introduction Software Download Online Services Algorithm Used Useful Links Email: alse@cs.hku.hk