Methods Inf Med 2004; 43(01): 9-12
DOI: 10.1055/s-0038-1633414
Original Article
Schattauer GmbH

A Sequential Method for Discovering Probabilistic Motifs in Proteins

K. Blekas
1   Department of Computer Science, University of Ioannina, and Biomedical Research Institute, Foundation for Research and Technology – Hellas, Ioannina, Greece
,
D. I. Fotiadis
1   Department of Computer Science, University of Ioannina, and Biomedical Research Institute, Foundation for Research and Technology – Hellas, Ioannina, Greece
,
A. Likas
1   Department of Computer Science, University of Ioannina, and Biomedical Research Institute, Foundation for Research and Technology – Hellas, Ioannina, Greece
› Author Affiliations
Further Information

Publication History

Publication Date:
07 February 2018 (online)

Summary

Objectives: This paper proposes a greedy algorithm for learning a mixture of motifs model through likelihood maximization, in order to discover common substrings, known as motifs, from a given collection of related biosequences.

Methods: The approach sequentially adds a new motif component to a mixture model by performing a combined scheme of global and local search for appropriately initializing the component parameters. A hierarchical clustering scheme is also applied initially which leads to the identification of candidate motif models and speeds up the global searching procedure.

Results: The performance of the proposed algorithm has been studied in both artificial and real biological datasets. In comparison with the well-known MEME approach, the algorithm is advantageous since it identifies motifs with significant conservation and produces larger protein fingerprints.

Conclusion: The proposed greedy algorithm constitutes a promising approach for discovering multiple probabilistic motifs in biological sequences. By using an effective incremental mixture modeling strategy, our technique manages to successfully overcome the limitation of the MEME scheme which erases motif occurrences each time a new motif is discovered.

 
  • References

  • 1 Attwood TK, Croning MDR, Flower DR, Lewis AP, Mabey JE, Scordis P, Selley J, Wright W. PRINT-S: the database formerly known as PRINTS. Nucleic Acids Research 2000; 28 (01) 225-7.
  • 2 Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D. The Emergency of Pattern Discovery Techniques in Computational Biology. Metabolic Engineering 2000; (02) 159-77.
  • 3 Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwland AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993; 226: 208-14.
  • 4 Bailey TL, Elkan C. Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning 1995; 21: 51-83.
  • 5 McLachlan GM, Peel P. Finite Mixture Models. New York: John Wiley & Sons, Inc; 2001
  • 6 Vlassis N, Likas A. A greedy EM algorithm for Gaussian mixture learning. Neural Processing Letters 2002; 15 (01) 77-87.
  • 7 Bentley JL. Multidimensional binary search trees used for associative searching. Commun ACM 1975; 18 (09) 509-17.