{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

HMMBiologicalSeq_Baldi94PNAS - Proc Nati Acad Sci USA Vol...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Proc. Nati. Acad. Sci. USA Vol. 91, pp. 1059-1063, February 1994 Biochemistry Hidden Markov models of biological primary sequence information (multiple sequence algnments/protein modeling/adaptive algorithms/sequence classification) PIERRE BALDI*t, YVES CHAUVINt§, TIM HUNKAPILLER*¶, AND MARCELLA A. MCCLUREII** *Division of Biology, California Institute of Technology, Pasadena, CA 91125; tNetID, Inc., San Francisco, CA 94107; IDepartment of Molecular Biotechnology, University of Washington, Seattle, WA 98195; IlDepartment of Ecology and Evolutionary Biology, University of California, Irvine, CA 92717; tJet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109; and §Department of Psychology, Stanford University, Stanford, CA 94025 Communicated by Leroy Hood, October 12, 1993 (received for review January 14, 1993) ABSTRACT Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences. Comparative analysis of primary sequence information is a major tool in the elucidation of the molecular mechanisms of replication and evolution of organisms and the structure and function of proteins. For the simple case of pairwise se- quence comparison, good algorithms exist (see refs. 1 and 2 for recent reviews) that can align two sequences of length N in roughly O(N2) steps. Most of these algorithms are based on dynamic programming (3), with location-independent sub- stitution and gap penalties. Unfortunately, when dynamic programming is applied to a family of K sequences its behavior scales like O(NK), exponentially in the number of sequences (4). A number of algorithms have been devised to try to tackle the multiple alignment problem (see refs. 5-7 for some of the most recent ones). Most protein sequence relationships ex- hibiting >50%o identical residues can be aligned by several of these algorithms. Many of the most interesting protein fam- ilies, however, exhibit conservation far below 50%o identity. To date, alignment methods have not been developed that can correctly identify all the motifs that define each protein family (2). Here, we apply a different approach, based on hidden Markov models (HMMs), to the problem of modeling and aligning a family by using primary structure information only.
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}