{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

whatIsHMM_seanEddy_nbt04 - _computational BIOLOGY PRIMER...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
PRIMER Often, biological sequence analysis is just a matter of putting the right label on each residue. In gene identification, we want to label nucleotides as exons, introns, or inter- genic sequence. In sequence alignment, we want to associate residues in a query sequence with homologous residues in a tar- get database sequence. We can always write an ad hoc program for any given problem, but the same frustrating issues will always recur. One is that we want to incorporate heterogeneous sources of information. A genefinder, for instance, ought to combine splice-site consensus, codon bias, exon/ intron length preferences and open reading frame analysis into one scoring system. How should these parameters be set? How should different kinds of information be weighted? A second issue is to interpret results proba- bilistically. Finding a best scoring answer is one thing, but what does the score mean, and how confident are we that the best scor- ing answer is correct? A third issue is exten- sibility. The moment we perfect our ad hoc genefinder, we wish we had also modeled translational initiation consensus, alterna- tive splicing and a polyadenylation signal. Too often, piling more reality onto a fragile ad hoc program makes it collapse under its own weight. Hidden Markov models (HMMs) are a formal foundation for making probabilistic models of linear sequence ‘labeling’ prob- lems 1,2 . They provide a conceptual toolkit for building complex models just by draw- ing an intuitive picture. They are at the heart of a diverse range of programs, including genefinding, profile searches, multiple sequence alignment and regulatory site identification. HMMs are the Legos of com- putational sequence analysis. A toy HMM: 5 splice site recognition As a simple example, imagine the following caricature of a 5 splice-site recognition problem. Assume we are given a DNA sequence that begins in an exon, contains one 5 splice site and ends in an intron. The problem is to identify where the switch from exon to intron occurred—where the 5 splice site (5 SS) is. For us to guess intelligently, the sequences of exons, splice sites and introns must have different statistical properties. Let’s imagine some simple differences: say that exons have a uniform base composition on average (25% each base), introns are A/T rich (say, 40% each for A/T, 10% each for C/G), and the 5 SS consensus nucleotide is almost always a G (say, 95% G and 5% A).
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}