note05-1x2 - Bioinformatics and Computational Biology LCS...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Bioinformatics and Computational Biology LCS is closely related to a group of central problems in Bioinformatics and Computational Biology A DNA string is a string over = { A , T , C , G } . The human DNA geno consists of DNA strings of total length about 3 10 9 letters. A substring is a continuous section of the string. (In contrast, a subsequence needs not be continuous. ) A gene is a substring in DNA string that encodes a protein . The lengths of typical genes range from 500 - 10000 letters. Current estimate: There are 50K-100K genes in human geno. The total length of genes only counts about 5-10% of DNA. The functions of other portions in DNA are unknown. From Biology point of view, these notions might be overly simplified. But from CS point of view, this is it. c Xin He (University at Buffalo) CSE 431/531 Algorithm Analysis and Design 2 / 30 Bioinformatics and Computational Biology DNA Matching Given two DNA strings S and T , how similar they are? Finding Gene Given a DNA string S and a gene T ( T is just a much shorter DNA string), does S contain T ? c Xin He (University at Buffalo) CSE 431/531 Algorithm Analysis and Design 3 / 30 Bioinformatics and Computational Biology The primary structure of a protein is a string consisting of 20 amino acids (these are the basic building blocks of proteins) . So a protein is just a string over , consisting of 20 symbols. Three letters in DNA string encode an amino acid. These three letter sequences are called codons in Molecular Biology. CAC (His/H)Histidine AGC (Ser/S)Serine CAA (Gln/Q)Glutamine AGA (Arg/R)Arginine CAG (Gln/Q)Glutamine AGG (Arg/R)Arginine ... .... ... The correspondence between the codons and amino acids is called the genetic code . It was discovered by Holley and Nirenberg in 1960s. They shared Nobel Prize in 1968. c Xin He (University at Buffalo) CSE 431/531 Algorithm Analysis and Design 4 / 30 String Alignment Problem A protein molecule is a linear structure, but it folds into a complex 3D structure, called the secondary structure . The protein functionality is more related to the secondary structure, less so to the primary linear structure. Nevertheless, the secondary structure is determined by the primary structure . In contrast, the functionality of DNA/Gene is directly related to the linear structure . We can ask similar questions about protein structures: given two proteins, how similar are they? .... From CS point of view, these are the same computational problems, the only difference is | | = 4 or | | = 20 . In Biology, there are rarely perfect matches . A match of 90-95% is considered very good . We have to defined what we mean by two DNA strings S and T are similar , or S contains T ....
View Full Document

This document was uploaded on 02/02/2010.

Page1 / 14

note05-1x2 - Bioinformatics and Computational Biology LCS...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online