LecturesPart01 - Computational Biology, Part 1 Introduction...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 1 Introduction Robert F. Murphy Copyright © 1996, 2000-2009. Copyright All rights reserved. Course Introduction s What these courses are about s What I expect s What you can expect Course numbers s 03-510 = 42-434 undergraduate course s 03-710 = 42-734 graduate course x Difference is additional research paper for Difference graduate course graduate What these courses are about s overview of ways in which computers are overview used to solve problems in biology used s supervised learning of illustrative or supervised frequently-used algorithms and programs frequently-used s supervised learning of programming supervised techniques and algorithms selected from these uses these I expect s s s s students will have basic knowledge of biology and students chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more Biology/Chemistry) students have some programming experience and students willingness to work to improve heterogeneous class - I plan to include refreshers heterogeneous on each new topic on students will ask questions in class and via email You can expect s Two major course sections x x s s s s s s s s s Computational Molecular Biology (Sequence & Structure Analysis) Computational Cell Biology (Modeling and Image Analysis) Class sessions: lectures/demonstrations Recitations: reviews/quizzes/help Quizzes on assigned reading/previous lectures (5% of Quizzes grade) grade) Homework assignments (50% for 03-510, 40% for 03-710) Midterm March 5 (20% of grade) Midterm (20% Final (25% of grade) Research Paper (10% for 03-710) Grades determined by weighted average of components Communication on class matters via email list Textbooks for first half of course s For all students x s Required textbook: An Introduction to Required Bioinformatics Algorithms Bioinformatics Recommended additional textbook x Biological Sequence Analysis: Probabilistic models Biological of proteins and nucleic acids by Durbin et al. (ISBN: 0-521-62971-3) 0-521-62971-3) Web page s http://www.cmu.edu/bio/education/courses/03510 or http://www.cmu.edu/bio/education/courses/03510 03710 03710 x Lecture Notes (as PowerPoint files) x Homework Assignments (as PDF files) x Additional materials as needed Class schedule s Tuesdays and Thursdays x 3:00 to 4:20 lecture s Fridays x 1:30 to 2:20 recitation Information flow s A major task in computational molecular major biology is to “decipher” information contained in biological sequences contained s Since the nucleotide sequence of a genome Since contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers using Review of basic biochemistry s Central Dogma: DNA makes RNA makes Central protein protein s Sequence determines structure determines Sequence function function Structure s macromolecular structure divided into x x x s primary structure (1D sequence) structure secondary structure (local 2D & 3D) tertiary structure (global 3D) structure DNA composed of four nucleotides or DNA nucleotides "bases": A,C,G,T "bases": s RNA composed of four also: A,C,G,U (T RNA transcribed as U) transcribed s proteins are composed of amino acids proteins amino DNA properties - base composition s Some properties of long, naturally-occuring Some DNA molecules can be predicted accurately given only the base composition base s Since double-stranded DNA should have the Since same number of As as Ts, DNA base composition usually expressed as %GC %GC (the percent of all base pairs that are G:C) or χ GC (fraction of all base pairs that are G:C) (fraction DNA properties - melting temperature s Example of zero order sequence Example properties properties x Tm, the melting temperature, the melting defined as the temperature at which half of the DNA is single-stranded and half is double-stranded and Fraction of separate strands (dashed line) Fraction of doublestranded base pairs DNA properties - melting temperature http://www.nordita.dk/~metz/dnadenaturation.html Tm (oC) = 69.3 + 41 χ GC (for 0.15 M NaCl) C) DNA structure - restriction maps s Restriction enzymes cut DNA at specific Restriction sequences. sequences. s A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes enzymes Restriction map for circular plasmid http://www.fermentas.com/techinfo/nucleicacids/mapfx174.htm http://www.cbs.dtu.dk/staff/dave/DNA_CenDog.html Transcription s s s s s s transcription is accomplished by RNA polymerase RNA polymerase binds to promoters RNA promoters promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding efficiency and progression rates and transcription start and stop affected by tertiary transcription structure structure regulatory sequences can be positive or negative RNA processing s eukaryotic genes are interrupted by introns introns s these are "spliced" out to yield mRNA s splicing done by spliceosome s splicing sites are quite degenerate but not all splicing are used are s same transcript can be spliced in multiple same ways (“alternative splicing”) ways RNA splicing http://genes.mit.edu/chris/ Translation s conversion from RNA to protein is by conversion codon: 3 bases = 1 amino acid codon s translation done by ribosome s translation efficiency controlled by mRNA translation copy number (turnover) and ribosome binding efficiency binding s translation affected by mRNA tertiary translation structure structure Translation http://www.biotopics.co.uk/genes/trans.html Protein localization s leader sequences can specify cellular leader location (e.g., insert across membranes) location s leader sequences usually removed by leader proteolytic cleavage proteolytic Protein localization http://fig.cox.miami.edu/~cmallery/150/cells/organelle.htm Postranslational processing s peptides fold after translation - may be peptides assisted or unassisted assisted s processing enzymes recognize specific sites processing (amino acid sequences) (amino s protein signals can involve secondary and protein tertiary structure, not just primary structure tertiary Representing and Retrieving Sequences Definition s A sequence is a linear set of characters (sequence sequence elements) representing nucleotides or amino acids elements) x x x DNA composed of four nucleotides or "bases": DNA nucleotides A,C,G,T A,C,G,T RNA composed of four also: A,C,G,U (T transcribed RNA as U) as proteins are composed of amino acids (20) proteins amino Representation of Sequences s characters x simplest x easy to read, edit, etc. s bit-coding x more compact, both on disk and in memory x comparisons more efficient x more to come on this Character representation of sequences s DNA or RNA x use 1-letter codes (e.g., A,C,G,T) s protein x use 1-letter codes use 3 can convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine) Cys Representing uncertainty in nucleotide sequences s It is often the case that we would like to It represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position “possible” x to express ambiguity during sequencing x to express variation at a position in a gene to during evolution during x to express ability of an enzyme to tolerate more to than one base at a given position of a recognition site recognition Representing uncertainty in nucleotide sequences s To do this for nucleotides, we use a set of To single character codes that represent all possible combinations of bases possible s This set was proposed and adopted by the This International Union of Biochemistry and is referred to as the I.U.B. code I.U.B. The I.U.B. Code s s s s s s s s s s s s A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used determinate) are Representing uncertainty in protein sequences s Given the size of the amino acid “alphabet”, Given it is not practical to design a set of codes for ambiguity in protein sequences ambiguity s Fortunately, ambiguity is less common in Fortunately, protein sequences than in nucleic acid sequences sequences s Could use bit-coding as for nucleic acids but Could rarely done rarely Sequence File Formats Sequence file formats s s s s Two characteristics of file formats x text or binary binary x minimal or annotated annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text SimpleText Microsoft or editor (e.g., emacs) emacs Binary files are usually readable only by the program that created them (e.g., MacVector) MacVector Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.) literature Examples of ASCII sequence file formats s  Fasta >gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese. CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC Examples of ASCII sequence file formats s GCG LOCUS DEFINITION ACCESSION KEYWORDS SOURCE ORGANISM ORGANISM REFERENCE AUTHORS AUTHORS TITLE TITLE JOURNAL JOURNAL COMMENT RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 Rat mRNA for obese. Rattus norvegicus; Norway rat Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Sarcopterygii; Myomorpha; Muridae; Murinae; Rattus Myomorpha; [1] Murakami, T. & Shima, K. Cloning of rat obese cDNA and its expression in obese rats. Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995) Database Reference: DDBJ RATOBESE DDBJ Accession: D49653 Accession: -----------Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3-chome Tokushima 770 Japan Phone: +81-886-33-7184 Fax: +81-886-31-9495 [continued] [continued] Examples of ASCII sequence file formats s GCG [continued] FEATURES pept pept ???? ????   From 30 1 To/Span 533 539 Description obese source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /strain=OLETF, /dev_stage=differentiated; /sequenced_mol=cDNA /dev_stage=differentiated; to mRNA; /tissue_type=adipose to 133 G 118 T 0 OTHER BASE COUNT 121 A 167 C ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM RATOBESE.G 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG CCAAGAAGAA 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT 61 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA 121 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC 181 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA 241 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT 301 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG 361 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT 421 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG 481 // Check: 5797 .. ACCCCTGTGC CGGTTCCTGT CCACAAAGTC CAGGATGACA CATTTCACAC ACGCAGTCGG CGGGCTTCAC CCCATTCTGA GATCCTCACC AGCTTGCCTT GCGAGACCTC CTCCATCTGC CCTGCAGAAG CCAGAGAGCC GGTGGCTCTG AGCAGGCTGC CCCTGAATGC TGAGGTTTC Entrez Entrez Databases http://www.ncbi.nlm.nih.gov/ s PubMed: The biomedical literature x s PUBMED database contains Medline abstracts as well as links PUBMED to full text articles on sites maintained by journal publishers to s PubMed Central: free, full text journal articles PubMed Books: online books OMIM: Online Mendelian Inheritance in Man s Nucleotide sequence database (Genbank) Nucleotide s s Protein sequence database Protein s Genome: complete genome assemblies s Structure: three-dimensional Structure: macromolecular structures macromolecular Entrez Databases s Taxonomy: organisms in GenBank Taxonomy: s SNP: single nucleotide polymorphism SNP: s PopSet: population study data sets s And many more… Entrez essentials s Semi-automated entry of information into Semi-automated databases databases s Critical to usefulness is the links between Critical databases databases Entrez literature searching s can find papers on a given subject s can find papers on a specific gene s can find papers related to a given paper s can switch between literature and sequence can databases databases s Pubmed has links to publishers’ websites to view full text of articles view s Pubmed Central has free full text copies free Entrez sequence searching s can find sequences for a given gene or can protein protein s can download copy of sequence Example Entrez Session s Goal: Find literature and sequences for cystic Goal: fibrosis genes fibrosis x x x x x x x x Use OMIM with Keyword searching. Use OMIM Keyword Switch to Nucleotide database to see sequence. Switch Nucleotide Switch to Protein database to see sequence. Switch Protein Change to GenPept format to save sequence. Change GenPept Use links to find related literatures in pubmed. Use pubmed. Use Related Articles to find similar articles. Use Related Search the Nucleotide database by gene name. Search Nucleotide gene Set Limits to narrow down the search Set Limits Example Entrez Session: home of Entrez Example Entrez Session: search OMIM for ‘cystic fibrosis’ Example Entrez Session: first hit is CFTR Example Entrez Session: after clicking linksNucleotide Example Entrez Session: after clicking linksProtein Example Entrez Session: Protein sequence from original cDNA Example Entrez Session: change ‘Send to’ to ‘File’ Example Entrez Session: LinksPubMed Example Entrez Session: paper in PubMed that is related Example Entrez Session: Related Articles Computation of related articles s Similarity between documents is measured Similarity by the words they have in common: by x Which words are considered? x What is the weight of each word ? x How do we calculate a similarity score of two How articles? articles? Computation of related articles: words considered s Remove stopwords: uninformative s Stem words s Words from the abstract are “text words” s Words from the title are put in twice s Words from the MeSH terms x U.S. National Library of Medicine x Vocabulary used for indexing articles Vocabulary x Consistent way to retrieve information View the MeSH terms: change ‘Display’ to ‘Citation’ Computation of related articles: weight of each word s Global weight: x Greater, if the word is less frequent in the whole Greater, database database s Local weight: Local x Greater, if the word is more frequent in the Greater, document document x Longer document is not favored Computation on related articles: Similarity score of two articles s Weight of one pair of common word: local wt1 * local wt2 * global wt local s Similarity of two articles: sum of weights Similarity of all common words of s The higher the score the closer the two The articles articles s Similarity scores are pre-computed Example Entrez Session: search Nucleotide for cftr Example Entrez Session: 1249 hits related to cftr Example Entrez Session: set limits as title and mRNA Example Entrez Session: 46 hits with limits Example Entrez Session: further narrow it down to human Block Diagram for Entrez Literature Searching Results of Previous Search Additional Search Criterion Displayed Item Selection Desired Output Format Entrez Search Engine Results of Search (List) Item Display Reading for next class s Read Chapter 1 s Depending on background, read Chapter 2 Depending and/or Chapter 3 and/or s Read Chapter 4 through section 4.6 ...
View Full Document

This note was uploaded on 12/03/2011 for the course BIO 118 taught by Professor Staff during the Fall '08 term at Rutgers.

Ask a homework question - tutors are online