LecturesPart01 - Computational Biology, Part 1...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Biology, Part 1 Introduction/Representing and Retrieving Sequences Robert F. Murphy Copyright © 1996, 2000-2006. Copyright All rights reserved. Course Introduction s What these courses are about s What I expect s What you can expect Course numbers s 03-311 (first half of 03-310) s 03-310 & 42-334 no programming req’d s 03-510 & 42-534 above plus programs s 03-710 & 42-734 above plus paper What these courses are about s overview of ways in which computers are overview used to solve problems in biology used s supervised learning of illustrative or supervised frequently-used algorithms and programs frequently-used s (03-510/710 & 42-534/42-734) supervised (03-510/710 learning of programming techniques and algorithms selected from these uses algorithms I expect s s s s s students will have basic knowledge of biology and students chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more willingness students will have basic familiarity with use of computers students (e.g., at the level of Computing Skills Workshop) and eagerness to gain new skills eagerness (03-510/710 & 42-534/734) students have some (03-510/710 programming experience and willingness to work to improve heterogeneous class - I plan to include refreshers on each heterogeneous new topic new students will ask questions in class and via email You can expect s Two major course sections x x s s s Class sessions: lectures/demonstrations/quizzes Pop quizzes on assigned reading and previous lectures Homework assignments x x x x s s s s Computational Molecular Biology (Sequence & Structure Analysis) Computational Cell Biology (modeling and image analysis) 60% of grade for 03-311 60% of grade for 03-310 50% of grade for 03-510 50% of grade for 03-710 Midterm March 7 (40% for 03-311, 20% for 03-510, 15% Midterm for others) for Final (30% of grade for 03-310, 03-710, 25% for 03-510) Grades totally determined by points system Communication on class matters via email list Textbooks for first half of course s For all students x s For 03-510/710 students x s Required textbook: Bioinformatics: Sequence and Required Genome Analysis by David W. Mount Genome Recommended additional textbook: Biological Recommended Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. (ISBN: 0-521and 62971-3) Additional suggested book for non-Bio majors x Chap. 1 of Computational Molecular Biology, Peter Chap. Clote & Rolf Backofen (ISBN 0-471-87252-0) is an Clote excellent introduction to molecular biology for nonexcellent biology majors Web resources for CMU computational biology classes s Web page Web (http://www.cmu.edu/bio/education/courses/03310 or 03311 or 03510 or 03710) 03311 03710 x Lecture Notes (as PowerPoint files) x Homework Assignments (as Word files) x Additional materials as needed Class schedule s Tuesdays and Thursdays x 3:00 to 4:20 all s Mondays x 11:30 to 12:20 03-310/311 recitation s Fridays x 1:30 to 2:20 03-510/710 recitation Information flow s A major task in computational molecular major biology is to “decipher” information contained in biological sequences contained s Since the nucleotide sequence of a genome Since contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers using Review of basic biochemistry s Central Dogma: DNA makes RNA makes Central protein protein s Sequence determines structure determines Sequence function function Structure s macromolecular structure divided into x x x s primary structure (1D sequence) structure secondary structure (local 2D & 3D) tertiary structure (global 3D) structure DNA composed of four nucleotides or DNA nucleotides "bases": A,C,G,T "bases": s RNA composed of four also: A,C,G,U (T RNA transcribed as U) transcribed s proteins are composed of amino acids proteins amino DNA properties - base composition s Some properties of long, naturally-occuring Some DNA molecules can be predicted accurately given only the base composition, usually base usually expressed as either expressed x %GC (the percent of all base pairs that are %GC G:C), or G:C), x χ GC (the mole fraction of all bases that are either G or C) either x %GC = 100*χ GC %GC 100* DNA properties - melting temperature s Example of zero order sequence Example properties properties x Tm, the melting temperature, the melting defined as the temperature at which half of the DNA is single-stranded and half is double-stranded and (oC) = 69.3 + 41 χ GC (for 0.15 M C) m NaCl) NaCl) 3T DNA structure - restriction maps s Restriction enzymes cut DNA at specific Restriction sequences. sequences. s A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes enzymes Restriction map of a circular plasmid with one enzyme AccII AccII AccII AccII AccII AccII pGEM4 AccII AccII AccII AccII AccII Restriction map of all enzymes that cut only once SspBIBsrGI Bsp1407I AcsI ApoI EcoRI Ecl136II EcoICRISacI SstI Acc65I Asp718I AvaI NheINaeINgoMINgoAIV SgrAI Eco47IIIAor51HI DsaI BsmFI EcoNI AflIII pGEM4 AlwNI AatII SspI XmnIAsp700I ScaI Eco255I XorII PvuI BspCI AhdI AspEI Eam1105I EclHKI BpmI GsuI BglI AviII FspI Transcription s s s s s s transcription is accomplished by RNA polymerase RNA polymerase binds to promoters RNA promoters promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding efficiency and progression rates and transcription start and stop affected by tertiary transcription structure structure regulatory sequences can be positive or negative RNA processing s eukaryotic genes are interrupted by introns introns s these are "spliced" out to yield mRNA s splicing done by spliceosome s splicing sites are quite degenerate but not all splicing are used are Translation s conversion from RNA to protein is by conversion codon: 3 bases = 1 amino acid codon s translation done by ribosome s translation efficiency controlled by mRNA translation copy number (turnover) and ribosome binding efficiency binding s translation affected by mRNA tertiary translation structure structure Protein localization s leader sequences can specify cellular leader location (e.g., insert across membranes) location s leader sequences usually removed by leader proteolytic cleavage proteolytic Postranslational processing s peptides fold after translation - may be peptides assisted or unassisted assisted s processing enzymes recognize specific sites processing (amino acid sequences) (amino s protein signals can involve secondary and protein tertiary structure, not just primary structure tertiary Representing and Retrieving Sequences Definition s A sequence is a linear set of characters (sequence sequence elements) representing nucleotides or amino acids elements) x x x DNA composed of four nucleotides or "bases": DNA nucleotides A,C,G,T A,C,G,T RNA composed of four also: A,C,G,U (T transcribed RNA as U) as proteins are composed of amino acids (20) proteins amino Representation of Sequences s characters x simplest x easy to read, edit, etc. s bit-coding x more compact, both on disk and in memory x comparisons more efficient x more to come on this Character representation of sequences s DNA or RNA x use 1-letter codes (e.g., A,C,G,T) s protein x use 1-letter codes use 3 can convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine) Cys Representing uncertainty in nucleotide sequences s It is often the case that we would like to It represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position “possible” x to express ambiguity during sequencing x to express variation at a position in a gene to during evolution during x to express ability of an enzyme to tolerate more to than one base at a given position of a recognition site recognition Representing uncertainty in nucleotide sequences s To do this for nucleotides, we use a set of To single character codes that represent all possible combinations of bases possible s This set was proposed and adopted by the This International Union of Biochemistry and is referred to as the I.U.B. code I.U.B. The I.U.B. Code s s s s s s s s s s s s A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used determinate) are Representing uncertainty in protein sequences s Given the size of the amino acid “alphabet”, Given it is not practical to design a set of codes for ambiguity in protein sequences ambiguity s Fortunately, ambiguity is less common in Fortunately, protein sequences than in nucleic acid sequences sequences s Could use bit-coding as for nucleic acids but Could rarely done rarely Sequence File Formats Sequence file formats s s s s Two characteristics of file formats x text or binary binary x minimal or annotated annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text SimpleText Microsoft or editor (e.g., emacs) emacs Binary files are usually readable only by the program that created them (e.g., MacVector) MacVector Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.) literature Examples of ASCII sequence file formats s Line (MacVector), Plain Text (AssemblyLIGN) CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC Examples of ASCII sequence file formats s  Fasta (Entrez) >gi|995614|dbj|D49653|RATOBESE Rat mRNA for obese. CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC Examples of ASCII sequence file formats s GCG (MacVector, GCG) LOCUS DEFINITION ACCESSION KEYWORDS SOURCE ORGANISM ORGANISM REFERENCE AUTHORS AUTHORS TITLE TITLE JOURNAL JOURNAL COMMENT RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 Rat mRNA for obese. Rattus norvegicus; Norway rat Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Sarcopterygii; Myomorpha; Muridae; Murinae; Rattus Myomorpha; [1] Murakami, T. & Shima, K. Cloning of rat obese cDNA and its expression in obese rats. Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995) Database Reference: DDBJ RATOBESE DDBJ Accession: D49653 Accession: -----------Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3-chome Tokushima 770 Japan Phone: +81-886-33-7184 Fax: +81-886-31-9495 [continued] [continued] Examples of ASCII sequence file formats s GCG [continued] FEATURES pept pept ???? ????   From 30 1 To/Span 533 539 Description obese source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /strain=OLETF, /dev_stage=differentiated; /sequenced_mol=cDNA /dev_stage=differentiated; to mRNA; /tissue_type=adipose to 133 G 118 T 0 OTHER BASE COUNT 121 A 167 C ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM RATOBESE.G 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG CCAAGAAGAA 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT 61 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA 121 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC 181 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA 241 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT 301 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG 361 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT 421 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG 481 // Check: 5797 .. ACCCCTGTGC CGGTTCCTGT CCACAAAGTC CAGGATGACA CATTTCACAC ACGCAGTCGG CGGGCTTCAC CCCATTCTGA GATCCTCACC AGCTTGCCTT GCGAGACCTC CTCCATCTGC CCTGCAGAAG CCAGAGAGCC GGTGGCTCTG AGCAGGCTGC CCCTGAATGC TGAGGTTTC Sequence file format tips s s s When saving a sequence for use in an email message or When pasting into a web page, use an unannotated text format such as FASTA FASTA When retrieving from a database or exchanging between When programs, use an annotated text format such as GCG GCG When using sequence again with the same program, use When that program’s annotated binary format (or annotated text if binary not available) if Entrez Entrez s s a client-server system for retrieval of information client-server related to molecular biology related can be used x x s via web page via "embedded" client in other software (e.g., via MacVector) MacVector) provided by National Center for Biotechnology provided Information, part of the National Library of Medicine (NIH) Medicine Entrez Databases http://www.ncbi.nlm.nih.gov/ s PubMed: The biomedical literature x s PUBMED database contains Medline abstracts as well as links PUBMED to full text articles on sites maintained by journal publishers to s PubMed Central: free, full text journal articles PubMed Books: online books OMIM: Online Mendelian Inheritance in Man s Nucleotide sequence database (Genbank) Nucleotide s s Protein sequence database Protein s Genome: complete genome assemblies s Structure: three-dimensional Structure: macromolecular structures macromolecular Entrez Databases s Taxonomy: organisms in GenBank Taxonomy: s SNP: single nucleotide polymorphism SNP: s PopSet: population study data sets s And many more… Entrez essentials s Semi-automated entry of information into Semi-automated databases databases s Critical to usefulness is the links between Critical databases databases Entrez literature searching s can find papers on a given subject s can find papers on a specific gene s can find papers related to a given paper s can switch between literature and sequence can databases databases s Pubmed has links to publishers’ websites to view full text of articles view s Pubmed Central has free full text copies free Entrez sequence searching s can find sequences for a given gene or can protein protein s can download copy of sequence Example Entrez Session s Goal: Find literature and sequences for cystic Goal: fibrosis genes fibrosis x x x x x x x x Use OMIM with Keyword searching. Use OMIM Keyword Switch to Nucleotide database to see sequence. Switch Nucleotide Switch to Protein database to see sequence. Switch Protein Change to GenPept format to save sequence. Change GenPept Use links to find related literatures in pubmed. Use pubmed. Use Related Articles to find similar articles. Use Related Search the Nucleotide database by gene name. Search Nucleotide gene Set Limits to narrow down the search Set Limits Example Entrez Session: home of Entrez Example Entrez Session: search OMIM for ‘cystic fibrosis’ Example Entrez Session: first hit is CFTR Example Entrez Session: after clicking linksNucleotide Example Entrez Session: after clicking linksProtein Example Entrez Session: Protein sequence from original cDNA Example Entrez Session: click send to save it Example Entrez Session: LinksPubMed Example Entrez Session: paper in PubMed that is related Example Entrez Session: Related Articles Example Entrez Session: search Nucleotide for cftr Example Entrez Session: 1012 hits related to cftr Example Entrez Session: set limits as title and mRNA Example Entrez Session: 141 hits with limits Example Entrez Session: further narrow it down to human Block Diagram for Entrez Literature Searching Results of Previous Search Additional Search Criterion Displayed Item Selection Desired Output Format Entrez Search Engine Results of Search (List) Item Display ...
View Full Document

Ask a homework question - tutors are online