CMSC297_Gasarch_201110

CMSC297_Gasarch_201110 - Why Biology and Medicine needed,...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Why Biology and Medicine needed, needs and will continue to need computational thinking (and their tools) Héctor Corrada Bravo Dept. of Computer Science Center for Bioinformatics and Computational Biology University of Maryland University of Maryland, 10/18/2011 Thursday, November 3, 2011 Bioinformatics and Computational Biology A. Cancer Genomics: a setting where computational tools have made significant impact in Biology and Medicine B. Personalized Medicine: a setting where computational tools will continue to make significant impact C. Where do we fit in the future of genomics Thursday, November 3, 2011 A super-quick introduction to Genome Biology Thursday, November 3, 2011 Why are my children such pigs? Thursday, November 3, 2011 Chromosomes These are actually human, for a down syndrome pa:ent Thursday, November 3, 2011 DNA DNAs (Deoxyribonucleic acids) are molecules to store gene:c informa:on of a living organism. DNA consists of two polymers made from four types of nucleo:des: adenine (A) guanine (G), cytosine (C) and thymine (T). Purines: A, G; Pyrimidines: C, T Two polymers are complementary to each other and from a double ­helix structure 5’-ACCGTTCGACGGTAA-3’ ||||||||||||||| 3’-TGGCAAGCTGCCATT-5’ Watson and Crick 1953 Thursday, November 3, 2011 What is Genomics? • Each cell contains a complete copy of an organism’s genome, or blueprint for all cellular structures and ac:vi:es. • The genome is distributed along chromosomes, which are made of compressed and entwined DNA. • Cells are of many different types (e.g. blood, skin, nerve cells), but all can be traced back to a single cell, the fer:lized egg. • Genomics is the study of molecular informa:on to understand natural human varia:on and disease. Thursday, November 3, 2011 Measurement • For a small enough piece, we can measure the sequence of bases, referred to as sequencing • Human Genome Project Thursday, November 3, 2011 Genome TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTCTCACACCTGAC ATGAAAAGGCACATGAGGATCCTCAAATACCCCGTGATCAGTCTCAGGGTAGCTCTCATAGCCTGGACAGGGCCCCCCTCGGGGGTTGCGCCC AGGTCCAGGCGGGGGATGCACAGCAACAGTCACCGAAGCAGAAGCCGTCACAGTGGTGATGGGCTGGCAGTAGCTGGGCACAGAGCTGCCCAT GGCGGTGGACGTTGGGTTCCGAGGGTTGTGAGAACGGGCCCCACGGGGCCCTGAGCGGTCCCTATTGCTAGGGCCAGAATGCCCTTCAGTAGA AATTTCAAAAGCGTCTCTGCGCGGTCTGTAGGGGGGTGGCCGCAAGCCTTCTCTAGGGGGATCCCTTCGAGGCTGCTGGCCTTGCCGTCCAGG GGACAAGGAGCCAGAGTCCAGGTGGGGCTGTTGCCGAGGGGTCAAGGGAGGCTGATGTCTGGAGTCCGGATGGACCACCTGCAGAGGAGAGAC ATAGGTCAACACAGGGAGGTAGGATGGTGGTGATGTTCCACCCACAAAAGAAAACCTATTCCTTTAGAAACCTCCAGGATGTGAATCCTGCCT GCACCTGCACAGCTGGCTGGAGGCATATAGCCACTGCCCATAGATCTCAACTTACCCTCACAACCAACTGCCCCCAGGCCTAAGTTCTCTGCC TCAAAACTGCCAAGGCCTGGATAGCCAAGAGCCTGGGTGTCTTGGAAATATGCAACCATAAATAGTAGCTTTTAGAAGTATAAGGCTCCTGTT TCTGGGTCATATTAGTGTTGTTTTCACCTGTCCCCAGCCCTAAGCCAGGTGTGGCCAGAAGCAAATGTACTGTAAGAGCAGAGCAAAAACTTC CACACAGATAGTTCTGTTAGGCAATACATCTCTGCCTGACTATTAGGAATCTGGTTTCTGGGTCCTCTGTACAAAGCTCGGAGCAACACAGTG GCCACATCAATCAAAAGGACCGTGACCAACTTCAAAGTCGGTGAGCTTGTACCTATTTTTAGGCTCCTGCTGAACAGAACCAGATTCACACTA CAGCTCAGCAGGGCATCGTCACGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGGGGGGGGGGGTGGACAGAGGACGGGGAC ACAATTCACTGGCCAGCCCTTCTCTCCTTCAAGGAAGGCTGCTCTAGCCTGGGACTGGAATACACATTTCCTGTAAACATGGTGGGGGCCTCA GGCAAGCCAGAGTTTTGGAGCCTTCCTTAACTCTTCAAGGTGAGCATCTTGACTTGGAGGGTGGGGGTGCGGGTAAGGAAGGAACCTGTGGAC TCCTCCCTACAAGACAGAAAAGGAATAAGCCACGAAGACAATAACGATTTTTGTATCAAGCGTCCTCTCCCATTTCAGCTTACCTGACAATGA AATCAAATTCGGACCCTGCAAGCATCAGTACACCCAGCAGAGTGGACACAGCACCGTCCAGAACGGGAGCAAACATGTGCTCCAGAGCGAGCA TAGCCCTGTGGTTCTTGTCCCCAATGGCTGTCAGAAAGGCCTGAACAAAGGAGAAAATTGACACGGTCACATTCTGGGTGTGGTAAAGTGCTC AGCTGTGTCTATACTTGGGTTTTGTAT… Total amount of DNA in human genome: 3 * 109 base pairs (bp) Thursday, November 3, 2011 Replica:on T T C G A T T A C G A Thursday, November 3, 2011 A A G C T A A T G C T T T C G A T T A C G A Thursday, November 3, 2011 A A G C T A A T G C T T T C G A T T A C G A C C C G T A A G T A T T T G T T G G G T A A T G C A T G G G T C A A T T A T T T A G T A G A A T G T C Thursday, November 3, 2011 A A G C T A A T G C T Bases available in our cells T T C G A T T A C G A Thursday, November 3, 2011 A A G C T A A T G C T T T C G A T T A C G A A A G C T A A T G C T T T C G A T T A C G A Thursday, November 3, 2011 A A G C T A A T G C T T T C G A T T A C G A A A G C T A A T G C T Why are these two different? Differences explained by 1 ­10% difference in genome Similari:es explained by similar genes Thursday, November 3, 2011 Genes Gene Thursday, November 3, 2011 Gene Gene Gene Gene Central Dogma Thursday, November 3, 2011 Measurement: Microarrays • More on this later Thursday, November 3, 2011 What makes them different? Much human varia:on is due to difference in ~ 6 million base pairs (0.1 % of genome) referred to as SNPs Thursday, November 3, 2011 How many basepair differences? Thursday, November 3, 2011 Epigene:cs h^p://nihroadmap.nih.gov/EPIGENOMICS/images/epigene:cmechanisms.jpg Thursday, November 3, 2011 Liver Brain Thursday, November 3, 2011 T T C G A T T A C G A A A G C T A A T G C T T T C G A T T A C G A A A G C T A A T G C T Cancer Genomics Thursday, November 3, 2011 What Do They Measure? Microarrays Gene Expression Arrays Exon Arrays Thursday, November 3, 2011 Nucleic Acid A T G C C G T T G C A T A C G G C A A C G T Thursday, November 3, 2011 Hybridiza:on A T G C C G T T G C A A C C T T A C G C T A T A C G G C A A C G T C C C T A T C G C A T Thursday, November 3, 2011 Hybridiza:on A T G C C G T T G C A T A C G G C A A C G T A C C T T A C G C T A C C C T A T C G C A T Thursday, November 3, 2011 Before Hybridiza:on: One Channel Sample 1 Array 1 Thursday, November 3, 2011 Sample 2 Array 2 Acer Hybridiza:on Array 1 Thursday, November 3, 2011 Array 2 Affymetrix GeneChip® Arrays GeneChip Probe Array Hybridized Probe Cell Single stranded, labeled RNA target * * * * * Oligonucleotide probe 8µm 1.28cm Millions of copies of a specific oligonucleotide probe >1,000,000 different complementary probes Image of Hybridized Probe Array 30 Thursday, November 3, 2011 Measurements Probes (genes) ~50K Samples (individuals) 1 2 ……….N 1 2 . . . . . . . . G DATA MATRIX Thursday, November 3, 2011 Computational Thinking • Computational Challenge: Group samples (individuals) that show similar gene expression profiles Thursday, November 3, 2011 Computational Thinking • Entities: expression profiles for individuals • Relationship: similarity between profiles Thursday, November 3, 2011 Clustering • Entities are points in Euclidean space, Relationship is distance. • Sample1= (E11, E21, …, EG1)ʼ • Sample2= (E12, E22, …, EG2)ʼ • Egi=expression gene g, sample i Thursday, November 3, 2011 Most Famous Distance • Euclidean distance – Example distance between gene 1 and 2: – Sqrt of Sum of (E1i-E2i)2, i=1,…,N • When N is 2, this is distance as we know it: Baltimore Distance Latitude Longitude DC When N is 20,000 you have to think abstractly Thursday, November 3, 2011 K-means Algorithm • We start with some data • Interpretation: – We are showing expression for two genes for 14 samples Iteration = 0 Thursday, November 3, 2011 K-means Algorithm • Choose K (3) centroids • These are starting values that the user picks. Iteration = 0 Thursday, November 3, 2011 K-means Algorithm • Make first partition by finding the closest centroid for each point • This is where distance is used Iteration = 1 Thursday, November 3, 2011 K-means Algorithm • Now re-compute the centroids by taking the middle of each cluster Iteration = 2 Thursday, November 3, 2011 K-means Algorithm • Repeat until the centroids stop moving or until you get tired of waiting Iteration = 3 Thursday, November 3, 2011 article MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia s - Thursday, November 3, 2011 © 2002 Nature Publishing Group http://genetics.nature.com , Scott A. Armstrong1–4, Jane E. Staunton5, Lewis B. Silverman1,3,4, Rob Pieters6, Monique L. den Boer6, Mark D. Minden7, Stephen E. Sallan1,3,4, Eric S. Lander5, Todd R. Golub1,3,4,5* & Stanley J. Korsmeyer2,4,8* *These authors contributed equally to this work. Published online: 3 December 2001, DOI: 10.1038/ng765 Acute lymphoblastic leukemias carrying a chromosomal translocation involving the mixed-lineage leukemia gene (MLL, ALL1, HRX) have a particularly poor prognosis. Here we show that they have a characteristic, highly distinct gene expression profile that is consistent with an early hematopoietic progenitor expressing select multilineage markers and individual HOX genes. Clustering algorithms reveal that lymphoblastic leukemias with MLL translocations can clearly be separated from conventional acute lymphoblastic and acute myelogenous leukemias. We propose that they constitute a distinct disease, denoted here as MLL, and show that the differences in gene expression are robust enough to classify leukemias correctly as MLL, acute lymphoblastic leukemia or acute myelogenous leukemia. Establishing that MLL is a unique entity is critical, as it mandates the examination of selectively expressed genes for urgently needed molecular targets. A subset of human acute leukemias with a decidedly unfavorable prognosis possess a chromosomal translocation involving the mixed-lineage leukemia gene (MLL, HRX, ALL1) on chromosome segment 11q23 (refs 1–4). The leukemic cells, which typically have a lymphoblastic morphology, have been classified as the carboxy–terminal portion of 1 of more than 20 fusion partners7. This has led to models of leukemogenesis in which the MLL fusion protein either may confer gain of function or neomorphic properties or may interfere with normal MLL function (with the MLL translocation representing a dominant-negative Thursday, November 3, 2011 Cancer Biomarkers • My lab works on highthroughput methods for finding genomic biomarkers • • cancer In particular: gene expression and DNA methylation Major constraint: accurate, sensitive and interpretable statistical models healthy Hector Corrada Bravo Thursday, November 3, 2011 43 A universal cancer signature? Normal Cancer PC 2 $33 43 3 &'()#*+#(*(,-#.,/0,1'2 PC 2 adrenal cor tex colon breast lobular cells cervix head and neck epithelial cells breast rectum mucosa sigmoid colon mucosa tongue squamous cells !"#$ !"#% PC 1 Hector Corrada Bravo Thursday, November 3, 2011 PC 1 44 Modeling • • Perhaps there is no universal cancer profile Tumors are not outliers • We don’t have mostly homogeneous, except rare outliers • Heterogeneity is defining feature Hector Corrada Bravo Thursday, November 3, 2011 45 antiProfile score 20 SLCO1B3 CXCL11 CLDN18 MAGEA6 INHBA MMP10 IL24 S100A12 IL6 PAH MAGEA12 ART3 MAGEA6 CXCL5 TNFAIP6 ● ● ● ● 15 ● ● expression 10 ● ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Hector Corrada Bravo Thursday, November 3, 2011 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ●● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●● ● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ●● ● ● ●● ● ●● ● ●● ● ●● ●● ●● ● ● ● ● ● 46 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Thursday, November 3, 2011 Hector Corrada Bravo 47 47 46 45 44 43 42 41 40 39 38 37 35 33 32 31 29 28 27 26 24 23 22 21 Normal 20 19 17 16 14 13 12 11 10 9 8 6 5 4 3 2 47 43 41 36 34 30 25 19 18 15 12 11 7 2 1 antiProfile score Cancer True positive rate 0.4 0.6 0.8 1.0 antiProfile score 0.0 0.2 adrenal_cortex: 0.99 AUC colon: 0.95 AUC endometrium: 0.99 AUC kidney: 0.91 AUC skin: 0.99 AUC stomach: 0.85 AUC vulva: 1.00 AUC universal: 0.94 AUC 0.0 Hector Corrada Bravo Thursday, November 3, 2011 0.2 0.4 0.6 False positive rate 0.8 1.0 48 Personal Genomics Personalized Medicine Thursday, November 3, 2011 Personal Genomics 50 Thursday, November 3, 2011 Sequence Once Read Often Read what? - genome - variants - methylation - expression - other genome features - medical literature - risk models - population information - ... Thursday, November 3, 2011 Personal Genomics • We need to produce reliable genome measurements, but on much bigger scale (Algorithmics, Systems) • Multiple genome features, decide which are relevant and significant (Information Retrieval, Data Management) • Population-based science, interpreted individually (Machine Learning/Statistics, Privacy) 52 Thursday, November 3, 2011 NHGRI strategic plan • What does the NIH think genomics should be for the next 10 years? [Nature, Feb. 2011] Thursday, November 3, 2011 The Essence of Genomics • Comprehensiveness – Genomics aims to generate complete data sets. Although rela:vely easy to define and measure for a genome sequence, a^aining comprehensiveness can be more challenging for other targets (for example, func:onal genomic elements or the ‘proteome’). • Scale – Genera:on of comprehensive data sets requires large ­scale efforts, demanding a^en:on to: (1) organiza:on; (2) robust data standards, to ensure high ­quality data and broad u:lity; and (3) computa:onal intensity. • Technology development – Genomics demands high ­throughput, low ­cost data produc:on, and requires that resources be devoted to technology development. • Rapid data release – Large data catalogues and analy:cal tools are community resources. • Social and ethical implicaKons – Genomics research and the many ways in which genomic data are used have numerous societal implica:ons that demand careful a^en:on Thursday, November 3, 2011 Schema:c representa:on of accomplishments across five domains of genomics research E D. Green et al. Nature 470, 204-213 (2011) doi:10.1038/nature09764 Thursday, November 3, 2011 Impera:ves for genomic medicine • Opportuni:es for genomic medicine will come from simultaneously acquiring founda:onal knowledge of genome func:on, insights into disease biology and powerful genomic tools. The following impera:ves will capitalize on these opportuni:es in the coming decade. • Making genomics ­based diagnosKcs rouKne – Genomic technology development so far has been driven by the research market. In the next decade, technology advances could enable a clinician to acquire a complete genomic diagnos:c panel (including genomic, epigenomic, transcriptomic and microbiomic analyses) as rou:nely as a blood chemistry panel. • Defining the geneKc components of disease – All diseases involve a gene:c component. Genome sequencing could be used to determine the gene:c varia:on underlying the full spectrum of diseases, from rare Mendelian to common complex disorders, through the study of upwards of a million pa:ents; efforts should begin now to organize the necessary sample collec:ons. • Comprehensive characterizaKon of cancer genomes – A comprehensive genomic view of all cancerswill reveal molecular taxonomies and altered pathways for each cancer subtype. Such informa:on should lead to more robust diagnos:c and therapeu:c strategies and a roadmap for developing new treatments74, 75. • PracKcal systems for clinical genomic informaKcs – Thousands of genomic variants associated with disease risk and treatment response are known, and many more will be discovered. New models for capturing and displaying these variants and their phenotypic consequences should be developed and incorporated into prac:cal systems that make informa:on available to pa:ents and their healthcare providers, so that they can interpret and reinterpret the data as knowledge evolves. • The role of the human microbiome in health and disease – Many diseases are influenced by the microbial communi:es that inhabit our bodies (the microbiome) Thursday, November 3, 2011 One example: TCGA • The Cancer Genome Atlas 57 Thursday, November 3, 2011 Cancer Genomics • Complexity in disease: TCGA data • Complexity in measurement: Epigene:cs Thursday, November 3, 2011 Where do we fit in? • The major bo^leneck in genome sequencing is no longer data genera:on—the computa:onal challenges around data analysis, display and integra:on are now rate limi:ng. New approaches and methods are required to meet these challenges. • Data analysis – • Data integraKon – • Genomics projects increasingly produce disparate data types (for example, molecular, phenotypic, environmental and clinical), so computa:onal approaches must not only keep pace with the volume of genomic data, but also their complexity. New integra:ve methods for analysis and for building predic:ve models are needed. VisualizaKon – • Computa:onal tools are quickly becoming inadequate for analysing the amount of genomic data that can now be generated, and this mismatch will worsen. Innova:ve approaches to analysis, involving close coupling with data produc:on, are essen:al. In the past, visualizing genomic data involved indexing to the one ­dimensional representa:on of a genome. New visualiza:on tools will need to accommodate the mul:dimensional data from studies of molecular phenotypes in different cells and :ssues, physiological states and developmental :me. Such tools must also incorporate non ­molecular data, such as phenotypes and environmental exposures. The new tools will need to accommodate the scale of the data to deliver informa:on rapidly and efficiently. ComputaKonal tools and infrastructure – Generally applicable tools are needed in the form of robust, well ­engineered socware that meets the dis:nct needs of genomic and non ­genomic scien:sts. Adequate computa:onal infrastructure is also needed, including sufficient storage and processing capacity to accommodate and analyse large, complex data sets (including metadata) deposited in stable and accessible repositories, and to provide consolidated views of many data types, all within a framework that addresses privacy concerns. Ideally, mul:ple solu:ons should be developed105. Thursday, November 3, 2011 Where do we fit in? • Mee:ng the computa:onal challenges for genomics requires scien:sts with exper:se in biology as well as in informa:cs, computer science, mathema:cs, sta:s:cs and/or engineering. A new genera:on of inves:gators who are proficient in two or more of these fields must be trained and supported. Thursday, November 3, 2011 Thank you! Thursday, November 3, 2011 ...
View Full Document

This note was uploaded on 01/13/2012 for the course CMSC 297 taught by Professor Staff during the Fall '11 term at Maryland.

Ask a homework question - tutors are online