This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Topics Related to Sequencing Topics
A history of genome sequencing history
– Human Genome Project – The competition between HGP and Celera The Genomics Genomics DNA sequencing theories DNA
– Coverage theory Sequence assembly Sequence A History of Genome Sequencing Sequencing
Haiyan Huang Feb 12, 2009 PB HLTH C240D/STAT C245D/STAT 246 Major Landmarks in de novo DNA de DNA Sequencing before HGP Started (I) Sequencing
1. 1953 – Discovery of the structure of the DNA Discovery double helix. double 2. 1972 – Isolation of defined fragments of DNA. 2. 72 3. 1977 – A. Maxam and W. Gilbert published 1977 and W. published "DNA sequencing by chemical degradation.“ "DNA 4. 1977 – F. Sanger published “DNA sequencing 1977 F. DNA with chain-terminating inhibitors,” (Sanger with (Sanger method) Frederick Sanger Frederick
Born on August 13, 1918. Bachelor from St John's College, Cambridge iin 1939. n St PhD in 1943. PhD Twice a Nobel laureate iin chemistry. The n chemistry he wice Nobel fourth (and only living) person awarded two Nobel Prizes. Nobel First triumph: determined the complete amino acid First determined amino sequence of the two polypeptide chains of insulin iin 1955; of insulin n crucial for developing ideas of how DNA codes for proteins. Fiirst Nobel prize in Chemistry in 1958. rst Nobel Second triumph: developed the chain-termnation method for Second termnation sequencing DNA (Sanger method) in 1975. Used his DNA Sanger technique to successfully sequence the genome of the Phage Φ-X174 in 1977, the first fully sequenced DNAbased genome. Second Nobel prize in Chemistry in 1980. based Nobel Major Landmarks in de novo DNA de DNA Sequencing before HGP Started (II) Sequencing
5. 1979 Shotgun sequencing, a faster but 1979 complex process using random fragments complex
The chain termination method can only be chain method used for fairly short strands (100 to 1000 basepairs) basepairs) R. Staden introduced the computer program for shotgun sequencing:
– DNA is broken up randomly into numerous small DNA segments; – segments are squenced to obatin reads; segments reads; – computer programs then use the overlapping computer ends of different reads to assemble them into a contiguous sequence. An Example on Shotgun Sequencing An
Consider the onsider two rounds of two rounds shotgun reads: reads:
In the above example, None of the four reads cover the full length of the one original sequence original The reads can be assembled into the original sequence he by the overlap of their ends to align and order them. by the Whole genome shotgun sequencing Whole
Subsequently widely adopted between 1995-2005 1995 Watch Vedio … Major Landmarks in de novo DNA de DNA Sequencing before HGP Started (III) Sequencing
6. 1987 Applied Biosystems marketed first Applied automated sequencing machine, the model ABI 370. 7. 1990 Human Genome Project started (co7. sponsored by NIH and DOE) Sequencing Activities before HGP started
1. 1977, Bacteriophage fX174, was the first genome 1. 1977, Bacteriophage fX174, to be sequenced, a viral genome with only 5,368 base pairs (by Sanger using the Sanger method) base 2. Bacteriophage I (48,502 bps) was the first to be (48,502 sequenced by the “shotgun” sequencing method. sequencing 3. In 1989, Andre Goffeau set up a European 3. In Goffeau set consortium (about 100 labs involved) to sequence the genome of the budding yeast S. cerevisiae cerevisiae (12.5Mb)
1. S. cerevisiae iis about 60 times larger than any S. cerevisiae s sequence previously attempted. 2. sequencing the human genome seemed out of scope of 2. sequencing technology due to its size of 3,000 Mb. technology International Genome Project Consortium Consortium
Consisting mainly of the US HGP researchers and British researchers from the Sanger center as well as German, French, Japanese, Chinese researchers and about 19 other countries’ researchers. about Sequencing Activities before HGP Started Started
4. The following year saw the initiation of the Human The Genome Project (1990). In the wake of this pronouncement came the start of three projects aimed at elucidating the sequences of smaller model organisms, similar to S. cerevisiae iin their cerevisiae n academic utility, such as E. coli, M. capricolum, capricolum and C. elegans. and C. elegans It was hoped that these projects would increase the efficiency of sequencing but unfortunately they fell short of this task. Human Genome Project Human
Started in 1990, co-sponsored by DOE & NIH Started Given 15 years and 3 billion dollars Given Ultimate goal:
– find all of the estimated 20-30 thousand human genes – determine the sequence of 3 billion DNA building block, determine which underlie the diversity of human race. which – Accuracy: one error in 10,000 bases. Accuracy: The HGP staff broke their 15 years into three five year plans year Three Five Year Plans Three
The HGP staff broke their 15 years into three five year plans year
– The original plan, from 1990-1995, was finished in 1993. 1995, – The second five year plan, realigned from 1993-1998. 1998. – (Early on, the scientists concentrate inventing the Early resources necessary for efficient DNA sequencing, including biological, instrumental and computing resources) resources – The third plan goes from 1998 to the proposed year of The completion 2003. This plan was updated in March of 1999, only 5 months after the release of the original plan in Science magazine. The new completion date is two years before the original ending date of the project. years Goals during the third five year Goals
“Although we have as our primary goal, the Although finished ‘book of life’ by the end of 2003, we also by want the working draft to be as useful as possible” want by Dr. Avi Putrinos, department of energy by Avi department associate director. associate Studies into gene expression and control, the creation of mutations causing loss or altering of functioning in model organisms and finally development of experimental and computational methods for protein analysis are other functional genomics goals. The Research Goals during 1998The 2003: 8 subsections
1. DNA Sequence: Finish complete human genome Finish sequence by end of 2003. Finish at least one-third of the sequence third sequence by 2001. sequence 2. Technology: Continue to increase the throughput and Continue decrease the cost of sequencing technology. decrease 3. Sequence Variation: Develop technology for quick mass Develop identification and/or scoring of single nucleotide polymorphisms and other DNA sequence variances. polymorphisms 4. Comparative Genomics: complete fruit fly, Drosophila’s sequence by 2002; complete mouse sequence by 2008; identify other potentially useful model organisms. identify The Research Goals during 1998The 2003: 8 subsections
5. Ethical, Legal and Social Implications: Explore how new Ethical, genomic and genetic information may interact with different philosophical, theological, and ethical viewpoints and ideas; research how racial, ethnic, and socioeconomic differences may change the usage, knowledge, and understanding of genetic information, services, and the development of a policy and/or laws on such things. such 6. Functional Genomics Technology: Create technology for Create complete analysis of gene expression; improve precomplete existing methods for genome-spanning mutagenesis; spanning create technology for mass protein examination; support for studies into methods on researching functions of nonfor protein-coding DNA sequences. coding The Research Goals during 1998The 2003: 8 subsections
7. Bioinformatics and Computational Biology: Bioinformatics
1. Create better ways to facilitate data generation, capture, 1. Create and annotation; 2. create new and improve pre-existing facilities and 2. existing databases for complete functional genetic and genomic research; 3. improve on pre-existing and generate new tools for 3. existing representing and examining similarities and changes in sequences; 4. develop mechanisms to support effective, exportable, 4. develop durable software for data sharing. 8. Training and Manpower: Increase the number of 8. Training scholars trained in ethics, law and social sciences who also have knowledge of genomic and genetic studies. The top priority of the HGP remains to obtain and make publicly available a comprehensive and accurate gene/DNA reference sequence, though the written goals are many and diverse. Important Dates of HGP Important
Most significant history created by HGP Most researchers was made during the third five year. For example year.
In April 1998, the project passed its midpoint, half of its time already having been dedicated to ground breaking genetic research. breaking In March 1999, the completion date for a working draft of the human genome was accelerated to spring 2000 (the actual completion date was the June 26). This was the first advancement to the 1998-2003 plan (a was 2003 year ahead of schedule). year Important Dates of HGP (continued) Important
In December 1999, the first human chromosome (chromosome 22) was sequenced.
– Chromosome 22 is the location of defects which can cause Chromosome Digeorge Syndrome, Chronic Myeloid Leukemia, and Digeorge Syndrome, neurofibromatosis. It is also the final autosome (non-sex autosome sex chromosome) in the human sequence. chromosome) In March 2000, the Drosophila genome was Drosophila genome completely sequenced. completely In April 2000, the sequencing of chromosome 5, 16, and 19 were announced. In June 2000, the first working draft sequence of human genome was released. The Race to Decode Human Genome Genome Venter Team Entered the Scene Venter
Recall: Recall: With the initiation of HGP, three projects were started to elucidate the sequences of smaller model organisms, similar to S. cerevisiae iin their academic utility, such as E. cerevisiae n coli, M. capricolum, and C. elegans. capricolum and C. elegans The first winner is an outsider: A team headed by J. Craig Venter (from the Institute for Genomic Research: TIGR) and Nobel laureate Hamilton Smith (from John Hopkins University), sequenced the 1.8Mb bacterium, the first genome of a free living organism to be sequenced in1995. organism Comparing the Conventional and New Methods
In conventional sequencing, the genome is broken down laboriously into ordered, overlapping segments, each containing up to 40kb of DNA. These segments are “shotgunned” iinto smaller nto pieces and then sequenced to reconstruct the genome. genome. Venter’s team utilized a more comprehensive Venter team approach by “shotgunning” the entire 1.8Mb H. the H. Influenzae genome. Previously such an approach genome. would have failed because the software did not exist to assemble such a massive amount of information accurately. TIGR Assember was up to Assember was the task. Comparing the Conventional and New Methods
In conventional sequencing, the genome is broken down laboriously into ordered, overlapping segments, each containing up to 40kb of DNA. These segments are “shotgunned” iinto smaller nto pieces and then sequenced to reconstruct the genome. genome. Venter’s team utilized a more comprehensive Venter team approach by “shotgunning” the entire 1.8Mb H. the H. Influenzae genome. Previously such an approach genome. would have failed because the software did not exist to assemble such a massive amount of information accurately. TIGR Assember was up to Assember was the task. Craig Venter Craig
B.S. iin biochemistry iin 1972; Ph.D. iin B.S. n biochemistry n Ph.D. n physiology and pharmacology iin 1975 physiology and pharmacology n from the University of California, San Diego; jjoined the National Institutes of Diego oined Health in 1984. Health In 1991, Venter’s H. Influenzae project In Influenzae project had failed to win funding from NIH. After that Venter left NIH. After In 1992, Venter founded the Institute for Genomic Research (TIGR), a non-profit non genomics research institute. In 1995, Venter and Smith succeeded in sequencing the genome in 13 months at a cost of 50 cents per base which was half the cost and drastically faster than conventional sequencing. faster Craig Venter, born in 1946 More Background Story Before the Race Started Race
In 1997, TIGR’s dramatic leadership role in the TIGR dramatic field of genome sequencing was paralleled by the final completion of two of the largest genomic sequences. sequences.
– the bacterium E. Coli K-12, and the yeast S. Cerevisiae the 12, Cerevisiae – over seven years of intensive work by the international over Genome Sequencing Consortium. Genome At the close of 1997, HGP researchers only sequenced less than 1.5% of the 3,000Mb human genome. Celera Genomics Started the Race Celera
In May 1998, Celera Genomics was established, with Venter from TIGR as its established, from first president. Celera was formed for the purpose of generating and commercializing genomic information to accelerate the understanding of biological processes. of Celera received their validation in early 2000 when they successfully sequenced the Drosophila genome. genome. The Celera Genome Project The
Celera accomplished their goal in less than three years as compared to 13 years by HGP. The Celera genome project ended up costing around $300 million dollars, a 90% price reduction compared to the $3 billion dollars for the HGP. to But we should note the following facts A significant portion of the human genome had already been sequenced when Celera entered the field been Celera did not incur any costs with obtaining the existing data existing Celera opened its data to non-commercial users later Celera Tough Time for the HGP Researchers Researchers
In 1997, the estimated finish of the human genome by the year 2005 appears quite optimistic genome
– the world’s large-scale sequencing capacity is scale approximately 100Mb per year. – to complete the genome, the average production must to increase to 400Mb per year. Celera's use of the whole genome shotgun strategy spurred the public HGP to change its own strategy, leading to a rapid acceleration of the public effort. public Current most modern Sanger sequencers are able to sequence approximately 2.8 million base pairs per 24 hours period. At this rate, it takes almost three years to sequence the human genome. Human Genome Project could never be a reality without modern computer facilities. reality Competition Result Competition
On June 26, 2000, in a press conference attended by President Bill Clinton and British Prime Minister Tony Blair, the international Human Sequencing consortium and Celera Genomics declared joint victory. This also puts an end to public criticisms of each other’s methods. other methods. The two groups each created an initial sequencing of the human genome. sequencing An Exciting Day: June 26 2000 An - some meeting details
Dr. Lane, science advisor to the President called this an “extremely exciting day” and said it was a and “forward looking time, because of the enormous forward opportunities for the use of this scientific information to benefit all people of the world.” information Dr. Collins, the director of National Human Genome research institute said that it was “a happy day for science, and I think, for the public, both here and around the world.” both President Clinton made a promise to the country. He pledged to continue to and accelerate the US involvement in translating this genetic “blueprint” into novel health care strategies and therapies. Sequencer Developments Sequencer
In 1997, Sanger method sequences at rate of 100Mb per year. per In 1998, Celera Genomics sequences at rate ~ 1000Mb per year. 1000Mb In 2005, modern Sanger method sequences at rate 2.8Mb per day ~ 1000Mb per year. The dye-terminator sequencing method, along with automated high-throughput DNA sequence analyzers, is now being used for the vast majority of sequencing projects. New Sequencing Methods (since 2005) (since
High-throughput sequencing technologies that High throughput parallelize the sequencing process can produce parallelize the thousands or millions of sequences at once.
– In 2006, the Genome Sequences FLX from 454 Life In Sciences can sequence 300Mb per day. At this rate, the human genome can be sequenced in about TEN DAYS! human Sequencing by hybridization iis a method that uses a s Sequencing DNA microarray. microarray It is reasonable to expect that the so-far illusive It far $1000 dollar human genome sequencing is only a few years away, which will finally yield individual genomes. genomes. More about Venter and TIGR and Celera Celera
Venter was criticized by the public project scientists, led by Francis Collins, for planning to profit from the data.
– The aim of the Celera project was to create a database The of genomic data that users could subscribe to for a fee. of Most geneticists criticized that Venter’s method ost criticized method would not be accurate enough for a genome as complicated as the human. complicated There were also concerns that Venter might shatter what was supposed to be an "international" face on a landmark event in history. face More about Venter and TIGR and Celera Celera
Venter was fired by Celera in early 2002; Venter resisted Venter efforts by the company board to change the strategic direction of the company. direction In 2002, Venter left Celera and founded the J. Craig Venter Science Foundation. Science In June 2005, he co-founded Synthetic Genomics, a firm In founded Synthetic firm dedicated to using modified microorganisms to produce microorganisms to ethanol and hydrogen as alternative fuels. He also helped ethanol and hydrogen as alternative He assess genetic diversity in marine microbial communities. genetic In late 2006, TIGR became a division of the J. Craig Venter Institute (JCVI). Institute (JCVI). In March/April 2007 the divisions were dissolved and TIGR In were was absorbed under the JCVI name. Bioinformatics Emerged and Was Quickly Maturing as a Field
Bioinformatics focuses on the acquisition, storage, access, analysis, modeling, and distribution of the many types of information embedded in DNA sequences and other type of genomic data. type This field will be challenged by the heightening demands of increased information on the algorithms currently utilized for sequence manipulation. DNA Sequencing Theories DNA
Coverage iis the average number of reads Coverage s representing a given nucleotide iin the nucleotide n reconstructed sequence. It can be calculated from the length of the original genome (G), the number of reads (N), and the genome ), average read length (L) as NL / G. This parameter average as NL also enables one to estimate other quantities, such as the percentage of the genome covered by reads (the coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in base calling and assembly. The subject of DNA sequencing theory addresses the DNA addresses relationships of such quantities. relationships Sequence Assembly Sequence
Sequence assembly refers to aligning and refers aligning and merging many fragments of a much longer DNA DNA sequence in order to reconstruct the original sequence sequence. The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing a copy of the book back together from only shredded pieces. The book may have many repeated paragraphs, and some shreds may be modified to have typos. Excerpts from another book may be added in, and some shreds may be completely unrecognizable. completely ...
View Full Document
This note was uploaded on 03/14/2010 for the course C 240D, C24 taught by Professor S.dudoit during the Spring '09 term at University of California, Berkeley.
- Spring '09