Lecture_12_genome_sequencing

Ghi hic icd abc bcd fgh cde def efg fgk gkl de bruijn

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Illumina data –  LOCAS: Resequencing genomes. –  HapAssembler: for sequencing highly polymorphic genomes Problems! Unfortunately, overlap- layout- consensus approach will not work for NGS data or significantly large genomes: – There is too much data. CalculaCng the overlap for each pair of reads would take way to much Cme. – There has to be a new method for fragment assembly. De Bruijn Graph Approach to Assembly De Bruijn Graph for Assembly •  Introduced in 1989. Pevzner. J Biomol Struct Dyn (1989) 7:63—73. Iduly & Waterman. J. Comput Biol (1995) 2:291—306. •  Adapted for next generaCon sequencing data. Euler- SR: Chaisson & Pevzner. Genome Res. (2008) 18:324—30. Velvet: Zerbino & Birney. Genome been sequenced? •  We’ve sequenced a number of genomes but several genomes remain difficult •  Plant genomes are very hard because they are extremely long, contain huge repeat regions, and are polyploid •  Note: we do not disCnguish between genotypes… that is a separate problem Organism Type Genome Size No. of predicted genes Homo Sapiens Takifugu rubripes Human Puffer fish 3.2Gb 390Mb 20,251 22- 29,000 Oryza saCva Anopheles gambiae Rice Mosquito 420Mb 278Mb 32- 50,000 13,700 Saccharomyces Baker’s cerevisiae yeast 12.1Mb 6,200 Cucumis saCvus cucumber 367Mb 27,000 •  kb (= kbp) = kilo base pairs = 1,000 bp •  Mb = mega base pairs = 1,000,000 bp •  Gb = giga base pairs = 1,000,000,000 bp. Assembly EvaluaCon •  How can we tell the difference between a good assembly and a bad assembly? –  Answer: N50 staCsCc, which is a metric of the length of a set of sequences, with greater weight given to longer sequences. –  Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. –  There are some contradictory in the definiCon(s) of the N50 value. Other EvaluaCons •  Number of inserCons, deleCons, and subsCtuCon errors in an assembly •  misassembly of conCgs (chimeric indels) >=500 bp 60 Other EvaluaCons •  Number of inserCons, deleCons, and subsCtuCon errors in an assembly •  misassembly of conCgs (chimeric indels) >=500 bp 61...
View Full Document

Ask a homework question - tutors are online