Lect14 Whole Genome Assembly

An Introduction to Bioinformatics Algorithms (Computational Molecular Biology)

This preview shows pages 1–13. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Whole Genome Assembly Microarray analysis Mate Pairs • Mate-pairs allow you to merge islands (contigs) into super-contigs Super-contigs are quite large • Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small. • Use the mate-pairs to order and orient the contigs, and make super- contigs. Problem 3: Repeats Repeats & Chimerisms • 40-50% of the human genome is made up of repetitive elements. • Repeats can cause great problems in the assembly! • Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly Repeat detection • Lander Waterman strikes again! • The expected number of clones in a Repeat containing island is MUCH larger than in a non-repeat containing island (contig). • Thus, every contig can be marked as Unique, or non-unique. In the first step, throw away the non-unique islands. Repeat Detecting Repeat Contigs 1: Read Density • Compute the log-odds ratio of two hypotheses: • H1: The contig is from a unique region of the genome. • The contig is from a region that is repeated at least twice Detecting Chimeric reads • Chimeric reads: Reads that contain sequence from two genomic locations. • Good overlaps: G(a,b) if a,b overlap with a high score • Transitive overlap: T(a,c) if G(a,b), and G(b,c) • Find a point x across which only transitive overlaps occur. X is a point of chimerism Contig assembly • Reads are merged into contigs upto repeat boundaries. • (a,b) & (a,c) overlap, (b,c) should overlap as well. Also, – shift(a,c)=shift(a,b)+shift(b,c) • Most of the contigs are unique pieces of the genome, and end at some Repeat boundary. • Some contigs might be entirely within repeats. These must be detected Creating Super Contigs Supercontig assembly • Supercontigs are built incrementally • Initially, each contig is a supercontig. • In each round, a pair of super-contigs is merged until no more can be performed. • Create a Priority Queue with a score for every pair of ‘mergeable supercontigs’. – Score has two terms: • A reward for multiple mate-pair links • A penalty for distance between the links. Supercontig merging • Remove the top scoring pair (S 1 ,S 2 ) from the priority queue. • Merge (S 1 ,S 2 ) to form contig T. • Remove all pairs in Q containing S 1 or S 2 • Find all supercontigs W that share mate-pair links with T and insert (T,W) into the priority queue....
View Full Document

This note was uploaded on 02/14/2008 for the course CSE 182 taught by Professor Bafna during the Fall '06 term at UCSD.

Page1 / 45

Lect14 Whole Genome Assembly - Whole Genome Assembly...

This preview shows document pages 1 - 13. Sign up to view the full document.

View Full Document
Ask a homework question - tutors are online