Genome Assembly Paradigms CMSC 423 Carl Kingsford

Shortest Common Superstring Def. Given strings s 1 , ..., s n , find the shortest string T such that each s i is a sub string of T . NP-hard (contrast with case when requiring s i to be sub sequences of T ) Approximation algorithms exist with factors: 4, 3, 2.89, 2.75, 2.67, 2.596, 2.5, ... Basic greedy method: find pair of strings that overlap the best, merge them, repeat (4 approximation): Given match, mismatch, gap costs, how can we compute the score of the best overlap?
Overlap Alignment 0 1 2 3 4 5 6 7 8 9 10 11 12 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 1g 2g 3g 4g 5g 6g 7g 8g 9g 10g 11g 12g x y C A G T T G C A A A A G G T A T G A A T C Score of an optimal alignment between a suffix of Y and a prefix of X Initialize first column to 0s Answer is maximum score in top row (traceback starts from there until it falls off left side) y x

K-mer Hashing AAAA AAAT AAAG AAAC AATA AATT AATG AATC AAGA AAGT r1 r2 r10 r11 r2 r3 read kmer Only compute overlap alignment between reads that share a kmer:

The problem with Shortest Common Superstring (SCS): Repeats AAAAAAAAAAAAAAAAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA Truth: SCS: ACCGCCT ACCGCCT ACCGCCT More complex example: 2 or 3 copies?
Overlap Graph 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Overlap graph: Nodes = reads Edges = overlaps Given overlap graph, how can we find a good candidate assembly?

