we see that the expected running time for line 2 of the algorithm is \u0398 1 There

We see that the expected running time for line 2 of

This preview shows page 77 - 79 out of 147 pages.

, we see that the expected running time for line 2 of the algorithm is Θ ( 1 ) . There are at most k iterations of the loop, so the total expected running time for Overlap is Θ ( k ) . Finding all pairs with large overlap. Some of our algorithms will seek to find all pairs with an overlap larger than some threshold, say α . There is a straightforward algorithm with O ( n 2 k 2 ) worst-case running time and O ( n 2 k ) expected running time: just try all pairs, as shown below. PairsWithLargeOverlap () : 1. For each s , t S such that s 6 = t : 2. If Overlap ( s , t ) α , output the pair ( s , t ) and continue. Optimization: hash tables. We can speed up the process of finding all pairs with large overlap using a hash table. In particular, we can build a hash table that stores each short read s , keyed on s [ 1 .. α ] (the first α letters of s ). This hash table can be built in Θ ( n α ) time. Once we have the hash table, then given a short read s we can quickly find all short reads t that have a large overlap with s as follows: Successors ( s ) : 1. For each i : = 1 , 2 ,..., length ( s ) - α + 1: CS 170, Fall 2014, Soln 9 7
Image of page 77
2. Look up s [ i .. i + α - 1 ] in the hash table, output all t S such that t [ 1 .. α ] = s [ i .. i + α - 1 ] , and continue. PairsWithLargeOverlap () : 1. For each s S : 2. Output Successors ( s ) and continue. The expected running time of this procedure will be O ( nk α ) , if α is large enough. Why? Well, we can construct the hash table in Θ ( n α ) time. Each call to Successors ( s ) takes Θ ( α k + n s ) time, where n s is the number of strings it outputs. Therefore, the overall running time of PairsWith- LargeOverlaps is Θ ( nk α + N ) where N is the total number of pairs with overlap α . We’ll argue that if α is large enough, than N is not too large. Assuming no short read is duplicated, there will be nk pairs of short reads that are chosen from overlapping positions in the original string. Also, if we have two short reads s , t that aren’t chosen from overlapping positions in the original string, the probability that their overlap is α will be k / 4 α (use a union bound, and note that the probability of an overlap of α letters starting at any fixed location is 1 / 4 α ). Therefore, if we choose α log 4 n , the expected number of pairs with overlaps α will be at most 2 nk , i.e., E [ N ] 2 nk . Therefore, the expected running time of PairsWithLargeOverlaps will be O ( nk α ) , in total. This also shows how to efficiently find all strings t S such that Overlap ( s , t ) α , for a given s : we simply call Successors ( s ) above. The average running time of Successors ( s ) is O ( α k + n s ) where n s is the number of strings it outputs; on average, n s is about 2 k , so the average running time is O ( α k ) . It is possible to further improve the running times to eliminate the factor of α , by using a rolling hash .
Image of page 78
Image of page 79

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture