100%(10)10 out of 10 people found this document helpful
This preview shows page 77 - 79 out of 147 pages.
, we see that the expected running time for line 2 of thealgorithm isΘ(1). There are at mostkiterations of the loop, so the total expected running time for OverlapisΘ(k).Finding all pairs with large overlap.Some of our algorithms will seek to find all pairs with an overlaplarger than some threshold, sayα. There is a straightforward algorithm withO(n2k2)worst-case runningtime andO(n2k)expected running time: just try all pairs, as shown below.PairsWithLargeOverlap():1.For eachs,t∈Ssuch thats6=t:2.If Overlap(s,t)≥α, output the pair(s,t)and continue.Optimization: hash tables.We can speed up the process of finding all pairs with large overlap using ahash table. In particular, we can build a hash table that stores each short reads, keyed ons[1..α](the firstαletters ofs). This hash table can be built inΘ(nα)time. Once we have the hash table, then given a shortreadswe can quickly find all short readstthat have a large overlap withsas follows:Successors(s):1.For eachi:=1,2,...,length(s)-α+1:CS 170, Fall 2014, Soln 97
2.Look ups[i..i+α-1]in the hash table, output allt∈Ssuch thatt[1..α] =s[i..i+α-1], and continue.PairsWithLargeOverlap():1.For eachs∈S:2.Output Successors(s)and continue.The expected running time of this procedure will beO(nkα), ifαis large enough.Why? Well, we can construct the hash table inΘ(nα)time. Each call to Successors(s)takesΘ(αk+ns)time, wherensis the number of strings it outputs.Therefore, the overall running time of PairsWith-LargeOverlaps isΘ(nkα+N)whereNis the total number of pairs with overlap≥α. We’ll argue thatifαis large enough, thanNis not too large. Assuming no short read is duplicated, there will be≤nkpairsof short reads that are chosen from overlapping positions in the original string. Also, if we have two shortreadss,tthat aren’t chosen from overlapping positions in the original string, the probability that their overlapis≥αwill be≤k/4α(use a union bound, and note that the probability of an overlap of≥αletters starting atany fixed location is 1/4α). Therefore, if we chooseα≥log4n, the expected number of pairs with overlaps≥αwill be at most 2nk, i.e.,E[N]≤2nk. Therefore, the expected running time of PairsWithLargeOverlapswill beO(nkα), in total.This also shows how to efficiently find all stringst∈Ssuch that Overlap(s,t)≥α, for a givens: we simplycall Successors(s)above. The average running time of Successors(s)isO(αk+ns)wherensis the numberof strings it outputs; on average,nsis about 2k, so the average running time isO(αk).It is possible to further improve the running times to eliminate the factor ofα, by using arolling hash.