, we see that the expected running time for line 2 of the
algorithm is
Θ
(
1
)
. There are at most
k
iterations of the loop, so the total expected running time for Overlap
is
Θ
(
k
)
.
Finding all pairs with large overlap.
Some of our algorithms will seek to find all pairs with an overlap
larger than some threshold, say
α
. There is a straightforward algorithm with
O
(
n
2
k
2
)
worstcase running
time and
O
(
n
2
k
)
expected running time: just try all pairs, as shown below.
PairsWithLargeOverlap
()
:
1.
For each
s
,
t
∈
S
such that
s
6
=
t
:
2.
If Overlap
(
s
,
t
)
≥
α
, output the pair
(
s
,
t
)
and continue.
Optimization: hash tables.
We can speed up the process of finding all pairs with large overlap using a
hash table. In particular, we can build a hash table that stores each short read
s
, keyed on
s
[
1
..
α
]
(the first
α
letters of
s
). This hash table can be built in
Θ
(
n
α
)
time. Once we have the hash table, then given a short
read
s
we can quickly find all short reads
t
that have a large overlap with
s
as follows:
Successors
(
s
)
:
1.
For each
i
:
=
1
,
2
,...,
length
(
s
)

α
+
1:
CS 170, Fall 2014, Soln 9
7
2.
Look up
s
[
i
..
i
+
α

1
]
in the hash table, output all
t
∈
S
such that
t
[
1
..
α
] =
s
[
i
..
i
+
α

1
]
, and continue.
PairsWithLargeOverlap
()
:
1.
For each
s
∈
S
:
2.
Output Successors
(
s
)
and continue.
The expected running time of this procedure will be
O
(
nk
α
)
, if
α
is large enough.
Why? Well, we can construct the hash table in
Θ
(
n
α
)
time. Each call to Successors
(
s
)
takes
Θ
(
α
k
+
n
s
)
time, where
n
s
is the number of strings it outputs.
Therefore, the overall running time of PairsWith
LargeOverlaps is
Θ
(
nk
α
+
N
)
where
N
is the total number of pairs with overlap
≥
α
. We’ll argue that
if
α
is large enough, than
N
is not too large. Assuming no short read is duplicated, there will be
≤
nk
pairs
of short reads that are chosen from overlapping positions in the original string. Also, if we have two short
reads
s
,
t
that aren’t chosen from overlapping positions in the original string, the probability that their overlap
is
≥
α
will be
≤
k
/
4
α
(use a union bound, and note that the probability of an overlap of
≥
α
letters starting at
any fixed location is 1
/
4
α
). Therefore, if we choose
α
≥
log
4
n
, the expected number of pairs with overlaps
≥
α
will be at most 2
nk
, i.e.,
E
[
N
]
≤
2
nk
. Therefore, the expected running time of PairsWithLargeOverlaps
will be
O
(
nk
α
)
, in total.
This also shows how to efficiently find all strings
t
∈
S
such that Overlap
(
s
,
t
)
≥
α
, for a given
s
: we simply
call Successors
(
s
)
above. The average running time of Successors
(
s
)
is
O
(
α
k
+
n
s
)
where
n
s
is the number
of strings it outputs; on average,
n
s
is about 2
k
, so the average running time is
O
(
α
k
)
.
It is possible to further improve the running times to eliminate the factor of
α
, by using a
rolling hash
.