4.4. SOME OTHER PROBLEMS AND APPROACHES
63
The computational advantage of double-barreled DNA sequencing is that it is
unlikely that both reads of the insert will lie in a large-scale DNA repeat (Roach et
al., 1995 [286]). Thus the read in a unique portio

52
CHAPTER 3. MAP ASSEMBLY
Suppose that A\ < A^ and A<i < A3. Then there exist edges (oi,o'2) and
(02,03) in F with a\ G A\, o^oJj E A2, and 03 G A3 (Figure 3.5(right). If
either (02,03) 0 E or (01,02) -E, then (01,03) E F and Ai < A3. Therefore,
assume t

54
CHAPTER 3. MAP ASSEMBLY
where each Si is a multiset representing the fingerprint of the i-th clone in the
clone ordering by the left endpoints. A labeled multistage graph G (called a clonefragment graph) consists of n stages, with the i-th stage contai

CHAPTER 5. DNA ARRAYS
90
each pair of adjacent -tuples (horizontally or vertically) differs in exactly one position. Such a Gray code can be generated from the one-digit two-dimensional Gray
code
d=
AT
G C
as follows. For an /-digit two-dimensional Gray c

60
CHAPTER 4. SEQUENCING
assemble" approach. It took Holley and collaborators at Cornell University seven
years to determine the sequence of 77 nucleotides in tRNA. For many years afterward DNA sequencing was done by first transcribing DNA to RNA and then

86
CHAPTER 5. DNA ARRAYS
Therefore, p(C(Z), n) = 1 - Pcfw_VYU VY2, VY3 < FCcfw_1). Assume that the probability of finding each of the A1 /-tuples at a given position of F is equal to i . The
probability that the spectrum of F does not contain VY\ can be r

5.12. MANUFACTURE OF DNA ARRAYS
87
the ambiguity arises when the spectrum Fcgap(i) contains both a VYcfw_ /-multiprobe
and a UXX.XYj
cfw_21 - l)-multiprobe (here, Yi ^ X m + i ) . Assume that the
i-i
probability of finding each multiprobe from Cgapcfw_l)

68
CHAPTERS.
DNAARRAYS
target DNA and determine its /-tuple composition. The simplest DNA array, C(l),
contains all probes of length / and works as follows (Figure 5.2):
Attach all possible probes of length / (/=8 in the first SBH papers) to the
surface,

3.6. SOME OTHER PROBLEMS AND APPROACHES
57
the problem is to find a string (ABCD) that contains these triples as subsequences.
The problem is complicated by the presence of incorrectly ordered triples and the
unknown orientation of triples (the marker ord

42
CHAPTER 3. MAP ASSEMBLY
As a result, biologists obtain a clone library consisting of thousands of clones
(each representing a short DNA fragment) from the same DNA molecule. Clones
from the clone library may overlap (overlapping can be achieved by usin

3.1. INTRODUCTION
43
C if there exists a substring of S containing exactly the same set of probes as C
(the order and multiplicities of probes in the substring are ignored). The string in
Figure 1.1 covers each of nine clones corresponding to the hybridiz

CHAPTERS. DNAARRAYS
84
Binary arrays
R=cfw_A,G Y=cfw_T,C
W=cfw_A,T
A
T
S=cfw_G,C
G
C
)
every string of length 1 in cfw_W,S or cfw_R,Y alphabet
is a pool of 2 strings in cfw_A,T,G,C alphabet
A
W
/
times
t"
/ times
WS - sub-array
^
RY - sub-array
Figure 5.1

Chapter 4
Sequencing
4.1 Introduction
At a time when the Human Genome Project is nearing completion, few people remember that before DNA sequencing even started, scientists routinely sequenced
proteins. Frederick Sanger was awarded his first Nobel Prize f

5.13. SOME OTHER PROBLEMS AND APPROACHES
91
shuffle operation (an arbitrary fixed shuffling like that of two decks of cards). The
simplest shuffling is the concatenation of G\(i) and G2O).
For uniform arrays, Gray-code masks have minimal overall border le

CHAPTER 3. MAP ASSEMBLY
48
solution of the problem (see Figure 3.3 for an example). Note that the shortest
covering string in this case has a clone with a double occurrence of the probe D.
A B C D X F G H
A B C D X F GH
R0
0
I
CLONES
Q
1
1 1
_L JL
Q
CLONE

44
CHAPTER 3. MAP ASSEMBLY
Early clone libraries were based on bacteriophage A vectors and accommodated up
to 25 Kb of DNA. Cosmids represent another vector that combines DNA sequences
from plasmids and a region of A genome. With cosmids, the largest size

ppf"
61
5.2. SEQUENCING BY HYBRIDIZATION
array for the entire human mitochondrial genome (16,569 base pairs) and were able
to successfully detect three disease-causing mutations in a mtDNA sample from a
patient with Leber's hereditary optic neuropathy. Pr

95
6.1. mTRODUCTION
2
0
IT)
0
i
0
A
2
3
4
5
G
6
7
T
o
\
0
0
\
0
(5)
G
0
0
0
t
t
t
o
0
0
*-1
t
1
1
1
1
t
t
t
2
2
0
1
2
2
1
2
0
0
\
2
1
\
6
(A)
0
t
5
(T)
1
t
t\
4
A
t
t
t\
t
2
0
(Q)
0
\
11
1
t\
1
2 2
t
2
t
2
t
2
t
2
3
3
3
t\
3
t
4
Computing similarity s(V

4.2. OVERLAP, LAYOUT, AND CONSENSUS
61
4.2 Overlap, Layout, and Consensus
After short DNA fragments (reads) are sequenced, we want to assemble them together and reconstruct the entire DNA sequence of the clone (fragment assembly
problem). The Shortest Sup

CHAPTER 3. MAP ASSEMBLY
56
human cells to radiation, which breaks each chromosome into random fragments.
These fragments are then "rescued" by fusing the human cells with hamster cells
that incorporate a random subset of the human DNA fragments into their

88
CHAPTER 5. DNA ARRAYS
appropriate sequence, it is possible to grow a complete set of /-length probes in as
few as 4 / steps. The light-directed synthesis allows random access to all positions
of the array and can be used to make arrays with any probes

Pi"
5.6. STRING REARRANGEMENTS
75
/ corresponding to pairs of positions in the DNA fragment of length n. Solving
()pl = 1 yields a rough estimate I = log i(%) for the minimal probe length
needed to reliably reconstruct an n-letter sequence from its /-tupl

82
CHAPTER 5. DNA ARRAYS
reconstruction less ambiguous, polynomial algorithms for PSBH sequence reconstruction are unknown. PSBH can be reduced to computing Eulerian path with an
additional restriction that the position of any edge in the computed Euleria

66
CHAPTER 5. DNA ARRAYS
contained in this fragment (7-tuple composition of the fragment) but does not provide information about the positions of these strings. Combinatorial algorithms are
then used to reconstruct the sequence of the fragment from its /-

5.4. SBH AND THE EULERIAN PATH PROBLEM
73
there is an edge from vertex i into vertex j in G, and a^ = 0 otherwise (Figure 5.6).
Define a matrix M by replacing the z-th diagonal entry of -A by indegree(i) for
all i. An i-cofactor of a matrix M is the deter

62
CHAPTER 4. SEQUENCING
of fragments (ccwfr'g^-rather than individual fragments-are merged. The difficulty
with the layout step is deciding whether two fragments with a good overlap really
overlap (i.e., their differences are caused by sequencing errors)

CHAPTER 5. DNA ARRAYS
72
This procedure ends when all edges incident to a vertex in G are used in the trail.
Since every vertex in G is balanced, every such trail starting at vertex v will end at
v. With some luck the trail will be Eulerian, but this need

3.3. MAPPING WITH UNIQUE PROBES
49
Connectedness. For every partition of the set of probes into two non-empty
sets A and B, there exist probes i A and j B such that C\ D Cj is not
empty.
Distinguishability. Ccfw_ 7^ Cj for i ^ j .
There is no essential

50
CHAPTER 3. MAP ASSEMBLY
The lemma motivates the following algorithm, which finds the true ordering
of probes (Alizadeh et al., 1995 [4]). Throughout the algorithm the variable TT =
ft first ftiast denotes a sequence of consecutive probes in the true or

3.4. INTERVAL GRAPHS
51
Figure 3.4: (i) The house graph is not an interval graph because it is not triangulated, (ii) The star
graph is not an interval graph because its complement is not a comparability graph, (iii) Transitive
orientation of a graph "A."

CHAPTER 3. MAP ASSEMBLY
46
C\, C2, C3, C4 that hybridize with the following probes A, JB, C, JD, E:
Gx - cfw_B, C, E, C2 - cfw_A, B, C, D, C3 - cfw_A, B, C, C4 - cfw_B, C, i?.
Let C(TT) be the length of the shortest string covering ir. Figure 1.1 presents