Unformatted text preview: Measuring transcriptomes with RNASeq
BMI/CS 576
November 17, 2011
Colin Dewey
[email protected] 1 Overview
• Motivation  Axolotl
• The transcriptome
• RNASeq technology
• The RNASeq quantiﬁcation problem
• Handling alternative splicing: probabilistic splice graphs 2 What I want you to get from this lecture
• What is RNASeq?
• How is RNASeq used to measure the abundances of RNAs within cells?
• What probabilistic models and algorithms are used for analyzing RNASeq? 3 Collaborators Axolotl James Thomson
Ron Stewart Regenerative Biology Laboratory, Morgridge Institute for Research, Madison, WI 4 Axolotl background • Ambystoma mexicanum • Neotenous • Natural habitats • Regenerative abilities • Lake Xochimilco (canals) • Limbs • Lake Chalco (drained) • Portions of Heart • Endangered • Portions of Brain • Commonly sold as pets • Tail and spinal cord
5 Axolotl limb regeneration David Gardiner  HHMIUCI
6 Goals
• What are the axolotl genes that are responsible for this remarkable
regenerative ability?
• Can this knowledge improve our medical treatments of severe wounds and
tissue regeneration? 7 Challenges with genomic studies of Axolotl
• No genome sequence
available
• genome estimated to be
10x larger than that of
human 340 mya • Distantly related to other
model organisms 280 mya Villiard et al. 2007
8 The Central Dogma 9 Transcription • RNA that is transcribed from a gene is called messenger RNA (mRNA)
• RNA polymerase is the enzyme that builds an RNA molecule from a gene 10 Transcription: DNA→RNA 11 Measuring transcription: Microarrays
• Each spot has “probes” for a certain
gene
• Probe: a DNA sequence
complementary to a certain gene
• Relies on complementary
hybridization
• Intensity/color of light from each
spot is measurement of the number
of transcripts for a certain gene in a
sample
• Requires knowledge of gene
sequences
12 Prior gene expression studies in Axolotl
• Microarrays
• Exist, but not very complete
• Limited amount of mRNA sequence data from Axolotl
• No genome, so can’t use predicted gene sequences 13 RNASeq technology
• Leverages rapidly advancing sequencing technology (e.g., Illumina, SOLiD)
• Transcriptome analog to whole genome shotgun sequencing
• Two key differences:
1. Transcripts sequenced at different levels of coverage  expression levels
2. Sequences already known (in many cases)  coverage is measurement 14 Advantages of RNASeq over microarrays
• No reference sequence needed
• With microarrays, limited to the probes on the chip
• Low background noise
• Large dynamic range
• 105 compared to 102 for microarrays
• High technical reproducibility 15 RNASeq protocol
Ampliﬁed
cDNA Sample
RNA
reverse
transcription
+ PCR cDNA
fragments reads
CCTTCNCACTTCGTTTCCCAC fragmentation sequencing
machine TTTTTNCAGAGTTTTTTCTTG
GAACANTCCAACGCTTGGTGA
GGAAANAAGACCCTGTTGAGC
CCCGGNGATCCGCTGGGACAA
GCAGCATATTGATAGATAACT
CTAGCTACGCGTACGCGATCG
CATCTAGCATCGCGTTGCGTT
CCCGCGCGCTTAGGCTACTCG
TCACACATCTCTAGCTAGCAT
CATGCTAGCTATGCCTATCTA
CACCCCGGGGATATATAGGAT
16 RNASeq data
@HWUSIEAS1789_0001:3:2:1708:1305#0/1
CCTTCNCACTTCGTTTCCCACTTAGCGATAATTTG
+HWUSIEAS1789_0001:3:2:1708:1305#0/1
VVULVBVYVYZZXZZ\ee[a^b`[a\a[\\a^^^\
@HWUSIEAS1789_0001:3:2:2062:1304#0/1
TTTTTNCAGAGTTTTTTCTTGAACTGGAAATTTTT
+HWUSIEAS1789_0001:3:2:2062:1304#0/1
a__[\Bbbb`edeeefd`cc`b]bffff`ffffff
@HWUSIEAS1789_0001:3:2:3194:1303#0/1
GAACANTCCAACGCTTGGTGAATTCTGCTTCACAA
+HWUSIEAS1789_0001:3:2:3194:1303#0/1
ZZ[[VBZZY][TWQQZ\ZS\[ZZXV__\OX`a[ZZ
@HWUSIEAS1789_0001:3:2:3716:1304#0/1
GGAAANAAGACCCTGTTGAGCTTGACTCTAGTCTG
+HWUSIEAS1789_0001:3:2:3716:1304#0/1
aaXWYBZVTXZX_]Xdccdfbb_\`a\aY_^]LZ^
@HWUSIEAS1789_0001:3:2:5000:1304#0/1
CCCGGNGATCCGCTGGGACAAGCAGCATATTGATA
+HWUSIEAS1789_0001:3:2:5000:1304#0/1
aaaaaBeeeeffffehhhhhhggdhhhhahhhadh name
sequence
qualities read pairedend reads
read1
read2 1 Illumina (GAIIX) lane
~20 million reads
17 Tasks with RNASeq data
• Assembly:
Given: RNASeq reads (and possibly a genome sequence)
• Do: reconstruct fulllength transcript sequences from the reads
• Quantiﬁcation:
• Given: RNASeq reads and transcript sequences
• Do: Estimate the relative abundances of transcripts (“gene expression”)
• Differential expression:
• Given: RNASeq reads from two different samples and transcript sequences
• Do: Predict which transcripts have different abundances between the two samples 18 The basics of quantiﬁcation with RNASeq data
• For simplicity, suppose reads are of length one (typically they are > 35 bases) transcripts
200 1
2
3 60
80 reads
100 A
60 C
40 G • What relative abundances would you estimate for these genes? 19 Length dependence
• probability of a read coming from a transcript ∝!relative abundance × length transcripts
200 1
2
3 ˆ
f1 / 60
80 100
200 1
=
200
400
60
200 1
ˆ
f2 /
=
60
200
40
200 1
ˆ
f3 /
=
80
400 reads
100 A
60 C
40 G ˆ
f1 = 0.25
ˆ
f2 = 0.5
ˆ
f3 = 0.25
20 What if reads do not uniquely map to transcripts?
• “multiread”: a read that could have been derived from multiple transcripts transcripts
200 1
2
3 60
80 reads
90 A
40 C
40 G
30 T • How would you estimate the relative abundances for these transcripts? 21 Some options for handling multireads
• Discard all multireads, estimate based on uniquely mapping reads only
• Discard multireads, but use “unique length” of each transcript in calculations
• “Rescue” multireads by allocating (fractions of) them to the transcripts
• Three step algorithm
1. Estimate abundances based on uniquely mapping reads only
2. For each multiread, divide it between the transcripts to which it maps,
proportionally to their abundances estimated in the ﬁrst step
3.Recompute abundances based on updated counts for each transcript 22 Rescue method example  Step 1
transcripts 3 90 A 200 1
2 reads 60 40 C 80 40 G
30 T
Step 1 ˆunique =
f1 90
200
90
+ 40
200
60 + 40
80 = 0.278 ˆunique = 0.412
f2
ˆunique = 0.309
f3
23 Rescue method example  Step 2
transcripts 3 90 A 200 1
2 reads 60 40 C 80 40 G
30 T
Step 2 rescue
c1
rescue
c2
rescue
c3 0.278
= 90 + 30 ⇥
= 102.1
0.278 + 0.412
0.412
= 40 + 30 ⇥
= 57.9
0.278 + 0.412
= 40 + 0 = 40
24 Rescue method example  Step 3
transcripts 3 90 A 200 1
2 reads 60 40 C 80 40 G
30 T Step 3 ˆrescue =
f1
ˆrescue =
f2
ˆrescue =
f3 102.1
200
102.1
200
102.1
200 102.1
200
+ 57.9
60
57.9
60
+ 57.9
60
40
80
+ 57.9
60 + 40
80 + 40
80 + 40
80 = 0.258
= 0.488
= 0.253
25 Are multireads really a problem?
Data set % unmapped % unique % multireads % ﬁltered Mouse liver
(Mortazavi et
al. 2008) 46.2 44.4 9.2 0.2 Maize
simulation 47.5 25.0 27.1 0.4 25 base reads, 2 mismatches allowed
• Still an issue with longer and paired reads
• mouse 75 base reads: 10% multireads (singleend), 8% (pairedend)
• Multireads arise due to homology, not chance similarity
26 An observation about the rescue method
• Note that at the end of the rescue algorithm, we have an updated set of
abundance estimates
• These new estimates could be used to reallocate the multireads
• And then we could update our abundance estimates once again
• And repeat!
• This is the intuition behind the statistical approach to this problem 27 Our solution  a generative probabilistic model
transcript probabilities (expression levels) ✓ N number of reads Gn
transcript
fragment length
start position Fn L1
n read length
orientation Sn L2
n
paired read Q1
n quality scores Q2
n On read sequence
1
Rn P (g, f , s, o, ⇥, q, r ) = N
Y n=1 2
Rn P ( g n  ) P ( f n  g n ) P ( s n  f n , g n ) P ( o n  g n ) P ( q n ) P ( ⇥ n  f n ) P ( r n  g n , f n , s n , o n , ⇥ n , qn )
28 Quantiﬁcation as maximum likelihood inference
• Observed data likelihood
P ( r, ⇥ , q  ) = NM
YX n=1 i=0 i Li Li 1
XXX
j =0 k=0 o=0 P (Rn = rn , Ln = ⇥n , Qn = qn , Sn = j, Fn = k, On = oGn = i) • Likelihood function is concave w.r.t. θ
• Has a global maximum (or global maxima)
• ExpectationMaximization for optimization “RNASeq gene expression estimation with read mapping uncertainty”
Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C.
Bioinformatics, 2010
29 Approximate inference with read alignments
P ( r, ⇥ , q  ) = NM
YX i n=1 i=0 Li Li 1
XXX
j =0 k=0 o=0 P (Rn = rn , Ln = ⇥n , Qn = qn , Sn = j, Fn = k, On = oGn = i) • Full likelihood computation requires O(NML2) time
• N (number of reads) ~ 107
• M (number of transcripts) ~ 104
• L (average transcript length) ~ 103
• Approximate by alignment P ( r, ⇥ , q  ) = N
Y X n=1 (i,j,k,o)2 i P ( Rn
x
n = rn , Ln = ⇥n , Qn = qn , Znijko = 1Gn = i) all local alignments of read n with at most x mismatches
30 HMM Interpretation
transcript 1 start ✓1
✓2
✓3 transcript 2 transcript 3 ...
hidden: read start positions
observed: read sequences ... ✓M transcript M Learning parameters: BaumWelch Algorithm (EM for HMMs)
Approximation: Only consider a subset of paths for each read
31 EM Algorithm
• ExpectationMaximization for RNASeq
• Estep: Compute expected read counts given current expression levels
• Mstep: Compute expression values maximizing likelihood given expected
read counts
• Rescue algorithm ≈ 1 iteration of EM 32 predicted expression level Improved accuracy over unique and rescue true expression level Genelevel expression estimation
33 Probabilisticallyweighted alignments
Scale
chr1:
10 _ 133546350 133546400 100 bases
133546450
133546500
133546550
SRR065546 133546600 133546650 133546700 SRR065546 0_ env
Ctse SRR065546_bam UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics 34 Expected read count visualization Scale
chr9:
8000 _ 78327000 1 kb
78327500 78328000
SRR065546 78328500 78329000 78329500 SRR065546 0_
8000 _ SRR065546_uniq SRR065546_uniq 0_
Eef1a1 UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics 35 Guiding experimental design and technology
development
• RNASeq is still relatively expensive and time consuming
• How should we guide experimental design and technology to achieve the best
quantiﬁcation accuracy?
• Simple economic model: sequencing cost proportional to number of bases
sequenced
• How long should reads be to obtain the highest accuracies?
• Tradeoff
• Large number of short reads: uncertainty due to mapping ambiguity
• Small number of long reads: uncertainty due to smaller sample size
36 Finding the optimal read length
9% 5
5 τ MPE 8%
7%
6%
5%
4% 5 5 1 1 2
4
3 3%
20 5 5
1 1 1
1 2 2 4
3 4
3
25 2
4
3 30 2
4
3 35 2
4
3 40 Read length (bases)
37 Axolotl experimental setup
Samples
Stylopod (upper arm) (3)
Zeugopod (lower arm) (3)
Autopod (hand) (3)
Digits (3)
30 day blastema (5) Three Distinct Phases of RegenerationSpeciﬁc
Gene Expression in the Axolotl Blastema
R. Stewart, J. Nie, S. Tian, C. Alexander
Rascón, M. Probasco, J. Bolin, V. Agarwal, B.
Habermann, S. Sengupta, M. Volkmer, S.
Badylak, E. Tanaka, J. Thomson, C. Dewey
Submitted
38 Humanbased analysis of axolotl transcription
Bowtie/
RSEM Axolotl RNASeq reads
Axolotl
transcript
contigs BLAST
Human
transcripts
Human
genes
39 Distinct phases of axolotl limb regeneration % of 2x Up Regulated Genes 1. 0hr1d: Oncogenes  Dedifferentiation, chromatin remodeling
2. 6hr10d: Extracellular Matrix remodeling
2. 5d14d: Blastemal Genes  Patterning
3. 21d: Patterning done? Growth? Keratins/Collagens ECM
Oncogenes Blastema Ker/Coll Time from amputation
40 Regeneration as controlled cancer P Tsonis, Limb Regeneration, 1996, Cambridge University Press Limb Regeneration  Oncogenes and tumor suppressors
“Controlled Cancer” > development and differentiation
Salamanders very resistant to tumorigenesis by carcinogens
41 Alternative splicing 42 Forms of alternative splicing 43 Alternative splicing analysis with RNASeq • RNASeq: powerful for analyzing
alternative splicing • Analysis challenges
• Genes with many isoforms • Discovery of novel splice junctions
• Nonidentiﬁability of abundances
• Precise quantiﬁcation of splice
events: low background, large
dynamic range • Difﬁculty in de novo assembly of
fulllength isoforms 44 Combinatorial explosion of distinct isoforms
• Combinatorial explosion of the number of possible isoforms for each gene
• Insufﬁcient data to accurately estimate abundances of thousands of isoforms
Scale
chr2R: 3210000
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam 3215000 3220000 3225000 20 kb
3230000 3235000
3240000
3245000
FlyBase ProteinCoding Genes 3250000 3255000 3260000 3265000 Drosophila Dscam: more than 38,000 possible isoforms
(Schmucker et al., 2000)
45 Nonidentiﬁability of fulllength isoform models
(A) 6 (B) 4 10 4
6 4 (C) 4 6 4 2 Lacroix et al. 2008; Hiller et al. 2009
46 of false nodes, resulting in a massive graph with millions from the kmer dictionary; (iii) selects the most frequent kmer in the
lbeit mostly implausible) paths.
dictionary to seed a contig assembly, excluding both lowcomplexity
resent Trinity, a method for the
robust de novo reconstruction of
es, consisting of three software
a
c
b
hworm, Chrysalis and Butterfly,
ntially to process large volumes
reads. We evaluated Trinity on
o wellannotated species—one
m (fission yeast) and one mamOverlap linear
as well as an insect (the whitefly
ATTCG
CTTCG
sequences by
i), whose genome has not yet been
TTCGC
overlaps of k – 1
each case, Trinity recovers most
to build graph
TCGCA
De Bruijn
Read set
components
ce (annotated) expressed trangraph (k = 5)
CGCAA
llength sequences, and resolves
GCAAT
forms and duplicated genes, perk–1
CAATC
CAATG
r than other available transcripAATCA
AATGA
Extend in kmer
assembly tools, and similarly to
space and
ing on genome alignments.
ATCAT
ATGAT De novo transcriptome assembly
... ... TGATC TCATC G ATCG CATCG • RNASeq reads/fragments are
relatively short
... • Often insufﬁcient to reconstruct
break ties
fulllength isoforms in the
thod for de novo
presence of alternative splicing
TGCA
e assembly
CA k–1 TC TGCA TA • Transcriptome assemblies
!
TG
A
T
perhaps best left Cin “graph” form TCGGA k–1 TC T* C CGGAT Compacting k–1 ... G Graph constructed
by the “Butterﬂy”
module of Trinity
(Grabherr et al. 2011) ATCGG ... de novo assembly of a genome,
rge connected sequence graphs
t connectivities among reads
chromosomes, in assembling
e data we expect to encounter
dividual disconnected graphs,
nting the transcriptional comnoverlapping loci. Accordingly,
ions the sequence data into these
ual graphs, and then processes
independently to extract fullms and tease apart transcripts
paralogous genes.
st step in Trinity, Inchworm
ds into the unique sequences of
nchworm (Fig. 1a) uses a greedy
approach for fast and efficient
embly, recovering only a single
entative for a set of alternative
share kmers (owing to alterna k–1 A C TTCGCAA...T
G C Compact
graph ...
...
...
... ATCGGAT... • De Bruijn graph Finding paths >a121:len = 5,845 ... ... >a122:len = 2,560 A C >a123:len = 4,443 • String graphs >a124:len = 48
>a126:len = 66 Linear sequences TTCGCAA...T
G Compact graph
with reads C ATCGGAT... Extracting sequences
...CTTCGCAA...TGATCGGAT...
...ATTCGCAA...TCATCGGAT... Transcripts
47 Our solution: Probabilistic Splice Graphs
• Splice Graphs (Heber et al. 2002)
• Compact representation of possible isoforms for a gene (A) • Statistical models with splice graphs (Jenkins et al. 2006)
• Modeling of EST data (B) (A)
(B) (A)
(B)0.8
(C) 0.2 0.4
0.6 0.32
0.48
0.08
0.12 (C)
L. Legault and C. Dewey. Inference of alternative splicing from
RNASeq data with probabilistic splice graphs. Submitted.
(C) 48 Probabilistic Splice Graph Complexity
130869000 (A) 130868500 130868000 130867500 130867000 130866500 :chr2
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4 known
isoforms (B) “line graph” (C) “exon graph” (D) “higherorder
exon graph” (E) “unfactorized
graph”
49 Advantages of PSGs
• Compact description of the possible isoforms of a gene
• Models the frequencies of potentially exponentially many isoforms with a
polynomial number of parameters
• Models dependence or independence of splice events
• The parameters of a PSG are more often identiﬁable than a model that has a
parameter for every possible isoform
• Splice graphs are naturally produced structures from transcriptome
assemblers 50 (A) The PSG parameter inference problem
(B)
• Given: RNASeq reads and a PSG structure
CCTTCNCACTTCGTTTCCCAC
TTTTTNCAGAGTTTTTTCTTG
(A)
GAACANTCCAACGCTTGGTGA
GGAAANAAGACCCTGTTGAGC
CCCGGNGATCCGCTGGGACAA
GCAGCATATTGATAGATAACT
(B)
CTAGCTACGCGTACGCGATCG
CATCTAGCATCGCGTTGCGTT (C) • Do: Estimate the (maximum likelihood) parameters for the model ? ? (C) ? ? 51 3.3 Identiﬁability of the PSG RNASeq model
)
f An important aspect of the transcript quantiﬁcation task is the identiﬁability
Identiﬁability of inference (Hiller etRNASeq data , 2008). A
of the model used for PSGs with al., 2009; Lacroix et al.
statistical model M with parameters ✓ is identiﬁable if
• Identiﬁability: P (D M, ✓ ) = P (D M, ✓ 0 ), 8D , ✓ = ✓ 0 . at
ll
el
is
h
.
o
s
n In words, for an identiﬁable model, different parameter values give rise to
• Proposition: If for all edges (u, v), there the data. read that is uniquely derived
different probability distributions over exists a Identiﬁability of isoform
from that edge, or v has a concern for technologies such read that is uniquely
quantiﬁcation models is indegree 1 and there exists a as microarrays and
derived from v, then the PSG is identiﬁable.
RNASeq because isoforms often share a large fraction of their sequence and
these technologies only probe short segments of them at a time. For example,
(A)
the frequencies of the isoforms shown in Figure 2B are not identiﬁable
given short singleend RNASeq data. In an encouraging result, Hiller
(B)
et al. (2009) found that the isoform frequencies for 97% of a subset of
alternatively spliced human genes are identiﬁable using singleend RNAnot data. However, this result was obtained using the RefSeq human gene
Seq identiﬁable
set (Pruitt et al., 2009), which is conservative and thus has a small number
of alternative isoforms for each gene. With gene sets that contain a greater
number of alternative isoforms, the percentage of genes with identiﬁable
(C)
identiﬁable
isoform frequencies is expected to decrease signiﬁcantly (Hiller et al., 2009).
52 si ,si+1 merging vertices.
i=1
Each vertex, vi , of a PSG is associated with a sequence, which we den
A transcript (isoform) is represented by a path t, with t1 = 0 and tt = M .
by i . The sequences of the start and end vertices are the empty stri
The relative abundance or probability of a transcript t is deﬁned as the weight
Each edge, (vi , vj ), in the graph has a weight ↵ij 2 [0, 1], and we requ
P
of its path, w(t). ↵ = 1. The weight, w(s), of a subpath, s, through G is
that 8i, j ij
There are a numberweights of conditional traverses: that can be computed
product of the of useful the edges it quantities
from a PSG. First, we can compute the conditional probability that vertex vj
s 1
is included in a transcript given graphsisY the transcript. We denote this
that vi in
• RSEM model extended to probabilistic splice
w (s ) =
↵si ,si+1
quantity by f (i, j ) and compute it with the recurrence
i=1
(
X
A transcript (isoform) is represented 1
• Efﬁcient inference of parameters (splice eventw(s) = by a path t, with t1 i = j tt =
frequencies) with EM = 0 and
f (i, j ) =
P
The relative abundance or probability of a transcript k) deﬁned as the wei
t is i 6= j
k ↵kj f (i,
s:s =i,s =j
of its path, w(1 ). s
t
• Dynamic programming algorithmsnumber of useful conditional quantities that can be compu
There are a → polynomial time transcripts or genes
Other useful quantities involve the lengths ofinference for subpaths. We
with an exponential number PSG.length we can compute theassociated with vertexthat vertex
denote from i the isoforms the sequence conditional probability i, i.e.,
by ` a of First, of
`i =  is included in a transcript given that vi is in the transcript. lengths of
i . The length of a subpath s is simply the sum of the We denote
P
quantity by f (i, j and its vertices: l( the recurrence . We deﬁne the
the sequences associated) with compute it with s) =
i `si
(
Probability of including preﬁx length dp (i) for vertex vi to be the expected length= jthe
expected
X
1
i of
v ( analogously, the
sufﬁx
vertex j given that subpath beginning at)v= and ending at wi ;s) = Pk ↵kj f (expected6= j
vertex i f (i, j 0
i, k) i
sis1 =i,ss =j
:s the expected length of the subpath beginning at
length dq (i) for vertex vi
was in transcript ending at v . These quantities can be calculated via the recurrences:
vi and Other useful quantities involve the lengths of transcripts or subpaths.
M
X
denote by `i the length of 1
the sequence associated with vertex i,
= `i
(0, j )↵ji dp (j
(1)
Expected preﬁx length `i = dpi(i.)The length+ fa(0, i) sfis simply theP) of the lengths

of subpath
sum
j
the sequences associated with its vertices: l(s) =
i `si . We deﬁne
X
expected (preﬁx= ` +p (i) for vertex) vi to be the expected length of
length d
dq i)
↵ij dq
(2)
Expected sufﬁx length subpath beginning at iv and ending (j v ; analogously, the expected su
at i
0
j
length dq (i) for vertex vi is the expected length of the subpath beginnin
The expected length of transcript of this gene is the expected sufﬁx length of
53 A model of RNASeq from PSGs that contains edge the iexpected number of reads that are ↵ . from a Estep of the
1
where zij is (v , vj ), given parameters derived The transcript
8
algorithm that containsthe computation of the zij↵(values, Estep of the
involves edge (vi , vj ), given parameters t) . The and the Mstep
algorithm involves the computation of the zij values, j ) 2 s
>f (0, s1 )w s
(i, and the Mstep
>
involves maximizing Equation(5.)
>
involves maximizing Equation 5.
> <f (0, i)↵ f (j, s )w(s)
ij
3.4.1 Estep Estep In the Estep, we calculate zij =if ↵E) [Ztij from vj to
calculate zij
EM jforP3.4.1 Pparameter estimation=E9(tpath][,Zwhere where s1
PSG In the Estep, we 1
ij ],
↵( )
g (s, i, ) =
Z Znij and sand Znij is f (indicatorrandom variable takes takes value
> Znij 1 ) is s) an ss , )↵ij if 9 path value
Zij = nij = fn(0,Znijw (an indicator irandom variable that thatfrom ss to vi
>
>
one when the transcript from which read nis derived includes edge j ). , v ).
>
one when the transcript from which read n is derived includes edge (vi , v (vi j
:0
The expected value of Znij is computed as
otherwise
The expected value of Z
is computed as • Estep: compute the expectation ofP number of times edge (i,j) is used
the
nij
(b,s
P P )2⇡(r) g(s, i, j )
E [Znij ]to a small number)of positions within the
=
(6)
Assuming that each read aligns (b,s)(2⇡)(r⇡)(rg (s, )i, j
b,s 2
) g (s
E only =
(6)
PSG, the Estep requires[Znij ] O(NP ) time as (all of the f (i, j ) values can
E
where
(b,s)2⇡ (r ) g s) be precomputed atgthe beginningwof) the Estep using dynamic programming.
(s) = f (0, s ) (s 1
possible where
isoform
8
for the gene is
> (0 s1
(i, ) 2 s
> expected
Given, thefw(,s) )w(s) Zij values jfrom the expectation step,
>
>
assume that theg (s) = f (0 s1 )
<f (0, i)↵ f (j, s )w(s)
m
if 9 path from vj to s1
1
ij
a readmodel parameters =
identiﬁes
g (s, i, j )8 must now be adjusted to reﬂect them. With the
the
>
from
P f (0, sf1(0w(sw(s)f (ss , i)↵ij if(9 path 2 s ss to vi
is four possible
> ) , s1 ) )
he
>
>
i, j )
>
:
constraint that 8i, > ↵ij 0 = 1, it can be otherwise that Equation 5 is
shown
>
>
or the gene.
e
j
<f (0, i)↵ f (j, s )w(s)
path from v to s
ij
maximizedi,maximize the completelyobservednumberifof9positions withinjedge1counts
j,
s • Mstep: when, 8i, that each read aligns to1a small likelihood given the the
g (s, j ) Assuming
=
>f (0, s1 ) only O N s ) time as of the f (i, from s s
,
>
PSG, the Estep requiresw (s)f((sEz i)↵ij all if 9 path j ) valuescan to vi
le
>
>
ij
plicitly modeling
:0
be precomputed at the beginning of the Estep using dynamic programming.
otherwise
(dp (i)+dq (j ))
events that allow
↵ Given P
(7)
3.4.2 Mstep ij = the expected Zij zik from the expectation step,
values
te some general
Assuming the model parameters must to a(d (i)+d to k)) positions within the
that each read aligns now small number of them. With the
k be adjusted q ( reﬂect
ble. A convenient
p
P
constraint that 8i, j ij = 1, it can as all of the Equation 5 is
ecessary, PSG, the Estep requires only↵O (N E ) time be shown that f (i, j ) values can
for the
g
maximized when, 8i, j ,
n.
Thus, be precomputed at the beginning of the Estep usingdirectly proportional to the
the maximum likelihood estimate for ↵ is dynamic programming. 3.4.2 w
al Mstep onical form G = times
number of ij zij
(dp (i)+dq (j )) the edge is used, andZ values from the expectation step,
inversely proportional to the average
3.4.2 Mstep Given the expected ij
↵ij = P
(7) 54 Efﬁcient inference for highlyspliced genes
Scale
chr2R: 3210000 • DSCAM running time test
• 23,976 isoforms
• Simulated 10 reads Method
Running time 3215000 3220000 3225000 20 kb
3230000 Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam
Dscam 3235000
3240000
3245000
FlyBase ProteinCoding Genes 3250000 3255000 3260000 3265000 RSEM Cufﬂinks Line graph PSG Not possible > 15 hours
(> 50 GB RAM) < 1 second 55 Next steps for modeling RNASeq with PSGs
• Graph construction
130869000 • Exon discovery
• Splice junction
discovery (A) 130869000 130868500 130868500 130868000 130868000 130867500 130867500 (A)
(B) 130867000 130867000 130866500 130866500 :chr2
Gfra4
Gfra4
Gfra4
Gfra4
:chr2
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4
Gfra4 (B) • Model selection (C)
(C) • Learning
dependencies
between splice
events (D) or (D) (E)
(E) 56 Summary
• RNASeq is likely the future of transcriptome analysis
• The major challenge in analyzing RNASeq data: the reads are much shorter
than the transcripts from which they are derived
• Tasks with RNASeq data thus require handling hidden information: which
gene/isoform gave rise to a given read
• The ExpectationMaximization algorithm is extremely powerful in these
situations
• Alternative splicing complicates matters further
• Probabilistic splice graphs are compact and efﬁcient models for RNASeq
data with alternatively spliced genes (dynamic programming!)
57 Acknowledgements Bo Li Nate Fillmore Laura LeGault James Thomson & Lab Thomson Lab Bioinformatics Team
(in particular, Ron Stewart & Victor Ruotti) NIH/NHGRI
R01HG005232
58 ...
View
Full Document
 Fall '11
 Staff
 DNA, PSG, Dscam

Click to edit the document details