rnaseq - Measuring transcriptomes with RNA-Seq BMI/CS 576...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Measuring transcriptomes with RNA-Seq BMI/CS 576 November 17, 2011 Colin Dewey [email protected] 1 Overview • Motivation - Axolotl • The transcriptome • RNA-Seq technology • The RNA-Seq quantification problem • Handling alternative splicing: probabilistic splice graphs 2 What I want you to get from this lecture • What is RNA-Seq? • How is RNA-Seq used to measure the abundances of RNAs within cells? • What probabilistic models and algorithms are used for analyzing RNA-Seq? 3 Collaborators Axolotl James Thomson Ron Stewart Regenerative Biology Laboratory, Morgridge Institute for Research, Madison, WI 4 Axolotl background • Ambystoma mexicanum • Neotenous • Natural habitats • Regenerative abilities • Lake Xochimilco (canals) • Limbs • Lake Chalco (drained) • Portions of Heart • Endangered • Portions of Brain • Commonly sold as pets • Tail and spinal cord 5 Axolotl limb regeneration David Gardiner - HHMI-UCI 6 Goals • What are the axolotl genes that are responsible for this remarkable regenerative ability? • Can this knowledge improve our medical treatments of severe wounds and tissue regeneration? 7 Challenges with genomic studies of Axolotl • No genome sequence available • genome estimated to be 10x larger than that of human 340 mya • Distantly related to other model organisms 280 mya Villiard et al. 2007 8 The Central Dogma 9 Transcription • RNA that is transcribed from a gene is called messenger RNA (mRNA) • RNA polymerase is the enzyme that builds an RNA molecule from a gene 10 Transcription: DNA→RNA 11 Measuring transcription: Microarrays • Each spot has “probes” for a certain gene • Probe: a DNA sequence complementary to a certain gene • Relies on complementary hybridization • Intensity/color of light from each spot is measurement of the number of transcripts for a certain gene in a sample • Requires knowledge of gene sequences 12 Prior gene expression studies in Axolotl • Microarrays • Exist, but not very complete • Limited amount of mRNA sequence data from Axolotl • No genome, so can’t use predicted gene sequences 13 RNA-Seq technology • Leverages rapidly advancing sequencing technology (e.g., Illumina, SOLiD) • Transcriptome analog to whole genome shotgun sequencing • Two key differences: 1. Transcripts sequenced at different levels of coverage - expression levels 2. Sequences already known (in many cases) - coverage is measurement 14 Advantages of RNA-Seq over microarrays • No reference sequence needed • With microarrays, limited to the probes on the chip • Low background noise • Large dynamic range • 105 compared to 102 for microarrays • High technical reproducibility 15 RNA-Seq protocol Amplified cDNA Sample RNA reverse transcription + PCR cDNA fragments reads CCTTCNCACTTCGTTTCCCAC fragmentation sequencing machine TTTTTNCAGAGTTTTTTCTTG GAACANTCCAACGCTTGGTGA GGAAANAAGACCCTGTTGAGC CCCGGNGATCCGCTGGGACAA GCAGCATATTGATAGATAACT CTAGCTACGCGTACGCGATCG CATCTAGCATCGCGTTGCGTT CCCGCGCGCTTAGGCTACTCG TCACACATCTCTAGCTAGCAT CATGCTAGCTATGCCTATCTA CACCCCGGGGATATATAGGAT 16 RNA-Seq data @HWUSI-EAS1789_0001:3:2:1708:1305#0/1 CCTTCNCACTTCGTTTCCCACTTAGCGATAATTTG +HWUSI-EAS1789_0001:3:2:1708:1305#0/1 VVULVBVYVYZZXZZ\ee[a^b`[a\a[\\a^^^\ @HWUSI-EAS1789_0001:3:2:2062:1304#0/1 TTTTTNCAGAGTTTTTTCTTGAACTGGAAATTTTT +HWUSI-EAS1789_0001:3:2:2062:1304#0/1 a__[\Bbbb`edeeefd`cc`b]bffff`ffffff @HWUSI-EAS1789_0001:3:2:3194:1303#0/1 GAACANTCCAACGCTTGGTGAATTCTGCTTCACAA +HWUSI-EAS1789_0001:3:2:3194:1303#0/1 ZZ[[VBZZY][TWQQZ\ZS\[ZZXV__\OX`a[ZZ @HWUSI-EAS1789_0001:3:2:3716:1304#0/1 GGAAANAAGACCCTGTTGAGCTTGACTCTAGTCTG +HWUSI-EAS1789_0001:3:2:3716:1304#0/1 aaXWYBZVTXZX_]Xdccdfbb_\`a\aY_^]LZ^ @HWUSI-EAS1789_0001:3:2:5000:1304#0/1 CCCGGNGATCCGCTGGGACAAGCAGCATATTGATA +HWUSI-EAS1789_0001:3:2:5000:1304#0/1 aaaaaBeeeeffffehhhhhhggdhhhhahhhadh name sequence qualities read paired-end reads read1 read2 1 Illumina (GAIIX) lane ~20 million reads 17 Tasks with RNA-Seq data • Assembly: Given: RNA-Seq reads (and possibly a genome sequence) • Do: reconstruct full-length transcript sequences from the reads • Quantification: • Given: RNA-Seq reads and transcript sequences • Do: Estimate the relative abundances of transcripts (“gene expression”) • Differential expression: • Given: RNA-Seq reads from two different samples and transcript sequences • Do: Predict which transcripts have different abundances between the two samples 18 The basics of quantification with RNA-Seq data • For simplicity, suppose reads are of length one (typically they are > 35 bases) transcripts 200 1 2 3 60 80 reads 100 A 60 C 40 G • What relative abundances would you estimate for these genes? 19 Length dependence • probability of a read coming from a transcript ∝!relative abundance × length transcripts 200 1 2 3 ˆ f1 / 60 80 100 200 1 = 200 400 60 200 1 ˆ f2 / = 60 200 40 200 1 ˆ f3 / = 80 400 reads 100 A 60 C 40 G ˆ f1 = 0.25 ˆ f2 = 0.5 ˆ f3 = 0.25 20 What if reads do not uniquely map to transcripts? • “multiread”: a read that could have been derived from multiple transcripts transcripts 200 1 2 3 60 80 reads 90 A 40 C 40 G 30 T • How would you estimate the relative abundances for these transcripts? 21 Some options for handling multireads • Discard all multireads, estimate based on uniquely mapping reads only • Discard multireads, but use “unique length” of each transcript in calculations • “Rescue” multireads by allocating (fractions of) them to the transcripts • Three step algorithm 1. Estimate abundances based on uniquely mapping reads only 2. For each multiread, divide it between the transcripts to which it maps, proportionally to their abundances estimated in the first step 3.Recompute abundances based on updated counts for each transcript 22 Rescue method example - Step 1 transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G 30 T Step 1 ˆunique = f1 90 200 90 + 40 200 60 + 40 80 = 0.278 ˆunique = 0.412 f2 ˆunique = 0.309 f3 23 Rescue method example - Step 2 transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G 30 T Step 2 rescue c1 rescue c2 rescue c3 0.278 = 90 + 30 ⇥ = 102.1 0.278 + 0.412 0.412 = 40 + 30 ⇥ = 57.9 0.278 + 0.412 = 40 + 0 = 40 24 Rescue method example - Step 3 transcripts 3 90 A 200 1 2 reads 60 40 C 80 40 G 30 T Step 3 ˆrescue = f1 ˆrescue = f2 ˆrescue = f3 102.1 200 102.1 200 102.1 200 102.1 200 + 57.9 60 57.9 60 + 57.9 60 40 80 + 57.9 60 + 40 80 + 40 80 + 40 80 = 0.258 = 0.488 = 0.253 25 Are multireads really a problem? Data set % unmapped % unique % multireads % filtered Mouse liver (Mortazavi et al. 2008) 46.2 44.4 9.2 0.2 Maize simulation 47.5 25.0 27.1 0.4 25 base reads, 2 mismatches allowed • Still an issue with longer and paired reads • mouse 75 base reads: 10% multireads (single-end), 8% (paired-end) • Multireads arise due to homology, not chance similarity 26 An observation about the rescue method • Note that at the end of the rescue algorithm, we have an updated set of abundance estimates • These new estimates could be used to reallocate the multireads • And then we could update our abundance estimates once again • And repeat! • This is the intuition behind the statistical approach to this problem 27 Our solution - a generative probabilistic model transcript probabilities (expression levels) ✓ N number of reads Gn transcript fragment length start position Fn L1 n read length orientation Sn L2 n paired read Q1 n quality scores Q2 n On read sequence 1 Rn P (g, f , s, o, ⇥, q, r| ) = N Y n=1 2 Rn P ( g n | ) P ( f n | g n ) P ( s n | f n , g n ) P ( o n | g n ) P ( q n ) P ( ⇥ n | f n ) P ( r n | g n , f n , s n , o n , ⇥ n , qn ) 28 Quantification as maximum likelihood inference • Observed data likelihood P ( r, ⇥ , q | ) = NM YX n=1 i=0 i Li Li 1 XXX j =0 k=0 o=0 P (Rn = rn , Ln = ⇥n , Qn = qn , Sn = j, Fn = k, On = o|Gn = i) • Likelihood function is concave w.r.t. θ • Has a global maximum (or global maxima) • Expectation-Maximization for optimization “RNA-Seq gene expression estimation with read mapping uncertainty” Li, B., Ruotti, V., Stewart, R., Thomson, J., Dewey, C. Bioinformatics, 2010 29 Approximate inference with read alignments P ( r, ⇥ , q | ) = NM YX i n=1 i=0 Li Li 1 XXX j =0 k=0 o=0 P (Rn = rn , Ln = ⇥n , Qn = qn , Sn = j, Fn = k, On = o|Gn = i) • Full likelihood computation requires O(NML2) time • N (number of reads) ~ 107 • M (number of transcripts) ~ 104 • L (average transcript length) ~ 103 • Approximate by alignment P ( r, ⇥ , q | ) = N Y X n=1 (i,j,k,o)2 i P ( Rn x n = rn , Ln = ⇥n , Qn = qn , Znijko = 1|Gn = i) all local alignments of read n with at most x mismatches 30 HMM Interpretation transcript 1 start ✓1 ✓2 ✓3 transcript 2 transcript 3 ... hidden: read start positions observed: read sequences ... ✓M transcript M Learning parameters: Baum-Welch Algorithm (EM for HMMs) Approximation: Only consider a subset of paths for each read 31 EM Algorithm • Expectation-Maximization for RNA-Seq • E-step: Compute expected read counts given current expression levels • M-step: Compute expression values maximizing likelihood given expected read counts • Rescue algorithm ≈ 1 iteration of EM 32 predicted expression level Improved accuracy over unique and rescue true expression level Gene-level expression estimation 33 Probabilistically-weighted alignments Scale chr1: 10 _ 133546350 133546400 100 bases 133546450 133546500 133546550 SRR065546 133546600 133546650 133546700 SRR065546 0_ env Ctse SRR065546_bam UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics 34 Expected read count visualization Scale chr9: 8000 _ 78327000 1 kb 78327500 78328000 SRR065546 78328500 78329000 78329500 SRR065546 0_ 8000 _ SRR065546_uniq SRR065546_uniq 0_ Eef1a1 UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics 35 Guiding experimental design and technology development • RNA-Seq is still relatively expensive and time consuming • How should we guide experimental design and technology to achieve the best quantification accuracy? • Simple economic model: sequencing cost proportional to number of bases sequenced • How long should reads be to obtain the highest accuracies? • Tradeoff • Large number of short reads: uncertainty due to mapping ambiguity • Small number of long reads: uncertainty due to smaller sample size 36 Finding the optimal read length 9% 5 5 τ MPE 8% 7% 6% 5% 4% 5 5 1 1 2 4 3 3% 20 5 5 1 1 1 1 2 2 4 3 4 3 25 2 4 3 30 2 4 3 35 2 4 3 40 Read length (bases) 37 Axolotl experimental setup Samples Stylopod (upper arm) (3) Zeugopod (lower arm) (3) Autopod (hand) (3) Digits (3) 30 day blastema (5) Three Distinct Phases of Regeneration-Specific Gene Expression in the Axolotl Blastema R. Stewart, J. Nie, S. Tian, C. Alexander Rascón, M. Probasco, J. Bolin, V. Agarwal, B. Habermann, S. Sengupta, M. Volkmer, S. Badylak, E. Tanaka, J. Thomson, C. Dewey Submitted 38 Human-based analysis of axolotl transcription Bowtie/ RSEM Axolotl RNASeq reads Axolotl transcript contigs BLAST Human transcripts Human genes 39 Distinct phases of axolotl limb regeneration % of 2x Up Regulated Genes 1. 0hr-1d: Oncogenes -- De-differentiation, chromatin remodeling 2. 6hr-10d: Extracellular Matrix remodeling 2. 5d-14d: Blastemal Genes -- Patterning 3. 21d: Patterning done? Growth? Keratins/Collagens ECM Oncogenes Blastema Ker/Coll Time from amputation 40 Regeneration as controlled cancer P Tsonis, Limb Regeneration, 1996, Cambridge University Press Limb Regeneration -- Oncogenes and tumor suppressors “Controlled Cancer” --> development and differentiation Salamanders very resistant to tumorigenesis by carcinogens 41 Alternative splicing 42 Forms of alternative splicing 43 Alternative splicing analysis with RNA-Seq • RNA-Seq: powerful for analyzing alternative splicing • Analysis challenges • Genes with many isoforms • Discovery of novel splice junctions • Non-identifiability of abundances • Precise quantification of splice events: low background, large dynamic range • Difficulty in de novo assembly of full-length isoforms 44 Combinatorial explosion of distinct isoforms • Combinatorial explosion of the number of possible isoforms for each gene • Insufficient data to accurately estimate abundances of thousands of isoforms Scale chr2R: 3210000 Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam 3215000 3220000 3225000 20 kb 3230000 3235000 3240000 3245000 FlyBase Protein-Coding Genes 3250000 3255000 3260000 3265000 Drosophila Dscam: more than 38,000 possible isoforms (Schmucker et al., 2000) 45 Non-identifiability of full-length isoform models (A) 6 (B) 4 10 4 6 4 (C) 4 6 4 2 Lacroix et al. 2008; Hiller et al. 2009 46 of false nodes, resulting in a massive graph with millions from the k-mer dictionary; (iii) selects the most frequent k-mer in the lbeit mostly implausible) paths. dictionary to seed a contig assembly, excluding both low-complexity resent Trinity, a method for the robust de novo reconstruction of es, consisting of three software a c b hworm, Chrysalis and Butterfly, ntially to process large volumes reads. We evaluated Trinity on o well-annotated species—one m (fission yeast) and one mamOverlap linear as well as an insect (the whitefly ATTCG CTTCG sequences by i), whose genome has not yet been TTCGC overlaps of k – 1 each case, Trinity recovers most to build graph TCGCA De Bruijn Read set components ce (annotated) expressed trangraph (k = 5) CGCAA l-length sequences, and resolves GCAAT forms and duplicated genes, perk–1 CAATC CAATG r than other available transcripAATCA AATGA Extend in k-mer assembly tools, and similarly to space and ing on genome alignments. ATCAT ATGAT De novo transcriptome assembly ... ... TGATC TCATC G ATCG CATCG • RNA-Seq reads/fragments are relatively short ... • Often insufficient to reconstruct break ties full-length isoforms in the thod for de novo presence of alternative splicing TGCA e assembly CA k–1 TC TGCA TA • Transcriptome assemblies ! TG A T perhaps best left Cin “graph” form TCGGA k–1 TC T* C CGGAT Compacting k–1 ... G Graph constructed by the “Butterfly” module of Trinity (Grabherr et al. 2011) ATCGG ... de novo assembly of a genome, rge connected sequence graphs t connectivities among reads chromosomes, in assembling e data we expect to encounter dividual disconnected graphs, nting the transcriptional comnoverlapping loci. Accordingly, ions the sequence data into these ual graphs, and then processes independently to extract fullms and tease apart transcripts paralogous genes. st step in Trinity, Inchworm ds into the unique sequences of nchworm (Fig. 1a) uses a greedy approach for fast and efficient embly, recovering only a single entative for a set of alternative share k-mers (owing to alterna- k–1 A C TTCGCAA...T G C Compact graph ... ... ... ... ATCGGAT... • De Bruijn graph Finding paths >a121:len = 5,845 ... ... >a122:len = 2,560 A C >a123:len = 4,443 • String graphs >a124:len = 48 >a126:len = 66 Linear sequences TTCGCAA...T G Compact graph with reads C ATCGGAT... Extracting sequences ...CTTCGCAA...TGATCGGAT... ...ATTCGCAA...TCATCGGAT... Transcripts 47 Our solution: Probabilistic Splice Graphs • Splice Graphs (Heber et al. 2002) • Compact representation of possible isoforms for a gene (A) • Statistical models with splice graphs (Jenkins et al. 2006) • Modeling of EST data (B) (A) (B) (A) (B)0.8 (C) 0.2 0.4 0.6 0.32 0.48 0.08 0.12 (C) L. Legault and C. Dewey. Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs. Submitted. (C) 48 Probabilistic Splice Graph Complexity 130869000 (A) 130868500 130868000 130867500 130867000 130866500 :chr2 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 known isoforms (B) “line graph” (C) “exon graph” (D) “higher-order exon graph” (E) “unfactorized graph” 49 Advantages of PSGs • Compact description of the possible isoforms of a gene • Models the frequencies of potentially exponentially many isoforms with a polynomial number of parameters • Models dependence or independence of splice events • The parameters of a PSG are more often identifiable than a model that has a parameter for every possible isoform • Splice graphs are naturally produced structures from transcriptome assemblers 50 (A) The PSG parameter inference problem (B) • Given: RNA-Seq reads and a PSG structure CCTTCNCACTTCGTTTCCCAC TTTTTNCAGAGTTTTTTCTTG (A) GAACANTCCAACGCTTGGTGA GGAAANAAGACCCTGTTGAGC CCCGGNGATCCGCTGGGACAA GCAGCATATTGATAGATAACT (B) CTAGCTACGCGTACGCGATCG CATCTAGCATCGCGTTGCGTT (C) • Do: Estimate the (maximum likelihood) parameters for the model ? ? (C) ? ? 51 3.3 Identifiability of the PSG RNA-Seq model ) f An important aspect of the transcript quantification task is the identifiability Identifiability of inference (Hiller etRNA-Seq data , 2008). A of the model used for PSGs with al., 2009; Lacroix et al. statistical model M with parameters ✓ is identifiable if • Identifiability: P (D |M, ✓ ) = P (D |M, ✓ 0 ), 8D , ✓ = ✓ 0 . at ll el is h . o s n In words, for an identifiable model, different parameter values give rise to • Proposition: If for all edges (u, v), there the data. read that is uniquely derived different probability distributions over exists a Identifiability of isoform from that edge, or v has a concern for technologies such read that is uniquely quantification models is indegree 1 and there exists a as microarrays and derived from v, then the PSG is identifiable. RNA-Seq because isoforms often share a large fraction of their sequence and these technologies only probe short segments of them at a time. For example, (A) the frequencies of the isoforms shown in Figure 2B are not identifiable given short single-end RNA-Seq data. In an encouraging result, Hiller (B) et al. (2009) found that the isoform frequencies for 97% of a subset of alternatively spliced human genes are identifiable using single-end RNAnot data. However, this result was obtained using the RefSeq human gene Seq identifiable set (Pruitt et al., 2009), which is conservative and thus has a small number of alternative isoforms for each gene. With gene sets that contain a greater number of alternative isoforms, the percentage of genes with identifiable (C) identifiable isoform frequencies is expected to decrease significantly (Hiller et al., 2009). 52 si ,si+1 merging vertices. i=1 Each vertex, vi , of a PSG is associated with a sequence, which we den A transcript (isoform) is represented by a path t, with t1 = 0 and t|t| = M . by i . The sequences of the start and end vertices are the empty stri The relative abundance or probability of a transcript t is defined as the weight Each edge, (vi , vj ), in the graph has a weight ↵ij 2 [0, 1], and we requ P of its path, w(t). ↵ = 1. The weight, w(s), of a subpath, s, through G is that 8i, j ij There are a numberweights of conditional traverses: that can be computed product of the of useful the edges it quantities from a PSG. First, we can compute the conditional probability that vertex vj |s| 1 is included in a transcript given graphsisY the transcript. We denote this that vi in • RSEM model extended to probabilistic splice w (s ) = ↵si ,si+1 quantity by f (i, j ) and compute it with the recurrence i=1 ( X A transcript (isoform) is represented 1 • Efficient inference of parameters (splice eventw(s) = by a path t, with t1 i = j t|t| = frequencies) with EM = 0 and f (i, j ) = P The relative abundance or probability of a transcript k) defined as the wei t is i 6= j k ↵kj f (i, s:s =i,s =j of its path, w(1 ). |s| t • Dynamic programming algorithmsnumber of useful conditional quantities that can be compu There are a → polynomial time transcripts or genes Other useful quantities involve the lengths ofinference for subpaths. We with an exponential number PSG.length we can compute theassociated with vertexthat vertex denote from i the isoforms the sequence conditional probability i, i.e., by ` a of First, of `i = | is |included in a transcript given that vi is in the transcript. lengths of i . The length of a subpath s is simply the sum of the We denote P quantity by f (i, j and its vertices: l( the recurrence . We define the the sequences associated) with compute it with s) = i `si ( Probability of including prefix length dp (i) for vertex vi to be the expected length= jthe expected X 1 i of v ( analogously, the suffix vertex j given that subpath beginning at)v= and ending at wi ;s) = Pk ↵kj f (expected6= j vertex i f (i, j 0 i, k) i sis1 =i,s|s| =j :s the expected length of the subpath beginning at length dq (i) for vertex vi was in transcript ending at v . These quantities can be calculated via the recurrences: vi and Other useful quantities involve the lengths of transcripts or subpaths. M X denote by `i the length of 1 the sequence associated with vertex i, = `i (0, j )↵ji dp (j (1) Expected prefix length `i = dpi(|i.)The length+ fa(0, i) sfis simply theP) of the lengths | of subpath sum j the sequences associated with its vertices: l(s) = i `si . We define X expected (prefix= ` +p (i) for vertex) vi to be the expected length of length d dq i) ↵ij dq (2) Expected suffix length subpath beginning at iv and ending (j v ; analogously, the expected su at i 0 j length dq (i) for vertex vi is the expected length of the subpath beginnin The expected length of transcript of this gene is the expected suffix length of 53 A model of RNA-Seq from PSGs that contains edge the iexpected number of reads that are ↵ . from a E-step of the 1 where zij is (v , vj ), given parameters derived The transcript 8 algorithm that containsthe computation of the zij↵(values, E-step of the involves edge (vi , vj ), given parameters t) . The and the M-step algorithm involves the computation of the zij values, j ) 2 s >f (0, s1 )w s (i, and the M-step > involves maximizing Equation(5.) > involves maximizing Equation 5. > <f (0, i)↵ f (j, s )w(s) ij 3.4.1 E-step E-step In the E-step, we calculate zij =if ↵E) [Ztij from vj to calculate zij EM jforP3.4.1 Pparameter estimation=E9(tpath][,Zwhere where s1 PSG In the E-step, we 1 ij ], ↵( ) g (s, i, ) = Z Znij and sand Znij is f (indicatorrandom variable takes takes value > Znij 1 ) is s) an s|s| , )↵ij if 9 path value Zij = nij = fn(0,Znijw (an indicator irandom variable that thatfrom s|s| to vi > > one when the transcript from which read nis derived includes edge j ). , v ). > one when the transcript from which read n is derived includes edge (vi , v (vi j :0 The expected value of Znij is computed as otherwise The expected value of Z is computed as • E-step: compute the expectation ofP number of times edge (i,j) is used the nij (b,s P P )2⇡(r) g(s, i, j ) E [Znij ]to a small number)of positions within the = (6) Assuming that each read aligns (b,s)(2⇡)(r⇡)(rg (s, )i, j b,s 2 ) g (s E only = (6) PSG, the E-step requires[Znij ] O(NP |) time as (all of the f (i, j ) values can |E where (b,s)2⇡ (r ) g s) be precomputed atgthe beginningwof) the E-step using dynamic programming. (s) = f (0, s ) (s 1 possible where isoform 8 for the gene is > (0 s1 (i, ) 2 s > expected Given, thefw(,s) )w(s) Zij values jfrom the expectation step, > > assume that theg (s) = f (0 s1 ) <f (0, i)↵ f (j, s )w(s) m if 9 path from vj to s1 1 ij a readmodel parameters = identifies g (s, i, j )8 must now be adjusted to reflect them. With the the > from P f (0, sf1(0w(sw(s)f (s|s| , i)↵ij if(9 path 2 s s|s| to vi is four possible > ) , s1 ) ) he > > i, j ) > : constraint that 8i, > ↵ij 0 = 1, it can be otherwise that Equation 5 is shown > > or the gene. e j <f (0, i)↵ f (j, s )w(s) path from v to s ij maximizedi,maximize the completely-observednumberifof9positions withinjedge1counts j, s • M-step: when, 8i, that each read aligns to1a small likelihood given the the g (s, j ) Assuming = >f (0, s1 ) only O N |s |) time as of the f (i, from s s| , > PSG, the E-step requiresw (s)f((s|E|z i)↵ij all if 9 path j ) values|can to vi le > > ij plicitly modeling :0 be precomputed at the beginning of the E-step using dynamic programming. otherwise (dp (i)+dq (j )) events that allow ↵ Given P (7) 3.4.2 M-step ij = the expected Zij zik from the expectation step, values te some general Assuming the model parameters must to a(d (i)+d to k)) positions within the that each read aligns now small number of them. With the k be adjusted q ( reflect ble. A convenient p P constraint that 8i, j ij = 1, it can as all of the Equation 5 is ecessary, PSG, the E-step requires only↵O (N |E |) time be shown that f (i, j ) values can for the g maximized when, 8i, j , n. Thus, be precomputed at the beginning of the E-step usingdirectly proportional to the the maximum likelihood estimate for ↵ is dynamic programming. 3.4.2 w al M-step onical form G = times number of ij zij (dp (i)+dq (j )) the edge is used, andZ values from the expectation step, inversely proportional to the average 3.4.2 M-step Given the expected ij ↵ij = P (7) 54 Efficient inference for highly-spliced genes Scale chr2R: 3210000 • DSCAM running time test • 23,976 isoforms • Simulated 10 reads Method Running time 3215000 3220000 3225000 20 kb 3230000 Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam Dscam 3235000 3240000 3245000 FlyBase Protein-Coding Genes 3250000 3255000 3260000 3265000 RSEM Cufflinks Line graph PSG Not possible > 15 hours (> 50 GB RAM) < 1 second 55 Next steps for modeling RNA-Seq with PSGs • Graph construction 130869000 • Exon discovery • Splice junction discovery (A) 130869000 130868500 130868500 130868000 130868000 130867500 130867500 (A) (B) 130867000 130867000 130866500 130866500 :chr2 Gfra4 Gfra4 Gfra4 Gfra4 :chr2 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 Gfra4 (B) • Model selection (C) (C) • Learning dependencies between splice events (D) or (D) (E) (E) 56 Summary • RNA-Seq is likely the future of transcriptome analysis • The major challenge in analyzing RNA-Seq data: the reads are much shorter than the transcripts from which they are derived • Tasks with RNA-Seq data thus require handling hidden information: which gene/isoform gave rise to a given read • The Expectation-Maximization algorithm is extremely powerful in these situations • Alternative splicing complicates matters further • Probabilistic splice graphs are compact and efficient models for RNA-Seq data with alternatively spliced genes (dynamic programming!) 57 Acknowledgements Bo Li Nate Fillmore Laura LeGault James Thomson & Lab Thomson Lab Bioinformatics Team (in particular, Ron Stewart & Victor Ruotti) NIH/NHGRI R01HG005232 58 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online