MIT6_047f08_lec04_slide04

MIT6_047f08_lec04_slide04 - MIT OpenCourseWare...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Computational Biology: Genomes, Networks, Evolution Clustering Lecture 3 September 16, 2008 Structure in High-Dimensional Data ©2007 IEEE. Used with permission. Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer Graphics 13, no. 6 (2007): 14321439. • Structure can be used to reduce dimensionality of data • Structure can tell us something useful about the underlying phenomena • Structure can be used to make inferences about new data Clustering vs Classification • Classification – Have labels for some points – Want a “rule” that will accurately assign labels to new points – Supervised learning Expression in Exp 2 Expression in Exp 1 • Objects characterized by one or more features • Clustering – No labels – Group points into clusters based on how “near” they are to one another – Identify structure in data – Unsupervised learning Today • Microarray Data • K-means clustering • Expectation Maximization • Hierarchical Clustering Central Dogma DNA mRNA protein phenotype We can measure amounts of mRNA for every gene in a cell Expression Microarrays • A way to measure the levels of mRNA in every gene • Two basic types – Affymetrix gene chips – Spotted oligonucleotides • Both work on same principle – Put DNA probe on slide – Complementary hybridization Expression Microarrays • Measure the level of mRNA messages in a cell DNA 1 DNA 2 RT Hybridize Measure DNA 3 DNA 4 DNA 5 DNA 6 RNA 6 cDNA 6 Gene 5 Gene 6 Gene 3 Gene 4 RNA 4 cDNA 4 Gene 1 Gene 2 Expression Microarray Data Matrix n experiments • Experiment are given by columns m genes • Genes are typically given as rows Clustering and Classification in Genomics • Classification ¾ Microarray data: classify cell state (i.e. AML vs ALL) using expression data ¾ Protein/gene sequences: predict function, localization, etc. • Clustering ¾ Microarray data: groups of genes that share similar function have similar expression patterns – identify regulons ¾ Protein sequence: group related proteins to infer function ¾ EST data: collapse redundant sequences Clustering Expression Data – Group by similar expression profiles Gene 2 • Cluster Experiments Experiment Gene 1 – Group by similar expression in different conditions Experiment 2 • Cluster Genes Genes Experiment 1 Why Cluster Genes by Expression? • Data Exploration – Summarize data – Explore without getting lost in each data point – Enhance visualization GCN4 • Co-regulated Genes – Common expression may imply common regulation – Predict cis-regulatory promoter sequences • Functional Annotation – Similar function from similar expression His2 His3 Unknown Amino Acids Amino Acids Amino Acids Clustering Algorithms • Partitioning – Divides objects into non-overlapping clusters such that each data object is in exactly one subset • Agglomerative – A set of nested clusters organized as a hierarchy K-Means Clustering The Basic Idea • Assume a fixed number of clusters, K • Goal: create “compact” clusters More Formally 1. Initialize K centers uk For each iteration n until convergence 2. Assign each xi the label of the nearest center, where the distance between xi and uk is di ,k = ( xi − μ k ) 3. 2 Move the position of each uk to the centroid of the points with that label μ k (n + 1) = x i with label j ∑ xi , x k = #xi with label k xk Cost Criterion We can think of K-means as trying to create clusters that minimize a cost criterion associated with the size of the cluster COST ( x1 ,x 2 ,x3 ,...,x n ) = ∑ μ k xi with label k ∑ ( xi − μ k ) 2 μ1 μ2 μ3 Minimizing this means minimizing each cluster term separately: xi with label k ∑ ( xi − μ k ) = 2 xi with label k ∑ x 2 − 2xi u k + μ 2 = ∑ x 2 − u k ∑ 2xi + x k u 2 k k i i Optimum uk = xi with label k ∑ xi xk , the centroid Fuzzy K-Means • • Initialize K centers uk For each point calculate the probability of membership for each category P(label K | xi , μ k ) K-means • Move the position of each uk to the weighted centroid : μ k (n + 1) = Of course, K-Means just special case where • Iterate ⎧1 if xi is closest to μ k P(label K | xi , μ k ) = ⎨ otherwise ⎩0 Fuzzy K-means x i with label j ∑ xi P(μ k | xi )b x i with label j ∑ P(μ k | xi ) b K-Means as a Generative Model μ2 Model of P(X,Labels) μ1 xi Samples drawn from two equally normal distributions with unit variance - a Gaussian Mixture Model ⎧ ( x − u )2 ⎫ 1 ⎪ i j ⎪ P ( xi | u j ) = exp ⎨− ⎬ 2 2π ⎪ ⎪ ⎩ ⎭ Unsupervised Learning μ2 μ1 Learn? xi Samples drawn from two equally normal distributions with unit variance - a Gaussian Mixture Model ⎧ ( x − u )2 ⎫ 1 ⎪ i j ⎪ P ( xi | u j ) = exp ⎨− ⎬ 2 2π ⎪ ⎪ ⎩ ⎭ If We Have Labeled Points Need to estimate unknown gaussian centers from data In general, how could we do this? How could we “estimate” the “best” uk? μ2? xi μ1 ? Choose uk to maximize probability of model If We Have Labeled Points Need to estimate unknown gaussian centers from data In general, how could we do this? How could we “estimate” the “best” uk? Given a set of xi, all with label k, we can find the maximum likelihood μk from ⎧1 ⎧ ⎫ ⎛ 1 ⎞⎫ 2 arg max ⎨log ∏ P ( xi | μ ) ⎬ = arg max ∑ ⎨− ( xi − u ) + log ⎜ ⎟⎬ 2 μ μ 2π ⎠ ⎭ i⎩ ⎝ i ⎩ ⎭ = arg min ∑ ( xi − u ) μ i 2 Solution is the centroid of the xi If We Know Cluster Centers Need to estimate labels for the data μ2 μ1 Learn? xi Distance measure for K-means arg max Pk ( xi | μ i ) = arg max k k ⎧ ( xi − u k )2 ⎫ 1 2 ⎪ ⎪ = arg min ( xi − u k ) exp ⎨− ⎬ 2 k 2π ⎪ ⎪ ⎩ ⎭ What If We Have Neither? μ2 μ1 xi An idea: 1. Imagine we start with some uk0 2. We could calculate the most likely labels for xi0 given these uk0 3. We could then use these labels to choose uk1 4. And iterate (to convergence) labelsi 0 = arg max Pk ( xi | μ 0 ) k k μ0 k ⎧ ⎫ μ1 = arg max ⎨log ∏ Pk x 0 | μ ⎬ k i μ ⎩ ⎭ i ( ) labelsi 0 = arg max Pk ( xi | μ 0 ) k k ... Expectation Maximization (EM) 1. Initialize parameters 2. E Step Estimate probability of hidden labels , Q, given parameters and sequence Q = P(label | x,u t −1) i k 3. M Step Choose new parameters to maximize expected likelihood of parameters given Q uk t = arg max EQ ⎡log P(labels | x, uk t −1 ) ⎤ ⎣ ⎦ u 4. Iterate P(x|Model) guaranteed to increase each iteration Expectation Maximization (EM) Remember the basic idea! 1.Use model to estimate (distribution of) missing data 2.Use estimate to update model 3.Repeat until convergence Model is the gaussian distributions Missing data are the data point labels Revisiting K-Means Generative Model Perspective 1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is The most likely label k for a point xi di ,k = ( xi − μ k ) 3. 2 Move the position of each uk to the centroid of the points with that label Iterate Maximum likelihood parameter μk given most likely label 4. Revisiting K-Means Generative Model Perspective 1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is 1.Initialize parameters 2.E Step Estimate most likely The most likely label missing label given previous k for parameter a point xi di ,k = ( xi − μ k ) 3. 2 Move the position of each uk to the centroid of the points with that label Iterate 3.M Maximum likelihood Step Choose new parameters to maximize parameter μk given likelihood of parameters given most likely estimated labels label 4.Iterate 4. Revisiting K-Means This is analogous to Viterbi Learning from HMMs Perspective 1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is Analogy with HMM is to use Viterbi to find2most d ,k = ( xi − μ k ) likely imissing path labels Move theDurbin book) uk to (see position of each the centroid of the points with that label Iterate 1.Initialize parameters 2.E Step Estimate most likely The most likely label missing label given previous k for parameter a point xi Generative Model 3. 3.M Maximum likelihood Step Choose new parameters to maximize parameter μk given likelihood of parameters given most likely estimated labels label 4.Iterate 4. Revisting Fuzzy K-Means Recall that instead of assigning each point xi to a label k, we calculate the probability of each label for that point (fuzzy membership): P(label K | xi , μ k ) Recall that given a set of xi, all with label k, we select a new μk with the update: Looking at case b=1 μ k (n + 1) = x i with label j ∑ xi P(μ k | xi )b x i with label j ∑ P(μ k | xi ) b It can be shown that this update rule follows from assuming the gaussian mixture generative models and performing ExpectationMaximization Revisiting Fuzzy K-Means This is analogous to Baum Welch Perspective from HMMs 1. 2. Initialize K centers uk For each point calculate the probability of membership for each category P(label K | xi , μ k ) Generative Model 1.Initialize parameters 2.E Step Estimate probability over missing labels given previous parameter 3.M Step Choose new parameters to maximize expected likelihood of parameters given estimated labels 4.Iterate 3. Move the position of each uk to the weighted centroid : μ k (n + 1) = x i with label j ∑ xi P(μ k | xi )b x i with label j ∑ P(μ k | xi ) b 4. Iterate K-Means, Viterbi learning & EM K-Means and Fuzzy K-means are two related methods that can be seen performing unsupervised learning on a gaussian mixture model Reveal assumptions about underlying data model Can relax assumptions by relaxing constraints on model • Including explicit covariance matrix • Relaxing assumption that all gaussians are equally likely Implications: Non-globular Clusters Actual Clustering K-means (K = 2) But How Many clusters? • How do we select K? – We can always make clusters “more compact” by increasing K – e.g. What happens is if K=number of data points? – What is a meaningful improvement? • Hierarchical clustering side-steps this issue Hierarchical clustering Most widely used algorithm for expression data a b e f g h c • • Start with each point in a separate cluster At each step: – Choose the pair of closest clusters – Merge d Phylogeny (UMPGA) a b d e f cg h Visualization of results Hierarchical clustering Avoid needing to select number of clusters Produces clusters at all levels d c a e f g b h We can always select a “cut level” to create disjoint clusters But how do we define distances between clusters? a b d e f cg h slide credits: M. Kellis Distance between clusters • CD(X,Y)=minx ∈X, y ∈Y D(x,y) Single-link method d e f g h f g h f g h f g h • CD(X,Y)=maxx ∈X, y ∈Y D(x,y) Complete-link method d e • CD(X,Y)=avgx ∈X, y ∈Y D(x,y) Average-link method d e • CD(X,Y)=D( avg(X) , avg(Y) ) Centroid method d e (Dis)Similarity Measures Image removed due to copyright restrictions. Table 1, Gene expression similarity measures. D'haeseleer, Patrik. "How Does Gene Expression Clustering Work?" Nature Biotechnology 23 (2005): 1499-1501. D’haeseleer (2005) Nat Biotech Evaluating Cluster Performance In general, it depends on your goals in clustering • Robustness – Select random samples from data set and cluster – Repeat – Robust clusters show up in all clusters • Category Enrichment – Look for categories of genes “over-represented” in particular clusters – Also used in Motif Discovery Evaluating clusters – Hypergeometric Distribution ⎛ p ⎞⎛ N − p ⎞ ⎜ ⎟⎜ ⎟ ⎜ m ⎟⎜ k − m ⎟ ⎠ ⎝ ⎠⎝ P( pos ≥ r ) = ∑ ⎛N⎞ m≥r P-value of uniformity ⎜⎟ ⎜k ⎟ in computed cluster ⎝⎠ • N experiments, p labeled +, (N-p) – • Cluster: k elements, m labeled + • P-value of single cluster containing k elements of which at least r are + Prob that a randomly chosen set of k experiments would result in m positive and k-m negative Similar Genes Can Cluster Clustered 8600 human genes using expression time course in fibroblasts (A) Cholesterol biosynthesis (B) Cell cycle (C) Immediate early response (D) Signalling and angiogenesis (E) Wound healing Eisen, Michael et al. "Cluster Analysis and Display of Genome-wide Expression Patterns." PNAS 95, no. 25 (1998): 14863-14868. Copyright (1998) National Academy of Sciences, U.S.A. (Eisen (1998) PNAS) Clusters and Motif Discovery Ribosome (1) S.D. from mean 3.5 2.5 1.5 0.5 -0.5 -1.5 G1 S G2 M G1 S G2 M Expression from 15 time points during yeast cell cycle Methionine & sulphur metabolism 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 G1 S G2 M G1 S G2 M S.D. from mean S.D. from mean RNA metabolism & translation (3) 2.5 1.5 0.5 -0.5 -1.5 -2.5 G1 S G2 M G1 S G2 M Tavazoie & Church (1999) Figure by MIT OpenCourseWare. Next Lecture The other side of the coin… Classification ...
View Full Document

This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.

Ask a homework question - tutors are online