This preview shows page 1. Sign up to view the full content.
Unformatted text preview: MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution
Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Computational Biology: Genomes, Networks, Evolution Clustering Lecture 3 September 16, 2008 Structure in HighDimensional Data ©2007 IEEE. Used with permission.
Gyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer Graphics 13, no. 6 (2007): 14321439. • Structure can be used to reduce dimensionality of data • Structure can tell us something useful about the underlying phenomena • Structure can be used to make inferences about new data Clustering vs Classification
• Classification
– Have labels for some points – Want a “rule” that will accurately assign labels to new points – Supervised learning Expression in Exp 2 Expression in Exp 1 • Objects characterized by one or more features • Clustering
– No labels – Group points into clusters based on how “near” they are to one another – Identify structure in data – Unsupervised learning Today
• Microarray Data • Kmeans clustering • Expectation Maximization • Hierarchical Clustering Central Dogma DNA mRNA protein phenotype We can measure amounts of mRNA for every gene in a cell Expression Microarrays
• A way to measure the levels of mRNA in every gene • Two basic types
– Affymetrix gene chips – Spotted oligonucleotides • Both work on same principle
– Put DNA probe on slide – Complementary hybridization Expression Microarrays
• Measure the level of mRNA messages in a cell
DNA 1 DNA 2
RT Hybridize Measure DNA 3 DNA 4 DNA 5 DNA 6 RNA 6 cDNA 6 Gene 5 Gene 6 Gene 3 Gene 4 RNA 4 cDNA 4 Gene 1 Gene 2 Expression Microarray Data Matrix
n experiments • Experiment are given by columns m genes • Genes are typically given as rows Clustering and Classification in Genomics
• Classification
¾ Microarray data: classify cell state (i.e. AML vs ALL) using expression data ¾ Protein/gene sequences: predict function, localization, etc. • Clustering
¾ Microarray data: groups of genes that share similar function have similar expression patterns – identify regulons ¾ Protein sequence: group related proteins to infer function ¾ EST data: collapse redundant sequences Clustering Expression Data
– Group by similar expression profiles
Gene 2 • Cluster Experiments Experiment Gene 1 – Group by similar expression in different conditions Experiment 2 • Cluster Genes Genes Experiment 1 Why Cluster Genes by Expression?
• Data Exploration
– Summarize data – Explore without getting lost in each data point – Enhance visualization GCN4 • Coregulated Genes
– Common expression may imply common regulation – Predict cisregulatory promoter sequences • Functional Annotation
– Similar function from similar expression His2 His3 Unknown Amino Acids Amino Acids Amino Acids Clustering Algorithms
• Partitioning
– Divides objects into nonoverlapping clusters such that each data object is in exactly one subset • Agglomerative
– A set of nested clusters organized as a hierarchy KMeans Clustering
The Basic Idea • Assume a fixed number of clusters, K • Goal: create “compact” clusters More Formally
1. Initialize K centers uk For each iteration n until convergence 2. Assign each xi the label of the nearest center, where the distance between xi and uk is di ,k = ( xi − μ k )
3. 2 Move the position of each uk to the centroid of the points with that label μ k (n + 1) = x i with label j ∑ xi , x k = #xi with label k xk Cost Criterion
We can think of Kmeans as trying to create clusters that minimize a cost criterion associated with the size of the cluster
COST ( x1 ,x 2 ,x3 ,...,x n ) = ∑
μ k xi with label k ∑ ( xi − μ k ) 2 μ1 μ2 μ3 Minimizing this means minimizing each cluster term separately: xi with label k ∑ ( xi − μ k ) =
2 xi with label k ∑ x 2 − 2xi u k + μ 2 = ∑ x 2 − u k ∑ 2xi + x k u 2 k k i i Optimum uk = xi with label k ∑ xi xk , the centroid Fuzzy KMeans
• • Initialize K centers uk For each point calculate the probability of membership for each category
P(label K  xi , μ k )
Kmeans • Move the position of each uk to the weighted centroid : μ k (n + 1) = Of course, KMeans just special case where
• Iterate
⎧1 if xi is closest to μ k P(label K  xi , μ k ) = ⎨ otherwise ⎩0
Fuzzy Kmeans x i with label j ∑ xi P(μ k  xi )b x i with label j ∑ P(μ k  xi ) b KMeans as a Generative Model
μ2 Model of P(X,Labels) μ1
xi Samples drawn from two equally normal distributions with unit variance  a Gaussian Mixture Model
⎧ ( x − u )2 ⎫ 1 ⎪ i j ⎪ P ( xi  u j ) = exp ⎨− ⎬ 2 2π ⎪ ⎪ ⎩ ⎭ Unsupervised Learning
μ2 μ1 Learn?
xi Samples drawn from two equally normal distributions with unit variance  a Gaussian Mixture Model
⎧ ( x − u )2 ⎫ 1 ⎪ i j ⎪ P ( xi  u j ) = exp ⎨− ⎬ 2 2π ⎪ ⎪ ⎩ ⎭ If We Have Labeled Points
Need to estimate unknown gaussian centers from data In general, how could we do this? How could we “estimate” the “best” uk? μ2?
xi μ1 ? Choose uk to maximize probability of model If We Have Labeled Points
Need to estimate unknown gaussian centers from data In general, how could we do this? How could we “estimate” the “best” uk? Given a set of xi, all with label k, we can find the maximum likelihood μk from
⎧1 ⎧ ⎫ ⎛ 1 ⎞⎫ 2 arg max ⎨log ∏ P ( xi  μ ) ⎬ = arg max ∑ ⎨− ( xi − u ) + log ⎜ ⎟⎬ 2 μ μ 2π ⎠ ⎭ i⎩ ⎝ i ⎩ ⎭ = arg min ∑ ( xi − u )
μ i 2 Solution is the centroid of the xi If We Know Cluster Centers
Need to estimate labels for the data μ2 μ1 Learn? xi Distance measure for Kmeans arg max Pk ( xi  μ i ) = arg max
k k ⎧ ( xi − u k )2 ⎫ 1 2 ⎪ ⎪ = arg min ( xi − u k ) exp ⎨− ⎬ 2 k 2π ⎪ ⎪ ⎩ ⎭ What If We Have Neither?
μ2 μ1
xi An idea: 1. Imagine we start with some uk0 2. We could calculate the most likely labels for xi0 given these uk0 3. We could then use these labels to choose uk1 4. And iterate (to convergence) labelsi 0 = arg max Pk ( xi  μ 0 ) k
k μ0 k ⎧ ⎫ μ1 = arg max ⎨log ∏ Pk x 0  μ ⎬ k i μ ⎩ ⎭ i ( ) labelsi 0 = arg max Pk ( xi  μ 0 ) k
k ... Expectation Maximization (EM)
1. Initialize parameters 2. E Step Estimate probability of hidden labels , Q, given parameters and sequence
Q = P(label  x,u t −1)
i k 3. M Step Choose new parameters to maximize expected likelihood of parameters given Q uk t = arg max EQ ⎡log P(labels  x, uk t −1 ) ⎤ ⎣ ⎦
u 4. Iterate P(xModel) guaranteed to increase each iteration Expectation Maximization (EM)
Remember the basic idea! 1.Use model to estimate (distribution of) missing data 2.Use estimate to update model 3.Repeat until convergence Model is the gaussian distributions Missing data are the data point labels Revisiting KMeans
Generative Model Perspective 1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is The most likely label k for a point xi di ,k = ( xi − μ k )
3. 2 Move the position of each uk to the centroid of the points with that label Iterate Maximum likelihood parameter μk given most likely label 4. Revisiting KMeans
Generative Model Perspective 1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is 1.Initialize parameters 2.E Step Estimate most likely The most likely label missing label given previous k for parameter a point xi di ,k = ( xi − μ k )
3. 2 Move the position of each uk to the centroid of the points with that label Iterate 3.M Maximum likelihood Step Choose new parameters to maximize parameter μk given likelihood of parameters given most likely estimated labels label 4.Iterate 4. Revisiting KMeans
This is analogous to Viterbi Learning from HMMs Perspective
1. 2. Initialize K centers uk Assign each xi the label of the nearest center, where the distance between xi and uk is Analogy with HMM is to use Viterbi to find2most d ,k = ( xi − μ k ) likely imissing path labels Move theDurbin book) uk to (see position of each the centroid of the points with that label Iterate 1.Initialize parameters 2.E Step Estimate most likely The most likely label missing label given previous k for parameter a point xi
Generative Model 3. 3.M Maximum likelihood Step Choose new parameters to maximize parameter μk given likelihood of parameters given most likely estimated labels label 4.Iterate 4. Revisting Fuzzy KMeans
Recall that instead of assigning each point xi to a label k, we calculate the probability of each label for that point (fuzzy membership): P(label K  xi , μ k )
Recall that given a set of xi, all with label k, we select a new μk with the update: Looking at case b=1 μ k (n + 1) = x i with label j ∑ xi P(μ k  xi )b x i with label j ∑ P(μ k  xi ) b It can be shown that this update rule follows from assuming the gaussian mixture generative models and performing ExpectationMaximization Revisiting Fuzzy KMeans
This is analogous to Baum Welch Perspective from HMMs
1. 2. Initialize K centers uk For each point calculate the probability of membership for each category
P(label K  xi , μ k ) Generative Model 1.Initialize parameters 2.E Step Estimate probability over missing labels given previous parameter 3.M Step Choose new parameters to maximize expected likelihood of parameters given estimated labels 4.Iterate 3. Move the position of each uk to the weighted centroid : μ k (n + 1) = x i with label j ∑ xi P(μ k  xi )b x i with label j ∑ P(μ k  xi ) b 4. Iterate KMeans, Viterbi learning & EM
KMeans and Fuzzy Kmeans are two related methods that can be seen performing unsupervised learning on a gaussian mixture model Reveal assumptions about underlying data model Can relax assumptions by relaxing constraints on model • Including explicit covariance matrix • Relaxing assumption that all gaussians are equally likely Implications: Nonglobular Clusters Actual Clustering Kmeans (K = 2) But How Many clusters?
• How do we select K?
– We can always make clusters “more compact” by increasing K – e.g. What happens is if K=number of data points? – What is a meaningful improvement? • Hierarchical clustering sidesteps this issue Hierarchical clustering
Most widely used algorithm for expression data
a b e f g h c • • Start with each point in a separate cluster At each step:
– Choose the pair of closest clusters – Merge d Phylogeny (UMPGA) a b d e f cg h Visualization of results Hierarchical clustering
Avoid needing to select number of clusters Produces clusters at all levels
d c a e f g b h We can always select a “cut level” to create disjoint clusters But how do we define distances between clusters? a b d e f cg h slide credits: M. Kellis Distance between clusters
• CD(X,Y)=minx ∈X, y ∈Y D(x,y) Singlelink method
d e f g h f g h f g h f g h • CD(X,Y)=maxx ∈X, y ∈Y D(x,y) Completelink method d e • CD(X,Y)=avgx ∈X, y ∈Y D(x,y) Averagelink method d e • CD(X,Y)=D( avg(X) , avg(Y) ) Centroid method d e (Dis)Similarity Measures Image removed due to copyright restrictions. Table 1, Gene expression similarity measures. D'haeseleer, Patrik. "How Does Gene Expression Clustering Work?" Nature Biotechnology 23 (2005): 14991501. D’haeseleer (2005) Nat Biotech Evaluating Cluster Performance
In general, it depends on your goals in clustering • Robustness
– Select random samples from data set and cluster – Repeat – Robust clusters show up in all clusters • Category Enrichment
– Look for categories of genes “overrepresented” in particular clusters – Also used in Motif Discovery Evaluating clusters – Hypergeometric Distribution ⎛ p ⎞⎛ N − p ⎞ ⎜ ⎟⎜ ⎟ ⎜ m ⎟⎜ k − m ⎟ ⎠ ⎝ ⎠⎝ P( pos ≥ r ) = ∑ ⎛N⎞ m≥r Pvalue of uniformity ⎜⎟ ⎜k ⎟ in computed cluster ⎝⎠ • N experiments, p labeled +, (Np) – • Cluster: k elements, m labeled + • Pvalue of single cluster containing k elements of which at least r are + Prob that a randomly chosen set of k experiments would result in m positive and km negative Similar Genes Can Cluster
Clustered 8600 human genes using expression time course in fibroblasts (A) Cholesterol biosynthesis (B) Cell cycle (C) Immediate early response (D) Signalling and angiogenesis (E) Wound healing
Eisen, Michael et al. "Cluster Analysis and Display of Genomewide Expression Patterns." PNAS 95, no. 25 (1998): 1486314868. Copyright (1998) National Academy of Sciences, U.S.A. (Eisen (1998) PNAS) Clusters and Motif Discovery
Ribosome (1) S.D. from mean
3.5 2.5 1.5 0.5 0.5 1.5 G1 S G2 M G1 S G2 M Expression from 15 time points during yeast cell cycle Methionine & sulphur metabolism
3 2.5 2 1.5 1 0.5 0 0.5 1 1.5 G1 S G2 M G1 S G2 M S.D. from mean S.D. from mean RNA metabolism & translation (3)
2.5 1.5 0.5 0.5 1.5 2.5 G1 S G2 M G1 S G2 M Tavazoie & Church (1999) Figure by MIT OpenCourseWare. Next Lecture The other side of the coin… Classification ...
View
Full
Document
This note was uploaded on 09/24/2010 for the course EECS 6.047 / 6. taught by Professor Manoliskellis during the Fall '08 term at MIT.
 Fall '08
 ManolisKellis

Click to edit the document details