ECS124_Lecture8

ECS124_Lecture8 - ECS 124 Theory and Practice of...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ECS 124 Theory and Practice of Bioinformatics Lecture 8 : Clustering Lecture 8 : Clustering Instructor: Ilias T k I Ili Tagkopoulos l itagkopoulos@ucdavis.edu Office: Kemper 3063 and GBSF 5313 1 UC Davis 4/23/2010 Bioinformatics book: an introduction to Bioinformatics Algorithms http://mitpress.mit.edu/catalog/item/default.asp?tid=10337&ttype=2 p // p / g/ / p yp 2 UC Davis 4/23/2010 LAST TIME: Clustering vs. Classification features f Classification Labels Assign labels to new points Supervised learning Supervised learning Ex xpression n in Exp 2 Expression in Exp 1 Objects characterized by one or more Clustering N l b l No labels Group points into clusters based on how "near" they are to one another Id tif t Identify structure i d t t in data Unsupervised learning UC Davis 4/23/2010 Where can data come from? Wh d f ? Any data source of high dimensionality: Images MS Recordings Microarrays etc Microarrays: measures mRNA concentrations over conditions/treatments/time points Li h i Light intensity (normalized) images show expression i ( li d) i h i level ratio (target mRNA/control mRNA) Generally: green yellow red is low medium high Generally: green, yellow, red is low, medium, high target mRNA expression i.r.t. the control. Black means neither target or control was expressed 4 UC Davis 4/23/2010 Gene/Protein expression arrays G /P i i Intensity matrix > expression matrix Intensity matrix > expression matrix Goal: Find correlation between genes OR conditions Solution: Clustering Time: Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 UC Davis Time X 10 10 4 7 1 Time Y 8 0 8.6 8 2 Time Z 10 9 3 3 3 5 4/23/2010 Clustering example Cl i l Assume that you have a microarray with 10 genes Assume that you have a microarray with 10 genes and values over 3 time points. Calculate distance (euclidean covariance) for all Calculate distance (euclidean, covariance) for all gene pairs. Cluster together genes that have the small Cluster together genes that have the small distance. 6 UC Davis 4/23/2010 Clustering example Cl i l 7 UC Davis www.bioalgorithms.info 4/23/2010 Clustering Principles Intracluster Homogeneity: Members of the same Intracluster Homogeneity: Members of the same cluster should be "close" (similar) to each other Intercluster Separation: Members in different Intercluster Separation: Members in different clusters should be "far" (dissimilar) to each other 8 UC Davis 4/23/2010 Kmeans Clustering K Cl i Assume you have a set of observations/points (x1, Assume you have a set of observations/points (x x2 ,..., xn). d2 Each point is ddimensional, x1 x7 here d 2 here d = 2 Kmeans: put points in K groups/clusters groups/clusters (s1, s2 ,..., sK) s s so that withingroup distance is minimized: is minimized: arg min 1 2 d1 9 UC Davis 4/23/2010 Pattern recognition and matching: KMeans Clustering Assume a fixed number of Assume a fixed number of clusters, K Goal: create "compact" Goal: create compact clusters (min intracluster distance) UC Davis 4/23/2010 Pattern recognition and matching: KMeans Algorithm Randomly initialize clusters UC Davis 4/23/2010 Pattern recognition and matching: KMeans Algorithm Randomly initialize clusters Assign data p points to nearest clusters UC Davis 4/23/2010 Pattern recognition and matching: KMeans Algorithm Randomly initialize clusters Assign data p points to nearest clusters Recalculate centroids UC Davis 4/23/2010 Pattern recognition and matching: KMeans Algorithm Randomly initialize clusters Assign data p points to nearest clusters Recalculate centroids Repeat last two Repeat last two steps UC Davis 4/23/2010 Pattern recognition and matching: Kmeans Lloyd algorithm K Ll d l ith 1. Initialize K centers For each iteration n until convergence 2. Assign each xi to the cluster with the nearest center, where the distance between xi and cluster k is di , k xi k with is the center of cluster k with is the center of cluster k 3. 2 Recalculate the center of each cluster to be the centroid of its members | | # UC Davis 4/23/2010 Variations of kmeans clustering Variations of k means clustering Moving elements may have no consequence Moving only when it matters "Greedy" Kmeans Hard or soft clustering? How can we allow membership in multiple clusters (soft clustering) F Fuzzy Kmeans K 16 UC Davis 4/23/2010 Greedy Kmeans Greedy K means M Move one point at a time, ONLY if it improves i t t ti ONLY if it i the overall clustering cost Greedy Algorithm: Select arbitrary partition of X in K clusters Select arbitrary partition of X in K clusters For every cluster For every element not in cluster y If moving element reduces clustering cost o MOVE If no move was made If no move was made Return K clusters 17 UC Davis 4/23/2010 Pattern recognition and matching: Fuzzy KMeans F KM Initialize K clusters Initialize K clusters For each point calculate the p probability of membership for y p each category | , Kmeans Move the position of the center of each cluster to the weighted of each cluster to the weighted centroid : | , | , Fuzzy Kmeans y Iterate | , Kmeans is a special case with: Problems regarding Kmeans clustering Convergence: How to know when to stop ? Anyone Convergence: How to know when to stop ? Anyone guarantees convergence ? Assumptions: What is the structure of the data? Are Assumptions: What is the structure of the data? Are all Gaussians equal ? Overfitting: what is the optimum number of clusters? g p Similarity metric: what similarity metric should we used ? And as with any clustering/classification algorithm, we should be concerned about irrelevant features, curse of dimensionality, small sample size, etc 19 UC Davis 4/23/2010 Convergence & Structure C &S Convergence: clusters are compact enough Number of iterations Number of iterations CPU time Walltime Generally K means is NP hard (special case: K D Generally Kmeans is NPhard (special case: K,D fixed) Generally kmeans performance is arbitrary Generally k means performance is arbitrary Rectangle example where kmeans converges at starting point gp Assumptions regarding structure 20 UC Davis 4/23/2010 Structure... Actual 21 UC Davis Kmeans 4/23/2010 Number of clusters N b f l How do we decide the number of clusters? How do we decide the number of clusters? Minimum description length (MDL) Bayesian information criterion (BIC) Bayesian information criterion (BIC) | where x the observed data, k the parameters of the model, L the max likelihood (fit), n the number of points. The lower the value of BIC, the better the value of BIC the better Akaike's information criterion (AIC) Hierarchical clustering 22 UC Davis 4/23/2010 Hierarchical Clustering Hi hi l Cl i Goal: build a hierarchy of clusters based on Goal: build a hierarchy of clusters based on similarity distance Types: Agglomerative: Each point/observation is a cluster; merge clusters based on similarity; stop when only merge clusters based on similarity; stop when only one cluster is left. Bottomup approach Divisive: All points/observations are in ONE p cluster; split recursively clusters until all points are a cluster themselves. Topdown approach 23 UC Davis 4/23/2010 Hierarchical Clustering Hi hi l Cl i Bottomup approach: St t ith Start with each point in a h i ti separate cluster At each iteration: Choose the pair of closest Choose the pair of closest clusters Merge the pair of clusters to one. one 24 UC Davis 4/23/2010 Hierarchical Clustering Hi hi l Cl i c a d b Bottomup approach: e f St t ith Start with each point in a h i ti separate cluster At each iteration: Choose the pair of closest Choose the pair of closest clusters Merge the pair of clusters to one. one UPGMA (Unweighted Pair Group Method with Arithmetic mean) used Method with Arithmetic mean) used in phylogenetic trees 4/23/2010 a b UC Davis d e f c Hierarchical Clustering: Distance metric We may define as "distance" the: W d fi "di " h Minimum distance between elements of each cluster (singlelinkage clustering) ( i le li ka e lu te i ) d e h f g d e h f g h f g h f g 4/23/2010 Maximum distance between elements of each cluster Maximum distance between elements of each cluster (completelinkage clustering) Mean distance between all elements in each cluster (averagelinkage clustering) Distance between the centroids of each cluster (centroidlinkage clustering) UC Davis d e d e ...
View Full Document

Ask a homework question - tutors are online