Model[1] - Model Learning and Clustering CPS170 Ron Parr material from: Lise Getoor, Andrew Moore, Tom [email protected], SebasDan Thrun,

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Model Learning and Clustering CPS170 Ron Parr material from: Lise Getoor, Andrew Moore, Tom [email protected], SebasDan Thrun, Rich Maclin Unsupervised Learning •  Supervised learning: Data <x1, x2, … xn, y> •  Unsupervised Learning: Data <x1, x2, … xn> •  So, what’s the big deal? •  Isn’t y just another feature? •  No explicit performance objecDve –  Bad news: Problem not necessarily well defined without further assumpDons –  Good news: Results can be useful for more than predicDng y 1 Model Learning •  Produce a global summary of the data •  Not an exact copy •  Consider space of models M and dataset D •  One approach: Maximize P(M|D) •  How to do this? Bayes Rule: P (M | D ) = P(D | M)P(M) P(D) € Example: Modeling Coin Flips •  Suppose we have observed: D=HTTHT •  Which is a [email protected] model? –  P(H=0.4) –  P(H=0.5) P (M | D ) = P(D | M)P(M) P(D) P(D | (P(H = 0.5)) = 0.5 5 = 0.312 2 3 P(D | (P(H = 0.4)) = 0.4 * 0.6 = 0.3456 € What about P(D) and P(M)??? € 2 Model Learning With Bayes Rule P (M | D ) = P(D | M)P(M) P(D) •  We call P(D|M) the likelihood •  We can ignore P(D)… Why? € •  What about P(M)? –  Call this a our prior probability on models –  If P(M) is uniform (all models equally likely) then maximizing P(D|M) is equivalent to maximizing P(M|D) (Call this the maximium likelihood approach.) Using Priors •  Suppose we have good reason to expect that the coin is fair •  Should we really conclude P(H)=0.4? •  Suppose we think P(P(H=0.5)) = 2 x P(P(H=0.4)) •  This means P(D|P(H=0.4)) must be 2X larger than P (D|P(H=0.5)) to compensate if P(H=0.4) is to maximize the posterior probability P (M | D ) = P(D | M)P(M) P(D) € 3 Data Can Overwhelm a Prior Specifying Priors •  In our coin example, we considered just two models P (H=0.4) and P(H=0.5) •  In general, we might want to specify a distribuDon over all possible coin probabiliDes •  This introduces complicaDons: –  P(M) is now a distribuDon over a conDnuous parameter –  Need to use calculus to find maximizer of P(D|M)P(M) 4 Conjugate Priors •  A likelihood and prior are said to be conjugate if their product has the same parametric form as the prior •  (This is outside the scope of the class, but we provide one nice example.) •  The beta distribution is conjugate to the binomial distribution –  Can think of the beta distribution as specifying a number of “imagined” heads and tails –  Maximum of the posterior adds together observed heads and tails with imagined heads and tails –  Examples: •  Prior of 100 heads and 100 tails is a strong prior towards fairness •  Prior of 1 head and 1 tail is a weak prior towards fairness Clustering as Modeling •  Clustering assigns points in a space to clusters •  Example: By examining x ­rays of cancer tumors, one might idenDfy different subtypes of cancer based upon growth [email protected] •  Each cluster has its own probabilisDc model describing how items of that cluster’s type behave 5 Examples of Clustering ApplicaDons •  MarkeDng: Help marketers discover disDnct groups in their customer bases, and then use this knowledge to develop targeted markeDng programs •  Land use: IdenDficaDon of areas of similar land use in an earth observaDon database •  Insurance: IdenDfying groups of motor insurance policy holders with similar claim cost •  City ­planning: IdenDfying groups of houses according to their house type, value, and geographical locaDon •  Earth ­quake studies: Observed earth quake epicenters should be clustered along conDnent faults Example of SubtleDes in Clustering •  Household Dataset: locaDon, income, number of children, rent/own, crime rate, number of cars •  Appropriate clustering may depend on use: –  Goal to minimize delivery Dme ⇒ cluster by locaDon –  Others? –  Clustering work ojen suffers from mismatch between the clustering objecDve funcDon and the performance criterion 6 Clustering Desiderata •  DecomposiDon or parDDon of data into groups so that –  Points in one group are similar to each other –  Are as different as possible from the points in other groups •  Measure of distance is fundamental •  Explicit representaDon: –  D(x(i),x(j)) for each x –  Only feasible for small domains •  Implicit representaDon by measurement: –  Distance computed from features –  Implement this as a funcDon Families of Clustering Algorithms •  ParDDon ­based methods –  e.g., K ­means •  Hierarchical clustering –  e.g., hierarchical agglomeraDve clustering •  ProbabilisDc model ­based clustering –  e.g., mixture models •  Graph ­based Methods –  e.g., spectral methods 7 K ­means •  Start with randomly chosen cluster centers •  Assign points to closest cluster •  Recompute cluster centers •  Reassign points •  Repeat unDl no changes K ­means example X(5) X(7) X(4) X(8) X(6) X(1) X(2) X(3) 8 K ­means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) K ­means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) 9 K ­means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) K ­means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) 10 K ­means example c3 X(4) X(5) X(7) c2 X(8) X(6) c1 X(1) X(2) X(3) K ­means example #2 X(5) X(7) X(4) X(8) X(6) X(1) X(2) X(3) 11 K ­means example #2 c3 X(5) X(7) X(4) X(8) X(6) c2 c1 X(1) X(2) X(3) K ­means example #2 c3 X(5) X(7) X(4) X(8) X(6) c2 X(1) X(2) c1 X(3) 12 Demo [email protected]:[email protected]/Clustering/tutorial_html/AppletKM.html Complexity •  Does algorithm terminate? yes •  Does algorithm converge to opDmal clustering? Can only guarantee local opDmum •  Time complexity one iteraDon? nk 13 Understanding k ­Means •  Implicitly models data as coming from a Gaussian distribuDon centered at cluster centers •  log P(data) ~ sum of squared distances P( x i ∈ c j ) ∝ e − ( x i −c j ) 2 P(data) = ∏ P( x i ∈ cclustering( i ) ) i log(P(data)) = α ∑ ( x i − cclustering( i ) )2 i € Understanding k ­Means II •  Each step of k ­Means increases P(data) –  Reassigning points moves points to clusters for which their coordinates have higher probability –  RecompuDng means moves cluster centers to increase the average probability of points in the cluster •  Fixed number of assignments and monotonic score implies convergence 14 Understanding k ­Means III P (M | D ) = P(D | M)P(M) P(D) •  Can view k ­means as max likelihood method with a twist –  Unlike the coin toss example, there is a hidden variable with each € datum – the cluster membership –  k ­means iteraDvely improves its guesses about these hidden pieces of informaDon •  k ­means can be interpreted as an instance of a general approach to dealing with hidden variables called ExpectaDon MaximizaDon (EM) But How Do We Pick k? •  SomeDmes there will be an obvious choice given background knowledge or the intended use of the clustering output •  What if we just iterated over k? –  Picking k=n will always maximize P(D|M) –  We could introduce a prior over models using P(M) in Bayes rule •  Compare prior over models with regularizaDon: –  RegularizaDon in regression penalized overly complex soluDons –  We can assign models with a high number of clusters low probability to achieve a similar effect –  (In general, use of priors subsumes regularizaDon.) 15 Is Clustering Well Defined? •  Clustering is not clearly axiomaDzed •  Can we define an “opDmal” clustering w/o specifying an a priori preference (prior) on the cluster sizes or making addiDonal assumpDons? •  Kleinberg: Clustering is impossible under some plausible assumpDons (IOW, union of unstated assumpDons made by clustering algorithms is logically inconsistent) •  Recent efforts make progress putng clustering on more solid ground Model Learning Conclusion •  Ojen seek to find the most likely model given the data •  Can be viewed as maximizing the posterior P(M|D) using Bayes rule •  Model learning can be applied to: –  –  –  –  Coin flips Clustering Learning parameters of Bayes nets or HMMs etc. •  Some care must go into formulaDon of modeling assumpDons to avoid degenerate soluDons, e.g., assigning every point to its own cluster •  Priors can help avoid degenerate soluDons 16 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online