Unformatted text preview: Model Learning and Clustering CPS170 Ron Parr material from: Lise Getoor, Andrew Moore, Tom [email protected], SebasDan Thrun, Rich Maclin Unsupervised Learning • Supervised learning: Data <x1, x2, … xn, y> • Unsupervised Learning: Data <x1, x2, … xn> • So, what’s the big deal? • Isn’t y just another feature? • No explicit performance objecDve – Bad news: Problem not necessarily well deﬁned without further assumpDons – Good news: Results can be useful for more than predicDng y 1 Model Learning • Produce a global summary of the data • Not an exact copy • Consider space of models M and dataset D • One approach: Maximize P(MD) • How to do this? Bayes Rule: P (M  D ) =
P(D  M)P(M)
P(D) € Example: Modeling Coin Flips
• Suppose we have observed: D=HTTHT • Which is a [email protected] model? – P(H=0.4) – P(H=0.5) P (M  D ) =
P(D  M)P(M)
P(D)
P(D  (P(H = 0.5)) = 0.5 5 = 0.312
2
3
P(D  (P(H = 0.4)) = 0.4 * 0.6 = 0.3456
€ What about P(D) and P(M)??? € 2 Model Learning With Bayes Rule
P (M  D ) =
P(D  M)P(M)
P(D) • We call P(DM) the likelihood • We can ignore P(D)… Why? €
• What about P(M)? – Call this a our prior probability on models – If P(M) is uniform (all models equally likely) then maximizing P(DM) is equivalent to maximizing P(MD) (Call this the maximium likelihood approach.) Using Priors
• Suppose we have good reason to expect that the coin is fair • Should we really conclude P(H)=0.4? • Suppose we think P(P(H=0.5)) = 2 x P(P(H=0.4)) • This means P(DP(H=0.4)) must be 2X larger than P
(DP(H=0.5)) to compensate if P(H=0.4) is to maximize the posterior probability P (M  D ) =
P(D  M)P(M)
P(D) €
3 Data Can Overwhelm a Prior
Specifying Priors
• In our coin example, we considered just two models P
(H=0.4) and P(H=0.5) • In general, we might want to specify a distribuDon over all possible coin probabiliDes • This introduces complicaDons: – P(M) is now a distribuDon over a conDnuous parameter – Need to use calculus to ﬁnd maximizer of P(DM)P(M) 4 Conjugate Priors
• A likelihood and prior are said to be conjugate if their
product has the same parametric form as the prior
• (This is outside the scope of the class, but we provide
one nice example.)
• The beta distribution is conjugate to the binomial
distribution
– Can think of the beta distribution as specifying a number of
“imagined” heads and tails
– Maximum of the posterior adds together observed heads and
tails with imagined heads and tails
– Examples:
• Prior of 100 heads and 100 tails is a strong prior towards fairness
• Prior of 1 head and 1 tail is a weak prior towards fairness Clustering as Modeling • Clustering assigns points in a space to clusters • Example: By examining x
rays of cancer tumors, one might idenDfy diﬀerent subtypes of cancer based upon growth [email protected] • Each cluster has its own probabilisDc model describing how items of that cluster’s type behave 5 Examples of Clustering ApplicaDons • MarkeDng: Help marketers discover disDnct groups in their customer bases, and then use this knowledge to develop targeted markeDng programs • Land use: IdenDﬁcaDon of areas of similar land use in an earth observaDon database • Insurance: IdenDfying groups of motor insurance policy holders with similar claim cost • City
planning: IdenDfying groups of houses according to their house type, value, and geographical locaDon • Earth
quake studies: Observed earth quake epicenters should be clustered along conDnent faults Example of SubtleDes in Clustering • Household Dataset: locaDon, income, number of children, rent/own, crime rate, number of cars • Appropriate clustering may depend on use: – Goal to minimize delivery Dme ⇒ cluster by locaDon – Others? – Clustering work ojen suﬀers from mismatch between the clustering objecDve funcDon and the performance criterion 6 Clustering Desiderata • DecomposiDon or parDDon of data into groups so that – Points in one group are similar to each other – Are as diﬀerent as possible from the points in other groups • Measure of distance is fundamental • Explicit representaDon: – D(x(i),x(j)) for each x – Only feasible for small domains • Implicit representaDon by measurement: – Distance computed from features – Implement this as a funcDon Families of Clustering Algorithms • ParDDon
based methods – e.g., K
means • Hierarchical clustering – e.g., hierarchical agglomeraDve clustering • ProbabilisDc model
based clustering – e.g., mixture models • Graph
based Methods – e.g., spectral methods 7 K
means • Start with randomly chosen cluster centers • Assign points to closest cluster • Recompute cluster centers • Reassign points • Repeat unDl no changes K
means example X(5) X(7) X(4) X(8) X(6) X(1) X(2) X(3) 8 K
means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) K
means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) 9 K
means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) K
means example c3 X(5) X(7) c2 X(4) X(8) X(6) c1 X(1) X(2) X(3) 10 K
means example c3 X(4) X(5) X(7) c2 X(8) X(6) c1 X(1) X(2) X(3) K
means example #2 X(5) X(7) X(4) X(8) X(6) X(1) X(2) X(3) 11 K
means example #2 c3 X(5) X(7) X(4) X(8) X(6) c2 c1 X(1) X(2) X(3) K
means example #2 c3 X(5) X(7) X(4) X(8) X(6) c2 X(1) X(2) c1 X(3) 12 Demo [email protected]://home.dei.polimi.it/[email protected]/Clustering/tutorial_html/AppletKM.html Complexity • Does algorithm terminate? yes • Does algorithm converge to opDmal clustering? Can only guarantee local opDmum • Time complexity one iteraDon? nk 13 Understanding k
Means • Implicitly models data as coming from a Gaussian distribuDon centered at cluster centers • log P(data) ~ sum of squared distances P( x i ∈ c j ) ∝ e − ( x i −c j ) 2 P(data) = ∏ P( x i ∈ cclustering( i ) )
i log(P(data)) = α ∑ ( x i − cclustering( i ) )2
i
€ Understanding k
Means II • Each step of k
Means increases P(data) – Reassigning points moves points to clusters for which their coordinates have higher probability – RecompuDng means moves cluster centers to increase the average probability of points in the cluster • Fixed number of assignments and monotonic score implies convergence 14 Understanding k
Means III
P (M  D ) =
P(D  M)P(M)
P(D) • Can view k
means as max likelihood method with a twist – Unlike the coin toss example, there is a hidden variable with each €
datum – the cluster membership – k
means iteraDvely improves its guesses about these hidden pieces of informaDon • k
means can be interpreted as an instance of a general approach to dealing with hidden variables called ExpectaDon MaximizaDon (EM) But How Do We Pick k?
• SomeDmes there will be an obvious choice given background knowledge or the intended use of the clustering output • What if we just iterated over k? – Picking k=n will always maximize P(DM) – We could introduce a prior over models using P(M) in Bayes rule • Compare prior over models with regularizaDon: – RegularizaDon in regression penalized overly complex soluDons – We can assign models with a high number of clusters low probability to achieve a similar eﬀect – (In general, use of priors subsumes regularizaDon.) 15 Is Clustering Well Deﬁned? • Clustering is not clearly axiomaDzed • Can we deﬁne an “opDmal” clustering w/o specifying an a priori preference (prior) on the cluster sizes or making addiDonal assumpDons? • Kleinberg: Clustering is impossible under some plausible assumpDons (IOW, union of unstated assumpDons made by clustering algorithms is logically inconsistent) • Recent eﬀorts make progress putng clustering on more solid ground Model Learning Conclusion
• Ojen seek to ﬁnd the most likely model given the data • Can be viewed as maximizing the posterior P(MD) using Bayes rule • Model learning can be applied to: –
–
–
– Coin ﬂips Clustering Learning parameters of Bayes nets or HMMs etc. • Some care must go into formulaDon of modeling assumpDons to avoid degenerate soluDons, e.g., assigning every point to its own cluster • Priors can help avoid degenerate soluDons 16 ...
View
Full Document
 Spring '11
 Parr
 Artificial Intelligence, the00, Beta distribution, Conjugate prior, to00, clustering00

Click to edit the document details