Hofmann 2001 formulates a variant that is able to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: . . , qik ) is a probability vector ∑k=1 qic = 1. c The underlying statistical assumption is that a document was created in two stages: First we pick a cluster Pc from {1, . . . , k} with fixed probability qc ; then we generate the words t of the document according to a cluster-specific probability distribution p(t| Pc ). This corresponds to a mixture model where the probability of an observed document (t1 , . . . , tni ) is p ( t1 , . . . , t ni ) = k ∑ qc p(t1 , . . . , tni | Pc ) c =1 (17) Each cluster Pc is a mixture component. The mixture probabilities qc describe an unobservable “cluster variable” z which may take the values from {1, . . . , k }. A well established method for estimating models involving unobserved variables is the EM-algorithm (Hastie et al. 2001), which basically replaces the unknown value with its current probability estimate and then proceeds as if it has been observed. Clustering methods for documents based on mixture models have been proposed by Cheeseman & Stutz (1996) and y...
View Full Document

Ask a homework question - tutors are online