10.1.1.153.6679

# Hofmann 2001 formulates a variant that is able to

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: . . , qik ) is a probability vector ∑k=1 qic = 1. c The underlying statistical assumption is that a document was created in two stages: First we pick a cluster Pc from {1, . . . , k} with ﬁxed probability qc ; then we generate the words t of the document according to a cluster-speciﬁc probability distribution p(t| Pc ). This corresponds to a mixture model where the probability of an observed document (t1 , . . . , tni ) is p ( t1 , . . . , t ni ) = k ∑ qc p(t1 , . . . , tni | Pc ) c =1 (17) Each cluster Pc is a mixture component. The mixture probabilities qc describe an unobservable “cluster variable” z which may take the values from {1, . . . , k }. A well established method for estimating models involving unobserved variables is the EM-algorithm (Hastie et al. 2001), which basically replaces the unknown value with its current probability estimate and then proceeds as if it has been observed. Clustering methods for documents based on mixture models have been proposed by Cheeseman & Stutz (1996) and y...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online