P16-anthes - news Science | DOI:10.1145/1859204.1859210 Gary Anthes Topic models Vs unstructured Data With topic modeling scientists can explore

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
16 COMMUNICATIONS OF THE ACM | DECEMBER 2010 | VOL. 53 | NO. 12 news T OPIC MODELING, AN amal- gam of ideas drawn from computer science, math- ematics, and cognitive sci- ence, is evolving rapidly to help users understand and navigate huge stores of unstructured data. Topic models use Bayesian statistics and machine learning to discover the thematic content of unlabeled docu- ments, provide application-specific roadmaps through them, and predict the nature of future documents in a collection. Most often used with text documents, topic models can also be applied to collections of images, mu- sic, DNA sequences, and other types of information. Because topic models can discover the latent, or hidden, structure in doc- uments and establish links between documents, they offer a powerful new way to explore and understand infor- mation that might otherwise seem cha- otic and unnavigable. The base on which most proba- bilistic topic models are built today is latent Dirichlet allocation (LDA). Applied to a collection of text docu- ments, LDA discovers “topics,” which are probability distributions over words that co-occur frequently. For example, “software,” “algorithm,” and “kernel” might be found likely to occur in articles about computer science. LDA also discovers the prob- ability distribution of topics in a docu- ment. For example, by examining the word patterns and probabilities, one article might be tagged as 100% about computer science while another might be tagged as 10% computer sci- ence and 90% neuroscience. LDA algorithms are built on assump- tions of how a “generative” process might create a collection of documents from these probability distributions. The process does that by first assigning to each document a probability distri- bution across a small number of top- ics from among, say, 100 possible top- ics in the collection. Then, for each of these hypothetical documents, a topic is chosen at random (but weighted by its probability distribution), and a word is generated at random from that topic’s probability distribution across the words. This hypothetical process is repeated over and over, each word in a document occurring in proportion to the distribution of topics in the docu- ment and the distribution of words in a topic, until all the documents have been generated. LDA takes that definition of how the documents to be analyzed might have been created, “inverts” the process, and works backward to explain the ob- served data. This process, called “pos- terior probabilistic inference,” essen- tially says, “Given these observed data, and given the model for document-cre- ation posited in the generative process, what conditional distribution of words over topics and of topics over docu- ments resulted in the data I see?” It both defines the topics in a collection and explains the proportions of these topics in each document, and in so do- ing it discovers the underlying seman- tic structure of the documents. LDA and its derivatives are examples
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 10/24/2011 for the course CS 142 taught by Professor Staff during the Fall '08 term at BYU.

Page1 / 3

P16-anthes - news Science | DOI:10.1145/1859204.1859210 Gary Anthes Topic models Vs unstructured Data With topic modeling scientists can explore

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online