lect14-maxent.ppt - 1 Maximum Entropy Lecture #13...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 1 Maximum Entropy Lecture #13 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum (Slides from Jason Eisner and Dan Klein) 2 Probability is Useful • We love probability distributions! • We’ve learned how to define & use p(…) functions. • Pick best output text T from a set of candidates • speech recognition ; machine translation ; OCR ; spell correction ... • maximize p 1 (T) for some appropriate distribution p 1 • Pick best annotation T for a fixed input I • text categorization ; parsing ; part-of-speech tagging … • maximize p(T | I) ; equivalently maximize joint probability p(I,T) • often define p(I,T) by noisy channel: p(I,T) = p(T) * p(I | T) • speech recognition & other tasks above are cases of this too: • we’re maximizing an appropriate p 1 (T) defined by p(T | I) • Pick best probability distribution (a meta-problem!) • really, pick best parameters θ : train HMM, PCFG, n-grams, clusters … • maximum likelihood ; smoothing ; • Smoothing: max p( θ |data) = max p( θ , data) =p( θ )p(data| θ ) summary of half of the course (statistics) 3 Probability is Flexible • We love probability distributions! • We’ve learned how to define & use p(…) functions. • We want p(…) to define probability of linguistic objects • Sequences of words, tags, morphemes, phonemes ( n-grams, FSMs, FSTs ; Viterbi , collocations ) • Vectors ( naïve Bayes ; clustering word senses ) • Trees of (non)terminals ( PCFGs ; CKY, Earley ) • We’ve also seen some not-so-probabilistic stuff • Syntactic features , morphology . Could be stochasticized? • Methods can be quantitative & data-driven but not fully probabilistic: clustering , collocations,… • But probabilities have wormed their way into most things • p(…) has to capture our intuitions about the ling. data summary of other half of the course (linguistics) 4 An Alternative Tradition • Old AI hacking technique: • Possible parses (or whatever) have scores. • Pick the one with the best score. • How do you define the score? • Completely ad hoc! • Throw anything you want into the stew • Add a bonus for this, a penalty for that, etc. • “Learns” over time – as you adjust bonuses and penalties by hand to improve performance. • Total kludge, but totally flexible too … • Can throw in any intuitions you might have really so alternative? 5 An Alternative Tradition • Old AI hacking technique: • Possible parses (or whatever) have scores. • Pick the one with the best score. • How do you define the score? • Completely ad hoc! • Throw anything you want into the stew • Add a bonus for this, a penalty for that, etc....
View Full Document

This note was uploaded on 02/22/2012 for the course CMPSCI 585 taught by Professor Staff during the Fall '08 term at UMass (Amherst).

Page1 / 42

lect14-maxent.ppt - 1 Maximum Entropy Lecture #13...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online