Lecture6-WordSegmentation2

Lecture6-WordSegmentation2 - Computational Problem Psych...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Problem Psych 215L: Language Acquisition Lecture 6 Word Segmentation Computational Problem Divide spoken speech into words húwzəfréjdəvð əbɪ́ gbQ́ dwə́lf Word Boundaries or Lexicon Items? Identify word boundaries Divide spoken speech into words Gambell & Yang (2006): Identify boundaries with USC + TrProb, identify boundaries with USC + Algebraic learning (though also identify lexical items with algebraic learning) húwzəfréjdəvð əbɪ́ gbQ́ dwə́lf húwz əfréjd əv ð ə bɪ́ g bQ́d wə́l f who‘s afraid of the big bad wolf Fleck (2008): Identify boundaries with phonotactic constraints Hewlett & Cohen (2009): Identify boundaries with phonotactic constraints Identify/optimize lexical items Goldwater et al. (2009): bias for shorter & fewer lexicon items (ideal learner) Johnson & Goldwater (2009): bias for shorter & fewer lexicon items + phonotactic constraints (ideal learner) Pearl et al. (2011): bias for shorter & fewer lexicon items (constrained learner) Blanchard et al. (2010): bias for lexicon items obeying phonotactic constraints (constrained learner) McInnes & Goldwater (2011): extract from acoustic data (constrained learner) Looking for lexicons? Modeling learnability vs. modeling acquirability Frank et al. (2010 Cognition): examining the predictions of several word segmentation models on human experimental data. The Bayesian model (which explicitly optimized a lexicon) usually was a better fit. “ideal”, “rational”, or “computational-level” learners “Can it be learned at all by a simulated learner?” The exception: All models failed to predict human difficulty when there were more lexical items, suggesting that memory limitations are important to include. Frank et al. (2010 CogSci proceedings): more support that (adult) human learners look to optimize lexicons Modeling learnability what is possible to learn Modeling acquirability (Johnson 2004) Input (specific linguistic observations) Abstract internal representation/generalization Output (specific linguistic productions) more “realistic” or “cognitively inspired” learners Language acquisition computation as induction “Can it be learned by a simulated learner that is constrained in the ways humans are constrained?” what is possible to learn if you’re human Probabilistic models for induction • Typically an i deal observer approach asks what the optimal solution to the induction problem is, given particular assumptions about knowledge representation and available information. • Constrained learners implement ideal learners in more cognitively plausible ways. – How might limitations on memory and processing affect learning? Word segmentation Bayesian inference: model goals • One of the first problems infants must solve when learning language. • Infants make use of many different cues. – Phonotactics, allophonic variation, metrical (stress) patterns, effects of coarticulation, and statistical regularities in syllable sequences. • The Bayesian learner seeks to identify an explanatory linguistic hypothesis that – accounts for the observed data. – conforms to prior expectations. language-dependent Statistics may provide initial bootstrapping. Ideal learner: Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal. Used very early (Thiessen & Saffran, 2003) Language-independent, so doesn’t require children to know some words already Constrained learner: Use same probabilistic model, but algorithm reflects how humans might implement the computation. Bayesian segmentation Bayesian segmentation • In the domain of segmentation, we have: – Data: unsegmented corpus (transcriptions) – Hypotheses: sequences of word tokens • In the domain of segmentation, we have: – Data: unsegmented corpus (transcriptions) – Hypotheses: sequences of word tokens = 1 if concatenating words forms corpus, = 0 otherwise. Corpus: “lookatthedoggie” P(d|h) =1 loo k atth ed oggie lookat thedoggie look at the doggie = 1 if concatenating words forms corpus, = 0 otherwise. P(d|h) = 0 i like penguins look at thekitty abc Encodes assumptions or biases in the learner. Optimal solution is the segmentation with highest probability. An ideal Bayesian learner for word segmentation Investigating learner assumptions Model considers hypothesis space of segmentations, preferring those where The lexicon is relatively small. Words are relatively short. • If a learner assumes that words are i ndependent units, what is learned from realistic data? [unigram model] • What if the learner assumes that words are units that h elp predict other units? [bigram model] The learner has a perfect memory for the data The entire corpus is available in memory. Note: Approach of Goldwater, Griffiths, & Johnson (2007, 2009): use a Bayesian i deal observer t o examine the consequences of making these different assumptions. only counts of lexicon items are required to compute highest probability segmentation. Assumption: phonemes are relevant unit of representation Goldwater, Griffiths, and Johnson (2007, 2009) Generative process: Unigram model Walkthrough: Unigram model • Choose next word in corpus using a Dirichlet Process (DP) with concentration parameter α and base distribution P 0: n + #P0 ( w ) P ( w i = w | w1 ...w i "1 ) = w i "1+ # • Base distribution P 0 i s the probability of generating a new word: Assumes word w i is generated as follows: 1. Is w i a novel lexical item? P( yes ) = ! n +! P( no) = n n +! ! m P0 ( w i = x1 ...x m ) = " P ( x i ) i =1 ! Fewer word types = Higher probability Generative process: Bigram model Walkthrough: Unigram model Assume word wi is generated as follows: 2. If novel, generate phonemic form x1… xm : m Shorter words = Higher probability P0 ( w i = x1 ...x m ) = " P ( x i ) i =1 Otherwise, choose lexical identity of wi from previously occurring words: ! P( wi = w) = Power law = Higher probability for more frequent words nw n Walkthrough: Bigram model Assume word wi is generated as follows: 1. Is (wi-1,wi) a novel bigram? P ( yes ) = 2. ! nwi "1 + ! P( no) = nwi "1 P( wi = w | wi "1 = w' , w1...wi "2 ) = Otherwise, choose lexical identity of wi from words previously occurring after w i-1. n( w ',w ) nw ' n( w ',w ) + !P ( w) 1 i "1+ ! Choose word based on previous word’s identity and all previous words (base distribution P1, concentration parameter β) Base distribution for generating novel bigrams P ( wi = w | w1...wi "1 ) = 1 bw + !P0 ( w) b +! Search through hypothesis space of segmentations Model defines a distribution over hypotheses. Can use Gibbs sampling to find a good hypothesis. • Iterative procedure produces samples from the posterior distribution of hypotheses. nwi "1 + ! If novel, generate w i using unigram model (almost). P ( wi = w | wi !1 = w' ) = • Bigram model is a hierarchical Dirichlet process ( Teh et al., 2005): P(h|d) h Gibbs sampling Corpus: child-directed speech samples • Compares pairs of hypotheses differing by a single word boundary: whats.that the.doggie yeah wheres.the.doggie … • Bernstein-Ratner corpus: – 9790 utterances of phonemically transcribed childdirected speech (19-23 months), 33399 tokens and 1321 unique types. – Average utterance length: 3.4 words – Average word length: 2.9 phonemes whats.that the.dog.gie yeah wheres.the.doggie … • Example input: • Calculate the probabilities of the words that differ, given current analysis of all other words in the corpus. • Sample a hypothesis according to the ratio of probabilities. yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs ... Results: Ideal learner (Standard MCMC) ≈ youwanttoseethebook looktheresaboywithhishat andadoggie youwanttolookatthis ... Results: Ideal learner (Standard MCMC) Precision: #correct / #found, “How many of what I found are right?” Precision: #correct / #found, “How many of what I found are right?” Recall: #found / #true, “How many did I find that I should have found?” Recall: #found / #true, “How many did I find that I should have found?” Word Tokens Prec Rec Boundaries Prec Rec Lexicon Prec Rec Word Tokens Prec Rec Boundaries Prec Rec Lexicon Prec Rec Ideal (unigram) 61.7 47.1 92.7 61.6 55.1 66.0 Ideal (unigram) 61.7 47.1 92.7 61.6 55.1 66.0 Ideal (bigram) 68.4 90.4 79.8 63.3 62.6 Ideal (bigram) 68.4 90.4 79.8 63.3 62.6 74.6 Correct segmentation: “look at the doggie. look at the kitty.” Best guess of learner: “ lookat the doggie. lookat thekitty.” Word Token Prec = 2/5 (0.4), Word Token Rec = 2/8 (0.25) Boundary Prec = 3/3 (1.0), Boundary Rec = 3/6 (0.5) Lexicon Prec = 2/4 (0.5), Lexicon Rec = 2/5 (0.4) 74.6 The assumption that words predict other words is good: bigram model generally has superior performance Note: Training set was used as test set Both models tend to undersegment, though the bigram model does so less (boundary precision > boundary recall) Results: Ideal learner sample segmentations Unigram model How about constrained learners? Bigram model youwant to see thebook look theres aboy with his hat and adoggie you wantto lookatthis lookatthis havea drink okay now whatsthis whatsthat whatisit look canyou take itout ... you want to see the book look theres a boy with his hat and a doggie you want to lookat this lookat this have a drink okay now whats this whats that whatis it look canyou take it out ... Considering human limitations What if the only limitation is that the learner must process utterances one at a time? The constrained learners use the same probabilistic model, but process the data incrementally (one utterance at a time), rather than all at once. Dynamic Programming with Maximization (DPM) Dynamic Programming with Sampling (DPS) Decayed Markov Chain Monte Carlo (DMCMC) Dynamic Programming: Maximization For each utterance: • Use dynamic programming to compute highest probability segmentation. • Add counts of segmented words to lexicon. you want to see the book 0.33 yu want tusi D6bUk 0.21 yu wanttusi D6bUk 0.15 yuwant t usi D6 bUk … … Algorithm used by Brent (1999), with different model. Considering human limitations What if humans don’t always choose the most probable hypothesis, but instead sample among the different hypotheses available? Dynamic Programming: Sampling For each utterance: • Use dynamic programming to compute probabilities of all segmentations, given the current lexicon. • Sample a segmentation. • Add counts of segmented words to lexicon. you want to see the book 0.33 yu want tusi D6bUk 0.21 yu wanttusi D6bUk 0.15 yuwant t usi D6 bUk … Considering human limitations What if humans are more likely to pay attention to potential word boundaries that they have heard more recently (decaying memory = recency effect)? … Decayed Markov Chain Monte Carlo For each utterance: • Probabilistically sample s boundaries from all utterances encountered so far. • P rob(sample b) ∝ ba-d where ba is the number of potential boundary locations between b and the end of the current utterance, and d is the decay rate (Marthi et al. 2002). • Update lexicon after every sample. you want to see the book Probability of sampling boundary s s amples yuwant tusi D6 bUk Boundaries Utterance 1 Decayed Markov Chain Monte Carlo For each utterance: • Probabilistically sample s boundaries from all utterances encountered so far. • P rob(sample b) ∝ ba-d where ba is the number of potential boundary locations between b and the end of the current utterance, and d is the decay rate (Marthi et al. 2002). • Update lexicon after every sample. Decayed Markov Chain Monte Carlo Decay rates tested: 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125 Probability of sampling within current utterance yuwant tu si D6 bUk wAtsDIs .772 d=1 .323 d = 0.75 .125 d = 0.5 .036 d = 0.25 s s amples Probability of sampling boundary .942 d = 1.5 you want to see the book what’s this d=2 .009 d = 0.125 .004 Boundaries Utterance 1 Utterance 2 Results: unigrams vs. bigrams Results: unigrams vs. bigrams F = 2 * Prec * Rec Prec + Rec F = 2 * Prec * Rec Prec + Rec Precision: #correct / #found #correct / #found Recall: Recall: #found / #true Results averaged over 5 randomly generated test sets (~900 utterances) that were separate from the training sets (~8800 utterances), all generated from the Bernstein Ratner corpus Precision: #found / #true DMCMC Unigram: d=1, s=20000 DMCMC Bigram: d=0.25, s=20000 Note: s=20000 means DMCMC learner samples 89% less often than the Ideal learner. Like the Ideal learner, the DPM & DMCMC bigram learners perform better than the unigram learner, though improvement is not as great as in the Ideal learner. The bigram assumption is helpful. Results: unigrams vs. bigrams Results: unigrams vs. bigrams F = 2 * Prec * Rec Prec + Rec F = 2 * Prec * Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true However, the DPS bigram learner performs worse than the unigram learner. The bigram assumption is not helpful. Results: unigrams vs. bigrams Unigram comparison: DPM, DMCMC > Ideal, DPS performance Interesting: Constrained learners outperforming unconstrained learner when words are believed to be independent units. Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec Prec + Rec F = 2 * Prec * Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Bigram comparison: Ideal, DMCMC > DPM > DPS performance Interesting: Constrained learner performing equivalently to unconstrained learner when words are believed to be predictive units. Lexicon = a seed pool of words for children to use to figure out language-dependent word segmentation strategies. Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec Prec + Rec Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Like the Ideal learner, the DPM bigram learner yields a more reliable lexicon than the unigram learner. However, the DPS and DMCMC bigram learners yield less reliable lexicons than the unigram learners. Results: unigrams vs. bigrams for the lexicon Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec Prec + Rec F = 2 * Prec * Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Unigram comparison: DMCMC > Ideal > DPM > DPS performance Bigram comparison: Ideal > DPM > DMCMC > DPS performance Interesting: Constrained learner outperforming unconstrained learner when words are believed to be independent units. More expected: Unconstrained learner outperforming constrained learners when words are believed to be predictive units (though not by a lot). Results: under vs. oversegmentation Results: under vs. oversegmentation Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Undersegmentation: boundary precision > boundary recall The DMCMC unigram learner, like the Ideal learner, tends to undersegment. Based on Peters (1983), children may have a tendency to undersegment, too. Oversegmentation: boundary precision < boundary recall Results: under vs. oversegmentation Precision: #correct / #found Recall: #found / #true All other learners, however, tend to oversegment. Results: Exploring different performance measures Some positions in the utterance are more easily segmented by infants, such as the first and last word of the utterance (Seidl & Johnson 2006). If models are reasonable reflections of human behavior, their performance on the first and last words is better than their performance over the entire utterance. Moreover, they should perform equally on the first and last words in order to match infant behavior. Results: first/last vs. whole utterance Results: first/last vs. whole utterance F = 2 * Prec * Rec F = 2 * Prec * Rec Prec + Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true DPM and DMCMC learners have the desired behavior. The Ideal learner improves for both, but improves more for last words. The DPS learner only improves for first words. DPM and DPS have the desired behavior. The Ideal and DMCMC learners only improve for the first word. Results: main points Results: main points A better set of cognitively inspired statistical learners While no constrained learners outperform the best ideal learner on all measures, all perform better on realistic child-directed speech data than a transitional probability learner and out-performed other unsupervised word segmentation models. Implication: Learners that optimize a lexicon may work better than learners who only are looking for word boundaries. Ideal learner behavior doesn’t always transfer While assuming words are predictive units (bigram model) significantly helped the ideal learner, this assumption may not be as useful to a constrained learner (depending on how cognitive limitations are implemented). Speculation: Some of the constrained learners are unable to successfully search the larger hypothesis space that exists for the bigram model Results: main points Constraints on processing a re not always harmful Results: main points Decayed MCMC learner can perform well even with more than 99.9% less processing than the unconstrained ideal learner Results: main points Constraints on processing a re not always harmful Constraints on processing a re not always harmful Decayed MCMC unigram learner out-performs Ideal learner when both sample the same number of times – suggests something special about the way DMCMC approximates its inference process. (This is not true for the bigram learner, though.) Results: main points Constraints on processing a re not always harmful Constrained unigram learners can sometimes outperform the unconstrained unigram learner (“Less is More” Hypothesis: Newport 1990). This behavior persists when tested on a larger corpus of English child-directed speech (Pearl-Brent), suggesting it’s not just a fluke of the Bernstein corpus. The reason why the unigram DMCMC learner might fare better has to do with the Ideal learner’s superior memory capacity and processing abilities. The Ideal learner (because it can see everything all the time and update anything at any point) can notice that certain short items (e.g., actual words like it’s and a) appear very frequently together. The issue turns out to be that the Ideal learner makes many more errors on frequent lexical items than the DMCMC learner. The only way for a unigram learner to represent this dependency is as a single lexicon item. The Ideal learner can fix its previous “errors” that it made earlier during learning when it thought these were two separate lexical items. The DMCMC does not have the memory and processing power to make this same mistake. Results: main points Constraints on processing a re not always harmful Related to Newport (1990)’s “Less is More” hypothesis: limited processing abilities are advantageous for acquisition “… the more limited inference process of the DMCMC learner f ocuses its attention only on the current frequency information and does not allow it to view the frequency of the corpus as a whole. Coupled with this learner’ s more limited ability to correct its initial hypotheses about lexicon items, this leads to superior segmentation performance. We note, however, that this superior performance is mainly due to the unigram learner’ s inability to capture word sequence predictiveness; when it sees items appearing together, it has no way to capture this behavior except by assuming these items are actually one word. Thus, the ideal unigram learner’s additional knowledge causes it to commit more undersegmentation errors. The bigram learner, on the other hand, does not have this problem – and indeed we do not see the DMCMC bigram learner out-performing the ideal bigram learner.” - Pearl et al. 2011 Where to go from here: exploring acquirability Explore robustness of constrained learner performance across different corpora and different languages Is it just for this language that we see these effects? In progress: Spanish to children a year or younger (portion of JacksonThal corpus (Jackson-Thal 1994) containing ~3600 utterances) Investigate other implementations of constrained learners Imperfect memory: Assume lexicon precision decays over time, assume calculation of probabilities is noisy Knowledge representation (in progress): assume syllables are a relevant unit of representation (Jusczyk et al. 1999), assume stressed and unstressed syllables are tracked separately (Curtin et al. 2005, Pelucchi et al. 2009), assume infants have certain phonotactic knowledge beforehand and/or are acquiring it at the same time segmentation happens (Blanchard et al. 2010), assume acoustic level information is the right level of granularity (McInnes & Goldwater 2011) Results: main points About infants’ tendencies to segment edge-words better “Seidl and Johnson (2006) review a number of proposed explanations of why utterance edges are easier, including p erceptual/prosodic salience, cognitive biases to attend more to edges (including recency effects), or the pauses at utterance boundaries. In our results, we find that all of the models find utterance-initial words easier to segment, and most of them also find utterance-final words easier. Since none of the algorithms include models of perceptual salience, our results suggest that this explanation is probably unnecessary to account for the edge effect, especially for utterance-initial words. Rather, it seems simpler to assume that the pauses at utterance boundaries make segmentation easier by eliminating the ambiguity of one of the two boundaries of the word.” - Pearl et al. 2011 ...
View Full Document

This note was uploaded on 12/12/2011 for the course PSYCH 215l taught by Professor Pearl during the Fall '11 term at UC Irvine.

Ask a homework question - tutors are online