This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Computational Problem Psych 215L:
Language Acquisition
Lecture 6
Word Segmentation Computational Problem Divide spoken speech into words húwzəfréjdəvð əbɪ́ gbQ́ dwə́lf Word Boundaries or Lexicon Items?
Identify word boundaries Divide spoken speech into words Gambell & Yang (2006): Identify
boundaries with USC + TrProb,
identify boundaries with USC +
Algebraic learning (though also
identify lexical items with algebraic
learning) húwzəfréjdəvð əbɪ́ gbQ́ dwə́lf
húwz əfréjd əv ð ə bɪ́ g bQ́d wə́l f
who‘s afraid of the big bad wolf Fleck (2008): Identify boundaries
with phonotactic constraints
Hewlett & Cohen (2009): Identify
boundaries with phonotactic
constraints Identify/optimize lexical items
Goldwater et al. (2009): bias for
shorter & fewer lexicon items (ideal
learner)
Johnson & Goldwater (2009): bias
for shorter & fewer lexicon items +
phonotactic constraints (ideal
learner)
Pearl et al. (2011): bias for shorter &
fewer lexicon items (constrained
learner)
Blanchard et al. (2010): bias for
lexicon items obeying phonotactic
constraints (constrained learner)
McInnes & Goldwater (2011):
extract from acoustic data
(constrained learner) Looking for lexicons? Modeling learnability vs. modeling acquirability Frank et al. (2010 Cognition): examining the predictions of several word
segmentation models on human experimental data. The Bayesian model
(which explicitly optimized a lexicon) usually was a better fit. “ideal”, “rational”, or “computationallevel” learners “Can it be learned at all by a simulated learner?” The exception: All models failed to predict human difficulty when there were
more lexical items, suggesting that memory limitations are important to include. Frank et al. (2010 CogSci proceedings): more support that (adult) human
learners look to optimize lexicons Modeling learnability what is possible to learn Modeling acquirability (Johnson 2004) Input
(specific linguistic
observations) Abstract internal
representation/generalization Output
(specific linguistic
productions) more “realistic” or “cognitively inspired” learners Language acquisition computation as
induction “Can it be learned by a simulated learner that is constrained in the
ways humans are constrained?”
what is possible to learn if you’re human Probabilistic models for induction
• Typically an i deal observer approach asks what the
optimal solution to the induction problem is, given
particular assumptions about knowledge representation
and available information.
• Constrained learners implement ideal learners in more
cognitively plausible ways.
– How might limitations on memory and processing affect
learning? Word segmentation Bayesian inference: model goals • One of the first problems infants must solve when learning
language.
• Infants make use of many different cues.
– Phonotactics, allophonic variation, metrical (stress)
patterns, effects of coarticulation, and statistical
regularities in syllable sequences. • The Bayesian learner seeks to identify an explanatory linguistic
hypothesis that
– accounts for the observed data.
– conforms to prior expectations. languagedependent Statistics may provide initial bootstrapping. Ideal learner: Focus is on the goal of computation, not the
procedure (algorithm) used to achieve the goal. Used very early (Thiessen & Saffran, 2003) Languageindependent, so doesn’t require children to know
some words already Constrained learner: Use same probabilistic model, but
algorithm reflects how humans might implement the computation. Bayesian segmentation Bayesian segmentation • In the domain of segmentation, we have:
– Data: unsegmented corpus (transcriptions)
– Hypotheses: sequences of word tokens • In the domain of segmentation, we have:
– Data: unsegmented corpus (transcriptions)
– Hypotheses: sequences of word tokens = 1 if concatenating words forms corpus,
= 0 otherwise.
Corpus: “lookatthedoggie” P(dh) =1
loo k atth ed oggie
lookat thedoggie
look at the doggie = 1 if concatenating words forms corpus,
= 0 otherwise.
P(dh) = 0
i like penguins
look at thekitty
abc Encodes assumptions or
biases in the learner. Optimal solution is the segmentation with highest
probability. An ideal Bayesian learner for word segmentation Investigating learner assumptions
Model considers hypothesis space of segmentations,
preferring those where The lexicon is relatively small.
Words are relatively short. • If a learner assumes that words are i ndependent units, what is
learned from realistic data? [unigram model]
• What if the learner assumes that words are units that h elp predict
other units? [bigram model] The learner has a perfect memory for the data The entire corpus is available in memory. Note: Approach of Goldwater, Griffiths, & Johnson (2007, 2009): use a
Bayesian i deal observer t o examine the consequences of making
these different assumptions. only counts of lexicon items are required to compute highest
probability segmentation. Assumption: phonemes are relevant unit of representation Goldwater, Griffiths, and Johnson (2007, 2009) Generative process: Unigram model
Walkthrough: Unigram model
• Choose next word in corpus using a Dirichlet Process (DP)
with concentration parameter α and base distribution P 0: n + #P0 ( w )
P ( w i = w  w1 ...w i "1 ) = w
i "1+ #
• Base distribution P 0 i s the probability of generating a new word: Assumes word w i is generated as follows:
1. Is w i a novel lexical item? P( yes ) = !
n +! P( no) = n
n +! !
m P0 ( w i = x1 ...x m ) = " P ( x i )
i =1 ! Fewer word types =
Higher probability Generative process: Bigram model
Walkthrough: Unigram model
Assume word wi is generated as follows:
2. If novel, generate phonemic form x1… xm :
m Shorter words =
Higher probability P0 ( w i = x1 ...x m ) = " P ( x i )
i =1 Otherwise, choose lexical identity of wi from
previously occurring words: ! P( wi = w) = Power law =
Higher probability
for more frequent
words nw
n Walkthrough: Bigram model
Assume word wi is generated as follows:
1. Is (wi1,wi) a novel bigram? P ( yes ) =
2. !
nwi "1 + ! P( no) = nwi "1 P( wi = w  wi "1 = w' , w1...wi "2 ) = Otherwise, choose lexical identity of wi from words
previously occurring after w i1. n( w ',w )
nw ' n( w ',w ) + !P ( w)
1
i "1+ ! Choose word based on previous word’s identity and all previous words
(base distribution P1, concentration parameter β) Base distribution for
generating novel bigrams P ( wi = w  w1...wi "1 ) =
1 bw + !P0 ( w)
b +! Search through hypothesis space of segmentations
Model defines a distribution over hypotheses. Can use
Gibbs sampling to find a good hypothesis.
• Iterative procedure produces samples from the posterior
distribution of hypotheses. nwi "1 + ! If novel, generate w i using unigram model (almost). P ( wi = w  wi !1 = w' ) = • Bigram model is a hierarchical Dirichlet process ( Teh et
al., 2005): P(hd)
h Gibbs sampling Corpus: childdirected speech samples • Compares pairs of hypotheses differing by a single word
boundary:
whats.that
the.doggie
yeah
wheres.the.doggie
… • BernsteinRatner corpus:
– 9790 utterances of phonemically transcribed childdirected speech (1923 months), 33399 tokens and
1321 unique types.
– Average utterance length: 3.4 words
– Average word length: 2.9 phonemes whats.that
the.dog.gie
yeah
wheres.the.doggie
… • Example input: • Calculate the probabilities of the words that differ, given
current analysis of all other words in the corpus.
• Sample a hypothesis according to the ratio of probabilities. yuwanttusiD6bUk
lUkD*z6b7wIThIzh&t
&nd6dOgi
yuwanttulUk&tDIs
... Results: Ideal learner (Standard MCMC) ≈ youwanttoseethebook
looktheresaboywithhishat
andadoggie
youwanttolookatthis
... Results: Ideal learner (Standard MCMC) Precision: #correct / #found, “How many of what I found are right?” Precision: #correct / #found, “How many of what I found are right?” Recall: #found / #true, “How many did I find that I should have found?” Recall: #found / #true, “How many did I find that I should have found?” Word Tokens
Prec Rec Boundaries
Prec Rec Lexicon
Prec Rec Word Tokens
Prec Rec Boundaries
Prec Rec Lexicon
Prec Rec Ideal (unigram) 61.7 47.1 92.7 61.6 55.1 66.0 Ideal (unigram) 61.7 47.1 92.7 61.6 55.1 66.0 Ideal (bigram) 68.4 90.4 79.8 63.3 62.6 Ideal (bigram) 68.4 90.4 79.8 63.3 62.6 74.6 Correct segmentation: “look at the doggie. look at the kitty.”
Best guess of learner: “ lookat the doggie. lookat thekitty.” Word Token Prec = 2/5 (0.4), Word Token Rec = 2/8 (0.25)
Boundary Prec = 3/3 (1.0), Boundary Rec = 3/6 (0.5)
Lexicon Prec = 2/4 (0.5), Lexicon Rec = 2/5 (0.4) 74.6 The assumption that words predict other words is good: bigram model
generally has superior performance
Note: Training set was used as test set
Both models tend to undersegment, though the bigram model does so
less (boundary precision > boundary recall) Results: Ideal learner sample
segmentations
Unigram model How about constrained learners? Bigram model youwant to see thebook
look theres aboy with his hat
and adoggie
you wantto lookatthis
lookatthis
havea drink
okay now
whatsthis
whatsthat
whatisit
look canyou take itout
... you want to see the book
look theres a boy with his hat
and a doggie
you want to lookat this
lookat this
have a drink
okay now
whats this
whats that
whatis it
look canyou take it out
... Considering human limitations What if the only limitation is that the learner must
process utterances one at a time? The constrained learners use the same probabilistic
model, but process the data incrementally (one utterance
at a time), rather than all at once. Dynamic Programming with Maximization (DPM) Dynamic Programming with Sampling (DPS) Decayed Markov Chain Monte Carlo (DMCMC) Dynamic Programming: Maximization
For each utterance:
• Use dynamic programming to compute highest
probability segmentation.
• Add counts of segmented words to lexicon. you want to see the book
0.33 yu want tusi D6bUk 0.21 yu wanttusi D6bUk 0.15 yuwant t usi D6 bUk … … Algorithm used by Brent (1999), with different model. Considering human limitations What if humans don’t always choose the most
probable hypothesis, but instead sample among the
different hypotheses available? Dynamic Programming: Sampling
For each utterance:
• Use dynamic programming to compute probabilities of
all segmentations, given the current lexicon.
• Sample a segmentation.
• Add counts of segmented words to lexicon.
you want to see the book
0.33 yu want tusi D6bUk 0.21 yu wanttusi D6bUk 0.15 yuwant t usi D6 bUk … Considering human limitations What if humans are more likely to pay attention to
potential word boundaries that they have heard more
recently (decaying memory = recency effect)? … Decayed Markov Chain Monte Carlo
For each utterance:
• Probabilistically sample s boundaries from all utterances
encountered so far.
• P rob(sample b) ∝ bad where ba is the number of potential
boundary locations between b and the end of the current
utterance, and d is the decay rate (Marthi et al. 2002).
• Update lexicon after every sample. you want to see the book
Probability of
sampling boundary s s amples yuwant tusi D6 bUk
Boundaries
Utterance 1 Decayed Markov Chain Monte Carlo
For each utterance:
• Probabilistically sample s boundaries from all utterances
encountered so far.
• P rob(sample b) ∝ bad where ba is the number of potential
boundary locations between b and the end of the current
utterance, and d is the decay rate (Marthi et al. 2002).
• Update lexicon after every sample. Decayed Markov Chain Monte Carlo
Decay rates tested: 2, 1.5, 1, 0.75, 0.5, 0.25, 0.125
Probability of
sampling within
current utterance yuwant tu si D6 bUk wAtsDIs .772 d=1 .323 d = 0.75 .125 d = 0.5 .036 d = 0.25 s s amples Probability of
sampling boundary .942 d = 1.5
you want to see the book what’s this d=2 .009 d = 0.125 .004 Boundaries
Utterance 1 Utterance 2 Results: unigrams vs. bigrams Results: unigrams vs. bigrams F = 2 * Prec * Rec
Prec + Rec F = 2 * Prec * Rec
Prec + Rec Precision:
#correct / #found #correct / #found Recall: Recall: #found / #true
Results averaged over 5 randomly generated test
sets (~900 utterances) that were separate from
the training sets (~8800 utterances), all
generated from the Bernstein Ratner corpus Precision: #found / #true DMCMC Unigram: d=1, s=20000
DMCMC Bigram: d=0.25, s=20000
Note: s=20000 means DMCMC
learner samples 89% less often
than the Ideal learner. Like the Ideal learner, the DPM & DMCMC bigram learners perform
better than the unigram learner, though improvement is not as great
as in the Ideal learner. The bigram assumption is helpful. Results: unigrams vs. bigrams Results: unigrams vs. bigrams F = 2 * Prec * Rec
Prec + Rec F = 2 * Prec * Rec
Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true However, the DPS bigram learner performs worse than the unigram
learner. The bigram assumption is not helpful. Results: unigrams vs. bigrams Unigram comparison: DPM, DMCMC > Ideal, DPS performance
Interesting: Constrained learners outperforming unconstrained learner
when words are believed to be independent units. Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec
Prec + Rec F = 2 * Prec * Rec
Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Bigram comparison: Ideal, DMCMC > DPM > DPS performance
Interesting: Constrained learner performing equivalently to unconstrained
learner when words are believed to be predictive units. Lexicon = a seed pool of words for children to use to figure out
languagedependent word segmentation strategies. Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec
Prec + Rec Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec
Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Like the Ideal learner, the DPM bigram learner yields a more reliable
lexicon than the unigram learner. However, the DPS and DMCMC bigram learners yield less reliable
lexicons than the unigram learners. Results: unigrams vs. bigrams for the lexicon Results: unigrams vs. bigrams for the lexicon F = 2 * Prec * Rec
Prec + Rec F = 2 * Prec * Rec
Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Unigram comparison: DMCMC > Ideal > DPM > DPS performance Bigram comparison: Ideal > DPM > DMCMC > DPS performance Interesting: Constrained learner outperforming unconstrained learner
when words are believed to be independent units. More expected: Unconstrained learner outperforming constrained learners
when words are believed to be predictive units (though not by a lot). Results: under vs. oversegmentation Results: under vs. oversegmentation Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true Undersegmentation: boundary precision > boundary recall The DMCMC unigram learner, like the Ideal learner, tends to
undersegment. Based on Peters (1983), children may have a
tendency to undersegment, too. Oversegmentation: boundary precision < boundary recall Results: under vs. oversegmentation Precision:
#correct / #found
Recall:
#found / #true All other learners, however, tend to oversegment. Results: Exploring different performance measures Some positions in the utterance are more easily segmented
by infants, such as the first and last word of the utterance
(Seidl & Johnson 2006). If models are reasonable reflections of human behavior, their
performance on the first and last words is better than their
performance over the entire utterance. Moreover, they should perform
equally on the first and last words in order to match infant behavior. Results: first/last vs. whole utterance Results: first/last vs. whole utterance F = 2 * Prec * Rec F = 2 * Prec * Rec Prec + Rec Prec + Rec Precision: Precision: #correct / #found #correct / #found Recall: Recall: #found / #true #found / #true DPM and DMCMC learners have the desired behavior. The Ideal
learner improves for both, but improves more for last words. The
DPS learner only improves for first words. DPM and DPS have the desired behavior. The Ideal and DMCMC
learners only improve for the first word. Results: main points Results: main points A better set of cognitively inspired statistical learners While no constrained learners outperform the best ideal learner on
all measures, all perform better on realistic childdirected speech
data than a transitional probability learner and outperformed other
unsupervised word segmentation models. Implication: Learners that optimize a lexicon may work better than
learners who only are looking for word boundaries. Ideal learner behavior doesn’t always transfer While assuming words are predictive units (bigram model)
significantly helped the ideal learner, this assumption may not be
as useful to a constrained learner (depending on how cognitive
limitations are implemented). Speculation: Some of the constrained learners are unable to
successfully search the larger hypothesis space that exists for the
bigram model Results: main points Constraints on processing a re not always harmful Results: main points Decayed MCMC learner can perform well even with more than 99.9%
less processing than the unconstrained ideal learner Results: main points Constraints on processing a re not always harmful Constraints on processing a re not always harmful
Decayed MCMC unigram learner outperforms Ideal learner when
both sample the same number of times – suggests something special
about the way DMCMC approximates its inference process. (This is
not true for the bigram learner, though.) Results: main points Constraints on processing a re not always harmful Constrained unigram learners can sometimes outperform the
unconstrained unigram learner (“Less is More” Hypothesis: Newport
1990). This behavior persists when tested on a larger corpus of
English childdirected speech (PearlBrent), suggesting it’s not just a
fluke of the Bernstein corpus. The reason why the unigram DMCMC learner might fare better has to
do with the Ideal learner’s superior memory capacity and processing
abilities. The Ideal learner (because it can see everything all the time and
update anything at any point) can notice that certain short items (e.g.,
actual words like it’s and a) appear very frequently together. The issue turns out to be that the Ideal learner makes many more
errors on frequent lexical items than the DMCMC learner. The only way for a unigram learner to represent this dependency is as
a single lexicon item. The Ideal learner can fix its previous “errors”
that it made earlier during learning when it thought these were two
separate lexical items. The DMCMC does not have the memory and
processing power to make this same mistake. Results: main points Constraints on processing a re not always harmful Related to Newport (1990)’s “Less is More” hypothesis: limited
processing abilities are advantageous for acquisition “… the more limited inference process of the DMCMC learner f ocuses its attention only on the current frequency information and does not allow it to
view the frequency of the corpus as a whole. Coupled with this learner’ s more
limited ability to correct its initial hypotheses about lexicon items, this leads to
superior segmentation performance. We note, however, that this superior
performance is mainly due to the unigram learner’ s inability to capture word
sequence predictiveness; when it sees items appearing together, it has no
way to capture this behavior except by assuming these items are actually one
word. Thus, the ideal unigram learner’s additional knowledge causes it to
commit more undersegmentation errors. The bigram learner, on the other
hand, does not have this problem – and indeed we do not see the DMCMC
bigram learner outperforming the ideal bigram learner.”  Pearl et al. 2011 Where to go from here: exploring acquirability Explore robustness of constrained learner performance across different
corpora and different languages Is it just for this language that we see these effects? In progress: Spanish to children a year or younger (portion of JacksonThal corpus (JacksonThal 1994) containing ~3600 utterances) Investigate other implementations of constrained learners Imperfect memory: Assume lexicon precision decays over time, assume
calculation of probabilities is noisy Knowledge representation (in progress): assume syllables are a relevant unit
of representation (Jusczyk et al. 1999), assume stressed and unstressed
syllables are tracked separately (Curtin et al. 2005, Pelucchi et al. 2009),
assume infants have certain phonotactic knowledge beforehand and/or are
acquiring it at the same time segmentation happens (Blanchard et al. 2010),
assume acoustic level information is the right level of granularity (McInnes &
Goldwater 2011) Results: main points About infants’ tendencies to segment edgewords better “Seidl and Johnson (2006) review a number of proposed explanations of why utterance edges are easier, including p erceptual/prosodic salience, cognitive
biases to attend more to edges (including recency effects), or the pauses at
utterance boundaries. In our results, we find that all of the models find
utteranceinitial words easier to segment, and most of them also find
utterancefinal words easier. Since none of the algorithms include models of
perceptual salience, our results suggest that this explanation is probably
unnecessary to account for the edge effect, especially for utteranceinitial
words. Rather, it seems simpler to assume that the pauses at utterance
boundaries make segmentation easier by eliminating the ambiguity of one of
the two boundaries of the word.”  Pearl et al. 2011 ...
View
Full
Document
This note was uploaded on 12/12/2011 for the course PSYCH 215l taught by Professor Pearl during the Fall '11 term at UC Irvine.
 Fall '11
 pearl

Click to edit the document details