This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Computational Problem Psych 215L:
Language Acquisition
Lecture 7
WordMeaning Mapping Xu & Tenenbaum (2007) “I love my daxes.” Dax = that specific toy, teddy bear, stuffed animal, toy, object, …? Xu & Tenenbaum (2007) Previous approaches to wordlearning: 5 things a wordlearning model should do: Hypothesis elimination: hypothesis space of potential concepts for word
exists and learner eliminates incorrect hypotheses based on input (Pinker
1984, 1989, Berwick 1986, Siskind 1996) (1) Word meanings learned from very few examples Associative learning: connectionist networks (Colunga & Smith, 2005;
Gasser & Smith, 1998; Regier, 1996, 2005; L. B. Smith, 2000) or
similarity matching to examples (Landau, Smith, & Jones, 1988; Roy &
Pentland, 2004) – no explicit hypothesis space, per se (3) The target of wordlearning is a system of overlapping concepts (2) Word meanings inferred form only positive examples (4) Inferences about word meaning based on examples should be graded,
rather than absolute
(5) Inferences about word meanings can be strongly affected by reasoning
about how the observed examples were generated Xu & Tenenbaum (2007) Ruling out unnatural extensions Approach to word learning based on rational statistical inference (ideal
learner) dog = dog parts, front half of dog, dog spots, all spotted things, all running
things, all dogs + one cat Hypothesis about word meanings evaluated by Bayesian probability theory Claim: “The interaction of Bayesian inference principles with appropriately
structured hypothesis spaces can explain the core phenomena listed
above. Learners can rationally infer the meanings of words that label
multiple overlapping concepts, from just a few positive examples.
Inferences from more ambiguous patterns of data lead to more graded and
uncertain patterns of generalization. Pragmatic inferences based on
communicative context affect generalizations about word meanings by
changing the learner’s probabilistic models.” The issue of overlapping hypotheses Traditional Solutions:
Whole Object constraint: First guess is that a label refers to a whole object,
rather than part of the object (dog parts, front half of dog) or an attribute of
the object (dog spots)
Taxonomic constraint (Markman 1989): First guess about an unknown
label is that it applies to the taxonomic class (ex: dog, instead of all running
things or all dogs + one cat) The issue of overlapping hypotheses Objectkind labels: dog vs. dalmatian vs. animal Multiple properties potentially relevant: shape vs. material Issue: clearly overlapping labels – a dalmatian is a dog and an animal, but
not all animals are dogs, and not all dogs are dalmatians. Which level
does each label apply to? Issue: clearly overlapping labels – which aspect is being labeled? Traditional solutions
For objectkind labeling:
Markman (1989): learners prefer the “basic” level of categorization (dog
over dalmatian or animal)
Remaining issue: How do learners figure out nonbasic level labels? That
is, how do they overcome this bias? Since concepts are overlapping,
it’s not enough to learn that dog can label a dog. Learners must
somehow figure out that animal is also a fine label for a dog (and a cat
and a bird). A Bayesian solution
Suspicious Coincidences:
Situation: fep fep fep fep Suspicious: Why is no other animal or other kind of dog a fep if fep can
really label any animal or any kind of dog?
Bayesian reasoning: Would expect to see other animals (or dogs) labeled
as fep if fep really could mean those things. If it continues not to be used
this way, this is growing support that fep cannot mean those things. The Bayesian Framework
Task: Learn concept associated with word C. Size Principle: Suspicious coincidences
Has to do with expectation of the data points that should
be encountered in the input X = {x1, x2, …, xn} = n observed examples of C
Hypothesis h = pointer to some subset of entities in the domain that
could be the extension of C Standard Bayesian inference:
Posterior probability of hypothesis h given
examples X is related to the likelihood of h
producing X (p(X  h)) and the prior likelihood
of h (p(h)). These are normalized against the
cumulative likelihood of producing X given
any hypothesis (p(X)). MoreGeneral (dog)
If moregeneral generalization
(dog) is correct, the learner
should encounter some data
that can only be accounted
for by the moregeneral
generalization (like beagles
or poodles). This data would
be incompatible with the
lessgeneral generalization
(dalmatian). Lessgeneral
(dalmatian) Size Principle: Suspicious coincidences
Has to do with expectation of the data points that should
be encountered in the input
MoreGeneral (dog)
If the learner keeps not
encountering data
compatible only with the
moregeneral generalization,
the lessgeneral
generalization becomes
more and more likely to be
the generalization
responsible for the language
data encountered. Lessgeneral
(dalmatian) Size Principle: Suspicious coincidences
Another way to think about it: probability of generating data point
The likelihood that a given data
point (like dalmation1) was
generated if the subset is
doing the generating is, by
definition, higher than the
likelihood that data point
would be generated if the
superset was doing the
generating. So, the subset
has a higher probability of
having produced this data
point  it gets favored (+some
probability) when this data
point is encountered. MoreGeneral (dog) Lessgeneral
(dalmatian) dalmatian1 Xu & Tenenbaum (2007): Expt 1 Xu & Tenenbaum (2007): Expt 1 Subjects: Adults Subjects: Adults Task, part 1: novel word learning (“This is a blicket”) Task, part 2: generalization (rate how much like a “blicket” some new object is) Xu & Tenenbaum (2007): Expt 1
Results: Clear differentiation between being shown 1 example vs. being
shown 3 examples. When shown 3 examples, adults restricted their
hypotheses to the most conservative meaning consistent with the
examples. 1example scenario most like being shown 3 basiclevel
examples, but has more uncertainty. Xu & Tenenbaum (2007): Expt 2
Subjects: 3 and 4yearold children
Task, part 2: generalization (asked to help Mr.Frog identify only things that
are “blicks”/ “feps”/ “daxes” from a set of new objects) Xu & Tenenbaum (2007): Expt 2
Subjects: 3 and 4yearold children
Task, part 1: novel word learning (“This is a blick/fep/dax”) Xu & Tenenbaum (2007): Expt 2
Results: Qualitatively similar behavior to adults, though distinctions are not
quite as sharp (possibly due to experimental design factors: dalmatians
too interesting, peppers too boring). 1example learning shows graded
judgment, but not much of a basiclevel bias. Xu & Tenenbaum (2007): Expt 3 Xu & Tenenbaum (2007): Expt 3 Subjects: 3 and 4yearold children Subjects: 3 and 4yearold children Task, part 1: novel word learning (“This is a blick/fep/dax”) with objects of
more equal salience to children (dalmatians replaced by terriers, and
peppers replaced by chili peppers). Also, single example labeled 3 times
so same number of labeling events occur across conditions. Task, part 2: generalization (asked to tell Mr.Frog whether something was a
“blick”/ “fep”/ “dax” from a set of new objects) Xu & Tenenbaum (2007): Expt 3
Results: Again, qualitatively similar behavior to adults, though distinctions
are not quite as sharp – but still much sharper than before. 1example
learning shows graded judgment, but still not much of a basiclevel bias.
Surprising tendency not to completely allow superordinate labeling when
given superordinate examples. Xu & Tenenbaum (2007): Main findings Number of examples matters:
“We found that word learning displays the characteristics of a
statistical inference, with both adult and child learners becoming
more accurate and more confident in their generalizations as the
number of examples increased…Both adult and child learners
appear to be sensitive to suspicious coincidences in how the
examples given for a novel word appear to cluster in a
taxonomy of candidate categories to be named.” Xu & Tenenbaum (2007): Main findings Learning nonbasic level labels may not be so hard:
“When given multiple examples, preschool children are able to
learn words that refer to different levels of the taxonomic
hierarchy, at least within the superordinate categories of animal,
vehicle, and vegetable. Special linguistic cues or negative
examples are not necessary for learning these words.” Xu & Tenenbaum (2007): Main findings Xu & Tenenbaum (2007): Main findings It’s not just about the number of labeling events:
“We found evidence that preschool children keep track of the
number of instances labeled and not simply the number of cooccurrences between object percepts and labels. Word learning
appears to be fundamentally a statistical inference, but unlike
standard associative models, the statistics are computed over
an ontology of objects and classes, rather than over surface
perceptual features.” Some Caveats
This assumes children’s hypothesis space is the same as adults
(nested labels at the superordinate, basic, and subordinate level). The basiclevel bias isn’t something children seem to have:
“Adults showed much greater basiclevel generalization than did
children…a basiclevel bias may not be part of the foundations
for word learning. Rather, such a bias may develop as children
learn more about general patterns of word meanings and how
words tend to be used. Further research using a broader range
of categories in the same experimental paradigm developed
here will be necessary to establish a good case for this
developmental proposal.” Looking at children’s lessthanperfect ability to pick out the
superordinate label when given superordinate examples:
(1) Children have a different treestructured hypothesis space than
adults – superordinate space not quite the same.
(2) Children’s hypothesis space may vary more from child to child.
(3) Children might also need to acquire deeper theoretical knowledge
about superordinate categories (e.g., biologically relevant facts, such
as all animals breathe and grow—part of the intension of the word)
before these categories can become stable hypotheses for
generalizing word meanings. Basiclevel bias in children’s vocabularies?
“Several factors may be important in explaining the time lag between
acquiring basiclevel labels and acquiring subordinate and
superordinatelevel labels. First, subordinate and superordinatelevel labels may require multiple examples. If each example is
labeled on different occasions and spread out in time, children may
forget the examples over time. Second, subordinate and
superordinatelevel category labels are used much less frequently in
adult speech, and so the relevant examples are harder to come by.
Middleclass American parents tend to point to objects and label
them with basiclevel terms. Last, superordinates are often used to
refer to collections (Markman, 1989), and so children may be misled
by the input in interpreting these words. In our studies, we have
presented children with a simplified learning situation in order to
uncover the underlying inferential competence that guides them in—
but is not exclusively responsible for—realworld performance.” Bayesian model fit to experimental data
Likelihood: Approximate p(X  h) by height of cluster, which
represents object distinctiveness (how dissimilar they are to
neighboring cluster) Bayesian model fit to experimental data
Hypothesis space: constructed from adult similarity judgments in
experiment 1. Bayesian model fit to experimental data
Prior: Approximate p(h) as preference for cluster distinctiveness (prefer
R cluster over P [which is not too much lower than R] or Z [which is
not too much lower than AA]) – based on cluster branch height Bayesian model fit to experimental data Balance between prior
and likelihood
(conceptually natural
hypotheses win out):
prefer R (~dog) over P
(~dog with white fur color)
because R is more
distinctive even though P
is smaller (and so has
higher probability of
generating data point). Bayesian model fit to experimental data Distinctiveness may be
high for basiclevel
categories (like R), but
also for superordinate
categories (like JJ, HH,
and EE). Bayesian model fit to experimental data Bayesian model fit to experimental data Model with no special basiclevel bias vs.
Model with special basiclevel bias (prior for basiclevel
hypotheses replaced with β times its value in the original model). Model with no special basiclevel bias vs.
Model with special basiclevel bias (prior for basiclevel
hypotheses replaced with β times its value in the original model). No bias version a better match for children’s subordinate data and
1example data. Bias version a better match for adult’s subordinate data and 1example data. Other models’ fit to experimental data Other models’ fit to experimental data Weak Bayes: No size principle (prior is the only thing that matters).
Likelihood is 1 if examples X are consistent with h, 0 otherwise. MAP Bayes: Base probability on single most probable hypothesis
(rather than averaging over all consistent hypotheses). Poor fit – no benefit from 3 examples vs. 1 example. Too much
basiclevel generalization. Poor fit – no benefit from 3 examples vs. 1 example. Too much
basiclevel generalization. No gradedness. Other models’ fit to experimental data
Other models with poor fit. Extending the Bayesian Framework
• Learning in the case when the hypothesis space doesn’t necessarily
form a nested hierarchy: object vs. material concepts Extending the Bayesian Framework
• Prasada, Ferenz, and Haskell (2002):
(1) Given a single regularly shaped entity, people tended to choose
an object category.
(2) Given a single irregularly shaped entity, people tended to choose
the substance interpretation.
(3) When people were shown multiple essentially identical entities,
each with the same complex irregular shape and novel material, their
preference for labeling an entity in this set switched from a substance
interpretation to an object interpretation. Extending the Bayesian Framework
• Using negative evidence: “That’s a dalmatian. It’s a kind of dog.”
dalmatian = subset of dog
Bayesian learner can treat this as conclusive evidence that
dalmatian is a subset of dog and give 0 likelihood to any
hypothesis where dalmatian is not contained within the set of
dogs. Extending the Bayesian Framework
• Bayesian explanation of Prasada, Ferenz, and Haskell (2002):
(1) Each object category is organized around a prototypical shape.
(2) Prior for object categories with regular shapes >
Prior for substance categories >
Prior for object categories with irregular shapes.
(3) There are more conceptually distinct shapes that support object
categories than there are material properties that support substance
categories. Objectkind hypotheses are smaller than substance
hypotheses = objectkinds have higher likelihood than substancekinds. Seeing several examples of novel objects with the same name
is a greater suspicious coincidence. Extending the Bayesian Framework
• Being sensitive to the source of the input:
Random sampling (“teacherdriven”) vs. Nonrandom sampling
(“learnerdriven”) of subordinates
Xu & Tenenbaum (2007 Developmental Science): Adults and preschoolers in teacherdriven condition made subordinate
generalization while the ones in the learnerdriven condition
allowed basiclevel generalization.
Bayesian model explanation: Likelihood no longer reflects size
principle (not a suspicious coincidence if you had control of which
examples were picked), just similarity – equivalent to Weak Bayes
model Extending the Bayesian Framework Open questions
Early wordlearning appears to be slow & laborious – shouldn’t be in a
Bayesian learner. • Incorporating the effects of previously learned words:
Lexical contrast: meaning of all words must somehow differ Bayesian model implementation: Prior of hypotheses that overlap
with known extensions is lower.
Lower prior Higher prior Known concept Open questions
What about conceptformation (when concept isn’t available yet,
Bayesian learner can’t map label to it):
Ex: number words like “four”
Potential explanation:
“…perhaps the observation of new words that cannot be mapped
easily onto the current hypothesis space of candidate concepts
somehow triggers the formation of new concepts, more suitable as
hypotheses for the meanings of these words.” Potential explanations:
(1) Bayesian inference capacity isn’t online in early wordlearners.
(2) Hypothesis spaces of young children may not be sufficiently
constrained to make strong inferences.
(3) Bayesian inference can’t apply yet to wordlearning properly
(though it could apply to other learning)
(4) Children’s ability to remember words and/or their referents isn’t
stable (input is effectively unreliable) ...
View
Full
Document
This note was uploaded on 12/12/2011 for the course PSYCH 215l taught by Professor Pearl during the Fall '11 term at UC Irvine.
 Fall '11
 pearl

Click to edit the document details