Lecture8-WordMeaningMapping2

Lecture8-WordMeaningMapping2 - Computational Problem Psych...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Problem Psych 215L: Language Acquisition “Look! There’s a goblin!” Goblin = ???? Lecture 8 Word-Meaning Mapping Smith & Yu (2008) Learning in cases of referential ambiguity: Why? “…not all opportunities for word learning are as uncluttered as the experimental settings in which fast-mapping has been demonstrated. In everyday contexts, there are typically many words, many potential referents, limited cues as to which words go with which referents, and rapid attentional shifts among the many entities in the scene.” Also, “…the evidence indicates that 9-, 10-, and certainly 12-month-old infants are accumulating considerable receptive lexical knowledge …Yet many studies find that children even as old as 18 months have difficulty in making the right inferences about the intended referents of novel words…infants as young as 13 or 14 months…can link a name to an object given repeated unambiguous pairings in a single session. Overall, however, these effects are fragile with small experimental variations often leading to no learning.” Smith & Yu (2008) New approach: infants accrue statistical evidence across multiple trials that are individually ambiguous but can be disambiguated when the information from the trials is aggregated. Smith & Yu (2008) A more complicated example: Trial 1: A = a (.5), b (.5)? B = a (.5), b (.5)? Trial 2: C = c (.5), d (.5)? D = c (.5), d (.5)? Trial 3: E = e (.5), f (.5)? F = e (.5), f (.5)? Trial 4: A = g (.3), a (.3), b (.3)? G = g (.5), a(.5)? (but wait! b isn’t present, so A = b has prob 0) A = g (.5), a (.5)? (but wait! G wasn’t present in trial 1, A = g has prob 0) A=a G=g Requirements: (1) Learner notices absence of b in Trial 4 (2) Learner remembers absence of g in Trial 1 (3) Learner registers occurrences & nonoccurrences (4) Learner calculates correct statistics based off this information Smith & Yu (2008): Experiment Six novel words obeying phonotactic probabilities of English: bosa, gasser, manu, colat, kaki, regli Six brightly colored shapes (sadly greyscale in the paper) Smith & Yu (2008) Yu & Smith (2007): Adults seem able to accomplish this. Smith & Yu ask: Can 12- and 14-month-old infants do this? (Relevant age for beginning word-learning.) Requirements: (1) Learner notices absence of b in Trial 4 (2) Learner remembers absence of g in Trial 1 (3) Learner registers occurrences & nonoccurrences (4) Learner calculates correct statistics based off this information Smith & Yu (2008): Experiment Training: 30 slides with 2 objects named with two words (total time: 4 min) manu colat Testing: 12 trials with one word repeated 4 times and 2 objects (correct one and distracter) present manu manu manu manu Smith & Yu (2008): Experiment Results: Infants preferentially look at target over distracter, and 14-montholds looked longer than 12-month-olds. Smith & Yu (2008) Interesting point: More ambiguity within trials may lead to better learning overall “Yu and Smith (2007; Yu et al., 2007), using a task much like the infant task used here, showed that adults actually learned more word-referent pairs when the set contained 18 words and referents than when it contained only 9. This is because more words and referents mean better evidence against spurious correlations. Although much remains to be discovered about the relevant mechanisms, they clearly should help children learn from the regularities that accrue across the many ambiguous word-scene pairings that occur in everyday communication.” Smith & Yu (2008) Also, Ramscar et al. (2011) This kind of statistical learning vs. transitional probability learning Kids vs. adults: word-meaning mapping in cases of ambiguity “The statistical regularities to which infants must attend to learn wordreferent pairings are different from those underlying the segmentation of a sequential stream in that word-referent pairings require computing cooccurrence frequencies across two streams of events (words and referents) simultaneously for many words and referents. Nonetheless, the present findings, like the earlier ones showing statistical learning of sequential probabilities, suggest that solutions to fundamental problems in learning language may be found by studying the statistical patterns in the learning environment and the statistical learning mechanisms in the learner (Newport & Aslin, 2004; Saffran et al., 1996)” “These findings…are consistent with other cross-situational approaches to word learning (Yu & Smith, 2007; Smith & Yu, 2008), which have established that in word learning tasks, both children and adults can “rapidly learn multiple word-referent pairs by accruing statistical evidence across multiple and individually ambiguous word-scene pairings”…. However, in this experiment, we explicitly tested for children’s sensitivity to the information provided by cues, rather than their co-occurrence rates…pattern of children’s responses indicates that they can and do use informativity in learning to use words…what a child learns about any given word is dependent on the information it provides about the environment, in relation to other words…it is quite clear that the adults we tested did not place the same value on informativity in their learning that the children did…” However… See Medina, Snedecker, Trueswell, & Gleitman (2011) for evidence against learners having multiple meaning hypotheses and crosstabulating them via statistical procedures. (One issue - the sheer number of items in real world situations, and the different perceptual instances of the items in question.) Instead, learners “appear to use a one-trial ‘fast-mapping’ procedure, even under conditions of referential uncertainty.” Frank, Goodman, & Tenenbaum (2009) Problems for learning based on cross-situational idea that referents are present: “…speakers often talk about objects that are not visible and about actions that are not in progress at the moment of speech (Gleitman, 1990), adding noise to the correlations between words and objects.” Solution: appeal to external social/communication cues “…cross-situational and associative theories often appeal to external social cues, such as eye gaze (Smith, 2000; Yu & Ballard, 2007), but these are used as markers of salience (the ‘‘warm glow’’ of attention), rather than as evidence about internal states of the speaker, as in social theories.” Frank, Goodman, & Tenenbaum (2009) Redefining the problem: (It’s harder) Not just about learning stable lexicon of word-meaning mappings, but also about the intention of the speaker at the moment. “Social theories suggest that learners rely on a rich understanding of the goals and intentions of speakers…once the child understands what is being talked about, the mappings between words and referents are relatively easy to learn (St. Augustine, 397/1963; Baldwin, 1993; Bloom, 2002; Tomasello, 2003). These theories must assume some mechanism for making mappings, but this mechanism is often taken to be deterministic, and its details are rarely specified. In contrast, crosssituational accounts of word learning take advantage of the fact that words often refer to the immediate environment of the speaker, which allows learners to build a lexicon based on consistent associations between words and their referents (Locke, 1690/1964; Siskind, 1996; Smith, 2000; Yu & Smith, 2007).” [How different are these accounts, really?] Frank, Goodman, & Tenenbaum (2009) Task: Identify lexicon items for object nouns Frank, Goodman, & Tenenbaum (2009) Assumption: What people intend to say (I) is a function of the world around them (specifically, the objects O present). Assumption: The words people say (W) are a function of what people intend to say (I = objects intended) and how those intentions can be translated with the language they speak (using lexicon items L) Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Prior P(L) favors parsimony (fewer lexical items): exponentially penalized for each additional lexical item, using constant α P(L) ∝ e-α|L| Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Likelihood P(C|L) is product of the words, objects, and intentions given the lexicon L for all situations in C: Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. W & O are conditionally independent, so P(Ws, Os, Is | L) can be rewritten… Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. …as the product of the words given the speaker’s intended objects and lexicon (P(Ws | Is, L)… P(Ws | Is, L) *… Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. …times the probability of the speaker’s intended objects (I) given the objects present (P(Is | Os). P(Ws | Is, L) * P(Is | Os) Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Since we can’t observe speaker’s intended referent directly, we sum over all possible values of intended referent I, assuming the object is present (I ∈ Os). ΣI⊆O P(Ws | Is, L) * P(Is | Os) Note that Is can be empty if speaker is not referring to an object that is present. Simplicity assumption: P(Is | Os) ∝ 1 (all intentions equally likely) Remaining term: P(Ws | Is,L) Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. Assumption: words are generated as a bag of words (no order or dependencies, so can multiply them together) Assumption: words are generated because (1) they are referential to some item present [PR] (2) they are non-referential [PNR] Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. ϒ = probability a word is used referentially, given context (1 – ϒ) = probability word is not used referentially (specifically, not referring to objects: function words, adjectives, verbs) Model Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. PR(w|o, L) = probability of word used referentially for an object = probability of word being chosen, given the object and the lexicon Uniform over words linked to object in the lexicon. If a word is not linked to an object, its referential probability is 0 for that object. Averaged over all possible intended referents (Is). Model Testing the Model: Corpus Evaluation Input Corpus: Rollins videos of parents interacting with preverbal infants Annotated with all mid-size objects judged to be visible to the infant. Model learns a probability distribution over unobserved lexicons L (one L = set of lexicon items), given an observed corpus C of situations. PNR(w|L) = probability of word used non-referentially w.r.t objects = probability of word being chosen, given lexicon. Other word-learning models evaluated on same data, and all models judged on the accuracy of the lexicons learned and inferences on speaker intentions Lexicons: Each model produced association probability between word & object. Chose lexicon that maximized F-score (harmonic mean of precision & recall). If word not in lexicon already, probability of choosing word ∝ 1. If word in lexicon already, probability of choosing word ∝ κ. Note: Intentional model with “one parameter” is when α is the only free parameter. When κ < 1, words in lexicon less likely to be uttered non-referentially than words not in lexicon. Testing the Model: Corpus Evaluation Best lexicon found by intentional model Testing the Model: Corpus Evaluation Input Corpus: Rollins videos of parents interacting with preverbal infants Annotated with all mid-size objects judged to be visible to the infant. Other word-learning models evaluated on same data, and all models judged on the accuracy of the lexicons learned and inferences on speaker intentions Speaker Intentions: Intentional model = intention with highest posterior probability given lexicon Other models = objects for which matching words in best lexicon had been uttered Note: Intentional model with “one parameter” is when α is the only free parameter. Testing the Model: Corpus Evaluation Using the model to explain experimental results Why did the intentional model work so well? “The high precision of the lexicon found by our model was likely due to two factors. First, the distinction between referential and nonreferential words allowed our model to exclude from the lexicon words that were used without a consistent referent. Second, the ability of the model to infer an empty intention allowed it to discount utterances that did not contain references to any object in the immediate context.” Cross-situational word-learning (Yu & Smith 2007, Smith & Yu 2008) All models (even the non-intentional ones) successfully learned the word-meaning mappings, given those experimental stimuli. Doesn’t help to differentiate – just shows that all these models can use statistical information like this. Using the model to explain experimental results Using the model to explain experimental results Mutual Exclusivity “Can you give me the dax?” (“bird” = BIRD already known) Mutual Exclusivity “Can you give me the dax?” (“bird” = BIRD already known) Children give novel object, presumably assuming bird can’t also be called “dax”. Intentional model has soft preference for one-to-one mappings already, since having multiple words for object reduces consistency of word use with that object. (Though note that some of the other comparison models can also show this behavior, such as the conditional probability models.) Children give novel object, presumably assuming bird can’t also be called “dax”. Intentional model scoring for four potential wordreferent mappings. Mapping to novel object is the best. Note also that this is a case of one-trial learning (Carey 1978, Markson & Bloom 1997). Using the model to explain experimental results Using the model to explain experimental results Object Individuation Object Individuation Xu 2002: Infants use words to individuate objects Xu 2002: Infants use words to individuate objects Habituation: toys coming out from behind screens (figure shows two-word habituation, where words are “duck” and “ball” - alternative is one-word habituation, where both objects would be labeled “toy”) Habituation: “Look, a duck!” “Look, a ball!” Test: screen removed to reveal… Infant reaction: Infants didn’t look as long. (not surprised) vs. Habituation: “Look, a toy!” “Look, a toy!” Infant reaction: Infants looked longer. (surprised to see two objects) Using the model to explain experimental results Using the model to explain experimental results Object Individuation Xu 2002: Infants use words to individuate objects Intention Reading Baldwin 1993: Children sensitive to intentional labeling, not just timing of labeling. Interpretation: Infants expect words to be used referentially. One object = one label, two objects = two labels. Intentional model: Simulate looking time with surprisal (negative log probability) and get equivalent results. Children told the name of a toy that was unseen and given a second toy to play with. Children learned to label the first toy with the name. Easy to simulate in intentional model: Instead of intended objects being unknown, intended objects are known. Note: Perceptual salience models cannot capture this. Frank, Goodman, & Tenenbaum (2009) “Our model operates at the ‘‘computational theory’’ level of explanation (Marr, 1982). It describes explicitly the structure of a learner’s assumptions in terms of relationships between observed and unobserved variables. Thus, in defining our model, we have made no claims about the nature of the mechanisms that might instantiate these relationships in the human brain.” “The success of our model supports the hypothesis that specialized principles may not be necessary to explain many of the smart inferences that young children are able to make in learning words. Instead, in some cases, a representation of speakers’ intentions may suffice.” ...
View Full Document

This note was uploaded on 12/12/2011 for the course PSYCH 215l taught by Professor Pearl during the Fall '11 term at UC Irvine.

Ask a homework question - tutors are online