Lecture9-GramCat

Lecture9-GramCat - Computational Problem Psych 215L:...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Computational Problem Psych 215L: Language Acquisition Lecture 9 Grammatical Categorization Identify classes of words that behave similarly (are used in similar syntactic environments). “This is a DAX.” DAX = noun Other nouns = bear, toy, teddy, stuffed animal, really great toy that I love so much,… Mintz 2003 “…it is not fully known how child language learners initially categorize words. There has been recent interest in the idea that distributional information carried by the cooccurrence patterns of words in sentences could provide a great deal of information relevant to grammatical categories.” Mintz 2003, on Theorists And what theorists initially thought… “Pinker (1987) argued that, given sentences in (2a,b), a distributional learner would incorrectly categorize fish and rabbits together, and, hearing (2c), would incorrectly assume that (2d) is also permissible.” (2a) John ate fish. (2b) John ate rabbits. (2c) John can fish. (2d) *John can rabbits. “The crux of the problem…is that a given word form…can belong to multiple categories and thus occur in different syntactic contexts…potentially providing misleading category information…argued that the resulting erroneous generalizations would be common, and would render a distributional approach to categorization untenable.” Mintz 2003, Another Problem “The fundamental issue is that lexical adjacency patterns are variable…another question is how the learner is to know which environments are important and which should be ignored. Distributional analyses that consider all the possible relations among words in a corpus of sentences would be computationally unmanageable at best, and impossible at worst.” One idea: local contexts “…by showing that local contexts are informative, these findings suggested a solution to the problem of there being too many possible environments to keep track of: focusing on local contexts might be sufficient.” Frequent Frames Frequent frame: X___Y where X and Y are words that frame another word and appear frequently in the child’s linguistic environment the__is the king is… the goblin is… the girl is… Idea: Children may be attending to other kinds of distributional information available in the linguistic environment There is evidence that children can track information that is nonadjacent in the speech stream (Santelmann & Jusczyk 1998, Gómez 2002) he is running Also, frequency of lexical frames is something children are sensitive to (Childers & Tomasello 2001: children more easily acquire novel verb meanings when the verbs occur in lexical frames that occur frequently in the input) Frequent Frames vs. Bigrams Idea: What categorization information is available if children track frequent frames? Examples: Experimental Evidence can___him can trick him… can help him… can hug him… “In the present approach the word ‘W’ in the environment ‘…X W Y…’ is stored as ‘jointly following X and preceding Y’, but such would not be the case if W occurred after X and before Y on independent occasions…bigram contexts…record only independent cooccurrence patterns (e.g. ‘following X’, ‘preceding Y’)….property of joint co-occurrence in the frame contexts involves an additional relationship...” Experimental Support “Another important difference…adults will categorize words in an artificial language based on their occurrence within frames…whereas bigram regularity alone has failed to produce categorization in artificial grammar experiments, without additional cues…” - Also, Mintz (2006) shows that 12-month-olds are sensitive to frequent frames in an experimental setup Goals “The goal of the work described here…what assumptions would be reasonable to build into [a model of grammatical categorization by learners]. Specifically, the goal was to formulate a unit to which there is some evidence that children and adults attend, and with which adults have been shown to categorize, and examine how predictive it is of category membership.” What is a “frequent” frame? Data Data representing child’s linguistic environment: 6 corpora of child-directed speech from the CHILDES database How Frequent Frames Work Definition of “frequent” for frequent frames: Frames appearing a certain number of times in a give corpus “The principles guiding inclusion in the set of frequent frames were that frames should occur frequently enough to be noticeable, and that they should also occur enough to include a variety of intervening words to be categorized together. While these criteria were not operationalized in the present experiment, a pilot analysis with a randomly chosen corpus, Peter, determined that the 45 most frequent frames satisfied these goals and provided good categorization.” Trying out frequent frames on a corpus of child-directed speech. Frame: the ___ is “the radio is in the way…but the doll is…and the teddy is…” radio, doll, teddy = Category1 (similar to Noun) Frame: you ___ it “you draw it so that he can see it… you dropped it on purpose!…so he hit you with it…” draw, dropped, with = Category 2 (similar-ish to Verb) Metrics for Success Metrics for Success Determining success with frequent frames: Determining success with frequent frames: Precision = # of words identified correctly as Category within frame # of words identified as Category within frame Precision = # of words identified correctly as Category within frame # of words identified as Category within frame Recall = # of words identified correctly as Category within frame # of words that should have been identified as Category Recall = # of words identified correctly as Category within frame # of words that should have been identified as Category (Accuracy) (Completeness) Frame: you ___ it draw, dropped, with = Category 2 (similar-ish to Verb) # of words correctly identified as Verb = 2 # of words identified as Verb = 3 Precision = 2/3 Metrics for Success Determining success with frequent frames: Precision = # of words identified correctly as Category within frame # of words identified as Category within frame Recall = # of words identified correctly as Category within frame # of words that should have been identified as Category Frame: you ___ it draw, dropped, with = Category 2 (similar-ish to Verb) # of words correctly identified as Verb = 2 # of words should be identified as Verb = many (all verbs in corpus) Recall = 2/many = small number Some Frequent Frame Results Some Frequent Frame Results Another Look at Frequent Frame Coverage “Frequent frames can thus focus a learner on a relatively small number of contexts that can have broad impact on how words in the input are categorized….be very useful to young language learners, who have limited memory and processing resources.” The Robustness of Frequent Frames Precision results “…on average 45% of the frequent frames of a given corpus were frequent frames for at least three other corpora, indicating that many informative distributional contexts are shared from corpus to corpus.” Precision generally quite high. Interpretation: When a frequent frame clustered words together into category, those words often did belong together. (Nouns together, verbs together, etc.) Recall results The magic number of frequency… “It would be desirable to analyze the corpora using a frequency threshold for each corpus that is based on a relativized frequency criterion, as the salience of frequent frames to human learners is more likely to be a factor of relative frequency than absolute number.” Recall generally quite low. Experiment 2 “The set of frequent frames was…selected to include all frames whose frequency in proportion to the total number of frames in the corpus surpassed a predetermined threshold of 0.13%…this specific threshold was determined based on the frequent frames for each corpus in Experiment 1….frequent frame selection method for Experiment 2 provided a kind of normalization of the method used in Experiment 1.” “…there were often several noun categories and several verb categories (all very accurate), rather than one category of all the nouns, one of all the verbs, etc.” Relativized Frequent Frame Coverage Similar coverage to non-relativized frequent frames Relativized Frequent Frame Precision Relativized Frequent Frame Recall Getting Better Scores Getting better precision (which was already high) “…one way to circumvent the erroneous classifications…would be to filter out extremely low frequency targets.” Getting better recall (which was pretty low) “It is a prevalent characteristic of these frame-based categories that there is considerable overlap in the words they contain….two framebased categories could be unified if they surpass a threshold of lexical overlap. This possibility was tested on the results from one of the corpora, Peter, using a criterion of 20% overlap. The outcome was that 17 different verb categories were joined to form one category of 261 word types, 99.3% of which were verbs.” Unification Overlap in Action Many frames overlap in the words they identify. “Accuracy was not adversely affected by the unification of categories, remaining at 0.90 or above…indicating that the unification procedure did not join together frame-based categories containing words from different grammatical categories. Furthermore, type completeness reached 0.91…indicating that, as expected, the distributional categories that had been fragments of grammatical categories were merged by the unification procedure…it appears that a very simple conglomeration procedure based on lexical overlap could be used to join accurate smaller categories together into a more complete category.” the__is dog cat king girl the__was dog cat king teddy a___is dog goblin king girl the/a/that__is/was dog teddy cat goblin king girl that___is … cat goblin king teddy Some thoughts on why FFs work Wang & Mintz (2010) “…frequent frames are accurate categorizers because they identify linear sequences that are syntactically highly constrained…. a target and its context in a FF are more syntactically closely related to each other than in bigrams…provides converging evidence that frequent frames select syntactically constrained word sequences…limiting distributional generalizations to structurally similar contexts is possible without requiring a prior structural analysis…frequent frames can be viewed as a proxy for structural information, and it is perhaps for this reason, in part, that it is such a robust cue to lexical categories.” Cross-linguistic Application? “The fundamental notion is that a relatively local context defined by frequently co-occurring units can reveal a target word’s category…[here] the units were words and the frame contexts were defined by words that frequently co-occur. In other languages, a failure to find frequent word frames could trigger an analysis of co-occurrence patterns at a different level of granularity, for example, at the level of sub-lexical morphemes. The frequently co-occurring units in these languages are likely to be the inflectional morphemes which are limited in number and extremely frequent.” – Mintz 2003 Western Greenlandic Cross-linguistic Application? Cross-linguistic Application? Some work done for French (Chemla et al. 2009), Spanish (Weisleder & Waxman 2010), Chinese (Cai 2006, Xiao, Cai, & Lee 2006), Dutch (Erkelens 2009), German (Wang et al. 2010), Turkish (Wang et al. 2010) Liebbrandt & Powers 2010: Maybe not always so effective in Dutch… Very similar results: high accuracy, low completeness (before aggregation) - However, for Turkish, it’s better to have FFs at the morpheme (rather than whole word) level Why? Is one word before and after too short a context? No – using full utterances as the “context” actually yielded worse performance. Is there an issue with the frequency of the words filling the frames? There seems to be – using only frames where the filler was an infrequent word (and so rarely a function word) yielded better performance. Cross-linguistic Application? Corollaries from Chemla et al. (2009), Wang & Mintz (2010), Wang et al. (2010): Reiterating the importance of the frame over the bigram or trigram Chemla et al. (2009): it’s important that frames consists of individual lexical items rather than categories made up of multiple words Wang & Mintz (2008): Dynamic FFs “…the frequent frame analysis procedure proposed by Mintz (2003) was not intended as a model of acquisition, but rather as a demonstration of the information contained in frequent frames in child-directed speech…Mintz (2003) did not address the question of whether an actual learner could detect and use frequent frames to categorize words…” “This paper addresses this question with the investigation of a computational model of frequent frame detection that incorporates more psychologically plausible assumptions about the memor[y] resources of learners. In addition, it implements learning as a dynamic process that takes place utterance by utterance as a corpus is processed, rather than ‘in a batch’ over an entire corpus.” Considering Children’s Limitations Memory Considerations (1) Children possess limited memory and cognitive capacity and cannot track all the occurrences of all the frames in a corpus. (2) Memory retention is not perfect: infrequent frames may be forgotten. The Model’s Operation (1) Only 150 frame types (and their frequencies) are held in memory (2) Forgetting function: frames that have not been encountered recently are less likely to stay in memory than frames that have been recently encountered Dynamic Procedure (1) Child encounters an utterance (e.g. “You read the story to mommy.”) (2) Child segments the utterance into frames: You (1) you (2) (3) (4) read X read the the X the story to mommy. story X story to X mommy Dynamic Procedure Dynamic Procedure (3) If memory is not full, a newly-encountered frame is added to the memory and its initial activation is set to 1. The forgetting function is simulated by the activation for each frame in memory decreasing by 0.0075 at each processing step. (3) If memory is not full, a newly-encountered frame is added to the memory and its initial activation is set to 1. The forgetting function is simulated by the activation for each frame in memory decreasing by 0.0075 at each processing step. Memory you X Memory you X read X Activation 1.0 the Activation 0.9925 1.0 the story Processing Step 1 Processing Step 2: frame read X story Dynamic Procedure Dynamic Procedure (4) If the frame already exists in memory, its activation is increased by 1. (4) If the frame already exists in memory, its activation is increased by 1. Memory I X you X read X the X story X … Memory I X you X read X the X story X … it the story to mommy Activation 3.885 0.8945 0.8805 0.8735 0.8625 Processing Step 27: frame you X the it the story to mommy Activation 3.885 1.8945 0.8805 0.8735 0.8625 Processing Step 27: frame you X the Dynamic Procedure Dynamic Procedure (5) Since the memory buffer only stores 150 frames, it becomes full very quickly (after ~50 utterances). When memory is full, a newly-encountered frame replaces the least active frame with activation less than 1. (5) Since the memory buffer only stores 150 frames, it becomes full very quickly (after ~50 utterances). When memory is full, a newly-encountered frame replaces the least active frame with activation less than 1. Memory I X you X read X the X story X … you X with X Memory I X you X read X the X story X … with X you X it the story to mommy Activation 8.75 6.995 5.65 5.45 5.35 it and 0.9925 0.7965 Processing Step 101: new frame with X by it the story to mommy Activation 8.75 6.995 5.65 5.45 5.35 by it 1.0 0.9925 Processing Step 101: new frame with X by Dynamic Procedure Dynamic Procedure (6) If all activations are greater than 1, no change is made other than the forgetting function (activation - .0075) (6) If all activations are greater than 1, no change is made other than the forgetting function (activation - .0075) Memory I X you X read X the X story X … you X with X Memory I X you X read X the X story X … you X with X it the story to mommy Activation 8.75 6.995 5.65 5.45 5.35 it and 1.9925 1.7965 Processing Step 101: new frame with X by it the story to mommy Activation 8.7425 6.9875 5.6425 5.4425 5.3425 it and 1.9850 1.7890 Processing Step 101: new frame with X by Input & Performance Gauge Using same corpora for input as Mintz (2003) (6 from CHILDES) How many of the overall most frequent frames were in the model’s top 45? Eve corpus Model’s performance was evaluated every 100 frames. Metric used: accuracy/precision (not recall) How many of the overall most frequent frames were in the model’s top 45? Peter corpus What about the ones that weren’t frequent frames? Are they still good categorizers? Wang & Mintz (2008) Conclusions “…our model demonstrates very effective categorization of words. Even with limited and imperfect memory, the learning algorithm can identify highly informative contexts after processing a relatively small number of utterances, thus yield[ing] a high accuracy of word categorization. It also provides evidence that frames are a robust cue for categorizing words.” ...
View Full Document

Ask a homework question - tutors are online