Lecture12-PoS3

Lecture12-PoS3 - Reminder: Poverty of the Stimulus Psych...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Reminder: Poverty of the Stimulus Psych 215L: Language Acquisition The Logic of Poverty of the Stimulus (The Logical Problem of Language Acquisition) Legate & Yang (2002): Poverty of the Stimulus Lives Suppose there is an incorrect hypothesis compatible with the data. Suppose children behave as if they never entertain the incorrect hypothesis. Addendum (interpretation): Or children converge on the correct hypothesis much earlier than expected (Legate & Yang 2002). Conclusion: C hildren possess innate knowledge ruling out the incorrect hypothesis from the hypothesis space considered. Addendum (Interpretation): The initial hypothesis space does not include all hypotheses. Specifically, the incorrect ones of a particular kind are not in the child’ s hypothesis space. Perfors, Tenenbaum, & Regier (2011): Or does it? Some Issues Child Input Very frequent Is Hoggle tis Suppose there are some data. 2) 3) Lecture 12 Poverty of the Stimulus: Structure Dependence 1) Unclear how much evidence is “enough”. Forms do occur, even if they do so rarely. Moreover, may be better to consider forms not in isolation, but in a larger context. running away from Jareth? Very infrequent, if ever Can someone who can solve the Labyrinth tcan show someone who can’ t how? “Our findings suggest that it is vital to consider the learnability of entire candidate grammars holistically. While crucial data that would independently support any one generalization (such as the auxiliaryfronting rule) may be very sparse or even nonexistent, there may be extensive data supporting other, related generalizations; this can bias a rational learner towards making the correct inferences about the cases for which the data is very sparse….The need to acquire a whole system of linguistic rules together imposes constraints among the rules, so that an a priori unbiased learner may acquire constraints that are based on the other linguistic rules it must learn at the same time.” Perfors, Tenenbaum, & Regier (2011): Or does it? Some Issues It’s possible to have both domain-general learning abilities and structures representations. Perfors, Tenenbaum, & Regier (2011): Or does it? Some Issues Previous statistical accounts haven’t connected with the argument that preferring hierarchical structures must be innate. “PoS arguments begin with the assumption - taken by most linguists as self-evident - that language does have explicit hierarchical phrase structure, and that linguistic knowledge must at some level be based on representations of syntactic categories and phrases that are hierarchically organized within sentences. The PoS arguments are about whether and to what extent childrenʼs knowledge about this structure is learned via domain-general mechanisms, or is innate in some language-specific system. Critiques based on the premise that this explicit structure is not represented as such in the minds of language users do not really address this argument...” Perfors, Tenenbaum, & Regier (2011): Or does it? Some Issues Previous statistical accounts also are somewhat difficult to interpret. “For instance, the networks used by Reali and Christiansen (2005) and Lewis and Elman (2001) measure success by whether they predict the next word in a sequence or by comparing the prediction error for grammatical and ungrammatical sentences. These networks lack not only a grammar-like representation; they lack any kind of explicitly articulated representation of the knowledge they have learned. It is thus difficult to say what exactly they have learned about linguistic structure - despite their interesting linguistic behavior once trained.” Perfors, Tenenbaum, & Regier (2011): Or does it? Some Issues Working within an ideal learner framework, to show the inference is possible from the data. It remains to be seen whether it’s p ossible for children, given their memory and processing limitations. “We are not proposing a comprehensive or mechanistic account of how children actually acquire language...setting this challenge aside allows us to focus with more clarity on those aspects of learnability that classic PoS arguments address: claims about what data might be sufficient for learning, or what language-specific prior knowledge must be assumed in order to make learning possible…If we can show that such learning is in principle possible, then it becomes meaningful to ask the algorithmiclevel question of how a system might successfully and in reasonable time search the space of possible grammars to discover the bestscoring grammar.” Perfors, Tenenbaum, & Regier (2011): Or does it? A depiction of the Poverty of the Stimulus “…many versions of the PoS argument assume that the T is languagespecific: in particular, that T is the knowledge that linguistic rules are defined over hierarchical phrase structures.” Perfors, Tenenbaum, & Regier (2011): Or does it? Bayesian Model Selection Perfors, Tenenbaum, & Regier (2011): Or does it? Bayesian learning: Tradeoffs A Bayesian learner finds a balance between fit to the data (likelihood) and simplicity of the explanation (hypothesis prior). (Would prefer the middle one below) Perfors, Tenenbaum, & Regier (2011): Or does it? Posterior probability of G and T, given D First, pick a type of grammar T (ex: linear, regular, hierarchical). Then, pick an instance of T, G, from which the data D are generated. T hierarchical G hierarchical 1 “Is the dwarf who is being teased grumpy?” T G hierarchical hierarchical 1 “Is the dwarf who is being teased grumpy?” Perfors, Tenenbaum, & Regier (2011): Or does it? Posterior probability of G and T, given D T G Perfors, Tenenbaum, & Regier (2011): Or does it? Posterior probability of G and T, given D hierarchical T hierarchical 1 “Is the dwarf who is being teased grumpy?” is proportional to the probability of generating the data from G [p(D | G)] Perfors, Tenenbaum, & Regier (2011): Or does it? G hierarchical hierarchical 1 “Is the dwarf who is being teased grumpy?” is proportional to the probability of generating the data from G [p(D | G)], multiplied by the probability of G, given the type of grammar T chosen [p(G|T)]. Perfors, Tenenbaum, & Regier (2011): Or does it? The Corpus, slightly simplified The Corpus, slightly simplified Adam corpus (American English), each word (mostly) replaced with its syntactic category: Ungrammatical and the most complex grammatical sentences were also removed: (available at determiners (det) [ex: t he, a, an] nouns (n) [ex: cat, penguin, dream] adjectives (adj) [ex: adorable, stinky] comments (c) [ex: mmhm] prepositions (prep) [ex: to, from, of] pronouns (pro) [ex: h e, she, it, one] proper nouns (prop) [ex: Jareth, Sarah, Hoggle] infinitives (to) [ex: to in I want to go] participles (part) [ex: She would have gone, I’ m going] infinitive verbs (vinf) [ex: I want to g o] conjugated verbs (v) [ex: he w ent] auxiliary verbs (aux) [ex: he c an go] complementizers (comp) [ex: I thought that I should go.] wh-question words (wh) [ex: w hat are you doing] Adverbs (ex: too, very) a nd negations (ex: n ot) were removed from all sentences. http://www.psychology.adelaide.edu.au/personalpages/staff/amyperfors/research/cognitio npos/index.html). Note: This biases the model against the more complex hierachical grammars. topicalized sentences ex: “ Here he is.” (some) sentences with subordinate clauses ex: “ if you want to.” (some) sentential complements ex: “ He thought that she ought to watch the movie.” conjunctions (ex: and, or, but) serial verb constructions ex: “ You should g o play outside.” Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? Test corpora The grammars Separate by frequency (idea: less complex sentences occur more frequently) Structure-dependent, hierarchical grammar (smaller & larger): represented with context-free phrase structure rules Level 1 (500+ times) = 8 unique types Level 2 (300+ times) = 13 types Level 3 (100+ times) = 37 types Level 4 (50+ times) = 67 types Level 5 (10+ times) = 268 types Level 6 (complete corpus) = 2338 unique types, including interrogatives, wh-questions, relative clauses, prepositional and adverbial phrases, command forms, and auxiliary as well as nonauxiliary verbs. Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? The grammars The grammars Flat grammars: one production per observed sentence (good fit, maximum complexity) Regular grammars: hierarchical branching in only one direction (rules: A a or A aB) , with varying levels of complexity & fit Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? The grammars Likelihoods for the grammars The relationship between grammar complexity. Two component adaptor grammar model of Goldwater et al. (2006) and Johnson et al. (2007) (1) [grammar] Assign probability distribution over infinite syntactic forms accepted in the language (2) [adaptor] Generate finite observed corpus from that probability distribution (use power-law generation, so a few syntactic types are very frequent while most are infrequent) Perfors, Tenenbaum, & Regier (2011): Or does it? Results based on data types Perfors, Tenenbaum, & Regier (2011): Or does it? Why the transition to the hierarchical grammars occurs Log probability = smaller negative number means more probable Hierarchical grammars preferred once more complex structures (levels 4, 5, and 6) are included in the data to be accounted for. “What kind of input is responsible for the transition from linear grammars to grammars with hierarchical phrase structure? The smallest three corpora contain very few elements generated from recursive productions (e.g., nested prepositional phrases or relative clauses) or sentences using the same kind of phrase in different positions (e.g., a prepositional phrase modifying an NP subject, an NP object, a verb, or an adjective phrase). While a regular grammar must often add an entire new subset of productions to account for these elements, a context-free grammar need add fewer (especially CFG-S). As a consequence, the flat and regular grammars have poorer generalization ability and must add proportionally more productions in order to parse a novel sentence.” Perfors, Tenenbaum, & Regier (2011): Or does it? Why the larger hierarchical grammar is preferred at the very last level “The larger context-free grammar CFG-L outperforms CFG-S on the full corpus, probably because it includes non-recursive counterparts to some of its recursive productions. This results in a significantly higher likelihood since less of the probability mass is invested in recursive productions that are used much less frequently than the non-recursive ones. Thus, although both grammars have similar expressive power, the CFG-L is favored on larger corpora because the likelihood advantage overwhelms the disadvantage in the prior.” Perfors, Tenenbaum, & Regier (2011): Or does it? Results using data tokens, rather than data types Different results: Linear grammars always preferred, no matter how complex the data. Why? “The corpus of sentence tokens contains almost ten times as much data, but no concomitant increase in the variety of sentences (as would occur if there were simply more types, corresponding to a larger dataset of tokens). Thus the likelihood is weighted relatively more strongly relative to the prior (which does not change); this works against the context-free grammars, which overgeneralize more.” Implications: Children need bias to evaluate grammars based on data types, rather than data tokens. (Innate bias, but domain-specific or domain-general?) Perfors, Tenenbaum, & Regier (2011): Or does it? Results using automatically generated grammars, rather than hand-crafted ones Log probability = smaller negative number means more probable Same results: hierarchical grammars preferred when more complex data is in the input. Perfors, Tenenbaum, & Regier (2011): Or does it? Results using data types, but data based by age input (rather than level of complexity) Hierarchical grammars preferred at all ages (even earliest ages have sufficient complexity in the input). Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? Accounting for data Making the right generalizations Note that the hierarchical grammars can account for much of the most complex data, even if they’ re only trying to account for less complex data. The hierarchical grammars are the only grammars that generalize correctly - they have rules to parse the grammatical utterances (even ones not in the input) and no rules able to parse the ungrammatical utterances. Also, the CFG-L grammar trying to account for Adam data can account for between 87 and 94% of sentences from a completely different data set (Sarah). Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? Generating the right representations Implications about what children need The hierarchical grammars (especially CFG-L) generate the most accurate structural representations for a novel data set (though the regular grammars aren’ t far behind). “In general, one must assume either a powerful domain-general learning mechanism with only a few general innate biases that guide the search, or a weaker learning mechanism with stronger innate biases, or some compromise position. Our results do not suggest that any of these possibilities is more likely than the others. Our core argument concerns only the specific need for a bias to a priori prefer analyses of syntax that incorporate hierarchical phrase structure. We are arguing that a rational learner may not require such a bias, not that other biases are also unnecessary.” Berwick, Pietroski, Yankama, & Chomsky (2011) Response Perfors, Tenenbaum, & Regier (2011): Or does it? Berwick et al. say this doesn’t address the PoS problem About the necessity of the type-based analysis Basic argument: Having a hierarchical analysis is the first step to being able to posit structure-dependent rules - but it doesn’ t mean that you do posit those rules rather than structure-independent rules. You have to still know to use that structure when hypothesizing your rules. “Our work suggests that if human learners, like our model, are capable of evaluating whether type-based or token-based analyses are t hemselves more appropriate for a given problem, they might rationally decide to favor a more type-based analysis when deciding among grammars (not necessarily for other aspects of language acquisition)… Would a disposition to evaluate grammars within a two-component adaptor-grammar-like framework, or based on type data only, constitute a language-specific or domain-general disposition? It is difficult to say, but the conceptual underpinnings of the adaptor grammar framework are consistent with a domain-general interpretation, emerging due to memory constraints or other cognitive factors.” “But even if a Bayesian learner can acquire grammars that generate structured expressions… crucially, however, it does not follow that such learners will acquire grammars in which rules are structure dependent. On the contrary…the acquired grammars may still operate structureindependently… PTR seem to assume that if a grammar generates expressions that exhibit hierarchy, then the rules defined over these expressions ⁄ structures must be structure dependent…Structured expressions can be (trans)formed by a structure-independent rule… for example, fronting the first auxiliary.” Perfors, Tenenbaum, & Regier (2011): Or does it? Perfors, Tenenbaum, & Regier (2011): Or does it? The importance of learning a system rather than learning a construction Learning higher-order generalizations first “Our analysis makes a general point that has sometimes been overlooked in considering stimulus poverty arguments, namely that children learn grammatical rules as a part of a system of knowledge…. We have suggested here that even when the data does not appear to explain an isolated inference, there may be enough evidence to learn a larger system of inguistic knowledge - a whole grammar - of which the isolated inference is a part. A similar intuition underlies other arguments about the important role that indirect evidence might play in language acquisition…This point is also broadly consistent with the generative tradition in linguistics…one of whose original goals was to unify apparently disparate aspects of syntax…” “One implication of our work is that it may be possible to learn a higher-order abstraction T even before identifying all of the correct lower-level generalizations G that T sup- ports. Therefore, it may be possible for T to operate to constrain G even if T itself is learned… If an abstract generalization can be acquired very early and can function as a constraint on later development of specific rules of grammar, it may function effectively as if it were an innate domain-specific constraint, even if it is in fact not innate and instead is acquired by domain-general induction from data.” Perfors, Tenenbaum, & Regier (2011): Or does it? How this happens “While there are infinitely many possible specific grammars G, there are only a small number of possible grammar types T. It may thus require less evidence to identify the correct T than to identify the correct G. More deeply, because the higher level of T affects the grammar of the language as a whole while any component of G affects only a small subset of the language produced, there is in a sense much more data available about T than there is about any particular component of G…Higher-order generalizations may thus be learned faster simply because there is much more evidence relevant to them.” ...
View Full Document

This note was uploaded on 12/12/2011 for the course PSYCH 215l taught by Professor Pearl during the Fall '11 term at UC Irvine.

Ask a homework question - tutors are online