124.11.lec16

124.11.lec16 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 16: Machine Transla8on: Intro and Classical Models Outline for MT Week   Intro and a li=le history   Language Similari8es and Divergences   Three classic MT Approaches   Transfer   Interlingua   Direct   Modern Sta8s8cal MT   Evalua8on What is MT?   Transla8ng a text from one language to another automa8cally. Google Translate   The transla8on   h=p://translate.google.com/translate?hl=en&sl=es&tl=en&u=h=p%3A%2F %2Fwww.cocinadominicana.com%2Facompanamientos ­ensaladas ­pastelones %2F1907 ­tostones.html   The original recipe for tostones   h=p://www.cocinadominicana.com/acompanamientos ­ensaladas ­pastelones/ 1907 ­tostones.html Google Translate   h=p://translate.google.com/translate_t   h=p://translate.google.com/translate?hl=en&sl=fr&u=h=p://www.tarte ­ ta8n.info/rece=e ­tarte ­ ta8n.html&ei=BduiSYK3C4KOsQObvLm_CQ&sa=X&oi=translate&resnum=4&ct= result&prev=/search%3Fq%3Dtarte%2Bta8n%2Brece=es%26num%3D100%26hl %3Den%26lr%3D%26client%3Dsa Machine Translation   The Story of the Stone (“The Dream of the Red Chamber”)   Cao Xueqin 1792   Chinese gloss: Dai ­yu alone on bed top think ­of ­with ­gra8tude Bao ­chai again listen to window outside bamboo 8p plantain leaf of on ­top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come   Hawkes transla8on: As she lay there alone, Dai ­yu’s thoughts turned to Bao ­chai… Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without no8cing it she had begun to cry. Machine Translation   Issues:   Sentence segmenta8on: 4 English sentences to 1 Chinese   Gramma8cal differences   Chinese rarely marks tense:   As, turned to, had begun,   tou  ­> penetrated   No pronouns or ar8cles in Chinese   Stylis8c and cultural differences   Bamboo 8p plaintain leaf  ­> bamboos and plantains   Ma ‘curtain’  ­> curtains of her bed   Rain sound sigh drop  ­> insistent rustle of the rain Alignment in Machine Translation Not just literature   Hansards: Canadian parliamentary proceeedings What is MT already good enough for?   Tasks for which a rough transla8on is fine   Extrac8ng informa8on (finding recipes!)   Web pages   email   Tasks for which MT can be post ­edited   MT as first pass   “Computer ­aided human transla8on   Tasks in sublanguage domains where high ­quality MT is possible   FAHQT What is MT not yet good enough for?   Really hard stuff   Literature   Natural spoken speech (mee8ngs, court repor8ng)   Really important stuff   Medical transla8on in hospitals   911 calls MT History   1946 Booth and Weaver discuss MT at Rockefeller           founda8on in New York; 1947 ­48 idea of dic8onary ­based direct transla8on 1949 Weaver memorandum popularized idea 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian ­English MT 1955 ­65 lots of labs take up MT Warren Weaver memo   h=p://www.stanford.edu/class/linguist289/ weaver001.pdf   “There are certain invariant proper8es which are… to some sta8s8cally useful degree, common to all languages.”   On March 4, 1947, “having considerable exposure to computer design problems during the war, and being aware of the speed, capacity, and logical flexibility possible in modern electronic computers”, Weaver suggested that computers to be used for transla8on History of MT: Pessimism   1959/1960: Bar ­Hillel “Report on the state of MT in US and GB”   Argued FAHQT too hard (seman8c ambiguity, etc)   Should work on semi ­automa8c instead of automa8c   His argument:   Li=le John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.   Only human knowledge lets us know that ‘playpens’ are bigger than boxes, but ‘wri8ng pens’ are smaller   His claim: we would have to encode all of human knowledge History of MT: Pessimism   The ALPAC report   Headed by John R. Pierce of Bell Labs   Conclusions:   Supply of human translators exceeds demand   All the Soviet literature is already being translated   MT has been a failure: all current MT work had to be post ­edited   Sponsored evalua8ons which showed that intelligibility and informa8veness was worse than human transla8ons   Results:   MT research suffered   Funding loss   Number of research labs declined   Associa8on for Machine Transla8on and Computa8onal Linguis8cs dropped MT from its name History of MT   1976 Meteo, weather forecasts from English to French   Systran (Babelfish) been used for 40 years   1970’s:   European focus in MT; mainly ignored in US   1980’s   ideas of using early AI techniques in MT (KBMT, CMU)   Focus on “interlingua” systems, especially in Japan   1990’s   Commercial MT systems   Sta8s8cal MT   Speech ­to ­speech transla8on   2000’s   Sta8s8cal MT takes off   Google Translate Language Similarities and Divergences   Some aspects of human language are universal or near ­universal, others diverge greatly.   Typology: the study of systema8c cross ­linguis8c similari8es and differences   What are the dimensions along with human languages vary? Morphology   Morpheme   Minimal meaningful unit of language   Word = Morpheme+Morpheme+Morpheme+…   Stems: also called lemma, base form, root, lexeme   hope+ing hoping hop hopping   Affixes   Prefixes: An8disestablishmentarianism   Suffixes: An8disestablishmentarianism   Infixes: hingi (borrow) – humingi (borrower) in Tagalog   Circumfixes: sagen (say) – gesagt (said) in German Morphological Variation   Isola8ng languages   Cantonese, Vietnamese: each word generally has one morpheme   Vs. Polysynthe8c languages   Siberian Yupik (`Eskimo’): single word may have very many morphemes   Agglu8na8ve languages   Turkish: morphemes have clean boundaries   Vs. Fusion languages   Russian: single affix may have many morphemes A Turkish word   uygarlaştıramadıklarımızdanmışsınızcasına   uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına   Behaving as if you are among those whom we could not cause to become civilized Index of synthesis isolating Vietnamese synthetic English Russian Slide from Holger Diessel Oneida Isolating language (1) Vietnamese (Comrie 1981: 43) Khi tôi When I đến nhà bạn, come house friend ‘When I came to my friend’s house, chúng tôi PL begin I bắt đầu làm bài. do lesson ‘we began to do lessons.’ Slide from Holger Diessel Isolating language Cantonese keui wa chyuhn gwok jeui daaih gaan nguk haih li gaan he say entire country most big building house is this building Synthetic language (2) Kirundi (Whaley 1997:20) Y-a-bi-gur-i-ye CL1-PST-CL8.them-buy-APPL-ASP abâna CL2.children ‘He bought them for the children.’ Slide from Holger Diessel Polysynthetic language Noun-incorporation (cf. fox-hunting, bird-watching) (3) Mohawk (Mithun 1984: 868) a. r-ukwe’t-í:yo he-person-nice ‘He is a nice person.’ b. wa-hi-‘sereth-óhare-‘se PST-he/me-car-wash-for ‘He car-wash for me.’ (= ‘He washed my car’) c. kvtsyu v-kuwa-nya’t-ó:’ase fish FUT-they/her-throat-slit ‘They will throat-slit a fish.’ Slide from Holger Diessel Index of fusion agglutinative Swahili fusional Russian Slide from Holger Diessel Oneida Agglutinative language (1) Turkish (Comrie 1981: 44) SG PL Nominative adam adam-lar Accusative adam-K adam-lar-K Genitive adam-Kn adam-lar-Kn Dative adam-a adam-lar-a Locative adam-da adam-lar-da Ablative adam-dan adam-lar-dan Slide from Holger Diessel Fusional language (2) Russian SG Nominative stol Accusative stol Genitive stol-a Dative stol-u Instrumental stol-om Prepositional stol-e stol-ax PL SG PL stol-y stol-y stol-ov stol-am stol-ami lip-a lip-y lip-u lip-y lip-y lip lip-e lip-am lip-oj lip-ami lip-e lip-ax Slide from Holger Diessel Syntactic Variation   SVO (Subject ­Verb ­Object) languages   English, German, French, Mandarin   SOV Languages   Japanese, Hindi   VSO languages   Irish, Classical Arabic   SVO lgs generally preposi8ons: to Yuriko   VSO lgs generally postposi8ons: Yuriko ni Segmentation Variation   Not every wri8ng system has word boundaries marked   Chinese, Japanese, Thai, Vietnamese   Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences:   Modern Standard Arabic, Chinese Inferential Load: cold vs. hot lgs   Some ‘cold’ languages require the hearer to do more “figuring out” of who the various actors in the various events are:   Japanese, Chinese,   Other ‘hot’ languages are pre=y explicit about saying who did what to whom.   English Inferential Load (2) All noun phrases in blue do not appear in Chinese text … But they are needed for a good translation Lexical Divergences   Word to phrases:   English “computer science” = French “informa8que”   POS divergences   Eng. ‘she likes/VERB to sing’   Ger. Sie singt gerne/ADV   Eng ‘I’m hungry/ADJ   Sp. ‘tengo hambre/NOUN Lexical Divergences: Specificity   Gramma8cal constraints   English has gender on pronouns, Mandarin not.   So transla8ng “3rd person” from Chinese to English, need to figure out gender of the person!   Similarly from English “they” to French “ils/elles”   Seman8c constraints   English `brother’   Mandarin ‘gege’ (older) versus ‘didi’ (younger)   English ‘wall’   German ‘Wand’ (inside) ‘Mauer’ (outside)   German ‘Berg’   English ‘hill’ or ‘mountain’ Lexical Divergence: many-to-many Lexical Divergence: lexical gaps   Japanese: no word for privacy   English: no word for Cantonese ‘haauseun’ or Japanese ‘oyakoko’ (something like `filial piety’)   English ‘cow’ vs. ‘beef’, Cantonese ‘ngau’   English “fish”, Spanish “pez” vs. “pescado” Event-to-argument divergences   English   The bo=le floated out.   Spanish   La botella salió flotando.   The bo=le exited floa)ng   Verb ­framed lg: mark direc8on of mo8on on verb   Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu familiies   Satellite ­framed lg: mark direc8on of mo8on on satellite   Crawl out, float off, jump down, walk over to, run a•er   Rest of Indo ­European, Hungarian, Finnish, Chinese Structural divergences   G: Wir treffen uns am Mi.woch   E: We’ll meet on Wednesday Head Swapping   E: X swim across Y   S: X crucar Y nadando   E: I like to eat   G: Ich esse gern   E: I’d prefer vanilla   G: Mir wäre Vanille lieber Thematic divergence   Y me gusto   I like Y   G: Mir fällt der Termin ein   E: I forget the date Divergence counts from Bonnie Dorr   32% of sentences in UN Spanish/English Corpus (5K) Categorial X tener hambre Y have hunger 98% Conflational X dar puñaladas a Z X stab Z 83% Structural X entrar en Y X enter Y 35% Head Swapping X cruzar Y nadando X swim across Y 8% Thematic X gustar a Y Y likes X 6% 3 “Classical” methods for MT   Direct   Transfer   Interlingua Three MT Approaches: Direct, Transfer, Interlingual Direct Translation   Proceed word ­by ­word through text   Transla8ng each word   No intermediate structures except morphology   Knowledge is in the form of   Huge bilingual dic8onary   word ­to ­word transla8on informa8on   A•er word transla8on, can do simple reordering   Adjec8ve ordering English  ­> French/Spanish Direct MT Dictionary entry Direct MT Problems with direct MT   German   Chinese The Transfer Model   Idea: apply contras8ve knowledge, i.e., knowledge about the difference between two languages   Steps:   Analysis: Syntac8cally parse Source language   Transfer: Rules to turn this parse into parse for Target language   Genera8on: Generate Target sentence from parse tree English to French   Generally   English: Adjec8ve Noun   French: Noun Adjec8ve   Note: not always true   Route mauvaise ‘bad road, badly ­paved road’   Mauvaise route ‘wrong road’)   But is a reasonable first approxima8on   Rule: Transfer rules Lexical transfer   Transfer ­based systems also need lexical transfer rules   Bilingual dic8onary (like for direct MT)   English home:   German   nach Hause (going home)   Heim (home game)   Heimat (homeland, home country)   zu Hause (at home)   Can list “at home < ­> zu Hause”   Or do Word Sense Disambigua8on Systran: combining direct and transfer   Analysis   Morphological analysis, POS tagging   Chunking of NPs, PPs, phrases   Shallow dependency parsing   Transfer   Transla8on of idioms   Word sense disambigua8on   Assigning preposi8ons based on governing verbs   Synthesis   Apply rich bilingual dic8onary   Deal with reordering   Morphological genera8on Transfer: some problems   N2 sets of transfer rules!   Grammar and lexicon full of language ­specific stuff   Hard to build, hard to maintain Interlingua   Intui8on: Instead of lg ­lg knowledge rules, use the meaning of the sentence to help   Steps:   1) translate source sentence into meaning representa8on   2) generate target sentence from meaning. Interlingua for Mary did not slap the green witch Interlingua   Idea is that some of the MT work that we need to do is part of other NLP tasks   E.g., disambigua8ng E:book S:‘libro’ from E:book S:‘reservar’   So we could have concepts like BOOKVOLUME and RESERVE and solve this problem once for each language Direct MT: pros and cons (Bonnie Dorr)   Pros   Fast   Simple   Cheap   No transla8on rules hidden in lexicon   Cons   Unreliable   Not powerful   Rule prolifera8on   Requires lots of context   Major restructuring a•er lexical subs8tu8on Interlingual MT: pros and cons (B. Dorr)   Pros   Avoids the N2 problem   Easier to write rules   Cons:   Seman8cs is HARD   Useful informa8on lost (paraphrase) Moving toward Statistical MT! Warren Weaver (1947) When I look at an article in Russian, I say to myself: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. Kevin Knight slide Rosetta Stone   Carved in 196 BC Egyptian hieroglyphs   Found in 1799   Decoded in 1822 Egyptian Demotic Greek Kevin Knight slide Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: Kevin Knight slide farok crrrok hihok yorok clok kantok ok-yurp Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight ??? 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . 12b. wat nnat forat arrat vat gat . Kevin Knight slide Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . 12b. wat nnat forat arrat vat gat . Kevin Knight slide ??? Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . 12b. wat nnat forat arrat vat gat . Kevin Knight slide process of elimination Centauri/Arcturan [Knight, 1997] Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . Slide from Kevin Knight 12b. wat nnat forat arrat vat gat . cognate? Centauri/Arcturan [Knight, 1997] Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp } 1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok . 1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat . 2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok . 2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat . 3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp . 3b. totat dat arrat vat hilat . 4a. ok-voon anok drok brok jok . 9b. totat nnat quat oloat at-yurp . 10a. lalok mok nok yorok ghirok clok . 4b. at-voon krat pippat sat lat . 5a. wiwok farok izok stok . 10b. wat nnat gat mat bat hilat . 11a. lalok nok crrrok hihok yorok zanzanok . 5b. totat jjat quat cat . 6a. lalok sprok izok jok stok . 11b. wat nnat arrat mat zanzanat . 12a. lalok rarok nok izok hihok mok . 6b. wat dat krat quat cat . 12b. wat nnat forat arrat vat gat . Kevin Knight slide zero fertility It’s Really Spanish/English Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa 1a. Garcia and associates . 1b. Garcia y asociados . 7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos . 2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados . 8a. the company has three groups . 8b. la empresa tiene tres grupos . 3a. his associates are not strong . 3b. sus asociados no son fuertes . 9a. its groups are in Europe . 9b. sus grupos estan en Europa . 4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa . 10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes . 5a. its clients are angry . 5b. sus clientes estan enfadados . 11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina . 6a. the associates are also angry . 6b. los asociados tambien estan enfadados . 12a. the small groups are not modern . 12b. los grupos pequenos no son modernos . Slide from Kevin Knight How does statistical MT work?   Surprising: Intui8on comes from the impossibility of transla8on!!!!   Consider Hebrew “adonai roi” (“the lord is my shepherd”)   for a culture without sheep or shepherds!   Something fluent and understandable, but not faithful:   “The Lord will look a•er me”   Something faithful, but not fluent and nautral   “The Lord is for me like somebody who looks a•er animals with co=on ­like hair” What makes a good translation   Translators o•en talk about two factors we want to maximize:   Faithfulness or fidelity   How close is the meaning of the transla8on to the meaning of the original   (Even be=er: does the transla8on cause the reader to draw the same inferences as the original would have)   Fluency or naturalness   How natural the transla8on is, just considering its fluency in the target language Statistical MT: Faithfulness and Fluency formalized!   Best ­transla8on of a source sentence S: ˆ T = argmaxT fluency(T )faithfulness(T, S )   Developed by researchers who were originally in € speech recogni8on at IBM   Called the IBM model The IBM model   Hmm, those two factors might look familiar… ˆ T = argmaxT fluency(T )faithfulness(T, S )   Yup, it’s Bayes rule: ˆ T = argmaxT P (T ) P ( S | T ) € € More formally   Assume we are transla8ng from a foreign language sentence F to an English sentence E:   F = f1, f2, f3,…, fm   We want to find the best English sentence   E ­hat = e1, e2, e3,…, en   E ­hat = argmaxE P(E|F)   = argmaxE P(F|E)P(E)/P(F)   = argmaxE P(F|E)P(E) Translation Model Language Model The noisy channel model for MT Fluency: P(T)   How to measure that this sentence   That car was almost crash onto me   is less fluent than this one:   That car almost hit me.   Answer: language models (N ­grams!)   For example P(hit|almost) > P(was|almost)   But can use any other more sophis8cated model of grammar   Advantage: this is monolingual knowledge! Faithfulness: P(S|T)   French: ça me plait [that me pleases]   English:   that pleases me  ­ most fluent   I like it   I’ll take that one   How to quan8fy this?   Intui8on: degree to which words in one sentence are plausible transla8ons of words in other sentence   Product of probabili8es that each word in target sentence would generate each word in source sentence. Faithfulness P(S|T)   Need to know, for every target language word, probability of it mapping to every source language word.   How do we learn these probabili8es?   Parallel texts!   Lots of 8mes we have two texts that are transla8ons of each other   If we knew which word in Source Text mapped to each word in Target Text, we could just count! Faithfulness P(S|T)   Sentence alignment:   Figuring out which source language sentence maps to which target language sentence   Word alignment   Figuring out which source language word maps to which target language word Big Point about Faithfulness and Fluency   Job of the faithfulness model P(S|T) is just to model “bag of words”; which words come from say English to Spanish.   P(S|T) doesn’t have to worry about internal facts about Spanish word order: that’s the job of P(T)   P(T) can do Bag genera8on: put the following words in order (from Kevin Knight)   have programming a seen never I language be=er -actual the hashing is since not collision-free usually the is less perfectly the of somewhat capacity table P(T) and bag generation: the answer   “Usually the actual capacity of the table is somewhat less, since the hashing is not collision ­free”   How about:   loves Mary John Summary   Intro and a li=le history   Language Similari8es and Divergences   Three classic MT Approaches   Transfer   Interlingua   Direct   Intui8on of Modern Sta8s8cal MT ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online