Qnextqp compute the new state 2 addqqqp add the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e is the partial English sentence •  b is a bit vector recorded which source words are translated •  α is score of translation so far Decoder Pseudocode Initialization: Set beam Q={q0} where q0 is initial state with no words translated For i=0 … n-1 [where n in input sentence length] •  For each state q∈beam(Q) and phrase p∈ph(q) 1.  q’=next(q,p) [compute the new state] 2.  Add(Q,q’,q,p) [add the new state to the beam] Possible State Representations: •  Full: q = (e, b, α), e.g. (“Joe did not give,” 11000000, 0.092) •  Compact: q = (e1, e2, b, r, α) , •  e.g. (“not,” “give,” 11000000, 4, 0.092) •  e1 and e2 are the last two words of partial translation •  r is the length of the partial translation •  Compact representation is more efficient, but requires back pointers to get the final translation Pruning 26 27 Pruning nation is not su §  Problem: easy partial cient analyses are cheaper Hypothesis Queues §  Solution 1: separate bean for s in priority queues, e.g. by each number of foreign ds covered words oreign words covered 1 2 3 English words produced §  Solution 2: estimate forward costs (A*-like) s in queue, discard bad ones • Organization of hypothesis into queues weak hypotheses early ng: keep top n hypotheses in each queue (e.g., n=100) g: keep hypotheses that are at most times the cost of n queue (e.g., = 0.001) Machine Translation 16 February 2012 4 5 6 – here: based on number of foreign words translated – during translation all hypotheses from one stack are expanded – expanded hypotheses are placed into queues Miles Osborne Machine Translation 16 February 2012 Decoder Pseudocode (Multibeam) Initialization: •  set Q0={q0}, Qi={} for I = 1 … n [n is input sent length] For i=0 … n-1 •  For each state q∈beam(Qi) and phrase p∈ph(q) 1.  q’=next(q,p) 2.  Add(Qj,q’,q,p) where j = len(q’) Notes: •  Qi is a beam of all partial translations where i input words have been translated •  len(q) is the number of bits equal to one in q (the number of words that have been translated) Tons of Data? §  Discussed for LMs, but can new understand full model! Tuning for MT §  Features encapsulate lots of information §  Basic MT systems have around 6 features §  P(e|f), P(f|e), lexical weighting, language model §  How to tune feature weights? §  Idea 1: Use your favorite classifier Why Tuning is Hard §  Problem 1: There are latent variables §  Alignments and segementations §  Possibility: forced decoding (but it can go badly) Why Tuning is Hard §  Problem 2: There are many right answers §  The reference or references are just a few options §  No good characterization of the whole class §  BLEU isn’t perfect, but even if you trust it, it’s a corpus-level metric, not sentence-level Linear Models: Perceptron §  The perceptron algorithm §  Iteratively processes the training set, reacting to training errors §  Can be thought of as trying to drive down training error §  The (online) perceptron algorithm: §  Start with zero weights §  Visit training instances (xi,yi) one by one §  Make a prediction ⇤ y = arg max w · (xi , y) y §  If correct (y*==yi): no change, goto next example! §  If wrong: adjust weights w = w + (xi , yi ) (xi , y ⇤ ) Why Tuning is Hard §  Problem 3: Computational constraints §  Discriminative training involves repeated decoding §  Very slow! So people tune on sets much smaller than those used to build phrase tables Minimum Error Rate Training §  Standard method: minimize BLEU directly (Och 03) §  MERT is a discontinuous objective §  Only works for max ~10 features, but works very well then §  Here: k-best lists, but forest methods exist (Machery et al 08) Model Score MERT: Convex Upper Bound of BLEU BLEU Score θ θ...
View Full Document

This document was uploaded on 04/04/2014.

Ask a homework question - tutors are online