*This preview shows
page 1. Sign up
to
view the full content.*

**Unformatted text preview: **e is the partial English sentence
• b is a bit vector recorded which source words are
translated
• α is score of translation so far Decoder Pseudocode
Initialization: Set beam Q={q0} where q0 is initial state with
no words translated
For i=0 … n-1
[where n in input sentence length]
• For each state q∈beam(Q) and phrase p∈ph(q)
1. q’=next(q,p)
[compute the new state]
2. Add(Q,q’,q,p)
[add the new state to the beam]
Possible State Representations:
• Full: q = (e, b, α), e.g. (“Joe did not give,” 11000000, 0.092)
• Compact: q = (e1, e2, b, r, α) ,
• e.g. (“not,” “give,” 11000000, 4, 0.092)
• e1 and e2 are the last two words of partial translation
• r is the length of the partial translation
• Compact representation is more efficient, but requires
back pointers to get the final translation Pruning 26 27 Pruning nation is not su § Problem: easy partial
cient
analyses are cheaper Hypothesis Queues § Solution 1: separate bean for
s in priority queues, e.g. by each number of foreign
ds covered
words
oreign words covered
1
2
3
English words produced § Solution 2: estimate forward
costs (A*-like)
s in queue, discard bad ones
• Organization of hypothesis into queues
weak hypotheses early ng: keep top n hypotheses in each queue (e.g., n=100)
g: keep hypotheses that are at most times the cost of
n queue (e.g., = 0.001)
Machine Translation 16 February 2012 4 5 6 – here: based on number of foreign words translated
– during translation all hypotheses from one stack are expanded
– expanded hypotheses are placed into queues
Miles Osborne Machine Translation 16 February 2012 Decoder Pseudocode (Multibeam)
Initialization:
• set Q0={q0}, Qi={} for I = 1 … n [n is input sent length]
For i=0 … n-1
• For each state q∈beam(Qi) and phrase p∈ph(q)
1. q’=next(q,p)
2. Add(Qj,q’,q,p) where j = len(q’)
Notes:
• Qi is a beam of all partial translations where i input
words have been translated
• len(q) is the number of bits equal to one in q (the
number of words that have been translated) Tons of Data? § Discussed for LMs, but can new understand full model! Tuning for MT
§ Features encapsulate lots of information
§ Basic MT systems have around 6 features
§ P(e|f), P(f|e), lexical weighting, language model § How to tune feature weights?
§ Idea 1: Use your favorite classifier Why Tuning is Hard
§ Problem 1: There are latent variables
§ Alignments and segementations
§ Possibility: forced decoding (but it can go badly) Why Tuning is Hard
§ Problem 2: There are many right answers
§ The reference or references are just a few options
§ No good characterization of the whole class § BLEU isn’t perfect, but even if you trust it, it’s a corpus-level
metric, not sentence-level Linear Models: Perceptron
§ The perceptron algorithm
§ Iteratively processes the training set, reacting to training errors
§ Can be thought of as trying to drive down training error § The (online) perceptron algorithm:
§ Start with zero weights
§ Visit training instances (xi,yi) one by one
§ Make a prediction ⇤ y = arg max w · (xi , y)
y § If correct (y*==yi): no change, goto next example!
§ If wrong: adjust weights w = w + (xi , yi ) (xi , y ⇤ ) Why Tuning is Hard
§ Problem 3: Computational constraints
§ Discriminative training involves repeated decoding
§ Very slow! So people tune on sets much smaller than those
used to build phrase tables Minimum Error Rate Training
§ Standard method: minimize BLEU directly (Och 03)
§ MERT is a discontinuous objective
§ Only works for max ~10 features, but works very well then
§ Here: k-best lists, but forest methods exist (Machery et al 08) Model Score MERT: Convex Upper Bound of BLEU BLEU Score θ θ...

View
Full
Document