Recitation-boosting

# Recitation-boosting - A Formal Description of Boosting •...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Formal Description of Boosting • given training set (x1, y1), . . . , (xm, ym) • yi ∈ {−1, +1} correct label of instance xi ∈ X • for t = 1, . . . , T : • construct distribution Dt on {1, . . . , m} • ﬁnd weak classiﬁer (“rule of thumb”) ht : X → {−1, +1} with small error t on Dt: t = PrDt [ht(xi) = yi] • output ﬁnal classiﬁer Hﬁnal AdaBoost AdaBoost [with Freund] • constructing Dt : D1(i) = 1/m • given Dt and ht: • Dt(i) e−αt if yi = ht(xi) Dt+1(i) = × αt e if yi = ht(xi) Zt Dt(i) = exp(−αt yi ht(xi)) Zt • ﬁnal classiﬁer classiﬁer: • where Zt = normalization constant 1 ln 1 − t > 0 αt = 2 t Hﬁnal(x) = sign αtht(x) t Toy Example Toy D1 weak classiﬁers = vertical or horizontal half-planes Round 1 Round h1 D2 "1 =0.30 !1=0.42 Round 2 Round h2 D3 "2 =0.21 !2=0.65 Round 3 Round h3 "3 =0.14 !3=0.92 Final Classiﬁer Final H = sign final 0.42 + 0.65 + 0.92 = Analyzing the training error Analyzing • Theorem Theorem: • write t as 1/2 − γt • then training error(Hﬁnal) ≤ = 2 t(1 − t) t t 2 1 − 4 γt 2 ≤ exp −2 γt t • so: if ∀t : γt ≥ γ > 0 −2γ 2T then training error(Hﬁnal) ≤ e • AdaBoost is adaptive adaptive: • does not need to know γ or T a priori • can exploit γt γ Proof Proof • let f (x) = t αtht(x) ⇒ Hﬁnal(x) = sign(f (x)) • Step 1: unwrapping recurrence: exp −yi αtht(xi) 1 t Dﬁnal(i) = m Zt t 1 exp (−yif (xi)) = m Zt t Proof (cont.) Proof • Step 2: training error(Hﬁnal) ≤ Zt t • Proof: 1 1 if yi = Hﬁnal(xi) training error(Hﬁnal) = m i 0 else 1 1 if yif (xi) ≤ 0 = m i 0 else 1 ≤ exp(−yif (xi)) mi = Dﬁnal(i) Zt i t = t Zt Proof (cont.) Proof • Step 3: Zt = 2 t(1 − t) • Proof: Zt = = i Dt(i) exp(−αt yi ht(xi)) Dt(i)eαt + i:yi=ht(xi) i:yi=ht(xi) Dt(i)e−αt = t eαt + (1 − t) e−αt = 2 t(1 − t) How Will Test Error Behave? (A First Guess) How 1 0.8 error 0.6 0.4 0.2 test train 20 40 60 80 100 # of rounds (T) expect expect: • training error to continue to drop (or reach zero) • test error to increase when Hﬁnal becomes “too complex” • “Occam’s razor razor” • overﬁtting • hard to know when to stop training Actual Typical Run Actual 20 15 C4.5 test error test train 10 100 1000 error 10 5 0 (boosting C4.5 on “letter” dataset) # of rounds (T) • test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes) • test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 • Occam’s razor wrongly predicts “simpler” rule is better ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online