Recitation-boosting - A Formal Description of Boosting •...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: A Formal Description of Boosting • given training set (x1, y1), . . . , (xm, ym) • yi ∈ {−1, +1} correct label of instance xi ∈ X • for t = 1, . . . , T : • construct distribution Dt on {1, . . . , m} • find weak classifier (“rule of thumb”) ht : X → {−1, +1} with small error t on Dt: t = PrDt [ht(xi) = yi] • output final classifier Hfinal AdaBoost AdaBoost [with Freund] • constructing Dt : D1(i) = 1/m • given Dt and ht: • Dt(i) e−αt if yi = ht(xi) Dt+1(i) = × αt e if yi = ht(xi) Zt Dt(i) = exp(−αt yi ht(xi)) Zt • final classifier classifier: • where Zt = normalization constant 1 ln 1 − t > 0 αt = 2 t Hfinal(x) = sign αtht(x) t Toy Example Toy D1 weak classifiers = vertical or horizontal half-planes Round 1 Round h1 D2 "1 =0.30 !1=0.42 Round 2 Round h2 D3 "2 =0.21 !2=0.65 Round 3 Round h3 "3 =0.14 !3=0.92 Final Classifier Final H = sign final 0.42 + 0.65 + 0.92 = Analyzing the training error Analyzing • Theorem Theorem: • write t as 1/2 − γt • then training error(Hfinal) ≤ = 2 t(1 − t) t t 2 1 − 4 γt 2 ≤ exp −2 γt t • so: if ∀t : γt ≥ γ > 0 −2γ 2T then training error(Hfinal) ≤ e • AdaBoost is adaptive adaptive: • does not need to know γ or T a priori • can exploit γt γ Proof Proof • let f (x) = t αtht(x) ⇒ Hfinal(x) = sign(f (x)) • Step 1: unwrapping recurrence: exp −yi αtht(xi) 1 t Dfinal(i) = m Zt t 1 exp (−yif (xi)) = m Zt t Proof (cont.) Proof • Step 2: training error(Hfinal) ≤ Zt t • Proof: 1 1 if yi = Hfinal(xi) training error(Hfinal) = m i 0 else 1 1 if yif (xi) ≤ 0 = m i 0 else 1 ≤ exp(−yif (xi)) mi = Dfinal(i) Zt i t = t Zt Proof (cont.) Proof • Step 3: Zt = 2 t(1 − t) • Proof: Zt = = i Dt(i) exp(−αt yi ht(xi)) Dt(i)eαt + i:yi=ht(xi) i:yi=ht(xi) Dt(i)e−αt = t eαt + (1 − t) e−αt = 2 t(1 − t) How Will Test Error Behave? (A First Guess) How 1 0.8 error 0.6 0.4 0.2 test train 20 40 60 80 100 # of rounds (T) expect expect: • training error to continue to drop (or reach zero) • test error to increase when Hfinal becomes “too complex” • “Occam’s razor razor” • overfitting • hard to know when to stop training Actual Typical Run Actual 20 15 C4.5 test error test train 10 100 1000 error 10 5 0 (boosting C4.5 on “letter” dataset) # of rounds (T) • test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes) • test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 • Occam’s razor wrongly predicts “simpler” rule is better ...
View Full Document

This note was uploaded on 01/26/2010 for the course MACHINE LE 10701 taught by Professor Ericp.xing during the Fall '08 term at Carnegie Mellon.

Ask a homework question - tutors are online