Unformatted text preview: A Formal Description of Boosting • given training set (x1, y1), . . . , (xm, ym) • yi ∈ {−1, +1} correct label of instance xi ∈ X • for t = 1, . . . , T : • construct distribution Dt on {1, . . . , m} • ﬁnd weak classiﬁer (“rule of thumb”) ht : X → {−1, +1} with small error t on Dt: t = PrDt [ht(xi) = yi] • output ﬁnal classiﬁer Hﬁnal AdaBoost AdaBoost [with Freund] • constructing Dt : D1(i) = 1/m • given Dt and ht: • Dt(i) e−αt if yi = ht(xi) Dt+1(i) = × αt e if yi = ht(xi) Zt Dt(i) = exp(−αt yi ht(xi)) Zt • ﬁnal classiﬁer classiﬁer: • where Zt = normalization constant 1 ln 1 − t > 0 αt = 2 t Hﬁnal(x) = sign αtht(x) t Toy Example Toy D1 weak classiﬁers = vertical or horizontal half-planes Round 1 Round h1 D2 "1 =0.30 !1=0.42 Round 2 Round h2 D3 "2 =0.21 !2=0.65 Round 3 Round h3 "3 =0.14 !3=0.92 Final Classiﬁer Final H = sign final 0.42 + 0.65 + 0.92 = Analyzing the training error Analyzing • Theorem Theorem: • write t as 1/2 − γt • then training error(Hﬁnal) ≤ = 2 t(1 − t) t t 2 1 − 4 γt 2 ≤ exp −2 γt t • so: if ∀t : γt ≥ γ > 0 −2γ 2T then training error(Hﬁnal) ≤ e • AdaBoost is adaptive adaptive: • does not need to know γ or T a priori • can exploit γt γ Proof Proof • let f (x) = t αtht(x) ⇒ Hﬁnal(x) = sign(f (x)) • Step 1: unwrapping recurrence: exp −yi αtht(xi) 1 t Dﬁnal(i) = m Zt t 1 exp (−yif (xi)) = m Zt t Proof (cont.) Proof • Step 2: training error(Hﬁnal) ≤ Zt t • Proof: 1 1 if yi = Hﬁnal(xi) training error(Hﬁnal) = m i 0 else 1 1 if yif (xi) ≤ 0 = m i 0 else 1 ≤ exp(−yif (xi)) mi = Dﬁnal(i) Zt i t = t Zt Proof (cont.) Proof • Step 3: Zt = 2 t(1 − t) • Proof: Zt = = i Dt(i) exp(−αt yi ht(xi)) Dt(i)eαt + i:yi=ht(xi) i:yi=ht(xi) Dt(i)e−αt = t eαt + (1 − t) e−αt = 2 t(1 − t) How Will Test Error Behave? (A First Guess) How 1 0.8 error 0.6 0.4 0.2 test train 20 40 60 80 100 # of rounds (T) expect expect: • training error to continue to drop (or reach zero) • test error to increase when Hﬁnal becomes “too complex” • “Occam’s razor razor” • overﬁtting • hard to know when to stop training Actual Typical Run Actual 20 15 C4.5 test error test train 10 100 1000 error 10 5 0 (boosting C4.5 on “letter” dataset) # of rounds (T) • test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes) • test error continues to drop even after training error is zero! # rounds 5 100 1000 train error 0.0 0.0 0.0 test error 8.4 3.3 3.1 • Occam’s razor wrongly predicts “simpler” rule is better ...
