Obtained by drawing m examples iid from the corrupted

This preview shows page 10 - 11 out of 11 pages.

obtained by drawing m examples IID from the corrupted distribution D τ . Suppose we pick h H using empirical risk minimization: ˆ h = arg min h H ˆ ε S ( h ). Also, let h * = arg min h H ε 0 ( h ). Let any δ, γ > 0 be given. Prove that for ε 0 ( ˆ h ) ε 0 ( h * ) + 2 γ to hold with probability 1 - δ , it suffices that m 1 2(1 - 2 τ ) 2 γ 2 log 2 | H | δ . Remark. This result suggests that, roughly, m examples that have been corrupted at noise level τ are worth about as much as (1 - 2 τ ) 2 m uncorrupted training examples. This is a useful rule-of-thumb to know if you ever need to decide whether/how much to pay for a more reliable source of training data. (If you’ve taken a class in information theory, you may also have heard that (1 -H ( τ )) m is a good estimate of the information in the m corrupted examples, where H ( τ ) = - ( τ log 2 τ + (1 - τ ) log 2 (1 - τ )) is the “binary entropy” function. And indeed, the functions (1 - 2 τ ) 2 and 1 -H ( τ ) are quite close to each other.)
(c) Comment briefly on what happens as τ approaches 0 . 5. Answer:
CS229 Problem Set #2 Solutions 11 (b) We will need to apply the following (in the right order): h H, | ε τ ( h ) - ˆ ε τ ( h ) | ≤ ¯ γ w.p. (1 - δ ) , δ = 2 K exp( - γ 2 m ) (6) ε τ = (1 - 2 τ ) ε + τ, ε 0 = ε τ - τ 1 - 2 τ (7) h H, ˆ ε τ ( ˆ h ) ˆ ε τ ( h ) , in particular for h * (8) Here is the derivation: ε 0 ( ˆ h ) = ε τ ( ˆ h ) - τ 1 - 2 τ (9) ˆ ε τ ( ˆ h ) + ¯ γ - τ 1 - 2 τ w.p. (1 - δ ) (10) ˆ ε τ ( h * ) + ¯ γ - τ 1 - 2 τ w.p. (1 - δ ) (11) ε τ ( h * ) + 2¯ γ - τ 1 - 2 τ w.p. (1 - δ ) (12) = (1 - 2 τ ) ε 0 ( h * ) + τ + 2¯ γ - τ 1 - 2 τ w.p. (1 - δ ) (13) = ε 0 ( h * ) + γ 1 - 2 τ w.p. (1 - δ ) (14) = ε 0 ( h * ) + 2 γ w.p. (1 - δ ) (15) Where we used in the following order: ( ?? )( ?? )( ?? )( ?? )( ?? ), and the last 2 steps are algebraic simplifications, and defining γ as a function of ¯ γ . Now we can fill out ¯ γ = γ (1 - 2 τ ) into δ of ( ?? ), solve for m and we are done. Note: one could shorten the above derivation and go straight from ( ?? ) to ( ?? ) by using that result from class. (c) The closer τ is to 0 . 5 , the more samples are needed to get the same generalization error bound. For τ approaching 0 . 5 , the training data becomes more and more random; having no information at all about the underlying distribution for τ = 0 . 5 .

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture