This preview shows pages 1–10. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE 6740 Lecture 17 What Loss Function Should I Use? II (Estimation Theory) Alexander Gray [email protected] Georgia Institute of Technology CSE 6740 Lecture 17 – p. 1/3 4 Today 1. Robustness (“How safe/stable is my loss function?”) 2. Comparing Estimators (“How can I say one loss function is superior to another?”) CSE 6740 Lecture 17 – p. 2/3 4 Robustness We often choose according to mathematical/computational convenience. Otherwise, mostly robustness decides. CSE 6740 Lecture 17 – p. 3/3 4 Robustness In the (approximate) words of [Huber, 1981]: Any statistical procedure should possess the following desirable features: It has reasonably good efficiency under the assumed model. It is robust in the sense that small deviations from the assumed model assumptions should impair the performance only slightly. Somewhat larger deviations from the model should not cause a catastrophe. CSE 6740 Lecture 17 – p. 4/3 4 MLE vs. L2E Let’s revisit L 2 estimation (L2E), which we used for KDE. If f is the true density and hatwide f θ is an estimate with parameters θ , the L 2 error or L 2 distance is L 2 ( θ ) = integraldisplay ( hatwide f θ ( x ) f ( x )) 2 dx (1) = integraldisplay hatwide f 2 θ ( x ) dx 2 integraldisplay hatwide f θ ( x ) f ( x ) dx + integraldisplay f 2 ( x ) dx. Note that the third term can be ignored for the purpose of comparing different estimators. CSE 6740 Lecture 17 – p. 5/3 4 MLE vs. L2E Given a dataset, we wish to find the parameters which minimize the L 2 risk E [ L 2 ( θ )] = integraldisplay hatwide f 2 θ ( x ) dx 2 N N summationdisplay i =1 hatwide f θ ( x i ) . (2) The term integraltext hatwide f 2 θ ( x ) dx can be thought of as a kind of builtin regularization term, which acts to penalize spikes or overly large densities (due to, say, overlapped components in a mixture), and the second term as a goodnessoffit term. CSE 6740 Lecture 17 – p. 6/3 4 MLE vs. L2E Let’s do L2E for a mixture of Gaussians hatwide f θ ( x ) = K summationdisplay k =1 ω k φ ( x  μ k , Σ k ) . (3) The L2E regularization term for a mixture of Gaussians is integraldisplay hatwide f 2 θ ( x ) dx = K summationdisplay k =1 K summationdisplay j =1 ω k ω j φ ( μ j  μ k , Σ k + Σ j ) . (4) The expression φ ( μ j  μ k , Σ k + Σ j ) comes from the identity φ ( x  μ k , Σ k ) φ ( x  μ j , Σ j ) = φ ( μ j  μ k , Σ k + Σ j ) φ ( x  μ ′ k,j , Σ ′ k,j ) ; this and other properties of Gaussians make the integral tractable. CSE 6740 Lecture 17 – p. 7/3 4 MLE vs. L2E One Extreme Outlier MLE 40% Uniform Noise 40% Mixed Gaussian Noise Quasar Data (Stars as Noise) L2E CSE 6740 Lecture 17 – p. 8/3 4 Robustness Let X ∼ N ( μ,σ 2 ) . The value which minimizes squarederror, or L 2 loss, arg min θ E ( X θ ) 2 , is the mean of X : d dθ E ( X θ ) 2 = 0 ⇔ θ = E X (5) The value which minimizes absolute, or L 1 loss, arg min θ E  X θ  , is the median: d dθ E  X θ...
View
Full
Document
This note was uploaded on 04/03/2010 for the course CSE 6740 taught by Professor Staff during the Fall '08 term at Georgia Tech.
 Fall '08
 Staff

Click to edit the document details