This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Insight by Mathematics and Intuition for understanding Pattern Recognition Waleed A. Yousef Faculty of Computers and Information, Helwan University. April 4, 2010 Ch3. Linear Models for Regression We saw that the best regression function is ä Y = E [ Y  X ] , In general, we always can write Y  X = E [ Y  X ] + ε = f ( X ) + ε, where ε is a r.v. In linear models, it is assumed that f ( X ) is linear in X , and the goal is to estimate the coefficients in f ( X ). Linear models: • largely developed in statistics community long time ago • Still are a great tool for prediction and can outperform fancier ones. • can be applied to transformed features (e.g., if we have X = ( X 1 ,X 2 ) Í , we can make up the feature vector X = ( X 1 ,X 2 ,X 2 1 ,X 2 2 ,X 1 X 2 ) Í , and then assume f ( X ) is linear in this new X ). • many other methods are generalization to linear models, including Neural Networks and even some methods for classification. Suppose that the original feature vector is Z = ( Z 1 ,...,Z D ) Í . In linear models we assume that f ( X ) = β + β 1 X 1 + ··· + β p X p = β Í X, X = (1 ,X 1 ,...,X p ) Í , β = ( β ,...,β p ) , where 1 is to account for the intercept and X 1 ,...,X p can be: • The components of the original feature vector • Basis of features, e.g., X 1 = Z 1 ,X 2 = Z 2 2 ,X 3 = Z 1 Z 2 ,... • Transformation of features, e.g., X 1 = log Z 1 ,X 2 = exp [ Z 2 ]. The model still is linear in coefficients (or linear in the new features). Typically, we have N observations; each is ( x i ,y i ). So, we have the data matrix and the response values: X N × p +1 = (1 ,x 1 ) Í . . . (1 ,x N ) Í = 1 x 11 ... x 1 p . . . . . . . . . 1 x N 1 x Np , Y N × 1 = y 1 . . . y N . For any choice β , we have a residual some squares given by RSS ( β ) = N Ø i =1 ( y i f ( x i )) 2 = Ø i ( y i β Í x i ) 2 = Ø i y i β o p Ø j =1 β j x ij 2 . A valid choice of β is to minimize RSS , which can be written in vector form as RSS ( β ) = ( y X β ) Í ( y X β ) = y Í y 2 β Í X Í y + β Í X Í X β In general, to minimize a scalar function E in a vector W , we have to find the point w at which the gradient is zero, i.e., ∇ E ( W ) = ∂E ( W ) ∂W = ∂E ( W ) ∂W 1 ,..., ∂E ( W ) ∂W p Í . Prove, for any matrix A and vector α , that ∇ AW = A, ∇ α Í W = ∇ W Í α = α ∇ W Í AW = [ A + A Í ] W. Now, back to the RSS ( β ), where we have to find the gradient w.r.t β ∇ RSS ( β ) = 2 X Í y + 2X Í Xβ = 2 X Í ( y X β ) ....
View Full
Document
 Spring '10
 WaleedA.Yousef

Click to edit the document details