This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Logistic Regression Logistic Regression
Jia Li
Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali Logistic Regression Logistic Regression
Preserve linear classification boundaries. By the Bayes rule: ^ G (x) = arg max Pr (G = k  X = x) .
k Decision boundary between class k and l is determined by the equation: Pr (G = k  X = x) = Pr (G = l  X = x) . Divide both sides by Pr (G = l  X = x) and take log. The above equation is equivalent to log Pr (G = k  X = x) =0. Pr (G = l  X = x) Jia Li http://www.stat.psu.edu/jiali Logistic Regression Since we enforce linear boundary, we can assume (k,l) Pr (G = k  X = x) (k,l) log = a0 + aj xj . Pr (G = l  X = x)
j=1 p For logistic regression, there are restrictive relations between a(k,l) for different pairs of (k, l). Jia Li http://www.stat.psu.edu/jiali Logistic Regression Assumptions Pr (G = 1  X = x) Pr (G = K  X = x) Pr (G = 2  X = x) log Pr (G = K  X = x) log log Pr (G = K  1  X = x) Pr (G = K  X = x) T = 10 + 1 x T = 20 + 2 x . . .
T = (K 1)0 + K 1 x Jia Li http://www.stat.psu.edu/jiali Logistic Regression For any pair (k, l): log Pr (G = k  X = x) = k0  l0 + (k  l )T x . Pr (G = l  X = x) Number of parameters: (K  1)(p + 1). Denote the entire parameter set by = {10 , 1 , 20 , 2 , ..., (K 1)0 , K 1 } . The log ratio of posterior probabilities are called logodds or logit transformations. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Under the assumptions, the posterior probabilities are given by: Pr (G = k  X = x) =
T exp(k0 + k x) K 1 1 + l=1 exp(l0 + lT x) for k = 1, ..., K  1 Pr (G = K  X = x) = 1+ For Pr (G = k  X = x) given above, obviously K Sum up to 1: k=1 Pr (G = k  X = x) = 1. A simple calculation shows that the assumptions are satisfied. K 1
l=1 1 exp(l0 + lT x) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression Comparison with LR on Indicators Similarities: Both attempt to estimate Pr (G = k  X = x). Both have linear classification boundaries. Linear regression on indicator matrix: approximate Pr (G = k  X = x) by a linear function of x. Pr (G = k  X = x) is not guaranteed to fall between 0 and 1 and to sum up to 1. Logistic regression: Pr (G = k  X = x) is a nonlinear function of x. It is guaranteed to range from 0 to 1 and to sum up to 1. Difference: Jia Li http://www.stat.psu.edu/jiali Logistic Regression Fitting Logistic Regression Models Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. Denote pk (xi ; ) = Pr (G = k  X = xi ; ). Given the first input x1 , the posterior probability of its class being g1 is Pr (G = g1  X = x1 ). Since samples in the training data set are independent, the posterior probability for the N samples each having class gi , i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is:
N i=1 Pr (G = gi  X = xi ) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression The conditional loglikelihood of the class labels in the training data set is L() =
N i=1 N i=1 log Pr (G = gi  X = xi ) log pgi (xi ; ) . = Jia Li http://www.stat.psu.edu/jiali Logistic Regression Binary Classification For binary classification, if gi = 1, denote yi = 1; if gi = 2, denote yi = 0. Let p1 (x; ) = p(x; ), then p2 (x; ) = 1  p1 (x; ) = 1  p(x; ) . Since K = 2, the parameters = {10 , 1 }. We denote = (10 , 1 )T . Jia Li http://www.stat.psu.edu/jiali Logistic Regression If yi = 1, i.e., gi = 1, log pgi (x; ) = log p1 (x; ) = 1 log p(x; ) If yi = 0, i.e., gi = 2, log pgi (x; ) = log p2 (x; ) = 1 log(1  p(x; )) Since either yi = 0 or 1  yi = 0, we have log pgi (x; ) = yi log p(x; ) + (1  yi ) log(1  p(x; )) . = (1  yi ) log(1  p(x; )) . = yi log p(x; ) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression The conditional likelihood L() =
N i=1 N i=1 log pgi (xi ; ) [yi log p(xi ; ) + (1  yi ) log(1  p(xi ; ))] = There p + 1 parameters in = (10 , 1 )T . Assume a column vector form for : 10 11 = 12 . . . . 1,p
http://www.stat.psu.edu/jiali Jia Li Logistic Regression Here we add the constant term 1 to x to accommodate the intercept. 1 x,1 x = x,2 . . . . x,p Jia Li http://www.stat.psu.edu/jiali Logistic Regression By the assumption of logistic regression model: p(x; ) = Pr (G = 1  X = x) = 1  p(x; ) = Pr (G = 2  X = x) = exp( T x) 1 + exp( T x) 1 1 + exp( T x) Substitute the above in L(): L() =
N i=1 yi T xi  log(1 + e Tx i ) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression To maximize L(), we set the first order partial derivatives of L() to zero. L() 1j =
N i=1 N yi xij  yi xij  N xij e T xi 1 + e T xi i=1 N i=1 = p(x; )xij = i=1 N i=1 xij (yi  p(xi ; )) for all j = 0, 1, ..., p. Jia Li http://www.stat.psu.edu/jiali Logistic Regression In matrix form, we write L() = xi (yi  p(xi ; )) . i=1 N To solve the set of p + 1 nonlinear equations L() = 0, 1j j = 0, 1, ..., p, use the NewtonRaphson algorithm. The NewtonRaphson algorithm requires the secondderivatives or Hessian matrix: 2 L() = xi xiT p(xi ; )(1  p(xi ; )) . T i=1 N Jia Li http://www.stat.psu.edu/jiali Logistic Regression The element on the jth row and nth column is (counting from 0): L() 1j 1n =  =  = 
N (1 + e T xi )e T xi xij xin  (e T xi )2 xij xin i=1 N i=1 N i=1 (1 + e T xi )2 xij xin p(xi ; )  xij xin p(xi ; )2 xij xin p(xi ; )(1  p(xi ; )) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression Starting with old , a single NewtonRaphson update is new = old  2 L() T 1 L() , where the derivatives are evaluated at old . Jia Li http://www.stat.psu.edu/jiali Logistic Regression The iteration can be expressed compactly in matrix form. Let y be the column vector of yi . Let X be the N (p + 1) input matrix. Let p be the Nvector of fitted probabilities with ith element p(xi ; old ). Let W be an N N diagonal matrix of weights with ith element p(xi ; old )(1  p(xi ; old )). Then L() 2 L() T = XT (y  p) = XT WX . Jia Li http://www.stat.psu.edu/jiali Logistic Regression The NewtonRaphson step is new = old + (XT WX)1 XT (y  p) = (XT WX)1 XT Wz , = (XT WX)1 XT W(X old + W1 (y  p)) where z X old + W1 (y  p). If z is viewed as a response and X is the input matrix, new is the solution to a weighted least square problem: new arg min(z  X)T W(z  X) . Recall that linear regression by least square is to solve arg min(z  X)T (z  X) . z is referred to as the adjusted response. The algorithm is referred to as iteratively reweighted least squares or IRLS.
http://www.stat.psu.edu/jiali Jia Li Logistic Regression Pseudo Code
1. 0 2. Compute y by setting its elements to 1 if gi = 1 yi = 0 if gi = 2 i = 1, 2, ..., N. 3. Compute p by setting its elements to
T , 4. 5. 6. 7.
Jia Li e xi i = 1, 2, ..., N. 1 + e T xi Compute the diagonal matrix W. The ith diagonal element is p(xi ; )(1  p(xi ; )), i = 1, 2, ..., N. z X + W1 (y  p). (XT WX)1 XT Wz. If the stopping criteria is met, stop; otherwise go back to step 3. p(xi ; ) =
http://www.stat.psu.edu/jiali Logistic Regression Computational Efficiency Since W is an N N diagonal matrix, direct matrix operations with it may be very inefficient. A modified pseudo code is provided next. Jia Li http://www.stat.psu.edu/jiali Logistic Regression 1. 0 2. Compute y by setting its elements to
yi = 1 0 if gi = 1 if gi = 2
T , i = 1, 2, ..., N . 3. Compute p by setting its elements to e xi i = 1, 2, ..., N. 1 + e T xi ~ 4. Compute the N (p + 1) matrix X by multiplying the matrix X by p(xi ; )(1  p(xi ; )), i = 1, 2, ..., N: T T x1 p(x1 ; )(1  p(x1 ; ))x1 xT p(x2 ; )(1  p(x2 ; ))x T 2 ~ X= 2 X= T T xN p(xN ; )(1  p(xN ; ))xN p(xi ; ) = ith row of ~ 5. + (XT X)1 XT (y  p). 6. If the stopping criteria is met, stop; otherwise go back to step 3.
Jia Li http://www.stat.psu.edu/jiali Logistic Regression Example Diabetes data set Input X is two dimensional. X1 and X2 are the two principal components of the original 8 variables. Class 1: without diabetes; Class 2: with diabetes. Applying logistic regression, we obtain = (0.7679, 0.6816, 0.3664)T . Jia Li http://www.stat.psu.edu/jiali Logistic Regression The posterior probabilities are: Pr (G = 1  X = x) = Pr (G = 2  X = x) = e 0.76790.6816X1 0.3664X2 1 + e 0.76790.6816X1 0.3664X2 1 1 + e 0.76790.6816X1 0.3664X2 The classification rule is: ^ (x) = 1 0.7679  0.6816X1  0.3664X2 0 G 2 0.7679  0.6816X1  0.3664X2 < 0 Jia Li http://www.stat.psu.edu/jiali Logistic Regression Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Within training data set classification error rate: 28.12%. Sensitivity: 45.9%. Specificity: 85.8%. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Multiclass Case (K 3) When K 3, is a (K1)(p+1)vector: 10 11 . . 10 . 1 1p 20 20 . 2 . = = . . . 2p . (K 1)0 . . . (K 1)0 K 1 . . . (K 1)p Jia Li http://www.stat.psu.edu/jiali Logistic Regression l0 . l The likelihood function becomes Let l = L() =
N i=1 N log pgi (xi ; ) log e gi xi K 1
l=1 T = = i=1 N i=1 1+ T gi xi  log 1 + e l T xi e lT xi K 1 l=1 Jia Li http://www.stat.psu.edu/jiali Logistic Regression Note: the indicator function I () equals 1 when the argument is true and 0 otherwise. First order derivatives: N T e k xi xij L() = I (gi = k)xij  T kj 1 + K 1 e l xi l=1 i=1 =
N i=1 xij (I (gi = k)  pk (xi ; )) Jia Li http://www.stat.psu.edu/jiali Logistic Regression Second order derivatives: 2 L() kj mn =
N = i=1 xij T k xi (1 + e I (k = m)xin (1 + 1 K 1
l=1 e l T xi 2 ) e lT xi N i=1 K 1 l=1 )+e T T k xi m xi e xin xij xin (pk (xi ; )I (k = m) + pk (xi ; )pm (xi ; )) xij xin pk (xi ; )[I (k = m)  pm (xi ; )] . = 
Jia Li N i=1 http://www.stat.psu.edu/jiali Logistic Regression Matrix form. y is the concatenated N (K  1). y1 y2 y= . . . yK 1 p is the concatenated N (K  1). p1 p2 p= . . . indicator vector of dimension I (g1 = k) I (g2 = k) . . . I (gN = k) vector of fitted probabilities of dimension pk (x1 ; ) pk (x2 ; ) . . . pk (xN ; ) 1k K 1 yk = pK 1 Jia Li http://www.stat.psu.edu/jiali 1k K 1 pk = Logistic Regression ~ X is an N(K  1) (p + 1)(K  1) matrix: X 0 0 0 X 0 ~ X = 0 0 X Jia Li http://www.stat.psu.edu/jiali Logistic Regression Matrix W is an N(K  1) N(K  1) square matrix: W11 W12 W1(K 1) W21 W22 W2(K 1) W = W(K 1),1 W(K 1),2 W(K 1),(K 1) Each submatrix Wkm , 1 k, m K  1, is an N N diagonal matrix. When k = m, the ith diagonal element in Wkk is pk (xi ; old )(1  pk (xi ; old )). When k = m, the ith diagonal element in Wkm is pk (xi ; old )pm (xi ; old ). Jia Li http://www.stat.psu.edu/jiali Logistic Regression Similarly as with binary classification L() 2 L() T ~ = XT (y  p) ~ ~ = XT WX . The formula for updating new in the binary classification case holds for multiclass. ~ ~ ~ new = (XT WX)1 XT Wz , ~ where z X old + W1 (y  p). Or simply: ~ ~ ~ new = old + (XT WX)1 XT (y  p) . Jia Li http://www.stat.psu.edu/jiali Logistic Regression Computation Issues Initialization: one option is to use = 0. Convergence is not guaranteed, but usually is the case. Usually, the loglikelihood increases after each iteration, but overshooting can occur. In the rare cases that the loglikelihood decreases, cut step size by half. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Connection with LDA Under the model of LDA: Pr (G = k  X = x) Pr (G = K  X = x) k 1 = log  (k + K )T 1 (k  K ) K 2 T 1 +x (k  K ) log
T = ak0 + ak x . The model of LDA satisfies the assumption of the linear logistic model. The linear logistic model only specifies the conditional distribution Pr (G = k  X = x). No assumption is made about Pr (X ).
http://www.stat.psu.edu/jiali Jia Li Logistic Regression The LDA model specifies the joint distribution of X and G . Pr (X ) is a mixture of Gaussians: Pr (X ) =
K k=1 k (X ; k , ) . where is the Gaussian density function. Linear logistic regression maximizes the conditional likelihood of G given X : Pr (G = k  X = x). LDA maximizes the joint likelihood of G and X : Pr (X = x, G = k). Jia Li http://www.stat.psu.edu/jiali Logistic Regression If the additional assumption made by LDA is appropriate, LDA tends to estimate the parameters more efficiently by using more information about the data. Samples without class labels can be used under the model of LDA. LDA is not robust to gross outliers. As logistic regression relies on fewer assumptions, it seems to be more robust. In practice, logistic regression and LDA often give similar results. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Simulation Assume input X is 1D. Two classes have equal priors and the classconditional densities of X are shifted versions of each other. Each conditional density is a mixture of two normals: Class 1 (red): 0.6N(2, 1 ) + 0.4N(0, 1). 4 Class 2 (blue): 0.6N(0, 1 ) + 0.4N(2, 1). 4 The classconditional densities are shown below. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Jia Li http://www.stat.psu.edu/jiali Logistic Regression LDA Result Training data set: 2000 samples for each class. Test data set: 1000 samples for each class. The estimation by LDA: 1 = 1.1948, 2 = 0.8224, ^ ^ 2 = 1.5268. Boundary value between the two classes is ^ (^1 + 2 )/2 =0.1862. ^ The classification error rate on the test data is 0.2315. Based on the true distribution, the Bayes (optimal) boundary value between the two classes is 0.7750 and the error rate is 0.1765. Jia Li http://www.stat.psu.edu/jiali Logistic Regression Jia Li http://www.stat.psu.edu/jiali Logistic Regression Logistic Regression Result Linear logistic regression obtains = (0.3288, 1.3275)T . The boundary value satisfies 0.3288  1.3275X = 0, hence equals 0.2477. The error rate on the test data set is 0.2205. The estimated posterior probability is: Pr (G = 1  X = x) = e 0.32881.3275x . 1 + e 0.32881.3275x Jia Li http://www.stat.psu.edu/jiali Logistic Regression The estimated posterior probability Pr (G = 1  X = x) and its true value based on the true distribution are compared in the graph below. Jia Li http://www.stat.psu.edu/jiali ...
View
Full
Document
This note was uploaded on 02/04/2012 for the course STAT 557 taught by Professor Jiali during the Fall '09 term at Penn State.
 Fall '09
 JIALI
 Statistics

Click to edit the document details