CSE
jurafsky&martin_3rdEd_17 (1).pdf

Observed count f k expected count f k 714 thus in

Info icon This preview shows pages 97–99. Sign up to view the full content.

Observed count ( f k ) - Expected count ( f k ) (7.14) Thus in optimal weights for the model the model’s expected feature values match the actual counts in the data. 7.4 Regularization There is a problem with learning weights that make the model perfectly match the training data. If a feature is perfectly predictive of the outcome because it happens to only occur in one class, it will be assigned a very high weight. The weights for features will attempt to perfectly fit details of the training set, in fact too perfectly, modeling noisy factors that just accidentally correlate with the class. This problem is called overfitting . overfitting To avoid overfitting a regularization term is added to the objective function in regularization Eq. 7.13 . Instead of the optimization in Eq. 7.12 , we optimize the following: ˆ w = argmax w X j log P ( y ( j ) | x ( j ) ) - a R ( w ) (7.15) where R ( w ) , the regularization term, is used to penalize large weights. Thus a setting of the weights that matches the training data perfectly, but uses lots of weights with
Image of page 97

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

98 C HAPTER 7 L OGISTIC R EGRESSION high values to do so, will be penalized more than than a setting that matches the data a little less well, but does so using smaller weights. There are two common regularization terms R ( w ) . L2 regularization is a quad- L2 regularization ratic function of the weight values, named because is uses the (square of the) L2 norm of the weight values. The L2 norm, || W || 2 , is the same as the Euclidean distance : R ( W ) = || W || 2 2 = N X j = 1 w 2 j (7.16) The L2 regularized objective function becomes: ˆ w = argmax w X j log P ( y ( j ) | x ( j ) ) - a N X i = 1 w 2 i (7.17) L1 regularization is a linear function of the weight values, named after the L1 L1 regularization norm || W || 1 , the sum of the absolute values of the weights, or Manhattan distance (the Manhattan distance is the distance you’d have to walk between two points in a city with a street grid like New York): R ( W ) = || W || 1 = N X i = 1 | w i | (7.18) The L1 regularized objective function becomes: ˆ w = argmax w X j log P ( y ( j ) | x ( j ) ) - a N X i = 1 | w i | (7.19) These kinds of regularization come from statistics, where L1 regularization is called ‘the lasso’ or lasso regression (Tibshirani, 1996) and L2 regression is called ridge regression , and both are commonly used in language processing. L2 regu- larization is easier to optimize because of its simple derivative (the derivative of w 2 is just 2 w ), while L1 regularization is more complex (the derivative of | w | is non- continuous at zero). But where L2 prefers weight vectors with many small weights, L1 prefers sparse solutions with some larger weights but many more weights set to zero. Thus L1 regularization leads to much sparser weight vectors, that is, far fewer features. Both L1 and L2 regularization have Bayesian interpretations as constraints on the prior of how weights should look. L1 regularization can be viewed as a Laplace prior on the weights. L2 regularization corresponds to assuming that weights are distributed according to a gaussian distribution with mean μ = 0. In a gaussian
Image of page 98
Image of page 99
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern