jurafsky&martin_3rdEd_17 (1).pdf

However z cant be the output of the classifier since

Info icon This preview shows pages 110–112. Sign up to view the full content.

However, z can’t be the output of the classifier, since it’s a vector of real-valued numbers, while what we need for classification is a vector of probabilities. There is a convenient function for normalizing a vector of real values, by which we mean normalizing converting it to a vector that encodes a probability distribution (all the numbers lie between 0 and 1 and sum to 1): the softmax function. softmax For a vector z of dimensionality D , the softmax is defined as: softmax ( z i ) = e z i P k j = 1 e z j 1 i D (8.10) Thus for example given a vector z =[0.6 1.1 -1.5 1.2 3.2 -1.1], softmax( z ) is [ 0.055 0.090 0.0067 0.10 0.74 0.010]. You may recall that softmax was exactly what is used to create a probability dis- tribution from a vector of real-valued numbers (computed from summing weights times features) in logistic regression in Chapter 7; the equation for computing the probability of y being of class c given x in multinomial logistic regression was (re- peated from Eq. 8.11 ): p ( c | x ) = exp N X i = 1 w i f i ( c , x ) ! X c 0 2 C exp N X i = 1 w i f i ( c 0 , x ) ! (8.11)
Image of page 110

Info icon This preview has intentionally blurred sections. Sign up to view the full version.

8.4 T RAINING N EURAL N ETS 111 In other words, we can think of a neural network classifier with one hidden layer as building a vector h which is a hidden layer representation of the input, and then running standard logistic regression on the features that the network develops in h . By contrast, in Chapter 7 the features were mainly designed by hand via feature templates. So a neural network is like logistic regression, but (a) with many layers, since a deep neural network is like layer after layer of logistic regression classifiers, and (b) rather than forming the features by feature templates, the prior layers of the network induce the feature representations themselves. Here are the final equations for a feed-forward network with a single hidden layer, which takes an input vector x , outputs a probability distribution y , and is pa- rameterized by weight matrices W and U and a bias vector b : h = s ( Wx + b ) (8.12) z = Uh (8.13) y = softmax ( z ) (8.14) (8.15) 8.4 Training Neural Nets To train a neural net, meaning to set the weights and biases W and b for each layer, we use optimization methods like stochastic gradient descent , just as with logistic regression in Chapter 7. Let’s use the variable q to mean all the parameters we need to learn ( W and b for each layer). The intuition of gradient descent is to start with some initial guess at q , for example setting all the weights randomly, and then nudge the weights (i.e. change q slightly) in a direction that improves our system. 8.4.1 Loss function If our goal is to move our weights in a way that improves the system, we’ll obviously need a metric for whether the system has improved or not. The neural nets we have been describing are supervised classifiers, which means we know the right answer for each observation in the training set. So our goal is for the output from the network for each training instance to be as close as possible to the correct gold label.
Image of page 111
Image of page 112
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern