{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Lecture8

# Lecture8 - CS221 Lecture notes Supervised learning summary...

This preview shows pages 1–4. Sign up to view the full content.

CS221 Lecture notes Supervised learning summary In the previous two lectures, we discussed many specific algorithms for su- pervised learning. Now, we’re going to take a step back and discuss some of the principles of how to use these learning algorithms to achieve good performance. 1 Multi-class classification When discussing logistic regression and decision trees, we simplified our task by focusing on binary classification tasks, where there are only two categories to distinguish. However, many problems require us to distinguish more than two categories. Many binary classification algorithms can be extended to directly deal with multiple classes, but there is one general approach we can take even for algorithms which don’t have straightforward multiclass extensions. In one-vs.-all (also called one-vs.-many or one-vs.-rest ), if we are trying to distinguish between N different classes, we train N different clas- sifiers, each one of which tries to distinguish one class from all the rest. For instance, suppose we are given the three-class data shown in Figure 1 (a). We construct three different classification problems, each of which uses one of the three classes for the positive examples and the other two classes for the negative examples. The resulting classifiers are shown. How do we combine these classifiers to get a prediction on a novel example x ? Each of the classifiers outputs some sort of confidence score that it sees a positive example. For instance, with logistic regression, the confidence score is given by h θ ( x ). For decision trees, it is the probability estimate associated with the corresponding leaf node. Our prediction on the new example x will 1

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
2 (a) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 (b) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 (c) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 (d) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 Figure 1: (a) A multiclass classification problem, with three categories. (b- d) Learned classifiers for each of the binary classification subproblems in one-vs.-all. simply be the class for which the classifier returns the highest confidence score of the example being a member of that class. 2 Bias, variance, and generalization error In the first machine learning lecture, we introduced the idea of overfitting or underfitting. Recall that we said a model underfits the training data if, like the first model in Figure 2 (b), it does not capture all of the structure available from the data. On the other hand, a model overfits if it captures too many of the ideosyncrasies of the training data, as in Figure 2 (d). In this section, we define more formally what we mean by overfitting and underfitting. 2.1 Regression For the moment, let’s focus on the regression problem. Suppose we have a training set S train = { ( x (1) , y (1) ) , . . . , ( x ( m ) , y ( m ) ) } sampled independently
3 (a) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (b) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (c) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 (d) 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 2: (a) Five data points to which we would like to fit a polynomial model.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}