lec10svm - CSC 2515 2008 Lecture 10 Support Vector Machines...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon
CSC 2515 2008 Lecture 10 Support Vector Machines Geoffrey Hinton
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Getting good generalization on big datasets • If we have a big data set that needs a complicated model, the full Bayesian framework is very computationally expensive. • Is there a frequentist method that is faster but still generalizes well?
Background image of page 2
Preprocessing the input vectors Instead of trying to predict the answer directly from the raw inputs we could start by extracting a layer of l features z . – Sensible if we already know that certain combinations of input values would be useful (e.g. edges or corners in an image). Instead of learning the features we could design them by hand. – The hand-coded features are equivalent to a layer of non-linear neurons that do not need to be learned. – If we use a very big set of features for a two-class problem, the classes will almost certainly be linearly separable. • But surely the linear separator will give poor generalization.
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Is preprocessing cheating? Its cheating if we use a carefully designed set of task- specific, hand-coded features and then claim that the learning algorithm solved the whole problem. – The really hard bit is done by designing the features. Its not cheating if we learn the non-linear preprocessing. – This makes learning much more difficult and much more interesting (e.g. backpropagation after pre-training) Its not cheating if we use a very big set of non-linear features that is task-independent. – S upport V ector M achines do this. – They have a clever way to prevent overfitting (first half of lecture) – They have a very clever way to use a huge number of features without requiring nearly as much computation as seems to be necessary (second half of lecture).
Background image of page 4
A hierarchy of model classes • Some model classes can be arranged in a hierarchy of increasing complexity. • How do we pick the best level in the hierarchy for modeling a given dataset?
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
A way to choose a model class We want to get a low error rate on unseen data. – This is called l structural risk minimization z It would be really helpful if we could get a guarantee of the following form: Test error rate =< train error rate + f(N, h, p) Where N = size of training set, h = measure of the model complexity, p = the probability that this bound fails We need p to allow for really unlucky test sets. Then we could choose the model complexity that minimizes the bound on the test error rate.
Background image of page 6
A weird measure of model complexity Suppose that we pick n datapoints and assign labels of + or – to them at random. If our model class (e.g. a neural net with a certain number of hidden units) is powerful enough to learn any association of labels with the data, its too powerful! Maybe we can characterize the power of a model class
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 8
This is the end of the preview. Sign up to access the rest of the document.

This document was uploaded on 10/23/2010.

Page1 / 35

lec10svm - CSC 2515 2008 Lecture 10 Support Vector Machines...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online