This preview shows page 1. Sign up to view the full content.
Unformatted text preview: 4/8/10 Regression CPS 170 Ron Parr Regression figures provided by Christopher Bishop and © 2007 Christopher Bishop
With content adapted from Lise Getoor, Tom Die>erich, Andrew Moore & Rich Maclin Supervised Learning • Given: Training Set • Goal: Good performance on test set • AssumpHons: – Training samples are independently drawn, and idenHcally distributed (IID) – Test set is from same distribuHon as training set 1 4/8/10 FiPng ConHnuous Data (Regression) • Datum i has feature vector: φ=(φ1(x(i))…φk(x(i))) • Has real valued target: t(i) • Concept space: linear combinaHons of features: • Learning objecHve: Search to ﬁnd “best” w • (This is standard “data ﬁPng” that most people learn in some form or another.) Linearity of Regression
• Regression typically considered a linear method, but… • Features not necessarily linear • Features not necessarily linear • Features not necessarily linear • Features not necessarily linear • and, BTW, features not necessarily linear 2 4/8/10 Regression Examples • PredicHng housing price from: – House size, lot size, rooms, neighborhood*, etc. • PredicHng weight from: – Sex, height, ethnicity, etc. • PredicHng life expectancy increase from: – MedicaHon, disease state, etc. • PredicHng crop yield from: – PrecipitaHon, ferHlizer, temperature, etc. • FiPng polynomials – Features are monomials Features/Basis FuncHons • Polynomials • Indicators • Gaussian densiHes • Step funcHons or sigmoids • Sinusoids (Fourier basis) • Wavelets • Anything you can imagine… 3 4/8/10 What is “best”? • No obvious answer to this quesHon • Three compaHble answers: – Minimize squared error on training set – Maximize likelihood of the data (under certain assumpHons) – Project data into “closest” approximaHon • Other answers possible Minimizing Squared Training Set Error • Why is this good? • How could this be bad? • Minimize: N E ( w) = ∑ ( wT φ ( x ( i ) ) − t ( i) )
i =1
2 € 4 4/8/10 Minimizing E by Gradient Descent E(w) gradient vector Start with iniHal weight vector w 0 w 2 w1 w0 w Compute the gradient Compute where α is the step size Repeat until convergence
(Adapted from Lise Getoor s Slides) Gradient Descent Issues • For this parHcular problem: – Global minimum exists – Convergence “guaranteed” if done in “batch” • In general – Local opHmum only – Batch mode more stable – Incremental possible • Can oscillate • Use decreasing step size (Robbins
Monro) to stabilize 5 4/8/10 Solving the MinimizaHon Directly n E = ∑ (t ( i) − w T φ ( x ( i) ))2
i =1 n ∇w E ∝ ∑ (t ( i) − w T φ ( x ( i) ))φ ( x ( i) )T
i =1
scalar row vector Set gradient to 0 to find min:
n € ∑ (t ( i) − wT φ ( x ( i) ))φ ( x ( i) )T = 0 i =1
n ∑ φ(x
i =1
T n
( i) T ( i) ) t − wT ∑ φ ( x ( i) )φ ( x ( i ) )T = 0
i =1 T T T t Φ − w Φ Φ = Φ t − ΦT Φw = 0
−1 w = ΦT Φ) ΦT t
( € Ⱥφ ( x (1) ) Ⱥ
Ⱥ (2 ) Ⱥ
φ ( x ) Ⱥ
Φ = Ⱥ
Ⱥ Ⱥ
Ⱥ ( n) Ⱥ
Ⱥφ ( x ) Ⱥ € What is the Best Choice of Features? Noisy Source Data 6 4/8/10 Degree 0 Fit Degree 1 Fit 7 4/8/10 Degree 3 Fit Degree 9 Fit 8 4/8/10 ObservaHons • Degree 3 is the best match to the source • Degree 9 is the best match to the samples • Performance on test data: Bias and Variance
• Bias: How much of our error comes from our choice of hypothesis space? • Variance: How much of our error comes from noise in the training data? 9 4/8/10 Example: 20 points
y = x + 2 sin(1.5x) + N(0,0.2)
Noise Hypothesis space = linear in x 50 ﬁts (20 examples each)
What are we seeing here? 10 4/8/10 Bias
Variance
11 4/8/10 Trade oﬀ Between Bias and Variance •
•
•
• Is the problem a bad choice of polynomial? Is the problem that we don’t have enough data? Answer: Yes For small datasets: – Lower bias
> Higher Variance – Higher bias
> Lower Variance Bias and Variance: Lessons Learned
• When data are scarce relaHve to the “capacity” of our hypothesis space – Variance can be a problem – RestricHng hypothesis space can reduce variance at cost of increased bias • When data are plenHful – Variance is less of a concern – May aﬀord to use richer hypothesis space 12 4/8/10 Methods for Choosing Features
• Cross validaHon • RegularizaHon Cross ValidaHon
• Suppose we have many possible hypothesis spaces, e.g., diﬀerent degree polynomials • Recall our empirical performance results: • Why not use the data to ﬁnd min of the red curve? 13 4/8/10 ImplemenHng Cross ValidaHon
• Many possible approaches to cross validaHon • Typical approach divides data into k equally sized chunks: –
–
–
–
– Do k instances of learning For each instance hold out 1/k of the data Train on (k
1)/k fracHon of the data Test on held out data Average results • Can also sample subsets of data with replacement • Cross validaHon can be used to search range of hypothesis classes to ﬁnd where overﬁ'ng starts Problems with Cross ValidaHon
• Cross validaHon is a sound method, but requires a lot of data and/
or is slow • Must trade oﬀ two factors: – Want enough data within each run – Want to average over enough trials • With scarce data: – Choose k close to n – Almost as many learning problems as data points • With abundant data (then why are you doing cross validaHon?) – Choose k = a small constant, e.g., 10 – Not too painful if you have a lot of parallel compuHng resources and a lot of data, e.g., if you are Google 14 4/8/10 RegularizaHon
• Cross validaHon may also be impracHcal if range of hypothesis classes is not easily enumerated a searched iteraHvely • RegularizaHon aims to avoid overﬁPng, while – Avoiding speed penalty of cross validaHon – Not assuming an ordering on hypothesis spaces RegularizaHon
• Idea: Penalize overly complicated answers • Ordinary regression minimizes: • L2 Regularized regression minimizes: • Note: May exclude constants form the norm 15 4/8/10 L2 RegularizaHon: Why?
• For polynomials, extreme curves typically require extreme values • In general, encourages use of features only when they lead to a substanHal increase in performance • Problem: How to choose λ (cross validaHon?) The L2 Regularized SoluHon
• Minimize: • Set gradient to 0, solve for w for features Φ: • Compare with unregularized soluHon 16 4/8/10 RegularizaHon Example
High regularizaHon produces “ﬂat” soluHons because weights must approach 0. Lower values allow for more curviness in the value funcHon. Concluding Comments • Regression is the most basic machine learning algorithm for conHnuous targets • MulHple views are all equivalent: – Minimize squared loss – Maximize likelihood – Orthogonal projecHon • Big quesHon: Choosing features • Step towards understanding this: Bias/variance trade oﬀ • Cross validaHon, regularizaHon automate (to some extent) balancing bias and variance 17 ...
View
Full
Document
This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.
 Spring '11
 Parr
 Artificial Intelligence

Click to edit the document details