regression[1] - 4/8/10 Regression CPS 170 Ron Parr...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: 4/8/10 Regression CPS 170 Ron Parr Regression figures provided by Christopher Bishop and © 2007 Christopher Bishop With content adapted from Lise Getoor, Tom Die>erich, Andrew Moore & Rich Maclin Supervised Learning •  Given: Training Set •  Goal: Good performance on test set •  AssumpHons: –  Training samples are independently drawn, and idenHcally distributed (IID) –  Test set is from same distribuHon as training set 1 4/8/10 FiPng ConHnuous Data (Regression) •  Datum i has feature vector: φ=(φ1(x(i))…φk(x(i))) •  Has real valued target: t(i) •  Concept space: linear combinaHons of features: •  Learning objecHve: Search to find “best” w •  (This is standard “data fiPng” that most people learn in some form or another.) Linearity of Regression •  Regression typically considered a linear method, but… •  Features not necessarily linear •  Features not necessarily linear •  Features not necessarily linear •  Features not necessarily linear •  and, BTW, features not necessarily linear 2 4/8/10 Regression Examples •  PredicHng housing price from: –  House size, lot size, rooms, neighborhood*, etc. •  PredicHng weight from: –  Sex, height, ethnicity, etc. •  PredicHng life expectancy increase from: –  MedicaHon, disease state, etc. •  PredicHng crop yield from: –  PrecipitaHon, ferHlizer, temperature, etc. •  FiPng polynomials –  Features are monomials Features/Basis FuncHons •  Polynomials •  Indicators •  Gaussian densiHes •  Step funcHons or sigmoids •  Sinusoids (Fourier basis) •  Wavelets •  Anything you can imagine… 3 4/8/10 What is “best”? •  No obvious answer to this quesHon •  Three compaHble answers: –  Minimize squared error on training set –  Maximize likelihood of the data (under certain assumpHons) –  Project data into “closest” approximaHon •  Other answers possible Minimizing Squared Training Set Error •  Why is this good? •  How could this be bad? •  Minimize: N E ( w) = ∑ ( wT φ ( x ( i ) ) − t ( i) ) i =1 2 € 4 4/8/10 Minimizing E by Gradient Descent E(w) gradient vector Start with iniHal weight vector w 0 w 2 w1 w0 w Compute the gradient Compute where α is the step size Repeat until convergence (Adapted from Lise Getoor s Slides) Gradient Descent Issues •  For this parHcular problem: –  Global minimum exists –  Convergence “guaranteed” if done in “batch” •  In general –  Local opHmum only –  Batch mode more stable –  Incremental possible •  Can oscillate •  Use decreasing step size (Robbins ­Monro) to stabilize 5 4/8/10 Solving the MinimizaHon Directly n E = ∑ (t ( i) − w T φ ( x ( i) ))2 i =1 n ∇w E ∝ ∑ (t ( i) − w T φ ( x ( i) ))φ ( x ( i) )T i =1 scalar row vector Set gradient to 0 to find min: n € ∑ (t ( i) − wT φ ( x ( i) ))φ ( x ( i) )T = 0 i =1 n ∑ φ(x i =1 T n ( i) T ( i) ) t − wT ∑ φ ( x ( i) )φ ( x ( i ) )T = 0 i =1 T T T t Φ − w Φ Φ = Φ t − ΦT Φw = 0 −1 w = ΦT Φ) ΦT t ( € Ⱥφ ( x (1) ) Ⱥ Ⱥ (2 ) Ⱥ φ ( x ) Ⱥ Φ = Ⱥ Ⱥ Ⱥ Ⱥ ( n) Ⱥ Ⱥφ ( x ) Ⱥ € What is the Best Choice of Features? Noisy Source Data 6 4/8/10 Degree 0 Fit Degree 1 Fit 7 4/8/10 Degree 3 Fit Degree 9 Fit 8 4/8/10 ObservaHons •  Degree 3 is the best match to the source •  Degree 9 is the best match to the samples •  Performance on test data: Bias and Variance •  Bias: How much of our error comes from our choice of hypothesis space? •  Variance: How much of our error comes from noise in the training data? 9 4/8/10 Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2) Noise Hypothesis space = linear in x 50 fits (20 examples each) What are we seeing here? 10 4/8/10 Bias Variance 11 4/8/10 Trade off Between Bias and Variance •  •  •  •  Is the problem a bad choice of polynomial? Is the problem that we don’t have enough data? Answer: Yes For small datasets: –  Lower bias  ­> Higher Variance –  Higher bias  ­> Lower Variance Bias and Variance: Lessons Learned •  When data are scarce relaHve to the “capacity” of our hypothesis space –  Variance can be a problem –  RestricHng hypothesis space can reduce variance at cost of increased bias •  When data are plenHful –  Variance is less of a concern –  May afford to use richer hypothesis space 12 4/8/10 Methods for Choosing Features •  Cross validaHon •  RegularizaHon Cross ValidaHon •  Suppose we have many possible hypothesis spaces, e.g., different degree polynomials •  Recall our empirical performance results: •  Why not use the data to find min of the red curve? 13 4/8/10 ImplemenHng Cross ValidaHon •  Many possible approaches to cross validaHon •  Typical approach divides data into k equally sized chunks: –  –  –  –  –  Do k instances of learning For each instance hold out 1/k of the data Train on (k ­1)/k fracHon of the data Test on held out data Average results •  Can also sample subsets of data with replacement •  Cross validaHon can be used to search range of hypothesis classes to find where overfi'ng starts Problems with Cross ValidaHon •  Cross validaHon is a sound method, but requires a lot of data and/ or is slow •  Must trade off two factors: –  Want enough data within each run –  Want to average over enough trials •  With scarce data: –  Choose k close to n –  Almost as many learning problems as data points •  With abundant data (then why are you doing cross validaHon?) –  Choose k = a small constant, e.g., 10 –  Not too painful if you have a lot of parallel compuHng resources and a lot of data, e.g., if you are Google 14 4/8/10 RegularizaHon •  Cross validaHon may also be impracHcal if range of hypothesis classes is not easily enumerated a searched iteraHvely •  RegularizaHon aims to avoid overfiPng, while –  Avoiding speed penalty of cross validaHon –  Not assuming an ordering on hypothesis spaces RegularizaHon •  Idea: Penalize overly complicated answers •  Ordinary regression minimizes: •  L2 Regularized regression minimizes: •  Note: May exclude constants form the norm 15 4/8/10 L2 RegularizaHon: Why? •  For polynomials, extreme curves typically require extreme values •  In general, encourages use of features only when they lead to a substanHal increase in performance •  Problem: How to choose λ (cross validaHon?) The L2 Regularized SoluHon •  Minimize: •  Set gradient to 0, solve for w for features Φ: •  Compare with unregularized soluHon 16 4/8/10 RegularizaHon Example High regularizaHon produces “flat” soluHons because weights must approach 0. Lower values allow for more curviness in the value funcHon. Concluding Comments •  Regression is the most basic machine learning algorithm for conHnuous targets •  MulHple views are all equivalent: –  Minimize squared loss –  Maximize likelihood –  Orthogonal projecHon •  Big quesHon: Choosing features •  Step towards understanding this: Bias/variance trade off •  Cross validaHon, regularizaHon automate (to some extent) balancing bias and variance 17 ...
View Full Document

This note was uploaded on 02/17/2012 for the course COMPSCI 170 taught by Professor Parr during the Spring '11 term at Duke.

Ask a homework question - tutors are online