**Unformatted text preview: **Introductory Applied Machine Learning
Nearest Neighbour Methods
Victor Lavrenko and Nigel Goddard
School of Informatics Overview • Nearest neighbour method – classiﬁcaBon and regression – pracBcal issues: k, distance, Bes, missing values – opBmality and assumpBons • Making kNN fast: – K-‐D trees – inverted indices – ﬁngerprinBng • References: W&F secBons 4.7 and 6.4 Copyright © 2014 Victor Lavrenko IntuiBon for kNN • set of points (x,y) – two classes • is the box red or blue • how did you do it – use Bayes rule? – a decision tree? – ﬁt a hyperplane? • nearby points are red – use this as a basis for a learning algorithm Copyright © 2014 Victor Lavrenko Nearest-‐neighbor classiﬁcaBon • Use the intuiBon to classify a new point x: – ﬁnd the most similar training example x’ – predict its class y’ • Voronoi tesselaBon – parBBons space into regions – boundary: points at same distance from two diﬀerent training examples • classiﬁcaBon boundary – non-‐linear, reﬂects classes well – compare to NB, DT, logisBc – impressive for simple method Copyright © 2014 Victor Lavrenko ? Nearest neighbour: outliers • Algorithm is sensiBve to outliers – single mislabeled example dramaBcally changes boundary • No conﬁdence P(y|x) • InsensiBve to class prior • Idea: – use more than one nearest neighbor to make decision – count class labels in k most similar training examples • many “triangles” will outweigh single “circle” outlier Copyright © 2014 Victor Lavrenko kNN classiﬁcaBon algorithm • Given: – training examples {xi,yi} • xi … aeribute-‐value representaBon of examples • yi … class label: {ham,spam}, digit {0,1,…9} etc. – tesBng point x that we want to classify • Algorithm: • compute distance D(x,xi) to every training example xi • select k closest instances xi1...xik and their labels yi1…yik • output the class y* which is most frequent in yi1…yik Copyright © 2014 Victor Lavrenko Example: handwrieen digits • 16x16 bitmaps • 8-‐bit grayscale • Euclidian distance – over raw pixels • Accuracy: – 7-‐NN ~ 95.2% – SVM ~ 95.8% – humans ~ 97.5% Copyright © 2014 Victor Lavrenko kNN regression algorithm • Given: – training examples {xi,yi} • xi … aeribute-‐value representaBon of examples • yi … real-‐valued target (proﬁt, raBng on YouTube, etc) – tesBng point x that we want to predict the target • Algorithm: • compute distance D(x,xi) to every training example xi • select k closest instances xi1...xik and their labels yi1…yik • output the mean of yi1…yik : Copyright © 2014 Victor Lavrenko Example: kNN regression in 1-‐d 1-NN 2-NN
3-NN Copyright © 2014 Victor Lavrenko Choosing the value of k • Value of k has strong eﬀect on kNN performance – large value ! everything classiﬁed as the most probable class: P(y) – small value ! highly variable, unstable decision boundaries • small changes to training set ! large changes in classiﬁcaBon – aﬀects “smoothness” of the boundary • SelecBng the value of k – set aside a porBon of the training data (validaBon set) – vary k, observe training ! validaBon error – pick k that gives best generalizaBon performance Copyright © 2014 Victor Lavrenko Distance measures • Key component of the kNN algorithm – deﬁnes which examples are similar & which aren’t – can have strong eﬀect on performance • Euclidian (numeric aeributes): D(x, x') = ∑ d x d − x'd 2 – symmetric, spherical, treats all dimensions equally – sensiBve to extreme diﬀerences in single aeribute € • behaves like a “sop” logical OR • Hamming (categorical aeributes): D(x, x') = ∑d1x
– number of aeributes where x, x’ diﬀer Copyright © 2014 Victor Lavrenko € d ≠x' d Distance measures (2) • Minkowski distance (p-‐norm): " ∑ p=2: Euclidian d p=∞ p=1: Manhaean p=0: Hamming … logical AND " p=∞: maxd |xd-‐x’d| … logical OR Kullback-‐Leibler (KL) divergence: "
" " " p=0 for histograms (xd>0, Σdxd = 1): D( x, x') = −∑d
asymmetric, excess bits to encode x with x’ xd
x d log
x' d Custom distance measures (BM25 for text) Copyright © 2014 Victor Lavrenko € E p=0 € " x d − x' d p=∞ " D(x, x') = p p kNN: pracBcal issues • Resolving Bes: – equal number of posiBve/negaBve neighbours – use odd k (doesn’t solve mulB-‐class) – breaking Bes: • random: ﬂip a coin to decide posiBve / negaBve • prior: pick class with greater prior • nearest: use 1-‐nn classiﬁer to decide • Missing values – have to “ﬁll in”, otherwise can’t compute distance – key concern: should aﬀect distance as liele as possible – reasonable choice: average value across enBre dataset Copyright © 2014 Victor Lavrenko kNN, Parzen Windows and Kernels Parzen
window 3-NN R
R R 1 1
P(y | x) =
R(x) ∑1yi =y
x i ∈R( x ) R ∑1 ⋅1
=
∑1
i y i =y x i ∈R( x ) i x i ∈R( x ) Copyright © 2014 Victor Lavrenko ∑ 1 ⋅ K(x , x)
=
∑ K(x , x)
i y i =y
i i i kNN pros and cons • Almost no assumpBons about the data – smoothness: nearby regions of space ! same class – assumpBons implied by distance funcBon (only locally!) – non-‐parametric approach: “let the data speak for itself” • nothing to infer from the data, except k and possibly D() • easy to update in online sexng: just add new item to training set • Need to handle missing data: ﬁll-‐in or create a special distance • SensiBve to class-‐outliers (mislabeled training instances) • SensiBve to lots of irrelevant aeributes (aﬀect distance) • ComputaBonally expensive: – space: need to store all training examples – Bme: need to compute distance to all examples: O(nd) • n … number of training examples, d … cost of compuBng distance • n grows ! system will become slower and slower • expense is at tes8ng, not training Bme (bad) Copyright © 2014 Victor Lavrenko Summary: kNN • Key idea: nearby points ! same class – important to select good distance funcBon • Can be used for classiﬁcaBon and regression • Simple, non-‐linear, asymptoBcally opBmal – does not make assumpBons about the data – “let the data speak for itself” • Select k by opBmizing error on held-‐out set • Naïve implementaBons slow for big datasets – use K-‐D trees (low-‐d) or inverted lists (high-‐d) Copyright © 2014 Victor Lavrenko Why is kNN slow? What you see What algorithm sees Find nearest neighbors
of the testing point (red)
Copyright © 2014 Victor Lavrenko Making kNN fast • Training: O(d), but tesBng: O(nd) • Reduce d: dimensionality reducBon – simple feature selecBon, other methods O(d3) • Reduce n: don’t compare to all training examples – idea: quickly idenBfy m<<n potenBal near neighbors • compare only to those, pick k nearest neighbors ! O(md) Bme – K-‐D trees: low-‐dimensional, real-‐valued data • O (d log2 n), only works when d << n, inexact: may miss neighbors – inverted lists: high-‐dimensional, discrete data • O (n’d’) where d’<<d, n’<<n, only for sparse data (e.g. text), exact – locality-‐sensi4ve hashing: high-‐d, discrete or real-‐valued • O(n’d), n’<<n … bits in ﬁngerprint, inexact: may miss near neighbors Copyright © 2014 Victor Lavrenko K-‐D tree example • Building a K-‐D tree from training data: – pick random dimension, ﬁnd median, split data, repeat • Find NNs for new point (7,4) – ﬁnd region containing (7,4) – compare to all points in region (1,9), (2,3), (4,1), (3,7), (5,4), (6,8), (7,2), (8,8), (7,9), (9,6) (1,9), (2,3), (4,1), (3,7), (5,4), x ≥ 6 (6,8), (7,2), (8,8), (7,9), (9,6) y ≥ 4 (2,3) (4,1) (1,9) (3,7) (5,4) y ≥ 8 (7,2) (9,6) (6,8) (8,8) (7,9) Copyright © 2014 Victor Lavrenko Locality-‐SensiBve Hashing (LSH) • Random hyper-‐planes h1…hk – space sliced into 2k regions (polytopes) – compare x only to training points in the same region R • Complexity: O(kd + dn/2k) - – O(kd) to ﬁnd region R, k << n +
+
- • dot-‐product x with h1…hk – compare to n/2k points in R • Inexact: missed neighbors – repeat with diﬀerent h1…hk • Why not K-‐D tree? Copyright © 2014 Victor Lavrenko R +
- Inverted list example • Data structure used by search engines (Google, etc) – list all training examples that contain parBcular aeribute – assumpBon: most aeribute values are zero (sparseness) • Given a new tesBng example: – merge inverted lists for aeributes present in new example – O(dn): d … nonzero aeributes, n … avg. length of inverted list Copyright © 2014 Victor Lavrenko ...

View
Full Document

- Fall '12
- CharlesSutton