13_NearestNeighbourMethods.pdf - Introductory Applied Machine Learning Nearest Neighbour Methods Victor Lavrenko and Nigel Goddard School of Informatics

13_NearestNeighbourMethods.pdf - Introductory Applied...

This preview shows page 1 out of 21 pages.

Unformatted text preview: Introductory Applied Machine Learning Nearest Neighbour Methods Victor Lavrenko and Nigel Goddard School of Informatics Overview •  Nearest neighbour method –  classificaBon and regression –  pracBcal issues: k, distance, Bes, missing values –  opBmality and assumpBons •  Making kNN fast: –  K-­‐D trees –  inverted indices –  fingerprinBng •  References: W&F secBons 4.7 and 6.4 Copyright © 2014 Victor Lavrenko IntuiBon for kNN •  set of points (x,y) –  two classes •  is the box red or blue •  how did you do it –  use Bayes rule? –  a decision tree? –  fit a hyperplane? •  nearby points are red –  use this as a basis for a learning algorithm Copyright © 2014 Victor Lavrenko Nearest-­‐neighbor classificaBon •  Use the intuiBon to classify a new point x: –  find the most similar training example x’ –  predict its class y’ •  Voronoi tesselaBon –  parBBons space into regions –  boundary: points at same distance from two different training examples •  classificaBon boundary –  non-­‐linear, reflects classes well –  compare to NB, DT, logisBc –  impressive for simple method Copyright © 2014 Victor Lavrenko ? Nearest neighbour: outliers •  Algorithm is sensiBve to outliers –  single mislabeled example dramaBcally changes boundary •  No confidence P(y|x) •  InsensiBve to class prior •  Idea: –  use more than one nearest neighbor to make decision –  count class labels in k most similar training examples •  many “triangles” will outweigh single “circle” outlier Copyright © 2014 Victor Lavrenko kNN classificaBon algorithm •  Given: –  training examples {xi,yi} •  xi … aeribute-­‐value representaBon of examples •  yi … class label: {ham,spam}, digit {0,1,…9} etc. –  tesBng point x that we want to classify •  Algorithm: •  compute distance D(x,xi) to every training example xi •  select k closest instances xi1...xik and their labels yi1…yik •  output the class y* which is most frequent in yi1…yik Copyright © 2014 Victor Lavrenko Example: handwrieen digits •  16x16 bitmaps •  8-­‐bit grayscale •  Euclidian distance –  over raw pixels •  Accuracy: –  7-­‐NN ~ 95.2% –  SVM ~ 95.8% –  humans ~ 97.5% Copyright © 2014 Victor Lavrenko kNN regression algorithm •  Given: –  training examples {xi,yi} •  xi … aeribute-­‐value representaBon of examples •  yi … real-­‐valued target (profit, raBng on YouTube, etc) –  tesBng point x that we want to predict the target •  Algorithm: •  compute distance D(x,xi) to every training example xi •  select k closest instances xi1...xik and their labels yi1…yik •  output the mean of yi1…yik : Copyright © 2014 Victor Lavrenko Example: kNN regression in 1-­‐d 1-NN 2-NN 3-NN Copyright © 2014 Victor Lavrenko Choosing the value of k •  Value of k has strong effect on kNN performance –  large value ! everything classified as the most probable class: P(y) –  small value ! highly variable, unstable decision boundaries •  small changes to training set ! large changes in classificaBon –  affects “smoothness” of the boundary •  SelecBng the value of k –  set aside a porBon of the training data (validaBon set) –  vary k, observe training ! validaBon error –  pick k that gives best generalizaBon performance Copyright © 2014 Victor Lavrenko Distance measures •  Key component of the kNN algorithm –  defines which examples are similar & which aren’t –  can have strong effect on performance •  Euclidian (numeric aeributes): D(x, x') = ∑ d x d − x'd 2 –  symmetric, spherical, treats all dimensions equally –  sensiBve to extreme differences in single aeribute € •  behaves like a “sop” logical OR •  Hamming (categorical aeributes): D(x, x') = ∑d1x –  number of aeributes where x, x’ differ Copyright © 2014 Victor Lavrenko € d ≠x' d Distance measures (2) •  Minkowski distance (p-­‐norm): "  ∑ p=2: Euclidian d p=∞ p=1: Manhaean p=0: Hamming … logical AND "  p=∞: maxd |xd-­‐x’d| … logical OR Kullback-­‐Leibler (KL) divergence: "  "  "  "  p=0 for histograms (xd>0, Σdxd = 1): D( x, x') = −∑d asymmetric, excess bits to encode x with x’ xd x d log x' d Custom distance measures (BM25 for text) Copyright © 2014 Victor Lavrenko € E p=0 € "  x d − x' d p=∞ "  D(x, x') = p p kNN: pracBcal issues •  Resolving Bes: –  equal number of posiBve/negaBve neighbours –  use odd k (doesn’t solve mulB-­‐class) –  breaking Bes: •  random: flip a coin to decide posiBve / negaBve •  prior: pick class with greater prior •  nearest: use 1-­‐nn classifier to decide •  Missing values –  have to “fill in”, otherwise can’t compute distance –  key concern: should affect distance as liele as possible –  reasonable choice: average value across enBre dataset Copyright © 2014 Victor Lavrenko kNN, Parzen Windows and Kernels Parzen window 3-NN R R R 1 1 P(y | x) = R(x) ∑1yi =y x i ∈R( x ) R ∑1 ⋅1 = ∑1 i y i =y x i ∈R( x ) i x i ∈R( x ) Copyright © 2014 Victor Lavrenko ∑ 1 ⋅ K(x , x) = ∑ K(x , x) i y i =y i i i kNN pros and cons •  Almost no assumpBons about the data –  smoothness: nearby regions of space ! same class –  assumpBons implied by distance funcBon (only locally!) –  non-­‐parametric approach: “let the data speak for itself” •  nothing to infer from the data, except k and possibly D() •  easy to update in online sexng: just add new item to training set •  Need to handle missing data: fill-­‐in or create a special distance •  SensiBve to class-­‐outliers (mislabeled training instances) •  SensiBve to lots of irrelevant aeributes (affect distance) •  ComputaBonally expensive: –  space: need to store all training examples –  Bme: need to compute distance to all examples: O(nd) •  n … number of training examples, d … cost of compuBng distance •  n grows ! system will become slower and slower •  expense is at tes8ng, not training Bme (bad) Copyright © 2014 Victor Lavrenko Summary: kNN •  Key idea: nearby points ! same class –  important to select good distance funcBon •  Can be used for classificaBon and regression •  Simple, non-­‐linear, asymptoBcally opBmal –  does not make assumpBons about the data –  “let the data speak for itself” •  Select k by opBmizing error on held-­‐out set •  Naïve implementaBons slow for big datasets –  use K-­‐D trees (low-­‐d) or inverted lists (high-­‐d) Copyright © 2014 Victor Lavrenko Why is kNN slow? What you see What algorithm sees Find nearest neighbors of the testing point (red) Copyright © 2014 Victor Lavrenko Making kNN fast •  Training: O(d), but tesBng: O(nd) •  Reduce d: dimensionality reducBon –  simple feature selecBon, other methods O(d3) •  Reduce n: don’t compare to all training examples –  idea: quickly idenBfy m<<n potenBal near neighbors •  compare only to those, pick k nearest neighbors ! O(md) Bme –  K-­‐D trees: low-­‐dimensional, real-­‐valued data •  O (d log2 n), only works when d << n, inexact: may miss neighbors –  inverted lists: high-­‐dimensional, discrete data •  O (n’d’) where d’<<d, n’<<n, only for sparse data (e.g. text), exact –  locality-­‐sensi4ve hashing: high-­‐d, discrete or real-­‐valued •  O(n’d), n’<<n … bits in fingerprint, inexact: may miss near neighbors Copyright © 2014 Victor Lavrenko K-­‐D tree example •  Building a K-­‐D tree from training data: –  pick random dimension, find median, split data, repeat •  Find NNs for new point (7,4) –  find region containing (7,4) –  compare to all points in region (1,9), (2,3), (4,1), (3,7), (5,4), (6,8), (7,2), (8,8), (7,9), (9,6) (1,9), (2,3), (4,1), (3,7), (5,4), x ≥ 6 (6,8), (7,2), (8,8), (7,9), (9,6) y ≥ 4 (2,3) (4,1) (1,9) (3,7) (5,4) y ≥ 8 (7,2) (9,6) (6,8) (8,8) (7,9) Copyright © 2014 Victor Lavrenko Locality-­‐SensiBve Hashing (LSH) •  Random hyper-­‐planes h1…hk –  space sliced into 2k regions (polytopes) –  compare x only to training points in the same region R •  Complexity: O(kd + dn/2k) - –  O(kd) to find region R, k << n + + - •  dot-­‐product x with h1…hk –  compare to n/2k points in R •  Inexact: missed neighbors –  repeat with different h1…hk •  Why not K-­‐D tree? Copyright © 2014 Victor Lavrenko R + - Inverted list example •  Data structure used by search engines (Google, etc) –  list all training examples that contain parBcular aeribute –  assumpBon: most aeribute values are zero (sparseness) •  Given a new tesBng example: –  merge inverted lists for aeributes present in new example –  O(dn): d … nonzero aeributes, n … avg. length of inverted list Copyright © 2014 Victor Lavrenko ...
View Full Document

  • Fall '12
  • CharlesSutton

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Ask Expert Tutors You can ask You can ask ( soon) You can ask (will expire )
Answers in as fast as 15 minutes