4/14/11 Classification: Supervised learning, and Model Evaluation Classifier KNN, DT Feb 2011 Tommy W. S. Chow

4/14/11 K Nearest Neighbors l K Nearest Neighbors (KNN) l Advantage l Nonparametric architecture l Simple l Powerful l Requires no training time l Disadvantage l Memory intensive l Classification/estimation is slow
4/14/11 K-NN classifier schematic For a test instance, 1) Calculate distances from training pts. 2) Find K-nearest neighbours (say, K = 3) 3) Assign class label based on majority Classifying if the “blue” belongs to the class of “green” or “red”

4/14/11 The red points are one class The green points are another class The points with black circle are the three nearest neighbours around the grey point. Because there exists two red points in the three nearest neighbours, so the grey point point is classified as red class K-NN classifier schematic
4/14/11 KNN l Data : Numerical data, categorical data (non-numerical but has distance in some sense), & ordinal data (non- numerical and has no distance in any sense, i.e., color red, black, shape round, square etc. l How to determine distances between values of categorical l attributes? l Alternatives: l Use Boolean distance (1 if the same, 0 if different) l Introduce Differential grading (e.g. weather – ‘drizzling’ and ‘rainy’ are closer than ‘rainy’ and ‘sunny’ )

4/14/11 How to determine the value of “K” l This is another practical issue cannot be resolved easily. l But we can determine K experimentally. Use the K that gives min error in a test set.
4/14/11 How good is KNN? l Normally works well for simple clean data set l But suffer from noisy data l Computationally demanding as it needs to calculate lots of distances and comparing l First use maybe around 60% of the data set as training data set l Verify the KNN by using the rest 40% as test set l Accuracy ok, then use it in real application

4/14/11 K Nearest Neighbors l The key issues involved in training this model includes setting l the variable K l Validation techniques (ex. Cross validation) l the type of distant metric l Euclidean distance measure: measure the distance between the test point Y and all other data points of different classes, X . l Find the shortest K , i.e., 3, 5 or 7, distances
4/14/11 Train data No Attrib class label Train data No Attrib Class label Train data No Attrib Class label 1 (2.3,1.6) 1 11 (0.7,4.8) 2 21 (4.4,4.2) 3 2 (2.1,1.2) 1 12 (0.8,4.2) 2 22 (4.9,4.2) 3 3 (2.3,1.3) 1 13 (0.2,4.7) 2 23 (4.5,4.6) 3 4 (2.2,1.2) 1 14 (0.2,4.8) 2 24 (4.1,4.0) 3 5 (2.7, 1.0) 1 15 (0.3,4.4) 2 25 (4.7,4.3) 3 6 (2.1,1.4) 1 16 (0.7,4.5) 2 26 (4.5,4.4) 3 7 (2.0,1.6) 1 17 (1.0,4.6) 2 27 (4.6,4.5) 3 8 (2.4,1.1) 1 18 (0.6,4.6) 2 28 (4.7,4.1) 3 9 (2.7,1.6) 1 19 (0.5,4.3) 2 29 (4.1,4.4) 3 10 (2.6,1.9) 1 20 (0.8,4.3) 2 30 (4.5,4.8) 3 Training Set

4/14/11 Test Set Test data Attrib Test data Attrib Test data Attrib 31 (2.4,3.6) 36 (3.7,3.8) 41 (2.9,3.2) 32 (2.8,2.9) 37 (3.8,3.5) 42 (2.6,3.2) 33 (2.3,3.0) 38 (3.3,3.7) 43 (2.0,3.6) 34 (2.2,3.2) 39 (3.2,3.8) 44 (2.1,3.1) 35 (2.7,3.7) 40 (3.3,3.4) 45 (2.4,3.3)
4/14/11 Here, k=5 KNN classification example

4/14/11 KNN when K = 15 l The decision boundary can be irregular.
