CS 170
Algorithms
Fall 2014
David Wagner
Soln 12
1. (50 pts.)K-Nearest NeighborsDigit classification is a classical problem that has been studied in depth by many researchers and computerscientists over the past few decades. Digit classification has many applications: for instance, postal serviceslike the US Postal Service, UPS, and FedEx use pre-trained classifiers in order to speed up and accuratelyrecognize handwritten addresses.Today, over 95% of all handwritten addresses are correctly classifiedthrough a computer rather than a human manually reading the address.The problem statement is as follows: given an image of a single handwritten digit, build a classifier thatcorrectly predicts what the actual digit value of the image is. Thus, your classifier receives as input an imageof a digit, and must output a class in the set{0,1,2,...,9}. For this homework, you will attack this problemby using ak-nearest neighbors algorithm.We will give you a data set (a reduced version of the MNIST handwritten digit data set). Each image of adigit is a 28×28 pixel image. We have already extracted features, using a very simple scheme: each pixel isits own feature, so we have 282=784 features. The value of a feature is the intensity of that correspondingpixel, normalized to be in the range 0..1. We have preprocessed and vectorized these images into featurevectors for you. We have split the data set into training, validation, and test sets, and we’ve provided theclass of each image in the training and validation sets. Your job is to infer the class of each image in the testset. Here are five examples of images that might appear in this data set, to help you visualize the data:We want you to do the following steps:(i) Implement thek-nearest neighbors algorithm.You can implement it in any way you like, in anyprogramming language of your choice. Fork>1, decide on a rule for resolving ties (if there is a tiefor the majority vote among theknearest neighbors when trying to classify a new observation, whichone do you choose?).(ii) Using the training set as your training data, compute the class of each digit in the validation set,and compute the error rate on the validation set, for each of the following candidate values ofk:k=1,2,5,10,25.(iii) Did your algorithm perform significantly better fork=2 than fork=1? Why do you think this is?(iv) Look at a few examples of digits in the validation set that yourk-nearest neighbors algorithm misclas-sifies. We have provided images of each of the digits, so you might want to look at the correspondingimages to see if you can get some intuition for what’s going on. Answer the following questions:Which example digits did you look at? Why do you think the algorithm misclassifies these examples?List some additional features we could add that might help classify these examples more accurately.

#### You've reached the end of your free preview.

Want to read all 4 pages?

- Fall '02
- HENZINGER
- Algorithms, Standard Deviation, Machine Learning, error rate