CS 170
Algorithms
Fall 2014
David Wagner
HW12
Due Dec. 5, 6:00pm
Instructions.
This homework is due Friday, December 5, at 6:00pm electronically via glookup.
This
homework assignment is a programming assignment that is based on a machine learning application.
You may work individually or in groups of two for this assignment. You may not work with more than one
other person. If you work with a partner, you may only use code you wrote together via pair programming or
code you wrote individually. For example, you may choose to do the following: for problem 1, you decide
to use pair programming and for problem 2, you decide only to discuss your approaches to one another and
use your own implementations. You may not turn in any code for which you were not involved in writing
(for instance, “you implement Problem 1, I’ll implement Problem 2” is not allowed). In addition, you may
not discuss your approaches with anyone other than your partner.
1. (50 pts.)
K-Nearest Neighbors
Digit classification is a classical problem that has been studied in depth by many researchers and computer
scientists over the past few decades. Digit classification has many applications: for instance, postal services
like the US Postal Service, UPS, and FedEx use pre-trained classifiers in order to speed up and accurately
recognize handwritten addresses.
Today, over 95% of all handwritten addresses are correctly classified
through a computer rather than a human manually reading the address.
The problem statement is as follows: given an image of a single handwritten digit, build a classifier that
correctly predicts what the actual digit value of the image is. Thus, your classifier receives as input an image
of a digit, and must output a class in the set
{
0
,
1
,
2
,...,
9
}
. For this homework, you will attack this problem
by using a
k
-nearest neighbors algorithm.
We will give you a data set (a reduced version of the MNIST handwritten digit data set). Each image of a
digit is a 28
×
28 pixel image. We have already extracted features, using a very simple scheme: each pixel is
its own feature, so we have 28
2
=
784 features. The value of a feature is the intensity of that corresponding
pixel, normalized to be in the range 0..1. We have preprocessed and vectorized these images into feature
vectors for you. We have split the data set into training, validation, and test sets, and we’ve provided the
class of each image in the training and validation sets. Your job is to infer the class of each image in the test
set. Here are five examples of images that might appear in this data set, to help you visualize the data:
We want you to do the following steps:
CS 170, Fall 2014, HW12
1
