Unformatted text preview: Problem 2 : 10 - fold Cross Validation
image . print < - function ( x )
Consider a two- class classification problem with zero - one loss and training data set *train = ( ( 20 1 , 3/1 ) ......
( In , Un) } , with class labels y; E 10 , 1] . Given a test data point a , recall at the the k-nearest neighbor
* . matrix < - matrix ( x , 16 , 16 , byrow = FALSE )*
classifier calculates y , the predicted class of ac , as follows :\
* . matrix . rotated < - t ( apply ( x . matrix , 1 , rev ) )
image ( x . matrix . rotated , axes = FALSE , col = grey ( seq ( 0 . 1 , length . out = 256 ) ) )
. Find the k points in (*1 . .... In] that are closest to ac ( in terms of Euclidean distance in 19 ) .
. Predict ; to be the majority class to which the k closest points belong ."
The original image was scanned bottom up from the right , so we first transform * into a 16 * 16
We denote the above prediction to be
matrix , rotate , and transpose the data ."
3 . Randomly split the data into :"
* = f ( ac ; K , *train )
. Training data ( 60% of the entire data set )
Given data set *^ = ( ( 20 1 , 4/ 1 ) . .... ( Zen , Un ) ] , describe a step - by- step 10- fold cross validation procedure to
. Test data ( 20% )
choose an optimal value for the parameter K out of the values $ 1 , 3 , 5 , 7 , 93 . You should use the notation*
. Validation data ( 20% )
If defined above . Address the following issues :`
1 . The R function kun in library class implements a k- nearest classifier . We will only work with two
. What are the 10 folds ?
classes and odd values of K , so you do not have to implement tie-breaking .*
. For each fold , what do you use as the training data and what do you use as the validation data ?"
. Train a 1 - nearest neighbor classifier using the training data and predict the labels of the images*
in the test data . What is the test error ( the empirical error rate on the test set ) ?'
. What quantity do you compare for KE ( 1 , 3 , 5 , 7 , 97 ?'
. Plot the misclassified images .*
. How do you determine which & is optimal ?"
5 . Select & as follows :"
Nearest neighbor methods often work surprising well . Can you think of a reason why they may nonetheless*
. For KE ( 1 , 3, 5, 7 , 9, 13 ; , train the k- In classifier on the training data and classify the images in*
be an inconvenient choice for an application running , for example , on a phone or a digital camera ?*
the test set . Compute the test error for each * .*
. Which value of K should you choose ? Why ?*
Problem 3 : Cross validating a nearest neighbor classifier*
. Finally , compute the error rate of the classifier for the optimal value of K on the validation set ."
A nearest neighbor classifier requires a parameter ( the number K of neighbors used to classify ) . We will*
use cross validation to select the value of " for a specific type of data , handwritten digits .*
Download the digit data set from Courseworks . The zip archive contains two files :"
both files are text files . The file usesdata . txt contains a matrix with one data point*
( = vector of length 256 ) per row . The 256 - vector in each row represents at 16 * 16
image of a handwritten number . The uspsci . Ext contains the corresponding class*
labels . The data contains two classes - the digits 5 and 6 - so the class label are stored
as - I and + 1 , respectively . The image on the right shows the first row , re-arranged
as a 16 x 16 matrix and plotted as a gray scale image .*
1 . Read the data into R from the dataset . It can be more convenient to model categorical class data with
the factor data type in R. Use the function as . factor to transform the class label to the factor
2 . Plot the first four images using the following function . Note that the input* * should be a numerical
vector of length 256 .
View Full Document
- One '15
- Statistics, Euclidean space