CIS 4526
Downloadthe Letter Recognition Data Setfrom the UCI Machine Learning Repository. This dataset contains 20,000 examples.
View the step-by-step solution to:

Question

Download the Letter Recognition Data Set from the UCI Machine Learning Repository. This dataset contains 20,000

examples. Divide the set so that the first 15,000 examples are for training and the remaining 5,000 for testing.

You will implement 2 algorithms from class: (1) the k-NN algorithm and (2) the "pocket" algorithm.

Let:

  • num_train = number of training examples
  • num_test = number of testing examples
  • num_dims = the dimensionality of the examples

You should implement the following functions. (Implementations that do not conform to these specifications will lose a significant amount of credit for this assignment.)

  • pred_y = test_knn(train_x, train_y, test_x, num_nn)
  • where train_x is a (num_train, num_dims) data matrix, test_x is a (num_test, num_dims) data matrix, train_y is a (num_train,) label vector, and pred_y is a (num_test,) label vector, and num_nn is the number of nearest neighbors for classification.
  • w = train_pocket(train_x, train_y, num_iters)
  • where train_x is a (num_train, num_dims) data matrix, train_y is a (num_train,) +1/-1 label vector, num_iters is the number of iterations for the algorithm, w is a vector of learned perceptron weights.
  • pred_y = test_pocket(w, test_x)
  • where w is a vector of learned perceptron weights, test_x is a (num_test, num_dims) data matrix, and pred_y is a (num_test,) +1/-1 label vector.
  • acc = compute_accuracy(test_y, pred_y)
  • where test_y is a (num_test,) label vector, and pred_y is a (num_test,) label vector, and acc is a float between 0.0 and 1.0, representing the classification accuracy.
  • id = get_id()
  • where id is a string representing your Temple Accessnet (e.g., "tua12345")

For the algorithms (k-NN, pocket), run the following experiments

  • Randomly subsample the training data for num_train = {100, 1000, 2000, 5000, 10000, 15000}
  • For k-NN, use the following values for k = {1,3,5,7,9} (5 versions of k-NN)
  • Note: You should run at least 6 (algorithms) * 6 (values of num_train) = 36 total experiments
  • These algorithms include 5 versions of k-NN and one-vs-all (OVA) classification with perceptrons.

Notes

  • A code skeleton has been provided for you. Assume your code will be run as a module, so do not include any statements outside of functions.
  • Any reference to a "matrix" or "array" or "vector" for input and output should be of the type numpy.ndarray. DO NOT use another type (e.g., lists, dictionary, numpy.mat).
  • For numpy arrays, there is a difference between 1D arrays, where shape=(n,), and 2D arrays with a singleton dimension, where shape=(n,1). Be sure to use 1D arrays where appropriate.
  • As described in class, the pocket algorithm isn't designed for the multi-class case. Consider one-vs-all (OVA) classification and write related code directly in the main function.
  • Do not use (or even refer to) any implementations of k-nn (e.g., sklearn.neighbors) or PLA/pocket.

By the due date, turn in a ZIP file (pa1.zip) which contains:

  • Your single Python source file (written in Python 3) named pa1.py which contains the specified functions (plus any helper code you need).
  • A project write-up (pa1.pdf) that contains:
  • An English description of your algorithms, including any assumptions or design decisions you made. This discussion should include (but not be limited to) any choices you made that were not explicitly described here and how num_iters was selected for the pocket algorithm.
  • For each experiment, report the classification accuracy. Additionally, for one experiment, include a confusion matrix of the results. You will be graded on how well you present these results.
  • Discussion of the various experiments and what contribution the changes had on the accuracy and running time.
  • If there were any problems with your implementation (e.g. clearly wrong output) then make sure to indicate that in your write-up and give as much information as you can as to what you think is causing the problem.

Your submission should be a single ZIP file, which includes only the files specified above. Do not include any other files or internal folders in your submission. Part of your score for this assignment will be for following directions.


# Note: this is just a template for PA 1 and the code is for references only.

# Feel free to design the pipeline of the *main* function. However, one should keep

# the interfaces for the other functions unchanged. Change the returned values of

# these functions so that they are consistent with the assignment instructions.

# In general, one will only need to add the code below the TO-DO statements to

# finish the assignment. Additional import statements can be included when needed.

#

# For the kNN classifier, one could use existing libraries to compute the pairwise

# Euclidean distances between the test and training data, as for-loops in Python

# are pretty slow. Other than that, the designs of all functions should be your

# original work.


import csv

import numpy as np


def compute_accuracy(test_ypred_y):


    # TO-DO: add your code here


    return None


def test_knn(train_xtrain_ytest_xnum_nn):


    # TO-DO: add your code here


    return None


def test_pocket(wtest_x):


    # TO-DO: add your code here


    return None


def train_pocket(train_xtrain_ynum_iters):


    # TO-DO: add your code here


    return None


def get_id():


    # TO-DO: add your code here


    return 'tuxddddd'


def main():


    # Read the data file

    szDatasetPath = './letter-recognition.data' # Put this file in the same place as this script

    listClasses = []

    listAttrs = []

    with open(szDatasetPath) as csvFile:

        csvReader = csv.reader(csvFile, delimiter=',')

        for row in csvReader:

            listClasses.append(row[0])

            listAttrs.append(list(map(float, row[1:])))


    # Generate the mapping from class name to integer IDs

    mapCls2Int = dict([(y, x) for x, y in enumerate(sorted(set(listClasses)))])


    # Store the dataset with numpy array

    dataX = np.array(listAttrs)

    dataY = np.array([mapCls2Int[clsfor cls in listClasses])


    # Split the dataset as the training set and test set

    nNumTrainingExamples = 15000

    trainX = dataX[:nNumTrainingExamples, :]

    trainY = dataY[:nNumTrainingExamples]

    testX = dataX[nNumTrainingExamples:, :]

    testY = dataY[nNumTrainingExamples:]


    # TO-DO: add your code here


    return None


if __name__ == "__main__":

    main()

Recently Asked Questions

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

-

Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question
Ask Expert Tutors You can ask 0 bonus questions You can ask 0 questions (0 expire soon) You can ask 0 questions (will expire )
Answers in as fast as 15 minutes