CAP4770 – Introduction to Data Mining Final Project Instruction Data: We use a gene data set as our data for the final project. This data set is in attributes-in-rows format, comma-separated values. It can be downloaded by following this link: http://users.cis.fiu.edu/~lli003/teaching/hw-sol/finalproject_datafiles.zip Username/Password: CAP4770/student The zip file contains three files: train.csv: training data , consisting of 69 instances with 7,070 attributes. train_class.txt: training classes , corresponding to the true labels for each instance in training data in the order. There are 5 classes in total, MED, MGL, RHB, JPA and EPD. test.csv: test data , consisting of 112 unlabeled instances with 7,070 attributes. Goal: Learn the best classifier from the training data and use it to predict the classes for test data. Due Date:
December 10 th , 2010 Submission: 1. Project report , describing how to establish your classifier step by step. Specifically, your report should include: 1) How to do data cleaning; 2) How to do feature selection; 3) How to train classifier. 2. Predicted result (in YourPatherId .txt file , one class per line in uppercase, as the same order of the test data) 3. Make sure that all the files you submit are zipped into one single file, named as “CAP4770_finalproject_firstname_lastname_patherid”. Important hints: 1. The training and testing data are all in the format of attribute-in-row. Probably you need to transform the data into the format of attribute-in-column so that the data can be appropriately fed into Weka. 2.
