This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Data Preparation Initially, the data set contained 7071 columns, one for each gene and one for a serial number for each instance. The information about each patient was recorded in rows. There were 70 rows, 69 for each patient and one with names of each gene. Your first step was to normalize the data. Domain experts had previously established that values within the range of 20 and 16000 were reliable for this experiment. You probably need to write a small program (e.g., Java) to read the data set and changed any gene value less than 20 to have the value of 20 and any gene value greater than 16000 to have the value of 16000. Subsequently, we proceeded to select the genes that seemed to be correlated to the outcome. Since many learning algorithms look for non-linear combinations of features, having a large data set with few records and several genes might give us false positives. Gene reduction thus increases classification accuracy. To do that you probably need to write a small java program to remove genes classification accuracy....
View Full Document
- Fall '08