Firstcolumn 1 firstcolumnnorm

This preview shows page 10 - 13 out of 18 pages.

first_column = bunchobject.data[:,np.newaxis, 1 ] first_column_norm = normalize_minmax(first_column) print(five_number_summary(first_column_norm)) The expected output is as follows. Your answer may be slightly different due to floating point error. [{ 'minimum' : 0.0 , 'first quartile' : 0.21846466012850865 , 'median' : 0.30875887724044637 , 'third quartile' : 0.40886033141697664 , 'maximum' : 1.0 }] Your function should also work if more than one column is input. Hence, for the following test script: cols = [ 1 , 7 ] some_columns = bunchobject.data[:,cols] snorm = normalize_minmax(some_columns) print( 'normalized' , five_number_summary(snorm)) The expected output is as follows. normalized [{ 'minimum' : 0.0 , 'first quartile' : 0.21846466012850865 , 'median' : 0.30875887724044637 , 'third quartile' : 0.40886033141697664 , 'maximum' : 1.0 }, { 'minimum' : 0.0 , 'first quartile' : 0.10094433399602387 , 'median' : 0.1665009940357853 , 'third quartile' : 0.36779324055666002 , 'maximum' : 1.0 }] 10 Singapore University of Technology and Design, 2018
Image of page 10
4. k-Nearest Neighbours model. Having understood what a confusion matrix says, you are ready to build your first classifier using the k-Nearest Neighbours model. The steps are as follows. Step 1. Obtain the dataset. You have already seen how to do this. Step 2. Select the features that are to be included in the dataset. The dataset has 30 features, and for your first analysis, you may select all the features. For example, to select the first twenty features: feature_list = range( 20 ) data = bunchobject.data[:, feature_list] Step 3. Each numerical feature selected is normalized using the min/max normalization. Step 4. The dataset (which includes the target variable) is divided into two sets, the training set and the test set . The analyst typically decides the percentage and a typical value is to choose the test set from 40% of the records. The performance of the model is checked using the data from the test set. This is done using the train_test_split () method, which conducts a random sampling from the records to give you the two sets. Read the documentation for details. from sklearn.model_selection import train_test_split data_train, data_test, target_train, target_test = train_test_split( data , target , test_size = 0.40 , random_state = 42 ) Step 5. Select a value of k to build the classifier. The classifier is built using the data from the training set. Step 6. The classifier is then used to make predictions on the target variable in the test set. A partial set of code for these two steps is given below. Read the documentation to find out how to complete it. clf = neighbors.KNeighborsClassifier( pass ) clf.fit( pass ) target_predicted = clf.predict( pass ) Step 7. The results of this classification is reported in the confusion matrix and the various metrics. You have already written a method for this. 11 Singapore University of Technology and Design, 2018
Image of page 11
These steps can be completed in a single function. Write a function knn_classifier() that takes in the following inputs: The bunchobject that is obtained after loading the dataset A list containing the column numbers of the features to be selected The size of the test set as a fraction of the total number of records A random number seed to ensure that the results can be repeated The value of k that is selected.
Image of page 12
Image of page 13

You've reached the end of your free preview.

Want to read all 18 pages?

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture