first_column = bunchobject.data[:,np.newaxis,
1
]
first_column_norm = normalize_minmax(first_column)
print(five_number_summary(first_column_norm))
The expected output is as follows. Your answer may be slightly different due to
floating point error.
[{
'minimum'
:
0.0
,
'first quartile'
:
0.21846466012850865
,
'median'
:
0.30875887724044637
,
'third quartile'
:
0.40886033141697664
,
'maximum'
:
1.0
}]
Your function should also work if more than one column is input. Hence, for the
following test script:
cols = [
1
,
7
]
some_columns = bunchobject.data[:,cols]
snorm = normalize_minmax(some_columns)
print(
'normalized'
, five_number_summary(snorm))
The expected output is as follows.
normalized [{
'minimum'
:
0.0
,
'first quartile'
:
0.21846466012850865
,
'median'
:
0.30875887724044637
,
'third quartile'
:
0.40886033141697664
,
'maximum'
:
1.0
}, {
'minimum'
:
0.0
,
'first quartile'
:
0.10094433399602387
,
'median'
:
0.1665009940357853
,
'third quartile'
:
0.36779324055666002
,
'maximum'
:
1.0
}]
10
Singapore University of Technology and Design, 2018

4.
k-Nearest Neighbours model.
Having understood what a confusion matrix says,
you are ready to build your first classifier using the k-Nearest Neighbours model.
The steps are as follows.
Step 1. Obtain the dataset. You have already seen how to do this.
Step 2. Select the features that are to be included in the dataset. The dataset has
30 features, and for your first analysis, you may select all the features. For
example, to select the first twenty features:
feature_list = range(
20
)
data = bunchobject.data[:, feature_list]
Step 3. Each numerical feature selected is normalized using the min/max
normalization.
Step 4. The dataset (which includes the target variable) is divided into two sets, the
training set
and the
test set
. The analyst typically decides the percentage and a
typical value is to choose the test set from 40% of the records. The performance of
the model is checked using the data from the test set.
This is done using the
train_test_split
() method, which conducts a random
sampling from the records to give you the two sets. Read the documentation for
details.
from
sklearn.model_selection
import
train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data , target , test_size =
0.40
, random_state =
42
)
Step 5. Select a value of
k
to build the classifier. The classifier is built using the
data from the training set.
Step 6. The classifier is then used to make predictions on the target variable in the
test set.
A partial set of code for these two steps is given below. Read the documentation to
find out how to complete it.
clf = neighbors.KNeighborsClassifier(
pass
)
clf.fit(
pass
)
target_predicted = clf.predict(
pass
)
Step 7. The results of this classification is reported in the confusion matrix and the
various metrics. You have already written a method for this.
11
Singapore University of Technology and Design, 2018

These steps can be completed in a single function. Write a function
knn_classifier()
that takes in the following inputs:
●
The bunchobject that is obtained after loading the dataset
●
A list containing the column numbers of the features to be selected
●
The size of the test set as a fraction of the total number of records
●
A random number seed to ensure that the results can be repeated
●
The value of
k
that is selected.

#### You've reached the end of your free preview.

Want to read all 18 pages?

- Spring '19
- Linear Regression, Regression Analysis, Machine Learning, Singapore University of Technology and Design