PRACHI PATEL
ASSIGNMENT:-2
DSC 441
Problem 1 (10 points):
This problem is an example of data preprocessing needed in a data mining process.
Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with the following
results:
Age
26
26
29
29
40
45
50
55
60
%fat
10.5
30.5
8.8
20.8
32.4
26.9
30.4
30.2
33.2
Age
55
45
60
55
61
62
63
75
66
%fat
36.6
44.5
30.8
35.4
33.2
36.1
37.9
43.2
37.7
a.
(2 points) Draw the box-plots for age and %fat.
Interpret the distribution of the data.
1

PRACHI PATEL
Based on the descriptive statistics and boxplot for the Age variable, we can conclude that Age is skewed to the left.
Based on the descriptive statistics and boxplot of the %fat variable, we could identify two outliers in the data. Points
8.8 and 10.5 are outliers.
b.
(2 points) Normalize the two attributes based on z-score normalization.
Z-score normalization can be calculated using the following formula:
v
'
=
v
−
meam
std
.
From the descriptive statistics, we know that the mean and standard deviation for age are
50.11 and 14.9 respectively. We also know the mean and standard deviation for %fat is
31.06 and 9.54 respectively. Using SPSS we can calculate the z-scores for each case quickly.
c.
(2 points) Regardless of the original ranges of the variables, normalization techniques transform
the data into new ranges that allow to compare and use variables on the same scales. What are the
values ranges of the following normalization methods? Explain your answer.
i.
Min-max normalization
Range [new_min, new_max] = [0, 1].
2

PRACHI PATEL
When using the min-max normalization, the values are forced into a specific range. The
advantage of this method is when outliers are present in the data.