Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce

Exploring the data Statistical summary of data: common metrics Average Median Minimum Maximum Standard deviation Counts & percentages
Summary Statistics – Boston Housing

Correlations Between Pairs of Variables: Correlation Matrix from Excel PTRATIO B LSTAT MEDV PTRATIO 1 B -0.17738 1 LSTAT 0.374044 -0.36609 1 MEDV -0.50779 0.333461 -0.73766 1
Summarize Using Pivot Tables Count of MEDV CHAS Total 0 471 1 35 Grand Total 506 Counts & percentages are useful for summarizing categorical data Boston Housing example: 471 neighborhoods border the Charles River (1) 35 neighborhoods do not (0)

Pivot Tables - cont. In Boston Housing example: Compare average home values in neighborhoods that border Charles River (1) and those that do not (0) Average of MEDV CHAS Total 0 22.09 1 28.44 Grand Total 22.53 Averages are useful for summarizing grouped numerical data
Pivot Tables, cont. Group by multiple criteria: By # rooms and location E.g., neighborhoods on the Charles with 6-7 rooms have average house value of 25.92 (\$000) Average of MEDV CHAS RM 0 1 Grand Total 3-4 25.30 25.30 4-5 16.02 16.02 5-6 17.13 22.22 17.49 6-7 21.77 25.92 22.02 7-8 35.96 44.07 36.92 8-9 45.70 35.95 44.20 Grand Total 22.09 28.44 22.53

Graphs
Histograms Histogram shows the distribution of the outcome variable (median house value) 0 20 40 60 80 100 120 140 160 180 5 10 15 20 25 30 35 40 45 50 Frequency MEDV Histogram Boston Housing example:

Boxplots Boston Housing Example: Display distribution of outcome variable (MEDV) for neighborhoods on Charles (1) and not on Charles (0) 0 1 0 10 20 30 40 50 60 Y Values CHAS Box Plot MEDV Side-by-side boxplots are useful for comparing subgroups
Box Plot Top outliers defined as those above Q3+1.5(Q3-Q1). “max” is the maximum of non-outliers Analogous definitions for bottom outliers and for “min” Details may differ across software Media n Quartile 1 “max “min” outliers mea n outlier Quartile 3

Correlation Analysis Below: Correlation matrix for portion of Boston Housing data Shows correlation between variable pairs CRIM ZN INDUS CHAS NOX RM CRIM 1 ZN -0.20047 1 INDUS 0.406583 -0.53383 1 CHAS -0.05589 -0.0427 0.062938 1 NOX 0.420972 -0.5166 0.763651 0.091203 1 RM -0.21925 0.311991 -0.39168 0.091251 -0.30219 1
Matrix Plot Shows scatterplots for variable pairs Example: scatterplots for 3 Boston Housing variables

