# chap3_data_exploration1 - Data Mining Exploring Data...

Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Topics Exploratory Data Analysis Summary Statistics Visualization
What is data exploration? Key motivations of data exploration include - Helping to select the right tool for preprocessing or analysis - Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) - Created by statistician John Tukey - Seminal book is Exploratory Data Analysis by Tukey - A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook A preliminary exploration of the data to better understand its characteristics.

Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. - Can be obtained from the UCI Machine Learning Repository - From the statistician Douglas Fisher - Three flower types (classes): Setosa Virginica Versicolour - Four (non-class) attributes Sepal width and length Petal width and length Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
Topics Exploratory Data Analysis Summary Statistics Visualization

Summary Statistics Summary statistics are numbers that summarize properties of the data - Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation - Most summary statistics can be calculated in a single pass through the data
Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set - For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data

Percentiles For continuous data, the notion of a percentile is more useful.
