chap3_data_exploration1

# chap3_data_exploration1 - Data Mining Exploring Data...

This preview shows pages 1–9. Sign up to view the full content.

Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Topics Exploratory Data Analysis Summary Statistics Visualization
What is data exploration? Key motivations of data exploration include - Helping to select the right tool for preprocessing or analysis - Making use of humans’ abilities to recognize patterns People can recognize patterns not captured by data analysis tools Related to the area of Exploratory Data Analysis (EDA) - Created by statistician John Tukey - Seminal book is Exploratory Data Analysis by Tukey - A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbook A preliminary exploration of the data to better understand its characteristics.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Iris Sample Data Set Many of the exploratory data techniques are illustrated with the Iris Plant data set. - Can be obtained from the UCI Machine Learning Repository - From the statistician Douglas Fisher - Three flower types (classes): Setosa Virginica Versicolour - Four (non-class) attributes Sepal width and length Petal width and length Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute.
Topics Exploratory Data Analysis Summary Statistics Visualization

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Summary Statistics Summary statistics are numbers that summarize properties of the data - Summarized properties include frequency, location and spread Examples: location - mean spread - standard deviation - Most summary statistics can be calculated in a single pass through the data
Frequency and Mode The frequency of an attribute value is the percentage of time the value occurs in the data set - For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. The mode of an attribute is the most frequent attribute value The notions of frequency and mode are typically used with categorical data

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Percentiles For continuous data, the notion of a percentile is more useful.
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern