This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Data Mining 2011  Volinsky  Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for Data Analysis: Cook and Swayne Padhraic Smyth’s UCI lecture notes R Graphics: Paul Murrell Graphics of Large Datasets: Visualizing a Milion: Unwin, Theus and Hofmann 1 Data Mining 2011  Volinsky  Columbia University Outline • EDA • Visualization – One variable – Two variables – More than two variables – Other types of data – Dimension reduction 2 Data Mining 2011  Volinsky  Columbia University EDA and Visualization • Exploratory Data Analysis (EDA) and Visualization are important (necessary?) steps in any analysis task. • get to know your data! – distributions (symmetric, normal, skewed) – data quality problems – outliers – correlations and interrelationships – subsets of interest – suggest functional relationships • Sometimes EDA or viz might be the goal! 3 Data Mining 2011  Volinsky  Columbia University 4 flowingdata.com 9/9/11 flowingdata.com 9/9/11 Data Mining 2011  Volinsky  Columbia University 5 NYTimes 7/26/11 NYTimes 7/26/11 Data Mining 2011  Volinsky  Columbia University Exploratory Data Analysis (EDA) • Goal: get a general sense of the data – means, medians, quantiles, histograms, boxplots • You should always look at every variable  you will learn something! • datadriven (modelfree) • Think interactive and visual – Humans are the best pattern recognizers – You can use more than 2 dimensions! • x,y,z, space, color, time…. • especially useful in early stages of data mining – detect outliers (e.g. assess data quality) – test assumptions (e.g. normal distributions or skewed?) – identify useful raw data & transforms (e.g. log(x)) • Bottom line: it is always well worth looking at your data! 6 Data Mining 2011  Volinsky  Columbia University Summary Statistics • not visual • sample statistics of data X – mean: μ = ∑ i X i / n – mode: most common value in X – median: X =sort(X), median = X n/2 (half below, half above) – quartiles of sorted X : Q1 value = X 0.25n , Q3 value = X 0.75 n • interquartile range: value(Q3)  value(Q1) • range: max(X)  min(X) = X n  X 1 – variance: σ 2 = ∑ i (X i  μ ) 2 / n – skewness: ∑ i (X i  μ ) 3 / [ ( ∑ i (X i  μ ) 2 ) 3/2 ] • zero if symmetric; rightskewed more common (what kind of data is right skewed?) – number of distinct values for a variable (see unique() in R) – Don’t need to report all of thses: Bottom line…do these numbers make sense???...
View
Full
Document
This note was uploaded on 02/28/2012 for the course ELEN E4815 taught by Professor I during the Spring '12 term at Columbia.
 Spring '12
 I

Click to edit the document details