Topic2-EDAViz

Topic2-EDAViz - Data Mining 2011 - Volinsky - Columbia...

Info iconThis preview shows pages 1–8. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for Data Analysis: Cook and Swayne Padhraic Smyth’s UCI lecture notes R Graphics: Paul Murrell Graphics of Large Datasets: Visualizing a Milion: Unwin, Theus and Hofmann 1 Data Mining 2011 - Volinsky - Columbia University Outline • EDA • Visualization – One variable – Two variables – More than two variables – Other types of data – Dimension reduction 2 Data Mining 2011 - Volinsky - Columbia University EDA and Visualization • Exploratory Data Analysis (EDA) and Visualization are important (necessary?) steps in any analysis task. • get to know your data! – distributions (symmetric, normal, skewed) – data quality problems – outliers – correlations and inter-relationships – subsets of interest – suggest functional relationships • Sometimes EDA or viz might be the goal! 3 Data Mining 2011 - Volinsky - Columbia University 4 flowingdata.com 9/9/11 flowingdata.com 9/9/11 Data Mining 2011 - Volinsky - Columbia University 5 NYTimes 7/26/11 NYTimes 7/26/11 Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis (EDA) • Goal: get a general sense of the data – means, medians, quantiles, histograms, boxplots • You should always look at every variable - you will learn something! • data-driven (model-free) • Think interactive and visual – Humans are the best pattern recognizers – You can use more than 2 dimensions! • x,y,z, space, color, time…. • especially useful in early stages of data mining – detect outliers (e.g. assess data quality) – test assumptions (e.g. normal distributions or skewed?) – identify useful raw data & transforms (e.g. log(x)) • Bottom line: it is always well worth looking at your data! 6 Data Mining 2011 - Volinsky - Columbia University Summary Statistics • not visual • sample statistics of data X – mean: μ = ∑ i X i / n – mode: most common value in X – median: X =sort(X), median = X n/2 (half below, half above) – quartiles of sorted X : Q1 value = X 0.25n , Q3 value = X 0.75 n • interquartile range: value(Q3) - value(Q1) • range: max(X) - min(X) = X n - X 1 – variance: σ 2 = ∑ i (X i - μ ) 2 / n – skewness: ∑ i (X i - μ ) 3 / [ ( ∑ i (X i - μ ) 2 ) 3/2 ] • zero if symmetric; right-skewed more common (what kind of data is right skewed?) – number of distinct values for a variable (see unique() in R) – Don’t need to report all of thses: Bottom line…do these numbers make sense???...
View Full Document

This note was uploaded on 02/28/2012 for the course ELEN E4815 taught by Professor I during the Spring '12 term at Columbia.

Page1 / 57

Topic2-EDAViz - Data Mining 2011 - Volinsky - Columbia...

This preview shows document pages 1 - 8. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online