Unimodal Empirical formula: Multi-modal Bimodal Trimodal ) ( 3 median mean mode mean

Data Mining Exploratory Data Analysis Symmetric vs. Skewed Data Data in most real applications are not symmetric. They may instead be either positively skewed , where the mode occurs at a value that is smaller than the median, or negatively skewed , where the mode occurs at a value greater than the median. 10 Symmetric data Positively skewed data Negatively skewed data Data Mining Exploratory Data Analysis Properties of Normal Distribution Curve 11 ←———— Represent data dispersion, spread ————→ Represent central tendency

Data Mining Exploratory Data Analysis Measures Data Distribution: Variance and Standard Deviation Variance and standard deviation ( sample: s, population: σ) Variance : (algebraic, scalable computation) Q: Can you compute it incrementally and efficiently? Standard deviation s (or σ) is the square root of variance s 2 ( or σ 2) 12 n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ] ) ( 1 [ 1 1 ) ( 1 1 n i i n i i x N x N 1 2 2 1 2 2 1 ) ( 1 Data Mining Exploratory Data Analysis Graphic Displays of Basic Statistical Descriptions Boxplot : graphic display of five-number summary Histogram : x-axis represent values, y-axis represent frequencies. Quantile plot : each value is paired with indicating that approximately 100 of data are . Quantile-quantile (q-q) plot : graphs the quantiles of one univariate distribution against the corresponding quantiles of another Scatter plot : each pair of values is a pair of coordinates and plotted as points in the plane 13

Data Mining Exploratory Data Analysis Boxplot Boxplot : graphic display of five-number summary. 14 Data Mining Exploratory Data Analysis Measuring the Dispersion of Data: Quartiles & Boxplots Quartiles : Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range : IQR = Q 3 Q 1 Five number summary : min, Q 1 , median, Q 3 , max Boxplot : Data is represented with a box Q 1 , Q 3 , IQR: The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR Median (Q 2 ) is marked by a line within the box Whiskers: two lines outside the box extended to Minimum and Maximum Outliers: points beyond a specified outlier threshold, plotted individually Outlier : usually, a value higher/lower than 1.5 x IQR 15

Data Mining Exploratory Data Analysis Visualization of Data Dispersion: 3-D Boxplots 16 Data Mining Exploratory Data Analysis Histogram Histogram : x-axis represent values, y-axis represent frequencies. 17

