class_09_03

# class_09_03 - Statistical Data Mining ORIE 474 Fall 2007...

Statistical Data Mining ORIE 474 Fall 2007 Tatiyana V. Apanasovich 09/03/07 Visualizing Data

3. Visualizing and Exploring Data Summarizing Data Tools for Displaying Single variables Relationships between two variables More than two variables Principal Component Analysis
3.2 Summarizing Data Suppose that x(1),…,x(n) is a set of n data values Relevant sample statistics are: Location measures: Mean Median and Quartiles Mode Dispersion or variability measures: Standard deviation and variance Interquartile range and range Skewness

3.2 Summarizing Data (cont’d) Ex: 100 data points sampled from a normal distribution with mean 0 and std. dev. 10
Sample Mean Sample Mean: Ex: (whereas µ=0) Location measure Sample mean is the value that is “central” in the sense that it minimizes the sum of squared differences between it and the data: Proof: = i i x n ) ( 1 ˆ μ 36 . 1 ˆ = μ ( 29 ( 29 = = - = - = - i i i i i x n a na i x a i x da a i x d ) ( 1 0 ) ( ) ( 2 ) ( 2 ( 29 2 ) ( min ˆ - = i a a i x μ

Sample Median and Mode Sample median = value that has an equal number of data points above or below it If, as in our example, n is even, it is usually defined as the halfway between the 2 middle values Ex: (whereas m=0)
