Data Mining CS57300 Purdue University September 7, 2010

Populations and samples (cont)
Types of probability sampling • Simple random sampling • There is an equal probability of selecting any particular item • Sampling without replacement • As each item is selected, it is removed from the population • Sampling with replacement • Items are not removed from the population as they are selected for the sample; the same item can be picked up more than once • StratiFed sampling • Split the data into several partitions; then draw random samples from each partition Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.

Tan, Steinbach, Kumar. Introduction to Data Mining, 2004. Sample size 500 Points 2000 Points 8000 Points
Tan, Steinbach, Kumar. Introduction to Data Mining, 2004. • What sample size is necessary to get at least one object from each of 10 groups? Sample size

Statistical inference
Populations and samples • In data mining we often work with a sample of data from the population of interest Estimation techniques allow inferences about population properties from sample data • If we had the population we could calculate the properties of interest Population Sample Parameter: Beta = 0.546 Statistic: b = 0.692 Sampling Inference

Statistical inference • Infer properties of an unknown distribution with sample data generated from that distribution • Parameter estimation • Infer the value of a population parameter based on a sample statistic (e.g., estimate the mean) • Hypothesis testing • Infer the answer to a question about a population parameter based on a sample statistic (e.g., is the mean non-zero?)
Example inference procedure

