# 03c-preprocessing-bigdata.pdf - CS171 Intro to ML DM 03c...

• 36

This preview shows page 1 - 8 out of 36 pages.

CS171 Intro to ML & DM 03c – Data Preprocessing & Big Data Processing Evangelos (Vagelis) Papalexakis , Many of the slides adapted from Jiawei Han, Micheline Kamber and Jian Pei (authors of the textbook), Christos Faloutsos, Spark slides adapted from Leman Akoglu (CMU Heinz – Course 95-869) CS171-S19-L03c
Today’s Agenda Data Reduction (sampling) Data Transformation Intro to Big Data & Hadoop / MapReduce CS171-S19-L03c 1
2 Sampling Sampling: obtaining a small sample s to represent the whole data set N Allow an algorithm to run in complexity that is potentially sub-linear to the size of the data Key principle: Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods, e.g., stratified sampling CS171-S19-L03c
3 Types of Sampling Simple/uniform random sampling There is an equal probability of selecting any particular item Sampling without replacement Once an object is selected, it is removed from the population Sampling with replacement A selected object is not removed from the population Stratified sampling: Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) Used in conjunction with skewed data CS171-S19-L03c
CS171-S19-L03c 4 Bootstrapping Need to estimate a statistic of the data from sample (e.g., the mean) We can get the sample mean and get a point estimate How accurate is it? How much variability is there? Bootstrapping can provide confidence intervals for that statistic
CS171-S19-L03c 5 Bootstrapping Bootstrapping “pretends” that the sample of data we have is the actual distribution of the data ? ? ? ? ? ? ? ? ... Population Sample (n=5) 4 4 2 5 7 Bootstrapped Sample (n=5) 7 7 5 4 2 Img from:
CS171-S19-L03c 6 Bootstrapping From the sample of N data points we have, we sample N items with replacement