Common Data Terms u Private and sensitive data u Metadata versus data u Unstructured vs structured data u Databases u Data silos u Born digital u Data quality u Data cleaning u Raw data u Open data u “Big data” 32
Private and Sensitive Data 33
Metadata versus Data Publisher: New York Times Publication date: May 29, 2015 Author: Paul Krugman Title: The Insecure American 34
Unstructured vs Structured Data Maya Soetoro- sister Michelle Obam spouse Bo pet left-handedness handedness Democratic Par member of political party 35
Structuring Data with XML Markup <observation> <date> 29 January 2015 </date> <temperature> <value> 59 </value> <unit> F </unit> </temperature> <activity> Not many birds observed today. Some squirrels were running around but not as many as other days. A raccoon came in sight briefly. </activity> </observation> 36
Databases u Tabular organization of data u Efficient “ queries ” u Relational databases : relations across tables u “ Key ” as the unique identifier of objects across tables 37
Data Silos u Hard to get data out u Hard to integrate data across silos u Examples: u Social sites u Electronic patient records 38
Data “ Born Digital ” u Originally recorded or created in digital form 39
Data Quality 40
Data Cleaning Course ID Faculty Name Semester Offered Open 549 A. Gold Fall 2015 Yes 533 D. Garcia Spring 2015 Open 556 P. Peters Spring 2015 Y 521 J. Smith Fall 2017 Open Joseph Smith or Jane Smith? Errors (maybe) Inconsistent formats 41
Data Cleaning to the Extreme… Patient ID Age Smoker? 549 45 Occasionally as a teen 533 55 Never before lunch 556 43 Two packs a day 521 78 Quit 3 years ago Requires a lot of work to be usable 42
Raw Data u Raw data is newly collected data before any kind of cleaning or pre-processing.
You've reached the end of your free preview.
Want to read all 54 pages?
- Fall '17
- Data Management, 175, Computational Thinking and Data Science