Common Data Terms u Private and sensitive data u Metadata versus data u

Common data terms u private and sensitive data u

This preview shows page 32 - 44 out of 54 pages.

Common Data Terms u Private and sensitive data u Metadata versus data u Unstructured vs structured data u Databases u Data silos u Born digital u Data quality u Data cleaning u Raw data u Open data u “Big data” 32
Image of page 32
Private and Sensitive Data 33
Image of page 33
Metadata versus Data Publisher: New York Times Publication date: May 29, 2015 Author: Paul Krugman Title: The Insecure American 34
Image of page 34
Unstructured vs Structured Data Maya Soetoro- sister Michelle Obam spouse Bo pet left-handedness handedness Democratic Par member of political party 35
Image of page 35
Structuring Data with XML Markup <observation> <date> 29 January 2015 </date> <temperature> <value> 59 </value> <unit> F </unit> </temperature> <activity> Not many birds observed today. Some squirrels were running around but not as many as other days. A raccoon came in sight briefly. </activity> </observation> 36
Image of page 36
Databases u Tabular organization of data u Efficient “ queries u Relational databases : relations across tables u Key ” as the unique identifier of objects across tables 37
Image of page 37
Data Silos u Hard to get data out u Hard to integrate data across silos u Examples: u Social sites u Electronic patient records 38
Image of page 38
Data “ Born Digital u Originally recorded or created in digital form 39
Image of page 39
Data Quality 40
Image of page 40
Data Cleaning Course ID Faculty Name Semester Offered Open 549 A. Gold Fall 2015 Yes 533 D. Garcia Spring 2015 Open 556 P. Peters Spring 2015 Y 521 J. Smith Fall 2017 Open Joseph Smith or Jane Smith? Errors (maybe) Inconsistent formats 41
Image of page 41
Data Cleaning to the Extreme… Patient ID Age Smoker? 549 45 Occasionally as a teen 533 55 Never before lunch 556 43 Two packs a day 521 78 Quit 3 years ago Requires a lot of work to be usable 42
Image of page 42
Raw Data u Raw data is newly collected data before any kind of cleaning or pre-processing.
Image of page 43
Image of page 44

You've reached the end of your free preview.

Want to read all 54 pages?

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture