cmsc320_f2018_lec16.pdf - INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture#16 \u2013 CMSC320 Mondays Wednesdays 2:00pm \u2013 3:15pm ANNOUNCEMENTS

# cmsc320_f2018_lec16.pdf - INTRODUCTION TO DATA SCIENCE JOHN...

• 109

This preview shows page 1 - 11 out of 109 pages.

INTRODUCTION TO DATA SCIENCE JOHN P DICKERSON Lecture #16 – 10/22/2018 CMSC320 Mondays & Wednesdays 2:00pm – 3:15pm
ANNOUNCEMENTS 2 Mini-Project #3 is not out yet! Will be out after the midterm. It will be linked to from ELMS; will also be available at: Deliverable is a .ipynb file submitted to ELMS Due before Thanksgiving (TBD) Please label your ipynb file something like <lastname>_<firstname>_project3.ipynb E.g., dickerson_john_project3.ipynb
PROJECT 1 GRADES ARE UP! General comments: People did really well! We used a fairly strict rubric, but if you have a real bone to pick with your grade, please triage through TAs/office hours! Comments for our sanity, moving forward: df.head(n) -- defaults to n = 5, use ~10, 20, 50 as needed Please label your ipynb file something like <lastname>_<firstname>_project3.ipynb E.g., dickerson_john_project3.ipynb 3
COMMON ISSUE Often not a problem! But, sometimes a problem … Example: df[df[‘intensity’] > 0.1][‘color’] = ‘red’ ?????????? This will not set a value in df – assignment is chained Instead, use df.loc[df[‘intensity’] > 0.1, ‘color’] = ‘red’ 4 A value is trying to be set on a copy of a slice from a DataFrame
MIDTERM: STRUCTURE 50 points = 25% of the total grade 10 points: 10 True/False questions, 1 point each 10 points: 5 multiple choice questions, 2 points each 30 points: 10 short answer questions, 3 points each Compared to the CMSC320 midterm I posted from last semester, this midterm is shorter . 5
MIDTERM: CHEAT SHEET You can use a cheat sheet on the exam: Create it on your own Handwritten notes only One side of one 8.5x11 inch ("normal-sized”) sheet of paper You’ll turn in your cheat sheet with your midterm 6
QUICK MIDTERM REVIEW As discussed in previous lectures and on Piazza, the midterm can cover: Up to and including last Wednesday’s lecture (10/17) Quizzes that were due on or before last Wednesday Stuff that you should know from doing P1 and P2 Everything is online: I know this is a lot of material. Rule of thumb: open up a slide deck Do you feel “comfortable” with the material? Test will be more qualitative than prior 1xx, 2xx, 3xx tests 7
QUICK MIDTERM REVIEW Data collection Data processing Exploratory analysis & Data viz Analysis, hypothesis testing, & ML Insight & Policy Decision 8 Not exhaustive! Ask questions!
DATA COLLECTION (DC) & DATA PROCESSING (DP) We talked about: Scraping data RESTful APIs Structured data formats (JSON, XML, etc) Regexes Data manipulation via Numpy Stack (Numpy, Pandas, etc) Indexing, slicing, groups, joins, aggregate queries, etc Tidy data + melting Version control (just know how this works qualitatively) RDMS, a little bit of SQL Entity resolution & other data integration issues Storing stuff as a graph, and manipulating it 9
DC: HTTP REQUESTS ?q=cmsc320&tbs=qdr:m HTTP GET Request: GET /?q=cmsc320&tbs=qdr:m HTTP/1.1 Host: User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20100101 Firefox/10.0.1 10 ??????????