2016_04_CME250A_2_1 - CME250A A Short Course On Big Data...

This preview shows 1 out of 8 pages.

Spring 2016 CME250A CME250A A Short Course On Big Data And Machine Learning Cliff Click [email protected] cliffc.org/blog
Image of page 1

Subscribe to view the full document.

Spring 2016 CME250A Types of Data Small vs Big “More than one machine” or “not on a flash stick” Basic copy & inspect operations are painfully slow Down-sample to do any work? Sparse vs Dense Most values are missing; text analytics and health Tall-skinny vs short-fat Different algorithms apply
Image of page 2
Spring 2016 CME250A Time to Solution Running time matters, constant factors matter Small data – any tool, any machine all works Pick your favorite, comfortable working environ Big data – high n and/or high p Algorithms blow out time or memory Compromise the data science to just do something Ex: GLM; n rows x p features → O(np 2 + p 3 ) Free at p=100, one second at p=1000, 10mins at p=10000, one year at p=100000
Image of page 3

Subscribe to view the full document.

Spring 2016 CME250A Recap Data Science Basics Load the Data Typically fails first few times; Needs filtering, cleanup Build a Model – a mathematical estimate of Reality Check model quality Typically crap: overfits or underfits badly e.g. predicting via name doesn't help future predictions Munge for new features and toss out bad ones Repeat till Model is good enough Use the Model to predict Reality Perhaps change the Future? Lather Rinse Repeat
Image of page 4
Spring 2016 CME250A Citibikes Rent a bike across town 340 stations, pay by hour, day, annual At end of day, bikes are all shuffled Need to rebalance bikes-per-station Predict: ride-starts per day per station Goal: info rebalance crews Data: 10M rides, 2G csv
Image of page 5

Subscribe to view the full document.

Spring 2016 CME250A Demo - Citibikes Size check; launch IDE; load; check summary
Image of page 6
Spring 2016 CME250A Citibikes – first peek 10M Rows – too big; must use summaries Check age; see sane ages & sigma Check trip duration; again same times; large sigma Note: starttime in msec since Unix Epoch Co-linear: station-id, name, long, lat Same data, but perhaps scaled and shifted No net new info, does not help learning, cost to keep Only need 1 start and 1 stop column But maybe want long/lat later for distance feature?
Image of page 7

Subscribe to view the full document.

Image of page 8
You've reached the end of this preview.
  • Spring '15
  • American films, Das Model, first-cut GBM Model, Drop co-linear cols, launch IDE

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern