Algorithms for Data Science ( PDFDrive.com ).pdf - Brian Steele � John Chandler Swarna Reddy Algorithms for Data Science Algorithms for Data Science

Algorithms for Data Science ( PDFDrive.com ).pdf - Brian...

This preview shows page 1 out of 438 pages.

You've reached the end of your free preview.

Want to read all 438 pages?

Unformatted text preview: Brian Steele · John Chandler Swarna Reddy Algorithms for Data Science Algorithms for Data Science Brian Steele • John Chandler • Swarna Reddy Algorithms for Data Science 123 Brian Steele University of Montana Missoula, MT, USA John Chandler School of Business Administration University of Montana Missoula, MT, USA Swarna Reddy SoftMath Consultants, LLC Missoula, MT, USA ISBN 978-3-319-45795-6 ISBN 978-3-319-45797-0 (eBook) DOI 10.1007/978-3-319-45797-0 Library of Congress Control Number: 2016952812 © Springer International Publishing Switzerland 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer International Publishing AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface Data science has been recognized as a science since 2001, roughly. Its origin lies in technological advances that are generating nearly inconceivable volumes of data. The rate at which new data are being produced is not likely to slow for some time. As a society, we have realized that these data provide opportunities to learn about the systems and processes generating the data. But data in its original form is of relatively little value. Paradoxically, the more of it that there is, the less the value. It has to be reduced to extract value from it. Extracting information from data is the subject of data science. Becoming a successful practitioner of data science is a real challenge. The knowledge base incorporates demanding topics from statistics, computer science, and mathematics. On top of that, domain-specific knowledge, if not critical, is very helpful. Preparing students in these three or four areas is necessary. But at some point, the subject areas need to be brought together as a coherent package in what we consider to be a course in data science. A student that lacks a course that actually teaches data science is not well prepared to practice data science. This book serves as a backbone for a course that brings together the main subject areas. We’ve paid attention to the needs of employers with respect to entrylevel data scientists—and what they say is lacking from the skills of these new data scientists. What is most lacking are programming abilities. From the educators’ point of view, we want to teach principles and theory—the stuff that’s needed by students to learn on their own. We’re not going to be able to teach them everything they need in their careers, or even in the short term. But teaching principles and foundations is the best preparation for independent learning. Fortuitously, there is a subject that encompasses both principles and programming—algorithms. Therefore, this book has been written about the algorithms of data science. v vi Preface Algorithms for Data Science focuses on the principles of data reduction and core algorithms for analyzing the data of data science. Understanding the fundamentals is crucial to be able to adapt existing algorithms and create new algorithms. The text provides many opportunities for the reader to develop and improve their programming skills. Every algorithm discussed at length is accompanied by a tutorial that guides the reader through implementation of the algorithm in either Python or R. The algorithm is then applied to a real-world data set. Using real data allows us to talk about domainspecific problems. Regrettably, our self-imposed coding edict eliminates some important predictive analytic algorithms because of their complexity. We have two audiences in mind. One audience is practitioners of data science and the allied areas of statistics, mathematics, and computer science. This audience would read the book if they have an interest in improving their analytical skills, perhaps with the objective of working as a data scientist. The second audience are upper-division undergraduate and graduate students in data science, business analytics, mathematics, statistics, and computer science. This audience would be engaged in a course on data analytics or self-study. Depending on the sophistication of the audience, the book may be used for a one- or two-semester course on data analytics. If used for a one-semester course, the instructor has several options regarding the course content. All options begin with Chaps. 1 and 2 so that the concepts of data reduction and data dictionaries are firmly established. 1. If the instructional emphasis is on computation, then Chaps. 3 and 4 on methods for massively large data and distributed computing would be covered. Chapter 12 works with streaming data, and so this chapter is a nice choice to close the course. Chapter 7 on healthcare analytics is optional and might be covered as time allows. The tutorials of Chap. 7 involve relatively large and challenging data sets. These data sets provide the student and instructor with many opportunities for interesting projects. 2. A course oriented toward general analytical methods might pass over Chaps. 3 and 4 in favor of data visualization (Chap. 5) and linear regression (Chap. 6). The course could close with Chap. 9 on k-nearest neighbor prediction functions and Chap. 11 on forecasting. 3. A course oriented toward predictive analytics would focus on Chaps. 9 and 10 on k-nearest neighbor and naïve Bayes prediction functions. The course would close with Chaps. 11 and 12 on forecasting and streaming data. Preface vii Acknowledgments We thank Brett Kassner, Jason Kolberg, and Greg St. George for reviewing chapters and Guy Shepard for help solving hardware problems and unraveling network mysteries. Many thanks to Alex Philp for anticipating the future and breaking trail. We thank Leonid Kalachev and Peter Golubstov for many interesting conversations and insights. Missoula, MT, USA Missoula, MT, USA Missoula, MT, USA Brian Steele John Chandler Swarna Reddy Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What Is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Diabetes in America . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Authors of the Federalist Papers . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Forecasting NASDAQ Stock Prices . . . . . . . . . . . . . . . . . . . . . . . 1.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 The Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 R .................................................... 1.10 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Book Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 5 6 8 8 11 12 13 14 14 16 Part I Data Reduction 2 Data 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 Mapping and Data Dictionaries . . . . . . . . . . . . . . . . . . . . . Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Political Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial: Big Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 The Political Contributions Example . . . . . . . . . . . . . . 2.5.3 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial: Election Cycle Contributions . . . . . . . . . . . . . . . . . . . . Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial: Computing Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . Concluding Remarks About Dictionaries . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 22 22 27 28 29 30 31 38 41 43 47 48 ix x Contents 2.10.1 2.10.2 3 4 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Scalable Algorithms and Associative Statistics . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Example: Obesity in the United States . . . . . . . . . . . . . . . . . . . 3.3 Associative Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Univariate Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Histogram Construction . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Tutorial: Histogram Construction . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 The Augmented Moment Matrix . . . . . . . . . . . . . . . . . . 3.7.4 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Tutorial: Computing the Correlation Matrix . . . . . . . . . . . . . . . 3.8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Introduction to Linear Regression . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . 3.9.2 The Estimator of β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Accuracy Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 ............................ 3.9.4 Computing Radjusted  3.10 Tutorial: Computing β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 53 54 55 57 58 60 61 74 74 75 76 79 80 80 87 88 89 90 93 94 95 101 102 102 103 Hadoop and MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Hadoop Ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 The Hadoop Distributed File System . . . . . . . . . . . . . . 4.2.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Developing a Hadoop Application . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Medicare Payments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 The Command Line Environment . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Tutorial: Programming a MapReduce Algorithm . . . . . . . . . . . 4.6.1 The Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 The Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 105 106 106 108 108 110 111 111 113 113 116 120 123 Contents 4.7 4.8 xi Tutorial: Using Amazon Web Services . . . . . . . . . . . . . . . . . . . . 4.7.1 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 128 128 128 128 Part II Extracting Information from Data 5 Data 5.1 5.2 5.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Principles of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . Making Good Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Univariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Bivariate and Multivariate Data . . . . . . . . . . . . . . . . . . Harnessing the Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Building Fig. 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Building Fig. 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Building Fig. 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Building Fig. 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Building Fig. 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Building Fig. 5.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.7 Building Fig. 5.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 133 135 138 139 142 148 151 152 153 154 155 156 157 158 Linear Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Example: Depression, Fatalism, and Simplicity . . . . . 6.2.2 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Distributional Conditions . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.6 Cautionary Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Tutorial: R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Tutorial: Large Data Sets and R . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 The Extra Sums-of-Squares F -test . . . . . . . . . . . . . . . . 6.7 Tutorial: Bike Share . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 An Incongruous Result . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Analysis of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 161 162 164 166 168 170 171 175 176 177 181 181 187 189 192 195 200 200 201 5.4 5.5 6 xii Contents 6.8.2 Example: The Bike Share Problem . . . . . . . . . . . . . . . . 6.8.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tutorial: Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.1 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10.2 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 204 208 210 211 211 212 7 Healthcare Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Behavioral Risk Factor Surveillance System . . . . . . . . . . . 7.2.1 Estimation of Prevalence . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Estimation of Incidence . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Tutorial: Diabetes Prevalence and Incidence . . . . . . . . . . . . . . . 7.4 Predicting At-Risk Individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Tutorial: Identifying At-Risk Individuals . . . . . . . . . . . . . . . . . . 7.6 Unusual Demographic Attribute Vectors . . . . . . . . . . . . . . . . . . 7.7 Tutorial: Building Neighborhood Sets . . . . . . . . . . . . . . . . . . . . . 7.7.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Conceptual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 Computational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 217 219 220 221 222 231 234 236 243 245 247 249 249 250 8 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Hierarchical Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . 8.3 Comparison of States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Tutorial: Hierarchical Clustering of States . . . . . . . . . . . . . . . . . 8.4.1 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 The k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Tutorial: The k-Means Algorithm . . . . . . ...
View Full Document

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes