introductiontomachinelearningwithr.pdf - Introduction to...

  • No School
  • AA 1
  • 225
  • 100% (2) 2 out of 2 people found this document helpful

This preview shows page 1 out of 225 pages.

You've reached the end of your free preview.

Want to read all 225 pages?

Unformatted text preview: Introduction to Machine Learning with R RIGOROUS MATHEMATICAL MODELING Scott V. Burger Introduction to Machine Learning with R Rigorous Mathematical Analysis Scott V. Burger Beijing Boston Farnham Sebastopol Tokyo Introduction to Machine Learning with R by Scott V. Burger Copyright © 2018 Scott Burger. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( ). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected] Editors: Rachel Roumeliotis and Heather Scherer Production Editor: Kristen Brown Copyeditor: Bob Russell, Octal Publishing, Inc. Proofreader: Jasmine Kwityn Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest First Edition March 2018: Revision History for the First Edition 2018-03-08: First Release See for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Introduction to Machine Learning with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-97644-9 [LSI] Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. What Is a Model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Algorithms Versus Models: What’s the Difference? A Note on Terminology Modeling Limitations Statistics and Computation in Modeling Data Training Cross-Validation Why Use R? The Good R and Machine Learning The Bad Summary 6 7 8 10 11 12 13 13 15 16 17 2. Supervised and Unsupervised Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Supervised Models Regression Training and Testing of Data Classification Logistic Regression Supervised Clustering Methods Mixed Methods Tree-Based Models Random Forests Neural Networks Support Vector Machines Unsupervised Learning 20 20 22 24 24 26 31 31 34 35 39 40 iii Unsupervised Clustering Methods Summary 41 43 3. Sampling Statistics and Model Training in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bias Sampling in R Training and Testing Roles of Training and Test Sets Why Make a Test Set? Training and Test Sets: Regression Modeling Training and Test Sets: Classification Modeling Cross-Validation k-Fold Cross-Validation Summary 46 51 54 55 55 55 63 67 67 69 4. Regression in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Linear Regression Multivariate Regression Regularization Polynomial Regression Goodness of Fit with Data—The Perils of Overfitting Root-Mean-Square Error Model Simplicity and Goodness of Fit Logistic Regression The Motivation for Classification The Decision Boundary The Sigmoid Function Binary Classification Multiclass Classification Logistic Regression with Caret Summary Linear Regression Logistic Regression 72 74 78 81 87 87 89 91 92 93 94 98 101 105 106 106 107 5. Neural Networks in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Single-Layer Neural Networks Building a Simple Neural Network by Using R Multiple Compute Outputs Hidden Compute Nodes Multilayer Neural Networks Neural Networks for Regression Neural Networks for Classification iv | Table of Contents 109 111 113 114 120 125 130 Neural Networks with caret Regression Classification Summary 131 131 132 133 6. Tree-Based Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A Simple Tree Model Deciding How to Split Trees Tree Entropy and Information Gain Pros and Cons of Decision Trees Tree Overfitting Pruning Trees Decision Trees for Regression Decision Trees for Classification Conditional Inference Trees Conditional Inference Tree Regression Conditional Inference Tree Classification Random Forests Random Forest Regression Random Forest Classification Summary 135 138 139 140 141 145 151 151 152 154 155 155 156 157 158 7. Other Advanced Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Naive Bayes Classification Bayesian Statistics in a Nutshell Application of Naive Bayes Principal Component Analysis Linear Discriminant Analysis Support Vector Machines k-Nearest Neighbors Regression Using kNN Classification Using kNN Summary 159 159 161 163 169 173 179 181 182 184 8. Machine Learning with the caret Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 The Titanic Dataset Data Wrangling caret Unleashed Imputation Data Splitting caret Under the Hood Model Training 186 187 188 188 190 191 194 Table of Contents | v Comparing Multiple caret Models Summary 197 199 A. Encyclopedia of Machine Learning Models in caret. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 vi | Table of Contents Preface In this short introduction, I tackle a few key points. Who Should Read This Book? This book is ideally suited for people who have some working knowledge of the R programming language. If you don’t have any knowledge of R, it’s an easy enough language to pick up, and the code is readable enough that you can pretty much get the gist of the code examples herein. Scope of the Book This book is an introductory text, so we don’t dive deeply into the mathematical underpinnings of every algorithm covered. Presented here are enough of the details for you to discern the difference between a neural network and, say, a random forest at a high level. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. vii Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐ sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit . viii | Preface How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at . To comment or ask technical questions about this book, send email to bookques‐ [email protected] For more information about our books, courses, conferences, and news, see our web‐ site at . Find us on Facebook: Follow us on Twitter: Watch us on YouTube: Acknowledgments It’s always been a dream of mine to write a book. When I was in third or fourth grade, my ideal book to write would have been a talk show hosted by my stuffed-animal col‐ lection. I never thought at the time that I would develop the skills to one day be shed‐ ding light on the complex world of machine learning. Between then and now, so many things have happened that I need to take a moment to thank some people who have made this book possible in more ways than one: Allison Randal, Amanda Har‐ ris, Cristiano Sabiu, Dorothy Duffy, Elayne Britain, Filipe Abdalla, Heather Scherer, Ian Furniss, Kristen Brown, Kristen Larson, Marie Beaugureau, Max Winderbaum, Myrna Fant, Richard Fant, Robert Lippens, Will Wright, and Woody Ciskowski. Preface | ix CHAPTER 1 What Is a Model? There was a time in my undergraduate physics studies that I was excited to learn what a model was. I remember the scene pretty well. We were in a Stars and Galaxies class, getting ready to learn about atmospheric models that could be applied not only to the Earth, but to other planets in the solar system as well. I knew enough about climate models to know they were complicated, so I braced myself for an onslaught of math that would take me weeks to parse. When we finally got to the meat of the subject, I was kind of let down: I had already dealt with data models in the past and hadn’t even realized! Because models are a fundamental aspect of machine learning, perhaps it’s not sur‐ prising that this story mirrors how I learned to understand the field of machine learning. During my graduate studies, I was on the fence about going into the finan‐ cial industry. I had heard that machine learning was being used extensively in that world, and, as a lowly physics major, I felt like I would need to be more of a computa‐ tional engineer to compete. I came to a similar realization that not only was machine learning not as scary of a subject as I originally thought, but I had indeed been using it before. Since before high school, even! Models are helpful because unlike dashboards, which offer a static picture of what the data shows currently (or at a particular slice in time), models can go further and help you understand the future. For example, someone who is working on a sales team might only be familiar with reports that show a static picture. Maybe their screen is always up to date with what the daily sales are. There have been countless dashboards that I’ve seen and built that simply say “this is how many assets are in right now.” Or, “this is what our key performance indicator is for today.” A report is a static entity that doesn’t offer an intuition as to how it evolves over time. Figure 1-1 shows what a report might look like: 1 op <- par(mar = c(10, 4, 4, 2) + 0.1) #margin formatting barplot(mtcars$mpg, names.arg = row.names(mtcars), las = 2, ylab = "Fuel Efficiency in Miles per Gallon") Figure 1-1. A distribution of vehicle fuel efficiency based on the built-in mtcars dataset found in R Figure 1-1 depicts a plot of the mtcars dataset that comes prebuilt with R. The figure shows a number of cars plotted by their fuel efficiency in miles per gallon. This report isn’t very interesting. It doesn’t give us any predictive power. Seeing how the efficiency of the cars is distributed is nice, but how can we relate that to other things in the data and, moreover, make predictions from it? A model is any sort of function that has predictive power. So how do we turn this boring report into something more useful? How do we bridge the gap between reporting and machine learning? Oftentimes the correct answer to this is “more data!” That can come in the form of more observations of the same data or by collecting new types of data that we can then use for comparison. Let’s take a look at the built-in mtcars dataset that comes with R in more detail: 2 | Chapter 1: What Is a Model? head(mtcars) ## ## ## ## ## ## ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 By just calling the built-in object of mtcars within R, we can see all sorts of columns in the data from which to choose to build a machine learning model. In the machine learning world, columns of data are sometimes also called features. Now that we know what we have to work with, we could try seeing if there’s a relationship between the car’s fuel efficiency and any one of these features, as depicted in Figure 1-2: pairs(mtcars[1:7], lower.panel = NULL) Figure 1-2. A pairs plot of the mtcars dataset, focusing on the first seven rows Each box is its own separate plot, for which the dependent variable is the text box at the bottom of the column, and the independent variable is the text box at the begin‐ ning of the row. Some of these plots are more interesting for trending purposes than others. None of the plots in the cyl row, for example, look like they lend themselves easily to simple regression modeling. What Is a Model? | 3 In this example, we are plotting some of those features against others. The columns, or features, of this data are defined as follows: mpg cyl Miles per US gallon Number of cylinders in the car’s engine disp The engine’s displacement (or volume) in cubic inches hp The engine’s horsepower drat The vehicle’s rear axle ratio wt The vehicle’s weight in thousands of pounds qsec The vehicle’s quarter-mile race time vs The vehicle’s engine cylinder configuration, where “V” is for a v-shaped engine and “S” is for a straight, inline design am The transmission of the vehicle, where 0 is an automatic transmission and 1 is a manual transmission gear carb The number of gears in the vehicle’s transmission The number of carburetors used by the vehicle’s engine You can read the upper-right plot as “mpg as a function of quarter-mile-time,” for example. Here we are mostly interested in something that looks like it might have some kind of quantifiable relationship. This is up to the investigator to pick out what patterns look interesting. Note that “mpg as a function of cyl” looks very different from “mpg as a function of wt.” In this case, we focus on the latter, as shown in Figure 1-3: plot(y = mtcars$mpg, x = mtcars$wt, xlab = "Vehicle Weight", ylab = "Vehicle Fuel Efficiency in Miles per Gallon") 4 | Chapter 1: What Is a Model? Figure 1-3. This plot is the basis for drawing a regression line through the data Now we have a more interesting kind of dataset. We still have our fuel efficiency, but now it is plotted against the weight of the respective cars in tons. From this kind of format of the data, we can extract a best fit to all the data points and turn this plot into an equation. We’ll cover this in more detail in later chapters, but we use a func‐ tion in R to model the value we’re interested in, called a response, against other fea‐ tures in our dataset: mt.model <- lm(formula = mpg ~ wt, data = mtcars) coef(mt.model)[2] ## wt ## -5.344472 coef(mt.model)[1] ## (Intercept) ## 37.28513 In this code chunk, we modeled the vehicle’s fuel efficiency (mpg) as a function of the vehicle’s weight (wt) and extracted values from that model object to use in an equa‐ tion that we can write as follows: Fuel Efficiency = 5.344 × Vehicle Weight + 37.285 What Is a Model? | 5 Now if we wanted to know what the fuel efficiency was for any car, not just those in the dataset, all we would need to input is the weight of it, and we get a result. This the benefit of a model. We have predictive power, given some kind of input (e.g., weight), that can give us a value for any number we put in. The model might have its limitations, but this is one way in which we can help to expand the data beyond a static report into something more flexible and more insightful. A given vehicle’s weight might not actually be predictive of the fuel effi‐ ciency as given by the preceding equation. There might be some error in the data or the observation. You might have come across this kind of modeling procedure before in dealing with the world of data. If you have, congratulations—you have been doing machine learn‐ ing without even knowing it! This particular type of machine learning model is called linear regression. It’s much simpler than some other machine learning models like neural networks, but the algorithms that make it work are certainly using machine learning principles. Algorithms Versus Models: What’s the Difference? Machine learning and algorithms can hardly be separated. Algorithms are another subject that can seem impenetrably daunting at first, but they are actually quite sim‐ ple at their core, and you have probably been using them for a long time without real‐ izing it. An algorithm is a set of steps performed in order. That’s all an algorithm is. The algorithm for putting on your shoes might be some‐ thing like putting your toes in the open part of the shoe, and then pressing your foot forward and your heel downward. The set of steps necessary to produce a machine learning algorithm are more complicated than designing an algorithm for putting on your shoes, of course, but one of the goals of this book is to explain the inner work‐ ings of the most widely used machine learning models in R by helping to simplify their algorithmic processes. The simplest algorithm for linear regression involves putting two points on a plot and then drawing a line between them. You get the important parts of the equation (slope and intercept) by taking the difference in the coordinates of those points with respect to some origin. The algorithm becomes more complicated when you try to do the same procedure for more than two points, however. That process involves more equations that can be tedious to compute by hand for a human but very easy for a processor in a computer to handle in microseconds. A machine learning model like regression or clustering or neural networks relies on the workings of algorithms to help them run in the first place. Algorithms are the 6 | Chapter 1: What Is a Model? engine that underlie the simple R code that we run. They do all the heavy lifting of multiplying matrices, optimizing results, and outputting a number for us to use. There are many types of models in R, which span an entire ecosystem of machine learning more generally. There are three major types of models: regression models, classification models, and mixed models that are a combination of both. We’ve already encountered a regression model. A classification model is different in that we would be trying to take input data and arrange it according to a type, class, group, or other discrete output. Mixed models might start with a regression model and then use the output from that to help it classify other types of data. The reverse could be true for other mixed models. The function call for a simple linear regression in R can be writ...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture