02450_Book.pdf - Tue Herlau Mikkel N Schmidt and Morten...

This preview shows page 1 out of 390 pages.

Unformatted text preview: Tue Herlau, Mikkel N. Schmidt and Morten Mørup Introduction to Machine Learning and Data Mining Lecture notes, Fall 2019, version 1.3 This document may not be redistributed. All rights belongs to the authors and DTU. September 24, 2019 Technical University of Denmark Notation cheat sheet Matlab var. Type Size Description Regression y Numeric N × 1 Class index: For each data object, y contains a class index, yn ∈ {0, 1, . . . , C − 1}, where C is the total number of classes. classNames Cell array C × 1 Class names: Name (string) for each of the C classes. C Numeric Scalar Number of classes. Cross-validation y Numeric N × 1 Dependent variable (output): For each data object, y contains an output value that we wish to predict. Classification X Numeric N × M Data matrix: The rows correspond to N data objects, each of which contains M attributes. attributeNames Cell array M × 1 Attribute names: Name (string) for each of the M attributes. N Numeric Scalar Number of data objects. M Numeric Scalar Number of attributes. ? train ? test — — — — All variables mentioned above appended with train or test represent the corresponding variable for the training or test set. Training data. Test data. This book attempts to give a concise introduction to machine-learning concepts. We believe this is best accomplished by clearly stating what a given method actually does as a sequence of mathematical operations, and use illustrations and text to provide an intuition. We will therefore make use of tools from linear algebra, probability theory and analysis to describe the methods, focusing on using as small a set of concepts as possible and strive towards maximal consistency. VI In the following, vectors will be denoted by lower-case roman letters x, y, . . . and matrices by bolder, upper case roman letters A, B, . . . . A superscript T denote the transpose. For instance   −1   −1 0 2 A= and if x = 4 then xT = −1 4 1 . 1 1 −2 1 The ith element of a vector is written as xi and the i, j’th element of a matrix as Aij (and sometimes Ai,j to avoid ambiguity). In the preceding example, x2 = 4 and A2,3 = −2. During this course the observed data set, which we feed into our machine learning methods, will consist of N observations where each observation consist of a M dimensional vector. For instance if we have N observations x1 , · · · , xN then any given observation will consist of M numbers:  T x = x1 . . . , xM . For convenience, we will often combine the observations into an N × M data matrix X T x1 .. X= . xTN in which the ith row of X corresponds to the row vector xTi . We will use this notation for our data matrix and the rows of X will correspond to N observations and the M columns of X will correspond to M attributes. Often each of the observations xi will come with a label or target yi corresponding to a feature of xi which we are interested in predicting. In this case we will collect the labels in a N -dimensional vector y and the pair (X, y) will be all the data available for the machine learning method. A more comprehensive translation of the notation as used in this book and in the exercises can be found in the table on the previous page. Finally, the reader should be familiar with the big-sigma notation which allows us to conveniently write sums and products of multiple terms: n X i=1 n Y i=1 f (i) = f (1) + f (2) + · · · + f (n − 1) + f (n) f (i) = f (1) × f (2) × · · · × f (n − 1) × f (n). As an example, if f (i) = i2 and n = 4 we have 4 X i=1 f (i) = 12 + 22 + 32 + 42 = 30, 4 Y i=1 f (i) = 12 × 22 × 32 × 42 = 576. Course Reading Guide It is our experience that when students have difficulties understanding a topic of this course, most often probability theory is the culprit. A reason for this is probability theory can be notationally challenging. For instance: P (X = x|Y = y) (0.1) is, all being equal, a fairly unusual way to use an equality sign. Another reason is many students last encountered probability theory in conjunction with an introductory statistics course, where the main theorems are presented using the notation of stochastic variables and measure spaces. Especially when such a course has an applied focus, there is a tendency terms such as stochastic variables end up playing a mnemonic role; i.e. as a shorthand for which theorems or rules are supposed to be used in a given situation. This makes it difficult for students to map their notation onto probabilistic primitives such as events, in particular for multivariate distributions. To overcome these problems, both chapter 5 and chapter 6 will be concerned with probability theory. The idea is to provide a ground-up introduction to probability theory with a focus on distributions that can be represented using well-behaved density functions. We advice a reader to make absolutely sure he or she understands the definitions in the green boxes in these chapters. The disadvantage of this approach is the amount of reading material for the first weeks may seem excessively long, and we will therefore use stars, i.e. F , to signify a particular section (including subsections) is of less significance, perhaps because it recaps material from other courses (such as the introduction to linear algebra), or because it is technical in nature and is supposed to give a more in-depth idea of what is going on (c.f. section 5.5 and section 5.4.6). We obviously advice a reader to do the assigned homework problems (see course website), but failing that, we strongly encourage a reader to read the homework problems to get an idea about what parts of the material is more likely to occur at the exam. The focus in the exam is to either be able to understand the material well enough to make common-sense inferences about how they apply in particular situations, or to concretely apply the methods/definitions to concrete situations. Note solutions to the homework problems are included at the end of this book. Based on feedback in previous semesters, we have begun implementing colored boxes as an aid for the reader. These boxes are used as follows: VIII Method: Key definitions or summaries Summarizing a method or particular relevant result. Should be fairly self-contained and relevant as a how-to resource. Make sure you understand the content. Example: Illustration of how to do something A small (concrete) example of how to calculate something, either because it is exam relevant, or to test how certain definitions are used in practice. Technical note: A warning or derivation Used to provide additional details which may be technical, confusing or simply a lot of work. Easily (and sometimes best) skipped. Note the use of boxes is still work in progress and, as with all other aspects of this note, we will be very happy to get feedback on how to best make use of them. Updates in version 1.1 • Added section 3.4.1 on interpretation of PCA components which will be useful for project 1. • Added section 6.3.4 about the cumulative density function and it’s inverse. This material should be familiar from a statistics/probability class. Updates in version 1.2 • Added chapter 11 on statistical evaluation and comparison of machine learning models • Renamed chapter 16 to avoid confusion with chapter 11 (no other changes) • Bayesian networks (section 13.3) is now optional reading and will not be part of the exam Updates in version 1.3 • Equation (11.35) for McNemars confidence interval was expressed for the wrong coordinates and have been updated. A small comment was added to the text as an explanation. • Added section 11.2.1 to provide a brief explanation of baseline models (useful for project 2) • Aligned notation in eq. (11.40) with the subsequent notation (ˆ s→σ ˜) Contents Notation cheat sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V Course reading guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI Part I Data: Types, Features and Visualization 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What is machine learning and data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Relationship to artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Relationship to other disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Why should I care about machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Machine learning tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 The machine-learning toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Basic terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 A closer look at what a model doesF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The machine learning workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 4 4 4 4 5 5 8 11 12 13 14 15 17 2 Data and attribute types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 What is a dataset? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Attribute types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Data issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The standard data format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Feature transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 One-out-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 21 22 23 24 25 X Contents 2.4.2 Binarizing/thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 27 3 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Projections and subspacesF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Projection onto a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Singular Value Decomposition and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 The PCA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Variance explained by the PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Applications of principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Example 1: Interpreting PCA components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Example 2: A high-dimensional example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Uses of PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 30 31 33 39 39 40 41 42 44 47 49 4 Summary statistics and measures of similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Attribute statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Term-document matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Measures of distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 The Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Measures of similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 55 56 57 59 59 62 5 Discrete probabilities and information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Probability basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 A primer on binary propositionsF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Probabilities and plausibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Basic rules of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Marginalization and Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.5 Mutually exclusive events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.6 Equally likely events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Discrete data and stochastic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Example: Bayes theorem and the cars dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Generating random numbersF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Expectations, mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Independence and conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 The Bernoulli, categorical and binomial distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 The categorical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Parameter transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Repeated events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 A learning principle: Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 The binomial distributionF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Information TheoryF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 64 65 65 67 67 69 70 74 75 77 78 79 80 80 81 82 82 83 85 85 Contents 5.5.1 5.5.2 5.5.3 5.5.4 XI Measuring information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Normalized mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 88 89 90 6 Densities and models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Probability densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Multiple continuous parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Expectations, mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Examples of densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 The normal and multivariate normal distribution . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Diagonal covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 The Beta distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 The cumulative density function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.5 The central limit theoremF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Bayesian probabilities and machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Choosing the prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Bayesian learning in general . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 93 94 97 98 99 101 102 103 105 107 109 109 113 7 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Basic plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 What sets apart a good plot? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Visualizing the machine-learning workflowF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Visualizations to understand loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Use visualizations to understand mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Visualization to debug methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Use visualization for an overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Illustrati...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture