Deep Learning.pdf - Deep Learning Ian Goodfellow Yoshua...

This preview shows page 1 out of 777 pages.

Unformatted text preview: Deep Learning Ian Goodfellow Yoshua Bengio and Aaron Courville The MIT Press Cambridge, Massachusetts London, England c 2016 Massachusetts Institute of Technology This book was set in SFRM1095 by diacriTech, Chennai. Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Data Names: Goodfellow, Ian, author. | Bengio, Yoshua, author. | Courville, Aaron, author. Title: Deep learning / Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Description: Cambridge, MA : MIT Press, [2017] | Series: Adaptive computation and machine learning series | Includes bibliographical references and index. Identifiers: LCCN 2016022992 | ISBN 9780262035613 (hardcover : alk. paper) Subjects: LCSH: Machine learning, Classification: LCC Q325.5 .G66 2017 | DDC 006.3/1–dc23 LC record available at Contents Website xiii Notation xix 1 Introduction I 1 1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8 1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . . 12 Applied Math and Machine Learning Basics 2 Linear Algebra 27 29 2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . . 29 2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 32 2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . . 34 2.4 Linear Dependence and Span . . . . . . . . . . . . . . . . . . . . . 35 2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . . 38 2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 42 2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . . 43 2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.12 Example: Principal Components Analysis . . . . . . . . . . . . . . 45 3 Probability and Information Theory 51 3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 57 3.7 Independence and Conditional Independence . . . . . . . . . . . . 58 3.8 Expectation, Variance and Covariance . . . . . . . . . . . . . . . . 58 3.9 Common Probability Distributions . . . . . . . . . . . . . . . . . . 60 3.10 Useful Properties of Common Functions . . . . . . . . . . . . . . . 65 3.11 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.12 Technical Details of Continuous Variables . . . . . . . . . . . . . . 68 3.13 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 74 4 Numerical Computation 77 4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . 79 4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . 89 4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . . 92 5 Machine Learning Basics 95 5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2 Capacity, Overfitting and Underfitting . . . . . . . . . . . . . . . . 107 5.3 Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 117 5.4 Estimators, Bias and Variance . . . . . . . . . . . . . . . . . . . . 119 5.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . 128 5.6 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.7 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . 136 5.8 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . . 142 5.9 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 147 5.10 Building a Machine Learning Algorithm . . . . . . . . . . . . . . . 149 5.11 Challenges Motivating Deep Learning . . . . . . . . . . . . . . . . 151 II Deep Networks: Modern Practices 6 Deep Feedforward Networks 161 163 6.1 Example: Learning XOR . . . . . . . . . . . . . . . . . . . . . . . 166 6.2 Gradient-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 171 6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.5 Back-Propagation and Other Differentiation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 7 Regularization for Deep Learning 221 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 223 7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 230 7.3 Regularization and Under-Constrained Problems . . . . . . . . . . 232 7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.5 Noise Robustness 7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 236 7.7 Multitask Learning 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 246 . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 . . . . . . . . . . . . . . . . . . . . . . . . . . 237 7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 247 7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 249 7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . . 261 7.14 Tangent Distance, Tangent Prop and Manifold Tangent Classifier . 263 8 Optimization for Training Deep Models 8.1 267 How Learning Differs from Pure Optimization . . . . . . . . . . . 268 8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . . 275 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . . 292 8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . . 298 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . . 302 8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 309 9 Convolutional Networks 321 9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . . 322 9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . . 334 9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . . 337 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 347 9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . . 350 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . . 351 9.10 The Neuroscientific Basis for Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.11 Convolutional Networks and the History of Deep Learning . . . . . 359 10 Sequence Modeling: Recurrent and Recursive Nets 363 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . . . . 365 10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . 368 10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 383 10.4 Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . . 385 10.5 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . 387 10.6 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . 388 10.7 The Challenge of Long-Term Dependencies . . . . . . . . . . . . . 390 10.8 Echo State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 392 10.9 Leaky Units and Other Strategies for Multiple Time Scales . . . . 395 10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 397 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 401 10.12 Explicit Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 11 Practical Methodology 409 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 410 11.2 Default Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . 413 11.3 Determining Whether to Gather More Data . . . . . . . . . . . . . 414 11.4 Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 415 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 424 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . . . . 428 12 Applications 431 12.1 Large-Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 431 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 446 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . 448 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 III Deep Learning Research 13 Linear Factor Models 475 479 13.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 480 13.2 Independent Component Analysis (ICA) . . . . . . . . . . . . . . . 481 13.3 Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 484 13.4 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 489 14 Autoencoders 493 14.1 Undercomplete Autoencoders . . . . . . . . . . . . . . . . . . . . . 494 14.2 Regularized Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 495 14.3 Representational Power, Layer Size and Depth . . . . . . . . . . . 499 14.4 Stochastic Encoders and Decoders . . . . . . . . . . . . . . . . . . 500 14.5 Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 501 14.6 Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 506 14.7 Contractive Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 510 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . . 514 14.9 Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . . 515 15 Representation Learning 517 15.1 Greedy Layer-Wise Unsupervised Pretraining . . . . . . . . . . . . 519 15.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 526 15.3 Semi-Supervised Disentangling of Causal Factors . . . . . . . . . . 532 15.4 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . 536 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . . 543 15.6 Providing Clues to Discover Underlying Causes . . . . . . . . . . . 544 16 Structured Probabilistic Models for Deep Learning 549 16.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 550 16.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 554 16.3 Sampling from Graphical Models . . . . . . . . . . . . . . . . . . . 570 16.4 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 572 16.5 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . . 572 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . . 573 16.7 The Deep Learning Approach to Structured Probabilistic Models . 575 17 Monte Carlo Methods 581 17.1 Sampling and Monte Carlo Methods . . . . . . . . . . . . . . . . . 581 17.2 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 583 17.3 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . 586 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 17.5 The Challenge of Mixing between Separated Modes . . . . . . . . 591 18 Confronting the Partition Function 597 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . . 598 18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 599 18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . . 609 18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . . 611 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . 612 18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . . 614 19 Approximate Inference 623 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . . 624 19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 626 19.3 MAP Inference and Sparse Coding . . . . . . . . . . . . . . . . . . 627 19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . . 629 19.5 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . . 642 20 Deep Generative Models 645 20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 645 20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 647 20.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 651 20.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 654 20.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 667 20.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 673 20.7 Boltzmann Machines for Structured or Sequential Outputs . . . . 675 20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 677 20.9 Back-Propagation through Random Operations . . . . . . . . . . . 678 20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 682 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 701 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 704 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 706 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 707 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 710 Bibliography 711 Index 767 Website This book is accompanied by the above website. The website provides a variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors. Notation This section provides a concise reference describing the notation used throughout this book. If you are unfamiliar with any of the corresponding mathematical concepts, we describe most of these ideas in chapters 2–4. Numbers and Arrays a A scalar (integer or real) a A vector A A matrix A A tensor In Identity matrix with n rows and n columns I Identity matrix with dimensionality implied by context e(i) Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a 1 at position i diag(a) A square, diagonal matrix with diagonal entries given by a a A scalar random variable a A vector-valued random variable A A matrix-valued random variable Sets and Graphs A A set R The set of real numbers {0, 1} The set containing 0 and 1 {0, 1, . . . , n} The set of all integers between 0 and n [a, b] The real interval including a and b (a, b] The real interval excluding a but including b A\B Set subtraction, i.e., the set containing the elements of A that are not in B G A graph P aG (xi ) The parents of xi in G Indexing Element i of vector a, with indexing starting at 1 ai a−i All elements of vector a except for element i Ai,j Element i, j of matrix A Ai,: Row i of matrix A A:,i Column i of matrix A Ai,j,k Element (i, j, k) of a 3-D tensor A A:,:,i 2-D slice of a 3-D tensor Element i of the random vector a ai Linear Algebra Operations > A Transpose of matrix A A+ Moore-Penrose pseudoinverse of A A B Element-wise (Hadamard) product of A and B det(A) Determinant of A Calculus Derivative of y with respect to x dy dx ∂y ∂x ∇x y Partial derivative of y with respect to x Gradient of y with respect to x ∇X y Matrix derivatives of y with respect to X ∇X y Tensor containing derivatives of y with respect to X ∂f ∂x 2 ∇x f (x) or H(f )(x) Z f (x)dx Z f (x)dx Jacobian matrix J ∈ Rm×n of f : Rn → Rm The Hessian matrix of f at input point x Definite integral over the entire domain of x Definite integral with respect to x over the set S S Probability and Information Theory a⊥b a⊥b | c The random variables a and b are independent They are conditionally independent given c P (a) A probability distribution over a discrete variable p(a) A probability distribution over a continuous variable, or over a variable whose type has not been specified a∼P Ex∼P [f (x)] or Ef (x) Var(f (x)) Cov(f (x), g(x)) H(x) Random variable a has distribution P Expectation of f (x) with respect to P (x) Variance of f (x) under P (x) Covariance of f (x) and g(x) under P (x) Shannon entropy of the random variable x DKL (P kQ) Kullback-Leibler divergence of P and Q N (x; µ, Σ) Gaussian distribution over x with mean µ and covariance Σ f :A→B f ◦g f (x; θ) Functions The function f with domain A and range B Composition of the functions f and g A function of x parametrized by θ. (Sometimes we write f (x) and omit the argument θ to lighten notation) log x Natural logarithm of x σ(x) Logistic sigmoid, ζ(x) Softplus, log(1 + exp(x)) ||x||p Lp norm of x ||x|| L2 norm of x x+ Positive part of x, i.e., max(0, x) 1condition 1 1 + exp(−x) is 1 if the condition is true, 0 otherwise Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f (x), f (X), or f (X). This denotes the application of f to the array element-wise. For example, if C = σ(X), then Ci,j,k = σ(Xi,j,k ) for all valid values of i, j and k. pdata pˆdata X x(i) Datasets and Distributions The data generating distribution The empirical distribution defined by the training set A set of training examples The i-th example (input) from a dataset y (i) or y (i) The target associated with x(i) for supervised learning X The m × n matrix with input example x(i) in row Xi,: 1 Introduction Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). When programmable computers were first conceived, people wondered whether such machines might become intelligent, over a hundred years before one was built (Lovelace, 1842). Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics. We look to intelligent software to automate routine labor, understand speech or images, make diagnoses in medicine and support basic scientific research. In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers—problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images. This book is about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined through its relatio...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture