Bayesian Reasoning and Machine Learning - Bayesian Reasoning and Machine Learning c David Barber 2007,2008,2009,2010,2011,2012,2013 Notation List V a

Bayesian Reasoning and Machine Learning - Bayesian...

This preview shows page 1 out of 672 pages.

You've reached the end of your free preview.

Want to read all 672 pages?

Unformatted text preview: Bayesian Reasoning and Machine Learning c David Barber 2007,2008,2009,2010,2011,2012,2013 Notation List V a calligraphic symbol typically denotes a set of random variables . . . . . . . . 7 dom(x) Domain of a variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 x=x The variable x is in the state x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 p(x = tr) probability of event/variable x being in the state true . . . . . . . . . . . . . . . . . . . 7 p(x = fa) probability of event/variable x being in the state false . . . . . . . . . . . . . . . . . . . 7 p(x, y) probability of x and y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x ∩ y) probability of x and y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x ∪ y) probability of x or y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x|y) The probability of x conditioned on y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 X ⊥⊥ Y| Z Variables X are independent of variables Y conditioned on variables Z . 11 X >>Y| Z Variables X are dependent on variables Y conditioned on variables Z . . 11 R For continuous variables this is shorthand for Pf (x)dx and for discrete variables means summation over the states of x, x f (x) . . . . . . . . . . . . . . . . . . 17 R x f (x) I [S] Indicator : has value 1 if the statement S is true, 0 otherwise . . . . . . . . . . 19 pa (x) The parents of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ch (x) The children of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ne (x) Neighbours of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 dim (x) For a discrete variable x, this denotes the number of states x can take . . 34 hf (x)ip(x) The average of the function f (x) with respect to the distribution p(x) . 158 δ(a, b) Delta function. For discrete a, b, this is the Kronecker delta, δa,b and for continuous a, b the Dirac delta function δ(a − b) . . . . . . . . . . . . . . . . . . . . . . 160 dim x The dimension of the vector/matrix x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 (x = s, y = t) The number of times x is in state s and y in state t simultaneously . . . 197 xy The number of times variable x is in state y . . . . . . . . . . . . . . . . . . . . . . . . . . 278 D Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 n Data index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 N Number of dataset training points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 S Sample Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 σ(x) The logistic sigmoid 1/(1 + exp(−x)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 erf(x) The (Gaussian) error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 xa:b xa , xa+1 , . . . , xb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 i∼j The set of unique neighbouring edges on a graph . . . . . . . . . . . . . . . . . . . . . . 585 Im The m × m identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 II DRAFT March 29, 2013 Preface The data explosion We live in a world that is rich in data, ever increasing in scale. This data comes from many different sources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customer databases, financial transactions, engine monitoring, speech recognition, surveillance, search). Possessing the knowledge as to how to process and extract value from such data is therefore a key and increasingly important skill. Our society also expects ultimately to be able to engage with computers in a natural manner so that computers can ‘talk’ to humans, ‘understand’ what they say and ‘comprehend’ the visual world around them. These are difficult large-scale information processing tasks and represent grand challenges for computer science and related fields. Similarly, there is a desire to control increasingly complex systems, possibly containing many interacting parts, such as in robotics and autonomous navigation. Successfully mastering such systems requires an understanding of the processes underlying their behaviour. Processing and making sense of such large amounts of data from complex systems is therefore a pressing modern day concern and will likely remain so for the foreseeable future. Machine Learning Machine Learning is the study of data-driven methods capable of mimicking, understanding and aiding human and biological information processing tasks. In this pursuit, many related issues arise such as how to compress data, interpret and process it. Often these methods are not necessarily directed to mimicking directly human processing but rather to enhance it, such as in predicting the stock market or retrieving information rapidly. In this probability theory is key since inevitably our limited data and understanding of the problem forces us to address uncertainty. In the broadest sense, Machine Learning and related fields aim to ‘learn something useful’ about the environment within which the agent operates. Machine Learning is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data to drive and adapt the model. In the early stages of Machine Learning and related areas, similar techniques were discovered in relatively isolated research communities. This book presents a unified treatment via graphical models, a marriage between graph and probability theory, facilitating the transference of Machine Learning concepts between different branches of the mathematical and computational sciences. Whom this book is for The book is designed to appeal to students with only a modest mathematical background in undergraduate calculus and linear algebra. No formal computer science or statistical background is required to follow the book, although a basic familiarity with probability, calculus and linear algebra would be useful. The book should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in Machine Learning. In order to engage with students, the book introduces fundamental concepts in inference using III only minimal reference to algebra and calculus. More mathematical techniques are postponed until as and when required, always with the concept as primary and the mathematics secondary. The concepts and algorithms are described with the aid of many worked examples. The exercises and demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and more deeply understand the material. The ultimate aim of the book is to enable the reader to construct novel algorithms. The book therefore places an emphasis on skill learning, rather than being a collection of recipes. This is a key aspect since modern applications are often so specialised as to require novel methods. The approach taken throughout is to describe the problem as a graphical model, which is then translated into a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox. The book is primarily aimed at final year undergraduates and graduates without significant experience in mathematics. On completion, the reader should have a good understanding of the techniques, practicalities and philosophies of probabilistic aspects of Machine Learning and be well equipped to understand more advanced research level material. The structure of the book The book begins with the basic concepts of graphical models and inference. For the independent reader chapters 1,2,3,4,5,9,10,13,14,15,16,17,21 and 23 would form a good introduction to probabilistic reasoning, modelling and Machine Learning. The material in chapters 19, 24, 25 and 28 is more advanced, with the remaining material being of more specialised interest. Note that in each chapter the level of material is of varying difficulty, typically with the more challenging material placed towards the end of each chapter. As an introduction to the area of probabilistic modelling, a course can be constructed from the material as indicated in the chart. The material from parts I and II has been successfully used for courses on Graphical Models. I have also taught an introduction to Probabilistic Machine Learning using material largely from part III, as indicated. These two courses can be taught separately and a useful approach would be to teach first the Graphical Models course, followed by a separate Probabilistic Machine Learning course. A short course on approximate inference can be constructed from introductory material in part I and the more advanced material in part V, as indicated. The exact inference methods in part I can be covered relatively quickly with the material in part V considered in more in depth. A timeseries course can be made by using primarily the material in part IV, possibly combined with material from part I for students that are unfamiliar with probabilistic modelling approaches. Some of this material, particularly in chapter 25 is more advanced and can be deferred until the end of the course, or considered for a more advanced course. The references are generally to works at a level consistent with the book material and which are in the most part readily available. Accompanying code The BRMLtoolbox is provided to help readers see how mathematical models translate into actual MATLAB code. There are a large number of demos that a lecturer may wish to use or adapt to help illustrate the material. In addition many of the exercises make use of the code, helping the reader gain confidence in the concepts and their application. Along with complete routines for many Machine Learning methods, the philosophy is to provide low level routines whose composition intuitively follows the mathematical description of the algorithm. In this way students may easily match the mathematics with the corresponding algorithmic implementation. IV DRAFT March 29, 2013 Part II: Learning in Probabilistic Models Part III: Machine Learning Part IV: Dynamical Models Part V: Approximate Inference 1: 2: 3: 4: 5: 6: 7: Probabilistic Modelling Course Time-series Short Course Approximate Inference Short Course Probabilistic Machine Learning Course Graphical Models Course Part I: Inference in Probabilistic Models Probabilistic Reasoning Basic Graph Concepts Belief Networks Graphical Models Efficient Inference in Trees The Junction Tree Algorithm Making Decisions 8: Statistics for Machine Learning 9: Learning as Inference 10: Naive Bayes 11: Learning with Hidden Variables 12: Bayesian Model Selection 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: Machine Learning Concepts Nearest Neighbour Classification Unsupervised Linear Dimension Reduction Supervised Linear Dimension Reduction Linear Models Bayesian Linear Models Gaussian Processes Mixture Models Latent Linear Models Latent Ability Models 23: 24: 25: 26: Discrete-State Markov Models Continuous-State Markov Models Switching Linear Dynamical Systems Distributed Computation 27: Sampling 28: Deterministic Approximate Inference Website The BRMLtoolbox along with an electronic version of the book is available from Instructors seeking solutions to the exercises can find information at the website, along with additional teaching materials. DRAFT March 29, 2013 V Other books in this area The literature on Machine Learning is vast with much relevant literature also contained in statistics, engineering and other physical sciences. A small list of more specialised books that may be referred to for deeper treatments of specific topics is: • Graphical models – Graphical models by S. Lauritzen, Oxford University Press, 1996. – Bayesian Networks and Decision Graphs by F. Jensen and T. D. Nielsen, Springer Verlag, 2007. – Probabilistic Networks and Expert Systems by R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter, Springer Verlag, 1999. – Probabilistic Reasoning in Intelligent Systems by J. Pearl, Morgan Kaufmann, 1988. – Graphical Models in Applied Multivariate Statistics by J. Whittaker, Wiley, 1990. – Probabilistic Graphical Models: Principles and Techniques by D. Koller and N. Friedman, MIT Press, 2009. • Machine Learning and Information Processing – Information Theory, Inference and Learning Algorithms by D. J. C. MacKay, Cambridge University Press, 2003. – Pattern Recognition and Machine Learning by C. M. Bishop, Springer Verlag, 2006. – An Introduction To Support Vector Machines, N. Cristianini and J. Shawe-Taylor, Cambridge University Press, 2000. – Gaussian Processes for Machine Learning by C. E. Rasmussen and C. K. I. Williams, MIT press, 2006. Acknowledgements Many people have helped this book along the way either in terms of reading, feedback, general insights, allowing me to present their work, or just plain motivation. Amongst these I would like to thank Dan Cornford, Massimiliano Pontil, Mark Herbster, John Shawe-Taylor, Vladimir Kolmogorov, Yuri Boykov, Tom Minka, Simon Prince, Silvia Chiappa, Bertrand Mesot, Robert Cowell, Ali Taylan Cemgil, David Blei, Jeff Bilmes, David Cohn, David Page, Peter Sollich, Chris Williams, Marc Toussaint, Amos Storkey, Zakria Hussain, Le Chen, Seraf´ın Moral, Milan Studen´ y, Luc De Raedt, Tristan Fletcher, Chris Vryonides, Tom Furmston, Ed Challis and Chris Bracegirdle. I would also like to thank the many students that have helped improve the material during lectures over the years. I’m particularly grateful to Taylan Cemgil for allowing his GraphLayout package to be bundled with the BRMLtoolbox. The staff at Cambridge University Press have been a delight to work with and I would especially like to thank Heather Bergman for her initial endeavors and the wonderful Diana Gillooly for her continued enthusiasm. A heartfelt thankyou to my parents and sister – I hope this small token will make them proud. I’m also fortunate to be able to acknowledge the support and generosity of friends throughout. Finally, I’d like to thank Silvia who made it all worthwhile. VI DRAFT March 29, 2013 BRMLtoolbox The BRMLtoolbox is a lightweight set of routines that enables the reader to experiment with concepts in graph theory, probability theory and Machine Learning. The code contains basic routines for manipulating discrete variable distributions, along with more limited support for continuous variables. In addition there are many hard-coded standard Machine Learning algorithms. The website contains also a complete list of all the teaching demos and related exercise material. BRMLTOOLKIT Graph Theory ancestors ancestralorder descendents children edges elimtri connectedComponents istree neigh noselfpath parents spantree triangulate triangulatePorder - Return the ancestors of nodes x in DAG A Return the ancestral order or the DAG A (oldest first) Return the descendents of nodes x in DAG A return the children of variable x given adjacency matrix A Return edge list from adjacency matrix A Return a variable elimination sequence for a triangulated graph Find the connected components of an adjacency matrix Check if graph is singly-connected Find the neighbours of vertex v on a graph with adjacency matrix G return a path excluding self transitions return the parents of variable x given adjacency matrix A Find a spanning tree from an edge list Triangulate adjacency matrix A Triangulate adjacency matrix A according to a partial ordering Potential manipulation condpot changevar dag deltapot disptable divpots drawFG drawID drawJTree drawNet evalpot exppot eyepot grouppot groupstate logpot markov maxpot maxsumpot multpots numstates - Return a potential conditioned on another variable Change variable names in a potential Return the adjacency matrix (zeros on diagonal) for a Belief Network A delta function potential Print the table of a potential Divide potential pota by potb Draw the Factor Graph A plot an Influence Diagram plot a Junction Tree plot network Evaluate the table of a potential when variables are set exponential of a potential Return a unit potential Form a potential based on grouping variables together Find the state of the group variables corresponding to a given ungrouped state logarithm of the potential Return a symmetric adjacency matrix of Markov Network in pot Maximise a potential over variables Maximise or Sum a potential over variables Multiply potentials into a single potential Number of states of the variables in a potential VII orderpot orderpotfields potsample potscontainingonly potvariables setevpot setpot setstate squeezepots sumpot sumpotID sumpots table ungrouppot uniquepots whichpot - Return potential with variables reordered according to order Order the fields of the potential, creating blank entries where necessary Draw sample from a single potential Returns those potential numbers that contain only the required variables Returns information about all variables in a set of potentials Sets variables in a potential into evidential states sets potential variables to specified states set a potential’s specified joint state to a specified value Eliminate redundant potentials (those contained wholly within another) Sum potential pot over variables Return the summed probability and utility tables from an ID Sum a set of potentials Return the potential table Form a potential based on ungrouping variables Eliminate redundant potentials (those contained wholly within another) Returns potentials that contain a set of variables Routines also extend the toolbox to deal with Gaussian potentials: multpotsGaussianMoment.m, sumpotGaussianCanonical.m, sumpotGaussianMoment.m, multpotsGaussianCanonical.m See demoSumprodGaussCanon.m, demoSumprodGaussCanonLDS.m, demoSumprodGaussMoment.m Inference absorb absorption absorptionID ancestralsample binaryMRFmap bucketelim condindep condindepEmp condindepPot condMI FactorConnectingVariable FactorGraph IDvars jtassignpot jtree jtreeID LoopyBP MaxFlow maxNpot maxNprodFG maxprodFG MDPemDeterministicPolicy MDPsolve MesstoFact metropolis mostprobablepath mostprobablepathmult sumprodFG - Update potentials in absorption message passing on a Junction Tree Perform full round of absorption on a Junction Tree Perform full round of absorption on an Influence Diagram Ancestral sampling from a Belief Network get the MAP assignment for a binary MRF with positive W Bucket Elimination on a set of potentials Conditional Independence check using graph of variable interactions Compute the empirical log Bayes Factor and MI for independence/dependence Numerical conditional independence measure conditional mutual information I(x,y|z) of a potential. Factor nodes connecting to a set of variables Returns a Factor Graph adjacency matrix based on potentials probability and decision variables from a partial order Assign potentials to cliques in a Junction Tree Setup a Junction Tree based on a set of potentials Setup a Junction Tree based on an Influence Diagram loopy Belief Propagation using sum-product algorithm Ford Fulkerson max flow - min cut algorithm (breadth first search) Find the N most probable values and states in a potential N-Max-Product algorithm on a Factor Graph (Returns the Nmax most probable States) Max-Prod...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture