You've reached the end of your free preview.
Want to read all 672 pages?
Unformatted text preview: Bayesian Reasoning and Machine Learning
c
David Barber 2007,2008,2009,2010,2011,2012,2013 Notation List
V a calligraphic symbol typically denotes a set of random variables . . . . . . . . 7 dom(x) Domain of a variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 x=x The variable x is in the state x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 p(x = tr) probability of event/variable x being in the state true . . . . . . . . . . . . . . . . . . . 7 p(x = fa) probability of event/variable x being in the state false . . . . . . . . . . . . . . . . . . . 7 p(x, y) probability of x and y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x ∩ y) probability of x and y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x ∪ y) probability of x or y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 p(x|y) The probability of x conditioned on y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 X ⊥⊥ Y| Z Variables X are independent of variables Y conditioned on variables Z . 11 X >>Y| Z Variables X are dependent on variables Y conditioned on variables Z . . 11
R
For continuous variables this is shorthand for Pf (x)dx and for discrete variables means summation over the states of x, x f (x) . . . . . . . . . . . . . . . . . . 17 R x f (x) I [S] Indicator : has value 1 if the statement S is true, 0 otherwise . . . . . . . . . . 19 pa (x) The parents of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ch (x) The children of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 ne (x) Neighbours of node x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 dim (x) For a discrete variable x, this denotes the number of states x can take . . 34 hf (x)ip(x) The average of the function f (x) with respect to the distribution p(x) . 158 δ(a, b) Delta function. For discrete a, b, this is the Kronecker delta, δa,b and for
continuous a, b the Dirac delta function δ(a − b) . . . . . . . . . . . . . . . . . . . . . . 160 dim x The dimension of the vector/matrix x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 (x = s, y = t) The number of times x is in state s and y in state t simultaneously . . . 197
xy The number of times variable x is in state y . . . . . . . . . . . . . . . . . . . . . . . . . . 278 D Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 n Data index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 N Number of dataset training points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 S Sample Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 σ(x) The logistic sigmoid 1/(1 + exp(−x)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 erf(x) The (Gaussian) error function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 xa:b xa , xa+1 , . . . , xb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 i∼j The set of unique neighbouring edges on a graph . . . . . . . . . . . . . . . . . . . . . . 585 Im The m × m identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 II DRAFT March 29, 2013 Preface The data explosion
We live in a world that is rich in data, ever increasing in scale. This data comes from many different
sources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customer
databases, financial transactions, engine monitoring, speech recognition, surveillance, search). Possessing
the knowledge as to how to process and extract value from such data is therefore a key and increasingly
important skill. Our society also expects ultimately to be able to engage with computers in a natural manner
so that computers can ‘talk’ to humans, ‘understand’ what they say and ‘comprehend’ the visual world
around them. These are difficult large-scale information processing tasks and represent grand challenges
for computer science and related fields. Similarly, there is a desire to control increasingly complex systems,
possibly containing many interacting parts, such as in robotics and autonomous navigation. Successfully
mastering such systems requires an understanding of the processes underlying their behaviour. Processing
and making sense of such large amounts of data from complex systems is therefore a pressing modern day
concern and will likely remain so for the foreseeable future. Machine Learning
Machine Learning is the study of data-driven methods capable of mimicking, understanding and aiding
human and biological information processing tasks. In this pursuit, many related issues arise such as how
to compress data, interpret and process it. Often these methods are not necessarily directed to mimicking
directly human processing but rather to enhance it, such as in predicting the stock market or retrieving
information rapidly. In this probability theory is key since inevitably our limited data and understanding
of the problem forces us to address uncertainty. In the broadest sense, Machine Learning and related fields
aim to ‘learn something useful’ about the environment within which the agent operates. Machine Learning
is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data
to drive and adapt the model.
In the early stages of Machine Learning and related areas, similar techniques were discovered in relatively
isolated research communities. This book presents a unified treatment via graphical models, a marriage
between graph and probability theory, facilitating the transference of Machine Learning concepts between
different branches of the mathematical and computational sciences. Whom this book is for
The book is designed to appeal to students with only a modest mathematical background in undergraduate
calculus and linear algebra. No formal computer science or statistical background is required to follow the
book, although a basic familiarity with probability, calculus and linear algebra would be useful. The book
should appeal to students from a variety of backgrounds, including Computer Science, Engineering, applied
Statistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in Machine
Learning. In order to engage with students, the book introduces fundamental concepts in inference using
III only minimal reference to algebra and calculus. More mathematical techniques are postponed until as and
when required, always with the concept as primary and the mathematics secondary.
The concepts and algorithms are described with the aid of many worked examples. The exercises and
demonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment and
more deeply understand the material. The ultimate aim of the book is to enable the reader to construct
novel algorithms. The book therefore places an emphasis on skill learning, rather than being a collection of
recipes. This is a key aspect since modern applications are often so specialised as to require novel methods.
The approach taken throughout is to describe the problem as a graphical model, which is then translated
into a mathematical framework, ultimately leading to an algorithmic implementation in the BRMLtoolbox.
The book is primarily aimed at final year undergraduates and graduates without significant experience in
mathematics. On completion, the reader should have a good understanding of the techniques, practicalities
and philosophies of probabilistic aspects of Machine Learning and be well equipped to understand more
advanced research level material. The structure of the book
The book begins with the basic concepts of graphical models and inference. For the independent reader
chapters 1,2,3,4,5,9,10,13,14,15,16,17,21 and 23 would form a good introduction to probabilistic reasoning,
modelling and Machine Learning. The material in chapters 19, 24, 25 and 28 is more advanced, with the
remaining material being of more specialised interest. Note that in each chapter the level of material is of
varying difficulty, typically with the more challenging material placed towards the end of each chapter. As
an introduction to the area of probabilistic modelling, a course can be constructed from the material as
indicated in the chart.
The material from parts I and II has been successfully used for courses on Graphical Models. I have also
taught an introduction to Probabilistic Machine Learning using material largely from part III, as indicated.
These two courses can be taught separately and a useful approach would be to teach first the Graphical
Models course, followed by a separate Probabilistic Machine Learning course.
A short course on approximate inference can be constructed from introductory material in part I and the
more advanced material in part V, as indicated. The exact inference methods in part I can be covered
relatively quickly with the material in part V considered in more in depth.
A timeseries course can be made by using primarily the material in part IV, possibly combined with material
from part I for students that are unfamiliar with probabilistic modelling approaches. Some of this material,
particularly in chapter 25 is more advanced and can be deferred until the end of the course, or considered
for a more advanced course.
The references are generally to works at a level consistent with the book material and which are in the most
part readily available. Accompanying code
The BRMLtoolbox is provided to help readers see how mathematical models translate into actual MATLAB code. There are a large number of demos that a lecturer may wish to use or adapt to help illustrate
the material. In addition many of the exercises make use of the code, helping the reader gain confidence
in the concepts and their application. Along with complete routines for many Machine Learning methods,
the philosophy is to provide low level routines whose composition intuitively follows the mathematical description of the algorithm. In this way students may easily match the mathematics with the corresponding
algorithmic implementation.
IV DRAFT March 29, 2013 Part II:
Learning in Probabilistic Models Part III:
Machine Learning Part IV:
Dynamical Models Part V:
Approximate Inference 1:
2:
3:
4:
5:
6:
7: Probabilistic Modelling Course Time-series Short Course Approximate Inference Short Course Probabilistic Machine Learning Course Graphical Models Course
Part I:
Inference in Probabilistic Models Probabilistic Reasoning
Basic Graph Concepts
Belief Networks
Graphical Models
Efficient Inference in Trees
The Junction Tree Algorithm
Making Decisions 8: Statistics for Machine Learning
9: Learning as Inference
10: Naive Bayes
11: Learning with Hidden Variables
12: Bayesian Model Selection
13:
14:
15:
16:
17:
18:
19:
20:
21:
22: Machine Learning Concepts
Nearest Neighbour Classification
Unsupervised Linear Dimension Reduction
Supervised Linear Dimension Reduction
Linear Models
Bayesian Linear Models
Gaussian Processes
Mixture Models
Latent Linear Models
Latent Ability Models 23:
24:
25:
26: Discrete-State Markov Models
Continuous-State Markov Models
Switching Linear Dynamical Systems
Distributed Computation 27: Sampling
28: Deterministic Approximate Inference Website
The BRMLtoolbox along with an electronic version of the book is available from
Instructors seeking solutions to the exercises can find information at the website, along with additional
teaching materials.
DRAFT March 29, 2013 V Other books in this area
The literature on Machine Learning is vast with much relevant literature also contained in statistics, engineering and other physical sciences. A small list of more specialised books that may be referred to for
deeper treatments of specific topics is:
• Graphical models – Graphical models by S. Lauritzen, Oxford University Press, 1996.
– Bayesian Networks and Decision Graphs by F. Jensen and T. D. Nielsen, Springer Verlag, 2007.
– Probabilistic Networks and Expert Systems by R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D.
J. Spiegelhalter, Springer Verlag, 1999.
– Probabilistic Reasoning in Intelligent Systems by J. Pearl, Morgan Kaufmann, 1988.
– Graphical Models in Applied Multivariate Statistics by J. Whittaker, Wiley, 1990.
– Probabilistic Graphical Models: Principles and Techniques by D. Koller and N. Friedman, MIT
Press, 2009. • Machine Learning and Information Processing – Information Theory, Inference and Learning Algorithms by D. J. C. MacKay, Cambridge University Press, 2003.
– Pattern Recognition and Machine Learning by C. M. Bishop, Springer Verlag, 2006.
– An Introduction To Support Vector Machines, N. Cristianini and J. Shawe-Taylor, Cambridge
University Press, 2000.
– Gaussian Processes for Machine Learning by C. E. Rasmussen and C. K. I. Williams, MIT press,
2006. Acknowledgements
Many people have helped this book along the way either in terms of reading, feedback, general insights,
allowing me to present their work, or just plain motivation. Amongst these I would like to thank Dan
Cornford, Massimiliano Pontil, Mark Herbster, John Shawe-Taylor, Vladimir Kolmogorov, Yuri Boykov,
Tom Minka, Simon Prince, Silvia Chiappa, Bertrand Mesot, Robert Cowell, Ali Taylan Cemgil, David Blei,
Jeff Bilmes, David Cohn, David Page, Peter Sollich, Chris Williams, Marc Toussaint, Amos Storkey, Zakria
Hussain, Le Chen, Seraf´ın Moral, Milan Studen´
y, Luc De Raedt, Tristan Fletcher, Chris Vryonides, Tom
Furmston, Ed Challis and Chris Bracegirdle. I would also like to thank the many students that have helped
improve the material during lectures over the years. I’m particularly grateful to Taylan Cemgil for allowing
his GraphLayout package to be bundled with the BRMLtoolbox.
The staff at Cambridge University Press have been a delight to work with and I would especially like to
thank Heather Bergman for her initial endeavors and the wonderful Diana Gillooly for her continued enthusiasm.
A heartfelt thankyou to my parents and sister – I hope this small token will make them proud. I’m also
fortunate to be able to acknowledge the support and generosity of friends throughout. Finally, I’d like to
thank Silvia who made it all worthwhile. VI DRAFT March 29, 2013 BRMLtoolbox The BRMLtoolbox is a lightweight set of routines that enables the reader to experiment with concepts in
graph theory, probability theory and Machine Learning. The code contains basic routines for manipulating
discrete variable distributions, along with more limited support for continuous variables. In addition there
are many hard-coded standard Machine Learning algorithms. The website contains also a complete list of
all the teaching demos and related exercise material. BRMLTOOLKIT
Graph Theory
ancestors
ancestralorder
descendents
children
edges
elimtri
connectedComponents
istree
neigh
noselfpath
parents
spantree
triangulate
triangulatePorder - Return the ancestors of nodes x in DAG A
Return the ancestral order or the DAG A (oldest first)
Return the descendents of nodes x in DAG A
return the children of variable x given adjacency matrix A
Return edge list from adjacency matrix A
Return a variable elimination sequence for a triangulated graph
Find the connected components of an adjacency matrix
Check if graph is singly-connected
Find the neighbours of vertex v on a graph with adjacency matrix G
return a path excluding self transitions
return the parents of variable x given adjacency matrix A
Find a spanning tree from an edge list
Triangulate adjacency matrix A
Triangulate adjacency matrix A according to a partial ordering Potential manipulation
condpot
changevar
dag
deltapot
disptable
divpots
drawFG
drawID
drawJTree
drawNet
evalpot
exppot
eyepot
grouppot
groupstate
logpot
markov
maxpot
maxsumpot
multpots
numstates - Return a potential conditioned on another variable
Change variable names in a potential
Return the adjacency matrix (zeros on diagonal) for a Belief Network
A delta function potential
Print the table of a potential
Divide potential pota by potb
Draw the Factor Graph A
plot an Influence Diagram
plot a Junction Tree
plot network
Evaluate the table of a potential when variables are set
exponential of a potential
Return a unit potential
Form a potential based on grouping variables together
Find the state of the group variables corresponding to a given ungrouped state
logarithm of the potential
Return a symmetric adjacency matrix of Markov Network in pot
Maximise a potential over variables
Maximise or Sum a potential over variables
Multiply potentials into a single potential
Number of states of the variables in a potential VII orderpot
orderpotfields
potsample
potscontainingonly
potvariables
setevpot
setpot
setstate
squeezepots
sumpot
sumpotID
sumpots
table
ungrouppot
uniquepots
whichpot - Return potential with variables reordered according to order
Order the fields of the potential, creating blank entries where necessary
Draw sample from a single potential
Returns those potential numbers that contain only the required variables
Returns information about all variables in a set of potentials
Sets variables in a potential into evidential states
sets potential variables to specified states
set a potential’s specified joint state to a specified value
Eliminate redundant potentials (those contained wholly within another)
Sum potential pot over variables
Return the summed probability and utility tables from an ID
Sum a set of potentials
Return the potential table
Form a potential based on ungrouping variables
Eliminate redundant potentials (those contained wholly within another)
Returns potentials that contain a set of variables Routines also extend the toolbox to deal with Gaussian potentials:
multpotsGaussianMoment.m, sumpotGaussianCanonical.m, sumpotGaussianMoment.m, multpotsGaussianCanonical.m
See demoSumprodGaussCanon.m, demoSumprodGaussCanonLDS.m, demoSumprodGaussMoment.m Inference
absorb
absorption
absorptionID
ancestralsample
binaryMRFmap
bucketelim
condindep
condindepEmp
condindepPot
condMI
FactorConnectingVariable
FactorGraph
IDvars
jtassignpot
jtree
jtreeID
LoopyBP
MaxFlow
maxNpot
maxNprodFG
maxprodFG
MDPemDeterministicPolicy
MDPsolve
MesstoFact
metropolis
mostprobablepath
mostprobablepathmult
sumprodFG - Update potentials in absorption message passing on a Junction Tree
Perform full round of absorption on a Junction Tree
Perform full round of absorption on an Influence Diagram
Ancestral sampling from a Belief Network
get the MAP assignment for a binary MRF with positive W
Bucket Elimination on a set of potentials
Conditional Independence check using graph of variable interactions
Compute the empirical log Bayes Factor and MI for independence/dependence
Numerical conditional independence measure
conditional mutual information I(x,y|z) of a potential.
Factor nodes connecting to a set of variables
Returns a Factor Graph adjacency matrix based on potentials
probability and decision variables from a partial order
Assign potentials to cliques in a Junction Tree
Setup a Junction Tree based on a set of potentials
Setup a Junction Tree based on an Influence Diagram
loopy Belief Propagation using sum-product algorithm
Ford Fulkerson max flow - min cut algorithm (breadth first search)
Find the N most probable values and states in a potential
N-Max-Product algorithm on a Factor Graph (Returns the Nmax most probable States)
Max-Prod...
View
Full Document
- Spring '16
- Computer Science, Machine Learning, The Land, Probability theory, Bayesian network, Belief networks, Belief propagation, junction tree