Unformatted text preview: A Tutorial on Learning With Bayesian Networks
David Heckerman [email protected] March 1995 (Revised November 1996) Technical Report MSR-TR-95-06 Microsoft Research Advanced Technology Division Microsoft Corporation One Microsoft Way Redmond, WA 98052 A companion set of lecture slides is available at ftp://ftp.research.microsoft.com /pub/dtg/david/tutorial.ps. Abstract
A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks o er an e cient and principled approach for avoiding the over tting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study. 1 Introduction
A Bayesian network is a graphical model for probabilistic relationships among a set of variables. Over the last decade, the Bayesian network has become a popular representation for encoding uncertain expert knowledge in expert systems (Heckerman et al., 1995a). More recently, researchers have developed methods for learning Bayesian networks from data. The techniques that have been developed are new and still evolving, but they have been shown to be remarkably e ective for some data-analysis problems. In this paper, we provide a tutorial on Bayesian networks and associated Bayesian techniques for extracting and encoding knowledge from data. There are numerous representations available for data analysis, including rule bases, decision trees, and arti cial neural networks and there are many techniques for data analysis such as density estimation, classi cation, regression, and clustering. So what do Bayesian networks and Bayesian methods have to o er? There are at least four answers. One, Bayesian networks can readily handle incomplete data sets. For example, consider a classi cation or regression problem where two of the explanatory or input variables are strongly anti-correlated. This correlation is not a problem for standard supervised learning techniques, provided all inputs are measured in every case. When one of the inputs is not observed, however, most models will produce an inaccurate prediction, because they do not 1 encode the correlation between the input variables. Bayesian networks o er a natural way to encode such dependencies. Two, Bayesian networks allow one to learn about causal relationships. Learning about causal relationships are important for at least two reasons. The process is useful when we are trying to gain understanding about a problem domain, for example, during exploratory data analysis. In addition, knowledge of causal relationships allows us to make predictions in the presence of interventions. For example, a marketing analyst may want to know whether or not it is worthwhile to increase exposure of a particular advertisement in order to increase the sales of a product. To answer this question, the analyst can determine whether or not the advertisement is a cause for increased sales, and to what degree. The use of Bayesian networks helps to answer such questions even when no experiment about the e ects of increased exposure is available. Three, Bayesian networks in conjunction with Bayesian statistical techniques facilitate the combination of domain knowledge and data. Anyone who has performed a real-world analysis knows the importance of prior or domain knowledge, especially when data is scarce or expensive. The fact that some commercial systems (i.e., expert systems) can be built from prior knowledge alone is a testament to the power of prior knowledge. Bayesian networks have a causal semantics that makes the encoding of causal prior knowledge particularly straightforward. In addition, Bayesian networks encode the strength of causal relationships with probabilities. Consequently, prior knowledge and data can be combined with wellstudied techniques from Bayesian statistics. Four, Bayesian methods in conjunction with Bayesian networks and other types of models o ers an e cient and principled approach for avoiding the over tting of data. As we shall see, there is no need to hold out some of the available data for testing. Using the Bayesian approach, models can be \smoothed" in such a way that all available data can be used for training. This tutorial is organized as follows. In Section 2, we discuss the Bayesian interpretation of probability and review methods from Bayesian statistics for combining prior knowledge with data. In Section 3, we describe Bayesian networks and discuss how they can be constructed from prior knowledge alone. In Section 4, we discuss algorithms for probabilistic inference in a Bayesian network. In Sections 5 and 6, we show how to learn the probabilities in a xed Bayesian-network structure, and describe techniques for handling incomplete data including Monte-Carlo methods and the Gaussian approximation. In Sections 7 through 12, we show how to learn both the probabilities and structure of a Bayesian network. Topics discussed include methods for assessing priors for Bayesian-network structure and parameters, and methods for avoiding the over tting of data including Monte-Carlo, Laplace, BIC, 2 and MDL approximations. In Sections 13 and 14, we describe the relationships between Bayesian-network techniques and methods for supervised and unsupervised learning. In Section 15, we show how Bayesian networks facilitate the learning of causal relationships. In Section 16, we illustrate techniques discussed in the tutorial using a real-world case study. In Section 17, we give pointers to software and additional literature. 2 The Bayesian Approach to Probability and Statistics
To understand Bayesian networks and associated learning techniques, it is important to understand the Bayesian approach to probability and statistics. In this section, we provide an introduction to the Bayesian approach for those readers familiar only with the classical view. In a nutshell, the Bayesian probability of an event x is a person's degree of belief in that event. Whereas a classical probability is a physical property of the world (e.g., the probability that a coin will land heads), a Bayesian probability is a property of the person who assigns the probability (e.g., your degree of belief that the coin will land heads). To keep these two concepts of probability distinct, we refer to the classical probability of an event as the true or physical probability of that event, and refer to a degree of belief in an event as a Bayesian or personal probability. Alternatively, when the meaning is clear, we refer to a Bayesian probability simply as a probability. One important di erence between physical probability and personal probability is that, to measure the latter, we do not need repeated trials. For example, imagine the repeated tosses of a sugar cube onto a wet surface. Every time the cube is tossed, its dimensions will change slightly. Thus, although the classical statistician has a hard time measuring the probability that the cube will land with a particular face up, the Bayesian simply restricts his or her attention to the next toss, and assigns a probability. As another example, consider the question: What is the probability that the Chicago Bulls will win the championship in 2001? Here, the classical statistician must remain silent, whereas the Bayesian can assign a probability (and perhaps make a bit of money in the process). One common criticism of the Bayesian de nition of probability is that probabilities seem arbitrary. Why should degrees of belief satisfy the rules of probability? On what scale should probabilities be measured? In particular, it makes sense to assign a probability of one (zero) to an event that will (not) occur, but what probabilities do we assign to beliefs that are not at the extremes? Not surprisingly, these questions have been studied intensely. With regards to the rst question, many researchers have suggested di erent sets of properties that should be satis ed by degrees of belief (e.g., Ramsey 1931, Cox 1946, Good 3 Figure 1: The probability wheel: a tool for assessing probabilities. 1950, Savage 1954, DeFinetti 1970). It turns out that each set of properties leads to the same rules: the rules of probability. Although each set of properties is in itself compelling, the fact that di erent sets all lead to the rules of probability provides a particularly strong argument for using probability to measure beliefs. The answer to the question of scale follows from a simple observation: people nd it fairly easy to say that two events are equally likely. For example, imagine a simpli ed wheel of fortune having only two regions (shaded and not shaded), such as the one illustrated in Figure 1. Assuming everything about the wheel as symmetric (except for shading), you should conclude that it is equally likely for the wheel to stop in any one position. From this judgment and the sum rule of probability (probabilities of mutually exclusive and collectively exhaustive sum to one), it follows that your probability that the wheel will stop in the shaded region is the percent area of the wheel that is shaded (in this case, 0.3). This probability wheel now provides a reference for measuring your probabilities of other events. For example, what is your probability that Al Gore will run on the Democratic ticket in 2000? First, ask yourself the question: Is it more likely that Gore will run or that the wheel when spun will stop in the shaded region? If you think that it is more likely that Gore will run, then imagine another wheel where the shaded region is larger. If you think that it is more likely that the wheel will stop in the shaded region, then imagine another wheel where the shaded region is smaller. Now, repeat this process until you think that Gore running and the wheel stopping in the shaded region are equally likely. At this point, your probability that Gore will run is just the percent surface area of the shaded area on the wheel. In general, the process of measuring a degree of belief is commonly referred to as a probability assessment. The technique for assessment that we have just described is one of many available techniques discussed in the Management Science, Operations Research, and Psychology literature. One problem with probability assessment that is addressed in this literature is that of precision. Can one really say that his or her probability for event x is 0:601 and not 0:599? In most cases, no. Nonetheless, in most cases, probabilities are used 4 to make decisions, and these decisions are not sensitive to small variations in probabilities. Well-established practices of sensitivity analysis help one to know when additional precision is unnecessary (e.g., Howard and Matheson, 1983). Another problem with probability assessment is that of accuracy. For example, recent experiences or the way a question is phrased can lead to assessments that do not re ect a person's true beliefs (Tversky and Kahneman, 1974). Methods for improving accuracy can be found in the decision-analysis literature (e.g, Spetzler et al. (1975)). Now let us turn to the issue of learning with data. To illustrate the Bayesian approach, consider a common thumbtack|one with a round, at head that can be found in most supermarkets. If we throw the thumbtack up in the air, it will come to rest either on its point (heads) or on its head (tails).1 Suppose we ip the thumbtack N + 1 times, making sure that the physical properties of the thumbtack and the conditions under which it is ipped remain stable over time. From the rst N observations, we want to determine the probability of heads on the N + 1th toss. In the classical analysis of this problem, we assert that there is some physical probability of heads, which is unknown. We estimate this physical probability from the N observations using criteria such as low bias and low variance. We then use this estimate as our probability for heads on the N + 1th toss. In the Bayesian approach, we also assert that there is some physical probability of heads, but we encode our uncertainty about this physical probability using (Bayesian) probabilities, and use the rules of probability to compute our probability of heads on the N + 1th toss.2 To examine the Bayesian analysis of this problem, we need some notation. We denote a variable by an upper-case letter (e.g., X Y Xi ), and the state or value of a corresponding variable by that same letter in lower case (e.g., x y xi ). We denote a set of variables by a bold-face upper-case letter (e.g., X Y Xi). We use a corresponding bold-face lower-case letter (e.g., x y xi) to denote an assignment of state or value to each variable in a given set. We say that variable set X is in con guration x. We use p(X = xj ) (or p(xj ) as a shorthand) to denote the probability that X = x of a person with state of information . We also use p(xj ) to denote the probability distribution for X (both mass functions and density functions). Whether p(xj ) refers to a probability, a probability density, or a probability distribution will be clear from context. We use this notation for probability throughout the paper. A summary of all notation is given at the end of the chapter. Returning to the thumbtack problem, we de ne to be a variable3 whose values
This example is taken from Howard (1970). Strictly speaking, a probability belongs to a single person, not a collection of people. Nonetheless, in parts of this discussion, we refer to \our" probability to avoid awkward English. 3 Bayesians typically refer to as an uncertain variable, because the value of is uncertain. In con1 2 5 correspond to the possible true values of the physical probability. We sometimes refer to as a parameter. We express the uncertainty about using the probability density function p( j ). In addition, we use Xl to denote the variable representing the outcome of the lth ip, l = 1 : : : N + 1, and D = fX1 = x1 : : : XN = xN g to denote the set of our observations. Thus, in Bayesian terms, the thumbtack problem reduces to computing p(xN +1jD ) from p( j ). To do so, we rst use Bayes' rule to obtain the probability distribution for given D and background knowledge : p (1) p( jD ) = p( j p)(D(jD)j ) where Z p(Dj ) = p(Dj ) p( j ) d (2) Next, we expand the term p(Dj ). Both Bayesians and classical statisticians agree on this term: it is the likelihood function for binomial sampling. In particular, given the value of , the observations in D are mutually independent, and the probability of heads (tails) on any one observation is (1 ; ). Consequently, Equation 1 becomes h t p( jD ) = p( j )p(Dj(1) ; ) (3) where h and t are the number of heads and tails observed in D, respectively. The probability distributions p( j ) and p( jD ) are commonly referred to as the prior and posterior for , respectively. The quantities h and t are said to be su cient statistics for binomial sampling, because they provide a summarization of the data that is su cient to compute the posterior from the prior. Finally, we average over the possible values of (using the expansion rule of probability) to determine the probability that the N + 1th toss of the thumbtack will come up heads: Z p(XN +1 = headsjD ) = p(XN +1 = headsj ) p( jD ) d Z = p( jD ) d Ep( jD )( ) (4) where Ep( jD )( ) denotes the expectation of with respect to the distribution p( jD ). To complete the Bayesian story for this example, we need a method to assess the prior distribution for . A common approach, usually adopted for convenience, is to assume that this distribution is a beta distribution: ) (5) p( j ) = Beta( j h t) ;( ;();( ) h;1 (1 ; ) t;1 h t
trast, classical statisticians often refer to as a random variable. In this text, we refer to uncertain/random variables simply as variables. and all 6 Beta(1,1) Beta(2,2) Beta(3,2) Beta(19,39) Figure 2: Several beta distributions. where h > 0 and t > 0 are the parameters of the beta distribution, = h + t , and ;( ) is the Gamma function which satis es ;(x + 1) = x;(x) and ;(1) = 1. The quantities h and t are often referred to as hyperparameters to distinguish them from the parameter . The hyperparameters h and t must be greater than zero so that the distribution can be normalized. Examples of beta distributions are shown in Figure 2. The beta prior is convenient for several reasons. By Equation 3, the posterior distribution will also be a beta distribution: p( jD ) = ;( ;( h+ N ) + t) h+h;1 (1 ; ) t+t;1 = Beta( j h + h t + t) (6) h + );( t We say that the set of beta distributions is a conjugate family of distributions for binomial sampling. Also, the expectation of with respect to this distribution has a simple form: Z Beta( j h t) d = h (7) Hence, given a beta prior, we have a simple expression for the probability of heads in the N + 1th toss: +h p(XN +1 = headsjD ) = h+ N (8) Assuming p( j ) is a beta distribution, it can be assessed in a number of ways. For example, we can assess our probability for heads in the rst toss of the thumbtack (e.g., using a probability wheel). Next, we can imagine having seen the outcomes of k ips, and reassess our probability for heads in the next toss. From Equation 8, we have (for k = 1) h h p(X2 = headsjX1 = heads ) = + + 1 1 p(X1 = headsj ) = + h t h t+ Given these probabilities, we can solve for h and t . This assessment technique is known as the method of imagined future data. Another assessment method is based on Equation 6. This equation says that, if we start with a Beta(0 0) prior4 and observe h heads and t tails, then our posterior (i.e., new
Technically, the hyperparameters of this prior should be small positive numbers so that p( j ) can be normalized.
4 7 prior) will be a Beta( h t) distribution. Recognizing that a Beta(0 0) prior encodes a state of minimum information, we can assess h and t by determining the (possibly fractional) number of observations of heads and tails that is equivalent to our actual knowledge about ipping thumbtacks. Alternatively, we can assess p(X1 = headsj ) and , which can be regarded as an equivalent sample size for our current knowledge. This technique is known as the method of equivalent samples. Other techniques for assessing beta distributions are discussed by Winkler (1967) and Chaloner and Duncan (1983). Although the beta prior is convenient, it is not accurate for some problems. For example, suppose we think that the thumbtack may have been purchased at a magic shop. In this case, a more appropriate prior may be a mixture of beta distributions|for example, p( j ) = 0:4 Beta(20 1) + 0:4 Beta(1 20) + 0:2 Beta(2 2)
where 0.4 is our probability that the thumbtack is heavily weighted toward heads (tails). In e ect, we have introduced an additional hidden or unobserved variable H , whose states correspond to the three possibilities: (1) thumbtack is biased toward heads, (2) thumbtack is biased toward tails, and (3) thumbtac...
View Full Document