Education
A Primer on Learning in Bayesian Networks
for Computational Biology
Chris J. Needham
*
, James R. Bradford, Andrew J. Bulpitt, David R. Westhead
Introduction
Bayesian networks (BNs) provide a neat and compact
representation for expressing joint probability distributions
(JPDs) and for inference. They are becoming increasingly
important in the biological sciences for the tasks of inferring
cellular networks [1], modelling protein signalling pathways
[2], systems biology, data integration [3], classiFcation [4], and
genetic data analysis [5]. The representation and use of
probability theory makes BNs suitable for combining domain
knowledge and data, expressing causal relationships, avoiding
overFtting a model to training data, and learning from
incomplete datasets. The probabilistic formalism provides a
natural treatment for the stochastic nature of biological
systems and measurements. This primer aims to introduce
BNs to the computational biologist, focusing on the concepts
behind methods for learning the parameters and structure of
models, at a time when they are becoming the machine
learning method of choice.
There are many applications in biology where we wish to
classify data; for example, gene function prediction. To solve
such problems, a set of rules are required that can be used for
prediction, but often such knowledge is unavailable, or in
practice there turn out to be many exceptions to the rules or
so many rules that this approach produces poor results.
Machine learning approaches often produce better results,
where a large number of examples (the training set) is used to
adapt the parameters of a model that can then be used for
performing predictions or classiFcations on data. There are
many different types of models that may be required and
many different approaches to training the models, each with
its pros and cons. An excellent overview of the topic can be
found in [6] and [7]. Neural networks, for example, are often
able to learn a model from training data, but it is often
difFcult to extract information about the model, which with
other methods can provide valuable insights into the data or
problem being solved. A common problem in machine
learning is overFtting, where the learned model is too
complex and generalises poorly to unseen data. Increasing
the size of the training dataset may reduce this; however, this
assumes more training data is readily available, which is often
not the case. In addition, often it is important to determine
the uncertainty in the learned model parameters or even in
the choice of model. This primer focuses on the use of BNs,
which offer a solution to these issues. The use of Bayesian
probability theory provides mechanisms for describing
uncertainty and for adapting the number of parameters to
the size of the data. Using a graphical representation provides
a simple way to visualise the structure of a model. Inspection
of models can provide valuable insights into the properties of
the data and allow new models to be produced.