Inference in Hybrid Bayesian Networks using Dynamic Discretisation

Inference in Hybrid Bayesian Networks using Dynamic Discretisation
Inference in Hybrid Bayesian Networks using Dynamic
Discretisation
Martin Neil†‡, Manesh Tailor‡ and David Marquez†
† Department of Computer Science, Queen Mary, University of London
‡ Agena Ltd
Abstract
We consider approximate inference in hybrid Bayesian Networks (BNs) and present a new
iterative algorithm that efficiently combines dynamic discretisation with robust propagation
algorithms on junction trees structures. Our approach offers a significant extension to
Bayesian Network theory and practice by offering a flexible way of modelling continuous
nodes in BNs conditioned on complex configurations of evidence and intermixed with
discrete nodes as both parents and children of continuous nodes. Our algorithm is
implemented in a commercial Bayesian Network software package, AgenaRisk, which allows
model construction and testing to be carried out easily. The results from the empirical trials
clearly show how our software can deal effectively with different type of hybrid models
containing elements of expert judgement as well as statistical inference. In particular, we
show how the rapid convergence of the algorithm towards zones of high probability density,
make robust inference analysis possible even in situations where, due to the lack of
information in both prior and data, robust sampling becomes unfeasible.
Keywords: Bayesian networks; Expert systems; Bayesian software; Reasoning under
uncertainty; Statistical inference; Propagation algorithms; Dynamic discretisation.
1. Introduction
In this paper we present a new and powerful approximate algorithm for performing inference
in hybrid Bayesian Networks (BNs) by a process of dynamic discretisation of the domain of
all continuous variables contained in the network. The approach is influenced by the work of
[Kozlov and Koller, 1997] and like them uses entropy error as the basis for approximation.
We differ from their approach by integrating an iterative approximation scheme within
existing BN software architectures, such as in Junction Tree (JT) propagation [Jensen et al.
1990], thus obviating the need for separate data structures and a new propagation algorithm.
By using the data structure commonly used in JT algorithms, we can apply the standard
operations for evidence multiplication, summation and integration, popular in these
architectures. In our scheme, approximation is employed iteratively on an anytime basis when
evidence is entered into the BN.
The power and flexibility of the approach is demonstrated by comparing it to static
discretisation approaches (used in a number of popular BN software tools such as [Netica,
2005, Hugin, 2005]), and the efficacy of the algorithm is tested on a range of models
occurring in practical applications, namely, hybrid Bayesian networks with discrete child of a
continuous parent and hybrid Bayesian networks with conditionally deterministic variables
(i.e., a variable that is a deterministic function of its parents). We also consider different types
of hybrid models related to statistical inference problems, e.g., finite mixture of Normal
distributions, Bayesian hierarchical models and generalised linear models. In each case we
compare the results with the analytical solution or a solution generated by stochastic sampling
using Markov Chain Monte Carlo (MCMC) methods [Gilks et al., 1996], as appropriate.
The results from the empirical trials are promising, equalling the accuracy of analytical
solutions and either equalling or surpassing the accuracy of results gained by using Mixture of
1
Truncated Exponentials (MTE), stochastic sampling using MCMC, Fast Fourier Transforms
(FFT) and analytical solutions. In addition to the models presented here the approach has been
applied to a wide variety of real-world modelling problems requiring hybrid Bayesian models
containing elements of expert judgement as well as statistical inference [Neil et al., 2001, Neil
et al. 2003a, Neil et al. 2003b, Fenton et al. 2002, Fenton et al. 2004]. However for brevity
these results cannot be presented here.
Our dynamic discretisation algorithm is implemented in the commercial general-purpose
Bayesian Network software tool AgenaRisk [Agena 2005]. Likewise, the example models are
built and executed using this software as is the graphical output (marginals and BN graphs)
presented here.
In spite of the significant potential to address inferential tasks on complex hybrid models,
there are some limitations in our algorithm related to the choice of the Hugin architecture as a
platform to compute the marginal distributions. Although this algorithm is intended to
produce junction trees with minimum cliques size, for some statistical models, with dconverging dependency structures, on many unobserved and observed variables, the cliques in
the corresponding junction tree can grow exponentially making the computation of the
marginal distributions very costly or even impossible.
A brief description of the paper is as follows. In Section 2 we describe the problem and
relevant background research. In Section 3 we give a detailed presentation of our algorithm.
Then, in Section 4, we test the efficacy of the algorithm by conducting a series of analyses on
a set of models. Finally Section 5 contains concluding remarks.
2. Background
Hybrid Bayesian networks (BNs) have been widely used to represent full probability models
in a compact and intuitive way. In the Bayesian network framework the independence
structure (if any) in a joint distribution is characterised by a directed acyclic graph, with nodes
representing random variables, which can be discrete or continuous, and may or may not be
observable, and directed arcs representing causal or influential relationship between variables
[Pearl, 1993]. The conditional independence assertions about the variables, represented by the
lack of arcs, reduce significantly the complexity of inference and allow to decompose the
underlying joint probability distribution as a product of local conditional probability
distributions (CPDs) associated to each node and its respective parents. [Speigelhalter and
Lauritzen 1990, Lauritzen, 1996]. If the variables are discrete, the CPDs can be represented as
Node Probability Table (NPTs), which list the probability that the child node takes on each of
its different values for each combination of values of its parents.
Since a Bayesian network encodes all relevant qualitative and quantitative information
contained in a full probability model, it provides an excellent tool to perform many types of
probabilistic inference tasks [Whittaker, 1990, Heckerman et al., 1995], consisting mainly in
computing the posterior probability distribution of some variables of interest (unknown
parameters and unobserved data) conditioned on some other variables that have been
observed.
A range of robust and efficient propagation algorithms has been developed for exact inference
on Bayesian networks with discrete variables [Pearl, 1988, Lauritzen and Spiegelhalter, 1988,
Shenoy and Shafer, 1990, Jensen et al, 1990]. The common feature of these algorithms is that
the exact computation of posterior marginals is performed through a series of local
computations over a secondary structure, a tree of clusters, which allows calculating the
marginal without computing the joint distribution. See also [Huang, 1996].
2
In hybrid Bayesian networks, local exact computations can be performed only under the
assumption of Conditional Gaussian (CG) distributions [Lauritzen and Jensen, 2001]. The
advantages and drawbacks of using Conditional Gaussian distributions are well known. They
are useful to model mixtures of Gaussian variables conditioned on discrete and weighted
combinations of CG parents but they are much too inflexible to support general-purpose
inference over hybrid models containing mixtures of discrete labelled, integer and continuous
types and non-Gaussian distributions. Most real applications demand non-standard high
dimensional statistical models with intermixed continuous and discrete variables, where exact
inference becomes computationally intractable.
The present generation of BN software tools attempt to model continuous nodes by numerical
approximations using static discretisation as implemented in a number of software tools
[Hugin, 2005, Netica, 2005]. Although disctretisation allows approximate inference in a
hybrid BN without limitations on relationships among continuous and discrete variables,
current software implementations requires users to define a uniform discretisation of the states
of any numeric node (whether it is continuous or discrete) as a sequence of pre-defined
intervals, which remain static throughout all subsequent stages of Bayesian inference
regardless of any new conditioning evidence. The more intervals you define, the more
accuracy you can achieve, but at a heavy cost of computational complexity. This is made
worse by the fact that you do not necessarily know in advance where the posterior marginal
distribution will lie on the continuum for all nodes and which ranges require the finer
intervals. It follows that where a model contains numerical nodes having a potentially large
range, results are necessarily only crude approximations.
Alternatives to discretisation have been suggested by [Moral et al, 2001, Cobb and Shenoy,
2005a], who describe potential approximations using mixtures of truncated exponential
(MTE) distributions, [Koller at al., 1999] who combine MTE approximations with direct
sampling (Monte Carlo) methods, and [Murphy, 1999] who uses variational methods. There
have also been some attempts for approximate inference on hybrid BNs using Markov Chain
Monte Carlo (MCMC) approaches [Shachter and Peot, 1989], however, constructing
dependent samples that mixed well (i.e., that move rapidly throughout the support of the
target distribution) remains a complex task.
3. Dynamic Discretisation
Let X be a continuous random node in the BN. The range of X is denoted by ΩX , and the
probability density function (PDF) of X , with support ΩX , is denoted by f X . The idea of
discretisation is to approximate f X by, first, partitioning ΩX into a set of intervals
Ψ X = {w j } , and second, defining a locally constant function f X on the partitioning intervals.
The task consists in finding an optimal discretisation set Ψ X = {ωi } and optimal values for
the discretised probability density function f X . Discretisation operates in much the same way
when X takes integer values but in this paper we will focus on the case where X is
continuous.
The approach to dynamic discretisation described here searches ΩX for the most accurate
specification of the high-density regions (HDR), given the model and the evidence by
calculating a sequence of discretisation intervals in ΩX iteratively. At each stage in the
iterative process a candidate discretisation, Ψ X , is tested to determine whether the resulting
discretised probability density f X has converged to the true probability density f X within an
acceptable degree of precision. At convergence, f X is then approximated by f X .
3
By dynamically discretising the model we achieve more accuracy in the regions that matter
and incur less storage space over static discretisations. Moreover, we can adjust the
discretisation anytime in response to new evidence to achieve greater accuracy.
The approach to dynamic discretisation presented here is influenced by work of Kozlov and
Koller on using non-uniform discretisation in hybrid BNs [Kozlov and Koller, 1997]. A
number of features are introduced in their approach:
1. They apply a multivariate approach to discretise continuous functions on
multidimensional domains, and introduce a new data structure, called Binary Split
Partition (BSP) tree, to represent a recursive binary decomposition of a
multidimensional function.
2. They use the relative entropy or Kullback-Leibler (KL) distance between two density
functions f and g as a metric of the error introduced by discretisation:
D( f || g ) = f ( x) log
S
f ( x)
dx
g ( x)
Under this metric, the optimal value for the discretised function f is given by the
mean of the function f in each of the intervals of the discretised domain.
3. They recommend using a bound on the KL distance as an estimate of the relative
entropy error between a function f and its discretisation f based on the function
mean f , the function maximum f max , and the function minimum f min in the given
discretisation interval w j :
Ej =
f max − f
f
f − f min
f
f min log min +
f max log max w j
f max − f min
f
f max − f min
f
where w j denotes the length of the discretisation interval w j .
4. Evidence propagation uses an extension to standard BN inference algorithms such as
the Junction Tree approach [Jensen et al. 1990]. This extension involves the
propagation of weights between cliques to readjust the discretisation when evidence
lies in low-density regions (LDR). This is in addition to normal message passing
operations.
5. They define a series of new operators for multiplication, summation and integration
of continuous nodes expressed over BSP trees.
Our approach to dynamic discretisation is simpler, easier to implement using well know JT
algorithms, such as [Lauritzen and Speigelhalter, 1988], and produces very accurate results.
Firstly, we choose to handle univariate partitions (i.e. marginal densities only), which are
simpler to implement, instead of tackling the problem of partitioning joint multivariate
distributions such as cliques. Secondly, we use the data structures commonly used in JT
algorithms making the need to support separate BSP data structures redundant. The advantage
of this is that we can apply the normal operations for evidence multiplication, summation and
integration without change. Finally, we can also apply the normal evidence propagation JT
4
algorithm with the only change being to perform propagation iteratively on an anytime basis
rather than once.
In outline, dynamic discretisation follows these steps:
1. Choose initial discretisations for all continuous variables.
2. Calculate the discretised CPD of each continuous node given the current discretisation
and propagate evidence through the BN.
3. Query the BN to get posterior marginals for each node and split those intervals with
highest entropy error in each node.
4. Continue to iterate the process, by recalculating the conditional probability densities
and propagating the BN, and then querying to get the marginals and then split intervals
with highest entropy error, until the model converges to an acceptable level of
accuracy.
In order to control the growth of the resulting discretisation sets, Ψ X , after each iteration, we
merge those consecutive intervals in Ψ X with the lowest entropy error or that have zero mass
and zero entropy error. Merging intervals is difficult in practice because of a number of
factors. Firstly we do not necessarily want to merge intervals because they have a zero
relative entropy error, as is the case with uniform distributions, since we want those intervals
to help generate finer grained discretisations in any connected child nodes. Also, we wish to
ensure that we only merge zero mass intervals with zero relative entropy error if they belong
to sequences of zero mass intervals because some zero mass intervals might separate out local
maxima in multimodal distributions. To resolve these issues we therefore apply a number of
heuristics whilst merging.
A key challenge in our approach to dynamic discretisation occurs when some
evidence, X = x , lies in a region of ΩX where temporarily there is no probability mass. This
can occur simply because the model as a whole is far from convergence to an acceptable
solution and occurs when the sampling has not generated probability mass in the intervals of
interest. This is a dangerous situation unremarked by [Kozlov and Koller, 1997], but which
only occurs when sampling. If sampling, we solve this problem by postponing the
instantiation of evidence to the interval of interest and in the meantime assign it to the closest
neighbouring interval with the aim of maximising the probability mass in the region closest to
the actual HDR. To avoid the problem all together we recommend estimating deterministic
functions using mixtures of Uniform distributions, as described in Section 3.2.
Similarly, to enter point values evidence into a continuous node X , we assign a tolerance
bound around the evidence, namely δ ( x ) , and instantiate X on the interval
( x − δ ( x ) , x + δ ( x )) .
We consider here a Bayesian network for a set of random variables, X , and partition X into
the sets, XQ and X E , consisting of the set of query variables and the set of observed
variables, respectively.
3.1.
The dynamic discretisation algorithm
Our approach to simulation using dynamic discretisation is based on the following algorithm:
1: Initialise the discretisation, Ψ (X0) , for each continuous variable X ∈ X .
2: Build a junction tree structure to determine the cliques,
, and sepsets.
5
3: for l = 1 to max_num_ite
4:
Compute the NPTs, P (l ) ( X | pa{ X }) , on Ψ (Xl −1) for all nodes X ∈ XQ that have new
discretisation or that are children of parent nodes that have a new discretisation
5:
Initialise the junction tree by multiplying the NPTs for all nodes into the relevant
members of
6:
Enter evidence, X E = e , into the junction tree
7:
Perform global propagation on the junction tree
8:
for all nodes X ∈ XQ
9:
Marginalize/normalise to get the discretised posterior marginals P (l ) ( X X E = e )
10:
Compute the approximate relative entropy error S X(l ) =
E j , for P ( ) ( X X E = e )
l
wj
over all intervals w j in Ψ (Xl −1)
11:
If
1− α ≤
S X( )
≤ 1 + α for k = 1, 2,3 # Stable-entropy-error stopping rule #
S X(l − k +1)
l −k
or
{S
X
i
< β } # Low-entropy-error stopping rule #
12:
then stop discretisation for node X
13:
else create a new discretisation Ψ (Xl ) for node X :
14:
Split into two halves the interval w j in Ψ (Xl −1) with the highest entropy error, E j .
15:
Merge those consecutive intervals in Ψ (Xl −1) with the lowest entropy error or that
have zero mass and zero entropy error
16:
17:
end if
end for
18: end for
3.2.
Estimating deterministic functions using mixtures of Uniform distributions
Once a discretisation has been defined at each step in the algorithm we need to calculate the
marginal probability for all X in the model, by marginalisation from the conditional
distribution of X given its parents, p ( X | pa{ X }) . For standard continuous and discrete
density functions this does not represent a problem but for more complex conditional
distributions approximation techniques need to be used.
6
Consider, for instance, the case in which the conditional distributions p ( X | pa{ X }) involve
a deterministic function of random variables, e.g. X = f ( pa { X }) . For differentiable
functions of discrete parents, is easy to obtain a closed expression for the marginal probability
of X as a function of the joint probability of the parents. In a more general framework, a
simple method for generating the local conditional probability table p ( X | pa{ X }) commonly
used under the static discretisation approach proceeds by first sampling values from each
parent interval in Ωpa{ X } for all parents of X and calculating the result X = f ( pa { X }) , then
counting the frequencies with which the results fall within the static bins predefined for X ,
and finally normalising the NPT.
Although simple this procedure is flawed. On the one hand, there is no guarantee that every
bin in ΩX will contain a probability density if the parents’ node values are under sampled.
The implication of this is that some regions of ΩX might be void; they should have
probability mass but do not. Any subsequent inference in the BN then will return an
inconsistency when it encounters either a valid observation in a zero mass interval in X or
attempts inference involving X . The only way to counter this under static discretisation is to
generate a large number of samples, which is expensive and made more difficult by the fact
that the sampling configuration settings in tools that use the static approach are inaccessible.
On the other hand, samples from each parent interval in Ωpa{ X } are usually taken uniformly
such that at least two samples are taken for each interval in Ωpa{ X } . As the number of parent
nodes increases, and the states in ΩX and Ωpa{ X } increases, the number of cells in the NPT,
p ( X | pa{ X }) , increases exponentially.
Consider,
for
example,
Z = X +Y
with
X , Y N (10,100) .
Here
p ( X , Y , Z ) = p( Z | X , Y ) p ( X ) p(Y ) with Z = f ( X , Y ) . The resulting marginal distribution
p ( Z ) using Hugin is shown in Figure 1 and we can clearly see that the interval ]80 – 80.1]
has been under-sampled resulting in a zero mass interval. Should we actually observe
evidence Z = 80.05 then we will achieve a falsely inconsistent result and any attempt at
inference about the parents of Z will stall.
Figure 1: p ( Z ) with zero mass in the interval ]80, 80.1]
To avoid this issue we resolve all deterministic functions by modelling them as an
approximate mixture of Uniform distributions. This involves taking the upper and lower
bounds of each interval in Ψ (pal ){ X } , calculate all values using the deterministic function and
then calculate the min and max values and enter these as the parameters in the Uniform
distribution. Under dynamic discretisation an increasing number of intervals are produced,
resulting in many more interval combinations, which has the effect of fitting a histogram
7
composed of Uniform distributions to the continuous function, and resulting in a piecewise
continuous function with no voids.
For instance with Z = X + Y we take the boundaries of each interval from X and Y and
calculate the conditional probability for Z from these:
p ( Z | X ∈ [ xl , xu ], Y ∈[ yl , yu ]) = U (min( xl + yl , xl + yu , xu + yl , xu + yu ),
max( xl + yl , xl + yu , xu + yl , xu + yu ))
4. Empirical Evaluation of Dynamic Discretisation Approach
In this section we study the efficacy of the dynamic discretisation approach in practice by
resolving the following models:
•
•
•
•
•
Normal mixture model
Hybrid model
Conditionally deterministic variables
Statistical inference using a hierarchical normal model
Statistical inference using a logistic regression model
The first example, Normal mixture distribution, illustrates how the approach can produce
estimates for continuous nodes that have discrete nodes as parents and also illustrates how
multi-modal posterior distributions can be recovered. Here we also illustrate how the iterative
approximation works and illustrate convergence properties of the algorithm.
The second example is a simple hybrid BN consisting of a Conditional Linear Gaussian
model of continuous parents with a discrete child. In the third example we show dynamic
discretisation generating a probability distribution for hybrid BNs with variables that are
deterministic function of its parents.
It is important to point out that our approach can be used to solve statistical inference
problems on general Bayesian hierarchical models, using both, conjugate and non-conjugate
standard prior distributions. To this end the third and fourth examples focus on Bayesian
statistical inference using a hierarchical normal and a logistic regression model respectively.
In each case we compare the solutions under dynamic discretisation with the analytical
equivalent solution, or where this is not possible approximate answers using Monte Carlo
Markov Chains (MCMC).
In addition to the examples covered here a very large number of examples covering a wide
variety of predictive and diagnostic problems have been successfully modelled and are
available with the AgenaRisk software.
4.1.
Normal Mixture Model
The Normal mixture distribution is an example of statistical models of continuous nodes that
have discrete nodes as parents. Consider a mixture model with distributions
p ( X = false) = p( X = true) = 0.5
p (Y | X ) =
Normal ( µ1 ,σ 12 )
X = false
Normal ( µ2 ,σ )
X = true
2
2
8
The marginal distribution of Y is a mixture of Normal distributions
P(Y ) =
1
1
N (Y | µ1 ,σ 12 ) + N (Y | µ2 ,σ 22 )
2
2
with mean and variance given by
1
( µ1 + µ2 )
2
1
1
2
Var [Y ] = (σ 12 + µ12 ) + (σ 22 + µ22 ) − ( µ1 + µ2 )
2
4
E [Y ] =
Figure 2 shows the resulting marginal distribution p (Y ) after 25 iterations, for the mixture of
N (Y |10,100 ) and N (Y | 50,10 ) , calculated under the static and the dynamic discretisation
approaches. While using the later approach we are able to recover the exact values for the
mean and variance, E (Y ) = 30 , Var (Y ) = 455 , the static case produces the approximated
values µY = 82.8 and σ Y2 = 12518 , showing clearly just how badly the static discretisation
performs.
Figure 2 Comparison of static and dynamic discretisations of the marginal distribution
p (Y ) for the Normal mixture model
To give a clearer insight into the steps involved in the algorithm we now present the resulting
approximations for p (Y ) after 2, 4, 6 and 25 iterations, as shown in Figure 3
After 2 iterations
After 4 iterations
9
After 6 iterations
After 25 iterations
Figure 3 Approximation of p (Y ) for Normal mixture problem over 2, 4, 6 and 25
iterations (Graphs show 99.8 percentile range around the median)
After 2 iterations in Figure 3 the following intervals are candidates for splitting: [-100 – 10],
[10 – 55] and [55 – 100] with relative entropy error values of 4.1, 156 and 168 respectively,
thus the interval [55 – 100] is split with highest priority and then [10 – 55]. The interval [-100,
10], is long and thin but has such small relative entropy value compared to the other intervals
that it has very low priority.
After 4 iterations, the interval [55 – 77.5] has the highest relative entropy error of 18.43. The
next highest relative entropy error, standing at 15.8, corresponds to the interval [32.5 – 55].
Again the interval [-100, 10] is very low priority.
After 6 iterations the multi modal nature of p (Y ) is gradually being revealed and both modes
are competing for attention: [10 – 32.5] and [43.75 – 55] each have very close relative entropy
error values at 3.85 and 3.94 respectively. However, [55 – 56.25] give rise to a higher relative
entropy error at 4.27 and so is the next interval to split.
After 25 iterations we can clearly see a very good approximation to the “true” multi modal
distribution. Notice that the “long tail” interval [-100, -10] has now been split so many times
that it has dropped out of the displayed percentile range for the graph, thus producing an
accurate discretisation in the tail region.
For most problems each node in the model converges relatively quickly by converging
according to one of the two stopping rules of the dynamic discretisation algorithm. In Figure 4
we show the resulting logarithm of the sum of the relative entropy errors for our
example p (Y ) over 20 iterations. The Low-entropy-error stopping rule used a very small
threshold value to ensure it continued up to 50 iterations.
10
Figure 4: Convergence of p (Y ) over 20 iterations
Clearly, from Figure 4, we can see that the results are highly accurate after as few as 15
iterations and that the sum relative entropy error metric converges nicely. At iteration 15
some intervals merged resulting in a slight decrease in accuracy. The sum entropy error for
the estimates of p (Y ) eventually converges to around 10−3 .
In practice the choice of stopping rule values needs to be traded off against computation time.
For single nodes the increase in computation time is linear but for larger networks with many
parent nodes the increase is exponential. In this example, using a 3.2Ghz Pentium 4 computer
with 1Gb RAM, 10 iterations took 0.437 and 50 iterations took 1.984s.
Computation times can be significantly improved by refactoring the network to ensure that no
continuous nodes have more than two parent nodes thus ensuring the maximum size of NPTs
are only n3 rather than n m for m continuous nodes with n states. Refactoring is only possible
for a limited range of NPTs, but most models can potentially benefit from doing so.
4.2.
A simple hybrid Bayesian Network
We will use the robot example of [Kozlov and Koller, 1997] to present an application of the
dynamic discretisation algorithm to inference in a hybrid network, in this case a Conditional
Linear Gaussian (CLG) model, where one of the discrete nodes has a continuous parent. We
also illustrate here how the approach can produce accurate estimates in statistical models of
continuous variables whose posterior marginal distributions can vary widely when unlikely
evidence is provided.
Let us assume that a robot can walk randomly on the interval [0,1]. The position x of the
robot is unknown but we can record it on a number of sensors. We are interested in the
posterior probability of the robot coordinate after reading observations on three sensors,
p ( x3 o1, o2, o3) . The BN for the model is shown in Figure 5
11
Figure 5 BN for the one-dimensional robot
The unknown coordinates of the robot’s position, x , are modelled as a CLG network with
two variables: p ( x3 | x 2 ) ~ N ( x 2,0.01) and p ( x 2 | x1) ~ N ( x1,0.01) . A non-informative
prior belief is assigned to the first position, p ( x1) ~ Unif [ 0,1] . The first two readings, o1 and
o 2 , are noisy observations of the robot’s position, so p ( o | x ) ~ N ( x,0.01) . The third
observation is a binary, discrete random variable indicating whether or not the robot is in the
(
left halfspace x < 0.5 and is modelled with a sigmoid function 1 + exp {40 ( x − 0.5)}
)
−1
.
As pointed out by [Kozlov and Koller, 1997, Cobb and Shenoy, 2005], a weakness of the
static discretisation is that, as the evidence entered in this type of model becomes more and
more unlikely, the static discrete approximation of the posterior marginal degenerate and the
estimators are very different from the exact answer
The Figure 6 shows how we can obtain good estimators of the posterior marginals with our
algorithm, even if unlikely evidence is given to the model. As we can see, the results from
AgenaRisk for both, likely and unlikely scenarios compare very well to the exact answer.
Here we run three scenarios, each containing different sets of evidence, and compare the
posterior marginal for x3 under each. Scenario 1 corresponds to evidence o1 = 0.2 and
o 2 = 0.2 , scenario 2 corresponds to evidence o1 = 0.2 and o2 = 0.65 and scenario 3 to
evidence o1 = 0.2 , o 2 = 0.8 . In all scenarios we set o3 = true . The mean and variance
statistics for each scenario are listed in Table 1.
Figure 6 Posterior marginal probability for robot
position at x3 for different evidence scenarios
12
Table 1: Posterior moment for variable x3
E [ x3 ]
Var [ x3 ]
4.3.
Scenario 1
Scenario 2 Scenario 3
0.202
0.498
0.438
0.013
0.017
0.006
BNs with conditionally deterministic variables
Here we consider two simple examples of an important class of Bayesian network, namely,
models containing a variable that is a deterministic function of its parents.
Let us first consider the probability distribution for a sum of two independent random
variables Z = X + Y , where X ~ f X and Y ~ fY , given by the convolution function:
f Z ( z ) = f X × fY ( z ) = f X ( x ) fY ( z − x) dx
Calculating such a distribution represents a major challenge for most BN software.
Traditional methods to obtain this function include Fast Fourier Transform (FFT) [Brigham E.
1988] or Monte Carlo simulation. Here we compare an example and solution using
AgenaRisk with the analytical solution produced by convolution of the density functions.
Consider the case f X = Uniform( −2,2) and fY = Triangular (0,0,2) . The probability density
for Z = X + Y can be obtained analytically by
fZ ( z) =
2+ z
0
2
0
0
z−2
(1/ 4 + x / 8) dx + (1/ 4 + x / 8)dx +
(1/ 4 + x / 8) dx
The resulting mean and variance are E [ Z ] = 0.667 and Var ( Z ) = 1.555
Using dynamic discretisation, over 40 iterations, results in the set of marginal distributions for
f Z = f X × fY as shown in Figure 7. The summary statistics are µ Z = 0.667 and σ Z2 = 1.559 ,
which are very accurate estimates of the analytical solution.
13
Figure 7: Marginal distributions from function f Z = f X × fY after 40 iterations
The sum entropy error for the estimate of P( Z ) is approximately 10−6 .
The second example here, taken from [Cobb and Shenoy, 2005b], illustrates how our
approach can estimate the distribution of variables that are nonlinear deterministic functions
of their continuous parents. The model consists of a variable, X, with distribution
Beta(2.7,1.3) , a variable that is a nonlinear deterministic function of its parent,
Y = −0.3 X 3 + X 2 , and a variable that is a conditionally deterministic function of its parent,
Z ~ N ( 2Y + 1,1) . Figure 8 shows the posterior marginals for each of the variables before and
after entering the evidence Z = 0
Figure 8 Marginal distributions for X, Y, and Z after 25 iterations. a) Before entering the
evidence; b) after entering the evidence Z = 0
Running the model for 25 iterations results in the summary posterior values given in Table 2
and are compared with the estimates produced by [Cobb and Shenoy, 2005b] using MTE
approximations to the potentials.
Table 2: Summary posterior values
14
(Estimates produced using MTE in brackets)
X
Y
Z
Mean
Variance
0.6747(0.6750)
0.3037 (0.3042)
1.6070 (1.6084)
0.0440(0.4380)
0.0165 (0.0159)
1.0892 (1.0455)
After observing Z = 0
X
Y
0.5890 (0.5942)
0.2511 (0.2560)
0.0481 (0.0480)
0.0174 (0.0167)
As can be observed the results obtained using dynamic discretisation and AgenaRisk compare
very favourably with those achieved using MTE approximations.
4.4.
Hierarchical Normal Model
We now present the analysis of hierarchical model based on the normal distribution. Formally
the model for the hierarchical normal model is described by
{y }
n j iid
ij i =1
N ( µ j ,σ 2 ) ,
with conjugate prior distribution for the group means µ j ’ s given by N ( µ0 ,σ 02 ) and Inv-
Gamma distribution with hyperparameters α , β ~ 0 for the common unknown variance σ 2 .
Figure 9 shows the corresponding graphical model using plates’ notation [Buntine, 1994].
Figure 9: Hierarchical Normal model
15
We illustrate this analysis using data from [Gelman et al, 2004] on coagulation times for
blood drawn from randomly drawn animal test subjects on four different diets, A, B, C and D.
We wish to determine whether the treatments are statistically significant.
We assume that each of the data points, yij , from each diet group, j = A, B, C , D are
independently normally distributed within each group with means µ A
µ D and unknown
common variance σ . The data, yij , is shown in Table 3.
2
Table 3: Coagulation time data for four groups, A, B, C, D
A
62
60
63
59
B
63
67
71
64
65
66
C
68
66
71
67
68
68
D
56
62
60
61
63
64
63
59
The hierarchical analysis involves assigning a hyperprior density for the hyperparameters
µ0 ,σ 02 . In this case a convenient diffuse noninformative hyperprior for the scale parameter,
(
)
σ 02 , is given by the uniform prior density for, P (σ 02 ) ∝ 1 . We assumed a Uniform
(
)
distribution in the range [0, 50] for σ 02 and a diffuse hyperprior N 0,104 for µ0 .
Running the model for 25 iterations results in the summary posterior values given in Table 4
and are compared with the estimates produced by [Gelman et al., 2004] using Gibbs
sampling.
Table 4: Summary posterior values
(Estimates produced using Gibbs sampling in brackets)
µA
µB
µC
µD
µ
σ
σ 02
2
25%
Median
75%
60.5 (60.6)
61.3 (61.3)
62.0 (62.1)
65.2 (65.3)
65.8 (65.9)
65.6 (66.6)
67.1 (67.1))
67.8 (67.8)
68.4 (68.5)
60.7 (60.6)
61.2 (61.1)
61.8 (61.7)
62.5 (62.2)
4.2 (4.84)
64.0 (63.9)
5.3 (5.76)
65.6 (65.5)
6.8 (7.76)
12.2 (12.96)
20.9 (24.0)
32.6 (57.7)
These results compare very favourably with those achieved using Gibbs Sampling the only
major difference between Gibbs and dynamic discretisation is in the variance estimates,
which can perhaps be attributed to slight differences between the prior distributions chosen.
16
Figure 10 shows the BN graph model and superimposed marginal posterior distributions
produced within AgenaRisk.
Figure 10: Marginal distributions and BN graph from hierarchical Normal model after
25 iterations (note that for brevity the figure only shows the first two data points for
each class)
The sum entropy error for µ was 12 and for the variance estimates, σ B2 and σ 02 , it was
approximately 10−3 (the reader must remember that the entropy error values are not scale
invariant).
4.5.
Logistic Regression Model
One of the main advantages of the dynamic discretisation approach used within AgenaRisk is
that, since it targets the highest density regions of the posterior probabilities, the inference
analysis is possible even in situations where there is too little information concerning a
parameter.
We illustrate this using the bioassay example analysed in [Gelman et al. 2004]. It consists of a
nonconjugate logistic regression model for the data shown in Table 5.
Table 5: Drug Trial Data
Dose, xi (log)
Number of deaths, yi
Number of animals, ni
-0.86
0
5
-0.3
1
5
-0.05
3
5
0.73
5
5
The logistic regression model is a particular case of the generalized linear model for binary or
binomial data { yi }i =1 Bin ( ni , pi ) , with link function given by the logit transformations of the
N
pi
. Such a model is commonly used in acute
1 − pi
toxicity tests or bioassay experiments for the development of chemical compounds, to analyse
the subject’ s responses to various doses of the compound. The model of the dose-response
relation is given by the linear equation
probability of success, logit ( pi ) = log
17
logit ( pi ) = log
pi
= α + β xi .
1 − pi
where pi is the probability of a ‘positive’ outcome for subjects exposed to a dose level xi .
The likelihood function for the regression parameters (α , β ) is given by:
L (α , β ) ∝ ∏
exp {α + β xi }
1 + exp {α + β xi }
yi
1
1 + exp {α + β xi }
ni − yi
.
In the absence of any prior knowledge about the regression parameters (α , β ) , the use of a
noninformative prior leads to the classical maximum likelihood estimates for α and β ,
which can be obtained using iterative computational procedures, such as the Newton-Raphson
and Fisher-Scoring (or iterative re-weighted least squares) methods [Dobson, 1990].
Gelman et al, use a simple simulation approach, computing the posterior distribution of α
and β on a grid of points. In order to get an idea of the effective range for the grid, a rough
estimate of the regression parameters is obtained first, by a linear regression of logit
yi
ni
on
xi for the four data points given in Table 5. Further analysis leads to approximate the
posterior density, based on a uniform prior distribution for (α , β ) , on a region sufficiently
large to ensure that important features of the posterior fall inside the grid.
The model constructed in AgenaRisk is shown in Figure 11 along with the equivalent general
form for the model using the notation of plates.
Figure 11: BN graph from Logistic Regression model. The equivalent plate model is
shown alongside
In addition to uniform prior distribution on the range [-2, 5] × [-10, 40] suggested by Gelman
et al., we use a noninformative N(0,1000) prior as a basis for comparison and model each of
these priors in AgenaRisk using a labelled node “ type of prior” to condition each of the
hyperparameter. The resulting marginal distributions are shown in Figure 12.
18
Figure 12: (a) Marginal posterior distributions for α (b) Marginal posterior
distributions for β . The Uniform prior case plotted as a histogram and the Gaussian
prior case plotted as a line.
We compare the results obtained with AgenaRisk against the Gibbs sampling estimates
obtained with WinBUGS using the bounded uniform prior. The Table 6 shows the posterior
mean estimates for (α , β ) under each prior obtained in AgenaRisk after 25 iterations,
together with the WinBUGS estimates using the informative uniform prior, produced using a
1000 updates burn-in followed by a further 10000 updates.
Table 6: Mean estimates for (α , β )
α
β
Uniform
Prior
Gaussian
Prior
WinBUGS
(Uniform)
1.452
13.593
1.432
15.584
1.287
11.56
The sum entropy errors for the estimates (α , β ) are approximately 10−2 − 10−1 .
Here is important to point out that for this type of analysis, where there is not enough
information in the data concerning the parameters, robust sampling based on noninformative
‘flat’ priors becomes infeasible. In effect, trying to analyse this data with WinBUGS, using a
noninformative N(0,1000) prior, or a uniform on a wider range, causes the sampling algorithm
to fail and as a result WinBUGS crashes.
As we mentioned before with AgenaRisk is it possible to obtain reasonable good estimators of
the regression parameters using a "vague" prior, even if there is too little information in the
data concerning a parameter. As we can see the results from AgenaRisk for both cases
compare very well to those produced by WinBUGS in the case of the informative uniform
prior.
5. Concluding Remarks
In this paper we have presented a new approximate inference algorithm for a general class of
hybrid Bayesian Networks. Our algorithm is inspired by the dynamic discretisation approach
suggested by Kozlov and Koller (1997), and like them uses relative entropy to iteratively
adjust the discretisation in response to new evidence, and so achieve more accuracy in the
zones of high posterior density. Our approach though is implemented using the data structures
commonly used in JT algorithms, making it possible to use the natural operations on the
19
cliques’ potentials, as well as performing propagation iteratively on the junction tree using
standard algorithms.
We have highlighted how our new dynamic discretisation approach overcomes most of the
problems inherent to the static or uniform discretisation approaches adopted by a number of
popular BN software tools. In particular, problems related to computational inefficiency
caused by supporting too many states to represent the domain, high level of inaccuracy in
posterior estimates for continuous variables, problems in instantiating evidence in areas of the
domain that are grossly under sampled that lead to inconsistency and error.
Further technical improvements in our algorithm include coping effectively with situations in
which evidence lies in a region where temporarily there is no probability mass, the
introduction of tolerance bounds to enter point values evidence into continuous nodes, and
finally the approximation of function of random variables using mixtures of uniform
distributions.
The results from the empirical trials clearly show how our software can produce accurate
estimates on different classes of hybrid Bayesian networks appearing in practical applications.
In particular Bayesian networks with discrete children of continuous parents and hybrid
Bayesian networks with variables that are deterministic (linear and non-linear) functions of
their parents are easily modelled. We have shown how our approach can cope with multimodal posterior distributions, as well as models of continuous variables whose posterior
marginal distributions vary widely when unlikely evidence is provided.
We have also shown empirical results that illustrate how our software can deal effectively
with different type of hybrid models related to statistical inference problems, making it a
potential alternative tool for fitting Bayesian statistical models on complex problems in a
powerful and user friendly way. In particular, we shown how the rapid convergence of the
algorithm towards zones of high probability density, make robust inference analysis possible
even in situations where, due to the lack of information in both prior and data, robust
sampling becomes unfeasible
In spite of the significant potential to address inferential tasks on complex hybrid models,
there are some limitations in our algorithm related to the choice of the Hugin architecture as a
platform to compute the marginal distributions. As is well known, the efficiency of the Hugin
architecture depends on the size of the cliques in the associated junction tree. Although this
algorithm is intended to produce junction trees with minimum cliques size, for some statistical
models, with d-converging dependency structures, on many unobserved and observed
variables, the cliques in the corresponding junction tree can grow exponentially making the
computation of the marginal distributions very costly or even impossible.
Although we have successfully addressed inferential task on complex (hierarchical) statistical
models with a large number of parameters with the current implementation we are unable to
solve multiple regression and structural equation modelling problems. This means we need to
look at faster and more efficient approaches to propagation. An extension of our work will
include Shenoy-Shafer architecture on binary join trees [Shenoy, 1997, Shenoy and Shafer,
1990], designed to reduce the computation involved in the multiplication and marginalisation
of the local conditional distributions and other methods involving clique factorisation.
Another useful extension that would optimise the use of Bayesian Networks to solve
statistical inference problem is the introduction of plates to represent and manipulate
‘replicated nodes’ [Buntine, 1994]. Repeated node structures might appear in statistical
models either in the representation of homogeneous data or in the modelling of unobserved
subpopulation parameters (as in the hierarchical models). With the plate’ s representation, a
single indexed node replaces repeated nodes, and a box, called plate, indicates that the
20
enclosed subgraph is duplicated. This not only should give a more compact representation of
the inference model, but also, because of the indexed representation of the nodes, would allow
a more efficient input-output data management. We can see potential benefits in extending
AgenaRisk’ s object-based modelling framework to support this approach.
Acknolwedgements
We would like to extend our thanks to the referees for their insightful comments and helpful
suggestions. This paper is based in part on work undertaken on the eXdecide: “ Quantified
Risk Assessment and Decision Support for Agile Software Projects” project, EPSRC project
code EP/C005406/1, March 2005 - Feb 2008.
References
Agena Ltd. 2005. AgenaRisk Software Package, www.agenarisk.com
Bernardo J., and Smith A. 1994. Bayesian Theory. John Wiley and Sons, New York.
Brigham E. 1988. Fast Fourier Transform and Its Applications. Prentice Hall; 1st edition.
Buntine W. 1996. A guide to the literature on learning graphical models, IEEE Transactions
on Knowledge and Data Engineering. 8:195-210.
Buntine W. 1994. Operations for Learning with Graphical Models, J. AI Research, 159-225.
Casella G., and George E. I. 1992. Explaining the Gibbs sampler, Am. Stat. 46: 167–174.
Cobb B., and Shenoy P. 2005a. Inference in Hybrid Bayesian Networks with Mixtures of
Truncated Exponentials, University of Kansas School of Business, working paper 294.
Cobb B., and Shenoy P. 2005b. Nonlinear Deterministic Relationships in Bayesian Networks,
In L. Godo (Ed.) ECSQARU, Springer-Verlag Berlin Heidelberg , pp. 27–38,
Dobson A. J. 1990. An introduction to generalized linear models, New York: Chapman &
Hall.
Fenton N., Krause P., and Neil M. 2002. Probabilistic Modelling for Software Quality
Control, Journal of Applied Non-Classical Logics 12(2), pp. 173-188
Fenton N., Marsh W., Neil M., Cates P., Forey S., and Tailor T. 2004. Making Resource
Decisions for Software Projects, 26th International Conference on Software Engineering,
Edinburgh, United Kingdom.
Gelman A., Carlin J. B., Stern H. S., and Rubin D. B. 2004. Bayesian Data Analysis (2nd
Edition), Chapman and Hall, pp. 209 – 302.
Gelfand A., and A. Smith F.M. 1990. Sampling-based approaches to calculating marginal
densities, J. Am. Stat. Asso. 85: 398–409.
Geman S., and Geman D. 1984. Stochastic relaxation, Gibbs distribution and Bayesian
restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6:
721–741.
21
Gilks W. R., Richardson S., and Spiegelhalter D. J., 1996. Markov chain Monte Carlo in
Practice, Chapman and Hall, London, UK
Heckerman D. 1999. A Tutorial on Learning with Bayesian Networks, Learning in Graphical
Models, M. Jordan, ed. MIT Press, Cambridge, MA.
Heckerman D., Mamdani A., and Wellman M.P. 1995. Real- world applications'of Bayesian
networks, Comm. of the ACM, vol. 38, no. 3, pp. 24-68
Huang C., and Darwiche A. 1996. Inference in belief networks: A procedural guide. Int. J.
Approx. Reasoning 15(3): 225-263
Hugin. 2005. www.hugin.com
Jensen F. 1996. An Introduction to Bayesian Networks, Springer.
Jensen F., Lauritzen S.L., and Olesen K. 1990. Bayesian updating in recursive graphical
models by local computations, Computational Statistics Quarterly, 4: 260-282
Koller D., Lerner U., and Angelov D. 1999. A general algorithm for approximate inference
and its applications to Hybrid Bayes Nets, in K.B Laskey and H. Prade (eds.), Proceedings of
the 15th Conference on Uncertainty in Artificial Intelligence, pp. 324–333.
Kozlov A.V., and Koller D. 1997. Nonuniform dynamic discretization in hybrid networks, in
D. Geiger and P.P. Shenoy (eds.), Uncertainty in Artificial Intelligence, 13: 314–325.
Lauritzen S.L. 1996. Graphical Models, Oxford.
Lauritzen S.L., and Jensen F. 2001. Stable local computation with conditional Gaussian
Distributions, Statistics and Computing, 11, 191–203.
Lauritzen S.L., and Speigelhalter D.J. 1988. Local Computations with Probabilities on
Graphical Structures and their Application to Expert Systems (with discussion), Journal of the
Royal Statistical Society Series B, Vol. 50, No 2, pp.157-224.
Moral S., Rumı R., and Salmeron A. 2001. Mixtures of truncated exponentials in hybrid
Bayesian networks, in P. Besnard and S. Benferhart (eds.), Symbolic and Quantitative
Approaches to Reasoning under Uncertainty, Lecture Notes in Artificial Intelligence, 2143,
156–167.
Murphy K. P. 2001. A Brief Introduction to Graphical Models and Bayesian Networks.
Berkeley, CA: Department of Computer Science, University of California - Berkeley.
Murphy, K. 1999. A variational approximation for Bayesian networks with discrete and
continuous latent variables, in K.B. Laskey and H. Prade (eds.), Uncertainty in Artificial
Intelligence, 15, 467-475.
Netica. 2005. www.norsys.com
Neil M., Fenton N., Forey S., and Harris R. 2001. Using Bayesian Belief Networks to Predict
the Reliability of Military Vehicles, IEE Computing and Control Engineering J 12(1), 11-20
22
Neil M., Malcolm B., and Shaw R. 2003a. Modelling an Air Traffic Control Environment
Using Bayesian Belief Networks, 21st International System Safety Conference, Ottawa,
Ontario, Canada.
Neil M., Krause P., and Fenton N. 2003b. Software Quality Prediction Using Bayesian
Networks in Software Engineering with Computational Intelligence, (edited by Khoshgoftaar
T. M). The Kluwer International Series in Engineering and Computer Science, Volume 73
Pearl J. 1993. Graphical models, causality, and intervention, Statistical Science, vol 8, no. 3,
pp. 266-273
Speigelhalter D.J., Thomas A., Best N.G., and Gilks W.R. 1995. BUGS: Bayesian inference
Using Gibbs Sampling, Version 0.50. MRC Biostatistics Unit, Cambridge.
Speigelhalter D.J., and Lauritzen S.L. 1990. Sequential updating of conditional probabilities
on directed graphical structures, Networks, 20, pp. 579-605.
Shacter R., and Peot M. 1989. Simulation approaches to general probabilistic inference on
belief networks. In Proceedings of the 5th Annual Conference on Uncertainty in AI (UAI),
pages 221-230.
Shenoy P. 1997. Binary join trees for computing marginals in the Shenoy-Shafer architecture,
International Journal of Approximate Reasoning, 17(1), 1–25.
Shenoy P., and Shafer G. 1990. Axioms for probability and belief-function propagation,
Readings in uncertain reasoning, Morgan Kaufmann Publishers Inc, p.p. 575 - 610
Whittaker J. 1990. Graphical Models in Applied Multivariate Statistics, Wiley.
23
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement