Learning Bayesian network parameters under

Pattern Recognition 42 (2009) 3046 -- 3056
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
Learning Bayesian network parameters under incomplete data with
domain knowledge
Wenhui Liao a,∗ , Qiang Ji b
a
b
Thomson Reuters, Eagan, MN 55123, USA
ECSE, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
A R T I C L E
I N F O
Article history:
Received 24 April 2008
Received in revised form 4 February 2009
Accepted 7 April 2009
Keywords:
Bayesian network parameter learning
Missing data
EM algorithm
Facial action unit (AU) recognition
A B S T R A C T
Bayesian networks (BNs) have gained increasing attention in recent years. One key issue in Bayesian networks is parameter learning. When training data is incomplete or sparse or when multiple hidden nodes
exist, learning parameters in Bayesian networks becomes extremely difficult. Under these circumstances,
the learning algorithms are required to operate in a high-dimensional search space and they could easily
get trapped among copious local maxima. This paper presents a learning algorithm to incorporate domain
knowledge into the learning to regularize the otherwise ill-posed problem, to limit the search space, and
to avoid local optima. Unlike the conventional approaches that typically exploit the quantitative domain
knowledge such as prior probability distribution, our method systematically incorporates qualitative
constraints on some of the parameters into the learning process. Specifically, the problem is formulated
as a constrained optimization problem, where an objective function is defined as a combination of the
likelihood function and penalty functions constructed from the qualitative domain knowledge. Then, a
gradient-descent procedure is systematically integrated with the E-step and M-step of the EM algorithm,
to estimate the parameters iteratively until it converges. The experiments with both synthetic data and
real data for facial action recognition show our algorithm improves the accuracy of the learned BN parameters significantly over the conventional EM algorithm.
© 2009 Elsevier Ltd. All rights reserved.
1. Introduction
In recent years, Bayesian networks (BNs) have been increasingly
used in a wide range of applications including computer vision [1],
bioinformatics [2], information retrieval [3], data fusion [4], decision
support systems and others. A BN is a directed acyclic graph (DAG)
that represents a joint probability distribution among a set of variables, where the nodes denote random variables and the links denote the conditional dependencies among variables. The advantages
of BNs can be summarized as their semantic clarity and understandability by humans, the ease of acquisition and incorporation of prior
knowledge, the possibility of causal interpretation of learned models, and the automatic handling of noisy and missing data [5].
In spite of these claims, people often face the problem of learning
BNs from training data in order to apply BNs to real-world applications. Typically, there are two categories in learning BNs, one is to
learn BN parameters when a BN structure is known, and another is
∗ Corresponding author.
E-mail addresses: wenhui.liao@thomsonreuters.com (W. Liao), qji@ecse.rpi.edu
(Q. Ji).
0031-3203/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2009.04.006
to learn both BN structures and parameters. In this paper, we focus
on BN parameter learning by assuming the BN structure is already
known. If the training data is complete, learning BN parameters is
not difficult, however, in real world, training data can be incomplete
for various reasons. For example, in a BN modeling video surveillance, the training data may be incomplete because of security issue;
in a BN modeling customer behaviors, the training data may be incomplete because of privacy issue. Sometimes, the training data may
be complete but sparse, because some events rarely happen, or the
data for these events are difficult to obtain.
In general, training data can be missed in three ways: missing at
random (MAR), missing completely at random (MCAR), and not missing at random (NMAR). MAR means the probability of missing data
on any variable is not related to its particular value, but could be
related to other variables. MCAR means the missing value of a variable depends neither on the variable itself nor on the values of other
variables in the BN. For example, some hidden (latent) nodes never
have data. NMAR means data missing is not at random but depends
on the values of the variables.
The majority of the current learning algorithms assume the MAR
property holds for all the incomplete training samples since learning
is easier for MAR than NMAR and MCAR. The classical approaches
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
include the Expectation Maximization (EM) algorithm [6] and Gibbs
sampling [7]. Other methods are proposed to overcome the disadvantages of EM and Gibbs sampling. For example, methods are
proposed to learn the parameters when data are not missing at random, such as the AI&M procedure [8], the RBE algorithm [9], and
the maximum entropy method [10,11]; some methods are proposed
to escape local maxima under the assumption of MAR, such as the
information-bottleneck EM (IB-EM) algorithm [12], data perturbation method [13], etc.; other methods are proposed to speed up the
learning procedure, such as generalized conjugate gradient algorithm
[14], online updating rules [15], and others.
When data are missing completely at random, in other words,
when several hidden nodes exist, those methods could fail, where
the learned parameters may be quite different from the true parameters. In fact, since there are no data for hidden nodes, learning parameters becomes an ill-posed problem. Thus, prior data on domain
knowledge are needed to regularize the learning problem. In most
domains, at least some information, either from literature or from
domain experts, is available about the model to be constructed. However, many forms of prior knowledge that an expert might have are
difficult to be directly used by existing machine learning algorithms.
Therefore, it is important to formalize the knowledge systematically
and incorporate it into the learning. Such domain knowledge can
help regularize the otherwise ill-posed learning problem, reduce the
search space significantly, and help escape local maxima.
This motivates us to propose a Bayesian network learning algorithm for the case when multiple hidden nodes exist by systematically combining domain knowledge during learning. Instead
of using quantitative domain knowledge, which is often hard to
obtain, we propose to exploit qualitative domain knowledge. Qualitative domain knowledge impose approximated constraints on
some parameters or on the relationships among some parameters.
These kind of qualitative knowledge are often readily available.
Specifically, two qualitative constraints are considered, the range
of parameters, and the relative relationships between different parameters. Instead of using the likelihood function as the objective
to maximize during learning, we define the objective function as a
combination of the likelihood function and the penalty functions
constructed from the domain knowledge. Then, a gradient-descent
procedure is systematically integrated with the Expectation-step (Estep) and Maximization-step (M-step) of the EM algorithm, to estimate the parameters iteratively until it converges. The experiments
show the proposed algorithm significantly improves the accuracy
of the learned BN parameters over the conventional EM method.
2. Related work
During the past several years, many methods have been proposed to learn BN parameters when data are missing. Two standard
learning algorithms are Gibbs sampling [7] and EM [6]. Gibbs sampling by Geman and Geman [7] is the basic tool of simulation and
can be applied to virtually any graphical model whether the arcs are
directed or not, and whether the variables are continuous or discrete
[16]. It completes the samples by inferring the missing data from the
available information and then learns from the completed database
(imputation strategy). Unfortunately, Gibbs sampling method suffers from convergence problems arising from correlations between
successive samples [10]. In addition, it is not effective when data
are missing in complete random (e.g. the case of the hidden
nodes).
The EM algorithm can be regarded as a deterministic version of
Gibbs sampling used to search for the Maximum Likelihood (ML) or
Maximum a Posteriori (MAP) estimate for model parameters [16,6].
However, when there are multiple hidden variables or a large amount
of missing data, EM gets easily trapped in a local maximum. “With
3047
data missing massively and systematically, the likelihood function
has a number of local maxima and straight maximum likelihood
gives results with unsuitably extreme probabilities” [17]. In addition,
EM algorithms are sensitive to the initial starting points. If the initial
starting points are far away from the optimal solution, the learned
parameters are not reliable.
Different methods are proposed to help avoid local maxima.
Elidan and Friedman [12] propose an information-bottleneck EM
(IB-EM) algorithm to learn the parameters of BNs with hidden
nodes. It treats the learning problem as a tradeoff between two
information-theoretic objectives, where the first one is to make the
hidden nodes uninformative about the identity of specific instances,
and the second one is to make the hidden variables informative
about the observed attributes. However, although IB-EM has a better
performance than the standard EM for some simple BNs, it is actually worse than EM for the complex hierarchical models as shown in
[12]. To escape local maxima in learning, Elida et al. [13] propose a
solution by perturbing training data. Two basic techniques are used
to perturb the weights of the training data: (1) random reweighing,
which randomly samples weight profiles on the training data, and
(2) adversarial reweighing, which updates the weight profiles to explicitly punish the current hypothesis, with the intent of moving the
search quickly to a nearby basin of attraction. Although it usually
achieves better solutions than EM, it is still a heuristic method and
not necessarily able to escape local maxima. And also, it is much
slower than the standard EM algorithm.
The previous methods emphasize improving the machine learning techniques, instead of using domain knowledge to help learning.
Since there are no data available for hidden nodes, it is important to
incorporate any available information about these nodes into learning. The methods for constraining the parameters for a BN include
Dirichlet priors, parameter sharing, and qualitative constraints. According to [18], there are several problems using Dirichlet priors.
First, it is impossible to represent even the simple equality constraints on the parameters. Second, it is often beyond expert's capability to specify a full Dirichlet prior over the parameters of a Bayesian
network. Parameter sharing, on the other hand, allows parameters
of different models to share the same values, i.e., it allows to impose
equality constraints. Parameter sharing methods, however, do not
capture more complicated constraints among parameters such as inequality constraints among the parameters. In addition, both Dirichlet priors and parameter sharing methods are restricted to sharing
parameters at the level of sharing a whole CPT or CPTs, instead of at
the level of granularity of individual parameters. To overcome these
limitations, others [19–22,18] propose to explicitly exploit qualitative relationships among parameters and systematically incorporate
them into the parameter estimation process.
Druzdel et al. [19] give formal definitions of several types of qualitative relationships that can hold between nodes in a BN to help
specify CPTs of BNs, including probability intervals, qualitative influences, and qualitative synergies. They express these available information in a canonical form consisting of (in)equalities expressing
constraints on the hyperspace of possible joint probability distributions, and then use this canonical form to derive upper and lower
bounds on any probability of interest. However, the upper and lower
bounds cannot give sufficient insight into how likely a value from
the interval is to be the actual probability.
Wittig and Jameson [20] present a method for integrating formal
statements of qualitative constraints into two learning algorithms,
APN [23,24] and EM. Two types of qualitative influences [19] are considered as constraints for parameters during learning in this method:
(1) a positive influence holds between two variables (X1 , X2 ) if for
any given value of X2 , an increase in the value of X1 will not decrease
the probability that the value of X2 is equal to or greater than that
given value; and (2) a negative influence can be defined analogously.
3048
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
This paper shows that the accuracy of the learned BNs is superior to
that of the corresponding BNs learned without the constraints.
Even when data are complete, domain knowledge can help learning significantly, in particular when insufficient data are available.
Altendorf et al. [21] show how to interpret knowledge of qualitative influences, and in particular of monotonicities, as constraints on
probability distribution, and to incorporate this knowledge into BN
learning algorithms. It assumes that the values of the variables can be
totally ordered. It focuses on learning from complete but sparse data.
It shows the additional constraints provided by qualitative monotonicities can improve the performance of BN classifiers, particularly
on extremely small training sets. The assumption that the values of
the variables can be totally ordered is, however, too restrictive.
Feelders and Gaag [22,25] present a method for learning BN parameters with prior knowledge about the signs of influences between
variables. The various signs are translated into order constraints on
the network parameters and isotonic regression is then used to compute order-constrained estimates from the available data. They also
focus only on learning from complete but sparse data. In addition,
they assume all variables are binary.
The constraints analyzed in these papers [20–22] are somewhat
restrictive, in the sense that each constraint must involve all parameters in a conditional probability table (CPT). Niculescu [18] considers
a different type of domain knowledge for constraining parameter estimate when data are complete but limited. Two types of constraints
are used: one is that the sum of several parameters within one conditional probability distribution is bounded by the sum of other parameters in the same distribution; another is an upper bound on the
sum of several parameters in one conditional probability distribution.
Both the constraints have to be twice differentiable, with continuous second derivatives. They formalize the learning as a constraint
optimization problem and derive closed form maximum likelihood
parameter estimators.
Our learning method also exploits domain knowledge to help parameter learning, but especially for BNs with hidden nodes or a large
amount of missing data. Different from other learning methods, the
domain knowledge we used has the following features: (1) it does
not need to involve all the parameters in a conditional probability
table; (2) it can deal with relationships between different parameters; (3) it is associated with confidence levels to reflect the confidence of the domain experts; (4) it is easy to assess and define; and
(5) new format of domain knowledge can be easily incorporated into
the algorithm. Our algorithm systematically incorporates the domain
knowledge and is capable of learning parameters successfully even
when a large percent of nodes are hidden in a Bayesian network.
3. Learning Bayesian network parameters
3.1. The basic theory
Let G be a BN with nodes X1 , . . . , Xn . If there is a directed arc from
Xi to Xj , Xi is called a parent of Xj , pa(Xj ). Given its parent nodes, a
node is conditionally independent from all the other nodes. Thus the
joint distribution of all the nodes can be written as
p(X1 , . . . , Xn ) =
n
p(Xi |pa(Xi ))
(1)
i=1
Each node is associated with several parameters to describe the conditional probability distribution of the random variable given its parents. We use to denote the entire vector of parameter value ijk ,
ijk = p(xki |paji ), where i (i = 1, . . . , n) ranges over all the variables in
the BN, j (j = 1, . . . , qi ) ranges over all the possible parent configurations of Xi , and k (k = 1, . . . , ri ) ranges over all the possible states of
j
Xi . Therefore, xki represents the kth state of variable Xi , and pai is the
jth configuration of the parent nodes of Xi .
Given the data set D = {D1 , . . . , DN }, where Dl = {x1 [l], . . . , xn [l]} that
consists of instances of the BN nodes, the goal of parameter learning
is to find the most probable value ˆ for that can best explain the
data set D, which is usually quantified by the log-likelihood function,
log(p(D|)), denoted as LD (). If D is complete, based on the conditional independence assumption in BNs as well as the assumptions
that the samples are independent, we can get the equation as follows:
⎧
⎫
N
⎨
⎬
p(x1 [m], . . . , xn [m] : )
LD () = log
⎩
⎭
m=1
= log
⎧
n N
⎨
⎩
⎫
⎬
p(xi [m]|pai [xi (m)] : )
⎭
i=1 m=1
(2)
where pai [xi (m)] indicates the ith parent of xi (m). With the MLE
(Maximum Likelihood Estimation) method, we can get the parameter
∗ as follows:
∗ = arg max LD ()
(3)
However, when the data D is incomplete, Eq. (2) cannot be directly
applied anymore. A common method is the EM algorithm [6]. Let
Y = {Y1 , Y2 , . . . , YN }, which is observed data; Z = {Z1 , Z2 , . . . , ZN }, which
is missing data; and Dl = Yl ∪ Zl . The EM algorithm starts with some
(0)
initial guess at the maximum likelihood parameter, , and then
(1)
(2)
proceeds to iteratively generate successive estimates, , , . . .,
by repeatedly applying the Expectation-step and Maximization-step,
for t = 1, 2, . . .
E-step: computes the conditional expectation of the loglikelihood function given the observed data Y and the current
(t)
parameter (t)
(t)
Q(| ) = E(t) [log p(D|)| , Y]
(4)
(t)
M-step: finds a new parameter which maximizes the expected
log-likelihood under the assumption that the distribution found in
the E-step is correct
(t+1) = arg max Q(|(t) )
(5)
Each iteration is guaranteed to increase the likelihood, and finally the algorithm converges to a local maximum of the likelihood
function. However, when there are multiple hidden nodes or when
a large amount of data are missing, EM method easily gets stuck in a
local maximum. Next, we show how to incorporate domain knowledge into the learning process in order to reduce search space and
to avoid local maxima.
3.2. Qualitative constraints with confidence
In many real-world applications, domain experts usually have
valuable information about model parameters. We consider two
types of constraints: type-I is about the range of a parameter; and
type-II is about the relative relationships (> , < , =) between different parameters. One of our goals is to make the constraints as simple as possible, so that experts can easily formalize their knowledge
into these constraints.
The range of a parameter allows domain experts to specify an
upper bound and a lower bound for the parameter, instead of defining specific values. Fig. 1 shows a very simple BN and we assume all
the nodes have binary states. The table in the right is the conditional
probability table of the node B, indicating P(B|A), which can be described by four parameters B00 , B01 , B10 , B11 , where B00 + B01 =1
and B10 + B11 = 1. The domain experts may not know the specific
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
The above equation shows a constrained optimization problem.
For each inequality constraint, we define the following penalty functions:
p(B|A)
A
A
B
0
1
0
B00
B01
1
B10
B11
g (ijk ) = [ijk − lijk ]− ,
g (ijk ) = [uijk − ijk ]− ,
Fig. 1. A simple Bayesian network example and its conditional probability table.
values of B00 and B10 , but they can set the ranges for B00 and B10
as 0.3 < B00 < 0.5, 0.6 < B10 < 0.9, respectively.
In addition to assessing the ranges of parameters, the domain experts may also know the relative relationships between some parameters. For each type-II constraint, if the two associated parameters
are in the same CPT of a node, we call it an inner-relationship constraint; if the two parameters come from the CPTs of different nodes,
we call it an outer-relationship constraint. For example, if the observation of A = 0 increases the posterior of B = 0 the most, we could say
that p(B = 0|A = 0) > p(B = 0|A = 1), in other words, B00 > B10 . Obviously, the constraint of B00 > B10 defines an inner-relationship.
On the other hand, if the observation of A = 0 increases the posterior
of B = 0 more than the observation of A = 0 increasing the posterior
of C = 0, we can get the constraint of B00 > C00 , which defines an
outer-relationship since B and C are from two different CPTs.
In real-world applications, there are always such constraints that
can be found by domain experts. For example, assume we use a BN
to model human state and its symptoms such as blinking rate, head
tilt rate, eye gaze distribution, etc. Some symptom may be a stronger
indicator of a particular human state than another symptom. This
kind of relationship can be captured by the type-II constraints in the
BN. They can often be identified either through subjective experience
or through a sensitivity analysis. These constraints look simple, but
are very important for the hidden nodes, where no data are available.
By adding these constraints into learning, the domain knowledge can
be well utilized to obtain parameters that meet the requirements of
real-world applications.
Now we formally define the two types of constraints. Let A be
the set that includes the parameters whose ranges are known based
on the domain knowledge. For each ijk ∈ A, we define the range
as lijk ijk uijk . Obviously, lijk 0, and uijk 1. Let B be the set
that includes the parameters whose relative relationships are known
based on the domain knowledge. For each ijk , i j k ∈ B, we have
ijk > i j k , and/or, ijk = i j k , where i i , or j j , or k k .
However, the domain knowledge may not be reliable all the time.
i j k
To account for it, we associate confidence levels ijk , ijk to each
constraint in the sets A and B, respectively. The value of each is between 0 and 1. If a domain expert is very confident with a
constraint, the corresponding value of is 1; otherwise, is less than
1 but larger than 0.
3.3. Parameter learning with uncertain qualitative constraints
Now our goal is to find the optimal parameter ˆ that maximizes
the log-likelihood LD () given the three constraints as below:
Maximize
Subject to
ijk = 1
Maximize
J() = LD () −
1 j qi ,
1 k qi
lijk ijk uijk ,
ijk ∈ A
ijk i j k , ijk , i j k ∈ B
(6)
(8)
∀ijk , i j k ∈ B
(9)
w1 ijk [(g (ijk ))2 + (g (ijk ))2 ]
2
ijk ∈A
w2
−
2
Subject to
ijk ,i j k ∈B
iijkj k (g (ijk , i j k ))2
ijk = 1
(10)
k
where wi is the penalty weight, which is decided empirically. Obviously, the penalty varies with the confidence level for each constraint.
In order to solve the problem, first, we eliminate the constraint
k ijk =1 by reparameterizing ijk , so that the new parameters automatically respect the constraint on ijk no matter what their values
are. We define a new parameter ijk so that
ijk ≡ ri
exp(ijk )
k =1
(11)
exp(ijk )
In this way, a local maximum w.r.t. to ijk is also a local maximum
w.r.t. ijk , and vice versa. Most importantly, the constraint is automatically satisfied.
In the next step, we need to compute the derivative of J() w.r.t.
. Based on [24], ∇ijk LD () can be expressed as follows:
∇ijk LD () =
=
N
* ln
l=1 p(Dl |)
*ijk
N
* ln p(Dl |)
*ijk
l=1
=
N
l=1
* ln p(Dl |)/ *ijk
p(Dl |)
(12)
where
* ln p(Dl |)/ *ijk
p(Dl |)
* j
j
k
k
j ,k p(Dl |xi , pai , )p(xi , pai |)
*ijk
=
p(Dl |)
* j
j
j
k
k
j ,k p(Dl |xi , pai , )p(xi |pai , )p(pai |)
*ijk
=
p(Dl |)
j
j
k
p(Dl |xi , pai , )p(pai |)
=
p(Dl |)
j
j
p(xki , pai |Dl , )p(Dl |)P(pai |)
=
k
(7)
∀ijk ∈ A
where [x]− = max(0, −x).
Therefore, we can rephrase Eq. (6) as follows:
=
LD ()
1 i n,
∀ijk ∈ A
g (ijk , i j k ) = [ijk − i j k ]− ,
C
B
3049
=
j
p(xki , pai |)p(Dl |)
j
p(xki , pai |Dl , )
j
j
p(xki , pai |pai , )
j
p(xki , pai |Dl , )
ijk
(13)
3050
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
By combining Eqs. (12) and (13), we obtain the equation as below:
N
j
k
l=1 p(xi , pai |Dl , )
∇ijk LD () =
(14)
ijk
Therefore, the derivative of J(ijk ) w.r.t. ijk is as we can see here:
∇ijk J(ijk ) =
− w1 ijk [g (ijk )∇ijk g (ijk ) + g (ijk )∇ijk g (ijk )]
i j k
− w2
[ijk g (ijk , i j k )∇ijk g (ijk , i j k )]
*LD () *ijk
*ijk *ijk
B+ (ijk )
+ w2
2
= ∇ijk LD ()(ijk − ijk )
=
N
j
p(xki , pai |Dl , )(1 − ijk )
(15)
Similarly, for g (ijk ), g (ijk ), and g (ijk ), the derivatives are as follows:
∇ijk g (ijk ) =
2ijk − ijk if ijk lijk
0
(16)
otherwise
if ijk uijk
otherwise
⎧ 2
⎨ ijk −ijk
ijk − 2ijk
0
∇ijk g (ijk , i j k ) =
if ijk i j k ,
2
i i
or j j
if ijk i j k , i = i , j = j
otherwise
(18)
Table 1
A constrained EM (CEM) learning algorithm.
2
14
5
6
where B+ (ijk ) is the set of the constraints whose first term is ijk ,
while B− (ijk ) is the set of the constraints whose second term is ijk .
Both B+ (ijk ) and B− (ijk ) belong to the set B.
Now, we are ready to present the constrained EM (CEM) learning
algorithm as summarized in Table 1. The algorithm consists of three
steps. The first two steps are the same as the E-step and M-step in
the EM algorithm. In the third step, a gradient-based update is used
to force the solutions to move towards the direction of reducing
constraint violations.
4. Experiments
In this section, we compare our algorithm to the EM algorithm
with synthetic data. We first describe how the testing data is generated, and then demonstrate the results in three scenarios: (1) varying the type-I constraints; (2) varying the type-II constraints; and
(3) varying the number of training samples. Furthermore, in the next
section, we will apply our algorithm to a real-world application.
In order to compare the CEM algorithm to the EM algorithm, we
design the experiments as follows.
Generation of original BNs. Two BNs as shown in Fig. 2 are created
in the experiments. Then 22 instances are created with the same
structures as BN1 and BN2, respectively (11 for each), but with different parameters. Two BNs (called original BNs) are then used as
the ground-truth for BN1 and BN2, respectively, and the others are
to be learned.
Generation of constraints. For each BN, based on the true CPTs,
type-I (the range of a parameter) and type-II (the relationship between different parameters) constraints are generated for the node
sets A and B, where A and B vary in the experiments. Specifically,
to generate type-I constraints for the nodes in A, for each true parameter ijk , the lower bound is set as (1 − r)ijk , and the upper
bound is set as min(1, (1 + r)ijk ), where r is a ratio (0 < r < 1) and
varies in the experiments. Type-II constraints can be divided into
3
15
1
16
4
8
19
2
17
5
7
18
10
11
12
3
4
6
7
8
9
9
20
(19)
4.1. Experiment design
Repeat until it converges
Step 1: E-step to compute the conditional expectation of the log-likelihood
function based on Eq. (4);
Step 2: M-step to find the parameter that maximizes the expected
log-likelihood based on Eq. (5);
Step 3: Perform the following optimization procedure based on the
gradient-descent method:
t = ; map t to t based on Eq. (11)
t
Repeat until 0
t = 0
for each variable i, parent configuration j, value k
for each Dl ∈ D
t
j
k
t+1
ijk = ijk + p(xi , pai |Dl , t )
t+1
t+1
ijk = ijk (1 − ijk ) + K
(K represents the last three terms in Eq. (19))
t+1 = t + t+1
t+1
t+1
map to based on Eq. (11)
t = t+1
Go to Step 1
t
Return 1
i j k
[ijk g (i j k , ijk )∇ijk g (i j k , ijk )]
(17)
⎩ ijk −ijk −ijk i j k
0
13
B− (ijk )
l=1
∇ijk g (ijk ) =
j
p(xki , pai |Dl , )(1 − ijk )
l=1
Therefore, based on the chain rule,
∇ijk LD () =
N
11
10
12
15
13
14
16
18
17
19
21
20
Fig. 2. Two BN examples: (a) BN1; (b) BN2. The shaded nodes are hidden nodes, the others are observable nodes.
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
two kinds: inner-relationship, which compares two parameters
within the same CPT; outer-relationship, which compares two parameters in different CPTs. For example, in BN1, two parameters in
the CPT of the node 18 can be associated with an inner-relationship
Table 2
Some examples of type-II constraints for BN1.
Inner-relationship constraints
Outer-relationship constraints
Node 14: 1411 > 1421
Node 15: 1511 > 1522
Node 20: 2011 > 2021
Node 13 and Node 17: 1311 > 1711
Node 14 and Node 15: 1411 > 1531
Node 18 and Node 19: 1821 > 1941
3
3051
constraint; while a parameter in the CPT of the node 18 and a
parameter in the CPT of the node 19 can be associated with an outerrelationship constraint. Table 2 shows some examples of type-II constraints for BN1. For each parameter abcd in the table, the first two
numbers (ab) of the subscript represent the index of the node, and
the third number (c) represents the index of the parent configurations, and the last number (d) represents the state index of the node.
Generation of training data. For each BN, 500 samples are generated based on the true parameters. The values of the hidden nodes
are then removed from the generated samples. With the remaining
samples, we then learn the 20 BNs with randomly assigned initial
parameters, which are required to be different from the true parameters.
1.5
EM
CEM (r = 0.4)
EM
CEM
KL−divergence
KL−divergence
2.5
2
1.5
1
1
0.5
0.5
0
0
0
3
5
10
15
Node Index
20
25
0
1.5
CEM (r = 0.4)
CEM (r = 0.2)
5
10
15
Node Index
20
25
10
15
Node Index
20
25
CEM (r = 0.4)
CEM (r = 0.2)
KL−divergence
KL−divergence
2.5
2
1.5
1
1
0.5
0.5
0
0
19
5
10
15
Node Index
20
25
EM
CEM (r = 0.2)
CEM (r = 0.4)
18
17
16
15
14
13
12
11
10
1
2
3
4
0
Average Negative Log Likelihood
Average Negative Log Likelihood
0
5
6
7
BN Index
8
9
10
5
17
16
15
14
13
12
11
EM
CEM (r = 0.2)
CEM (r = 0.4)
10
9
1
2
3
4
5
6
7
BN Index
8
9
10
Fig. 3. Learning results vs. type-I constraints. The charts in the left are the results for BN1; and the charts in the right are the results for BN2. (a), (b) EM vs. CEM when
r = 0.4; (c), (d) CEM when r = 0.4 and 0.2; and (e), (f) negative log-likelihood for different BNs.
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
3
3
2.5
2.5
2.5
2
1.5
1
0.5
2
1.5
1
0.5
0
5
10
15
Node Index
20
25
2
1.5
1
0.5
0
0
0
0
5
10
15
Node Index
20
25
3
3
2.5
2.5
2.5
2
1.5
1
0.5
KL−divergence
3
KL−divergence
KL−divergence
KL−divergence
3
KL−divergence
KL−divergence
3052
2
1.5
1
0.5
0
5
10
15
Node Index
20
25
5
10
15
Node Index
20
25
0
5
10
15
Node Index
20
25
2
1.5
1
0.5
0
0
0
0
0
5
10
15
Node Index
20
25
Fig. 4. Learning results vs. type-II constraints for BN1: (a) EM vs. CEM (eight inner-relationship constraints for the hidden nodes are used); (b) EM vs. CEM (eight outer-relationship constraints for the hidden nodes are used); (c) CEM (eight outer-relationship constraints are used) vs. CEM (eight outer-relationship and eight inner-relationship
constraints are used); (d) eight and 16 inner-relationship constraints are used, respectively; (e) eight and 16 outer-relationships are used, respectively; and (f) CEM (eight
outer-relationship and eight inner-relationship constraints are used) vs. CEM (16 inner-relationship and 16 outer-relationship constraints are used). BN2 has the similar results.
Evaluation of performance. Two criteria are used to evaluate the
performance of the learning algorithms. The first one is to compare
the estimated CPT of each node to the ground-truth CPT using the
Kullback–Leibler (KL) divergence. The smaller the KL-divergence is,
the closer the estimated CPT to the true CPT. The second criterion
is the negative log-likelihood per sample. It evaluates how well a
learned BN fits the testing data as a whole. To compute that, we
first generate 500 testing samples from each original BN. Each of
the learned BNs is then evaluated on these samples to get the average negative log-likelihood. Since it is negative log-likelihood, the
smaller the value is, the better the learned BN fits the data.
4.2. Learning results vs. type-I constraints
In this section, we compare learning performance when typeI constraints vary while type-II constraints are fixed. Specifically,
both the set A and B include only hidden nodes. For type-I constraints, r varies from 0.2 to 0.4. For type-II constraints, only the
inner-relationships for two parameters in the CPT of each hidden
node are used as constraints.
Fig. 3 illustrates the results. In Charts (a) through (d), the xcoordinate denotes the node index, and the y-coordinate denotes the
KL-divergence. The median of each bar is the mean, and the height
of the bar is the standard deviation, which are obtained from 10 BN
instances. Charts (a) and (b) compare EM to CEM when r = 0.4. We
can see that CEM achieves better results in both mean and standard
deviation of KL-divergence than EM for both BNs. In BN1, for the hidden nodes, the average mean decreases from 1.0687 (EM) to 0.2337
(CEM, r =0.4); the average standard deviation decreases from 0.7659
(EM) to 0.2219 (CEM, r = 0.4). In BN2, for the hidden nodes, the average mean decreases from 0.6657 (EM) to 0.1163 (CEM, r = 0.4); the
average standard deviation decreases from 0.5614 (EM) to 0.0969
(CEM, r = 0.4). Specifically, as shown in Charts (a) and (c), for the
hidden node 18 in BN1, the KL-divergence decreases from around
2.1 (EM) to 0.2 (CEM, r = 0.2); for the hidden node 19 in BN1, the
KL-divergence decreases from around 2.2 (EM) to 0.2 (CEM, r = 0.2).
Charts (c) and (d) compare CEM when r varies. As r decreases from
0.4 to 0.2, the performance of CEM is further improved, especially for
the hidden nodes. The negative log-likelihood further confirms that
CEM performs better than EM, as shown in Charts (e) and (f), where
the x-coordinate indicates BN index and the y-coordinate indicates
the negative log-likelihood. The negative log-likelihood from CEM is
smaller than that from EM.
4.3. Learning results vs. type-II constraints
In the second scenario, we change type-II constraints while fix
the type-I constraints (r is set as 0.4). We observe the learning results
in the following cases: (1) varying the number of inner-relationship
constraints; (2) varying the number of outer-relationship constraints; and (3) combining inner-relationship constraints and
outer-relationship constraints.
Fig. 4 illustrates the results in the three cases. Charts (a) and
(b) compare EM to CEM when eight inner-relationship constraints
and eight outer-relationship constraints are used, respectively. Obviously, CEM always performs better than EM. For example, in Chart
(a), the average mean for the hidden nodes decreases from 1.0687
(EM) to 0.2337 (CEM), the average standard deviation decreases
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
3053
x 10−3
6
0.35
KL−divergence
KL−divergence
0.4
EM
CEM
5
4
3
2
1
0
200
0.3
0.25
EM
CEM
0.2
0.15
0.1
0.05
400
0
200 400 600 1000 1400 2000 4000 6000
600 1000 1400 2000 4000 6000
# of samples
KL−divergence
1.5
1
EM
CEM
0.5
0
200
400
600 1000 1400 2000 4000 6000
# of samples
Average Negative Log Likelihood
# of samples
16
15.5
15
14.5
14
13.5
13
12.5
12
11.5
11
EM
CEM
200 400 600 1000 1400 2000 4000 6000
# of samples
Fig. 5. Learning results vs. the number of training samples for BN1: (a) Average KL-divergence for the observable nodes whose parameters are independent from the hidden
nodes (i.e., nodes 2, 3, 4, 9); (b) average KL-divergence for the other observable nodes; (c) average KL-divergence for the hidden nodes; and (d) negative log-likelihood. BN2
has the similar results.
from 0.7659 (EM) to 0.2219 (CEM); and in Chart (b), the average
mean for the hidden nodes decreases from 1.0687 (EM) to 0.2214
(CEM), the average standard deviation decreases from 0.7659 (EM)
to 0.2377 (CEM). Charts (c) through (f) show how the performance
of CEM varies as different numbers of constraints are used. Chart
(c) demonstrates that when both inner-relationship constraints and
outer-relationship constraints are used, the performance is better
than single-type constraints are used. The average mean for the hidden nodes decreases from 0.2214 to 0.1929, and the average standard deviation decreases from 0.2377 to 0.1763. Charts (d)–(f) show
that the performance of CEM is improved slightly when we double
the same types of constraints.
4.4. Learning results vs. training samples
We now demonstrate how the learning results vary with the
number of training samples. In the experiments, we fix the constraints (r =0.2), and only eight inner-relationship constraints for the
hidden nodes are used, but vary the number of training samples during learning. Fig. 5 demonstrates the results for BN1 only, since BN2
has the similar results. In order to observe how the training samples affect different types of nodes in BN1, we divide the nodes into
three groups: group 1 includes the nodes (2, 3, 4, and 9) whose parameters are independent from the hidden nodes; group 2 includes
the other observable nodes; and group 3 includes all the hidden
nodes.
As shown in Chart (a) of Fig. 5, for the nodes in group 1, the KLdivergence decreases when the number of samples increases. And
the KL-divergence is very small in all the cases even when there are
only 200 training samples. Both EM and CEM return the same results
since the data are complete for all those nodes. In both Charts (b)
and (c), the KL-divergence decreases slightly when the number of
training samples increases, while the KL-divergence of CEM is much
smaller than that of EM. Especially for the hidden nodes, even when
the number of the training sample is 6000, the KL-divergence for the
hidden nodes is only slightly smaller than the KL-divergence when
the number of training sample is 200. This is because the information
about the hidden nodes rarely varies as the total number of training
samples increases. Therefore, in order to improve the learning results
for the hidden nodes, domain knowledge is more important than
training samples.
5. A case study
In this section, we apply our algorithm to a real-world application
in computer vision: facial action unit (AU) recognition.
5.1. A Bayesian network for facial action unit modeling and recognition
In recent years, a variety of approaches are proposed to recognize facial expressions. Besides recognizing six basic facial expressions directly, techniques have also been developed to automatically
recognize facial action units. According to the Facial Action Unit System (FACS) by Ekman [26], each AU is related to the contraction of
a specific set of facial muscles. FACS has been demonstrated to be a
powerful means for representing and characterizing a large number
3054
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
Table 3
A list of action units.
AU6
O6
AU27
AU12
O27
O12
AU25
O25
AU15
O15
O2
AU23
AU24
O24
O23
AU2
AU1
AU5
O5
AU9
O9
O1
AU4
AU7
AU17
O17
O4
O7
Fig. 6. A Bayesian network for AU modeling. We adapted it from [27]. The unshaded
nodes are measurement nodes.
of facial expressions through the combination of only a small set of
AUs.
Current techniques for recognizing AUs are mostly based on computer vision techniques. But due to the richness, ambiguity, and dynamic nature of facial actions, as well as the image uncertainty and
individual difference, it is difficult to recognize each AU individually
by using only computer vision techniques. Computer vision techniques should be combined with additional sematic information to
achieve a more robust and consistent AU recognition. Fortunately,
there are some inherent relationships among AUs as described in
the FACS manual [26]. For example, the alternative rules provided in
the FACS manual describe the mutual exclusive relationship among
AUs. Furthermore, the FACS also includes the co-occurrence rules in
their old version, which were “designed to make scoring more deterministic and conservative, and more reliable” [26].
Therefore, instead of recognizing each AU alone, a Bayesian network can be used to model the probabilistic relationships among
different AUs, and given the BN model, AU recognition can be performed through a probabilistic inference [27]. Fig. 6 illustrates such
a BN to model 14 AUs, and Table 3 summarizes a list of the 14 AUs
and their meanings. They are adapted from [27]. Such a model is
capable of representing the relationships among different AUs in a
coherent and unified hierarchical structure, accounting for uncertainties in the AU recognition process with conditional dependence
links, and providing principled inference and learning techniques to
systematically combine the domain information and statistical information extracted from the data.
To incorporate the AU recognition results from a computer vision
technique, in the BN model, each AU is connected with a measurement node (unshaded node), which encodes the measurement obtained from computer vision techniques. For AU measurement, we
employ a technique similar to the one described in [28]. The output
of the technique is a score for each AU, which is subsequently discredited to produce a value for a corresponding AU measurement
node.
5.2. AU model parameter learning
Given the BN structure in Fig. 6 and the AU measurements, we
then need parameterize the BN model before AU inference can commence. For this, we need training data. As defined before, a complete training sample requires the true AU label for each AU node
and the measurement for each meresman node. However, manually
labeling AUs is usually time consuming and difficult. Moreover, the
labeling process is highly subjective, therefore prone to human errors. In addition, some AU events rarely happen in the collected data.
Therefore, the training data could be incomplete, biased, or spare for
certain AUs. We thus apply our algorithm to learn the BN parameters using only the AU measurements and some domain knowledge,
without any AU labeling.
We first generate constraints. For type I constraints, domain experts are consulated to specify the approximate ranges for most of
the parameters. Type II constraints are also constructed from domain
specific knowledge. Specifically, for each measurement node, since
the measurement accuracy for each AU varies, depending the computer vision technique used as well as on the difficulty of the AU,
we can rank the measurements by their accuracy and then translate such a ranking into the outer-relationships between the corresponding measurement nodes. For example, the computer vision
technique usually performs better in recognition of AU2 (outer brow
raiser) than AU23 (lip tighten), hence we can get constraints like
p(O2 = 0|AU2 = 0) > p(O23 = 0|AU23 = 0), p(O2 = 1|AU2 = 1) > p(O23 =
1|AU23 = 1), where 0 means an AU is absent, and 1 means an AU is
present.
More type-II constraints can be obtained based on the properties
of different AUs. For example, for AU6 (cheek raiser) and AU12 (lip
corner puller), the probability of AU12 being absent if AU6 is absent, is smaller than the probability of AU6 being present if AU12
is present, i.e., p(AU12 = 0|AU6 = 0) < p(AU12 = 1|AU6 = 1). For AU1
(inner brow raiser), the influence of AU2 (outer brow raiser) on AU1
is larger than the influence of AU5 (upper lid raiser) on AU1, i.e.
p(AU1 = 1|AU2 = 1, AU5 = 0) > p(AU1 = 1|AU2 = 0, AU5 = 1). Overall,
we generate one type-I constraint for each parameter, and about 28
type-II constraints for all the parameters. Of course, the number of
constraints depends on the application and the domain knowledge
available for the application.
5.3. AU recognition results
We use 8000 images collected from Cohan and Kanade's DFAT504 database [29], where 80% are used for training and 20% data
are used for testing. We first use MLE to learn parameters from the
complete data, which consists both the true AU labels and the AU
measurements. Then, we use the EM and CEM algorithms to learn
parameters from the incomplete data, which only includes the AU
measurements.
MLE
EM
CEM
0
2
4
6
8
AU Index
10
12
14
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
3055
False Alarm
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
Positive Rate
True Skill Score
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
MLE
EM
CEM
0
2
4
6
8
AU Index
10
12
14
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MLE
EM
CEM
0
2
4
6
8
AU Index
10
12
14
Fig. 7. Comparison of average AU recognition results using the BNs learned from MLE, EM, and CEM, respectively: (a) true skill score; (b) positive rate; and (c) false alarm.
MLE uses complete training data, while EM and CEM use incomplete data that only include AU measurements.
Fig. 7 compares the AU recognition results with BNs learned from
MLE, EM, and CEM in terms of true skill score (the difference between
the positive rate and the false alarm), positive rate, and false alarm.
CEM performs similarly to MLE (based on complete data), but much
better than EM. The average true skill score is only 0.37 for EM, 0.72
for CEM, and 0.76 for MLE. The positive rate increases from 0.69 (EM)
to 0.82 (CEM), and the false alarm decreases from 0.32 (EM) to 0.1
(CEM). For EM, some AUs totally fail, such as AU2, AU7, and AU23.
But CEM has a fair performance for all the AUs even it is learned
from the unlabeled data. This again shows the importance of domain
knowledge. CEM is able to fully utilize the domain knowledge for
automatic parameter learning. Compared to MLE that is based on
the labeled data, the CEM has comparable performance but without
using any labeled AUs. This is indeed very encouraging. This may
represent a significant step forward in machine learning in general
and BN learning in particular.
6. Conclusion and future work
When a large amount of data are missing, or when multiple
hidden nodes exist, learning parameters in Bayesian networks becomes extremely difficult. The learning algorithms are required to
operate in a high-dimensional search space and could easily get
trapped among copious local maxima. We thus present a constrained
EM algorithm to learn Bayesian network parameters when a large
amount of data are missing in the training data. The algorithm
fully utilizes certain qualitative domain knowledge to regularize the
otherwise ill-posed problem, limit the search space, and avoid local maxima. Compared with the quantitative domain knowledge
such as prior probability distribution typically used by the existing
methods, the qualitative domain knowledge is local (only concerned
with some parameters), easy to specify, and does not need strong
assumption.
For many computer vision and pattern recognition problem, data
is often hard to acquire and the model becomes increasingly complex. It, therefore, becomes increasingly important to incorporate
human knowledge into the otherwise ill-posed learning process. Our
method can solicit simple yet effective qualitative constraints from
human experts, and then systematically incorporate them into the
learning process. The improvement in learning performance is significant. Both the experiments from the synthetic data and real data for
facial action recognition demonstrate that our algorithm improves
the accuracy of the learned parameters significantly over the traditional EM algorithm.
The domain knowledge in the current learning algorithm was
formalized by two simple constraints: a range of a parameter, and
relative relationships between different parameters. Although they
are very useful, it is possible to introduce more types of constraints
into learning, such as the relationships between the sum of several
parameters, parameter sharing, etc. More constraints will help further reduce the search space, although they may not be easy for
domain experts to specify.
Furthermore, we assumed model structures are known and thus
only focused on learning model parameters. However, in many applications, the structure could be unknown, or it is too difficult for
domain experts to manually construct a complete structure. Learning the BN structure is therefore also necessary. Most current approaches to BN structure learning assume generic prior probabilities
on graph structures, typically encoding a sparseness bias but otherwise expecting no regularities in the learned structures. We believe
that the domain knowledge about model parameters can also help
in learning the model structure.
References
[1] E. Delage, H. Lee, A. Ng, A dynamic Bayesian network model for autonomous
3d reconstruction from a single indoor image, in: Proceedings of the IEEE
International Conference on Computer Vision and Pattern Recognition, 2006.
[2] J.M. Pena, J. Bjorkegren, J. Tegner, Growing Bayesian network models of gene
networks from seed genes, Bioinformatics (2005) 224–229.
[3] L.M. de Campos, J.M. Fernández-Luna, J.F. Huete, Bayesian networks and
information retrieval: an introduction to the special issue, Information
Processing and Management (2004) 727–733.
[4] Y. Zhang, Q. Ji, Active and dynamic information fusion for facial expression
understanding from image sequence, IEEE Transactions on Pattern Analysis and
Machine Intelligence 27 (5) (2005) 699–714.
[5] N. Friedman, M. Goldszmidt, D. Heckerman, S. Russell, Where is the impact of
Bayesian networks in learning? in: International Joint Conference on Artificial
Intelligence, 1997.
[6] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete
data via the EM algorithm, The Royal Statistical Society Series B 39 (1977) 1–38.
[7] S. Geman, D. Geman, Stochastic relaxation, Gibbs distribution and the Bayesian
restoration of images, IEEE Transactions on Pattern Analysis and Machine
Intelligence 6 (1984) 721–741.
[8] M. Jaeger, The AI&M procedure for learning from incomplete data, in:
Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence,
2006, pp. 225–232.
[9] M. Ramoni, P. Sebastiani, Robust learning with missing data, Machine Learning
45 (2) (2001) 147–170.
[10] R.G. Cowell, Parameter learning from incomplete data using maximum entropy
I: principles, Statistical Research Report, vol. 21, 1999.
[11] R.G. Cowell, Parameter learning from incomplete data using maximum entropy
II: application to Bayesian networks, Statistical Research Report, vol. 21, 1999.
[12] G. Elidan, N. Friedman, The information bottleneck EM algorithm, in:
Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence,
2003, pp. 200–209.
[13] G. Elidan, M. Ninio, N. Friedman, D. Schuurmans, Data perturbation for escaping
local maxima in learning, in: Proceedings of the 18th National Conference on
Artificial Intelligence, 2002, pp. 132–139.
3056
W. Liao, Q. Ji / Pattern Recognition 42 (2009) 3046 -- 3056
[14] B. Thiesson, Accelerated quantification of Bayesian networks with incomplete
data, in: Proceedings of the First International Conference on Knowledge
Discovery and Data Mining, 1995, pp. 306–311.
[15] E. Bauer, D. Koller, Y. Singer, Update rules for parameter estimation in Bayesian
networks, in: Proceedings of the 13th Conference on Uncertainty in Artificial
Intelligence, 1997, pp. 3–13.
[16] W.L. Buntine, Operations for learning with graphical models, Artificial
Intelligence Research 2 (1994) 159–225.
[17] S.T. Lauritzen, The EM algorithm for graphical association models with missing
data, Computational Statistics and Data Analysis 19 (1995) 191–201.
[18] R.S. Niculescu, T.M. Mitchell, R.B. Rao, A theoretical framework for learning
Bayesian networks with parameter inequality constraints, in: Proceedings of
the 20th International Joint Conference on Artificial Intelligence, 2007.
[19] M.J. Druzdzel, L.C. van der Gaag, Elicitation of probabilities for belief networks:
combining qualitative and quantitative information, in: Proceedings of the 11th
Conference on Uncertainty in Artificial Intelligence, 1995, pp. 141–148.
[20] F. Wittig, A. Jameson, Exploiting qualitative knowledge in the learning of
conditional probabilities of Bayesian networks, in: Proceedings of the Sixteenth
Conference on Uncertainty in Artificial Intelligence, 2000, pp. 644–652.
[21] E.E. Altendorf, A.C. Restificar, T.G. Dietterich, Learning from sparse data by
exploiting monotonicity constraints, in: Proceedings of the 21th Conference on
Uncertainty in Artificial Intelligence, 2005, pp. 18–26.
[22] A. Feelders, L. var der Gaag, Learning Bayesian network parameters with prior
knowledge about context-specific qualitative influences, in: Proceedings of the
21th Conference on Uncertainty in Artificial Intelligence, 2005, pp. 193–20.
[23] S. Russell, J. Binder, D. Koller, K. Kanazawa, Local learning in probabilistic
networks with hidden variables, in; Proceedings of the 14th International Joint
Conference on Artificial Intelligence, 1995, pp. 1146–1152.
[24] J. Binder, D. Koller, S. Russell, K. Kanazawa, Adaptive probabilistic networks
with hidden variables, Machine Learning (1997) 213–244.
[25] A. Feelders, L. var der Gaag, Learning Bayesian network parameters under order
constraints, International Journal of Approximate Reasoning 42 (2006) 37–53.
[26] P. Ekman, W. Friesen, Facial Action Coding System: A Technique for the
Measurement of Facial Movement, Consulting Psychologists Press, Palo Alto,
CA, 1978.
[27] Y. Tong, W. Liao, Q. Ji, Inferring facial action units with causal relations, in:
Proceedings of the IEEE International Conference on Computer Vision and
Pattern Recognition, 2006.
[28] M.S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, J. Movellan,
Recognizing facial expression: machine learning and application to spontaneous
behavior, Proceedings of the IEEE International Conference on Computer Vision
and Pattern Recognition 2 (2005) 568–573.
[29] T. Kanade, J.F. Cohn, Y. Tian, Comprehensive database for facial expression
analysis, in: Proceedings of FG00, 2000.
About the Author—WENHUI LIAO received the PhD degree from the Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy,
New York, in 2006. Her areas of research include probabilistic graphical models, information fusion, computer vision, and natural language processing. She is currently a
research scientist at R&D of Thomson-Reuters Corporation.
About the Author—QIANG JI received the PhD degree in electrical engineering from the University of Washington in 1998. He is currently an associate professor in the
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute (RPI), Troy, New York. Prior to joining RPI in 2001, he was an assistant professor
in the Department of Computer Science, University of Nevada, Reno. He also held research and visiting positions with Carnegie Mellon University, Western Research, and
the US Air Force Research Laboratory. His research interests include computer vision, probabilistic reasoning with Bayesian networks for decision making and information
fusion, human computer interaction, pattern recognition, and robotics. He has published more than 100 papers in peer-reviewed journals and conferences. His research
has been funded by local and federal government agencies including the US National Science Foundation (NSF), the US National Institute of Health (NIH), the US Air Force
Office of Scientific Research (AFOSR), the US Office of Naval Research (ONR), DARPA, and the US Army Research Office (ARO) and by private companies including Boeing
and Honda. He is a senior member of the IEEE.