Gibbs sampling for the uninitiated

Gibbs sampling for the uninitiated
CS-TR-4956
UMIACS-TR-2010-04
LAMP-TR-153
June 2010
GIBBS SAMPLING FOR THE UNINITIATED
Philip Resnik
Eric Hardisty
Department of Linguistics
Institute for Advanced Computer Studies
University of Maryland
College Park, MD 20742-3275
resnik AT umd.edu
Department of Computer Science
Institute for Advanced Computer Studies
University of Maryland
College Park, MD 20742-3275
hardisty AT cs.umd.edu
Abstract
This document is intended for computer scientists who would like to try out a Markov Chain Monte
Carlo (MCMC) technique, particularly in order to do inference with Bayesian models on problems related
to text processing. We try to keep theory to the absolute minimum needed, though we work through the
details much more explicitly than you usually see even in “introductory” explanations. That means we’ve
attempted to be ridiculously explicit in our exposition and notation.
After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our heads in the
direction of theory), we work through an example application in detail—the derivation of a Gibbs sampler for
a Naı̈ve Bayes model. Along with the example, we discuss some practical implementation issues, including
the integrating out of continuous parameters when possible. We conclude with some pointers to literature
that we’ve found to be somewhat more friendly to uninitiated readers.
Note: as of June 3, 2010 we have corrected some small errors in the original April 2010 report.
Keywords: Gibbs sampling, Markov Chain Monte Carlo, naı̈ve Bayes, Bayesian inference,
tutorial
Gibbs Sampling for the Uninitiated
Eric Hardisty
Department of Computer Science and
Institute for Advanced Computer Studies
University of Maryland
College Park, MD 20742 USA
hardisty AT cs.umd.edu
Philip Resnik
Department of Linguistics and
Institute for Advanced Computer Studies
University of Maryland
College Park, MD 20742 USA
resnik AT umd.edu
Abstract
This document is intended for computer scientists who would like to try out a Markov Chain Monte Carlo
(MCMC) technique, particularly in order to do inference with Bayesian models on problems related to text
processing. We try to keep theory to the absolute minimum needed, though we work through the details
much more explicitly than you usually see even in “introductory” explanations. That means we’ve attempted
to be ridiculously explicit in our exposition and notation.
After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our heads in the
direction of theory), we work through an example application in detail—the derivation of a Gibbs sampler for
a Naı̈ve Bayes model. Along with the example, we discuss some practical implementation issues, including
the integrating out of continuous parameters when possible. We conclude with some pointers to literature
that we’ve found to be somewhat more friendly to uninitiated readers.
1
Introduction
Markov Chain Monte Carlo (MCMC) techniques like Gibbs sampling provide a principled way to approximate
the value of an integral.
1.1
Why integrals?
Ok, stop right there. Many computer scientists, including a lot of us who focus in natural language processing,
don’t spend a lot of time with integrals. We spend most of our time and energy in a world of discrete events.
(The word bank can mean (1) a financial institution, (2) the side of a river, or (3) tilting an airplane. Which
meaning was intended, based on the words that appear nearby?) Take a look at Manning and Schuetze
[14], and you’ll see that the probabilistic models we use tend to involve sums, not integrals (the Baum-Welch
algorithm for HMMs, for example). So we have to start by asking: why and when do we care about integrals?
One good answer has to do with probability estimation.1 Numerous computational methods involve
estimating the probabilities of alternative discrete choices, often in order to pick the single most probable
choice. As one example, the language model in an automatic speech recognition (ASR) system estimates
the probability of the next word given the previous context. As another example, many spam blockers use
features of the e-mail message (like the word Viagra, or the phrase send this message to all your friends) to
predict the probability that the message is spam.
Sometimes we estimate probabilities by using maximimum likelihood estimation (MLE). To use a standard
example, if we are told a coin may be unfair, and we flip it 10 times and see HHHHTTTTTT (H=heads,
T=tails), it’s conventional to estimate the probability of heads for the next flip as 0.4. In practical terms,
MLE amounts to counting and then normalizing so that the probabilities sum to 1.
1 This
subsection is built around the very nice explication of Bayesian probability estimation by Heinrich [7].
1
Input interpretation:
plot
p4 "1 ! p#6
p ! 0 to 1
Plot:
0.0012
0.0010
0.0008
0.0006
0.0004
0.0002
0.2
0.4
0.6
0.8
1.0
Figure 1: Probability of generating the coin-flip sequence HHHHTTTTTT, using different values for
P (heads) on the x-axis. The value that maximizes the probability of the observed sequence, 0.4, is the
maximum likelihood estimate (MLE).
4
count(H)
=
= 0.4
count(H) + count(T)
10
(1)
Formally, MLE produces the choice most likely to have generated the observed data.
In this case, the most natural model µ has just a single parameter, π, namely the probability of heads
(see Figure 1).2 Letting X = HHHHTTTTTT represent the observed data, and y the outcome of the next
coin flip, we estimate Generated by Wolfram|Alpha (www.wolframalpha.com) on September 24, 2009 from Champaign, IL.
© Wolfram Alpha LLC—A Wolfram Research Company
π̃M LE
=
argmax P (X |π)
(2)
π
P (y|X ) ≈ P (y|π̃M LE )
(3)
On the other hand, sometimes we estimate probabilities using maximum a posteriori (MAP) estimation.
A MAP estimate is the choice that is most likely given the observed data. In this case,
π̃M AP
=
argmax P (π|X )
π
=
=
P (X |π)P (π)
P (X )
π
argmax P (X |π)P (π)
argmax
(4)
π
P (y|X ) ≈ P (y|π̃M AP )
(5)
In contrast to MLE, MAP estimation applies Bayes’s Rule, so that our estimate (4) can take into account
prior knowledge about what we expect π to be in the form of a prior probability distribution P (π).3 So,
2 Specifically, µ models each choice as a Bernoulli trial, and the probability of generating exactly this heads-tails sequence for
a given π is π 4 (1 − π)6 . If you type Plot[p^4(1-p)^6,{p,0,1}] into Wolfram Alpha, you get Figure 1, and you can immediately
see that the curve tops out, i.e. the probability of generating the sequence is highest, exactly when p = 0.4. Confirm this by
entering derivative of p^4(1-p)^6 and you’ll get 52 = 0.4 as the maximum. Thanks to Kevin Knight for pointing out how
easy all this is using Wolfram Alpha. Also see discussion in Heinrich [7], Section 2.1.
3 We got to (4) from the desired posterior probability by applying Bayes’s Rule and then ignoring the denominator since the
argmax doesn’t depend on it.
2
for example, we might believe that the coin flipper is a scrupulously honest person, and choose a prior
distribution that is biased in favor of π = 0.5. The more heavily biased that prior distribution is, the more
evidence it will take to shake our pre-existing belief that the coin is fair.4
Now, MLE and MAP estimates are both giving us the best estimate, according to their respective
definitions of “best.” But notice that using a single estimate — whether it’s π̃M LE or π̃M AP — throws away
information. In principle, π could have any value between 0 and 1; might we not get better estimates if we
took the whole distribution P (π|X ) into account, rather than just a single estimated value for π? If we do
that, we’re making use of all the information about π that we can wring from the observed data, X .
The way to take advantage of all that information is to calculate an expected value rather than an estimate
using the single best guess for π. Recall that the expected value of a function f (z), when z is a discrete
variable, is
E[f (z)]
=
X
f (z)p(z).
(6)
z∈Z
Here Z is the set of discrete values z can take, and p(z) is the probability distribution over possible values
for z. If z is a continuous variable, the expected value is an integral rather than a sum:
Z
E[f (z)]
=
f (z)p(z) dz.
(7)
For our example, z = π, the function f we’re interested in is f (z) = P (y|π), and the distribution over
which we’re taking the expectation is P (π|X ), i.e. the whole distribution over possible values of π given that
we’ve observed X . That gives us the following expected value for the posterior probability of y given X :
Z
P (y|X )
=
P (y|π)P (π|X ) dπ
(8)
where Bayes’s Rule defines
P (π|X ) =
P (X |π)P (π)
P (X |π)P (π)
=R
.
P (X )
P
(X |π)P (π) dπ
π
(9)
Notice that, unlike (3) and (5), Equation (8) defines the posterior using a true equality, not an approximation. It takes fully into account our prior beliefs about what the value of π will be, along with the
interaction of those prior beliefs with observed evidence X .
Equations (8) and (9) provide one compelling answer to the question we started with. Why should
even discrete-minded computer scientists care about integrals? Because even when the probability space
is discrete, we often care about good estimates of posterior probabilities. Computing integrals can help us
improve the parameter estimates in our models.5
4 See http://www.math.uah.edu/STAT/objects/experiments/BetaCoinExperiment.xhtml for a nice applet that lets you explore this idea. If you set a = b = 10, you get a prior strongly biased toward 0.5, and it’s hard to move the posterior too
far from that value even if you generate observed heads with probability p = 0.8. If you set a = b = 2, there’s still a bias
toward 0.5 but it’s much easier to move the posterior off that value. As a second pointer, see some nice, self-contained slides at
http://www.cs.cmu.edu/∼lewicki/cp-s08/Bayesian-inference.pdf.
5 Chris Dyer (personal communication) points out you don’t have to be doing Bayesian estimation to care about expected
values. For example, better ways to compute expected values can be useful in the E step of expectation-maximization algorithms,
which give you maximum likelihood estimates for models with latent variables. He also points out that for many models,
Bayesian parameter estimation can be a whole lot easier to implement than EM. The widely used GIZA++ implementation of
IBM Model 3 (a probabilistic model used in statistical machine translation [12]) contains 2186 lines of code; Chris implemented
a Gibbs sampler for Model 3 in 67 lines. On a related note, Kevin Knight’s excellent “Bayesian Inference with Tears: A tutorial
workbook for natural language researchers” [9] was written with goals very similar to our own, but from an almost completely
complementary angle: he emphasizes conceptual connections to EM algorithms and focuses on the kinds of structured problems
you tend to see in natural language processing.
3
1.2
Why sampling?
The trouble with integrals, of course, is that they can be very difficult to calculate. The methods we learned
in calculus class are fine for classroom exercises, but often cannot be applied to interesting problems in the
real world. Indeed, analytical solutions to (8) and the denominator of (9) might be impossible to obtain, so
we might not be able to determine the exact form of P (π|X ). Gibbs sampling allows us to sample from a
distribution that asymptotically follows P (π|X ) without having to explicitly calculate the integrals.
1.2.1
Monte Carlo: a circle, a square, and a bag of rice
Gibbs Sampling is an instance of a Markov Chain Monte Carlo technique. Let’s start with the “Monte
Carlo” part. You can think of Monte Carlo methods as algorithms that help you obtain a desired value by
performing simulations involving probabilistic choices. As a simple example, here’s a cute, low-tech Monte
Carlo technique for estimating the value of π (the ratio of a circle’s circumference to its diameter).6
Draw a perfect square on the ground. Inscribe a circle in it — i.e. the circle and the square are centered
in exactly the same place, and the circle’s diameter has length identical to the side of the square. Now take
a bag of rice, and scatter the grains uniformly at random inside the square. Finally, count the total number
of grains of rice inside the circle (call that C), and inside the square (call that S).
You scattered rice at random. Assuming you managed to do this pretty uniformly, the ratio between the
circle’s grains and the square’s grains (which include the circle’s) should approximate the ratio between the
area of the circle and the area of the square, so
π( d2 )2
C
≈
.
S
d2
(10)
Solving for π, we get π ≈ 4C
S .
You may not have realized it, but we just solved a problem by approximating the values of integrals. The
true area of the circle, π( d2 )2 , is the result of summing up an infinite number of infinitessimally small points;
similarly for the the true area d2 of the square. The more grains of rice we use, the better our approximation
will be.
1.2.2
Markov Chains: walking the right walk
In the circle-and-square example, we saw the value of sampling involving a uniform distribution, since the
grains of rice were distributed uniformly within the square. Returning to the problem of computing expected
values, recall that we’re interested in Ep(x) [f (x)] (equation 7), where we’ll assume that the distribution p(x)
is not uniform and, in fact, not easy to work with analytically.
Figure 2 provides an example f (z) and p(z) for illustration. Conceptually, the integral in equation (7)
sums up f (z)p(z) over infinitely many values of z. But rather than touching each point in the sum exactly
once, here’s another way to think about it: if you sample N points z (0) , z (1) , z (2) , . . . , z (N ) at random from
the probability density p(z), then
N
1 X
f (z (t) ).
N →∞ N
t=1
Ep(z) [f (z)] = lim
(11)
That looks a lot like a kind of averaged value for f , which makes a lot of sense since in the discrete case
(equation 6) the expected value is nothing but a weighted average, where the weight for each value of z is
its probability.
Notice, though, that the value in the sum is just f (z (t) ), not f (z (t) )p(z (t) ) as in the integral in equation (7).
Where did the p(z) part go? Intuitively, if we’re sampling according to p(z), and count(z) is the number of
6 We’re
elaborating on the introductory example at http://en.wikipedia.org/wiki/Monte Carlo method.
4
f(z)
p(z)
z1
z2
z3
Z
z4
Figure 2: Example of computing expectation Ep(z) [f (z)]. (This figure was adapted from page 7 of the
handouts for Chris Bishop’s presentation “NATO ASI: Learning Theory and Practice”, Leuven, July 2002,
http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-nato-2.pdf )
times we observe z in the sample, then N1 count(z) approaches p(z) as N → ∞. So the p(z) is implicit in the
way the samples are drawn.7
Looking at equation (11), it’s clear that we can get an approximate value by sampling only a finite number
of times, T :
Ep(z) [f (z)] ≈
T
1X
f (z (t) ).
T t=1
(12)
Progress! Now we have a way to approximate the integral. The remaining problem is this: how do we
sample z (0) , z (1) , z (2) , . . . , z (T ) according to p(z)?
There are a whole variety of ways to go about this; e.g., see Bishop [2] Chapter 11 for discussion of rejection sampling, adaptive rejection sampling, adaptive rejection Metropolis sampling, importance sampling,
sampling-importance-sampling,... For our purposes, though, the key idea is to think of z’s as points in a
state space, and find ways to “walk around” the space — going from z (0) to z (1) to z (2) and so forth — so
that the likelihood of visiting any point z is proportional to p(z). Figure 2 illustrates the reasoning visually:
walking around values for Z, we want to spend our time adding values f (z) to the sum when p(z) is large,
e.g. devoting more attention to the space between z1 and z2 , as opposed to spending time in less probable
portions of the space like the part between z2 and z3 .
A walk of this kind can be characterized abstractly as follows:
1: z (0) := a random initial point
2: for t = 1 to T do
3:
z (t+1) := g(z (t) )
4: end for
7 Okay
— we’re playing a little fast and loose here by talking about counts: with Z continuous, we’re not going to see two
identical samples z (i) = z (j) , so it doesn’t really make sense to talk about counting how many times a value was seen. But we
did say “intuitively”, right?
5
Here g is a function that makes a probabilistic choice about what state to go to next according to an
explicit or implicit transition probability Ptrans (z (t+1) |z (0) , z (1) , . . . , z (t) ).8
The part about probabilistic choices makes this a Monte Carlo technique. What will make it a Markov
Chain Monte Carlo technique is defining things so that the next state you visit, z (t+1) , depends only on the
current state z (t) . That is,
Ptrans (z (t+1) |z (0) , z (1) , . . . , z (t) ) = Ptrans (z (t+1) |z (t) ).
(13)
For you language folks, this is precisely the same idea as modeling word sequences using a bigram model,
where here we have states z instead of having words w.
We included the subscript trans in our notation, so that Ptrans can be read as “transition probability”,
in order to emphasize that these are state-to-state transition probabilities in a (first-order) Markov model.
The heart of Markov Chain Monte Carlo methods is designing g so that the probability of visiting a state z
will turn out to be p(z), as desired. This can be accomplished by guaranteeing that the chain, as defined by
the transition probabilities Ptrans , meets certain conditions. Gibbs sampling is one algorithm that meets
those conditions.9
1.2.3
The Gibbs sampling algorithm
Gibbs sampling is applicable in situations where Z has at least two dimensions, i.e. each point z is really
z = hz1 , . . . , zk i, with k > 1. The basic idea in Gibbs sampling is that, rather than probabilistically picking
the next state all at once, you make a separate probabilistic choice for each of the k dimensions, where each
choice depends on the other k − 1 dimensions.10 That is, the probabilistic walk through the space proceeds
as follows:
(0)
(0)
1: z (0) := hz1 , . . . , zk i
2: for t = 1 to T do
3:
for i = 1 to k do
(t+1)
(t+1)
(t+1) (t)
(t)
4:
zi
∼ P (Zi |z1
, . . . , zi−1 , zi+1 , . . . , zk )
5:
end for
6: end for
Note that we can obtain the distribution we are sampling from by using the definition of conditional
probability:
(t+1)
(t+1)
P (Zi |z1
(t+1)
(t)
(t)
, . . . , zi−1 , zi+1 , . . . , zk ) =
P (z1
(t+1)
P (z1
(t+1)
(t)
(t)
(t)
, . . . , zi−1 , zi , zi+1 , . . . , zk )
(t+1)
(t)
(t)
.
(14)
, . . . , zi−1 , zi+1 , . . . , zk )
Notice that the only difference between the numerator and the denominator is that the numerator is the full
(t)
(t)
joint probability, including zi , whereas zi is missing in the denominator. This will be important later.
One full execution of the inner loop and you’ve just computed your new point z (t+1) = g(z (t) ) =
(t+1)
(t+1)
, . . . , zk
i.
hz1
You can think of each dimension zi as corresponding to a parameter or variable in your model. Using
equation (14), we sample the new value for each variable according to its distribution based on the values
of all the other variables. During this process, new values for the variables are used as soon as you obtain
them. For the case of three variables:
• The new value of z1 is sampled conditioning on the old values of z2 and z3 .
8 We’re deliberately staying at a high level here. In the bigger picture, g might consider and reject one or more states before
finally deciding to accept one and return it as the value for z (t+1) . See discussions of the Metropolis-Hastings algorithm, e.g.
Bishop [2], Section 11.2.
9 We told you we were going to keep theory to a minimum, didn’t we?
10 There are, of course, variations on this basic scheme. For example, “blocked sampling” groups the variables into b < k
blocks and the variables in each block are sampled together based on the other b − 1 blocks.
6
• The new value of z2 is sampled conditioning on the new value of z1 and the old value of z3 .
• The new value of z3 is sampled conditioning on the new values of z1 and z2 .
1.3
The remainder of this document
So, there you have it. Gibbs sampling makes it possible to obtain samples from probability distributions
without having to explicitly calculate the values for their marginalizing integrals, e.g. computing expected
values, by defining a conceptually straightforward approximation. This approximation is based on the idea
of a probabilistic walk through a state space whose dimensions correspond to the variables or parameters in
your model.
Trouble is, from what we can tell, most descriptions of Gibbs sampling pretty much stop there (if they’ve
even gone into that much detail). To someone relatively new to this territory, though, that’s not nearly far
enough to figure out how to do Gibbs sampling. How exactly do you implement the “sampling from the
following distribution” part at the heart of the algorithm (equation 14) for your particular model? How do
you deal with continuous parameters in your model? How do you actually generate the expected values you’re
ultimately interested in (e.g. equation 8), as opposed to just doing the probabilistic walk for T iterations?
Just as the first part of this document aimed at explaining why, the remainder aims to explain how.
In Section 2, we take a very simple probabilistic model — Naı̈ve Bayes — and describe in considerable
(painful?) detail how to construct a Gibbs sampler for it. This includes two crucial things, namely how to
employ conjugate priors and how to actually sample from conditional distributions per equation (14).
In Section 3 we discuss how to actually obtain values from a Gibbs sampler, as opposed to merely watching
it walk around the state space. (Which might be entertaining, but wasn’t really the point.) Our discussion
includes convergence and burn-in, auto-correlation and lag, and other practical issues.
In Section 4 we conclude with pointers to other things you might find it useful to read, as well as an
invitation to tell us how we could make this document more accurate or more useful.
2
Deriving a Gibbs Sampler for a Naı̈ve Bayes Model
In this section we consider Naive Bayes models.11 Let’s assume that items of interest are documents, that
the features under consideration are the words in the document, and that the document-level class variable
we’re trying to predict is a sentiment label whose value is either 0 or 1. For ease of reference, we present
our notation in Figure 3. Figure 4 describes the model as a “plate diagram”, to which we will refer when
describing the model.
2.1
Modeling How Documents are Generated
We represent each document as a bag of words. Given an unlabeled document Wj , our goal is to pick
the best label Lj = 0 or Lj = 1. Sometimes we will refer to 0 and 1 as classes instead of labels. In
further discussion it will be convenient for us to refer to the sets of documents sharing the same label, so for
notational convenience we define the sets C0 = {Wj |Lj = 0} and C1 = {Wj |Lj = 1}. The usual treatment
of these models is to equate “best” with “most probable”, and therefore our goal is to choose the label Lj
for Wj that maximizes P (Lj |Wj ). Applying Bayes’s Rule,
Lj = argmax P (L|Wj )
=
L
=
P (Wj |L)P (L)
P (Wj )
L
argmax P (Wj |L)P (L),
argmax
L
where the denominator P (Wj ) is omitted because it does not depend on L.
This application of Bayes’s rule (the “Bayes” part of “Naı̈ve Bayes”) allows us to think of the model in
terms of a “generative story” that accounts for how documents are created. According to that story, we
11 We
assume the reader is familiar with Naive Bayes. For a refresher, see [14].
7
V
N
γπ1 , γπ0
γθ
γθi
Cx
C
C0 (C1 )
Wj
Wji
L
Lj
Rj
θi
θx,i
NCx (i)
C(−j)
L(−j)
(−j)
(−j)
C0
(C1 )
µ
number of words in the vocabulary.
number of documents in the corpus.
hyperparameters of the Beta distribution.
hyperparameter vector for the multinomial prior.
pseudocount for word i.
set of documents labeled x.
the set of all documents.
number of documents labeled 0 (1).
document j’s frequency distribution.
frequency of word i in document j.
vector of document labels.
label for document j.
number of words in document j.
probability of word i.
probability of word i from the distribution of class x.
number of times word i occurs in the set of all documents labeled x.
set of all documents except Wj
vector of all document labels except Lj
number of documents labeled 0 (1) except for Wj
set of hyperparameters hγπ1 , γπ0 , γ θ i
Figure 3: Notation.
γπ
Lj
π
γθ
Wjk
θ
Rj
2
Figure 4: Naive Bayes plate diagram
8
N
first pick the class label of the document, Lj ; our model will assume that’s done by flipping a coin whose
probability of heads is some value π = P (Lj = 1). We can express this a little more formally as
Lj ∼ Bernoulli(π).
Then, for every one of the Rj word positions in the document, we pick a word wi independently by
sampling randomly according to a probability distribution over words. Which probability distribution we
use is based on the label Lj of the document, so we’ll write them as θ 0 and θ 1 . Formally one would describe
the creation of document j’s bag of words as
Wj ∼ Multinomial(Rj , θ).
(15)
The assumption that the words are chosen independently is the reason we call the model “naı̈ve”.
Notice that logically speaking, it made sense to describe the model in terms of two separate probability
distributions, θ 0 and θ 1 , each one being a simple unigram distribution. The notation in (15) doesn’t explicitly
show that what happens in generating Wj depends on whether Lj was 0 or 1. Unfortunately, that notational
choice seems to be standard, even though it’s less transparent.12 We indicate the existence of two θs by
including the 2 in the lower rectangle of Figure 4, but many plate diagrams in the literature would not.
Another, perhaps clearer way to describe the process would be
Wj ∼ Multinomial(Rj , θ Lj ).
(16)
And that’s it: our “generative story” for the creation of a whole set of labeled documents hWn , Ln i,
according to the Naı̈ve Bayes model, is that this simple document-level generative story gets repeated N
times, as indicated by the N in the upper rectangle in Figure 4.
2.2
Priors
Well, ok, that’s not completely it. Where did π come from? Our generative story is going to assume that
before this whole process began, we also picked π randomly. Specifically we’ll assume that π is sampled from
a Beta distribution with parameters γπ1 and γπ0 . These are referred to as hyperparameters because they are
parameters of a prior, which is itself used to pick parameters of the model. In Figure 4 we represent these two
hyperparameters as a single two-dimensional vector γπ = hγπ1 , γπ0 i. When γπ1 = γπ0 = 1, Beta(γπ1 , γπ0 )
is just a uniform distribution, which means that any value for π is equally likely. For this reason we call
Beta(1, 1) an “uninformed prior”.
Similarly, where do θ 0 and θ 1 come from? Just as the Beta distribution can provide an uninformed prior
for a distribution making a two-way choice, the Dirichlet distribution can provide an uninformed prior for
V -way choices, where V is the number of words in the vocabulary. Let γ θ be a V -dimensional vector where
the value of every dimension equals 1. If θ 0 is sampled from Dirichlet(γ θ ), every probability distribution
over words will be equally likely. Similarly, we’ll assume θ 1 is sampled from Dirichlet(γ θ ).13 Formally
π ∼ Beta(γπ )
θ ∼ Dirichlet(γ θ )
Choosing the Beta and Dirichlet distributions as priors for binomial and multinomial distributions, respectively, helps the math work out cleanly. We’ll discuss this in more detail in Section 2.4.2.
12 The actual basis for removing the subscripts on the parameters θ is that we assume the data from one class is independent
of the parameter estimation of all the other classes, so essentially when we derive the probability expressions for one class the
others look the same [19].
13 Note that θ and θ are sampled separately. There’s no assumption that they are related to each other at all. Also, it’s
0
1
worth noting that a Dirichlet distribution is a Beta distribution if the dimension V = 2. Dirichlet generalizes Beta in the same
way that multinomial generalizes binomial. Now you see why we took the trouble to represent γπ1 and γπ0 as a 2-dimensional
vector γπ .
9
2.3
State space and initialization
Following Pedersen [17, 18], we’re going to describe the Gibbs sampler in a completely unsupervised setting
where no labels at all are provided as training data. We’ll then briefly explain how to take advantage of
labeled data.
State space. Recall that the job of a Gibbs sampler is to walk through an k-dimensional state space defined
by the random variables hZ1 , Z2 , . . . Zk i in the model. Every point in that walk is a collection hz1 , z2 , . . . zk i
of values for those variables.
In the Naı̈ve Bayes model we’ve just described, here are the variables that define the state space.
• one scalar-valued variable π
• two vector-valued variables, θ 0 and θ 1
• binary label variables L, one for each of the N documents
We also have one vector variable Wj for each of the N documents, but these are observed variables, i.e.
their values are already known (and which is why Wjk is shaded in Figure 4).
Initialization. The initialization of our sampler is going to be very easy. Pick a value π by sampling from
(0)
the Beta(γπ1 , γπ0 ) distribution. Then, for each j, flip a coin with success probability π, and assign label Lj
— that is, the label of document j at the 0th iteration – based on the outcome of the coin flip. Similarly,
you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γ θ ).
2.4
Deriving the Joint Distribution
Recall that for each iteration t = 1 . . . T of sampling, we update every variable defining the state space by
sampling from its conditional distribution given the other variables, as described in equation (14).
Here’s how we’re going to proceed:
• We will define the joint distribution of all the variables, corresponding to the numerator in (14).
• We simplify our expression for the joint distribution.
• We use our final expression of the joint distribution to define how to sample from the conditional
distribution in (14).
• We give the final form of the sampler as pseudocode.
2.4.1
Expressing and simplifying the joint distribution
According to our model, the joint distribution for the entire document collection is P (C, L, π, θ 0 , θ 1 ; γπ1 , γπ0 , γ θ ).
The semicolon indicates that the values to its right are parameters for this joint distribution. Another way
to say this is that the variables to the left of the semicolon are conditioned on the hyperparameters given to
the right of the semicolon. Using the model’s generative story, and, crucially, the independence assumptions
that are a part of that story, the joint distribution can decomposed into a product of several factors:14
P (π|γπ1 , γπ0 )P (L|π)P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L)
Let’s look at each of these in turn.
14 Note that we can also obtain the products of the joint distribution directly from our graphical model, Figure 4 by multiplying
together each latent variable conditioned on its parents.
10
• P (π|γπ1 , γπ0 ). The first factor is the probability of choosing this particular value of π given that γπ1
and γπ0 are being used as the hyperparameters of the Beta distribution. By definition of the Beta
distribution, that probability is:
P (π|γπ1 , γπ0 ) =
Γ(γπ1 + γπ0 ) γπ1 −1
π
(1 − π)γπ0 −1
Γ(γπ1 )Γ(γπ0 )
(17)
And because
Γ(γπ1 + γπ0 )
Γ(γπ1 )Γ(γπ0 )
is a constant that doesn’t depend on π, we can rewrite this as:
P (π|γπ1 , γπ0 ) = c π γπ1 −1 (1 − π)γπ0 −1 .
(18)
The constant c is a normalizing constant that makes sure P (π|γπ1 , γπ0 ) sums to 1 over all π. Γ(x)
is the gamma function, a continuous-valued generalization of the factorial function. We could also
express (18) as
P (π|γπ1 , γπ0 ) ∝ π γπ1 −1 (1 − π)γπ0 −1 .
(19)
• P (L|π). The second factor is the probability of obtaining this specific sequence L of N binary labels,
given that the probability of choosing label = 1 is π. That’s
P (L|π)
=
N
Y
n=1
C1
= π
π Ln (1 − π)(1−Ln )
(20)
(1 − π)C0
(21)
Recall from Figure 3 that C0 and C1 are nothing more than the number of documents labeled 1 and
0 respectively.15
• P (θ 0 |γ θ ) and P (θ 1 |γ θ ). The third factors are the probability of having sampled these particular choices
of word distributions, given that γ θ was used as the hyperparameter of the Dirichlet distribution.
Since the θ distributions are independent of each other, let’s make the notation a bit easier to read here
and consider each of them in isolation, allowing us to momentarily elide the subscript saying which one
is which. By the definition of the Dirichlet distribution, the probability of each word distribution is
P (θ|γ θ )
=
PV
V
Γ( i=1γθi ) Y γθi −1
θi
QV
i=1 Γ(γθi ) i=1
= c0
V
Y
θiγθi −1
(22)
(23)
i=1
∝
V
Y
θiγθi −1
(24)
i=1
Recall that γθi denotes the value of vector γ θ ’s ith dimension, and similarly, θi is the value for the
ith dimension of vector θ, i.e. the probability assigned by this distribution to the ith word in the
vocabulary. c0 is another normalization constant that we can discard by changing the equality to a
proportionality.
15 Of course we can represent these quantities given the variables we already have, but we define these in the interests of
simplifying the equations somewhat.
11
• P (C0 |θ 0 , L) and P (C1 |θ 1 , L). These are the probabilities of generating the contents of the bags of
words in each of the two document classes.
Generating the bag of words Wn for document n depends on that document’s label, Ln and the word
probability distribution associated with that label, θ Ln (so θ Ln is either θ 0 or θ 1 ). For notational
simplicity, let’s let θ = θ Ln :
P (Wn |L, θ Ln )
=
V
Y
ni
θW
i
(25)
i=1
Here θi is the probability of word i in distribution θ, and the exponent Wni is the frequency of word i
in Wn .
Now, since the documents are generated independently of each other, we can multiply the value in (25)
for each of the documents in each class to get the combined probability of all the observed bags of
words within a class:
P (Cx |L, θ x )
V
Y Y
=
Wni
θx,i
(26)
(i)
(27)
n∈Cx i=1
V
Y
=
N
θx,iCx
i=1
Where NCx (i) gives the count of word i in documents with class label x.
2.4.2
Choice of Priors and Simplifying the Joint Probability Expression
So why did we pick the Dirichlet distribution as our prior for θ 0 and θ 1 and the Beta distribution as our
prior for π? Let’s look at what happens in the process of simplifying the joint distribution and see what
happens to our estimates of θ (where this can be either θ 0 or θ 1 ) and π once we observe some evidence (i.e.
the words from a single document). Using (19) and (21) from above:
P (π|L; γπ1 , γπ0 )
∝
= P (L|π)P (π|γπ1 , γπ0 )
C1
π (1 − π)C0 π γπ1 −1 (1 − π)γπ0 −1
(29)
(28)
∝ π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1
(30)
Likewise, for θ using (24) and (25) from above:
P (θ|Wn ; γ θ )
= P (Wn |θ)P (θ|γ θ )
V
Y
∝
θiWni
i=1
V
Y
∝
V
Y
θiγθi −1
i=1
θiWni +γθi −1
(31)
i=1
If we use the words in all of the documents of a given class, then we have:
P (θ x |Cx ; γ θ ) ∝
V
Y
i=1
12
N
θx,iCx
(i)+γθi −1
(32)
Notice that (30) is an unnormalized Beta distribution, with parameters C1 + γπ1 and C0 + γπ0 , and (32)
is an unnormalized Dirichlet distribution, with parameter vector hNCx (i) + γθi i for 1 ≤ i ≤ V . When the
posterior probability distribution is of the same family as the likelihood probability distribution — that is, the
same functional form, just with different arguments — it is said to be the conjugate prior of the posterior.
The Beta distribution is the conjugate prior for binomial (and Bernoulli) distributions and the Dirichlet
distribution is the conjugate prior for multinomial distributions. Also notice what role the hyperparameters
played—they are added just like observed evidence. It is for this reason that the hyperparameters are
sometimes referred to as pseudocounts.
Okay, so if we multiply together the individual factors from Section 2.4.1 as simplified above (in Section
2.4.2) and let µ = hγπ1 , γπ0 , γ θ i we can express the full joint distribution as:
P (C, L, π, θ 0 , θ 1 ; µ) ∝ π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1
V
Y
N
θ0,iC0
(i)+γθi −1 NC1 (i)+γθi −1
θ1,i
(33)
i=1
2.4.3
Integrating out π
Having illustrated how to derive the joint distribution for a model, as in (33), it turns out that, for this
particular model, we can make our lives a little bit simpler: we can reduce the effective number of parameters
in the model by integrating our joint distribution with respect to π. This has the effect of taking all possible
values of π into account in our sampler, without representing it as a variable explicitly and having to sample
it at every iteration. Intuitively, “integrating out” a variable is an application of precisely the same principle
as computing the marginal probability for a discrete distribution. For example, if we have P
an expression for
P (a, b, c), we can compute P (a, b) by summing over all possible values of c, i.e. P (a, b) = c P (a, b, c). As
a result, c is “there” conceptually, in terms of our understanding of the model, but we don’t need to deal
with manipulating it explicitly as a parameter. With a continuous variable, the principle is the same, but
we integrate over all possible values of the variable rather than summing.
So, we have
Z
P (L, C, θ 0 , θ 1 ; µ) =
P (L, C, θ 0 , θ 1 , π; µ) dπ
(34)
π
Z
P (π|γπ1 , γπ0 )P (L|π)P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L) dπ
Z
= P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L) P (π|γπ1 , γπ0 )P (L|π) dπ
=
(35)
π
(36)
π
At this point let’s focus our attention on the integrand only and substitute the true distributions from
(17) and (21).
Z
Z
P (π|γπ1 , γπ0 )P (L|π) dπ
=
π
=
Γ(γπ1 + γπ0 ) γπ1 −1
π
(1 − π)γπ0 −1 π C1 (1 − π)C0 dπ
Γ(γ
)Γ(γ
)
π1
π0
π
Z
Γ(γπ1 + γπ0 )
π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1 dπ
Γ(γπ1 )Γ(γπ0 ) π
(37)
(38)
Here’s where our use of conjugate priors pays off. Notice that the integrand in (38) is a Beta distribution with
parameters C1 + γπ1 and C0 + γπ0 . This means that the value of the integral is just the normalizing constant
for that distribution, which is easy to look up (e.g. in the entry for the Beta distribution in Wikipedia): the
normalizing constant for distribution Beta(C1 + γπ1 , C0 + γπ0 ) is
Γ(C1 + γπ1 )Γ(C0 + γπ0 )
Γ(C0 + C1 + γπ1 + γπ0 )
13
Making that substitution in (38), and also substituting N = C0 + C1 , we arrive at
Z
Γ(γπ1 + γπ0 ) Γ(C1 + γπ1 )Γ(C0 + γπ0 )
P (π|γπ1 , γπ0 )P (L|π) dπ =
Γ(γ
Γ(N + γπ1 + γπ0 )
π1 )Γ(γπ0 )
π
(39)
Substituting (39) and the definitions of the probability distributions from Section 2.4.1 back into (36)
gives us
P (L, C, θ 0 , θ 1 ; µ) ∝
V
Γ(γπ1 + γπ0 ) Γ(C1 + γπ1 )Γ(C0 + γπ0 ) Y NC0 (i)+γθi −1 NC1 (i)+γθi −1
θ1,i
θ0,i
Γ(γπ1 )Γ(γπ0 )
Γ(N + γπ1 + γπ0 )
i=1
(40)
Okay, so it would be reasonable at this point to ask, “I thought the point of integrating out π was to
simplify things, so why did we just ’simplify’ the joint expression by adding in a bunch of these Gamma
functions everywhere?” Good question–hold that thought until we derive the sampler.
2.5
Building the Gibbs Sampler
The definition of a Gibbs sampler specifies that in each iteration we assign a new value to variable Zi by
sampling from the conditional distribution
(t+1)
P (Zi |z1
(t+1)
(t)
, . . . , zi−1 , zi+1 , . . . , zr(t) ).
(t+1)
So, for example, to assign the value of L1
, we need to compute this conditional distribution:16
(t)
(t)
(t)
(t)
P (L1 |L2 , . . . , LN , C, θ 0 , θ 1 ; µ),
(t+1)
To assign the value of L2
, we need to compute
(t+1)
P (L2 |L1
(t+1)
(t)
(t)
(t)
(t)
, L3 , . . . , LN , C, θ 0 , θ 1 ; µ),
(t+1)
and so forth for L3
through LN . To assign the value of θ 0 we need to compute
Similarly, to assign the value of θ 0 we need to sample from the conditional distribution
(t+1)
P (θ 0
(t+1)
|L1
(t+1)
, L2
(t+1)
, . . . , LN
(t)
, C, θ 1 ; µ),
and, for θ 1 ,
(t+1)
P (θ 1
(t+1)
|L1
(t+1)
, L2
(t+1)
, . . . , LN
(t+1)
, C, θ 0
; µ),
Intuitively, at the start of an iteration t, we have a collection of all our current information at this
point in the sampling process. That information includes the word count for each document, the number of
documents labeled 0, the number of documents labeled 1, the word counts for all documents labeled 0, the
word counts for all documents labeled 1, the current label for each document, the current values of θ 0 and θ 1 ,
etc. When we want to sample the new label for document j, we temporarily remove all information (i.e. word
counts and label information) about this document from that collection of information. Then we look at the
conditional probability that Lj = 0 given all the remaining information, and the conditional probability that
(t+1)
Lj = 1 given the same information, and we sample the new label Lj
by choosing randomly according to
(t+1)
the relative weight of those two conditional probabilities. Sampling to get the new values θ 0
operates according to the same principal.
16 There’s
(t+1)
and θ 1
no superscript on the bags of words C because they’re fully observed and don’t change from iteration to iteration.
14
2.5.1
Sampling for Document Labels
Okay, so how do we actually do the sampling? We’re almost there. As the final step in our journey, we show
how to select all of the new document labels Lj and the new distributions θ 0 and θ 1 during each iteration
of the sampler. By definition of conditional probability,
P (Lj , Wj , L(−j) , C(−j) , θ 0 , θ 1 ; µ)
P (L(−j) , C(−j) , θ 0 , θ 1 ; µ)
P (L, C, θ 0 , θ 1 ; µ)
=
P (L(−j) , C(−j) , θ 0 , θ 1 ; µ)
P (Lj |L(−j) , C(−j) , θ 0 , θ 1 ; µ) =
(41)
(42)
where L(−j) are all the document labels except Lj , and C(−j) is the set of all documents except Wj . The
distribution is over two possible outcomes, Lj = 0 and Lj = 1.
Notice that the numerator is just the complete joint probability described in (40). In the denominator,
we have the same expression, minus all the information about document Wj . Therefore we can work out
what (42) should look like by considering the three factors in (40), one at a time. For each factor, we will
remind ourselves of what it looks like in the numerator (which includes Wj ), and work out what it should
look like in the denominator (excluding Wj ), for each of the two outcomes.
The first factor in (40),
Γ(γπ1 + γπ0 )
,
Γ(γπ1 )Γ(γπ0 )
(43)
is very easy. It depends only on the hyperparameters, so removing information about Wj has no impact
on its value. Since it will be the same in both the numerator and denominator of (42), it will cancel out.
Excellent! Two factors to go.
Let’s look at the second factor of (40). Taking all documents into account including Wj , this is
Γ(C1 + γπ1 )Γ(C0 + γπ0 )
.
Γ(N + γπ1 + γπ0 )
(44)
Now, in order to compute the denominator of (42), we remove document Wj from consideration. How does
that change things? It depends on what Lj was during the previous iteration. Whether Lj = 0 or Lj = 1
during the previous iteration, the corpus size is effectively reduced from N to N − 1, and the size of one of
the document classes is smaller by one compared to its value in the numerator. If Lj = 0 when we remove
(−j)
(−j)
it, then we will have C0
= C0 − 1 and C1
= C1 . If Lj = 1 during the previous iteration, then we will
(−j)
(−j)
have C0 = C0
and C1
= C1 − 1. In each case, removing Wj only changes the information we know
about one class, which means that of the two terms in the numerator of this factor, one of them is going to
be the same in the numerator and the denominator of (42). That’s going to cause the terms from one of the
classes (the one Wj did not belong to after the previous iteration) to cancel out.
(−j)
Let x ∈ {0, 1} be the outcome we’re considering, i.e. the one for which Cx
= Cx − 1.17 If we use x
and reorganize the expression a bit, the second factor of (42) can be rewritten from
Γ(C1 +γπ1 )Γ(C0 +γπ0 )
Γ(N +γπ1 +γπ0 )
(−j)
Γ(C0
(−j)
+γπ0 )Γ(C1
+γπ1 )
Γ(N +γπ1 +γπ0 −1)
to
Γ(Cx + γπx )Γ(N + γπ1 + γπ0 − 1)
Γ(N + γπ1 + γπ0 )Γ(Cx + γπx − 1)
(t)
(45)
17 Crucially, note that x here is not L
j , the label of document j at the previous iteration. We are pulling document j out
of the collection, effectively obviating our knowledge of its current label, and then constructing a distribution over the two
possibilities, Lj = 0 and Lj = 1. So we need for our expression to take x as a parameter, allowing us to build a probability
distribution by asking “what if x = 0?” and “what if x = 1?”.
15
Using the fact that Γ(a + 1) = aΓ(a) for all a, we can simplify further to get
Cx + γπx − 1
N + γπ1 + γπ0 − 1
(46)
Look, no more pesky Gammas!
Finally, the third factor in (40) is
V
Y
N
θ0,iC0
(i)+γθi −1 NC1 (i)+γθi −1
.
θ1,i
(47)
i=1
When we look at what changes when we remove Wj from consideration, in order to write the corresponding
expression in the denominator of (42), we see that it will behave in the same way as the second factor. One
of the classes remains unchanged, so one of the terms, either the θ 0 or θ 1 , will cancel out when we do the
(−j)
division. If, similar to above, we let x be the class for which Cx
= Cx − 1, then we can again capture
both cases in one expression:
N (i)+γθi −1
V
Y
θx,iCx
N
i=1
(−j) (i)+γθi −1
Cx
θx,i
=
V
Y
W
θx,iji
(48)
i=1
With that, we have finished working through the three factors in (40), which means we have worked out
how to express the numerator and denominator of (42), factor by factor. Recombining the factors, we get
the following expression for the conditional distribution in (42): for x ∈ {0, 1},
Pr(Lj = x|L(−j) , C(−j) , θ 0 , θ 1 ; µ) =
V
Cx + γπx − 1 Y Wji
θ
N + γπ1 + γπ0 − 1 i=1 x,i
(49)
Let’s take a step back and take a look at what this equation is telling us about how a label is chosen. Its
first factor gives us an indication of how likely it is that Lj = x considering only the distribution of the other
labels. So, for example, if the corpus had more class 0 documents than class 1 documents, this factor would
tend to push the label toward class 0. Its second factor is like a word distribution “fitting room.” We get an
indication of how well the words in Wj “fit” with each of the two distributions. If, for example, the words
from Wj “fit” better with distribution θ 0 (by having a larger value for the document probability using θ 0 )
then this factor will push the label toward class 0 as well.
So (finally!) here’s the actual procedure to sample from the conditional distribution in (42):
1. Let value0 = expression (49) with x = 0
2. Let value1 = expression (49) with x = 1
value0
value1
3. Let the distribution be h value0+value1
, value0+value1
i
(t+1)
4. Select the value of Lj
bution.
2.5.2
as the result of a Bernoulli trial (weighted coin flip) according to this distri-
Sampling for θ
We’ll follow a similar procedure to determine how to sample for new values of θ 0 and θ 1 . Since the estimation
of the two distributions is independent of one another, we’re going to omit the subscripts on θ to make the
notation a bit easier to digest. Just like above, we’ll need to derive an expression for the probability of θ
given all other variables, but our work is a bit simpler in this case. Observe,
P (θ|C, L; µ) ∝ P (C, L|θ)P (θ|µ)
16
(50)
Furthermore, recall that, since we used conjugate priors, this posterior, like the prior, works out to be a
Dirichlet distribution. We actually derived the full expression in Section 2.4.2, but we don’t need the full
expression here. All we need to do to sample a new distribution is to make another draw from a Dirichlet
distribution, but this time with parameters NCx (i) + γθi for each i in V . For notational convenience, let’s
define the V dimensional vector t such that each ti = NCx (i) + γθi , where x is again either 0 or 1 depending
on which θ we are resampling. We then sample a new θ as:
θ ∼ Dirichlet(t)
(51)
How do you actually implement sampling from a new Dirichlet distribution? To sample a random vector
a = ha1 , . . . , aV i from the V -dimensional Dirichlet distribution with parameters hα1 , . . . , αV i, one fast way
is to draw V independent samples y1 , . . . , yV from gamma distributions, each with density
Gamma(αi , 1) =
and then set ai = yi /
2.5.3
PV
j=1
yiαi −1 e−yi
,
Γ(αi )
(52)
yj (i.e., just normalize each of the gamma distribution draws).18
Taking advantage of documents with labels
Using labeled documents is relatively painless: just don’t sample Lj for those documents! Always keep
Lj equal to the observed label. The documents will effectively serve as “ground truth” evidence for the
distributions that created them. Since we never sample for their labels, they will always contribute to the
counts in (49) and (51) and will never be subtracted out.
2.5.4
Putting it all together
Initialization.
Define the priors as in Section 2.2 and initialize them as described in Section 2.3.
1: for t := 1 to T do
2:
for j := 1 to N do
3:
if j is not a training document then
4:
Subtract j’s word counts from the total word counts of whatever class it’s currently a member of
5:
Subtract 1 from the count of documents with label Lj
(t+1)
6:
Assign a new label Lj
to document j as described at the end of Section 2.5.1
7:
Add 1 to the count of documents with label Lj
(t+1)
(t+1)
8:
Add j’s word counts to the total word counts for class Lj
9:
end if
10:
end for
11:
t0 := vector of total word counts from class 0, including pseudocounts
12:
θ 0 ∼ Dirichlet(t0 ), as described in Section 2.5.2
13:
t1 := vector of total word counts from class 1, including pseudocounts
14:
θ 1 ∼ Dirichlet(t1 ), as described in Section 2.5.2
15: end for
Sampling iterations. Notice that as soon as a new label for Lj is assigned, this changes the counts that
will affect the labeling of the subsequent documents. This is, in fact, the whole principle behind a Gibbs
sampler!
That concludes the discussion of how sampling is done. We’ll see how to get from the output of the
sampler to estimated values for the variables in Section 3.
18 For
details, see http://en.wikipedia.org/wiki/Dirichlet distribution (version of April 12, 2010).
17
2.6
Optional: A Note on Integrating out Continuous Parameters
At this point you might be asking yourself why we were able to integrate out the continuous parameter π
from our model, but did not do something similar with the two word distributions θ 0 and θ 1 . The idea of
doing this even looks promising, but there’s a subtle problem that will get us into trouble and end up leaving
us with an expression filled with Γ functions that will not cancel out. Let us go through the derivations and
see where it leads us. If you follow this piece of the discussion, then you really understand the details!19
Our goal here would be to obtain the probability that a document was generated from the same distribution that generated the words of a particular class of documents, Cx . We would then use this as a
replacement for the product in (48). We start first by calculating the probability of making a single word
draw given the other words in the class, subtracting out the information about Wj . In the equations that
follow there’s an implicit (−j) superscript on all of the counts. If we let wk denote the word at some position
k in Wj then,
Pr(wk =
y|C(−j)
; γθ)
x
Z
=
Pr(wk = y|θ)P (θ|C(−j)
; γ θ )dθ
x
(53)
∆
PV
V
NC (i) + γθi ) Y NCx (i)+γθi −1
Γ(
dθ
θx,i
θx,y QV i=1 x
∆
i=1 Γ(NCx (i) + γθi ) i=1
PV
Z
V
Y
Γ( i=1NCx (i) + γθi )
N (i)+γθi −1
dθ
θx,y
θx,iCx
QV
Γ(N
(i)
+
γ
)
∆
Cx
θi
i=1
i=1
PV
Z
V
Y
Γ( i=1NCx (i) + γθi )
N (i)+γθi −1
N (y)+γθy
dθ
θx,iCx
θx,yCx
QV
Γ(N
(i)
+
γ
)
∆
C
θi
x
i=1
i=1∧i6=y
QV
PV
Γ( i=1NCx (i) + γθi ) Γ(NCx (y) + γθy + 1) i=1∧i6=y Γ(NCx (i) + γθi )
QV
PV
Γ(1 + i=1NCx (i) + γθi )
i=1 Γ(NCx (i) + γθi )
NCx (y) + γθy
PV
i=1NCx (i) + γθi
Z
=
=
=
=
=
(54)
(55)
(56)
(57)
(58)
The process we use is actually the same as the process used to integrate π, just
Pin the multidimensional
case. The set ∆ is the probability simplex of θ, namely the set of all θ such that i θi = 1. We get to (54)
by substitution from the formulas we derived in Section 2.4.2, then (55) by factoring out the normalization
constant for the Dirichlet distribution, since it is constant with respect to θ. Note that the integrand of
(56) is actually another Dirichlet distribution, so its integral is its normalization constant (same reasoning
as before). We substitute this in to obtain (57). Using the property of Γ that Γ(x + 1) = xΓ(x) for all x, we
can again cancel all of the Γ terms.
At this point, even though we have a simple and intuitive result for the probability of drawing a single
word from a Dirichlet, we actually need the probability of drawing all words in Wj from the Dirichlet
distribution. What we’d really like to do is assume that the words within a particular document are drawn
from the same distribution, and just calculate the probability of Wj by multiplying (58) over all words in the
vocabulary. But we cannot do that, since, without the values of θ being known, we cannot make independent
draws from a Dirichlet distribution since our draws have an effect on what our estimate of θ is!
We can see this two different ways. First, instead of drawing one word in equation (53), do the derivation
by drawing two words at a time.20 You’ll find that once you hit (57), you’ll have an extra set of Gamma
functions that won’t cancel out nicely. The second way to see it is actually by looking at the plate diagram
for the model, Figure 4. Each θ effectively has Rj arrows coming out of it for each document j to individual
19 The authors thank Wu Ke, who really understood the details, for pointing out our error in an earlier version of this document
and providing the illustrating example we go through next.
20 This is what Wu Ke did to demonstrate his point to us.
18
words, so every word within a document is in the Markov blanket 21 of the others; therefore we can’t assume
that the words are independent without a fixed θ. At issue here isn’t just that there are multiple instances
of words coming out of each theta, but crucially that those instances are sampled at the same time.
The lesson here is to be careful when you integrate out parameters. If you’re doing a single draw from a
multinomial, then integrating out a continuous parameter can make the sampler simpler, since you won’t have
to sample for it at every iteration. If, on the other hand, you do multiple draws from the same multinomial,
integration (although possible) will result in an expression that involves Gamma functions. Calculating
Gamma functions is undesirable since they are computationally expensive, so they can slow down a sampler
significantly.
3
Producing values from the output of a Gibbs sampler
The initialization and sampling iterations in the Gibbs sampling algorithm will produce values for each of
the variables, for iterations t = 1, 2, . . . , T . In theory, the approximated value for any variable Zi can simply
be obtained by calculating:
T
1 X (t)
z ,
T t=1 i
(59)
as discussed in equation (12). However, expression (59) is not always used directly. There are several
additional details to note that are a part of typical sampling practice.22
Convergence and burn-in iterations. Depending on the values chosen during the initialization step,
(t) (t)
(t)
it may take some time for the Gibbs sampler to reach a point where the points hz1 , z2 , . . . zr i are all
coming from the stationary distribution of the Markov chain, which is an assumption of the approximation
in (59). In order to avoid the estimates being contaminated by the values at iterations before that point,
some practitioners generally discard the values at iterations t < B, which are referred to as the “burn-in”
iterations, so that the average in (59) is taken only over iterations B + 1 through T .23
Autocorrelation and lag. The approximation in (59) assumes the samples for Zi are independent, but
we know they’re not, because they were produced by a process that conditions on the previous point in the
chain to generate the next point. This is referred to as autocorrelation (sometimes serial autocorrelation),
i.e. correlation between successive values in the data.24 In order to avoid autocorrelation problems (so that
the chain “mixes well”), many implementations of Gibbs sampling average only every Lth value, where L is
referred to as the lag.25 In this context, Jordan Boyd-Graber (personal communication) also recommends
looking at Neal’s [15] discussion of likelihood as a metric of convergence.
21 The Markov blanket of a node in a graphical model consists of that node’s parents, its children, and the coparents of its
children. [16].
22 Jason Eisner (personal communication) argues, with some support from the literature, that burn-in, lag, and multiple
chains are in fact unnecessary and it is perfectly correct to do a single long sampling run and keep all samples. See [4, 5],
MacKay ([13], end of section 29.9, page 381) and Koller and Friedman ([10], end of section 12.3.5.2, page 522).
23 As far as we can tell, there is no principled way to choose the “right” value for B in advance. There are a variety of
ways to test for convergence, and to measure autocorrelation; see, e.g., Brian J. Smith, “boa: An R Package for MCMC
Output Convergence Assessment and Posterior Inference”, Journal of Statistical Software, November 2007, Volume 21, Issue
11, http://www.jstatsoft.org/ for practical discussion. However, from what we can tell, many people just choose a really big
value for T , pick B to be large also, and assume that their samples are coming from a chain that has converged.
24 Lohninger [11] observes that “most inferential tests and modeling techniques fail if data is autocorrelated”.
25 Again, the choice of L seems to be more a matter of art than science: people seem to look at plots of the autocorrelation
for different values of L and use a value for which the autocorrelation drops off quickly. The autocorrelation for variable Zi
(t)
(t−L)
with lag L is simply the correlation between the sequence Zi and the sequence Zi
. Which correlation function is used
seems to vary.
19
Multiple chains. As is the case for many other stochastic algorithms (e.g. expectation maximization as
used in the forward-backward algorithm for HMMs), people often try to avoid sensitivity to the starting
point chosen at initialization time by doing multiple runs from different starting points. For Gibbs sampling
and other Markov Chain Monte Carlo methods, these are referred to as “multiple chains”.26
Hyperparameter sampling. Rather than simply picking hyperparameters, it is possible, and in fact
often critical, to assign their values via sampling (Boyd-Graber, personal communication). See, e.g., Wallach
et al. [20] and Escobar and West [3].
4
Conclusions
The main point of this document has been to take some of the mystery out of Gibbs sampling for computer
scientists who want to get their hands dirty and try it out. Like any other technique, however, caveat lector:
using a tool with only limited understanding of its theoretical foundations can produce undetected mistakes,
misleading results, or frustration.
As a first step toward getting further up to speed on the relevant background, Ted Pedersen’s [18]
doctoral dissertation has a very nice discussion of parameter estimation in Chapter 4, including a detailed
exposition of an EM algorithm for Naı̈ve Bayes and his own derivation of a Naı̈ve Bayes Gibbs sampler that
highlights the relationship to EM. (He works through several iterations of each algorithm explicitly, which
in our opinion merits a standing ovation.) The ideas introduced in Chapter 4 are applied in Pedersen and
Bruce [17]; note that the brief description of the Gibbs sampler there makes an awful lot more sense after
you’ve read Pedersen’s dissertation chapter.
We also recommend Gregor Heinrich’s [7] “Parameter estimation for text analysis.” Heinrich presents
fundamentals of Bayesian inference starting with a nice discussion of basics like maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation, all with an eye toward dealing with text.
(We followed his discussion closely above in Section 1.1.) Also, his is one of the few papers we’ve been
able to find that actually provides pseudo-code for a Gibbs sampler. Heinrich discusses in detail Gibbs
sampling for the widely discussed Latent Dirichlet Allocation (LDA) model, and his corresponding code is
at http://www.arbylon.net/projects/LdaGibbsSampler.java.
For a textbook-style exposition, see Bishop [2]. The relevant pieces of the book are a little less standalone than we’d like (which makes sense for a course on machine learning, as opposed to just trying to dive
straight into a specific topic); Chapter 11 (Sampling Methods) is most relevant, though you may also find
yourself referring back to Chapter 8 (Graphical Models).
Those ready to dive into the topic of Markov Chain Monte Carlo in more depth might want to start
with Andrieu et al. [1]. We and Andrieu et al. appear to differ somewhat on the semantics of the word
“introduction,” which is one of the reasons the document you’re reading exists.
Finally, for people interested in the use of Gibbs sampling for structured models in NLP (e.g. parsing),
the right place to start is undoubtedly Kevin Knight’s excellent “Bayesian Inference with Tears: A tutorial
workbook for natural language researchers” [9], after which you’ll be equipped to look at Johnson, Griffiths,
and Goldwater [8].27 The leap from the models discussed here to those kinds of models actually turns out
to be a lot less scary than it appears at first. The main thing to observe is that in the crucial sampling
(t)
step (equation (14) of Section 1.2.3), the denominator is just the numerator without zi , the variable whose
new value you’re choosing. So when you’re sampling conditional distributions (e.g. Sections 2.5.1–2.5.4)
in more complex models, the basic idea will be the same: you subtract out counts related to the variable
you’re interested in based on its current value, compute proportions based on the remaining counts, then
26 Again, there seems to be as much art as science in whether to use multiple chains, how many, and how to combine them to
get a single output. Chris Dyer (personal communication) reports that it is not uncommon simply to concatenate the chains
together after removing the burn-in iterations.
27 As an aside, our travels through the literature in writing this document led to an interesting early use of Gibbs sampling
with CFGs: Grate et al. [6]. Johnson et al. had not come across this when they wrote their seminal paper introducing Gibbs
sampling for PCFGs to the NLP community (and in fact a search on scholar.google.com turned up no citations in the NLP
literature). Mark Johnson (personal communication) observes that the “local move” Gibbs sampler Grate et al. describe is
specialized to a particular PCFG, and it’s not clear how to generalize it to arbitrary PCFGs.
20
pick probabilistically based on the result, and finally add counts back in according to the probabilistic choice
you just made.
Acknowledgments
The creation of this document has been supported in part by the National Science Foundation (award IIS0838801), the GALE program of the Defense Advanced Research Projects Agency (Contract No. HR001106-2-001), and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research
Projects Activity (IARPA), through the Army Research Laboratory. All statements of fact, opinion, or
conclusions contained herein are those of the authors and should not be construed as representing the official
views or policies of NSF, DARPA, IARPA, the ODNI, the U.S. Government, the University of Maryland,
nor of the Resnik or Hardisty families or any of their close relations, friends, or family pets.
The authors are grateful to Jordan Boyd-Graber, Bob Carpenter, Chris Dyer, Jason Eisner, John Goldsmith, Kevin Knight, Mark Johnson, Nitin Madnani, Neil Parikh, Sasa Petrovic, William Schuler, Prithvi
Sen, Wu Ke and Matt Pico (so far!) for helpful comments and/or catching glitches in the manuscript. Extra
thanks to Jordan Boyd-Graber for many helpful discussions and extra assistance with plate diagrams, and
to Jay Resnik for his help debugging the LATEX document.
References
[1] Andrieu, Freitas, Doucet, and Jordan. An introduction to MCMC for machine learning. Machine
Learning, 50:5–43, 2003.
[2] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the
American Statistical Association, 90:577–588, June 1995.
[4] C. Geyer. Burn-in is unnecessary, 2009. http://www.stat.umn.edu/∼charlie/mcmc/burn.html, Downloaded October 18, 2009.
[5] C. Geyer. One long run, 2009. http://www.stat.umn.edu/∼charlie/mcmc/one.html.
[6] L. Grate, M. Herbster, R. Hughey, D. Haussler, I. S. Mian, and H. Noller. Rna modeling using gibbs
sampling and stochastic context free grammars. In ISMB, pages 138–146, 1994.
[7] G. Heinrich. Parameter estimation for text analysis. Technical Note Version 2.4, vsonix GmbH and
University of Leipzig, August 2008. http://www.arbylon.net/publications/text-est.pdf.
[8] M. Johnson, T. Griffiths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte
Carlo. In Human Language Technologies 2007: The Conference of the North American Chapter of
the Association for Computational Linguistics; Proceedings of the Main Conference, pages 139–146,
Rochester, New York, April 2007. Association for Computational Linguistics.
[9] K. Knight. Bayesian inference with tears: A tutorial workbook for natural language researchers, 2009.
http://www.isi.edu/natural-language/people/bayes-with-tears.pdf.
[10] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,
2009.
[11] H.
Lohninger.
Teach/Me
Data
Analysis.
http://www.vias.org/tmdatanaleng/cc corr auto 1.html.
Springer-Verlag,
1999.
[12] A. Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–49, August 2008.
[13] D. Mackay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press,
2003.
21
[14] C. Manning and H. Schuetze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.
[15] R. M. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report CRGTR-93-1, University of Toronto, 1993. http://www.cs.toronto.edu/∼radford/ftp/review.pdf.
[16] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
[17] T. Pedersen. Knowledge lean word sense disambiguation. In AAAI/IAAI, page 814, 1997.
[18] T. Pedersen. Learning Probabilistic Models of Word Sense Disambiguation. PhD thesis, Southern
Methodist University, 1998. http://arxiv.org/abs/0707.3972.
[19] S. Theodoridis and K. Koutroumbas. Pattern Recognition, 4th Ed. Academic Press, 2009.
[20] H. Wallach, C. Sutton, and A. McCallum. Bayesian modeling of dependency trees using hierarchical
Pitman-Yor priors. In ICML Workshop on Prior Knowledge for Text and Language Processing, 2008.
22
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising