CS-TR-4956 UMIACS-TR-2010-04 LAMP-TR-153 June 2010 GIBBS SAMPLING FOR THE UNINITIATED Philip Resnik Eric Hardisty Department of Linguistics Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3275 resnik AT umd.edu Department of Computer Science Institute for Advanced Computer Studies University of Maryland College Park, MD 20742-3275 hardisty AT cs.umd.edu Abstract This document is intended for computer scientists who would like to try out a Markov Chain Monte Carlo (MCMC) technique, particularly in order to do inference with Bayesian models on problems related to text processing. We try to keep theory to the absolute minimum needed, though we work through the details much more explicitly than you usually see even in “introductory” explanations. That means we’ve attempted to be ridiculously explicit in our exposition and notation. After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our heads in the direction of theory), we work through an example application in detail—the derivation of a Gibbs sampler for a Naı̈ve Bayes model. Along with the example, we discuss some practical implementation issues, including the integrating out of continuous parameters when possible. We conclude with some pointers to literature that we’ve found to be somewhat more friendly to uninitiated readers. Note: as of June 3, 2010 we have corrected some small errors in the original April 2010 report. Keywords: Gibbs sampling, Markov Chain Monte Carlo, naı̈ve Bayes, Bayesian inference, tutorial Gibbs Sampling for the Uninitiated Eric Hardisty Department of Computer Science and Institute for Advanced Computer Studies University of Maryland College Park, MD 20742 USA hardisty AT cs.umd.edu Philip Resnik Department of Linguistics and Institute for Advanced Computer Studies University of Maryland College Park, MD 20742 USA resnik AT umd.edu Abstract This document is intended for computer scientists who would like to try out a Markov Chain Monte Carlo (MCMC) technique, particularly in order to do inference with Bayesian models on problems related to text processing. We try to keep theory to the absolute minimum needed, though we work through the details much more explicitly than you usually see even in “introductory” explanations. That means we’ve attempted to be ridiculously explicit in our exposition and notation. After providing the reasons and reasoning behind Gibbs sampling (and at least nodding our heads in the direction of theory), we work through an example application in detail—the derivation of a Gibbs sampler for a Naı̈ve Bayes model. Along with the example, we discuss some practical implementation issues, including the integrating out of continuous parameters when possible. We conclude with some pointers to literature that we’ve found to be somewhat more friendly to uninitiated readers. 1 Introduction Markov Chain Monte Carlo (MCMC) techniques like Gibbs sampling provide a principled way to approximate the value of an integral. 1.1 Why integrals? Ok, stop right there. Many computer scientists, including a lot of us who focus in natural language processing, don’t spend a lot of time with integrals. We spend most of our time and energy in a world of discrete events. (The word bank can mean (1) a financial institution, (2) the side of a river, or (3) tilting an airplane. Which meaning was intended, based on the words that appear nearby?) Take a look at Manning and Schuetze [14], and you’ll see that the probabilistic models we use tend to involve sums, not integrals (the Baum-Welch algorithm for HMMs, for example). So we have to start by asking: why and when do we care about integrals? One good answer has to do with probability estimation.1 Numerous computational methods involve estimating the probabilities of alternative discrete choices, often in order to pick the single most probable choice. As one example, the language model in an automatic speech recognition (ASR) system estimates the probability of the next word given the previous context. As another example, many spam blockers use features of the e-mail message (like the word Viagra, or the phrase send this message to all your friends) to predict the probability that the message is spam. Sometimes we estimate probabilities by using maximimum likelihood estimation (MLE). To use a standard example, if we are told a coin may be unfair, and we flip it 10 times and see HHHHTTTTTT (H=heads, T=tails), it’s conventional to estimate the probability of heads for the next flip as 0.4. In practical terms, MLE amounts to counting and then normalizing so that the probabilities sum to 1. 1 This subsection is built around the very nice explication of Bayesian probability estimation by Heinrich [7]. 1 Input interpretation: plot p4 "1 ! p#6 p ! 0 to 1 Plot: 0.0012 0.0010 0.0008 0.0006 0.0004 0.0002 0.2 0.4 0.6 0.8 1.0 Figure 1: Probability of generating the coin-flip sequence HHHHTTTTTT, using different values for P (heads) on the x-axis. The value that maximizes the probability of the observed sequence, 0.4, is the maximum likelihood estimate (MLE). 4 count(H) = = 0.4 count(H) + count(T) 10 (1) Formally, MLE produces the choice most likely to have generated the observed data. In this case, the most natural model µ has just a single parameter, π, namely the probability of heads (see Figure 1).2 Letting X = HHHHTTTTTT represent the observed data, and y the outcome of the next coin flip, we estimate Generated by Wolfram|Alpha (www.wolframalpha.com) on September 24, 2009 from Champaign, IL. © Wolfram Alpha LLC—A Wolfram Research Company π̃M LE = argmax P (X |π) (2) π P (y|X ) ≈ P (y|π̃M LE ) (3) On the other hand, sometimes we estimate probabilities using maximum a posteriori (MAP) estimation. A MAP estimate is the choice that is most likely given the observed data. In this case, π̃M AP = argmax P (π|X ) π = = P (X |π)P (π) P (X ) π argmax P (X |π)P (π) argmax (4) π P (y|X ) ≈ P (y|π̃M AP ) (5) In contrast to MLE, MAP estimation applies Bayes’s Rule, so that our estimate (4) can take into account prior knowledge about what we expect π to be in the form of a prior probability distribution P (π).3 So, 2 Specifically, µ models each choice as a Bernoulli trial, and the probability of generating exactly this heads-tails sequence for a given π is π 4 (1 − π)6 . If you type Plot[p^4(1-p)^6,{p,0,1}] into Wolfram Alpha, you get Figure 1, and you can immediately see that the curve tops out, i.e. the probability of generating the sequence is highest, exactly when p = 0.4. Confirm this by entering derivative of p^4(1-p)^6 and you’ll get 52 = 0.4 as the maximum. Thanks to Kevin Knight for pointing out how easy all this is using Wolfram Alpha. Also see discussion in Heinrich [7], Section 2.1. 3 We got to (4) from the desired posterior probability by applying Bayes’s Rule and then ignoring the denominator since the argmax doesn’t depend on it. 2 for example, we might believe that the coin flipper is a scrupulously honest person, and choose a prior distribution that is biased in favor of π = 0.5. The more heavily biased that prior distribution is, the more evidence it will take to shake our pre-existing belief that the coin is fair.4 Now, MLE and MAP estimates are both giving us the best estimate, according to their respective definitions of “best.” But notice that using a single estimate — whether it’s π̃M LE or π̃M AP — throws away information. In principle, π could have any value between 0 and 1; might we not get better estimates if we took the whole distribution P (π|X ) into account, rather than just a single estimated value for π? If we do that, we’re making use of all the information about π that we can wring from the observed data, X . The way to take advantage of all that information is to calculate an expected value rather than an estimate using the single best guess for π. Recall that the expected value of a function f (z), when z is a discrete variable, is E[f (z)] = X f (z)p(z). (6) z∈Z Here Z is the set of discrete values z can take, and p(z) is the probability distribution over possible values for z. If z is a continuous variable, the expected value is an integral rather than a sum: Z E[f (z)] = f (z)p(z) dz. (7) For our example, z = π, the function f we’re interested in is f (z) = P (y|π), and the distribution over which we’re taking the expectation is P (π|X ), i.e. the whole distribution over possible values of π given that we’ve observed X . That gives us the following expected value for the posterior probability of y given X : Z P (y|X ) = P (y|π)P (π|X ) dπ (8) where Bayes’s Rule defines P (π|X ) = P (X |π)P (π) P (X |π)P (π) =R . P (X ) P (X |π)P (π) dπ π (9) Notice that, unlike (3) and (5), Equation (8) defines the posterior using a true equality, not an approximation. It takes fully into account our prior beliefs about what the value of π will be, along with the interaction of those prior beliefs with observed evidence X . Equations (8) and (9) provide one compelling answer to the question we started with. Why should even discrete-minded computer scientists care about integrals? Because even when the probability space is discrete, we often care about good estimates of posterior probabilities. Computing integrals can help us improve the parameter estimates in our models.5 4 See http://www.math.uah.edu/STAT/objects/experiments/BetaCoinExperiment.xhtml for a nice applet that lets you explore this idea. If you set a = b = 10, you get a prior strongly biased toward 0.5, and it’s hard to move the posterior too far from that value even if you generate observed heads with probability p = 0.8. If you set a = b = 2, there’s still a bias toward 0.5 but it’s much easier to move the posterior off that value. As a second pointer, see some nice, self-contained slides at http://www.cs.cmu.edu/∼lewicki/cp-s08/Bayesian-inference.pdf. 5 Chris Dyer (personal communication) points out you don’t have to be doing Bayesian estimation to care about expected values. For example, better ways to compute expected values can be useful in the E step of expectation-maximization algorithms, which give you maximum likelihood estimates for models with latent variables. He also points out that for many models, Bayesian parameter estimation can be a whole lot easier to implement than EM. The widely used GIZA++ implementation of IBM Model 3 (a probabilistic model used in statistical machine translation [12]) contains 2186 lines of code; Chris implemented a Gibbs sampler for Model 3 in 67 lines. On a related note, Kevin Knight’s excellent “Bayesian Inference with Tears: A tutorial workbook for natural language researchers” [9] was written with goals very similar to our own, but from an almost completely complementary angle: he emphasizes conceptual connections to EM algorithms and focuses on the kinds of structured problems you tend to see in natural language processing. 3 1.2 Why sampling? The trouble with integrals, of course, is that they can be very difficult to calculate. The methods we learned in calculus class are fine for classroom exercises, but often cannot be applied to interesting problems in the real world. Indeed, analytical solutions to (8) and the denominator of (9) might be impossible to obtain, so we might not be able to determine the exact form of P (π|X ). Gibbs sampling allows us to sample from a distribution that asymptotically follows P (π|X ) without having to explicitly calculate the integrals. 1.2.1 Monte Carlo: a circle, a square, and a bag of rice Gibbs Sampling is an instance of a Markov Chain Monte Carlo technique. Let’s start with the “Monte Carlo” part. You can think of Monte Carlo methods as algorithms that help you obtain a desired value by performing simulations involving probabilistic choices. As a simple example, here’s a cute, low-tech Monte Carlo technique for estimating the value of π (the ratio of a circle’s circumference to its diameter).6 Draw a perfect square on the ground. Inscribe a circle in it — i.e. the circle and the square are centered in exactly the same place, and the circle’s diameter has length identical to the side of the square. Now take a bag of rice, and scatter the grains uniformly at random inside the square. Finally, count the total number of grains of rice inside the circle (call that C), and inside the square (call that S). You scattered rice at random. Assuming you managed to do this pretty uniformly, the ratio between the circle’s grains and the square’s grains (which include the circle’s) should approximate the ratio between the area of the circle and the area of the square, so π( d2 )2 C ≈ . S d2 (10) Solving for π, we get π ≈ 4C S . You may not have realized it, but we just solved a problem by approximating the values of integrals. The true area of the circle, π( d2 )2 , is the result of summing up an infinite number of infinitessimally small points; similarly for the the true area d2 of the square. The more grains of rice we use, the better our approximation will be. 1.2.2 Markov Chains: walking the right walk In the circle-and-square example, we saw the value of sampling involving a uniform distribution, since the grains of rice were distributed uniformly within the square. Returning to the problem of computing expected values, recall that we’re interested in Ep(x) [f (x)] (equation 7), where we’ll assume that the distribution p(x) is not uniform and, in fact, not easy to work with analytically. Figure 2 provides an example f (z) and p(z) for illustration. Conceptually, the integral in equation (7) sums up f (z)p(z) over infinitely many values of z. But rather than touching each point in the sum exactly once, here’s another way to think about it: if you sample N points z (0) , z (1) , z (2) , . . . , z (N ) at random from the probability density p(z), then N 1 X f (z (t) ). N →∞ N t=1 Ep(z) [f (z)] = lim (11) That looks a lot like a kind of averaged value for f , which makes a lot of sense since in the discrete case (equation 6) the expected value is nothing but a weighted average, where the weight for each value of z is its probability. Notice, though, that the value in the sum is just f (z (t) ), not f (z (t) )p(z (t) ) as in the integral in equation (7). Where did the p(z) part go? Intuitively, if we’re sampling according to p(z), and count(z) is the number of 6 We’re elaborating on the introductory example at http://en.wikipedia.org/wiki/Monte Carlo method. 4 f(z) p(z) z1 z2 z3 Z z4 Figure 2: Example of computing expectation Ep(z) [f (z)]. (This figure was adapted from page 7 of the handouts for Chris Bishop’s presentation “NATO ASI: Learning Theory and Practice”, Leuven, July 2002, http://research.microsoft.com/en-us/um/people/cmbishop/downloads/bishop-nato-2.pdf ) times we observe z in the sample, then N1 count(z) approaches p(z) as N → ∞. So the p(z) is implicit in the way the samples are drawn.7 Looking at equation (11), it’s clear that we can get an approximate value by sampling only a finite number of times, T : Ep(z) [f (z)] ≈ T 1X f (z (t) ). T t=1 (12) Progress! Now we have a way to approximate the integral. The remaining problem is this: how do we sample z (0) , z (1) , z (2) , . . . , z (T ) according to p(z)? There are a whole variety of ways to go about this; e.g., see Bishop [2] Chapter 11 for discussion of rejection sampling, adaptive rejection sampling, adaptive rejection Metropolis sampling, importance sampling, sampling-importance-sampling,... For our purposes, though, the key idea is to think of z’s as points in a state space, and find ways to “walk around” the space — going from z (0) to z (1) to z (2) and so forth — so that the likelihood of visiting any point z is proportional to p(z). Figure 2 illustrates the reasoning visually: walking around values for Z, we want to spend our time adding values f (z) to the sum when p(z) is large, e.g. devoting more attention to the space between z1 and z2 , as opposed to spending time in less probable portions of the space like the part between z2 and z3 . A walk of this kind can be characterized abstractly as follows: 1: z (0) := a random initial point 2: for t = 1 to T do 3: z (t+1) := g(z (t) ) 4: end for 7 Okay — we’re playing a little fast and loose here by talking about counts: with Z continuous, we’re not going to see two identical samples z (i) = z (j) , so it doesn’t really make sense to talk about counting how many times a value was seen. But we did say “intuitively”, right? 5 Here g is a function that makes a probabilistic choice about what state to go to next according to an explicit or implicit transition probability Ptrans (z (t+1) |z (0) , z (1) , . . . , z (t) ).8 The part about probabilistic choices makes this a Monte Carlo technique. What will make it a Markov Chain Monte Carlo technique is defining things so that the next state you visit, z (t+1) , depends only on the current state z (t) . That is, Ptrans (z (t+1) |z (0) , z (1) , . . . , z (t) ) = Ptrans (z (t+1) |z (t) ). (13) For you language folks, this is precisely the same idea as modeling word sequences using a bigram model, where here we have states z instead of having words w. We included the subscript trans in our notation, so that Ptrans can be read as “transition probability”, in order to emphasize that these are state-to-state transition probabilities in a (first-order) Markov model. The heart of Markov Chain Monte Carlo methods is designing g so that the probability of visiting a state z will turn out to be p(z), as desired. This can be accomplished by guaranteeing that the chain, as defined by the transition probabilities Ptrans , meets certain conditions. Gibbs sampling is one algorithm that meets those conditions.9 1.2.3 The Gibbs sampling algorithm Gibbs sampling is applicable in situations where Z has at least two dimensions, i.e. each point z is really z = hz1 , . . . , zk i, with k > 1. The basic idea in Gibbs sampling is that, rather than probabilistically picking the next state all at once, you make a separate probabilistic choice for each of the k dimensions, where each choice depends on the other k − 1 dimensions.10 That is, the probabilistic walk through the space proceeds as follows: (0) (0) 1: z (0) := hz1 , . . . , zk i 2: for t = 1 to T do 3: for i = 1 to k do (t+1) (t+1) (t+1) (t) (t) 4: zi ∼ P (Zi |z1 , . . . , zi−1 , zi+1 , . . . , zk ) 5: end for 6: end for Note that we can obtain the distribution we are sampling from by using the definition of conditional probability: (t+1) (t+1) P (Zi |z1 (t+1) (t) (t) , . . . , zi−1 , zi+1 , . . . , zk ) = P (z1 (t+1) P (z1 (t+1) (t) (t) (t) , . . . , zi−1 , zi , zi+1 , . . . , zk ) (t+1) (t) (t) . (14) , . . . , zi−1 , zi+1 , . . . , zk ) Notice that the only difference between the numerator and the denominator is that the numerator is the full (t) (t) joint probability, including zi , whereas zi is missing in the denominator. This will be important later. One full execution of the inner loop and you’ve just computed your new point z (t+1) = g(z (t) ) = (t+1) (t+1) , . . . , zk i. hz1 You can think of each dimension zi as corresponding to a parameter or variable in your model. Using equation (14), we sample the new value for each variable according to its distribution based on the values of all the other variables. During this process, new values for the variables are used as soon as you obtain them. For the case of three variables: • The new value of z1 is sampled conditioning on the old values of z2 and z3 . 8 We’re deliberately staying at a high level here. In the bigger picture, g might consider and reject one or more states before finally deciding to accept one and return it as the value for z (t+1) . See discussions of the Metropolis-Hastings algorithm, e.g. Bishop [2], Section 11.2. 9 We told you we were going to keep theory to a minimum, didn’t we? 10 There are, of course, variations on this basic scheme. For example, “blocked sampling” groups the variables into b < k blocks and the variables in each block are sampled together based on the other b − 1 blocks. 6 • The new value of z2 is sampled conditioning on the new value of z1 and the old value of z3 . • The new value of z3 is sampled conditioning on the new values of z1 and z2 . 1.3 The remainder of this document So, there you have it. Gibbs sampling makes it possible to obtain samples from probability distributions without having to explicitly calculate the values for their marginalizing integrals, e.g. computing expected values, by defining a conceptually straightforward approximation. This approximation is based on the idea of a probabilistic walk through a state space whose dimensions correspond to the variables or parameters in your model. Trouble is, from what we can tell, most descriptions of Gibbs sampling pretty much stop there (if they’ve even gone into that much detail). To someone relatively new to this territory, though, that’s not nearly far enough to figure out how to do Gibbs sampling. How exactly do you implement the “sampling from the following distribution” part at the heart of the algorithm (equation 14) for your particular model? How do you deal with continuous parameters in your model? How do you actually generate the expected values you’re ultimately interested in (e.g. equation 8), as opposed to just doing the probabilistic walk for T iterations? Just as the first part of this document aimed at explaining why, the remainder aims to explain how. In Section 2, we take a very simple probabilistic model — Naı̈ve Bayes — and describe in considerable (painful?) detail how to construct a Gibbs sampler for it. This includes two crucial things, namely how to employ conjugate priors and how to actually sample from conditional distributions per equation (14). In Section 3 we discuss how to actually obtain values from a Gibbs sampler, as opposed to merely watching it walk around the state space. (Which might be entertaining, but wasn’t really the point.) Our discussion includes convergence and burn-in, auto-correlation and lag, and other practical issues. In Section 4 we conclude with pointers to other things you might find it useful to read, as well as an invitation to tell us how we could make this document more accurate or more useful. 2 Deriving a Gibbs Sampler for a Naı̈ve Bayes Model In this section we consider Naive Bayes models.11 Let’s assume that items of interest are documents, that the features under consideration are the words in the document, and that the document-level class variable we’re trying to predict is a sentiment label whose value is either 0 or 1. For ease of reference, we present our notation in Figure 3. Figure 4 describes the model as a “plate diagram”, to which we will refer when describing the model. 2.1 Modeling How Documents are Generated We represent each document as a bag of words. Given an unlabeled document Wj , our goal is to pick the best label Lj = 0 or Lj = 1. Sometimes we will refer to 0 and 1 as classes instead of labels. In further discussion it will be convenient for us to refer to the sets of documents sharing the same label, so for notational convenience we define the sets C0 = {Wj |Lj = 0} and C1 = {Wj |Lj = 1}. The usual treatment of these models is to equate “best” with “most probable”, and therefore our goal is to choose the label Lj for Wj that maximizes P (Lj |Wj ). Applying Bayes’s Rule, Lj = argmax P (L|Wj ) = L = P (Wj |L)P (L) P (Wj ) L argmax P (Wj |L)P (L), argmax L where the denominator P (Wj ) is omitted because it does not depend on L. This application of Bayes’s rule (the “Bayes” part of “Naı̈ve Bayes”) allows us to think of the model in terms of a “generative story” that accounts for how documents are created. According to that story, we 11 We assume the reader is familiar with Naive Bayes. For a refresher, see [14]. 7 V N γπ1 , γπ0 γθ γθi Cx C C0 (C1 ) Wj Wji L Lj Rj θi θx,i NCx (i) C(−j) L(−j) (−j) (−j) C0 (C1 ) µ number of words in the vocabulary. number of documents in the corpus. hyperparameters of the Beta distribution. hyperparameter vector for the multinomial prior. pseudocount for word i. set of documents labeled x. the set of all documents. number of documents labeled 0 (1). document j’s frequency distribution. frequency of word i in document j. vector of document labels. label for document j. number of words in document j. probability of word i. probability of word i from the distribution of class x. number of times word i occurs in the set of all documents labeled x. set of all documents except Wj vector of all document labels except Lj number of documents labeled 0 (1) except for Wj set of hyperparameters hγπ1 , γπ0 , γ θ i Figure 3: Notation. γπ Lj π γθ Wjk θ Rj 2 Figure 4: Naive Bayes plate diagram 8 N first pick the class label of the document, Lj ; our model will assume that’s done by flipping a coin whose probability of heads is some value π = P (Lj = 1). We can express this a little more formally as Lj ∼ Bernoulli(π). Then, for every one of the Rj word positions in the document, we pick a word wi independently by sampling randomly according to a probability distribution over words. Which probability distribution we use is based on the label Lj of the document, so we’ll write them as θ 0 and θ 1 . Formally one would describe the creation of document j’s bag of words as Wj ∼ Multinomial(Rj , θ). (15) The assumption that the words are chosen independently is the reason we call the model “naı̈ve”. Notice that logically speaking, it made sense to describe the model in terms of two separate probability distributions, θ 0 and θ 1 , each one being a simple unigram distribution. The notation in (15) doesn’t explicitly show that what happens in generating Wj depends on whether Lj was 0 or 1. Unfortunately, that notational choice seems to be standard, even though it’s less transparent.12 We indicate the existence of two θs by including the 2 in the lower rectangle of Figure 4, but many plate diagrams in the literature would not. Another, perhaps clearer way to describe the process would be Wj ∼ Multinomial(Rj , θ Lj ). (16) And that’s it: our “generative story” for the creation of a whole set of labeled documents hWn , Ln i, according to the Naı̈ve Bayes model, is that this simple document-level generative story gets repeated N times, as indicated by the N in the upper rectangle in Figure 4. 2.2 Priors Well, ok, that’s not completely it. Where did π come from? Our generative story is going to assume that before this whole process began, we also picked π randomly. Specifically we’ll assume that π is sampled from a Beta distribution with parameters γπ1 and γπ0 . These are referred to as hyperparameters because they are parameters of a prior, which is itself used to pick parameters of the model. In Figure 4 we represent these two hyperparameters as a single two-dimensional vector γπ = hγπ1 , γπ0 i. When γπ1 = γπ0 = 1, Beta(γπ1 , γπ0 ) is just a uniform distribution, which means that any value for π is equally likely. For this reason we call Beta(1, 1) an “uninformed prior”. Similarly, where do θ 0 and θ 1 come from? Just as the Beta distribution can provide an uninformed prior for a distribution making a two-way choice, the Dirichlet distribution can provide an uninformed prior for V -way choices, where V is the number of words in the vocabulary. Let γ θ be a V -dimensional vector where the value of every dimension equals 1. If θ 0 is sampled from Dirichlet(γ θ ), every probability distribution over words will be equally likely. Similarly, we’ll assume θ 1 is sampled from Dirichlet(γ θ ).13 Formally π ∼ Beta(γπ ) θ ∼ Dirichlet(γ θ ) Choosing the Beta and Dirichlet distributions as priors for binomial and multinomial distributions, respectively, helps the math work out cleanly. We’ll discuss this in more detail in Section 2.4.2. 12 The actual basis for removing the subscripts on the parameters θ is that we assume the data from one class is independent of the parameter estimation of all the other classes, so essentially when we derive the probability expressions for one class the others look the same [19]. 13 Note that θ and θ are sampled separately. There’s no assumption that they are related to each other at all. Also, it’s 0 1 worth noting that a Dirichlet distribution is a Beta distribution if the dimension V = 2. Dirichlet generalizes Beta in the same way that multinomial generalizes binomial. Now you see why we took the trouble to represent γπ1 and γπ0 as a 2-dimensional vector γπ . 9 2.3 State space and initialization Following Pedersen [17, 18], we’re going to describe the Gibbs sampler in a completely unsupervised setting where no labels at all are provided as training data. We’ll then briefly explain how to take advantage of labeled data. State space. Recall that the job of a Gibbs sampler is to walk through an k-dimensional state space defined by the random variables hZ1 , Z2 , . . . Zk i in the model. Every point in that walk is a collection hz1 , z2 , . . . zk i of values for those variables. In the Naı̈ve Bayes model we’ve just described, here are the variables that define the state space. • one scalar-valued variable π • two vector-valued variables, θ 0 and θ 1 • binary label variables L, one for each of the N documents We also have one vector variable Wj for each of the N documents, but these are observed variables, i.e. their values are already known (and which is why Wjk is shaded in Figure 4). Initialization. The initialization of our sampler is going to be very easy. Pick a value π by sampling from (0) the Beta(γπ1 , γπ0 ) distribution. Then, for each j, flip a coin with success probability π, and assign label Lj — that is, the label of document j at the 0th iteration – based on the outcome of the coin flip. Similarly, you also need to initialize θ 0 and θ 1 by sampling from Dirichlet(γ θ ). 2.4 Deriving the Joint Distribution Recall that for each iteration t = 1 . . . T of sampling, we update every variable defining the state space by sampling from its conditional distribution given the other variables, as described in equation (14). Here’s how we’re going to proceed: • We will define the joint distribution of all the variables, corresponding to the numerator in (14). • We simplify our expression for the joint distribution. • We use our final expression of the joint distribution to define how to sample from the conditional distribution in (14). • We give the final form of the sampler as pseudocode. 2.4.1 Expressing and simplifying the joint distribution According to our model, the joint distribution for the entire document collection is P (C, L, π, θ 0 , θ 1 ; γπ1 , γπ0 , γ θ ). The semicolon indicates that the values to its right are parameters for this joint distribution. Another way to say this is that the variables to the left of the semicolon are conditioned on the hyperparameters given to the right of the semicolon. Using the model’s generative story, and, crucially, the independence assumptions that are a part of that story, the joint distribution can decomposed into a product of several factors:14 P (π|γπ1 , γπ0 )P (L|π)P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L) Let’s look at each of these in turn. 14 Note that we can also obtain the products of the joint distribution directly from our graphical model, Figure 4 by multiplying together each latent variable conditioned on its parents. 10 • P (π|γπ1 , γπ0 ). The first factor is the probability of choosing this particular value of π given that γπ1 and γπ0 are being used as the hyperparameters of the Beta distribution. By definition of the Beta distribution, that probability is: P (π|γπ1 , γπ0 ) = Γ(γπ1 + γπ0 ) γπ1 −1 π (1 − π)γπ0 −1 Γ(γπ1 )Γ(γπ0 ) (17) And because Γ(γπ1 + γπ0 ) Γ(γπ1 )Γ(γπ0 ) is a constant that doesn’t depend on π, we can rewrite this as: P (π|γπ1 , γπ0 ) = c π γπ1 −1 (1 − π)γπ0 −1 . (18) The constant c is a normalizing constant that makes sure P (π|γπ1 , γπ0 ) sums to 1 over all π. Γ(x) is the gamma function, a continuous-valued generalization of the factorial function. We could also express (18) as P (π|γπ1 , γπ0 ) ∝ π γπ1 −1 (1 − π)γπ0 −1 . (19) • P (L|π). The second factor is the probability of obtaining this specific sequence L of N binary labels, given that the probability of choosing label = 1 is π. That’s P (L|π) = N Y n=1 C1 = π π Ln (1 − π)(1−Ln ) (20) (1 − π)C0 (21) Recall from Figure 3 that C0 and C1 are nothing more than the number of documents labeled 1 and 0 respectively.15 • P (θ 0 |γ θ ) and P (θ 1 |γ θ ). The third factors are the probability of having sampled these particular choices of word distributions, given that γ θ was used as the hyperparameter of the Dirichlet distribution. Since the θ distributions are independent of each other, let’s make the notation a bit easier to read here and consider each of them in isolation, allowing us to momentarily elide the subscript saying which one is which. By the definition of the Dirichlet distribution, the probability of each word distribution is P (θ|γ θ ) = PV V Γ( i=1γθi ) Y γθi −1 θi QV i=1 Γ(γθi ) i=1 = c0 V Y θiγθi −1 (22) (23) i=1 ∝ V Y θiγθi −1 (24) i=1 Recall that γθi denotes the value of vector γ θ ’s ith dimension, and similarly, θi is the value for the ith dimension of vector θ, i.e. the probability assigned by this distribution to the ith word in the vocabulary. c0 is another normalization constant that we can discard by changing the equality to a proportionality. 15 Of course we can represent these quantities given the variables we already have, but we define these in the interests of simplifying the equations somewhat. 11 • P (C0 |θ 0 , L) and P (C1 |θ 1 , L). These are the probabilities of generating the contents of the bags of words in each of the two document classes. Generating the bag of words Wn for document n depends on that document’s label, Ln and the word probability distribution associated with that label, θ Ln (so θ Ln is either θ 0 or θ 1 ). For notational simplicity, let’s let θ = θ Ln : P (Wn |L, θ Ln ) = V Y ni θW i (25) i=1 Here θi is the probability of word i in distribution θ, and the exponent Wni is the frequency of word i in Wn . Now, since the documents are generated independently of each other, we can multiply the value in (25) for each of the documents in each class to get the combined probability of all the observed bags of words within a class: P (Cx |L, θ x ) V Y Y = Wni θx,i (26) (i) (27) n∈Cx i=1 V Y = N θx,iCx i=1 Where NCx (i) gives the count of word i in documents with class label x. 2.4.2 Choice of Priors and Simplifying the Joint Probability Expression So why did we pick the Dirichlet distribution as our prior for θ 0 and θ 1 and the Beta distribution as our prior for π? Let’s look at what happens in the process of simplifying the joint distribution and see what happens to our estimates of θ (where this can be either θ 0 or θ 1 ) and π once we observe some evidence (i.e. the words from a single document). Using (19) and (21) from above: P (π|L; γπ1 , γπ0 ) ∝ = P (L|π)P (π|γπ1 , γπ0 ) C1 π (1 − π)C0 π γπ1 −1 (1 − π)γπ0 −1 (29) (28) ∝ π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1 (30) Likewise, for θ using (24) and (25) from above: P (θ|Wn ; γ θ ) = P (Wn |θ)P (θ|γ θ ) V Y ∝ θiWni i=1 V Y ∝ V Y θiγθi −1 i=1 θiWni +γθi −1 (31) i=1 If we use the words in all of the documents of a given class, then we have: P (θ x |Cx ; γ θ ) ∝ V Y i=1 12 N θx,iCx (i)+γθi −1 (32) Notice that (30) is an unnormalized Beta distribution, with parameters C1 + γπ1 and C0 + γπ0 , and (32) is an unnormalized Dirichlet distribution, with parameter vector hNCx (i) + γθi i for 1 ≤ i ≤ V . When the posterior probability distribution is of the same family as the likelihood probability distribution — that is, the same functional form, just with different arguments — it is said to be the conjugate prior of the posterior. The Beta distribution is the conjugate prior for binomial (and Bernoulli) distributions and the Dirichlet distribution is the conjugate prior for multinomial distributions. Also notice what role the hyperparameters played—they are added just like observed evidence. It is for this reason that the hyperparameters are sometimes referred to as pseudocounts. Okay, so if we multiply together the individual factors from Section 2.4.1 as simplified above (in Section 2.4.2) and let µ = hγπ1 , γπ0 , γ θ i we can express the full joint distribution as: P (C, L, π, θ 0 , θ 1 ; µ) ∝ π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1 V Y N θ0,iC0 (i)+γθi −1 NC1 (i)+γθi −1 θ1,i (33) i=1 2.4.3 Integrating out π Having illustrated how to derive the joint distribution for a model, as in (33), it turns out that, for this particular model, we can make our lives a little bit simpler: we can reduce the effective number of parameters in the model by integrating our joint distribution with respect to π. This has the effect of taking all possible values of π into account in our sampler, without representing it as a variable explicitly and having to sample it at every iteration. Intuitively, “integrating out” a variable is an application of precisely the same principle as computing the marginal probability for a discrete distribution. For example, if we have P an expression for P (a, b, c), we can compute P (a, b) by summing over all possible values of c, i.e. P (a, b) = c P (a, b, c). As a result, c is “there” conceptually, in terms of our understanding of the model, but we don’t need to deal with manipulating it explicitly as a parameter. With a continuous variable, the principle is the same, but we integrate over all possible values of the variable rather than summing. So, we have Z P (L, C, θ 0 , θ 1 ; µ) = P (L, C, θ 0 , θ 1 , π; µ) dπ (34) π Z P (π|γπ1 , γπ0 )P (L|π)P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L) dπ Z = P (θ 0 |γ θ )P (θ 1 |γ θ )P (C0 |θ 0 , L)P (C1 |θ 1 , L) P (π|γπ1 , γπ0 )P (L|π) dπ = (35) π (36) π At this point let’s focus our attention on the integrand only and substitute the true distributions from (17) and (21). Z Z P (π|γπ1 , γπ0 )P (L|π) dπ = π = Γ(γπ1 + γπ0 ) γπ1 −1 π (1 − π)γπ0 −1 π C1 (1 − π)C0 dπ Γ(γ )Γ(γ ) π1 π0 π Z Γ(γπ1 + γπ0 ) π C1 +γπ1 −1 (1 − π)C0 +γπ0 −1 dπ Γ(γπ1 )Γ(γπ0 ) π (37) (38) Here’s where our use of conjugate priors pays off. Notice that the integrand in (38) is a Beta distribution with parameters C1 + γπ1 and C0 + γπ0 . This means that the value of the integral is just the normalizing constant for that distribution, which is easy to look up (e.g. in the entry for the Beta distribution in Wikipedia): the normalizing constant for distribution Beta(C1 + γπ1 , C0 + γπ0 ) is Γ(C1 + γπ1 )Γ(C0 + γπ0 ) Γ(C0 + C1 + γπ1 + γπ0 ) 13 Making that substitution in (38), and also substituting N = C0 + C1 , we arrive at Z Γ(γπ1 + γπ0 ) Γ(C1 + γπ1 )Γ(C0 + γπ0 ) P (π|γπ1 , γπ0 )P (L|π) dπ = Γ(γ Γ(N + γπ1 + γπ0 ) π1 )Γ(γπ0 ) π (39) Substituting (39) and the definitions of the probability distributions from Section 2.4.1 back into (36) gives us P (L, C, θ 0 , θ 1 ; µ) ∝ V Γ(γπ1 + γπ0 ) Γ(C1 + γπ1 )Γ(C0 + γπ0 ) Y NC0 (i)+γθi −1 NC1 (i)+γθi −1 θ1,i θ0,i Γ(γπ1 )Γ(γπ0 ) Γ(N + γπ1 + γπ0 ) i=1 (40) Okay, so it would be reasonable at this point to ask, “I thought the point of integrating out π was to simplify things, so why did we just ’simplify’ the joint expression by adding in a bunch of these Gamma functions everywhere?” Good question–hold that thought until we derive the sampler. 2.5 Building the Gibbs Sampler The definition of a Gibbs sampler specifies that in each iteration we assign a new value to variable Zi by sampling from the conditional distribution (t+1) P (Zi |z1 (t+1) (t) , . . . , zi−1 , zi+1 , . . . , zr(t) ). (t+1) So, for example, to assign the value of L1 , we need to compute this conditional distribution:16 (t) (t) (t) (t) P (L1 |L2 , . . . , LN , C, θ 0 , θ 1 ; µ), (t+1) To assign the value of L2 , we need to compute (t+1) P (L2 |L1 (t+1) (t) (t) (t) (t) , L3 , . . . , LN , C, θ 0 , θ 1 ; µ), (t+1) and so forth for L3 through LN . To assign the value of θ 0 we need to compute Similarly, to assign the value of θ 0 we need to sample from the conditional distribution (t+1) P (θ 0 (t+1) |L1 (t+1) , L2 (t+1) , . . . , LN (t) , C, θ 1 ; µ), and, for θ 1 , (t+1) P (θ 1 (t+1) |L1 (t+1) , L2 (t+1) , . . . , LN (t+1) , C, θ 0 ; µ), Intuitively, at the start of an iteration t, we have a collection of all our current information at this point in the sampling process. That information includes the word count for each document, the number of documents labeled 0, the number of documents labeled 1, the word counts for all documents labeled 0, the word counts for all documents labeled 1, the current label for each document, the current values of θ 0 and θ 1 , etc. When we want to sample the new label for document j, we temporarily remove all information (i.e. word counts and label information) about this document from that collection of information. Then we look at the conditional probability that Lj = 0 given all the remaining information, and the conditional probability that (t+1) Lj = 1 given the same information, and we sample the new label Lj by choosing randomly according to (t+1) the relative weight of those two conditional probabilities. Sampling to get the new values θ 0 operates according to the same principal. 16 There’s (t+1) and θ 1 no superscript on the bags of words C because they’re fully observed and don’t change from iteration to iteration. 14 2.5.1 Sampling for Document Labels Okay, so how do we actually do the sampling? We’re almost there. As the final step in our journey, we show how to select all of the new document labels Lj and the new distributions θ 0 and θ 1 during each iteration of the sampler. By definition of conditional probability, P (Lj , Wj , L(−j) , C(−j) , θ 0 , θ 1 ; µ) P (L(−j) , C(−j) , θ 0 , θ 1 ; µ) P (L, C, θ 0 , θ 1 ; µ) = P (L(−j) , C(−j) , θ 0 , θ 1 ; µ) P (Lj |L(−j) , C(−j) , θ 0 , θ 1 ; µ) = (41) (42) where L(−j) are all the document labels except Lj , and C(−j) is the set of all documents except Wj . The distribution is over two possible outcomes, Lj = 0 and Lj = 1. Notice that the numerator is just the complete joint probability described in (40). In the denominator, we have the same expression, minus all the information about document Wj . Therefore we can work out what (42) should look like by considering the three factors in (40), one at a time. For each factor, we will remind ourselves of what it looks like in the numerator (which includes Wj ), and work out what it should look like in the denominator (excluding Wj ), for each of the two outcomes. The first factor in (40), Γ(γπ1 + γπ0 ) , Γ(γπ1 )Γ(γπ0 ) (43) is very easy. It depends only on the hyperparameters, so removing information about Wj has no impact on its value. Since it will be the same in both the numerator and denominator of (42), it will cancel out. Excellent! Two factors to go. Let’s look at the second factor of (40). Taking all documents into account including Wj , this is Γ(C1 + γπ1 )Γ(C0 + γπ0 ) . Γ(N + γπ1 + γπ0 ) (44) Now, in order to compute the denominator of (42), we remove document Wj from consideration. How does that change things? It depends on what Lj was during the previous iteration. Whether Lj = 0 or Lj = 1 during the previous iteration, the corpus size is effectively reduced from N to N − 1, and the size of one of the document classes is smaller by one compared to its value in the numerator. If Lj = 0 when we remove (−j) (−j) it, then we will have C0 = C0 − 1 and C1 = C1 . If Lj = 1 during the previous iteration, then we will (−j) (−j) have C0 = C0 and C1 = C1 − 1. In each case, removing Wj only changes the information we know about one class, which means that of the two terms in the numerator of this factor, one of them is going to be the same in the numerator and the denominator of (42). That’s going to cause the terms from one of the classes (the one Wj did not belong to after the previous iteration) to cancel out. (−j) Let x ∈ {0, 1} be the outcome we’re considering, i.e. the one for which Cx = Cx − 1.17 If we use x and reorganize the expression a bit, the second factor of (42) can be rewritten from Γ(C1 +γπ1 )Γ(C0 +γπ0 ) Γ(N +γπ1 +γπ0 ) (−j) Γ(C0 (−j) +γπ0 )Γ(C1 +γπ1 ) Γ(N +γπ1 +γπ0 −1) to Γ(Cx + γπx )Γ(N + γπ1 + γπ0 − 1) Γ(N + γπ1 + γπ0 )Γ(Cx + γπx − 1) (t) (45) 17 Crucially, note that x here is not L j , the label of document j at the previous iteration. We are pulling document j out of the collection, effectively obviating our knowledge of its current label, and then constructing a distribution over the two possibilities, Lj = 0 and Lj = 1. So we need for our expression to take x as a parameter, allowing us to build a probability distribution by asking “what if x = 0?” and “what if x = 1?”. 15 Using the fact that Γ(a + 1) = aΓ(a) for all a, we can simplify further to get Cx + γπx − 1 N + γπ1 + γπ0 − 1 (46) Look, no more pesky Gammas! Finally, the third factor in (40) is V Y N θ0,iC0 (i)+γθi −1 NC1 (i)+γθi −1 . θ1,i (47) i=1 When we look at what changes when we remove Wj from consideration, in order to write the corresponding expression in the denominator of (42), we see that it will behave in the same way as the second factor. One of the classes remains unchanged, so one of the terms, either the θ 0 or θ 1 , will cancel out when we do the (−j) division. If, similar to above, we let x be the class for which Cx = Cx − 1, then we can again capture both cases in one expression: N (i)+γθi −1 V Y θx,iCx N i=1 (−j) (i)+γθi −1 Cx θx,i = V Y W θx,iji (48) i=1 With that, we have finished working through the three factors in (40), which means we have worked out how to express the numerator and denominator of (42), factor by factor. Recombining the factors, we get the following expression for the conditional distribution in (42): for x ∈ {0, 1}, Pr(Lj = x|L(−j) , C(−j) , θ 0 , θ 1 ; µ) = V Cx + γπx − 1 Y Wji θ N + γπ1 + γπ0 − 1 i=1 x,i (49) Let’s take a step back and take a look at what this equation is telling us about how a label is chosen. Its first factor gives us an indication of how likely it is that Lj = x considering only the distribution of the other labels. So, for example, if the corpus had more class 0 documents than class 1 documents, this factor would tend to push the label toward class 0. Its second factor is like a word distribution “fitting room.” We get an indication of how well the words in Wj “fit” with each of the two distributions. If, for example, the words from Wj “fit” better with distribution θ 0 (by having a larger value for the document probability using θ 0 ) then this factor will push the label toward class 0 as well. So (finally!) here’s the actual procedure to sample from the conditional distribution in (42): 1. Let value0 = expression (49) with x = 0 2. Let value1 = expression (49) with x = 1 value0 value1 3. Let the distribution be h value0+value1 , value0+value1 i (t+1) 4. Select the value of Lj bution. 2.5.2 as the result of a Bernoulli trial (weighted coin flip) according to this distri- Sampling for θ We’ll follow a similar procedure to determine how to sample for new values of θ 0 and θ 1 . Since the estimation of the two distributions is independent of one another, we’re going to omit the subscripts on θ to make the notation a bit easier to digest. Just like above, we’ll need to derive an expression for the probability of θ given all other variables, but our work is a bit simpler in this case. Observe, P (θ|C, L; µ) ∝ P (C, L|θ)P (θ|µ) 16 (50) Furthermore, recall that, since we used conjugate priors, this posterior, like the prior, works out to be a Dirichlet distribution. We actually derived the full expression in Section 2.4.2, but we don’t need the full expression here. All we need to do to sample a new distribution is to make another draw from a Dirichlet distribution, but this time with parameters NCx (i) + γθi for each i in V . For notational convenience, let’s define the V dimensional vector t such that each ti = NCx (i) + γθi , where x is again either 0 or 1 depending on which θ we are resampling. We then sample a new θ as: θ ∼ Dirichlet(t) (51) How do you actually implement sampling from a new Dirichlet distribution? To sample a random vector a = ha1 , . . . , aV i from the V -dimensional Dirichlet distribution with parameters hα1 , . . . , αV i, one fast way is to draw V independent samples y1 , . . . , yV from gamma distributions, each with density Gamma(αi , 1) = and then set ai = yi / 2.5.3 PV j=1 yiαi −1 e−yi , Γ(αi ) (52) yj (i.e., just normalize each of the gamma distribution draws).18 Taking advantage of documents with labels Using labeled documents is relatively painless: just don’t sample Lj for those documents! Always keep Lj equal to the observed label. The documents will effectively serve as “ground truth” evidence for the distributions that created them. Since we never sample for their labels, they will always contribute to the counts in (49) and (51) and will never be subtracted out. 2.5.4 Putting it all together Initialization. Define the priors as in Section 2.2 and initialize them as described in Section 2.3. 1: for t := 1 to T do 2: for j := 1 to N do 3: if j is not a training document then 4: Subtract j’s word counts from the total word counts of whatever class it’s currently a member of 5: Subtract 1 from the count of documents with label Lj (t+1) 6: Assign a new label Lj to document j as described at the end of Section 2.5.1 7: Add 1 to the count of documents with label Lj (t+1) (t+1) 8: Add j’s word counts to the total word counts for class Lj 9: end if 10: end for 11: t0 := vector of total word counts from class 0, including pseudocounts 12: θ 0 ∼ Dirichlet(t0 ), as described in Section 2.5.2 13: t1 := vector of total word counts from class 1, including pseudocounts 14: θ 1 ∼ Dirichlet(t1 ), as described in Section 2.5.2 15: end for Sampling iterations. Notice that as soon as a new label for Lj is assigned, this changes the counts that will affect the labeling of the subsequent documents. This is, in fact, the whole principle behind a Gibbs sampler! That concludes the discussion of how sampling is done. We’ll see how to get from the output of the sampler to estimated values for the variables in Section 3. 18 For details, see http://en.wikipedia.org/wiki/Dirichlet distribution (version of April 12, 2010). 17 2.6 Optional: A Note on Integrating out Continuous Parameters At this point you might be asking yourself why we were able to integrate out the continuous parameter π from our model, but did not do something similar with the two word distributions θ 0 and θ 1 . The idea of doing this even looks promising, but there’s a subtle problem that will get us into trouble and end up leaving us with an expression filled with Γ functions that will not cancel out. Let us go through the derivations and see where it leads us. If you follow this piece of the discussion, then you really understand the details!19 Our goal here would be to obtain the probability that a document was generated from the same distribution that generated the words of a particular class of documents, Cx . We would then use this as a replacement for the product in (48). We start first by calculating the probability of making a single word draw given the other words in the class, subtracting out the information about Wj . In the equations that follow there’s an implicit (−j) superscript on all of the counts. If we let wk denote the word at some position k in Wj then, Pr(wk = y|C(−j) ; γθ) x Z = Pr(wk = y|θ)P (θ|C(−j) ; γ θ )dθ x (53) ∆ PV V NC (i) + γθi ) Y NCx (i)+γθi −1 Γ( dθ θx,i θx,y QV i=1 x ∆ i=1 Γ(NCx (i) + γθi ) i=1 PV Z V Y Γ( i=1NCx (i) + γθi ) N (i)+γθi −1 dθ θx,y θx,iCx QV Γ(N (i) + γ ) ∆ Cx θi i=1 i=1 PV Z V Y Γ( i=1NCx (i) + γθi ) N (i)+γθi −1 N (y)+γθy dθ θx,iCx θx,yCx QV Γ(N (i) + γ ) ∆ C θi x i=1 i=1∧i6=y QV PV Γ( i=1NCx (i) + γθi ) Γ(NCx (y) + γθy + 1) i=1∧i6=y Γ(NCx (i) + γθi ) QV PV Γ(1 + i=1NCx (i) + γθi ) i=1 Γ(NCx (i) + γθi ) NCx (y) + γθy PV i=1NCx (i) + γθi Z = = = = = (54) (55) (56) (57) (58) The process we use is actually the same as the process used to integrate π, just Pin the multidimensional case. The set ∆ is the probability simplex of θ, namely the set of all θ such that i θi = 1. We get to (54) by substitution from the formulas we derived in Section 2.4.2, then (55) by factoring out the normalization constant for the Dirichlet distribution, since it is constant with respect to θ. Note that the integrand of (56) is actually another Dirichlet distribution, so its integral is its normalization constant (same reasoning as before). We substitute this in to obtain (57). Using the property of Γ that Γ(x + 1) = xΓ(x) for all x, we can again cancel all of the Γ terms. At this point, even though we have a simple and intuitive result for the probability of drawing a single word from a Dirichlet, we actually need the probability of drawing all words in Wj from the Dirichlet distribution. What we’d really like to do is assume that the words within a particular document are drawn from the same distribution, and just calculate the probability of Wj by multiplying (58) over all words in the vocabulary. But we cannot do that, since, without the values of θ being known, we cannot make independent draws from a Dirichlet distribution since our draws have an effect on what our estimate of θ is! We can see this two different ways. First, instead of drawing one word in equation (53), do the derivation by drawing two words at a time.20 You’ll find that once you hit (57), you’ll have an extra set of Gamma functions that won’t cancel out nicely. The second way to see it is actually by looking at the plate diagram for the model, Figure 4. Each θ effectively has Rj arrows coming out of it for each document j to individual 19 The authors thank Wu Ke, who really understood the details, for pointing out our error in an earlier version of this document and providing the illustrating example we go through next. 20 This is what Wu Ke did to demonstrate his point to us. 18 words, so every word within a document is in the Markov blanket 21 of the others; therefore we can’t assume that the words are independent without a fixed θ. At issue here isn’t just that there are multiple instances of words coming out of each theta, but crucially that those instances are sampled at the same time. The lesson here is to be careful when you integrate out parameters. If you’re doing a single draw from a multinomial, then integrating out a continuous parameter can make the sampler simpler, since you won’t have to sample for it at every iteration. If, on the other hand, you do multiple draws from the same multinomial, integration (although possible) will result in an expression that involves Gamma functions. Calculating Gamma functions is undesirable since they are computationally expensive, so they can slow down a sampler significantly. 3 Producing values from the output of a Gibbs sampler The initialization and sampling iterations in the Gibbs sampling algorithm will produce values for each of the variables, for iterations t = 1, 2, . . . , T . In theory, the approximated value for any variable Zi can simply be obtained by calculating: T 1 X (t) z , T t=1 i (59) as discussed in equation (12). However, expression (59) is not always used directly. There are several additional details to note that are a part of typical sampling practice.22 Convergence and burn-in iterations. Depending on the values chosen during the initialization step, (t) (t) (t) it may take some time for the Gibbs sampler to reach a point where the points hz1 , z2 , . . . zr i are all coming from the stationary distribution of the Markov chain, which is an assumption of the approximation in (59). In order to avoid the estimates being contaminated by the values at iterations before that point, some practitioners generally discard the values at iterations t < B, which are referred to as the “burn-in” iterations, so that the average in (59) is taken only over iterations B + 1 through T .23 Autocorrelation and lag. The approximation in (59) assumes the samples for Zi are independent, but we know they’re not, because they were produced by a process that conditions on the previous point in the chain to generate the next point. This is referred to as autocorrelation (sometimes serial autocorrelation), i.e. correlation between successive values in the data.24 In order to avoid autocorrelation problems (so that the chain “mixes well”), many implementations of Gibbs sampling average only every Lth value, where L is referred to as the lag.25 In this context, Jordan Boyd-Graber (personal communication) also recommends looking at Neal’s [15] discussion of likelihood as a metric of convergence. 21 The Markov blanket of a node in a graphical model consists of that node’s parents, its children, and the coparents of its children. [16]. 22 Jason Eisner (personal communication) argues, with some support from the literature, that burn-in, lag, and multiple chains are in fact unnecessary and it is perfectly correct to do a single long sampling run and keep all samples. See [4, 5], MacKay ([13], end of section 29.9, page 381) and Koller and Friedman ([10], end of section 12.3.5.2, page 522). 23 As far as we can tell, there is no principled way to choose the “right” value for B in advance. There are a variety of ways to test for convergence, and to measure autocorrelation; see, e.g., Brian J. Smith, “boa: An R Package for MCMC Output Convergence Assessment and Posterior Inference”, Journal of Statistical Software, November 2007, Volume 21, Issue 11, http://www.jstatsoft.org/ for practical discussion. However, from what we can tell, many people just choose a really big value for T , pick B to be large also, and assume that their samples are coming from a chain that has converged. 24 Lohninger [11] observes that “most inferential tests and modeling techniques fail if data is autocorrelated”. 25 Again, the choice of L seems to be more a matter of art than science: people seem to look at plots of the autocorrelation for different values of L and use a value for which the autocorrelation drops off quickly. The autocorrelation for variable Zi (t) (t−L) with lag L is simply the correlation between the sequence Zi and the sequence Zi . Which correlation function is used seems to vary. 19 Multiple chains. As is the case for many other stochastic algorithms (e.g. expectation maximization as used in the forward-backward algorithm for HMMs), people often try to avoid sensitivity to the starting point chosen at initialization time by doing multiple runs from different starting points. For Gibbs sampling and other Markov Chain Monte Carlo methods, these are referred to as “multiple chains”.26 Hyperparameter sampling. Rather than simply picking hyperparameters, it is possible, and in fact often critical, to assign their values via sampling (Boyd-Graber, personal communication). See, e.g., Wallach et al. [20] and Escobar and West [3]. 4 Conclusions The main point of this document has been to take some of the mystery out of Gibbs sampling for computer scientists who want to get their hands dirty and try it out. Like any other technique, however, caveat lector: using a tool with only limited understanding of its theoretical foundations can produce undetected mistakes, misleading results, or frustration. As a first step toward getting further up to speed on the relevant background, Ted Pedersen’s [18] doctoral dissertation has a very nice discussion of parameter estimation in Chapter 4, including a detailed exposition of an EM algorithm for Naı̈ve Bayes and his own derivation of a Naı̈ve Bayes Gibbs sampler that highlights the relationship to EM. (He works through several iterations of each algorithm explicitly, which in our opinion merits a standing ovation.) The ideas introduced in Chapter 4 are applied in Pedersen and Bruce [17]; note that the brief description of the Gibbs sampler there makes an awful lot more sense after you’ve read Pedersen’s dissertation chapter. We also recommend Gregor Heinrich’s [7] “Parameter estimation for text analysis.” Heinrich presents fundamentals of Bayesian inference starting with a nice discussion of basics like maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation, all with an eye toward dealing with text. (We followed his discussion closely above in Section 1.1.) Also, his is one of the few papers we’ve been able to find that actually provides pseudo-code for a Gibbs sampler. Heinrich discusses in detail Gibbs sampling for the widely discussed Latent Dirichlet Allocation (LDA) model, and his corresponding code is at http://www.arbylon.net/projects/LdaGibbsSampler.java. For a textbook-style exposition, see Bishop [2]. The relevant pieces of the book are a little less standalone than we’d like (which makes sense for a course on machine learning, as opposed to just trying to dive straight into a specific topic); Chapter 11 (Sampling Methods) is most relevant, though you may also find yourself referring back to Chapter 8 (Graphical Models). Those ready to dive into the topic of Markov Chain Monte Carlo in more depth might want to start with Andrieu et al. [1]. We and Andrieu et al. appear to differ somewhat on the semantics of the word “introduction,” which is one of the reasons the document you’re reading exists. Finally, for people interested in the use of Gibbs sampling for structured models in NLP (e.g. parsing), the right place to start is undoubtedly Kevin Knight’s excellent “Bayesian Inference with Tears: A tutorial workbook for natural language researchers” [9], after which you’ll be equipped to look at Johnson, Griffiths, and Goldwater [8].27 The leap from the models discussed here to those kinds of models actually turns out to be a lot less scary than it appears at first. The main thing to observe is that in the crucial sampling (t) step (equation (14) of Section 1.2.3), the denominator is just the numerator without zi , the variable whose new value you’re choosing. So when you’re sampling conditional distributions (e.g. Sections 2.5.1–2.5.4) in more complex models, the basic idea will be the same: you subtract out counts related to the variable you’re interested in based on its current value, compute proportions based on the remaining counts, then 26 Again, there seems to be as much art as science in whether to use multiple chains, how many, and how to combine them to get a single output. Chris Dyer (personal communication) reports that it is not uncommon simply to concatenate the chains together after removing the burn-in iterations. 27 As an aside, our travels through the literature in writing this document led to an interesting early use of Gibbs sampling with CFGs: Grate et al. [6]. Johnson et al. had not come across this when they wrote their seminal paper introducing Gibbs sampling for PCFGs to the NLP community (and in fact a search on scholar.google.com turned up no citations in the NLP literature). Mark Johnson (personal communication) observes that the “local move” Gibbs sampler Grate et al. describe is specialized to a particular PCFG, and it’s not clear how to generalize it to arbitrary PCFGs. 20 pick probabilistically based on the result, and finally add counts back in according to the probabilistic choice you just made. Acknowledgments The creation of this document has been supported in part by the National Science Foundation (award IIS0838801), the GALE program of the Defense Advanced Research Projects Agency (Contract No. HR001106-2-001), and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory. All statements of fact, opinion, or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of NSF, DARPA, IARPA, the ODNI, the U.S. Government, the University of Maryland, nor of the Resnik or Hardisty families or any of their close relations, friends, or family pets. The authors are grateful to Jordan Boyd-Graber, Bob Carpenter, Chris Dyer, Jason Eisner, John Goldsmith, Kevin Knight, Mark Johnson, Nitin Madnani, Neil Parikh, Sasa Petrovic, William Schuler, Prithvi Sen, Wu Ke and Matt Pico (so far!) for helpful comments and/or catching glitches in the manuscript. Extra thanks to Jordan Boyd-Graber for many helpful discussions and extra assistance with plate diagrams, and to Jay Resnik for his help debugging the LATEX document. References [1] Andrieu, Freitas, Doucet, and Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5–43, 2003. [2] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90:577–588, June 1995. [4] C. Geyer. Burn-in is unnecessary, 2009. http://www.stat.umn.edu/∼charlie/mcmc/burn.html, Downloaded October 18, 2009. [5] C. Geyer. One long run, 2009. http://www.stat.umn.edu/∼charlie/mcmc/one.html. [6] L. Grate, M. Herbster, R. Hughey, D. Haussler, I. S. Mian, and H. Noller. Rna modeling using gibbs sampling and stochastic context free grammars. In ISMB, pages 138–146, 1994. [7] G. Heinrich. Parameter estimation for text analysis. Technical Note Version 2.4, vsonix GmbH and University of Leipzig, August 2008. http://www.arbylon.net/publications/text-est.pdf. [8] M. Johnson, T. Griffiths, and S. Goldwater. Bayesian inference for PCFGs via Markov chain Monte Carlo. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 139–146, Rochester, New York, April 2007. Association for Computational Linguistics. [9] K. Knight. Bayesian inference with tears: A tutorial workbook for natural language researchers, 2009. http://www.isi.edu/natural-language/people/bayes-with-tears.pdf. [10] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009. [11] H. Lohninger. Teach/Me Data Analysis. http://www.vias.org/tmdatanaleng/cc corr auto 1.html. Springer-Verlag, 1999. [12] A. Lopez. Statistical machine translation. ACM Computing Surveys, 40(3):1–49, August 2008. [13] D. Mackay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. 21 [14] C. Manning and H. Schuetze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. [15] R. M. Neal. Probabilistic inference using markov chain monte carlo methods. Technical Report CRGTR-93-1, University of Toronto, 1993. http://www.cs.toronto.edu/∼radford/ftp/review.pdf. [16] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. [17] T. Pedersen. Knowledge lean word sense disambiguation. In AAAI/IAAI, page 814, 1997. [18] T. Pedersen. Learning Probabilistic Models of Word Sense Disambiguation. PhD thesis, Southern Methodist University, 1998. http://arxiv.org/abs/0707.3972. [19] S. Theodoridis and K. Koutroumbas. Pattern Recognition, 4th Ed. Academic Press, 2009. [20] H. Wallach, C. Sutton, and A. McCallum. Bayesian modeling of dependency trees using hierarchical Pitman-Yor priors. In ICML Workshop on Prior Knowledge for Text and Language Processing, 2008. 22

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising