# Bayesian Cognitive Modeling: A Practical Course

```Bayesian Cognitive Modeling:
A Practical Course
MICHAEL D. LEE AND ERIC-JAN WAGENMAKERS
March 21, 2012
PRELIMINARY DRAFT
SUGGESTIONS FOR IMPROVEMENT WELCOME
Contents
Preface
page vi
Part I Getting Started
1 Bayesian Basics
1.1 General Principles
1.2 Prediction
1.3 Sequential Updating
1.4 Markov Chain Monte Carlo
3
3
5
6
7
11
2 Getting Started with WinBUGS
2.1 Installing WinBUGS, Matbugs, and R2WinBugs
2.2 Using the Applications
14
14
15
29
Part II Parameter Estimation
iii
1
31
3 Inferences With Binomials
3.1 Inferring a Rate
3.2 Difference Between Two Rates
3.3 Inferring a Common Rate
3.4 Prior and Posterior Prediction
3.5 Posterior Prediction
3.6 Joint Distributions
33
33
35
37
39
42
44
4 Inferences With Gaussians
4.1 Inferring Means and Standard Deviations
4.2 The Seven Scientists
4.3 Repeated Measurement of IQ
47
47
48
50
5 Data Analysis
5.1 Pearson Correlation
5.2 Pearson Correlation With Uncertainty
52
52
54
Contents
iv
5.3
5.4
5.5
5.6
The Kappa Coefficient of Agreement
Change Detection in Time Series Data
Censored Data
Population Size
6 Latent Mixtures
6.1 Exam Scores
6.2 Exam Scores With Individual Differences
6.3 Twenty Questions
6.4 The Two Country Quiz
6.5 Latent Group Assessment of Malingering
6.6 Individual Differences in Malingering
6.7 Alzheimer’s Recall Test Cheating
56
59
61
64
68
68
70
71
74
77
79
82
Preface
87
89
2.4 Answers: Markov Chain Monte Carlo
90
90
91
92
92
3.2 Answers: Difference Between Two Rates
3.3 Answers: Inferring a Common Rate
3.4 Answers: Prior and Posterior Prediction
93
93
94
95
95
97
98
4.1 Answers: Inferring Means and Standard Deviations
4.3 Answers: Repeated Measurement of IQ
100
100
101
102
5.2 Answers: Pearson Correlation With Uncertainty
5.3 Answers: The Kappa Coefficient of Agreement
5.4 Answers: Change Detection in Time Series Data
104
104
104
106
107
108
Contents
v
6.2 Answers: Exam Scores With Individual Differences
6.4 Answers: The Two Country Quiz
6.5 Answers: Latent Group Assessment of Malingering
6.6 Answers: Individual Differences in Malingering
6.7 Answers: Alzheimer’s Recall Test Cheating
110
110
111
112
113
115
116
117
References
119
References
Index
119
123
Preface
This book teaches you how to do Bayesian modeling. Using modern computer
software—and, in particular, the WinBUGS program—this turns out to be surprisingly straightforward. After working through the examples provided in this book,
you should be able to build your own models, apply them to your own data, and
This book is based on three principles. The first is that of accessibility: the
book’s only prerequisite is that you know how to operate a computer; you do not
need any advanced knowledge of statistics or mathematics. The second principle is
that of applicability: the examples in this book are meant to illustrate how Bayesian
modeling can be useful for problems that people in cognitive science care about.
The third principle is that of practicality: this book offers a hands-on, “just do it”
approach that we feel keeps students interested and motivated to learn more.
In line with these three principles, this book has little content that is purely
theoretical. Hence, you will not learn from this book why the Bayesian philosophy
to inference is as compelling as it is; neither will you learn much about the intricate
details of modern sampling algorithms such as Markov chain Monte Carlo, even
though this book could not exist without them.
The goal of this book is to facilitate and promote the use of Bayesian modeling in
cognitive science. As shown by means of examples throughout this book, Bayesian
modeling is ideally suited for applications in cognitive science. It is easy to construct a basic model, and then add individual differences, add substantive prior
information, add covariates, add a contaminant process, and so on. In other words,
Bayesian modeling is flexible and respects the complexities that are inherent in the
modeling of cognitive phenomena.
We hope that after completing this course, you will have gained not only a new
understanding of statistics (yes, it can make sense), but also the technical skills to
implement statistical models that professional but non-Bayesian cognitive scientists
We like to thank John Miyamoto, Eddy Davelaar, Hedderik van Rijn, and Thomas
Palmeri for constructive criticism and suggestions for improvement, and Dora Matzke
for her help in programming and plotting.
Michael D. Lee
Irvine, USA
Eric-Jan Wagenmakers
Amsterdam, The Netherlands
August 2011
vi
PART I
GETTING STARTED
(...) the theory of probabilities is basically just common sense reduced
to calculus; it makes one appreciate with exactness that which accurate
minds feel with a sort of instinct, often without being able to account
for it.
Laplace, 1829
1
The Basics of Bayesian Analysis
1.1 General Principles
The general principles of Bayesian analysis are easy to understand. First, uncertainty or “degree of belief” is quantified by probability. Second, the observed data
are used to update the prior information or beliefs to become posterior information
or beliefs. That’s it!
To see how this works in practice, consider the following example. Assume we
give you a test that consists of 10 factual questions of equal difficulty. What we want
to estimate is your ability θ—the rate with which you answer questions correctly.
Note that we do not directly observe your ability θ; all that we observe is your score
on the test.
Before we do anything else (for example, before we start to look at your data) we
need to specify our prior uncertainty with respect to your ability θ. This uncertainty
needs to be expressed as a probability distribution, called the prior distribution. In
this case, keep in mind that θ can range from 0 to 1, and that you do not know
anything about the topic or about the difficulty level of the questions. Then, a reasonable “prior distribution”, denoted by p (θ), is one that assigns equal probability
to every value of θ. This uniform distribution is shown by the dotted horizontal line
in Figure 1.1.
Now we consider your performance, and find that you answered 9 out of 10
questions correctly. After having seen these data, the updated knowledge about
θ is described by the posterior distribution, denoted p (θ | D), where D indicates
the observed data. Bayes rule specifies how we can combine the information from
the data—that is, the Binomial likelihood p (D | θ)—with the information from the
prior distribution p (θ) to arrive at the posterior distribution p (θ | D):
p (D | θ) p (θ)
.
p(D)
(1.1)
likelihood × prior
.
marginal likelihood
(1.2)
p (θ | D) =
This equation is often verbalized as
posterior =
Note that the marginal likelihood (i.e., the probability of the observed data) does
not involve the parameter θ, and is given by a single number that ensures that
the area under the posterior distribution equals 1. Therefore, Equation 1.1 is often
3
Bayesian Basics
4
5
95%
Density
4
Posterior
3
2
Prior
1
MLE
0
0
0.25
0.5
0.75
1
θ
t
Fig. 1.1
Bayesian parameter estimation for rate parameter θ, after observing 9 correct
responses and 1 incorrect response. The mode of the posterior distribution for θ is
0.9, equal to the maximum likelihood estimate, and the 95% confidence interval
extends from 0.59 to 0.98.
written as
p (θ | D) ∝ p (D | θ) p (θ) ,
(1.3)
which says that the posterior is proportional to the likelihood times the prior. Note
that the posterior distribution is a combination of what we knew before we saw
the data (i.e., the information in the prior distribution), and what we have learned
from the data.
The solid line in Figure 1.1 shows the posterior distribution for θ, obtained when
the uniform prior is updated with the data, that is, k = 9 correct answers out of n =
10 questions. The central tendency of a posterior distribution is often summarized
by its mean, median, or mode. Note that with a uniform prior, the mode of a
posterior distribution coincides with the classical maximum likelihood estimate or
MLE , θ̂ = k/n = 0.9 (Myung, 2003). The spread of a posterior distribution is
most easily captured by a Bayesian x% credible interval that extends from the
(x/2)th to the (100−x/2)th percentile of the posterior distribution. For the posterior
distribution in Figure 1.1, a 95% Bayesian credible interval for θ extends from 0.59
to 0.98. In contrast to the orthodox confidence interval, this means that one can be
95% confident that the true value of θ lies in between 0.59 and 0.98.
Exercises
Exercise 1.1.1 The famous Bayesian statistician Bruno de Finetti published two
big volumes entitled “Theory of Probability” (de Finetti, 1974). Perhaps surprisingly, the first volume starts with the words “probability does not exist”.
5
Prediction
To understand why de Finetti wrote this, consider the following situation:
someone tosses a fair coin, and the outcome will be either heads or tails.
What do you think the probability is that the coin lands heads? Now suppose
you are a physicist with advanced measurement tools, and you can establish
relatively precisely both the position of the coin and the tension in the muscles immediately before the coin is tossed in the air—does this change your
probability? Now suppose you can briefly look into the future (Bem, 2011),
albeit hazily—is your probability still the same?
Exercise 1.1.2 On his blog, prominent Bayesian Andrew Gelman wrote (March
18, 2010) “Some probabilities are more objective than others. The probability
that the die sitting in front of me now will come up ‘6’ if I roll it...that’s
about 1/6. But not exactly, because it’s not a perfectly symmetric die. The
probability that I’ll be stopped by exactly three traffic lights on the way to
school tomorrow morning: that’s...well, I don’t know exactly, but it is what
it is.” Was de Finetti wrong, and is there only one clearly defined probability of Andrew Gelman encountering three traffic lights on the way to school
tomorrow morning?
Exercise 1.1.3 Figure 1.1 shows that the 95% Bayesian credible interval for θ
extends from 0.59 to 0.98. This means that one can be 95% confident that
the true value of θ lies in between 0.59 and 0.98. Suppose you did an orthodox analysis and found the same confidence interval. What is the orthodox
interpretation of this interval?
Exercise 1.1.4 Suppose you learn that the questions are all true or false questions.
Does this knowledge affect your prior distribution? And if so, how would this
prior in turn affect your posterior distribution?
1.2 Prediction
The posterior distribution θ contains all that we know about the rate with which
you answer questions correctly. One way to use the knowledge is prediction.
For instance, suppose we design a new set of 5 questions, all of the same
difficulty as before. How can we formalize our expectations about your performance on this new set? In other words, how can we use the posterior distribution
p (θ | n = 10, k = 9)—which after all represents everything that we know about θ
from the old set—to predict the number of correct responses out of the new set of
nrep = 5 questions? The mathematical solution is to integrate over the posterior,
R
p (k rep | θ, nrep = 5) p (θ | n = 10, k = 9) dθ, where k rep is the predicted number
of correct responses out of the additional set of 5 questions.
Computationally, you can think of this procedure as repeatedly drawing a random
value θi from the posterior, and using that value to every time determine a single
kirep . The end result is p (k rep ), the posterior predictive density of the possible
number of correct responses in the additional set of 5 questions. The important
6
Bayesian Basics
point is that by integrating over the posterior, all predictive uncertainty is taken
into account.
Exercises
Exercise 1.2.1 Instead of “integrating over the posterior”, orthodox methods often use the “plug-in principle”; in this case, the plug-in principle suggest that
we predict p(k rep ) solely based on θ̂, the maximum likelihood estimate. Why
is this generally a bad idea? Can you think of a specific situation in which
this may not be so much of a problem?
1.3 Sequential Updating
Bayesian analysis is particularly appropriate when you want to combine different
sources of information. For instance, assume that we present you with a new set of 5
questions. You answer 3 out of 5 correctly. How can we combine this new information
with the old? Or, in other words, how do we update our knowledge of θ? Consistent
with intuition, Bayes’ rule entails that the prior that should be updated based on
your performance for the new set is the posterior that was obtained based on your
performance for the old set. Or, as Lindley put it, “today’s posterior is tomorrow’s
prior” (Lindley, 1972, p. 2).
When all the data have been collected, however, the precise order in which this
was done is irrelevant; the results from the 15 questions could have been analyzed as
a single batch, they could have been analyzed sequentially, one-by-one, they could
have been analyzed by first considering the set of 10 questions and next the set of
5, or vice versa. For all these cases, the end result, the final posterior distribution
for θ, is identical. This again contrasts with orthodox inference, in which inference
for sequential designs is radically different from that for non-sequential designs (for
a discussion, see, for example, Anscombe, 1963).
Thus, a posterior distribution describes our uncertainty with respect to a parameter of interest, and the posterior is useful—or, as a Bayesian would have it,
necessary—for probabilistic prediction and for sequential updating. In general, the
posterior distribution or any of its summary measures can only be obtained in closed
form for a restricted set of relatively simple models. To illustrate in the case of our
binomial example, the uniform prior is a so–called beta distribution with parameters α = 1 and β = 1, and when combined with the binomial likelihood this yields
a posterior that is also a beta distribution, with parameters α + k and β + n − k.
In simple conjugate cases such as these, where the prior and the posterior belong
to the same distributional family, it is possible to obtain closed form solutions for
the posterior distribution, but in many interesting cases it is not.
Markov Chain Monte Carlo
7
1.4 Markov Chain Monte Carlo
For a long time, researchers could only proceed with Bayesian inference when the
posterior was available in closed form. As a result, practitioners interested in models
of realistic complexity did not much use Bayesian inference. This situation changed
dramatically with the advent of computer-driven sampling methodology generally
known as Markov chain Monte Carlo (MCMC: e.g., Gamerman & Lopes, 2006;
Gilks, Richardson, & Spiegelhalter, 1996). Using MCMC techniques such as Gibbs
sampling or the Metropolis-Hastings algorithm, researchers can directly sample
sequences of values from the posterior distribution of interest, foregoing the need
for closed form analytic solutions. The current adage is that Bayesian models are
limited only by the user’s imagination.
In order to visualize the increased popularity of Bayesian inference, Figure 1.2
plots the proportion of articles that feature the words “Bayes” or “Bayesian”, according to Google Scholar (for a similar analysis for specific journals in statistics
and economics see Poirier, 2006). The time line in Figure 1.2 also indicates the publication of three landmark papers, Geman and Geman (1984), Gelfand and Smith
(1990), and Casella and George (1992), as well as the introduction of WinBUGS, a
general-purpose program that greatly facilitates Bayesian analysis for a wide range
of statistical models (Lunn, Thomas, Best, & Spiegelhalter, 2000; Lunn, Spiegelhalter, Thomas, & Best, 2009; Sheu & O’Curry, 1998). Thus, MCMC methods have
transformed Bayesian inference to a vibrant and practical area of modern statistics.
For a concrete and simple illustration of Bayesian inference using MCMC, consider again the binomial example of 9 correct responses out of 10 questions, and
the associated inference problem for θ, the rate of answering questions correctly.
Throughout this book, we use WinBUGS to specify and fit our models, saving us
the effort to code the MCMC algorithms ourselves. Although WinBUGS does not
work for every research problem application, it will work for many in cognitive science. WinBUGS is easy to learn and is supported by a large community of active
researchers.1
The WinBUGS program requires you to construct a file that contains the model
specification, a file that contains initial values for the model parameters, and a
file that contains the data. The model specification file is most important. For our
binomial example, we set out to obtain samples from the prior and the posterior of
θ. The associated WinBUGS model specification code is three lines long:
model {
theta ~ dbeta(1,1) # the uniform prior for updating by the data
k ~ dbin(theta,n) # the data; in our example, k = 9 and n = 10
thetaprior ~ dbeta(1,1) # a uniform prior not for updating
1
Two alternative software programs are OpenBUGS and JAGS, programs that may be particularly attractive for Mac and Linux users. The model code for OpenBUGS and JAGS is almost
identical to WinBUGS, so that the transition from one program to the other is easy.
Bayesian Basics
8
Proportion of hits on Google Scholar
for ’bayes OR bayesian −author:bayes’
0.06
0.05
0.04
0.03
WinBUGS
Casella & George
Geman & Geman
0.02
Gelfand & Smith
0.01
0.00
1980
1985
1990
1995
2000
2007
Year
t
Fig. 1.2
A Google Scholar perspective on the increasing popularity of Bayesian inference.
}
In this code, the “∼” or twiddle symbol denotes “is distributed as”, dbeta(a,b)
indicates the beta distribution with parameters a and b2 , and dbin(theta,n) indicates the binomial distribution with rate theta and n observations. These and
many other distributions are build in to the WinBUGS program. The “#” or hash
sign is used for commenting out what should not be compiled. As WinBUGS is a
declarative language, the order of the three lines is inconsequential.
When this code is executed, you obtain a sequence of samples (i.e., an MCMC
chain) from the posterior p (θ | D) and a sequence of samples from the prior p (θ).
This sequence is called a chain. In more complex models, it may take some time
before a chain converges from its starting value to what is called its stationary
distribution. To make sure that we only use those samples that come from the
stationary distribution, and hence are unaffected by the starting values, it is good
practice to diagnose convergence by running multiple chains. It is often also good
practice to discard the first samples from each chain. These discarded samples are
called burn in samples. Finally, it can also be helpful, especially when the sampling
process produce auto-correlates sequences, not to record every sample taken in a
2
The dbeta(1,1) distribution is a uniform distribution from 0 to 1. Therefore, the prior distribution for θ could also have been specified as theta ∼ dunif(0,1).
Markov Chain Monte Carlo
9
1.0
0.8
0.6
θ
Chain 1
0.4
Chain 2
Chain 3
0.2
0.0
0
20
40
60
80
100
MCMC Iteration
t
Fig. 1.3
Three MCMC chains for rate parameter θ, after observing 9 correct responses and 1
incorrect response.
chain, but every second, or third, or tenth, or some other subset of samples. This
is known as thinning.
For example, Figure 1.3 shows the first 100 iterations for three chains that were
set up to draw values from the posterior for θ. It is evident that the three chains are
“mixing” well, suggesting early convergence. Quantitative measures for diagnosing
convergence are also available, such as the Gelman and Rubin (1992) R̂ statistic,
that compares within–chain to between–chain variability. For more recommendations regarding convergence see Gelman (1996) and Gelman and Hill (2007).
After assuring ourselves that the chains have converged, we can use the sampled
values to plot a histogram, construct a density estimate, and compute values of
interest. To illustrate, the three chains from Figure 1.3 were run for 3000 iterations
each, for a total of 9000 samples for the prior and the posterior of θ. Figure 1.4
plots histograms for the prior (i.e., dotted line) and the posterior (i.e., thick solid
line). To visualize how the histograms are constructed from the MCMC chains, the
bottom panel of Figure 1.4 plots the MCMC chains sideways; the histograms are
created by collapsing the values along the “MCMC iteration” axis and onto the “θ”
axis.
In the top panel of Figure 1.4, the thin solid lines represent density estimates.
The mode of the density estimate for the posterior of θ is 0.89, whereas the 95%
credible interval is (0.59, 0.98), matching the analytical result shown in Figure 1.1.
The key point is that the analytical intractibilities that limited the scope of
Bayesian parameter estimation have now been overcome. Using MCMC sampling,
posterior distributions can be approximated to any desired degree of accuracy. This
book teaches you to use MCMC sampling and Bayesian inference to do research
with cognitive science models and data.
Bayesian Basics
10
5
95%
Density
4
Posterior
3
2
Prior
1
0
0
0.25
0.5
0.75
1
0.75
1
θ
0
0.25
0.5
MCMC Iteration
0
20
Chain 1
Chain 2
Chain 3
40
60
80
100
t
Fig. 1.4
MCMC–based Bayesian parameter estimation for rate parameter θ, after observing 9
correct responses and 1 incorrect response. The thin solid lines indicate the fit of a
density estimator. Based on this density estimator, the mode of the posterior
distribution for θ is approximately 0.89, and the 95% credible interval extends from
0.59 to 0.98, closely matching the analytical results from Figure 1.1.
Exercise 1.4.1 Use Google and list some other scientific disciplines that use
Bayesian inference and MCMC sampling.
Exercise 1.4.2 The text reads: “Using MCMC sampling, posterior distributions
can be approximated to any desired degree of accuracy”. How is this possible?
11
This section provides some references for further reading. We first list Bayesian
textbooks and seminal papers, then some texts that specifically deal with WinBUGS. We also note that Smithson (2010) presents a useful comparative review of
six introductory textbooks on Bayesian methods.
1.5.1 Bayesian Statistics
This section contains an annotated bibliography on Bayesian articles and books
that we believe are particularly useful or inspiring.
Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle (2nd ed.).
Institute of Mathematical Statistics, Hayward (CA). This is a great book if you want
to understand the limitations of orthodox statistics. Insightful and fun.
Bolstad, W. M. (2007). Introduction to Bayesian Statistics (2nd ed.). Wiley,
Hoboken (NJ). Many books claim to introduce Bayesian statistics, but forget
to state on the cover that the introduction is “for statisticians” or “for those
comfortable with mathematical statistics”. The Bolstad book is an exception, as
it does not assume much background knowledge.
Gamerman, D., & Lopes, H. F. (2006). Markov Chain Monte Carlo: Stochastic
Simulation for Bayesian Inference. Chapman & Hall/CRC, Boca Raton (FL). This
book discusses the details of MCMC sampling; a good book, but too advanced for
beginners.
Gelman, A. & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, This book is an extensive
practical guide on how to apply Bayesian regression models to data. WinBUGS
code is provided throughout the book. Andrew Gelman also has an active blog
that you might find interesting: http://andrewgelman.com/
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain Monte
Carlo in Practice. Chapman & Hall/CRC, Boca Raton (FL). A citation classic in
the MCMC literature, this book features many short chapters on all kinds of
sampling-related topics: theory, convergence, model selection, mixture models, and
so on.
Gill, J. (2002). Bayesian Methods: A Social and Behavioral Sciences Approach.
CRC Press, Boca Raton (FL). A well-written book that covers a lot of ground.
Readers need some background in mathematical statistics to appreciate the content.
12
Bayesian Basics
Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. Springer,
Dordrecht, The Netherlands. A clear and well-written introduction to Bayesian
inference, with accompanying R code. This book requires some familiarity with
mathematical statistics.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press, Cambridge. Jaynes was one of the most ardent supporters of objective
Bayesian statistics. The book is full of interesting ideas and compelling arguments,
as well as being laced with Jaynes’ acerbic wit, but it requires some mathematical
background to appreciate all of the content.
Jeffreys, H. (1939/1961). Theory of Probability. Oxford University Press, Oxford, UK.
Sir Harold Jeffreys is the first statistician who exclusively used Bayesian methods
for inference. Jeffreys also invented the Bayesian hypothesis test, and was generally
far ahead of his time. The book is not always an easy read, in part because the notation is somewhat outdated. Strongly recommended, but only for those who already
have a solid background in mathematical statistics and a firm grasp of Bayesian
thinking. See www.economics.soton.ac.uk/staff/aldrich/jeffreysweb.htm
Lindley, D. V. (2000). The philosophy of statistics. The Statistician, 49, 293337. Dennis Lindley, one the godfathers of current-day Bayesian statistics, explains
why Bayesian inference is right and everything else is wrong. Peter Armitage
commented on the paper: “Lindley’s concern is with the very nature of statistics,
and his argument unfolds clearly, seamlessly and relentlessly. Those of us who
cannot accompany him to the end of his journey must consider very carefully
where we need to dismount; otherwise we shall find ourselves unwittingly at the
bus terminus, without a return ticket.”
McGrayne, S. B. (2011). The Theory That Would Not Die: How Bayes’ Rule
Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant From Two Centuries Of Controversy. Yale University Press.. A fascinating
and accessible overview of the history of Bayesian inference.
Ntzoufras, I. (2009). Bayesian Modeling using WinBUGS. Wiley, Hoboken (NJ).
A great book for learning how to do regression and ANOVA using WinBUGS. See
www.ruudwetzels.com for a detailed review.
O’Hagan, A. & Forster, J. (2004). Kendall’s Advanced Theory of Statistics Vol.
2B: Bayesian Inference (2nd ed.). Arnold, London. If you are willing to read only a
single book on Bayesian statistics, this one is it. The book requires a background
in mathematical statistics.
Royall, R. M. (1997). Statistical Evidence: A Likelihood Paradigm. Chapman &
Hall, London. This book describes the different statistical paradigms, and highlights
13
the deficiencies of the orthodox schools. The content can be appreciated without
much background knowledge in statistics. The main disadvantage of this book is
that the author is not a Bayesian. We still recommend the book, which is saying
something.
1.5.2 WinBUGS Texts
Kruschke, J. K. (2010). Doing Bayesian Data Analysis: A Tutorial Introduction with
R and BUGS. Academic Press, Burlington (MA). This is one of the first Bayesian
books geared explicitly towards experimental psychologists and cognitive scientists.
Kruschke explains core Bayesian concepts with concrete examples and OpenBUGS
code. The book focuses on statistical models such as regression and ANOVA, and
provides a Bayesian approach to data analysis in psychology, cognitive science,
and empirical sciences more generally.
Lee, S.–Y. (2007). Structural Equation Modelling: A Bayesian Approach. Chichester, UK: Wiley.
Ntzoufras, I. (2009). Bayesian modeling using WinBUGS. Hoboken, NJ: Wiley.
Provides an easily accessible introduction to the use of WinBUGS. The book
also presents a variety of Bayesian modeling examples, with the emphasis on
Generalized Linear Models.
Spiegelhalter, D., Best, N. & Lunn, D. (2003). WinBUGS User Manual 1.4.
MRC Biostatistic Unit, Cambridge, UK. Provides an introduction to the use of
WinBUGS, including a useful tutorial and various tips and tricks for new users.
2
Getting Started with WinBUGS
with Dora Matzke
Throughout this course book, you will use the WinBUGS (Lunn et al., 2000)
software to work your way through the exercises. Although it is possible to do the
exercises using the graphical user interface provided by the WinBUGS package, you
can also use the Matlab or R programs to interact with WinBUGS.
In this chapter, we start by working through a concrete example using just WinBUGS. This provides an introduction to the WinBUGS interface, and the basic
theoretical and practical components involved in Bayesian graphical model analysis. Completing the example will also quickly convince you that you do not want
to rely on WinBUGS as your primary means for handling and analyzing data. It is
not especially easy to use as a graphical user interface, and does not have all of the
data management and visualization features needed for research.
Instead, we encourage you to choose either Matlab or R as your primary research
computing environment, and use WinBUGS as an ‘add-on’ that does the Bayesian
inference part of analyses. Some WinBUGS interface capabilities will remain useful,
especially in the exploratory stages of research. But either Matlab or R will be primary. Accordingly, this chapter re-works the concrete example, originally done in
WinBUGS, using both Matlab and R. You should complete just the one corresponding to your preferred research software. You will then be ready for the following
chapters, which assume you are working in either Matlab or R, but understand the
basics on the WinBUGS interface.
2.1 Installing WinBUGS, Matbugs, and R2WinBugs
2.1.1 Installing WinBUGS
WinBUGS is a currently free software, and is available at http://www.mrc-bsu.
Some of the exercises in this course might work without the registration key, but
some of them will not. You can download WinBUGS and the registration key directly from http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml.
14
15
Using the Applications
2.1.2 Installing Matlab and Matbugs
Matlab is a commercial software, and is available at http://www.mathworks.com/.
As best we know, any reasonably recent version of Matlab should let you do the
exercises in this course. Also, as best we know, no toolboxes are required. To
give Matlab the ability to interact with WinBUGS, download the freely available matbugs.m function and put it in your Matlab working directory. You can
MATBUGS/matbugs.html.
2.1.3 Installing R and R2WinBUGS
R is a free software, and is available at http://www.r-project.org/. You can
bin/windows/base/. To give R the ability to interact with WinBUGS, you have
to install the R2WinBUGS package. To install the R2WinBUGS package, start R and
select the Install Package(s) option in the Packages menu. Once you chose your
preferred CRAN mirror, select R2WinBUGS in the Packages window and click on OK.
2.2 Using the Applications
2.2.1 An Example with the Binomial Distribution
We will illustrate the use of WinBUGS, Matbugs, and R by means of a simple
example involving a binary process. A binary process is anything where there are
only two possible outcomes. It might be that something either happens or does
not happen, or that something either succeeds or fails, or that something takes one
value rather than the other. An inference that is often important for these sorts of
processes is the underlying rate at which the process takes one value rather than
the other. Inferences about the rate can be made by observing how many times the
process takes each value over a number of trials.
Suppose that one of the values (e.g., the number of successes) happens on k out
of n trials. These are known, or observed, data. The unknown variable of interest is
the rate θ at which the values are produced. Assuming that what happened on one
trial does not influence the other trials the number of successes k follows a Binomial
distribution, k ∼ Binomial θ, n . This relationship means that by observing the k
successes out of n trials, it is possible to update our knowledge about the rate θ. The
basic idea of Bayesian analysis is that what we know, and what we do not know,
about the variables of interest is always represented by probability distributions.
Data like k and n allow us to update prior distributions for the unknown variables
into posterior distributions that incorporate the new information.
16
Getting Started with WinBUGS
The graphical model representation of our binomial example is shown in Figure 2.1. The nodes represent all the variables that are relevant to the problem. The
graph structure is used to indicate dependencies between the variables, with children depending on their parents. We use the conventions of representing unobserved
variables without shading and observed variables with shading, and continuous variables with circular nodes and discrete variables with square nodes.
Thus, the observed discrete counts of the numbers of successes k and the number
of trials n are represented by shaded and square nodes, and the unknown continuous
rate θ is represented by an unshaded and circular node. Because the number of
successes k depends on the number of trials n and on the rate of success θ, the
nodes representing n and θ are directed towards the node representing k. We will
start with the prior assumption that all possible rates between 0 and 1 are equally
likely. We will thus assume a uniform prior θ ∼ Uniform 0, 1 .
θ
k
θ ∼ Beta(1, 1)
k ∼ Binomial(θ, n)
n
t
Fig. 2.1
Graphical model for inferring the rate of a binary process.
One advantage of using the language of graphical models is that it gives a complete and interpretable representation of a Bayesian probabilistic model. Another
advantage is that WinBUGS can easily implement graphical models, and its various built-in computational algorithms are then able to do all of the inferences
automatically.
2.2.2 Using WinBUGS
WinBUGS requires the user to construct a file that contains the data, a file that
contains the starting values for the model parameters, and a file that contains the
model specification. The WinBUGS model specification code associated with our
binomial example is as follows:
# Inferring a Rate
model {
# Prior on Rate Thetat
theta ~ dbeta(1,1)
# Observed Counts
k ~ dbin(theta,n)
}
Using the Applications
17
Note that, even though conceptually
the prior on θ is Uniform 0, 1 , it has been
implemented as Beta 1, 1 . These two distributions are the same, but our experience is that WinBUGS seems to have fewer computational problems with the Beta
distribution implementation.
Implementing the model shown in Figure 2.1, and obtaining samples from the
posterior distribution of θ, can be done by following these steps.
1. Copy the model specification text above and paste it in a text file. Save the file,
for instance as “Rate 1.txt”.
2. Start WinBUGS. Open your newly created model specification file by selecting the Open option in the File menu, choosing the appropriate directory, and
double-clicking on the model specification file. Do not forget to select files of
type “txt”, or you might be searching for a long time. Now check the syntax
of the model specification code by selecting the Specification option in the
Model menu. Once the Specification Tool window is opened, as shown in
Figure 2.2, highlight the word “model” at the beginning of the code and click on
check model. If the model is syntactically correct and all parameters are given
priors, the message “model is syntactically correct” will appear in the status
bar all the way in the bottom left corner of the WinBUGS window. (Although
beware, the letters are very small and difficult to see).
3. Create a text file that contains the data. The content of the file should look like
this:
list(
k=5,
n=10
)
Save the file, for instance as “Data.Rate 1.txt”.
4. Open the data file and load the data. To open the data file, select the Open
option in the File menu, select the appropriate directory, and double-click on
the data file. To load the data, highlight the word “list” at the beginning of the
data file and click on load data in the Specification Tool window, as shown
in Figure 2.2. If the data are successfully loaded, the message “data is loaded”
will appear in the status bar.
5. Set the number of chains. Each chain is an independent run of the same model
with the same data, although you can vary the set different starting values of
parameters for each chain.1 Chains provide a key test of convergence—something
we will discuss in more detail in a later chapter. In our binomial example, we
will run two chains. To set the number of chains, type “2” in the field labelled
num of chains in the Specification Tool window, shown in Figure 2.2.
6. Compile the model. To compile the model, click on compile in the
1
Running multiple chains is the best and easiest way to ensure WinBUGS uses different random
number sequences in sampling. Doing a single-chain analysis multiple times can use the same
random number sequence, and so produce the same results.
Getting Started with WinBUGS
18
t
Fig. 2.2
Model Specification Tool.
Specification Tool window, shown in Figure 2.2. If the model is successfully compiled, the message “model compiled” will appear in the status bar.
7. Create a text file that contains the starting values of the unobserved variables
(i.e., just the parameter θ for this model). If you do not specify the starting
values, WinBUGS will try to get them from the prior, which may or may not
lead to numerical crashes. It is therefore safer to give a starting value to all
unobserved variables, and especially for variables at nodes ‘at the top’ of the
graphical model, which have no parents.
The content of the file should look like this:
list(
theta=0.1
)
list(
theta=0.9
)
Save the file, for instance as “Start.values.txt”.
8. Open the file that contains the starting values by selecting the Open option in
the File menu, selecting the appropriate directory, and double-clicking on the
file. To load the starting value of θ for the first chain, highlight the word “list” at
the beginning of the file and click on load inits in the Specification Tool
Using the Applications
19
window, shown in Figure 2.2). To load the starting value for the second chain,
highlight the second “list” command and click on load inits once again. If the
starting values are successfully loaded, the message “model is initialized” will
appear in the status bar.
9. Set monitors to store the sampled values of the parameters of interest. To set a
monitor for θ, select the Samples option from the Inference menu. Once the
Sample Monitor Tool window, shown in Figure 2.3, is opened, type “theta” in
the field labelled node and click on set.
10. Specify the number of samples you want to record. To this end, you first have
to specifythe total number of samples you want to draw from the posterior of θ,
and the number of burn-in samples that you want to discard at the beginning
of a sampling run. The number of recorded samples equals the total number of
samples minus the number of burn-in samples. In our binomial example, we will
not discard any of the samples and will set out to obtain 20, 000 samples from
the posterior of θ. To specify the number of recorded samples, type “1” in the
field labelled beg (i.e., WinBUGS will start recording from the first sample) and
type “20000” in the field labelled end in the Sample Monitor Tool window,
shown in Figure 2.3).
t
Fig. 2.3
Sample Monitor Tool.
11. Set “live” trace plots of the unobserved parameters of interest. WinBUGS allows
you to monitor the sampling run in real-time. This can be useful on long sampling
runs, for debugging, and for diagnosing whether the chains have converged. To
set a “live” trace plot of θ, click on trace in the Sample Monitor Tool window,
shown in Figure 2.3, and wait for an empty plot to appear on the screen. Once
WinBUGS starts to sample from the posterior, the trace plot of θ will appear
live on the screen.
12. Specify the total number of samples that you want to draw from the posterior.
This is done by selecting the Update option from the Model menu. Once the
Update Tool window (see 2.4) is opened, type “20000” in the field labelled
updates. Typically, the number you enter in the Update Tool window will correspond to the number you entered in the end field of the Sample Monitor
Tool.
13. Specify how many samples should be drawn between the recorded samples.
Getting Started with WinBUGS
20
You can, for example, specify that only every second drawn sample should be
recorded. This ability to “thin” a chain is is important when successive samples
are not independent but autocorrelated. In our binomial example, we will record
every sample that is drawn from the posterior of θ. To specify this, type “1” in
the field labelled thin in the Update Tool window, shown in Figure 2.4.
14. Specify the number of samples after which WinBUGS should refresh its display.
To this end, type “100” in the field labelled refresh in the Update Tool window,
shown in Figure 2.4.
15. Sample from the posterior. To sample from the posterior of θ, click on update in
the Update Tool window, shown in Figure 2.4). During sampling, the message
“model updating” will appear in the status bar. Once the sampling is finished,
the message “update took x secs” will appear in the status bar.
t
Fig. 2.4
Update Tool.
16. Specify the output format. WinBUGS can produce two types of output; it can
open a new window for each new piece of output or it can paste all output
into a single log file. To specify the output format for our binomial example,
options window.
17. Obtain summary statistics of the posterior distribution. To request summary
statistics based on the sampled values of θ, select the Samples option in the
Inference menu, and click on stats in the Sample Monitor Tool window,
shown in Figure 2.3. WinBUGS will paste a table reporting various summary
statistics for θ in the log file.
18. Plot the posterior distribution. To plot the posterior distribution of θ, click on
density in the Sample Monitor Tool window, shown in Figure 2.3. WinBUGS
will paste the posterior distribution of θ in the log file.
Figure 2.5 shows the log file that contains the results for our binomial example.
The first five lines of the log file document the steps taken to specify and initialize
the model. The first output item is the Dynamic trace plot that allows the θ
variable to be monitored during sampling and is useful for diagnosing whether the
chains have reached convergence. In this case, we can be reasonably confident that
convergence has been achieved because the two chains, shown in different colors,
are overlapping one another. The second output item is the Node statistics table
that presents the summary statistics for θ. Among others, the table shows the mean,
Using the Applications
21
the standard deviation, and the median of the sampled values of θ. The last output
item is the Kernel density plot that shows the posterior distribution of θ.
t
Fig. 2.5
Example of an output log file.
How did WinBUGS produce the results in Figure 2.5? The model specification
file implemented the graphical model from Figure 2.1, saying that there is a rate
θ with a uniform prior, that generates k successes out of n observations. The data
file supplied the observed data, setting k = 5 and n = 10. WinBUGS then sampled
from the posterior of the unobserved variable θ. ‘Sampling’ means drawing a set
of values, so that the relative probability that any particular value will be sampled
is proportional to the density of the posterior distribution at that value. For this
example, the posterior samples for θ are a sequence of numbers like 0.5006, 0.7678,
0.3283, 0.3775, 0.4126, . . .. A histogram of these values is an approximation to the
posterior distribution of θ.
In one sense, it would be nice to understand exactly how WinBUGS managed to
generate the posterior samples. In another sense, if you are interested in building
and analyzing models and data, you do not necessarily need to understand the
computational basis of posterior sampling (any more than you need to know how
SPSS calculates a t-test statistic). If you understand the conceptual basis that
Getting Started with WinBUGS
22
underlies the generation of the posterior samples, you can happily build models and
Rejection Sampling, Markov-Chain Monte-Carlo, and all the rest.2
Error Messages
If the syntax of your model file is incorrect or the data and starting values are
incompatible with your model specification, WinBUGS will balk and produce an
error message. Error messages can provide useful information when it comes to
debugging your WinBUGS code.3 The error messages are displayed in the bottom
left corner of the status bar, in very small letters.
With respect to errors in the model specification, suppose, for example, that you
mistakenly use the “assign” operator (<-) to specify the distribution of the prior
on the rate parameter θ and the distribution of the observed data k:
model {
#Prior on Rate
theta <- dbeta(1,1)
#Observed Counts
k <- dbin(theta,n)
}
As WinBUGS requires you to use the tilde symbol “∼” to denote the distributions of
the prior and the data, it will produce the following error message: “unknown type
of logical function”, as shown in Figure 2.6. As another example, suppose that
you mistype the distribution of the observed counts k, and you mistakenly specify
the distribution of k as follows:
k ~ dbon(theta,n)
WinBUGS will not recognize dbon as an existing probability distribution, and will
produce the following error message: “unknown type of probability density”,
as shown in Figure 2.7.
With respect to errors in the data file, suppose that your data file contains the
following data: k = -5 and n = 10. Note, however, that k is the number of successes
in the 10 trials and it is specified to be binomially distributed. WinBUGS therefore
2
3
Some people find the idea that WinBUGS looks after inference, and there is no need to understand the computational sampling routines in detail, to be a relief. Others find it deeply
disturbing. For the disturbed, there are many Bayesian texts that give detailed accounts of
Bayesian inference using computational sampling. Start with the summary for cognitive scientists presented by Griffiths, Kemp, and Tenenbaum (2008). Continue with the relevant chapters
in the excellent book by MacKay (2003), which is freely available on the Web, and follow the
more technical references from there.
Although nobody ever accused WinBUGS of being user friendly in this regard. The error trap
messaging, in particular, seems to have been written by the same people who did the Dead Sea
Scrolls.
23
Using the Applications
t
Fig. 2.6
WinBUGS error message as a result of incorrect logical operators.
t
WinBUGS error message as a result of a misspecified probability density.
Fig. 2.7
expects the value of k to lie between 0 and n and it will produce the following error message: “value of binomial k must be between zero and order of k”,
as shown in Figure 2.8.
Finally, with respect to erroneous starting values, suppose that you chose 1.5
as the starting value of θ for the second chain. Because θ is the probability of
getting 5 successes in 10 trials, WinBUGS expects the starting value for θ to
Getting Started with WinBUGS
24
t
Fig. 2.8
WinBUGS error message as a result of incorrect data.
lie between 0 and 1. Therefore, specifying a value such as 1.5 produces the following error message: “value of proportion of binomial k must be between
zero and one”, as shown in Figure 2.9.
2.2.3 Using Matbugs
We will use the matbugs function to call the WinBUGS software from within Matlab, and to return the results of the WinBUGS sampling to a Matlab variable for
further analysis. The code we are using to do this follows:
% Set the working directory
cd D:\WinBUGS_Book\Matlab_codes
% Data
k=5;n=10;
% WinBUGS Parameters
nchains=2; % How Many Chains?
nburnin=0; % How Many Burn-in Samples?
nsamples=2e4; %How Many Recorded Samples?
% Assign Matlab Variables to the Observed WinBUGS Nodes
datastruct = struct(’k’,k,’n’,n);
% Initialize Unobserved Variables
start.theta= [0.1 0.9];
25
t
Fig. 2.9
Using the Applications
WinBUGS error message as a result of an incorrect starting value.
for i=1:nchains
S.theta = start.theta(i); % An Intial Value for the Success Rate
init0(i) = S;
end
% Use WinBUGS to Sample
[samples, stats] = matbugs(datastruct, ...
fullfile(pwd, ’Rate_1.txt’), ...
’init’, init0, ...
’nChains’, nchains, ...
’view’, 1, ’nburnin’, nburnin, ’nsamples’, nsamples, ...
’thin’, 1, ’DICstatus’, 0, ’refreshrate’,100, ...
’monitorParams’, {’theta’}, ...
’Bugdir’, ’C:/Program Files/WinBUGS14’);
Some of the options in the Matbugs function control software input and output.
• datastruct contains the data that you want to pass from Matlab to WinBUGS.
• fullfile gives the name of the text file that contains the WinBUGS scripting
of your graphical model (i.e., the model specification file).
• view controls the termination of WinBUGS. If view is set to 0, WinBUGS is
closed automatically at the end of the sampling. If view is set to 1, WinBUGS
remains open and it pastes the results of the sampling run in a log output
file. To be able to inspect the results in WinBUGS, maximize the log output
Getting Started with WinBUGS
26
file and scroll up to the top of the page. Note that if you subsequently want
WinBUGS to return the results to Matlab, you first have to close WinBUGS.
• refreshrate gives the number of samples after which WinBUGS should refresh
its display.
• monitorParams gives the list of variables that will be monitored and returned to
Matlab in the samples variable.
• Bugdir gives the location of the WinBUGS software.
Other options define the values for the computational sampling parameters.
•
•
•
•
init gives the starting values for the unobserved variables.
nChains gives the number of chains.
nburnin gives the number of ‘burn-in’ samples.
nsamples gives the number of recorded samples that will be drawn from the
posterior.
• thin gives the number of drawn samples between those that are recorded.
• DICstatus gives an option to calculate the Divergence Information Criterion
(DIC) statistic. The DIC statistic is intended to be used for model selection,
but is not universally accepted theoretically among Bayesian statisticians. If
DICstatus is set to 0, the DIC statistic will not be calculated. If it is set to 1,
WinBUGS will calculate the DIC statistic.
How did the WinBUGS script and Matlab work together to produce the posterior
samples of θ? The WinBUGS model specification script defined the graphical model
from Figure 2.1. The Matlab code supplied the observed data and the starting values
for θ, and called WinBUGS. WinBUGS then sampled from the posterior of θ and
returned the sampled values in the Matlab variable samples.theta. You can plot
the histogram of these sampled values using Matlab, in the way demonstrated in
the script Rate 1.m. It should look something like the jagged line in Figure 2.10.
Because the probability of any value appearing in the sequence of posterior samples
is decided by its relative posterior probability, the histogram is an approximation
to the posterior distribution of θ.
Besides the sequence of posterior samples, WinBUGS also returns some useful
summary statistics to Matlab. The variable stats.mean gives the mean of the
posterior samples for each unobserved variable, which approximates its posterior
expectation. This can often (but not always, as later exercises explore) be a useful
point-estimate summary of all the information in the full posterior distribution.
Similarly, stats.std gives the standard deviation of the posterior samples for each
unobserved variable.
Finally, WinBUGS also returns the so-called R̂ statistic in the stats.Rhat variable. This is a statistic about the sampling procedure itself, not about the posterior
distribution. The R̂ statistic is proposed by Brooks and Gelman (1997) and it gives
information about convergence. The basic idea is to run two or more chains and
measure the ratio of within–to between–chain variance. If this ratio is close to 1, the
Using the Applications
27
3
Posterior Density
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
Rate
t
Fig. 2.10
Posterior distribution of rate θ for k = 5 successes out of n = 10 trials, based on
20,000 posterior samples.
independent sampling sequences are probably giving the same answer, and there is
reason to trust the results.
2.2.4 Using R2WinBUGS
We will use the bugs() function in the R2WinBUGS package to call the WinBUGS
software from within R, and to return the results of the WinBUGS sampling to a
R variable for further analysis. The code we are using to do this follows.
setwd("D:/WinBUGS_Book/R_codes") #Set the working directory
bugsdir = "C:/Program Files/WinBUGS14"
k = 5
n = 10
data
= list("k", "n")
myinits = list(
list(theta = 0.1),
list(theta = 0.9))
parameters = c("theta")
samples = bugs(data, inits=myinits, parameters,
model.file ="Rate_1.txt",
n.chains=2, n.iter=20000, n.burnin=0, n.thin=1,
DIC=F, bugs.directory=bugsdir,
codaPkg=F, debug=T)
Some of these options control software input and output.
• data contains the data that you want to pass from R to WinBUGS.
28
Getting Started with WinBUGS
• parameters gives the list of variables that will be monitored and returned to R
in the samples variable.
• model.file gives the name of the text file that contains the WinBUGS scripting
of your graphical model (i.e., the model specification file). Avoid using nonalphanumeric characters (e.g., “&” and “*”) in the directory and file names.
Also, make sure that the name of the directory that contains the model file is
not too long, otherwise WinBUGS will generate the following error message :
“incompatible copy”. If WinBUGS fails to locate a correctly specified model
file, try to include the entire path in the model.file argument.
• bugs.directory gives the location of the WinBUGS software.
• codaPkg controls the content of the variable that is returned from WinBUGS.
If codaPkg is set to FALSE, WinBUGS returns a variable that contains the
results of the sampling run. If codaPkg is set to TRUE, WinBUGS returns a
variable that contains the file names of the WinBUGS outputs and the corresponding paths. You can access these output files by means of the R function
• debug controls the termination of WinBUGS. If debug is set to FALSE, WinBUGS is closed automatically at the end of the sampling. If debug is set to
TRUE, WinBUGS remains open and it pastes the results of the sampling run
in a log output file. To be able to inspect the results in WinBUGS, maximize
the log output file and scroll up to the top of the page. Note that if you subsequently want WinBUGS to return the results in the R samples variable, you
first have to close WinBUGS. In general, you will not be able to use R again
until after you terminate WinBUGS.
The other options define the values for the computational sampling parameters.
• inits assigns starting values to the unobserved variables. If you want WinBUGS
to choose these starting values for you, replace inits=myinits in the call to
bugs with inits=NULL.
• n.chains gives the number of chains.
• n.iter gives the number of recorded samples that will be drawn from the posterior.
• n.burnin gives the number of ‘burn-in’ samples.
• n.thin gives the number of drawn samples between those that are recorded.
• DIC gives an option to calculate the DIC statistic. If DIC is set to FALSE, the DIC
statistic will not be calculated. If it is set to TRUE, WinBUGS will calculate
the DIC statistic.
WinBUGS returns the sampled values of θ in the R variable samples. You can
access these values by typing samples\$sims.array. You can also plot the histogram of these sampled values using R, in the way demonstrated in the script
Rate 1.R). Besides the sequence of posterior samples, WinBUGS also returns some
29
useful statistics to R. You can access the summary statistics of the posterior samples, as well as the R̂ statistic mentioned in the previous section by typing samples.
• The BUGS Project webpage http://www.mrc-bsu.cam.ac.uk/bugs/weblinks/
webresource.shtml provides useful links to various articles, tutorial materials,
and lecture notes about Bayesian modeling and the WinBUGS software.
• The BUGS discussion list https://www.jiscmail.ac.uk/cgi-bin/webadmin?
A0=bugs is an online forum where WinBUGS users can exchange tips, ask
questions, and share worked examples.
2.3.2 For Mac users
You can run WinBUGS on Macs using emulators, such as Darwine. As best we
know, you need a Dual Core Intel based Mac and the latest stable version of Darwine
to be able to use R2WinBUGS.
• The Darwine emulator is available at www.kronenberg.org/darwine/.
• The R2WinBUGS reference manual on the R-project webpage cran.r-project.
org/web/packages/R2WinBUGS/index.html provides instructions on how to
run R2winBUGS on Macs.
• Further information for running R2WinBUGS on Macs is available at
ggorjan.blogspot.com/2008/10/runnning-r2winbugs-on-mac.html and
idiom.ucsd.edu/~rlevy/winbugsonmacosx.pdf.
• Further information for running WinBUGS on Macs using a Matlab or R interface
is available at web.mit.edu/yarden/www/bayes.html and www.ruudwetzels.
com/macbugs.
2.3.3 For Linux users
You can run WinBUGS under Linux using emulators, such as Wine and CrossOver.
• The BUGS Project webpage provides useful links to various examples on how to
run WinBUGS under Linux www.mrc-bsu.cam.ac.uk/bugs/faqs/contents.
shtml and how to run WinBUGS using a Matlab interface www.mrc-bsu.cam.
ac.uk/bugs/winbugs/remote14.shtml.
• The R2WinBUGS reference manual on the R-project webpage cran.r-project.
org/web/packages/R2WinBUGS/index.html provides instructions on how to
run R2winBUGS under Linux.
PART II
PARAMETER ESTIMATION
3
Inferences With Binomial
Distributions
3.1 Inferring a Rate
Our first problem completes the introductory example from the “Getting Started
with WinBUGS” chapter, and involves inferring the underlying success rate for a
binary process. The graphical model is shown again in Figure 3.1. Recall that shaded
nodes indicate known values, while unshaded nodes represent unknown values, and
that circular nodes correspond to continuous values, while square nodes correspond
to discrete values.
The goal of inference in the graphical model is to determine the posterior distribution of the rate θ having observed k successes from n trials. The analysis starts
with the prior assumption that all possible rates between 0 and 1 are equally likely.
This corresponds to the uniform prior distribution θ ∼ Uniform 0, 1 which can
equivalently be written in terms of a Beta distribution as θ ∼ Beta 1, 1 .
θ
k
θ ∼ Beta(1, 1)
k ∼ Binomial(θ, n)
n
t
Fig. 3.1
Graphical model for inferring the rate of a binary process.
The script Rate 1.txt implements the graphical model in WinBUGS.
# Inferring a Rate
model {
# Observed Counts
k ~ dbin(theta,n)
# Prior on Rate Theta
theta ~ dbeta(1,1)
}
The code Rate 1.m for Matlab or Rate 1.R for R sets k = 5 and n = 10 and
calls WinBUGS to sample from the graphical model. WinBUGS then returns to
Matlab or R the posterior samples from θ1 . The Matlab or R code also plots the
33
34
Inferences With Binomials
Box 3.1
Beta distributions as conjugate priors
One of the nice properties of using the θ ∼ Beta α, β prior distribution for
a rate θ, is that it has a natural interpretation. The α and β values can be
thought of as counts of “prior successes”
and “prior failures”, respectively.
This means, using a θ ∼ Beta 3, 1 prior corresponds to having the prior
information that 4 previous observations have been made, and 3 of them were
successes. Or, more elaborately, starting
with a θ ∼ Beta 3, 1 is the same
as starting with a θ ∼ Beta 1, 1 , and then seeing data giving two more
successes (i.e., the posterior distribution in the second scenario will be same
as the prior distribution in the first). As always in Bayesian analysis, inference
starts with prior information, and updates that information—by changing the
probability distribution representing the uncertain information—as more information becomes available. When a type of likelihood function (in this case,
the Binomial) does not change the type of distribution (in this case, the Beta)
going from the posterior to the prior, they are said to have a “conjugate” relationship. This is valued a lot in analytic approaches to Bayesian inference,
because it makes for tractable calculations. It is not so important for that
reason in computational approaches, as emphasized in this book, because
sampling methods can handle easily much more general relationships between
parameter distributions and likelihood functions. But conjugacy is still useful in computational approaches because of the natural semantics it gives in
setting prior distributions.
posterior distribution of the rate θ. A histogram of the samples looks something
like the jagged line in Figure 3.2.1
Exercises
Exercise 3.1.1 Alter the data to k = 50 and n = 100, and compare the posterior
for the rate θ to the original with k = 5 and n = 10.
Exercise 3.1.2 For both the k = 50, n = 100 and k = 5, n = 10 cases just
considered, re-run the analyses with many more samples (e.g., ten times as
many) by changing the nsamples variable in Matlab, or the niter variable in
R. This will take some time, but there is an important point to understand.
What controls the width of the posterior distribution (i.e., the expression of
uncertainty in the rate parameter θ)? What controls the quality of the estimate
of the posterior (i.e., the smoothness of the histograms in the figures)?
1
At least, this is what Matlab produces. The density smoothing used by default in R leads to a
different visual appearance.
Difference Between Two Rates
35
3
Posterior Density
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
Rate
t
Fig. 3.2
Posterior distribution of rate θ for k = 5 successes out of n = 10 trials.
Exercise 3.1.3 Alter the data to k = 99 and n = 100, and comment on the shape
of the posterior for the rate θ.
Exercise 3.1.4 Alter the data to k = 0 and n = 1, and comment on what this
3.2 Difference Between Two Rates
Now suppose that now we have two different processes, producing k1 and k2 successes out of n1 and n2 trials, respectively. First, we will make the assumption the
underlying rates are different, so they correspond to different latent variables θ1
and θ2 . Our interest is in the values of these rates, as estimated from the data, and
in the difference δ = θ1 − θ2 between the rates.
The graphical model representation for this problem is shown in Figure 3.3. The
new notation is that the deterministic variable δ is shown by a double-bordered
node. A deterministic variable is one that is defined in terms of other variables,
and inherits its distribution from them. Computationally, deterministic nodes are
unnecessary—all inference could be done with the variables that define them—but
they are often conceptually very useful to include to communicate the meaning of
a model.
The script Rate 2.txt implements the graphical model in WinBUGS.
# Difference Between Two Rates
model {
# Observed Counts
k1 ~ dbin(theta1,n1)
k2 ~ dbin(theta2,n2)
# Prior on Rates
theta1 ~ dbeta(1,1)
Inferences With Binomials
36
δ
k1 ∼ Binomial(θ1 , n1 )
θ1
θ2
k2 ∼ Binomial(θ2 , n2 )
θ1 ∼ Beta(1, 1)
k1
θ2 ∼ Beta(1, 1)
k2
δ ← θ1 − θ2
n1
t
Fig. 3.3
n2
Graphical model for inferring the difference in the rates of two binary process.
theta2 ~ dbeta(1,1)
# Difference Between Rates
delta <- theta1-theta2
}
The code Rate 2.m or Rate 2.R sets k1 = 5, k2 = 7, n1 = n2 = 10, and then
calls WinBUGS to sample from the graphical model. WinBUGS returns to Matlab
or R the posterior samples from θ1 , θ2 and δ. If the main research question is
how different the rates are, then δ is the most relevant variable, and its posterior
distribution is shown in Figure 3.4.
Posterior Density
2
1
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Difference in Rates
t
Fig. 3.4
Posterior distribution of the difference between two rates δ = θ1 − θ2 .
1
37
Inferring a Common Rate
There are many ways the full information in the posterior distribution of δ might
usefully be summarized. The Matlab or R code produces a set of these from the
posterior samples, including
• The mean value, which approximates the expectation of the posterior. This is the
point-estimate corresponding to quadratic loss. That is, it tries to pick a single
value close to the truth, with bigger deviations from the truth being punished
more heavily.
• The value with maximum density in the posterior samples, approximating the
posterior mode. This is known as the maximum a posteriori (MAP) estimate,
and is the same as the maximum likelihood estimate (MLE) for ‘flat’ priors.
This point-estimate corresponds to 0-1 loss, which aims to pick the single most
likely value. Estimating the mode requires evaluating the likelihood function
at each posterior sample, and so requires a bit more post-processing work in
Matlab or R.
• The median value, which is the point-estimate corresponding to linear loss.
• The 95% credible interval. This gives the upper and lower values between which
95% of samples fall. Thus, it approximates the bounds on the posterior distribution that contain 95% of the posterior density. The Matlab or R code can
be modified to produce credible intervals for criteria other than 95%.
For the current problem, the mean of δ estimated from the returned samples is
approximately -0.17, the mode is approximately -0.20, the median is approximately
-0.17, and the 95% credible interval is approximately [−0.52, 0.21].
Exercises
Exercise 3.2.1 Compare the data sets k1 = 8, n1 = 10, k2 = 7, n2 = 10 and
k1 = 80, n1 = 100, k2 = 70, n2 = 100.
Exercise 3.2.2 Try the data k1 = 0, n1 = 1, and k2 = 0, n2 = 5.
Exercise 3.2.3 In what context might different possible summaries of the posterior
distribution of δ (i.e., point estimates, or credible intervals) be reasonable, and
when might it be important to show the full posterior distribution?
3.3 Inferring a Common Rate
We continue to consider two binary processes, producing k1 and k2 successes out of
n1 and n2 trials, respectively, but now assume the underlying rate for both is the
same. This means there is just one rate, θ.
The graphical model representation for this problem is shown in Figure 3.5.
An equivalent graphical model, using plate notation, is shown in Figure 3.6.
Plates are bounding rectangles that enclose independent replications of a graphical
structure within a whole model. In this case, the plate encloses the two observed
Inferences With Binomials
38
θ
k1 ∼ Binomial(θ, n1 )
k1
k2
k2 ∼ Binomial(θ, n2 )
θ ∼ Beta(1, 1)
n1
t
Fig. 3.5
n2
Graphical model for inferring the common rate underlying two binary processes.
counts and numbers of trials. Because there is only one latent rate θ (i.e., the
same probability drives both binary processes) it is not iterated inside the plate.
One way to think of plates, which some people find helpful, is as “for loops” from
programming languages (including WinBUGS itself).
θ
ki ∼ Binomial(θ, ni )
ki
θ ∼ Beta(1, 1)
ni
i
t
Fig. 3.6
Graphical model for inferring the common rate underlying two binary processes, using
plate notation.
The script Rate 3.txt implements the graphical model in WinBUGS.
# Inferring a Common Rate
model{
# Observed Counts
k1 ~ dbin(theta,n1)
k2 ~ dbin(theta,n2)
# Prior on Single Rate Theta
theta ~ dbeta(1,1)
}
The code Rate 3.m or Rate 3.R sets k1 , k2 , n1 and n2 , and then call WinBUGS
to sample from the graphical model. Note that the R code sets debug=T, and so will
wait to terminate and return the sampling information. The code also produces a
plot of the posterior distribution for the common rate, as shown in Figure 3.7.
Prior and Posterior Prediction
39
4
Posterior Density
3
2
1
0
0.2
0.4
0.6
0.8
1
Rate
t
Fig. 3.7
Posterior distribution of common rate θ.
Exercises
Exercise 3.3.1 Try the data k1 = 14, n1 = 20, k2 = 16, n2 = 20. How could you
report the inference about the common rate θ?
Exercise 3.3.2 Try the data k1 = 0, n1 = 10, k2 = 10, n2 = 10. What does this
analysis infer the common rate θ to be? Do you believe the inference?
Exercise 3.3.3 Compare the data sets k1 = 7, n1 = 10, k2 = 3, n2 = 10 and
k1 = 5, n1 = 10, k2 = 5, n2 = 10. Make sure, following on from the previous
question, that you understand why the comparison works the way it does.
3.4 Prior and Posterior Prediction
One conceptual way to think about Bayesian analysis is that Bayes Rule provides a
bridge between the unobserved parameters of models and the observed measurement
of data. The most useful part of this bridge is that data allows us to update the
uncertainty (represented by probability distributions) about parameters. But the
bridge can handle two way traffic, and so there is a richer set of possibilities for
relating parameters to data. There are really four distributions available, and they
are all important and useful.
• First, the prior distribution over parameters captures our initial assumptions or
state of knowledge about the psychological variables they represent.
• Secondly, the prior predictive distribution tells us what data to expect, given our
model and our current state of knowledge. The prior predictive is a distribution
40
Inferences With Binomials
over data, and gives the relative probability of different observable outcomes
before any data have been seen.
• Thirdly, the posterior distribution over parameters captures what we know about
the psychological variables having updated the prior information with the evidence provided by data.
• Finally, the posterior predictive distribution tells us what data expect, given the
same model we started with, but with a current state of knowledge that has
been updated by the observed data. Again, the posterior predictive is a distribution over data, and gives the relative probability of different observable
outcomes after data have been seen.
As an example to illustrate these distributions, we return to the simple problem
of inferring a single underlying rate. Figure 3.8 presents the graphical model, and
is the same as Figure 3.1.
θ
k
θ ∼ Beta(1, 1)
k ∼ Binomial(θ, n)
n
t
Fig. 3.8
Graphical model for inferring the rate of a binary process.
The script Rate 4.txt implements the graphical model in WinBUGS, and provides sampling for not just the posterior, but also for the prior, prior predictive and
posterior predictive.
# Prior and Posterior Prediction
model{
# Observed Data
k ~ dbin(theta,n)
# Prior on Rate Theta
theta ~ dbeta(1,1)
# Posterior Predictive
postpredk ~ dbin(theta,n)
# Prior Predictive
thetaprior ~ dbeta(1,1)
priorpredk ~ dbin(thetaprior,n)
}
To allow sampling from the prior, we use a dummy variable thetaprior that is
identical to the one we actually do inference on, but is itself independent of the
data, and so is never updated. Prior predictive sampling is achieved by the variable priorpredk that samples data using the same Binomial, but relying on the
Prior and Posterior Prediction
41
prior rate. Posterior predictive sampling is achieved by the variable postpredk that
samples predicted data using the same Binomial as the actual observed data.
The code Rate 4.m or Rate 4.R sets observed data with k = 1 successes out of
n = 10 observations, and then calls WinBUGS to sample from the graphical model.
The code also draws the four distributions, two in the parameter space (the prior
and posterior for θ), and two in the data space (the prior predictive and posterior
predictive for k). It should look something like Figure 3.9.
8
Prior
Posterior
Density
6
4
2
0
0
0.2
0.4
0.6
0.8
1
Theta
0.4
Prior
Posterior
Mass
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Success Count
t
Fig. 3.9
Prior and posterior for the success rate (top panel), and prior and posterior predictive
for counts of the number of successes (bottom panel), based on data giving k = 1
successes out of n = 15 trials.
Exercises
Exercise 3.4.1 Make sure you understand the prior, posterior, prior predictive
and posterior predictive distributions, and how they relate to each other (e.g.,
why is the top panel of Figure 3.9 a line plot, while the bottom panel is a
bar graph?). Understanding these ideas is a key to understanding Bayesian
analysis. Check your understanding by trying other data sets, varying both k
and n.
Exercise 3.4.2 Try different priors on θ, by changing θ ∼ Beta 1, 1 to θ ∼
Beta 10, 10 , θ ∼ Beta 1, 5 , and θ ∼ Beta 0.1, 0.1 . Use the figures produced
to understand the assumptions these priors capture, and how they interact
with the same data to produce posterior inferences and predictions.
Exercise 3.4.3 Predictive distributions are not restricted to exactly the same experiment as the observed data, but for any experiment where the inferred
Inferences With Binomials
42
model parameters make predictions. In the current simple Binomial setting,
for example, predictive distributions could be found by a new experiment with
n0 6= n observations. Change the graphical model, and Matlab or R code, to
implement this more general case.
Exercise 3.4.4 In October 2009, the Dutch newspaper “Trouw” reported on research conducted by H. Trompetter, a student from the Radboud University
in the city of Nijmegen. For her undergraduate thesis, Hester had interviewed
121 older adults living in nursing homes. Out of these 121 older adults, 24
(about 20%) indicated that they had at some point been bullied by their fellow residents. Trompetter confidently rejected the suggestion that her study
may have been too small to draw reliable conclusions: “If I had talked to more
people, the result would have changed by one or two percent at the most.”
Is Trompetter correct? Use the code Rate 4.m or Rate 4.R, by changing the
dataset variable, to find the prior and posterior predictive for the relevant
rate parameter and bullying counts. Based on these distributions, do you agree
with Trompetter’s claims?
3.5 Posterior Prediction
One important use of posterior predictive distributions is to examine the descriptive
adequacy of a model. It can be viewed as a set of predictions about what data
the most expects to see, based on the posterior distribution over parameters. If
these predictions do not match the data already seen, the model is descriptively
problem of inferring a common rate underlying two binary processes. Figure 3.10
presents the graphical model, and is the same as Figure 3.5.
θ
k1 ∼ Binomial(θ, n1 )
k1
k2
k2 ∼ Binomial(θ, n2 )
θ ∼ Beta(1, 1)
n1
t
Fig. 3.10
n2
Graphical model for inferring the common rate underlying two binary processes.
The script Rate 5.txt implements the graphical model in WinBUGS, and provides sampling for the posterior predictive distribution.
# Inferring a Common Rate, With Posterior Predictive
Posterior Prediction
43
model{
# Observed Counts
k1 ~ dbin(theta,n1)
k2 ~ dbin(theta,n2)
# Prior on Single Rate Theta
theta ~ dbeta(1,1)
# Posterior Predictive
postpredk1 ~ dbin(theta,n1)
postpredk2 ~ dbin(theta,n2)
}
The code Rate 5.m or Rate 5.R sets observed data with k1 = 0 successes out
of n1 = 10 observations, and k2 = 10 successes out of n2 = 10 observations.
The code draws the posterior distribution for the rate and the posterior predictive
distribution, as shown in Figure 3.11.
4
10
9
8
Success Count 2
Density
3
2
1
7
6
5
4
3
2
1
0
0
0.2
0.4
0.6
0.8
1
Theta
t
Fig. 3.11
0
1
2
3
4
5
6
7
8
9 10
Success Count 1
The posterior distribution of the common rate θ for two binary processes (left panel),
and the posterior predictive distribution (right panel), based on 0 and 10 successes
out of 10 observations.
The left panel shows the posterior distribution over the common rate θ for two
binary processes, which gives density to values near 0.5. The right panel shows
the posterior predictive distribution of the model, with respect to the two success
counts. The size of each square is proportional to the predictive mass given to each
possible combination of success count observations. The actual data observed in
this example, with 0 and 10 successes for the two counts, are shown by the cross.
Exercises
Exercise 3.5.1 Why is the posterior distribution in the left panel inherently one
dimensional, but the posterior predictive distribution in the right panel inherently two-dimensional?
Exercise 3.5.2 What do you conclude about the descriptive adequacy of the
model, based on the relationship between the observed data and the posterior
predictive distribution?
Exercise 3.5.3 What can you conclude about the parameter θ?
Inferences With Binomials
44
3.6 Joint Distributions
So far, we have almost always assumed that the number of successes k and number
of total observations n is known, but that the underlying rate θ is unknown. This
has meant that our parameter space has been one-dimensional. Everything learned
from data is incorporated into a single probability distribution representing the
relative likelihood of different values for the rate θ.
For many problems in cognitive science (and more generally), however, there will
be more than one unknown variable of interest, and they will interact. A simple
case of this general property is a binomial process in which both the rate θ and the
total number n unknown, and so the problem is to infer both simultaneously from
counts of successes k.
To make the problem concrete, suppose there are five helpers distributing a bundle of surveys to houses. It is known that each bundle contained the same number
of surveys, n, but the number itself is not known. The only available relevant information is that the maximum bundle is Nmax = 500, and so n must be between
1 and Nmax .
In this problem, it is also not known what the rate of return for the surveys
is. But, it is assumed that each helper distributed to houses selected in a random
enough way that it is reasonable to believe the return rates are the same. It is also
assumed to be reasonable to set a prior on this common rate θ ∼ Beta 1, 1 .
Inferences can simultaneously be made about n and θ from the observed number
of surveys returned for each of the helpers. Assuming the surveys themselves are
able to be identified with their distributing helper when returned, the data will
take the form of m = 5 counts, one for each helper, giving the number of returned
surveys for each.
n
θ
ki ∼ Binomial(θ, n)
θ ∼ Beta(1, 1)
ki
1
1
n ∼ Categorical( nmax
, . . . , nmax
)
i helpers
t
Fig. 3.12
Graphical model for the joint inference of n and θ from a set of m observed counts of
successes k1 , . . . , km.
The graphical model for this problem is shown in Figure 3.12, and the script
Survey.txt implements the graphical model in WinBUGS. Note the use of the
Categorical distribution, which gives probabilities to a finite set of nominal outcomes.
# Inferring Return Rate and Numbers of Surveys from Observed Returns
model {
Joint Distributions
45
# Observed Returns
for (i in 1:m){
k[i] ~ dbin(theta,n)
}
# Priors on Rate Theta and Number n
theta ~ dbeta(1,1)
n ~ dcat(p[])
for (i in 1:nmax){
p[i] <- 1/nmax
}
}
The code Survey.m or Survey.R uses the data k = {16, 18, 22, 25, 27}, and then
calls WinBUGS to sample from the graphical model. Figure 3.13 shows the joint
posterior distribution over n and θ as a scatter-plot, and the marginal distributions
of each as histograms.
It is clear that the joint posterior distributions carries more information than the
marginal posterior distributions. This is very important. It means that just looking
at the marginal distributions will not give a complete account of the inferences
0.6
0.4
Rate of Return
0.8
0.2
0
0
t
Fig. 3.13
100
200
300
Number of Surveys
400
Joint posterior distribution (scatter-plot) of the probability of return θ and the
number of surveys m for observed counts k = {16, 18, 22, 25, 27}. The histograms
show the marginal densities. The red cross shows the expected value of the joint
posterior, and the green circle shows the mode (i.e., maximum likelihood), both
estimated from the posterior samples.
An intuitive graphical way to see that there is extra information in the joint
posterior is to see if it is well approximated by the product of the marginal distri-
Inferences With Binomials
46
butions. Imagine sampling a point from the histogram for n, and then sampling one
from the histogram for θ, and plotting the two-dimensional point corresponding to
these samples. Then imagine repeating this process many times. It should be clear
the resulting scatter-plot would be different from the joint posterior scatter-plot in
Figure 3.13. So, the joint distribution carries information not available from the
marginal distributions.
For this example, it is intuitively obvious why the joint posterior distribution has
the clear non-linear structure it does. One possible way in which 20 surveys might
be returned is if there were only about 50 surveys, but 40% were returned. Another
possibility is that there were 500 surveys, but only a 4% return rate. In general,
the number and return rate can trade-off against each other, sweeping out the joint
posterior distribution seen in Figure 3.13.
Exercises
Exercise 3.6.1 The basic moral of this example is that it is often worth thinking about joint posterior distributions over model parameters. In this case
the marginal posterior distributions are probably misleading. Potentially even
more misleading are common (and often perfectly appropriate) point estimates
of the joint distribution. The red cross in Figure 3.13 shows the expected value
of the joint posterior, as estimated from the samples. Notice that it does not
even lie in a region of the parameter space with any posterior mass. Does this
make sense?
Exercise 3.6.2 The green circle in Figure 3.13 shows an approximation to the
mode (i.e., the sample with maximum likelihood) from the joint posterior
samples. Does this make sense?
Exercise 3.6.3 Try the very slightly changed data k = {16, 18, 22, 25, 28}. How
does this change the joint posterior, the marginal posteriors, the expected
point, and the maximum likelihood point? If you were comfortable with the
mode, are you still comfortable?2
Exercise 3.6.4 If you look at the sequence of samples in WinBUGS, some autocorrelation is evident. The samples ‘sweep’ through high and low values
in a systematic way, showing the dependency of a sample on those immediately preceding. This is a deviation from the ideal situation in which posterior samples are independent draws from the joint posterior. Try thinning
the sampling, taking only every 100th sample, by setting nthin=100 in Matlab or n.thin=100 in R. To make the computational time reasonable, reduce
the number of samples to just 500. How is the sequence of samples visually
different with thinning?
2
This example is based heavily on one we read in a book, but we have lost the reference. If you
know which one, could you please let us know, so we can acknowledge it?
4
Inferences Involving Gaussian
Distributions
4.1 Inferring Means and Standard Deviations
One of the most common inference problems involves assuming data following a
Gaussian (also known as the ‘Normal’, ‘Central’, ‘Maxwellian’) distribution, and
inferring the mean and standard deviation of this distribution from a sample of
observed independent data.
The graphical model representation for this problem is shown in Figure 4.1.
The data are the n observations x1 , . . . , xn. The mean of the Gaussian is µ and
the standard deviation is σ. WinBUGS parameterizes the Gaussian distribution
in terms of the mean and precision, not the mean and variance or the mean and
standard deviation. These are all simply related, with the variance being σ 2 and
the precision being λ = 1/σ 2 .
The prior used for µ is intended to be only weakly informative . It is a Gaussian
centered on zero, but with very low precision (i.e., very large variance), and gives
prior probability to a wide range of possible means for the data. When the goal is
to estimate parameters, this sort of approach is relatively non-controversial.
Setting priors for standard deviations (or variances, or precisions) is trickier, and
certainly more controversial. If there is any relevant information that helps put the
data on scale, so that bounds can be set on reasonable possibilities for the standard
deviation, then setting a uniform over that range is advocated by Gelman (2006).
In this first example, we assume the data are all small enough that setting an upper
bound of 10 on the standard deviation covers all the possibilities.
µ
σ
µ ∼ Gaussian(0, 0.001)
σ ∼ Uniform(0, 10)
xi
xi ∼ Gaussian(µ, σ12 )
i data
t
Fig. 4.1
Graphical model for inferring the mean and standard deviation of data generated by a
Gaussian distribution.
The script Gaussian.txt implements the graphical model in WinBUGS. Note
47
48
Inferences With Gaussians
the conversion of the standard deviation sigma into the precision parameter lambda
used to sample from a Gaussian.
# Inferring the Mean and Standard Deviation of a Gaussian
model{
# Data Come From A Gaussian
for (i in 1:n){
x[i] ~ dnorm(mu,lambda)
}
# Priors
mu ~ dnorm(0,.001)
sigma ~ dunif(0,10)
lambda <- 1/pow(sigma,2)
}
The code Gaussian.m or Gaussian.R creates some artificial data, and applies the
graphical model to make inferences from data. The code does not produce a graph,
or any other output. But all of the information you need to analyze the results is
in the returned variable samples.
Exercises
Exercise 4.1.1 Try a few data sets, varying what you expect the mean and standard deviation to be, and how many data you observe.
Exercise 4.1.2 Plot the joint posterior of µ and σ. Interpret the plot.
Exercise 4.1.3 Suppose you knew the standard deviation of the Gaussian was 1.0,
but still wanted to infer the mean from data. This is a realistic question: For
example, knowing the standard deviation might amount to knowing the noise
associated with measuring some psychological trait using a test instrument.
The xi values could then be repeated measures for the same person, and their
mean the trait value you are trying to infer. Modify the WinBUGS script and
Matlab or R code to do this. What does the revised graphical model look like?
Exercise 4.1.4 Suppose you knew the mean of the Gaussian was zero, but wanted
to infer the standard deviation from data. This is also a realistic question:
Suppose you know the error associated with a measurement is unbiased, so
its average or mean is zero, but you are unsure how much noise there is in the
instrument. Inferring the standard deviation is then a sensible way to infer
the noisiness of the instrument. Once again, modify the WinBUGS script and
Matlab or R code to do this. Once again, what does the revised graphical
model look like?
4.2 The Seven Scientists
This problem is from MacKay (2003, p. 309) where it is (among other things)
treated to a Bayesian solution, but not quite using a graphical modeling approach,
nor relying on computational sampling methods.
The Seven Scientists
49
Seven scientists with wildly-differing experimental skills all make a measurement
of the same quantity. They get the answers x = {−27.020, 3.570, 8.191, 9.898, 9.603,
9.945, 10.056}. Intuitively, it seems clear that the first two scientists are pretty inept
measurers, and that the true value of the quantity is probably just a bit below 10.
The main problem is to find the posterior distribution over the measured quantity,
telling us what we can infer from the measurement. A secondary problem is to infer
something about the measurement skills of the seven scientists.
The graphical model for one (good) way of solving this problem is shown in
Figure 4.2. The assumption is that all the scientists have measurements that follow
a Gaussian distribution, but with different standard deviations. However, because
they are all measuring the same quantity, each Gaussian has the same mean, it is
just the standard deviation that differs.
σi
µ ∼ Gaussian(0, 0.001)
σi ∼ InvSqrtGamma(0.001, 0.001)
µ
xi
xi ∼ Gaussian(µ, σ12 )
i
i data
t
Fig. 4.2
Graphical model for the seven scientists problem.
Notice the different approach to setting priors about the standard deviations
used in this example. This approach has a theoretical basis in scale invariance
arguments (i.e., choosing to set a prior so that changing the measurement scale of
the data does not affect inference). The invariant prior turns is improper (i.e., the
area under the curve is unbounded), meaning it is not really a distribution, but the
limit of a sequence of distributions (see Jaynes, 2003). WinBUGS requires proper
distributions always be used, and so the InvSqrtGamma .001, .001 is intended as
a proper approximation to the theoretically-motivated improper prior. This raises
the issue of whether inference is sensitive to the essentially arbitrary value 0.001.
Gelman (2006) raises some other challenges to this approach. But, it is still worth
The script SevenScientists.txt code implements the graphical model in Figure 4.2 in WinBUGS.
# The Seven Scientists
model{
# Data Come From Gaussians With Common Mean But Different Precisions
for (i in 1:n){
x[i] ~ dnorm(mu,lambda[i])
}
# Priors
mu ~ dnorm(0,.001)
for (i in 1:n){
Inferences With Gaussians
50
lambda[i] ~ dgamma(.001,.001)
sigma[i] <- 1/sqrt(lambda[i])
}
}
Notice that the Inverse-SquareRoot-Gamma prior distribution is implemented
by first setting a prior for the precision, λ ∼ Gamma .001, .001 and then reparameterization to the standard deviation.
The code SevenScientists.m or SevenScientists.R applies the seven scientist
data to the graphical model.
Exercises
Exercise 4.2.1 Draw posterior samples using the Matlab or R code, and reach conclusions about the value of the measured quantity, and about the accuracies
of the seven scientists.
Exercise 4.2.2 Change the graphical model in Figure 4.2 to use a uniform prior
over the standard deviation, as was done in Figure 4.1. Experiment with the
effect the upper bound of this uniform prior has on inference.
4.3 Repeated Measurement of IQ
In this example, we consider how to estimate the IQ of a set of people, each of whom
have done multiple IQ tests. The data are the measures xij for the i = 1, . . . , n
people and their j = 1, . . . , m repeated test scores.
We assume that the differences in repeated test scores are Gaussian error with
zero mean, but some unknown precision. The mean of the Gaussian of a person’s
test scores corresponds to their IQ measure. This will be different for each person.
The standard deviation of the Gaussians corresponds to the accuracy of the testing
instruments in measuring the one underlying IQ value. We assume this is the same
for every person, since it is conceived as a property of the tests themselves.
The graphical model for this problem is shown in Figure 4.3. Because we know
quite a bit about the IQ scale, it makes sense to set priors for the mean and
standard deviation using this knowledge. Our first attempts to set priors (these are
re-visited in the exercises) simply assume the actual IQ values are equally likely to
be anywhere between 0 and 300, and standard deviations are anywhere between 0
and 100.
The script IQ.txt implements the graphical model in WinBUGS.
# Repeated Measures of IQ
model{
# Data Come From Gaussians With Different Means But Common Precision
for (i in 1:n){
for (j in 1:m){
x[i,j] ~ dnorm(mu[i],lambda)
Repeated Measurement of IQ
51
µi
µi ∼ Uniform(0, 300)
σ ∼ Uniform(0, 100)
xij
σ
xij ∼ Gaussian(µi , σ12 )
j tests
i people
t
Fig. 4.3
Graphical model for inferring the IQ from repeated measures.
}
}
# Priors
sigma ~ dunif(0,100)
lambda <- 1/pow(sigma,2)
for (i in 1:n){
mu[i] ~ dunif(0,300)
}
}
The code IQ.m or IQ.R creates a data set corresponding to there being three
people, with test scores of (90, 95, 100), (105, 110, 115), and (150, 155, 160), and
applies the graphical model.
Exercises
Exercise 4.3.1 Use the posterior distribution for each person’s µi to estimate their
IQ. What can we say about the precision of the IQ test?
Exercise 4.3.2 Now, use a more realistic prior assumption for the µi means. Theoretically, IQ distributions should have a mean of 100, and a standard deviation
of 15. This corresponds to having a prior of mu[i] ∼ dnorm(100,.0044), instead of mu[i] ∼ dunif(0,300), because 1/152 = 0.0044. Make this change
in the WinBUGS script, and re-run the inference. How do the estimates of IQ
given by the means change? Why?
Exercise 4.3.3 Repeat both of the above stages (i.e., using both priors on µi )
with a new, but closely related, data set that has scores of (94, 95, 96), (109,
110, 111), and (154, 155, 156). How do the different prior assumptions affect
IQ estimation for these data. Why does it not follow the same pattern as the
previous data?
5
Some Examples Of Data Analysis
5.1 Pearson Correlation
The Pearson-product moment correlation coefficient, usually denoted r, is a very
widely-used measure of the relationship between two variables. It ranges between
+1, indicating a perfect positive linear relationship, to 0, indicating no linear relationship, to −1 indicating a perfect negative relationship. Usually the correlation r
is reported as a single point estimate, perhaps together with a frequentist significance test.
But, rather than just having a single number to measure the correlation, it would
be nice to have a posterior distribution for r, saying how likely each possible level
of correlation was. There are frequentist confidence interval methods that try to
do this, as well as various analytic Bayesian results based on asymptotic approximations (e.g., Donner & Wells, 1986). An advantage of using a computational
approach is the flexibility in the assumptions that can be made. It is possible to
set up a graphical model that allows inferences about the correlation coefficient for
any data generating process and set of prior assumptions about the correlation.
µ
r
σ
µ1 , µ2 ∼ Gaussian(0, 0.001)
σ1 , σ2 ∼ InvSqrtGamma(0.001, 0.001)
r ∼ Uniform(−1, 1)
xi
i data
t
Fig. 5.1

xi ∼ MvGaussian (µ1 , µ2 ) , 
σ12
rσ1 σ2
rσ1 σ2
σ22
−1

Graphical model for inferring a correlation coefficient.
One graphical model for doing this is shown in Figure 5.1. The observed data
take the form xi = (xi1 , xi2) for the ith observation, and, following the theory
behind the correlation coefficient, are modeled as draws from a multivariate Gaussian distribution. The parameters of this distribution are the means and standard
deviations of the two dimensions, and the correlation coefficient that links them.
In Figure 5.1, the variances are given the approximations to non-informative
discussed earlier. The correlation coefficient itself is given a uniform prior over its
possible range. All of these choices would be easily modified, with one obvious
possible change being to give the prior for the correlation more density around 0.
52
Pearson Correlation
53
The script Correlation 1.txt implements the graphical model in WinBUGS.
# Pearson Correlation
model {
# Data
for (i in 1:n){
x[i,1:2] ~ dmnorm(mu[],TI[,])
}
# Priors
mu[1] ~ dnorm(0,.001)
mu[2] ~ dnorm(0,.001)
lambda[1] ~ dgamma(.001,.001)
lambda[2] ~ dgamma(.001,.001)
r ~ dunif(-1,1)
# Reparameterization
sigma[1] <- 1/sqrt(lambda[1])
sigma[2] <- 1/sqrt(lambda[2])
T[1,1] <- 1/lambda[1]
T[1,2] <- r*sigma[1]*sigma[2]
T[2,1] <- r*sigma[1]*sigma[2]
T[2,2] <- 1/lambda[2]
TI[1:2,1:2] <- inverse(T[1:2,1:2])
}
The code Correlation 1.m or Correlation 1.R includes two data sets. Both
involve fabricated data comparing response times (on the x-axis) with IQ measures (on the y-axis), looking for a correlation between simple measures of decisionmaking and general intelligence.
For the first data set in the Matlab and R code, the results shown in Figure 5.2
are produced. The left panel shows a scatter-plot of the raw data. The right panel
shows the posterior distribution of r, together with the standard frequentist pointestimate.
115
Posterior Density
110
IQ
105
100
95
90
85
0
0.5
1
Response Time (sec)
t
Fig. 5.2
1.5
−1
−0.5
0
0.5
Correlation
Data (left panel) and posterior distribution for correlation coefficient (right panel).
The broken line shows the frequentist point-estimate.
1
Data Analysis
54
Exercises
Exercise 5.1.1 The second data set in the Matlab and R code is just the first data
set from Figure 5.2 repeated twice. Set dataset=2 to consider these repeated
data, and interpret the differences in the posterior distributions for r.
Exercise 5.1.2 The current graphical model assumes that the values from the
two variables—the xi = (xi1 , xi2)—are observed with perfect accuracy. When
might this be a problematic assumption? How could the current approach be
extended to make more realistic assumptions?
5.2 Pearson Correlation With Uncertainty
We now tackle the problem asked by the last question in the previous section,
and consider the correlations when there is uncertainty about the exact values of
variables. While it might be plausible that response time could be measured very
accurately, the measurement of IQ seems likely to be less precise. This uncertainty
should be incorporated in an assessment of the correlation between the variables.
µ
r
µi ∼ Gaussian(0, 0.001)
σ
σi ∼ InvSqrtGamma(0.001, 0.001)
r ∼ Uniform(−1, 1)
yi
xi
σe

yi ∼ MvGaussian (µ1 , µ2 ) , 
σ12
rσ1 σ2
rσ1 σ2
σ22
xij ∼ Gaussian(yij , σje)
i data
t
Fig. 5.3
Graphical model for inferring a correlation coefficient, when there is uncertainty
inherent in the measurements.
A simple approach for including this uncertainty is adopted by the graphical
model in Figure 5.3. The observed data still take the form xi = (xi1 , xi2 ) for the
ith person’s response time and IQ measure. But these observations are now draws
from a Gaussian distribution, centered on the unobserved ‘true’ response time and
IQ of that person, denoted yi = (yi1 , yi2 ). These true values are then modeled as
the x were in the previous model in Figure 5.1, as draws from the Multivariate
Gaussian distribution corresponding the correlation.
The precision of the measurements is captured by the standard deviations σ e =
e
(σ1 , σ2e ) of the Gaussian draws for the observed data, xij ∼ Gaussian yij , σje . The
graphical model in Figure 5.3 assumes that the standard deviations are known.
−1

55
Pearson Correlation With Uncertainty
The script Correlation 2.txt implements the graphical model shown in WinBUGS.
# Pearson Correlation With Uncertainty in Measurement
model {
# Data
for (i in 1:n){
y[i,1:2] ~ dmnorm(mu[],TI[,])
for (j in 1:2){
x[i,j] ~ dnorm(y[i,j],lambdapoints[j])
}
}
# Priors
mu[1] ~ dnorm(0,.001)
mu[2] ~ dnorm(0,.001)
lambda[1] ~ dgamma(.001,.001)
lambda[2] ~ dgamma(.001,.001)
r ~ dunif(-1,1)
# Reparameterization
sigma[1] <- 1/sqrt(lambda[1])
sigma[2] <- 1/sqrt(lambda[2])
T[1,1] <- 1/lambda[1]
T[1,2] <- r*sigma[1]*sigma[2]
T[2,1] <- r*sigma[1]*sigma[2]
T[2,2] <- 1/lambda[2]
TI[1:2,1:2] <- inverse(T[1:2,1:2])
}
The code Correlation 2.m uses the same data set as in the previous section, but
has different data sets for different assumptions about the uncertainty in measurement. In the first analysis, these are set to the values σ1e = .03 for response times
(which seem likely to be measured accurately) and σ2e = 1 for IQ (which seems near
the smallest plausible value). The results of this assumption using the model are
shown in Figure 5.4. The left panel shows a scatterplot of the raw data, together
with error bars representing the uncertainty quantified by the observed standard
deviations. The right panel shows the posterior distribution of r, together with the
standard frequentist point estimate.
Exercises
Exercise 5.2.1 Compare the results obtained in Figure 5.4 with those obtained
earlier using the same data in Figure 5.2, for the model without any account
of uncertainty in measurement.
Exercise 5.2.2 Generate results for the second data set, which changes σ2e = 10
for the IQ measurement. Compare these results with those obtained assuming
σ2e = 1.
Exercise 5.2.3 The graphical model in Figure 5.3 assumes the uncertainty for each
variable is known. How could this assumption be relaxed to the case where
the uncertainty is unknown?
Exercise 5.2.4 The graphical model in Figure 5.3 assumes the uncertainty for
each variable is the same for all observations. How could this assumption
Data Analysis
56
115
Posterior Density
110
IQ
105
100
95
90
85
0
0.5
1
Response Time (sec)
t
Fig. 5.4
1.5
−1
−0.5
0
0.5
1
Correlation
Data (left panel), including error bars showing uncertainty in measurement, and
posterior distribution for correlation coefficient (right panel). The broken line shows
the frequentist point-estimate.
be relaxed to the case where, for examples, extreme IQs are less accurately
measured than IQs in the middle of the standard distribution?
5.3 The Kappa Coefficient of Agreement
An important statistical inference problem in a range of physical, biological, behavioral and social sciences is to decide how well one decision-making method agrees
with another. An interesting special case considers only binary decisions, and views
one of the decision-making methods as giving objectively true decisions to which
the other aspires. This problem occurs often in medicine, when cheap or easily administered methods for diagnosis are evaluated in terms of how well they agree with
a more expensive or complicated ‘gold standard’ method.
For this problem, when both decision-making methods make n independent assessments, the data D take the form of four counts: a observations where both
methods decide ‘one’, b observations where the objective method decides ‘one’ but
the surrogate method decides ‘zero’, c observations where the objective method decides ‘zero’ but the surrogate method decides ‘one’, and d observations where both
methods decide ‘zero’, with n = a + b + c + d.
A variety of orthodox statistical measures have been proposed for assessing agreement using these data (but see Basu, Banerjee, & Sen, 2000, for a Bayesian approach). Useful reviews are provided by Agresti (1992), Banerjee, Capozzoli, McSweeney, and Sinha (1999), Fleiss, Levin, and Paik (2003), Kraemer (1992), Kraemer, Periyakoil, and Noda (2004) and Shrout (1998). Of all the measures, however,
it is reasonable to argue that the conclusion of Uebersax (1987) that “the kappa
The Kappa Coefficient of Agreement
57
coefficient is generally regarded as the statistic of choice for measuring agreement”
(p. 140) remains true.
Cohen’s (1960) kappa statistic estimates the level of observed agreement
a+d
n
relative to the agreement that would be expected by chance alone (i.e., the overall
probability for the first method to decide ‘one’ times the overall probability for the
second method to decide ‘one’, and added to this the overall probability for the
second method to decide ‘zero’ times the overall probability for the first method to
decide ‘zero’)
(a + b) (a + c) + (b + d) (c + d)
,
pe =
n2
and is given by
po − pe
.
κ=
1 − pe
po =
Kappa lies on a scale of −1 to +1, with values below 0.4 often interpreted as
“poor” agreement beyond chance, values between 0.4 and 0.75 interpreted as “fair
to good” agreement beyond chance, and values above 0.75 interpreted as “excellent”
agreement beyond chance (Landis & Koch, 1977). The key insight of kappa as a
measure of agreement is its correction for chance agreement.
κ
ξ
ψ
κ ← (ξ − ψ) / (1 − ψ)
ξ ← αβ + (1 − α) γ
ψ ← (πa + πb ) (πa + πc ) + (πb + πd ) (πc + πd )
γ
α
β
α, β, γ ∼ Beta(1, 1)
πa ← αβ
πb ← α (1 − β)
πa
πb
πc
πd
πc ← (1 − α) (1 − γ)
πd ← (1 − α) γ
D ∼ Multinomial([πa , πb, πc, πd ] , n)
D
t
Fig. 5.5
Graphical model for inferring the kappa coefficient of agreement.
The graphical model for a Bayesian version of kappa is shown in Figure 5.5. The
58
Data Analysis
key latent variables are α, β and γ. The rate α is the rate at which the gold standard
method decides ‘one’. This means (1 − α) is the rate at which the gold standard
method decides ‘zero’. The rate β is the rate at which the surrogate method decides
‘one’ when the gold standard also decides ‘one’. The rate γ is the rate at which the
surrogate method decides ‘zero’ when the gold standard decides ‘zero’. The best
way to interpret β and γ is that they are the rate of agreement of the surrogate
method with the gold standard, for the ‘one’ and ‘zero’ decisions respectively.
Using the rates α, β and γ, it is possible to calculate the probabilities that both
methods will decide ‘one’, πa = αβ, that the gold standard will decide ‘one’ but
the surrogate will decide zero, πb = α (1 − β), the gold standard will decide ‘zero’
but the surrogate will decide ‘one’, πc = (1 − α) (1 − γ), and that both methods
will decide ‘zero’, πd = (1 − α) γ.
These probabilities, in turn, describe how the observed data, D, made up of the
counts a, b, c, and d, are generated. They come from a Multinomial distribution
with n trials, where on each trial there is a πa probability of generating an a count,
πb probability for a b count, and so on.
So, observing the data D allows inferences to be made about the key rates α, β
and γ. The remaining variables in the graphical model in Figure 5.5 just re-express
these rates in the way needed to provide an analogue to the kappa measure of chance
corrected agreement. The ξ variable measures the observed rate of agreement, which
is ξ = αβ + (1 − α) γ. The ψ variable measures the rate of agreement that would
occur by chance, which is ψ = (πa + πb ) (πa + πc) + (πb + πd ) (πc + πd ), and could
be expressed in terms of α, β and γ. Finally κ is the chance corrected measure of
agreement on the −1 to +1 scale, given by κ = (ξ − ψ) / (1 − ψ).
The script Kappa.txt implements the graphical model in WinBUGS.
# Kappa Coefficient of Agreement
model {
# Underlying Rates
# Rate objective method decides ’one’
alpha ~ dbeta(1,1)
# Rate surrogate method decides ’one’ when objective method decides ’one’
beta ~ dbeta(1,1)
# Rate surrogate method decides ’zero’ when objective method decides ’zero’
gamma ~ dbeta(1,1)
# Probabilities For Each Count
pi[1] <- alpha*beta
pi[2] <- alpha*(1-beta)
pi[3] <- (1-alpha)*(1-gamma)
pi[4] <- (1-alpha)*gamma
# Count Data
d[1:4] ~ dmulti(pi[],n)
# Derived Measures
# Rate surrogate method agrees with the objective method
xi <- alpha*beta+(1-alpha)*gamma
# Rate of chance agreement
psi <- (pi[1]+pi[2])*(pi[1]+pi[3])+(pi[2]+pi[4])*(pi[3]+pi[4])
# Chance corrected agreement
kappa <- (xi-psi)/(1-psi)
}
59
Change Detection in Time Series Data
The code Kappa.m or Kappa.R includes several data sets, described in the Exercises below, to WinBUGS to sample from the graphical model.
Exercises
Exercise 5.3.1 Influenza Clinical Trial Poehling, Griffin, and Dittus (2002) reported data evaluating a rapid bedside test for influenza using a sample of
233 children hospitalized with fever or respitory symptoms. Of the 18 children
known to have influenza, the surrogate method identified 14 and missed 4. Of
the 215 children known not to have influenza, the surrogate method correctly
rejected 210 but falsely identified 5. These data correspond to a = 14, b = 4,
c = 5, and d = 210. Examine the posterior distributions of the interesting
variables, and reach a scientific conclusion. That is, pretend you are a consultant for the clinical trial. What would your two- or three-sentence ‘take home
message’ conclusion be to your customers?
Exercise 5.3.2 Hearing Loss Assessment Trial Grant (1974) reported data from
a screening of a pre-school population intended to assess the adequacy of a
school nurse assessment of hearing loss in relation to expert assessment. Of
those children assessed as having hearing loss by the expert, 20 were correctly
identified by the nurse and 7 were missed. Of those assessed as not having
hearing loss by the expert, 417 were correctly diagnosed by the nurse but
103 were incorrectly diagnosed as having hearing loss. These data correspond
to a = 20, b = 7, c = 103, d = 417. Once again, examine the posterior
distributions of the interesting variables, and reach a scientific conclusion.
Once again, what would your two- or three-sentence ‘take home message’
Exercise 5.3.3 Rare Disease Suppose you are testing a cheap instrument for detecting a rare medical condition. After 170 patients have been screened, the
test results shower 157 did not have the condition, but 13 did. The expensive ground truth assessment subsequently revealed that, in fact, none of the
patients had the condition. These data correspond to a = 0, b = 0, c = 13,
d = 157. Apply the kappa graphical model to these data, and reach a conclusion about the usefulness of the cheap instrument. What is special about this
data set, and what does it demonstrate about the Bayesian approach?
5.4 Change Detection in Time Series Data
This case study involves near-infrared spectrographic data, in the form of oxygenated hemoglobin counts of frontal lobe activity during an attention task in
of sounding neuro and clinical, and so must be impressive and eminently fundable
work.
Data Analysis
60
The interesting modeling problem is that a change is expected in the time series
of counts because of the attention task. The statistical problem is to identify the
change. To do this, we are going to make a number of strong assumptions. In
particular, we will assume that the counts come from a Gaussian distribution that
always has the same variance, but changes its mean at one specific point in time.
µ1
λ
µ1 , µ2 ∼ Gaussian(0, 0.001)
µ2
λ ∼ Gamma(0.001, 0.001)
τ
ci
ti
i samples
t
Fig. 5.6
τ ∼ Uniform(0, tmax )

 Gaussian(µ , λ)
1
ci ∼
 Gaussian(µ , λ)
2
if ti < τ
if ti ≥ τ
Graphical model for detecting a single change-point in time series.
Figure 5.6 presents a graphical model for detecting the change point. The observed data are the counts ci at time ti for the ith sample. The unobserved variable
τ is the time at which the change happens, and so controls whether the counts have
mean µ1 or µ2 . A uniform prior over the full range of possible times is assumed for
the change point, and generic weakly informative priors are given to the means and
the precision.
The script ChangeDetection.txt implements this graphical model in WinBUGS.
# Change Detection
model {
# Data Come From A Gaussian
for (i in 1:n){
c[i] ~ dnorm(mu[z1[i]],lambda)
}
# Group Means
mu[1] ~ dnorm(0,.001)
mu[2] ~ dnorm(0,.001)
# Common Precision
lambda ~ dgamma(.001,.001)
sigma <- 1/sqrt(lambda)
# Which Side is Time of Change Point?
for (i in 1:n){
z[i] <- step(t[i]-tau)
z1[i] <- z[i]+1
}
# Prior On Change Point
tau ~ dunif(0,tmax)
}
Note the use of the step function. This function returns 1 if its argument is greater
than or equal to zero, and 0 otherwise. The z1 variable, however, serves as an
indicator variable for mu, and therefore it needs to take on values 1 and 2. This is
Censored Data
61
70
60
Value
50
40
30
20
10
0
200
400
600
800
1000
Samples
t
Fig. 5.7
Identification of change-point in time series data.
the reason z is transformed to z1. Study this code and make sure you understand
what the step function accomplishes in this example.
The code ChangeDetection.m or ChangeDetection.R applies the model to the
near-infrared spectrographic data. Uniform sampling is assumed, so that t =
1, . . . , 1778.
The code produces a simple analysis, finding the mean of the posteriors for τ ,
µ1 and µ2 , and using these summary points to overlay the inferences over the raw
data. The result look something like Figure 5.7.
Exercises
Exercise 5.4.1 Draw the posterior distributions for the change-point, the means,
and the common standard deviation.
Exercise 5.4.2 Figure 5.7 shows the mean of the posterior distribution for the
change-point (this is the point in time where the two horizontal lines meet).
Can you think of a situation in which such a plotting procedure can be misleading?
Exercise 5.4.3 Imagine that you apply this model to a data set that has two
change-points instead of one. What could happen?
5.5 Censored Data
Starting 13 April 2005, Cha Sa-soon, a 68-year old grandmother living in Jeonju,
South Korea, repeatedly tried to pass the written exam for a driving license. In
South Korea, this exam features 50 four-choice questions. In order to pass, a score
of at least 60 points out of a maximum of 100. Accordingly, we assume that each
Data Analysis
62
correct answer is worth 2 points, so that in order to pass one needs to answer at
least 30 questions correctly.
What makes Cha Sa-soon special is that she failed to pass the test on 949 consecutive occasions, spending the equivalent of 4,200 US dollars on application fees.
In her last, 950th attempt, Cha Sa-soon scored the required minimum of 30 correct
questions and finally obtained her written exam. After her 775th failure, in February 2009, Mrs Cha told Reuters news agency “I believe you can achieve your goal
if you persistently pursue it. So don’t give up your dream, like me. Be strong and
We know that on her final and 950th attempt, Cha Sa-soon answered 30 questions
correctly. In addition, news agencies report that in her 949 unsuccessful attempts,
the number of correct answers had ranged from 15 to 25. Armed with this knowledge, what can we say about θ, the latent probability that Cha Sa-soon can answer
any one question correctly? Note that we assume each question is equally difficult,
and that Cha Sa-soon does not learn from her earlier attempts.
What makes these data special is that for the failed attempts, we do not know
the precise scores. We only know that these scores range from 15 to 25. In statistical
terms, these data are said to be censored, both from below and above. We follow
and approach inspired by Gelman and Hill (2007, p. 405) to apply WinBUGS to
the problem of dealing with censored data.
θ
θ ∼ Beta(1, 1)
yi
zi
zi ∼ Binomial(θ, n)
15 ≤ zi ≤ 25 if yi = 1
i attempts
n
t
Fig. 5.8
Graphical model for inferring a rate from observed and censored data.
Figure 5.8 presents a graphical model for dealing with the censored data. The
variable zi represents both the first 949 unobserved, and the final observed attempt.
It is called a ‘partially observed’ variable, and is shaded more lightly to denote this.
The variable yi is a simple binary indicator variable, denoting whether or not the
ith attempt is observed. The bounds z lo = 15 and z hi = 25 give the known censored
interval for the unobserved attempts. Finally, n = 50 is the number of questions in
the test. This means that zi ∼ Binomial θ, n I(zlo ,zhi ) when yi indicates a censored
attempt, but is not censored for the final known score z950 = 30. Note that this
means zi is observed once, but not observed the other times. This sort of variable
is known as partially observed, and is denoted in the graphical model by a lighter
for fully unobserved or latent nodes).
Censored Data
63
160
140
95%
0.39 − 0.41
Posterior Density
120
100
80
60
40
20
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rate
t
Fig. 5.9
Posterior density for Cha Sa-soon’s rate of answering questions correctly.
The script ChaSaSoon.txt implements this graphical model in WinBUGS.
# ChaSaSoon Censored Data
model
{
for (i in 1:nattempts){
# If the Data Were Unobserved y[i]=1, Otherwise y[i]=1
z.low[i] <- 15*equals(y[i],1)+0*equals(y[i],0)
z.high[i] <- 25*equals(y[i],1)+n*equals(y[i],0)
z[i] ~ dbin(theta,n)I(z.low[i],z.high[i])
}
# Uniform Prior on Rate Theta
theta ~ dbeta(1,1)
}
Note the use of the equals command, which returns 1 when its arguments match,
and 0 when they mismatch. Thus, when y[i]=1, for censored data, z.low[i] is set
to 15 and z.hi[i] is set to 25. When y[i]=0 z.low[i] is set to 0 and z.hi[i]
is set to n. These z.low[i] and z.hi[i] values are then applied to censor the
Binomial distribution that generates the test scores.
The code ChaSaSoon.m or ChaSaSoon.R applies the model to the data from Cha
Sa-soon. The posterior density for θ is shown in Figure 5.9, and can be seen to be
relatively peaked. Despite the fact that we do not know the actual scores for 949
of the 950 results, we are still able to infer a lot about θ.
64
Data Analysis
Exercises
Exercise 5.5.1 Do you think Cha Sa-soon could have passed the test by just guessing?
Exercise 5.5.2 What happens when you increase the interval in which you know
the data are located, from 15–25 to something else?
Exercise 5.5.3 What happens when you decrease the number of failed attempts?
Exercise 5.5.4 What happens when you increase Cha Sa-soon’s final score from
30?
Exercise 5.5.5 Do you think the assumption that all of the scores follow a Binomial distribution with a single rate of success is a good model for these
data?
5.6 Population Size
An interesting inference problem that occurs in a number of fields is to estimate
the size of a population, when a census is impossible, but repeated surveying is
possible. For example, the goal might be to estimate the number of animals in a
large woodland area that cannot be search exhaustively. Or, the goal might be to
decide how many students are on a campus, but it is not possible to count them
all.
A clever sampling approach to this problem is given by capture-and-recapture
methods. The basic idea is to capture (i.e., identify, tag, or otherwise remember) a
sample at one time point, and then collect another sample. The number of items
in the second sample that were also in the first then provides relevant information
as to the population size.
Probably the simplest possible version of this approach can be formalized with
t as the population total, x as the number in the first sample, n as the number in
the second sample, and k as the number in both samples. That is, x animals are
tagged or people remembered in the first sample, then k out of n are seen again
(i.e., recaptured) in the second sample.
The statistical model to relate the counts, and make inferences about the population size t based on the hypergeometric distribution, so that the probability that
the true underlying population is t is given by
x t−x
k
n−k
t
n
.
This makes intuitive sense, since the second sample involves taking n items from a
population of t, and has k out of x recaptures, n − k other items out of the other
t − x in the population.
The Bayesian approach to this problem involves putting a prior on t, and using the
hypergeometric distribution as the appropriate likelihood function. Conceptually,
Population Size
65
t
k
x
k ∼ Hypergeometric(n, x, t)
t ∼ Categorical(α)
n
t
Fig. 5.10
Graphical model for inferring a population from capture and recapture data.
this means k ∼ Hypergeometric n, x, t , as in the graphical model in Figure 5.10.
The vector α allows for any sort of prior mass to be given to all the possible counts
for the population total. Since, x + (n − k) items are known to exist, one reasonable
choice of prior might be to make every possibility from x + (n − k) to tmax equally
likely, where tmax is a sensible upper bound on the possible population.
While simple conceptually, there is a a difficulty in implementing the graphical
model in Figure 5.10. The problem is that WinBUGS does not provide the hypergeometric distribution. It is, however, possible to implement distributions that are
not provided, but for which the likelihood function can be expressed in WinBUGS.
This can be done using the either the so-called ‘ones trick’ or ‘zeros trick’. These
tricks rely on simple properties of the Poisson and Bernoulli distributions. By implementing the likelihood function of the new distribution within the Poisson or
Bernoulli distribution, and forcing values of 1 or 0 to be sampled, it can be shown
that the samples actually generated will come from the desired distribution.1
The script Population.txt implements the graphical model in Figure 5.10 in
WinBUGS, using the zeros trick. Note how the terms in the log-likelihood expression
for the hypergeometric distribution are built up to define phi, and a constant C is
used to insure the Poisson distribution is used with a positive value.
# Population
model{
# Hypergeometric Likelihood Via Ones Trick
logterm1 <- logfact(x)-logfact(k)-logfact(x-k)
logterm2 <- logfact(t-x)-logfact(n-k)-logfact((t-x)-(n-k))
logterm3 <- logfact(t)-logfact(n)-logfact(t-n)
C <- 1000
phi <- -(logterm1+logterm2-logterm3)+C
zeros <- 0
zeros ~ dpois(phi)
# Prior on Population Size
for (i in 1:tmax){
tptmp[i] <- step(i-(n-k+x))
tp[i] <- tptmp[i]/sum(tptmp[1:tmax])
1
` ´
The negative log-likelihood
of a sample of 0 from Poisson φ is φ. The likelihood of a sample
` ´
of 1 from Bernoulli θ is θ. So, by setting log φ or θ appropriately, and forcing 1 or 0 to be
observed, sampling effectively proceeds from the distribution defined by φ or θ.
Data Analysis
66
0.14
0.12
Posterior Mass
0.1
0.08
0.06
0.04
0.02
0
0
11
50
Total Population
t
Fig. 5.11
Posterior mass for the population size, known to be 50 or fewer, based on a
capture-recapture experiment with x = 10 items in the first sample, and k = 4 out of
n = 5 recaptured in the second sample.
}
t ~ dcat(tp[])
}
The code Population.m or Population.R applies the model to the data x =
10, k = 4, and n = 5, using uniform prior mass for all possible sizes between
x + (n − k) = 11 and tmax = 50. The posterior distribution for t is shown in
Figure 5.11. The inference is that it is mostly likely there are not many more than
6 items, which makes intuitive sense, since 4 out of 5 in the second sample were
from the original set of 10.
Exercises
Exercise 5.6.1 Try changing the number of items recaptured in the second sample
from k = 4 to k = 0. What inference do you draw about the population size
now?
Exercise 5.6.2 How important is it that the upper bound tmax = 50 correspond
67
Population Size
closely to available information when k = 4 and when k = 0? Justify your
answer by trying both the k = 4 and k = 0 cases with tmax = 100.
Exercise 5.6.3 Suppose, having obtained the posterior mass in Figure 5.11, the
same population was subjected to a new capture-recapture experiment (e.g.,
with a different means of identifying or tagging). What would be an appropriate prior for t?
6
Latent Mixture Models
6.1 Exam Scores
Suppose a group of 15 people sit an exam made up of 40 true-or-false questions,
and they get 21, 17, 21, 18, 22, 31, 31, 34, 34, 35, 35, 36, 39, 36, and 35 right. These
scores suggest that the first 5 people were just guessing, but the last 10 had some
level of knowledge.
One way to make statistical inferences along these lines is to assume there are two
different groups of people. These groups have different probabilities of success, with
the guessing group having a probability of 0.5, and the knowledge group having a
probability greater than 0.5. Whether each person belongs to the first or the second
group is a latent and unobserved variable that can take just two values. Using this
approach, the goal is to infer to which group each person belongs, and also the rate
of success for the knowledge group.
i people
zi
zi ∼ Bernoulli(0.5)
ψ ← 0.5
φ
θi
ki
ψ
φ ∼ Uniform(0.5, 1)

 φ if zi = 1
θi ∼
 ψ if z = 0
i
ki ∼ Binomial(θi , n)
n
t
Fig. 6.1
Graphical modeling for inferring membership of two latent groups, with different rates
of success in answering exam questions.
A graphical model for doing this is shown in Figure 6.1. The number of correct
answers for the ith person is ki , and is out of n = 40. The probability of success on
each question for the ith person is the rate θi . This rate is either φ0 , if the person is
68
69
Exam Scores
in the guessing group, or φ1 if the person is in the knowledge group. Which group
they are in is determined by their binary indicator variable zi , with zi = 0 if the
ith person is in the guessing group, and zi = 1 is they are in the knowledge group.
We assume each of these indicator variables equally likely to be 0 or 1 a priori,
so they have the prior zi ∼ Bernoulli 1/2 . For the guessing group, we assume
that the rate is φ0 = 1/2. For the knowledge group, we use a prior where all rate
possibilities greater than 1/2 are equally likely, so that φ1 ∼ Uniform 0.5, 1 .
The script ExamsQuizzes 1.txt implements the graphical model in WinBUGS.
# Exam Scores
model{
# Each Person Belongs To One Of Two Latent Groups
for (i in 1:p){
z[i] ~ dbern(0.5)
z1[i] <- z[i]+1
}
# First Group Just Guesses
phi[1] <- 0.5
# Second Group Has Some Unknown Greater Rate Of Success
phi[2] ~ dunif(0.5,1)
# Data Follow Binomial With Rate Given By Each Person’s Group Assignment
for (i in 1:p){
theta[i] <- phi[z1[i]]
k[i] ~ dbin(theta[i],n)
}
}
Notice the use of a dummy variable z1[i] <- z[i]+1, which—just as in the
change-detection example in Section 5.4—allows WinBUGS array structures to be
indexed in assigning theta[i].
The code ExamsQuizzes 1.m or ExamsQuizzes 1.R makes inferences about group
membership, and the success rate of the knowledge group, using the model.
Exercises
Exercise 6.1.1 Draw some conclusions about the problem from the posterior distribution. Who belongs to what group, and how confident are you?
Exercise 6.1.2 The initial allocations of people to the two groups in this code is
random, and so will be different every time you run it. Check that this does
not affect the final results from sampling.
Exercise 6.1.3 Include an extra person in the exam, with a score of 28 out of 40.
What does their posterior for z tell you?
Exercise 6.1.4 What happens if you change the prior on the success rate of the
second group to be uniform over the whole range (0, 1), and so allow for
worse-than-guessing performance?
Exercise 6.1.5 What happens if you change the initial expectation that everybody
is equally likely to belong to either group, and have an expectation
that people
generally are not guessing, with (say), zi ∼ Bernoulli 0.9 ?
Latent Mixtures
70
6.2 Exam Scores With Individual Differences
The previous example shows how sampling can naturally and easily find discrete
latent groups. But the model itself has at least one big weakness, which is that it
assumes all the people in the knowledge group have exactly the same rate of success
on the questions.
One straightforward way to allow for individual differences in the knowledge
group is to extend the model hierarchically. This involves drawing the success rate
for each of the people in the knowledge group from an over-arching distribution.
One convenient (but not perfect) choice for this ‘individual differences’ distribution
is a Gaussian. It is a natural statistical model for individual variation, at least in
the absence of any rich theory. But it has the problem of allowing for success rates
below zero and above one. An inelegant but practical and effective way to deal with
this is simply to censor the sampled success rates to the valid range.
i people
zi ∼ Bernoulli(0.5)
zi
µ ∼ Uniform(0.5, 1)
λ ∼ Gamma(0.001, 0.001)
µ
φi
θi
ψ
λ
ki
φi ∼ Gaussian(µ, λ)
ψ ← 0.5

 φi if zi = 1
θi ←
 ψ if z = 0
i
ki ∼ Binomial(θi , n)
n
t
Fig. 6.2
Graphical model for inferring membership of two latent groups, with different rates of
success in answering exam questions, allowing for individual differences in the
knowledge group.
A graphical model that implements this idea is shown in Figure 6.2. It extends
the original model by having a knowledge group success rate φi1 for the ith person.
These success rates are drawn from a Gaussian distribution with mean µ and precision λ. The mean µ is given a Uniform prior between 0.5 and 1.0, consistent with
the original assumption that people in the knowledge group have a greater than
chance success rate.
The script ExamsQuizzes 2.txt implements the graphical model in WinBUGS.
# Exam Scores With Individual Differences
71
Twenty Questions
model {
# Data Follow Binomial With Rate Given By Each Person’s Group Assignment
for (i in 1:p){
k[i] ~ dbin(theta[i,z1[i]],n)
}
# Each Person Belongs To One Of Two Latent Groups
for (i in 1:p){
z[i] ~ dbern(0.5)
z1[i] <- z[i]+1
}
# The Second Group Now Allows Individual Differences
# So There Is a Rate Per Person
for (i in 1:p){
# First Group Is Still Just Guesses
theta[i,1] <- 0.5
# Second Group Drawn From A Censored Gaussian Distribution
thetatmp[i,2] ~ dnorm(mu,lambda)
theta[i,2] <- min(1,max(0,thetatmp[i,2])) # Censor The Probability To (0,1)
}
# Second Group Mean, Precision (And Standard Deviation)
mu ~ dunif(0.5,1) # Greater Than 0.5 Average Success Rate
lambda ~ dgamma(.001,.001)
sigma <- 1/sqrt(lambda)
# Posterior Predictive For Second Group
predphitmp ~ dnorm(mu,lambda)
predphi <- min(1,max(0,predphitmp))
}
Notice that is includes a posterior predictive variable predphi for the knowledge
group success rates of each person.
The code ExamsQuizzes 2.m or ExamsQuizzes 2.R makes inferences about group
membership, the success rate of each person the knowledge group, and the mean
and standard deviation of the over-arching Gaussian for the knowledge group.
Exercises
Exercise 6.2.1 Compare the results of the hierarchical model with the original
model that did not allow for individual differences.
Exercise 6.2.2 Interpret the posterior distribution of by the variable predphi.
How does this distribution relate to the posterior distribution for mu?
Exercise 6.2.3 What does the posterior distribution for the variable theta[1,2]
mean?
Exercise 6.2.4 In what sense could the latent assignment of people to groups in
this case study be considered a form of model selection?
6.3 Twenty Questions
Suppose a group of 10 people attend a lecture, and are asked a set of 20 questions
afterwards, with every answer being either correct or incorrect. The pattern of data
Latent Mixtures
72
is shown in Table 6.1. From this pattern of correct and incorrect answers we want
to infer two things. The first is how well each person attended to the lecture. The
second is how hard each of the questions was.
Table 6.1 Correct and incorrect answers for 10 people on 20 questions.
Question
A B C D E F G H I J K L M N O P Q R S T
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Person 9
Person 10
1
0
0
0
1
1
0
0
0
1
1
1
0
0
0
1
0
0
1
0
1
1
1
0
1
0
0
0
1
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
1
0
0
0
0
1
1
0
1
0
1
1
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
1
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
One way to make these inferences is to specify a model of how a person’s attentiveness and a question’s difficulty combine to give an overall probability the
question will be answered correctly. A very simple model involves assuming each
person listens to some proportion of the lecture, and that each question has some
probability of being answered correctly if the person was listening at the right point
in the lecture.
pi
θij
qj
pi , qj ∼ Beta(1, 1)
θij ← pi qj
kij ∼ Bernoulli(θij )
kij
i people
j questions
t
Fig. 6.3
Graphical model for inferring the rate people listened to a lecture, and the difficulty of
the questions.
A graphical model that implements this idea is shown in Figure 6.3. Under the
model, if the ith person’s probability of listening is pi , and the jth question’s
probability of being answered correctly if the relevant information is heard is qj ,
then the probability the ith person will answer the jth question correctly is just
73
Twenty Questions
θij = pi qj . The observed pattern of correct and incorrect answers, where kij = 1
if the ith person answered the jth question correctly, and kij = 0 if they did not,
then is a draw from a Bernoulli distribution with probability θij .
The script TwentyQuestions.txt implements the graphical model in WinBUGS.
# Twenty Questions
model {
# Correctness Of Each Answer Is Bernoulli Trial
for (i in 1:np){
for (j in 1:nq){
k[i,j] ~ dbern(theta[i,j])
}
}
# Probability Correct Is Product Of Question By Person Rates
for (i in 1:np){
for (j in 1:nq){
theta[i,j] <- p[i]*q[j]
}
}
# Priors For People and Questions
for (i in 1:np){
p[i] ~ dbeta(1,1)
}
for (j in 1:nq){
q[j] ~ dbeta(1,1)
}
}
The code TwentyQuestions.m or TwentyQuestions.R makes inferences about
the data in Table 6.1 using the model.
Exercises
Exercise 6.3.1 Draw some conclusions about how well the various people listened,
and about the difficulties of the various questions. Do the marginal posterior
distributions you are basing your inference on seem intuitively reasonable?
Exercise 6.3.2 Now suppose that three of the answers were not recorded, for whatever reason. Our new data set, with missing data, now take the form shown
in Table 6.2.
Bayesian inference will automatically make predictions about these missing
values (i.e., “fill in the blanks”) by using the same probabilistic model that
generated the observed data. Missing data are entered as nan (“not a number”)
in Matlab, and NA (“not available”) in R or WinBUGS. Including the variable
k as one to monitor when sampling will then provide posterior values for the
missing values. That is, it provides information about the relative likelihood of
the missing values being each of the possible alternatives, using the statistical
model and the available data.
Look through the Matlab or R code to see how all of this is implemented in
the second dataset. Run the code, and interpret the posterior distributions
for the three missing values. Are they reasonable inferences?
Latent Mixtures
74
Table 6.2 Correct, incorrect and missing answers for 10 people on 20
questions.
Question
A B C D E F G H I J K L M N O P Q R S T
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Person 9
Person 10
1
0
0
0
1
1
0
0
0
1
1
1
0
0
0
1
0
0
1
0
1
1
1
0
1
0
0
0
1
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
?
0
0
0
0
0
0
1
0
0
0
0
0
1
0
1
1
1
0
0
0
0
1
1
0
1
0
1
1
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
?
0
1
0
1
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
?
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
6.4 The Two Country Quiz
Suppose a group of people take a historical quiz, and each answer for each person is
scored as correct or incorrect. Some of the people are Thai, and some are Moldovan.
Some of the questions are about Thai history, and would be very likely to be known
by any Thai person, but very unlikely to be known by people from outside the
region. The rest of the questions are about Moldovan history, and would be very
likely to be known by any Moldovan, but not by others.
We do not know who is Thai or Moldovan, and we do not know the content of
the questions. All we have are the data shown in Table 6.3. Spend some time just
looking at the data, and try to infer which people are from the same country, and
which questions relate to their country.
A good way to make these inferences formally is to assume there are two types
of answers. For those where the nationality of the person matches the origin of the
question will be correct with high probability. For those where a person is being
asked about the other country will have a very low probability of being correct.
A graphical model that implements this idea is shown in Figure 6.4. The rate
α is the (expected to be high) probability of a person from a country correctly
answering a question about their country’s history. The rate β is the (expected
to be low) probability of a person correctly answering a question about the other
country’s history. To capture the knowledge about the rates, the priors constrain
α ≥ β, by defining alpha ∼ dunif(0,1) and beta ∼ dunif(0,alpha). At first
glance, this might seem inappropriate, since it specifies a prior for one parameter
in terms of another (unknown, and being inferred) parameter. Conceptually, it is
clearer to think of this syntax as a (perhaps clumsy) way to specify a joint prior
The Two Country Quiz
75
Table 6.3 Correct, incorrect and missing answers for 8 people on 8 questions.
Question
Person
Person
Person
Person
Person
Person
Person
Person
1
2
3
4
5
6
7
8
A
B
C
D
E
F
G
H
1
1
0
0
1
0
0
0
0
0
1
1
0
0
1
1
0
0
1
1
0
0
0
1
1
1
0
0
1
1
0
1
1
1
0
0
1
1
0
0
0
0
1
1
0
0
1
1
0
0
0
1
0
0
1
1
1
1
0
0
1
1
0
0
α
β
α ∼ Uniform(0, 1)
β ∼ Uniform(0, α)
xi
θij
zj
xi ∼ Bernoulli(0.5)
θij
kij

 α if x = z
i
j
←
 β if x =
i 6 zj
kij ∼ Bernoulli(θij )
i people
j questions
t
Fig. 6.4
Graphical model for inferring the country of origin for people and questions.
over α and β in which the α ≥ β. Graphically, the parameter space over (α, β) is
a unit square, and the prior being specified is the half of the square on one side of
the diagonal line α = β.
In the remainder of the graphical model, the binary indicator variable xi assigns
the ith person to one or other country, and zj similarly assigns the jth question to
one or other country. The probability the ith person will answer the jth question
correctly is θij , which is simply α if the country assignments match, and β if they
do not. Finally, the actual data kij indicating whether or not the answer was correct
follow a Bernoulli distribution with rate θij .
The script TwoCountryQuiz.txt implements the graphical model in WinBUGS.
# The Two Country Quiz
model {
# Probability Of Not Forgetting And Guessing
alpha ~ dunif(0,1) # Match
beta ~ dunif(0,alpha) # Mismatch
# Group Membership For People and Questions
Latent Mixtures
76
for (i in 1:np){
pz[i] ~ dbern(0.5)
pz1[i] <- pz[i]+1
}
for (j in 1:nq){
qz[j] ~ dbern(0.5)
qz1[j] <- qz[j]+1
}
# Probability Correct For Each Person-Question Comination By Groups
# High If Person Group Matches Question Group
# Low If No Match
for (i in 1:np){
for (j in 1:nq){
theta[i,j,1,1] <- alpha
theta[i,j,1,2] <- beta
theta[i,j,2,1] <- beta
theta[i,j,2,2] <- alpha
}
}
# Data Are Bernoulli By Rate
for (i in 1:np){
for (j in 1:nq){
k[i,j] ~ dbern(theta[i,j,pz1[i],qz1[j]])
}
}
}
The code TwoCountryQuiz.m or TwoCountryQuiz.R makes inferences about the
data in Table 6.3 using the model.
Exercises
Exercise 6.4.1 Interpret the posterior distributions for x[i], z[j], alpha and
beta. Do the formal inferences agree with your original intuitions?
Exercise 6.4.2 The priors on the probabilities of answering correctly capture
knowledge about what it means to match and mismatch, by imposing an
order constraint α ≥ β. Change the code so that this information is not included, by using priors alpha∼dbeta(1,1) and beta∼dbeta(1,1). Run a
few chains against the same data, until you get an inappropriate, and perhaps
counter-intuitive, result. Describe the result, and discuss why it comes about.
Exercise 6.4.3 Now suppose that three extra people enter the room late, and
begin to take the quiz. One of them (Late Person 1) has answered the first
four questions, the next (Late Person 2) has only answered the first question,
and the final new person (Late Person 3) is still sharpening their pencil, and
has not started the quiz. This situation can be represented as an updated data
set, now with missing data, as in Table 6.4. Interpret the inferences the model
makes about the nationality of the late people, and whether or not they will
get the unfinished questions correct.
Exercise 6.4.4 Finally, suppose that you are now given the correctness scores for
a set of 10 new people, whose data were not previously available, but who
Latent Group Assessment of Malingering
77
Table 6.4 Correct, incorrect and missing answers for 8 people and 3 late
people on 8 questions.
Question
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Late Person 1
Late Person 2
Late Person 3
A
B
C
D
E
F
G
H
1
1
0
0
1
0
0
0
1
0
?
0
0
1
1
0
0
1
1
0
?
?
0
0
1
1
0
0
0
1
0
?
?
1
1
0
0
1
1
0
1
1
?
?
1
1
0
0
1
1
0
0
?
?
?
0
0
1
1
0
0
1
1
?
?
?
0
0
0
1
0
0
1
1
?
?
?
1
1
0
0
1
1
0
0
?
?
?
form part of the same group of people we are studying. The updated data
set is shown in Table 6.5. Interpret the inferences the model makes about the
nationality of the new people. Revisit the inferences about the late people,
and whether or not they will get the unfinished questions correct. Does the
inference drawn by the model for the third late person match your intuition?
There is a problem here. How could it be fixed?
6.5 Latent Group Assessment of Malingering
Armed with the knowledge from the previous sections we now consider a question of considerable practical interest: how to detect if people are cheating on
a test. For instance, people who have been in a car accident may seek financial
compensation from insurance companies by feigning cognitive impairment such as
pronounced memory loss1 . When these people are confronted with a memory test
that is intended to measure the extent of their impairment, they may deliberately
under-perform. This behavior is called malingering, and it may be accompanied by
performance that is much worse than that which is displayed by real amnesiacs.
For instance, malingerers may perform substantially below chance.
However, malingerers may not always be easy to detect, and this is when Bayesian
inference with latent mixture models can be helpful. Not only can we classify people
in two categories—those who malinger and those who are truthful or bona fide—but
we can also quantify our confidence in each classification.
1
pronounced mem-uh-ree los.
Latent Mixtures
78
Table 6.5 Correct, incorrect and missing answers for 8 people, 3 late people,
and 10 new people on 8 questions.
Question
New Person
New Person
New Person
New Person
New Person
New Person
New Person
New Person
New Person
New Person
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
Person 7
Person 8
Late Person
Late Person
Late Person
1
2
3
4
5
6
7
8
9
10
1
2
3
A
B
C
D
E
F
G
H
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
0
0
0
1
0
?
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
0
?
?
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
?
?
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
1
1
?
?
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
?
?
?
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
?
?
?
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
?
?
?
1
1
1
1
1
1
1
1
1
1
1
1
0
0
1
1
0
0
?
?
?
In an experimental study on malingering, each of p = 22 participants was confronted with a memory test. One group of participants was told to do their best;
these are the bona fide participants. The other group of participants was told to
under-perform by deliberately simulating amnesia. These are the malingerers. Out
of a total of n = 45 test items, the participants get 45, 45, 44, 45, 44, 45, 45, 45,
45, 45, 30, 20, 6, 44, 44, 27, 25, 17, 14, 27, 35, and 30 correct. Because this was an
experimental study, we know that the first 10 participants were bona fide and the
next 12 were instructed to malinger.
The first analysis is straightforward and very similar to the one we did in Section 6.1. We assume that all bona fide participants have the same ability, and so
have the same rate φb of answering each question correctly. For the malingerers,
the rate of answering questions correctly is given by φm , and φb > φm .
The script Malingering 1.txt implements the graphical model in WinBUGS.
# Malingering
model
{
Individual Differences in Malingering
79
ψm
ψb
ψ b ∼ Uniform(0.5, 1)
ψ m ∼ Uniform(0, ψ b)
zi
θi
zi ∼ Bernoulli(0.5)
θi ←
ki

 ψb
if zi = 1
 ψ m if z = 0
i
ki ∼ Binomial(θi , n)
i people
n
t
Fig. 6.5
Graphical model for the detection of malingering.
# Each person belongs to one of two latent groups
for (i in 1:p){
z[i] ~ dbern(0.5)
z1[i] <- z[i]+1
}
# Bonafide group has unknown success rate above chance
psi[1] ~ dunif(0.5,1)
# Malingering group has unknown rate of success below bonafide
psi[2] ~ dunif(0,psi[1])
# Data are binomial with group rate of each person
for (i in 1:p){
theta[i] <- psi[z1[i]]
k[i] ~ dbin(theta[i],n)
}
}
Notice the restriction in the dunif command, which prevents the so-called model
indeterminacy or label-switching problem by ensuring that φb > φm .
The code Malingering 1.m or Malingering 1.R allows you to draw conclusions
about group membership and the success rate of the two groups.
Exercises
6.6 Individual Differences in Malingering
As before, it may seem needlessly restrictive to assume that all members of a group
have the same chance of answering correctly. So now we assume that the ith partic-
80
Latent Mixtures
ipant in each group has a unique rate parameter, θi , which is constrained by group
level distributions.
In Section 6.2, we used group level Gaussians. The problem with that approach
is that values can lie outside the range 0 to 1. These values can just be excluded
from consideration, as we did with the censoring, but this solution is not elegant.
One of several alternatives is to assume that instead of being Gaussian, the group
level distribution is Beta α, β . The α and β values can be thought of as counts of
“prior successes” and “prior failures”, respectively. Because the Beta distribution is
defined on the interval from 0 to 1 it respects the natural boundaries of rates. So we
now have a model in which each individual binomial rate parameter is constrained
by a group level Beta distribution—this complete model is known as the betabinomial.
It is useful to transform the α and β parameters from the Beta distribution to a
group mean µ = α/(α + β) and a group precision λ = α + β. In a first attempt, one
may then assign uniform priors to both µb (the group-level mean for the bona fide
participants) and µm (the group-level mean for the malingerers). Unfortunately,
this assignment does not reflect our knowledge that µb > µm , and so is subject to
label-switching. To avoid this problem we could use the dunif(0,mu bon) syntax
used in the previous example.
However, here we solve the label-switching problem differently. We first define
µm as the additive combination of µb and a difference parameter, as follows:
logit(µmal ) = logit(µb ) − µd . Note that this is an additive combination on the
logit scale, as is customary in beta-binomial models. The logit transformation is
defined as logit(θ) ≡ ln(θ/(1 − θ)) and it transforms values on the rate scale (ranging from 0 to 1) to values on the logit scale (ranging from −∞ to ∞). The logit
transformation is shown in Figure 6.6.
Next, we assign µd a positive-only Gaussian distribution, that is, mu diff ∼
dnorm(0,0.5)I(0,). This insures that the group mean of the malingerers is never
larger than that of the bona fide participants, and the label-switching problem is
solved.
One final note concerns the base rate of malingering φ. In the previous example
we set the base rate equal to 0.5. Now, we assign the base rate φ a prior distribution,
and use the data to infer group membership and at the same time learn about the
base rate.
A graphical model that implements the above ideas is shown in Figure 6.7. The
script Malingering 2.txt implements the graphical model in WinBUGS.
# Malingering, with individual differences
model {
# Each person belongs to one of two latent groups
for (i in 1:p){
z[i] ~ dbern(phi) # phi is the base rate
z1[i] <- z[i]+1
}
# Relatively uninformative prior on base rate
phi ~ dbeta(5,5)
Individual Differences in Malingering
81
6
Logit θ
4
2
0
2.94
0
−2
−4
−6
0.5
0.0
0.2
0.4
0.95
0.6
0.8
1.0
Probability θ
t
Fig. 6.6
The logit transformation. Probabilities θ range from 0 to 1 and are mapped to the
entire real using the logit transform, logit(θ) ≡ ln(θ/(1 − θ)). This transformation is
particularly useful for the modeling of additive effects.
# Data are binomial with rate of each person
for (i in 1:p){
k[i] ~ dbin(theta[i,z1[i]],n)
theta[i,1] ~ dbeta(alpha[1],beta[1])
theta[i,2] ~ dbeta(alpha[2],beta[2])
}
# Transformation to group mean and precision
alpha[1] <- mubon * lambdabon
beta[1] <- lambdabon * (1-mubon)
logit(mumal) <- logit(mubon) - mudiff
alpha[2] <- mumal * lambdamal
beta[2] <- lambdamal * (1-mumal)
# Priors
mubon ~ dbeta(1,1)
mudiff ~ dnorm(0,0.5)I(0,) # Constrained to be postive
lambdabon ~ dunif(40,800)
lambdamal ~ dunif(4,100)
}
The code Malingering 2.m or Malingering 2.R allows you to draw conclusions
about group membership and the success rate of the two groups.
Latent Mixtures
82
µd
µb ∼ Beta(1, 1)
µd ∼ Gaussian(0, 0.5)I(0,∞)
λm
µm
µb
λb
λb ∼ Uniform(40, 800)
λm ∼ Uniform(4, 100)
φ
zi
θi
zi ∼ Bernoulli(φ)
θi ∼
ki
i people
Fig. 6.7
if zi = 0
 Beta(µ λ , (1 − µ ) λ ) if z = 1
m m
m
m
i
ki ∼ Binomial(θi , n)
logitµm ← logitµb + µd
n
t

 Beta(µb λb , (1 − µb ) λb )
φ ∼ Beta(5, 5)
Graphical model for inferring membership of two latent groups, consisting of
malingerers and bona fide participants. Each participant has their own rate of
answering memory questions correctly, coming from group level distributions that
have their means constrained so that the bona fide group is greater than that for the
malingerers.
Exercises
Exercise 6.6.1 Assume you know that the base rate of malingering is 10%. Change
the WinBUGS script to reflect this knowledge. Do you expect any differences?
Exercise 6.6.2 Assume you know for certain that participants 1, 2, and 3 are bona
fide. Change the code to reflect this knowledge.
Exercise 6.6.3 Suppose you add a new participant. What number of questions
Exercise 6.6.4 Try to solve the label-switching problem by using the
dunif(0,mu bon) trick instead of the logit transform.
Exercise 6.6.5 Why do you think the priors for λb and λm are different?
6.7 Alzheimer’s Recall Test Cheating
In this section, we apply the same latent mixture model shown in Figure 6.7 to
different memory test data. Simple recognition and recall tasks are an important
part of screening for Alzheimer’s Disease and Related Disorders (ADRD), and are
83
Alzheimer’s Recall Test Cheating
sometimes administered over the telephone. This practice raises the possibility of
people cheating by, for example, writing down the words they are being asked to
remember.
The data we use come from an informal experiment, in which 118 people (actually,
employees of an insurance company familiar with administering these tests) were
asked either to complete the test normally (giving a total of 61 bona fide people),
or were instructed to cheat (giving a total of 57 malingerers). The particular test
used was a complicated sequence of immediate and delayed free recall tasks, which
we simplify to give a simple score correct out of 40 for each person.
We assume that both the bona fide and malingering groups come from different
populations, following the model in Figure 6.7. Note that we expect the mean of the
malingerers to be higher, since the impact of cheating is to recall more words than
would otherwise be the case. The script Cheating.txt implements the analysis in
WinBUGS
# Cheating Latent Mixture Model
model
{
# Each Person Belongs To One Of Two Latent Groups
for (i in 1:p){
z[i] ~ dbern(phi) # phi is the Base Rate
z1[i] <- z[i]+1
}
# Relatively Uninformative Prior on Base Rate
phi ~ dbeta(5,5)
# Data Follow Binomial With Rate Given By Each Person’s Group Assignment
for (i in 1:p){
k[i] ~ dbin(theta[i,z1[i]],n)
thetatmp[i,1] ~ dbeta(alpha[1],beta[1])
theta[i,1] <- max(.001,min(.999,thetatmp[i,1]))
thetatmp[i,2] ~ dbeta(alpha[2],beta[2])
theta[i,2] <- max(.001,min(.999,thetatmp[i,2]))
}
# Transformation to Group Mean and Precision
alpha[1] <- mubon * lambdabon
beta[1] <- lambdabon * (1-mubon)
logit(mumal) <- logit(mubon) - mudiff
alpha[2] <- mumal * lambdamal
beta[2] <- lambdamal * (1-mumal)
# Priors
mubon ~ dbeta(1,1)
mudifftmp ~ dnorm(0,0.5)
mudiff <- max(0,mudifftmp) # Constrained to be Postive
lambdabon ~ dunif(5,50)
lambdamal ~ dunif(5,50)
# Correct Count
for (i in 1:p){
pct[i] <- equals(z[i],truth[i])
}
pc <- sum(pct[1:p])
}
Latent Mixtures
84
Bona fide
Malingerer
Proportion Correct
0
5
10
15
5
10
15
20
25
30
35
40
20
25
30
35
40
1
0.9
0.8
0.7
0.6
0.5
0.4
0
Total Correct
t
Fig. 6.8
The distribution of total correct recall scores for the Alzheimer’s data, and
classification performance. The top panel shows the distribution of scores for the bona
fide and malingering groups. The bottom panel shows, with the line, the accuracy
achieved using various cut offs to separate the groups, and, with the distribution, the
accuracy achieved by the latent mixture model.
This script is essentially the same as for the previous example, although it modifies
the priors on the precisions of the group distributions. It also includes a variable
pc that keeps track of the accuracy of each classification sample made in sampling, by comparing each person’s latent assignment to the known truth from the
experimental design.
The code Cheating.m or Cheating.R applies the graphical model to the data.
We focus our analysis of the results firstly on the classification accuracy of the
model. The top panel of Figure 6.8 summarizes the data, showing the distribution
of correctly recalled words in both the bona fide and malingering groups. It is clear
that malingerers generally recall more words, but that there is overlap between the
groups.
One way to provide a benchmark classification accuracy is the consider the best
possible ‘cut off’. This is a total correct score below which a person is classified as
bona fide, and at or above which they are classified as a malingerer. The line in
the bottom panel in Figure 6.8 shows the classification accuracy for all possible cut
offs, which peaks at 86.4% accuracy using the cut off of 35. The green distribution
at the left of the panel is the posterior distribution of the pc variable, showing the
range of accuracy achieved by the latent mixture model.
Alzheimer’s Recall Test Cheating
85
1
0.9
BonaFide Classification
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
Total Correct
t
Fig. 6.9
The relationship between each person’s total correct recall score, and their posterior
classification as belonging to the bona fide or malingering group.
In general, using generative model to solve classification problems is unlikely
to work as well as the best discriminative methods from machine learning and
statistics. The advantage of the generative model is in providing details about the
underlying processes assumed to produce the data, particularly by quantifying uncertainty. A good example of this general feature is shown in Figure 6.9, which
shows the relationship between the total correct raw data score, and the posterior
uncertainty about classification, for each person. This figure shows that the model
infers people with scores below 35 as more likely to be bona fide. But it also shows
how certain the model is about each classification, which provides more information (and more probabilistically coherent information) than many machine learning
methods.
This information about uncertainty is useful, for example, if there are costs or
utilities associated with different classification decisions. Suppose that raising a
false-alarm and suspecting someone of cheating on the screening test costs \$25, perhaps through a wasted follow-up in person test, but that missing someone cheated
on the screening test costs \$100, perhaps through providing insurance that should
not have been provided. With these utilities, the decision should be to classify people as bona fide only if it is four times more likely than them being a malingerer. In
other words, we need 80% certainty they are bona fide. The posterior distribution
of the latent assignment variable z provides exactly this information. It is clear
86
Latent Mixtures
from Figure 6.9 only people with a total correct score below 30 (not 35) should be
treated as bona fide, under this set of utilities.
Exercises
Exercise 6.7.1 Suppose the utilities are very different, so that a false-alarm costs
\$100, because of the risk of litigation in a false accusation, but misses are
relatively harmless, costing \$10 in wasted administrative costs. What decisions
Exercise 6.7.2 What other potential information, besides the uncertainty about
classification, does the model provide? Give at least one concrete example.
PART III
Preface
This document contains answers to the exercises from the book “A Course in
Bayesian Graphical Modeling for Cognitive Science”. Contrary to popular belief,
statistical modeling is rarely a matter of right or wrong; instead, the overriding
concerns are reasonableness and plausibility. Therefore, you may find yourself disagreeing with some of our intended solutions. Please let us know if you believe your
answer is clearly superior to ours, and—if we agree with you—we will adjust the
text accordingly.
Michael D. Lee
Irvine, USA
Eric-Jan Wagenmakers
Amsterdam, The Netherlands
August 2011
89
2
The Basics of Bayesian Analysis
Exercise 2.1.1 The famous Bayesian statistician Bruno de Finetti published two
big volumes entitled “Theory of Probability” (de Finetti, 1974). Perhaps
surprisingly, the first volume starts with the words “probability does not
exist”. To understand why de Finetti wrote this, consider the following
situation: someone tosses a fair coin, and the outcome will be either heads or
tails. What do you think the probability is that the coin lands heads? Now
suppose you are a physicist with advanced measurement tools, and you can
establish relatively precisely both the position of the coin and the tension
in the muscles immediately before the coin is tossed in the air—does this
change your probability? Now suppose you can briefly look into the future
(Bem, 2011), albeit hazily—is your probability still the same?
These different situations illustrate how the concept of probability is always conditional on background knowledge, and does not exist in a vacuum.
This idea is central to the subjective Bayesian school, a school that stresses how
inference is, in the end, dependent on personal beliefs.
Exercise 2.1.2 On his blog, prominent Bayesian Andrew Gelman wrote (March
18, 2010) “Some probabilities are more objective than others. The probability
that the die sitting in front of me now will come up ‘6’ if I roll it...that’s
about 1/6. But not exactly, because it’s not a perfectly symmetric die. The
probability that I’ll be stopped by exactly three traffic lights on the way to
school tomorrow morning: that’s...well, I don’t know exactly, but it is what
it is.” Was de Finetti wrong, and is there only one clearly defined probability
of Andrew Gelman encountering three traffic lights on the way to school
tomorrow morning?
A detailed knowledge of the layout of the traffic signs along the route will
influence your assessment of this probability, as well as your knowledge of how
busy traffic will be tomorrow morning, how often the traffic signs malfunction,
whether traffic will be rerouted because of construction, and so on When you
can look one day into the future, the probability of Andrew Gelman encountering
90
91
three traffic lights on the way to school is either zero or one. As before, probability
statements are conditional on your background knowledge.
Exercise 2.1.3 Figure 1.1 shows that the 95% Bayesian credible interval for θ
extends from 0.59 to 0.98. This means that one can be 95% confident that
the true value of θ lies in between 0.59 and 0.98. Suppose you would do
an orthodox analysis and find the same confidence interval. What is the
orthodox interpretation of this interval?
The orthodox interpretation is that if you repeat the experiment very many
times, and every time determine the confidence interval in the same way as you
did for the observed data, then the true value of θ falls inside the computed
intervals for 95% of the replicate experiments. Note that this says nothing about
the confidence for the current θ, but instead refers to the long-run performance
of the confidence interval method across many hypothetical experiments.
Exercise 2.1.4 Suppose you learn that the questions are all true/false questions.
Does this knowledge affect your prior distribution? And if so, how would this
prior in turn affect your posterior distribution?
With true or false questions, zero ability corresponds to guessing, that is,
θ = .5. Because negative ability is deeply implausible (unless the questions are
deliberately misleading), it makes sense to have a uniform prior that ranges from
.5 to 1, and hence has zero mass below .5. Because there is no prior mass below
.5, there will also be no posterior mass below 0.5.
Exercise 2.2.1 Instead of “integrating over the posterior”, orthodox methods
often use the “plug-in principle”; in this case, the plug-in principle suggest
that we predict p(k rep ) solely based on θ̂, the maximum likelihood estimate.
Why is this a bad idea in general? And can you think of a specific situation
in which this may not be so much of a problem?
The plug-in principle ignores uncertainty in θ, and therefore lead to predictions that are overconfident, that is, predictions that are less variable than
they should be (?, ?). The overconfidence increases with the width of the
posterior distribution. This also means that when the posterior is very peaked,
that is, when we are very certain about θ (for instance because we have observed
many data), the plug-in principle will only result in very little overconfidence.
92
No exercises.
2.4 Answers: Markov Chain Monte Carlo
Exercise 2.4.1 Use Google and list some other scientific disciplines that use
Bayesian inference and MCMC sampling.
Bayesian inference is used in almost all scientific disciplines (but we wanted you
to discover this yourself).
Exercise 2.4.2 The text reads: “Using MCMC sampling, you can approximate
posterior distributions to any desired degree of accuracy”. How is this
possible?
By drawing more and more MCM samples, the discrepancy between the
true distribution and the histogram can be made arbitrarily small. Or, in other
words, longer chains yield better approximations.
3
Inferences Involving Binomial
Distributions
Exercise 3.1.1 Alter the data to k = 50 and n = 100, and compare the posterior
for the rate θ to the original with k = 5 and n = 10.
more peaked. This means that you are more certain about what values for the
difference are plausible, and what values are not.
Exercise 3.1.2 For both the k = 50, n = 100 and k = 5, n = 10 cases just
considered, re-run the analyses with many more samples (e.g., ten times as
many) by changing the nsamples variable in Matlab or the niter variable in
R. This will take some time, but there is an important point to understand.
What controls the width of the posterior distribution (i.e., the expression
of uncertainty in the rate parameter θ)? What controls the quality of the
estimate of the posterior (i.e., the smoothness of the histograms in the
figures)?
The width of the posterior distribution, expressing the uncertainty in the
single true underlying rate, is controlled by the available information in the
data. Thus, higher n leads to narrower posterior distributions. The quality of
the estimate, visually evident by the smoothness of the posterior histogram, is
controlled by how many samples are collected to form the approximation. Note
that these two aspects of the analysis are completely independent. It is possible
to have many data but just collect a few samples in a quick data analysis, to get
a crude approximation to a narrow posterior. Similarly, it is possible to have only
a few data, but collect many samples, to get a very close approximation to a
Exercise 3.1.3 Alter the data to k = 99 and n = 100, and comment on the shape
of the posterior for the rate θ.
The posterior distribution is not symmetric, because of the ‘edge effect’
given by the theoretical upper bound of one for the rate. This goes some way
93
94
to demonstrating how a Bayesian posterior distribution can take any form, and
certainly does not have to be symmetric, or Gaussian, or in any other simple form.
Exercise 3.1.4 Alter the data to k = 0 and n = 1, and comment on what this
The fact that a posterior distribution exists at all shows that Bayesian
analysis can be done even when there are very few data. The posterior distribution is very broad, reflecting the large uncertainty following from the lack of
information, but nonetheless represents (as always) everything that is known and
unknown about the parameter of interest.
3.2 Answers: Difference Between Two Rates
Exercise 3.2.1 Compare the data sets k1 = 8, n1 = 10, k2 = 7, n2 = 10 and
k1 = 80, n1 = 100, k2 = 70, n2 = 100.
When you have more information (i.e., high n) the posteriors–for the individual
rates, as well as for the difference between them that is of interest—become
more peaked. This means that you are more certain about what values for the
difference are plausible, and what values are not.
Exercise 3.2.2 Try the data k1 = 0, n1 = 1, k2 = 0, n2 = 5.
The key to understanding the posterior is that you can be relatively sure
that θ2 is small, but you cannot be so sure about the value of θ1 .
Exercise 3.2.3 In what context might different possible summaries of the posterior distribution of δ (i.e., point estimates, or credible intervals) be reasonable,
and when might it be important to show the full posterior distribution?
In general, point estimates (usually mean, median, or mode) and credible
intervals are appropriate when they convey much the same information as would
be gained from examining the whole posterior distribution. For example, if the
posterior distribution is symmetric and with a small variance, its mean is a good
summary of the entire distribution.
95
3.3 Answers: Inferring a Common Rate
Exercise 3.3.1 Try the data k1 = 14, n1 = 20, k2 = 16, n2 = 20. How could you
report the inference about the common rate θ?
One reasonable reporting strategy here might be to use a measure for
central tendency, such as a mean, median, or mode, together with a credible
interval, for instance a 95% credible interval.
Exercise 3.3.2 Try the data k1 = 0, n1 = 10, k2 = 10, n2 = 10. What does this
analysis infer the common rate θ to be? Do you believe the inference?
The analysis wants you to believe that the most plausible value for the
common rate is around 0.5. This example highlights that the posterior distributions generated by a Bayesian analysis are conditional on the truth of the
observed data, and of the model. If the model is wrong in an important way,
the posteriors will be correct for that model, but probably not useful for the real
problem. If a single rate really did underly k1 = 0 and k2 = 10 then the rate
must be near a half, since it is the most likely way to generate those data. But
the basic assumption of a single rate seems problematic. The data suggest that
a rate of 0.5 is one of the least plausible values. Perhaps the data are generated
by two different rates, instead of one common rate.
Exercise 3.3.3 Compare the data sets k1 = 7, n1 = 10, k2 = 3, n2 = 10 and
k1 = 5, n1 = 10, k2 = 5, n2 = 10. Make sure, following on from the previous
question, that you understand why the comparison works the way it does.
The results for these data sets will be exactly the same. Because the model
assumes a common rate, both data sets can in fact be re-described as having
k = k1 + k2 = 10, n = n1 + n2 = 20.
3.4 Answers: Prior and Posterior Prediction
Exercise 3.4.1 Make sure you understand the prior, posterior, prior predictive
and posterior predictive distributions, and how they relate to each other
(e.g., why is the top panel of Figure 3.9 a line plot, while the bottom panel is
a bar graph?). Understanding these ideas is a key to understanding Bayesian
analysis. Check your understanding by trying other data sets, varying both
k and n.
Line plots are for continuous quantities (e.g., rate parameter θ) and bar plots are
96
for discrete quantities (e.g., success counts of data).
Exercise 3.4.2 Try different
priors on θ, by changing θ ∼ Beta 1, 1 to
θ ∼ Beta 10, 10 , θ ∼ Beta 1, 5 , and θ ∼ Beta 0.1, 0.1 . Use the figures
produced to understand the assumptions these priors capture, and how they
interact with the same data to produce posterior inferences and predictions.
One of the nice properties of using the θ ∼ Beta α, β prior distribution
for a rate θ, is that it has a natural interpretation. The α and β values can be
thought of as counts of “priorsuccesses” and “prior failures”, respectively. This
means, using a θ ∼ Beta 3, 1 prior corresponds to having the prior information
that 4 previous observations have been made, and3 of them were successes. Or,
more elaborately, starting with a θ ∼ Beta 3, 1 is the same as starting with
a θ ∼ Beta 1, 1 , and then seeing data giving two more successes (i.e., the
posterior distribution in the second scenario will be same as the prior distribution
in the first). As always in Bayesian analysis, inference starts with prior information, and updates that information—by changing the probability distribution
When a type of likelihood function (in this case, the Binomial) does not change
the type of distribution (in this case, the Beta) going from the posterior to
the prior, they are said to have a “conjugate” relationship. This is valued a lot
in analytic approaches to Bayesian inference, because it makes for tractable
calculations. It is not so important for that reason in computational approaches,
because sampling methods can handle easily much more general relationships
between parameter distributions and likelihood functions. But conjugacy is still
useful in computational approaches because of the natural semantics it gives in
setting prior distributions.
Exercise 3.4.3 Predictive distributions are not restricted to exactly the same
experiment as the observed data, but for any experiment where the inferred
model parameters make predictions. In the current simple Binomial setting,
for example, predictive distributions could be found by a new experiment
with n0 6= n observations. Change the graphical model, and Matlab or R
code, to implement this more general case.
The script Rate 4 answer.txt implements the modified graphical model.
# Prior and Posterior Prediction
model{
# Observed Data
k ~ dbin(theta,n)
# Prior on Rate Theta
theta ~ dbeta(1,1)
# Posterior Predictive
postpredk ~ dbin(theta,npred)
97
# Prior Predictive
thetaprior ~ dbeta(1,1)
priorpredk ~ dbin(thetaprior,npred)
}
Exercise 3.4.4 In October 2009, the Dutch newspaper “Trouw” reported on
research conducted by H. Trompetter, a student from the Radboud University in the city of Nijmegen. For her undergraduate thesis, Trompetter had
interviewed 121 older adults living in nursing homes. Out of these 121 older
by their fellow residents. Trompetter confidently rejected the suggestion that
her study may have been too small to draw reliable conclusions: “If I had
talked to more people, the result would have changed by one or two percent
at the most.” Is Trompetter correct? Use the code Rate 4.m or Rate 4.R, by
changing the dataset variable, to find the prior and posterior predictive for
the relevant rate parameter and bullying counts. Based on these distributions,
do you agree with Trompetter’s claims?
The 95% credible interval on the predicted number of bullied elderly (out
of a total of 121) ranges from approximately (depending on sampling) 13 to 38.
This means that the percentage varies from 13/121 ≈ 10.7% to 38/121 ≈ 31.4%.
Exercise 3.5.1 Why is the posterior distribution in the left panel inherently
one dimensional, but the posterior predictive distribution in the right panel
inherently two-dimensional?
There is only one parameter, the rate θ, but there are two data, the success
counts k1 and k2 .
Exercise 3.5.2 What do you conclude about the descriptive adequacy of the
model, based on the relationship between the observed data and the posterior
predictive distribution?
The posterior predictive mass, shown by the squares, is very small for the actual
outcome of the experiment, shown by the cross. The posterior prediction is
concentrated on outcomes (around k1 = k2 = 5) that are very different from the
data, and so the model does not seem descriptively adequate.
Exercise 3.5.3 What can you conclude about the parameter θ?
If the model is a good one, the posterior distribution for θ indicates that it is
somewhere between about 0.2 and 0.8, and most likely around 0.5. But, it seems
98
unlikely the model is a good one, and so it is not clear anything useful can be
Exercise 3.6.1 The basic moral of this example is that it is often worth thinking
about joint posterior distributions over model parameters. In this case
the marginal posterior distributions are probably misleading. Potentially
even more misleading are common (and often perfectly appropriate) point
estimates of the joint distribution. The red cross in Figure 3.13 shows the
expected value of the joint posterior, as estimated from the samples. Notice
that it does not even lie in a region of the parameter space with any posterior
mass. Does this make sense?
In general, it seems unhelpful to have a point summary that is not a plausible
terms of the goal of the point estimate. The mean in this example is trying to
minimize squared loss to the true value, and the possible values follow a curved
surface, causing it to lie in the interior. Another way to think about the location
of the mean is physically. It is the center of mass of the joint posterior (i.e., the
place where you would put your finger to make the curved scatterplot balance).
More mundanely, the expectation of the joint posterior is (by mathematical fact)
the combination of the expectations for each parameter taken independently.
Looking at the marginal posteriors, it is clear why the cross lies where it does.
Exercise 3.6.2 The green circle in Figure 3.13 shows an approximation to the
mode (i.e., the sample with maximum likelihood) from the joint posterior
samples. Does this make sense?
This estimate seems to be more useful, at least in the sense that it falls
on values that are plausible. In fact, it falls on the values with the highest density
in the (estimated) posterior. Think of it as sitting on top of the hill surface traced
out by the scatterplot. Nonetheless, it still seems unwise to try and summarize
the complicated and informative curved pattern shown by the joint posterior
scatterplot by a single set of values.
Exercise 3.6.3 Try the very slightly changed data k = {16, 18, 22, 25, 28}. How
does this change the joint posterior, the marginal posteriors, the expected
point, and the maximum likelihood point? If you were comfortable with the
mode, are you still comfortable?
The minor change to the data hardly affects the mean, but greatly shifts
99
the mode. This shows that the mode can be very sensitive to the exact information available, and is a non-robust summary in that sense. Metaphorically, the
hill traced out by the joint density scatterplot has a ‘ridge’ running along the top
that is very flat, and the single highest point can move a long way if the data are
altered slightly.
Exercise 3.6.4 If you look at the sequence of samples in WinBUGS, some autocorrelation is evident. The samples ‘sweep’ through high and low values in a
systematic way, showing the dependency of a sample on those immediately
preceding. This is a deviation from the ideal situation in which posterior
samples are independent draws from the joint posterior. Try thinning the
sampling, taking only every 100th sample, by setting nthin=100 in Matlab
or n.thin=100 in R. To make the computational time reasonable, reduce
the number of samples to just 500. How is the sequence of samples visually
different with thinning?
With thinning, the sequence of samples no longer shows the visual pattern
of autocorrelation, as resembles more of a ‘block’ than a ‘curve’. One colorful
description of the ideal visual appearance of samples is as a ‘fat hairy caterpillar’.
Thinning is needed in this example to achieve that type of visual appearance.
4
Inferences Involving Gaussian
Distributions
4.1 Answers: Inferring Means and Standard
Deviations
Exercise 4.1.1 Try a few data sets, varying what you expect the mean and
standard deviation to be, and how many data you observe.
As usual, posterior distributions become more peaked the more data you
observe. The posterior distribution for µ should be located around the sample average. Highly variable numbers lead to a low precision λ, that is, a high standard
deviation σ. Note that with many data points, you may estimate the standard
deviation σ quite accurately (i.e., the posterior for σ can be very peaked).
In fact, with an infinite number of data, the posterior distribution converges
to a single point. This happens independently of whether the standard deviation σ is large or small; for instance, after observing a large sequence of highly
variable data you can be relatively certain that the standard deviation is very high.
Exercise 4.1.2 Plot the joint posterior of µ and σ. Interpret the plot.
There is a tendency for the joint posterior to be U-shaped. This is because extreme values of µ are only plausible when σ is high.
Exercise 4.1.3 Suppose you knew the standard deviation of the Gaussian was 1.0,
but still wanted to infer the mean from data. This is a realistic question: For
example, knowing the standard deviation might amount to knowing the noise
associated with measuring some psychological trait using a test instrument.
The xi values could then be repeated measures for the same person, and their
mean the trait value you are trying to infer. Modify the WinBUGS script and
Matlab or R code to do this. What does the revised graphical model look like?
The script can be adjusted in several ways. The easiest is probably just to replace
the statement x[i] ∼ dnorm(mu,lambda) with x[i] ∼ dnorm(mu,1). In the
graphical model. This change means that the node for σ is now shaded, because
σ is no longer an unknown quantity that needs to be inferred.
Exercise 4.1.4 Suppose you knew the mean of the Gaussian was zero, but wanted
100
101
to infer the standard deviation from data. This is also a realistic question:
Suppose you know the error associated with a measurement is unbiased, so
its average or mean is zero, but you are unsure how much noise there is
in the instrument. Inferring the standard deviation is then a sensible way
to infer the noisiness of the instrument. Once again, modify the WinBUGS
script and Matlab or R code to do this. Once again, what does the revised
graphical model look like?
Again, the script can be adjusted in several ways. Again, the easiest is
probably just to replace the statement x[i] ∼ dnorm(mu,lambda) with x[i]
∼ dnorm(0,lambda). In the graphical model, this change means that the node
for µ is now shaded, because µ is no longer an unknown quantity that needs
to be estimated. Follow-up question: if you set µ to zero as suggested above,
WinBUGS still provides a trace plot for µ. Why?
Exercise 4.2.1 Draw posterior samples using the Matlab or R code, and reach
conclusions about the value of the measured quantity, and about the accuracies of the seven scientists.
The posterior distributions for most standard deviations are very skewed.
As a result, the posterior mean will be dominated by relatively low proportion of
extreme values. For this reason, it is more informative to look at the posterior
median. As expected, the first two scientists are pretty inept measurers and have
high estimates of sigma. The third scientist does better than the first two, but
also appears more inept than the remaining four.
Exercise 4.2.2 Change the graphical model in Figure 4.2 to use a uniform prior
over the standard deviation, as was done in Figure 4.1. Experiment with the
effect the upper bound of this uniform prior has on inference.
This exercise requires you to put a uniform distribution on sigma, so that the
code needs to read (for an upper bound of 100): sigma[i]∼ dunif(0,100).
Then lambda[i] <-1/pow(sigma[i],2). Note that this change also requires
of lambda, because now sigma is assigned a prior and lambda is calculated
deterministically from sigma, instead of the other way around.
When you make these changes you can see that the difference between
the scientists is reduced. To get a more accurate idea of what is going on you
may want to set the number of MCMC samples to 100,000 (and, optionally,
102
set a thinning factor to 10, so that only every tenth sample is recorded –
this reduces the autocorrelation in the MCMC chains). As before, posterior
median for the first scientist is largest, followed by that of numbers two and
three.
4.3 Answers: Repeated Measurement of IQ
Exercise 4.3.1 Use the posterior distribution for each person’s µi to estimate
their IQ. What can we say about the precision of the IQ test?
The posterior means for the mu parameters are very close to the sample
means. The precision is 1/σ 2 , and because the posterior for sigma is concentrated around 6 the posterior precision is concentrated around 1/36 ≈ 0.03.
Exercise 4.3.2 Now, use a more realistic prior assumption for the µi means. Theoretically, IQ distributions should have a mean of 100, and a standard deviation
of 15. This corresponds to having a prior of mu[i] ~
dnorm(100,.0044), in2
~
stead of mu[i] dunif(0,300), because 1/15 = 0.0044. Make this change in
the WinBUGS script, and re-run the inference. How do the estimates of IQ
given by the means change? Why?
Parameter mu[3] is now estimated to be around 150, which is 5 points
lower than the sample mean. Extremely high scores are tempered by the prior
expectation. That is, an IQ of 150 is much more likely, according to the prior,
than an IQ of 160. The same strong effect of the prior on inference is not evident
for the other people, because their IQ scores have values over a range for which
the prior is (relatively) flat.
Exercise 4.3.3 Repeat both of the above stages (i.e., using both priors on µi )
with a new, but closely related, data set that has scores of (94, 95, 96), (109,
110, 111), and (154, 155, 156). How do the different prior assumptions affect
IQ estimation for these data. Why does it not follow the same pattern as the
previous data?
The tempering effect of prior expectation has now disappeared, and even
under realistic prior assumptions the posterior means for mu are close to the
sample means. This happens because the data suggest that the test is very
accurate, and accurate data are more robust against the prior. One helpful way to
more accurately), and that extra information now overwhelms the prior. Notice
how this example shows that is is not necessarily more data that is needed to
103
remove the influence of priors, but rather more information. Often, of course,
the best way to get more information is to collect more data. But, another way
is to develop data that are more precisely measured, or in some other way more
informative.
5
Some Examples Of Basic Data
Analysis
Exercise 5.1.1 The second data set in the Matlab and R code is just the first
data set from Figure 5.2 repeated twice. Interpret the differences in the
posterior distributions for r for these two data sets.
With more data the posterior distribution for r becomes more peaked,
showing that there is less uncertainty about the true correlation coefficient when
Exercise 5.1.2 The current graphical model assumes that the values from the
two variables—the xi = (xi1 , xi2)—are observed with perfect accuracy. When
might this be a problematic assumption? How could the current approach be
extended to make more realistic assumptions?
Very often in psychology, as with all empirical sciences, data are not measured with arbitrary precision. Other than nominal or ordinal variables (gender,
color, occupation, and so on), most variables are measured imperfectly. Some,
like response time, might be quite precise, consistent with measurement in the
physical sciences. Others, like IQ, or personality traits, are often very imprecise.
The current model makes the assumption that these sorts of measurements are
perfectly precise. Since they are the basis for the correlation coefficient, the
inference understates the uncertainty, and could lead to conclusions that are too
confident, or otherwise inappropriate. The next section shows one approach to
extending the model to address this problem.
5.2 Answers: Pearson Correlation With Uncertainty
Exercise 5.2.1 Compare the results obtained in Figure 5.4 with those obtained
earlier using the same data in Figure 5.2, for the model without any account
of uncertainty in measurement.
104
105
The posterior distributions are (suprisingly, perhaps) quite similar.
Exercise 5.2.2 Generate results for the second data set, which changes σ2e = 10
for the IQ measurement. Compare these results with those obtained assuming
σ2e = 1.
These results are very different. Allowing the large (but perhaps plausibly
large, depending on the measurement instrument) uncertainty in the IQ data introduces large uncertainty into inference about the correlation coefficient. Larger
values are more likely, but all possible values, including negative correlations,
remain plausible. Note also that the expectation of this posterior is the same as
in the case where the uncertainty of measurement is low or non-existent. This is
a good example of the need to base inference on posterior distributions, rather
than point estimates.
Exercise 5.2.3 The graphical model in Figure 5.3 assumes the uncertainty for
each variable is known. How could this assumption be relaxed to the case
where the uncertainty is unknown?
Statistically, it is straightforward to extend the graphical model, making
the σ e variables into parameters with prior distributions, and allowing them to
be inferred from data. Whether the current data would be informative enough
about the uncertainty of measurement to allow helpful inference is less clear. It
might be that different sorts of data, like repeated measurements of the same
people’s IQs, are needed for this model to be effective. But is it straightforward
to implement.
Exercise 5.2.4 The graphical model in Figure 5.3 assumes the uncertainty for
each variable is the same for all observations. How could this assumption
be relaxed to the case where, for examples, extreme IQs are less accurately
measured than IQs in the middle of the standard distribution?
e
The basic statistical idea would be to model the σi2
variables, representing the ith persons error of measurement in their IQ score as a functions of µi2 ,
representing their IQ itself. This would express a relationship between where
people lie on the IQ scale, and how precisely their IQ can be measured. Whatever
relationship is chosen is itself a statistical model, formalizing assumptions about
this relationship, and so can have parameters that are given priors and inferred
from data.
106
5.3 Answers: The Kappa Coefficient of Agreement
Exercise 5.3.1 Influenza Clinical Trial Poehling et al. (2002) reported data
evaluating a rapid bedside test for influenza using a sample of 233 children
hospitalized with fever or respitory symptoms. Of the 18 children known to
have influenza, the surrogate method identified 14 and missed 4. Of the 215
children known not to have influenza, the surrogate method correctly rejected
210 but falsely identified 5. These data correspond to a = 14, b = 4, c = 5,
and d = 210. Plot posterior distributions of the interesting variables, and
reach a scientific conclusion. That is, pretend you are a consultant for the
clinical trial. What would your two- or three-sentence ‘take home message’
The surrogate method does a better job detecting the absence of influenza than
it does detecting the presence of influenza. The 95% Bayesian confidence interval
for kappa is (.51, .84), suggesting that the test is useful.
Exercise 5.3.2 Hearing Loss Assessment Trial Grant (1974) reported data from
a screening of a pre-school population intended to assess the adequacy of a
school nurse assessment of hearing loss in relation to expert assessment. Of
those children assessed as having hearing loss by the expert, 20 were correctly
identified by the nurse and 7 were missed. Of those assessed as not having
hearing loss by the expert, 417 were correctly diagnosed by the nurse but 103
were incorrectly diagnosed as having hearing loss. These data correspond to
a = 20, b = 7, c = 103, d = 417. Once again, plot posterior distributions of
the interesting variables, and reach a scientific conclusion. Once again, what
would your two- or three-sentence ‘take home message’ conclusion be to your
customers?
Compared to the expert, the nurse displays a bias to classify children as
having hearing loss. In addition, the nurse misses 7 out of 27 children with
hearing loss. The nurse is doing a poor job, and this is reflected in the 95%
credible interval for kappa of (approximately, up to sampling) (.12, .29).
Exercise 5.3.3 Rare Disease Suppose you are testing a cheap instrument for detecting a rare medical condition. After 170 patients have been screened, the
test results shower 157 did not have the condition, but 13 did. The expensive ground truth assessment subsequently revealed that, in fact, none of the
patients had the condition. These data correspond to a = 0, b = 0, c = 13,
d = 157. Apply the kappa graphical model to these data, and reach a conclusion about the usefulness of the cheap instrument. What is special about this
data set, and what does it demonstrate about the Bayesian approach?
107
Answers: Change Detection in Time Series Data
The posterior mean for kappa is approximately .05, with a 95% credible interval
of approximately (0, .24). The data are noteworthy because the disease has never
been observed, so there are two zero cells, and a zero column sum. This poses
a challenge for frequentist estimators. In order to deal with the problem of zero
counts a frequentist may add a “1” to each cell in the design, but this amounts
to fabricating data. An attractive property of the Bayesian approach is that it is
always possible to do the analysis.
5.4 Answers: Change Detection in Time Series Data
Exercise 5.4.1 Draw the posterior distributions for the change-point, the means,
and the common standard deviation.
When you look at the trace plots, you may see that it takes a few samples for the chains to lose their dependence on the initial value that was used
as a starting point. These initial values are non-representative outliers, and they
also stretch out the y-axis of the trace plots. In the call to bugs, set burn-in to
10 and observe the change. We discuss this issue in detail in the chapter on
convergence.
With respect to the posterior distributions, it is worthwhile to note that
the key parameter tau is estimated relatively precisely around 732. One of the
reasons for this is that µ1 and µ2 are relatively easy to tell apart.
Exercise 5.4.2 Figure 5.7 shows the mean of the posterior distribution for the
change-point (this is the point in time where the two horizontal lines meet).
Can you think of a situation in which such a plotting procedure can be
One case in which this procedure may be misleading is when the posterior
distribution is relatively wide (i.e., not peaked around its mean); in such a
situation, there is a lot of uncertainty about the location of the change-point,
and the plotting procedure, based on a point estimate, falsely suggests that the
location is determined precisely.
Exercise 5.4.3 Imagine that you apply this model to a data set that has two
change-points instead of one. What could happen?
In this case the model is seriously misspecified. The model assumes that
there are two regimes, but in reality there are three. One thing that could happen
is that the model groups together the two adjacent regimes that are most similar
108
to each other and treats them as one. The problems that result should be visible
from mismatches between the posterior predictive distribution and the data.
Exercise 5.5.1 Do you think Cha Sa-soon could have passed the test by just
guessing?
It is unlikely that Cha Sa-soon was just guessing. First, the posterior distribution for theta is relatively peaked around .40, whereas chance performance
in a four-choice task is only 0.25. Second, the probability of scoring 30 or more
correct answers when guessing equals .00000016 (in R: 1-pbinom(29,50,.25)).
With this success probability, the number of attempts to pass the exam follows
a geometric distribution. Therefore we know that when guessing, the average
number of attempts equals 1/.00000016 ≈ 6, 097, 561, considerably more than
Cha Sa-soon required. The probability of guessing and “only” needing 950
attempts is a relatively low .00016 (in R: pgeom(950, prob=.00000016)).
In contrast, with a theta of .4 the the probability of scoring 30 or more correct
answers equals .0034 (in R: 1-pbinom(29,50,.40)). With this probability, the
associated expected number of attempts until success is 294, and the probability
of passing the exam and “only” needing 950 attempts is a relatively high 0.96.
Exercise 5.5.2 What happens when you increase the interval in which you know
the data are located, from 15–25 to something else?
Increasing the interval increases the posterior uncertainty for theta.
Exercise 5.5.3 What happens when you decrease the number of failed attempts?
When the number of failed attempts becomes low (say 20), the posterior
for theta becomes wider and shifts to values that are somewhat higher.
Exercise 5.5.4 What happens when you increase Cha Sa-soon’s final score from
30?
Not that much! Apparently, the extra information about Cha Sa-soon’s final score is much less informative than the knowledge that she had failed 949
times (and with scores ranging from 15 to 25).
Exercise 5.5.5 Do you think the assumption that all of the scores follow a Binomial distribution with a single rate of success is a good model for these data?
109
It seems to be a poor model. The chance of the same underlying rate
generating all of the censored scores below 25, and then producing the 30, can
be calculated according to the model, and is tiny. Alternative models would
assume some sort of change in the underlying rate. This could psychologically
correspond to learning, for at least some of the problems in the test, at some
point in the sequence of 950 attempts.
6
Exams, Quizzes, Latent Groups, and
Missing Data
Exercise 6.1.1 Draw some conclusions about the problem from the posterior
distribution. Who belongs to what group, and how confident are you?
Inspection of the z[i] nodes confirms that the first five people are confidently assigned to the guessing group, and the remaining ten people are
confidently assigned to the knowledge group. The high confidence is clear from
the fact that the posteriors for z[i] are approximately located either at 0 or 1.
Exercise 6.1.2 The initial allocations of people to the two groups in this code is
random, and so will be different every time you run it. Check that this does
not affect the final results from sampling.
This is easily done by running the code again a few times.
Exercise 6.1.3 Include an extra person in the exam, with a score of 28 out of 40.
What does their posterior for z tell you?
Performance of the new participant is completely ambiguous. The posterior for z is approximately 0.5, indicating that this participant is as likely to
belong to the guessing group as they are to belong to the knowledge group.
Exercise 6.1.4 What happens if you change the prior on the success rate of the
second group to be uniform over the whole range (0, 1), and so allow for
worse-than-guessing performance?
Nothing much. While the original prior assumption makes more sense,
since it captures information we have about the problem, the data are sufficiently
informative that inference is not significantly affected by this change.
Exercise 6.1.5 What happens if you change the initial expectation that everybody is equally likely to belong to either group, and have an expectation that
people generally are not guessing, with (say), zi ∼ Bernoulli 0.9 ?
110
111
Answers: Exam Scores With Individual Differences
This expectation does not change the classification of the first 15 participants much, because these participants are unambiguous in terms of their
performance. However, the new participant with a score of 28 is inferred to be in
the knowledge group with probability 0.9, whereas this was 0.5 before. Because
the data for this participant are ambiguous it is the prior expectation that largely
determines how this participant is classified.
6.2 Answers: Exam Scores With Individual Differences
Exercise 6.2.1 Compare the results of the hierarchical model with the original
model that did not allow for individual differences.
For the first 15 participants the results are essentially unchanged. The
new participant with a score of 28 is now inferred to be in the knowledge group
with probability 0.8, compared to the original 0.5. This happens because the new
participant is more likely to be a low-knowledge member of the knowledge group
than a member of the guessing group. The fact that the current model allows
for individual differences helps it account for the relatively low score of 28.
Exercise 6.2.2 Interpret the posterior distribution of the variable predphi. How
does this distribution relate to the posterior distribution for mu?
The variable predphi is based on a draw from a Gaussian distribution
with mean mu and standard deviation sigma. This predictive distribution
indicates what we can expect about the success rate of a new, as yet unobserved
participant from the knowledge group. If there were no individual differences,
sigma would be zero and draws for predphi are effectively draws from the
posterior of mu. But because there are individual differences, this adds uncertainty
to what we can expect for a new participant and hence the posterior distribution
for predphi is wider than that of mu.
Exercise 6.2.3 What does the posterior distribution for the variable theta[1,2]
mean?
Participant 1 clearly belongs to the guessing group. In the samples, this
participant is almost never assigned to the knowledge group. The value for
theta[1,2] is therefore not directly informative. It is like saying “theta[1,2]
is what we expect the success rate to be if participant 1 was in the knowledge
group.” Perhaps the only sense in which this is useful information is if it is
conceived as hypothetically how the participant would have performed if they
had listened in class and put themselves in the knowledge group.
112
Exercise 6.2.4 In what sense could the latent assignment of people to groups in
this case study be considered a form of model selection?
There are two rival explanations, specified as statistical models, for the
data. These are the guessing or knowledge-based responding accounts. When a
participant is assigned to the guessing group this means that, for that particular
participant, we believe the guessing model is a better explanation for the observed
data than is the knowledge-based model.
Exercise 6.3.1 Draw some conclusions about how well the various people listened,
and about the difficulties of the various questions. Do the marginal posterior distributions you are basing your inference on seem intuitively reasonable?
This question is best answered by tallying the total number of correct responses separately for each participant and for each item. The result show that,
first, participants who answer most items correctly have the highest estimated
values for p, and, second, items answered correctly most often have the highest
estimated values for q. These results are consistent with intuition.
Exercise 6.3.2 Now suppose that three of the answers were not recorded. Think
of a Scantron1 with coffee spilled on it being eaten by a dog. This gives the
new data set shown in Table 6.2. Bayesian inference will automatically make
predictions about these missing values (i.e., “fill in the blanks”) by using
the same probabilistic model that generated the observed data. Missing data
are entered as nan (“not a number”) in Matlab, and NA (“not available”) in
R or WinBUGS. Including the variable k as one to monitor when sampling
will then provide posterior values for the missing values. That is, it provides
information about the relative likelihood of the missing values being each of
the possible alternatives, using the statistical model and the available data.
Look through the Matlab or R code to see how all of this is implemented in
the second dataset. Run the code, and interpret the posterior distributions
for the three missing values. Are they reasonable inferences? Finally, think of
a more realistic application for inferring missing values in cognitive modeling
than dogs eating coffee flavored Scantrons.
The estimates are reasonable. One of the nice things about Bayesian inference is that, given that the model is appropriate, the estimates are always
reasonable. Sometimes the reasonableness may be hidden from your intuition,
1
113
but this just means that your intuition was faulty. Consider person 1 on item
M. We know that person 1 is relatively attentive, because they answer relatively
many questions, and we also know that item M is relatively easy, because many
other people answer item M correctly. This item-person combination looks like it
could have resulted in a correct answer. The inferred probability for M-1 being
correct is approximately 0.74. For item-person combination E-8 the reverse holds.
The item is difficult and the participant is inattentive. Consequently, the inferred
probability for E-8 being correct is approximately 0.01. Finally, combination R-10
is middle-of-the-road on both dimensions, and the inferred probability for it being
correct is 0.41.
These inferred probabilities are directly related to the knowledge about
each participant and item. In fact, if you multiply the estimated p’s and q’s
you can recover the inferred probabilities. For instance, p[1] is approximately
0.88, and q[13] (the M) is approximately 0.84. The multiplication of these
probabilities yields .74. The same calculations may be performed for the other
missing item-participant calculations.
The last part of the question deals with plausible scenarios for missing
data in cognitive modeling. Participants in speeded response time experiments
may not answer fast enough, in designs where the trial is terminated without a
response and the next trial is presented; people answering a questionnaire may
fail to answer a specific question because they overlooked it; participants in
to the removal of the trial; participants in longitudinal experiments may quit the
study prematurely; participants who are asked to monitor some aspect of their
lives at regular intervals over several days may sometimes forget to register their
response, and so on and so forth.
In general, data can be missing in several ways. When data are missing
“completely at random” it is easiest handled. It is more difficult when there
is a relationship between the missing-ness and the parameters of interest. For
example, say person 10 does not complete item Q because he or she realizes
the answer may well be wrong and time is better spend answering easier items.
To handle this situation we need more complicated models of the missing-ness.
Bayesian models, of course.
6.4 Answers: The Two Country Quiz
Exercise 6.4.1 Interpret the posterior distributions for x[i], z[j], alpha and
beta. Do the formal inferences agree with your original intuitions?
114
Yes, people 1, 2, 5, and 6 form one group, and people 3, 4, 7, and 8
form the other group. The model also groups together questions A, D, E, and H
versus questions B, C, F, and G. And the rates of correct decisions for matched
and mismatched groups make sense, too.
Exercise 6.4.2 The priors on the probabilities of answering correctly capture
knowledge about what it means to match and mismatch, by imposing an
order constrain α ≥ β. Change the code so that this information is not
included, by using priors alpha∼dbeta(1,1) and beta∼dbeta(1,1). Run a
few chains against the same data, until you get an inappropriate, and perhaps
counter-intuitive, result. Describe the result, and discuss why it comes about.
The result you get from the analysis with uniform prior can change from chain
to chain, switching the inferences about alpha and beta. This is a basic and
common problem for mixture models, known as model indeterminacy. The
probability α is used whenever xi = zj . If this corresponds to a Thai person
answering a Thai question, then α should be high, as we expect. But there is
nothing stopping the model, without the order constraint, from coding Thai
people as xi = 1 and Moldovan questions as zj = 1, in which case α will
be low. Effectively, with this coding, α and β will swap roles. Overall, there
are four possibilities (two ways people can be encoded, by two ways questions
can be encoded). Our semantics of α being knowledge-based and β being
ignorance-based will apply for 2 of these 4 possible encodings, but will be
reversed for the other two. The core problem is that α and β are statistically
defined the same way in the revised model. This is the indeterminacy. A practical
but inelegant way to solve this problem is by being flexible in interpretation.
A better way is, as per the original model and code, by defining the statistical
model itself more carefully, introducing the order constraint, and removing the
indeterminacy.
Exercise 6.4.3 Now suppose that three extra people enter the room late, and
begin to take the quiz. One of them (Late Person 1) has answered the first
four questions, the next (Late Person 2) has only answered the first question,
and the final new person (Late Person 3) is still sharpening their pencil, and
has not started the quiz. This situation can be represented as an updated
data set, now with missing data, as in Table 6.4. Interpret the inferences the
model makes about the nationality of the late people, and whether or not
they will get the unfinished questions correct.
Late person 1 is confidently placed in the same category as people 1, 2,
5, and 6. This is also reflected in the probabilities of answering the remaining
four questions correctly: 0.88, 0.07, 0.05, 0.90, predicting a “1 0 0 1” pattern
that was also observed for people 1, 2, 5, and 6.
Late person 2 only answered a single question, but this information suf-
115
Answers: Latent Group Assessment of Malingering
fices to assign this person with probability .89 to the same category as persons
3, 4, 7, and 8. This is reflected in the probabilities of answering the remaining
seven questions correctly: .80, .80, .15, .15, .80, .80, .13 predicting a “1 1 0 0 1
1 0” pattern that was relatively typical for people 3, 4, 7, and 8.
Late person 3 did not answer a single question and is equally likely to belong to either group. Because each group has an opposite pattern of answering
any particular question correctly, the model predicts that the performance of
late person 3 will be around chance (slightly worse than chance because not all
questions are answered correctly even if the question matches the nationality).
Exercise 6.4.4 Finally, suppose that you are now given the correctness scores for a
set of 10 new people, whose data were not previously available, but who form
part of the same group of people we are studying. The updated data set shown
in Table 6.5. Interpret the inferences the model makes about the nationality of
the new people. Revisit the inferences about the late people, and whether or not
they will get the unfinished questions correct. Does the inference drawn by the
model for the third late person match your intuition? There is a problem here.
How could it be fixed?
The new people are all classified in the same group as people 1, 2, 5, and
6. However, late person 3 is still equally likely to be classified in either group.
This is a problem in the sense that the model is insensitive to changes in baseline
proportions: if we know that there are many more people in one category than
another this knowledge should affect our prediction for late person 3. In this
case, late person 3 is likely to belong to the same category as the new persons.
The model can be extended to deal with baseline proportions by changing
the line pz[i] ∼ dbern(0.5), which assumes equal baselines, to pz[i] ∼
dbern(phi), and phi ∼ dbeta(1,1), which now estimates the baseline.
6.5 Answers: Latent Group Assessment of
Malingering
The expectation of the posterior for the indicator variable z shows everybody to have (essentially) the value 0 or 1. This means there is certainly
of the classification into the bona fide and malingering classifications, with 0
corresponding to bona fide and 1 to malingering. The first 10 people, as expected,
are bona fide. The rest, with two exceptions, are inferred to be malingerers. The
116
two exceptions are the 14th and 15th people, who scored 44. It seems likely these
people did not follow the instruction to malinger.
6.6 Answers: Individual Differences in Malingering
Exercise 6.6.1 Assume you know that the base rate of malingering is 10%. Change
the WinBUGS script to reflect this knowledge. Do you expect any differences?
The change can be made by changing the definition of the indicator variables to z[i]∼dbern(0.1), This base rate is different from the one that is
inferred for φ, which has posterior expectation of about 0.5, so it is reasonable to
expect inference to be affected. Specifically, when the base rate of malingering is
very low we should be less confident about the classification of participants who
performed poorly.
Exercise 6.6.2 Assume you know for certain that participants 1, 2, and 3 are
bona fide. Change the code to reflect this knowledge.
This change can be made by defining z[1] <- 0, z[2] <- 0 and z[3]
<- 0. It is important, once this change is made, to be sure that initial values are
not set for these three parameters, since they are no longer stochastic variables
to be inferred. This is not always straightforward, in terms of the data structures
being passed to and from WinBUGS from Matlab or R. An effective solution is
to define z[4] <- ztmp[1] to z[22] <- ztmp[19], and set the initial values
directly on the complete ztmp vector.
Exercise 6.6.3 Suppose you add a new participant. What number of questions
A little trial-and-error experimentation finds that an extra score of 41 questions
correct leads to a posterior expectation of about 0.45. This is more uncertain
than the neighboring possibilities of 40 questions correct, with posterior expectation about 0.8, and 42 questions correct, with posterior expectation about 0.15.
Exercise 6.6.4 Try to solve the label-switching problem by using the
dunif(0,mu bon) approach to specifying a joint prior instead of the
logit transform.
To be written
Exercise 6.6.5 Why do you think the priors for λb and λm are different?
117
Conceptually, it might be reasonable to expect less variability for the bona
fide group. Doing the task correctly seems more constraining than the freedom
to simulate amnesia and malinger. Computationally, the lack of variability in the
bona fide scores can cause under- and over-flow issues in sampling, and limiting
the priors to reasonable and computational ranges is a practical approach to
6.7 Answers: Alzheimer’s Recall Test Cheating
Exercise 6.7.1 Suppose the utilities are very different, so that a false-alarm
costs \$100, because of the risk of litigation in a false accusation, but misses
are relatively harmless, costing \$10 in wasted administrative costs. What
If it is 10 times more important not to make false-alarms (i.e., \$100 vs
\$10), then you should only treat someone as a cheater if that is 10 times more
likely. This means the expectation of zi should be less than 0.1 before the
decision is made to classify them as having cheated. It is clear from Figure 6.9
that his applies only to the few people who scored 39 or 40 on the test.
Exercise 6.7.2 What other potential information, besides the uncertainty about
classification, does the model provide? Give at least one concrete example.
The model provides information about the base rate of cheating, as well
as information about the levels of performance, and variability in that performance, for the different groups of people, By providing a complete model of
the data-generating process, Bayesian inference is able to provide a much more
complete analysis of the data than a simple set of classifications.
References
Agresti, A. (1992). Modelling patterns of agreement and disagreement. Statistical
Methods in Medical Research, 1, 201–218.
Anscombe, F. J. (1963). Sequential medical trials. Journal of the American Statistical Association, 58, 365–383.
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). Beyond kappa: A
review of interrater agreement measures. The Canadian Journal of Statistics,
27, 3–23.
Basu, S., Banerjee, M., & Sen, A. (2000). Bayesian inference for kappa from single
and multiple studies. Biometrics, 56, 577–582.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social
Psychology, 100, 407–425.
Brooks, S. P., & Gelman, A. (1997). General methods for monitoring convergence
of iterative simulations. Journal of Computational and Graphical Statistics,
7, 434–455.
Casella, G., & George, E. I. (1992). Explaining the gibbs sampler. The American
Statistician, 46, 167–174.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37–46.
de Finetti, B. (1974). Theory of Probability, Vol. 1 and 2. New York: John Wiley
& Sons.
Donner, A., & Wells, G. (1986). A comparison of confidence interval methods for
the intraclass correlation coefficient. Biometrics, 42 (2), 401–412.
Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and
proportions (Third ed.). New York: Wiley.
Gamerman, D., & Lopes, H. F. (2006). Markov chain Monte Carlo: Stochastic
simulation for Bayesian inference. Boca Raton, FL: Chapman & Hall/CRC.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating
marginal densities. Journal of the American Statistical Association, 85, 398–
409.
Gelman, A. (1996). Inference and monitoring convergence. In W. R. Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in
practice (pp. 131–143). Boca Raton (FL): Chapman & Hall/CRC.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical
models. Bayesian Analysis, 1, 515–534.
119
120
References
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using
multiple sequences (with discussion). Statistical Science, 7, 457–472.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721–741.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain
Monte Carlo in practice. Boca Raton (FL): Chapman & Hall/CRC.
Grant, J. A. (1974). Evaluation of a screening program. American Journal of
Public Health, 64, 66–71.
Griffiths, T. L., Kemp, C., & Tenenbaum, J. B. (2008). Bayesian models of cognition. In R. Sun (Ed.), Cambridge Handbook of Computational Cognitive
Modeling (pp. 59–100). Cambridge, MA: Cambridge University Press.
Jaynes, E. T. (2003). Probability theory: The logic of science. Cambridge, UK:
Cambridge University Press.
Kraemer, H. C. (1992). Measurement of reliability for categorical data in medical
research. Statistical Methods in Medical Research, 1, 183–199.
Kraemer, H. C., Periyakoil, V. S., & Noda, A. (2004). Kappa coefficients in medical research. In R. B. D’Agostino (Ed.), Tutorials in biostatistics volume 1:
Statistical methods in clinical studies. New York: Wiley.
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for
categorical data. Biometrics, 33, 159–174.
Lindley, D. V. (1972). Bayesian Statistics, A Review. Philadelphia (PA): SIAM.
Lunn, D. J., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project:
Evolution, critique and future directions. Statistics in Medicine, 28, 3049–
3067.
Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS – a
Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.
MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms.
Cambridge: Cambridge University Press.
Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of
Mathematical Psychology, 47, 90–100.
Poehling, K. A., Griffin, M. R., & Dittus, R. S. (2002). Bedside diagnosis of
influenzavirus infections in hospitalized children. Pediatrics, 110, 83–88.
Poirier, D. J. (2006). The growth of bayesian methods in statistics and economics
since 1970. Bayesian Analysis, 1, 969–980.
Sheu, C.-F., & O’Curry, S. L. (1998). Simulation–based Bayesian inference using
BUGS. Behavioral Research Methods, Instruments, & Computers, 30, 232–
237.
Shrout, P. E. (1998). Measurement reliability and agreement in psychiatry. Statistical Methods in Medical Research, 7, 301–317.
121
References
Smithson, M. (2010). A review of six introductory texts on Bayesian methods.
Journal of Educational and Behavioral Statistics, 35, 371–374.
Uebersax, J. S. (1987). Diversity of decision-making models and the measurement
of interrater agreement. Psychological Bulletin, 101, 140–146.
Index
0-1 loss, 37
autocorrelation, 46
Bayes Rule, 3
burn in, 8
censored data, 61
chain, 8
change detection, 59
confidence interval, 4
conjugacy, 6
conjugate priors, 34
convergence, 8
correlation coefficient, 52
credible interval, 4, 9, 37
deterministic variable, 35
distribution
beta, 33, 34
binomial, 15
categorical, 44
Gaussian, 47
multinomial, 58
multivariate Gaussian, 52
uniform, 33
improper distribution, 49
joint distribution, 44, 45
kappa coefficient, 56
label switching, 79
linear loss, 37
marginal distribution, 45
marginal likelihood, 3
maximum a posteriori (MAP), 37
maximum likelihood estimate (MLE), 4, 37
MCMC, 7
median, 37
mixing, 9
model indeterminacy, 79
partially obseved, 62
plates, 37
posterior distribution, 3, 40
posterior expectation, 37
posterior mean, 37
posterior mode, 37
123
posterior predictive, 5
posterior predictive checking, 42
posterior predictive distribution, 40
prediction, 5
prior distribution, 3, 39
prior predictive distribution, 39