Introduction to Bayesian Analysis Procedures SAS/STAT User’s Guide

Introduction to Bayesian Analysis Procedures SAS/STAT User’s Guide
®
SAS/STAT 9.2 User’s Guide
Introduction to Bayesian
Analysis Procedures
(Book Excerpt)
®
SAS Documentation
This document is an individual chapter from SAS/STAT® 9.2 User’s Guide.
The correct bibliographic citation for the complete manual is as follows: SAS Institute Inc. 2008. SAS/STAT® 9.2
User’s Guide. Cary, NC: SAS Institute Inc.
Copyright © 2008, SAS Institute Inc., Cary, NC, USA
All rights reserved. Produced in the United States of America.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the vendor
at the time you acquire this publication.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation
by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19,
Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
1st electronic book, March 2008
2nd electronic book, February 2009
SAS® Publishing provides a complete selection of books and electronic products to help customers use SAS software to
its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the
SAS Publishing Web site at support.sas.com/publishing or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute
Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
Chapter 7
Introduction to Bayesian Analysis
Procedures
Contents
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . .
Background in Bayesian Statistics . . . . . . . . . . . .
Prior Distributions . . . . . . . . . . . . . . . . .
Bayesian Inference . . . . . . . . . . . . . . . . .
Bayesian Analysis: Advantages and Disadvantages
Markov Chain Monte Carlo Method . . . . . . . .
Assessing Markov Chain Convergence . . . . . .
Summary Statistics . . . . . . . . . . . . . . . . .
A Bayesian Reading List . . . . . . . . . . . . . . . . .
Textbooks . . . . . . . . . . . . . . . . . . . . .
Tutorial and Review Papers on MCMC . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
141
142
144
144
147
149
151
156
170
173
173
175
175
Overview
SAS/STAT software provides Bayesian capabilities in four procedures: GENMOD, LIFEREG,
MCMC, and PHREG. The GENMOD, LIFEREG, and PHREG procedures provide Bayesian analysis in addition to the standard frequentist analyses they have always performed. Thus, these procedures provide convenient access to Bayesian modeling and inference for generalized linear models,
accelerated life failure models, Cox regression models, and piecewise constant baseline hazard models (also known as piecewise exponential models). The MCMC procedure is a general procedure
that fits Bayesian models with arbitrary priors and likelihood functions.
This chapter provides an overview of Bayesian statistics; describes specific sampling algorithms
used in these four procedures; and discusses posterior inference and convergence diagnostics computations. Sources that provide in-depth treatment of Bayesian statistics can be found at the end
of this chapter, in the section “A Bayesian Reading List” on page 173. Additional chapters contain syntax, details, and examples for the individual procedures GENMOD (see Chapter 37, “The
GENMOD Procedure”), LIFEREG (see Chapter 48, “The LIFEREG Procedure”), MCMC (see
142 F Chapter 7: Introduction to Bayesian Analysis Procedures
Chapter 52, “The MCMC Procedure (Experimental)”), and PHREG (see Chapter 64, “The PHREG
Procedure”).
Introduction
The most frequently used statistical methods are known as frequentist (or classical) methods. These
methods assume that unknown parameters are fixed constants, and they define probability by using
limiting relative frequencies. It follows from these assumptions that probabilities are objective and
that you cannot make probabilistic statements about parameters because they are fixed. Bayesian
methods offer an alternative approach; they treat parameters as random variables and define probability as “degrees of belief” (that is, the probability of an event is the degree to which you believe
the event is true). It follows from these postulates that probabilities are subjective and that you can
make probability statements about parameters. The term “Bayesian” comes from the prevalent usage of Bayes’ theorem, which was named after the Reverend Thomas Bayes, an eighteenth century
Presbyterian minister. Bayes was interested in solving the question of inverse probability: after
observing a collection of events, what is the probability of one event?
Suppose you are interested in estimating from data y D fy1 ; : : : ; yn g by using a statistical model
described by a density p.yj /. Bayesian philosophy states that cannot be determined exactly,
and uncertainty about the parameter is expressed through probability statements and distributions.
You can say that follows a normal distribution with mean 0 and variance 1, if it is believed that
this distribution best describes the uncertainty associated with the parameter. The following steps
describe the essential elements of Bayesian inference:
1. A probability distribution for is formulated as ./, which is known as the prior distribution, or just the prior. The prior distribution expresses your beliefs (for example, on the
mean, the spread, the skewness, and so forth) about the parameter before you examine the
data.
2. Given the observed data y, you choose a statistical model p.yj/ to describe the distribution
of y given .
3. You update your beliefs about by combining information from the prior distribution and the
data through the calculation of the posterior distribution, p. jy/.
The third step is carried out by using Bayes’ theorem, which enables you to combine the prior
distribution and the model in the following way:
p. jy/ D
p.; y/
p.yj/./
p.yj/./
D
DR
p.y/
p.y/
p.yj/./d
Introduction F 143
The quantity
Z
p.y/ D
p.yj /. /d
is the normalizing constant of the posterior distribution. This quantity p.y/ is also the marginal
distribution of y, and it is sometimes called the marginal distribution of the data. The likelihood
function of is any function proportional to p.yj/; that is, L./ / p.yj/. Another way of
writing Bayes’ theorem is as follows:
p. jy/ D R
L. /. /
L. /. /d
The marginal distribution p.y/ is an integral. As long as the integral is finite, the particular value
of the integral does not provide any additional information about the posterior distribution. Hence,
p. jy/ can be written up to an arbitrary constant, presented here in proportional form as:
p. jy/ / L. /. /
Simply put, Bayes’ theorem tells you how to update existing knowledge with new information. You
begin with a prior belief . /, and after learning information from data y, you change or update
your belief about and obtain p. jy/. These are the essential elements of the Bayesian approach
to data analysis.
In theory, Bayesian methods offer simple alternatives to statistical inference—all inferences follow
from the posterior distribution p. jy/. In practice, however, you can obtain the posterior distribution
with straightforward analytical solutions only in the most rudimentary problems. Most Bayesian
analyses require sophisticated computations, including the use of simulation methods. You generate
samples from the posterior distribution and use these samples to estimate the quantities of interest.
PROC MCMC uses a self-tuning Metropolis algorithm (see the section “Metropolis and MetropolisHastings Algorithms” on page 152). The GENMOD, LIFEREG, and PHREG procedures use the
Gibbs sampler (see the section “Gibbs Sampler” on page 154). An important aspect of any analysis
is assessing the convergence of the Markov chains. Inferences based on nonconverged Markov
chains can be both inaccurate and misleading.
Both Bayesian and classical methods have their advantages and disadvantages. From a practical
point of view, your choice of method depends on what you want to accomplish with your data
analysis. If you have prior information (either expert opinion or historical knowledge) that you
want to incorporate into the analysis, then you should consider Bayesian methods. In addition, if
you want to communicate your findings in terms of probability notions that can be more easily understood by nonstatisticians, Bayesian methods might be appropriate. The Bayesian paradigm can
often provide a framework for answering specific scientific questions that a single point estimate
cannot sufficiently address. Alternatively, if you are interested only in estimating parameters based
on the likelihood, then numerical optimization methods, such as the Newton-Raphson method, can
144 F Chapter 7: Introduction to Bayesian Analysis Procedures
give you very precise estimates and there is no need to use a Bayesian analysis. For further discussions of the relative advantages and disadvantages of Bayesian analysis, see the section “Bayesian
Analysis: Advantages and Disadvantages” on page 149.
Background in Bayesian Statistics
Prior Distributions
A prior distribution of a parameter is the probability distribution that represents your uncertainty
about the parameter before the current data are examined. Multiplying the prior distribution and
the likelihood function together leads to the posterior distribution of the parameter. You use the
posterior distribution to carry out all inferences. You cannot carry out any Bayesian inference or
perform any modeling without using a prior distribution.
Objective Priors versus Subjective Priors
Bayesian probability measures the degree of belief that you have in a random event. By this definition, probability is highly subjective. It follows that all priors are subjective priors. Not everyone
agrees with this notion of subjectivity when it comes to specifying prior distributions. There has
long been a desire to obtain results that are objectively valid. Within the Bayesian paradigm, this
can be somewhat achieved by using prior distributions that are “objective” (that is, that have a minimal impact on the posterior distribution). Such distributions are called objective or noninformative
priors (see the next section). However, while noninformative priors are very popular in some applications, they are not always easy to construct. See DeGroot and Schervish (2002, Section 1.2)
and Press (2003, Section 2.2) for more information about interpretations of probability. See Berger
(2006) and Goldstein (2006) for discussions about objective Bayesian versus subjective Bayesian
analysis.
Noninformative Priors
Roughly speaking, a prior distribution is noninformative if the prior is “flat” relative to the likelihood
function. Thus, a prior . / is noninformative if it has minimal impact on the posterior distribution
of . Other names for the noninformative prior are vague, diffuse, and flat prior. Many statisticians
favor noninformative priors because they appear to be more objective. However, it is unrealistic
to expect that noninformative priors represent total ignorance about the parameter of interest. In
some cases, noninformative priors can lead to improper posteriors (nonintegrable posterior density).
You cannot make inferences with improper posterior distributions. In addition, noninformative
priors are often not invariant under transformation; that is, a prior might be noninformative in one
parameterization but not necessarily noninformative if a transformation is applied. A common
choice for a noninformative prior is the flat prior, which is a prior distribution that assigns equal
likelihood on all possible values of the parameter. Intuitively this makes sense, and in some cases,
Prior Distributions F 145
such as linear regression, flat priors on the regression parameter are noninformative. However, this
is not necessarily true in all cases. For example, suppose there is a binomial experiment with n
Bernoulli trials where y 1s are observed. You want to make inferences about the unknown success
probability p. A uniform prior on p,
.p/ / 1
might appear to be noninformative. However, using the uniform prior is actually equivalent to
adding two observations to the data, one 1 and one 0. With small n and y, the added observations
can be very influential to the parameter estimate of p.
To see this, note that the likelihood is this:
p y .1
p/n
y
The maximum likelihood estimator (MLE) of p is y=n. The uniform prior can be written as a beta
distribution with both the shape (˛) and scale (ˇ) parameters being 1:
.p/ / p ˛
1
.1
p/ˇ
1
The posterior distribution of p is proportional to the following:
p ˛Cy
1
.1
p/ˇ Cn
which is beta(˛ C y; ˇ C n
y 1
y). Therefore, the posterior mean is this:
˛Cy
1Cy
D
˛CˇCn
2Cn
and it can be quite different from the MLE if both n and y are small. See Box and Tiao (1973) for a
more formal development of noninformative priors. See Kass and Wasserman (1996) for techniques
for deriving noninformative priors.
Improper Priors
A prior . / is said to be improper if
Z
. /d D 1
146 F Chapter 7: Introduction to Bayesian Analysis Procedures
For example, a uniform prior distribution on the real line, ./ / 1, for 1 < < 1, is
an improper prior. Improper priors are often used in Bayesian inference since they usually yield
noninformative priors and proper posterior distributions. Improper prior distributions can lead to
posterior impropriety (improper posterior distribution). To determine
R whether a posterior distribution is proper, you need to make sure that the normalizing constant p.yj/p./d is finite for all
y. If an improper prior distribution leads to an improper posterior distribution, inference based on
the improper posterior distribution is invalid.
The GENMOD, LIFEREG, and PHREG procedures allow the use of improper priors—that is, the
flat prior on the real line—for regression coefficients. These improper priors do not lead to any
improper posterior distributions in the models that these procedures fit. PROC MCMC allows the
use of any prior, as long as the distribution is programmable using DATA step functions. However,
the procedure does not verify whether the posterior distribution is integrable. You must ensure this
yourself.
Informative Priors
An informative prior is a prior that is not dominated by the likelihood and that has an impact on the
posterior distribution. If a prior distribution dominates the likelihood, it is clearly an informative
prior. These types of distributions must be specified with care in actual practice. On the other hand,
the proper use of prior distributions illustrates the power of the Bayesian method: information
gathered from the previous study, past experience, or expert opinion can be combined with current
information in a natural way. See the “Examples” sections of the GENMOD and PHREG procedure
chapters for instructions about constructing informative prior distributions.
Conjugate Priors
A prior is said to be a conjugate prior for a family of distributions if the prior and posterior distributions are from the same family, which means that the form of the posterior has the same distributional form as the prior distribution. For example, if the likelihood is binomial, y Ï Bin.n; /,
a conjugate prior on is the beta distribution; it follows that the posterior distribution of is
also a beta distribution. Other commonly used conjugate prior/likelihood combinations include the
normal/normal, gamma/Poisson, gamma/gamma, and gamma/beta cases. The development of conjugate priors was partially driven by a desire for computational convenience—conjugacy provides
a practical way to obtain the posterior distributions. The Bayesian procedures do not use conjugacy
in posterior sampling.
Jeffreys’ Prior
A very useful prior is Jeffreys’ prior (Jeffreys 1961). It satisfies the local uniformity property: a
prior that does not change much over the region in which the likelihood is significant and does not
assume large values outside that range. It is based on the Fisher information matrix. Jeffreys’ prior
is defined as
Bayesian Inference F 147
. / / jI. /j1=2
where j j denotes the determinant and I. / is the Fisher information matrix based on the likelihood
function p.yj /:
@2 log p.yj /
E
@ 2
I. / D
Jeffreys’ prior is locally uniform and hence noninformative. It provides an automated scheme for
finding a noninformative prior for any parametric model p.yj/. Another appealing property of
Jeffreys’ prior is that it is invariant with respect to one-to-one transformations. The invariance
property means that if you have a locally uniform prior on and ./ is a one-to-one function of ,
then p.. // D . /j 0 . /j 1 is a locally uniform prior for ./. This invariance principle carries
through to multidimensional parameters as well. While Jeffreys’ prior provides a general recipe for
obtaining noninformative priors, it has some shortcomings: the prior is improper for many models,
and it can lead to improper posterior in some cases; and the prior can be cumbersome to use in high
dimensions. PROC GENMOD calculates Jeffreys’ prior automatically for any generalized linear
model. You can set it as your prior density for the coefficient parameters, and it does not lead to
improper posteriors. You can construct Jeffreys’ prior for a variety of statistical models in PROC
MCMC. See the section “Logistic Regression Model with Jeffreys’ Prior” on page 3592 for an
example. PROC MCMC does not guarantee that the corresponding posterior distribution is proper,
and you need to exercise extra caution in this case.
Bayesian Inference
Bayesian inference about is primarily based on the posterior distribution of . There are various
ways in which you can summarize this distribution. For example, you can report your findings
through point estimates. You can also use the posterior distribution to construct hypothesis tests or
probability statements.
Point Estimation and Estimation Error
Classical methods often report the maximum likelihood estimator (MLE) or the method of moments
estimator (MOME) of a parameter. In contrast, Bayesian approaches often use the posterior mean.
The definition of the posterior mean is given by
Z
E.jy/ D
p.jy/ d
148 F Chapter 7: Introduction to Bayesian Analysis Procedures
Other commonly used posterior estimators include the posterior median, defined as
W P . medianjy/ D P .median jy/ D
1
2
and the posterior mode, defined as the value of that maximizes p. jy/.
The variance of the posterior density (simply referred to as the posterior variance) describes the
uncertainty in the parameter, which is a random variable in the Bayesian paradigm. A Bayesian
analysis typically uses the posterior variance, or the posterior standard deviation, to characterize
the dispersion of the parameter. In multidimensional models, covariance or correlation matrices are
used.
If you know the distributional form of the posterior density of interest, you can report the exact
posterior point estimates. When models become too difficult to analyze analytically, you have to use
simulation algorithms, such as the Markov chain Monte Carlo (MCMC) method to obtain posterior
estimates (see the section “Markov Chain Monte Carlo Method” on page 151). All of the Bayesian
procedures rely on MCMC to obtain all posterior estimates. Using only a finite number of samples,
simulations introduce an additional level of uncertainty to the accuracy of the estimates. Monte
Carlo standard error (MCSE), which is the standard error of the posterior mean estimate, measures
the simulation accuracy. See the section “Standard Error of the Mean Estimate” on page 170 for
more information.
The posterior standard deviation and the MCSE are two completely different concepts: the posterior
standard deviation describes the uncertainty in the parameter, while the MCSE describes only the
uncertainty in the parameter estimate as a result of MCMC simulation. The posterior standard
deviation is a function of the sample size in the data set, and the MCSE is a function of the number
of iterations in the simulation.
Hypothesis Testing
Suppose you have the following null and alternative hypotheses: H0 is 2 ‚0 and H1 is 2
‚c0 , where ‚0 is a subset of the parameter space and ‚c0 is its complement. Using the posterior
distribution . jy/, you can compute the posterior probabilities P . 2 ‚0 jy/ and P . 2 ‚c0 jy/, or
the probabilities that H0 and H1 are true, respectively. One way to perform a Bayesian hypothesis
test is to accept the null hypothesis if P . 2 ‚0 jy/ P . 2 ‚c0 jy/ and vice versa, or to accept the
null hypothesis if P . 2 ‚0 jy/ is greater than a predefined threshold, such as 0:75, to guard against
falsely accepted null distribution.
It is more difficult to carry out a point null hypothesis test in a Bayesian analysis. A point null
hypothesis is a test of H0 W D 0 versus H1 W ¤ 0 . If the prior distribution ./ is a continuous
density, then the posterior probability of the null hypothesis being true is 0, and there is no point
in carrying out the test. One alternative is to restate the null to be a small interval hypothesis:
2 ‚0 D .0 a; 0 C a/, where a is a very small constant. The Bayesian paradigm can deal with
an interval hypothesis more easily. Another approach is to give a mixture prior distribution to with a positive probability of p0 on 0 and the density .1 p0 /./ on ¤ 0 . This prior ensures
a nonzero posterior probability on 0 , and you can then make realistic probabilistic comparisons.
For more detailed treatment of Bayesian hypothesis testing, see Berger (1985).
Bayesian Analysis: Advantages and Disadvantages F 149
Interval Estimation
The Bayesian set estimates are called credible sets, which is also known as credible intervals. This
is analogous to the concept of confidence intervals used in classical statistics. Given a posterior
distribution p. jy/, A is a credible set for if
Z
P . 2 Ajy/ D
A
p. jy/d
For
R example, you can construct a 95% credible set for by finding an interval, A, over which
A p. jy/ D 0:95.
You can construct credible sets that have equal tails. A 100.1 ˛/% equal-tail interval corresponds
to the 100.˛=2/th and 100.1 ˛=2/th percentiles of the posterior distribution. Some statisticians
prefer this interval because it is invariant under transformations. Another frequently used Bayesian
credible set is called the highest posterior density (HPD) interval.
A 100.1
˛/% HPD interval is a region that satisfies the following two conditions:
1. The posterior probability of that region is 100.1
˛/%.
2. The minimum density of any point within that region is equal to or larger than the density of
any point outside that region.
The HPD is an interval in which most of the distribution lies. Some statisticians prefer this interval
because it is the smallest interval.
One major distinction between Bayesian and classical sets is their interpretation. The Bayesian
probability reflects a person’s subjective beliefs. Following this approach, a statistician can make
the claim that is inside a credible interval with measurable probability. This property is appealing
because it enables you to make a direct probability statement about parameters. Many people find
this concept to be a more natural way of understanding a probability interval, which is also easier to
explain to nonstatisticians. A confidence interval, on the other hand, enables you to make a claim
that the interval covers the true parameter. The interpretation reflects the uncertainty in the sampling
procedure; a confidence interval of 100.1 ˛/% asserts that, in the long run, 100.1 ˛/% of the
realized confidence intervals cover the true parameter.
Bayesian Analysis: Advantages and Disadvantages
Bayesian methods and classical methods both have advantages and disadvantages, and there are
some similarities. When the sample size is large, Bayesian inference often provides results for
parametric models that are very similar to the results produced by frequentist methods. Some advantages to using Bayesian analysis include the following:
150 F Chapter 7: Introduction to Bayesian Analysis Procedures
It provides a natural and principled way of combining prior information with data, within a
solid decision theoretical framework. You can incorporate past information about a parameter
and form a prior distribution for future analysis. When new observations become available,
the previous posterior distribution can be used as a prior. All inferences logically follow from
Bayes’ theorem.
It provides inferences that are conditional on the data and are exact, without reliance on
asymptotic approximation. Small sample inference proceeds in the same manner as if one
had a large sample. Bayesian analysis also can estimate any functions of parameters directly,
without using the “plug-in” method (a way to estimate functionals by plugging the estimated
parameters in the functionals).
It obeys the likelihood principle. If two distinct sampling designs yield proportional likelihood functions for , then all inferences about should be identical from these two designs.
Classical inference does not in general obey the likelihood principle.
It provides interpretable answers, such as “the true parameter has a probability of 0.95 of
falling in a 95% credible interval.”
It provides a convenient setting for a wide range of models, such as hierarchical models and
missing data problems. MCMC, along with other numerical methods, makes computations
tractable for virtually all parametric models.
There are also disadvantages to using Bayesian analysis:
It does not tell you how to select a prior. There is no correct way to choose a prior. Bayesian
inferences require skills to translate subjective prior beliefs into a mathematically formulated
prior. If you do not proceed with caution, you can generate misleading results.
It can produce posterior distributions that are heavily influenced by the priors. From a practical point of view, it might sometimes be difficult to convince subject matter experts who do
not agree with the validity of the chosen prior.
It often comes with a high computational cost, especially in models with a large number
of parameters. In addition, simulations provide slightly different answers unless the same
random seed is used. Note that slight variations in simulation results do not contradict the
early claim that Bayesian inferences are exact. The posterior distribution of a parameter
is exact, given the likelihood function and the priors, while simulation-based estimates of
posterior quantities can vary due to the random number generator used in the procedures.
For more in-depth treatments of the pros and cons of Bayesian analysis, see Berger (1985, Sections
4.1 and 4.12), Berger and Wolpert (1988), Bernardo and Smith (1994, with a new edition coming
out), Carlin and Louis (2000, Section 1.4), Robert (2001, Chapter 11), and Wasserman (2004,
Section 11.9).
The following sections provide detailed information about the Bayesian methods provided in SAS.
Markov Chain Monte Carlo Method F 151
Markov Chain Monte Carlo Method
The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling
from posterior distributions and computing posterior quantities of interest. MCMC methods sample
successively from a target distribution. Each sample depends on the previous one, hence the notion
of the Markov chain. A Markov chain is a sequence of random variables, 1 , 2 , , for which the
random variable t depends on all previous s only through its immediate predecessor t 1 . You
can think of a Markov chain applied to sampling as a mechanism that traverses randomly through a
target distribution without having any memory of where it has been. Where it moves next is entirely
dependent on where it is now.
Monte Carlo, as in Monte Carlo integration, is mainly used to approximate an expectation by using
the Markov chain samples. In the simplest version
n
Z
S
g. /p. /d Š
1X
g. t /
n
t D1
where g./ is a function of interest and t are samples from p./ on its support S . This approximates the expected value of g. /. The earliest reference to MCMC simulation occurs in the physics
literature. Metropolis and Ulam (1949) and Metropolis et al. (1953) describe what is known as
the Metropolis algorithm (see the section “Metropolis and Metropolis-Hastings Algorithms” on
page 152). The algorithm can be used to generate sequences of samples from the joint distribution
of multiple variables, and it is the foundation of MCMC. Hastings (1970) generalized their work,
resulting in the Metropolis-Hastings algorithm. Geman and Geman (1984) analyzed image data
by using what is now called Gibbs sampling (see the section “Gibbs Sampler” on page 154). These
MCMC methods first appeared in the mainstream statistical literature in Tanner and Wong (1987).
The Markov chain method has been quite successful in modern Bayesian computing. Only in the
simplest Bayesian models can you recognize the analytical forms of the posterior distributions and
summarize inferences directly. In moderately complex models, posterior densities are too difficult
to work with directly. With the MCMC method, it is possible to generate samples from an arbitrary
posterior density p. jy/ and to use these samples to approximate expectations of quantities of
interest. Several other aspects of the Markov chain method also contributed to its success. Most
importantly, if the simulation algorithm is implemented correctly, the Markov chain is guaranteed
to converge to the target distribution p. jy/ under rather broad conditions, regardless of where the
chain was initialized. In other words, a Markov chain is able to improve its approximation to the
true distribution at each step in the simulation. Furthermore, if the chain is run for a very long time
(often required), you can recover p. jy/ to any precision. Also, the simulation algorithm is easily
extensible to models with a large number of parameters or high complexity, although the “curse of
dimensionality” often causes problems in practice.
Properties of Markov chains are discussed in Feller (1968), Breiman (1968), and Meyn and Tweedie
(1993). Ross (1997) and Karlin and Taylor (1975) give a non-measure-theoretic treatment of
stochastic processes, including Markov chains. For conditions that govern Markov chain convergence and rates of convergence, see Amit (1991), Applegate, Kannan, and Polson (1990), Chan
(1993), Geman and Geman (1984), Liu, Wong, and Kong (1991a, b), Rosenthal (1991a, b), Tierney
152 F Chapter 7: Introduction to Bayesian Analysis Procedures
(1994), and Schervish and Carlin (1992). Besag (1974) describes conditions under which a set of
conditional distributions gives a unique joint distribution. Tanner (1993), Gilks, Richardson, and
Spiegelhalter (1996), Chen, Shao, and Ibrahim (2000), Liu (2001), Gelman et al. (2004), Robert and
Casella (2004), and Congdon (2001, 2003, 2005) provide both theoretical and applied treatments
of MCMC methods. You can also see the section “A Bayesian Reading List” on page 173 for a list
of books with varying levels of difficulty of treatment of the subject and its application to Bayesian
statistics.
Metropolis and Metropolis-Hastings Algorithms
The Metropolis algorithm is named after its inventor, the American physicist and computer scientist
Nicholas C. Metropolis. The algorithm is simple but practical, and it can be used to obtain random
samples from any arbitrarily complicated target distribution of any dimension that is known up to
a normalizing constant. The Bayesian procedures use a special case of the Metropolis algorithm
called the Gibbs sampler to obtain posterior samplers.
Suppose you want to obtain T samples from a univariate distribution with probability density function f .jy/. Suppose t is the tth sample from f . To use the Metropolis algorithm, you need to
have an initial value 0 and a symmetric proposal density q. t C1 j t /. For the .t C 1/th iteration,
the algorithm generates a sample from q.j/ based on the current sample t , and it makes a decision
to either accept or reject the new sample. If the new sample is accepted, the algorithm repeats itself
by starting at the new sample. If the new sample is rejected, the algorithm starts at the current point
and repeats. The algorithm is self-repeating, so it can be carried out as long as required. In practice,
you have to decide the total number of samples needed in advance and stop the sampler after that
many iterations have been completed.
Suppose q.new j t / is a symmetric distribution. The proposal distribution should be an easy distribution from which to sample, and it must be such that q.new j t / D q. t jnew /, meaning that
the likelihood of jumping to new from t is the same as the likelihood of jumping back to t from
new . The most common choice of the proposal distribution is the normal distribution N. t ; / with
a fixed . The Metropolis algorithm can be summarized as follows:
1. Set t D 0. Choose a starting point 0 . This can be an arbitrary point as long as f . 0 jy/ > 0.
2. Generate a new sample, new , by using the proposal distribution q.j t /.
3. Calculate the following quantity:
r D min
f .new jy/
;
1
f . t jy/
4. Sample u from the uniform distribution U.0; 1/.
5. Set t C1 D new if u < r; otherwise set t C1 D t .
6. Set t D t C 1. If t < T , the number of desired samples, return to step 2. Otherwise, stop.
Markov Chain Monte Carlo Method F 153
Note that the number of iteration keeps increasing regardless of whether a proposed sample is
accepted.
This algorithm defines a chain of random variates whose distribution will converge to the desired
distribution f .jy/, and so from some point forward, the chain of samples is a sample from the
distribution of interest. In Markov chain terminology, this distribution is called the stationary distribution of the chain, and in Bayesian statistics, it is the posterior distribution of the model parameters. The reason that the Metropolis algorithm works is beyond the scope of this documentation, but
you can find more detailed descriptions and proofs in many standard textbooks, including Roberts
(1996) and Liu (2001). The random-walk Metropolis algorithm is used in PROC MCMC.
You are not limited to a symmetric random-walk proposal distribution in establishing a valid sampling algorithm. A more general form, the Metropolis-Hastings (MH) algorithm, was proposed
by Hastings (1970). The MH algorithm uses an asymmetric proposal distribution: q.new j t / ¤
q. t jnew /. The difference in its implementation comes in calculating the ratio of densities:
f .new jy/q. t jnew /
r D min
;1
f . t jy/q.new j t /
Other steps remain the same.
The extension of the Metropolis algorithm to a higher-dimensional is straightforward. Suppose
D .1 ; 2 ; ; k /0 is the parameter vector. To start the Metropolis algorithm, select an initial
value for each k and use a multivariate version of proposal distribution q.j/, such as a multivariate
normal distribution, to select a k-dimensional new parameter. Other steps remain the same as
those previously described, and this Markov chain eventually converges to the target distribution of
f .jy/. Chib and Greenberg (1995) provide a useful tutorial on the algorithm.
Independence Sampler
Another type of Metropolis algorithm is the “independence” sampler. It is called the independence
sampler because the proposal distribution in the algorithm does not depend on the current point as it
does with the random-walk Metropolis algorithm. For this sampler to work well, you want to have
a proposal distribution that mimics the target distribution and have the acceptance rate be as high as
possible.
1. Set t D 0. Choose a starting point 0 . This can be an arbitrary point as long as f . 0 jy/ > 0.
2. Generate a new sample, new , by using the proposal distribution q./. The proposal distribution does not depend on the current value of t .
3. Calculate the following quantity:
r D min
f .new jy/=q.new /
;
1
f . t jy/=q. t /
4. Sample u from the uniform distribution U.0; 1/.
154 F Chapter 7: Introduction to Bayesian Analysis Procedures
5. Set t C1 D new if u < r; otherwise set t C1 D t .
6. Set t D t C 1. If t < T , the number of desired samples, return to step 2. Otherwise, stop.
A good proposal density should have thicker tails than those of the target distribution. This requirement sometimes can be difficult to satisfy especially in cases where you do not know what the target
posterior distributions are like. In addition, this sampler does not produce independent samples as
the name seems to imply, and sample chains from independence samplers can get stuck in the tails
of the posterior distribution if the proposal distribution is not chosen carefully. The independence
sampler is used in PROC MCMC.
Gibbs Sampler
The Gibbs sampler, named by Geman and Geman (1984) after the American physicist Josiah W.
Gibbs, is a special case of the Metropolis sampler in which the proposal distributions exactly match
the posterior conditional distributions and proposals are accepted 100% of the time. Gibbs sampling
requires you to decompose the joint posterior distribution into full conditional distributions for each
parameter in the model and then sample from them. The sampler can be efficient when the parameters are not highly dependent on each other and the full conditional distributions are easy to sample
from. Some researchers favor this algorithm because it does not require an instrumental proposal
distribution as Metropolis methods do. However, while deriving the conditional distributions can
be relatively easy, it is not always possible to find an efficient way to sample from these conditional
distributions.
Suppose D .1 ; : : : ; k /0 is the parameter vector, p.yj/ is the likelihood, and ./ is the prior
distribution. The full posterior conditional distribution of .i jj ; i ¤ j; y/ is proportional to the
joint posterior density; that is,
.i jj ; i ¤ j; y/ / p.yj/./
For instance, the one-dimensional conditional distribution of 1 given j D j ; 2 j k, is
computed as the following:
.1 jj D j ; 2 j k; y/ D p.yj. D .1 ; 2 ; : : : ; k /0 /. D .1 ; 2 ; : : : ; k /0 /
The Gibbs sampler works as follows:
.0/
.0/
1. Set t D 0, and choose an arbitrary initial value of .0/ D f1 ; : : : ; k g.
2. Generate each component of as follows:
.t C1/
from .1 j2 ; : : : ; k ; y/
.t C1/
from .2 j1
draw 1
draw 2
.t /
.t C1/
.t /
.t /
.t /
; 3 ; : : : ; k ; y/
Markov Chain Monte Carlo Method F 155
...
.tC1/
draw k
.tC1/
from .k j1
.tC1/
; y/
1
; : : : ; k
3. Set t D t C 1. If t < T , the number of desired samples, return to step 2. Otherwise, stop.
The name “Gibbs” was introduced by Geman and Geman (1984). Gelfand et al. (1990) first used
Gibbs sampling to solve problems in Bayesian inference. See Casella and George (1992) for a
tutorial on the sampler. The GENMOD, LIFEREG, and PHREG procedures update parameters
using the Gibbs sampler.
Adaptive Rejection Sampling Algorithm
The GENMOD, LIFEREG, and PHREG procedures use the adaptive rejection sampling (ARS) algorithm to sample parameters sequentially from their univariate full conditional distributions. The
ARS algorithm is a rejection algorithm that was originally proposed by Gilks and Wild (1992).
Given a log-concave density (the log of the density is concave), you can construct an envelope to
the density by using linear segments. You then use the linear segment envelope as a proposal density
(it becomes a piecewise exponential density on the original scale and is easy to generate samplers
from) in the rejection sampling. The log-concavity condition is met in some of the models fit by
the procedures. For example, the posterior densities for the regression parameters in the generalized linear models are log-concave under flat priors. When this condition fails, the ARS algorithm
calls for an additional Metropolis-Hasting step (Gilks, Best, and Tan 1995), and the modified algorithm becomes the adaptive rejection metropolis sampling (ARMS) algorithm. The GENMOD,
LIFEREG, and PHREG procedures can recognize whether a model is log-concave and select the
appropriate sampler for the problem at hand.
The GENMOD, LIFEREG, and PHREG procedures implement the ARMS algorithm based on code
kindly provided by Walter R. Gilks, University of Leeds (Gilks 2003), to obtain posterior samples.
For a detailed description and explanation of the algorithm, see Gilks and Wild (1992) and Gilks,
Best, and Tan (1995).
Burn-in, Thinning, and Markov Chain Samples
Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the
effect of initial values on the posterior inference is minimized. For example, suppose the target
distribution is N.0; 1/ and the Markov chain was started at the value 106 . The chain might quickly
travel to regions around 0 in a few iterations. However, including samples around the value 106
in the posterior mean calculation can produce substantial bias in the mean estimate. In theory, if
the Markov chain is run for an infinite amount of time, the effect of the initial values decreases to
zero. In practice, you do not have the luxury of infinite samples. In practice, you assume that after
t iterations, the chain has reached its target distribution and you can throw away the early portion
and use the good samples for posterior inference. The value of t is the burn-in number.
With some models you might experience poor mixing (or slow convergence) of the Markov chain.
This can happen, for example, when parameters are highly correlated with each other. Poor mixing
means that the Markov chain slowly traverses the parameter space (see the section “Visual Analysis
156 F Chapter 7: Introduction to Bayesian Analysis Procedures
via Trace Plots” on page 156 for examples of poorly mixed chains) and the chain has high dependence. High sample autocorrelation can result in biased Monte Carlo standard errors. A common
strategy is to thin the Markov chain in order to reduce sample autocorrelations. You thin a chain
by keeping every kth simulated draw from each sequence. You can safely use a thinned Markov
chain for posterior inference as long as the chain converges. It is important to note that thinning a
Markov chain can be wasteful because you are throwing away a k k 1 fraction of all the posterior
samples generated. MacEachern and Berliner (1994) show that you always get more precise posterior estimates if the entire Markov chain is used. However, other factors, such as computer storage
or plotting time, might prevent you from keeping all samples.
To use the GENMOD, LIFEREG, MCMC, and PHREG procedures, you need to determine the total
number of samples to keep ahead of time. This number is not obvious and often depends on the type
of inference you want to make. Mean estimates do not require nearly as many samples as small-tail
percentile estimates. In most applications, you might find that keeping a few thousand iterations
is sufficient for reasonably accurate posterior inference. In all four procedures, the relationship between the number of iterations requested, the number of iterations kept, and the amount of thinning
is as follows:
requested
kept D
thinning
where Œ  is the rounding operator.
Assessing Markov Chain Convergence
Simulation-based Bayesian inference requires using simulated draws to summarize the posterior
distribution or calculate any relevant quantities of interest. You need to treat the simulation draws
with care. There are usually two issues. First, you have to decide whether the Markov chain
has reached its stationary, or the desired posterior, distribution. Second, you have to determine
the number of iterations to keep after the Markov chain has reached stationarity. Convergence
diagnostics help to resolve these issues. Note that many diagnostic tools are designed to verify a
necessary but not sufficient condition for convergence. There are no conclusive tests that can tell
you when the Markov chain has converged to its stationary distribution. You should proceed with
caution. Also, note that you should check the convergence of all parameters, and not just those of
interest, before proceeding to make any inference. With some models, certain parameters can appear
to have very good convergence behavior, but that could be misleading due to the slow convergence
of other parameters. If some of the parameters have bad mixing, you cannot get accurate posterior
inference for parameters that appear to have good mixing. See Cowles and Carlin (1996) and Brooks
and Roberts (1998) for discussions about convergence diagnostics.
Visual Analysis via Trace Plots
Trace plots of samples versus the simulation index can be very useful in assessing convergence. The
trace tells you if the chain has not yet converged to its stationary distribution—that is, if it needs a
Assessing Markov Chain Convergence F 157
longer burn-in period. A trace can also tell you whether the chain is mixing well. A chain might
have reached stationarity if the distribution of points is not changing as the chain progresses. The
aspects of stationarity that are most recognizable from a trace plot are a relatively constant mean
and variance. A chain that mixes well traverses its posterior space rapidly, and it can jump from
one remote region of the posterior to another in relatively few steps. Figure 7.1 through Figure 7.4
display some typical features that you might see in trace plots. The traces are for a parameter called
.
Figure 7.1 Essentially Perfect Trace for Figure 7.1 displays a “perfect” trace plot. Note that the center of the chain appears to be around
the value 3, with very small fluctuations. This indicates that the chain could have reached the right
distribution. The chain is mixing well; it is exploring the distribution by traversing to areas where
its density is very low. You can conclude that the mixing is quite good here.
158 F Chapter 7: Introduction to Bayesian Analysis Procedures
Figure 7.2 Nonconvergence of Figure 7.2 displays a trace plot for a chain that starts at a very remote initial value and makes its
way to the targeting distribution. The first few hundred observations should be discarded. This
chain appears to be mixing very well locally. It travels relatively quickly to the target distribution,
reaching it in a few hundred iterations. If you have a chain that looks like this, you would want to
increase the burn-in sample size. If you need to use this sample to make inferences, you would want
to use only the samples toward the end of the chain.
Assessing Markov Chain Convergence F 159
Figure 7.3 Marginal Mixing for Figure 7.3 demonstrates marginal mixing. The chain is taking only small steps and does not traverse
its distribution quickly. This type of trace plot is typically associated with high autocorrelation
among the samples. To obtain a few thousand independent samples, you need to run the chain for
much longer.
160 F Chapter 7: Introduction to Bayesian Analysis Procedures
Figure 7.4 Bad Mixing, Nonconvergence of The trace plot shown in Figure 7.4 depicts a chain with serious problems. It is mixing very slowly,
and it offers no evidence of convergence. You would want to try to improve the mixing of this chain.
For example, you might consider reparameterizing your model on the log scale. Run the Markov
chain for a long time to see where it goes. This type of chain is entirely unsuitable for making
parameter inferences.
Statistical Diagnostic Tests
The Bayesian procedures include several statistical diagnostic tests that can help you assess Markov
chain convergence. For a detailed description of each of the diagnostic tests, see the following
subsections. Table 7.1 provides a summary of the diagnostic tests and their interpretations.
Table 7.1
Convergence Diagnostic Tests Available in the Bayesian Procedures
Name
Gelman-Rubin
Description
Uses parallel chains with dispersed initial values to test whether they all converge to the same target distribution.
Failure could indicate the presence of
a multi-mode posterior distribution (different chains converge to different local
modes) or the need to run a longer chain
(burn-in is yet to be completed).
Interpretation of the Test
One-sided test based on a
variance ratio test statistic.
Large b
Rc values indicate rejection.
Assessing Markov Chain Convergence F 161
Table 7.1
(continued)
Name
Geweke
Description
Tests whether the mean estimates have
converged by comparing means from the
early and latter part of the Markov chain.
Interpretation of the test
Two-sided test based on a zscore statistic. Large absolute z values indicate rejection.
Heidelberger-Welch
(stationarity test)
Tests whether the Markov chain is a
covariance (or weakly) stationary process. Failure could indicate that a longer
Markov chain is needed.
One-sided test based on a
Cramer–von Mises statistic.
Small p-values indicate rejection.
Heidelberger-Welch
(half-width test)
Reports whether the sample size is adequate to meet the required accuracy for
the mean estimate. Failure could indicate
that a longer Markov chain is needed.
If a relative half-width
statistic is greater than a
predetermined
accuracy
measure, this indicates
rejection.
Raftery-Lewis
Evaluates the accuracy of the estimated
(desired) percentiles by reporting the
number of samples needed to reach the
desired accuracy of the percentiles. Failure could indicate that a longer Markov
chain is needed.
Measures dependency among Markov
chain samples.
If the total samples needed
are fewer than the Markov
chain sample, this indicates
rejection.
Relates to autocorrelation; measures mixing of the Markov chain.
Large discrepancy between
the effective sample size
and the actual simulation iteration indicates poor mixing.
autocorrelation
effective sample size
High correlations between
long lags indicate poor mixing.
Gelman and Rubin Diagnostics
Gelman and Rubin diagnostics (Gelman and Rubin 1992; Brooks and Gelman 1997) are based on
analyzing multiple simulated MCMC chains by comparing the variances within each chain and the
variance between chains. Large deviation between these two variances indicates nonconvergence.
Define f t g, where t D 1; : : : ; n, to be the collection of a single Markov chain output. The parameter t is the tth sample of the Markov chain. For notational simplicity, is assumed to be single
dimensional in this section.
Suppose you have M parallel MCMC chains that were initialized from various parts of the target
distribution. Each chain is of length n (after discarding the burn-in). For each t , the simulations
t ; where t D 1; : : : ; n and m D 1; : : : ; M . The between-chain variance B and the
are labeled as m
within-chain variance W are calculated as
162 F Chapter 7: Introduction to Bayesian Analysis Procedures
W
M
X
n
B D
M
1
n
M
1 X N
1X t
m ; N D
m
N /2 ; where Nm
D
n
M
.Nm
mD1
t D1
M
n
1 X 2
1 X t
2
sm ; where sm
.m
D
M
n 1
D
mD1
mD1
2
Nm
/
t D1
The posterior marginal variance, var. jy/, is a weighted average of W and B. The estimate of the
variance is
bDn
V
1
n
W C
M C1
B
nM
If all M chains have reached the target distribution, this posterior variance estimate should be very
b =W be close
close to the within-chain variance W . Therefore, you would expect to see the ratio V
to 1. The square root of this ratio is referred to as the potential scale reduction factor (PSRF). A
large PSRF indicates that the between-chain variance is substantially greater than the within-chain
variance, so that longer simulation is needed. If the PSRF is close to 1, you can conclude that each
of the M chains has stabilized, and they are likely to have reached the target distribution.
A refined version of PSRF is calculated, as suggested by Brooks and Gelman (1997), as
s
b
Rc D
b
dO C 3 V
D
dO C 1 W
s
dO C 3 n 1 M C 1 B
C
n
nM W
dO C 1
where
b2
2V
dO D
b/
c V
Var.
and
2
1
2
M
C
1
2
b/ D
c V
c m/ C
Var.
Var.s
B2
n
M
nM
M 1
.M C 1/.n 1/ n
2
2
2 N
C2
cov.sm
; .Nm
/ / 2N cov.sm
; m /
2
n M
M
n
1
2
b
b
All the Bayesian procedures also produce an upper 100.1 ˛=2/% confidence limit of b
Rc . Gelman
and Rubin (1992) showed that the ratio B=W in b
Rc has an F distribution with degrees of freedom
Assessing Markov Chain Convergence F 163
2 /. Because you are concerned only if the scale is large, not small, only
c m
M 1 and 2W 2 M=Var.s
the upper 100.1 ˛=2/% confidence limit is reported. This is written as
v
u
u n
t
1
n
M C1
C
F1
nM
˛=2
M
1;
2W 2
2 /=M
c m
Var.s
!!
dO C 3
dO C 1
In the Bayesian procedures, you can specify the number of chains that you want to run. Typically
three chains are sufficient. The first chain is used for posterior inference, such as mean and standard deviation; the other M 1 chains are used for computing the diagnostics and are discarded
afterward. This test can be computationally costly, because it prolongs the simulation M -fold.
It is best to choose different initial values for all M chains. The initial values should be as dispersed
from each other as possible so that the Markov chains can fully explore different parts of the distribution before they converge to the target. Similar initial values can be risky because all of the
chains can get stuck in a local maximum; that is something this convergence test cannot detect. If
you do not supply initial values for all the different chains, the procedures generate them for you.
Geweke Diagnostics
The Geweke test (Geweke 1992) compares values in the early part of the Markov chain to those in
the latter part of the chain in order to detect failure of convergence. The statistic is constructed as
follows. Two subsequences of the Markov chain f t g are taken out, with f1t W t D 1; : : : ; n1 g and
f2t W t D na ; : : : ; ng, where 1 < n1 < na < n. Let n2 D n na C 1, and define
n1
n
1 X
1 X t
N
1 D
t and N2 D
n1
n2 t Dn
t D1
a
Let sO1 .0/ and sO2 .0/ denote consistent spectral density estimates at zero frequency (see the subsection “Spectral Density Estimate at Zero Frequency” on page 164 for estimation details) for the two
MCMC chains, respectively. If the ratios n1 =n and n2 =n are fixed, .n1 C n2 /=n < 1, and the chain
is stationary, then the following statistic converges to a standard normal distribution as n ! 1 :
Zn D q
N1
sO1 .0/
n1
N2
C
sO2 .0/
n2
This is a two-sided test, and large absolute z-scores indicate rejection.
164 F Chapter 7: Introduction to Bayesian Analysis Procedures
Spectral Density Estimate at Zero Frequency
For one sequence of the Markov chain ft g, the relationship between the h-lag covariance sequence
of a time series and the spectral density, f , is
sh D
1
2
Z
exp.i!h/f .!/d!
where i indicates that !h is the complex argument. Inverting this Fourier integral,
f .!/ D
1
X
sh exp. i!h/ D s0 1 C 2
1
X
!
h cos.!h/
hD1
hD 1
It follows that
f .0/ D 2 1 C 2
1
X
!
h
hD1
which gives an autocorrelation adjusted estimate of the variance. In this equation, 2 is the naive
variance estimate of the sequence ft g and h is the lag h autocorrelation. Due to obvious computational difficulties, such as calculation of autocorrelation at infinity, you cannot effectively estimate
f .0/ by using the preceding formula. The usual route is to first obtain the periodogram p.!/ of
the sequence, and then estimate f .0/ by smoothing the estimated periodogram. The periodogram
is defined to be
2
!2
!2 3
n
n
X
14 X
p.!/ D
t sin.!t/ C
t cos.!t/ 5
n
tD1
t D1
The procedures use the following way to estimate fO.0/ from p (Heidelberger and Welch 1981). In
p.!/, let ! D !k D 2k=n and k D 1; : : : ; Œ n2 .1 A smooth spectral density in the domain of
.0;  is obtained
by fitting a gamma model with the log link function, using p.!k / as response and
p
x1 .!k / D 3.4!k =.2/ 1/ as the only regressor. The predicted value fO.0/ is given by
fO.0/ D exp.ˇO0
p
3ˇO1 /
where ˇO0 and ˇO1 are the estimates of the intercept and slope parameters, respectively.
1 This
is equivalent to the fast Fourier transformation of the original time series t .
Assessing Markov Chain Convergence F 165
Heidelberger and Welch Diagnostics
The Heidelberger and Welch test (Heidelberger and Welch 1981, 1983) consists of two parts: a
stationary portion test and a half-width test. The stationarity test assesses the stationarity of a
Markov chain by testing the hypothesis that the chain comes from a covariance stationary process.
The half-width test checks whether the Markov chain sample size is adequate to estimate the mean
values accurately.
P
P
Given f t g, set S0 D 0, Sn D ntD1 t , and N D .1=n/ ntD1 t . You can construct the following
sequence with s coordinates on values from n1 ; n2 ; ; 1:
Bn .s/ D .SŒns
1=2
ŒnsN /=.np.0//
O
where Œ  is the rounding operator, and p.0/
O
is an estimate of the spectral density at zero frequency
that uses the second half of the sequence (see the section “Spectral Density Estimate at Zero Frequency” on page 164 for estimation details). For large n, Bn converges in distribution to a Brownian bridge (Billingsley 1986). So you can construct a test statistic
R 1by using Bn . The statistic used
in these procedures is the Cramer–von Mises statistic2 ; that is 0 Bn .s/2 ds D C VM.Bn /. As
n ! 1,R the statistic converges in distribution to a standard Cramer–von Mises distribution. The
1
integral 0 Bn .s/2 ds is numerically approximated using Simpson’s rule.
Let yi D Bn .s/2 , where s D 0; n1 ; ; n n 1 ; 1, and i D ns D 0; 1; ; n. If n is even, let m D n=2;
otherwise, let m D .n 1/=2. The Simpson’s approximation to the integral is
1
Z
0
Bn .s/2 ds 1
Œy0 C 4.y1 C C y2m
3n
1/
C 2.y2 C C y2m
2/
C y2m 
Note that Simpson’s rule requires an even number of intervals. When n is odd, yn is set to be 0 and
the value does not contribute to the approximation.
This test can be performed repeatedly on the same chain, and it helps you identify a time t when
the chain has reached stationarity. The whole chain, f t g, is first used to construct the Cramer–von
Mises statistic. If it passes the test, you can conclude that the entire chain is stationary. If it fails
the test, you drop the initial 10% of the chain and redo the test by using the remaining 90%. This
process is repeated until either a time t is selected or it reaches a point where there are not enough
data remaining to construct a confidence interval (the cutoff proportion is set to be 50%).
The part of the chain that is deemed stationary is put through a half-width test, which reports whether
the sample size is adequate to meet certain accuracy requirements for the mean estimates. Running
the simulation less than this length of time would not meet the requirement, while running it longer
would not provide any additional information that is needed. The statistic calculated here is the
relative half-width (RHW) of the confidence interval. The RHW for a confidence interval of level
1 ˛ is
2 The von Mises distribution was first introduced by von Mises (1918). The density function is p. j/ Ï
M.; / D Œ2I0 ./ 1 exp. cos. // .0 2/, where the function I0 ./ is the modified Bessel function of
R 2
the first kind and order zero, defined by I0 ./ D .2/ 1 0 exp. cos. //d .
166 F Chapter 7: Introduction to Bayesian Analysis Procedures
RHW D
z.1
˛=2/
.Osn =n/1=2
O
where z.1 ˛=2/ is the z-score of the 100.1 ˛=2/th percentile (for example, z.1 ˛=2/ D 1:96
if ˛ D 0:05), sOn is the variance of the chain estimated using the spectral density method (see
explanation in the section “Spectral Density Estimate at Zero Frequency” on page 164), n is the
length, and O is the estimated mean. The RHW quantifies accuracy of the 1 ˛ level confidence
interval of the mean estimate by measuring the ratio between the sample standard error of the
mean and the mean itself. In other words, you can stop the Markov chain if the variability of the
mean stabilizes with respect to the mean. An implicit assumption is that large means are often
accompanied by large variances. If this assumption is not met, then this test can produce false
rejections (such as a small mean around 0 and large standard deviation) or false acceptance (such
as a very large mean with relative small variance). As with any other convergence diagnostics, you
might want to exercise caution in interpreting the results.
The stationarity test is one-sided; rejection occurs when the p-value is greater than 1 ˛. To perform
the half-width test, you need to select an ˛ level (the default of which is 0:05) and a predetermined
tolerance value (the default of which is 0:1). If the calculated RHW is greater than , you conclude
that there are not enough data to accurately estimate the mean with 1 ˛ confidence under tolerance
of .
Raftery and Lewis Diagnostics
If your interest lies in posterior percentiles, you want a diagnostic test that evaluates the accuracy of
the estimated percentiles. The Raftery-Lewis test (Raftery and Lewis 1992, 1996) is designed for
this purpose. Notation and deductions here closely resemble those in Raftery and Lewis (1996).
Suppose you are interested in a quantity q such that P . q jy/ D q, where q can be an
arbitrary cumulative probability, such as 0:025. This q can be empirically estimated by finding
the Œn 100 qth number of the sorted f t g. Let Oq denote the estimand, which corresponds to an
estimated probability P . Oq / D POq . Because the simulated posterior distribution converges to
the true distribution as the simulation sample size grows, Oq can achieve any degree of accuracy if
the simulator is run for a very long time. However, running too long a simulation can be wasteful.
Alternatively, you can use coverage probability to measure accuracy and stop the chain when a
certain accuracy is reached.
A stopping criterion is reached when the estimated probability is within ˙r of the true cumulative
probability q, with probability s, such as P .POq 2 .q r; q C r// D s. For example, suppose you
want the coverage probability s to be 0:95 and the amount of tolerance r to be 0:005. This corresponds to requiring that the estimate of the cumulative distribution function of the 2:5th percentile
be estimated to within ˙0:5 percentage points with probability 0:95.
The Raftery-Lewis diagnostics test finds the number of iterations, M , that need to be discarded
(burn-ins) and the number of iterations needed, N , to achieve a desired precision. Given a predefined cumulative probability q, these procedures first find Oq , and then they construct a binary 0 1
process fZt g by setting Zt D 1 if t Oq and 0 otherwise for all t. The sequence fZt g is itself not
a Markov chain, but you can construct a subsequence of fZt g that is approximately Markovian if it
Assessing Markov Chain Convergence F 167
.k/
is sufficiently k-thinned. When k becomes reasonably large, fZt g starts to behave like a Markov
chain.
Next, the procedures find this thinning parameter k. The number k is estimated by comparing the
Bayesian information criterion (BIC) between two Markov models: a first-order and a second-order
.k/
Markov model. A j th-order Markov model is one in which the current value of fZt g depends on
the previous j values. For example, in a second-order Markov model,
.k/
.k/
.k/
.k/
p Zt D zt jZt 1 D zt 1 ; Zt 2 D zt 2 ; ; Z0 D z0
.k/
.k/
.k/
D p Zt D zt jZt 1 D zt 1 ; Zt 2 D zt 2
.k/
where zi D f0; 1g; i D 0; ; t. Given fZt g, you can construct two transition count matrices for
a second-order Markov model:
zt
zt
zt
2
2
D0
D1
1
zt D 0
D 0 zt
w000
w100
1 D1
w010
w110
zt
zt
zt
2
2
D0
D1
1
zt D 1
D 0 zt
w001
w101
1 D1
w011
w111
For each k, the procedures calculate the BIC that compares the two Markov models. The BIC is
based on a likelihood ratio test statistic that is defined as
Gk2
D2
1 X
1
1 X
X
wij l log
i D0 j D0 lD0
wij l
wO ij l
where wO ij l is the expected cell count of wij l under the null model, the first-order Markov model,
.k/
.k/
.k/
where the assumption .Zt ? Zt 2 /jZt 1 holds. The formula for the expected cell count is
P
wO ij l D
wij l P P
P
i
i
l
l
wij l
wij l
The BIC is Gk2 2 log.nk 2/, where nk is the k-thinned sample size (every kth sample starting
with the first), with the last two data points discarded due to the construction of the second-order
Markov model. The thinning parameter k is the smallest k for which the BIC is negative. When k
.k/
is found, you can estimate a transition probability matrix between state 0 and state 1 for fZt g:
QD
1
˛
ˇ
˛
1
ˇ
168 F Chapter 7: Introduction to Bayesian Analysis Procedures
.k/
Because fZt g is a Markov chain, its equilibrium distribution exists and is estimated by
D .0 ; 1 / D
.ˇ; ˛/
˛Cˇ
where 0 D P . q jy/ and 1 D 1 0 . The goal is to find an iteration number m such that
.k/
.k/
after m steps, the estimated transition probability P .Zm D i jZ0 D j / is within of equilibrium
i for i; j D 0; 1. Let e0 D .1; 0/ and e1 D 1 e0 . The estimated transition probability after step
m is
.k/
.k/
P .Zm
D ijZ0
D j / D ej
0 1
0 1
C
.1
˛ ˇ/m
˛Cˇ
˛
ˇ
˛
ˇ
ej>
which holds when
log
mD
log.1
assuming 1
˛
.˛Cˇ /
max.˛;ˇ /
˛
ˇ/
ˇ > 0.
.k/
Therefore, by time m, fZt g is sufficiently close to its equilibrium distribution, and you know that
a total size of M D mk should be discarded as the burn-in.
Next, the procedures estimate N , the number of simulations needed to achieve desired accuracy on
P
.k/
.k/
.k/
percentile estimation. The estimate of P . q jy/ is ZN n D n1 ntD1 Zt . For large n, ZN n is
normally distributed with mean q, the true cumulative probability, and variance
1 .2 ˛ ˇ/˛ˇ
n .˛ C ˇ/3
P .q
.k/
r ZN n q C r/ D s is satisfied if
nD
.2
˛ ˇ/˛ˇ
.˛ C ˇ/3
(
ˆ
1 sC1
2
)2
r
Therefore, N D nk.
By using similar reasoning, the procedures first calculate the minimal number of iterations needed
to achieve the desired accuracy, assuming the samples are independent:
Assessing Markov Chain Convergence F 169
Nmi n D ˆ
1
sC1
2
2
q.1 q/
r2
If f t g does not have that required sample size, the Raftery-Lewis test is not carried out. If you still
want to carry out the test, increase the number of Markov chain iterations.
The ratio N=Nmi n is sometimes referred to as the dependence factor. It measures deviation from
posterior sample independence: the closer it is to 1, the less correlated are the samples. There are
a few things to keep in mind when you use this test. This diagnostic tool is specifically designed
for the percentile of interest and does not provide information about convergence of the chain as
a whole (Brooks and Roberts 1999). In addition, the test can be very sensitive to small changes.
Both N and Nmi n are inversely proportional to r 2 , so you can expect to see large variations in
these numbers with small changes to input variables, such as the desired coverage probability or
the cumulative probability of interest. Last, the time until convergence for a parameter can differ
substantially for different cumulative probabilities.
Autocorrelations
The sample autocorrelation of lag h is defined in terms of the sample autocovariance function:
O .h/ D
O .h/
; jhj < n
O .0/
˚ The sample autocovariance function of lag h (of it ) is defined by
O .h/ D
n
Xh 1
n
h
it Ch
Ni
it
Ni ; 0 h < n
t D1
Effective Sample Size
You can use autocorrelation and trace plots to examine the mixing of a Markov chain. A closely
related measure of mixing is the effective sample size (ESS) (Kass et al. 1998).
ESS is defined as follows:
ESS D
n
n
P1
D
1 C 2 kD1 k . /
where n is the total sample size and k . / is the autocorrelation of lag k for . The quantity is
referred to as the autocorrelation time. To estimate , the Bayesian procedures first find a cutoff
point k after which the autocorrelations are very close to zero, and then sum all the k up to that
point. The cutoff point k is such that k < 0:05 or k < 2sk , where sk is the estimated standard
deviation:
170 F Chapter 7: Introduction to Bayesian Analysis Procedures
v0 0
11
u
k
u
X1
[email protected] 1 @
sk D 2t
1C2
k2 ./AA
n
j D1
ESS and are inversely proportional to each other, and low ESS or high indicates bad mixing of
the Markov chain.
Summary Statistics
˚
Let be a p-dimensional parameter
D 1 ; : : : ; p . For each i 2 f1; : : : ; pg ,
˚ t vector of interest:
there are n observations: i D i ; t D 1; : : : ; n .
Mean
The posterior mean is calculated by using the following formula:
n
1X t
E .i jy/ Ni D
i ; for i D 1; : : : ; n
n
t D1
Standard Deviation
Sample standard deviation (expressed in variance term) is calculated by using the following formula:
Var.i jy/ si2 D
n
X
1
n
1
it
Ni
2
t D1
Standard Error of the Mean Estimate
Suppose you have n iid samples, the mean estimate is Ni , and the sample standard deviation is si .
p
The standard error of the estimate is O i = n. However, positive autocorrelation (see the section
“Autocorrelations” on page 169 for a definition) in the MCMC samples makes this an underestimate. To take account of the autocorrelation, the Bayesian procedures correct the standard error by
using effective sample size (see the section “Effective Sample Size” on page 169).
p
Given an effective sample size of m, the standard error for Ni is O i = m. The procedures use the
following formula (expressed in variance term):
c Ni / D
Var.
1C2
Pn
P1
kD1 k .i /
n
t D1
.n
Ni
it
1/
2
Summary Statistics F 171
The standard error of the mean is also known as the Monte Carlo standard error (MCSE). The
MCSE provides a measurement of the accuracy of the posterior estimates, and small values do not
necessarily indicate that you have recovered the true posterior mean.
Percentiles
Sample percentiles are calculated using Definition 5 (see Chapter 4, “The UNIVARIATE Procedure” (Base SAS Procedures Guide: Statistical Procedures),).
Correlation
Correlation between i and j is calculated as
Pn
t D1
rij D r
P
t
Ni
it
Ni
it
t
j
2 P t
t j
Nj
Nj
2
Covariance
Covariance i and j is calculated as
sij D
n
X
it
Ni
jt
Nj =.n
1/
t D1
Equal-Tail Credible Interval
Let .i jy/ denote the marginal posterior cumulative
distribution
function of
i . A 100
.1 ˛/ %
˛=2
˛=2 1 ˛=2
Bayesian equal-tail credible interval for i is i ; i
, where i jy D ˛2 , and
1 ˛=2
i
jy D 1 ˛2 . The interval is obtained using the empirical ˛2 th and .1 ˛2 /th percentiles
˚ of it .
Highest Posterior Density (HPD) Interval
For a definition of an HPD interval, see the section “Interval Estimation” on page 149. The procedures use the Chen-Shao algorithm (Chen and Shao 1999; Chen, Shao, and Ibrahim 2000) to
estimate an empirical HPD interval of i :
˚ 1. Sort it to obtain the ordered values:
i .1/ i .2/ i .n/
172 F Chapter 7: Introduction to Bayesian Analysis Procedures
2. Compute the 100 .1
˛/ % credible intervals:
Rj .n/ D i .j / ; i .j CŒ.1 ˛/n/
for j D 1; 2; : : : ; n
Œ.1
˛/ n.
3. The 100 .1 ˛/ % HPD interval, denoted by Rj .n/, is the one with the smallest interval
width among all credible intervals.
Deviance Information Criterion (DIC)
The deviance information criterion (DIC) (Spiegelhalter et al. 2002) is a model assessment tool, and
it is a Bayesian alternative to Akaike’s information criterion (AIC) and the Bayesian information
criterion (BIC, also known as the Schwarz criterion). The DIC uses the posterior densities, which
means that it takes the prior information into account. The criterion can be applied to nonnested
models and models that have non-iid data. Calculation of the DIC in MCMC is trivial—it does not
require maximization over the parameter space, like the AIC and BIC. A smaller DIC indicates a
better fit to the data set.
Letting be the parameters of the model, the deviance information formula is
DIC D D./ C pD D D./ C 2pD
where
D./ D 2 .log.f .y//
log.p.yj/// : deviance
where
p.yj/: likelihood function with the normalizing constants.
f .y/: a standardizing term that is a function of the data alone. This term is constant with
respect to the parameter and is irrelevant when you compare different models that have
the same likelihood function. Since the term cancels out in DIC comparisons, its calculation is often omitted.
N OTE : You can think of the deviance as the difference in twice the log likelihood between
the saturated, f .y/, and fitted, p.yj/, models.
P
: posterior mean, approximated by n1 ntD1 t
P
D./: posterior mean of the deviance, approximated by n1 ntD1 D. t /. The expected deviation
measures how well the model fits the data.
N equal to 2 log.p.yj//.
N It is the deviance evaluated at your “best”
D./: deviance evaluated at ,
posterior estimate.
pD : effective number of parameters. It is the difference between the measure of fit and the deviance at the estimates: D./ D./. This term describes the complexity of the model, and
it serves as a penalization term that corrects deviance’s propensity toward models with more
parameters.
A Bayesian Reading List F 173
A Bayesian Reading List
This section lists a number of Bayesian textbooks of varying difficulty degrees and a few tutorial/review papers.
Textbooks
Introductory Books
Berry, D. A. (1996), Statistics: A Bayesian Perspective, London: Duxbury Press.
Bolstad, W. M. (2007), Introduction to Bayesian Statistics, 2nd ed. New York: John Wiley & Sons.
DeGroot, M. H. and Schervish, M. J. (2002), Probability and Statistics, Reading, MA: Addison
Wesley.
Gamerman, D. and Lopes, H. F. (2006), Markov Chain Monte Carlo: Stochastic Simulation for
Bayesian Inference, 2nd ed. London: Chapman & Hall/CRC.
Ghosh, J. K., Delampady, M., and Samanta, T. (2006), An Introduction to Bayesian Analysis, New
York: Springer-Verlag.
Lee, P. M. (2004), Bayesian Statistics: An Introduction, 3rd ed. London: Arnold.
Sivia, D. S. (1996), Data Analysis: A Bayesian Tutorial, Oxford: Oxford University Press.
Intermediate-Level Books
Box, G. E. P., and Tiao, G. C. (1992), Bayesian Inference in Statistical Analysis, New York: John
Wiley & Sons.
Chen, M. H., Shao Q. M., and Ibrahim, J. G. (2000), Monte Carlo Methods in Bayesian Computation, New York: Springer-Verlag.
Gelman, A. and Hill, J. (2006), Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge: Cambridge University Press.
Goldstein, M. and Woof, D. A. (2007), Bayes Linear Statistics: Theory and Methods, New York:
John Wiley & Sons.
Harney, H. L. (2003), Bayesian Inference: Parameter Estimation and Decisions, New York:
Springer-Verlag.
Leonard, T. and Hsu, J. S. (1999), Bayesian Methods: An Analysis for Statisticians and Interdisciplinary Researchers, Cambridge: Cambridge University Press.
Liu, J. S. (2001), Monte Carlo Strategies in Scientific Computing, New York: Springer-Verlag.
174 F Chapter 7: Introduction to Bayesian Analysis Procedures
Marin, J. M. and Robert, C. P. (2007), Bayesian Core: a Practical Approach to Computational
Bayesian Statistics, New York: Springer-Verlag.
Press, S. J. (2002), Subjective and Objective Bayesian Statistics: Principles, Models, and Applications, 2nd ed. New York: Wiley-Interscience.
Robert, C. P. (2001), The Bayesian Choice, 2nd ed. New York: Springer-Verlag.
Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, 2nd ed. New York: SpringerVerlag.
Tanner, M. A. (1993), Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, New York: Springer-Verlag.
Advanced Titles
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, New York: SpringerVerlag.
Bernardo, J. M. and Smith, A. F. M. (2007), Bayesian Theory, 2nd ed. New York: John Wiley &
Sons.
de Finetti, B. (1992), Theory of Probability, New York: John Wiley & Sons.
Jeffreys, H. (1998), Theory of Probability, Oxford: Oxford University Press.
O’Hagan, A. (1994), Bayesian Inference, volume 2B of Kendall’s Advanced Theory of Statistics,
London: Arnold.
Savage, L. J. (1954), The Foundations of Statistics, New York: John Wiley & Sons.
Books Motivated by Statistical Applications and Data Analysis
Carlin, B. and Louris, T. A. (2000), Bayes and Empirical Bayes Methods for Data Analysis, 2nd ed.
London: Chapman & Hall.
Congdon, P. (2006), Bayesian Statistical Modeling, 2nd ed. New York: John Wiley & Sons.
Congdon, P. (2003), Applied Bayesian Modeling, New York: John Wiley & Sons.
Congdon, P. (2005), Bayesian Models for Categorical Data, New York: John Wiley & Sons.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004), Bayesian Data Analysis, 3rd ed.
London: Chapman & Hall.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman & Hall.
Tutorial and Review Papers on MCMC F 175
Tutorial and Review Papers on MCMC
Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995), “Bayesian Computation and Stochastic
Systems,” Statistical Science, 10(1), 3–66.
Casella, G. and George, E. (1992), “Explaining the Gibbs Sampler,” The American Statistician, 46,
167–174.
Chib, S. and Greenberg, E. (1995), “Understanding the Metropolis-Hastings Algorithm,” The American Statistician, 49, 327–335.
Chib, S. and Greenberg, E. (1996), “Markov Chain Monte Carlo Simulation Methods in Econometrics,” Econometric Theory, 12, 409–431.
Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998), “Markov Chain Monte Carlo in
Practice: A Roundtable Discussion,” Statistical Science, 52(2), 93–100.
References
Amit, Y. (1991), “On Rates of Convergence of Stochastic Relaxation for Gaussian and NonGaussian Distributions,” Journal of Multivariate Analysis, 38, 82–99.
Applegate, D., Kannan, R., and Polson, N. (1990), Random Polynomial Time Algorithms for Sampling from Joint Distributions, Technical report, School of Computer Science, Carnegie Mellon
University.
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis, Second Edition, New York:
Springer-Verlag.
Berger, J. O. (2006), “The Case for Objective Bayesian Analysis,” Bayesian Analysis, 3, 385–402,
http://ba.stat.cmu.edu/journal/2006/vol01/issue03/berger.pdf.
Berger, J. O. and Wolpert, R. (1988), The Likelihood Principle, 9, Second Edition, Hayward, California: Institute of Mathematical Statistics, monograph series.
Bernardo, J. M. and Smith, A. F. M. (1994), Bayesian Theory, New York: John Wiley & Sons.
Besag, J. (1974), “Spatial Interaction and the Statistical Analysis of Lattic Systems,” Journal of the
Royal Statistical Society B, 36, 192–326.
Billingsley, P. (1986), Probability and Measure, Second Edition, New York: John Wiley & Sons.
Box, G. E. P. and Tiao, G. C. (1973), Bayesian Inference in Statistical Analysis, Wiley Classics
Library Edition, published 1992, New York: John Wiley & Sons.
Breiman, L. (1968), Probability, Reading, MA: Addison-Wesley.
Brooks, S. P. and Gelman, A. (1997), “General Methods for Monitoring Convergence of Iterative
Simulations,” Journal of Computational and Graphical Statistics, 7, 434–455.
176 F Chapter 7: Introduction to Bayesian Analysis Procedures
Brooks, S. P. and Roberts, G. O. (1998), “Assessing Convergence of Markov Chain Monte Carlo
Algorithms,” Statistics and Computing, 8, 319–335.
Brooks, S. P. and Roberts, G. O. (1999), “On Quantile Estimation and Markov Chain Monte Carlo
Convergence,” Biometrika, 86(3), 710–717.
Carlin, B. P. and Louis, T. A. (2000), Bayes and Empirical Bayes Methods for Data Analysis,
Second Edition, London: Chapman & Hall.
Casella, G. and George, E. I. (1992), “Explaining the Gibbs Sampler,” The American Statistician,
46, 167–174.
Chan, K. S. (1993), “Asymptotic Behavior of the Gibbs Sampler,” Journal of the American Statistical Association, 88, 320–326.
Chen, M. H. and Shao, Q. M. (1999), “Monte Carlo Estimation of Bayesian Credible and HPD
Intervals,” Journal of Computational and Graphical Statistics, 8, 69–92.
Chen, M. H., Shao, Q. M., and Ibrahim, J. G. (2000), Monte Carlo Methods in Bayesian Computation, New York: Springer-Verlag.
Chib, S. and Greenberg, E. (1995), “Understanding the Metropolis-Hastings Algorithm,” The American Statistician, 49, 327–335.
Congdon, P. (2001), Bayesian Statistical Modeling, John Wiley & Sons.
Congdon, P. (2003), Applied Bayesian Modeling, John Wiley & Sons.
Congdon, P. (2005), Bayesian Models for Categorical Data, John Wiley & Sons.
Cowles, M. K. and Carlin, B. P. (1996), “Markov Chain Monte Carlo Convergence Diagnostics: A
Comparative Review,” Journal of the American Statistical Association, 883–904.
DeGroot, M. H. and Schervish, M. J. (2002), Probability and Statistics, 3rd Edition, Reading, MA:
Addison-Wesley.
Feller, W. (1968), An Introduction to Probability Theory and Its Applications, Third Edition, New
York: John Wiley & Sons.
Gelfand, A. E., Hills, S. E., Racine-Poon, A., and Smith, A. F. M. (1990), “Illustration of Bayesian
Inference in Normal Data Models Using Gibbs Sampling,” Journal of the American Statistical
Association, 85, 972–985.
Gelman, A., Carlin, J., Stern, H., and Rubin, D. (2004), Bayesian Data Analysis, Second Edition,
London: Chapman & Hall.
Gelman, A. and Rubin, D. B. (1992), “Inference from Iterative Simulation Using Multiple Sequences,” Statistical Science, 7, 457–472.
Geman, S. and Geman, D. (1984), “Stochastic Relaxation, Gibbs Distribution, and the Bayesian
Restoration of Images,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 6, 721–
741.
References F 177
Geweke, J. (1992), “Evaluating the Accuracy of Sampling-Based Approaches to Calculating Posterior Moments,” in J. M. Bernardo, J. O. Berger, A. P. Dawiv, and A. F. M. Smith, eds., Bayesian
Statistics, volume 4, Oxford, UK: Clarendon Press.
Gilks, W. (2003), “Adaptive Metropolis Rejection Sampling (ARMS),” software from MRC
Biostatistics Unit, Cambridge, UK, http://www.maths.leeds.ac.uk/~wally.gilks/
adaptive.rejection/web_page/Welcome.html.
Gilks, W. R., Best, N. G., and Tan, K. K. C. (1995), “Adaptive Rejection Metropolis Sampling with
Gibbs Sampling,” Applied Statistics, 44, 455–472.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman & Hall.
Gilks, W. R. and Wild, P. (1992), “Adaptive Rejection Sampling for Gibbs Sampling,” Applied
Statistics, 41, 337–348.
Goldstein, M. (2006), “Subjective Bayesian Analysis: Principles and Practice,” Bayesian
Analysis, 3, 403–420, http://ba.stat.cmu.edu/journal/2006/vol01/issue03/
goldstein.pdf.
Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika, 57, 97–109.
Heidelberger, P. and Welch, P. D. (1981), “A Spectral Method for Confidence Interval Generation
and Run Length Control in Simulations,” Communication of the ACM, 24, 233–245.
Heidelberger, P. and Welch, P. D. (1983), “Simulation Run Length Control in the Presence of an
Initial Transient,” Operations Research, 31, 1109–1144.
Jeffreys, H. (1961), Theory of Probability, third Edition, Oxford: Oxford University Press.
Karlin, S. and Taylor, H. (1975), A First Course in Stochastic Processes, Second Edition, Orlando,
FL: Academic Press.
Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. (1998), “Markov Chain Monte Carlo in Practice:
A Roundtable Discussion,” The American Statistician, 52, 93–100.
Kass, R. E. and Wasserman, L. (1996), “Formal Rules of Selecting Prior Distributions: A Review
and Annotated Bibliography,” Journal of the American Statistical Association, 91, 343–370.
Liu, C., Wong, W. H., and Kong, A. (1991a), Correlation Structure and Convergence Rate of the
Gibbs Sampler (I): Application to the Comparison of Estimators and Augmentation Scheme,
Technical report, Department of Statistics, University of Chicago.
Liu, C., Wong, W. H., and Kong, A. (1991b), Correlation Structure and Convergence Rate of the
Gibbs Sampler (II): Applications to Various Scans, Technical report, Department of Statistics,
University of Chicago.
Liu, J. S. (2001), Monte Carlo Strategies in Scientific Computing, Springer-Verlag.
MacEachern, S. N. and Berliner, L. M. (1994), “Subsampling the Gibbs Sampler,” The American
Statistician, 48, 188–190.
178 F Chapter 7: Introduction to Bayesian Analysis Procedures
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953), “Equation of State Calculations by Fast Computing Machines,” Journal of Chemical Physics, 21, 1087–
1092.
Metropolis, N. and Ulam, S. (1949), “The Monte Carlo Method,” Journal of the American Statistical
Association, 44.
Meyn, S. P. and Tweedie, R. L. (1993), Markov Chains and Stochastic Stability, Berlin: SpringerVerlag.
Press, S. J. (2003), Subjective and Objective Bayesian Statistics, New York: John Wiley & Sons.
Raftery, A. E. and Lewis, S. M. (1992), “One Long Run with Diagnostics: Implementation Strategies for Markov Chain Monte Carlo,” Statistical Science, 7, 493–497.
Raftery, A. E. and Lewis, S. M. (1996), “The Number of Iterations, Convergence Diagnostics and
Generic Metropolis Algorithms,” in W. R. Gilks, D. J. Spiegelhalter, and S. Richardson, eds.,
Markov Chain Monte Carlo in Practice, London, UK: Chapman & Hall.
Robert, C. P. (2001), The Bayesian Choice, Second Edition, New York: Springer-Verlag.
Robert, C. P. and Casella, G. (2004), Monte Carlo Statistical Methods, second Edition, New York:
Springer-Verlag.
Roberts, G. O. (1996), “Markov Chain Concepts Related to Sampling Algorithms,” in W. R. Gilks,
D. J. Spiegelhalter, and S. Richardson, eds., Markov Chain Monte Carlo in Practice, 45–58,
London: Chapman & Hall.
Rosenthal, J. S. (1991a), Rates of Convergence for Data Augmentation on Finite Sample Spaces,
Technical report, Department of Mathematics, Harvard University.
Rosenthal, J. S. (1991b), Rates of Convergence for Gibbs Sampling for Variance Component Models, Technical report, Department of Mathematics, Harvard University.
Ross, S. M. (1997), Simulation, Second Edition, Orlando, FL: Academic Press.
Schervish, M. J. and Carlin, B. P. (1992), “On the Convergence of Successive Substitution Sampling,” Journal of Computational and Graphical Statistics, 1, 111–127.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van der Linde, A. (2002), “Bayesian Measures
of Model Complexity and Fit,” Journal of the Royal Statistical Society, Series B, 64(4), 583–616,
with discussion.
Tanner, M. A. (1993), Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, New York: Springer-Verlag.
Tanner, M. A. and Wong, W. H. (1987), “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, 82, 528–550.
Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions,” Annals of Statistics,
22(4), 1701–1762.
von Mises, R. (1918), “Über die ‘Ganzzahligkeit’ der Atomgewicht und verwandte Fragen,”
Physikal. Z., 19, 490–500.
References F 179
Wasserman, L. (2004), All of Statistics: A Concise Course in Statistical Inference, New York:
Springer-Verlag.
Index
adaptive algorithms
adaptive rejection Metropolis sampling
(ARMS), 155
adaptive rejection sampling (ARS), 155
Introduction to Bayesian Analysis, 155
Markov chain Monte Carlo, 155
advantages and disadvantages of Bayesian
analysis
Introduction to Bayesian Analysis, 149
assessing MCMC convergence
autocorrelation, 169
effective sample sizes (ESS), 169
Gelman and Rubin diagnostics, 161
Geweke diagnostics, 163
Heidelberger and Welch diagnostics, 165
Introduction to Bayesian Analysis, 156
Markov chain Monte Carlo, 156
Raftery and Lewis diagnostics, 166
visual inspection, 156
Bayes’ theorem
Introduction to Bayesian Analysis, 142
Bayesian credible intervals
definition of, 149
equal-tail intervals, 149, 171
highest posterior density (HPD) intervals,
149, 171
Introduction to Bayesian Analysis, 149
Bayesian hypothesis testing
Introduction to Bayesian Analysis, 148
Bayesian interval estimation
Introduction to Bayesian Analysis, 149
Bayesian probability
Introduction to Bayesian Analysis, 142
burn-in for MCMC
Introduction to Bayesian Analysis, 155
Markov chain Monte Carlo, 155
convergence diagnostics, see assessing MCMC
convergence
definition of
effective sample sizes (ESS), 169
deviance information criterion
Introduction to Bayesian Analysis, 172
deviance information criterion (DIC)
definition of, 172
DIC, see deviance information criterion
effective sample sizes (ESS)
definition of, 169
Introduction to Bayesian Analysis, 169
equal-tail intervals
definition of, 149
Introduction to Bayesian Analysis, 149, 171
frequentist probability
Introduction to Bayesian Analysis, 142
Gibbs sampler
Introduction to Bayesian Analysis, 151, 154
Markov chain Monte Carlo, 151, 154
highest posterior density (HPD) intervals
definition of, 149
Introduction to Bayesian Analysis, 149, 171
independence sampler
Introduction to Bayesian Analysis, 153
Markov chain Monte Carlo, 153
Introduction to Bayesian Analysis, 141
adaptive algorithms, 155
advantages and disadvantages of Bayesian
analysis, 149
assessing MCMC convergence, 156
Bayes’ theorem, 142
Bayesian credible intervals, 149
Bayesian hypothesis testing, 148
Bayesian interval estimation, 149
Bayesian probability, 142
burn-in for MCMC, 155
deviance information criterion, 172
effective sample sizes (ESS), 169
equal-tail intervals, 149, 171
frequentist probability, 142
Gibbs sampler, 151, 154
highest posterior density (HPD) intervals,
149, 171
independence sampler, 153
Jeffreys’ prior, 146
likelihood function, 143
likelihood principle, 150
marginal distribution, 143
Markov chain Monte Carlo, 151
Metropolis algorithm, 151
Metropolis-Hastings algorithm, 151
Monte Carlo standard error (MCSE), 148,
170
normalizing constant, 143
posterior distribution, 142
posterior summary statistics, 170
prior distribution, 142, 144
spectral density estimate at zero frequency,
164
thinning of MCMC, 155
Jeffreys’ prior
definition of, 146
Introduction to Bayesian Analysis, 146
likelihood function
Introduction to Bayesian Analysis, 143
likelihood principle
Introduction to Bayesian Analysis, 150
marginal distribution
definition of, 143
Introduction to Bayesian Analysis, 143
Markov chain Monte Carlo
adaptive algorithms, 155
assessing MCMC convergence, 156
burn-in for MCMC, 155
Gibbs sampler, 151, 154
independence sampler, 153
Introduction to Bayesian Analysis, 151
Metropolis algorithm, 151, 152
Metropolis-Hastings algorithm, 151, 152
posterior summary statistics, 170
thinning of MCMC, 155
Metropolis algorithm
Introduction to Bayesian Analysis, 151
Markov chain Monte Carlo, 151, 152
Metropolis-Hastings algorithm
Introduction to Bayesian Analysis, 151
Markov chain Monte Carlo, 151, 152
Monte Carlo standard error (MCSE)
Introduction to Bayesian Analysis, 148, 170
normalizing constant
definition of, 143
Introduction to Bayesian Analysis, 143
point estimation
Introduction to Bayesian Analysis, 147
posterior distribution
definition of, 142
improper, 144, 146
Introduction to Bayesian Analysis, 142
posterior summary statistics
correlation, 171
covariance, 171
equal-tail intervals, 171
highest posterior density (HPD) intervals,
171
Introduction to Bayesian Analysis, 170
mean, 170
Monte Carlo standard error (MCSE), 170
percentiles, 171
standard deviation, 170
standard error of the mean estimate, 170
prior distribution
conjugate, 146
definition of, 142
diffuse, 144
flat, 144
improper, 145
informative, 146
Introduction to Bayesian Analysis, 142, 144
Jeffreys’ prior, 146
noninformative, 144, 147
objective, 144
subjective, 144
vague, 144
spectral density estimate at zero frequency
Introduction to Bayesian Analysis, 164
thinning of MCMC
Introduction to Bayesian Analysis, 155
Markov chain Monte Carlo, 155
Your Turn
We welcome your feedback.
If you have comments about this book, please send them to
[email protected] Include the full title and page numbers (if
applicable).
If you have comments about the software, please send them to
[email protected]
SAS Publishing Delivers!
®
Whether you are new to the work force or an experienced professional, you need to distinguish yourself in this rapidly
changing and competitive job market. SAS Publishing provides you with a wide range of resources to help you set
yourself apart. Visit us online at support.sas.com/bookstore.
®
SAS Press
®
Need to learn the basics? Struggling with a programming problem? You’ll find the expert answers that you
need in example-rich books from SAS Press. Written by experienced SAS professionals from around the
world, SAS Press books deliver real-world insights on a broad range of topics for all skill levels.
SAS Documentation
support.sas.com/saspress
®
To successfully implement applications using SAS software, companies in every industry and on every
continent all turn to the one source for accurate, timely, and reliable information: SAS documentation.
We currently produce the following types of reference documentation to improve your work experience:
• Online help that is built into the software.
• Tutorials that are integrated into the product.
• Reference documentation delivered in HTML and PDF – free on the Web.
• Hard-copy books.
support.sas.com/publishing
SAS Publishing News
®
Subscribe to SAS Publishing News to receive up-to-date information about all new SAS titles, author
podcasts, and new Web site features via e-mail. Complete instructions on how to subscribe, as well as
access to past issues, are available at our Web site.
support.sas.com/spn
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2009 SAS Institute Inc. All rights reserved. 518177_1US.0109
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement