# Probability distributions and maximum entropy

```PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
1. Introduction
If we want to assign probabilities to an event, and see no reason for one outcome to occur
more often than any other, then the events are assigned equal probabilities. This is called
the principle of insufficient reason, or principle of indifference, and goes back to Laplace. If
we happen to know (or learn) something about the non-uniformity of the outcomes, how
should the assignment of probabilities be changed? There is an extension of the principle of
insufficient reason that suggests what to do. It is called the principle of maximum entropy.
After defining entropy and computing it in some examples, we will describe this principle and
see how it provides a natural conceptual role for many standard probability distributions
(normal, exponential, Laplace, Bernoulli). In particular, the normal distribution will be
seen to have a distinguishing property among all continuous probability distributions on R
that may be simpler for students in an undergraduate probability course to appreciate than
the special role of the normal distribution in the central limit theorem.
This paper is organized as follows. In Sections 2 and 3, we describe the principle of maximum entropy in three basic examples. The explanation of these examples is given in Section
4 as a consequence of a general result (Theorem 4.3). Section 5 provides further illustrations
of the maximum entropy principle, with details largely left to the reader as practice. In
Section 6 we state Shannon’s theorem, which characterizes the entropy function on finite
sample spaces. Finally, in Section 7, we prove a theorem about positivity of maximum entropy distributions in the presence of suitable constraints, and derive a uniqueness theorem.
In that final section entropy is considered on abstract measure spaces, while the earlier part
is written in the language of discrete and continuous probability distributions in order to
be more accessible to a motivated undergraduate studying probability.
2. Entropy: definitions and calculations
For a discrete probability distribution p on the countable set {x1 , x2 , . . . }, with pi = p(xi ),
the entropy of p is defined as
X
h(p) = −
pi log pi .
i≥1
For a continuous probability density function p(x) on an interval I, its entropy is defined as
Z
h(p) = − p(x) log p(x) dx.
I
We set 0 log 0 = 0, which is natural if you look at the graph of x log x. See Figure 1.
This definition of entropy, introduced by Shannon [13], resembles a formula for a thermodynamic notion of entropy. Physically, systems are expected to evolve into states with
higher entropy as they approach equilibrium. In our probabilistic context, h(p) is viewed
as a measure of the information carried by p, with higher entropy corresponding to less
information (more uncertainty, or more of a lack of information).
1
2
y
x
1
Figure 1. Graph of y = x log x.
Example 2.1. Consider a finite set {x1 , x2 , . . . , xn }. If p(x1 ) = 1 while p(xj ) = 0 for j > 1,
then h(p) = −1 log 1 = 0. In this case, statistics governed by p almost surely produce only
one possible outcome, x1 . We have complete knowledge of what will happen. On the other
hand, if p is the uniform density function, where p(xj ) = 1/n for all j, then h(p) = log n.
We will see later (Theorem 3.1) that every probability density function on {x1 , . . . , xn } has
entropy ≤ log n, and the entropy log n occurs only for the uniform distribution. Heuristically, the probability density function on {x1 , x2 , . . . , xn } with maximum entropy turns out
to be the one that corresponds to the least amount of knowledge of {x1 , x2 , . . . , xn }.
Example 2.2. The entropy of the Gaussian density on R with mean µ and variance σ 2 is
!
Z
√
1 x−µ 2
1
1
−(1/2)((x−µ)/σ)2
√
− log( 2πσ) −
dx = (1 + log(2πσ 2 )).
e
−
2
σ
2
2πσ
R
The mean µ does not enter the final formula, so all Gaussians with a common σ (Figure 2)
have the same entropy.
y
x
Figure 2. Gaussians with the same σ: same entropy.
For σ near 0, the entropy of a Gaussian is negative. Graphically, when σ is small, a
substantial piece of the probability density function has values greater than 1, and there
−p log p < 0. For discrete distributions, on the other hand, entropy is always ≥ 0, since
values of a discrete probability density function never exceed 1. That entropy can be
negative in the continuous case simply reflects the fact that probability distributions in the
continuous case can be more concentrated than a uniform distribution on [0, 1].
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
3
Example 2.3. The entropy of the exponential density on (0, ∞) with mean λ is
Z ∞
1 −x/λ x
−
e
− log λ −
dx = 1 + log λ.
λ
λ
0
As in the previous example, this entropy becomes negative for small λ.
We now state the principle of maximum entropy: if we are seeking a probability density
function subject to certain constraints (e.g., a given mean or variance), use the density
satisfying those constraints that has entropy as large as possible. Laplace’s principle of
indifference turns out to correspond to the case of a finite sample space with no constraints
(Example 2.1).
The principle of maximum entropy, as a method of statistical inference, is due to Jaynes
[6, 7, 8]. His idea is that this principle leads to the selection of a probability density function
that is consistent with our knowledge and introduces no unwarranted information. Any
probability density function satisfying the constraints that has smaller entropy will contain
more information (less uncertainty), and thus says something stronger than what we are
assuming. The probability density function with maximum entropy, satisfying whatever
constraints we impose, is the one that should be least surprising in terms of the predictions
it makes.
It is important to clear up an easy misconception: the principle of maximum entropy
does not give us something for nothing. For example, a coin is not fair just because we
don’t know anything about it. In fact, to the contrary, the principle of maximum entropy
guides us to the best probability distribution that reflects our current knowledge and it tells
us what to do if experimental data does not agree with predictions coming from our chosen
distribution: understand why the phenomenon being studied behaves in an unexpected
way (find a previously unseen constraint) and maximize entropy over the distributions that
satisfy all constraints we are now aware of, including the new one.
A proper appreciation of the principle of maximum entropy goes hand in hand with a
certain attitude about the interpretation of probability distributions. A probability distribution can be viewed as: (1) a predictor of frequencies of outcomes over repeated trials,
or (2) a numerical measure of plausibility that some individual situation develops in certain ways. Sometimes the first (frequency) viewpoint is meaningless, and only the second
(subjective) interpretation of probability makes sense. For instance, we can ask about the
probability that civilization will be wiped out by an asteroid in the next 10,000 years, or
the probability that the Red Sox will win the World Series again.
Example 2.4. For an example that mixes the two interpretations of probability (frequency
in repeated trials versus plausibility of a single event), consider a six-sided die rolled 1000
times. The average number of dots that are rolled is 4.7. (A fair die would be expected
to have average 7/2 = 3.5.) What is the probability distribution for the faces on this die?
This is clearlyPan underdetermined
problem. (There are infinitely many 6-tuples (p1 , . . . , p6 )
P
with pi ≥ 0, i pi = 1, and i ipi = 4.7.) We will return to this question in Section 5.
3. Three examples of maximum entropy
We illustrate the principle of maximum entropy in the following three theorems. Proofs
of these theorems are in the next section.
Theorem 3.1. For a probability density function p on a finite set {x1 , . . . , xn },
h(p) ≤ log n,
with equality if and only if p is uniform, i.e., p(xi ) = 1/n for all i.
4
P
Concretely, if p1 , . . . , pn are nonnegative numbers with
pi = 1 then Theorem 3.1 says
P
− pi log pi ≤ log n, with equality if and only if every pi is 1/n. We are reminded now of
the arithmetic-geometric mean equality, but that is really a different inequality:
for positive
P
p1 , . . . , pn that sum to 1, the arithmetic-geometric mean inequality says
log pi ≤ −n log n
with equality if and only if every pi is 1/n. We will see in Section 7 that the arithmeticgeometric mean inequality is a special case of an inequality for entropies of multivariate
Gaussians.
Theorem 3.2. For a continuous probability density function p on R with variance σ 2 ,
1
h(p) ≤ (1 + log(2πσ 2 )),
2
with√equality if and only if p is Gaussian with variance σ 2 , i.e., for some µ we have p(x) =
2
(1/ 2πσ)e−(1/2)((x−µ)/σ) .
This describes a conceptual role for Gaussians that is simpler than the Central Limit
Theorem.
Theorem 3.3. For any continuous probability density function p on (0, ∞) with mean λ,
h(p) ≤ 1 + log λ,
with equality if and only if p is exponential with mean λ, i.e., p(x) = (1/λ)e−x/λ .
Theorem 3.3 suggests that for an experiment with positive outcomes whose mean value
is known, the most conservative probabilistic model consistent with that mean value is an
exponential distribution.
Example 3.4. To illustrate the inequality in Theorem 3.3, consider another kind of proba2
bility density function on (0, ∞), say p(x) = ae−bx for x > 0. (Here the exponent involves
−x2 rather than −x.) For p to have total integral 1 over (0, ∞) requires b = (π/4)a2 , and
the mean is a/2b. Set λ = a/2b, so p has the same mean as the exponential distribution
(1/λ)e−x/λ . The conditions b = (π/4)a2 and λ = a/2b let us solve for a and b in terms of
2
λ: a = 2/(πλ) and b = 1/(πλ2 ). In Figure 3 we plot ae−bx and (1/λ)e−x/λ with the same
mean. The inequality h(p) < 1 + log λ from Theorem 3.3 is equivalent (after some algebra)
to 1/2 + log 2 − log π being positive. This number is approximately .0484.
y
x
λ
Figure 3. Exponential and other distribution with same mean.
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
5
While we will be concerned with the principle of maximum entropy insofar as it explains
a natural role for various classical probability distributions, the principle is also widely used
for practical purposes in the applied sciences [2, 9, 10].
4. Explanation of the three examples
To prove Theorems 3.1, 3.2, and 3.3, we want to maximize h(p) over all probability density
functions p satisfying certain constraints. This can be done using Lagrange multipliers.1 We
will instead prove the theorems by taking advantage of special properties of the functional
expression −p log p. The application of Lagrange multipliers to such problems is discussed
in Appendix A. It is worth noting that Lagrange multipliers would would not, on its own,
lead to a logically complete solution of the problem, since the maximum entropy might a
priori occur at a probability distribution not accessible to that method (analogous to x = 0
as a minimum of |x|, which is inaccessible to methods of calculus).
Lemma 4.1. For x > 0 and y ≥ 0,
y − y log y ≤ x − y log x,
with equality if and only if x = y.
Note the right side is not x − x log x.
Proof. The result is clear if y = 0, by our convention that 0 log 0 = 0. Thus, let y > 0.
The inequality is equivalent to log(x/y) ≤ x/y − 1, and it is easy to check, for t > 0, that
log t ≤ t − 1 with equality if and only if t = 1 (see Figure 4).
y
t
1
Figure 4. Graph of y = log t lies below y = t − 1 except at t = 1.
Lemma 4.2. Let p(x) and q(x) be continuous probability density functions on an interval
I in the real numbers, with p ≥ 0 and q > 0 on I. We have
Z
Z
− p log p dx ≤ − p log q dx
I
I
if both integrals exist. Moreover, there is equality if and only if p(x) = q(x) for all x.
1This is a more interesting application of Lagrange multipliers with multiple constraints than the type usually
met in multivariable calculus books.
6
For discrete probability density functions p and q on a set {x1 , x2 , . . . }, with p(xi ) ≥ 0
and q(xi ) > 0 for all i,
X
X
p(xi ) log q(xi )
p(xi ) log p(xi ) ≤ −
−
i≥1
i≥1
if both sums converge. Moreover, there is equality if and only if p(xi ) = q(xi ) for all i.
Proof. We carry out the proof in the continuous case. The discrete case is identical, using
sums in place of integrals. By Lemma 4.1, for any x ∈ I
(4.1)
p(x) − p(x) log p(x) ≤ q(x) − p(x) log q(x),
so integrating both sides over the interval I gives
Z
Z
Z
Z
Z
Z
p dx − p log p dx ≤ q dx − p log q dx =⇒ − p log p dx ≤ − p log q dx
I
I
I
I
I
I
R
R
because I p dx and I q dx are both 1. If this last inequality of integrals is an equality, then
the continuous function
q(x) − p(x) log q(x) − p(x) + p(x) log p(x)
has integral 0 over I, so this nonnegative function is 0. Thus (4.1) is an equality for all
x ∈ I, so p(x) = q(x) for all x ∈ I by Lemma 4.1.
Theorem 4.3. Let p and q be continuous probability density functions on an interval I,
with finite entropy. Assume q(x) > 0 for all x ∈ I. If
Z
(4.2)
− p log q dx = h(q),
I
then h(p) ≤ h(q), with equality if and only if p = q.
For discrete probability density functions p and q on {x1 , x2 , . . . }, with finite entropy,
assume q(xi ) > 0 for all i. If
X
(4.3)
−
p(xi ) log q(xi ) = h(q),
i≥1
then h(p) ≤ h(q), with equality if and only if p(xi ) = q(xi ) for all i.
R
Proof.PBy Lemma 4.2, h(p) is bounded above by − I p log q dx in the continuous case and
by − i p(xi ) log q(xi ) in the discrete case. This bound is being assumed to equal h(q). If
h(p) = h(q), then h(p) equals the bound, in which case Lemma 4.2 tells us p equals q. Remark 4.4. The conclusions of Theorem 4.3 are still true, by the same proof, if we use
≤ in place of the equalities in (4.2) and (4.3). However, we will see in all our applications
that equality for (4.2) and (4.3) is what occurs when we want to use Theorem 4.3.
R
P
The integral − I p log q dx and sum − i≥1 p(xi ) log q(xi ) that occur in Theorem 4.3
are not entropies, but merely play a technical role to get the desired maximum entropy
conclusion in the theorems.
Although Theorem 4.3 allows p to vanish somewhere, it avoids the possibility that q = 0
anywhere. Intuitively, this is because the expression p log q becomes unwieldy around points
where q vanishes and p does not. The relation between maximum entropy and zero sets of
probability density functions will be explored in Section 7.
The proofs of Theorems 3.1, 3.2, and 3.3 are quite simple. We will show in each case
that the constraints (if any) imposed in these theorems imply the constraint (4.2) or (4.3)
from Theorem 4.3 for suitable q.
First we prove Theorem 3.1.
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
7
Proof. We give two proofs. The first proof will not use Theorem 4.3. The second will.
A probability density function on {x1 , . . . , xn } is a set of nonnegative real numbers
p1 , . . . , pn that add up to 1. Entropy is a continuous function of the n-tuples (p1 , . . . , pn ),
and these points lie in a compact subset of Rn , so there is an n-tuple where entropy is
maximized. We want to show this occurs at (1/n, . . . , 1/n) and nowhere else.
Suppose the pj are not all equal, say p1 < p2 . (Clearly n 6= 1.) We will find a new
probability density with higher entropy. It then follows, since entropy is maximized at
some n-tuple, that entropy is uniquely maximized at the n-tuple with pi = 1/n for all i.
Since p1 < p2 , for small positive ε we have p1 + ε < p2 − ε. The entropy of {p1 + ε, p2 −
ε, p3 , . . . , pn } minus the entropy of {p1 , p2 , p3 , . . . , pn } equals
p2 − ε
p1 + ε
− ε log(p1 + ε) − p2 log
+ ε log(p2 − ε).
(4.4)
−p1 log
p1
p2
We want to show this is positive for small enough ε. Rewrite (4.4) as
(4.5) ε
ε
ε
ε
−p1 log 1 +
−ε log p1 + log 1 +
−p2 log 1 −
+ε log p2 + log 1 −
.
p1
p1
p2
p2
Since log(1 + x) = x + O(x2 ) for small x, (4.5) is
−ε − ε log p1 + ε + ε log p2 + O(ε2 ) = ε log(p2 /p1 ) + O(ε2 ),
which is positive when ε is small enough since p1 < p2 .
For the second proof, let p be any probability density function on {x1 , . . . , xn }, with
pi = p(xi ). Letting qi = 1/n for all i,
−
n
X
i=1
pi log qi =
n
X
pi log n = log n,
i=1
which is the entropy of q. Therefore Lemma 4.2 says h(p) ≤ h(q), with equality if and only
if p is uniform.
Next, we prove Theorem 3.2.
Proof. Let p be a probability density function on R with variance σ 2 . Let µ be its mean.
(The mean exists by definition of variance.) Letting q be the Gaussian with mean µ and
variance σ 2 ,
2 !
Z
Z
1
x
−
µ
1
dx.
log(2πσ 2 ) +
−
p(x) log q(x) dx =
p(x)
2
2
σ
R
R
R
Splitting up the integral intoR two integrals, the first is (1/2) log(2πσ 2 ) since R p(x) dx = 1,
and the second is 1/2 since R (x − µ)2 p(x) dx = σ 2 by definition. Thus the total integral is
(1/2) log(2πσ 2 ) + 1/2, which is the entropy of q.
Finally, we prove Theorem 3.3.
Proof. Let p be a probability density function on the interval (0, ∞) with mean λ. Letting
q(x) = (1/λ)e−x/λ ,
Z ∞
Z ∞
x
−
p(x) log q(x) dx =
p(x) log λ +
dx.
λ
0
0
Since p has mean λ, this integral is log λ + 1, which is the entropy of q.
8
In each of Theorems 3.1, 3.2, and 3.3, entropy is maximized over distributions on a fixed
domain satisfying certain constraints. Table 1 summarizes these extra constraints, which
in each case amounts to fixing the value of some integral.
Distribution
Domain
Fixed Value
Uniform
Finite Set R
None
2
(x
−
Normal with mean µ
R
R R ∞ µ) p(x) dx
Exponential
(0, ∞)
0 xp(x) dx
Table 1. Extra constraints
How does one discover these extra constraints? They come from asking, for a given
distribution q (which we aim to characterize via maximum entropy), what extra information
about distributions p on the same domain is needed so that the equality in (4.2) or (4.3)
takes place. For instance, in the setting of Theorem 3.3, we want to realize an exponential
distribution q(x) = (1/λ)e−x/λ on (0, ∞) as a maximum entropy distribution. For any
distribution p(x) on (0, ∞),
Z ∞
Z ∞
x
dx
−
p(x) log q(x) dx =
p(x) log λ +
λ
0
0
Z ∞
Z
1 ∞
= (log λ)
p(x) dx +
xp(x) dx
λ 0
0
Z
1 ∞
xp(x) dx.
= log λ +
λ 0
To complete this calculation, we need to know the mean of p. This is why, in Theorem 3.3,
the exponential distribution is the one on (0, ∞) with maximum entropy having a given
mean. The reader should consider Theorems 3.1 and 3.2, as well as later characterizations
of distributions in terms of maximum entropy, in this light.
Remark 4.5. For any real-valued random variable X on R with a corresponding probability
R density function p(x), the entropy of X is defined to be the entropy of p: h(X) =
− R p(x) log p(x) dx. If X1 , X2 , . . . , is a sequence of independent identically distributed
real-valued random variables with mean
0 and variance 1, the central limit theorem says
√
that the sums Sn = (X1 +· · ·+Xn )/ n, which all have mean 0 and variance 1, converge in a
suitable sense to the normal distribution N (0, 1) with mean 0 and variance 1. Since N (0, 1)
is the unique maximum entropy distribution among continuous probability distributions on
R with mean 0 and variance 1, each Sn has entropy less than that of N (0, 1). It’s natural to
ask if, following thermodynamic intuition, the entropies h(Sn ) are monotonically increasing
(up to the entropy of N (0, 1)). The answer is yes, which provides an entropic viewpoint on
the central limit theorem. See [1].
5. More examples of maximum entropy
Other probability density functions can also be characterized via maximum entropy using
Theorem 4.3.
Theorem 5.1. Fix real numbers a < b and µ ∈ (a, b). The continuous probability density
function on the interval [a, b] with mean µ that maximizes entropy among all such densities
(on [a, b] with mean µ) is a truncated exponential density
(
Cα eαx , if x ∈ [a, b],
qα (x) =
0,
otherwise,
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
where Cα is chosen so that
Rb
αx dx = µ.
a Cα xe
Rb
a
9
Cα eαx dx = 1, and α is the unique real number such that
This answers a question posted on Math Overflow [11].
Rb
Proof. The normalization condition a Cα eαx dx = 1 tells us Cα = α/(eαb − eαa ). At α = 0
this is 1/(b − a), so q0 (x) is the uniform density on [a, b], whose mean is (a + b)/2.
To show, for each µ ∈ (a, b), that qα (x) has mean µ for exactly one real number α, we
work out a general formula for the mean of qα (x). When α 6= 0, the mean µα of qα (x) is
αx
Z b
xe
beαb − aeαa
1
eαx b
xqα (x) dx = Cα
µα =
− ,
− 2 = αb
αa
α
α
α
e
−
e
a
a
and we set µ0 = (a + b)/2, which is the mean of q0 . In Figure 5 is a graph of µα as α varies.
µα
b
(a + b)/2
α
a
Figure 5. A graph of µα as a function of α.
Let’s check µα has the features suggested by Figure 5:
• It is a monotonically increasing function of α,
• limα→−∞ µα = a, limα→∞ µα = b.
For α 6= 0, write
1
b−a
b(eαb − eaα ) + (b − a)eαa
1
− = b + α(b−a)
− ,
αb
aα
α
e −e
e
−1 α
and differentiate this last formula with respect to α:
(5.1)
µα =
?
dµα
(b − a)2 eα(b−a)
1 ?
+ 2 > 0 ⇐⇒ (eα(b−a) − 1)2 > (α(b − a))2 eα(b−a) .
= − α(b−a)
2
dα
α
(e
− 1)
?
?
Setting t = α(b − a), the right inequality is (et − 1)2 > t2 et , or equivalently et − 2 + e−t > t2 ,
and that inequality holds for all t 6= 0 since the power series of the left side is a series in
even powers of t whose lowest-order term is t2 . This proves µα is monotonically increasing
in α for both α < 0 and α > 0. To put these together, we want to check µα < (a + b)/2 for
α < 0 and µα > (a + b)/2 for α > 0: using (5.1),
a+b
b−a
b−a
b−a
1
1
1
µα −
=
+ α(b−a)
−
= (b − a)
+
−
2
2
2 et − 1
t
e
− 1 (b − a)α
for t = (b − a)α. As a power series in t, this is (b − a)(t/12 − t3 /720 + · · · ), so for α near 0,
µα − (a + b)/2 is positive for α > 0 and negative for α < 0.
10
To compute limα→−∞ µα and limα→∞ µα , the second formula in (5.1) tells us that
limα→∞ µα = b and limα→−∞ µα = b − (b − a) = a.
Thus for each µ ∈ (a, b), there is a unique α ∈ R such that qα (x) has mean µ
With α chosen so that qα (x) has mean µ, we want to show the probability density
function qα (x) has maximum entropy among all continuous probability density functions
p(x) on [a, b] with mean µ. Since
Z b
Z b
p(x)(log Cα + αx) dx = − log Cα − αµ = h(qα ),
p(x) log qα (x) dx = −
−
a
a
Theorem 4.3 tells us that h(p) ≤ h(qα ), with equality if and only if p = qα on [a, b].
Taking [a, b] = [0, 1], the plot of qα (x) for several values of α is in Figure 6, with α ≤ 0
on the left and α ≥ 0 on the right. At α = 0 we have µ = 1/2 (uniform). As µ → 0 (that
is, α → −∞) we have qα (x) → 0 for 0 < x ≤ 1, and as µ → 1 (that is, α → ∞) we have
qα (x) → 0 for 0 ≤ x < 1. In the limit, q−∞ (x) is a Dirac mass at 0 while q∞ (x) is a Dirac
mass at 1. These limiting distributions are not continuous.
qα (x)
qα (x)
x
0
1
x
0
1
Figure 6. Plot of qα (x) on [0, 1] for α = 0, ±.5, ±1, ±2, ±3, ±4, and ±7.
Theorem 5.2. Fix λ > 0 and µ ∈ R. For any continuous probability density function p on
R with mean µ such that
Z
λ=
|x − µ|p(x) dx
R
we have
h(p) ≤ 1 + log(2λ),
with equality if and only if p is the Laplace distribution with mean µ and variance 2λ2 , i.e.,
p(x) = (1/2λ)e−|x−µ|/λ almost everywhere.
Proof. Left to the reader.
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
Note the constraint on p in Theorem 5.2 involves
R
R |x − µ|p(x) dx, not
11
R
R (x − µ)p(x) dx.
Theorem 5.3. Fix λ > 0 and µ ∈ R. For any continuous probability density function p on
R with mean µ such that
Z
p(x) log(e(x−µ)/(2λ) + e−(x−µ)/(2λ) ) dx = 1
R
we have
h(p) ≤ 2 + log λ,
with equality if and only if p is the logistic distribution p(x) =
1
.
λ(e(x−µ)/(2λ) + e−(x−µ)/(2λ) )2
Proof. Left
y = e(x−µ)/(2λ) and the integral
R ∞to the reader. 2Use the change
R ∞of variables
2
formulas 0 ((log y)/(1 + y) ) dy = 0 and 1 ((log y)/y ) dy = 1.
Remark 5.4. As |x| → ∞, the logistic distribution is in practice the simplest smooth
probability distribution whose tail decays like e−|x−µ|/λ , which resembles the non-smooth
Laplace distribution.
Now we turn to n-dimensional distributions, generalizing Theorem 3.2. Entropy is defined
in terms of integrals over Rn , and there is an obvious n-dimensional analogue of Lemma
4.2 and Theorem 4.3, which the reader can check.
Theorem 5.5. For a continuous probability density function p on Rn with fixed covariances
σij ,
1
h(p) ≤ (n + log((2π)n det Σ)),
2
where Σ = (σij ) is the covariance matrix for p. There is equality if and only if p is an
n-dimensional Gaussian density with covariances σij .
We recall the definition of the
R covariances σij . For an n-dimensional probability density
function p, its means are µi = Rn xi p(x) dx and its covariances are
Z
(5.2)
σij =
(xi − µi )(xj − µj )p(x) dx.
Rn
In particular, σii > 0. When n = 1, σ11 = σ 2 in the usual notation. The symmetric matrix
Σ = (σij ) is positive-definite, since the matrix (hvi , vj i) is positive-definite for any finite set
of linearly independent vi in a real inner product space (V, h·, ·i).
Proof. The Gaussian densities on Rn are those probability density functions of the form
1
−1
G(x) = p
e−(1/2)(x−µ)·Σ (x−µ) ,
(2π)n det Σ
where Σ := (σij ) is Ra positive-definite symmetric matrix and µ ∈ Rn . Calculations show
the i-th mean of G, Rn xi G(x) dx, is the i-th coordinate of µ, and the covariances of G are
the entries σij in Σ. The entropy of G is
1
(n + log((2π)n det Σ)),
2
by a calculation left to the reader. (Hint: it helps in the calculation to write Σ as the square
of a symmetric matrix.)
Now assume p is any n-dimensional probability density function with means and covariances. Define µi to be its i-th mean of p and define (σij ) to be the covariance matrix of p.
12
Let G be the n-dimensional Gaussian with the means and covariances of p. The theorem
follows from Theorem 4.3 and the equation
Z
1
p(x) log G(x) dx = (n + log((2π)n det Σ)),
−
2
n
R
whose verification boils down to checking that
Z
(5.3)
(x − µ) · Σ−1 (x − µ)p(x) dx = n,
Rn
which is left to the reader. (Hint: Diagonalize the quadratic form corresponding to Σ.) The next example, which can be omitted by the reader unfamiliar with the context,
involves entropy for probability density functions on local fields.
Theorem 5.6. Let K be a nonarchimedean local field, with dx the Haar measure, normalized so OK has measure 1. Fix a positive number r. As p varies over all probability density
functions on (K, dx) such that
Z
|x|r p(x) dx
br =
K
is fixed, the maximum value of h(p) occurs precisely at those p such that p(x) = ce−t|x|
almost everywhere, where c and t are determined by the conditions
Z
Z
r
−t|x|r
ce
dx = 1,
|x|r (ce−t|x| ) dx = br .
K
Proof. Left to the reader.
r
K
Theorem 5.6 applies to the real and complex fields, using Lebesgue measure. For example,
the cases r = 1 and r = 2 of Theorem 5.6 with K = R were already met, in Theorems 5.2
and 3.2 respectively.
Our remaining examples are discrete distributions. On N = {0, 1, 2, . . . }, the geometric
distributions are given by p(n) = (1 − r)rn , where 0 ≤ r < 1, and the mean is r/(1 − r).
More generally, for a fixed integer k, the geometric distributions on {k, k + 1, k + 2, . . . } are
given by p(n) = (1 − r)rn−k (0 ≤ r < 1), with mean k + r/(1 − r). In particular, the mean
and r in a geometric distribution on {k, k + 1, k + 2, . . . } determine each other (k is fixed).
Any probability distribution p on {k, k + 1, . . . } has mean at least k (if it has a mean),
so there is a geometric distribution with the same mean as p.
Theorem 5.7. On {k, k + 1, k + 2, . . . }, the unique probability distribution with a given
mean and maximum entropy is the geometric distribution with that mean.
Proof. When the mean is greater than k, this is an application of Theorem 4.3, and is left
to the reader. When the mean is k, the geometric distribution with r = 0 is the only such
distribution at all.
Example 5.8. During a storm, 100 windows in a factory break into a total of 1000 pieces.
What is (the best estimate for) the probability that a specific window broke into 4 pieces?
20 pieces?
Letting m be the number of pieces into which a window breaks, our model will let m run
over Z+ . (There is an upper bound on m due to atoms, but taking m ∈ Z+ seems most
reasonable.) The statement of the problem suggests the mean number of pieces a window
broke into is 10. Using Theorem 5.7 with k = 1, the most reasonable distribution on m
is geometric with k + r/(1 − r) = 10, i.e., r = 9/10. Then p(n) = (1/10)(9/10)n−1 . In
particular, p(4) ≈ .0729. That is, the (most reasonable) probability a window broke into 4
pieces is around 7.3%. The probability that a window broke into 20 pieces is p(20) ≈ .0135,
or a little over 1.3%. The probability a window did not break is p(1) = 1/10, or 10%.
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
13
Theorem 5.9. For a fixed y > 0, the unique probability distribution on {0, y, 2y, 3y, . . . }
with mean λ and maximum entropy is p(ny) = yλn /(λ+y)n+1 , which has the form (1−r)rn
with r = 1/(1 + y/λ).
Proof. Left to the reader.
Now we consider distributions corresponding to a finite set of (say) n independent trials,
each trial having just two outcomes (such as coin-flipping). The standard example of such a
distribution is a Bernoulli distribution, where each trial has the same probability of success.
The Bernoulli distribution for n trials, with mean λ ∈ [0, n], corresponds to a probability
of success λ/n in each trial.
Theorem 5.10. Let p be a probability distribution on {0, 1}n corresponding to some sequence of n independent trialsP
having two outcomes. Let pi ∈ [0, 1] be the probability the
i-th outcome is 1, and let λ =
pi be the mean of p. Then
h(p) ≤ −λ log(λ/n) − (n − λ) log(1 − λ/n),
with equality if and only if p is the Bernoulli distribution with mean λ.
Proof. Left to the reader. Note the independence condition means such p are precisely
the set of probability measures on {0, 1}n that break up into products: p(x1 , . . . , xn ) =
p1 (x1 )p2 (x2 ) . . . pn (xn ), where {pi (0), pi (1)} is a probability distribution on {0, 1}.
Our next and final example of the principle of maximum entropy will refine the principle
of indifference on finite sample spaces and lead to a standard probability distribution from
thermodynamics. Let S = {s1 , . . . , sn } be a finite sample space and E : S → R, with
Ej = E(sj ). Physically, the elements of S are states of a system, Ej is the energy a
particle has when it’s in state sj , and pj is the probability of a particle being in state sj , or
equivalently the probability that a particle in the system has energy Ej . For each probability
distribution p on S, with pj = p(sj ), thought of as a probability distribution
for the particles
P
to be in the different states, the corresponding expected value of E is
pj Ej . This number
is between min Ej and max Ej . Choose a number E such that min Ej ≤ E ≤ max Ej , Our
basic question is: what distribution p on S satisfies
X
(5.4)
pj Ej = E
and has maximum entropy?P
When all Ej are equal, then the expected value condition (5.4)
is vacuous (it follows from
pj = 1) and we are simply seeking a distribution on S with
maximum entropy and no constraints. Theorem 3.1 tells us the answer when there are no
constraints: the uniform distribution. What if the Ej are not all equal? (We allow some of
the Ej to coincide, just not all.)
If n = 2 and E1 6= E2 , then elementary algebra gives the answer: if p1 + p2 = 1 and
E = p1 E1 + p2 E2 , where E1 , E2 , and E are known, then
p1 =
E − E2
,
E1 − E2
p2 =
E − E1
.
E2 − E1
P
P
If n > 2, then the conditions
pj = 1 and
pj Ej = E are two linear equations in n
unknowns (the pj ), so there is not a unique solution. (Recall Example 2.4.)
Theorem 5.11. If the Ej ’s are not all equal, then for each E between min Ej andPmax Ej ,
there is a unique probability distribution q on {s1 , . . . , sn } satisfying the condition
qj Ej =
E and having maximum entropy. It is given by the formula
(5.5)
e−βEj
qj = Pn −βE
i
i=1 e
14
for a unique extended real number β in [−∞, ∞] that depends on E. In particular, β =
−∞ corresponds to E = max Ej , β = ∞ corresponds to E = min Ej , and β = 0 (the
P
uniform distribution) corresponds to the arithmetic mean E = ( Ej )/n, so β > 0 when
P
P
E < ( Ej )/n and β < 0 when E > ( Ej )/n.
When all Ej are equal, qj = 1/n for every β, so Theorem 5.11 could incorporate Theorem
3.1 as a special case. However, we will be using Theorem 3.1 in the proof of certain
degenerate cases of Theorem 5.11.
Proof. For each β ∈ R (we’ll deal with β = ±∞ later), let qj (β) be given by the formula
(5.5). Note qj (β) > 0. We view the numbers qj (β) (for j = 1, . . . , n) as a probability
distribution q(β) on {s1 , . . . , sn }, with qj (β) being the probability that particles from the
system are in state sj . The expected value of the energy function E for this probability
distribution is
Pn
n
−βEj
X
j=1 Ej e
(5.6)
f (β) =
qj (β)Ej = Pn −βE .
i
i=1 e
j=1
P
Writing Z(β) = ni=1 e−βEi for the denominator of the qj (β)’s, f (β) = −Z 0 (β)/Z(β).
For each β ∈ R, f (β) is the expected value of E relative to the distribution q(β). As β
varies over R, f (β) has values strictly between min Ej and max Ej . (Since we are assuming
the Ej are not all equal, n > 1.) A calculation shows
2 −β(Ei +Ej )
i,j (Ei Ej − Ej )e
Pn −βE 2
i)
( i=1 e
P
P
0
f (β) =
=−
− Ej )2 e−β(Ei +Ej )
,
Pn −βE 2
i)
( i=1 e
i<j (Ei
so f 0 (β) < 0 for all β ∈ R. Therefore f takes each value in its range only once. Since
(5.7)
lim f (β) = max Ej ,
β→−∞
lim f (β) = min Ej ,
β→∞
and
(5.8)
(
1/cmax ,
lim qi (β) =
β→−∞
0,
if Ei = max Ej ,
if Ei < max Ej ,
(
1/cmin ,
lim qj (β) =
β→∞
0,
if Ei = min Ej ,
if Ei > min Ej ,
where cmax = #{i : Ei = max Ej } and cmin = #{i : Ei = min Ej }, we can use (5.7)
and (5.8) to extend the functions qj (β) and f (β) to the cases β = ±∞. Therefore, for
each E in [min Ej , max Ej ], there is a unique β ∈ [−∞, ∞] (depending on E) such that
P
qj (β)Ej = E.
P
Now we consider all probability distributions pj on {s1 , . . . , sn } with
pj Ej = E. We
want to show the unique distribution p with maximum entropy is given by pj = qj (β),
P
where β = βE is selected from [−∞, ∞] to satisfy
qj (β)Ej = E.
Case 1: min Ej < E < max Ej . Here β ∈ R and qj (β) > 0 for all j. We will show
(5.9)
−
n
X
j=1
and then Theorem 4.3 applies.
pj log qj (β) = −
n
X
j=1
qj (β) log qj (β),
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
Set Z(β) =
from
Pn
−βEi ,
i=1 e
−
n
X
15
the denominator of each qj (β), as before. Then (5.9) follows
pj log qj (β) = −
j=1
n
X
pj (−βEj − log Z(β))
j=1
= βE + log Z(β)
n
X
= −
qj (β) log qj (β).
j=1
Case 2: E = min Ej . In this case, β = ∞. Let c be the number of values of i such that
Ei = min Ej , so
(
1/c, if Ei = min Ej ,
qi (∞) =
0,
otherwise.
We cannot directly use Theorem 4.3 to prove this distribution has maximum entropy, since
the probability distribution q(∞) vanishes at some siP
.
We will show that every distribution p such that
pj Ej = min Ej vanishes at each si
where Ei > min Ej . That will imply this class of distributions is supported on the set
{si : Ei = min Ej }, where Ei restricts to a constant function and q(∞) restricts to the
uniform distribution. Therefore q(∞) will have the maximum entropy by Theorem 3.1.
For ease of notation, suppose the Ej ’s are indexed so that E1 ≤ E2 ≤ · · · ≤ En . Then
E1 =
n
X
i=1
pi E1 ≥
c
X
i=1
pi E1 +
n
X
pi Ei .
i=c+1
Subtracting the first term on the right from both sides,
!
!
n
n
n
X
X
X
pi E1 ≥
pi Ei ≥
pi Ec+1 .
i=c+1
i=c+1
i=c+1
Since E1 < Ec+1 , we must have pi = 0 for i > c.
Case 3: E = max Ej . This is similar to Case 2 and is left to the reader.
Example 5.12. We return to the weighted die problem from Example 2.4. When rolled
1000 times, a six-sided die comes up with an average of 4.7 dots. We want to estimate, as
best we can, the probability distribution of the faces.
Here the space of 6 outcomes is {1, 2, . . . , 6} We do not know the probability distribution
of the occurrence of the faces, although we expect it is not uniform. (In a uniform distribution, the average number of dots that occur is 3.5, not 4.7.) What is the best guess for
the probability distribution? By the principle of
Pmaximum entropy and Theorem 5.11, our
best guess is q(β0 ), where β0 is chosen so that 6j=1 jqj (β0 ) = 4.7, i.e.,
P6
−β0 j
j=1 je
(5.10)
= 4.7.
P6
−β0 i
i=1 e
(That we use a numerical average over 1000 rolls as a theoretical mean value is admittedly
a judgment call.) The left side of (5.10) is a monotonically decreasing function of β0 , so it is
easy with a computer to find the approximate solution β0 ≈ −.4632823. The full maximum
entropy distribution is, approximately,
q1 ≈ .039, q2 ≈ .062, q3 ≈ .098, q4 ≈ .157, q5 ≈ .249, q6 ≈ .395.
For example, this suggests there should be about a 25% chance of getting a 5 on each
independent roll of the die.
16
Example 5.13. If after 1000 rolls the six-sided die comes up with an average of 2.8 dots,
then the maximum entropy distribution of the faces is determined by the parameter β0
satisfying
P6
−β0 j
j=1 je
= 2.8,
(5.11)
P6
−β0 i
i=1 e
and by a computer β0 ≈ .2490454. The maximum entropy distribution for this β0 is
q1 ≈ .284, q2 ≈ .221, q3 ≈ .172, q4 ≈ .134, q5 ≈ .104, q6 ≈ .081.
This method of solution, in extreme cases, leads to results that at first seem absurd.
For instance, if we flip a coin twice and get heads both times, the principle of maximum
entropy would suggest the coin has probability 0 of coming up tails! This setting is far
removed from usual statistical practice, where two trials have no significance. Obviously
further experimentation would be likely to lead us to revise our estimated probabilities
(using maximum entropy in light of new information).
Theorem 5.14. Let S = {s1 , s2 , . . . , } be a countable set and E : S → R be a nonnegative
function that takes on any value a uniformly bounded number of times, Write E(sj ) as Ej
and assume limi→∞ Ei = ∞. For each number E ≥ min Ej , there is a unique probability
P
distribution q on S that satisfies the condition j≥1 qj Ej = E and has maximum entropy.
It is given by the formula
(5.12)
qj = P
e−βEj
−βEi
i≥1 e
for a unique number β ∈ (0, ∞]. In particular, β = ∞ corresponds to E = min Ej .
Proof. This is similar to the proof of Theorem 5.11.
That we only encounter β > 0 in the countably infinite case, while β could be positive or
negative or 0 in the finite case, is related to the fact that the denominator of (5.12) doesn’t
converge for β ≤ 0 when Ei > 0 and there are infinitely many Ei ’s.
Remark 5.15. Theorems 5.7 and 5.9 are special cases of Theorem 5.14: S = {k, k + 1, k +
2, . . . } and En = n for n ≥ k in the case of Theorem 5.7, and S = {0, y, 2y, . . . } and
En = ny for n ≥ 0 in the case of Theorem 5.9. Setting En = n for n ≥ k in (5.12), for some
β > 0 the maximum entropy distribution in Theorem 5.7 will have
e−βn
e−βn
=
= (1 − e−β )e−β(n−k) ,
−βi
−βk /(1 − e−β )
e
e
i≥k
qn = P
which is the geometric distribution (1 − r)rn−k for r = e−β ∈ [0, 1).
Setting En = ny for n ≥ 0 in (5.12), for some β > 0 the maximum entropy distribution
in Theorem 5.9 will have
e−βny
qn = P
= (1 − e−βy )e−βny = (1 − r)rn
−βiy
e
i≥0
P
−βy
for r = e
. The condition that
n≥0 qn En = E becomes ry/(1 − r) = E, so r =
E/(y + E) = 1/(1 + y/E).
In thermodynamics, the distribution arising in Theorem 5.11 is the Maxwell–Boltzmann
energy distribution of a system of non-interacting particles in thermodynamic equilibrium
having a given mean energy E. The second law of thermodynamics says, roughly, that as
a physical system evolves towards equilibrium its entropy is increasing, which provides the
physical intuition for the principle of maximum entropy.
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
17
Jaynes [6, 8] stressed the derivation of the Maxwell–Boltzmann distribution in Theorem
5.11, which does not rely on equations of motion of particles. For Jaynes, the Maxwell–
Boltzmann distribution in thermodynamics does not arise from laws of physics, but rather
from “logic”: we want an energy distribution with a given average value and without any
unwarranted extra properties. Constraining the energy distribution only by a mean value,
the distribution with maximum entropy is the unique Maxwell-Boltzmann distribution with
the chosen mean value. Jaynes said the reason this distribution fits experimental measurements so well is that macroscopic systems have a tremendous number of particles, so the
distribution of a macroscopic random variable (e.g., pressure) is very highly concentrated.
6. A characterization of the entropy function
Shannon’s basic paper [13] gives an axiomatic description of the entropy function on finite
sample spaces. A version of Shannon’s theorem with weaker hypotheses, due to Faddeev
[3], is as follows.
P
Theorem 6.1. For n ≥ 2, let ∆n = {(p1 , . . . , pn ) : 0 ≤ pi ≤ 1, pi = 1}. Suppose on each
∆n a function Hn : ∆n → R is given with the following properties:
• Each Hn is continuous.
• Each Hn is a symmetric function of its arguments.
• For n ≥ 2 and all (p1 , . . . , pn ) ∈ Hn with pn > 0, and t ∈ [0, 1],
Hn+1 (p1 , . . . , pn−1 , tpn , (1 − t)pn ) = Hn (p1 , . . . , pn−1 , pn ) + pn H2 (t, 1 − t).
P
Then, for some constant k, Hn (p1 , . . . , pn ) = −k ni=1 pi log pi for all n.
Proof. See [3] or [4, Chapter 1]. The proof has three steps. First, verify the function
F (n) = Hn (1/n, 1/n, . . . , 1/n) has the form −k log n. (This uses the infinitude of the
primes!) Second, reduce the case of rational probabilities to the equiprobable case using
the third property. Finally, handle irrational probabilities by contintuity from the rational
probability case. Only in the last step is the continuity hypothesis used.
Unlike Faddeev’s proof, Shannon’s proof of Theorem 6.1 included the condition that
Hn (1/n, 1/n, . . . , 1/n) is monotonically increasing with n (and the symmetry hypothesis
was not included).
The choice of normalization constant k in the conclusion of Theorem 6.1 can be considered as a choice of base for the logarithms. If we ask for Hn (1/n, . . . , 1/n) to be an
increasing function of n, which seems plausible for a measurement of uncertainty on probability distributions, then k > 0. In fact, the positivity of k is forced just by asking that
H2 (1/2, 1/2) > 0.
In Theorem 6.1, only H2 has to be assumed continuous, since the other Hn ’s can be
written in terms of H2 and therefore are forced to be continuous.
The third condition in Theorem 6.1 is a consistency condition. If one event can be viewed
as two separate events with their own probabilities (t and 1 − t), then the total entropy
is the entropy with the two events combined plus a weighted entropy for the two events
separately. It implies, quite generally, that any two ways of viewing a sequence of events
as a composite of smaller events will lead to the same calculation of entropy. That is, if
(r)
(q1 , . . . , qn ) ∈ ∆n and each qr is a sum of (say) mr numbers pj ≥ 0, then
!
n
(r)
(r)
X
p
p
m
(1) (1)
(n)
(n)
1
,..., r .
(6.1) H(p1 , p2 , . . . , p(1)
qr H
m1 , . . . , p1 , . . . , pmn ) = H(q1 , . . . , qn ) +
qr
qr
r=1
(The subscripts on H have been dropped to avoid cluttering the notation.)
18
P 2
Although
pi is a conceivable first attempt to measure the uncertainty in a probability distribution on finite sample spaces, Theorem 6.1 implies this formula has to lead to
inconsistent calculations of uncertainty, since it is not the unique possible type of formula
satisfying the hypotheses of Theorem 6.1.
7. Positivity and uniqueness of the maximum entropy distribution
In each of the three examples from Section 3, the probability density function with
maximum entropy is positive everywhere. Positivity on the whole space played a crucial
technical role in our discussion, through its appearance as a hypothesis in Theorem 4.3.
This raises a natural question: is it always true that a probability density with maximum
entropy is positive? No. Indeed, we have seen examples where the density with maximum
entropy vanishes in some places (e.g., Theorem 5.11 when E is min Ej or max Ej ), or we can
interpret some earlier examples in this context (e.g., consider Theorem 3.3 for probability
R0
density functions on R rather than (0, ∞), but with the extra constraint −∞ p dx = 0.) In
such examples, however, every probability density function satisfying the given constraints
also vanishes where the maximum entropy density vanishes.
Let us refine our question: is the zero set of a probability density with maximum entropy
not a surprise, in the sense that all other probability densities satisfying the constraints also
vanish on this set? (Zero sets, unlike other level sets, are distinguished through their probabilistic interpretation as “impossible events.”) Jaynes believed so, as he wrote [6, p. 623]
“Mathematically, the maximum-entropy distribution has the important property that no
possibility is ignored; it assigns positive weight to every situation that is not absolutely
excluded by the given information.” However, Jaynes did not prove his remark. We will
give a proof when the constraints are convex-linear (that is, the probability distributions
satisfying the constraints are closed under convex-linear combinations). This fits the setting
of most of our examples. (Exceptions are Theorems 3.2, 5.5, and 5.10, where the constraints
are not convex-linear.)
At this point, to state the basic result quite broadly, we will work in the general context of
entropy on measure spaces. Let (S, ν) be a measure space. A probability density function p
on (S, ν) is a ν-measurable function from S to [0, ∞) such that p dν is a probability measure
on S. Define the entropy of p to be
Z
h(p) = − p log p dν,
S
assuming this converges. That is, we assume p log p ∈ L1 (S, ν).
(Note the entropy of p depends on the choice of ν, so hν (p) would be a better notation
than h(p). For any probability measure µ on S that is absolutely
continuous with respect
R
to ν, we could define the entropy of µ with respect to ν as − S log(dµ/dν) dµ, where dµ/dν
is a Radon-Nikodym derivative. This recovers the above formulation by writing µ as p dν.)
Theorem 7.1. Let p and q be probability density functions on (S, ν) with finite entropy.
Assume q > 0 on S. If
Z
− p log p dν = h(q),
S
then h(p) ≤ h(q), with equality if and only if p = q almost everywhere on S.
Proof. Mimic the proof of Lemma 4.2 and Theorem 4.3.
Example 7.2. For a < b, partition the interval [a, b] into n parts using a = a0 < a1 <
· · · < an = b. Pick n positive numbers λ1 , . . . , λn such that λ1 + · · · + λn = 1. Let q be
the (discontinuous) probability density function on [a, b] with constant value λi /(ai − ai−1 )
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
19
on each open interval (ai−1 , ai ) and having arbitrary
R ai positive values at the ai ’s. For any
probability density function p on [a, b] such that ai−1
p(x) dx = λi for all i,
Z b
n
X
λi
= h(q).
−
p log q dx = −
λi log
ai − ai−1
a
i=1
Therefore
q has maximum entropy among all probability density functions on [a, b] satisfying
R ai
ai−1 p(x) dx = λi for all i.
Lemma 7.3. The probability density functions on (S, ν) having finite entropy are closed
under convex-linear combinations: if h(p1 ) and h(p2 ) are finite, so is h((1 − ε)p1 + εp2 ) for
any ε ∈ [0, 1].
Proof. We may take 0 < ε < 1.
R
Let f (t) = −t log t for t ≥ 0, so h(p) = S f (p(x)) dν(x). From the concavity of the graph
of s,
f ((1 − ε)t1 + εt2 ) ≥ (1 − ε)f (t1 ) + εf (t2 )
when t1 , t2 ≥ 0. Therefore
(7.1)
f ((1 − ε)p1 (x) + εp2 (x)) ≥ (1 − ε)f (p1 (x)) + εf (p2 (x))
For an upper bound on f ((1 − ε)t1 + εt2 ), we note that
f (t1 + t2 ) ≤ f (t1 ) + f (t2 )
for all t1 , t2 ≥ 0. This is clear if either t1 or t2 is 0. When t1 and t2 are both positive,
f (t1 + t2 ) =
=
≤
=
−(t1 + t2 ) log(t1 + t2 )
−t1 log(t1 + t2 ) − t2 log(t1 + t2 )
−t1 log(t1 ) − t2 log(t2 )
f (t1 ) + f (t2 ).
Therefore
(7.2) (1 − ε)f (p1 (x)) + εf (p2 (x)) ≤ f ((1 − ε)p1 (x) + εp2 (x)) ≤ f ((1 − ε)p1 (x)) + f (εp2 (x)).
R
Integrate (7.2) over S. Since S f (δp(x)) dν = −δ log δ + δh(p) for δ ≥ 0 and p a probability
distribution on S, we see (1 − ε)p1 + εp2 has finite entropy:
(1−ε)h(p1 )+εh(p2 ) ≤ h((1−ε)p1 +εp2 ) ≤ −(1−ε) log(1−ε)+(1−ε)h(p1 )−ε log ε+εh(p2 ).
Let Π be any set of probability density functions p on (S, ν) that is closed under convexlinear combinations. Let Π0 be the subset of Π consisting of those p ∈ Π with finite entropy.
By Lemma 7.3, Π0 is closed under convex-linear combinations.
Example 7.4. Fix a finite set of (real-valued) random variables X1 , . . . , Xr on (S, ν) and
a finite set of constants ci ∈ R. Let Π be those probability density functions p on (S, ν)
that give Xi the expected value ci :
Z
(7.3)
Xi p dν = ci .
S
The property (7.3) is preserved for convex-linear combinations of p’s.
From a physical standpoint, S is a state space (say, all possible positions of particles in a
box), the Xi ’s are macroscopic physical observables on the states in the state space (energy,
speed, etc.), and a choice of p amounts to a weighting of which states are more or less likely
to arise. The choice of Π using constraints of the type (7.3) amounts to the consideration
20
of only those weightings p that give the Xi ’s certain fixed mean values ci . (While (7.3) is
the way constraints usually arise in practice, from a purely mathematical point of view we
can simply take ci = 0 by replacing Xi with Xi − ci for all i.)
I am grateful to Jon Tyson for the formulation and proof of the next lemma.
Lemma 7.5. If there are p1 and p2 in Π0 such that, on a subset A ⊂ S with positive
measure, p1 = 0 and p2 > 0, then
h((1 − ε)p1 + εp2 ) > h(p1 )
for small positive ε.
We consider (1 − ε)p1 + εp2 to be a slight perturbation of p1 when ε is small.
R
Proof. Let B = S−A, so h(p1 ) = − B p1 log p1 dν. Since p1 and p2 are in Π0 , (1−ε)p1 +εp2 ∈
Π0 for any ε ∈ [0, 1].
As in the proof of Lemma 7.3, let f (t) = −t log t for t ≥ 0. Since p1 = 0 on A,
Z
Z
(7.4)
h(p1 ) =
f (p1 ) dν =
f (p1 ) dν.
S
B
Then
Z
h((1 − ε)p1 + εp2 ) =
Z
f ((1 − ε)p1 + εp2 ) dν +
f ((1 − ε)p1 + εp2 ) dν
B
Z
f (εp2 ) dν +
f ((1 − ε)p1 + εp2 ) dν
A
B
Z
Z
f (εp2 ) dν + ((1 − ε)f (p1 ) + εf (p2 )) dν by (7.1)
A
B
Z
εh(p2 ) − (ε log ε)
p2 dν + (1 − ε)h(p1 ) by (7.4)
A
Z
h(p1 ) + ε −(log ε)
p2 dν + h(p2 ) − h(p1 ) .
ZA
=
≥
=
=
A
R
Since ν(A) > 0 and p2 > 0 on A, A p2 dν > 0. Therefore the expression inside the
parentheses is positive when ε is close enough to 0, no matter the size of h(p2 ) − h(p1 ).
Thus, the overall entropy is greater than h(p1 ) for small ε.
Theorem 7.6. If Π0 contains a probability density function q with maximum entropy, then
every p ∈ Π0 vanishes almost everywhere q vanishes.
Proof. If the conclusion is false, there is some p ∈ Π0 and some A ⊂ S with ν(A) > 0 such
that, on A, q = 0 and p > 0. However, for sufficiently small ε > 0, the probability density
function (1 − ε)q + εp lies in Π0 and has greater entropy than q by Lemma 7.5. This is a
A consequence of Theorem 7.6 is the essential uniqueness of the maximum entropy distribution, if it exists.
Theorem 7.7. If q1 and q2 have maximum entropy in Π0 , then q1 = q2 almost everywhere.
Proof. By Theorem 7.6, we can change q1 and q2 on a set of measure 0 in order to assume
they have the same zero set, say Z. Let Y = S − Z, so q1 and q2 are positive probability
density functions on Y .
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
21
Let q = (1/2)(q1 + q2 ). (Any other convex-linear combination εq1 + (1 − ε)q2 , for an
ε ∈ (0, 1), would do just as well for this proof.) Then q > 0 on Y . By Lemma 7.3, q ∈ Π0 .
By Lemma 4.2,
Z
Z
(7.5)
h(q1 ) = −
q1 log q1 dν ≤ −
q1 log q dν,
Y
Y
and
Z
Z
(7.6)
q1 log q1 dν ≤ −
h(q2 ) = −
q1 log q dν.
Y
Y
Since
Z
h(q) = −
q log q dν
S
Z
= −
q log q dν
Y
Z
Z
1
1
= −
q1 log q dν −
q2 log q dν
2 Y
2 Y
1
1
≥
h(q1 ) + h(q2 )
2
2
= h(q1 ),
maximality implies h(q) = h(q1 ) = h(q2 ). This forces the inequalities in (7.5) and (7.6)
to be equalities, which by Lemma 4.2 forces q1 = q and q2 = q almost everywhere on Y .
Therefore q1 = q2 almost everywhere on S.
This uniqueness confirms Jaynes’ remark [6, p. 623] that “the maximum-entropy distribution [. . . ] is uniquely determined as the one which is maximally noncommittal with
regard to missing information.” (italics mine)
The physical interpretation of Theorem 7.7 is interesting. Consider a system constrained
by a finite set of macroscopic mean values. This corresponds to a Π constrained according
to Example 7.4. Theorem 7.7 “proves” the system has at most one equilibrium state.
In contrast to Theorem 7.7, Theorem 3.2 is a context where there are infinitely many
maximum entropy distributions: all Gaussians on R with a fixed variance have the same
entropy. This is not a counterexample to Theorem 7.7, since the property in Theorem 3.2
of having a fixed variance is not closed under convex-linear combinations: consider two
Gaussians on R with the same variance and different means. (Of course “fixed variance”
alone is an artificial kind of constraint to consider, but it indicates an extent to which
the hypotheses of Theorem 7.7 can’t be relaxed.) On the other hand, the simultaneous
conditions “fixed mean” and “fixed variance” for distributions on R are closed under convexlinear combinations, and we already know there is only one maximum entropy distribution
on R satisfying those two constraints, namely the Gaussian having the chosen fixed mean
and variance.
Theorem 7.7 does not tell us Π has a unique maximum entropy distribution, but rather
that if it has one then it is unique. A maximum entropy distribution need not exist. For
instance, the set of probability density functions on R constrained to have a fixed mean
µ is closed under convex-linear combinations, but it has no maximum entropy distribution
since the one-dimensional Gaussians with mean µ and increasing variance have no maximum
entropy. As another example, which answers a question of Mark Fisher, the set of probability
Rb
density functions p on R that satisfy a p(x) dx = λ, where a, b, and λ are fixed with
λ < 1, has no maximum entropy distribution (even though it is closed under convex linear
22
combinations): for any c > b, the probability density function


if a ≤ x ≤ b,
λ/(b − a),
p(x) = (1 − λ)/(c − b), if b < x ≤ c,


0,
otherwise
has entropy
b
Z
Z
p(x) log p(x) dx −
h(p) = −
c
p(x) log p(x) dx
b
a
= −λ log λ − (1 − λ) log(1 − λ) + λ log(b − a) + (1 − λ) log(c − b),
which becomes arbitrarily large as c → ∞. Tweaking the graph of p near a, b, and c produces
Rb
continuous distributions satisfying a p(x) dx = λ that have arbitrarily large entropy. (If
λ = 1 then p = 0 almost everywhere outside [a, b] and such probability density functions
have a maximum entropy distribution, namely the uniform distribution on [a, b].)
The following corollary of Theorem 7.7 answers a question of Claude Girard.
Corollary 7.8. Suppose Π is a set of probability distributions on Rn that is closed under
convex-linear combinations and contains a maximum entropy distribution. If Π is closed
under orthogonal transformations (i.e., when p(x) ∈ Π, so is p(Ax) for any orthogonal
matrix A), then the maximum entropy distribution in Π is orthogonally invariant.
Proof. First, note that when p(x) is a probability distribution on Rn , so is p(Ax) since
Lebesgue measure is invariant under orthogonal transformations. Therefore it makes sense
to consider the possibility that Π is closed under orthogonal transformations. Moreover,
when p(x) has finite entropy and A is orthogonal, p(Ax) has the same entropy (with respect
to Lebesgue measure):
Z
Z
−
p(Ax) log p(Ax) dx = −
p(x) log p(x) dx.
Rn
Rn
Thus, if Π is closed under convex-linear combinations (so Theorem 7.7 applies to Π0 ) and
any orthogonal change of variables, and Π0 contains a maximum entropy distribution, the
uniqueness of this distribution forces it to be an orthogonally invariant function: p(x) =
p(Ax) for every orthogonal A and almost all x in Rn .
Example 7.9. For a real number t > 0, let Πt be the set of probability distributions p(x)
on Rn satisfying the conditions
Z
Z
(7.7)
(x · v)p(x) dx = 0,
(x · x)p(x) dx = t,
Rn
Rn
where v runs over Rn . The first condition (over all v) is equivalent to the coordinate
conditions
Z
(7.8)
xi p(x) dx = 0
Rn
for i = 1, 2, . . . , n, so it is saying (in a more geometric way than (7.8)) that any coordinate
of a vector chosen randomly according to p has expected value 0. The second condition
says the expected squared length of vectors chosen randomly according to p is t.
The conditions in (7.7) are trivially closed under convex-linear combinations on p. They
are also closed under any orthogonal transformation on p, since for any orthogonal matrix
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
23
A and v ∈ Rn ,
Z
Z
(Ax · Av)p(Ax) dx
(x · v)p(Ax) dx =
n
Rn
ZR
(x · Av)p(x) dx
=
Rn
= 0
and
Z
Z
(Ax · Ax)p(Ax) dx
(x · x)p(Ax) dx =
Rn
Rn
Z
(x · x)p(x) dx
=
Rn
= t.
Thus, if Πt has a maximum entropy distribution, Theorem 7.7 tells us it is unique and
Corollary 7.8 tells us it is an orthogonally-invariant function, i.e., it must be a function of
|x|. But neither Theorem 7.7 nor Corollary 7.8 guarantees that there is a maximum entropy
distribution. Is there one? If so, what is it?
We will show Πt has for its maximum entropy distribution the n-dimensional Gaussian
with mean 0 and scalar covariance matrix Σ = (t/n)In . Our argument will not use the
non-constructive Theorem 7.7 or Corollary 7.8.
By Theorem 5.5, any probability distribution on Rn with finite means, finite covariances
(recall (5.2)), and finite entropy has its entropy bounded above by the entropy of the ndimensional Gaussian with the same means and covariances. The conditions (7.7) defining
Πt imply any p ∈ Πt has coordinate means µi =P0 and finite covariances (note (xi + xj )2 ≤
2x2i + 2x2j ), with the diagonal covariance sum ni=1 σii equal to t. Therefore a maximum
entropy distribution in Πt , if one exists, lies among the n-dimensional Gaussians in Πt ,
which are the distributions of the form
1
−1
GΣ (x) = p
e−(1/2)x·Σ x ,
n
(2π) det Σ
where Σ is a positive-definite symmetric matrix with trace t. The entropy of GΣ is
1
h(GΣ ) = (n + log((2π)n det Σ)).
2
The arithmetic-geometric mean inequality on the (positive) eigenvalues of Σ says
√
1
n
Tr(Σ) ≥ det Σ
n
with equality if and only if all the eigenvalues are equal. Therefore
n
2πt
(7.9)
h(GΣ ) ≤
1 + log
.
2
n
√
This upper bound is achieved exactly when n det Σ = t/n, which occurs when the eigenvalues of Σ are all equal. Since Σ has trace t, the common eigenvalue is t/n, so Σ = (t/n)In .
The Gaussian G(t/n)In has maximum entropy in Πt .
Remark 7.10. In equation (7.9), the right side can be rewritten as h(G(t/n)In ), which shows
the arithmetic-geometric mean inequality (both the inequality and the condition describing
when equality occurs) is equivalent to a special case of entropy maximization.
24
Replacing the standard inner product on Rn with an inner product attached to any
positive-definite quadratic form, we can extend Example 7.9 to an entropic characterization of Gaussian distributions that is more geometric than the entropic characterization in
Theorem 5.5.
Theorem 7.11. Let Q be a positive-definite quadratic form on Rn , and t > 0. Consider
the set of probability distributions p on Rn such that
Z
Z
Q(x)p(x) dx = t,
hx, viQ p(x) dx = 0,
(7.10)
Rn
Rn
Rn
where v runs over
and h·, ·iQ is the bilinear form attached to Q.
Entropy in this set of distributions is maximized at the Gaussian with µ = 0 and inverse
covariance matrix Σ−1 attached to the quadratic form (n/t)Q: x · Σ−1 x = (n/t)Q(x).
Proof. Relative to the usual inner product on Rn , Q(x) = x · Cx and hx, yiQ = x · Cy for
some positive-definite symmetric matrix C. We can write C = D2 for a symmetric matrix
D with positive eigenvalues. The conditions (7.10) become
Z
Z
p(D−1 x)
p(D−1 x)
(x · v)
dx = 0,
(x · x)
dx = t,
det D
det D
Rn
Rn
as v runs over Rn . Let pD (x) = p(D−1 x)/ det D, which is also a probability distribution.
Thus, p satisfies (7.10) exactly when pD (x) satisfies (7.7). Moreover, a calculation shows
h(pD ) = h(p) + log(det D), and the constant log(det D) is independent of p. It depends only
on the initial choice of quadratic form Q. Therefore, our calculations in Example 7.9 tell us
that the unique maximum-entropy distribution for the constraints (7.10) is the p such that
1
pD (x) = G(t/n)In (x) = p
e−(1/2)(n/t)x·x ,
(2π)n (t/n)n
Unwinding the algebra, this says
p(x) =
=
=
det D
p
e−(1/2)(n/t)Dx·Dx
(2π)n (t/n)n
1
2
p
e−(1/2)x·(n/t)D x
n
n
−1
2
(2π) (t/n) (det D )
1
−1
p
e−(1/2)x·Σ x ,
n
(2π) det Σ
where Σ = (t/n)C −1 . This is a Gaussian, with x · Σ−1 x = x · (n/t)Cx = (n/t)Q(x).
R In (7.10), the second condition can be described in terms of traces. Letting σij =
Rn xi xj p(x) dx be the covariances of p, and Σp = (σij ) be the corresponding covariance
matrix for p, the second equation in (7.10) says Tr(Σp C) = t, where C is the symmetric
matrix attached to Q.
The conclusion of Theorem 7.11 shows it is (n/t)Q, rather than the initial Q, which
is preferred. Rescaling Q does not affect the vanishing in the first condition in (7.10),
but changes both sides of the second condition in (7.10). Rescaling by n/t corresponds
to making the right side equal to n (which reminds us of (5.3)). In other words, once we
choose a particular geometry for Rn (a choice of positive-definite quadratic form, up to
scaling), scale the choice so random vectors have expected squared length equal to n. With
this constraint, the maxmium entropy distribution with mean vector 0 is the Gaussian with
mean vector 0 and inverted covariance matrix equal to the matrix defined by Q: for x ∈ Rn ,
x · Σ−1 x = Q(x).
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
25
Appendix A. Lagrange Multipliers
To prove a probability distribution is a maximum entropy distribution, an argument often
used in place of Theorem 4.3 is the method of Lagrange multipliers. We will revisit several
examples and show how this idea is carried out. In the spirit of how this is done in practice,
we will totally ignore the need to show Lagrange multipliers is giving us a maximum.
Example A.1. (Theorem 3.1) Among all probability distributions
{p1 , . . . , pn } on a finite
Pn
p
log
pi with the constraint
set, to maximize
the
entropy
we
want
to
maximize
−
i
i=1
P
pi ≥ 0 and ni=1 pi = 1. Set
F (p1 , . . . , pn , λ) = −
n
X
pi log pi + λ
i=1
n
X
!
pi − 1 ,
i=1
where λ is a Lagrange multiplier and the pi ’s are (positive) real variables. Since ∂F/∂pj =
−1−log pj +λ, at a maximum entropy distribution we have −1−log pj +λ = 0, so pj = eλ−1
for all j. This is a constant value, and the condition p1 + · · · + pn = 1 implies eλ−1 = 1/n.
Thus pi = 1/n for i = 1, . . . , n.
Example A.2. (Theorem 3.2)
R Let p(x) be a probability distribution on R with mean µ and
variance σ 2 . To maximize − R p(x) log p(x) dx subject to these constraints, we consider
Z
Z
F (p, λ1 , λ2 , λ3 ) = −
p(x) log p(x) dx + λ1
p(x) dx − 1 + λ2
R
Z
2
2
+λ3
(x − µ) p(x) dx − σ
R
Z
xp(x) dx − µ
R
R
Z
(−p(x) log p(x) + λ1 p(x) + λ2 xp(x) + λ3 (x − µ)2 p(x)) dx − λ1
=
R
−µλ2 − σ 2 λ3
Z
=
L(x, p(x), λ1 , λ2 , λ3 ) dx − λ1 − µλ2 − σ 2 λ3 ,
R
where L(x, p, λ1 , λ2 , λ3 ) = −p log p + λ1 p + λ2 xp + λ3 (x − µ)2 p. Since2
∂L
= −1 − log p + λ1 + λ2 x + λ3 (x − µ)2 ,
∂p
at a maximum entropy distribution we have −1 − log p(x) + λ1 + λ2 x + λ3 (x − µ)2 = 0, so
R
2
p(x) = eλ1 −1+λ2 x+λ3 (x−µ) . For R p(x) dx to be finite requires λ2 = 0 and λ3 < 0. Thus
2
p(x) = ea e−b(x−µ) , where a = λ1 − 1 p
and b = −λ3 > 0.
R a −b(x−µ)2
a
The integral R e e
dx is e π/b, so p(x) being a probability distribution makes
p
R
R
2
it b/πe−b(x−µ) . Then R xp(x) dx is automatically µ, and R (x − µ)2 p(x) dx is 1/(2b),
2
2
1
so b = 1/(2σ 2 ). Thus p(x) = √2πσ
e−(1/2)(x−µ) /σ , a normal distribution.
2In ∂L/∂p, p is treated as an indeterminate, not as p(x).
26
Example A.3. (Theorem
3.3) Let p(x) be a probability distribution on (0, ∞) with mean
R
µ. To maximize − R p(x) log p(x) dx subject to this constraint, set
Z ∞
Z ∞
Z ∞
xp(x) dx − µ
p(x) dx − 1 + λ2
p(x) log p(x) dx + λ1
F (p, λ1 , λ2 ) = −
0
0
Z ∞0
=
(−p(x) log p(x) + λ1 p(x) + λ2 xp(x)) dx − λ1 − λ2 µ
0
Z ∞
L(x, p(x), λ1 , λ2 ) dx − λ1 − λ2 µ,
=
0
where L(x, p, λ1 , λ2 ) = −p log p + λ1 p + λ2 xp. Then ∂L/∂p = −1 − log p + λ1 + λ2 x,
so at a maximum entropy distribution we haveR −1 − log p(x) + λ1 + λ2 x = 0, and thus
∞
p(x) = eλ1 −1+λ2 x for x ≥ 0. The condition 0 p(x) dx < ∞ requires λ2 < 0. Then
R ∞ λ −1+λ x
R
∞ λ2 x
1
2 dx = eλ1 −1
dx = eλ1 −1 /|λ2 |, so eλ1 −1 = |λ2 |. Thus p(x) = |λ2 |eλ2 x .
0 e R
0 e
R∞
∞
Since 0 xeλ2 x dx = 1/λ22 , the condition 0 xp(x) dx = µ implies λ2 = −1/µ, so p(x) =
(1/µ)e−x/µ , which is an exponential distribution.
Example A.4. (Theorem 5.1) When p(x) is a probability distribution on [a, b] with mean
Rb
µ ∈ (a, b), we maximize − a p(x) log p(x) dx by looking at
Z b
Z b
Z b
F (p, λ1 , λ2 ) = −
p(x) log p(x) dx + λ1
p(x) dx − 1 + λ2
xp(x) dx − µ ,
a
a
a
exactly as in the previous example. The same ideas from the previous example imply
p(x) = eλ1 −1 eλ2 x for x ∈ [a, b], which has the shape Ceαx from Theorem 5.1. The two
Rb
Rb
constraints a p(x) dx = 1 and a xp(x) dx = µ pin down λ1 and λ2 .
Example A.5. (Theorem 5.7) When {pk , pk+1 , pk+2 , . . . } is a probability distribution on
{k, k + 1, k + 2, . . . } with mean µ, it has maximum entropy among such distributions when




X
X
X
F (pk , pk+1 , . . . , λ1 , λ2 ) = −
pi log pi + λ1 
p i − 1 + λ 2 
ipi − µ
i≥k
i≥k
i≥k
satisfies ∂F/∂pj = 0 for all j. The derivative is −1 − log pj + λ1 + λ2 j, so pj = eλ1 −1 eλ2 j
P
for all j. To have j≥k pj < ∞ requires λ2 < 0. Set r = eλ2 ∈ (0, 1), so pj = Crj , where
C = eλ1 −1 .
P
The condition i≥k pi = 1 implies C = (1 − r)/rk , so pi = (1 − r)ri−k . The condition
that the mean is µ determines the value of r, as in the proof of Theorem 5.7.
Example A.6. (Theorem 5.11) Let E1 , . . . , En be in R with E lying between min Ej and
P
max Ej . To find the maximum entropy distribution {p1 , . . . , pn } where ni=1 pi Ei = E,
consider
!
!
n
n
n
X
X
X
F (p1 , . . . , pn , λ1 , λ2 ) = −
pi log pi + λ1
pi − 1 + λ2
pi Ei − E .
i=1
i=1
i=1
Since ∂F/∂pj = −1 − log pj + λ1 + λ2 Ej , at a maximum entropy distribution we have
P
−1 − log pj + λ1 + λ2 Ej = 0, so pj = eλ1 −1+λ2 Ej . The condition nj=1 pj = 1 becomes
P
P
eλ1 −1 ni=1 eλ2 Ej = 1, so eλ1 −1 = 1/ nj=1 eλ2 Ej . Thus
eλ2 Ei
pi = eλ1 −1 eλ2 Ei = Pn
.
λ2 Ej
j=1 e
PROBABILITY DISTRIBUTIONS AND MAXIMUM ENTROPY
27
P
This agrees with (5.5), where λ2 needs to be chosen so that ni=1 pi Ei = E, which is the
same as λ2 satisfying
Pn
Ei eλ2 Ei
Pi=1
= E.
n
λ2 Ei
j=1 e
To align the notation with Theorem 5.11, we should write λ2 as −β.
It is left to the reader to work out other maximum entropy distributions using Lagrange
multipliers, such as Theorems 5.2, 5.5, and 5.9.
References
[1] S. Artstein, K. M. Ball, F. Barthe, and A. Naor, Solution of Shannon’s Problem on the Monotonicity of
Entropy, J. Amer. Math. Soc. 17 (2004), 975–982.
[2] B. Buck and V. A. Macaulay (eds.), Maximum Entropy in Action, Clarendon Press, Oxford, 1991.
[3] D. K. Faddeev, The notion of entropy of finite probabilistic schemes (Russian), Uspekhi Mat. Nauk 11
(1956), 15–19.
[4] A. Feinstein, “The Foundations of Information Theory,” McGraw-Hill, New York, 1958.
[5] S. Goldman, “Information Theory,” Prentice-Hall, New York, 1955.
[6] E. T. Jaynes, “Information theory and statistical mechanics,” Phys. Rev. 106 (1957), 620–630.
[7] E. T. Jaynes, “Information theory and statistical mechanics, II” Phys. Rev. 108 (1957), 171–190.
[8] E. T. Jaynes, Information theory and statistical mechanics, pp. 181–218 of “Statistical Physics,” Brandeis
Summer Insitute 1962, W. A. Benjamin, New York, 1963.
[9] J. Justice (ed.), Maximum Entropy and Bayesian Methods in Applied Statistics, Cambridge Univ. Press,
Cambridge, 1986.
[10] R. D. Levine and M. Tribus (eds.), The Maximum Entropy Formalism, The MIT Press, Cambridge,
1979.
[11] Math Overflow, http://mathoverflow.net/questions/116667/whats-the-maximum-entropy-probabi
lity-distribution-given-bounds-a-b-and-mean.
[12] G. P. Patil, S. Kotz, and J. K. Ord (eds.), Statistical Distributions in Scientific Work Vol. 3, D. Reidel
Publ. Company, Dordrecht, 1975.
[13] C. E. Shannon and W. Weaver, “The Mathematical Theory of Communication,” Univ. of Illinois Press,
Urbana, IL, 1949.
```