Information Theory Stefan Höst March 9, 2012 1

Information Theory Stefan Höst March 9, 2012 1
1
Information Theory
Stefan Höst
March 9, 2012
Draft version
V001 First Draft Outline
March 2012
c Stefan Höst
Dept. of Electrical and Information Technology
Lund University
2
CONTENTS
3
Contents
1 Preface
5
2 Introduction to Information Theory
7
3 Probability
3.1 Random variable . . . . . .
3.2 Expectation and variance . .
3.3 Weak law of large numbers
3.4 Jensen’s inequality . . . . .
3.5 Stochastic Processes . . . . .
3.6 Markov process . . . . . . .
.
.
.
.
.
.
9
10
12
18
21
24
27
4 Information Measure
4.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
33
36
42
4.3.1 Convexity of information measures . . . . . . . . . . . . . . . . . .
Entropy of sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
47
50
4.4
4.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 SourceCodingTheorem
5.1 Asymptotic Equipartition Property
5.2 Source Coding Theorem . . . . . .
5.3 Kraft Inequality . . . . . . . . . . .
5.4 Shannon-Fano Coding . . . . . . .
5.5 Huffman Coding . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
57
61
61
61
6 Universal Source Coding
6.1 LZ77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
63
6.2
6.1.1 LZSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
63
6.3
6.2.1 LZW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Applications of LZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
63
7 Channel Coding Theory
7.1 Channel coding theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Channel coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
65
65
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
CONTENTS
7.2.1
7.2.2
7.2.3
Repetition code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Convolutional code . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
65
65
8 Differential Entropy
8.1 Information for continouos random variables . . . . . . . . . . . . . . . . .
8.2 Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
67
9 Gaussian Channel
9.1 Capacity of Gaussian Channel .
9.2 Band limited channel . . . . . .
9.3 Fundamental Shannon limit . .
9.3.1 Limit for fix coding rate
9.3.2 Limit for QAM . . . . .
9.3.3 Limit for MIMO . . . . .
9.4 Waterfilling . . . . . . . . . . .
9.5 Coding Gain and Shaping Gain
.
.
.
.
.
.
.
.
69
69
69
69
69
69
69
69
69
.
.
.
.
.
71
71
71
71
71
71
10 Rate Distortion Theory
10.1 Rate distortion Function
10.2 Quantization . . . . . . .
10.3 Shannon Limit for fix Pb
10.4 Lossy compression . . .
10.5 Transform coding . . . .
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
Chapter 3
Probability
One of the basic insights when setting up a measure for information is that the observed
quantity must have choices to contain any information. So, for us to set up a theory for
information we must have a clear notation of probability theory. This chapter will give
the parts of the probability theory that is needed throughout the rest of the document. It
is not, in any way, a complete course in probability theory, for that we refer to e.g.[4] or
some other standard text book in the area.
In short, a probability is a measure of how likely an event is to occur. It is represented
by a number between zero and one, where zero means it will not happen and one that is
certain to happen. The sum of the probabilities for all possible events is one, since it is
certain that one of them will happen.
More formally, it was the Russian mathematician Kolmogorov who in 1933 set up the
theory as we know it today. The sample space Ω is the set of all possible outcomes of
the random experiment. In discrete probability theory the sample space is, in general,
countably infinite,
Ω = {ω1 , ω2 , . . . , ωn }
An event is a subset of the sample space, A ⊆ Ω. The event is said to occur if the outcome
from the random experiment is found in the event. Examples of specific events are
• The impossible event ∅ (the empty subset of Ω)
• The certain event Ω (the full subset)
Each of the elements in Ω, ω1 , ω2 , . . . , ωn , i.e. the smallest nonempty subsets, are called
elementary events or, sometimes, atomic events.
To each event there is assigned a probability measure P , where 0 ≤ P ≤ 1, which is a
measure of how probable the event is to happen. Some fundamental probabilities are
P (Ω) = 1
P (∅) = 0
P (A ∪ B) = P (A) + P (B), if A ∩ B = ∅
9
10
CHAPTER 3. PROBABILITY
where A and B are disjoint events. This imply that
X
P (A) =
P (ωi )
ωi ∈A
An important concept in probability theory is how two events are related. This is described by the conditional probability. The probability of event A conditioned on the event
B is defined as
P (A|B) =
P (A ∩ B)
P (B)
If the two events A and B are independent, the probability for A should not change if it
is conditioned on B or not. Hence,
P (A|B) = P (A)
Applying this fact to the definition of conditional probabilities we can conclude that A
and B are independent if and only if the probability of their intersection is equal to the
product of the individual probabilities,
P (A ∩ B) = P (A) · P (B)
3.1
Random variable
A random variable, or stochastic variable, X is a function from the sample space into a
specified set, most often the real numbers,
X:Ω→X
where X = {x1 , x2 , . . . , xk }, k ≤ n, is the set of possible values. We denote the event
{ω|X(ω) = xi } by {X = xi }.
To describe how the probabilities for the random variable is distributed we use, for the
discrete case, the probability function
pX (x) = P (X = x)
and in the continuous case the density function fX (x). The distribution function for the
variable can now be defined as,
FX (x) = P (X ≤ x)
which we can derive as1
X

pX (k)




k≤x
FX (x) =
Z x





fX (ν)dν

(3.1)
X discrete
X continuous
−∞
1
Often in this chapter we will give the result both for discrete and continuous variables, as done here. If
one of the equations are omitted this does not mean it does not exist. In most cases it is straight forward to
get the discrete or continuous counterpart.
3.1. RANDOM VARIABLE
11
In many occasions later in the document we will omit the index ·X , if it is clear what
variable is used.
To consider how two or more random variables are jointly distributed we can view a
vector of random variables. This vector will then represent a multi-dimensional random variable and can be treated the same way. Hence, we can consider (X, Y ), where
X and Y are random variables with possible outcomes X = {x1 , x2 , . . . , xM } and Y =
{y1 , y2 , . . . , yN }, respectively. Then the set of joint outcomes is
n
o
X × Y = (x1 , y1 ), (x1 , y2 ), . . . , (xM , yN )
Normally we write the joint probability function pX,Y (x, y) for the discrete case, meaning
the event P (X = x and Y = y), and the corresponding joint density function is denoted
fX,Y (x, y).
The marginal distribution can be derived from the joint distribution as
X
pX,Y (x, y)
pX (x) =
y
and
Z
fX (x) =
fX,Y (x, y)dy
R
for the discrete and continuous case, respectively. This leads us to consider how dependent the two variables are. We say that the conditional probability distribution is defined
by
pX|Y (x|y) =
pX,Y (x, y)
.
pY (y)
fX|Y (x|y) =
fX,Y (x, y)
.
fY (y)
and
This gives a probability distribution for the variable X when the outcome for Y is known.
Repeating this iteratively, we conclude the chain rule for probabilities for the probability of
an n-dimensional random variable as
P (X1 . . . Xn−1 Xn ) = P (Xn |X1 . . . Xn−1 )P (X1 . . . Xn−1 )
= P (X1 )P (X2 |X1 )P (X3 |X1 X2 ) . . . P (Xn |X1 . . . Xn−1 )
n
Y
=
P (Xi |X1 . . . Xi−1 )
(3.2)
i=1
Combining the above results we conclude that two random variables X and Y are statistically independent if and only if
pX,Y (x, y) = pX (x)pY (y)
or
fX,Y (x, y) = fX (x)fY (y)
for all x and y.
12
3.2
CHAPTER 3. PROBABILITY
Expectation and variance
Consider an example where the random variable X is the outcome of a throw with a fair
dice. The probabilities for each outcome is equivalent, i.e.
1
pX (x) = ,
6
x = 1, 2, . . . , 6
The arithmetic mean of the outcomes is
X=
1X
x = 3.5
6 x
Now, instead have a look at a counterfeit dice where there is a small weight place close to
the number one and let Y be the corresponding random variable. Simplifying the results
we assume that the number one will never show and number six will show twice as often
as before. That is, pY (1) = 0, pY (i) = 16 , i = 2, 3, 4, 5 and pY (6) = 13 . The probabilities
for the other numbers are unchanged. The arithmetic mean above will still be the same
3.5, but it does not say very much about the actual result. Therefore, we introduce the
expected value, which is a mean weighted with the probabilities for the outcomes,
E[X] =
X
(3.3)
xpX (x)
x
In the case with the fair dice the expected value is the same as the arithmetic mean, but
for the manipulated dice it becomes
E[Y ] = 0 · 1 +
1
6
·2+
1
6
·3+
1
6
·4+
1
6
·5+
1
3
· 6 ≈ 4.33
A slightly more general definition of the expected value is as follows.
Definition 1 Let g(X) be a real-valued function of a random variable X, the expected value (or
mean) of g(X) is for a discrete variable
X
E g(X) =
g(x)pX (x)
x∈X
and for a continuous variable
Z
E g(X) =
g(ν)fX (ν)dν
R
Going back to our true and counterfeit dice, we can use the function g(x) = x2 . This will
lead to the so called second order moment for the variable X. In the case for the true dice
we get
E[X 2 ] =
6
X
1
i=1
6
i2 ≈ 15.2
3.2. EXPECTATION AND VARIANCE
13
and for the counterfeit
2
E[Y ] =
5
X
1
6
i=2
1
i2 + 62 = 21
3
Later on, in Section 5.1, we will make use of the weak law of large numbers. It states that
the arithmetic mean of a vector consisting of independent identically (i.i.d.) distributed
random variables will approach the expected value as the length of the vector grows. In
this sense the expected value is a most natural definition of the mean for the outcome of
a random variable.
The expected value for the multi-dimensional variable is derived similarly, with the joint
distribution. In the case of a 2-dimensional vector (X, Y ) it becomes,
X
g(x, y)pX,Y (x, y)
discrete case




x,y

E g(X, Y ) =

Z




g(ν, µ)fX (ν, µ)dνdµ
continuous case
R2
It can also be that one variable is discrete and one is continuous, so the formula has one
sum and one integral.
From the definition of the expected value and basic calculus it is easy to verify that the
expected value is a linear mapping, i.e. that
X
E[aX + bY ] =
(ax + by)pXY (x, y)
x,y
=
X
axpXY (x, y) +
X
x,y
bypXY (x, y)
x,y
=a
X X
X X
x
pXY (x, y) + b
y
pXY (x, y)
=a
X
x
y
y
xpX (x) + b
X
x
x
ypY (y)
y
= aE[X] + bE[Y ]
(3.4)
At the end of this section there will be stated a more general version of this theorem.
We can also verify that if X and Y are independent, then the expectation of their product
equals the product of their expectations,
X
X
E[XY ] =
xypX,Y (x, y) =
xypX (x)pY (y)
x,y
=
X
x
x,y
xpX (x)
X
xpY (y) = E[X]E[Y ]
(3.5)
y
While the expected value is a measure of the weighted mean for the outcome of a random
variable, we also need a measure of its variation. This value is called the variance and is
defined as the expected value of the squared distance to the mean.
14
CHAPTER 3. PROBABILITY
Definition 2 Let X be a random variable with expected value E[X]. The variance of the variable
is
V [X] = E (X − E[X])2
The variance can often be derived from
V [X] = E (X − E[X])2 = E X 2 − 2XE[X] + E[X]2
= E[X 2 ] − 2E[X]E[X] + E[X]2 = E[X 2 ] − E[X]2
where E[X 2 ] is the second order moment of X. In many descriptions the expected value
is denoted by m, but we have here chosen to preserve the notation E[X]. Still, it should
be regarded as a constant in the derivations.
Often the standard deviation, σX , is used as a measure of the variation of the variable. This
is the square root of the variance,
2
σX
= V [X]
For the true and counterfeit dice described earlier in this section we get the variance
V [X] = E[X 2 ] − E[X]2 ≈ 2.9
V [Y ] = E[Y 2 ] − E[Y ]2 ≈ 2.2
The corresponding standard deviations are
p
V [X] ≈ 1.7
p
σY = V [Y ] ≈ 1.5
σX =
To get more understanding about the variance we first need to define one of its relatives,
the covariance function. This can be seen as a measure of the dependencies between two
random variables.
Definition 3 Let X and Y be two random variable with expected value E[X] and E[Y ]. The
covariance between the variables is
Cov(X, Y ) = E (X − E[X])(Y − E[Y ])
We can easily rewrite the covariance to get
Cov(X, Y ) = E (X − E[X])(Y − E[Y ])
= E XY − XE[Y ] − E[X]Y + E[X]E[Y ]
= E[XY ] − E[X]E[Y ]
3.2. EXPECTATION AND VARIANCE
15
From this result, and from (3.5), we see that when X and Y are independent we get zero
covariance,
(3.6)
Cov(X, Y ) = 0
Going back to the variance we can now derive the variance for the combination aX + bY
as
V [aX + bY ] = E (aX + bY )2 − E[aX + bY ]2
2
= E a2 X 2 + b2 Y 2 + 2abXY − aE[X] + bE[Y ]
= a2 E[X 2 ] + b2 E[Y 2 ] + 2abE[XY ]
− a2 E[X]2 − b2 E[Y ]2 − 2abE[X]E[Y ]
= a2 V [X] + b2 V [Y ] + 2abCov(X, Y )
(3.7)
With use of (3.6) we can also conclude that if X and Y are independent we get
V [aX + bY ] = a2 V [X] + b2 V [Y ]
(3.8)
Sometimes it is suitable to have a normalized random variable. We can get that by considering
X̃ =
X −m
σ
where E[X] = m and V [X] = σ 2 . Then the expectation and variance becomes
1
E[X] − m = 0
σ
1
V [X̃] = 2 E[(X − m)2 ] = 1
σ
E[X̃] =
To conclude the part about expected value and variance, we summarize and give more
general versions of (3.4), (3.7) and (3.8). The results can be shown very similar to what
has been done in the description above.
Theorem 1 Given a set of N random variables Xn and scalar constants αn , n = 1, 2, . . . , N ,
the sum
Y =
N
X
αn Xn
n=1
has the expected value
E[Y ] =
N
X
αn E[Xn ]
n=1
and the variance
V [Y ] =
N
X
n=1
αn2 V [Xn ] + 2
X
m<n
αn αm Cov(Xn , Xm )
16
CHAPTER 3. PROBABILITY
If the random variables Xn all are independent, then
V [Y ] =
N
X
αn2 V [Xn ]
n=1
We have already seen a distribution for the outcome of a dice. We will finish this section
by considering the geometric distribution and the Gaussian distributions, both often used
in technical contents.
Example 1 [Geometric distribution]
The geometric distribution is a discrete distribution that can be explained from a set of
coin flips. Assume that the probability for head and tail is given by P (head) = α and
P (tail) = 1 − α. Let K be the number of flips until a tail shows up. The probability for
this number to be k is
pK (k) = αk−1 (1 − α),
k = 1, 2, . . .
and the density function
FK (k) =
k
X
αk−1 (1 − α) = (1 − α)
k−1
X
αn = (1 − α)
n=0
i=1
1 − αk
= 1 − αk
1−α
We can directly see that this is indeed a probability distribution if we let k → ∞ for the
density function to get FK (k) → 1. Another way is to sum over all possible probabilities,
∞
X
αk−1 (1 − α) = (1 − α)
∞
X
αn = (1 − α)
n=0
k=1
1
=1
1−α
Here, and in the remaining of this example we will make use of the well known sums,
for |α| < 1,
∞
X
αn =
n=0
∞
X
1
,
1−α
n=0
nαn =
∞
X
α
,
(1 − α)2
n2 α n =
n=0
α(1 + α)
(1 − α)3
We can also find the expected value and the second order moment as
E[K] =
∞
X
kα
k−1
k=1
(1 − α) = (1 − α)
∞
X
(n + 1)αn
n=0
α
1 1
+
=
2
(1 − α)
1−α
1−α
∞
∞
X
X
E[K 2 ] =
k 2 αk−1 (1 − α) = (1 − α)
(n + 1)2 αn
= (1 − α)
k=1
n=0
α(1 + α)
1+α
α
1 = (1 − α)
+
2
+
=
(1 − α)3
(1 − α)2 1 − α
(1 − α)2
(3.9)
3.2. EXPECTATION AND VARIANCE
17
Hence, the variance is
V [K] = E[K 2 ] + −E[K]2 =
1 2
1+α
α
−
=
2
(1 − α)
1−α
(1 − α)2
For α = 12 , i.e. a fair coin, we obtain
pk (k) =
1 k
2
E[K] = 2
E[K 2 ] = 6
V [K] = 2
The next example treats the Gaussian distribution.
Example 2 [Gaussian distribution] The Gaussian distribution, or Normal distribution is
a very central distribution in probability theory. It is also a very common distribution to
use for modelling continuous channels. The distribution is often denoted X ∼ N (m, σ)
and the density function is
fX (x) = √
1
2πσ 2
e−
(x−m)2
2σ 2
(3.10)
To show that this is actually a density function, and to derive the expectation and variance, we start with a version where we center the function at 0, Y ∼ N (0, σ), with density
function
fY (y) = √
1
2πσ 2
y2
e− 2σ2
R∞
To show that this is a density function, i.e. that −∞ fY (y)dy = 1 we consider the squared
function,
Z ∞
Z ∞
Z ∞
2
2
2
2
1
1
− y2
− y2
− z2
2σ dy
2σ dz
√
e 2σ dy =
e
e
2πσ 2 −∞
2πσ 2
−∞
−∞
Z ∞Z ∞
2 +z 2
y
1
=
e− 2σ2 dydz
2πσ 2 −∞ −∞
Z 2π Z ∞
2
2
2 sin2 φ
1
− r cos φ+r
2σ 2
=
re
drdφ
2πσ 2 0
0
Z 2π Z ∞
1
r − r22
=
e 2σ drdφ
2π 0
σ2
0
Z 2π h
Z 2π
2
1
1
− r 2 ∞
2σ
=
−e
dφ =
dφ = 1
0
2π 0
2π 0
where we used the standard variable change to polar coordinates. The expectation and
18
CHAPTER 3. PROBABILITY
the second order moment can be derived as
Z ∞
Z ∞
2
1
σ2
y − y22
− y2
e 2σ dy
E[Y ] =
y√
e 2σ dy = √
2πσ 2
2πσ 2 −∞ σ 2
−∞
σ 2 h − y22 i∞
−e 2σ
=0
=√
−∞
2πσ 2
Z ∞
Z ∞
2
y2
1
y
σ2
− y2
2
2
2σ
y √
E[Y ] =
y 2 e− 2σ2 dy
e
dy = √
2πσ 2
2πσ 2 −∞ σ
−∞
Z
i
h
∞
∞
y2
y2
σ2
−e− 2σ2 dy
=√
−
−ye− 2σ2
−∞
2πσ 2
−∞
Z ∞
2
y
1
√
= σ2
e− 2σ2 dy = σ 2
2πσ 2
−∞
To make the same derivations in the more general case X ∼ N (m, σ) we will make use
of the variable change y = x − m (implying dy = dx and x = y + m). Then Y ∼ N (0, σ),
and we get
Z
∞
√
−∞
1
2πσ 2
e−
(x−m)2
2σ 2
Z
∞
dx =
−∞
√
y2
1
2πσ 2
e− 2σ2 dy = 1
so, it is a density function. The expectation and second order moment is
Z
∞
E[X] =
x√
1
−
e
(x−m)2
2σ 2
Z
dx =
∞
(y + m) √
1
y2
e− 2σ2 dy
2πσ 2
2πσ 2
−∞
Z ∞
= E[Y ] + m
fY (y)dy = m
−∞
Z ∞
Z ∞
(x−m)2
y2
1
1
−
2
2
2
(y + m)2 √
x √
e 2σ dx =
e− 2σ2 dy
E[X ] =
2πσ 2
2πσ 2
−∞
−∞
2
2
2
2
= E[Y ] + 2mE[Y ] + m = σ + m
−∞
and the variance
V [X] = E[X 2 ] − E[X]2 = σ 2
3.3
Weak law of large numbers
The weak law of large numbers will play a central role when proving the source coding
theorem, which sets a bound on the compression ratio, as well as the channel coding
theorem, which sets a limit on the capability of reliable communication. Those two theorems are very important results in information theory. In this section first the Markov’s
inequality and then Chebyshev’s inequality will be stated. These are famous bounds in
statistics. The latter will be used to show the weak law of large numbers.
The first inequality is stated in the following theorem. Here we will only consider the
case for discrete random variables but the result holds also for continuous variables.
3.3. WEAK LAW OF LARGE NUMBERS
19
Theorem 2 (Markov’s inequality) Let X be a non-zero random variable with finite expected
value. Then, for any non-zero integer a
P (X > a) ≤
E[X]
a
To show this we start with the expected value,
E[X] =
∞
X
kp(k) ≤
k=0
∞
X
kp(k) ≤
k=a+1
∞
X
ap(k) = a
k=a+1
∞
X
p(k) = aP (X > a)
k=a+1
which gives the result.
If we instead of X consider the squared distance to the expected value E[X] we get the
following result.
P
E[(X − E[X])2 ]
2
V [X]
X − E[X] > 2 ≤
=
2
2
Equivalently this can be stated as in the following theorem, which is Chebyshev’s inequality.
Theorem 3 (Chebyshev’s inequality) Let X be a non-zero random variable with finite expected value E[X]. Then, for any positive V [X]
P X − E[X] > ≤
2
As stated previously this can be used to give a first proof of the weak law of large numbers. Consider a sequence of independent identically distributed (i.i.d.) random variables, Xi , i = 1, 2, . . . , n. The arithmetic mean for the sequence can be viewed as a new
random variable,
n
Y =
1X
Xi
n
i=1
From Theorem 1 we see that the expected value and variance of Y can be expressed as
E[Y ] = E[X]
V X
V [Y ] =
n
Applying Chebyshev’s inequality yields
n
V [X]
1 X
P Xi − E[X]> ≤
n
n2
i=1
20
CHAPTER 3. PROBABILITY
As n grows the right hand side will tend to zero, bounding the arithmetic mean close to
the expected value. Stated differently we get
n
1 X
lim P Xi − E[X]< = 1
n→∞
n
i=1
p
which gives the weak law of large numbers. This probabilistic convergence is denoted →
and is read out as convergence in probability. This gives the following way to express the
relation.
Theorem 4 (Weak law of large numbers) Let X1 , X2 , . . . P
, Xn be i.i.d. random variables with
1
finite expectation E[X]. Form the arithmetic mean as Y = n i Xi . Then Y converges in probability to E[X],
p
Y → E[X],
n→∞
It should be noted that the proof given above requires that the variance is finite, but it is
possible to show also the more general theorem given here.
To see how the convergence towards the expectation works in practice we give an example with a binary vector.
Example 3 Consider a length n vector of i.i.d.binary random variables with pX (0) = 1/3
and pX (1) = 2/3, X = (X1 , X2 , . . . , Xn ). The probability for a vector to have k ones is
k
n 2
n 2 k 1 n−k
=
P (k 1s in X) =
k 3n
3
k 3
In Figure 3.1 the probability distribution of number of 1s in a vector of length n = 5 is
shown both as a table and in a graphical version. We see here that it is most likely that
we get a vector with 3 or 4 1s. Also it is interesting to observe that the it is less probable
to get all five 1s.
In Figure 3.2 the distribution for the number of 1s is shown when the length of the vector
is increased to 10, 50, 100 and 500. With increasing length it becomes more evident that
the most likely vector has about L · E[X].
While we are dealing with the arithmetic mean of i.i.d. random variables it would not be
fair not to mention the central limit theorem. This and the law of large numbers are the
two main limits in probability theory. However, we will here give the theorem without
a proof (as in many basic probability courses). The result is that the arithmetic mean
of a sequence of i.i.d. random variables will be normal distributed, independent of the
distribution of of the variables.
Theorem 5 (Central limit theorem) Let X1 , X2 , . . . , Xn be i.i.d. random variables with finite
expectation E[X] = m and finite variance V [X] = σ 2 . Form the arithmetic mean as Y =
3.4. JENSEN’S INEQUALITY
21
P (#1 in x)
k
k
= k5 235
0
1
2
3
4
5
0.0041
0.0412
0.1646
0.3292
0.3292
0.1317
Figure 3.1: Probability distribution for k 1s in a vector of length 5, when p(1) = 2/3 and
p(0) = 1/3.
1
n
P
i Xi .
Then, as n goes to infinity, Y becomes distributed according to a normal distribution,
i.e.
Y −m
√ ∼ N (0, 1),
σ/ n
n→∞
3.4
Jensen’s inequality
In many applications, like optimization, functions that are convex are desirable since they
have a structure that is easy to work with. A convex function is defined in the following.
Definition 4 (Convex function) A function g(x) is convex in the interval [a, b] if, for any
x1 , x2 ∈ [a, b] and any λ, 0 ≤ λ ≤ 1,
g λx1 + (1 − λ)x2 ≤ λg(x1 ) + (1 − λ)g(x2 )
Similarly, a function g(x) is concave in the interval [a, b] if −g(x) is convex in the same interval.
The inequality in the definition can be viewed graphically as in Figure 3.3. Let x1 and
x2 be two numbers in the interval [a, b], such that x1 < x2 . Then, we can also mark the
functions values, g(x1 ) and g(x2 ), on plot of g(x). For λ in [0, 1], we get that
x1 ≤ λx1 + (1 − λ)x2 ≤ x2
with equality to the left if λ = 1 and to the right
if λ = 0. Then we can also mark the
corresponding function value g λx1 +(1−λ)x2 . While λx1 +(1−λ)x2 is a value between
22
CHAPTER 3. PROBABILITY
(a)
(b)
(c)
(d)
Figure 3.2: Probability distributions for k 1s in a vector of length 10 (a), 50 (b), 100 (c) and
500 (d), when p(1) = 2/3 and p(0) = 1/3.
x1 and x2 , the corresponding value between g(x1 ) and g(x2 ) is λg(x1 ) + (1 − λ)g(x2 ), i.e.
the coordinates
λx1 + (1 − λ)x2 , λg(x1 ) + (1 − λ)g(x2 )
describes a straight line between x1 , g(x1 ) and x2 , g(x2 ) for 0 ≤ λ ≤ 1. From this reasoning we see that a convex function is typically shaped like a bowl. Similarly a concave
function is the opposite, shaped like a hill.
Example 4 The functions x2 and ex are typically convex functions. On the other hand,
the functions −x2 and log x are concave functions.
In the literature the names convex ∪ and convex ∩ are also used instead of convex and
concave.
Since λ and 1 − λ can be interpreted as a binary probability function, both λx1 + (1 − λ)x2
and λg(x1 ) + (1 − λ)g(x2 ) can be viewed as expected values. A more general way of
putting this is Jensen’s inequality.
3.4. JENSEN’S INEQUALITY
23
Figure 3.3: A graphical view of the definition of convex functions.
Theorem 6 (Jensen’s inequality) If f (x) is a convex function and X a random variable we
have
E[f (X)] ≥ f (E[X])
If f (x) is a concave function and X a random variable we have
E[f (X)] ≤ f (E[X])
Jensen’s inequality is so important that a more thorough outline of the proof is required.
Even though it will only be shown for the discrete case here, it is also valid in the continuous case. As stated prior to the theorem, the binary case follows directly from the
definition of convexity.
To show that the theorem holds also in for distributions with more than twoP
cases we
will use induction. Assume that a1 , a2 , . . . , an are positive numbers such that i ai = 1
and that
X
X
f
ai xi ≤
ai f (xi )
i
i
where f (x) is a convex function. Furthermore, let p1 , p2 , . . . , pn+1 be a probability distribution for X. Then
n+1
n+1
n+1
X
X
X
f E[X] = f
pi xi = f p1 x1 +
pi xi = f p1 x1 + (1 − p1 )
i=1
(a)
≤ p1 f (x1 ) + (1 − p1 )f
i=2
n+1
X
i=2
=
n+1
X
i=1
pi f (xi ) = E f (X)
pi
1 − p1
i=2
pi
xi
1 − p1
n+1
(b)
X
xi ≤ p1 f (x1 ) + (1 − p1 )
i=2
pi
f (xi )
1 − p1
24
CHAPTER 3. PROBABILITY
where inequality (a) comes from the convexity of f and (b) from the induction assumppi
tion. What is left to show is that ai = 1−p
satisfy the requirements above. Clearly, since
1
pi is a probability we have that ai ≥ 0. The second requirements follows from
n+1
X
i=2
n+1
pi
1 X
1
=
(1 − p1 ) = 1
pi =
1 − p1
1 − p1
1 − p1
i=2
which completes the proof.
Example 5 Since f (x) = x2 is a convex function we can see that E[X 2 ] ≥ E[X]2 , which
shows that the variance is a non-negative function.
Clearly, the above example comes as no surprise since it is clear from the definition of
variance. A somewhat more interesting result from Jensen’s inequality is the so called
log-sum inequality. The function f (t) = t log t is in fact a convex function. Then, if αi
forms a probability distribution, it follows from Jensen’s inequality that
X
X
X
αi ti log ti ≥
αi ti log
αi ti
(3.11)
i
i
i
This can be used to get
X
i
ai log
X bi
P
bi
P
= j bj
ai
j bj
i
X bi
P
P
≥ j bj
j bj
i
where we identified αi =
Pai
j bj
ai
ai
log
bi
bi
P
X bi ai X
ai
ai
P
P
log
=
ai log i
bi
b
b
j j i
j bj
and ti =
i
ai
bi
i
in (3.11). Summarizing, we have the following
theorem.
Theorem 7 (log-sum inequality) Let a1 , . . . , an and b1 , . . . , bn be non-negative numbers. Then
!
P
X
X
ai
ai
≥
ai log Pi
ai log
bi
i bi
i
i
3.5
Stochastic Processes
So far, we have assumed that the source symbols are independent in time. But it is often
useful to also consider how sequences depend in time, i.e. a dynamic system. We will
here look at an example to see that in texts we have dependencies in time and that one
letter is clearly dependent of the surrounding letters. This example comes directly from
the work of Claude Shannon [15].
3.5. STOCHASTIC PROCESSES
25
Shannon assumed an alphabet with 27 symbols, i.e. 26 letters and 1 space. To get the 0th
order approximation a sample text was generated with equal probability for the letters.
Example 6 [0th order approximation] Choose letters from the English alphabet with equal
probability.
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD
QPAAMKBZAACIBZLHJQD
Clearly, this text does not have much in common with normal written English. So, instead, he counted the number of occurrences per letter in normal English texts, and estimated the probabilities. The probabilities are given by Table 3.1.
X
A
B
C
D
E
F
G
H
I
P
8.167
1.492
2.782
4.253
12.702
2.228
2.015
6.094
6.966
X
J
K
L
M
N
O
P
Q
R
P
0.153
0.772
4.025
2.406
6.749
7.507
1.929
0.095
5.987
X
S
T
U
V
W
X
Y
Z
P
6.327
9.056
2.758
0.978
2.360
0.150
1.974
0.074
Table 3.1: Probabilities in percent for the letters in English text.
Then, according to those probabilities, a sample text for the 1st order approximation can
be generated. In the next example such text is shown. Here, the text have more of a
structure of English text, but still far from readable.
Example 7 [1st order approximation] Choose the symbols according to their normal probability (12 % E, 2 % W, etc.):
OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI
ALHENHTTPA OOBTTVA NAH BRL
The next step is to extend the distribution so the probability depends on the previous
letter, i.e. the probability for the letter at time t becomes P (St |St−1 ). In the next example
such text is given.
26
CHAPTER 3. PROBABILITY
Example 8 [2nd order approximation] Choose the letters conditioned on the previous
letter:
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN
D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY
TOBE SEACE CTISBE
Similarly, in the next example the 3rd order approximation conditions on the two previous letters. We see here that the text becomes more like English, even if it still dos not
make any sense.
Example 9 [3rd order approximation] Choose the symbols conditioned on the two previous symbols:
IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID
PONDENOME OF DEMONSTRURES OF THE REPTAGIN IS
REGOACTIONA OF CRE
If instead of letters, the source of the text is generated from probabilities for words. The
first order approximation uses the unconditioned probabilities.
Example 10 [1st order word approximation] Choose words independently (but with probability):
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME
CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE
TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE
If the probabilities for words are conditioned on the previous word we can get a much
more readable text. Still without any direct meaning, of course.
Example 11 [2nd order word approximation] Choose words conditioned on the two previous word:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO
EVER TOLD THE PROBLEM FOR AN UNEXPECTED
3.6. MARKOV PROCESS
27
The above examples show that it in many situations it is important to view sequences
instead of individual symbols. In probability theory this is denoted a random process, or a
stochastic process. In a general form it can be defined as follows.
Definition 5 (Random process) A discrete random process is a sequence of random variables, {Xi }ni=1 , defined on the same sample space.
There can be an arbitrary dependency among the variables, and the process is characterized by the joint probability function
P X1 , X2 , . . . , Xn = x1 , x2 , . . . , xn = p(x1 , x2 , . . . , xn ), n = 1, 2, . . .
Since there is a dependency in time we need also more general parameters corresponding
to the second order moment and the variance of the random variable, depending also on
the time shift. The auto correlation function reflects the correlation in time and is defined
as
rXX (n, n + k) = E Xn Xn+k
If the mean and the autocorrelation function are time independent, i.e. for all n
E[Xn ] = E[X]
rXX (n, n + k) = rXX (n)
the process is said to be wide sense stationary (WSS). The relation with the second order
moment function is that rXX (k) = E[X 2 ]. The same relation for the variance comes with
the autocovariance function, defined for a WSS process as
cXX (k) = E (Xn − E[X])(Xn+k − E[X])
for all n, We can easily see that cXX (k) = rXX (k) − E[X]2 and that cXX (0) = V [X].
The class of WSS processes is a very powerful when modelling. However, sometimes
we want an even stronger condition on the time invariance. We say that a process is
stationary, or time-invariant, if the probability distribution does not depend on the time
shift. That is, if
P X1 , . . . , Xn = x1 , . . . , xn = P Xl+1 , . . . , Xl+n = x1 , . . . , xn
for all n and time shifts `. Clearly this is a subclass of WSS processes.
3.6
Markov process
A widely used class of the discrete stationary random processes is the class of Markov
processes. Here, the probability for a symbol depends only on the previous symbol. With
this simplification we get a system that is relatively easy to handle from a mathematical
point of view, while still having time dependency in the sequence to be a powerful modelling tool.
28
CHAPTER 3. PROBABILITY
Definition 6 (Markov chain) A Markov chain, or Markov process, is a stationary random
process with unit memory, i.e.
P xn |xn−1 , . . . , x1 = P xn |xn−1 )
for all xi .
Using the chain rule for probabilities (3.2) we conclude that for a Markov chain the joint
probability function is
p(x1 , x2 , . . . , xn ) =
n
Y
p(xi |xi−1 )
i=1
= p(x1 )p(x2 |x1 )p(x3 |x2 ) · · · p(xn |xn−1 )
The unit memory of a Markov chain gives that we can characterize it by
• A finite set of states
X = {x1 , x2 , . . . , xr }
where the state determines everything about the past. This represents the unit
memory of the chain.
• A state transition matrix
P = [pij ]i,j∈{1,2,...,r} where pij = p(xj |xi )
P
and j pij = 1.
The behaviour of a Markov chain can be visualized in a state transition graph consisting
of states and edges, labeled with probabilities. The following example explains how this
is related.
Example 12 A three state Markov chain is described by the three states
X = {s1 , s2 , s3 }
and a state transition matrix

1 2
3
3 0

1
3
P =
4 0 4
1
1
2
2 0
From P we see that conditioned that we have state s1 as the previous state, the probability
for s1 is 1/3, s2 is 2/3 and s3 is 0. This can be viewed as transitions in a graph from state
s1 to the other states, see Figure 3.4. Similarly, the other rows in the state transition matrix
describe the probabilities for transitions from the other states.
3.6. MARKOV PROCESS
29
1/3
s1
2/3
1/2
1/4
1/2
s3
s2
3/4
Figure 3.4: A state transition graph of a three state Markov chain.
For a Markov chain with r states the transition matrix will be a r × r matrix with the
transition probabilities. Let the initial probabilities for the states at time 0 be
w(0) = w1(0) · · · wr(0)
(0)
where wi
(1)
wj
= P (X0 = i). Then the probability for being in state j at time 1 becomes
= P (X1 = j) =
X
P (X1 = j|X0 = i)P (X0 = i) =
i
X
(0)
pij wi
i
Hence, the vector describing the state probabilities at time 1 becomes
w(1) = w(0) P
where P is the state transition matrix. Similarly, letting w(n) be the state probabilities at
time n, we get
w(n) = w(n−1) P = · · · = w(0) P n
In the following example we view state transition matrices from time 1 to time 2, 4, 8
and 16 respectively. Here we see that the columns of the matrix becomes more and more
independent of the starting distribution.
Example 13 Continuing with the Markov chain from Example 12. The state transition
matrix there is

1 2
3
3 0


P =  41 0 34 
1
1
2
2 0
It shows the probabilities for the transitions from time 0 to time 1. If instead we are
considering P n we have the transition probabilities from time 0 to time n. Below we
30
CHAPTER 3. PROBABILITY
derive the transition probabilities from time 1 to time 2, 4, 8 and 16.
 20 16 36  

0.2778
0.2222
0.5000
72
72
72

 

39
0  ≈ 0.4583 0.5417
0 
P 2 = P P =  33
72
72
24
27
21
0.2917 0.3333 0.3750
72
72
72
 1684
P 4 = P 2P 2 =
5184
 1947
 5184
1779
5184
1808
5184
2049
5184
1920
5184
1692
5184
1188 
5184 
1485
5184



0.3248 0.3488 0.3264


≈ 0.3756 0.3953 0.2292
0.3432 0.3704 0.2865


0.3485 0.3720 0.2794


P 8 = P 4 P 4 ≈ 0.3491 0.3721 0.2788
0.3489 0.3722 0.2789
P 16


0.3488 0.3721 0.2791


= P 8 P 8 ≈ 0.3488 0.3721 0.2791
0.3488 0.3721 0.2791
Already for 16 steps in the graph the numbers in each column are equal, for this accuracy
with four decimals (actually, we have this already for P 12 ). This means that for 16 steps
or more in the graph the probability for a state is independent of the starting state. Hence,
with any starting distribution w we get
w(16) = wP 16 = 0.3488 0.372 0.2791
With higher accuracy in the derivations we need to keep multiplying longer but eventually we will reach a stage where it does not change.
As seen2 in Example 13 we will reach an asymptotic distribution w = (w1 , . . . , wr ) such
that
  

w
w1 · · · wr
w w1 · · · wr 
  

lim P n =  .  =  . .
. . ... 
n→∞
 ..   ..

w
w1 · · · wr
In this limit we also can see that
 
 
w
w
 .. 
 .. 
n
n+1
n
= P P =  . P
.=P =P
w
w
Considering one row in the left matrix we conclude that
w = wP
which is the stationary distribution of the system. We are now ready to state a theorem for
the relationship of the stationary and the asymptotic distribution.
2
We have omitted the formal proof in this text.
3.6. MARKOV PROCESS
31
Theorem 8 Let w = w1 w2 · · · wr be an asymptotic distribution of the state probabilities.
Then
•
P
j
wj = 1
• w is a stationary distribution, i.e. wP = w
• w is a unique stationary distribution for the source.
Clearly the first statement is fulfilled since w is a distribution. The second has already
been shown above. The third, on uniqueness, we still need to prove.
P
Assume that v = v1 · · · vr is a stationary distribution, i.e. it fulfills i vi = 1 and
vP = v. Then, as n → ∞, the equation v = vP n can be written


w1 · · · wj · · · wr

..
.. 
v1 . . . vr = v1 . . . vr  ...
.
. 
w1 · · · wj · · · wr
This imply that

vj = v1

wj
 
· · · vr  ...  = v1 wj + · · · + vr wj = wj (v1 + · · · + vr ) = wj
{z
}
|
=1
wj
That is, v = w, which proves uniqueness.
To derive the stationary distribution we start with the equation, wP = w. Equivalently
we can write wP − w = 0, and
w(P − I) = 0
However, since w 6= 0 we see that P −
PI does not have full rank and we need at least one
more equation. For this we can use j wj = 1. Hence the equation system we should
solve is
(
w(P − I) = 0
P
j wj = 1
In the next example we show the procedure.
Example 14 Again use the state transition matrix from Example 12,
1 2

3
3 0


P =  41 0 34 
1
1
2
2 0
32
CHAPTER 3. PROBABILITY
Starting with w(P − I) = 0 we have
 1 2


 2
−3
3
3 0
 1

 1
3
w  4 0 4  − I  = w  4
1
1
1
2
2 0
2
2
3
0
−1
3
4
1
2


=0
−1
In P − I column
P 2 plus column 3 equals column 1. Therefore, exchange column 1 with
the equation j wj = 1,

1 32

w 1 −1
1
1
2
0

3
4

= 1 0 0
−1
This is solved by

1
2
3
0
−1


w = 1 0 0 1 −1 43 
1 12 −1
 15
= 1 0
 43
0  42
43
36
43
16
43
− 24
43
4
43
12
43

− 18
43 
− 40
43

=
15
43
16
43
12
43
Derived with four decimals we get the same as earlier when the asymptotic distribution
was discussed,
w ≈ 0.3488 0.372 0.2791
Chapter 4
Information Measure
Information theory is a mathematical theory of communication, that has its base in probability theory. It involves specifying what information is and setting up measurements
for information, and had its birth with Shanon’s landmark paper from 1948 [15]. The
theory give answer to two fundamental questions:
• What is information?
• What is communication?
Even if we think we know what information and communication is, it is not the same
as defining it in a mathematical point. As an example one can consider an electronic
copy of Claude Shannon’s paper in pdf format. The one considered here has the file size
2.2MB. It contains a lot of information about the subject, but how can it be measured?
One way to look at it is to compress it as much as we can. In zip-format the same file
has the size 713kB. So if we should quantify the amount of information in the paper,
the pdf version contains at least 1.5MB that is not necessary for the pure information
in the text. Is then the number 713kB a measure of the contained information? From a
mathematical point of view, we will see that it is definitely closer to the truth. However,
from a philosophical point of view, it is not certain. We can compare this text with a
text of the same size containing only randomly chosen letters. If they have the same file
size, do they contain the same amount of information? These semantic doubts are not
considered in the mathematical model. Instead the question is how much information is
needed to describe the text.
4.1
Information
In his paper Shannon did set up a mathematical theory for information and communication, based on probability theory. He gave a quantitative measure of the amount of
information stored in one variable and gave limits of how much information that can be
transmitted from place to another over a given communication channel.
33
34
CHAPTER 4. INFORMATION MEASURE
However, already twenty years earlier, in 1928, Hartley stated that a symbol can contain
information only if it has multiple choices [6]. That is, the symbol must be a random
variable. Hartley argued that if one symbol, a, has k alternatives, a vector of n independent such symbols (a1 , . . . , an ) has k n alternatives. To form a measure of information,
one must notice that if the symbol a has the information I, then the vector should have
the information kI. The conclusion of this was that an appropriate information measure
should be the logarithm of the number of alternatives,
IH (a) = log k
In that way
IH (a1 , . . . , an ) = n log k
Example 15 Consider a the outcome of a throw with a fair dice. It has 6 alternatives, and
hence, the information according to Hartley is1
IH (Dice) = log2 6 ≈ 2.585bit
In this example Hartley’s information measure makes sense, since it is the number of bits
needed to point out one of the six alternatives. But there can be other situations where it
runs into problem, like in the next example.
Example 16 Let the variable X be the outcome of a counterfeit coin, with the probabilities
P (X = Head) = p and P (X = Tail) = 1 − p. According to Hartley the information is
IH (X) = log2 (2) = 1bit
In the previous example the measurement makes sense if the two outcomes are equally
likely. If we instead consider the case when p is very small, and flip the coin several times
after each other. Since p is small we would expect to have most of the outcomes to be Tail
and only a small fraction to be Head. Hence, the normal case in such vector of outcomes
would be to have Tail, meaning there is not much information in this outcome. In the rare
cases we get Head instead, there would be much more information. In other words, even
if Hartley’s information measure was ground breaking at its time, it lacked consideration
of the outcome distribution.
In Shannon’s information measure, introduced 20 years later, the base of the information
quantity is the probability distribution of a random variable. The information achieved
about the outcome of one variable by observing another, can be viewed as the difference
between the unconditioned and the conditioned probability for that outcome.
1
Hartley did not specify the basis of the logarithm, but using the binary base, the information has the unit
bits. In this way it specifies the number of bits required to distinguish the alternatives.
4.1. INFORMATION
35
Definition 7 The information about the event {X = x} from the event {Y = y}, denoted
I(X = x; Y = y), is
I(X = x; Y = y) = logb
P (X = x|Y = y)
,
P (X = x)
where we assumed that P (X = x) 6= 0 och P (Y = y) 6= 0.
If nothing else is stated, we will assume the the logarithmic base b = 2 to achieve the unit
bit. This unit was first used in Shannon’s paper, but it was John W. Tukey who coined the
expression.
Example 17 The outcome of a dice is reflected by the two random variables
X = Number of pips
Y = Odd or even number
The information achieved about the event X = 3 from the event Y = Odd is
I(X = 3; Y = Odd) = log
P (X = 3|Y = Odd)
1/3
= log
= log 2 = 1bit
P (X = 3)
1/6
In other words, by knowing that the number of pips is odd, which split the set of outcomes in half, we gain one bit information about the event that the number is 3.
We can see from the following derivation that the information achieved from one event
about another is a symmetric measurement.
I(A; B) = log
P (A|B)
P (A, B)
P (B|A)
= log
= log
= I(B; A)
P (A)
P (A)P (B)
P (B)
Example 18 [cont’d] The information from the event X = 3 about the event Y = Odd) is
I(Y = Odd; X = 3) = log
P (Y = Odd|X = 3)
1
= log
= log 2 = 1bit
P (Odd)
1/2
The knowledge about X = 3 gives us full knowledge about the outcome of Y , which is a
binary choice with two equally sized parts. To specify one of the two outcomes of Y it is
required one bit.
The mutual information can be bounded by
−∞ ≤ I(X = x; Y = y) ≤ − log P (Y = y)
(4.1)
To see this we consider the variations of P (X = x|Y = y), which is a value between 0
and 1. The two end cases give
0
= log 0 = −∞
P (Y = y)
1
P (X = x|Y = y) = 1 ⇒ I(X = x; Y = y) = log
= − log P (Y = y)
P (Y = y)
P (X = x|Y = y) = 0 ⇒ I(X = x; Y = y) = log
36
CHAPTER 4. INFORMATION MEASURE
Notice that since 0 ≤ P (Y = y) ≤ 1 the value − log P (Y = y) is positive. If I(X = x; Y =
y) = 0 the events X = x and Y = y are statistically independent since
I(X = x; Y = y) = 0 ⇒
P (X = x|Y = y)
=1
P (Y = y)
We are now ready to consider the self information, i.e. the information achieved about
an event by observing the same event.
Definition 8 Let X = Y . Then, the self information in the event X = x is defined as
I(X = x) = I(X = x; X = x) = log
P (X = x|X = x)
= − log P (X = x)
P (X = x)
Hence, − log Pr(X = x) is the amount of information needed to determine that the event
X = x has occurred. The self information is always a non-zero quantity.
4.2
Entropy
The above quantities deal with the information in specific events. If we instead are consider the amount of information required in average to determine the outcome of a random variable, we need to consider the expected value of the self information. We get the
following important definition.
Definition 9 The entropy, which is a measure of the uncertainty of a random variable, is derived as
X
H(X) = EX − log p(x) = −
p(x) log p(x)
(4.2)
x
In the derivations, we use the convention that 0 log 0 = 0, which stems from the corresponding limit value. Here, and hereafter, if nothing else is stated the logarithm uses the
binary base, i.e. log2 . To derive this we use that
x = aloga x = blogb x = aloga b logb x
where we used that b = aloga b . This leads to
loga x = loga b logb x
⇒
Especially, it is convenient to use
log2 x =
ln x
log10 x
=
ln 2
log10 2
logb x =
loga x
loga b
4.2. ENTROPY
37
In e.g. Matlab there is a command log2(n) to derive the binary logarithm.
Since p is a probability, and therefore between 0 and 1, we can directly say that − log p is
a non-negative quantity. That also means the sum in (4.2) is non-zero,
H(X) ≥ 0
(4.3)
i.e. the uncertainty cannot be negative.
Example 19 Given a coin that could be counterfeit. The outcome from one flip has the
sample space Ω = {Head, Tail}. Denote the probabilities for the outcome by
P (Head) = p
P (Tail) = 1 − p
The uncertainty of the random variable X is
H(X) = −p log p − (1 − p) log(1 − p)
and is shown in Figure 4.1. As an example, if p = 0.1 we get the entropy H(X) = 0.469.
That is, in average it is needed 0.469 bit information to determine the outcome of a flip.
The entropy function of a binary choice, as in the previous example, is very important
and has its own notation.
Definition 10 The binary entropy function for the probability p is defined as
h(p) = −p log p − (1 − p) log(1 − p)
It follows directly that the binary entropy function is symmetric in p. i.e. that
h(p) = h(1 − p)
This symmetry can also be seen in Figure 4.1. From the figure we also see that it has a
maximum value at h( 12 ) = 1. In the case of a flip with a coin, the maximum corresponds
to a fair coin and the uncertainty of the outcome is one bit. If the probability for Head is
increasing a natural initial guess is that the outcome would be this, and the uncertainty
decreases. Similarly, if the probability for Tail increases the uncertainty should decrease.
At the and points, p = 0 or p = 1, the outcome is known and the uncertainty is zero,
corresponding to h(0) = h(1) = 0.
Example 20 Let X be the outcome of a true dice. Then P (X = x) = 1/6, x = 1, 2, . . . , 6.
The entropy is
X1
1
H(X) = −
log = log 6 = 1 + log 3 ≈ 2.5850 bit
6
6
x
38
CHAPTER 4. INFORMATION MEASURE
h(p) = −p log(p) − (1−p) log(1−p)
1
0
0
0.5
1
Figure 4.1: The Binary entropy function.
The definition of the entropy also yield vectorized random variables, such as (X, Y ) with
the joint probability function p(x, y).
Definition 11 The joint entropy is the entropy for a pair of random variables with the joint
distribution p(x, y),
XX
H(X, Y ) = EXY − log p(x, y) = −
p(x, y) log p(x, y)
x
y
Clearly, in the general case with an N -dimensional vector the corresponding entropy
function is
X
H(X1 , . . . , XN ) = −
p(x1 , . . . , xN ) log p(x1 , . . . , xN )
x1 ,...,xN
Example 21 Let X and Y be the outcomes from two independent true dice. Then the
joint probability is P (X, Y = x, y) = 1/36 and the joint entropy
X 1
1
log
= log 36 = 2 log 6 ≈ 5.1699
H(X, Y ) = −
36
36
x,y
We conclude that the uncertainty of the outcome of two dice is twice the uncertainty of
one dice.
Instead consider the sum of the dice, Z = X + Y . The probabilities are shown in the
following table
Z
P (Z)
2
3
4
5
6
7
8
9
10 11 12
1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
4.2. ENTROPY
39
The entropy of Z is
H(Z) = H
1 2 3 4 5 6 5 4 3 2 1
36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36
1
1
2
2
3
3
log
− 2 log
− 2 log
36
36
36
36
36
36
4
4
5
5
6
6
− 2 log
− 2 log
−
log
36
36
36
36 36
36
23 5
5
= ··· =
+ log 3 −
log 5 ≈ 3.2744
18 3
18
= −2
The uncertainty of the sum of the dice is less than the outcomes of the individual dice.
This makes sense, since several outcomes of the pair X, Y give the same sum Z.
In (4.3) we saw that the entropy function is a non-negative function. To achieve an upper bound of it we first need to consider an important result, sometimes named the ITinequality.
Lemma 9 For every positive real number r
log(r) ≤ (r − 1) log(e)
with equality if and only if r = 1.
Proof: Graphically we consider the two functions y1 = r − 1 and y2 = ln r as shown in
Figure 4.2. From this we conclude that ln r = r − 1 at the point when r = 1. To formally
show that in all other cases ln r < r − 1 we notice that the derivative of r − 1 is always 1.
Then, the derivative of ln r is
(
> 1, r < 1 ⇒ ln r < r − 1
d
1
ln r = =
dr
r
< 1, r > 1 ⇒ ln r < r − 1
and we can conclude that ln r ≤ r − 1, with equality if and only if r = 1. Rewriting into
binary logarithm completes the proof.
2
1
0
y1=r−1
−1
0
y2=ln(r)
1
2
3
4
Figure 4.2: Graphical interpretation of the IT-inequality.
40
CHAPTER 4. INFORMATION MEASURE
From what we have seen in the previous examples, and some intuition, we would guess
that the maximum value of the entropy would occur when the outcomes have equal probabilities. Assume a random variable X has L outcomes, {x1 , . . . , xL }, and that P (X =
xi ) = L1 for all of them. Then the entropy is
H(X) = −
X1
1
log = log L
L
L
i
To show that this actually is a maximum value over all distributions we consider
H(X) − log L = −
X
p(x) log p(x) −
x
=
X
X
p(x) log L
x
p(x) log
x
1
p(x)L
1
− 1 log e
p(x)L
x
X 1 X
p(x) log e
=
−
L
x
x
≤
X
p(x)
= (1 − 1) log e = 0
1
where the inequality follows from the IT-inequality with r = p(x)L
, implying we have
1
equality if and only if p(x)L = 1. In other words, we have shown that
H(X) ≤ log L
with equality if and only if p(x) =
theorem.
(4.4)
1
L.
Combining (4.3) and (4.4) we can state the following
Theorem 10 If X is a random variable with L outcomes, |X | = L, then
0 ≤ H(X) ≤ log L
with equality to the left if and only if there exists some i where p(xi ) = 1, and with equality to the
right if and only if p(xi ) = 1/L for all i = 1, 2, . . . , L.
We are now ready to define a conditional entropy. The base is the conditional probability,
say p(x|y). Here we still have two random variables that need to be averaged and we get
the following definition.
Definition 12 The conditional entropy for X conditioned on Y , with the joint probability
p(x, y), is
X
H(X|Y ) = EXY − log p(x|y) = −
p(x, y) log p(x|y)
x,y
4.2. ENTROPY
41
Using the chain rule for probabilities, p(x, y) = p(x|y)p(y), the sum can be rewritten as
XX
XX
p(x|y)p(y) log p(x|y)
p(x, y) log p(x|y) = −
H(X|Y ) = −
x
=
X
x
y
y
X
p(x|y) log p(x|y)
p(y) −
x
y
By introducing the entropy of X conditioned on the event Y = y,
X
p(x|y) log p(x|y)
H(X|Y = y) = −
x
the conditional entropy can be derived as
X
H(X|Y = y)p(y)
H(X|Y ) =
y
Example 22 The joint distribution of the random variables X and Y is given by
Y
p(x, y)
0
X
1
0
0
1
3
4
1
8
1
8
The
P marginal distributions of X and Y can be derived as p(x) =
x p(x, y), respectively. These yields
x
p(x)
0
1
1
8
7
8
y
p(y)
0
1
3
4
1
4
P
y
p(x, y) and p(y) =
The individual entropies are
H(X) = h( 81 ) ≈ 0.5436
H(Y ) = h( 41 ) ≈ 0.8113
and the joint entropy
H(X, Y ) = H(0, 34 , 18 , 81 ) ≈ 1.0613
To calculate the conditional entropy H(X|Y ) we first consider the conditional probabilities, derived by p(x|y) = p(x,y)
p(y)
X
P (X|Y = 0) P (X|Y = 1)
1
0
0
2
1
1
1
2
Therefore,
H(X|Y = 0) = h(0) = 0
H(X|Y = 1) = h( 21 ) = 1
42
CHAPTER 4. INFORMATION MEASURE
Putting things together we get the conditional entropy as
H(X|Y ) = H(X|Y = 0)P (Y = 0) + H(X|Y = 0)P (Y = 0)
3
1
1
= h(0) + h(1) =
4
4
4
We can use the chain rule for probabilities again to achieve a corresponding chain rule
for entropies. The joint entropy can be written as
X
X
p(x, y) log p(x|y)p(y)
p(x, y) log p(x, y) = −
H(X, Y ) = −
x,y
x,y
=−
X
p(x, y) log p(x|y) −
x,y
X
p(y) log p(y)
y
(4.5)
= H(X|Y ) + H(Y )
Rewriting the result we get H(X|Y ) = H(X, Y ) − H(Y ). That is, the conditional entropy
is the uncertainty of the pair (X, Y ) when X is known. A slightly more general version
of (4.5) can be stated as the chain rule for entropies in the following theorem.
Theorem 11 Let X1 , . . . , XN be an N -dimensional random variable drawn according to p(x1 , . . . , xN ).
Then the chain rule for entropies state that
H(X1 , . . . , XN ) =
N
X
H(Xi |X1 , . . . , Xi−1 )
i=1
Example 23 [Cont’d from Example 22] The joint entropy can alternatively be derived as
H(X, Y ) = H(X|Y ) + H(Y ) =
4.3
1
9 3
+ h( 14 ) = − log 3 ≈ 1.0613
4
4 4
Mutual Information
The entropy was obtained by averaging the self information for a random variable. Similarly, the average mutual information achieved about X when observing Y can be defined as follows.
Definition 13 The average mutual information between the random variables X and Y is
defined as
h
p(x|y)
p(x|y) i X
I(X; Y ) = EX,Y log
=
p(x, y) log
(4.6)
p(x)
p(x)
x,y
4.3. MUTUAL INFORMATION
43
Utilizing that
p(x|y)
p(x, y)
=
p(x)
p(x)p(y)
an alternative definition can be made by considering the ratio between the joint and the
marginal probabilities,
h
p(x, y)
p(x, y) i X
p(x, y) log
I(X; Y ) = EX,Y log
=
(4.7)
p(x)p(y)
p(x)p(y)
x,y
This is a measure of how strongly the two variables are connected. From (4.7), it is clear
that the mutual information is a symmetric measure,
I(X; Y ) = I(Y ; X)
Breaking up the logarithm in (4.6) or (4.7), it is possible to derive the mutual information
from the entropies as
h
p(x, y) i
I(X; Y ) = EX,Y log
p(x)p(y)
= EX,Y log p(x, y) − log p(x) − log p(y)
= EX,Y log p(x, y) − EX log p(x) − EY log p(y)
= H(X) + H(Y ) − H(X, Y )
= H(X) − H(X|Y )
= H(Y ) − H(Y |X)
where in the last two equalities the chain rule was used.
Example 24 [Cont’d from Example 22] The mutual information can be derives as
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
≈ 0.5436 + 0.8113 − 1.0613 = 0.2936
Alternatively, we can use
I(X; Y ) = H(X) − H(X|Y ) ≈ 0.5436 − 1/4 = 0.2936
Both the entropy and the mutual information are two very important measures of information. The entropy states how much information is needed to determine the outcome
of a random variable. It will be shown later that this is a limit of how many bits is needed
in average to describe the variable. In other words, this is a limit of how much a source
can be compressed without any data being lost. The mutual information, on the other
hand, describes the amount of information achieved about the variable X by observing
the variable Y . In a communication system a symbol, X, is transmitted over a channel.
The received symbol Y , which is a distorted version of X, is used by a receiver to estimate X. The mutual information is a measure of how much information that can be
transmitted over this channel. It will lead to the concept of the channel capacity.
To get more knowledge about these quantities we introduce the relative entropy. It was
first introduced by Kullback and Leibler [12].
44
CHAPTER 4. INFORMATION MEASURE
Definition 14 Given two probability distributions p(x) and q(x) for the same sample set X . The
relative entropy, or Kullback-Leibler distance, is defined as
h
p(x) i X
p(x)
D(p||q) = Ep log
=
p(x) log
q(x)
q(x)
x
Example 25 Consider a binary random variable, X ∈ {0, 1}, where we set up two distributions. First we assume that the values are equally probable,
p(0) = p(1) = 1/2
and, secondly, we assume a skew distribution,
q(0) = 1/4 and q(1) = 3/4
The relative entropy from p to q is then
1
1/2 1
1/2
log
+ log
2
1/4 2
3/4
1
= · · · = 1 − log 3 ≈ 0.2075
2
D(p||q) =
On the other hand, if we consider the relative entropy from q to p we get
1
1/4 3
3/4
log
+ log
4
1/2 4
1/2
3
= · · · = log 3 − 1 ≈ 0.1887
4
D(q||p) =
That is, the relative entropy is not a symmetric measure, and can not be treated as a
distance in normal sense. However, when talking about optimal source coding, we will
see that it is natural to talk about the relative entropy as a distance from one distribution
to another. This is the reason for the alternative name Kullback-Leibler distance.
The relative entropy was introduced as a generalized information measure. We can see
that the mutual information as defined earlier can be expressed as a special case of the
relative entropy as
I(X; Y ) = E
h p(x, y) i
= D p(x, y)p(x)p(y)
p(x)p(y)
The mutual information is the information distance from the joint distribution to the
independent case, i.e.the information distance corresponding to the relation between X
and Y .
Another aspect of the relative entropy to consider is the relationship with the entropy
function. Consider a random variable with the possible outcomes in the set X with cardinality |X | = L and probability distribution p(x), x ∈ X . Let u(x) = 1/L be the uniform
4.3. MUTUAL INFORMATION
45
distribution for the same set of outcomes. Then,
X
X
p(x) log p(x)L
p(x) log p(x) = log L −
H(X) = −
x
x
= log L −
X
p(x) log
x
p(x)
= log L − D(p||u)
u(x)
(4.8)
where we see that the relative entropy from p(x) to u(x) is the difference between the
entropy based on the true distribution and the maximum value of the entropy. Since the
maximum value is achieved by the uniform distribution, we see that the relative entropy
is some sort of measure of how much p(x) diverge from the uniform distribution.
The relative entropy can be shown to only take positive values. Here we will use the
IT-inequality to show this fact.
p(x) X
q(x)
p(x) log
=
q(x)
p(x)
x
x
q(x)
X
≤
p(x)
− 1 log e
p(x)
x
X
X
p(x) −
q(x) log e = (1 − 1) log e = 0
=
−D(p||q) = −
X
x
p(x) log
x
q(x)
with equality when p(x)
= 1, i.e.when p(x) = q(x), for all x. Alternatively, Jensen’s
inequality can be used to show the same result. We can express the result in a theorem.
Theorem 12 Given two probability distributions p(x) and q(x) for the same sample set X . Then
the relative entropy is positive
D(p||q) ≥ 0
with equality if and only if p(x) = q(x) for all x.
Since the mutual information can be expressed as the relative entropy, we directly get the
corollary below.
Corollary 13 For any two random variables X and Y
I(X; Y ) ≥ 0
with equality if and only if they are independent.
Since we can write the mutual information as
I(X; Y ) = H(X) − H(X|Y )
and that this is non-zero, we conclude the following theorem.
Theorem 14 For any two random variables X and Y
H(X|Y ) ≤ H(X)
with equality if and only if they are independent.
46
CHAPTER 4. INFORMATION MEASURE
Intuitively, this means that the uncertainty, in average, will not increase by observing
some side information. If the two variables are independent, we will have the same uncertainty as before. Using this result together with the chain rule for entropy, Theorem 11,
we can get the next result.
Theorem 15 Let X1 , X2 , . . . , Xn be an n-dimensional random variable drawn according to p(x1 , x2 , . . . , xn ).
Then
H(X1 , X2 , . . . , Xn ) ≤
n
X
H(Xi )
i=1
with equality if and only if all Xi are independent.
That is, the uncertainty is minimized when considering a random vector as a whole,
instead of individual variables. In other words, we should take relationship between the
variables into account when minimizing the uncertainty.
4.3.1
Convexity of information measures
In Definition 4 the terminology of convex functions was defined. In areas like optimization this is an important property since it is much easier to find a global optima. Here we
will show that the information measurements we have defined are convex (or concave)
functions. We will begin with the relative entropy which will give the base for the other
functions.
Our previous definition for a convex function is only for one dimensional functions.
Therefore, we need to start with generalizing the definition. A straight forward way
is to say that a multidimensional functions is convex if it is convex in all dimensions.
This means that for a two dimensional function we get a surface that resembles a bowl.
In equation forms we write the condition for convexity for a two dimensional function
g(x, y) as
g λ(x1 , y1 ) + (1 − λ)(x2 , y2 ) = g λx1 + (1 − λ)x2 , λy1 + (1 − λ)y2
≤ λg(x1 , y1 ) + (1 − λ)g(x2 , y2 )
The relative entropy can now be written as
D λp1 + (1 − λ)p2 λq1 + (1 − λ)q2
X
λp1 (x) + (1 − λ)p2 (x)
=
λp1 (x) + (1 − λ)p2 (x) log
λq1 (x) + (1 − λ)q2 (x)
x
X
λp1 (x) X
(1 − λ)p1 (x)
≤
λp1 (x) log
+
(1 − λ)p1 (x) log
λq1 (x)
(1 − λ)q1 (x)
x
x
X
X
p1 (x)
p1 (x)
=λ
p1 (x) log
+ (1 − λ)
p1 (x) log
q1 (x)
q1 (x)
x
x
= λD p1 q1 + (1 − λ)D p2 q2
where the inequality is a direct application of the log-sum inequality in Theorem 7.
Hence, we see that the relative entropy is a convex function in (p, q).
4.4. ENTROPY OF SEQUENCES
47
From (4.8) we know that the entropy can be expressed as Hp (X) = log L−D(p||u) where u
is the uniform probability function and p the distribution used for calculating the entropy.
Then
Hλp1 +(1−λ)p2 (X) = log L − D λp1 + (1 − λ)p2 u
≥ log L − λD p1 u − (1 − λ)D p2 u
= λ log L − D p1 u + (1 − λ) log L − D p2 u
= λHp1 (X) + (1 − λ)Hp2 (X)
where the inequality follows from the convexity of the relative entropy. We see here that
the entropy is a concave function in p.
4.4
Entropy of sequences
When considering random processes and information theoretic measures it is natural
to start with a well known lemma based on a three state Markov chain. It states that
the information cannot increase by data processing, neither pre nor post. It can only be
transformed to another representation (or be destroyed). In information theoretic terms
it can be stated as the next lemma.
Lemma 16 (Data Processing Lemma) If the random variables X, Y and Z form a Markov
chain, X → Y → Z, we have
I(X; Z) ≤ I(X; Y )
I(X; Z) ≤ I(Y ; Z)
For the considered Markov chain we have that, conditioned on Y , X and Z are independent, i.e.
P (XZ|Y ) = P (X|Y )P (Z|XY ) = P (X|Y )P (Z|Y )
where the second equality comes from the Markov condition. Starting with the first inequality of the theorem,
I(X; Z) = H(X) − H(X|Z) ≤ H(X) − H(X|Y Z) = H(X) − H(X|Y ) = I(X; Y )
Similarly, the second inequality comes from
I(X; Z) = H(Z) − H(Z|X) ≤ H(Z) − H(Z|XY ) = H(Z) − H(Z|Y ) = I(Z; Y )
which concludes the lemma.
The entropy function is an important measurement of the amount of information stored
in a random variable. In the case when we are considering a random process and a
sequence of random variables with some sort of correlation in between, we need a more
48
CHAPTER 4. INFORMATION MEASURE
general definition. In this section we will introduce two natural generalizations of the
entropy function, and show that they are in fact equivalent. This function can in many
cases be used in the same way for a random process as the entropy is used for a random
variable.
A natural way to define the entropy per symbol for a sequence is by treating the sequence
as a multi- dimensional random variable and averaging over the number of symbols. I
the length of the sequence tend to infinity we get the following definition.
Definition 15 The entropy rate of a stochastic process is
1
H(X1 X2 . . . Xn )
n→∞ n
H∞ (X) = lim
To see that this is in deed a generalization of the entropy measure we have from before
we consider a sequence of i.i.d. variables in the next example.
Example 26 Consider a sequence of i.i.d. random variables with entropy H(X). Then
the entropy rate becomes the entropy function as defined earlier,
1
1X
H(X1 . . . Xn ) = lim
H(Xi |X1 . . . Xi−1 )
n→∞ n
n→∞ n
i
1X
1X
H(Xi ) = lim H(X)
1 = H(X)
= lim
n→∞
n→∞ n
n
H∞ (X) = lim
i
i
We can also define an alternative entropy rate for a stochastic process, where we consider
the entropy of the nth variable in the sequence, conditioned on all the previous. As
n → ∞ we get
H ∞ (X) = lim H(Xn |X1 X2 . . . Xn−1 )
n→∞
To see how this variant relates to (??) we use the chain rule
1
1X
H(X1 . . . Xn ) =
H(Xi |X1 . . . Xi−1 )
n
n
i
The right hand side is the arithmetic mean of H(Xi |X1 . . . Xi−1 ). By the law of large
numbers, as n → ∞ this will approach H ∞ (X). Hence, asymptotically as the length of
the sequence grows to infinity the two definitions for the entropy rate are equal,
H∞ (X) = H ∞ (X)
Considering a stationary (time-invariant) random process, it can be seen that
H(Xn |X1 . . . Xn−1 ) ≤ H(Xn |X2 . . . Xn−1 ) = H(Xn−1 |X1 . . . Xn−2 )
4.4. ENTROPY OF SEQUENCES
49
where the last equality follows from the stationarity. Here we see that H(Xn |X1 . . . Xn−1 )
is a decreasing function in n. As n decreases we get to a point
H(Xn |X1 . . . Xn−1 ) ≤ · · · ≤ H(X2 |X1 ) ≤ H(X1 ) = H(X) ≤ log |X |
Finally, since the entropy is a non-negative function we can state the following theorem.
Theorem 17 For a stationary stochastic process the entropy rate is bounded by
0 ≤ H∞ (X) ≤ H(X) ≤ log |X |
In Figure 4.3 the relation between log |X |, H(X), H(Hn |X1 . . . Xn−1 ) and H∞ (X) is shown.
log |X |
H(X)
H(Xn |X1 . . . Xn−1 )
H∞ (X)
n
Figure 4.3: The relation between H∞ (X) and H(X).
So far the entropy rate has been treated for the class of stationary random processes. If the
subclass Markov chains are considered it is possible to say more on how to derive it. For
the conditional entropy, the Markov condition can be written as H(Xn |X1 . . . Xn−1 ) =
H(Xn |Xn−1 ). Then, the entropy rate can be derived as
H∞ (X) = lim H(Xn |X1 . . . Xn−1 ) = lim H(Xn |Xn−1 ) = H(X2 |X1 )
n→∞
n→∞
X
=
P (X1 = xi , X2 = xj ) log P (X2 = xj |X1 = xi )
i,j
=
=
X
P (X1 = xi )
X
P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi )
i
j
X
H(X2 |X1 = xi )P (X1 = xi )
(4.9)
i
where
H(X2 |X1 = xi ) =
X
P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi )
j
In (4.9) the transition probability is given by the state transition matrix
h
i
P = pij = P (X2 = xj |X1 = xi )
xi ,xj ∈X
and the stationary distribution by wi = P (X1 = xj ). In terminology of the Markov
process we get the following theorem.
50
CHAPTER 4. INFORMATION MEASURE
Theorem 18 For a stationary Markov chain with stationary distribution w and transition matrix P = [pij ], the entropy rate can be derived as
X
H∞ (X) =
wi H(X2 |X1 = xi )
i
where
H(X2 |X1 = xi ) = −
X
pij log pij
j
Example 27 The Markov chain shown in Figure 3.4 has the state transition matrix
1 2

3
3 0


P =  14 0 34 
1
1
2
2 0
In Example 14 the steady state distribution was calculated as
16
12
w = 15
43
43
43
The conditional entropies are the entropies for each row in P ,
H(X2 |X1 = s1 ) = h( 13 ) = log 3 −
2
3
H(X2 |X1 = s2 ) = h( 14 ) = 2 − 34 log 3
H(X2 |X1 = s3 ) = h( 21 ) = 1
and the entropy rate becomes
H∞ (X) = w1 H(X2 |X1 = s1 ) + w2 H(X2 |X1 = s2 ) + w3 H(X2 |X1 = s3 )
=
=
4.5
1
16
1
12
1
15
43 h( 3 ) + 43 h( 4 ) + 43 h( 2 )
3
34
43 log 3 + 43 ≈ 0.9013 bit/symbol
Random Walk
To be done.
Consider an undirected weighted graph where Wij is the weight for the edge joining
node xi and xj . Let
X
X
Wi =
Wij
W =
Wij
j
i,j:i<j
denote the sum of the weights leaving node xi , and the total weight, respectively. The
probability for using the edge from node xi is pij = Wij /Wi .
4.5. RANDOM WALK
• <1-> The stationary distribution is
µi =
Wi
2W
• <2-> The entropy rate
Wij
Wi
H∞ (X) = H . . . ,
,... − H ...,
,...
2W
2W
• <3-> If all non-zero edges have Wij = 1,
Wi
H∞ (X) = log(2W ) − H . . . ,
,...
2W
Example 28 Consider a weighted graph
x1
1
x2
2
1
1
x4
x3
1
The conditional probabilities are
Wij
pij = P
j Wij
The state transition matrix


0 1/4 2/4 1/4
1/2 0 1/2 0 

P =
2/4 1/4 0 1/4
1/2 0 1/2 0
With W = 6 and W1 = 4, W2 = 2, W3 = 4, W4 = 2 we get the stationary distribution
4 2 4 2 1 1 1 1
µ=
=
12 12 12 12
3 6 3 6
and the entropy rate
1 2 1 1 1 1 2 1 1 1
H∞ (X) = H
, , , , , , , , ,
12 12 12 12 12 12 12 12 12 12
4 2 4 2
−H
, , ,
12 12 12 12
≈ 3.25 − 1.92 = 1.33
51
52
CHAPTER 4. INFORMATION MEASURE
Chapter 5
SourceCodingTheorem
Text...
5.1
Asymptotic Equipartition Property
Asymptotic Equipartition Property (AEP) is a very useful tool in information theory. The
basic idea is that a series of i.i.d. events is viewed as a vector of events. The probability
for each vector becomes very small as the length grows, giving that it is pointless of considering the probabilities for each vector. Instead it is the probability for the distribution
of events within the vector that is important. From the law of large numbers we can see
that it is only a fraction of the possible outcomes that is the probable ones. To see this,
consider 100 consecutive tosses with a fair dice. The probability for each of the length 10
vectors is
1 100
P (x1 , . . . , x100 ) =
≈ 8 · 10−31
2
So, the probability for each of the vectors in the outcome is very small and it isnot very
50
likely that the vector with 100 Heads will occur. However, since there are 100
≈ 1029 ,
the probability for getting a vector with equal number of Head and Tail is about
−100 50
P (50 Head, 50 Tail) = 2
≈ 0.080
100
which is relatively high. To conclude, it is most likely that the outcome of 100 tosses with
a fair dice will result in approximately the same numbers of Heads and Tails. This is in
fact a consequence of the weak law of large numbers, which we recapitalize here.
Theorem 19 (The weak
law of large numbers) Let X1 , X2 , . . . , , Xn be i.i.d. random variables with mean E X . Then,
1X
p
Xi → E X
n
i
p
where → denotes convergence in probability.
53
54
CHAPTER 5. SOURCECODINGTHEOREM
The notation that it converges in probability can also be expressed as
1 X
lim P Xi − E X < ε = 1
n→∞
n
i
for any ε > 0. This means that a arithmetic mean of several i.i.d. random variables
approaches the expected value of the variables. For an example on the weak law of large
numbers, see Example 3.
Consider instead the logarithmic probability for a vector of length n, consisting of i.i.d.
random variables. By the weak law of large numbers we can conclude that
1
1X
p
− log p(x) =
− log p(xi ) → H(X)
n
n
i
or, equivalently, that
1
lim P p(x) − E X < ε = 1
n→∞
n
for any ε > 0. That is, for all vectors that can happen, the mean logarithmic probability
for the random variable approaches the entropy. For a finite vector of length n we can
have a closer look at those vectors that fulfill this criteria. They are called typical sequences
and are defined in the next definition.
Definition 16 (AEP) The set of ε-typical sequences Aε (X) is the set of all n-dimensional vectors x = (x1 , x2 , . . . , xn ) ∈ X n such that
1
− log p(x) − H(X) ≤ ε
n
(5.1)
It is possible to rewrite (5.1) as follows
1
−ε ≤ − log p(x) − H(X) ≤ ε
n
1
H(X) − ε ≤ − log p(x) ≤ H(X) + ε
n
1
−n H(X) + ε ≤ − log p(x) ≤ −n H(X) − ε
n
−n(H(X)+ε)
2
≤ p(x) ≤ 2−n(H(X)−ε)
This is the base for an alternative definition of the AEP.
Definition 17 (AEP, Alternative definition) The ε-typical sequences can definition as the set
of vectors x such that
2−n(H(X)+ε) ≤ p(x) ≤ 2−n(H(X)−ε)
(5.2)
5.1. ASYMPTOTIC EQUIPARTITION PROPERTY
55
Both definitions of the AEP are frequently used in the literature. It differs which one is
used as the main definition, but quite often both are presented. In the next example we
try to show what it means with ε-typical sequences.
Example 29 Consider a binary 5-dimensional vector with i.i.d. elements where pX (0) =
1
2
3 and pX (1) = 3 . Then the entropy for each symbol is H(X) = h(1/3) = 0.918. In the
following tabular all vectors are listed together with their respectively probability.
x
00000
00001
00010
00011
00100
00101
00110
00111
01000
01001
01010
p(x)
0.0041
0.0082
0.0082
0.0165
0.0082
0.0165
0.0165
0.0329
0.0082
0.0165
0.0165
?
x
01011
01100
01101
01110
01111
10000
10001
10010
10011
10100
10101
p(x)
0.0329
0.0165
0.0329
0.0329
0.0658
0.0082
0.0165
0.0165
0.0329
0.0165
0.0329
?
?
?
?
?
x
10110
10111
11000
11001
11010
11011
11100
11101
11110
11111
p(x)
0.0329
0.0658
0.0165
0.0329
0.0329
0.0658
0.0329
0.0658
0.0658
0.1317
?
?
?
?
?
?
?
?
?
As expected the all zero vector is the least possible vector, while the all one vector is
the most likely. However, even the most likely vector is not very likely to happen, with
probability 0.1317. Still, if we were to pick one vector of all as a guess for what the
outcome would be, this is the one. But since the elements are i.i.d. it can be argued that
the order of the elements are unimportant. If we instead would take a guess on the type
of vector, meaning the number of ones and zeros the answer would be different. The
probability for a vector containing k ones and 5 − k zeros is
k
n 2 k 1 n−k
n 2
P (k1s, 5 − k0s) =
=
k 3
3
k 3n
which we saw already in Chapter 3. Viewing these numbers in a table we get (in Chapter 3 it is also plotted in a diagram),
P (#1 in x)
k
k
= k5 235
0
1
2
3
4
5
0.0041
0.0412
0.1646
0.3292
0.3292
0.1317
Here it is clear that the most likely vector, the all one vector, does not belong to the most
likely type of vector. When guessing of the number of ones, it is more likely to get 3 or 4
ones. This is of course due to that there are more vectors that fulfill this criteria then the
single all one vector. So, this concludes that vectors with 3 or 4 ones are sort of “typical”.
56
CHAPTER 5. SOURCECODINGTHEOREM
The question is then how this relates to the previous definitions of typical sequences. To
see this, chose an ε. Here we use 15% of the entropy, which is ε = 0.138,
1
2−n(H(X)+ε) = 2−5(h( 3 )+0.138) ≈ 0.0257
1
2−n(H(X)+ε) = 2−5(h( 3 )−0.138) ≈ 0.0669
Then, the ε-typical sequences are the ones in between those numbers. In the table these
are marked with a ?. Luckily these are the same vectors as we concluded should be the
typical ones.
In the previous example we saw that there are something called typical sequences, and
that they are the most likely sequences to occur, seen from the contents. In this example
we used a very short vector to be able to list all of them, but for longer sequences it can
be seen that the ε-typical sequences are just a fraction of all the sequences. On the other
hand, it can also be seen that the probability for a random sequence to belong to the
typical sequences is close to one. More formally, we can state the following theorem.
Theorem 20 Consider sequences of length n of i.i.d. random variables. For each ε there exists an
(n)
integer n0 such that, for each n > n0 , the set of ε-typical sequences, Aε (X), fulfills
P x ∈ A(n)
(5.3)
ε (X) ≥ 1 − ε
n(H(X)+ε)
(1 − ε)2n(H(X)−ε) ≤ A(n)
(5.4)
ε (X) ≤ 2
The first part of the theorem, (5.3), is a direct consequence of the law of large numbers
stating that − n1 log p(x) approaches H(X) as n grows. That means that there exists an n0 ,
such that for all n ≥ n0
1
P − log p(x) − H(X) < ε ≥ 1 − δ
n
for any δ between zero and one. Letting δ = ε gives
1
P − log p(x) − H(X) < ε ≥ 1 − ε
n
which is equivalent to (5.3). This shows that the probability for an arbitrary sequence to
belong to the typical set approaches one as n grows.
To show the second part, that the number of ε-typical sequences is bounded by (5.4) we
need to split the equation in two parts. Starting with the left hand inequality we have,
for large enough n0
X
1 − ε ≤ P x ∈ A(n)
p(x)
ε (X) =
(n)
x∈Aε (X)
≤
X
(n)
x∈Aε (X)
2−n(H(X)−ε) = Aε(n) (X)2−n(H(X)−ε)
5.2. SOURCE CODING THEOREM
57
where we in the second inequality used the alternative definition of the AEP. The right
hand side can be shown by
X
X
p(x)
1=
p(x) ≥
x∈X n
≥
X
(n)
x∈Aε (X)
−n(H(X)+ε)
2−n(H(X)+ε) = A(n)
ε (X) 2
(n)
x∈Aε (X)
which completes the theorem. To see what the theorem means we consider longer sequences than in the previous example.
Example 30
Let X n = {x} be the set of length n vector of i.i.d. binary random variables with p(0) = 13
and p(1) = 23 . Let ε be 5% of the entropy H(X) = h( 31 ), ε = 0.046. Then the number of epstypical sequences adn their bounding functions are given in the next table for n = 100,
n = 500 and n = 1000. As a comparison, the fraction of ε-typical sequences compared to
the total number of sequences is also shown.
(n)
(n)
n
(1 − ε)2n(H(X)−ε)
|Aε (X)|
2n(H(X)+ε)
|Aε (X)|/|X n |
100
500
1000
1.17 · 1026
1.90 · 10131
4.16 · 10262
7.51 · 1027
9.10 · 10142
1.00 · 10287
1.05 · 1029
1.34 · 10145
1.79 · 10290
5.9 · 10−3
2.78 · 10−8
9.38 · 10−15
This table shows that the ε-typical sequences only constitute a fraction of the total number
of sequences.
Next, the probability for the ε-typical sequences are given together with the probability
for the most probable sequence, the all one sequence. Here it can be clearly seen that
the most likely sequence has a very low probability, and is in fact very unlikely to happen. Instead, the most likely event is that a random sequence is taken from the typical
sequences.
n
P (Aε (X))
(n)
P (x = 11 . . . 1)
100
500
1000
0.660
0.971
0.998
2.4597 · 10−18
9.0027 · 10−89
8.1048 · 10−177
5.2
Source Coding Theorem
In Figure 5.2 a block diagram of a communication system is shown. One of the results
from information theory is that the sequence generated from a source often contains redundancy. The source encoder is intended to remove the redundancy and use a minimum size representation of the information. The problem is that as the redundancy is
58
CHAPTER 5. SOURCECODINGTHEOREM
removed the information is very vulnerable to disturbances in the transmission. To circumvent this new redundant data is added in a controlled way, by the channel encoder.
If errors occur during transmission, the channel decoder will be able to detect and/or
correct most of them. Finally, the source decoder decompress the received sequence back
to the original. If the channel decoder manage to correct all errors occurred on the channel, i.e. Yb = Y , the source decoder should implement the inverse of the compression
function. Shannon showed that the the source coding and the channel coding can be performed independent of each other. Therefore, in this section only the source encoding
and decoding will be treated.
X
Y
U
b
X
Yb
V
Figure 5.1: Shannon’s model for a communication system.
One of the questions that information theory tries to answer is how much information
is stored in a set of data. A typical example is text, that contain redundancy for us to be
able correct spelling errors or misprints. In English written language it is often possible
to cross out every third letter and still be able to read the text. The possible compression
that can be done for a sequence show how much information it contains. Earlier, we considered the entropy as a measure of the uncertainty, which is the amount of information
needed to determine the value. We will see in this section that the entropy is in fact a
measure of the possible compression of a source.
In Figure 5.2 the model for source encoding and decoding used is shown. The sequence
from the source is fed to the source encoder where the redundancy is removed. The
source decoder is the inverse function and reconstruct the source data. In the figure X
is an n-dimensional vector from the alphabet X . The length n of the vector is considered
to be a random variable and the average length is E[n]. The corresponding compressed
sequence, Y , is an `-dimensional vector from the alphabet Y. Also ` is a random variable
so the average length of the codeword is E[k]. The decoder estimates the source vector
c When working with lossless source coding as in this section it is required that the
as X.
b = X. This is the case for compression aloriginal message is perfectly reconstructed, X
gorithms working on e.g. texts and also some image processing. However, in many cases
for media coding, such as image, video and speech coding, there are lossy algorithms.
The compression is much better in those, but the prize paid is that he original message
b ≈ X.
cannot be reconstructed exactly, i.e. X
The compression rate is defined as
R=
E[k]
E[`]
If the alphabet sizes are equal this means there is a compression if R is less than one. The
reason to view n and ` as random variables is that, to have compression either the length
5.2. SOURCE CODING THEOREM
59
of the source vector, the code vector or both must vary. If both the lengths would be fixed
for all messages there would not be any compression. In the next definition the above
reasoning is formalized.
Source
X
Encoder
Y
Decoder
c
X
Figure 5.2: Block model for source coding system..
Definition 18 (Not too general) A source code is a mapping from a finite source vector x ∈
X n to a finite vector of variable length `, y = (y1 y2 . . . y` ), where yi ∈ D = ZD are drawn from
a D-ary alphabet.
The length of the codeword corresponding to x is denote `(x).
Naturally the codeword length is a good measure of the efficiency of a chosen code. This
is defined as
X
L = E[`(x)] =
p(x)`(x)
x∈X n
To derive the source coding theorem, stating there exists a source code where the average
codeword length approaches the entropy of the source, we first need to set up an encoding rule. We saw in the previous section that the typical sequences are typically a small
fraction of all sequences, but the are the most likely to happen. Assuming then that the
non-typical sequences almost never happen, we can concentrate on the typical. Starting
with a list of all sequences, it can be partitioned in two parts, one with the typical and one
with the non-typical. To construct the codewords we use a binary prefix stating which set
it belongs to. Lets say we use 0 for the typical sequences and 1 for the non typical. Each
of the sets are listed and indexed by binary vectors. For the set of typical sequences the
(n)
index vector will be of length dlog |Aε (X)|e. The non-typical sequences only happens
rarely so we can use n bits. The following two tables show the idea of the look-up tables.
Typical
Prefix = 0
x Index vec
x0 0 . . . 00
x1 0 . . . 0
..
..
.
.
Non-typical
Prefix = 1
x Index vec
xa 0 . . . . . . 00
xb 0 . . . . . . 0
..
..
.
.
..
..
.
.
The same table look-up can be used by the decoding algorithm. The number of typical
sequences is
(n)
Aε (X) ≤ 2n(H(X)+ε)
60
CHAPTER 5. SOURCECODINGTHEOREM
and the length is
l m
+ 1 ≤ logA(n)
(X)
`(x) = logA(n)
ε (X) + 2 ≤ n H(X) + ε + 2
ε
Similarly, we know there are at most |X |n non-typical sequences and that their length is
l m
n
n
`(x) = logX + 1 ≤ logX + 2 ≤ n log |X + 2
In Figure 5.2 the procedure is shown graphically.
All sequences, X n
Non-typical sequences
< |X |n sequences
`(x) ≤ n log |X | + 2 bits
Typical sequences
≤ 2n(H(X)+ε) sequences
`(x) ≤ n(H(X) + ε) + 2 bits
Figure 5.3: Principle of Shannon’s source coding algorithm.
Next, we derive a bound on the average length of the codewords.
X
L = E `(x) =
p(x)`(x) =
x∈X n
≤
X
X
(n)
x∈Aε (X)
p(x) n(H(X) + ε) + 2 +
(n)
x∈Aε (X)
p(x)`(x) +
X
p(x)`(x)
(n)
x6∈Aε (X)
X
p(x) n log |X | + 2
(n)
x6∈Aε (X)
(n)
= P x ∈ A(n)
ε (X) n(H(X) + ε) + P x 6∈ Aε (X) n log |X | + 2
2
≤ n(H(X) + ε) + εn log |X | + 2 = n H(X) + ε + ε log |+
n}
|
{z
0
ε
= n H(X) + ε0
where ε0 can be arbitrary small for sufficiently large n. Summarizing, we state the source
coding theorem.
Theorem 21 Let x ∈ X n be length n vectors of i.i.d. random variables with probability function
p(x). Then there exists a code which maps sequences x of length n into binary sequences such
that the mapping is invertible and
i
1 h
E `(x) ≤ H(X) + ε0
n
for sufficiently large n, where ε0 can be made arbitrarily small.
5.3. KRAFT INEQUALITY
61
For random processes we do not have i.i.d. variables. However, it is possible to generalize the theorem for this case, using the entropy rate. For a formal proof we refer to e.g.
[2].
Theorem 22 Let X n be an stationary ergodic process. Then there exists a code which maps
sequences x of length n into binary sequences such that the mapping is invertible and
i
1 h
E `(x) ≤ H∞ (X) + ε0
n
for sufficiently large n, where ε0 can be made arbitrarily small.
5.3
Kraft Inequality
To be done.
5.4
Shannon-Fano Coding
To be done.
5.5
Huffman Coding
To be done.
Bibliography
[1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 1991.
[2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 2nd
edition edition, 2006.
[3] Robert G. Gallager. Information Theory and Reliable Communication. Wiley, 1968.
[4] Allan Gut. An Intermediate Course in Probability. Springer-Verlag, 1995.
[5] Richard Hamming. Error detecting and error correcting codes. Bell System Technical
Journal, Vol. 29:pp. 147–160, 1950.
[6] Ralph V. L. Hartley. Transmission of information. Bell System Technical Journal, pages
p. 535–563, July 1928.
[7] http://www.libpng.org/pub/png. PNG (Portable Graphics Format) Home site.
maintained by Greg Roelofs.
[8] David A. Huffman. A method for the construction of minimum redundancy codes.
Proceedings of the Institute of Radio Engineers, Vol. 40:pp. 1098–1101, 1952.
[9] Rolf Johannesson. Informationsteori – grundvalen för (tele-) kommunikation. Studentlitteratur, 1988. (in Swedish).
[10] Rolf Johannesson and Kamil Zigangirov. Fundamentals of Convolutional Codes. IEEE
Press, 1999.
[11] Boris Kudryashov. thori Informatii. Piter, 2009. (in Russian).
[12] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical
Statistics, Vol. 22, No 1:79–86, March 1951.
[13] Robert McEliece. The Theory of Information and Coding. CambridgeUniversity Press,
2004. Student Edition.
[14] Khalid Sayood. Introduction to Data Compression. Elsevier Inc., 2006.
[15] Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, Vol. 27:pp. 379–423, 623–656, July, October 1948.
[16] Terry Welch. A technique for high-performance data compression. IEEE Computer,
vol. 17:pp. 8–19, 1984.
73
74
BIBLIOGRAPHY
[17] John F. Young. Information Trheory. Butterworth & Co, 1971.
[18] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, Vol. 23:pp. 337–343, 1977.
[19] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions on Information Theory, Vol. 24:pp. 530–536, 1978.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement