# Information Theory Stefan Höst March 9, 2012 1

1 Information Theory Stefan Höst March 9, 2012 Draft version V001 First Draft Outline March 2012 c Stefan Höst Dept. of Electrical and Information Technology Lund University 2 CONTENTS 3 Contents 1 Preface 5 2 Introduction to Information Theory 7 3 Probability 3.1 Random variable . . . . . . 3.2 Expectation and variance . . 3.3 Weak law of large numbers 3.4 Jensen’s inequality . . . . . 3.5 Stochastic Processes . . . . . 3.6 Markov process . . . . . . . . . . . . . 9 10 12 18 21 24 27 4 Information Measure 4.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 36 42 4.3.1 Convexity of information measures . . . . . . . . . . . . . . . . . . Entropy of sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 47 50 4.4 4.5 . . . . . . . . . . . . . . . . . . . . . . . . 5 SourceCodingTheorem 5.1 Asymptotic Equipartition Property 5.2 Source Coding Theorem . . . . . . 5.3 Kraft Inequality . . . . . . . . . . . 5.4 Shannon-Fano Coding . . . . . . . 5.5 Huffman Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 53 57 61 61 61 6 Universal Source Coding 6.1 LZ77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 6.2 6.1.1 LZSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LZ78 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 6.3 6.2.1 LZW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Applications of LZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 7 Channel Coding Theory 7.1 Channel coding theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Channel coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 CONTENTS 7.2.1 7.2.2 7.2.3 Repetition code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convolutional code . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 65 65 8 Differential Entropy 8.1 Information for continouos random variables . . . . . . . . . . . . . . . . . 8.2 Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 67 9 Gaussian Channel 9.1 Capacity of Gaussian Channel . 9.2 Band limited channel . . . . . . 9.3 Fundamental Shannon limit . . 9.3.1 Limit for fix coding rate 9.3.2 Limit for QAM . . . . . 9.3.3 Limit for MIMO . . . . . 9.4 Waterfilling . . . . . . . . . . . 9.5 Coding Gain and Shaping Gain . . . . . . . . 69 69 69 69 69 69 69 69 69 . . . . . 71 71 71 71 71 71 10 Rate Distortion Theory 10.1 Rate distortion Function 10.2 Quantization . . . . . . . 10.3 Shannon Limit for fix Pb 10.4 Lossy compression . . . 10.5 Transform coding . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Chapter 3 Probability One of the basic insights when setting up a measure for information is that the observed quantity must have choices to contain any information. So, for us to set up a theory for information we must have a clear notation of probability theory. This chapter will give the parts of the probability theory that is needed throughout the rest of the document. It is not, in any way, a complete course in probability theory, for that we refer to e.g.[4] or some other standard text book in the area. In short, a probability is a measure of how likely an event is to occur. It is represented by a number between zero and one, where zero means it will not happen and one that is certain to happen. The sum of the probabilities for all possible events is one, since it is certain that one of them will happen. More formally, it was the Russian mathematician Kolmogorov who in 1933 set up the theory as we know it today. The sample space Ω is the set of all possible outcomes of the random experiment. In discrete probability theory the sample space is, in general, countably infinite, Ω = {ω1 , ω2 , . . . , ωn } An event is a subset of the sample space, A ⊆ Ω. The event is said to occur if the outcome from the random experiment is found in the event. Examples of specific events are • The impossible event ∅ (the empty subset of Ω) • The certain event Ω (the full subset) Each of the elements in Ω, ω1 , ω2 , . . . , ωn , i.e. the smallest nonempty subsets, are called elementary events or, sometimes, atomic events. To each event there is assigned a probability measure P , where 0 ≤ P ≤ 1, which is a measure of how probable the event is to happen. Some fundamental probabilities are P (Ω) = 1 P (∅) = 0 P (A ∪ B) = P (A) + P (B), if A ∩ B = ∅ 9 10 CHAPTER 3. PROBABILITY where A and B are disjoint events. This imply that X P (A) = P (ωi ) ωi ∈A An important concept in probability theory is how two events are related. This is described by the conditional probability. The probability of event A conditioned on the event B is defined as P (A|B) = P (A ∩ B) P (B) If the two events A and B are independent, the probability for A should not change if it is conditioned on B or not. Hence, P (A|B) = P (A) Applying this fact to the definition of conditional probabilities we can conclude that A and B are independent if and only if the probability of their intersection is equal to the product of the individual probabilities, P (A ∩ B) = P (A) · P (B) 3.1 Random variable A random variable, or stochastic variable, X is a function from the sample space into a specified set, most often the real numbers, X:Ω→X where X = {x1 , x2 , . . . , xk }, k ≤ n, is the set of possible values. We denote the event {ω|X(ω) = xi } by {X = xi }. To describe how the probabilities for the random variable is distributed we use, for the discrete case, the probability function pX (x) = P (X = x) and in the continuous case the density function fX (x). The distribution function for the variable can now be defined as, FX (x) = P (X ≤ x) which we can derive as1 X pX (k) k≤x FX (x) = Z x fX (ν)dν (3.1) X discrete X continuous −∞ 1 Often in this chapter we will give the result both for discrete and continuous variables, as done here. If one of the equations are omitted this does not mean it does not exist. In most cases it is straight forward to get the discrete or continuous counterpart. 3.1. RANDOM VARIABLE 11 In many occasions later in the document we will omit the index ·X , if it is clear what variable is used. To consider how two or more random variables are jointly distributed we can view a vector of random variables. This vector will then represent a multi-dimensional random variable and can be treated the same way. Hence, we can consider (X, Y ), where X and Y are random variables with possible outcomes X = {x1 , x2 , . . . , xM } and Y = {y1 , y2 , . . . , yN }, respectively. Then the set of joint outcomes is n o X × Y = (x1 , y1 ), (x1 , y2 ), . . . , (xM , yN ) Normally we write the joint probability function pX,Y (x, y) for the discrete case, meaning the event P (X = x and Y = y), and the corresponding joint density function is denoted fX,Y (x, y). The marginal distribution can be derived from the joint distribution as X pX,Y (x, y) pX (x) = y and Z fX (x) = fX,Y (x, y)dy R for the discrete and continuous case, respectively. This leads us to consider how dependent the two variables are. We say that the conditional probability distribution is defined by pX|Y (x|y) = pX,Y (x, y) . pY (y) fX|Y (x|y) = fX,Y (x, y) . fY (y) and This gives a probability distribution for the variable X when the outcome for Y is known. Repeating this iteratively, we conclude the chain rule for probabilities for the probability of an n-dimensional random variable as P (X1 . . . Xn−1 Xn ) = P (Xn |X1 . . . Xn−1 )P (X1 . . . Xn−1 ) = P (X1 )P (X2 |X1 )P (X3 |X1 X2 ) . . . P (Xn |X1 . . . Xn−1 ) n Y = P (Xi |X1 . . . Xi−1 ) (3.2) i=1 Combining the above results we conclude that two random variables X and Y are statistically independent if and only if pX,Y (x, y) = pX (x)pY (y) or fX,Y (x, y) = fX (x)fY (y) for all x and y. 12 3.2 CHAPTER 3. PROBABILITY Expectation and variance Consider an example where the random variable X is the outcome of a throw with a fair dice. The probabilities for each outcome is equivalent, i.e. 1 pX (x) = , 6 x = 1, 2, . . . , 6 The arithmetic mean of the outcomes is X= 1X x = 3.5 6 x Now, instead have a look at a counterfeit dice where there is a small weight place close to the number one and let Y be the corresponding random variable. Simplifying the results we assume that the number one will never show and number six will show twice as often as before. That is, pY (1) = 0, pY (i) = 16 , i = 2, 3, 4, 5 and pY (6) = 13 . The probabilities for the other numbers are unchanged. The arithmetic mean above will still be the same 3.5, but it does not say very much about the actual result. Therefore, we introduce the expected value, which is a mean weighted with the probabilities for the outcomes, E[X] = X (3.3) xpX (x) x In the case with the fair dice the expected value is the same as the arithmetic mean, but for the manipulated dice it becomes E[Y ] = 0 · 1 + 1 6 ·2+ 1 6 ·3+ 1 6 ·4+ 1 6 ·5+ 1 3 · 6 ≈ 4.33 A slightly more general definition of the expected value is as follows. Definition 1 Let g(X) be a real-valued function of a random variable X, the expected value (or mean) of g(X) is for a discrete variable X E g(X) = g(x)pX (x) x∈X and for a continuous variable Z E g(X) = g(ν)fX (ν)dν R Going back to our true and counterfeit dice, we can use the function g(x) = x2 . This will lead to the so called second order moment for the variable X. In the case for the true dice we get E[X 2 ] = 6 X 1 i=1 6 i2 ≈ 15.2 3.2. EXPECTATION AND VARIANCE 13 and for the counterfeit 2 E[Y ] = 5 X 1 6 i=2 1 i2 + 62 = 21 3 Later on, in Section 5.1, we will make use of the weak law of large numbers. It states that the arithmetic mean of a vector consisting of independent identically (i.i.d.) distributed random variables will approach the expected value as the length of the vector grows. In this sense the expected value is a most natural definition of the mean for the outcome of a random variable. The expected value for the multi-dimensional variable is derived similarly, with the joint distribution. In the case of a 2-dimensional vector (X, Y ) it becomes, X g(x, y)pX,Y (x, y) discrete case x,y E g(X, Y ) = Z g(ν, µ)fX (ν, µ)dνdµ continuous case R2 It can also be that one variable is discrete and one is continuous, so the formula has one sum and one integral. From the definition of the expected value and basic calculus it is easy to verify that the expected value is a linear mapping, i.e. that X E[aX + bY ] = (ax + by)pXY (x, y) x,y = X axpXY (x, y) + X x,y bypXY (x, y) x,y =a X X X X x pXY (x, y) + b y pXY (x, y) =a X x y y xpX (x) + b X x x ypY (y) y = aE[X] + bE[Y ] (3.4) At the end of this section there will be stated a more general version of this theorem. We can also verify that if X and Y are independent, then the expectation of their product equals the product of their expectations, X X E[XY ] = xypX,Y (x, y) = xypX (x)pY (y) x,y = X x x,y xpX (x) X xpY (y) = E[X]E[Y ] (3.5) y While the expected value is a measure of the weighted mean for the outcome of a random variable, we also need a measure of its variation. This value is called the variance and is defined as the expected value of the squared distance to the mean. 14 CHAPTER 3. PROBABILITY Definition 2 Let X be a random variable with expected value E[X]. The variance of the variable is V [X] = E (X − E[X])2 The variance can often be derived from V [X] = E (X − E[X])2 = E X 2 − 2XE[X] + E[X]2 = E[X 2 ] − 2E[X]E[X] + E[X]2 = E[X 2 ] − E[X]2 where E[X 2 ] is the second order moment of X. In many descriptions the expected value is denoted by m, but we have here chosen to preserve the notation E[X]. Still, it should be regarded as a constant in the derivations. Often the standard deviation, σX , is used as a measure of the variation of the variable. This is the square root of the variance, 2 σX = V [X] For the true and counterfeit dice described earlier in this section we get the variance V [X] = E[X 2 ] − E[X]2 ≈ 2.9 V [Y ] = E[Y 2 ] − E[Y ]2 ≈ 2.2 The corresponding standard deviations are p V [X] ≈ 1.7 p σY = V [Y ] ≈ 1.5 σX = To get more understanding about the variance we first need to define one of its relatives, the covariance function. This can be seen as a measure of the dependencies between two random variables. Definition 3 Let X and Y be two random variable with expected value E[X] and E[Y ]. The covariance between the variables is Cov(X, Y ) = E (X − E[X])(Y − E[Y ]) We can easily rewrite the covariance to get Cov(X, Y ) = E (X − E[X])(Y − E[Y ]) = E XY − XE[Y ] − E[X]Y + E[X]E[Y ] = E[XY ] − E[X]E[Y ] 3.2. EXPECTATION AND VARIANCE 15 From this result, and from (3.5), we see that when X and Y are independent we get zero covariance, (3.6) Cov(X, Y ) = 0 Going back to the variance we can now derive the variance for the combination aX + bY as V [aX + bY ] = E (aX + bY )2 − E[aX + bY ]2 2 = E a2 X 2 + b2 Y 2 + 2abXY − aE[X] + bE[Y ] = a2 E[X 2 ] + b2 E[Y 2 ] + 2abE[XY ] − a2 E[X]2 − b2 E[Y ]2 − 2abE[X]E[Y ] = a2 V [X] + b2 V [Y ] + 2abCov(X, Y ) (3.7) With use of (3.6) we can also conclude that if X and Y are independent we get V [aX + bY ] = a2 V [X] + b2 V [Y ] (3.8) Sometimes it is suitable to have a normalized random variable. We can get that by considering X̃ = X −m σ where E[X] = m and V [X] = σ 2 . Then the expectation and variance becomes 1 E[X] − m = 0 σ 1 V [X̃] = 2 E[(X − m)2 ] = 1 σ E[X̃] = To conclude the part about expected value and variance, we summarize and give more general versions of (3.4), (3.7) and (3.8). The results can be shown very similar to what has been done in the description above. Theorem 1 Given a set of N random variables Xn and scalar constants αn , n = 1, 2, . . . , N , the sum Y = N X αn Xn n=1 has the expected value E[Y ] = N X αn E[Xn ] n=1 and the variance V [Y ] = N X n=1 αn2 V [Xn ] + 2 X m<n αn αm Cov(Xn , Xm ) 16 CHAPTER 3. PROBABILITY If the random variables Xn all are independent, then V [Y ] = N X αn2 V [Xn ] n=1 We have already seen a distribution for the outcome of a dice. We will finish this section by considering the geometric distribution and the Gaussian distributions, both often used in technical contents. Example 1 [Geometric distribution] The geometric distribution is a discrete distribution that can be explained from a set of coin flips. Assume that the probability for head and tail is given by P (head) = α and P (tail) = 1 − α. Let K be the number of flips until a tail shows up. The probability for this number to be k is pK (k) = αk−1 (1 − α), k = 1, 2, . . . and the density function FK (k) = k X αk−1 (1 − α) = (1 − α) k−1 X αn = (1 − α) n=0 i=1 1 − αk = 1 − αk 1−α We can directly see that this is indeed a probability distribution if we let k → ∞ for the density function to get FK (k) → 1. Another way is to sum over all possible probabilities, ∞ X αk−1 (1 − α) = (1 − α) ∞ X αn = (1 − α) n=0 k=1 1 =1 1−α Here, and in the remaining of this example we will make use of the well known sums, for |α| < 1, ∞ X αn = n=0 ∞ X 1 , 1−α n=0 nαn = ∞ X α , (1 − α)2 n2 α n = n=0 α(1 + α) (1 − α)3 We can also find the expected value and the second order moment as E[K] = ∞ X kα k−1 k=1 (1 − α) = (1 − α) ∞ X (n + 1)αn n=0 α 1 1 + = 2 (1 − α) 1−α 1−α ∞ ∞ X X E[K 2 ] = k 2 αk−1 (1 − α) = (1 − α) (n + 1)2 αn = (1 − α) k=1 n=0 α(1 + α) 1+α α 1 = (1 − α) + 2 + = (1 − α)3 (1 − α)2 1 − α (1 − α)2 (3.9) 3.2. EXPECTATION AND VARIANCE 17 Hence, the variance is V [K] = E[K 2 ] + −E[K]2 = 1 2 1+α α − = 2 (1 − α) 1−α (1 − α)2 For α = 12 , i.e. a fair coin, we obtain pk (k) = 1 k 2 E[K] = 2 E[K 2 ] = 6 V [K] = 2 The next example treats the Gaussian distribution. Example 2 [Gaussian distribution] The Gaussian distribution, or Normal distribution is a very central distribution in probability theory. It is also a very common distribution to use for modelling continuous channels. The distribution is often denoted X ∼ N (m, σ) and the density function is fX (x) = √ 1 2πσ 2 e− (x−m)2 2σ 2 (3.10) To show that this is actually a density function, and to derive the expectation and variance, we start with a version where we center the function at 0, Y ∼ N (0, σ), with density function fY (y) = √ 1 2πσ 2 y2 e− 2σ2 R∞ To show that this is a density function, i.e. that −∞ fY (y)dy = 1 we consider the squared function, Z ∞ Z ∞ Z ∞ 2 2 2 2 1 1 − y2 − y2 − z2 2σ dy 2σ dz √ e 2σ dy = e e 2πσ 2 −∞ 2πσ 2 −∞ −∞ Z ∞Z ∞ 2 +z 2 y 1 = e− 2σ2 dydz 2πσ 2 −∞ −∞ Z 2π Z ∞ 2 2 2 sin2 φ 1 − r cos φ+r 2σ 2 = re drdφ 2πσ 2 0 0 Z 2π Z ∞ 1 r − r22 = e 2σ drdφ 2π 0 σ2 0 Z 2π h Z 2π 2 1 1 − r 2 ∞ 2σ = −e dφ = dφ = 1 0 2π 0 2π 0 where we used the standard variable change to polar coordinates. The expectation and 18 CHAPTER 3. PROBABILITY the second order moment can be derived as Z ∞ Z ∞ 2 1 σ2 y − y22 − y2 e 2σ dy E[Y ] = y√ e 2σ dy = √ 2πσ 2 2πσ 2 −∞ σ 2 −∞ σ 2 h − y22 i∞ −e 2σ =0 =√ −∞ 2πσ 2 Z ∞ Z ∞ 2 y2 1 y σ2 − y2 2 2 2σ y √ E[Y ] = y 2 e− 2σ2 dy e dy = √ 2πσ 2 2πσ 2 −∞ σ −∞ Z i h ∞ ∞ y2 y2 σ2 −e− 2σ2 dy =√ − −ye− 2σ2 −∞ 2πσ 2 −∞ Z ∞ 2 y 1 √ = σ2 e− 2σ2 dy = σ 2 2πσ 2 −∞ To make the same derivations in the more general case X ∼ N (m, σ) we will make use of the variable change y = x − m (implying dy = dx and x = y + m). Then Y ∼ N (0, σ), and we get Z ∞ √ −∞ 1 2πσ 2 e− (x−m)2 2σ 2 Z ∞ dx = −∞ √ y2 1 2πσ 2 e− 2σ2 dy = 1 so, it is a density function. The expectation and second order moment is Z ∞ E[X] = x√ 1 − e (x−m)2 2σ 2 Z dx = ∞ (y + m) √ 1 y2 e− 2σ2 dy 2πσ 2 2πσ 2 −∞ Z ∞ = E[Y ] + m fY (y)dy = m −∞ Z ∞ Z ∞ (x−m)2 y2 1 1 − 2 2 2 (y + m)2 √ x √ e 2σ dx = e− 2σ2 dy E[X ] = 2πσ 2 2πσ 2 −∞ −∞ 2 2 2 2 = E[Y ] + 2mE[Y ] + m = σ + m −∞ and the variance V [X] = E[X 2 ] − E[X]2 = σ 2 3.3 Weak law of large numbers The weak law of large numbers will play a central role when proving the source coding theorem, which sets a bound on the compression ratio, as well as the channel coding theorem, which sets a limit on the capability of reliable communication. Those two theorems are very important results in information theory. In this section first the Markov’s inequality and then Chebyshev’s inequality will be stated. These are famous bounds in statistics. The latter will be used to show the weak law of large numbers. The first inequality is stated in the following theorem. Here we will only consider the case for discrete random variables but the result holds also for continuous variables. 3.3. WEAK LAW OF LARGE NUMBERS 19 Theorem 2 (Markov’s inequality) Let X be a non-zero random variable with finite expected value. Then, for any non-zero integer a P (X > a) ≤ E[X] a To show this we start with the expected value, E[X] = ∞ X kp(k) ≤ k=0 ∞ X kp(k) ≤ k=a+1 ∞ X ap(k) = a k=a+1 ∞ X p(k) = aP (X > a) k=a+1 which gives the result. If we instead of X consider the squared distance to the expected value E[X] we get the following result. P E[(X − E[X])2 ] 2 V [X] X − E[X] > 2 ≤ = 2 2 Equivalently this can be stated as in the following theorem, which is Chebyshev’s inequality. Theorem 3 (Chebyshev’s inequality) Let X be a non-zero random variable with finite expected value E[X]. Then, for any positive V [X] P X − E[X] > ≤ 2 As stated previously this can be used to give a first proof of the weak law of large numbers. Consider a sequence of independent identically distributed (i.i.d.) random variables, Xi , i = 1, 2, . . . , n. The arithmetic mean for the sequence can be viewed as a new random variable, n Y = 1X Xi n i=1 From Theorem 1 we see that the expected value and variance of Y can be expressed as E[Y ] = E[X] V X V [Y ] = n Applying Chebyshev’s inequality yields n V [X] 1 X P Xi − E[X]> ≤ n n2 i=1 20 CHAPTER 3. PROBABILITY As n grows the right hand side will tend to zero, bounding the arithmetic mean close to the expected value. Stated differently we get n 1 X lim P Xi − E[X]< = 1 n→∞ n i=1 p which gives the weak law of large numbers. This probabilistic convergence is denoted → and is read out as convergence in probability. This gives the following way to express the relation. Theorem 4 (Weak law of large numbers) Let X1 , X2 , . . . P , Xn be i.i.d. random variables with 1 finite expectation E[X]. Form the arithmetic mean as Y = n i Xi . Then Y converges in probability to E[X], p Y → E[X], n→∞ It should be noted that the proof given above requires that the variance is finite, but it is possible to show also the more general theorem given here. To see how the convergence towards the expectation works in practice we give an example with a binary vector. Example 3 Consider a length n vector of i.i.d.binary random variables with pX (0) = 1/3 and pX (1) = 2/3, X = (X1 , X2 , . . . , Xn ). The probability for a vector to have k ones is k n 2 n 2 k 1 n−k = P (k 1s in X) = k 3n 3 k 3 In Figure 3.1 the probability distribution of number of 1s in a vector of length n = 5 is shown both as a table and in a graphical version. We see here that it is most likely that we get a vector with 3 or 4 1s. Also it is interesting to observe that the it is less probable to get all five 1s. In Figure 3.2 the distribution for the number of 1s is shown when the length of the vector is increased to 10, 50, 100 and 500. With increasing length it becomes more evident that the most likely vector has about L · E[X]. While we are dealing with the arithmetic mean of i.i.d. random variables it would not be fair not to mention the central limit theorem. This and the law of large numbers are the two main limits in probability theory. However, we will here give the theorem without a proof (as in many basic probability courses). The result is that the arithmetic mean of a sequence of i.i.d. random variables will be normal distributed, independent of the distribution of of the variables. Theorem 5 (Central limit theorem) Let X1 , X2 , . . . , Xn be i.i.d. random variables with finite expectation E[X] = m and finite variance V [X] = σ 2 . Form the arithmetic mean as Y = 3.4. JENSEN’S INEQUALITY 21 P (#1 in x) k k = k5 235 0 1 2 3 4 5 0.0041 0.0412 0.1646 0.3292 0.3292 0.1317 Figure 3.1: Probability distribution for k 1s in a vector of length 5, when p(1) = 2/3 and p(0) = 1/3. 1 n P i Xi . Then, as n goes to infinity, Y becomes distributed according to a normal distribution, i.e. Y −m √ ∼ N (0, 1), σ/ n n→∞ 3.4 Jensen’s inequality In many applications, like optimization, functions that are convex are desirable since they have a structure that is easy to work with. A convex function is defined in the following. Definition 4 (Convex function) A function g(x) is convex in the interval [a, b] if, for any x1 , x2 ∈ [a, b] and any λ, 0 ≤ λ ≤ 1, g λx1 + (1 − λ)x2 ≤ λg(x1 ) + (1 − λ)g(x2 ) Similarly, a function g(x) is concave in the interval [a, b] if −g(x) is convex in the same interval. The inequality in the definition can be viewed graphically as in Figure 3.3. Let x1 and x2 be two numbers in the interval [a, b], such that x1 < x2 . Then, we can also mark the functions values, g(x1 ) and g(x2 ), on plot of g(x). For λ in [0, 1], we get that x1 ≤ λx1 + (1 − λ)x2 ≤ x2 with equality to the left if λ = 1 and to the right if λ = 0. Then we can also mark the corresponding function value g λx1 +(1−λ)x2 . While λx1 +(1−λ)x2 is a value between 22 CHAPTER 3. PROBABILITY (a) (b) (c) (d) Figure 3.2: Probability distributions for k 1s in a vector of length 10 (a), 50 (b), 100 (c) and 500 (d), when p(1) = 2/3 and p(0) = 1/3. x1 and x2 , the corresponding value between g(x1 ) and g(x2 ) is λg(x1 ) + (1 − λ)g(x2 ), i.e. the coordinates λx1 + (1 − λ)x2 , λg(x1 ) + (1 − λ)g(x2 ) describes a straight line between x1 , g(x1 ) and x2 , g(x2 ) for 0 ≤ λ ≤ 1. From this reasoning we see that a convex function is typically shaped like a bowl. Similarly a concave function is the opposite, shaped like a hill. Example 4 The functions x2 and ex are typically convex functions. On the other hand, the functions −x2 and log x are concave functions. In the literature the names convex ∪ and convex ∩ are also used instead of convex and concave. Since λ and 1 − λ can be interpreted as a binary probability function, both λx1 + (1 − λ)x2 and λg(x1 ) + (1 − λ)g(x2 ) can be viewed as expected values. A more general way of putting this is Jensen’s inequality. 3.4. JENSEN’S INEQUALITY 23 Figure 3.3: A graphical view of the definition of convex functions. Theorem 6 (Jensen’s inequality) If f (x) is a convex function and X a random variable we have E[f (X)] ≥ f (E[X]) If f (x) is a concave function and X a random variable we have E[f (X)] ≤ f (E[X]) Jensen’s inequality is so important that a more thorough outline of the proof is required. Even though it will only be shown for the discrete case here, it is also valid in the continuous case. As stated prior to the theorem, the binary case follows directly from the definition of convexity. To show that the theorem holds also in for distributions with more than twoP cases we will use induction. Assume that a1 , a2 , . . . , an are positive numbers such that i ai = 1 and that X X f ai xi ≤ ai f (xi ) i i where f (x) is a convex function. Furthermore, let p1 , p2 , . . . , pn+1 be a probability distribution for X. Then n+1 n+1 n+1 X X X f E[X] = f pi xi = f p1 x1 + pi xi = f p1 x1 + (1 − p1 ) i=1 (a) ≤ p1 f (x1 ) + (1 − p1 )f i=2 n+1 X i=2 = n+1 X i=1 pi f (xi ) = E f (X) pi 1 − p1 i=2 pi xi 1 − p1 n+1 (b) X xi ≤ p1 f (x1 ) + (1 − p1 ) i=2 pi f (xi ) 1 − p1 24 CHAPTER 3. PROBABILITY where inequality (a) comes from the convexity of f and (b) from the induction assumppi tion. What is left to show is that ai = 1−p satisfy the requirements above. Clearly, since 1 pi is a probability we have that ai ≥ 0. The second requirements follows from n+1 X i=2 n+1 pi 1 X 1 = (1 − p1 ) = 1 pi = 1 − p1 1 − p1 1 − p1 i=2 which completes the proof. Example 5 Since f (x) = x2 is a convex function we can see that E[X 2 ] ≥ E[X]2 , which shows that the variance is a non-negative function. Clearly, the above example comes as no surprise since it is clear from the definition of variance. A somewhat more interesting result from Jensen’s inequality is the so called log-sum inequality. The function f (t) = t log t is in fact a convex function. Then, if αi forms a probability distribution, it follows from Jensen’s inequality that X X X αi ti log ti ≥ αi ti log αi ti (3.11) i i i This can be used to get X i ai log X bi P bi P = j bj ai j bj i X bi P P ≥ j bj j bj i where we identified αi = Pai j bj ai ai log bi bi P X bi ai X ai ai P P log = ai log i bi b b j j i j bj and ti = i ai bi i in (3.11). Summarizing, we have the following theorem. Theorem 7 (log-sum inequality) Let a1 , . . . , an and b1 , . . . , bn be non-negative numbers. Then ! P X X ai ai ≥ ai log Pi ai log bi i bi i i 3.5 Stochastic Processes So far, we have assumed that the source symbols are independent in time. But it is often useful to also consider how sequences depend in time, i.e. a dynamic system. We will here look at an example to see that in texts we have dependencies in time and that one letter is clearly dependent of the surrounding letters. This example comes directly from the work of Claude Shannon [15]. 3.5. STOCHASTIC PROCESSES 25 Shannon assumed an alphabet with 27 symbols, i.e. 26 letters and 1 space. To get the 0th order approximation a sample text was generated with equal probability for the letters. Example 6 [0th order approximation] Choose letters from the English alphabet with equal probability. XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD Clearly, this text does not have much in common with normal written English. So, instead, he counted the number of occurrences per letter in normal English texts, and estimated the probabilities. The probabilities are given by Table 3.1. X A B C D E F G H I P 8.167 1.492 2.782 4.253 12.702 2.228 2.015 6.094 6.966 X J K L M N O P Q R P 0.153 0.772 4.025 2.406 6.749 7.507 1.929 0.095 5.987 X S T U V W X Y Z P 6.327 9.056 2.758 0.978 2.360 0.150 1.974 0.074 Table 3.1: Probabilities in percent for the letters in English text. Then, according to those probabilities, a sample text for the 1st order approximation can be generated. In the next example such text is shown. Here, the text have more of a structure of English text, but still far from readable. Example 7 [1st order approximation] Choose the symbols according to their normal probability (12 % E, 2 % W, etc.): OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL The next step is to extend the distribution so the probability depends on the previous letter, i.e. the probability for the letter at time t becomes P (St |St−1 ). In the next example such text is given. 26 CHAPTER 3. PROBABILITY Example 8 [2nd order approximation] Choose the letters conditioned on the previous letter: ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE Similarly, in the next example the 3rd order approximation conditions on the two previous letters. We see here that the text becomes more like English, even if it still dos not make any sense. Example 9 [3rd order approximation] Choose the symbols conditioned on the two previous symbols: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTRURES OF THE REPTAGIN IS REGOACTIONA OF CRE If instead of letters, the source of the text is generated from probabilities for words. The first order approximation uses the unconditioned probabilities. Example 10 [1st order word approximation] Choose words independently (but with probability): REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE If the probabilities for words are conditioned on the previous word we can get a much more readable text. Still without any direct meaning, of course. Example 11 [2nd order word approximation] Choose words conditioned on the two previous word: THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED 3.6. MARKOV PROCESS 27 The above examples show that it in many situations it is important to view sequences instead of individual symbols. In probability theory this is denoted a random process, or a stochastic process. In a general form it can be defined as follows. Definition 5 (Random process) A discrete random process is a sequence of random variables, {Xi }ni=1 , defined on the same sample space. There can be an arbitrary dependency among the variables, and the process is characterized by the joint probability function P X1 , X2 , . . . , Xn = x1 , x2 , . . . , xn = p(x1 , x2 , . . . , xn ), n = 1, 2, . . . Since there is a dependency in time we need also more general parameters corresponding to the second order moment and the variance of the random variable, depending also on the time shift. The auto correlation function reflects the correlation in time and is defined as rXX (n, n + k) = E Xn Xn+k If the mean and the autocorrelation function are time independent, i.e. for all n E[Xn ] = E[X] rXX (n, n + k) = rXX (n) the process is said to be wide sense stationary (WSS). The relation with the second order moment function is that rXX (k) = E[X 2 ]. The same relation for the variance comes with the autocovariance function, defined for a WSS process as cXX (k) = E (Xn − E[X])(Xn+k − E[X]) for all n, We can easily see that cXX (k) = rXX (k) − E[X]2 and that cXX (0) = V [X]. The class of WSS processes is a very powerful when modelling. However, sometimes we want an even stronger condition on the time invariance. We say that a process is stationary, or time-invariant, if the probability distribution does not depend on the time shift. That is, if P X1 , . . . , Xn = x1 , . . . , xn = P Xl+1 , . . . , Xl+n = x1 , . . . , xn for all n and time shifts `. Clearly this is a subclass of WSS processes. 3.6 Markov process A widely used class of the discrete stationary random processes is the class of Markov processes. Here, the probability for a symbol depends only on the previous symbol. With this simplification we get a system that is relatively easy to handle from a mathematical point of view, while still having time dependency in the sequence to be a powerful modelling tool. 28 CHAPTER 3. PROBABILITY Definition 6 (Markov chain) A Markov chain, or Markov process, is a stationary random process with unit memory, i.e. P xn |xn−1 , . . . , x1 = P xn |xn−1 ) for all xi . Using the chain rule for probabilities (3.2) we conclude that for a Markov chain the joint probability function is p(x1 , x2 , . . . , xn ) = n Y p(xi |xi−1 ) i=1 = p(x1 )p(x2 |x1 )p(x3 |x2 ) · · · p(xn |xn−1 ) The unit memory of a Markov chain gives that we can characterize it by • A finite set of states X = {x1 , x2 , . . . , xr } where the state determines everything about the past. This represents the unit memory of the chain. • A state transition matrix P = [pij ]i,j∈{1,2,...,r} where pij = p(xj |xi ) P and j pij = 1. The behaviour of a Markov chain can be visualized in a state transition graph consisting of states and edges, labeled with probabilities. The following example explains how this is related. Example 12 A three state Markov chain is described by the three states X = {s1 , s2 , s3 } and a state transition matrix 1 2 3 3 0 1 3 P = 4 0 4 1 1 2 2 0 From P we see that conditioned that we have state s1 as the previous state, the probability for s1 is 1/3, s2 is 2/3 and s3 is 0. This can be viewed as transitions in a graph from state s1 to the other states, see Figure 3.4. Similarly, the other rows in the state transition matrix describe the probabilities for transitions from the other states. 3.6. MARKOV PROCESS 29 1/3 s1 2/3 1/2 1/4 1/2 s3 s2 3/4 Figure 3.4: A state transition graph of a three state Markov chain. For a Markov chain with r states the transition matrix will be a r × r matrix with the transition probabilities. Let the initial probabilities for the states at time 0 be w(0) = w1(0) · · · wr(0) (0) where wi (1) wj = P (X0 = i). Then the probability for being in state j at time 1 becomes = P (X1 = j) = X P (X1 = j|X0 = i)P (X0 = i) = i X (0) pij wi i Hence, the vector describing the state probabilities at time 1 becomes w(1) = w(0) P where P is the state transition matrix. Similarly, letting w(n) be the state probabilities at time n, we get w(n) = w(n−1) P = · · · = w(0) P n In the following example we view state transition matrices from time 1 to time 2, 4, 8 and 16 respectively. Here we see that the columns of the matrix becomes more and more independent of the starting distribution. Example 13 Continuing with the Markov chain from Example 12. The state transition matrix there is 1 2 3 3 0 P = 41 0 34 1 1 2 2 0 It shows the probabilities for the transitions from time 0 to time 1. If instead we are considering P n we have the transition probabilities from time 0 to time n. Below we 30 CHAPTER 3. PROBABILITY derive the transition probabilities from time 1 to time 2, 4, 8 and 16. 20 16 36 0.2778 0.2222 0.5000 72 72 72 39 0 ≈ 0.4583 0.5417 0 P 2 = P P = 33 72 72 24 27 21 0.2917 0.3333 0.3750 72 72 72 1684 P 4 = P 2P 2 = 5184 1947 5184 1779 5184 1808 5184 2049 5184 1920 5184 1692 5184 1188 5184 1485 5184 0.3248 0.3488 0.3264 ≈ 0.3756 0.3953 0.2292 0.3432 0.3704 0.2865 0.3485 0.3720 0.2794 P 8 = P 4 P 4 ≈ 0.3491 0.3721 0.2788 0.3489 0.3722 0.2789 P 16 0.3488 0.3721 0.2791 = P 8 P 8 ≈ 0.3488 0.3721 0.2791 0.3488 0.3721 0.2791 Already for 16 steps in the graph the numbers in each column are equal, for this accuracy with four decimals (actually, we have this already for P 12 ). This means that for 16 steps or more in the graph the probability for a state is independent of the starting state. Hence, with any starting distribution w we get w(16) = wP 16 = 0.3488 0.372 0.2791 With higher accuracy in the derivations we need to keep multiplying longer but eventually we will reach a stage where it does not change. As seen2 in Example 13 we will reach an asymptotic distribution w = (w1 , . . . , wr ) such that w w1 · · · wr w w1 · · · wr lim P n = . = . . . . ... n→∞ .. .. w w1 · · · wr In this limit we also can see that w w .. .. n n+1 n = P P = . P .=P =P w w Considering one row in the left matrix we conclude that w = wP which is the stationary distribution of the system. We are now ready to state a theorem for the relationship of the stationary and the asymptotic distribution. 2 We have omitted the formal proof in this text. 3.6. MARKOV PROCESS 31 Theorem 8 Let w = w1 w2 · · · wr be an asymptotic distribution of the state probabilities. Then • P j wj = 1 • w is a stationary distribution, i.e. wP = w • w is a unique stationary distribution for the source. Clearly the first statement is fulfilled since w is a distribution. The second has already been shown above. The third, on uniqueness, we still need to prove. P Assume that v = v1 · · · vr is a stationary distribution, i.e. it fulfills i vi = 1 and vP = v. Then, as n → ∞, the equation v = vP n can be written w1 · · · wj · · · wr .. .. v1 . . . vr = v1 . . . vr ... . . w1 · · · wj · · · wr This imply that vj = v1 wj · · · vr ... = v1 wj + · · · + vr wj = wj (v1 + · · · + vr ) = wj {z } | =1 wj That is, v = w, which proves uniqueness. To derive the stationary distribution we start with the equation, wP = w. Equivalently we can write wP − w = 0, and w(P − I) = 0 However, since w 6= 0 we see that P − PI does not have full rank and we need at least one more equation. For this we can use j wj = 1. Hence the equation system we should solve is ( w(P − I) = 0 P j wj = 1 In the next example we show the procedure. Example 14 Again use the state transition matrix from Example 12, 1 2 3 3 0 P = 41 0 34 1 1 2 2 0 32 CHAPTER 3. PROBABILITY Starting with w(P − I) = 0 we have 1 2 2 −3 3 3 0 1 1 3 w 4 0 4 − I = w 4 1 1 1 2 2 0 2 2 3 0 −1 3 4 1 2 =0 −1 In P − I column P 2 plus column 3 equals column 1. Therefore, exchange column 1 with the equation j wj = 1, 1 32 w 1 −1 1 1 2 0 3 4 = 1 0 0 −1 This is solved by 1 2 3 0 −1 w = 1 0 0 1 −1 43 1 12 −1 15 = 1 0 43 0 42 43 36 43 16 43 − 24 43 4 43 12 43 − 18 43 − 40 43 = 15 43 16 43 12 43 Derived with four decimals we get the same as earlier when the asymptotic distribution was discussed, w ≈ 0.3488 0.372 0.2791 Chapter 4 Information Measure Information theory is a mathematical theory of communication, that has its base in probability theory. It involves specifying what information is and setting up measurements for information, and had its birth with Shanon’s landmark paper from 1948 [15]. The theory give answer to two fundamental questions: • What is information? • What is communication? Even if we think we know what information and communication is, it is not the same as defining it in a mathematical point. As an example one can consider an electronic copy of Claude Shannon’s paper in pdf format. The one considered here has the file size 2.2MB. It contains a lot of information about the subject, but how can it be measured? One way to look at it is to compress it as much as we can. In zip-format the same file has the size 713kB. So if we should quantify the amount of information in the paper, the pdf version contains at least 1.5MB that is not necessary for the pure information in the text. Is then the number 713kB a measure of the contained information? From a mathematical point of view, we will see that it is definitely closer to the truth. However, from a philosophical point of view, it is not certain. We can compare this text with a text of the same size containing only randomly chosen letters. If they have the same file size, do they contain the same amount of information? These semantic doubts are not considered in the mathematical model. Instead the question is how much information is needed to describe the text. 4.1 Information In his paper Shannon did set up a mathematical theory for information and communication, based on probability theory. He gave a quantitative measure of the amount of information stored in one variable and gave limits of how much information that can be transmitted from place to another over a given communication channel. 33 34 CHAPTER 4. INFORMATION MEASURE However, already twenty years earlier, in 1928, Hartley stated that a symbol can contain information only if it has multiple choices [6]. That is, the symbol must be a random variable. Hartley argued that if one symbol, a, has k alternatives, a vector of n independent such symbols (a1 , . . . , an ) has k n alternatives. To form a measure of information, one must notice that if the symbol a has the information I, then the vector should have the information kI. The conclusion of this was that an appropriate information measure should be the logarithm of the number of alternatives, IH (a) = log k In that way IH (a1 , . . . , an ) = n log k Example 15 Consider a the outcome of a throw with a fair dice. It has 6 alternatives, and hence, the information according to Hartley is1 IH (Dice) = log2 6 ≈ 2.585bit In this example Hartley’s information measure makes sense, since it is the number of bits needed to point out one of the six alternatives. But there can be other situations where it runs into problem, like in the next example. Example 16 Let the variable X be the outcome of a counterfeit coin, with the probabilities P (X = Head) = p and P (X = Tail) = 1 − p. According to Hartley the information is IH (X) = log2 (2) = 1bit In the previous example the measurement makes sense if the two outcomes are equally likely. If we instead consider the case when p is very small, and flip the coin several times after each other. Since p is small we would expect to have most of the outcomes to be Tail and only a small fraction to be Head. Hence, the normal case in such vector of outcomes would be to have Tail, meaning there is not much information in this outcome. In the rare cases we get Head instead, there would be much more information. In other words, even if Hartley’s information measure was ground breaking at its time, it lacked consideration of the outcome distribution. In Shannon’s information measure, introduced 20 years later, the base of the information quantity is the probability distribution of a random variable. The information achieved about the outcome of one variable by observing another, can be viewed as the difference between the unconditioned and the conditioned probability for that outcome. 1 Hartley did not specify the basis of the logarithm, but using the binary base, the information has the unit bits. In this way it specifies the number of bits required to distinguish the alternatives. 4.1. INFORMATION 35 Definition 7 The information about the event {X = x} from the event {Y = y}, denoted I(X = x; Y = y), is I(X = x; Y = y) = logb P (X = x|Y = y) , P (X = x) where we assumed that P (X = x) 6= 0 och P (Y = y) 6= 0. If nothing else is stated, we will assume the the logarithmic base b = 2 to achieve the unit bit. This unit was first used in Shannon’s paper, but it was John W. Tukey who coined the expression. Example 17 The outcome of a dice is reflected by the two random variables X = Number of pips Y = Odd or even number The information achieved about the event X = 3 from the event Y = Odd is I(X = 3; Y = Odd) = log P (X = 3|Y = Odd) 1/3 = log = log 2 = 1bit P (X = 3) 1/6 In other words, by knowing that the number of pips is odd, which split the set of outcomes in half, we gain one bit information about the event that the number is 3. We can see from the following derivation that the information achieved from one event about another is a symmetric measurement. I(A; B) = log P (A|B) P (A, B) P (B|A) = log = log = I(B; A) P (A) P (A)P (B) P (B) Example 18 [cont’d] The information from the event X = 3 about the event Y = Odd) is I(Y = Odd; X = 3) = log P (Y = Odd|X = 3) 1 = log = log 2 = 1bit P (Odd) 1/2 The knowledge about X = 3 gives us full knowledge about the outcome of Y , which is a binary choice with two equally sized parts. To specify one of the two outcomes of Y it is required one bit. The mutual information can be bounded by −∞ ≤ I(X = x; Y = y) ≤ − log P (Y = y) (4.1) To see this we consider the variations of P (X = x|Y = y), which is a value between 0 and 1. The two end cases give 0 = log 0 = −∞ P (Y = y) 1 P (X = x|Y = y) = 1 ⇒ I(X = x; Y = y) = log = − log P (Y = y) P (Y = y) P (X = x|Y = y) = 0 ⇒ I(X = x; Y = y) = log 36 CHAPTER 4. INFORMATION MEASURE Notice that since 0 ≤ P (Y = y) ≤ 1 the value − log P (Y = y) is positive. If I(X = x; Y = y) = 0 the events X = x and Y = y are statistically independent since I(X = x; Y = y) = 0 ⇒ P (X = x|Y = y) =1 P (Y = y) We are now ready to consider the self information, i.e. the information achieved about an event by observing the same event. Definition 8 Let X = Y . Then, the self information in the event X = x is defined as I(X = x) = I(X = x; X = x) = log P (X = x|X = x) = − log P (X = x) P (X = x) Hence, − log Pr(X = x) is the amount of information needed to determine that the event X = x has occurred. The self information is always a non-zero quantity. 4.2 Entropy The above quantities deal with the information in specific events. If we instead are consider the amount of information required in average to determine the outcome of a random variable, we need to consider the expected value of the self information. We get the following important definition. Definition 9 The entropy, which is a measure of the uncertainty of a random variable, is derived as X H(X) = EX − log p(x) = − p(x) log p(x) (4.2) x In the derivations, we use the convention that 0 log 0 = 0, which stems from the corresponding limit value. Here, and hereafter, if nothing else is stated the logarithm uses the binary base, i.e. log2 . To derive this we use that x = aloga x = blogb x = aloga b logb x where we used that b = aloga b . This leads to loga x = loga b logb x ⇒ Especially, it is convenient to use log2 x = ln x log10 x = ln 2 log10 2 logb x = loga x loga b 4.2. ENTROPY 37 In e.g. Matlab there is a command log2(n) to derive the binary logarithm. Since p is a probability, and therefore between 0 and 1, we can directly say that − log p is a non-negative quantity. That also means the sum in (4.2) is non-zero, H(X) ≥ 0 (4.3) i.e. the uncertainty cannot be negative. Example 19 Given a coin that could be counterfeit. The outcome from one flip has the sample space Ω = {Head, Tail}. Denote the probabilities for the outcome by P (Head) = p P (Tail) = 1 − p The uncertainty of the random variable X is H(X) = −p log p − (1 − p) log(1 − p) and is shown in Figure 4.1. As an example, if p = 0.1 we get the entropy H(X) = 0.469. That is, in average it is needed 0.469 bit information to determine the outcome of a flip. The entropy function of a binary choice, as in the previous example, is very important and has its own notation. Definition 10 The binary entropy function for the probability p is defined as h(p) = −p log p − (1 − p) log(1 − p) It follows directly that the binary entropy function is symmetric in p. i.e. that h(p) = h(1 − p) This symmetry can also be seen in Figure 4.1. From the figure we also see that it has a maximum value at h( 12 ) = 1. In the case of a flip with a coin, the maximum corresponds to a fair coin and the uncertainty of the outcome is one bit. If the probability for Head is increasing a natural initial guess is that the outcome would be this, and the uncertainty decreases. Similarly, if the probability for Tail increases the uncertainty should decrease. At the and points, p = 0 or p = 1, the outcome is known and the uncertainty is zero, corresponding to h(0) = h(1) = 0. Example 20 Let X be the outcome of a true dice. Then P (X = x) = 1/6, x = 1, 2, . . . , 6. The entropy is X1 1 H(X) = − log = log 6 = 1 + log 3 ≈ 2.5850 bit 6 6 x 38 CHAPTER 4. INFORMATION MEASURE h(p) = −p log(p) − (1−p) log(1−p) 1 0 0 0.5 1 Figure 4.1: The Binary entropy function. The definition of the entropy also yield vectorized random variables, such as (X, Y ) with the joint probability function p(x, y). Definition 11 The joint entropy is the entropy for a pair of random variables with the joint distribution p(x, y), XX H(X, Y ) = EXY − log p(x, y) = − p(x, y) log p(x, y) x y Clearly, in the general case with an N -dimensional vector the corresponding entropy function is X H(X1 , . . . , XN ) = − p(x1 , . . . , xN ) log p(x1 , . . . , xN ) x1 ,...,xN Example 21 Let X and Y be the outcomes from two independent true dice. Then the joint probability is P (X, Y = x, y) = 1/36 and the joint entropy X 1 1 log = log 36 = 2 log 6 ≈ 5.1699 H(X, Y ) = − 36 36 x,y We conclude that the uncertainty of the outcome of two dice is twice the uncertainty of one dice. Instead consider the sum of the dice, Z = X + Y . The probabilities are shown in the following table Z P (Z) 2 3 4 5 6 7 8 9 10 11 12 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 4.2. ENTROPY 39 The entropy of Z is H(Z) = H 1 2 3 4 5 6 5 4 3 2 1 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 1 1 2 2 3 3 log − 2 log − 2 log 36 36 36 36 36 36 4 4 5 5 6 6 − 2 log − 2 log − log 36 36 36 36 36 36 23 5 5 = ··· = + log 3 − log 5 ≈ 3.2744 18 3 18 = −2 The uncertainty of the sum of the dice is less than the outcomes of the individual dice. This makes sense, since several outcomes of the pair X, Y give the same sum Z. In (4.3) we saw that the entropy function is a non-negative function. To achieve an upper bound of it we first need to consider an important result, sometimes named the ITinequality. Lemma 9 For every positive real number r log(r) ≤ (r − 1) log(e) with equality if and only if r = 1. Proof: Graphically we consider the two functions y1 = r − 1 and y2 = ln r as shown in Figure 4.2. From this we conclude that ln r = r − 1 at the point when r = 1. To formally show that in all other cases ln r < r − 1 we notice that the derivative of r − 1 is always 1. Then, the derivative of ln r is ( > 1, r < 1 ⇒ ln r < r − 1 d 1 ln r = = dr r < 1, r > 1 ⇒ ln r < r − 1 and we can conclude that ln r ≤ r − 1, with equality if and only if r = 1. Rewriting into binary logarithm completes the proof. 2 1 0 y1=r−1 −1 0 y2=ln(r) 1 2 3 4 Figure 4.2: Graphical interpretation of the IT-inequality. 40 CHAPTER 4. INFORMATION MEASURE From what we have seen in the previous examples, and some intuition, we would guess that the maximum value of the entropy would occur when the outcomes have equal probabilities. Assume a random variable X has L outcomes, {x1 , . . . , xL }, and that P (X = xi ) = L1 for all of them. Then the entropy is H(X) = − X1 1 log = log L L L i To show that this actually is a maximum value over all distributions we consider H(X) − log L = − X p(x) log p(x) − x = X X p(x) log L x p(x) log x 1 p(x)L 1 − 1 log e p(x)L x X 1 X p(x) log e = − L x x ≤ X p(x) = (1 − 1) log e = 0 1 where the inequality follows from the IT-inequality with r = p(x)L , implying we have 1 equality if and only if p(x)L = 1. In other words, we have shown that H(X) ≤ log L with equality if and only if p(x) = theorem. (4.4) 1 L. Combining (4.3) and (4.4) we can state the following Theorem 10 If X is a random variable with L outcomes, |X | = L, then 0 ≤ H(X) ≤ log L with equality to the left if and only if there exists some i where p(xi ) = 1, and with equality to the right if and only if p(xi ) = 1/L for all i = 1, 2, . . . , L. We are now ready to define a conditional entropy. The base is the conditional probability, say p(x|y). Here we still have two random variables that need to be averaged and we get the following definition. Definition 12 The conditional entropy for X conditioned on Y , with the joint probability p(x, y), is X H(X|Y ) = EXY − log p(x|y) = − p(x, y) log p(x|y) x,y 4.2. ENTROPY 41 Using the chain rule for probabilities, p(x, y) = p(x|y)p(y), the sum can be rewritten as XX XX p(x|y)p(y) log p(x|y) p(x, y) log p(x|y) = − H(X|Y ) = − x = X x y y X p(x|y) log p(x|y) p(y) − x y By introducing the entropy of X conditioned on the event Y = y, X p(x|y) log p(x|y) H(X|Y = y) = − x the conditional entropy can be derived as X H(X|Y = y)p(y) H(X|Y ) = y Example 22 The joint distribution of the random variables X and Y is given by Y p(x, y) 0 X 1 0 0 1 3 4 1 8 1 8 The P marginal distributions of X and Y can be derived as p(x) = x p(x, y), respectively. These yields x p(x) 0 1 1 8 7 8 y p(y) 0 1 3 4 1 4 P y p(x, y) and p(y) = The individual entropies are H(X) = h( 81 ) ≈ 0.5436 H(Y ) = h( 41 ) ≈ 0.8113 and the joint entropy H(X, Y ) = H(0, 34 , 18 , 81 ) ≈ 1.0613 To calculate the conditional entropy H(X|Y ) we first consider the conditional probabilities, derived by p(x|y) = p(x,y) p(y) X P (X|Y = 0) P (X|Y = 1) 1 0 0 2 1 1 1 2 Therefore, H(X|Y = 0) = h(0) = 0 H(X|Y = 1) = h( 21 ) = 1 42 CHAPTER 4. INFORMATION MEASURE Putting things together we get the conditional entropy as H(X|Y ) = H(X|Y = 0)P (Y = 0) + H(X|Y = 0)P (Y = 0) 3 1 1 = h(0) + h(1) = 4 4 4 We can use the chain rule for probabilities again to achieve a corresponding chain rule for entropies. The joint entropy can be written as X X p(x, y) log p(x|y)p(y) p(x, y) log p(x, y) = − H(X, Y ) = − x,y x,y =− X p(x, y) log p(x|y) − x,y X p(y) log p(y) y (4.5) = H(X|Y ) + H(Y ) Rewriting the result we get H(X|Y ) = H(X, Y ) − H(Y ). That is, the conditional entropy is the uncertainty of the pair (X, Y ) when X is known. A slightly more general version of (4.5) can be stated as the chain rule for entropies in the following theorem. Theorem 11 Let X1 , . . . , XN be an N -dimensional random variable drawn according to p(x1 , . . . , xN ). Then the chain rule for entropies state that H(X1 , . . . , XN ) = N X H(Xi |X1 , . . . , Xi−1 ) i=1 Example 23 [Cont’d from Example 22] The joint entropy can alternatively be derived as H(X, Y ) = H(X|Y ) + H(Y ) = 4.3 1 9 3 + h( 14 ) = − log 3 ≈ 1.0613 4 4 4 Mutual Information The entropy was obtained by averaging the self information for a random variable. Similarly, the average mutual information achieved about X when observing Y can be defined as follows. Definition 13 The average mutual information between the random variables X and Y is defined as h p(x|y) p(x|y) i X I(X; Y ) = EX,Y log = p(x, y) log (4.6) p(x) p(x) x,y 4.3. MUTUAL INFORMATION 43 Utilizing that p(x|y) p(x, y) = p(x) p(x)p(y) an alternative definition can be made by considering the ratio between the joint and the marginal probabilities, h p(x, y) p(x, y) i X p(x, y) log I(X; Y ) = EX,Y log = (4.7) p(x)p(y) p(x)p(y) x,y This is a measure of how strongly the two variables are connected. From (4.7), it is clear that the mutual information is a symmetric measure, I(X; Y ) = I(Y ; X) Breaking up the logarithm in (4.6) or (4.7), it is possible to derive the mutual information from the entropies as h p(x, y) i I(X; Y ) = EX,Y log p(x)p(y) = EX,Y log p(x, y) − log p(x) − log p(y) = EX,Y log p(x, y) − EX log p(x) − EY log p(y) = H(X) + H(Y ) − H(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) where in the last two equalities the chain rule was used. Example 24 [Cont’d from Example 22] The mutual information can be derives as I(X; Y ) = H(X) + H(Y ) − H(X, Y ) ≈ 0.5436 + 0.8113 − 1.0613 = 0.2936 Alternatively, we can use I(X; Y ) = H(X) − H(X|Y ) ≈ 0.5436 − 1/4 = 0.2936 Both the entropy and the mutual information are two very important measures of information. The entropy states how much information is needed to determine the outcome of a random variable. It will be shown later that this is a limit of how many bits is needed in average to describe the variable. In other words, this is a limit of how much a source can be compressed without any data being lost. The mutual information, on the other hand, describes the amount of information achieved about the variable X by observing the variable Y . In a communication system a symbol, X, is transmitted over a channel. The received symbol Y , which is a distorted version of X, is used by a receiver to estimate X. The mutual information is a measure of how much information that can be transmitted over this channel. It will lead to the concept of the channel capacity. To get more knowledge about these quantities we introduce the relative entropy. It was first introduced by Kullback and Leibler [12]. 44 CHAPTER 4. INFORMATION MEASURE Definition 14 Given two probability distributions p(x) and q(x) for the same sample set X . The relative entropy, or Kullback-Leibler distance, is defined as h p(x) i X p(x) D(p||q) = Ep log = p(x) log q(x) q(x) x Example 25 Consider a binary random variable, X ∈ {0, 1}, where we set up two distributions. First we assume that the values are equally probable, p(0) = p(1) = 1/2 and, secondly, we assume a skew distribution, q(0) = 1/4 and q(1) = 3/4 The relative entropy from p to q is then 1 1/2 1 1/2 log + log 2 1/4 2 3/4 1 = · · · = 1 − log 3 ≈ 0.2075 2 D(p||q) = On the other hand, if we consider the relative entropy from q to p we get 1 1/4 3 3/4 log + log 4 1/2 4 1/2 3 = · · · = log 3 − 1 ≈ 0.1887 4 D(q||p) = That is, the relative entropy is not a symmetric measure, and can not be treated as a distance in normal sense. However, when talking about optimal source coding, we will see that it is natural to talk about the relative entropy as a distance from one distribution to another. This is the reason for the alternative name Kullback-Leibler distance. The relative entropy was introduced as a generalized information measure. We can see that the mutual information as defined earlier can be expressed as a special case of the relative entropy as I(X; Y ) = E h p(x, y) i = D p(x, y)p(x)p(y) p(x)p(y) The mutual information is the information distance from the joint distribution to the independent case, i.e.the information distance corresponding to the relation between X and Y . Another aspect of the relative entropy to consider is the relationship with the entropy function. Consider a random variable with the possible outcomes in the set X with cardinality |X | = L and probability distribution p(x), x ∈ X . Let u(x) = 1/L be the uniform 4.3. MUTUAL INFORMATION 45 distribution for the same set of outcomes. Then, X X p(x) log p(x)L p(x) log p(x) = log L − H(X) = − x x = log L − X p(x) log x p(x) = log L − D(p||u) u(x) (4.8) where we see that the relative entropy from p(x) to u(x) is the difference between the entropy based on the true distribution and the maximum value of the entropy. Since the maximum value is achieved by the uniform distribution, we see that the relative entropy is some sort of measure of how much p(x) diverge from the uniform distribution. The relative entropy can be shown to only take positive values. Here we will use the IT-inequality to show this fact. p(x) X q(x) p(x) log = q(x) p(x) x x q(x) X ≤ p(x) − 1 log e p(x) x X X p(x) − q(x) log e = (1 − 1) log e = 0 = −D(p||q) = − X x p(x) log x q(x) with equality when p(x) = 1, i.e.when p(x) = q(x), for all x. Alternatively, Jensen’s inequality can be used to show the same result. We can express the result in a theorem. Theorem 12 Given two probability distributions p(x) and q(x) for the same sample set X . Then the relative entropy is positive D(p||q) ≥ 0 with equality if and only if p(x) = q(x) for all x. Since the mutual information can be expressed as the relative entropy, we directly get the corollary below. Corollary 13 For any two random variables X and Y I(X; Y ) ≥ 0 with equality if and only if they are independent. Since we can write the mutual information as I(X; Y ) = H(X) − H(X|Y ) and that this is non-zero, we conclude the following theorem. Theorem 14 For any two random variables X and Y H(X|Y ) ≤ H(X) with equality if and only if they are independent. 46 CHAPTER 4. INFORMATION MEASURE Intuitively, this means that the uncertainty, in average, will not increase by observing some side information. If the two variables are independent, we will have the same uncertainty as before. Using this result together with the chain rule for entropy, Theorem 11, we can get the next result. Theorem 15 Let X1 , X2 , . . . , Xn be an n-dimensional random variable drawn according to p(x1 , x2 , . . . , xn ). Then H(X1 , X2 , . . . , Xn ) ≤ n X H(Xi ) i=1 with equality if and only if all Xi are independent. That is, the uncertainty is minimized when considering a random vector as a whole, instead of individual variables. In other words, we should take relationship between the variables into account when minimizing the uncertainty. 4.3.1 Convexity of information measures In Definition 4 the terminology of convex functions was defined. In areas like optimization this is an important property since it is much easier to find a global optima. Here we will show that the information measurements we have defined are convex (or concave) functions. We will begin with the relative entropy which will give the base for the other functions. Our previous definition for a convex function is only for one dimensional functions. Therefore, we need to start with generalizing the definition. A straight forward way is to say that a multidimensional functions is convex if it is convex in all dimensions. This means that for a two dimensional function we get a surface that resembles a bowl. In equation forms we write the condition for convexity for a two dimensional function g(x, y) as g λ(x1 , y1 ) + (1 − λ)(x2 , y2 ) = g λx1 + (1 − λ)x2 , λy1 + (1 − λ)y2 ≤ λg(x1 , y1 ) + (1 − λ)g(x2 , y2 ) The relative entropy can now be written as D λp1 + (1 − λ)p2 λq1 + (1 − λ)q2 X λp1 (x) + (1 − λ)p2 (x) = λp1 (x) + (1 − λ)p2 (x) log λq1 (x) + (1 − λ)q2 (x) x X λp1 (x) X (1 − λ)p1 (x) ≤ λp1 (x) log + (1 − λ)p1 (x) log λq1 (x) (1 − λ)q1 (x) x x X X p1 (x) p1 (x) =λ p1 (x) log + (1 − λ) p1 (x) log q1 (x) q1 (x) x x = λD p1 q1 + (1 − λ)D p2 q2 where the inequality is a direct application of the log-sum inequality in Theorem 7. Hence, we see that the relative entropy is a convex function in (p, q). 4.4. ENTROPY OF SEQUENCES 47 From (4.8) we know that the entropy can be expressed as Hp (X) = log L−D(p||u) where u is the uniform probability function and p the distribution used for calculating the entropy. Then Hλp1 +(1−λ)p2 (X) = log L − D λp1 + (1 − λ)p2 u ≥ log L − λD p1 u − (1 − λ)D p2 u = λ log L − D p1 u + (1 − λ) log L − D p2 u = λHp1 (X) + (1 − λ)Hp2 (X) where the inequality follows from the convexity of the relative entropy. We see here that the entropy is a concave function in p. 4.4 Entropy of sequences When considering random processes and information theoretic measures it is natural to start with a well known lemma based on a three state Markov chain. It states that the information cannot increase by data processing, neither pre nor post. It can only be transformed to another representation (or be destroyed). In information theoretic terms it can be stated as the next lemma. Lemma 16 (Data Processing Lemma) If the random variables X, Y and Z form a Markov chain, X → Y → Z, we have I(X; Z) ≤ I(X; Y ) I(X; Z) ≤ I(Y ; Z) For the considered Markov chain we have that, conditioned on Y , X and Z are independent, i.e. P (XZ|Y ) = P (X|Y )P (Z|XY ) = P (X|Y )P (Z|Y ) where the second equality comes from the Markov condition. Starting with the first inequality of the theorem, I(X; Z) = H(X) − H(X|Z) ≤ H(X) − H(X|Y Z) = H(X) − H(X|Y ) = I(X; Y ) Similarly, the second inequality comes from I(X; Z) = H(Z) − H(Z|X) ≤ H(Z) − H(Z|XY ) = H(Z) − H(Z|Y ) = I(Z; Y ) which concludes the lemma. The entropy function is an important measurement of the amount of information stored in a random variable. In the case when we are considering a random process and a sequence of random variables with some sort of correlation in between, we need a more 48 CHAPTER 4. INFORMATION MEASURE general definition. In this section we will introduce two natural generalizations of the entropy function, and show that they are in fact equivalent. This function can in many cases be used in the same way for a random process as the entropy is used for a random variable. A natural way to define the entropy per symbol for a sequence is by treating the sequence as a multi- dimensional random variable and averaging over the number of symbols. I the length of the sequence tend to infinity we get the following definition. Definition 15 The entropy rate of a stochastic process is 1 H(X1 X2 . . . Xn ) n→∞ n H∞ (X) = lim To see that this is in deed a generalization of the entropy measure we have from before we consider a sequence of i.i.d. variables in the next example. Example 26 Consider a sequence of i.i.d. random variables with entropy H(X). Then the entropy rate becomes the entropy function as defined earlier, 1 1X H(X1 . . . Xn ) = lim H(Xi |X1 . . . Xi−1 ) n→∞ n n→∞ n i 1X 1X H(Xi ) = lim H(X) 1 = H(X) = lim n→∞ n→∞ n n H∞ (X) = lim i i We can also define an alternative entropy rate for a stochastic process, where we consider the entropy of the nth variable in the sequence, conditioned on all the previous. As n → ∞ we get H ∞ (X) = lim H(Xn |X1 X2 . . . Xn−1 ) n→∞ To see how this variant relates to (??) we use the chain rule 1 1X H(X1 . . . Xn ) = H(Xi |X1 . . . Xi−1 ) n n i The right hand side is the arithmetic mean of H(Xi |X1 . . . Xi−1 ). By the law of large numbers, as n → ∞ this will approach H ∞ (X). Hence, asymptotically as the length of the sequence grows to infinity the two definitions for the entropy rate are equal, H∞ (X) = H ∞ (X) Considering a stationary (time-invariant) random process, it can be seen that H(Xn |X1 . . . Xn−1 ) ≤ H(Xn |X2 . . . Xn−1 ) = H(Xn−1 |X1 . . . Xn−2 ) 4.4. ENTROPY OF SEQUENCES 49 where the last equality follows from the stationarity. Here we see that H(Xn |X1 . . . Xn−1 ) is a decreasing function in n. As n decreases we get to a point H(Xn |X1 . . . Xn−1 ) ≤ · · · ≤ H(X2 |X1 ) ≤ H(X1 ) = H(X) ≤ log |X | Finally, since the entropy is a non-negative function we can state the following theorem. Theorem 17 For a stationary stochastic process the entropy rate is bounded by 0 ≤ H∞ (X) ≤ H(X) ≤ log |X | In Figure 4.3 the relation between log |X |, H(X), H(Hn |X1 . . . Xn−1 ) and H∞ (X) is shown. log |X | H(X) H(Xn |X1 . . . Xn−1 ) H∞ (X) n Figure 4.3: The relation between H∞ (X) and H(X). So far the entropy rate has been treated for the class of stationary random processes. If the subclass Markov chains are considered it is possible to say more on how to derive it. For the conditional entropy, the Markov condition can be written as H(Xn |X1 . . . Xn−1 ) = H(Xn |Xn−1 ). Then, the entropy rate can be derived as H∞ (X) = lim H(Xn |X1 . . . Xn−1 ) = lim H(Xn |Xn−1 ) = H(X2 |X1 ) n→∞ n→∞ X = P (X1 = xi , X2 = xj ) log P (X2 = xj |X1 = xi ) i,j = = X P (X1 = xi ) X P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi ) i j X H(X2 |X1 = xi )P (X1 = xi ) (4.9) i where H(X2 |X1 = xi ) = X P (X2 = xj |X1 = xi ) log P (X2 = xj |X1 = xi ) j In (4.9) the transition probability is given by the state transition matrix h i P = pij = P (X2 = xj |X1 = xi ) xi ,xj ∈X and the stationary distribution by wi = P (X1 = xj ). In terminology of the Markov process we get the following theorem. 50 CHAPTER 4. INFORMATION MEASURE Theorem 18 For a stationary Markov chain with stationary distribution w and transition matrix P = [pij ], the entropy rate can be derived as X H∞ (X) = wi H(X2 |X1 = xi ) i where H(X2 |X1 = xi ) = − X pij log pij j Example 27 The Markov chain shown in Figure 3.4 has the state transition matrix 1 2 3 3 0 P = 14 0 34 1 1 2 2 0 In Example 14 the steady state distribution was calculated as 16 12 w = 15 43 43 43 The conditional entropies are the entropies for each row in P , H(X2 |X1 = s1 ) = h( 13 ) = log 3 − 2 3 H(X2 |X1 = s2 ) = h( 14 ) = 2 − 34 log 3 H(X2 |X1 = s3 ) = h( 21 ) = 1 and the entropy rate becomes H∞ (X) = w1 H(X2 |X1 = s1 ) + w2 H(X2 |X1 = s2 ) + w3 H(X2 |X1 = s3 ) = = 4.5 1 16 1 12 1 15 43 h( 3 ) + 43 h( 4 ) + 43 h( 2 ) 3 34 43 log 3 + 43 ≈ 0.9013 bit/symbol Random Walk To be done. Consider an undirected weighted graph where Wij is the weight for the edge joining node xi and xj . Let X X Wi = Wij W = Wij j i,j:i<j denote the sum of the weights leaving node xi , and the total weight, respectively. The probability for using the edge from node xi is pij = Wij /Wi . 4.5. RANDOM WALK • <1-> The stationary distribution is µi = Wi 2W • <2-> The entropy rate Wij Wi H∞ (X) = H . . . , ,... − H ..., ,... 2W 2W • <3-> If all non-zero edges have Wij = 1, Wi H∞ (X) = log(2W ) − H . . . , ,... 2W Example 28 Consider a weighted graph x1 1 x2 2 1 1 x4 x3 1 The conditional probabilities are Wij pij = P j Wij The state transition matrix 0 1/4 2/4 1/4 1/2 0 1/2 0 P = 2/4 1/4 0 1/4 1/2 0 1/2 0 With W = 6 and W1 = 4, W2 = 2, W3 = 4, W4 = 2 we get the stationary distribution 4 2 4 2 1 1 1 1 µ= = 12 12 12 12 3 6 3 6 and the entropy rate 1 2 1 1 1 1 2 1 1 1 H∞ (X) = H , , , , , , , , , 12 12 12 12 12 12 12 12 12 12 4 2 4 2 −H , , , 12 12 12 12 ≈ 3.25 − 1.92 = 1.33 51 52 CHAPTER 4. INFORMATION MEASURE Chapter 5 SourceCodingTheorem Text... 5.1 Asymptotic Equipartition Property Asymptotic Equipartition Property (AEP) is a very useful tool in information theory. The basic idea is that a series of i.i.d. events is viewed as a vector of events. The probability for each vector becomes very small as the length grows, giving that it is pointless of considering the probabilities for each vector. Instead it is the probability for the distribution of events within the vector that is important. From the law of large numbers we can see that it is only a fraction of the possible outcomes that is the probable ones. To see this, consider 100 consecutive tosses with a fair dice. The probability for each of the length 10 vectors is 1 100 P (x1 , . . . , x100 ) = ≈ 8 · 10−31 2 So, the probability for each of the vectors in the outcome is very small and it isnot very 50 likely that the vector with 100 Heads will occur. However, since there are 100 ≈ 1029 , the probability for getting a vector with equal number of Head and Tail is about −100 50 P (50 Head, 50 Tail) = 2 ≈ 0.080 100 which is relatively high. To conclude, it is most likely that the outcome of 100 tosses with a fair dice will result in approximately the same numbers of Heads and Tails. This is in fact a consequence of the weak law of large numbers, which we recapitalize here. Theorem 19 (The weak law of large numbers) Let X1 , X2 , . . . , , Xn be i.i.d. random variables with mean E X . Then, 1X p Xi → E X n i p where → denotes convergence in probability. 53 54 CHAPTER 5. SOURCECODINGTHEOREM The notation that it converges in probability can also be expressed as 1 X lim P Xi − E X < ε = 1 n→∞ n i for any ε > 0. This means that a arithmetic mean of several i.i.d. random variables approaches the expected value of the variables. For an example on the weak law of large numbers, see Example 3. Consider instead the logarithmic probability for a vector of length n, consisting of i.i.d. random variables. By the weak law of large numbers we can conclude that 1 1X p − log p(x) = − log p(xi ) → H(X) n n i or, equivalently, that 1 lim P p(x) − E X < ε = 1 n→∞ n for any ε > 0. That is, for all vectors that can happen, the mean logarithmic probability for the random variable approaches the entropy. For a finite vector of length n we can have a closer look at those vectors that fulfill this criteria. They are called typical sequences and are defined in the next definition. Definition 16 (AEP) The set of ε-typical sequences Aε (X) is the set of all n-dimensional vectors x = (x1 , x2 , . . . , xn ) ∈ X n such that 1 − log p(x) − H(X) ≤ ε n (5.1) It is possible to rewrite (5.1) as follows 1 −ε ≤ − log p(x) − H(X) ≤ ε n 1 H(X) − ε ≤ − log p(x) ≤ H(X) + ε n 1 −n H(X) + ε ≤ − log p(x) ≤ −n H(X) − ε n −n(H(X)+ε) 2 ≤ p(x) ≤ 2−n(H(X)−ε) This is the base for an alternative definition of the AEP. Definition 17 (AEP, Alternative definition) The ε-typical sequences can definition as the set of vectors x such that 2−n(H(X)+ε) ≤ p(x) ≤ 2−n(H(X)−ε) (5.2) 5.1. ASYMPTOTIC EQUIPARTITION PROPERTY 55 Both definitions of the AEP are frequently used in the literature. It differs which one is used as the main definition, but quite often both are presented. In the next example we try to show what it means with ε-typical sequences. Example 29 Consider a binary 5-dimensional vector with i.i.d. elements where pX (0) = 1 2 3 and pX (1) = 3 . Then the entropy for each symbol is H(X) = h(1/3) = 0.918. In the following tabular all vectors are listed together with their respectively probability. x 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 p(x) 0.0041 0.0082 0.0082 0.0165 0.0082 0.0165 0.0165 0.0329 0.0082 0.0165 0.0165 ? x 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 p(x) 0.0329 0.0165 0.0329 0.0329 0.0658 0.0082 0.0165 0.0165 0.0329 0.0165 0.0329 ? ? ? ? ? x 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 p(x) 0.0329 0.0658 0.0165 0.0329 0.0329 0.0658 0.0329 0.0658 0.0658 0.1317 ? ? ? ? ? ? ? ? ? As expected the all zero vector is the least possible vector, while the all one vector is the most likely. However, even the most likely vector is not very likely to happen, with probability 0.1317. Still, if we were to pick one vector of all as a guess for what the outcome would be, this is the one. But since the elements are i.i.d. it can be argued that the order of the elements are unimportant. If we instead would take a guess on the type of vector, meaning the number of ones and zeros the answer would be different. The probability for a vector containing k ones and 5 − k zeros is k n 2 k 1 n−k n 2 P (k1s, 5 − k0s) = = k 3 3 k 3n which we saw already in Chapter 3. Viewing these numbers in a table we get (in Chapter 3 it is also plotted in a diagram), P (#1 in x) k k = k5 235 0 1 2 3 4 5 0.0041 0.0412 0.1646 0.3292 0.3292 0.1317 Here it is clear that the most likely vector, the all one vector, does not belong to the most likely type of vector. When guessing of the number of ones, it is more likely to get 3 or 4 ones. This is of course due to that there are more vectors that fulfill this criteria then the single all one vector. So, this concludes that vectors with 3 or 4 ones are sort of “typical”. 56 CHAPTER 5. SOURCECODINGTHEOREM The question is then how this relates to the previous definitions of typical sequences. To see this, chose an ε. Here we use 15% of the entropy, which is ε = 0.138, 1 2−n(H(X)+ε) = 2−5(h( 3 )+0.138) ≈ 0.0257 1 2−n(H(X)+ε) = 2−5(h( 3 )−0.138) ≈ 0.0669 Then, the ε-typical sequences are the ones in between those numbers. In the table these are marked with a ?. Luckily these are the same vectors as we concluded should be the typical ones. In the previous example we saw that there are something called typical sequences, and that they are the most likely sequences to occur, seen from the contents. In this example we used a very short vector to be able to list all of them, but for longer sequences it can be seen that the ε-typical sequences are just a fraction of all the sequences. On the other hand, it can also be seen that the probability for a random sequence to belong to the typical sequences is close to one. More formally, we can state the following theorem. Theorem 20 Consider sequences of length n of i.i.d. random variables. For each ε there exists an (n) integer n0 such that, for each n > n0 , the set of ε-typical sequences, Aε (X), fulfills P x ∈ A(n) (5.3) ε (X) ≥ 1 − ε n(H(X)+ε) (1 − ε)2n(H(X)−ε) ≤ A(n) (5.4) ε (X) ≤ 2 The first part of the theorem, (5.3), is a direct consequence of the law of large numbers stating that − n1 log p(x) approaches H(X) as n grows. That means that there exists an n0 , such that for all n ≥ n0 1 P − log p(x) − H(X) < ε ≥ 1 − δ n for any δ between zero and one. Letting δ = ε gives 1 P − log p(x) − H(X) < ε ≥ 1 − ε n which is equivalent to (5.3). This shows that the probability for an arbitrary sequence to belong to the typical set approaches one as n grows. To show the second part, that the number of ε-typical sequences is bounded by (5.4) we need to split the equation in two parts. Starting with the left hand inequality we have, for large enough n0 X 1 − ε ≤ P x ∈ A(n) p(x) ε (X) = (n) x∈Aε (X) ≤ X (n) x∈Aε (X) 2−n(H(X)−ε) = Aε(n) (X)2−n(H(X)−ε) 5.2. SOURCE CODING THEOREM 57 where we in the second inequality used the alternative definition of the AEP. The right hand side can be shown by X X p(x) 1= p(x) ≥ x∈X n ≥ X (n) x∈Aε (X) −n(H(X)+ε) 2−n(H(X)+ε) = A(n) ε (X) 2 (n) x∈Aε (X) which completes the theorem. To see what the theorem means we consider longer sequences than in the previous example. Example 30 Let X n = {x} be the set of length n vector of i.i.d. binary random variables with p(0) = 13 and p(1) = 23 . Let ε be 5% of the entropy H(X) = h( 31 ), ε = 0.046. Then the number of epstypical sequences adn their bounding functions are given in the next table for n = 100, n = 500 and n = 1000. As a comparison, the fraction of ε-typical sequences compared to the total number of sequences is also shown. (n) (n) n (1 − ε)2n(H(X)−ε) |Aε (X)| 2n(H(X)+ε) |Aε (X)|/|X n | 100 500 1000 1.17 · 1026 1.90 · 10131 4.16 · 10262 7.51 · 1027 9.10 · 10142 1.00 · 10287 1.05 · 1029 1.34 · 10145 1.79 · 10290 5.9 · 10−3 2.78 · 10−8 9.38 · 10−15 This table shows that the ε-typical sequences only constitute a fraction of the total number of sequences. Next, the probability for the ε-typical sequences are given together with the probability for the most probable sequence, the all one sequence. Here it can be clearly seen that the most likely sequence has a very low probability, and is in fact very unlikely to happen. Instead, the most likely event is that a random sequence is taken from the typical sequences. n P (Aε (X)) (n) P (x = 11 . . . 1) 100 500 1000 0.660 0.971 0.998 2.4597 · 10−18 9.0027 · 10−89 8.1048 · 10−177 5.2 Source Coding Theorem In Figure 5.2 a block diagram of a communication system is shown. One of the results from information theory is that the sequence generated from a source often contains redundancy. The source encoder is intended to remove the redundancy and use a minimum size representation of the information. The problem is that as the redundancy is 58 CHAPTER 5. SOURCECODINGTHEOREM removed the information is very vulnerable to disturbances in the transmission. To circumvent this new redundant data is added in a controlled way, by the channel encoder. If errors occur during transmission, the channel decoder will be able to detect and/or correct most of them. Finally, the source decoder decompress the received sequence back to the original. If the channel decoder manage to correct all errors occurred on the channel, i.e. Yb = Y , the source decoder should implement the inverse of the compression function. Shannon showed that the the source coding and the channel coding can be performed independent of each other. Therefore, in this section only the source encoding and decoding will be treated. X Y U b X Yb V Figure 5.1: Shannon’s model for a communication system. One of the questions that information theory tries to answer is how much information is stored in a set of data. A typical example is text, that contain redundancy for us to be able correct spelling errors or misprints. In English written language it is often possible to cross out every third letter and still be able to read the text. The possible compression that can be done for a sequence show how much information it contains. Earlier, we considered the entropy as a measure of the uncertainty, which is the amount of information needed to determine the value. We will see in this section that the entropy is in fact a measure of the possible compression of a source. In Figure 5.2 the model for source encoding and decoding used is shown. The sequence from the source is fed to the source encoder where the redundancy is removed. The source decoder is the inverse function and reconstruct the source data. In the figure X is an n-dimensional vector from the alphabet X . The length n of the vector is considered to be a random variable and the average length is E[n]. The corresponding compressed sequence, Y , is an `-dimensional vector from the alphabet Y. Also ` is a random variable so the average length of the codeword is E[k]. The decoder estimates the source vector c When working with lossless source coding as in this section it is required that the as X. b = X. This is the case for compression aloriginal message is perfectly reconstructed, X gorithms working on e.g. texts and also some image processing. However, in many cases for media coding, such as image, video and speech coding, there are lossy algorithms. The compression is much better in those, but the prize paid is that he original message b ≈ X. cannot be reconstructed exactly, i.e. X The compression rate is defined as R= E[k] E[`] If the alphabet sizes are equal this means there is a compression if R is less than one. The reason to view n and ` as random variables is that, to have compression either the length 5.2. SOURCE CODING THEOREM 59 of the source vector, the code vector or both must vary. If both the lengths would be fixed for all messages there would not be any compression. In the next definition the above reasoning is formalized. Source X Encoder Y Decoder c X Figure 5.2: Block model for source coding system.. Definition 18 (Not too general) A source code is a mapping from a finite source vector x ∈ X n to a finite vector of variable length `, y = (y1 y2 . . . y` ), where yi ∈ D = ZD are drawn from a D-ary alphabet. The length of the codeword corresponding to x is denote `(x). Naturally the codeword length is a good measure of the efficiency of a chosen code. This is defined as X L = E[`(x)] = p(x)`(x) x∈X n To derive the source coding theorem, stating there exists a source code where the average codeword length approaches the entropy of the source, we first need to set up an encoding rule. We saw in the previous section that the typical sequences are typically a small fraction of all sequences, but the are the most likely to happen. Assuming then that the non-typical sequences almost never happen, we can concentrate on the typical. Starting with a list of all sequences, it can be partitioned in two parts, one with the typical and one with the non-typical. To construct the codewords we use a binary prefix stating which set it belongs to. Lets say we use 0 for the typical sequences and 1 for the non typical. Each of the sets are listed and indexed by binary vectors. For the set of typical sequences the (n) index vector will be of length dlog |Aε (X)|e. The non-typical sequences only happens rarely so we can use n bits. The following two tables show the idea of the look-up tables. Typical Prefix = 0 x Index vec x0 0 . . . 00 x1 0 . . . 0 .. .. . . Non-typical Prefix = 1 x Index vec xa 0 . . . . . . 00 xb 0 . . . . . . 0 .. .. . . .. .. . . The same table look-up can be used by the decoding algorithm. The number of typical sequences is (n) Aε (X) ≤ 2n(H(X)+ε) 60 CHAPTER 5. SOURCECODINGTHEOREM and the length is l m + 1 ≤ logA(n) (X) `(x) = logA(n) ε (X) + 2 ≤ n H(X) + ε + 2 ε Similarly, we know there are at most |X |n non-typical sequences and that their length is l m n n `(x) = logX + 1 ≤ logX + 2 ≤ n log |X + 2 In Figure 5.2 the procedure is shown graphically. All sequences, X n Non-typical sequences < |X |n sequences `(x) ≤ n log |X | + 2 bits Typical sequences ≤ 2n(H(X)+ε) sequences `(x) ≤ n(H(X) + ε) + 2 bits Figure 5.3: Principle of Shannon’s source coding algorithm. Next, we derive a bound on the average length of the codewords. X L = E `(x) = p(x)`(x) = x∈X n ≤ X X (n) x∈Aε (X) p(x) n(H(X) + ε) + 2 + (n) x∈Aε (X) p(x)`(x) + X p(x)`(x) (n) x6∈Aε (X) X p(x) n log |X | + 2 (n) x6∈Aε (X) (n) = P x ∈ A(n) ε (X) n(H(X) + ε) + P x 6∈ Aε (X) n log |X | + 2 2 ≤ n(H(X) + ε) + εn log |X | + 2 = n H(X) + ε + ε log |+ n} | {z 0 ε = n H(X) + ε0 where ε0 can be arbitrary small for sufficiently large n. Summarizing, we state the source coding theorem. Theorem 21 Let x ∈ X n be length n vectors of i.i.d. random variables with probability function p(x). Then there exists a code which maps sequences x of length n into binary sequences such that the mapping is invertible and i 1 h E `(x) ≤ H(X) + ε0 n for sufficiently large n, where ε0 can be made arbitrarily small. 5.3. KRAFT INEQUALITY 61 For random processes we do not have i.i.d. variables. However, it is possible to generalize the theorem for this case, using the entropy rate. For a formal proof we refer to e.g. [2]. Theorem 22 Let X n be an stationary ergodic process. Then there exists a code which maps sequences x of length n into binary sequences such that the mapping is invertible and i 1 h E `(x) ≤ H∞ (X) + ε0 n for sufficiently large n, where ε0 can be made arbitrarily small. 5.3 Kraft Inequality To be done. 5.4 Shannon-Fano Coding To be done. 5.5 Huffman Coding To be done. Bibliography [1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 1991. [2] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, 2nd edition edition, 2006. [3] Robert G. Gallager. Information Theory and Reliable Communication. Wiley, 1968. [4] Allan Gut. An Intermediate Course in Probability. Springer-Verlag, 1995. [5] Richard Hamming. Error detecting and error correcting codes. Bell System Technical Journal, Vol. 29:pp. 147–160, 1950. [6] Ralph V. L. Hartley. Transmission of information. Bell System Technical Journal, pages p. 535–563, July 1928. [7] http://www.libpng.org/pub/png. PNG (Portable Graphics Format) Home site. maintained by Greg Roelofs. [8] David A. Huffman. A method for the construction of minimum redundancy codes. Proceedings of the Institute of Radio Engineers, Vol. 40:pp. 1098–1101, 1952. [9] Rolf Johannesson. Informationsteori – grundvalen för (tele-) kommunikation. Studentlitteratur, 1988. (in Swedish). [10] Rolf Johannesson and Kamil Zigangirov. Fundamentals of Convolutional Codes. IEEE Press, 1999. [11] Boris Kudryashov. thori Informatii. Piter, 2009. (in Russian). [12] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, Vol. 22, No 1:79–86, March 1951. [13] Robert McEliece. The Theory of Information and Coding. CambridgeUniversity Press, 2004. Student Edition. [14] Khalid Sayood. Introduction to Data Compression. Elsevier Inc., 2006. [15] Claude E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, Vol. 27:pp. 379–423, 623–656, July, October 1948. [16] Terry Welch. A technique for high-performance data compression. IEEE Computer, vol. 17:pp. 8–19, 1984. 73 74 BIBLIOGRAPHY [17] John F. Young. Information Trheory. Butterworth & Co, 1971. [18] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, Vol. 23:pp. 337–343, 1977. [19] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions on Information Theory, Vol. 24:pp. 530–536, 1978.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

### Related manuals

Download PDF

advertisement