Master Thesis MEE 03:15 LOW BITRATE VIDEO AND AUDIO CODECS FOR INTERNET COMMUNICATION By Jonas Nilsson & Jesper Nilsson THIS THESIS IS PRESENTED AS PART OF THE DEGREE OF MASTER OF SCIENCE IN ELECTRICAL ENGINEERING AT BLEKINGE INSTITUTE OF TECHNOLOGY RONNEBY, SWEDEN MAY 2003 c Copyright by Jonas Nilsson & Jesper Nilsson, 2003 Department of Telecommunications and Signal Processing Examiner: To Tran Supervisors: To Tran & Carl Ådahl, Moosehill Productions AB Abstract This master thesis discusses the design and the implementation of an own developed wavelet-based codec for both video and image compression. The codec is specifically designed for low bitrate video with minimum complexity for use in online gaming environments. Results indicate that the performance of the codec in many areas equals or even surpasses that of the international JPEG 2000 standard. We believe that it is suitable for any situation where low bitrate is desirable, e.g. video conferences and mobile communications. The game development company Moosehill Productions AB has shown great interest in our codec and its possible applications. We have also implemented an existing audio solution for low bandwidth use. i Preface This thesis is the final step before completing our Master of Science degree at the Blekinge Institute of Technology, Department of Telecommunications and Signal Processing, formerly known as University of Karlskrona/Ronneby. This master thesis corresponds to 20 credits. We would like to thank our examiner To Tran at Blekinge Institute of Technology for his creative ideas and Carl Ådahl from Moosehill Productions AB for his support. Of course, we are grateful to our parents and our sister for their patience and love. Without their support and encouraging this would not have been possible. Finally, we wish to thank the following: Johan Zackrisson, Benny Svensson, Mats Abrahamsson, Sten Dahlgren, all our relatives and all our friends, you know who you are! Dated: May 2003 Jonas Nilsson Jesper Nilsson ii Table of Contents Abstract i Preface ii Table of Contents iii 1 Introduction 2 Pre 2.1 2.2 2.3 2.4 2.5 Processing Color Representation Temporal Filtering . Level Offset . . . . . Normalization . . . . Color Transformation 1 . . . . . 3 3 3 5 6 6 . . . . . . . . . . . 8 8 9 11 12 12 14 17 17 18 21 22 4 Quantization 4.1 Uniform Dead-Zone Quantization . . . . . . . . . . . . . . . . . . . . 25 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Wavelet Transform 3.1 Historical Perspective . . . . . . . . . . . . . 3.2 Multiresolution Analysis . . . . . . . . . . . 3.3 1-D Wavelet Transform . . . . . . . . . . . . 3.3.1 Perfect Reconstruction . . . . . . . . 3.3.2 The Polyphase Representation . . . . 3.3.3 From Polyphase to Lifting . . . . . . 3.3.4 The Lazy Wavelet Transform . . . . 3.3.5 Factoring Wavelets into Lifting Steps 3.3.6 Calculating Lifting Steps . . . . . . . 3.3.7 Lifting Properties . . . . . . . . . . . 3.4 2-D Wavelet Transform . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Zero-Tree Coding 5.1 EZW . . . . . . . . . . . . . . . . . 5.1.1 The Significance Pass . . . . 5.1.2 The Refinement Pass . . . . 5.1.3 Example of EZW . . . . . . 5.2 SPIHT . . . . . . . . . . . . . . . . 5.2.1 Different Zero-Trees . . . . . 5.2.2 Ordered Lists . . . . . . . . 5.2.3 The Coding Passes . . . . . 5.2.4 The SPIHT Algorithm . . . 5.2.5 Example of SPIHT . . . . . 5.3 MSC3K . . . . . . . . . . . . . . . 5.3.1 Knowledge of Quantization . 5.3.2 The MSC3K Algorithm . . . . . . . . . . . . . . . . 28 29 29 30 31 33 33 34 34 35 37 40 40 41 . . . . . . . . 43 43 44 45 46 48 50 51 51 7 Video Compression 7.1 Predictive Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Motion Estimation & Motion Compensation . . . . . . . . . . . . . . 53 54 56 8 JPEG 2000 Overview 8.1 Features . . . . . . . . . . . . 8.2 JPEG 2000 Codec . . . . . . . 8.2.1 Preprocessing . . . . . 8.2.2 Wavelet transform . . 8.2.3 Quantization . . . . . 8.2.4 Tier-1 & Tier-2 Coding . . . . . . 59 59 60 60 61 62 62 9 MSC3K Overview 9.1 The Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Color Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 66 67 6 Data Compression 6.1 Huffman Coding . . . . . . . . 6.2 Adaptive Coding . . . . . . . . 6.3 Arithmetic Coding . . . . . . . 6.3.1 The Basics . . . . . . . . 6.3.2 Implementation . . . . . 6.3.3 Underflow . . . . . . . . 6.3.4 Decoding . . . . . . . . 6.3.5 Arithmetic vs. Huffman . . . . . . . . . . . . . . . . . . . . iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Audio Compression 10.1 History . . . . . . . . 10.2 Speech Synthesis . . 10.3 Basic Speech Model . 10.4 Speech production by . . . . . . . . . LPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 . . . . 69 69 72 73 73 11 Results 11.1 Image Distortion Measure . . . . . . 11.2 Still Image Compression Comparison 11.2.1 Test Setup . . . . . . . . . . . 11.2.2 Lena . . . . . . . . . . . . . . 11.2.3 Blue Rose . . . . . . . . . . . 11.2.4 Gollum . . . . . . . . . . . . . 11.2.5 Wraiths . . . . . . . . . . . . 11.2.6 Child . . . . . . . . . . . . . . 11.2.7 MSC3K Performance . . . . . 11.3 Video Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 78 79 79 79 80 81 81 83 12 Conclusions 12.1 Performance . . . . . . . . . . 12.2 Possible Application Areas . . 12.3 Future Works . . . . . . . . . 12.3.1 The Future of MSC3K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 90 91 92 93 . . . . . . . . . . . . . . . . A The Euclidean Algorithm 94 B Common Wavelet Transforms B.1 LeGall 5/3 . . . . . . . . . . . . B.1.1 Direct Form Coefficients B.1.2 Lifting Coefficients . . . B.2 Daubechies 9/7 . . . . . . . . . B.2.1 Direct Form Coefficients B.2.2 Lifting Coefficients . . . 95 95 95 95 96 96 96 . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 v Chapter 1 Introduction This thesis work is the result of the idea to integrate a low bitrate real-time video communication system into online gaming. This would enhance the online player’s gaming experience. Teams could transmit complex strategy plans to each other or opponents could argue with each other. The video function would of course be optional since it requires some sort of video device, e.g. a webcam. A button bound to the video transmitting function would enable the player to decide when to send video and optional settings would determine which players to send to and which players to ignore. The concept of video communication for online games places heavy requirements on the video codec, e.g. low computational complexity, low bitrate, error resilience and progressive scalability. At first a DCT1 based solution, such as MPEG-4, was considered, but thanks to To Tran we soon started looking into wavelet-based solutions. This resulted in a unique wavlet-based video and image codec that we named ”Mystery Science Codec 3000” (MSC3K). Later on in our work we recognized other applications, apart from online gaming, and these are discussed in chapter 12. This thesis covers all the steps in an actual image encoder/decoder and explains them in the order in which they appear. The thesis is organized as follows: Chapter 2 gives a short introduction to digital image representation and some 1 Discrete Cosine Transform 1 preprocessing procedures normally applied before transforming images to the wavelet domain. The wavelet transformation and how to calculate and design different wavelet implementations and transforms are explained in detail in chapter 3. The data of the wavelet transformed image is reduced using a quantization scheme (chapter 4) followed by the actual encoding of data in chapter 5. Chapter 5 only deals with zerotree algorithms, which MSC3K is a variant of, this chapter is meant to give the reader complete understanding of zero-trees. The basics of data compression, Huffman based coding and arithmetic coding, is described in chapter 6. Chapter 7 deals with video encoding. Some information on standards such as MPEG1 and MPEG2 is included, but the main focus is on functions and methods implemented in the MSC3K video codec. An informative overview of the JPEG 2000 standard can be found in chapter 8, followed by an overview of the MSC3K codec in chapter 9.1. Audio compression is covered in chapter 10, with detailed explanation of linear prediction based compression. Finally chapter 11 & 12 presents and discusses the results of this thesis work, followed by conclusions and possible future works. 2 Chapter 2 Pre Processing 2.1 Color Representation A color image is digitally represented by a set of RGB (red, green, blue) samples. Each RGB tuple is referred to as a pixel. A 2-D set of pixels make up a color image. A pixel with RGB=[0,0,0] is black. If each component is described by an 8-bit integer, then RGB=[255,255,255] represents white. RGB=[255,0,0] represents red and so on. 2.2 Temporal Filtering Digital video images are often, to various degrees, distorted with noise. The noise degrades the quality of the coded image and it also makes motion estimation a lot more difficult. Some sort of filtering must be applied in order to restore the image quality. In motion pictures, a sequence of images correlated in the temporal dimension is available, and this temporal correlation may be exploited through temporal filtering. Temporal filtering have two major advantages over spatial filtering, it reduces degradation without signal distortion and is more suitable for realtime implementations. Suppose a sequence of degraded images gi (n1 , n2 ) is expressed as 3 gi (n1 , n2 ) = f (n1 , n2 ) + vi (n1 , n2 ), 1≤i≤N (2.1) where vi (n1 , n2 ) is zero-mean stationary white Gaussian noise with variance of σv2 and vi (n1 , n2 ) is independent from vj (n1 , n2 ) for i 6= j. Assume that f (n1 , n2 ) is non-random then the maximum likelihood (ML) estimate of f (n1 , n2 ) is N 1 X fˆ(n1 , n2 ) = gi (n1 , n2 ) N i=1 (2.2) N 1 X fˆ(n1 , n2 ) = f (n1 , n2 ) + vi (n1 , n2 ) N i=1 (2.3) The degradation in the frame-averaged image, fˆ(n1 , n2 ), remains a zero-mean stationary white Gaussian noise with variance of σv2 /N , which is a reduction of noise variance by a factor of N in comparison to vi (n1 , n2 ). Noise characteristics remain the same but the variance is reduced and the processed image will not suffer from any blurring despite the noise reduction. This noise filtering is very effective in reducing noise that may occur in digitization of a still image using e.g. a video camera. This solution only works if f (n1 , n2 ) is stationary. If there are any movements between the frames the result would be blurry since there is no way to tell if it is the movement or the noise that causes the frames to differ. With a simple modification of this temporal filtering, a new solution can be derived that performs well with realtime video. Every pixel in fˆ(n1 , n2 ) is compared with every pixel in g(n1 , n2 ) and the difference between the two is compared with the predefined threshold δ. Differences higher than δ are considered to be movement, while differences lower than δ are considered to be noise. The pseudo code of this implementation can be written as: Pixels that differ less than δ are considered to be noise and averaged with the previous frame. This may blur the image in places since small variations do not 4 1. For every pixel, [x, y], in fˆ(n1 , n2 ) do: • d = fˆ(x, y) − g(x, y) • if d ≥ δ then fˆ(x, y) = g(x, y) else if d < δ then fˆ(x, y) = (fˆ(x, y) + g(x, y))/2 necessary mean noise. However, this occasional blurring is hardly noticeable though since the sharp edges are preserved. Moving sharp edges generate great differences between frames and they are therefore considered movement instead of noise. 2.3 Level Offset If the original B-bit image sample values are unsigned quantities, an offset of −2B−1 is added so that the samples have a signed representation in the range −2B−1 ≤ x[n] < 2B−1 (2.4) This is done because the discrete wavelet transform involves high-pass filtering and hence have a symmetric distribution about 0. Without the level offset, the LL sub-band would have to be treated as unsigned and all other sub-bands would be signed. This is of course a matter of choice but when it comes to implementing it is much easier if all sample values are within the same range and no exceptions have to be made. 5 2.4 Normalization All image samples are normalized through a division by 2B to the ”unit range” 1 1 − ≤ x[n] < 2 2 (2.5) The samples are treated as floating point values from here on. The color transform and the discrete wavelet transform may be normalized so that all sub-band samples nominally retain this same unit range. 2.5 Color Transformation An irreversible color transform is often used in lossy image compression formats, e.g. JPEG 2000, JPEG, MPEG-2, MPEG-4, etc. The reason for this is to separate color and intensity (chrominance and luminance) so that they can be treated differently in the encoder. Intensity, or the black & white information, is more important than color so changing color space can improve compression quality for color images. The color space used by the most standards, JPEG 2000 etc, is YCbCr, where Y is the intensity and Cb & Cr are color components. The transformation from RGB to YCbCr can be expressed as xY [n] 0.299 0.587 0.114 xR [n] xCb [n] = −0.168736 −0.331264 xG [n] 0.5 xCr [n] 0.5 −0.418688 −0.081312 xB [n] (2.6) Note, that the elements in the matrix are only approximate values. Observe that xY is a weighted average of the red, green and blue color components. The weight coefficients (0.299, 0.587, 0.114) reflect the importance of information, in terms of visual perception of detail, in the different color spectrums. Notice that the sum of 6 the weight coefficients equal 1, no combination of RGB values can render a Y value greater than the initial RGB range. The RGB to YCbCr conversion can be reversed with the following equation. xR [n] 1 0 1.402 xY [n] xG [n] = 1 −0.344136 −0.714136 xCb [n] xB [n] 1 1.772 0 xCr [n] (2.7) The transformation is called irreversible because when implemented, finite precision makes it impossible to get the exact result. There are of course fully reversible color transformations, but they are exclusively used with lossless compression. 7 Chapter 3 Wavelet Transform The Discrete Wavelet Transform (DWT) is the foundation block in modern image compression techniques. Among many, the most commonly known DWT based compression standard is probably JPEG 2000. Due to the nature of natural images the DWT can efficiently concentrate the image energy to small spatial regions in the wavelet representation. This property of the DWT is very beneficial for image compressibility. 3.1 Historical Perspective In the history of mathematics, wavelet analysis shows many different origins. Much of the work was performed in the 1930s, and at the time, the separate efforts did not appear to be parts of a coherent theory. Before 1930, the main branch of mathematics leading to wavelets began with Joseph Fourier (1807) with his theories of frequency analysis, often referred to as Fourier synthesis. Joseph Fourier opened up the door to a new functional universe. After 1807 mathematicians began to explore the meaning of functions, Fourier series convergence and orthogonal systems. The disadvantage of a Fourier expansion however is that it has only frequency resolution and no time resolution. This means that although all frequencies can be determined there is no way to know when they are present. Several solutions, that 8 more or less are able to represent a signal in the time and frequency domain at the same time, have been developed since. Research was gradually led from the notion of frequency analysis to the notion of scale analysis when it started to become clear that an approach measuring average fluctuations at different scales might prove less sensitive to noise. The first recorded mention of wavelets appeared in an appendix to the thesis of Alfred Haar (1909). In the 1930s, several groups working independently researched the representation of functions using scale-varying basis functions. Understanding the concepts of basis functions and scale-varying basis functions is key to understanding wavelets. Paul Levy, a 1930s scientist, used a scale-varying basis function called the Haar basis function to investigate Brownian motion, a type of random signal. He found the Haar basis function superior to the Fourier basis functions for studying small-complicated details in the Brownian motion. In 1980 Jean Morlet and the team at the Marseille Theoretical Physics Center working under Alex Grossmann in France first proposed the concept of wavelets in its present theoretical form. They broadly defined wavelets in the context of quantum physics. These two researchers provided a way of thinking for wavelets based on physical intuition. In 1985 Stephane Mallat gave wavelets an additional jumpstart through his work in digital signal processing. Later mainly Y. Meyer and his colleagues have developed the methods of wavelet analysis inspired by S. Mallats earlier work in 1988. Since then, wavelet research has become international. Ingrid Daubechies, Ronald Coifman, and Victor Wickerhauser are some of the greatest names in the field of wavelet research today. 3.2 Multiresolution Analysis The wavelet transform can be thought of as a filter bank. In order to wavelet transform a signal, the signal has to pass through a filter bank. The output of the different filter 9 stages are the wavelet- and scaling function transform coefficients. Analyzing a signal by passing it through a filter bank is not a new idea and has been around for many years under the name sub-band coding. Figure 3.1: Splitting the signal spectrum with an iterated filter bank. The filter bank needed in sub-band coding can be built in several ways. One way is to build many band-pass filters to split the spectrum into frequency bands. The advantage is that the width of every band can be chosen freely, in such a way that the spectrum of the signal to analyze is covered in the places where it might be interesting. The disadvantage is that every filter will have to be designed separately and this can be a time consuming process. The process of splitting the spectrum is graphically displayed in figure 3.1. The advantage of this scheme is that only two filters have to be designed, the disadvantage is that the signal spectrum coverage is fixed. Since the signals from the high pass filters have different spectrum width, they can be represented in the time domain with different sample resolutions. For example in figure 3.1, the 4B signal takes twice the amount of samples to be represented in the time domain than the 2B signal. This type of analysis of a signal is often referred to as a multiresolution analysis. 10 3.3 1-D Wavelet Transform The wavelet transform (or multiresolution analysis) can be performed using a filter bank with FIR filters as shown in figure 3.2. Combining some blocks in figure 3.2 will generate a multiresolution analysis (figure 3.1) of the input signal. Figure 3.3 gives a more detailed filter bank structure of such an analysis. The signal is split into numerous frequency bands (sub-bands). This can be done with reversible integer-to-integer or irreversible real-to-real wavelet transforms. A reversible wavelet transform is designed so that the signal can be perfectly reconstructed again. Irreversible transforms can only approximate a reconstructed signal due to finite precision. Figure 3.2: The discrete forward and inverse wavelet transform presented in direct form. Analysis filtering followed by subsampling then upsampling and synthesis filtering. Figure 3.3: A combined forward wavelet transform with sub-bands γ1 , γ2 and λ2 . 11 The analysis filters are denoted as ha and ga , while the synthesis filters are denoted as hs and gs . The h filters are low-pass filters and the g filters are high-pass. Later on an additional prefix will be added to indicate even or odd filter coefficients, e.g. hae represents the even filter coefficients of ha and hao will represent the odd coefficients. 3.3.1 Perfect Reconstruction Usually a signal transform is used to transform a signal to a different domain, perform some operation on the transformed signal and then inverse transform it back to the original domain. This means of course that the transform has to be invertible. If a signal is transformed and then directly inverse transformed, there should be no change, the signal has to be perfectly reconstructed. The conditions for perfect reconstruction are given by [1] as hs (z)ha (z −1 ) + gs (z)ga (z −1 ) = 2 (3.1) hs (z)ha (−z −1 ) + gs (z)ga (−z −1 ) = 0 In order to arrive at a non-delayed, perfectly reconstructed signal, the analyzing filters (ha (z) and ga (z)) have to be time reversed. 3.3.2 The Polyphase Representation The direct form represented in figure 3.2 is not suitable for implementation since half of all the computations are thrown away in the subsampling procedure. Analyzing the signals λ(z) and γ(z) reveals x[2]ha [2] x[1]ha [1] λ(z) = x[3]ha [2] x[2]ha [1] x[4]h [2] x[3]h [1] a a x[2]ga [2] x[1]ga [1] γ(z) = x[3]ga [2] x[2]ga [1] x[4]g [2] x[3]g [1] a a x[0]ha [0] x[−1]ha [−1] x[−2]ha [−2] x[1]ha [1] x[0]ha [−1] x[−1]ha [−2] x[2]ha [0] x[1]ha [−1] x[0]ha [−2] (3.2) x[0]ga [0] x[−1]ga [−1] x[−2]ga [−2] x[1]ga [1] x[0]ga [−1] x[−1]ga [−2] x[2]ga [0] x[1]ga [−1] x[0]ga [−2] 12 (3.3) The strike-through computations are thrown away in the subsampling procedure. By defining ha (z) = hae (z 2 ) + z −1 hao (z 2 ) (3.4) ga (z) = gae (z 2 ) + z −1 gao (z 2 ) λ(z) and γ(z) are then recognized as λ(z) = hae (z)xe (z) + hao (z)z −1 xo (z) (3.5) γ(z) = gae (z)xe (z) + gao (z)z −1 xo (z) The whole expression can be written as a multiplication between the matrix Pa (z) and a vector of xe (z) and z −1 xo (z) ! ! λ(z) hae (z) hao (z) = γ(z) gae (z) gao (z) | {z } xe (z) ! (3.6) z −1 xo (z) Pa (z) This implies that the filter bank can be expressed as a filter with even and odd samples as input and that the subsampling can be done prior to the filtering. Instead of throwing away half of the computations as in the case of figure 3.2, the filter bank can be re-modelled that of figure 3.4, where Pa (z) is the polyphase matrix for the analysis. Similar to Pa (z), the polyphase matrix for the synthesis, Ps (z), can be Figure 3.4: A one-stage forward and inverse wavelet transform using polyphase matrices. formulated as xe (z) z −1 xo (z) ! = Ps (z) λe (z) ! = γe (z) hse (z) gse (z) hso (z) gso (z) 13 ! λe (z) γe (z) ! (3.7) From figure 3.4. the condition for perfect reconstruction can now be written as Pa (z −1 )Ps (z) = I (3.8) The time reversal of Pa is needed to cancel the delays caused by the polyphase matrices. Assuming that Ps (z) is invertible, Cramer’s rule can be applied to calculate the inverse. Ps (z)−1 1 = Pa (z −1 ) = hse (z)gso (z) − hso (z)gse (z) gso (z) −gse (z) −hso (z) hse (z) ! (3.9) If the determinant of Ps (z) equals 1, i.e. hse (z)gso (z) − hso (z)gse (z) = 1, then not only will Ps (z) be invertible, but also hae (z) = gso (z −1 ) hao (z) = −gse (z −1 ) gae (z) = −hso (z −1 ) (3.10) gao (z) = hse (z −1 ) which implies that ha (z) = −z −1 gs (−z −1 ) ga (z) = z −1 hs (−z −1 ) (3.11) [1] defines the term complementary filter pairs as a pair of filters ,(h, g), which has a corresponding polyphase matrix, P (z), with determinant 1. The problem of finding an invertible wavelet transform using FIR filters amounts to finding such a matrix P (z) with determinant 1. If (ha , ga ) is complementary then so is (hs , gs ). From the matrix Pa (z) the filters needed in the invertible wavelet transform follow immediately. 3.3.3 From Polyphase to Lifting Lifting is the most computational efficient solution when implementing the wavelet transform, see chapter 3.3.7. The lifting procedure is described by a set of lifting 14 steps, where each lifting step is a polyphase matrix. Extracting the lifting steps is done by factoring the polyphase matrix in a certain way. The lifting theorem [1] states that any other finite filter gsnew complementary to hs is of the form gsnew (z) = gs (z) + hs (z)ss (z 2 ) (3.12) where s(z 2 ) is a Laurent polynomial. This can be written in polyphase form as Psnew (z) = Ps (z) hse (z) hse (z)ss (z) + gse (z) ! = hso (z) hso (z)ss (z) + gso (z) ! ! ! 1 ss (z) hse (z) gse (z) 1 ss (z) = 0 1 hso (z) gso (z) 0 1 (3.13) The lifting theorem can also be used to create the filter hnew a (z) complementary to ga (z) 2 hnew a (z) = ha (z) + ga (z)sa (z ) (3.14) with the polyphase matrix Panew (z) = 1 sa (z) 0 1 hae (z) + gae (z)sa (z) hao (z) + gao sa (z) gae (z) ! Pa (z) = ! = gao (z) 1 sa (z) 0 1 ! hae (z) hao (z) ! (3.15) gae (z) gao (z) This is also known as primal lifting(illustrated in figure 3.5), where the low-pass sub-band is lifted with the help of the high-pass sub-band. Lifting can also be applied to the high-pass sub-band, thus lifting the high-pass sub-band with the help of the low-pass sub-band. This procedure is called dual lifting. For dual lifting the equations become 2 hnew s (z) = hs (z) + gs (z)ts (z ) 15 (3.16) Figure 3.5: Primal lifting, the low-pass sub-band is lifted with the help of the highpass sub-band. Psnew (z) = Ps (z) hse (z) + gse (z)ts (z) gse (z) ! = hso (z) + gso (z)ts (z) gso (z) ! ! ! 1 0 hse (z) gse (z) 1 0 = ts (z) 1 hso (z) gso (z) ts (z) 1 (3.17) and ganew (z) = ga (z) + ha (z)ta (z 2 ) Panew (z) = hae (z) hao (z) (3.18) ! = gae (z) + hae (z)ta (z) gao (z) + hao ta (z) ! ! ! 1 0 1 0 hae (z) hao (z) Pa (z) = ta (z) 1 ta (z) 1 gae (z) gao (z) (3.19) Figure 3.6: Dual lifting, the high-pass sub-band is lifted with the help of the low-pass sub-band. 16 3.3.4 The Lazy Wavelet Transform The simplest wavelet transformed is often referred to as the lazy wavelet transform, where the polyphase matrices equals the unity matrix. This transform will divide the input signal into odd and even indexed samples. This is of course fully reversible since no data is discarded. The lazy transform can be used in designing other wavelet transforms. If dual or primal lifting are applied to the lazy transform then the result will be a new wavelet transform. Any wavelet transform with finite filters can be obtained starting from the Lazy followed by a finite number of primal lifting and dual lifting steps [1]. 3.3.5 Factoring Wavelets into Lifting Steps In the previous section the wavelet transform have been lifted as a part in designing wavelet transforms. The lifting procedure can be reversed and thus extracting the lifting steps from an existing transform. Investigating the lifting theorem ([1]) a bit more closely reveals that all reversible wavelets transforms can be regarded as a lazy wavelet transform that have been lifted a number of steps. This implies that all wavelet transforms can be factored into lifting steps until only the unit matrix and a number of lifting matrices remains, and in some cases an additional scaling matrix (K). The polyphase matrix, P (z), can be written as ! ! m ( ! Y 1 0 K1 0 1 si (z) P (z) = 0 1 0 K2 i=1 0 1 1 0 !) (3.20) ti (z) 1 An inverted dual lifting procedure can be written as P (z) = he (z) ho (z) ! = ge (z) go (z) 1 he (z) ho (z) he (z)t(z) + genew (z) ho (z)t(z) + go (z)new ! ! 0 he (z) ho (z) t(z) 1 ! = genew (z) gonew (z) (3.21) 17 from this it is clear that the factoring conditions are g(z) = h(z)t(z 2 ) + g new (z) ge (z) = he (z)t(z) + genew (z) (3.22) go (z) = ho (z)t(z) + gonew (z) This form is identical to a long division with remainder of Laurent polynomials, where g new (z) is the remainder. Two long divisions have to be carried out in order to extract one lifting step. The Euclidean algorithm for Laurent polynomials can be utilized to calculate the long divisions. The Euclidean algorithm was originally developed to find the greatest common divisor (gcd ) of two natural numbers, but it can be extended to handle Laurent polynomials as described in appendix A. 3.3.6 Calculating Lifting Steps A lot of research has already been performed on designing wavelet filters for all kinds of applications. Factoring FIR filters of an existing wavelet transform into lifting steps will give a fast implementation of an already working wavelet transform. The LeGall 5/3 wavelet transform filter is widely used in lossless image compression, e.g. the JPEG 2000 standard. The direct form filter coefficients for the 5/3 analysis transform are often given as hLP (z) = − 81 z −2 + 14 z −1 + 43 + 14 z 1 − 81 z 2 gHP (z) = − 12 z −1 + 1 − 21 z 1 (3.23) This notation will not generate a reversible wavelet transform, instead the gHP filter should be delayed one sample and the filter coefficients given as hLP (z) = − 18 z −2 + 14 z −1 + 43 + 14 z 1 − 81 z 2 gHP (z) = − 21 z −2 + z −1 − 1 2 Continuing with dividing the coefficients into even and odd 18 (3.24) 1 1 2 1 −2 3 1 2 −1 ha (z) = − z + − z +z + z 8 4 8 4 4 | | {z } {z } 2) 2) h (z h (z ae ao 1 −2 1 +z −1 {1} ga (z) = − z − |{z} 2 2 {z } | gao (z 2 ) (3.25) gae (z 2 ) and thus hae (z) hao (z) Pa (z) = ! = − 81 z −1 + 34 − 81 z − 12 z −1 − gae (z) gao (z) det(Pa (z)) = 1 4 1 2 + 14 z ! (3.26) 1 1 −1 3 1 1 1 1 −1 1 − z + − z (1) − + z − z − =1 8 4 8 4 4 2 2 (3.27) The LeGall 5/3 filter is obviously reversible since the determinant of the analysis polyphase matrix equals 1. The analysis filters meet the demands set for perfect reconstruction. Now it’s time to extract a lifting step. Note that hae (z) and hao (z) are the longest filters so extracting a primal lifting step should lower the order of hae (z) and hao (z). Pa (z) = 1 s(z) ! − 21 z −1 − ! 1 1 new hnew ae (z) + s(z)( 4 + 4 z) hao (z) + s(z) = − 21 z −1 − 21 1 0 hnew ao (z) hnew ae (z) 1 1 2 ! = 1 − 81 z −1 + 34 − 81 z − 12 z −1 − 1 2 1 4 + 14 z ! 1 (3.28) ( 1 −1 hnew − 21 ) = − 18 z −1 + 34 − 81 z ae + s(z)(− 2 z hnew oe + s(z) = 1 4 + 14 z) (3.29) From this the Euclidean algorithm can be used with a0 = hae (z) and b0 = hao (z). Three different factorizations are possible depending on the choice of the remainder. The three greatest common divisors are 19 1 −1 1 7 1 −1 (− 2 z − 2 )(− 4 + 4 z) − z 3 1 1 − z −1 + − z = 8 4 8 (− 21 z −1 − 12 )( 14 + 14 z) + 1 (3.30) (− 21 z −1 − 12 )( 14 − 74 z) − z thus 1 7 new −1 hnew ao = 1 s(z) = (− 4 + 4 z) hae = −z s(z) = ( 41 + 41 z) s(z) = ( 1 − 7 z) 4 4 hnew ae = 1 hnew ao = 0 hnew ae = −z hnew ao = z (3.31) Depending on what remainder to choose, the resulting lifting matrix will have different properties. E.g. if we choose the middle alternative, where hnew = 1, the ae lifting matrix will have symmetrical filters, which is more efficient for implementations. Furthermore there will be no extra delays in the lifting matrix. The analysis polyphase matrix after factorization Pa (z) = 1 1 4 + 41 z 0 ! 1 − 12 z −1 − 1 0 1 2 ! (3.32) 1 The synthesis transform is the inverse of the analysis, so inverting the lifting steps of Pa (z) will immediately give the synthesis lifting scheme Ps (z) = 1 1 −1 z 2 0 + 1 2 1 ! 1 − 41 − 14 z 0 ! (3.33) 1 There is no need to extract more lifting steps, since the expression already consists of a primal and a dual lifting step. Figure 3.7: LeGall 5/3 filter in lifting representation. 20 Comparing computational complexity, the direct form implementation of the analysis part in figure 3.4 requires 5 additions and 4 multiplications if the symmetrical properties of the filters are exploited. The lifting implementation, in figure 3.7, requires only 4 additions and 2 multiplications. The LeGall 5/3 filter can be implemented using an integer to integer mapping. Because the lifting filter coefficients are all power of 2, e.g. 1 2 = 2−1 , 1 4 = 2−2 , the multiplications can be replaced by binary shift operations. This operation is not only computational efficient, it also makes the integer to integer mapping possible and lossless compression can be achieved. An overview of the two most commonly used wavelet transforms, the LeGall 5/3 filter and the Daubechies 9/7 filter, and their lifting steps can be found in appendix B. 3.3.7 Lifting Properties As mentioned before, the greatest advantage of the lifting scheme is the reduced calculation complexity. It has been proven [1] that for long filters the lifting scheme reduces computation complexity to half, compared to the standard FIR filter bank algorithm. The filter bank method has a complexity of N , which is much more efficient than the FFT which has a complexity of N log(n). The lifting scheme in turn has a complexity of N/2. Another important factor beside the computational complexity, is the memory usage in implementations. For lifting, the samples needed are basically a small amount of the previous lifting step output, the direct form needs more surrounding samples to produce an output sample. This can be utilized in lifting by storing results from the previous step in CPU registers and calculate all lifting steps before storing the final result in external memory, thus reducing memory access and improving speed. Lifting has easy reversibility due to the nature of the lifting polyphase matrices. It also produces interlaced coefficients , the high pass coefficients are mixed with the low pass coefficients. For further details, see figure B.1. 21 3.4 2-D Wavelet Transform As previously discussed the result of the transform is a set of λ and γ coefficients. These coefficients are in fact the low frequency information and the high frequency information from the original signal. Separating these coefficients and storing them in different spatial regions will generate sub-bands, as discussed in chapter 3.2. A B Figure 3.8: A) One decomposition level. B) Two decomposition levels. Transforming an image requires a 2-D wavelet transform. Images can be seen as two dimensional signals, where the x dimension corresponds to the rows of pixels and the y dimension to the columns. The 1-D transform can be applied in the horizontal direction one row at a time, this will generate a decomposition illustrated in figure 3.9.B. The same 1-D transform can then be applied in the vertical direction, one column at a time, resulting in a complete 2-D wavelet transformation of the original image as illustrated in figure 3.9.C. At this point the transformation has reached one decomposition level and there are four so called different sub-bands present in the original spatial region. These are named LL1 , LH1 , HL1 and HH1 , where H stands for high pass and L for low pass. The index indicates from which decomposition level the sub-band originates from, the order of the letters indicates horizontal or vertical 22 origin, see figure 3.8. The 2-D transformation procedure can be repeated on the LL sub-band to create more decomposition levels and thus more sub-bands. Figure 3.9.D illustrates a two level decomposition. The wavelet coefficients, representing the original image, are the basis of all wavelet-based image compression techniques. Apart from the LL sub-band in figure 3.9, the grey areas represent zeros. This is because the wavelet coefficients range from − 12 to 1 2 and are presented in a non signed range of 0 to 255, thus zero is mapped on number 127 (− 21 = 0 and 1 2 = 255). The high representation of zeros in the outer sub-bands indicates that the most important information resides in the LL sub-band and this will later be used to achieve good compressibility. 23 A B C D Figure 3.9: A) The original gray image of Lena(256x256x8bpp). B) One horizontal 1-D wavelet transform step applied to A. C) A vertical 1-D wavelet transform step applied to B. D) Both vertical and horizontal steps applied to the LL sub-band in C. 24 Chapter 4 Quantization Quantization is the element in lossy compression that makes the data more compressible by reducing the data precision in some way. In most lossy compression systems, it is the only source of distortion. For lossless compression there should be no quantization. 4.1 Uniform Dead-Zone Quantization This is a variant of scalar quantization. Scalar quantization is the simplest of all quantization methods. It can be described as a function that maps each element in a subset of the real line to a particular value in that subset. Consider partitioning the real line into M disjoint intervals Iq = (tq , tq+1 ), q = 0, 1, ..., M − 1 (4.1) −∞ = t0 < t1 < ... < tM = +∞ (4.2) with Within each interval, a point x̂ is selected as the output value (or codeword) of Iq . a Scalar quantizer is then a mapping from R to 0, 1, ..., M − 1. Specifically, for a given x, Q(x) is the index q of the inteval Iq which contains x. The inverse quantization, 25 or dequantizer is given by Q−1 (q) = xˆq (4.3) A very desirable feature of compression systems is the ability to refine the recon- Figure 4.1: Different graphical representations of scalar quantization structed data as the bit-stream is decoded. As more of the compressed bit-stream is decoded, the reconstruction can be improved incrementally until full quality reconstruction is obtained upon decoding the entire bitstream. Compression systems possessing this property are faciliated by embedded quantization. In so called embedded quantization, the intervals of higher rate quantizers are embedded within the intervals of lower level quantizers. Equivalently, the levels of lower rate quantizers are partitioned to yield the intervals of higher rate quantizers. Uniformly subdividing the intervals of a uniform scalar quantizer yields another uniform scalar quantizer. In this way, a set of embedded uniform scalar quantizers may be constructed. 26 For the case when ξ = 0 q = Q(x) = sign(x)b |x| c 4 (4.4) and ( x̂ = Q−1 (q) = 0 q=0 sign(q)(|q| + δ)4 q 6= 0 (4.5) This quantizer has embedded within it, all uniform deadzone quantizers with step sizes 2p ∆ for integer p ≥ 0. Assuming that the magnitude of q can be represented with K bits, then q can be written in sign magnitude form as q = QK−1 (X) = s, q0 , q1 , ..., qK−1 (4.6) q (p) = s, q0 , q1 , ..., qK−1−p (4.7) Let be the index obtained by dropping the last p bits of q. Equivalently, q (p) is obtained by right shifting the binary representation of |q| by p bits. It can then be verified that QK−1−p (X) = q (p) (4.8) where QK−1−p is the uniform deadzone quantizer with step size 2p ∆. If the p LSBs of |q| are unavailable, we may still inverse the quantization (dequantize), but at a lower level of quality. The result will be the same as if quantization was performed using a step size of 2p ∆ (rather than ∆) in the first place. In this situation the inverse quantization is performed as ( 0 q (p) = 0 x̂ = sign(q (p) )(|q (p) | + δ)2p ∆ q (p) 6= 0 when p = 0, this yields the full quality dequantization. 27 (4.9) Chapter 5 Zero-Tree Coding When zero-tree coding was first introduced it produced state-of-the-art compression performance at a relatively modest level of complexity. First out was the Embedded zero-tree coding of wavelet coefficients (EZW) introduced by Shapiro [2]. Later Shapiro’s zero-trees were generalized, and set partition techniques were introduced to efficiently code these generalized trees of zeros. The improved technique is known as Set Partition in Hierarchical Trees (SPIHT) [3]. The SPIHT algorithm produces results superior of those of EZW at even lower levels of complexity. Zero-tree coding is applied to the quantized wavelet coefficients. The actual coding is founded on the premise that if a coefficient is small in magnitude, the spatial corresponding coefficients in higher sub-bands of the wavelet transform also tend to be small. This spatial correlation produces a tree-like coding structure, hence the name zero-tree. Our implementation, MSC3K, is a slightly modified SPIHT variant of zero-tree coding. The modification improves compression performance and ensures that prior knowledge of the quantization procedure is used to achieve faster encoding. For better understanding of the SPIHT algorithm a more detailed explanation of EZW is required. 28 5.1 EZW EZW can be described as a form of entropy encoding applied to bit-planes of the quantized wavelet coefficients. Assume an original image of size N ×N where N = 2n for some integer n ≥ D. This image is then transformed with D levels of a dyadic tree-structured sub-band transform. This operation results in 3D + 1 sub-bands of transform coefficients. The transform coefficients are then quantized with a uniform dead-zone quantization. As a result the original image is now an N × N array of integers, representing the transform. The coefficients are then ”sliced” into K + 1 bit-planes. The first such bit-plane consists of the sign bit of each index, s[x, y] x, y = 0, 1, ..., N . The next bit-plane consists of the MSBs of each magnitude, denoted by q0 [x, y] x, y = 0, 1, ..., N ., then q1 [x, y] and so on. Each of the K magnitude bitplanes is coded in two passes. The first pass determines which coefficients should be visited in the refinement pass in the next bit-plane, or one might say it indicates if a coefficient has become newly significant. This pass is called the significance pass and it outputs compressed data. The second pass is called the refinement pass and it outputs the bits of coefficients that have been determined significant, not the newly significant coefficients though. The newly significant coefficients become significant in the next bit-plane. So when coding q0 , there are no significant coefficients yet, just newly significant ones. Thus the refinement pass for q0 is skipped. 5.1.1 The Significance Pass Each insignificant coefficient in the wavelet transform is visited in raster order (left-toright, top-to-bottom), first within LLD then within LHD , then HLD , then HHD and then LHD−1 and so on. Because most energy is concentrated in the lower frequency sub-bands, this order will give the coefficients their rightful priority. Compression is accomplished via a 4-ary alphabet: 29 • POS = Significant Positive. This symbol equals 1 followed by a ”positive” sign bit. • NEG = Significant Negative. This symbol equals 1 followed by a ”negative” sign bit. • ZTR = Zero Tree Root. This symbol indicates that the current bit is 0, and that all descendants of the bit are 0. • IZ = Isolated Zero. This symbol indicates that the bit is 0, but at least one descendant is also 0. Figure 5.1: EZW descendants structure for one sample in the LL3 sub-band. The descendants of a coefficient in a zero-tree are often referred to as children or grandchildren depending on relationship, se figure 5.1. The highest frequency sub-bands, HL1 , LH1 and HH1 , have no children so symbols ZTR and IZ have no meaning in these regions. Instead these symbols are replaced by a single symbol Z. 5.1.2 The Refinement Pass In the refinement pass a single bit is coded for each coefficient known to be significant. A coefficient is significant if it has been coded POS or NEG in a previous bit-plane. 30 Its current refinement bit is simply its corresponding bit in the current bit-plane. The coefficients are visited in a special order, thus they are sorted by two criteria: first by magnitude, second by raster within sub-bands in the same order as the sub-bands are visited in the significance pass. Coefficients that have been determined significant are treated as IZ or Z symbols in the remaining significance passes. 5.1.3 Example of EZW The sign plane and the first two magnitude bit-planes (q0 , q1 ) for a D = 3 level dyadic orthonormal transform are shown in figure 5.2. The coding begins with a significance pass through q0 . First the single bit in LL3 is examined and since it is 1 and its sign is + thus a POS symbol is output. The bit in HL3 is coded as ZTR indicating that it and all its descendants are 0. LH3 is coded similarly to LL3 but the sign differs so it’s coded as NEG. HH3 is coded as ZTR. No further coding of HL2 , HL1 , HH2 and HH1 is done for bit-plane q0 . Proceeding to LH2 , four symbols are coded, ZTR, ZTR, POS, ZTR. Now only LH1 remains to be coded. 12 of 16 has already been coded as zeros by ZTR symbols in LH2 . The remaining 4 bits are coded Z, Z, Z, Z. The resulting symbols from the significance of q0 are given by: POS, ZTR, NEG, ZTR, ZTR, ZTR, POS, ZTR, Z, Z, Z, Z The symbols are broken into three lines for clarity of the text. The first line represents region LL3 , LH3 , HL3 and HH3 , the second line is from LH2 , HL2 and HH2 , the last line is for LH1 , HL1 and HH1 . By virtue of the q0 significance pass there are now three coefficients known to be significant, q[0, 0], q[0, 1] and q[0, 3]. The q1 bit for each of these coefficients is coded in the q1 refinement pass. No sorting is required of the significant coefficients since only one bit-plane has been coded. This is because all these coefficients have the q0 31 s q0 q1 Figure 5.2: Sign plane and first two magnitude bit-planes of a three level dyadic sub-band decomposition of an 8 × 8 image. bit set to 1, so sorting by criteria one is pointless, they have also been coded in the correct raster order so sorting by criteria two is also pointless. The code symbols for the q1 refinement pass are thus: q1 [0, 0], q1 [0, 1], q1 [0, 3] = 1, 1, 0 The significance pass for q1 then codes all remaining bits in the q1 bit-plane. The symbols from this pass are given by: IZ, POS ZTR, POS, ZTR, ZTR, ZTR, ZTR, IZ, ZTR, ZTR, ZTR, ZTR Z, Z, Z, Z, Z, NEG, Z, Z, Z, Z, Z, Z The first IZ symbol originates from region LH3 since it’s 0 but has a non-zero descendant in LH2 of the q1 bit-plane. The q1 significance pass has added three more significant coefficients (q[3, 0], q[1, 1] and q[3, 4]). The coding continues with the q2 refinement pass, the q2 significance pass, the q3 refinement pass and so on. 32 5.2 SPIHT The SPIHT algorithm has many things in common with the EZW algorithm, but there are also many differences. In particular, the parent-child relationship in LLD is altered (figure 5.3), the order of the refinement and significance pass is reversed, a new type of zero-tree is introduced, more inter-bit-plane memory is exploited and all SPIHT symbols are binary. The operations done prior to the actual zero-tree coding, e.g. the wavelet transform and the quantization, are practically identical to those of the EZW algorithm. Both algorithms apply a D level orthonormal transform and quantize the transform indices with a dead-zone scalar quantization. Figure 5.3: Parent child relationships in SPIHT. 5.2.1 Different Zero-Trees SPIHT uses two different kinds of zero-trees, Type A and Type B. Both are used to signal the existence of insignificant sets of coefficients. Type A indicates that all descendants of the root, within a given bit-plane, are zero. The root itself does not need to be zero and this differs from the EZW zero-tree. In fact, although the Type A zero-tree is specified by the coordinates of the root, the root itself is not included in the tree. 33 The Type B zero-tree indicates that all grandchildren and great grandchildren, etc, are zero. This Type B root does not include the root itself, nor does it include its children in the zero-tree. Type B trees can be thought of as the union of four Type A trees, each of which originates from a child to the Type B root. 5.2.2 Ordered Lists SPIHT is best explained by a set of ordered lists. These are used internally by the encoder and the decoder to keep track of significant and insignificant sets of coefficients. There are three kinds of lists: • List of Insignificant Coefficients (LIC). • List of Significant Coefficients (LSC). • List of Insignificant Sets of coefficients (LIS). LIC holds the index of insignificant coefficients. One item in the LIC list corresponds to one single insignificant coefficient. LSC holds the index of significant coefficients, and one item in this list corresponds to one single significant coefficient. The LIS list stores the location of Type A and Type B zero-tree roots. All coefficients are represented with these three lists. 5.2.3 The Coding Passes First, the coding is initialized by adding each coefficient in LLD to the LIC list. The LSC list is initially empty since nothing is known yet about the significance of the coefficients. Every coefficient within the LLD sub-band which has descendants is added as a Type A root to the LIS list, thus initially indicating that all coefficients are insignificant. The order of significance and refinement passes are reversed from that of EZW. Thus, coding starts with a significance pass followed by a refinement pass through 34 the current bit-plane. The refinement pass codes a refinement bit for each coefficient that was significant at the end of the previous bit-plane. All coefficients that became significant in the significance pass, prior to the refinement pass, can be regarded as ”newly significant” coefficients. These are not coded in the following refinement pass. The significance pass starts with examining the coefficients in LIC of the current bit-plane. The bit-values of the coefficients are coded, if a bit is coded as 1, its sign bit is coded immediately thereafter, the coefficient is then moved from the LIC list to the LSC list. The list of LIS roots is examined next. If the descendants of a root remain insignificant a 0 is coded and processing proceeds at the next entry in LIS. If all descendants are not insignificant a 1 is coded. Now one of two different actions is taken depending on the type of the root: 1) A Type A root is changed to Type B and moved to the bottom of the LIS list. The four children of the root are coded as if they were in LIC. After this the children are either in LIC or in LSC depending on their bit-values. 2) A Type B root is deleted from from LIS. Its children are added as Type A roots to LIS. Processing continues in this fashion throughout all entries in LIS. Note that entries added in the current pass, e.g. from examining a Type B root, are also processed in the same pass. The refinement pass then continues to code the magnitude bits to the coefficients in LSC, the coefficients that were significant in the start of the current pass. Thus the refinement pass of the q0 bit-plane is skipped. 5.2.4 The SPIHT Algorithm For a given coefficient at location [x, y] let C[x, y] be the set of its children. Let D[x, y] represent all descendants of [x, y] and let G[x, y] represent the set of grandchildren, great grandchildren, etc to [x, y]. s[x, y] represents the sign of [x, y]. By this notation follows: G[x, y] = D[x, y] − C[x, y] 35 (5.1) Let Sk (B) represent a boolean function which returns 0 if the coefficients of bitplane k, described by B, are all 0, and returns 1 if at least one coefficient is 1. The SPIHT algorithm can be described in pseudo code using the above described notation: 1. Initialization • Set k = 0, LSC = empty, LIC = {all coordinates [x, y] ∈ LLD }, LIS = {all coordinates of LIC that have children}. All entries of the LIS list are set to Type A. 2. Significance Pass • For each [x, y] ∈ LIC, do: – Output qk [x, y]. If qk [x, y] = 1, output s[x, y] and move [x, y] to the end of the LSC list. • For each [x, y] ∈ LIS, do: – If the set is of Type A, output Sk (D[x, y]). If Sk (D[x, y]) = 1, then ∗ For each [i, j] ∈ C[x, y], output qk [i, j]. If qk [i, j] = 0, add [i, j] to the LIC. Else, output s[i, j] and add [i, j] to the LSC. ∗ If G[x, y] exists for [x, y] then move [x, y] to the end of LIS as a Type B root. Else delete [x, y] from the LIS. – If the set is of Type B, output Sk (G[x, y]). If Sk (G[x, y]) = 1, then add each [i, j] ∈ C[x, y] to the end of the LIS (as Type A) and delete [x, y] from LIS. 3. Refinement Pass • For each [x, y] ∈ LLD , output qk [x, y]. Only coefficients that were in LSC before the most recent significant pass should be refined. 4. Set k = k + 1 and go to Step 2). 36 5.2.5 Example of SPIHT s q0 q1 Figure 5.4: Sign plane and first two magnitude bit-planes of a D = 2 level dyadic sub-band decomposition of an 8 × 8 image. The sign plane and the first two magnitude bit-planes (q0 and q1 ) are shown in figure 5.4. Coding starts with the initialization step and the lists are given by LSC LIC LIS [0,0] [1,0]A [1,0] [0,1]A [0,1] [1,1]A [1,1] The significance pass starts with coding of the coefficients in LIC. Thus, q0 [0, 0] = 1 is output, followed by s[0, 0] = +. q0 [1, 0] = 0, q0 [0, 1] = 1, followed by s[0, 1] = − and q0 [1, 1] = 0, resulting in that [0, 0] and [0, 1] are moved to LSC. Processing then proceeds to the LIS. S0 ([1, 0]) = 1 is output since all descendants of [1, 0] are not 0. The children of [1, 0], C[1, 0] = {[2, 0], [3, 0], [2, 1], [3, 1]}, are then coded. The resulting output values are 0,0,0,0. All of the children are then added to LIC since none of them were significant. The root [1, 0] is changed to Type B and moved to the bottom of LIS. Next entry in LIS is [0, 1], S0 ([0, 1]) = 1 is output and 37 the children are coded. The output of C[0, 1] = {[0, 2], [1, 2], [0, 3], [1, 3]} is 0,0,0,1,+. The coefficients [0, 2], [1, 2] and [0, 3] are added to LIC while [1, 3] is added to LSC. The root itself is changed to Type B and moved to the bottom. The last original entry in LIS, [1, 1], is coded 0 since all its descendants are 0. [1, 0]B is coded 1 and C[1, 0] = {[2, 0], [3, 0], [2, 1], [3, 1]} are added to LIS as Type A roots. [0, 1]B is coded 0 since all its grandchildren are 0. The children of [1, 0] are then examined next since there are no other entries in LIS. S0 ([2, 0]) = 0, S0 ([3, 0]) = 1, this root is not changed to Type B because it has no grandchildren. The output of C[3, 0] is 1,-,0,0,0. The output S0 ([2, 1]) = 0 and S0 ([3, 1]) = 0 finishes the significance pass for q0 and the total output are 1,+,0,1,-,0,1,0,0,0,0,1,0,0,0,1,+,0,0,1,0,1,1,-,0,0,0,0,0 As mentioned previously, the refinement pass for q0 os obsolete. The ordered lists now look like this LSC LIC LIS [0, 0] [1,0] [1,1]A [0, 1] [1,1] [0,1]B [1, 3] [2,0] [2,0]A [6, 0] [3,0] [2,1]A [2,1] [3,1]A [3,1] [0,2] [1,2] [0,3] [7,0] [6,1] [7,1] The q1 significance pass starts with LIC, the output is 1,-,0,0,1,+,0,0,0,0,1,+,1,+,0,1,and [1, 0], [3, 0], [0, 3], [7, 0], [7, 1] are moved to LSC. 38 Now the roots described in LIS are coded. All descendants of [1, 1] are still insignificant so a 0 is output. S1 ([0, 1]) = 1 so [0, 2], [1, 2], [0, 3] and [1, 3] are added to LIS as Type A roots and [0, 1] is deleted from LIS. Coding continues with S1 ([2, 0]) = 0, S1 ([3, 0]) = 0, S1 ([3, 1]) = 0, S1 ([0, 2]) = 0 and S1 ([1, 2]) = 1. C[1, 2] is coded as 0,1,-,0,0 and [3, 4] is added to LSC while [2, 4], [2, 5] and [3, 5] are added to LIC. The remaining coefficients in LIS are coded S1 ([0, 3]) = 0 and S1 ([1, 3]) = 1. The children are coded 1,+,0,0,1,- and [2, 6] and [3, 7] are added to LSC while [3, 6] and [2, 7] are added to LIC. The root [1, 3] is deleted from LIS since it has no grandchildren. The resulting outputs of q1 significance pass are thus 1,-,0,0,1,+,0,0,0,0,1,+,1,+,0,1,-,0,1,0,0,0,0,1,0,1,-,0,0,0,1,1,+,0,0,1,The q1 refinement pass outputs refinement bits for [0, 0], [0, 1], [1, 3] and [6, 0]. The total output is 1,1,0,0. Coding is now finished and the lists look like this LSC LIC LIS [0, 0] [1,1] [1,1]A [0, 1] [2,0] [2,0]A [1, 3] [2,1] [2,1]A [6, 0] [3,1] [3,1]A [1, 0] [0,2] [0,2]A [3, 0] [1,2] [0,3]A [0, 3] [6,1] [7, 0] [2,4] [7, 1] [2,5] [3, 4] [3,5] [2, 6] [3,6] [3, 7] [2,7] 39 5.3 MSC3K MSC3K, ”Mystery Science Codec 3000”, is a zero-tree compression technique similar to that of SPIHT. The core of the MSC3K codec is the zero-tree compression discussed here, for a more complete overview see chapter 9.1. There are two changes made to the original SPIHT. First the coefficients in LLD are treated separately and second, some more inter-band memory is used. The LLD sub-band holds most of the energy in the image. MSC3K codes all of LLD ”raw” before zero-tree coding begins. The LLD coefficients are coded one bit-plane at a time, starting with the most significant bit-plane. 5.3.1 Knowledge of Quantization The improvements made to the SPIHT algorithm are based on the knowledge of the quantization. Generally the LLD sub-band contains most energy of the image, thus this region is favored by the quantization by giving it more bits per coefficient than other sub-bands. Figure 5.5 illustrates a typical distribution of bits over sub-bands after quantization. Knowing this distribution, either by constant quantization or by additional data, it is possible to improve speed and efficiency. Figure 5.5: Example of bit-allocation for different sub-band regions of a D = 2 level dyadic sub-band decomposition of an 8 × 8 image. 40 With the knowledge about quantization, the bit-planes can be described with different sizes. For instance, bit-plane q0 and q1 only reside in LL3 in figure 5.5. Similarly bit-plane q2 resides in LH2 , HL2 and LL3 while bit-plane q4 also includes HH2 . In this case coding begins with bit-plane q2 . This is because in bit-plane q0 and q1 there are no descendants in the higher sub-bands. Why code something already known? The answer is of course to avoid it. With additional information of the bit-depth in different sub-bands it is possible to discard the coding of some Type B roots, given the condition that the outcome is already known. This also saves CPU cycles because it is not necessary to investigate these Type B roots. 5.3.2 The MSC3K Algorithm The discarded Type B roots are temporarily stored in an ordered list similar to the lists in the SPIHT algorithm. The XLIS (Extra List of Insignificant Coefficients) list, as it is called, holds all Type B roots that should have been coded 0 based on quantization knowledge. The roots are lifted out from the coding process and then inserted again into the LIS list at the start of the next bit-plane, given that the roots have descendants in that bit-plane. If the aim is low bitrate then the quantization is often heavy, thus creating a wide distribution of bit-depths in the sub-bands. If this is the case then the XLIS improvement gives a significant boost in efficiency, but if the goal is high quality and low compression then the improvement is less significant. The extra data describing quantization information that needs to be coded in the bitstream is small compared to the overall win. The MSC3K algorithm can be described by the following pseudo code. 1. Initialization • Set k ={first bit-plane where LLD has children}, LSC = empty, LIC = empty, LIS = {all coordinates in LLD that have children}. All entries of the LIS list are set to Type A. 41 2. Significance Pass • For each [x, y] ∈ LIC, do: – Output qk [x, y]. If qk [x, y] = 1, output s[x, y] and move [x, y] to the end of the LSC list. • For each [x, y] ∈ XLIS, check if [x, y] has grandchildren, if so then move them to the end of LIS. • For each [x, y] ∈ LIS, do: – If the set is of Type A, output Sk (D[x, y]). If Sk (D[x, y]) = 1, then ∗ For each [i, j] ∈ C[x, y], output qk [i, j]. If qk [i, j] = 0, add [i, j] to the LIC. Else, output s[i, j] and add [i, j] to the LSC. ∗ If G[x, y] exists for [x, y], using the current bit-plane size (based on quantization), then move [x, y] to the end of LIS as a Type B root. Else move [x, y] to the XLIS. – If the set is of Type B, output Sk (G[x, y]). If Sk (G[x, y]) = 1, then add each [i, j] ∈ C[x, y] to the end of the LIS (as Type A) and delete [x, y] from LIS. 3. Refinement Pass • For each [x, y] ∈ LLD , output qk [x, y]. Only coefficients that were in LSC before the most recent significant pass should be refined. 4. Set k = k + 1 and go to Step 2). 42 Chapter 6 Data Compression Most of today’s data compression methods fall into one of two camps: dictionary based schemes and statistical methods. Dictionary based methods are widely used, but arithmetic coding combined with statistical methods achieves better performance. Data compression operates in general by taking symbols from an input stream, and writing codes to a compressed output stream. Symbols are usually bytes or bits. Dictionary based compression schemes operate by replacing groups of symbols in the input stream with fixed length codes. This is the case in Huffman coding and LZW. Statistical methods of data compression take a completely different approach. They operate by encoding symbols one at into variable length output codes. The length of the output code varies based on the probability or frequency of the symbol. Low probability symbols are encoded using many bits, and high probability symbols are encoded using fewer bits. In practice, the line between statistical and dictionary methods is distinct. 6.1 Huffman Coding Huffman coding is probably the best known coding method based on probability statistics. First the probability of the incoming symbols in the input stream is calculated. The method, published by D.A. Huffman in 1952, then creates a code table 43 for the symbols given their probabilities. The Huffman code table has been proved to produce the lowest possible output bit count possible for the input stream of symbols, when using fixed length coding. Huffman coding assigns an output code to each symbol, with the output codes being as short as 1 bit, or considerably longer than the input symbols, strictly depending on their probabilities. The optimal number of bits to be used for each symbol is 1 log2 ( ) p (6.1) where p is the probability of a given character. Thus, if the probability of a character is 1/256, such as would be found in a random byte stream, the optimal number of bits per character is log2 (256), or 8. If the probability goes up to 1/2, the optimum number of bits needed to code the character would go down to 1. The problem with this scheme lies in the fact that Huffman codes have to be an integral number of bits long. For example, if the probability of a character is 1/3, the optimum number of bits to code that character is around 1.6. The Huffman coding scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer compressed message than is theoretically possible. This non-optimal coding becomes a noticeable problem when the probability of a character becomes very high. If a statistical method can be developed that can assign a 90% probability to a given character, the optimal code size would be 0.15 bits. The Huffman coding system would probably assign a 1 bit code to the symbol, which is 6 times longer than is necessary. 6.2 Adaptive Coding Another problem with Huffman coding comes when there is no statistical information on the incoming symbols. In this case the compression method must adapt and continuously update the probabilities of the symbols. The problem with adaptive 44 coding in Huffman coding is that the process of rebuilding the Huffman tree is a very expensive process. For an adaptive scheme to be efficient the probabilities must be updated after every incoming symbol, and in the Huffman case this will become a very time-consuming process. When doing non-adaptive coding, the compression scheme makes a single pass over the incoming symbols to collect statistics. The incoming symbols are then encoded using the statistics, which are unchanged throughout the encoding process. In order for the decoder to decode the compressed stream, it first needs a copy of the statistics. The encoder usually adds a statistics message to the compressed stream so the decoder can read it before starting. This obviously adds some extra data, but usually a frequency count of each character could be stored in as little as 256 bytes with fairly high accuracy. The other option is to use predefined tables, this is quite common in standards, e.g. JPEG and MPEG-1 Layer 3 (MP3). In order to bypass this problem, adaptive data compression is called for. In adaptive compression both the encoder and the decoder initiates their statistical model in the same state. Each of them processes one symbol at a time and updates their models. A slight amount of efficiency is lost by starting with non-optimal states, but it is usually more efficient in the long run than to pass any statistics in the compressed stream. 6.3 Arithmetic Coding Arithmetic coding bypasses the idea of replacing an input symbol with a specific code. Instead the main idea is to represent a set of symbols with a single float number. The longer the message, the more bits are needed to describe the corresponding float number. It was not until recently that practical methods for implementations in fixed registers were found. For more theoretical background, see [4]. 45 6.3.1 The Basics The output from an arithmetic coding process is a single number less than 1 and greater or equal to 0. This single number can be uniquely decoded to recreate the exact input symbols. Before encoding, a set of probabilities have to be assigned to each symbol. For e.g. the message ”HELLO” would have a probability distribution of that in table 6.1. Symbol E H L O Probability 1/5 1/5 2/5 1/5 Range 0.00-0.20 0.20-0.40 0.40-0.80 0.80-1.00 Table 6.1: Probability distribution for the message ”HELLO”. The individual symbols are also assigned a range along a probability line, which is nominally 0 to 1. It doesn’t matter which characters are assigned which segment of the range, as long as it is done in the same manner by both the encoder and the decoder. The symbol range is in direct proportion to the probability of the symbol. Each symbol ”owns” the assigned range up to, but not including the higher number. So the symbol ”O” in fact has the range 0.80-0.999999... New Symbol H E L L O Low Limit 0.00 0.20 0.20 0.216 0.2224 0.22752 High Limit 1.00 0.40 0.24 0.232 0.2288 0.2288 Table 6.2: Coding scheme for the message ”HELLO”. When encoding the first symbol, a high and a low limit is assigned to the range 46 of the symbol. So after the symbol ”H” is encoded the low limit is assigned 0.20 and the high limit is assigned 0.40. What happens during the rest of the encoding process is that each new symbol to be encoded will further restrict the possible range of the output number. The next symbol, ”E”, owns the range 0.00-0.20. This range will be applied to the current subrange, 0.20-0.40, thus forming the range 0.20-0.24. Repeating this procedure for the message ”HELLO” results in table 6.2. The pseudo code to accomplish this for a message of any length is • Initialize and set low = 0.0 and high = 1.0. • For every input symbol do: – range = high − low – high = low + range × highRange(symbol) – low = low + range × lowRange(symbol) • Output low. The final low value 0.22752 will uniquely code the message ”HELLO” given the encoding scheme in table 6.2. With this scheme its also possible to decode the message. The first symbol in the message is extracted by identifying the symbol that owns the code space that corresponds to the unique number. Since the number 0.22752 falls between 0.20-0.40 the first symbol must be ”H”. Because the low and high ranges of ”H” are known, the encoding procedure can be reversed and the range can be updated to give the next symbol. This is done by subtracting the low limit of ”H”, which is 0.20, and then dividing by the range of ”H”. This gives a value of (0.22752 − 0.20)/0.20 = 0.1376, which indicates that the symbol is ”E”. The pseudo code for the decoding algorithm should look like this 47 • Get encoded number and Do: – Output the symbol that falls in the range of the number. – range = highRange(symbol) − lowRange(symbol) – number = number − lowRange(symbol) – number = number/range • Start over with the new number Note that there is no way of telling when there are no more symbols to decode. This is solved by either encoding a special symbol at the end or transmitting the message length along with the coded message. The decoding algorithm for the ”HELLO” message will proceed like this: Encoded Number Output Low High Range 0.22752 H 0.20 0.40 0.20 0.1376 E 0.00 0.20 0.20 0.688 L 0.40 0.80 0.40 0.72 L 0.40 0.80 0.40 0.8 O 0.80 1.00 0.20 0 6.3.2 Implementation At first glance the arithmetic coding seems to be impossible to implement since it in theory requires infinite precision. Even with 80-bit float registers it is not possible to code more than 10-15 bytes and the float operations would take a lot of time when working with massive amounts of data. As it turns out, arithmetic coding is best implemented using standard 16-bit or 32-bit integer math. No floating point math is required, nor would it help to use it. What is used instead is a an incremental transmission scheme, where fixed size integer state variables receive new bits in at 48 the low end and shift them out the high end, forming a single number that can be as many bits long as are available on the computer’s storage medium. Going from float to integer requires some adaptations. The range 0.0-1.0 can be represented as 0.0-0.99999.. or 0.1111.. in binary. For 16-bit registers this would mean that high is initially set to 0xFFFF and low to 0x0000. The high value continues with FFs forever, and low continues with 0s forever, so extra bits can be shifted in with impunity when they are needed. The 5 digit decimal register equivalent would be, high = 99999 and low = 00000. The same encoding algorithm from the previous section can be applied to calculate new range numbers. Note that the difference between the two registers will be 100000, not 99999. Assuming the high register has an infinite number of 9’s added to it the theoretical difference would be 100000. Coding the message ”HELLO” once again gives the range 0.20-0.40 for the symbol ”H”. In register representation this corresponds to high = 40000 and low = 20000, but the high value needs to be decremented, once again because of the implied trailing 9s to the integer register. So the new value of high is 39999. Coding ”E” results in high = 23999 and low = 20000. At this point the most significant digits of high and low match. Due to the nature of arithmetic coding, high and low will continue to grow closer and closer without ever matching. This means that once both registers match in their most significant digit, that digit will never change. This digit can now be shifted out and output as the first digit of the encoded number. The shifting is done by shifting both high and low left by one digit and shifting in a 9 in the least significant digit of high. Processing the whole ”HELLO” message will give 49 Action High Initial state Low Range Tot. Output 99999 00000 100000 Encode H 39999 20000 40000 Encode E 23999 20000 4000 Shift out 2 39999 00000 40000 0.2 Encode L 31999 16000 16000 0.2 Encode L 28799 22400 6400 0.2 Shift out 2 87999 24000 64000 0.22 Encode O 87999 75200 12800 0.22 Shift out 7 79999 52000 28000 0.227 Shift out 5 99999 20000 80000 0.2275 Shift out 2 99999 00000 100000 0.22752 When all symbols have been encoded the registers have to be shifted until they reach their initial states. The shifting can be done from either the high or the low register, the output number will still fall within the right range. 6.3.3 Underflow Under special circumstances the high and low registers may converge on a value without the most significant digits ever matching. E.g. if high = 40004 and low = 39996, the calculated range is going to be only one digit long, which means that the output word will not have enough precision to be accurately encoded. After a few more iterations the registers may be permanently stuck at high = 40000, low = 39999. The range is now too small and any calculations will always return the same values. To prevent this underflow from happening, the encoding algorithm has to be altered. If the most significant digit of high and low is one apart, a test must be performed to see if the second most significant digit in high is 0 and 9 in low. If this is the case, then the risk of underflow is high and measures have to be taken. Instead of shifting out the most significant digit, the second digit is then deleted instead. An 50 underflow counter has to be incremented to signal that a 0 and a 9 have been thrown away. The operation looks like this Register Before After High 40688 46889 Low 39232 32320 0 1 Underflow After each recalculation operation, the underflow check has to be performed again. In case of underflow, the procedure is repeated and the underflow counter is increased by one. When the most significant digit finally match, the digit is first output followed by the ”underflow” digits that were previously discarded. The underflow digits will all be 0’s or 9’s depending on whether high and low converged to the higher or the lower value. 6.3.4 Decoding Just like in the encoding process, the precision is limited to 16-bit or 32-bit integer registers. A high and a low register is used, and these correspond exactly to the values the high and low registers held in the encoding procedure. An input buffer for the code stream is also needed. By following the same procedure as the encoder, the decoder knows when it is time to shift in new bits into the register. The same underflow checks are performed as well. 6.3.5 Arithmetic vs. Huffman The advantage of using arithmetic coding comes when the probabilities show more variation. Assume that the message, ”QQQQQQQE”, is about to be encoded, where the the probability of ”Q” is 0.9 and E is 0.1. The encoding process is listed in table 6.3. 51 New Symbol Q Q Q Q Q Q Q E Low Limit 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.43046721 High Limit 1.00 0.90 0.81 0.729 0.6561 0.59049 0.531441 0.4782969 0.4782969 Table 6.3: Coding scheme for the message ”QQQQQQQE”. Any number within the final range, e.g. 0.45, will uniquely decode the message. Those two decimal digits take slightly less than 7 bits to specify, which means that eight symbols have been encoded in less than eight bits, compared to an optimal Huffman coded variant, which would need a minimum of 9 bits. If the probability of a symbol is even higher, e.g. if the byte value ”0” has a probability of 16382/16383 and EOF1 has 1/16383, then an input data stream with 100,000 ”0” bytes followed by an EOF could be encoded using only approximately 3-4 bytes. The Huffman approach would yield about 12,500 bytes. 1 Short for ”End Of File” 52 Chapter 7 Video Compression The terminology used for still image compression needs to be extended for video compression. A still image is from now referred to as a frame. Several frames make up a video sequence. Frames per second (fps) determines the rate of video update frequency, typical values are 23.957 (film), 25 (PAL) and 30 (NTSC). There are essentially three different frame-types in all types of video compression methods. • I-frames • P-frames • B-frames I-frames are intra coded, i.e. they can be reconstructed without any reference to other frames. This is the same as still image compression. Some compression methods, like MJPEG (motion JPEG), use only I-frames, i.e. the video is encoded as a series of independent still images. This is suitable for video editing systems but not efficient from a compression point of view since frames based on previous frames can significantly reduce the amount of data to encode. The P-frames are forward predicted from the last I-frame or P-frame, i.e. it’s impossible to fully reconstruct them without a previous P or I-frame. The B-frames 53 Figure 7.1: An I-frame followed by P and B-frames. The arrows indicate inter relationships. are both forward predicted and backward predicted from the last/next I-frame or Pframe, thus requiring two other frames to be reconstructed. P-frames and B-frames are often referred to as inter coded frames. The relationship between these frametypes is illustrated in figure 7.1. 7.1 Predictive Frames The predictive frames require information from other frames. When encoding a set of frames the encoder measures e.g. the difference between the current frame and the previous frame. If the difference is found to be small, which means that the movement between the two consecutive frames is small or negligible. In this case, instead of encoding the current frame as an I-frame it will be more efficient to encode the difference. This difference or error frame can be encoded as a P or B-frame. If there is little movement between frames, the difference will not take much bitrate to encode and P & B-frames will then constantly improve the image quality. The encoder 54 has to decide when to code a new I-frame, usually when there is much movement between frames or the time interval between I-frames reach a certain limit. Motion estimation and motion compensation are very vital parts of video encoding. The purpose is to detect motion between frames and compensate to minimize the error frame. Motion compensation is applied to P and B-frames. The encoder estimates the motion and supplies motion vectors to predict the movements between the frames. a b Figure 7.2: Two consecutive video frames. Figure 7.3: Reconstruction of frame b with a motion vector and an error . To illustrate the use of motion vectors and P-frames, imagine a rectangular shape as in figure 7.2 (a). In the next frame the rectangle has moved down to the right and 55 rotated 5o , figure 7.2 (b). The encoder calculates the motion vectors that minimizes the error frame, encodes the vectors and the error frame as a P-frame. Figure 7.3 shows how the decoder then reconstructs frame b with help of frame a and the information from the P-frame. Thus, the reconstruction of the inter coded frames proceeds as follows: 1. Apply the motion vector to the referred frame 2. Add the prediction error compensation to the result from step 1. 7.2 Motion Estimation & Motion Compensation There are three main approaches to perform motion estimation (ME) and motion compensation (MC). The first approach, Block Motion Estimation (BME) and Overlapped Block Motion Compensation (OBMC), works for both DCT and wavelet-based video compression. The other two approaches, 3-D SPIHT [5] and LBSME/MC [6],are specifically developed for wavelet-based compression and will not be discussed in this thesis since BME/OBMC have proven to be the best combination regarding computation complexity and visual quality [7]. BME and OBMC is the approach used by the MPEG-1/2/4 standards. As mentioned earlier this method is efficient and it has been developed for a long time. The idea is fairly straightforward. The frame is divided into a number of blocks, most common sizes are 16x16 or 8x8 pixels. Each block has a motion vector. For every block each motion vector is calculated by comparing the current block with a region of the previous frame. The encoder tries to find the region in the previous frame that best matches the block in the current frame. The region in the previous frame must lie within the motion vector range. In e.g. MPEG-1 the range is -63 to +64 pixels for both x and y axis, but if the video resolution is low then maybe -15 to +16 is sufficient. Figure 7.4 represents a previous decoded I or P-frame, the dark green area represents the range of a possible motion vector for the black block. The dark green 56 area measures 128x128 pixels and thus the motion vectors have a -63 to +64 pixel range in x and y space. The white rectangle represents the current area that is to be compared with the block from the next frame (the black rectangle). Figure 7.4: Darkened gree area represents the possible motion vector range of the black block. The white area will move within the range and will be compared with the black one until a good motion vector that minimizes the error is found. For each motion vector comparison the current block is compared to the block given by the motion vector offset from the previous frame. If each motion vector is to be tested it would require an enormous amount of computational operations. The difference between the pixels in the blocks are summarized, for 16 ∗ 16 pixels block size every motion vector comparison amounts to 16 ∗ 16 ∗ 2 = 512 operations. If the motion vector range is −15 to +16 and all possible combinations of motion vectors are to be tested then 32 ∗ 32 ∗ 16 ∗ 16 ∗ 2 = 524.288 subtractions and additions are performed for each block. For a standard video resolution of 320x256 pixels, this 57 sums up to 524.288 ∗ (320/16) ∗ (256/16) = 167, 772.160 subtractions and additions. This brute force approach to calculate the optimum motion vector can of course not be used in real-time since it would take too much CPU time. The problem with all block based motion estimation algorithms is to calculate an adequate motion vector within a given time period. There are numerous approaches, but only the brute force method can guarantee that the optimum vector is found (unless the error happens to be 0). Some methods have a specific search order and others use predictive motion vectors based on neighboring vectors. Usually a threshold value determines when an adequate motion vector is found . When each block motion vector is calculated they are encoded together with the error frame. The error frame is encoded as a still image with either DCT or waveletbased compression. The difficulty with BME/OBMC video encoding amounts to deciding I/P/B-frame structure and finding good motion vectors. 58 Chapter 8 JPEG 2000 Overview 8.1 Features The JPEG 2000 standard supports both lossy and lossless compression of singlecomponent (e.g. grayscale) and multi-component (e.g. color) imagery. In addition to basic compression functionality the standard also supports the following features • Progressive recovery of an image by fidelity of resolution. • Region of interest coding (ROI), whereby different parts of an image can be coded withe different fidelity. • Random access to particular regions of an image without needing to decode the entire code stream. • Good error resilience. Due to its excellent coding performance and attractive features, JPEG 2000 has a large potential application base. Work on the JPEG 2000 standard started in March 1997. The purpose of having a new standard was twofold. First, it would address a number of weaknesses in the existing JPEG standard. Second, it would provide a number of new features not available in the JPEG standard. Much of the success of the original JPEG standard 59 can be attributed to its royalty-free nature. JPEG 2000, however, still struggles with some intellectual property issues that prevent it from being royalty-free. In the long run, these property issues also hold back the standard. 8.2 JPEG 2000 Codec The codec is based on wavelet/sub-band coding techniques. In order to facilitate both lossy and lossless coding, reversible integer-to-integer and nonreversible real-to-real transforms are employed (see chapter 3). From the codec’s point of view, an image is comprised of one or more components. Each component consists of a rectangular array of samples. The sample values of each component are integer valued, and can be either signed or unsigned with a precision of 1 to 38 bits/sample, specified per-component basis. All of the components are associated with the same spatial extent in the source image, but represent different spectral or auxiliary information. For example, an RGB color image has three components (red, green and blue). In the case of a grayscale image, there is only one component, corresponding to the luminance/intensity plane. The various components of an image need not be sampled at the same resolution. For example, when color images are represented in a luminance/intensity-chrominance color space (se chapter 2.5), the luminance information is often more finely sampled than the chrominance data. 8.2.1 Preprocessing If samples are unsigned, a bias is subtracted so that the dynamical range is centered around 0, as in chapter 2.3. If the image is about to be coded in a lossy manner the data is also normalized (chapter 2.4). Intercomponent transform, as it is called in the JPEG 2000 standard, now follows. Two intercomponent transforms are defined in the baseline JPEG 2000 codec: the 60 a b Figure 8.1: The structure of the (a) encoding and the (b) decoding process of the JPEG 2000 standard. irreversible color transform (ICT) and the reversible color transform (RCT). The ICT has already been thoroughly explained in chapter 2.5. The RCT has transform coefficients of a power of 2 and thus allows for a integer-to-integer transform. In lossy compression the ICT is used an in lossless the RCT. Both transforms maps image data from the RGB to YCrCb color space and operates on the 3 first components with the assumption that these are the RGB color planes. 8.2.2 Wavelet transform In the standard this is refereed to as an intracomponent transform. For lossless compression the LeGall 5/3 filter (appendix B.1) is used. The wavelet transform and an example of LeGall 5/3 is covered in chapter 3. In the case of lossy compression the Daubechies 9/7 filter is used instead (appendix B.2). The number of resolution levels is a parameter for each transform. A typical number is 5, for a sufficiently large image. In the decoding procedure the appropriate inverse wavelet transform is applied. 61 8.2.3 Quantization In the encoder, after the intracomponent transform, the resulting coefficients are quantized. Quantization allows greater compression to be achieved, by representing transform coefficients with only the minimal precision required to obtain the desired level of image quality. Quantization is one of the two primary sources of information loss in the coding path (the other being the discarding of coding passes). Transform coefficients are quantized using scalar quantization with a deadzone. A different quantizer is employed for the coefficients of each sub-band, and each quantizer has only one parameter, its step size. This type of quantization is described in chapter 10.4. In the case of lossless compression no quantization is applied. 8.2.4 Tier-1 & Tier-2 Coding After quantization is performed in the encoder, tier-1 coding takes place. This is the first of two coding stages. The quantizer indices for each sub-band are partitioned into code blocks. Code blocks are rectangular in shape, and their nominal size is a free parameter of the coding process, subject to certain constraints. After a sub-band has been partitioned into code blocks, each block is independently coded using a context based adaptive binary arithmetic coder, more specifically, the MQ coder from the JBIG standard [8]. For each code block, an embedded code is produced, comprised of numerous coding passes for the various code blocks. It is the discarding of these coding passes that is the second source of information loss, thus no passes can be discarded in the lossless encoding path. The bit-plane encoding process generates a sequence of symbols for each coding pass. Some or all of these symbols may be entropy coded with the MQ coder. Although the bit-plane coding technique employed is similar to those used in EZW and SPIHT coding, there are two notable differences: 1. Interband dependencies are not exploited. 2. Three coding passes per bit-plane instead of two. 62 This chapter will not go into any details about the coding, for more information see [9]. In the encoder, tier-1 encoding is followed by tier-2 encoding. The input to the tier-2 encoding process is the set of bit-plane coding passes generated during tier-1 encoding. In tier-2 encoding, the coding pass information is packaged into data units called packets, in a process referred to as packetization. The resulting packets are then output to the final code stream. The packetization process imposes a particular organization on coding pass data in the output code stream. This organization facilitates many of the desired codec features including rate scalability and progressive recovery by fidelity or resolution. A packet is nothing more than a collection of coding pass data. Each packet is comprised of two parts: a header and body. The header indicates which coding passes are included in the packet, while the body contains the actual coding pass data itself. In the code stream, the header and body may appear together or separately, depending on the coding options in effect. Rate scalability is achieved through (quality) layers. The coded data for each tile is organized into L layers, numbered from 0 to L - 1, where L ≥ 1. Each coding pass is either assigned to one of the L layers or discarded. The coding passes containing the most important data are included in the lower layers, while the coding passes associated with finer details are included in higher layers. During decoding, the reconstructed image quality improves incrementally with each successive layer processed. In the case of lossy compression, some coding passes may be discarded (i.e., not included in any layer) in which case rate control must decide which passes to include in the final code stream. In the lossless case, all coding passes must be included. Since some coding passes may be discarded, tier-2 coding is the second primary source of information loss in the coding path. More information about JPEG 2000 can be found in the ISO standards [10] and [9]. 63 Chapter 9 MSC3K Overview As mentioned in earlier chapters the MSC3K is a zero-tree type wavelet-based codec. The codec was developed for real-time low-bandwidth video communication with the following desired features: • High compression • Low computation complexity • Minimum latency (real-time) • Scalable quality The name ”Mystery Science Codec 3000” (MSC3K) was inspired by the TV-series ”Mystery Science Theater 3000”. 9.1 The Codec In figure 9.1, the encoding and decoding structure of MSC3K is shown. This is the heart of the MSC3K codec. The preprocessing is essentially level offset (chapter 2.3), normalization (chapter 2.4) and the irreversible color transform (ICT) (chapter 2.5). Since MSC3K only allows for lossy compression the Daubechies 9/7 filter is used in the wavelet transform, see chapter B.2. The actual transform is implemented 64 a b Figure 9.1: The structure of the (a) encoding and the (b) decoding process of still images in MSC3K. using lifting steps, as described in chapter 3, which is the most efficient way to implement a wavelet transform. The quantization method used is the uniform deadzone quantization described in chapter 10.4. The quantization procedure is the first element of information loss. The next is the zero-tree type compression. MSC3K is a modified version of the SPIHT algorithm, chapter 5.2. The modifications that MSC3K uses are described in chapter 5.3. The zero-tree encoding is controlled by the rate variable, and will terminate once the desired rate is achieved. After the zero-tree encoding a final bitstream is produced together with a small header that describes size, quantization, color and video settings. The header is about 10 bytes long and is added at the start of the bitstream. The decoder, figure 9.1, takes the bitstream and re-builds the wavelet tree coefficients in the zero-tree decoder. The effects of the quantization is then compensated, followed by the inverse wavelet transform. Last, the postprocessing undoes the effects of the preprocessing and produces a reconstructed image. 65 9.2 Color Compression As mentioned earlier the color image is first transformed to YCbCr color space with the ICT. The color components, Cb & Cr, are then subsampled by 2 in both directions. So for a 512x512 pixels image both Cb & Cr will be resized down to 256x256 pixels. This procedure is also done in JPEG & JPEG 2000. The eye is much less sensitive to color information so the effects of the subsampling are not visible. The color components are then arranged together with the intensity part(Y) to form a single image structure. This structure is then wavelet transformed with some modifications to handle the borders between the components. The MSC3K zero-tree type compression encodes the coefficients as if they were a single image and not a three-component image. This way color information is mixed with intensity according to significance/power. a b c Figure 9.2: a) Original color image (512x512). b) Transformed into YCBCr color space and arranged for wavelet transformation. c) One step DWT. 66 9.3 Video Compression MSC3K currently supports I and P-frames with a BME/OBMC type of motion estimation and motion compensation, see chapter 7. The encoder can either be forced to produce P-frames or it can be set to automatically decide when a P-frame can be inserted. The previous decoded frame is compared with the current frame. If the error between them is great, then the current frame will be encoded as an I-frame, otherwise the encoder goes into P-frame mode. Figure 9.3: Search pattern for motion vectors in MSC3K. In P-frame mode, the motion vectors are first calculated. This is done with the block motion estimation method. MSC3K supports 8x8 and 16x16 pixels block for the motion estimation. The motion vectors range from about -6 to +6 pixels in a diamond shape pattern. In figure 9.3 this specific search pattern is demonstrated. The search starts with index 0, the red dot, and then proceeds with 1,2,3.. etc. The motion vectors are stored as an index that indicates the specific offset. The diamond shape search pattern has shown to be efficient since movement is often concentrated around 0 on the x and y axes. The search stops when the error in the current search block is below a certain threshold. In some cases it is better to skip the use of motion vectors and just encode the error frame. Motion vectors for 8x8 pixel blocks in 160x112 pixel resolution takes about 67 300 bytes to store. Assume for example that the max bitrate is set to 1800 bytes, then the BME/OBMC improvement of the error frame must justify the use of these 300 extra bytes. For small movements it’s often not justified to use motion vectors. MSC3K measures the improvement of the motion vectors and decides whether or not they will be included in the encoding. The decoder always saves the previous decoded image, so in the case of a Pframe, the previous frame is motion compensated with the motion vectors and then the decoded error frame is added. B-frames have not been implemented since that would delay the video encoding with about 1-8 frames. For low-bitrate video, the frame-rate is often low, 4-8 fps, and in this case a delay of 8 frames would result in a time delay of 1-2 seconds. 9.4 Implementation The MSC3K codec is developed in C/C++. The codec kernel can be easily adapted to fit any application. The video input system uses DirectX 9 to connect e.g. webcams or TV-cards which is perfect for in-game applications, since most modern PC games are using DirectX. Some GUI1 -based test programs have written that use the MSC3K kernel, e.g. a video test program, see figure 11.6, and a still image compression frontend. 1 Graphical user interface 68 Chapter 10 Audio Compression When communicating via internet you’ll need more than images to make a successful conversation, you will also need sound. Our main focus for MSC3K is to use as little bandwidth as possible for image and sound. Therefore when examining different techniques for audio compression we choose to use a reference implementation of LPC. LPC produces adequate quality for speech at low bitrates. It’s also free to use without any license or prohibitions. 10.1 History In the second half of the 18th century the idea to construct a speech producing machine took form. One of the first attempts at mechanical speech generation in recorded history occurred around 1770 at the Imperial Academy of St. Petersburg. In response to the university’s challenge to explain the physiological speech differences between five vowels, Kratzenstein won the annual prize for modeling and constructing a series of acoustic resonators patterned after the human vocal tract. The speaking device, crude by today’s standards did, however, create the vowel sounds with vibrating reeds activated by air passage over them. By varying the acoustic resonators and effectively selecting the formant frequencies by hand, limited speech was generated by this mechanical device. Unknown to Kratzenstein at the time, Wolfgang von Kempelen was working in 69 Figure 10.1: Kratztenstein’s resonators for synthesis of vowel sounds. The resonators are actuated by blowing through a free, vibrating reed into the lower end. parallel on a more elaborate speaking machine for generating connected speech. He had a different approach to solve the problem of syntesized speech. He identified the the speech organs and tried to simulate them as good as possible. Kempelen’s speaking machine used a bellows to supply air to a reed which, in turn, excited a single, hand-varied resonator for producing voiced sounds. Consonants, including nasals, were simulated by four separate constricted passages, controlled by the fingers of the other hand. Thus making him the first to build a machine that could produce whole sentences. But since the property of the vocal cords could not be changed while the machine produced sound it sounded monotone. An improved version of the machine was built from von Kempelen’s description by Sir Charles Wheatstone. Figure 10.2: Wheatstone’s improved construction of von Kempelen’s speaking machine 70 In the year 1835 after a period without any significant progress, a man named Joseph Faber succeeded to add a tongue and a pharyngeal cavity whose to Kempelen’s model that could be controlled during speech production. He continued his work and succeeded in building a machine that could sing the anthem ”God save the queen”. Then the evolution of the speech machines stood still until R. R. Riesz in 1937 demonstrated his mechanical talker which, like the other mechanical devices, was more reminiscent of a musical instrument. The device was shaped like the human vocal tract and constructed primarily of rubber and metal with playing keys similar to those found on a trumpet. The mechanical talking device produced fairly good speech with a trained operator. With the ten control keys (or valves) operated simultaneously with two hands, the device could produce relatively articulate speech. Riesz had, through his use of the ten keys, allowed for control of almost every movable portion of the human vocal tract. Reports from that time stated that its most articulate speech was produced as it said the word ’cigarette’” Figure 10.3: ”Air under pressure is brought from a reservoir at the right. Two valves, V1 and V2 control the flow. Valve V1 admits air to a chamber L1 in which a reed is fixed. The reed vibrates and interrupts the air flow much like vocal cords. A spring-loaded slider varies the effective length of the reed and changes its fundamental frequency. Unvoiced sounds are produced by admitting air through valve V2. The configuration of the vocal tract is varied by means of nine movable members representing the lips (1 and 2), teeth (3 and 4), tongue (5, 6 and 7), pharynx (8), and velar coupling (9). 71 To simplify the control, Riesz also constructed the mechanical talker with finger keys to control the configuration. 10.2 Speech Synthesis Speech is produced by a cooperation of lungs, glottis (with vocal cords) and articulation tract (mouth and nose cavity).Figure 10.4 shows a cross section of the human speech organ. For the production of voiced sounds, the lungs press air through the epiglottis, the vocal cords vibrate, they interrupt the air stream and produce a quasiperiodic pressure wave. Figure 10.4: The pressure impulses are commonly called pitch impulses and the frequency of the pressure signal is the pitch frequency or fundamental frequency. In figure 10.5 a typical impulse sequence (sound pressure function) produced by the vocal cords for a voiced sound is shown. It is the part of the voice signal that defines the speech melody. When we s peak with a constant pitch frequency, the speech sounds monotonous but in normal cases a permanent change of the frequency ensues. 72 Figure 10.5: The pitch impulses stimulate the air in the mouth and for certain sounds (nasals) also the nasal cavity. When the cavities resonate, they radiate a sound wave which is the speech signal. Both cavities act as resonators with characteristic resonance frequencies, called formant frequencies. The fundamental freq F0 and the fundamental period N0 (N0=Fs*T0 where T0=1/F0) will later be used. Since the mouth cavity can be greatly changed, we are able to pronounce very many different sounds. In the case of unvoiced sounds, the excitation of the vocal tract is more noise-like. 10.3 Basic Speech Model In order to better understand and be able to analyse speech production, we are interested in creating a model of the speech production system which is amenable to further analysis based on theory and techniques from physics. A simple idea is to simulate the vocal tract by a straight tube through which air is blown. It has been found that a tube with a curvature does n ot sound much different from one without. Furthermore we assume that the tube is lossless. This means that we do not model energy loss due to friction between the flowing air and the tract walls or vibration of the walls. 10.4 Speech production by LPC In the basic speech model the lungs are replaced by a DC source, the vocal cords by an impulse generator and the articulation tract by a linear filter system. A noise 73 Figure 10.6: visualisation of the lossless tube model showing the cross sectional area A(x) of the cylinder x generator produces the unvoiced excitation. In practice, all sounds have a mixed excitation, which means that the excitation consists of voiced and unvoiced portions. Of course, the relation of these portions varies strongly with the sound being generated. In this model, the portions are adjusted by two potentiometers. Se figure 10.7a for illustration. Based on this model, a further simplification can be made (figure 10.7b). Instead of the two potentiometers it uses a ’hard’ switch which only selects between voiced and unvoiced excitation. The filter, representing the articulation tract, is a simple recursive digital filter; its resonance behavior (frequency response) is defined by a set of filter coefficients. Since the computation of the coefficients is based on the mathematical optimization procedure of Linear Prediction Coding they are called Linear Prediction Coding Coefficients or LPC coefficients and the complete model is the so-called LPC Vocoder (Vocoder is a concatenation of the terms ’voice’ and ’coding’). In practice, the LPC Vocoder is used for speech telephony. It’s great advantage is the very low bit rate needed for speech transmission (about 3 kbit/s) compared to PCM (64 kbit/s). A great advantage of the LPC vocoder are the manipulation facilities and the narrow analogy to the human speech production. Since the main parameters of the 74 Figure 10.7: Figure 4a: The Human Speech Production, Figure 4b: Speech Production by a machine speech production, namely the pitch and the articulation characteristics, expressed by the LPC coefficients, are directly accessible, the audible voice characteristics can be widely influenced. For example, the transformation of a male voice into the voice of a female or a child is very easy. Also the number of filter coefficients can be varied to influence the sound characteristics, above all, the formant characteristics. 75 Figure 10.8: LPC 76 Chapter 11 Results ”A picture says more than a thousand words”, this statement is often true and in this chapter the proper statement would be ”A picture says more than any test values”. A mathematical calculated value that measures quality can still be useful for a quick reference to other results, though it might not tell the whole truth. 11.1 Image Distortion Measure Judging the visual quality of an encoded image compared to the original is not always easy. A single number, representing the overall total distortion in proportion to the peak signal, is often used to compare and describe the visual quality. Peak Signal to Noise Ratio, or P SN R has the formula N1 −1 N 2 −1 X 1 X (x[n1 , n2 ] − x̂[n1 , n2 ])2 M SE = N1 N2 n =0 n =0 (11.1) (2B − 1)2 M SE (11.2) 1 2 P SN R = 10 log10 where M SE is the Mean Square Error of the original image, x[n1 , n2 ], compared to the reconstructed image x̂[n1 , n2 ]. The P SN R is expressed in dB (decibels). Good reconstructed images typically have P SN R values of 30 dB or more. 77 This measurement is of course not very accurate for true visual quality. E.g. some areas in the reconstructed image may be well preserved while the background is distorted, giving the image the same P SN R as an image that is evenly distorted. No one can really say one image is better than the other cause it is a matter of opinion and the true answer lies in the eye of the beholder. 11.2 Still Image Compression Comparison 11.2.1 Test Setup Three different compression methods will be tested. • The original JPEG standard. JPEG is the candidate that represents the traditional DCT based technique. • The JPEG 2000 standard, entropy based wavelet compression. • MSC3K, zero-tree based wavelet compression. BoxTop ProJPEG v5.2, http://www.boxtopsoft.com/projpeg.html, was used to compress standard JPEG images using optimized Huffman tables to achieve highest possible quality. The Jasper reference implementation of JPEG 2000 was used to compress to JPEG 2000 standard [11]. The images used in this test were all found on the internet, some of which are not too unfamiliar in compression comparisons, e.g. the Lena image as well as some Kodak pictures and pictures from the motion picture ”Lord of the Rings”. All original images and all the reconstructed images can be found at http://msc3k.sytes.net. All images will be compressed using lossy compression. 78 Name Lena Blue Rose Gollum Wraiths Child Size 512x512 256x256 512x512 256x256 256x256 Notes Gray image. Color. Cropped Color. Cropped Color. Cropped Color. Cropped from a film frame. from original. & resized from original. & resized from original. Table 11.1: Specifications of the original images. Lena Method MSC3K JPEG 2000 JPEG 1:100 P SN R(dB) 28.30 28.35 26.27 Table 11.2: Results from Lena. 11.2.2 Lena As shown in table 11.2, the JPEG 2000 got a slightly better P SN R value than MSC3K. This is a surprisingly good result for a zero-tree type compression that does not use arithmetic coding. JPEG is trailing behind, in this test JPEG could actually not be compressed more than 1:90. For more information about Lena, see [12]. 11.2.3 Blue Rose As shown in table 11.3, MSC3K got the best Y P SN R value in both the 1:100 and the 1:200 case. In the 1:200 case MSC3K also got the best total P SN R value. 11.2.4 Gollum As shown in table 11.4, JPEG 2000 got the best test values in this case. The Cr value is +2.72dB better than MSC3K, a possible explanation to this is that MSC3K 79 Blue Rose Method MSC3K JPEG 2000 JPEG Blue Rose Method MSC3K JPEG 2000 JPEG Ratio 1:100 Y P SN R(dB) 33.93 33.50 32.89 Ratio 1:200 Y P SN R(dB) 32.10 30.35 28.61 Cb P SN R(dB) 36.71 39.01 37.50 Cr P SN R(dB) 39.78 40.75 35.77 Tot P SN R(dB) 36.80 37.75 35.39 Cb P SN R(dB) 34.00 34.29 30.47 Cr P SN R(dB) 36.89 36.45 31.84 Tot P SN R(dB) 34.33 33.70 30.31 Table 11.3: Results from Blue Rose. Gollum Method MSC3K JPEG 2000 JPEG Ratio 1:100 Y P SN R(dB) 22.77 22.94 22.21 Cb P SN R(dB) 32.40 33.36 30.73 Cr P SN R(dB) 30.35 31.23 29.01 Tot P SN R(dB) 28.51 29.18 27.32 Table 11.4: Results from Gollum. doesn’t range the color sub-bands thus making some color information less important than it really is. 11.2.5 Wraiths As shown in table 11.5, MSC3K got the better Cb P SN R value in the 1:50 case. JPEG 2000 got a slightly better total value. In the 1:100 compression case, we see again that MSC3K got a slightly better Y P SN R value and that the JPEG 2000 got the best total value closely followed by the MSC3K. 80 Wraiths Method MSC3K JPEG 2000 JPEG Wraiths Method MSC3K JPEG 2000 JPEG Ratio 1:50 Y P SN R(dB) 22.94 23.10 22.67 Ratio 1:100 Y P SN R(dB) 20.48 20.47 19.32 Cb P SN R(dB) 33.81 33.56 33.22 Cr P SN R(dB) 29.00 30.05 29.18 Tot P SN R(dB) 28.58 28.90 28.36 Cb P SN R(dB) 32.96 33.15 29.97 Cr P SN R(dB) 28.12 28.14 27.02 Tot P SN R(dB) 27.19 27.26 25.43 Table 11.5: Results from Wraiths. 11.2.6 Child As shown in table 11.6, MSC3K yields the best result in all cases when compressing the image called ”child”. 11.2.7 MSC3K Performance A special test program was constructed to correctly measure CPU & memory usage of MSC3K. The performance data was collected with ”Intel VTune Performance Analyzer v7.0”. The test was to load a color image, 512x512 pixels, and then compress it to a bitstream using the MSC3K compression scheme at ratio 1:150. The test was repeated 100 times and the results averaged. The test-machine was a Pentium III 900 mhz. Memory usage was 1210kb including the test program and the MSC3K kernel. This is not very high considering that the 768kb input image is first read into temporary buffers and encoded from there. The memory needed is approximately 150% of the input image size. The total process took 0, 29 seconds. The most time-consuming task is the wavelet transformation (DWT), see table 11.7, followed by the loading of the input image. 81 Child Method MSC3K JPEG 2000 Child Method MSC3K JPEG 2000 Child Method MSC3K JPEG 2000 Ratio 1:200 Y P SN R(dB) 26.55 25.82 Ratio 1:250 Y P SN R(dB) 25.95 24.74 Ratio 1:300 Y P SN R(dB) 25.49 23.79 Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB) 31.47 33.17 30.40 31.03 32.90 29.91 Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB) 31.01 33.11 30.03 29.98 32.42 29.04 Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB) 30.27 32.56 29.44 29.86 31.33 28.32 Table 11.6: Results from Child. Encoding takes 0, 20 seconds if the image is already loaded. Surprisingly the zero-tree type compression takes little or no time at all in comparison. The ”other” section includes kernel calls, allocations, bitstream handling and more. Process DWT Loading image Quantization Zero-Tree compression Other Percentage 58% 32% 6% 0,1% 4% Table 11.7: Percentage of CPU-time spent in different encoding steps. 82 11.3 Video Compression Results The video results of MSC3K are purely empirical. The codec runs smooth in B&W 320x240 pixels at 30 fps on a P-III 900 with all features enable and both encoding and decoding at the same time. The actual CPU usage is hard to measure but approximately 60% of the total CPU power is used at 30 fps, yet many parts can be optimized and written in assembler to further improve results. The quality of the video is fully scalable and MSC3K produces fully functional video even under 5kb/s. The delay in the codec is insignificant compared to the internal delay in the webcams and their capture drivers. When the video input reaches the MSC3K codec it is already delayed about 100ms. The compressed output is then further delayed by about 20-25ms in the MSC3K codec. Figure 11.6 shows the MSC3K codec in full action. 83 A B C D Figure 11.1: A) The original gray image of Lena. B) Compressed with MSC3K. Ratio 1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with ProJPEG. Ratio 1:90. 84 A B C D E F G Figure 11.2: A) The original color image Blue Rose. B) Compressed with MSC3K. Ratio 1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with ProJPEG. Ratio 1:100. E) Compressed with MSC3K. Ratio 1:200. F) Compressed with JPEG 2000. Ratio 1:200. G) Compressed with ProJPEG. Ratio 1:200. 85 A B C D Figure 11.3: A) The original color image of Gollum. B) Compressed with MSC3K. Ratio 1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with ProJPEG. Ratio 1:100. 86 A B C D E F G Figure 11.4: A) The original color image Wraiths. B) Compressed with MSC3K. Ratio 1:50. C) Compressed with JPEG 2000. Ratio 1:50. D) Compressed with ProJPEG. Ratio 1:50. E) Compressed with MSC3K. Ratio 1:100. F) Compressed with JPEG 2000. Ratio 1:100. G) Compressed with ProJPEG. Ratio 1:100. 87 A B C D E F G Figure 11.5: A) The original color image Child. B) Compressed with MSC3K. Ratio 1:200. C) Compressed with JPEG 2000. Ratio 1:200. D) Compressed with MSC3K. Ratio 1:250. E) Compressed with JPEG 2000. Ratio 1:250. F) Compressed with MSC3K. Ratio 1:300. G) Compressed with JPEG 2000. Ratio 1:300. 88 Figure 11.6: Test program for the MSC3K video codec. To the left is the input signal and to the right is the coded output. 89 Chapter 12 Conclusions Results from chapter 11 have shown that all the desired features set before developing MSC3K; high compression, low computation complexity, minimum latency and scalable quality (chapter 9.1), have been fulfilled. The scalable quality feature comes from the fact that MSC3K is a zero-tree based compression algorithm, this will be discussed more in chapter 12.2. 12.1 Performance The temporal noise filter described in chapter 2.2 have been tested with webcams and is implemented as an optional feature in the preprocessing stage of MSC3K. The solution is very computation efficient compared to spatial filtering and empirical studies have proven it to be sufficient in terms of noise reduction results. Measuring the time it takes to decompress smaller images the MSC3K and JPEG 2000 perform much the same. When it comes to larger images, 512x512 pixels and more, JPEG 2000 is faster. This is due to the fact that JPEG 2000 only mostly works with localized memory, while zero-tree based compression uses coefficients from random locations in the image, thus leading to ineffective data-cache usage. When it comes to compressing images, MSC3K is faster than JPEG 2000. This is because zero-tree based compression algorithms can be terminated as soon as the desired output buffer is filled. At low bitrates the output buffer is smaller and thus 90 filled more quickly so the compression takes less time. The DWT process is significantly more time-consuming in comparison to the zero-tree compression, see chapter 11.2.7. MSC3K is generally faster and far less complex than the JPEG 2000 standard, although MSC3K requires more random memory access patterns. As shown in chapter 11 the MSC3K performs well compared to JPEG 2000, especially at low bitrates. Looking closely at the images you can see that MSC3K preserves the ”overall” details somewhat better than JPEG 2000 although the P SN R value is pretty much the same. 12.2 Possible Application Areas Apart from online gaming communication, the MSC3K video compression should also be suitable for video conferences, real-time video communication in cellular phones and, by more efficient compression, increase the capacity in digital cameras since they often use DCT based compression techniques. Since the MSC3K compression is better than JPEG 2000 when it comes to progressive recovery of an image, it should be very attractive to use in client-server based video streaming applications. E.g. in a scenario where many clients are streaming video from one server and the clients requires different quality of the video streams due to different bandwidth capacity of the clients, the MSC3K compression should with relative ease be able to provide different quality video streams based on one ”original” video stream. This without having to re-encode the video stream for each quality/bandwidth setting. The server simply cuts out different portions of the encoded bit-stream to serve the different needs of the clients in terms of bandwidth usage, see figure 12.1. In MSC3K this can easily be done at the bit-level and not as in JPEG 2000 where the ”portioning” can only be done at the block/layer-level. In JPEG 2000 this poses another problem since color information and intensity information is split into separate code-blocks. The question is which code-blocks are to be thrown away and which ones are best 91 a b c d Figure 12.1: Several reconstructed images from an ”original” MSC3K encoded bitstream a) Original image, 192kb, (256x256 pixels in 24bit color). b) 9.6kb transferred and decoded (ratio 1:20). c) 6.4kb transferred and decoded (ratio 1:30). d) 4.8kb transferred and decoded (ratio 1:40). kept. With MSC3K all information are encoded according to rank. And since the bitstream in MSC3K encoded images can be cut anywhere, this problem is eliminated. 12.3 Future Works A more advanced quantization method better suitable for zero-tree compression could improve image restoration quality. Preprocessing measurements could be taken to filter the image with e.g. a smart blur filter before compressing it. A smart blur filter preserves edges but reduces high 92 frequent information under a certain threshold. This way high frequent information that is located in small spatial regions can be reduced and thus the conditions for zero-tree type compression can be improved. Exactly what filtering properties the smart filter would have should be in proportion to the compression rate (more blurring for higher rates). Some bits in the zero-tree representation have proven to be compressible. SPIHT uses an arithmetic encoder to compress these bits, but an optimized Huffman tree should suffice. By compressing these bits overall compression efficiency can be improved. Memory usage can be improved by reusing some internal buffers in the encoding scheme. CPU usage can also by improved by rewriting critical sections of the code in assembly language. 12.3.1 The Future of MSC3K The implementation so far is only for educational purposes since there are some intellectual property issues with the SPIHT algorithm. MSC3K cannot go commercial unless some agreement is first met with the owners of the SPIHT algorithm. This is nothing unusual, almost every wavelet compression technique is protected by various patents. This is also a big problem for the JPEG 2000 standard. If the JPEG 2000 standard is to be a real standard on the Internet, all intellectual property issues must first be solved. Because of the issues described above, the final MSC3K product will most likely not use the SPIHT algorithm at all. Alternative methods are being developed but this thesis will not discuss them. Rest assured that whatever method will be used the compression results will be equal, or in some cases superior, to those of SPIHT. 93 Appendix A The Euclidean Algorithm Take two Laurent polynomials a(z) and b(z) 6= 0 with |a(z)| ≥ |b(z)|. Let a0 (z) = a(z) and b0 (z) = b(z) and iterate the following steps, starting from i = 0 ai+1 (z) = bi (z) (A.1) bi+1 (z) = ai (z) mod bi (z) qi+1 (z) = ai (z)/bi (z) Then an (z) = gcd(a(z), b(z)) where n is the smallest number for which bn (z) = 0. If gcd(a(z), b(z)) = 1, then a(z) and b0 (z) have no common divisor except for 1 and they are relatively prime. The iteration procedure can be written as a(z) b(z) ! = n Y i=1 qi (z) 1 1 ! an (z) 0 ! (A.2) 0 Iterate one time will yield a(z) = q1 (z)a1 (z) + b1 (z) b(z) = a1 (z) 94 (A.3) Appendix B Common Wavelet Transforms B.1 LeGall 5/3 B.1.1 Direct Form Coefficients i ha ga 0 3 4 1 4 − 18 1 ±1 ±2 B.1.2 − 12 hs gs 1 − 43 1 2 − 41 1 8 Lifting Coefficients Lifting Step 1 (Dual) 2 (Primal) Analysis Lifting Step − 12 z −1 + 1 − 21 z 1 −1 z 4 + 1 + 41 z 1 (Primal) 2 (Dual) 95 Synthesis − 14 z −1 + 1 − 14 z 1 −1 z 2 + 1 + 12 z B.2 Daubechies 9/7 B.2.1 Direct Form Coefficients i 0 ±1 ha ga hs gs 0.602949 −1.115087 1.115087 0.602949 0.266864 0.591272 −0.266864 ±2 −0.078223 0.591272 0.057543 −0.057543 −0.078223 ±3 −0.016864 −0.091276 −0.091276 0.016864 ±4 0.026749 0.026749 B.2.2 Lifting Coefficients Lifting Step Analysis 1 (Primal) −1.586134z −1 + 1 − 1.586134z 2 (Dual) −0.052980z −1 + 1 − 0.052980z 3 (Primal) 0.882911z −1 + 1 + 0.882911z 4 (Dual) 0.443507z −1 + 1 + 0.443507z Normalization Lowpass 0.812983 Highpass -1.230174 Lifting Step Synthesis 1 (Dual) −0.443507z −1 + 1 − 0.443507z 2 (Primal) −0.882911z −1 + 1 − 0.882911z 3 (Dual) 0.052980z −1 + 1 + 0.052980z 4 (Primal) 1.586134z −1 + 1 + 1.586134z Normalization Lowpass 1.230174 Highpass -0.812983 96 Figure B.1: Lifting implementation of the LeGall 5/3 filter, even (left) and odd (right). x(n) are input samples, d(n) are the high-pass coefficients and s(n) are the low-pass coefficients. 97 Bibliography [1] I. Daubechies and W. Sweldens. Factoring wavelet transforms into lifting steps. J. Fourier Anal. Appl., 4(3):245–267, 1998. [2] Jerome M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. on Signal Processing, 41(12):3445–3462, 1993. [3] Said A. and Pearlman W.A. A new, fast, and efficient image codec based on set partitioning in hierarchical trees. IEEE Trans. on Circuits and Systems for Video Technology, 6:243–250, 1996. [4] P.G. Howard and J. Vitter. Analysis of arithmetic coding for data compression. Information Processing and Management, 28(6):749–763, 1992. [5] William A. Pearlman Beong-Jo Kim, Zixiang Xiong. Multirate 3-d set partitioning in hierarchical trees (3-d spiht). IEEE Trans. on Circuits and Systems for Video Technology, 10(8), 2000. [6] Hyun-Wook Park and Hyung-Sun Kim. Motion estimation using low-band-shift method for wavelet-based moving-picture coding. IEEE Trans. on Image Processing, 9(4), 2000. [7] Wing Cheong Chan Ming Fai Fu, Au O.C. Novel motion compensation for wavelet video coding using overlapping. IEEE Trans. on Acoustics, Speech, and Signal Processing, 4:3421–3424, 2002. 98 [8] ISO/IEC. 14492-1, lossy/lossless coding of bi-level images. 2000. [9] ISO/IEC. 14495-1, lossless and near-lossless compression of continuous-tone still images - baseline. 2000. [10] ISO/IEC. 15444-1, information technology - jpeg 2000 image coding system part 1: Core coding system. 2000. [11] Michael Adams. The jasper project home page. http://www.ece.uvic.ca/ ~mdadams/jasper/. [12] The lenna story. http://www.lenna.org. 99

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement