LOW BITRATE VIDEO AND AUDIO CODECS FOR INTERNET COMMUNICATION By

LOW BITRATE VIDEO AND AUDIO CODECS FOR INTERNET COMMUNICATION By
Master Thesis MEE 03:15
LOW BITRATE VIDEO AND AUDIO CODECS
FOR INTERNET COMMUNICATION
By
Jonas Nilsson & Jesper Nilsson
THIS THESIS IS PRESENTED AS PART OF THE DEGREE OF
MASTER OF SCIENCE IN ELECTRICAL ENGINEERING
AT
BLEKINGE INSTITUTE OF TECHNOLOGY
RONNEBY, SWEDEN
MAY 2003
c Copyright by Jonas Nilsson & Jesper Nilsson, 2003
Department of Telecommunications and Signal Processing
Examiner: To Tran
Supervisors: To Tran & Carl Ådahl, Moosehill Productions AB
Abstract
This master thesis discusses the design and the implementation of an own developed
wavelet-based codec for both video and image compression. The codec is specifically
designed for low bitrate video with minimum complexity for use in online gaming environments. Results indicate that the performance of the codec in many areas equals
or even surpasses that of the international JPEG 2000 standard. We believe that it
is suitable for any situation where low bitrate is desirable, e.g. video conferences and
mobile communications. The game development company Moosehill Productions AB
has shown great interest in our codec and its possible applications. We have also
implemented an existing audio solution for low bandwidth use.
i
Preface
This thesis is the final step before completing our Master of Science degree at the
Blekinge Institute of Technology, Department of Telecommunications and Signal Processing, formerly known as University of Karlskrona/Ronneby. This master thesis
corresponds to 20 credits.
We would like to thank our examiner To Tran at Blekinge Institute of Technology
for his creative ideas and Carl Ådahl from Moosehill Productions AB for his support.
Of course, we are grateful to our parents and our sister for their patience and love.
Without their support and encouraging this would not have been possible.
Finally, we wish to thank the following:
Johan Zackrisson, Benny Svensson,
Mats Abrahamsson, Sten Dahlgren, all our relatives and all our friends, you know
who you are!
Dated: May 2003
Jonas Nilsson
Jesper Nilsson
ii
Table of Contents
Abstract
i
Preface
ii
Table of Contents
iii
1 Introduction
2 Pre
2.1
2.2
2.3
2.4
2.5
Processing
Color Representation
Temporal Filtering .
Level Offset . . . . .
Normalization . . . .
Color Transformation
1
.
.
.
.
.
3
3
3
5
6
6
.
.
.
.
.
.
.
.
.
.
.
8
8
9
11
12
12
14
17
17
18
21
22
4 Quantization
4.1 Uniform Dead-Zone Quantization . . . . . . . . . . . . . . . . . . . .
25
25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Wavelet Transform
3.1 Historical Perspective . . . . . . . . . . . . .
3.2 Multiresolution Analysis . . . . . . . . . . .
3.3 1-D Wavelet Transform . . . . . . . . . . . .
3.3.1 Perfect Reconstruction . . . . . . . .
3.3.2 The Polyphase Representation . . . .
3.3.3 From Polyphase to Lifting . . . . . .
3.3.4 The Lazy Wavelet Transform . . . .
3.3.5 Factoring Wavelets into Lifting Steps
3.3.6 Calculating Lifting Steps . . . . . . .
3.3.7 Lifting Properties . . . . . . . . . . .
3.4 2-D Wavelet Transform . . . . . . . . . . . .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Zero-Tree Coding
5.1 EZW . . . . . . . . . . . . . . . . .
5.1.1 The Significance Pass . . . .
5.1.2 The Refinement Pass . . . .
5.1.3 Example of EZW . . . . . .
5.2 SPIHT . . . . . . . . . . . . . . . .
5.2.1 Different Zero-Trees . . . . .
5.2.2 Ordered Lists . . . . . . . .
5.2.3 The Coding Passes . . . . .
5.2.4 The SPIHT Algorithm . . .
5.2.5 Example of SPIHT . . . . .
5.3 MSC3K . . . . . . . . . . . . . . .
5.3.1 Knowledge of Quantization .
5.3.2 The MSC3K Algorithm . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
30
31
33
33
34
34
35
37
40
40
41
.
.
.
.
.
.
.
.
43
43
44
45
46
48
50
51
51
7 Video Compression
7.1 Predictive Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Motion Estimation & Motion Compensation . . . . . . . . . . . . . .
53
54
56
8 JPEG 2000 Overview
8.1 Features . . . . . . . . . . . .
8.2 JPEG 2000 Codec . . . . . . .
8.2.1 Preprocessing . . . . .
8.2.2 Wavelet transform . .
8.2.3 Quantization . . . . .
8.2.4 Tier-1 & Tier-2 Coding
.
.
.
.
.
.
59
59
60
60
61
62
62
9 MSC3K Overview
9.1 The Codec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Color Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
64
66
67
6 Data Compression
6.1 Huffman Coding . . . . . . . .
6.2 Adaptive Coding . . . . . . . .
6.3 Arithmetic Coding . . . . . . .
6.3.1 The Basics . . . . . . . .
6.3.2 Implementation . . . . .
6.3.3 Underflow . . . . . . . .
6.3.4 Decoding . . . . . . . .
6.3.5 Arithmetic vs. Huffman
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9.4
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Audio Compression
10.1 History . . . . . . . .
10.2 Speech Synthesis . .
10.3 Basic Speech Model .
10.4 Speech production by
. . .
. . .
. . .
LPC
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
68
.
.
.
.
69
69
72
73
73
11 Results
11.1 Image Distortion Measure . . . . . .
11.2 Still Image Compression Comparison
11.2.1 Test Setup . . . . . . . . . . .
11.2.2 Lena . . . . . . . . . . . . . .
11.2.3 Blue Rose . . . . . . . . . . .
11.2.4 Gollum . . . . . . . . . . . . .
11.2.5 Wraiths . . . . . . . . . . . .
11.2.6 Child . . . . . . . . . . . . . .
11.2.7 MSC3K Performance . . . . .
11.3 Video Compression Results . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
78
78
79
79
79
80
81
81
83
12 Conclusions
12.1 Performance . . . . . . . . . .
12.2 Possible Application Areas . .
12.3 Future Works . . . . . . . . .
12.3.1 The Future of MSC3K
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
90
90
91
92
93
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A The Euclidean Algorithm
94
B Common Wavelet Transforms
B.1 LeGall 5/3 . . . . . . . . . . . .
B.1.1 Direct Form Coefficients
B.1.2 Lifting Coefficients . . .
B.2 Daubechies 9/7 . . . . . . . . .
B.2.1 Direct Form Coefficients
B.2.2 Lifting Coefficients . . .
95
95
95
95
96
96
96
.
.
.
.
.
.
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
98
v
Chapter 1
Introduction
This thesis work is the result of the idea to integrate a low bitrate real-time video
communication system into online gaming. This would enhance the online player’s
gaming experience. Teams could transmit complex strategy plans to each other or
opponents could argue with each other. The video function would of course be optional since it requires some sort of video device, e.g. a webcam. A button bound to
the video transmitting function would enable the player to decide when to send video
and optional settings would determine which players to send to and which players to
ignore.
The concept of video communication for online games places heavy requirements
on the video codec, e.g. low computational complexity, low bitrate, error resilience
and progressive scalability. At first a DCT1 based solution, such as MPEG-4, was
considered, but thanks to To Tran we soon started looking into wavelet-based solutions. This resulted in a unique wavlet-based video and image codec that we named
”Mystery Science Codec 3000” (MSC3K).
Later on in our work we recognized other applications, apart from online gaming,
and these are discussed in chapter 12.
This thesis covers all the steps in an actual image encoder/decoder and explains
them in the order in which they appear. The thesis is organized as follows:
Chapter 2 gives a short introduction to digital image representation and some
1
Discrete Cosine Transform
1
preprocessing procedures normally applied before transforming images to the wavelet
domain. The wavelet transformation and how to calculate and design different wavelet
implementations and transforms are explained in detail in chapter 3. The data of
the wavelet transformed image is reduced using a quantization scheme (chapter 4)
followed by the actual encoding of data in chapter 5. Chapter 5 only deals with zerotree algorithms, which MSC3K is a variant of, this chapter is meant to give the reader
complete understanding of zero-trees. The basics of data compression, Huffman based
coding and arithmetic coding, is described in chapter 6.
Chapter 7 deals with video encoding. Some information on standards such as
MPEG1 and MPEG2 is included, but the main focus is on functions and methods
implemented in the MSC3K video codec.
An informative overview of the JPEG 2000 standard can be found in chapter 8,
followed by an overview of the MSC3K codec in chapter 9.1.
Audio compression is covered in chapter 10, with detailed explanation of linear
prediction based compression.
Finally chapter 11 & 12 presents and discusses the results of this thesis work,
followed by conclusions and possible future works.
2
Chapter 2
Pre Processing
2.1
Color Representation
A color image is digitally represented by a set of RGB (red, green, blue) samples.
Each RGB tuple is referred to as a pixel. A 2-D set of pixels make up a color image.
A pixel with RGB=[0,0,0] is black. If each component is described by an 8-bit integer,
then RGB=[255,255,255] represents white. RGB=[255,0,0] represents red and so on.
2.2
Temporal Filtering
Digital video images are often, to various degrees, distorted with noise. The noise
degrades the quality of the coded image and it also makes motion estimation a lot
more difficult. Some sort of filtering must be applied in order to restore the image
quality.
In motion pictures, a sequence of images correlated in the temporal dimension is
available, and this temporal correlation may be exploited through temporal filtering.
Temporal filtering have two major advantages over spatial filtering, it reduces degradation without signal distortion and is more suitable for realtime implementations.
Suppose a sequence of degraded images gi (n1 , n2 ) is expressed as
3
gi (n1 , n2 ) = f (n1 , n2 ) + vi (n1 , n2 ),
1≤i≤N
(2.1)
where vi (n1 , n2 ) is zero-mean stationary white Gaussian noise with variance of
σv2
and vi (n1 , n2 ) is independent from vj (n1 , n2 ) for i 6= j. Assume that f (n1 , n2 ) is
non-random then the maximum likelihood (ML) estimate of f (n1 , n2 ) is
N
1 X
fˆ(n1 , n2 ) =
gi (n1 , n2 )
N i=1
(2.2)
N
1 X
fˆ(n1 , n2 ) = f (n1 , n2 ) +
vi (n1 , n2 )
N i=1
(2.3)
The degradation in the frame-averaged image, fˆ(n1 , n2 ), remains a zero-mean
stationary white Gaussian noise with variance of σv2 /N , which is a reduction of noise
variance by a factor of N in comparison to vi (n1 , n2 ). Noise characteristics remain
the same but the variance is reduced and the processed image will not suffer from any
blurring despite the noise reduction. This noise filtering is very effective in reducing
noise that may occur in digitization of a still image using e.g. a video camera.
This solution only works if f (n1 , n2 ) is stationary. If there are any movements
between the frames the result would be blurry since there is no way to tell if it is the
movement or the noise that causes the frames to differ. With a simple modification of
this temporal filtering, a new solution can be derived that performs well with realtime
video.
Every pixel in fˆ(n1 , n2 ) is compared with every pixel in g(n1 , n2 ) and the difference
between the two is compared with the predefined threshold δ. Differences higher than
δ are considered to be movement, while differences lower than δ are considered to be
noise. The pseudo code of this implementation can be written as:
Pixels that differ less than δ are considered to be noise and averaged with the
previous frame. This may blur the image in places since small variations do not
4
1. For every pixel, [x, y], in fˆ(n1 , n2 ) do:
• d = fˆ(x, y) − g(x, y)
• if d ≥ δ then fˆ(x, y) = g(x, y)
else if d < δ then fˆ(x, y) = (fˆ(x, y) + g(x, y))/2
necessary mean noise. However, this occasional blurring is hardly noticeable though
since the sharp edges are preserved. Moving sharp edges generate great differences
between frames and they are therefore considered movement instead of noise.
2.3
Level Offset
If the original B-bit image sample values are unsigned quantities, an offset of −2B−1
is added so that the samples have a signed representation in the range
−2B−1 ≤ x[n] < 2B−1
(2.4)
This is done because the discrete wavelet transform involves high-pass filtering
and hence have a symmetric distribution about 0. Without the level offset, the LL
sub-band would have to be treated as unsigned and all other sub-bands would be
signed. This is of course a matter of choice but when it comes to implementing it is
much easier if all sample values are within the same range and no exceptions have to
be made.
5
2.4
Normalization
All image samples are normalized through a division by 2B to the ”unit range”
1
1
− ≤ x[n] <
2
2
(2.5)
The samples are treated as floating point values from here on. The color transform
and the discrete wavelet transform may be normalized so that all sub-band samples
nominally retain this same unit range.
2.5
Color Transformation
An irreversible color transform is often used in lossy image compression formats, e.g.
JPEG 2000, JPEG, MPEG-2, MPEG-4, etc. The reason for this is to separate color
and intensity (chrominance and luminance) so that they can be treated differently
in the encoder. Intensity, or the black & white information, is more important than
color so changing color space can improve compression quality for color images.
The color space used by the most standards, JPEG 2000 etc, is YCbCr, where Y
is the intensity and Cb & Cr are color components. The transformation from RGB
to YCbCr can be expressed as

xY [n]


0.299
0.587
0.114

xR [n]


 


 xCb [n]  =  −0.168736 −0.331264
  xG [n] 
0.5

 


xCr [n]
0.5
−0.418688 −0.081312
xB [n]
(2.6)
Note, that the elements in the matrix are only approximate values. Observe that
xY is a weighted average of the red, green and blue color components. The weight
coefficients (0.299, 0.587, 0.114) reflect the importance of information, in terms of
visual perception of detail, in the different color spectrums. Notice that the sum of
6
the weight coefficients equal 1, no combination of RGB values can render a Y value
greater than the initial RGB range.
The RGB to YCbCr conversion can be reversed with the following equation.

xR [n]


1
0
1.402

xY [n]


 


 xG [n]  =  1 −0.344136 −0.714136   xCb [n] 

 


xB [n]
1
1.772
0
xCr [n]
(2.7)
The transformation is called irreversible because when implemented, finite precision makes it impossible to get the exact result. There are of course fully reversible
color transformations, but they are exclusively used with lossless compression.
7
Chapter 3
Wavelet Transform
The Discrete Wavelet Transform (DWT) is the foundation block in modern image
compression techniques. Among many, the most commonly known DWT based compression standard is probably JPEG 2000.
Due to the nature of natural images the DWT can efficiently concentrate the
image energy to small spatial regions in the wavelet representation. This property of
the DWT is very beneficial for image compressibility.
3.1
Historical Perspective
In the history of mathematics, wavelet analysis shows many different origins. Much
of the work was performed in the 1930s, and at the time, the separate efforts did not
appear to be parts of a coherent theory. Before 1930, the main branch of mathematics
leading to wavelets began with Joseph Fourier (1807) with his theories of frequency
analysis, often referred to as Fourier synthesis.
Joseph Fourier opened up the door to a new functional universe. After 1807
mathematicians began to explore the meaning of functions, Fourier series convergence
and orthogonal systems.
The disadvantage of a Fourier expansion however is that it has only frequency
resolution and no time resolution. This means that although all frequencies can be
determined there is no way to know when they are present. Several solutions, that
8
more or less are able to represent a signal in the time and frequency domain at the
same time, have been developed since.
Research was gradually led from the notion of frequency analysis to the notion of
scale analysis when it started to become clear that an approach measuring average
fluctuations at different scales might prove less sensitive to noise.
The first recorded mention of wavelets appeared in an appendix to the thesis of
Alfred Haar (1909).
In the 1930s, several groups working independently researched the representation
of functions using scale-varying basis functions. Understanding the concepts of basis
functions and scale-varying basis functions is key to understanding wavelets. Paul
Levy, a 1930s scientist, used a scale-varying basis function called the Haar basis
function to investigate Brownian motion, a type of random signal. He found the Haar
basis function superior to the Fourier basis functions for studying small-complicated
details in the Brownian motion.
In 1980 Jean Morlet and the team at the Marseille Theoretical Physics Center
working under Alex Grossmann in France first proposed the concept of wavelets in
its present theoretical form. They broadly defined wavelets in the context of quantum physics. These two researchers provided a way of thinking for wavelets based
on physical intuition. In 1985 Stephane Mallat gave wavelets an additional jumpstart through his work in digital signal processing. Later mainly Y. Meyer and his
colleagues have developed the methods of wavelet analysis inspired by S. Mallats
earlier work in 1988. Since then, wavelet research has become international. Ingrid Daubechies, Ronald Coifman, and Victor Wickerhauser are some of the greatest
names in the field of wavelet research today.
3.2
Multiresolution Analysis
The wavelet transform can be thought of as a filter bank. In order to wavelet transform
a signal, the signal has to pass through a filter bank. The output of the different filter
9
stages are the wavelet- and scaling function transform coefficients. Analyzing a signal
by passing it through a filter bank is not a new idea and has been around for many
years under the name sub-band coding.
Figure 3.1: Splitting the signal spectrum with an iterated filter bank.
The filter bank needed in sub-band coding can be built in several ways. One
way is to build many band-pass filters to split the spectrum into frequency bands.
The advantage is that the width of every band can be chosen freely, in such a way
that the spectrum of the signal to analyze is covered in the places where it might be
interesting. The disadvantage is that every filter will have to be designed separately
and this can be a time consuming process. The process of splitting the spectrum is
graphically displayed in figure 3.1. The advantage of this scheme is that only two
filters have to be designed, the disadvantage is that the signal spectrum coverage is
fixed.
Since the signals from the high pass filters have different spectrum width, they
can be represented in the time domain with different sample resolutions. For example
in figure 3.1, the 4B signal takes twice the amount of samples to be represented in
the time domain than the 2B signal. This type of analysis of a signal is often referred
to as a multiresolution analysis.
10
3.3
1-D Wavelet Transform
The wavelet transform (or multiresolution analysis) can be performed using a filter
bank with FIR filters as shown in figure 3.2. Combining some blocks in figure 3.2 will
generate a multiresolution analysis (figure 3.1) of the input signal. Figure 3.3 gives a
more detailed filter bank structure of such an analysis.
The signal is split into numerous frequency bands (sub-bands). This can be done
with reversible integer-to-integer or irreversible real-to-real wavelet transforms. A
reversible wavelet transform is designed so that the signal can be perfectly reconstructed again. Irreversible transforms can only approximate a reconstructed signal
due to finite precision.
Figure 3.2: The discrete forward and inverse wavelet transform presented in direct
form. Analysis filtering followed by subsampling then upsampling and synthesis filtering.
Figure 3.3: A combined forward wavelet transform with sub-bands γ1 , γ2 and λ2 .
11
The analysis filters are denoted as ha and ga , while the synthesis filters are denoted
as hs and gs . The h filters are low-pass filters and the g filters are high-pass. Later on
an additional prefix will be added to indicate even or odd filter coefficients, e.g. hae
represents the even filter coefficients of ha and hao will represent the odd coefficients.
3.3.1
Perfect Reconstruction
Usually a signal transform is used to transform a signal to a different domain, perform
some operation on the transformed signal and then inverse transform it back to the
original domain. This means of course that the transform has to be invertible. If
a signal is transformed and then directly inverse transformed, there should be no
change, the signal has to be perfectly reconstructed.
The conditions for perfect reconstruction are given by [1] as
hs (z)ha (z −1 ) + gs (z)ga (z −1 ) = 2
(3.1)
hs (z)ha (−z −1 ) + gs (z)ga (−z −1 ) = 0
In order to arrive at a non-delayed, perfectly reconstructed signal, the analyzing
filters (ha (z) and ga (z)) have to be time reversed.
3.3.2
The Polyphase Representation
The direct form represented in figure 3.2 is not suitable for implementation since half
of all the computations are thrown away in the subsampling procedure. Analyzing
the signals λ(z) and γ(z) reveals



 x[2]ha [2] x[1]ha [1]
λ(z) =
x[3]ha [2] x[2]ha [1]


 x[4]h [2] x[3]h [1]
a
a



 x[2]ga [2] x[1]ga [1]
γ(z) =
x[3]ga [2] x[2]ga [1]


 x[4]g [2] x[3]g [1]
a
a
x[0]ha [0] x[−1]ha [−1] x[−2]ha [−2]
x[1]ha [1] x[0]ha [−1]
x[−1]ha [−2]
x[2]ha [0] x[1]ha [−1]
x[0]ha [−2]
(3.2)
x[0]ga [0] x[−1]ga [−1] x[−2]ga [−2]
x[1]ga [1] x[0]ga [−1]
x[−1]ga [−2]
x[2]ga [0] x[1]ga [−1]
x[0]ga [−2]
12
(3.3)
The strike-through computations are thrown away in the subsampling procedure.
By defining
ha (z) = hae (z 2 ) + z −1 hao (z 2 )
(3.4)
ga (z) = gae (z 2 ) + z −1 gao (z 2 )
λ(z) and γ(z) are then recognized as
λ(z) = hae (z)xe (z) + hao (z)z −1 xo (z)
(3.5)
γ(z) = gae (z)xe (z) + gao (z)z −1 xo (z)
The whole expression can be written as a multiplication between the matrix Pa (z)
and a vector of xe (z) and z −1 xo (z)
!
!
λ(z)
hae (z) hao (z)
=
γ(z)
gae (z) gao (z)
|
{z
}
xe (z)
!
(3.6)
z −1 xo (z)
Pa (z)
This implies that the filter bank can be expressed as a filter with even and odd
samples as input and that the subsampling can be done prior to the filtering. Instead
of throwing away half of the computations as in the case of figure 3.2, the filter bank
can be re-modelled that of figure 3.4, where Pa (z) is the polyphase matrix for the
analysis. Similar to Pa (z), the polyphase matrix for the synthesis, Ps (z), can be
Figure 3.4: A one-stage forward and inverse wavelet transform using polyphase matrices.
formulated as
xe (z)
z −1 xo (z)
!
= Ps (z)
λe (z)
!
=
γe (z)
hse (z) gse (z)
hso (z) gso (z)
13
!
λe (z)
γe (z)
!
(3.7)
From figure 3.4. the condition for perfect reconstruction can now be written as
Pa (z −1 )Ps (z) = I
(3.8)
The time reversal of Pa is needed to cancel the delays caused by the polyphase
matrices. Assuming that Ps (z) is invertible, Cramer’s rule can be applied to calculate
the inverse.
Ps (z)−1
1
= Pa (z −1 ) =
hse (z)gso (z) − hso (z)gse (z)
gso (z)
−gse (z)
−hso (z)
hse (z)
!
(3.9)
If the determinant of Ps (z) equals 1, i.e. hse (z)gso (z) − hso (z)gse (z) = 1, then not
only will Ps (z) be invertible, but also
hae (z) = gso (z −1 )
hao (z) = −gse (z −1 )
gae (z) = −hso (z −1 )
(3.10)
gao (z) = hse (z −1 )
which implies that
ha (z) = −z −1 gs (−z −1 )
ga (z) = z −1 hs (−z −1 )
(3.11)
[1] defines the term complementary filter pairs as a pair of filters ,(h, g), which has
a corresponding polyphase matrix, P (z), with determinant 1. The problem of finding
an invertible wavelet transform using FIR filters amounts to finding such a matrix
P (z) with determinant 1. If (ha , ga ) is complementary then so is (hs , gs ). From the
matrix Pa (z) the filters needed in the invertible wavelet transform follow immediately.
3.3.3
From Polyphase to Lifting
Lifting is the most computational efficient solution when implementing the wavelet
transform, see chapter 3.3.7. The lifting procedure is described by a set of lifting
14
steps, where each lifting step is a polyphase matrix. Extracting the lifting steps is
done by factoring the polyphase matrix in a certain way.
The lifting theorem [1] states that any other finite filter gsnew complementary to
hs is of the form
gsnew (z) = gs (z) + hs (z)ss (z 2 )
(3.12)
where s(z 2 ) is a Laurent polynomial. This can be written in polyphase form as
Psnew (z) =
Ps (z)
hse (z) hse (z)ss (z) + gse (z)
!
=
hso (z) hso (z)ss (z) + gso (z)
!
!
!
1 ss (z)
hse (z) gse (z)
1 ss (z)
=
0
1
hso (z) gso (z)
0
1
(3.13)
The lifting theorem can also be used to create the filter hnew
a (z) complementary
to ga (z)
2
hnew
a (z) = ha (z) + ga (z)sa (z )
(3.14)
with the polyphase matrix
Panew (z) =
1 sa (z)
0
1
hae (z) + gae (z)sa (z) hao (z) + gao sa (z)
gae (z)
!
Pa (z) =
!
=
gao (z)
1 sa (z)
0
1
!
hae (z) hao (z)
!
(3.15)
gae (z) gao (z)
This is also known as primal lifting(illustrated in figure 3.5), where the low-pass
sub-band is lifted with the help of the high-pass sub-band.
Lifting can also be applied to the high-pass sub-band, thus lifting the high-pass
sub-band with the help of the low-pass sub-band. This procedure is called dual lifting.
For dual lifting the equations become
2
hnew
s (z) = hs (z) + gs (z)ts (z )
15
(3.16)
Figure 3.5: Primal lifting, the low-pass sub-band is lifted with the help of the highpass sub-band.
Psnew (z) =
Ps (z)
hse (z) + gse (z)ts (z) gse (z)
!
=
hso (z) + gso (z)ts (z) gso (z)
!
!
!
1
0
hse (z) gse (z)
1
0
=
ts (z) 1
hso (z) gso (z)
ts (z) 1
(3.17)
and
ganew (z) = ga (z) + ha (z)ta (z 2 )
Panew (z) =
hae (z)
hao (z)
(3.18)
!
=
gae (z) + hae (z)ta (z) gao (z) + hao ta (z)
!
!
!
1
0
1
0
hae (z) hao (z)
Pa (z) =
ta (z) 1
ta (z) 1
gae (z) gao (z)
(3.19)
Figure 3.6: Dual lifting, the high-pass sub-band is lifted with the help of the low-pass
sub-band.
16
3.3.4
The Lazy Wavelet Transform
The simplest wavelet transformed is often referred to as the lazy wavelet transform,
where the polyphase matrices equals the unity matrix. This transform will divide
the input signal into odd and even indexed samples. This is of course fully reversible
since no data is discarded. The lazy transform can be used in designing other wavelet
transforms. If dual or primal lifting are applied to the lazy transform then the result
will be a new wavelet transform. Any wavelet transform with finite filters can be
obtained starting from the Lazy followed by a finite number of primal lifting and
dual lifting steps [1].
3.3.5
Factoring Wavelets into Lifting Steps
In the previous section the wavelet transform have been lifted as a part in designing
wavelet transforms. The lifting procedure can be reversed and thus extracting the
lifting steps from an existing transform. Investigating the lifting theorem ([1]) a bit
more closely reveals that all reversible wavelets transforms can be regarded as a lazy
wavelet transform that have been lifted a number of steps. This implies that all
wavelet transforms can be factored into lifting steps until only the unit matrix and
a number of lifting matrices remains, and in some cases an additional scaling matrix
(K). The polyphase matrix, P (z), can be written as
!
! m (
!
Y
1 0
K1 0
1 si (z)
P (z) =
0 1
0 K2 i=1
0
1
1
0
!)
(3.20)
ti (z) 1
An inverted dual lifting procedure can be written as
P (z) =
he (z) ho (z)
!
=
ge (z) go (z)
1
he (z)
ho (z)
he (z)t(z) + genew (z) ho (z)t(z) + go (z)new
!
!
0
he (z)
ho (z)
t(z) 1
!
=
genew (z) gonew (z)
(3.21)
17
from this it is clear that the factoring conditions are
g(z) = h(z)t(z 2 ) + g new (z)
ge (z) = he (z)t(z) + genew (z)
(3.22)
go (z) = ho (z)t(z) + gonew (z)
This form is identical to a long division with remainder of Laurent polynomials,
where g new (z) is the remainder. Two long divisions have to be carried out in order
to extract one lifting step. The Euclidean algorithm for Laurent polynomials can
be utilized to calculate the long divisions. The Euclidean algorithm was originally
developed to find the greatest common divisor (gcd ) of two natural numbers, but it
can be extended to handle Laurent polynomials as described in appendix A.
3.3.6
Calculating Lifting Steps
A lot of research has already been performed on designing wavelet filters for all kinds
of applications. Factoring FIR filters of an existing wavelet transform into lifting
steps will give a fast implementation of an already working wavelet transform.
The LeGall 5/3 wavelet transform filter is widely used in lossless image compression, e.g. the JPEG 2000 standard. The direct form filter coefficients for the 5/3
analysis transform are often given as
hLP (z) = − 81 z −2 + 14 z −1 + 43 + 14 z 1 − 81 z 2
gHP (z) = − 12 z −1 + 1 − 21 z 1
(3.23)
This notation will not generate a reversible wavelet transform, instead the gHP
filter should be delayed one sample and the filter coefficients given as
hLP (z) = − 18 z −2 + 14 z −1 + 43 + 14 z 1 − 81 z 2
gHP (z) = − 21 z −2 + z −1 −
1
2
Continuing with dividing the coefficients into even and odd
18
(3.24)
1 1 2
1 −2 3 1 2
−1
ha (z) = − z + − z +z
+ z
8
4 8
4 4
|
|
{z
}
{z
}
2)
2)
h
(z
h
(z
ae
ao
1 −2 1
+z −1 {1}
ga (z) = − z −
|{z}
2
2
{z
}
|
gao (z 2 )
(3.25)
gae (z 2 )
and thus
hae (z) hao (z)
Pa (z) =
!
=
− 81 z −1 + 34 − 81 z
− 12 z −1 −
gae (z) gao (z)
det(Pa (z)) =
1
4
1
2
+ 14 z
!
(3.26)
1
1 −1 3 1
1 1
1 −1 1
− z + − z (1) −
+ z
− z −
=1
8
4 8
4 4
2
2
(3.27)
The LeGall 5/3 filter is obviously reversible since the determinant of the analysis
polyphase matrix equals 1. The analysis filters meet the demands set for perfect
reconstruction. Now it’s time to extract a lifting step. Note that hae (z) and hao (z)
are the longest filters so extracting a primal lifting step should lower the order of
hae (z) and hao (z).
Pa (z) =
1 s(z)
!
− 21 z −1 −
!
1
1
new
hnew
ae (z) + s(z)( 4 + 4 z) hao (z) + s(z)
=
− 21 z −1 − 21
1
0
hnew
ao (z)
hnew
ae (z)
1
1
2
!
=
1
− 81 z −1 + 34 − 81 z
− 12 z −1 −
1
2
1
4
+ 14 z
!
1
(3.28)
(
1 −1
hnew
− 21 ) = − 18 z −1 + 34 − 81 z
ae + s(z)(− 2 z
hnew
oe + s(z) =
1
4
+ 14 z)
(3.29)
From this the Euclidean algorithm can be used with a0 = hae (z) and b0 = hao (z).
Three different factorizations are possible depending on the choice of the remainder.
The three greatest common divisors are
19

1 −1
1
7
1
−1


 (− 2 z − 2 )(− 4 + 4 z) − z
3 1
1
− z −1 + − z =

8
4 8


(− 21 z −1 − 12 )( 14 + 14 z) + 1
(3.30)
(− 21 z −1 − 12 )( 14 − 74 z) − z
thus

1
7
new
−1

hnew

ao = 1
 s(z) = (− 4 + 4 z) hae = −z
s(z) = ( 41 + 41 z)


 s(z) = ( 1 − 7 z)
4
4
hnew
ae = 1
hnew
ao = 0
hnew
ae = −z
hnew
ao = z
(3.31)
Depending on what remainder to choose, the resulting lifting matrix will have
different properties. E.g. if we choose the middle alternative, where hnew
= 1, the
ae
lifting matrix will have symmetrical filters, which is more efficient for implementations. Furthermore there will be no extra delays in the lifting matrix.
The analysis polyphase matrix after factorization
Pa (z) =
1
1
4
+ 41 z
0
!
1
− 12 z −1 −
1
0
1
2
!
(3.32)
1
The synthesis transform is the inverse of the analysis, so inverting the lifting steps
of Pa (z) will immediately give the synthesis lifting scheme
Ps (z) =
1
1 −1
z
2
0
+
1
2
1
!
1 − 41 − 14 z
0
!
(3.33)
1
There is no need to extract more lifting steps, since the expression already consists
of a primal and a dual lifting step.
Figure 3.7: LeGall 5/3 filter in lifting representation.
20
Comparing computational complexity, the direct form implementation of the analysis part in figure 3.4 requires 5 additions and 4 multiplications if the symmetrical
properties of the filters are exploited. The lifting implementation, in figure 3.7, requires only 4 additions and 2 multiplications.
The LeGall 5/3 filter can be implemented using an integer to integer mapping.
Because the lifting filter coefficients are all power of 2, e.g.
1
2
= 2−1 ,
1
4
= 2−2 , the
multiplications can be replaced by binary shift operations. This operation is not only
computational efficient, it also makes the integer to integer mapping possible and
lossless compression can be achieved.
An overview of the two most commonly used wavelet transforms, the LeGall 5/3
filter and the Daubechies 9/7 filter, and their lifting steps can be found in appendix
B.
3.3.7
Lifting Properties
As mentioned before, the greatest advantage of the lifting scheme is the reduced
calculation complexity. It has been proven [1] that for long filters the lifting scheme
reduces computation complexity to half, compared to the standard FIR filter bank
algorithm. The filter bank method has a complexity of N , which is much more efficient
than the FFT which has a complexity of N log(n). The lifting scheme in turn has a
complexity of N/2.
Another important factor beside the computational complexity, is the memory
usage in implementations. For lifting, the samples needed are basically a small amount
of the previous lifting step output, the direct form needs more surrounding samples
to produce an output sample. This can be utilized in lifting by storing results from
the previous step in CPU registers and calculate all lifting steps before storing the
final result in external memory, thus reducing memory access and improving speed.
Lifting has easy reversibility due to the nature of the lifting polyphase matrices.
It also produces interlaced coefficients , the high pass coefficients are mixed with the
low pass coefficients. For further details, see figure B.1.
21
3.4
2-D Wavelet Transform
As previously discussed the result of the transform is a set of λ and γ coefficients.
These coefficients are in fact the low frequency information and the high frequency
information from the original signal. Separating these coefficients and storing them
in different spatial regions will generate sub-bands, as discussed in chapter 3.2.
A
B
Figure 3.8: A) One decomposition level. B) Two decomposition levels.
Transforming an image requires a 2-D wavelet transform. Images can be seen as
two dimensional signals, where the x dimension corresponds to the rows of pixels and
the y dimension to the columns. The 1-D transform can be applied in the horizontal
direction one row at a time, this will generate a decomposition illustrated in figure
3.9.B. The same 1-D transform can then be applied in the vertical direction, one
column at a time, resulting in a complete 2-D wavelet transformation of the original
image as illustrated in figure 3.9.C. At this point the transformation has reached one
decomposition level and there are four so called different sub-bands present in the
original spatial region. These are named LL1 , LH1 , HL1 and HH1 , where H stands
for high pass and L for low pass. The index indicates from which decomposition level
the sub-band originates from, the order of the letters indicates horizontal or vertical
22
origin, see figure 3.8.
The 2-D transformation procedure can be repeated on the LL sub-band to create
more decomposition levels and thus more sub-bands. Figure 3.9.D illustrates a two
level decomposition.
The wavelet coefficients, representing the original image, are the basis of all
wavelet-based image compression techniques.
Apart from the LL sub-band in figure 3.9, the grey areas represent zeros. This is
because the wavelet coefficients range from − 12 to
1
2
and are presented in a non signed
range of 0 to 255, thus zero is mapped on number 127 (− 21 = 0 and
1
2
= 255). The
high representation of zeros in the outer sub-bands indicates that the most important
information resides in the LL sub-band and this will later be used to achieve good
compressibility.
23
A
B
C
D
Figure 3.9: A) The original gray image of Lena(256x256x8bpp). B) One horizontal
1-D wavelet transform step applied to A. C) A vertical 1-D wavelet transform step
applied to B. D) Both vertical and horizontal steps applied to the LL sub-band in C.
24
Chapter 4
Quantization
Quantization is the element in lossy compression that makes the data more compressible by reducing the data precision in some way. In most lossy compression
systems, it is the only source of distortion. For lossless compression there should be
no quantization.
4.1
Uniform Dead-Zone Quantization
This is a variant of scalar quantization. Scalar quantization is the simplest of all
quantization methods. It can be described as a function that maps each element in a
subset of the real line to a particular value in that subset. Consider partitioning the
real line into M disjoint intervals
Iq = (tq , tq+1 ), q = 0, 1, ..., M − 1
(4.1)
−∞ = t0 < t1 < ... < tM = +∞
(4.2)
with
Within each interval, a point x̂ is selected as the output value (or codeword) of Iq .
a Scalar quantizer is then a mapping from R to 0, 1, ..., M − 1. Specifically, for a given
x, Q(x) is the index q of the inteval Iq which contains x. The inverse quantization,
25
or dequantizer is given by
Q−1 (q) = xˆq
(4.3)
A very desirable feature of compression systems is the ability to refine the recon-
Figure 4.1: Different graphical representations of scalar quantization
structed data as the bit-stream is decoded. As more of the compressed bit-stream
is decoded, the reconstruction can be improved incrementally until full quality reconstruction is obtained upon decoding the entire bitstream. Compression systems
possessing this property are faciliated by embedded quantization.
In so called embedded quantization, the intervals of higher rate quantizers are
embedded within the intervals of lower level quantizers. Equivalently, the levels of
lower rate quantizers are partitioned to yield the intervals of higher rate quantizers.
Uniformly subdividing the intervals of a uniform scalar quantizer yields another uniform scalar quantizer. In this way, a set of embedded uniform scalar quantizers may
be constructed.
26
For the case when ξ = 0
q = Q(x) = sign(x)b
|x|
c
4
(4.4)
and
(
x̂ = Q−1 (q) =
0
q=0
sign(q)(|q| + δ)4 q 6= 0
(4.5)
This quantizer has embedded within it, all uniform deadzone quantizers with step
sizes 2p ∆ for integer p ≥ 0. Assuming that the magnitude of q can be represented
with K bits, then q can be written in sign magnitude form as
q = QK−1 (X) = s, q0 , q1 , ..., qK−1
(4.6)
q (p) = s, q0 , q1 , ..., qK−1−p
(4.7)
Let
be the index obtained by dropping the last p bits of q. Equivalently, q (p) is obtained
by right shifting the binary representation of |q| by p bits. It can then be verified
that
QK−1−p (X) = q (p)
(4.8)
where QK−1−p is the uniform deadzone quantizer with step size 2p ∆.
If the p LSBs of |q| are unavailable, we may still inverse the quantization (dequantize), but at a lower level of quality. The result will be the same as if quantization
was performed using a step size of 2p ∆ (rather than ∆) in the first place. In this
situation the inverse quantization is performed as
(
0
q (p) = 0
x̂ =
sign(q (p) )(|q (p) | + δ)2p ∆ q (p) 6= 0
when p = 0, this yields the full quality dequantization.
27
(4.9)
Chapter 5
Zero-Tree Coding
When zero-tree coding was first introduced it produced state-of-the-art compression
performance at a relatively modest level of complexity. First out was the Embedded zero-tree coding of wavelet coefficients (EZW) introduced by Shapiro [2]. Later
Shapiro’s zero-trees were generalized, and set partition techniques were introduced
to efficiently code these generalized trees of zeros. The improved technique is known
as Set Partition in Hierarchical Trees (SPIHT) [3]. The SPIHT algorithm produces
results superior of those of EZW at even lower levels of complexity.
Zero-tree coding is applied to the quantized wavelet coefficients. The actual coding
is founded on the premise that if a coefficient is small in magnitude, the spatial
corresponding coefficients in higher sub-bands of the wavelet transform also tend to
be small. This spatial correlation produces a tree-like coding structure, hence the
name zero-tree.
Our implementation, MSC3K, is a slightly modified SPIHT variant of zero-tree
coding. The modification improves compression performance and ensures that prior
knowledge of the quantization procedure is used to achieve faster encoding. For
better understanding of the SPIHT algorithm a more detailed explanation of EZW is
required.
28
5.1
EZW
EZW can be described as a form of entropy encoding applied to bit-planes of the
quantized wavelet coefficients. Assume an original image of size N ×N where N = 2n
for some integer n ≥ D. This image is then transformed with D levels of a dyadic
tree-structured sub-band transform. This operation results in 3D + 1 sub-bands of
transform coefficients. The transform coefficients are then quantized with a uniform
dead-zone quantization. As a result the original image is now an N × N array of
integers, representing the transform. The coefficients are then ”sliced” into K + 1
bit-planes. The first such bit-plane consists of the sign bit of each index, s[x, y]
x, y = 0, 1, ..., N . The next bit-plane consists of the MSBs of each magnitude, denoted
by q0 [x, y] x, y = 0, 1, ..., N ., then q1 [x, y] and so on. Each of the K magnitude bitplanes is coded in two passes. The first pass determines which coefficients should be
visited in the refinement pass in the next bit-plane, or one might say it indicates if
a coefficient has become newly significant. This pass is called the significance pass
and it outputs compressed data. The second pass is called the refinement pass and it
outputs the bits of coefficients that have been determined significant, not the newly
significant coefficients though. The newly significant coefficients become significant
in the next bit-plane. So when coding q0 , there are no significant coefficients yet, just
newly significant ones. Thus the refinement pass for q0 is skipped.
5.1.1
The Significance Pass
Each insignificant coefficient in the wavelet transform is visited in raster order (left-toright, top-to-bottom), first within LLD then within LHD , then HLD , then HHD and
then LHD−1 and so on. Because most energy is concentrated in the lower frequency
sub-bands, this order will give the coefficients their rightful priority. Compression is
accomplished via a 4-ary alphabet:
29
• POS = Significant Positive. This symbol equals 1 followed by a ”positive” sign
bit.
• NEG = Significant Negative. This symbol equals 1 followed by a ”negative”
sign bit.
• ZTR = Zero Tree Root. This symbol indicates that the current bit is 0, and
that all descendants of the bit are 0.
• IZ = Isolated Zero. This symbol indicates that the bit is 0, but at least one
descendant is also 0.
Figure 5.1: EZW descendants structure for one sample in the LL3 sub-band.
The descendants of a coefficient in a zero-tree are often referred to as children
or grandchildren depending on relationship, se figure 5.1. The highest frequency
sub-bands, HL1 , LH1 and HH1 , have no children so symbols ZTR and IZ have no
meaning in these regions. Instead these symbols are replaced by a single symbol Z.
5.1.2
The Refinement Pass
In the refinement pass a single bit is coded for each coefficient known to be significant.
A coefficient is significant if it has been coded POS or NEG in a previous bit-plane.
30
Its current refinement bit is simply its corresponding bit in the current bit-plane. The
coefficients are visited in a special order, thus they are sorted by two criteria: first
by magnitude, second by raster within sub-bands in the same order as the sub-bands
are visited in the significance pass.
Coefficients that have been determined significant are treated as IZ or Z symbols in
the remaining significance passes.
5.1.3
Example of EZW
The sign plane and the first two magnitude bit-planes (q0 , q1 ) for a D = 3 level dyadic
orthonormal transform are shown in figure 5.2. The coding begins with a significance
pass through q0 . First the single bit in LL3 is examined and since it is 1 and its sign
is + thus a POS symbol is output. The bit in HL3 is coded as ZTR indicating that
it and all its descendants are 0. LH3 is coded similarly to LL3 but the sign differs
so it’s coded as NEG. HH3 is coded as ZTR. No further coding of HL2 , HL1 , HH2
and HH1 is done for bit-plane q0 . Proceeding to LH2 , four symbols are coded, ZTR,
ZTR, POS, ZTR. Now only LH1 remains to be coded. 12 of 16 has already been
coded as zeros by ZTR symbols in LH2 . The remaining 4 bits are coded Z, Z, Z, Z.
The resulting symbols from the significance of q0 are given by:
POS, ZTR, NEG, ZTR,
ZTR, ZTR, POS, ZTR,
Z, Z, Z, Z
The symbols are broken into three lines for clarity of the text. The first line
represents region LL3 , LH3 , HL3 and HH3 , the second line is from LH2 , HL2 and
HH2 , the last line is for LH1 , HL1 and HH1 .
By virtue of the q0 significance pass there are now three coefficients known to be
significant, q[0, 0], q[0, 1] and q[0, 3]. The q1 bit for each of these coefficients is coded
in the q1 refinement pass. No sorting is required of the significant coefficients since
only one bit-plane has been coded. This is because all these coefficients have the q0
31
s
q0
q1
Figure 5.2: Sign plane and first two magnitude bit-planes of a three level dyadic
sub-band decomposition of an 8 × 8 image.
bit set to 1, so sorting by criteria one is pointless, they have also been coded in the
correct raster order so sorting by criteria two is also pointless. The code symbols for
the q1 refinement pass are thus:
q1 [0, 0], q1 [0, 1], q1 [0, 3] = 1, 1, 0
The significance pass for q1 then codes all remaining bits in the q1 bit-plane. The
symbols from this pass are given by:
IZ, POS
ZTR, POS, ZTR, ZTR, ZTR, ZTR, IZ, ZTR, ZTR, ZTR, ZTR
Z, Z, Z, Z, Z, NEG, Z, Z, Z, Z, Z, Z
The first IZ symbol originates from region LH3 since it’s 0 but has a non-zero descendant in LH2 of the q1 bit-plane.
The q1 significance pass has added three more significant coefficients (q[3, 0], q[1, 1]
and q[3, 4]). The coding continues with the q2 refinement pass, the q2 significance pass,
the q3 refinement pass and so on.
32
5.2
SPIHT
The SPIHT algorithm has many things in common with the EZW algorithm, but
there are also many differences. In particular, the parent-child relationship in LLD
is altered (figure 5.3), the order of the refinement and significance pass is reversed, a
new type of zero-tree is introduced, more inter-bit-plane memory is exploited and all
SPIHT symbols are binary. The operations done prior to the actual zero-tree coding,
e.g. the wavelet transform and the quantization, are practically identical to those of
the EZW algorithm. Both algorithms apply a D level orthonormal transform and
quantize the transform indices with a dead-zone scalar quantization.
Figure 5.3: Parent child relationships in SPIHT.
5.2.1
Different Zero-Trees
SPIHT uses two different kinds of zero-trees, Type A and Type B. Both are used
to signal the existence of insignificant sets of coefficients. Type A indicates that all
descendants of the root, within a given bit-plane, are zero. The root itself does not
need to be zero and this differs from the EZW zero-tree. In fact, although the Type
A zero-tree is specified by the coordinates of the root, the root itself is not included
in the tree.
33
The Type B zero-tree indicates that all grandchildren and great grandchildren,
etc, are zero. This Type B root does not include the root itself, nor does it include its
children in the zero-tree. Type B trees can be thought of as the union of four Type
A trees, each of which originates from a child to the Type B root.
5.2.2
Ordered Lists
SPIHT is best explained by a set of ordered lists. These are used internally by
the encoder and the decoder to keep track of significant and insignificant sets of
coefficients. There are three kinds of lists:
• List of Insignificant Coefficients (LIC).
• List of Significant Coefficients (LSC).
• List of Insignificant Sets of coefficients (LIS).
LIC holds the index of insignificant coefficients. One item in the LIC list corresponds to one single insignificant coefficient. LSC holds the index of significant
coefficients, and one item in this list corresponds to one single significant coefficient.
The LIS list stores the location of Type A and Type B zero-tree roots. All coefficients
are represented with these three lists.
5.2.3
The Coding Passes
First, the coding is initialized by adding each coefficient in LLD to the LIC list. The
LSC list is initially empty since nothing is known yet about the significance of the
coefficients. Every coefficient within the LLD sub-band which has descendants is
added as a Type A root to the LIS list, thus initially indicating that all coefficients
are insignificant.
The order of significance and refinement passes are reversed from that of EZW.
Thus, coding starts with a significance pass followed by a refinement pass through
34
the current bit-plane. The refinement pass codes a refinement bit for each coefficient
that was significant at the end of the previous bit-plane. All coefficients that became
significant in the significance pass, prior to the refinement pass, can be regarded as
”newly significant” coefficients. These are not coded in the following refinement pass.
The significance pass starts with examining the coefficients in LIC of the current
bit-plane. The bit-values of the coefficients are coded, if a bit is coded as 1, its sign
bit is coded immediately thereafter, the coefficient is then moved from the LIC list
to the LSC list. The list of LIS roots is examined next. If the descendants of a root
remain insignificant a 0 is coded and processing proceeds at the next entry in LIS. If
all descendants are not insignificant a 1 is coded. Now one of two different actions
is taken depending on the type of the root: 1) A Type A root is changed to Type B
and moved to the bottom of the LIS list. The four children of the root are coded as
if they were in LIC. After this the children are either in LIC or in LSC depending on
their bit-values. 2) A Type B root is deleted from from LIS. Its children are added
as Type A roots to LIS.
Processing continues in this fashion throughout all entries in LIS. Note that entries
added in the current pass, e.g. from examining a Type B root, are also processed in
the same pass.
The refinement pass then continues to code the magnitude bits to the coefficients
in LSC, the coefficients that were significant in the start of the current pass. Thus
the refinement pass of the q0 bit-plane is skipped.
5.2.4
The SPIHT Algorithm
For a given coefficient at location [x, y] let C[x, y] be the set of its children. Let D[x, y]
represent all descendants of [x, y] and let G[x, y] represent the set of grandchildren,
great grandchildren, etc to [x, y]. s[x, y] represents the sign of [x, y]. By this notation
follows:
G[x, y] = D[x, y] − C[x, y]
35
(5.1)
Let Sk (B) represent a boolean function which returns 0 if the coefficients of bitplane k, described by B, are all 0, and returns 1 if at least one coefficient is 1.
The SPIHT algorithm can be described in pseudo code using the above described
notation:
1. Initialization
• Set k = 0, LSC = empty, LIC = {all coordinates [x, y] ∈ LLD }, LIS = {all
coordinates of LIC that have children}. All entries of the LIS list are set
to Type A.
2. Significance Pass
• For each [x, y] ∈ LIC, do:
– Output qk [x, y]. If qk [x, y] = 1, output s[x, y] and move [x, y] to the
end of the LSC list.
• For each [x, y] ∈ LIS, do:
– If the set is of Type A, output Sk (D[x, y]). If Sk (D[x, y]) = 1, then
∗ For each [i, j] ∈ C[x, y], output qk [i, j]. If qk [i, j] = 0, add [i, j] to
the LIC. Else, output s[i, j] and add [i, j] to the LSC.
∗ If G[x, y] exists for [x, y] then move [x, y] to the end of LIS as a
Type B root. Else delete [x, y] from the LIS.
– If the set is of Type B, output Sk (G[x, y]). If Sk (G[x, y]) = 1, then
add each [i, j] ∈ C[x, y] to the end of the LIS (as Type A) and delete
[x, y] from LIS.
3. Refinement Pass
• For each [x, y] ∈ LLD , output qk [x, y]. Only coefficients that were in LSC
before the most recent significant pass should be refined.
4. Set k = k + 1 and go to Step 2).
36
5.2.5
Example of SPIHT
s
q0
q1
Figure 5.4: Sign plane and first two magnitude bit-planes of a D = 2 level dyadic
sub-band decomposition of an 8 × 8 image.
The sign plane and the first two magnitude bit-planes (q0 and q1 ) are shown in
figure 5.4. Coding starts with the initialization step and the lists are given by
LSC
LIC
LIS
[0,0]
[1,0]A
[1,0]
[0,1]A
[0,1]
[1,1]A
[1,1]
The significance pass starts with coding of the coefficients in LIC. Thus, q0 [0, 0] = 1
is output, followed by s[0, 0] = +. q0 [1, 0] = 0, q0 [0, 1] = 1, followed by s[0, 1] = −
and q0 [1, 1] = 0, resulting in that [0, 0] and [0, 1] are moved to LSC.
Processing then proceeds to the LIS. S0 ([1, 0]) = 1 is output since all descendants
of [1, 0] are not 0. The children of [1, 0], C[1, 0] = {[2, 0], [3, 0], [2, 1], [3, 1]}, are then
coded. The resulting output values are 0,0,0,0. All of the children are then added to
LIC since none of them were significant. The root [1, 0] is changed to Type B and
moved to the bottom of LIS. Next entry in LIS is [0, 1], S0 ([0, 1]) = 1 is output and
37
the children are coded. The output of C[0, 1] = {[0, 2], [1, 2], [0, 3], [1, 3]} is 0,0,0,1,+.
The coefficients [0, 2], [1, 2] and [0, 3] are added to LIC while [1, 3] is added to LSC.
The root itself is changed to Type B and moved to the bottom. The last original
entry in LIS, [1, 1], is coded 0 since all its descendants are 0. [1, 0]B is coded 1 and
C[1, 0] = {[2, 0], [3, 0], [2, 1], [3, 1]} are added to LIS as Type A roots. [0, 1]B is coded
0 since all its grandchildren are 0. The children of [1, 0] are then examined next since
there are no other entries in LIS. S0 ([2, 0]) = 0, S0 ([3, 0]) = 1, this root is not changed
to Type B because it has no grandchildren. The output of C[3, 0] is 1,-,0,0,0. The
output S0 ([2, 1]) = 0 and S0 ([3, 1]) = 0 finishes the significance pass for q0 and the
total output are
1,+,0,1,-,0,1,0,0,0,0,1,0,0,0,1,+,0,0,1,0,1,1,-,0,0,0,0,0
As mentioned previously, the refinement pass for q0 os obsolete. The ordered lists
now look like this
LSC
LIC
LIS
[0, 0]
[1,0]
[1,1]A
[0, 1]
[1,1]
[0,1]B
[1, 3]
[2,0]
[2,0]A
[6, 0]
[3,0]
[2,1]A
[2,1]
[3,1]A
[3,1]
[0,2]
[1,2]
[0,3]
[7,0]
[6,1]
[7,1]
The q1 significance pass starts with LIC, the output is 1,-,0,0,1,+,0,0,0,0,1,+,1,+,0,1,and [1, 0], [3, 0], [0, 3], [7, 0], [7, 1] are moved to LSC.
38
Now the roots described in LIS are coded. All descendants of [1, 1] are still insignificant so a 0 is output. S1 ([0, 1]) = 1 so [0, 2], [1, 2], [0, 3] and [1, 3] are added to LIS
as Type A roots and [0, 1] is deleted from LIS. Coding continues with S1 ([2, 0]) = 0,
S1 ([3, 0]) = 0, S1 ([3, 1]) = 0, S1 ([0, 2]) = 0 and S1 ([1, 2]) = 1. C[1, 2] is coded as
0,1,-,0,0 and [3, 4] is added to LSC while [2, 4], [2, 5] and [3, 5] are added to LIC. The
remaining coefficients in LIS are coded S1 ([0, 3]) = 0 and S1 ([1, 3]) = 1. The children
are coded 1,+,0,0,1,- and [2, 6] and [3, 7] are added to LSC while [3, 6] and [2, 7] are
added to LIC. The root [1, 3] is deleted from LIS since it has no grandchildren. The
resulting outputs of q1 significance pass are thus
1,-,0,0,1,+,0,0,0,0,1,+,1,+,0,1,-,0,1,0,0,0,0,1,0,1,-,0,0,0,1,1,+,0,0,1,The q1 refinement pass outputs refinement bits for [0, 0], [0, 1], [1, 3] and [6, 0].
The total output is 1,1,0,0.
Coding is now finished and the lists look like this
LSC
LIC
LIS
[0, 0]
[1,1]
[1,1]A
[0, 1]
[2,0]
[2,0]A
[1, 3]
[2,1]
[2,1]A
[6, 0]
[3,1]
[3,1]A
[1, 0]
[0,2]
[0,2]A
[3, 0]
[1,2]
[0,3]A
[0, 3]
[6,1]
[7, 0]
[2,4]
[7, 1]
[2,5]
[3, 4]
[3,5]
[2, 6]
[3,6]
[3, 7]
[2,7]
39
5.3
MSC3K
MSC3K, ”Mystery Science Codec 3000”, is a zero-tree compression technique similar
to that of SPIHT. The core of the MSC3K codec is the zero-tree compression discussed
here, for a more complete overview see chapter 9.1.
There are two changes made to the original SPIHT. First the coefficients in LLD
are treated separately and second, some more inter-band memory is used.
The LLD sub-band holds most of the energy in the image. MSC3K codes all
of LLD ”raw” before zero-tree coding begins. The LLD coefficients are coded one
bit-plane at a time, starting with the most significant bit-plane.
5.3.1
Knowledge of Quantization
The improvements made to the SPIHT algorithm are based on the knowledge of the
quantization. Generally the LLD sub-band contains most energy of the image, thus
this region is favored by the quantization by giving it more bits per coefficient than
other sub-bands. Figure 5.5 illustrates a typical distribution of bits over sub-bands
after quantization. Knowing this distribution, either by constant quantization or by
additional data, it is possible to improve speed and efficiency.
Figure 5.5: Example of bit-allocation for different sub-band regions of a D = 2 level
dyadic sub-band decomposition of an 8 × 8 image.
40
With the knowledge about quantization, the bit-planes can be described with
different sizes. For instance, bit-plane q0 and q1 only reside in LL3 in figure 5.5.
Similarly bit-plane q2 resides in LH2 , HL2 and LL3 while bit-plane q4 also includes
HH2 . In this case coding begins with bit-plane q2 . This is because in bit-plane q0
and q1 there are no descendants in the higher sub-bands.
Why code something already known? The answer is of course to avoid it. With
additional information of the bit-depth in different sub-bands it is possible to discard
the coding of some Type B roots, given the condition that the outcome is already
known. This also saves CPU cycles because it is not necessary to investigate these
Type B roots.
5.3.2
The MSC3K Algorithm
The discarded Type B roots are temporarily stored in an ordered list similar to the
lists in the SPIHT algorithm. The XLIS (Extra List of Insignificant Coefficients)
list, as it is called, holds all Type B roots that should have been coded 0 based on
quantization knowledge. The roots are lifted out from the coding process and then
inserted again into the LIS list at the start of the next bit-plane, given that the roots
have descendants in that bit-plane. If the aim is low bitrate then the quantization is
often heavy, thus creating a wide distribution of bit-depths in the sub-bands. If this
is the case then the XLIS improvement gives a significant boost in efficiency, but if
the goal is high quality and low compression then the improvement is less significant.
The extra data describing quantization information that needs to be coded in the
bitstream is small compared to the overall win.
The MSC3K algorithm can be described by the following pseudo code.
1. Initialization
• Set k ={first bit-plane where LLD has children}, LSC = empty, LIC =
empty, LIS = {all coordinates in LLD that have children}. All entries of
the LIS list are set to Type A.
41
2. Significance Pass
• For each [x, y] ∈ LIC, do:
– Output qk [x, y]. If qk [x, y] = 1, output s[x, y] and move [x, y] to the
end of the LSC list.
• For each [x, y] ∈ XLIS, check if [x, y] has grandchildren, if so then move
them to the end of LIS.
• For each [x, y] ∈ LIS, do:
– If the set is of Type A, output Sk (D[x, y]). If Sk (D[x, y]) = 1, then
∗ For each [i, j] ∈ C[x, y], output qk [i, j]. If qk [i, j] = 0, add [i, j] to
the LIC. Else, output s[i, j] and add [i, j] to the LSC.
∗ If G[x, y] exists for [x, y], using the current bit-plane size (based
on quantization), then move [x, y] to the end of LIS as a Type B
root. Else move [x, y] to the XLIS.
– If the set is of Type B, output Sk (G[x, y]). If Sk (G[x, y]) = 1, then
add each [i, j] ∈ C[x, y] to the end of the LIS (as Type A) and delete
[x, y] from LIS.
3. Refinement Pass
• For each [x, y] ∈ LLD , output qk [x, y]. Only coefficients that were in LSC
before the most recent significant pass should be refined.
4. Set k = k + 1 and go to Step 2).
42
Chapter 6
Data Compression
Most of today’s data compression methods fall into one of two camps: dictionary
based schemes and statistical methods. Dictionary based methods are widely used,
but arithmetic coding combined with statistical methods achieves better performance.
Data compression operates in general by taking symbols from an input stream,
and writing codes to a compressed output stream. Symbols are usually bytes or bits.
Dictionary based compression schemes operate by replacing groups of symbols in
the input stream with fixed length codes. This is the case in Huffman coding and
LZW.
Statistical methods of data compression take a completely different approach.
They operate by encoding symbols one at into variable length output codes. The
length of the output code varies based on the probability or frequency of the symbol.
Low probability symbols are encoded using many bits, and high probability symbols
are encoded using fewer bits. In practice, the line between statistical and dictionary
methods is distinct.
6.1
Huffman Coding
Huffman coding is probably the best known coding method based on probability
statistics. First the probability of the incoming symbols in the input stream is calculated. The method, published by D.A. Huffman in 1952, then creates a code table
43
for the symbols given their probabilities. The Huffman code table has been proved to
produce the lowest possible output bit count possible for the input stream of symbols,
when using fixed length coding.
Huffman coding assigns an output code to each symbol, with the output codes
being as short as 1 bit, or considerably longer than the input symbols, strictly depending on their probabilities. The optimal number of bits to be used for each symbol
is
1
log2 ( )
p
(6.1)
where p is the probability of a given character. Thus, if the probability of a
character is 1/256, such as would be found in a random byte stream, the optimal
number of bits per character is log2 (256), or 8. If the probability goes up to 1/2, the
optimum number of bits needed to code the character would go down to 1.
The problem with this scheme lies in the fact that Huffman codes have to be an
integral number of bits long. For example, if the probability of a character is 1/3, the
optimum number of bits to code that character is around 1.6. The Huffman coding
scheme has to assign either 1 or 2 bits to the code, and either choice leads to a longer
compressed message than is theoretically possible. This non-optimal coding becomes
a noticeable problem when the probability of a character becomes very high. If a
statistical method can be developed that can assign a 90% probability to a given
character, the optimal code size would be 0.15 bits. The Huffman coding system
would probably assign a 1 bit code to the symbol, which is 6 times longer than is
necessary.
6.2
Adaptive Coding
Another problem with Huffman coding comes when there is no statistical information
on the incoming symbols. In this case the compression method must adapt and
continuously update the probabilities of the symbols. The problem with adaptive
44
coding in Huffman coding is that the process of rebuilding the Huffman tree is a very
expensive process. For an adaptive scheme to be efficient the probabilities must be
updated after every incoming symbol, and in the Huffman case this will become a
very time-consuming process.
When doing non-adaptive coding, the compression scheme makes a single pass over
the incoming symbols to collect statistics. The incoming symbols are then encoded
using the statistics, which are unchanged throughout the encoding process. In order
for the decoder to decode the compressed stream, it first needs a copy of the statistics.
The encoder usually adds a statistics message to the compressed stream so the decoder
can read it before starting. This obviously adds some extra data, but usually a
frequency count of each character could be stored in as little as 256 bytes with fairly
high accuracy. The other option is to use predefined tables, this is quite common in
standards, e.g. JPEG and MPEG-1 Layer 3 (MP3).
In order to bypass this problem, adaptive data compression is called for. In
adaptive compression both the encoder and the decoder initiates their statistical
model in the same state. Each of them processes one symbol at a time and updates
their models. A slight amount of efficiency is lost by starting with non-optimal
states, but it is usually more efficient in the long run than to pass any statistics in
the compressed stream.
6.3
Arithmetic Coding
Arithmetic coding bypasses the idea of replacing an input symbol with a specific code.
Instead the main idea is to represent a set of symbols with a single float number. The
longer the message, the more bits are needed to describe the corresponding float
number. It was not until recently that practical methods for implementations in
fixed registers were found. For more theoretical background, see [4].
45
6.3.1
The Basics
The output from an arithmetic coding process is a single number less than 1 and
greater or equal to 0. This single number can be uniquely decoded to recreate the
exact input symbols. Before encoding, a set of probabilities have to be assigned to
each symbol. For e.g. the message ”HELLO” would have a probability distribution
of that in table 6.1.
Symbol
E
H
L
O
Probability
1/5
1/5
2/5
1/5
Range
0.00-0.20
0.20-0.40
0.40-0.80
0.80-1.00
Table 6.1: Probability distribution for the message ”HELLO”.
The individual symbols are also assigned a range along a probability line, which
is nominally 0 to 1. It doesn’t matter which characters are assigned which segment
of the range, as long as it is done in the same manner by both the encoder and the
decoder. The symbol range is in direct proportion to the probability of the symbol.
Each symbol ”owns” the assigned range up to, but not including the higher number.
So the symbol ”O” in fact has the range 0.80-0.999999...
New Symbol
H
E
L
L
O
Low Limit
0.00
0.20
0.20
0.216
0.2224
0.22752
High Limit
1.00
0.40
0.24
0.232
0.2288
0.2288
Table 6.2: Coding scheme for the message ”HELLO”.
When encoding the first symbol, a high and a low limit is assigned to the range
46
of the symbol. So after the symbol ”H” is encoded the low limit is assigned 0.20
and the high limit is assigned 0.40. What happens during the rest of the encoding
process is that each new symbol to be encoded will further restrict the possible range
of the output number. The next symbol, ”E”, owns the range 0.00-0.20. This range
will be applied to the current subrange, 0.20-0.40, thus forming the range 0.20-0.24.
Repeating this procedure for the message ”HELLO” results in table 6.2. The pseudo
code to accomplish this for a message of any length is
• Initialize and set low = 0.0 and high = 1.0.
• For every input symbol do:
– range = high − low
– high = low + range × highRange(symbol)
– low = low + range × lowRange(symbol)
• Output low.
The final low value 0.22752 will uniquely code the message ”HELLO” given the
encoding scheme in table 6.2. With this scheme its also possible to decode the message. The first symbol in the message is extracted by identifying the symbol that owns
the code space that corresponds to the unique number. Since the number 0.22752
falls between 0.20-0.40 the first symbol must be ”H”. Because the low and high
ranges of ”H” are known, the encoding procedure can be reversed and the range can
be updated to give the next symbol. This is done by subtracting the low limit of
”H”, which is 0.20, and then dividing by the range of ”H”. This gives a value of
(0.22752 − 0.20)/0.20 = 0.1376, which indicates that the symbol is ”E”. The pseudo
code for the decoding algorithm should look like this
47
• Get encoded number and Do:
– Output the symbol that falls in the range of the number.
– range = highRange(symbol) − lowRange(symbol)
– number = number − lowRange(symbol)
– number = number/range
• Start over with the new number
Note that there is no way of telling when there are no more symbols to decode.
This is solved by either encoding a special symbol at the end or transmitting the
message length along with the coded message.
The decoding algorithm for the ”HELLO” message will proceed like this:
Encoded Number Output
Low
High Range
0.22752
H
0.20
0.40
0.20
0.1376
E
0.00
0.20
0.20
0.688
L
0.40
0.80
0.40
0.72
L
0.40
0.80
0.40
0.8
O
0.80
1.00
0.20
0
6.3.2
Implementation
At first glance the arithmetic coding seems to be impossible to implement since it in
theory requires infinite precision. Even with 80-bit float registers it is not possible to
code more than 10-15 bytes and the float operations would take a lot of time when
working with massive amounts of data. As it turns out, arithmetic coding is best
implemented using standard 16-bit or 32-bit integer math. No floating point math
is required, nor would it help to use it. What is used instead is a an incremental
transmission scheme, where fixed size integer state variables receive new bits in at
48
the low end and shift them out the high end, forming a single number that can be as
many bits long as are available on the computer’s storage medium.
Going from float to integer requires some adaptations. The range 0.0-1.0 can be
represented as 0.0-0.99999.. or 0.1111.. in binary. For 16-bit registers this would mean
that high is initially set to 0xFFFF and low to 0x0000. The high value continues
with FFs forever, and low continues with 0s forever, so extra bits can be shifted in
with impunity when they are needed.
The 5 digit decimal register equivalent would be, high = 99999 and low = 00000.
The same encoding algorithm from the previous section can be applied to calculate
new range numbers. Note that the difference between the two registers will be 100000,
not 99999. Assuming the high register has an infinite number of 9’s added to it
the theoretical difference would be 100000. Coding the message ”HELLO” once
again gives the range 0.20-0.40 for the symbol ”H”. In register representation this
corresponds to high = 40000 and low = 20000, but the high value needs to be
decremented, once again because of the implied trailing 9s to the integer register. So
the new value of high is 39999. Coding ”E” results in high = 23999 and low = 20000.
At this point the most significant digits of high and low match. Due to the nature
of arithmetic coding, high and low will continue to grow closer and closer without
ever matching. This means that once both registers match in their most significant
digit, that digit will never change. This digit can now be shifted out and output as
the first digit of the encoded number. The shifting is done by shifting both high
and low left by one digit and shifting in a 9 in the least significant digit of high.
Processing the whole ”HELLO” message will give
49
Action High
Initial state
Low
Range Tot. Output
99999 00000 100000
Encode H 39999 20000
40000
Encode E 23999 20000
4000
Shift out 2 39999 00000
40000
0.2
Encode L 31999 16000
16000
0.2
Encode L 28799 22400
6400 0.2
Shift out 2 87999 24000
64000
0.22
Encode O 87999 75200
12800
0.22
Shift out 7 79999 52000
28000
0.227
Shift out 5 99999 20000
80000
0.2275
Shift out 2 99999 00000 100000 0.22752
When all symbols have been encoded the registers have to be shifted until they
reach their initial states. The shifting can be done from either the high or the low
register, the output number will still fall within the right range.
6.3.3
Underflow
Under special circumstances the high and low registers may converge on a value
without the most significant digits ever matching. E.g. if high = 40004 and low =
39996, the calculated range is going to be only one digit long, which means that the
output word will not have enough precision to be accurately encoded. After a few
more iterations the registers may be permanently stuck at high = 40000, low = 39999.
The range is now too small and any calculations will always return the same values.
To prevent this underflow from happening, the encoding algorithm has to be
altered. If the most significant digit of high and low is one apart, a test must be
performed to see if the second most significant digit in high is 0 and 9 in low. If this
is the case, then the risk of underflow is high and measures have to be taken. Instead
of shifting out the most significant digit, the second digit is then deleted instead. An
50
underflow counter has to be incremented to signal that a 0 and a 9 have been thrown
away. The operation looks like this
Register Before
After
High 40688
46889
Low
39232
32320
0
1
Underflow
After each recalculation operation, the underflow check has to be performed again.
In case of underflow, the procedure is repeated and the underflow counter is increased
by one. When the most significant digit finally match, the digit is first output followed
by the ”underflow” digits that were previously discarded. The underflow digits will
all be 0’s or 9’s depending on whether high and low converged to the higher or the
lower value.
6.3.4
Decoding
Just like in the encoding process, the precision is limited to 16-bit or 32-bit integer
registers. A high and a low register is used, and these correspond exactly to the
values the high and low registers held in the encoding procedure. An input buffer
for the code stream is also needed. By following the same procedure as the encoder,
the decoder knows when it is time to shift in new bits into the register. The same
underflow checks are performed as well.
6.3.5
Arithmetic vs. Huffman
The advantage of using arithmetic coding comes when the probabilities show more
variation. Assume that the message, ”QQQQQQQE”, is about to be encoded, where
the the probability of ”Q” is 0.9 and E is 0.1. The encoding process is listed in table
6.3.
51
New Symbol
Q
Q
Q
Q
Q
Q
Q
E
Low Limit
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.43046721
High Limit
1.00
0.90
0.81
0.729
0.6561
0.59049
0.531441
0.4782969
0.4782969
Table 6.3: Coding scheme for the message ”QQQQQQQE”.
Any number within the final range, e.g. 0.45, will uniquely decode the message.
Those two decimal digits take slightly less than 7 bits to specify, which means that
eight symbols have been encoded in less than eight bits, compared to an optimal
Huffman coded variant, which would need a minimum of 9 bits.
If the probability of a symbol is even higher, e.g. if the byte value ”0” has a
probability of 16382/16383 and EOF1 has 1/16383, then an input data stream with
100,000 ”0” bytes followed by an EOF could be encoded using only approximately
3-4 bytes. The Huffman approach would yield about 12,500 bytes.
1
Short for ”End Of File”
52
Chapter 7
Video Compression
The terminology used for still image compression needs to be extended for video
compression. A still image is from now referred to as a frame. Several frames make
up a video sequence. Frames per second (fps) determines the rate of video update
frequency, typical values are 23.957 (film), 25 (PAL) and 30 (NTSC).
There are essentially three different frame-types in all types of video compression
methods.
• I-frames
• P-frames
• B-frames
I-frames are intra coded, i.e. they can be reconstructed without any reference
to other frames. This is the same as still image compression. Some compression
methods, like MJPEG (motion JPEG), use only I-frames, i.e. the video is encoded
as a series of independent still images. This is suitable for video editing systems but
not efficient from a compression point of view since frames based on previous frames
can significantly reduce the amount of data to encode.
The P-frames are forward predicted from the last I-frame or P-frame, i.e. it’s
impossible to fully reconstruct them without a previous P or I-frame. The B-frames
53
Figure 7.1: An I-frame followed by P and B-frames. The arrows indicate inter relationships.
are both forward predicted and backward predicted from the last/next I-frame or Pframe, thus requiring two other frames to be reconstructed. P-frames and B-frames
are often referred to as inter coded frames. The relationship between these frametypes is illustrated in figure 7.1.
7.1
Predictive Frames
The predictive frames require information from other frames. When encoding a set
of frames the encoder measures e.g. the difference between the current frame and
the previous frame. If the difference is found to be small, which means that the
movement between the two consecutive frames is small or negligible. In this case,
instead of encoding the current frame as an I-frame it will be more efficient to encode
the difference. This difference or error frame can be encoded as a P or B-frame. If
there is little movement between frames, the difference will not take much bitrate to
encode and P & B-frames will then constantly improve the image quality. The encoder
54
has to decide when to code a new I-frame, usually when there is much movement
between frames or the time interval between I-frames reach a certain limit.
Motion estimation and motion compensation are very vital parts of video encoding. The purpose is to detect motion between frames and compensate to minimize
the error frame. Motion compensation is applied to P and B-frames. The encoder
estimates the motion and supplies motion vectors to predict the movements between
the frames.
a
b
Figure 7.2: Two consecutive video frames.
Figure 7.3: Reconstruction of frame b with a motion vector and an error .
To illustrate the use of motion vectors and P-frames, imagine a rectangular shape
as in figure 7.2 (a). In the next frame the rectangle has moved down to the right and
55
rotated 5o , figure 7.2 (b). The encoder calculates the motion vectors that minimizes
the error frame, encodes the vectors and the error frame as a P-frame. Figure 7.3
shows how the decoder then reconstructs frame b with help of frame a and the
information from the P-frame.
Thus, the reconstruction of the inter coded frames proceeds as follows:
1. Apply the motion vector to the referred frame
2. Add the prediction error compensation to the result from step 1.
7.2
Motion Estimation & Motion Compensation
There are three main approaches to perform motion estimation (ME) and motion
compensation (MC). The first approach, Block Motion Estimation (BME) and Overlapped Block Motion Compensation (OBMC), works for both DCT and wavelet-based
video compression. The other two approaches, 3-D SPIHT [5] and LBSME/MC [6],are
specifically developed for wavelet-based compression and will not be discussed in this
thesis since BME/OBMC have proven to be the best combination regarding computation complexity and visual quality [7].
BME and OBMC is the approach used by the MPEG-1/2/4 standards. As mentioned earlier this method is efficient and it has been developed for a long time. The
idea is fairly straightforward. The frame is divided into a number of blocks, most
common sizes are 16x16 or 8x8 pixels. Each block has a motion vector. For every
block each motion vector is calculated by comparing the current block with a region
of the previous frame. The encoder tries to find the region in the previous frame that
best matches the block in the current frame. The region in the previous frame must
lie within the motion vector range. In e.g. MPEG-1 the range is -63 to +64 pixels
for both x and y axis, but if the video resolution is low then maybe -15 to +16 is
sufficient. Figure 7.4 represents a previous decoded I or P-frame, the dark green area
represents the range of a possible motion vector for the black block. The dark green
56
area measures 128x128 pixels and thus the motion vectors have a -63 to +64 pixel
range in x and y space. The white rectangle represents the current area that is to be
compared with the block from the next frame (the black rectangle).
Figure 7.4: Darkened gree area represents the possible motion vector range of the
black block. The white area will move within the range and will be compared with
the black one until a good motion vector that minimizes the error is found.
For each motion vector comparison the current block is compared to the block
given by the motion vector offset from the previous frame. If each motion vector is
to be tested it would require an enormous amount of computational operations. The
difference between the pixels in the blocks are summarized, for 16 ∗ 16 pixels block
size every motion vector comparison amounts to 16 ∗ 16 ∗ 2 = 512 operations. If the
motion vector range is −15 to +16 and all possible combinations of motion vectors
are to be tested then 32 ∗ 32 ∗ 16 ∗ 16 ∗ 2 = 524.288 subtractions and additions are
performed for each block. For a standard video resolution of 320x256 pixels, this
57
sums up to 524.288 ∗ (320/16) ∗ (256/16) = 167, 772.160 subtractions and additions.
This brute force approach to calculate the optimum motion vector can of course not
be used in real-time since it would take too much CPU time.
The problem with all block based motion estimation algorithms is to calculate an
adequate motion vector within a given time period. There are numerous approaches,
but only the brute force method can guarantee that the optimum vector is found
(unless the error happens to be 0). Some methods have a specific search order and
others use predictive motion vectors based on neighboring vectors. Usually a threshold
value determines when an adequate motion vector is found .
When each block motion vector is calculated they are encoded together with the
error frame. The error frame is encoded as a still image with either DCT or waveletbased compression. The difficulty with BME/OBMC video encoding amounts to
deciding I/P/B-frame structure and finding good motion vectors.
58
Chapter 8
JPEG 2000 Overview
8.1
Features
The JPEG 2000 standard supports both lossy and lossless compression of singlecomponent (e.g. grayscale) and multi-component (e.g. color) imagery. In addition to
basic compression functionality the standard also supports the following features
• Progressive recovery of an image by fidelity of resolution.
• Region of interest coding (ROI), whereby different parts of an image can be
coded withe different fidelity.
• Random access to particular regions of an image without needing to decode the
entire code stream.
• Good error resilience.
Due to its excellent coding performance and attractive features, JPEG 2000 has
a large potential application base.
Work on the JPEG 2000 standard started in March 1997. The purpose of having
a new standard was twofold. First, it would address a number of weaknesses in the
existing JPEG standard. Second, it would provide a number of new features not
available in the JPEG standard. Much of the success of the original JPEG standard
59
can be attributed to its royalty-free nature. JPEG 2000, however, still struggles with
some intellectual property issues that prevent it from being royalty-free. In the long
run, these property issues also hold back the standard.
8.2
JPEG 2000 Codec
The codec is based on wavelet/sub-band coding techniques. In order to facilitate both
lossy and lossless coding, reversible integer-to-integer and nonreversible real-to-real
transforms are employed (see chapter 3).
From the codec’s point of view, an image is comprised of one or more components.
Each component consists of a rectangular array of samples. The sample values of each
component are integer valued, and can be either signed or unsigned with a precision
of 1 to 38 bits/sample, specified per-component basis.
All of the components are associated with the same spatial extent in the source
image, but represent different spectral or auxiliary information. For example, an RGB
color image has three components (red, green and blue). In the case of a grayscale
image, there is only one component, corresponding to the luminance/intensity plane.
The various components of an image need not be sampled at the same resolution. For
example, when color images are represented in a luminance/intensity-chrominance
color space (se chapter 2.5), the luminance information is often more finely sampled
than the chrominance data.
8.2.1
Preprocessing
If samples are unsigned, a bias is subtracted so that the dynamical range is centered
around 0, as in chapter 2.3. If the image is about to be coded in a lossy manner the
data is also normalized (chapter 2.4).
Intercomponent transform, as it is called in the JPEG 2000 standard, now follows.
Two intercomponent transforms are defined in the baseline JPEG 2000 codec: the
60
a
b
Figure 8.1: The structure of the (a) encoding and the (b) decoding process of the
JPEG 2000 standard.
irreversible color transform (ICT) and the reversible color transform (RCT). The
ICT has already been thoroughly explained in chapter 2.5. The RCT has transform
coefficients of a power of 2 and thus allows for a integer-to-integer transform. In lossy
compression the ICT is used an in lossless the RCT. Both transforms maps image
data from the RGB to YCrCb color space and operates on the 3 first components
with the assumption that these are the RGB color planes.
8.2.2
Wavelet transform
In the standard this is refereed to as an intracomponent transform. For lossless
compression the LeGall 5/3 filter (appendix B.1) is used. The wavelet transform and
an example of LeGall 5/3 is covered in chapter 3. In the case of lossy compression
the Daubechies 9/7 filter is used instead (appendix B.2). The number of resolution
levels is a parameter for each transform. A typical number is 5, for a sufficiently
large image. In the decoding procedure the appropriate inverse wavelet transform is
applied.
61
8.2.3
Quantization
In the encoder, after the intracomponent transform, the resulting coefficients are
quantized. Quantization allows greater compression to be achieved, by representing
transform coefficients with only the minimal precision required to obtain the desired
level of image quality. Quantization is one of the two primary sources of information
loss in the coding path (the other being the discarding of coding passes).
Transform coefficients are quantized using scalar quantization with a deadzone.
A different quantizer is employed for the coefficients of each sub-band, and each
quantizer has only one parameter, its step size. This type of quantization is described
in chapter 10.4. In the case of lossless compression no quantization is applied.
8.2.4
Tier-1 & Tier-2 Coding
After quantization is performed in the encoder, tier-1 coding takes place. This is the
first of two coding stages. The quantizer indices for each sub-band are partitioned
into code blocks. Code blocks are rectangular in shape, and their nominal size is a
free parameter of the coding process, subject to certain constraints.
After a sub-band has been partitioned into code blocks, each block is independently coded using a context based adaptive binary arithmetic coder, more specifically, the MQ coder from the JBIG standard [8]. For each code block, an embedded
code is produced, comprised of numerous coding passes for the various code blocks.
It is the discarding of these coding passes that is the second source of information
loss, thus no passes can be discarded in the lossless encoding path.
The bit-plane encoding process generates a sequence of symbols for each coding
pass. Some or all of these symbols may be entropy coded with the MQ coder. Although the bit-plane coding technique employed is similar to those used in EZW and
SPIHT coding, there are two notable differences:
1. Interband dependencies are not exploited.
2. Three coding passes per bit-plane instead of two.
62
This chapter will not go into any details about the coding, for more information
see [9].
In the encoder, tier-1 encoding is followed by tier-2 encoding. The input to the
tier-2 encoding process is the set of bit-plane coding passes generated during tier-1
encoding. In tier-2 encoding, the coding pass information is packaged into data units
called packets, in a process referred to as packetization. The resulting packets are
then output to the final code stream. The packetization process imposes a particular
organization on coding pass data in the output code stream. This organization facilitates many of the desired codec features including rate scalability and progressive
recovery by fidelity or resolution. A packet is nothing more than a collection of coding
pass data. Each packet is comprised of two parts: a header and body. The header
indicates which coding passes are included in the packet, while the body contains the
actual coding pass data itself. In the code stream, the header and body may appear
together or separately, depending on the coding options in effect. Rate scalability is
achieved through (quality) layers. The coded data for each tile is organized into L
layers, numbered from 0 to L - 1, where L ≥ 1. Each coding pass is either assigned
to one of the L layers or discarded. The coding passes containing the most important
data are included in the lower layers, while the coding passes associated with finer
details are included in higher layers. During decoding, the reconstructed image quality improves incrementally with each successive layer processed. In the case of lossy
compression, some coding passes may be discarded (i.e., not included in any layer) in
which case rate control must decide which passes to include in the final code stream.
In the lossless case, all coding passes must be included. Since some coding passes
may be discarded, tier-2 coding is the second primary source of information loss in
the coding path.
More information about JPEG 2000 can be found in the ISO standards [10] and
[9].
63
Chapter 9
MSC3K Overview
As mentioned in earlier chapters the MSC3K is a zero-tree type wavelet-based codec.
The codec was developed for real-time low-bandwidth video communication with the
following desired features:
• High compression
• Low computation complexity
• Minimum latency (real-time)
• Scalable quality
The name ”Mystery Science Codec 3000” (MSC3K) was inspired by the TV-series
”Mystery Science Theater 3000”.
9.1
The Codec
In figure 9.1, the encoding and decoding structure of MSC3K is shown. This is the
heart of the MSC3K codec. The preprocessing is essentially level offset (chapter 2.3),
normalization (chapter 2.4) and the irreversible color transform (ICT) (chapter 2.5).
Since MSC3K only allows for lossy compression the Daubechies 9/7 filter is used
in the wavelet transform, see chapter B.2. The actual transform is implemented
64
a
b
Figure 9.1: The structure of the (a) encoding and the (b) decoding process of still
images in MSC3K.
using lifting steps, as described in chapter 3, which is the most efficient way to
implement a wavelet transform. The quantization method used is the uniform deadzone quantization described in chapter 10.4. The quantization procedure is the first
element of information loss. The next is the zero-tree type compression. MSC3K
is a modified version of the SPIHT algorithm, chapter 5.2. The modifications that
MSC3K uses are described in chapter 5.3. The zero-tree encoding is controlled by the
rate variable, and will terminate once the desired rate is achieved. After the zero-tree
encoding a final bitstream is produced together with a small header that describes
size, quantization, color and video settings. The header is about 10 bytes long and is
added at the start of the bitstream.
The decoder, figure 9.1, takes the bitstream and re-builds the wavelet tree coefficients in the zero-tree decoder. The effects of the quantization is then compensated,
followed by the inverse wavelet transform. Last, the postprocessing undoes the effects
of the preprocessing and produces a reconstructed image.
65
9.2
Color Compression
As mentioned earlier the color image is first transformed to YCbCr color space with
the ICT. The color components, Cb & Cr, are then subsampled by 2 in both directions.
So for a 512x512 pixels image both Cb & Cr will be resized down to 256x256 pixels.
This procedure is also done in JPEG & JPEG 2000. The eye is much less sensitive
to color information so the effects of the subsampling are not visible.
The color components are then arranged together with the intensity part(Y) to
form a single image structure. This structure is then wavelet transformed with some
modifications to handle the borders between the components. The MSC3K zero-tree
type compression encodes the coefficients as if they were a single image and not a
three-component image. This way color information is mixed with intensity according
to significance/power.
a
b
c
Figure 9.2: a) Original color image (512x512). b) Transformed into YCBCr color
space and arranged for wavelet transformation. c) One step DWT.
66
9.3
Video Compression
MSC3K currently supports I and P-frames with a BME/OBMC type of motion estimation and motion compensation, see chapter 7. The encoder can either be forced
to produce P-frames or it can be set to automatically decide when a P-frame can be
inserted. The previous decoded frame is compared with the current frame. If the
error between them is great, then the current frame will be encoded as an I-frame,
otherwise the encoder goes into P-frame mode.
Figure 9.3: Search pattern for motion vectors in MSC3K.
In P-frame mode, the motion vectors are first calculated. This is done with the
block motion estimation method. MSC3K supports 8x8 and 16x16 pixels block for
the motion estimation. The motion vectors range from about -6 to +6 pixels in a
diamond shape pattern. In figure 9.3 this specific search pattern is demonstrated.
The search starts with index 0, the red dot, and then proceeds with 1,2,3.. etc. The
motion vectors are stored as an index that indicates the specific offset. The diamond
shape search pattern has shown to be efficient since movement is often concentrated
around 0 on the x and y axes. The search stops when the error in the current search
block is below a certain threshold.
In some cases it is better to skip the use of motion vectors and just encode the error
frame. Motion vectors for 8x8 pixel blocks in 160x112 pixel resolution takes about
67
300 bytes to store. Assume for example that the max bitrate is set to 1800 bytes,
then the BME/OBMC improvement of the error frame must justify the use of these
300 extra bytes. For small movements it’s often not justified to use motion vectors.
MSC3K measures the improvement of the motion vectors and decides whether or not
they will be included in the encoding.
The decoder always saves the previous decoded image, so in the case of a Pframe, the previous frame is motion compensated with the motion vectors and then
the decoded error frame is added.
B-frames have not been implemented since that would delay the video encoding
with about 1-8 frames. For low-bitrate video, the frame-rate is often low, 4-8 fps, and
in this case a delay of 8 frames would result in a time delay of 1-2 seconds.
9.4
Implementation
The MSC3K codec is developed in C/C++. The codec kernel can be easily adapted to
fit any application. The video input system uses DirectX 9 to connect e.g. webcams
or TV-cards which is perfect for in-game applications, since most modern PC games
are using DirectX.
Some GUI1 -based test programs have written that use the MSC3K kernel, e.g. a
video test program, see figure 11.6, and a still image compression frontend.
1
Graphical user interface
68
Chapter 10
Audio Compression
When communicating via internet you’ll need more than images to make a successful
conversation, you will also need sound. Our main focus for MSC3K is to use as little
bandwidth as possible for image and sound. Therefore when examining different
techniques for audio compression we choose to use a reference implementation of
LPC. LPC produces adequate quality for speech at low bitrates. It’s also free to use
without any license or prohibitions.
10.1
History
In the second half of the 18th century the idea to construct a speech producing
machine took form. One of the first attempts at mechanical speech generation in
recorded history occurred around 1770 at the Imperial Academy of St. Petersburg.
In response to the university’s challenge to explain the physiological speech differences
between five vowels, Kratzenstein won the annual prize for modeling and constructing
a series of acoustic resonators patterned after the human vocal tract. The speaking
device, crude by today’s standards did, however, create the vowel sounds with vibrating reeds activated by air passage over them. By varying the acoustic resonators and
effectively selecting the formant frequencies by hand, limited speech was generated
by this mechanical device.
Unknown to Kratzenstein at the time, Wolfgang von Kempelen was working in
69
Figure 10.1: Kratztenstein’s resonators for synthesis of vowel sounds. The resonators
are actuated by blowing through a free, vibrating reed into the lower end.
parallel on a more elaborate speaking machine for generating connected speech. He
had a different approach to solve the problem of syntesized speech. He identified
the the speech organs and tried to simulate them as good as possible. Kempelen’s
speaking machine used a bellows to supply air to a reed which, in turn, excited a
single, hand-varied resonator for producing voiced sounds. Consonants, including
nasals, were simulated by four separate constricted passages, controlled by the fingers
of the other hand. Thus making him the first to build a machine that could produce
whole sentences. But since the property of the vocal cords could not be changed
while the machine produced sound it sounded monotone. An improved version of the
machine was built from von Kempelen’s description by Sir Charles Wheatstone.
Figure 10.2: Wheatstone’s improved construction of von Kempelen’s speaking machine
70
In the year 1835 after a period without any significant progress, a man named
Joseph Faber succeeded to add a tongue and a pharyngeal cavity whose to Kempelen’s
model that could be controlled during speech production. He continued his work and
succeeded in building a machine that could sing the anthem ”God save the queen”.
Then the evolution of the speech machines stood still until R. R. Riesz in 1937
demonstrated his mechanical talker which, like the other mechanical devices, was more
reminiscent of a musical instrument. The device was shaped like the human vocal
tract and constructed primarily of rubber and metal with playing keys similar to those
found on a trumpet. The mechanical talking device produced fairly good speech with
a trained operator. With the ten control keys (or valves) operated simultaneously
with two hands, the device could produce relatively articulate speech. Riesz had,
through his use of the ten keys, allowed for control of almost every movable portion
of the human vocal tract. Reports from that time stated that its most articulate
speech was produced as it said the word ’cigarette’”
Figure 10.3: ”Air under pressure is brought from a reservoir at the right. Two valves,
V1 and V2 control the flow. Valve V1 admits air to a chamber L1 in which a reed
is fixed. The reed vibrates and interrupts the air flow much like vocal cords. A
spring-loaded slider varies the effective length of the reed and changes its fundamental frequency. Unvoiced sounds are produced by admitting air through valve V2. The
configuration of the vocal tract is varied by means of nine movable members representing the lips (1 and 2), teeth (3 and 4), tongue (5, 6 and 7), pharynx (8), and velar
coupling (9).
71
To simplify the control, Riesz also constructed the mechanical talker with finger
keys to control the configuration.
10.2
Speech Synthesis
Speech is produced by a cooperation of lungs, glottis (with vocal cords) and articulation tract (mouth and nose cavity).Figure 10.4 shows a cross section of the human
speech organ. For the production of voiced sounds, the lungs press air through the
epiglottis, the vocal cords vibrate, they interrupt the air stream and produce a quasiperiodic pressure wave.
Figure 10.4:
The pressure impulses are commonly called pitch impulses and the frequency of
the pressure signal is the pitch frequency or fundamental frequency. In figure 10.5 a
typical impulse sequence (sound pressure function) produced by the vocal cords for a
voiced sound is shown. It is the part of the voice signal that defines the speech melody.
When we s peak with a constant pitch frequency, the speech sounds monotonous but
in normal cases a permanent change of the frequency ensues.
72
Figure 10.5:
The pitch impulses stimulate the air in the mouth and for certain sounds (nasals)
also the nasal cavity. When the cavities resonate, they radiate a sound wave which is
the speech signal. Both cavities act as resonators with characteristic resonance frequencies, called formant frequencies. The fundamental freq F0 and the fundamental
period N0 (N0=Fs*T0 where T0=1/F0) will later be used. Since the mouth cavity
can be greatly changed, we are able to pronounce very many different sounds. In the
case of unvoiced sounds, the excitation of the vocal tract is more noise-like.
10.3
Basic Speech Model
In order to better understand and be able to analyse speech production, we are
interested in creating a model of the speech production system which is amenable to
further analysis based on theory and techniques from physics. A simple idea is to
simulate the vocal tract by a straight tube through which air is blown. It has been
found that a tube with a curvature does n ot sound much different from one without.
Furthermore we assume that the tube is lossless. This means that we do not model
energy loss due to friction between the flowing air and the tract walls or vibration of
the walls.
10.4
Speech production by LPC
In the basic speech model the lungs are replaced by a DC source, the vocal cords
by an impulse generator and the articulation tract by a linear filter system. A noise
73
Figure 10.6: visualisation of the lossless tube model showing the cross sectional area
A(x) of the cylinder x
generator produces the unvoiced excitation. In practice, all sounds have a mixed excitation, which means that the excitation consists of voiced and unvoiced portions. Of
course, the relation of these portions varies strongly with the sound being generated.
In this model, the portions are adjusted by two potentiometers. Se figure 10.7a for
illustration.
Based on this model, a further simplification can be made (figure 10.7b). Instead
of the two potentiometers it uses a ’hard’ switch which only selects between voiced
and unvoiced excitation. The filter, representing the articulation tract, is a simple
recursive digital filter; its resonance behavior (frequency response) is defined by a
set of filter coefficients. Since the computation of the coefficients is based on the
mathematical optimization procedure of Linear Prediction Coding they are called
Linear Prediction Coding Coefficients or LPC coefficients and the complete model
is the so-called LPC Vocoder (Vocoder is a concatenation of the terms ’voice’ and
’coding’). In practice, the LPC Vocoder is used for speech telephony. It’s great
advantage is the very low bit rate needed for speech transmission (about 3 kbit/s)
compared to PCM (64 kbit/s).
A great advantage of the LPC vocoder are the manipulation facilities and the
narrow analogy to the human speech production. Since the main parameters of the
74
Figure 10.7: Figure 4a: The Human Speech Production, Figure 4b: Speech Production by a machine
speech production, namely the pitch and the articulation characteristics, expressed
by the LPC coefficients, are directly accessible, the audible voice characteristics can
be widely influenced. For example, the transformation of a male voice into the voice
of a female or a child is very easy. Also the number of filter coefficients can be varied
to influence the sound characteristics, above all, the formant characteristics.
75
Figure 10.8: LPC
76
Chapter 11
Results
”A picture says more than a thousand words”, this statement is often true and in this
chapter the proper statement would be ”A picture says more than any test values”.
A mathematical calculated value that measures quality can still be useful for a quick
reference to other results, though it might not tell the whole truth.
11.1
Image Distortion Measure
Judging the visual quality of an encoded image compared to the original is not always
easy. A single number, representing the overall total distortion in proportion to the
peak signal, is often used to compare and describe the visual quality. Peak Signal to
Noise Ratio, or P SN R has the formula
N1 −1 N
2 −1
X
1 X
(x[n1 , n2 ] − x̂[n1 , n2 ])2
M SE =
N1 N2 n =0 n =0
(11.1)
(2B − 1)2
M SE
(11.2)
1
2
P SN R = 10 log10
where M SE is the Mean Square Error of the original image, x[n1 , n2 ], compared
to the reconstructed image x̂[n1 , n2 ]. The P SN R is expressed in dB (decibels). Good
reconstructed images typically have P SN R values of 30 dB or more.
77
This measurement is of course not very accurate for true visual quality. E.g.
some areas in the reconstructed image may be well preserved while the background is
distorted, giving the image the same P SN R as an image that is evenly distorted. No
one can really say one image is better than the other cause it is a matter of opinion
and the true answer lies in the eye of the beholder.
11.2
Still Image Compression Comparison
11.2.1
Test Setup
Three different compression methods will be tested.
• The original JPEG standard. JPEG is the candidate that represents the traditional DCT based technique.
• The JPEG 2000 standard, entropy based wavelet compression.
• MSC3K, zero-tree based wavelet compression.
BoxTop ProJPEG v5.2, http://www.boxtopsoft.com/projpeg.html, was used
to compress standard JPEG images using optimized Huffman tables to achieve highest
possible quality.
The Jasper reference implementation of JPEG 2000 was used to compress to JPEG
2000 standard [11].
The images used in this test were all found on the internet, some of which are
not too unfamiliar in compression comparisons, e.g. the Lena image as well as some
Kodak pictures and pictures from the motion picture ”Lord of the Rings”. All original
images and all the reconstructed images can be found at http://msc3k.sytes.net.
All images will be compressed using lossy compression.
78
Name
Lena
Blue Rose
Gollum
Wraiths
Child
Size
512x512
256x256
512x512
256x256
256x256
Notes
Gray image.
Color. Cropped
Color. Cropped
Color. Cropped
Color. Cropped
from a film frame.
from original.
& resized from original.
& resized from original.
Table 11.1: Specifications of the original images.
Lena
Method
MSC3K
JPEG 2000
JPEG
1:100
P SN R(dB)
28.30
28.35
26.27
Table 11.2: Results from Lena.
11.2.2
Lena
As shown in table 11.2, the JPEG 2000 got a slightly better P SN R value than
MSC3K. This is a surprisingly good result for a zero-tree type compression that does
not use arithmetic coding. JPEG is trailing behind, in this test JPEG could actually
not be compressed more than 1:90. For more information about Lena, see [12].
11.2.3
Blue Rose
As shown in table 11.3, MSC3K got the best Y P SN R value in both the 1:100 and
the 1:200 case. In the 1:200 case MSC3K also got the best total P SN R value.
11.2.4
Gollum
As shown in table 11.4, JPEG 2000 got the best test values in this case. The Cr
value is +2.72dB better than MSC3K, a possible explanation to this is that MSC3K
79
Blue Rose
Method
MSC3K
JPEG 2000
JPEG
Blue Rose
Method
MSC3K
JPEG 2000
JPEG
Ratio 1:100
Y P SN R(dB)
33.93
33.50
32.89
Ratio 1:200
Y P SN R(dB)
32.10
30.35
28.61
Cb P SN R(dB)
36.71
39.01
37.50
Cr P SN R(dB)
39.78
40.75
35.77
Tot P SN R(dB)
36.80
37.75
35.39
Cb P SN R(dB)
34.00
34.29
30.47
Cr P SN R(dB)
36.89
36.45
31.84
Tot P SN R(dB)
34.33
33.70
30.31
Table 11.3: Results from Blue Rose.
Gollum
Method
MSC3K
JPEG 2000
JPEG
Ratio 1:100
Y P SN R(dB)
22.77
22.94
22.21
Cb P SN R(dB)
32.40
33.36
30.73
Cr P SN R(dB)
30.35
31.23
29.01
Tot P SN R(dB)
28.51
29.18
27.32
Table 11.4: Results from Gollum.
doesn’t range the color sub-bands thus making some color information less important
than it really is.
11.2.5
Wraiths
As shown in table 11.5, MSC3K got the better Cb P SN R value in the 1:50 case.
JPEG 2000 got a slightly better total value. In the 1:100 compression case, we see
again that MSC3K got a slightly better Y P SN R value and that the JPEG 2000 got
the best total value closely followed by the MSC3K.
80
Wraiths
Method
MSC3K
JPEG 2000
JPEG
Wraiths
Method
MSC3K
JPEG 2000
JPEG
Ratio 1:50
Y P SN R(dB)
22.94
23.10
22.67
Ratio 1:100
Y P SN R(dB)
20.48
20.47
19.32
Cb P SN R(dB)
33.81
33.56
33.22
Cr P SN R(dB)
29.00
30.05
29.18
Tot P SN R(dB)
28.58
28.90
28.36
Cb P SN R(dB)
32.96
33.15
29.97
Cr P SN R(dB)
28.12
28.14
27.02
Tot P SN R(dB)
27.19
27.26
25.43
Table 11.5: Results from Wraiths.
11.2.6
Child
As shown in table 11.6, MSC3K yields the best result in all cases when compressing
the image called ”child”.
11.2.7
MSC3K Performance
A special test program was constructed to correctly measure CPU & memory usage
of MSC3K. The performance data was collected with ”Intel VTune Performance Analyzer v7.0”. The test was to load a color image, 512x512 pixels, and then compress
it to a bitstream using the MSC3K compression scheme at ratio 1:150. The test was
repeated 100 times and the results averaged. The test-machine was a Pentium III
900 mhz.
Memory usage was 1210kb including the test program and the MSC3K kernel.
This is not very high considering that the 768kb input image is first read into temporary buffers and encoded from there. The memory needed is approximately 150%
of the input image size.
The total process took 0, 29 seconds. The most time-consuming task is the wavelet
transformation (DWT), see table 11.7, followed by the loading of the input image.
81
Child
Method
MSC3K
JPEG 2000
Child
Method
MSC3K
JPEG 2000
Child
Method
MSC3K
JPEG 2000
Ratio 1:200
Y P SN R(dB)
26.55
25.82
Ratio 1:250
Y P SN R(dB)
25.95
24.74
Ratio 1:300
Y P SN R(dB)
25.49
23.79
Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB)
31.47
33.17
30.40
31.03
32.90
29.91
Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB)
31.01
33.11
30.03
29.98
32.42
29.04
Cb P SN R(dB) Cr P SN R(dB) Tot P SN R(dB)
30.27
32.56
29.44
29.86
31.33
28.32
Table 11.6: Results from Child.
Encoding takes 0, 20 seconds if the image is already loaded. Surprisingly the zero-tree
type compression takes little or no time at all in comparison. The ”other” section
includes kernel calls, allocations, bitstream handling and more.
Process
DWT
Loading image
Quantization
Zero-Tree compression
Other
Percentage
58%
32%
6%
0,1%
4%
Table 11.7: Percentage of CPU-time spent in different encoding steps.
82
11.3
Video Compression Results
The video results of MSC3K are purely empirical. The codec runs smooth in B&W
320x240 pixels at 30 fps on a P-III 900 with all features enable and both encoding
and decoding at the same time. The actual CPU usage is hard to measure but
approximately 60% of the total CPU power is used at 30 fps, yet many parts can be
optimized and written in assembler to further improve results.
The quality of the video is fully scalable and MSC3K produces fully functional
video even under 5kb/s. The delay in the codec is insignificant compared to the
internal delay in the webcams and their capture drivers. When the video input
reaches the MSC3K codec it is already delayed about 100ms. The compressed output
is then further delayed by about 20-25ms in the MSC3K codec. Figure 11.6 shows
the MSC3K codec in full action.
83
A
B
C
D
Figure 11.1: A) The original gray image of Lena. B) Compressed with MSC3K. Ratio
1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with ProJPEG.
Ratio 1:90.
84
A
B
C
D
E
F
G
Figure 11.2: A) The original color image Blue Rose. B) Compressed with MSC3K.
Ratio 1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with
ProJPEG. Ratio 1:100. E) Compressed with MSC3K. Ratio 1:200. F) Compressed
with JPEG 2000. Ratio 1:200. G) Compressed with ProJPEG. Ratio 1:200.
85
A
B
C
D
Figure 11.3: A) The original color image of Gollum. B) Compressed with MSC3K.
Ratio 1:100. C) Compressed with JPEG 2000. Ratio 1:100. D) Compressed with
ProJPEG. Ratio 1:100.
86
A
B
C
D
E
F
G
Figure 11.4: A) The original color image Wraiths. B) Compressed with MSC3K.
Ratio 1:50. C) Compressed with JPEG 2000. Ratio 1:50. D) Compressed with
ProJPEG. Ratio 1:50. E) Compressed with MSC3K. Ratio 1:100. F) Compressed
with JPEG 2000. Ratio 1:100. G) Compressed with ProJPEG. Ratio 1:100.
87
A
B
C
D
E
F
G
Figure 11.5: A) The original color image Child. B) Compressed with MSC3K. Ratio
1:200. C) Compressed with JPEG 2000. Ratio 1:200. D) Compressed with MSC3K.
Ratio 1:250. E) Compressed with JPEG 2000. Ratio 1:250. F) Compressed with
MSC3K. Ratio 1:300. G) Compressed with JPEG 2000. Ratio 1:300.
88
Figure 11.6: Test program for the MSC3K video codec. To the left is the input signal
and to the right is the coded output.
89
Chapter 12
Conclusions
Results from chapter 11 have shown that all the desired features set before developing MSC3K; high compression, low computation complexity, minimum latency and
scalable quality (chapter 9.1), have been fulfilled. The scalable quality feature comes
from the fact that MSC3K is a zero-tree based compression algorithm, this will be
discussed more in chapter 12.2.
12.1
Performance
The temporal noise filter described in chapter 2.2 have been tested with webcams
and is implemented as an optional feature in the preprocessing stage of MSC3K.
The solution is very computation efficient compared to spatial filtering and empirical
studies have proven it to be sufficient in terms of noise reduction results.
Measuring the time it takes to decompress smaller images the MSC3K and JPEG
2000 perform much the same. When it comes to larger images, 512x512 pixels and
more, JPEG 2000 is faster. This is due to the fact that JPEG 2000 only mostly
works with localized memory, while zero-tree based compression uses coefficients from
random locations in the image, thus leading to ineffective data-cache usage.
When it comes to compressing images, MSC3K is faster than JPEG 2000. This
is because zero-tree based compression algorithms can be terminated as soon as the
desired output buffer is filled. At low bitrates the output buffer is smaller and thus
90
filled more quickly so the compression takes less time. The DWT process is significantly more time-consuming in comparison to the zero-tree compression, see chapter
11.2.7.
MSC3K is generally faster and far less complex than the JPEG 2000 standard,
although MSC3K requires more random memory access patterns.
As shown in chapter 11 the MSC3K performs well compared to JPEG 2000, especially at low bitrates. Looking closely at the images you can see that MSC3K preserves
the ”overall” details somewhat better than JPEG 2000 although the P SN R value is
pretty much the same.
12.2
Possible Application Areas
Apart from online gaming communication, the MSC3K video compression should also
be suitable for video conferences, real-time video communication in cellular phones
and, by more efficient compression, increase the capacity in digital cameras since
they often use DCT based compression techniques. Since the MSC3K compression is
better than JPEG 2000 when it comes to progressive recovery of an image, it should
be very attractive to use in client-server based video streaming applications.
E.g. in a scenario where many clients are streaming video from one server and
the clients requires different quality of the video streams due to different bandwidth
capacity of the clients, the MSC3K compression should with relative ease be able to
provide different quality video streams based on one ”original” video stream. This
without having to re-encode the video stream for each quality/bandwidth setting.
The server simply cuts out different portions of the encoded bit-stream to serve the
different needs of the clients in terms of bandwidth usage, see figure 12.1. In MSC3K
this can easily be done at the bit-level and not as in JPEG 2000 where the ”portioning”
can only be done at the block/layer-level. In JPEG 2000 this poses another problem
since color information and intensity information is split into separate code-blocks.
The question is which code-blocks are to be thrown away and which ones are best
91
a
b
c
d
Figure 12.1: Several reconstructed images from an ”original” MSC3K encoded bitstream a) Original image, 192kb, (256x256 pixels in 24bit color). b) 9.6kb transferred
and decoded (ratio 1:20). c) 6.4kb transferred and decoded (ratio 1:30). d) 4.8kb
transferred and decoded (ratio 1:40).
kept. With MSC3K all information are encoded according to rank. And since the bitstream in MSC3K encoded images can be cut anywhere, this problem is eliminated.
12.3
Future Works
A more advanced quantization method better suitable for zero-tree compression could
improve image restoration quality.
Preprocessing measurements could be taken to filter the image with e.g. a smart
blur filter before compressing it. A smart blur filter preserves edges but reduces high
92
frequent information under a certain threshold. This way high frequent information
that is located in small spatial regions can be reduced and thus the conditions for
zero-tree type compression can be improved. Exactly what filtering properties the
smart filter would have should be in proportion to the compression rate (more blurring
for higher rates).
Some bits in the zero-tree representation have proven to be compressible. SPIHT
uses an arithmetic encoder to compress these bits, but an optimized Huffman tree
should suffice. By compressing these bits overall compression efficiency can be improved.
Memory usage can be improved by reusing some internal buffers in the encoding
scheme. CPU usage can also by improved by rewriting critical sections of the code
in assembly language.
12.3.1
The Future of MSC3K
The implementation so far is only for educational purposes since there are some
intellectual property issues with the SPIHT algorithm. MSC3K cannot go commercial
unless some agreement is first met with the owners of the SPIHT algorithm. This is
nothing unusual, almost every wavelet compression technique is protected by various
patents. This is also a big problem for the JPEG 2000 standard. If the JPEG 2000
standard is to be a real standard on the Internet, all intellectual property issues must
first be solved.
Because of the issues described above, the final MSC3K product will most likely
not use the SPIHT algorithm at all. Alternative methods are being developed but
this thesis will not discuss them.
Rest assured that whatever method will be used the compression results will be
equal, or in some cases superior, to those of SPIHT.
93
Appendix A
The Euclidean Algorithm
Take two Laurent polynomials a(z) and b(z) 6= 0 with |a(z)| ≥ |b(z)|. Let a0 (z) = a(z)
and b0 (z) = b(z) and iterate the following steps, starting from i = 0
ai+1 (z) = bi (z)
(A.1)
bi+1 (z) = ai (z) mod bi (z)
qi+1 (z) = ai (z)/bi (z)
Then an (z) = gcd(a(z), b(z)) where n is the smallest number for which bn (z) = 0.
If gcd(a(z), b(z)) = 1, then a(z) and b0 (z) have no common divisor except for 1 and
they are relatively prime. The iteration procedure can be written as
a(z)
b(z)
!
=
n
Y
i=1
qi (z) 1
1
!
an (z)
0
!
(A.2)
0
Iterate one time will yield
a(z) = q1 (z)a1 (z) + b1 (z)
b(z) = a1 (z)
94
(A.3)
Appendix B
Common Wavelet Transforms
B.1
LeGall 5/3
B.1.1
Direct Form Coefficients
i
ha
ga
0
3
4
1
4
− 18
1
±1
±2
B.1.2
− 12
hs
gs
1 − 43
1
2
− 41
1
8
Lifting Coefficients
Lifting Step
1 (Dual)
2 (Primal)
Analysis Lifting Step
− 12 z −1 + 1 − 21 z
1 −1
z
4
+ 1 + 41 z
1 (Primal)
2 (Dual)
95
Synthesis
− 14 z −1 + 1 − 14 z
1 −1
z
2
+ 1 + 12 z
B.2
Daubechies 9/7
B.2.1
Direct Form Coefficients
i
0
±1
ha
ga
hs
gs
0.602949 −1.115087
1.115087
0.602949
0.266864
0.591272 −0.266864
±2 −0.078223
0.591272
0.057543 −0.057543 −0.078223
±3 −0.016864 −0.091276 −0.091276
0.016864
±4
0.026749
0.026749
B.2.2
Lifting Coefficients
Lifting Step
Analysis
1 (Primal)
−1.586134z −1 + 1 − 1.586134z
2 (Dual)
−0.052980z −1 + 1 − 0.052980z
3 (Primal)
0.882911z −1 + 1 + 0.882911z
4 (Dual)
0.443507z −1 + 1 + 0.443507z
Normalization
Lowpass
0.812983
Highpass
-1.230174
Lifting Step
Synthesis
1 (Dual)
−0.443507z −1 + 1 − 0.443507z
2 (Primal)
−0.882911z −1 + 1 − 0.882911z
3 (Dual)
0.052980z −1 + 1 + 0.052980z
4 (Primal)
1.586134z −1 + 1 + 1.586134z
Normalization
Lowpass
1.230174
Highpass
-0.812983
96
Figure B.1: Lifting implementation of the LeGall 5/3 filter, even (left) and odd (right).
x(n) are input samples, d(n) are the high-pass coefficients and s(n) are the low-pass
coefficients.
97
Bibliography
[1] I. Daubechies and W. Sweldens. Factoring wavelet transforms into lifting steps.
J. Fourier Anal. Appl., 4(3):245–267, 1998.
[2] Jerome M. Shapiro. Embedded image coding using zerotrees of wavelet coefficients. IEEE Trans. on Signal Processing, 41(12):3445–3462, 1993.
[3] Said A. and Pearlman W.A. A new, fast, and efficient image codec based on
set partitioning in hierarchical trees. IEEE Trans. on Circuits and Systems for
Video Technology, 6:243–250, 1996.
[4] P.G. Howard and J. Vitter. Analysis of arithmetic coding for data compression.
Information Processing and Management, 28(6):749–763, 1992.
[5] William A. Pearlman Beong-Jo Kim, Zixiang Xiong. Multirate 3-d set partitioning in hierarchical trees (3-d spiht). IEEE Trans. on Circuits and Systems for
Video Technology, 10(8), 2000.
[6] Hyun-Wook Park and Hyung-Sun Kim. Motion estimation using low-band-shift
method for wavelet-based moving-picture coding. IEEE Trans. on Image Processing, 9(4), 2000.
[7] Wing Cheong Chan Ming Fai Fu, Au O.C. Novel motion compensation for
wavelet video coding using overlapping. IEEE Trans. on Acoustics, Speech, and
Signal Processing, 4:3421–3424, 2002.
98
[8] ISO/IEC. 14492-1, lossy/lossless coding of bi-level images. 2000.
[9] ISO/IEC. 14495-1, lossless and near-lossless compression of continuous-tone still
images - baseline. 2000.
[10] ISO/IEC. 15444-1, information technology - jpeg 2000 image coding system part 1: Core coding system. 2000.
[11] Michael Adams. The jasper project home page. http://www.ece.uvic.ca/
~mdadams/jasper/.
[12] The lenna story. http://www.lenna.org.
99
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement