video compression standards

video compression standards
Chapter 13
VIDEO COMPRESSION
STANDARDS
Digital video communication is a complex and compute intensive process that requires many people to receive video signals from dierent sources. There are mainly
three classes of devices that we use for digital video communications today:
Digital television sets or settop boxes are mainly designed to receive video
signals from dierent content providers. These devices rely on one xed video
decoding algorithm that is implemented in hardware or in a combination of
hardware and a programmable Reduced Instruction Set Computers (RISC)
processor or Digital Signal Processor (DSP). Currently, there are no means
provided for uploading a new algorithm once the hardware is deployed to the
customer.
Videophones are usually implemented on DSPs with hardware acceleration
for some compute intensive parts of the video coding and decoding algorithm
such as DCT and motion estimation. Usually, the set of algorithms used in a
particular videophone cannot be replaced.
Personal Computers are the most exible and most expensive platform for digital video communication. While a PC with a high-end Pentium III processor
is able to decode DVDs, the software is usually preinstalled with the operating system in order to avoid hardware and driver problems. Video decoders
for streaming video may be updated using automatic software download and
installation, as done by commercial software such as Real Player, Windows
Media Player, Apple Quicktime and Microsoft Netmeeting.
Digital video communications standards were mainly developed for digital television and video phones in order to enable industry to provide consumers with a
bandwidth eÆcient terminal at an aordable price. We describe the standards organizations, the meaning of compatibility and applications for video coding standards
in Section 13.1. We begin the description of actual standards with the ITU video
coding standards H.261 and H.263 (Sec. 13.2) for interactive video communications.
426
Section 13.1.
Standardization
427
In Section 13.3 we present the standards H.323 and H.324 that dene multimedia
terminals for audiovisual communications. Within ISO, the Moving Picture Experts
Group (MPEG) dened MPEG-1 (Sec. 13.4) and MPEG-2 (Sec. 13.5) for entertainment and digital TV. MPEG-4 (Sec. 13.6) is the rst international standard that
does not only standardize audio and video communications but also graphics for
use in entertainment and interactive multimedia services. All standards describe
the syntax and semantics of a bitstream. Section 13.7 presents an overview over
the organization of a bitstream as used by H.261, H.263, MPEG-1,-2, and MPEG-4.
Finally, we give a brief description of the on-going MPEG-7 standardization activity
(Sec. 13.8), which intends to standardize the interface for describing the content of
an audiovisual document.
13.1 Standardization
Developing an International Standard requires collaboration between many parters
from dierent countries, and an organization that is able to support the standardization process as well as to enforce the standards. In Section 13.1.1 we describe
the organizations like ITU and ISO. In Section 13.1.2, the meaning of compatibility
is dened. Section 13.1.3 briey describes the workings of a standardization body.
In Section 13.1.4 applications for video communications are listed.
13.1.1 Standards Organizations
Standards are required if we want multiple terminals from dierent vendors to
exchange information or to receive information from a common source like a TV
broadcast station. Standardization organizations have their roots in the telecom
industry creating ITU and trade creating ISO.
ITU
The telecom industry has established a long record of setting international standards
[7]. At the beginning of electric telegraphy in the 19th century, telegraph lines did
not cross national frontiers because each country used a dierent system and each
had its own telegraph code to safeguard the secrecy of its military and political
telegraph messages. Messages had to be transcribed, translated and handed over at
frontiers before being retransmitted over the telegraph network of the neighboring
country. The rst International Telegraph Convention was signed in May 1865
and harmonized the dierent systems used. This event marked the birth of the
International Telecommunication Union (ITU).
Following the invention of the telephone and the subsequent expansion of telephony, the Telegraph Union began, in 1885, to draw up international rules for
telephony. In 1906 the rst International Radiotelegraph Convention was signed.
Subsequently, several committees were set up for establishing international standards including the International Telephone Consultative Committee (CCIF) in
1924, the International Telegraph Consultative Committee (CCIT) in 1925, and
428
Video Compression Standards
Chapter 13
ITU
ITU-R
SG1
SG2
WP1 - Modems
& Interfaces
V.34, V.25ter
...
ITU-T
ITU-D
Study Group 16- Multimedia
WP2 - Systems
H.320 - ISDN
H.323 - LAN
H.324 - POTS
T.120 - DATA
WP3 - Coding
G.7xx - Audio
H.261- Video
H.263 - Video
Organization of the ITU with its subgroups relevant to digital video
communications. Working Parties (WP) are organized into questions that dene
the standards.
Figure 13.1.
the International Radio Consultative Committee (CCIR) in 1927. In 1927, the
Telegraph Union allocated frequency bands to the various radio services existing at
the time. In 1934 the International Telegraph Convention of 1865 and the International Radiotelegraph Convention of 1906 were merged to become the International
Telecommunication Union (ITU, http://www.itu.int). In 1956, the CCIT and the
CCIF were amalgamated to create the International Telephone and Telegraph Consultative Committee (CCITT). In 1989, CCITT published the rst digital video
coding standard, the CCITT Recommendation H.261 [40], which is still relevant
today. In 1992, ITU reformed itself which resulted in renaming CCIR into ITU-R
and CCITT into ITU-T. Consequently, the standards of CCITT are now referred to
as ITU-T Recommendations. For example, CCITT H.261 is now known as ITU-T
H.261.
Fig. 13.1 shows the structural organization of the ITU detailing the parts that
are relevant to digital video communications. ITU-T is organized in study groups
with Study Group 16 (SG 16) being responsible for multimedia. SG 16 divided
its work into dierent Working Parties (WP) each dealing with several Questions.
Here we list some questions that SG 16 worked on in 2001: Question 15 (Advanced
video coding) developed the video coding standards ITU-T Recommendation H.261
and H.263 [49]. Question 19 (Extension to existing ITU-T speech coding standards
at bit rates below 16 kbit/s) developed speech coding standards like ITU-T Recommendation G.711 [35], G.722 [36] and G.728 [38]. Question numbers tend to change
every four years.
ITU is an international organization created by a treaty signed by its member countries. Therefore, countries consider dealings of the ITU relevant to their
sovereign power. Accordingly, any recommendation, i.e. standard, of the ITU has
to be agreed upon unanimously by the member states. Therefore, the standardiza-
Section 13.1.
Standardization
429
tion process in the ITU is often not able to keep up with the progress in modern
technology. Sometimes, the process of reaching unanimous decisions does not work
and ITU recommends regional standards like 7 bit or 8 bit representation of digital
speech in the USA and Europe, respectively. As far as mobile telephony is concerned
ITU did not play a leading role. In the USA there is not even a national mobile
telephony standard, as every operator is free to choose a standard of his own liking.
This contrasts with the approach adopted in Europe where the GSM standard is
so successful that it is expanding all over the world, USA included. With UMTS
(so-called 3rd generation mobile) the ITU-T is retaking its original role of developer
of global mobile telecommunication standards.
ISO
The need to establish international standards developed with the growth of trade
[7]. The International Electrotechnical Commission (IEC) was founded in 1906 to
prepare and publish international standards for all electrical, electronic and related
technologies. The IEC is currently responsible for standards of such communication
means as "receivers", audio and video recording systems, audio-visual equipment,
currently all grouped in TC 100 (Audio, Video and Multimedia Systems and Equipment). International standardization in other elds and particularly in mechanical
engineering was the concern of the International Federation of the National Standardizing Associations (ISA), set up in 1926. ISA's activities ceased in 1942 but
a new international organization called International Organization for Standardization (ISO) began to operate again in 1947 with the objective "to facilitate the
international coordination and unication of industrial standards". All computerrelated activities are currently in the Joint ISO/IEC Technical Committee 1 (JTC 1)
on Information Technology. This TC has achieved a very large size. About 1/3 of
all ISO and IEC standards work is done in JTC 1.
The subcommitees SC 24 (Computer Graphics and Image Processing) and SC 29
(Coding of Audio, Picture, Multimedia and Hypermedia Information) are of interest
to multimedia communications. Whereas SC24 denes computer graphics standards
like VRML, SC29 developed the well known audiovisual communication standards
MPEG-1, MPEG-2 and MPEG-4 (Fig. 13.2). The standards were developed at
meetings that between 200 and 400 delegates from industry, research institutes and
universities attended.
ISO is an agency of the United Nations since 1947. ISO and IEC have the status
of private not-for-prot companies established according to the Swiss Civil Code.
Similarly to ITU, ISO requires consensus in order to publish a standard. ISO also
fails sometimes to establish truly international standards as can be seen with digital
TV. While the same video decoder (MPEG-2 Video) is used worldwide, the audio
representation is dierent in the US and in Europe.
Both ISO (www.iso.ch) and ITU are in constant competition with industry.
While ISO and ITU have been very successful in dening widely used audio and
video coding standards, they were less successful in dening transport of multimedia
430
Video Compression Standards
Chapter 13
Audio visual communications standards like MPEG-1, MPEG-2 and
MPEG-4 are developed by Working Group (WG) 11 of Subcommittee (SC) 29 under
ISO/IEC JTC 1.
Figure 13.2.
signals over the Internet. This is currently handled by the Internet Engineering
Task Force (IETF, www.ietf.org) that is a large open international community of
network designers, operators, vendors, and researchers concerned with the evolution
of the Internet architecture and the smooth operation of the Internet. It is open
to any interested individual. Other de-facto standards like JAVA are dened by
one or a few companies thus limiting the access of outsiders and newcomers to the
technology.
13.1.2 Requirements for a Successful Standard
International standards are developed in order to allow interoperation of communications equipment provided by dierent vendors. This results in the following
requirements that enable a successful deployment of audio-visual communications
equipment in the market place.
1. Innovation: In order for a standard to distinguish itself from other standards
Section 13.1.
Standardization
431
or widely accepted industry standards, it has to provide a signicant amount
of innovation. Innovation in the context of video coding means that the
standard provides new functionalities like broadcast-quality interlaced digital video, video on CDROM or improved compression. If the only distinction
of a new standard is better compression, the standard should provide at least
an improvement that is visible to the consumer and non-expert, before its introduction makes sense commercially. This usually translates to a gain of 3dB
in PSNR of the compressed video at a generally acceptable picture quality.
2. Competition: Standards should not prevent competition between dierent
manufactures. Therefore, standards specications need to be open and available to everyone. Free software for encoders and decoders also helps to promote a standard. Furthermore, a standard should only dene the syntax and
semantics of a bitstream, i.e. a standard denes how a decoder works. Bitstream generation is not standardized. Although the development of bitstream
syntax and semantics requires to have an encoder and a decoder, the standard
does not dene the encoder. Therefore, manufactures of standards-compliant
terminals can compete not only about price but also about additional features like postprocessing of the decoded media and more importantly about
encoder performance. In video encoding, major dierences result from motion
estimation, scene change handling, rate control and optimal bit allocation.
3. Transmission and storage media independent: A content provider should be
able to transmit or store the digitally encoded content independent of the
network or storage media. As a consequence of this requirement, we use audio
and video standards to encode the audiovisual information. Then we use a
systems standard to format the audio and video bitstreams into a format
that is suitable for the selected network or storage system. The systems
standard species the packetization, multiplexing and packet header syntax
for delivering the audio and video bitstreams. The separation of transmission
media and media coding usually creates overhead for certain applications. For
example, we cannot exploit the advantages of joint source/channel coding.
4. Forward compatibility: A new standard should be able to understand the
bitstreams of prior standards, i.e. a new video coding standard like H.263 [49]
should be able to decode bitstreams according to the previous video coding
standard H.261 [40]. Forward compatibility ensures that new products can be
gradually introduced into the market. The new features of the latest standard
get only used when terminals conforming to the latest standard communicate.
Otherwise, terminals interoperate according to the previous standard.
5. Backward compatibility: A new standard is backward compatible to an older
standard, if the old standard can decode bitstreams of the new standard. A
very important backward compatible standard was the introduction of analog color TV. Black and white receivers were able to receive the color TV
432
Video Compression Standards
Chapter 13
signal and display a slightly degraded black and white version of the signal.
Backward compatibility in today's digital video standards can be achieved by
dening reserved bits in the bit stream that a decoder can ignore. A new
standard would transmit extra information using these reserved bits of the
old standard. Thus old terminals will be able to decode bitstreams according
to the new standard. Furthermore, they will understand those parts of the
bit stream that comply to the old standard. Backward compatibility can put
severe restrictions on the improvements that a new standard may achieve over
its predecessor. Therefore, backward compatibility is not always implemented
in a new standard.
6. Upward compatibility: A new receiver should be able to decode bitstreams
that were made for similar receivers of a previous or cheaper generation. Upward compatibility is important if an existing standard is extended. A new
HDTV set should be able to receive standard denition TV since both receivers use the same MPEG-2 standard [19].
7. Downward compatibility: An old receiver should be able to receive and decode bitstreams for the newer generation receivers. Downward compatibility
is important if an existing standard is extended. This may be achieved by decoding only parts of the bitstream which is easily possible if the new bitstream
is sent as a scalable bitstream (Chapter 11).
Obviously, not all of the above requirements are essential for the wide adoption
of a standard. We believe that the most important requirements are innovation,
competition, and forward compatibility in this order. Compatibility is most important for devices like TV settop boxes or mobile phones that cannot be upgraded
easily. On the other hand, any multimedia PC today comes with more than ten
software video codecs installed, relaxing the importance of compatible audio and
video coding standards for this kind of terminals.
13.1.3 Standard Development Process
All video coding standards were developed in three phases, competition, convergence,
verication. Fig. 13.3 shows the process for the video coding standard H.261 [40].
The competition phase started in 1984. The standard was published in December
1990 and revised in 1993.
During the competition phase, the application areas and requirements for the
standard are dened. Furthermore, experts gather and demonstrate their best
algorithms. Usually, the standardization body issues a Call for Proposals as soon
as the requirements are dened in order to solicit input from the entire community.
This phase can be characterized by independently working competing laboratories.
The goal of the convergence phase is to collaboratively reach an agreement on the
coding method. This process starts with a thorough evaluation of the proposals for
the standard. Issues like coding eÆciency, subjective quality, implementation com-
Section 13.1.
Standardization
433
Editing instructions: Replace 'divergence' with 'competition', 2 times.
Overview of the H.261 standardization process (from [57]). Initially,
the target was to develop two standards for video coding at rates of nx384 kbit/s
and mx64 kbit/s. Eventually, ITU settled on one standard for a rate of px64 kbit/s.
Figure 13.3.
plexity and compatibility are considered when agreeing on a rst common framework for the standard. This framework is implemented at dierent laboratories
and the description is rened until the dierent implementations achieve identical
results. This framework has dierent names in dierent standards, such as Reference Model or RM for H.261, Test Model Near term (TMN) for H.263, Simulation
Model for MPEG-1, Test Model (TM) for MPEG-2, Verication Model (VM) in
MPEG-4, Test Model Long term (TML) for H.26L. After the rst version of the
framework is implemented, researchers suggest improvements such as new elements
for the algorithm or better parameters for existing elements of the algorithm. These
are evaluated against the current framework. Proposals that achieve signicant improvements are included in the next version of the framework that serves as the
new reference for further improvements. This process is repeated until the desired
level of performance is achieved.
During the verication phase, the specication is checked for errors and ambiguities. Conformance bitstreams and correctly decoded video sequences are generated.
A standards compliant decoder has to decode every Conformance bitstream into the
correct video sequence.
The standardization process of H.261 can serve as a typical example (Fig. 13.3).
In 1985, the initial goal was to develop a video coding standard for bitrates between
384 kbit/s and 1920 kbit/s. Due to the deployment of ISDN telephone lines, another
standardization process for video coding at 64 kbit/s up to 128 kbit/s began two
years later. In 1988 the two standardization groups realized that one algorithm can
be used for coding video at rates between 64 kbit/s and 1920 kbit/s. RM 6 was
the rst reference model that covered the entire bitrate range. Technical work was
nished in 1989 and one year later, H.261 was formally adopted by the ITU.
434
Video Compression Standards
Chapter 13
13.1.4 Applications for Modern Video Coding Standards
As mentioned previously, there have been several major initiatives in video coding
that have led to a range of video standards for dierent applications [12].
Video coding for video teleconferencing, which has led to the ITU standards
called H.261 for ISDN videoconferencing [40], H.263 for video conferencing
over analog telephone lines, desktop and mobile terminals connected to the
Internet [49], and H.262/MPEG-2 video [42][17] for ATM/broadband videoconferencing.
Video coding for storing movies on CD-ROM and other consumer video applications, with about 1:2 Mbits/s allocated to video coding and 256 kbits/s
allocated to audio coding, which led to the initial ISO MPEG-1 standard [16].
Today, MPEG-1 is used for consumer video on CD, Karoke machines, in some
digital camcorders and on the Internet. Some Digital Satellites used MPEG-1
to broadcast TV signals prior to the release of MPEG-2.
Video coding for broadcast and for storing of digital video on DVD, with on
the order of 2 15 Mbits/s allocated to video and audio coding, which led
to the ISO MPEG-2 standard [19] and specications for DVD operations by
the Digital Audio VIsual Council (DAVIC) (www.davic.org) and the DVD
consortium [25]. This work was extended to video coding for HDTV with
bitrates ranging from 15 to 400 Mbits/s allocated to video coding. Applications include satellite TV, Cable TV, terrestrial broadcast, video editing and
storage. Today, MPEG-2 video is used in every digital settop box. It is also
selected as the video decoder for the American HDTV broadcast system.
Coding of separate audio-visual objects, both natural and synthetic is standardized in ISO MPEG-4 [22]. Target applications are Internet video, interactive video, content manipulation, professional video, 2D and 3D computer
graphics, and mobile video communications.
In the following sections, we will rst describe H.261. For H.263, we will highlight
the dierences from H.261 and compare their coding eÆciency. Then, we will discuss
MPEG-1, MPEG-2 and MPEG-4, again focusing on their dierences.
13.2 Video Telephony with H.261 and H.263
Video coding at 64 kbit/s was rst demonstrated at a conference in 1979 [56]. However, it took more than ten years to be able to dene a commercially viable video
coding standard at that rate. The standard H.261 was published in 1990 in order
to enable video conferencing using between one and thirty ISDN channels. At that
time, video conferencing hardware became available from dierent vendors. Companies like PictureTel that sold video conferencing equipment with a proprietary
algorithm quickly oered H.261 as an option. Later, ITU developed the similar
Section 13.2.
Video Telephony with H.261 and H.263
435
standard H.263 that enables video communications over analog telephone lines.
Today, H.263 video encoder and decoder software is installed on every PC with the
Windows operating system.
13.2.1 H.261 Overview
Figure 13.4 shows the block diagram of an H.261 encoder that processes video in the
4:2:0 sampling format. The video coder is a block-based hybrid coder with motion
compensation (Sec. 9.3.1). It subdivides the image into macroblocks (MB) of size
16x16 pels. A MB consists of 4 luminance blocks and 2 chrominance blocks, one
for the Cr and another for the Cb component. H.261 uses an 8x8 DCT for each
block to reduce spatial redundancy, a DPCM loop to exploit temporal redundancy
and unidirectional integer pel forward motion compensation for MBs (box P in
Fig. 13.4) to improve the performance of the DPCM loop.
A simple two-dimensional loop lter (see Sec. 9.3.5) may be used to lowpass
lter the motion compensated prediction signal (box F in Fig. 13.4). This usually
decreases the prediction error and reduces the blockiness of the prediction image.
The loop lter is separable into one-dimensional horizontal and vertical functions
with the coeÆcients [1=4; 1=2; 1=4]. H.261 uses two quantizers for DCT coeÆcients.
A uniform quantizer with stepsize 8 is used in intra-mode for DC coeÆcients, a
nearly uniform midtread quantizer with the stepsize between 2 and 62 is used for
AC coeÆcients in intra-mode and in inter-mode (Fig. 13.5). The input between T
and T , which is called the dead zone, is quantized to 0. Except for the deadzone,
the stepsize is uniform. This deadzone avoids coding many small DCT coeÆcients
that would mainly contribute to coding noise.
The encoder transmits mainly two classes of information for each MB that is
coded: DCT coeÆcients resulting from the transform of the prediction error signal
(q in Fig. 13.4) and motion vectors that are estimated by the motion estimator
(v and box P in Fig. 13.4). The motion vector range is limited to 16 pels. The
control information that tells the decoder whether and how a MB and its blocks
are coded are named macroblock type (MTYPE) and coded block pattern (CBP).
Table 13.1 shows the dierent MB types. In intra-mode, the bitstream contains
transform coeÆcients for each block. Optionally, a change in the quantizer stepsize
of 2 levels (MQUANT) can be signaled. In inter-mode, the encoder has a choice
of just sending a dierentially coded motion vector (MVD) with or without the loop
lter on. Alternatively, a CBP may be transmitted in order to specify the blocks
for which transform coeÆcients will be transmitted. Since the standard does not
specify an encoder, it is up to the encoder vendor to decide on an eÆcient Coding
Control (CC in Fig. 13.4) to optimally select MTYPE, CBP, MQUANT, loop lter
and motion vectors [69]. As a rough guideline, we can select MTYPE, CBP, and
MVD such that the prediction error is minimized. However, since the transmission
of motion vectors costs extra bits, we do this only if the prediction error using
the motion vector is signicantly lower than without it. The quantizer stepsize is
varied while coding the picture such that the picture does not require more bits
436
Video Compression Standards
Chapter 13
p
CC
t
qz
T
Video
in
Q
q
Q–1
To video
multiplex
coder
T –1
F
P
v
f
T1502441-90/d03
T
Q
P
F
CC
p
t
qz
q
v
f
Transform
Quantizer
Picture memory with motion compensated variable delay
Loop filter
Coding control
Flag for INTRA/INTER
Flag for transmitted or not
Quantizer indication
Quantizing index for transform coefficients
Motion vector
Switching on/off of the loop filter
FIGURE 3/H.261
Source coder
Editing instructions: P Picture memory with motion estimator and motion
compensation unit; Connect lines with a point at rst line crossing after 'Video in';
Remove 'FIGURE 3/H.261'; Remove 'Source Coder'; Remove 'T1502441-90/d03'.
Figure 13.4.
Block diagram of an H.261 encoder [40].
than the coder can transmit during the time between two coded frames. Coding
mode selection and parameter selection is previously discussed in Sec. 9.3.3.
Most information within a MB is coded using a variable length code that was
derived from statistics of test sequences. CoeÆcients of the 2D DCT are coded using
the runlength coding method discussed in Sec. 9.1.7. Specically, the quantized
DCT coeÆcients are scanned using a Zigzag scan (Fig. 9.8) and converted into
symbols. Each symbol includes the number of coeÆcients that were quantized to 0
Section 13.2.
Video Telephony with H.261 and H.263
437
Editing instructions: Replace '' with 'T'; replace 'y' with 'Q(x)'; remove 'b)';
replace 'q' by 'e'.
A midtread quantizer with a dead zone is used in H.261 for all DCT
coeÆcients but the DC coeÆcient in the intra-mode. The bottom part shows the
quantization error e = x Q(x) between the input amplitude x and the output
amplitude Q(x).
Figure 13.5.
since the last non-zero coeÆcient together with the amplitude of the current nonzero coeÆcient. Each symbol is coded using VLC. The encoder sends an End Of
Block (EOB) symbol after the last non-zero coeÆcient of a block (Fig. 9.9).
H.261 does not specify the video encoder capabilities. However, the picture
formats that an H.261 decoder has to support are listed in Tab. 13.2. Several
standards that setup video conferencing calls, exchange video capabilities between
terminals. At a minimum level as dened in H.320, a decoder must be capable of
decoding QCIF frames at a rate of 7.5 Hz [45]. An optional level of capability is
dened at decoding CIF frames at 15 Hz [45]. The maximum level requires the
decoding of CIF frames at 30 Hz (30000=1001 Hz to be precise) [45].
438
Video Compression Standards
Chapter 13
VLC table for macroblock type (MTYPE). Two MTYPEs are used
for intra-coded MBs and eight are used for inter-coded MBs. An 'x' indicates that
the syntactic element is transmitted for the MB [40].
Prediction
MQUANT MVD CBP TCOEFF
VLC
Intra
x
0001
Intra
x
x
0000 001
Inter
x
x
1
Inter
x
x
x
0000 1
Inter + MC
x
0000 0000 1
Inter + MC
x
x
x
0000 0001
Inter + MC
x
x
x
x
0000 0000 01
Inter + MC + FIL
x
001
Inter + MC + FIL
x
x
x
01
x
x
x
x
0000 01
Inter + MC + FIL
Table 13.1.
Picture formats supported by H.261 and H.263.
Sub-QCIF QCIF CIF
4CIF
16CIF Custom sizes
128
176
352
704
1408
< 2048
96
144
288
576
1152
< 1152
p Opt Still pict
p
p Opt
Opt
Opt
Table 13.2.
Lum Width (pels)
Lum Height (pels)
H.261
H.263
13.2.2 H.263 Highlights
The H.263 standard is based on the framework of H.261. Due to progress in video
compression technology and the availability of high performance desktop computers
at reasonable cost, ITU decided to include more compute intensive and more eÆcient
algorithms in the H.263 standard. The development of H.263 had three phases. The
technical work for the initial standard was nished in November 1995. An extension
of H.263, nicknamed H.263+, was incorporated into the standard in September 1997.
The results of the third phase, nicknamed H.263++, were folded into the standard
in 1999 and formally approved in November 2000. In this section, we focus on the
dierences between H.263 as of 1995 and H.261. We also briey describe H.263 as
of 2000.
H.263 Baseline as of 1995 versus H.261
H.263 consists of a baseline decoder with features that any H.263 decoder has to
implement. In addition, optional features are dened. The following mandatory
features distinguish H.263 as dened in November 1995 from H.261 [6][12]:
1. Half-pixel motion compensation: This feature signicantly improves the prediction capability of the motion compensation algorithm in cases where there
Section 13.2.
Video Telephony with H.261 and H.263
439
Prediction of motion vectors uses the median vector of MV1, MV2,
and MV3. We assume the motion vector is 0 if one MB is outside the picture or
group of blocks. If two MBs are outside, we use the remaining motion vector as
prediction.
Figure 13.6.
is object motion that needs ne spatial resolution for accurate modeling. Bilinear interpolation (simple averaging) is used to compute the predicted pels
in case of non-integer motion vectors. The coding of motion vectors uses the
median motion vector of the three neighboring MBs as prediction for each
component of the vector (Fig. 13.6).
2. Improved variable-length coding including a 3D VLC for improved eÆciency
in coding DCT coeÆcients. Whereas H.261 codes the symbols (run,level) and
sends an EOB word at the end of each block, H.263 integrates the EOB word
into the VLC. The events to be coded are (last, run, level), where last indicates
whether the coeÆcient is the last non-zero coeÆcient in the block.
3. Reduced overhead at the Group of Block level as well as coding of MTYPE
and CBP.
4. Support for more picture formats (Tab. 13.2).
In addition to the above improvements, H.263 oers a list of optional features
that are dened in annexes of the standard.
I Unrestricted motion vectors (Annex D) that are allowed to point outside the
picture improve coding eÆciency in case of camera motion or motion at the
picture boundary. The prediction signal for a motion vector that points outside of the image is generated by repeating the boundary pels of the image.
The motion vector range is extended to [ 31:5; 31].
II Syntax-based arithmetic coding (Annex E) may be used in place of the variable
length (Human) coding resulting in the same decoded pictures at an average
440
Video Compression Standards
Chapter 13
Prediction of motion vectors in advanced prediction mode with 4
motion vectors in a MB. The predicted value for a component of the current motion
vector MV is the median of its predictors.
Figure 13.7.
bitrate saving of 4% for P frames and 10% for I frames. However, decoder
computational requirements increase by more than 50% [10]. This will limit
the number of manufactures implementing this annex.
III Advanced prediction mode (Annex F) includes unrestricted motion vector
mode. Advanced prediction mode allows for two additional improvements:
Overlapped block motion compensation (OBMC) may be used to predict the
luminance component of a picture which improves prediction performance and
reduces signicantly blocking artifacts (see Sec. 9.3.2) [58]. Each pixel in an
8x8 luminance prediction block is a weighted sum of three predicted values
computed from the following three motion vectors: Vector of the current
MB and vectors of the two MBs that are closest to the current 8x8 block.
The weighting coeÆcients used for motion compensation and the equivalent
window function for motion estimation are given previously in Figs. 9.16 and
9.17.
The second improvement of advanced motion prediction is the optional use
of four motion vector for a MB, one for each luminance block. This enables
better modeling of motion in real images. However, it is up to the encoder to
decide in which MB the benet of four motion vectors is suÆcient to justify
the extra bits required for coding them. Again, the motion vectors are coded
predictively (Fig. 13.7).
IV PB pictures (Annex G) is a mode that codes a bidirectionally predicted picture with a normal forward predicted picture. The B-picture temporally precedes the P-picture of the PB picture. In contrast to bidirectional prediction
(Sec. 9.2.4, Fig. 9.12) that is computed on a frame by frame basis, PB pictures use bidirectional prediction on a MB level. In a PB frame, the number
of blocks per MB is 12 rather than 6. Within each MB, the 6 blocks be-
Section 13.2.
Video Telephony with H.261 and H.263
441
Forward prediction can be used for all B-blocks, backward prediction
is only used for those pels that the backward motion vector aligns with pels of the
current MB (from [10]).
Figure 13.8.
longing to the P-picture are transmitted rst, followed by the blocks of the
B-picture (Fig. 13.8). Bidirectional prediction is derived from the previous
decoded frame and the P-blocks of the current MB. As seen in Fig. 13.8, this
limits backward predictions to those pels of the B-blocks that are aligned to
pels inside the current P-macroblock in case of motion between the B-picture
and the P-picture (light grey area in Fig. 13.8). For the light grey area of the
B-block, the prediction is computed by averaging the results of forward and
backward prediction. Pels in the white area of the B-block are predicted using forward motion compensation only. An Improved PB-frame mode (Annex
M) was adopted later that removes this restriction enabling the eÆciency of
regular B-frames (Sec. 9.2.4).
PB-pictures are eÆcient for coding image sequences with moderate motion.
They tend not to work very well for scenes with fast or complex motion or
when coding at low frame rates. Since the picture quality of a B-picture has
no eect on the coding of subsequent frames, H.263 denes that the B-picture
of a PB-picture set is coded at a lower quality than the P-picture by using a
smaller quantizer stepsize for P-blocks than for the associated B-blocks. PBpictures increase the delay of a coding system, since PB pictures allow the
encoder to send bits for the B frame only after the following P-frame has been
captured and processed. This limits their usefulness for interactive real-time
applications.
Due to the larger number of coding modes, the encoder decisions are more
complex than in H.261. A rate distortion optimized H.263 encoder with the options Unrestricted Motion Vector Mode and Advanced Prediction was compared to
TMN5, the test model encoder used during the standards development [69]. The
optimized encoder increases the PSNR between 0.5 dB and 1.2 dB over TMN5 at
442
Video Compression Standards
Chapter 13
bitrates between 20 and 70 kbit/s.
H.263 as of 2000
After the initial version of H.263 was approved, work continued with further optional modes being added. However, since these more than 15 modes are optional,
it is questionable that any manufacturer will implement all these options. ITU
recognized that and added recommendations for preferred modes to the standard.
The most important preferred modes not mentioned above are listed here.
1. Advanced Intra Coding (Annex I): Intra blocks are coded using the block to
the left or above as a prediction provided that block is also coded in intramode. This mode increases coding eÆciency by 10% to 15% for I pictures.
2. Deblocking lter (Annex J): An adaptive lter is applied across the boundaries
of decoded 8x8 blocks to reduce blocking artifacts. This lter aects also the
predicted image and is implemented inside the prediction loop of coder and
decoder.
3. Supplemental Enhancement Information (Annex L): This information may
be used to provide tagging information for external use as dened by an
application using H.263. Furthermore, this information can be used to signal
enhanced display capabilities like frame freeze, zoom or chroma-keying (see
Sec. 10.3).
4. Improved PB-frame Mode (Annex M): As mentioned before, this mode removes the restrictions placed on the backward prediction of Annex G. Therefore, this mode enables regular bidirectional prediction (Sec. 9.2.4).
The above tools are developed for enhancing the coding eÆciency. In order to
enable transport of H.263 video over unreliable networks such as wireless networks
and the Internet, a set of tools are also developed for the purpose of error resilience.
These are included in Annex H: Forward Error Correction Using BCH Code, Annex K: Flexible Synchronization Marker Insertion Using the Slice Structured Mode,
Annex N and U: Reference Picture Selection, Annex O: Scalability, Annex R: Independent Segment Decoding, Annex V: Data Partitioning and RVLC, Annex W:
Header Repetition. These tools are described in Sec. 14.7.1. Further discussion of
H.263 can be found in [6] and in the standard itself [49].
13.2.3 Comparison
Fig. 13.9 compares the performance of H.261 and H.263 [10]. H.261 is shown with
and without using the lter in the loop (Curves 3 and 5). Since H.261 was designed
for data rates of 64 kbit/s and up, we discuss Fig. 13.9 at this rate. Without options,
H.263 outperform H.261 by 2 dB (Curves 2 and 3). Another dB is gained if we use
the options advanced prediction, syntax-based arithmetic coding, and PB frames
(Curve 1). Curve 4 shows that restricting motion vectors in H.263 to integer-pel
Section 13.3.
Standards for Visual Communication Systems
443
Performance of H.261 and H.263 for the sequence Foreman at QCIF
and 12.5 Hz [10].
Figure 13.9.
reduces coding eÆciency by 3dB. This is due to the reduced motion compensation
accuracy and the lack of the low pass lter that bilinear interpolation brings for
half-pel motion vectors. Comparing curves 3 and 5 shows the eect of this lowpass
lter on coding eÆciency. The dierences between curves 4 and 5 are mainly due to
the 3D VLC for coding of transform coeÆcients as well as improvements in coding
MTYPE and CBP.
13.3 Standards for Visual Communication Systems
In order to enable useful audiovisual communications, the terminals have to establish a common communication channel, exchange their capabilities and agree on the
standards for exchanging audiovisual information. In other words, we need much
more than just an audio and a video codec in order to enable audiovisual communication. The setup of communication between a server and a client over a network
is handled by a systems standard. ITU-T developed several system standards including H.323 and H.324 to enable bidirectional multimedia communications over
dierent networks, several audio coding standards for audio communications, and
the two important video coding standards H.261 and H.263. Table 13.3 provides an
444
Video Compression Standards
Chapter 13
13.3.
ITU-T Multimedia Communication Standards. PSTN: Public
Switched Telephone Network; N-ISDN: Narrow band ISDN (2x64 kbit/s); B-ISDN:
Broadband ISDN; ATM: Asynchronous Transfer Mode; QoS: guaranteed Quality of
Service; LAN: Local Area Network; H.262 is identical to MPEG-2 Video [42][17];
G.7xx represents G.711, G.722, and G.728.
Table
Network
PSTN
N-ISDN
B-ISDN/ATM
QoS LAN
Non-QoS LAN
System
H.324
H.320
H.321
H.310
H.322
H.323
Video
H.261/3
H.261
H.261
H.261/2
H.261/3
H.261
Audio
G.723.1
G.7xx
G.7xx
G.7xx/MPEG
G.7xx
G.7xx
Mux
H.223
H.221
H.221
H.222.0/1
H.221
H.225.0
Control
H.245
H.242
Q.2931
H.245
H.242
H.245
overview of the standards for audio, video, multiplexing and call control that these
system standards use [5]. In the following, we briey describe the functionality of
the recent standards H.323 [50] and H.324 [43].
13.3.1 H.323 Multimedia Terminals
Recommendation H.323 [50] provides the technical requirements for multimedia
communication systems that operate over packet-based networks like the Internet
where guaranteed quality of service is usually not available.
Fig. 13.10 shows the dierent protocols and standards that H.323 requires for
video conferencing over packet networks. An H.323 call scenario optionally starts
with a gatekeeper admission request(H.225.0 RAS, [47]). Then, call signaling establishes the connection between the communicating terminals (H.225.0, [47]). Next,
a communication channel is established for call control and capability exchange
(H.245, [48]). Finally, the media ow is established using RTP and its associated
control protocol RTCP [64]. A terminal may support several audio and video codecs.
However, the support of G.711 audio (64 kbit/s) [35] is mandatory. G.711 is the
standard currently used in the Public Switched Telephone Network (PSTN) for digital transmission of telephone calls. If a terminal claims to have video capabilities, it
has to include at least an H.261 video codec [40] with a spatial resolution of QCIF.
Modern H.323 video terminals usually use H.263 [49] for video communications.
13.3.2 H.324 Multimedia Terminals
H.324 [43] diers from H.323 as it enables the same communication over networks
with guaranteed quality of service as it is available when using V.34 [41] modems
over the PSTN. The standard H.324 may optionally support the media types voice,
data and video. If a terminal supports one or more of these media, it uses the same
audiovisual codecs as H.323. However, it also supports H.263 for video and G.723.1
445
Consumer Video Communications with MPEG-1
Data
H.225.0 H.245 T.120
Audio Video A/V Cntl Control
G.711
G.722 H.261
G.728 H.263
Gatekeeper
RTCP
RTP
TCP
Reg,
Adm,
Status
(RAS)
UDP
IP
Figure 13.10.
H.323
Control
Network
Section 13.4.
H.323 protocols for multimedia communications over TCP/IP.
[37] for audio at 5.3 kbit/s and 6.3 kbit/s. Audio quality of an G.723.1 codec at
6.3 kbit/s is very close to that of a regular phone call. Call control is handled
using H.245. Transmission of these dierent media types over the PSTN requires
the media to be multiplexed (Fig. 13.11) following the multiplexing standard H.223
[44]. The multiplexed data is sent to the PSTN using a V.34 modem and the V.8
or V.8bis procedure [52][51] to start and stop transmission. The modem control
V.25ter [46] is used, if the H.324 terminal uses an external modem.
13.4 Consumer Video Communications with MPEG-1
The MPEG standards were developed by ISO/IEC JTC1 SC29/WG11, which is
chaired by Leonardo Chiariglione. MPEG-1 was designed for progressively scanned
video used in multimedia applications, and the target was to produce near VHS
quality video at a bit rate of around 1.2 Mb/s (1.5 Mb/s including audio and data).
It was foreseen that a lot of multimedia content would be distributed on CD-ROM.
At the time of the MPEG-1 development, 1.5 Mbit/s was the access rate of CDROM players. The video format is SIF. The nal standard supports higher rates
and larger image sizes. Another important consideration when developing MPEG1 were functions that support basic VCR-like interactivity like fast forward, fast
reverse and random access into the stored bitstream at every half-second [54].
13.4.1 MPEG-1 Overview
The MPEG-1 standard, formally known as ISO 11172 [16], consists of 5 parts,
namely Systems, Video, Audio, Conformance, and Software.
MPEG-1 Systems provides a packet structure for combining coded audio and
446
Video Compression Standards
Chapter 13
Scope of recommendation H.324
Video I/O equipment
Video codec
H.263/H.261
Audio I/O equipment
Audio codec
G.723
User data
applications
Multiplex/
Demux
H.223
Data protocols
V.14, LAPM, etc.
Control protocol
H.245
System control
Receive
path delay
Modem
V.34/V.8
GSTN
Network
SRP/LAPM
procedures
MCU
Modem
control
V.25ter
Editing instructions: Replace GSTN with PSTN; Remove box with MCU.
Block diagram of an H.324 multimedia communication system over
the Public Switched Telephone Network (PSTN).
Figure 13.11.
video data. It enables the system to multiplex several audio and video streams
into one stream that allows synchronous playback of the individual streams. This
requires all streams to refer to a common system time clock (STC). From this STC,
presentation time stamps (PTS) dening the instant when a particular audio or
video frame should be presented on the terminal are derived. Since coded video
with B-frames required an reordering of decoded images, the concept of Decoding
Time Stamps (DTS) is used to indicate by when a certain image has to decoded.
MPEG-1 audio is a generic standard that does not make any assumptions about
the nature of the audio source. However, audio coding exploits perceptual limitations of the human auditory system for irrelevancy reduction. MPEG-1 audio is
dened in three layers I, II, and III. Higher layers have higher coding eÆciency and
require higher resources for decoding. Especially Layer III was very controversial
due to its computational complexity at the time of standardization in the early
1990's. However, today it is this Layer III MPEG-1 Audio codec that is known
to every music fan as MP3. The reason for its popularity is sound quality, coding
eÆciency - and, most of all, the fact that for a limited time the proprietary high
quality encoder source code was available for download on a company's web site.
This started the revolution of the music industry (see Sec. 13.1.2).
Section 13.4.
Consumer Video Communications with MPEG-1
447
Editing instructions: Remove old caption 'Figure 2 Conceptual ...'; Box on the right
is called 'buer';
Figure 13.12.
Block diagram of an MPEG-1 encoder.
13.4.2 MPEG-1 Video
The MPEG-1 video convergence phase started after subjective tests in October
1989 and resulted in the standard published in 1993. Since H.261 was published
in 1990, there are many similarities between H.261 and MPEG-1. Fig. 13.12 shows
the conceptual block diagram of an MPEG-1 coder. Compared to H.261 (Fig.13.4)
we notice the following dierences:
1. The loop lter is gone. Since MPEG-1 uses motion vectors with half-pel
accuracy, there is no need for the lter (see Sec. 13.2.3). The motion vector
range is extended to 64 pels.
2. MPEG-1 uses I-, P-, and B-frames. The use of the B-frames requires a more
complex motion estimator and motion compensation unit. Motion vectors for
B-frames are estimated with respect to two reference frames, the preceding Ior P-frame and the next I- or P-frame. Hence, we can associate two motion
vectors to each MB of a B-frame. For motion compensated prediction, we
now need two frame stores for these two reference pictures. The prediction
mode for B-frames is decided for each MB. Furthermore, the coding order is
dierent from the scan order (see Fig. 9.12) and therefore, we need a picture
reordering unit at the input to the encoder and at the decoder.
3. For I-frames, quantization of DCT coeÆcients is adapted to the human visual
system by dividing the coeÆcients with a weight matrix. Figure 13.13 shows
448
Video Compression Standards
wu;v
0
1
2
3
4
5
6
7
0
8
16
19
22
22
26
26
27
1
16
16
22
22
26
27
27
29
2
19
22
26
26
27
29
29
35
3
22
24
27
27
29
32
34
38
4
26
27
29
29
32
35
38
46
5
27
29
34
34
35
40
46
56
6
29
34
34
37
40
48
56
69
Chapter 13
8
34
37
38
40
48
58
69
83
Default weights for quantization of I-blocks in MPEG-1. Weights
for horizontal and vertical frequencies dier.
Figure 13.13.
the default table. Larger weights results in a coarser quantization of the
coeÆcient. It can be seen that the weights increase with the frequency that a
coeÆcient represents. When comparing a coder with and without the weight
matrix at identical bitrates, we notice that the weight matrix reduces the
PSNR of decoded pictures but increases the subjective quality.
Another dierences from H.261 is that the DC coeÆcient of an I-block may be
predicted from the DCT coeÆcient of its neighbor to the left. This concept was
later extended in JPEG [15][39], H.263, and MPEG-4.
MPEG-1 uses a Group of Picture (GOP) structure (Fig. 9.12). Each GOP starts
with an I frame followed by a number of P- and B-frames. This enables random
access to the video stream as well as the VCR like functionalities fast forward and
reverse.
Because of the large range of the characteristics of bitstreams that is supported
by the standard, a special subset of the coding parameters, known as Constrained
Parameters Set (CPS), has been dened (Tab. 13.4). CPS is a limited set of sampling and bitrate parameters designed to limit computational decoder complexity,
buer size, and memory bandwidth while still addressing the widest possible range of
applications. A decoder implemented with the CPS in mind needs only 4 Megabits
of DRAM while supporting SIF and CIF. A ag in the bitstream indicates whether
or not the bitstream is a CPS.
Compared to an analog consumer quality VCR, MPEG-1 codes video with only
half the number of scan lines. At a video bitrate of 1.8 Mbit/s however, it is possible
for a good encoder to deliver a video quality that exceeds the quality of a video
recorded by an analog consumer VCR onto a used video tape.
13.5 Digital TV with MPEG-2
Towards the end of the MPEG-1 standardization process it became obvious that
MPEG-1 would not be able to eÆciently compress interlaced digital video at broadcast quality. Therefore, the MPEG group issued a call for proposals to submit
Section 13.5.
449
Digital TV with MPEG-2
Table 13.4.
Constrained Parameter Set for MPEG-1 video.
Parameter
pixels/line
lines/picture
number of MBs per picture
number of MBs per second
input buer size
motion vector component
Bitrate
Maximum value
768 pels
576 lines
396 MBs
396 25 = 330 30 = 9900
327680 bytes
64 pels
1.856 Mbps
technology for digital coding of audio and video for TV broadcast applications.
The best performing algorithms were extensions of MPEG-1 to deal with interlaced
video formats. During the collaborative phase of the algorithm development a lot
of similarity with MPEG-1 was maintained.
The main purpose of MPEG-2 is to enable MPEG-1 like functionality for interlaced pictures, primarily using the ITU-R BT.601 (formerly CCIR 601) 4:2:0 format
[34]. The target was to produce TV quality pictures at data rates of 4 to 8 Mb/s
and high quality pictures at 10 to 15 Mb/s. MPEG-2 deals with high quality coding
of possibly interlaced video, of either SDTV or HDTV. A wide range of applications, bit rates, resolutions, signal qualities and services are addressed, including
all forms of digital storage media, television (including HDTV) broadcasting, and
communications [13].
The MPEG-2 standard [19] consists of nine parts: Systems, Video, Audio, Conformance, Software, Digital Storage Media { Command and Control (DSM-CC),
Non Backward Compatible (NBC) audio, Real Time Interface, and Digital Storage Media { Command and Control (DSM-CC) Conformance. In this section, we
provide a brief overview over MPEG-2 systems, audio and video and the MPEG-2
concept of Proles.
13.5.1 Systems
Requirements for MPEG-2 systems are to be somewhat compatible with MPEG-1
systems, be error resilient, support transport over ATM networks and transport
more than one TV program in one stream without requiring a common time base
for the programs. An MPEG-2 Program Stream (PS) is forward compatible with
MPEG-1 system stream decoders. A PS contains compressed data from the same
program, in packets of variable length usually between 1 and 2 kbytes and up to
64 kbytes. The MPEG-2 Transport Stream (TS) is not compatible with MPEG-1.
A TS oers error resilience as required for cable TV networks or satellite TV, uses
packets of 188 bytes, and may carry several programs with independent time bases
that can be easily accessed for channel hopping.
450
Video Compression Standards
Chapter 13
Y pixel
Cb and
Cr pixel
MPEG-1
Figure 13.14.
MPEG-2
Luminance and chrominance samples in a 4:2:0 progressive frame.
13.5.2 Audio
MPEG-2 audio comes in two parts: In part 3 of the standard, MPEG denes a forward and backward compatible audio format that supports ve channel surround
sound. The syntax is designed such that a MPEG-1 audio decoder is able to reproduce a meaningful downmix out of the ve channels of an MPEG-2 audio bitstream
[18]. In part 7, the more eÆcient multi-channel audio decoder, MPEG-2 Advanced
Audio Coder (AAC), with sound eects and many other features is dened [20].
MPEG-2 AAC requires 30% less bits than MPEG-1 Layer III Audio for the same
stereo sound quality. AAC has been adopted by the Japanese broadcasting industry. AAC is not popular as a format for the internet because no "free" encoder is
available.
13.5.3 Video
MPEG-2 is targeted at TV studios and TV broadcasting for standard TV and
HDTV. As a consequence, it has to support eÆciently the coding of interlaced video
at bitrates adequate for the applications. The major dierences between MPEG-1
and MPEG-2 are the following:
1. Chroma samples in the 4:2:0 format are located horizontally shifted by 0.5
pels compared to MPEG-1, H.261, and H.263 (Fig. 13.14).
2. MPEG-2 is able to code interlaced sequences in the 4:2:0 format (Fig. 13.15).
3. As a consequence, MPEG-2 allows additional scan patterns for DCT coeÆcient and motion compensation with blocks of size 16x8 pels.
4. Several dierences eg 10 bit quantization of the DC coeÆcient of the DCT,
non-linear quantization, better VLC tables improve coding eÆciency also for
progressive video sequences.
Section 13.5.
451
Digital TV with MPEG-2
Y pixel
Cb and
Cr pixel
Top Field
Bottom Field
Luminance and chrominance samples in a 4:2:0 interlaced frame
where the top eld is temporally rst.
Figure 13.15.
5. MPEG-2 supports varies modes of scalability. Spatial scalability enables different decoders to get videos of dierent picture sizes from the same bitstream.
MPEG-2 supports temporal scalability such that a bit stream can be decoded
into video sequences of dierent frame rates. Furthermore, SNR scalability
provides the ability to extract video sequences with dierent amplitude resolutions from the bitstream.
6. MPEG-2 denes proles and levels dening a subset of the MPEG-2 features
and their parameter ranges that are signaled in the header of a bitstream (see
Sec. 13.5.4). In this way a MPEG-2 compliant decoder knows immediately
whether it can decode the bitstream.
7. MPEG-2 allows for much higher bitrates (see Sec. 13.5.4).
In the following we will discus on the extensions introduced to support interlaced
video and scalability.
Coding of Interlaced Video
Interlaced video is a sequence of alternating top and bottom elds (see Sec. 1.3.1).
Two elds are of identical parity if they are both top elds or both bottom elds.
Otherwise, two elds are said to have opposite parity. MPEG-2 considers two types
of Picture Structures for interlaced video (Fig. 13.16). A Frame picture consists
of lines from the top and bottom elds of an interlaced picture in an interlaced
order. This frame picture structure is also used when coding progressive video. A
Field-picture keeps the top and the bottom eld of the picture separate. For each
of these pictures, I-, P-, and B-picture coding modes are available.
MPEG-2 adds new prediction modes for motion compensation, all related to
interlaced video.
1. Field prediction for Field-pictures is used to predict a MB in a Field-picture.
For P-elds, the prediction may come from either eld of the two most recently
452
Video Compression Standards
Chapter 13
Editing instructions: replace rectangles by stars; change square brackets to curly
brackets; remove 3rd column from each sub gure.
Frame and Field Picture Structures (side view of the individual
elds) : Each frame consists of a top and a bottom eld, either one of them may
be temporally rst.
Figure 13.16.
coded elds. For B-elds, we use the two elds of the two reference pictures
(Fig. 13.17).
2. Field prediction for Frame-pictures splits a MB of the frame into the pels of
the top eld and those of the bottom eld resulting in two 16x8 Field blocks
(Fig. 13.18). Each Field block is predicted independent of the other similar to
the method described in item 1 above. This prediction method is especially
useful for rapid motion.
3. Dual Prime for P-pictures transmits one motion vector per MB that can
be used for predicting Field and Frame-pictures from the preceding P- or
I-picture. The target MB is represented as two Field blocks. The coder
computes two predictions for each eld block and averages them. The rst
prediction of each Field block is computed by doing motion compensation us-
Section 13.5.
Digital TV with MPEG-2
453
Editing instructions: Replace rectangles by stars; change square brackets to curly
brackets.
Every MB relevant for Field Prediction for Field-pictures is located
within one eld of the Reference picture. Pictures may have dierent parity.
Figure 13.17.
ing the transmitted motion vector and the eld with the same parity as the
reference. The second prediction of each Field block is computed using a corrected motion vector and the eld with the opposite parity as reference. The
corrected motion vector is computed assuming linear motion. Considering the
temporal distance between the elds of same parity, the transmitted motion
vector is scaled to reect the temporal distance between the elds of opposite
parity. Then we add a transmitted Dierential Motion Vector (DMV) resulting in the corrected motion vector. For interlaced video, this Dual Prime for
P-pictures prediction mode can be as eÆcient as using B-pictures - without
adding the delay of a B-picture.
4. 16X8 MC for Field pictures corresponds to eld prediction for Frame-pictures.
Within a MB, the pels belonging to dierent elds have their own motion
vectors for motion compensation, i.e. two motion vectors are transmitted for
P-pictures and four for B-pictures.
These many choices for prediction makes the design of an optimal encoder ob-
454
Video Compression Standards
Chapter 13
Top
field pels
Bottom
field pels
Frame
Field
Field prediction for Frame-pictures: The MB to be predicted is
split into top eld pels and bottom eld pels. Each 16x8 Field block is predicted
separately with its own motion vector (P-frame) or 2 motion vectors (B-frame).
Figure 13.18.
viously very challenging.
In interlaced video, neighboring rows in a MB come from dierent elds, thus
the vertical correlation between lines is reduced when the underlying scene contains
motion with a vertical component. MPEG-2 provides two new coding modes to
increase the eÆciency of prediction error coding.
1. Field DCT reorganizes the pels of a MB into two blocks for the top eld and
two blocks for the bottom eld (Fig. 13.18). This increases the correlation
within a block in case of motion and thus increases the coding eÆciency.
2. MPEG-2 provides an Alternate scan that the encoder may select on a pictureby-picture basis. This scan puts coeÆcients with high vertical frequencies
earlier than the Zigzag scan. Fig. 13.19 compares the new scan to the conventional Zigzag scan.
Scalability in MPEG-2
The MPEG-2 functionality described so far is achieved with the non scalable syntax
of MPEG-2, which is a superset of MPEG-1. The scalable syntax structures the
bitstream in layers. The base layer can use the non-scalable syntax and thus be
decoded by an MPEG-2 terminal that does not understand the scalable syntax.
The basic MPEG-2 scalability tools are data partitioning, SNR scalability, spatial
scalability and temporal scalability (see Sec. 11.1. Combinations of these basic
scalability tools are also supported.
When using scalable codecs, drift may occur in a decoder that decodes the baselayer only. Drift is created if the reference pictures used for motion compensation
at the encoder and the base-layer decoder dier. This happens if the encoder uses
information of the enhancement layer when computing the reference picture for
Section 13.5.
455
Digital TV with MPEG-2
Zigzag scan
Alternate Scan
The Zigzag scan as known from H.261, H.263, and MPEG-1 is
augmented by the Alternate scan in MPEG-2 in order to code interlaced blocks
that have more correlation in horizontal direction than in vertical direction.
Figure 13.19.
the base layer. The drift is automatically set to zero at every I-frame. Drift does
not occur in scalable codecs if the encoder does not use any information of the
enhancement layer for coding the base layer. Furthermore, a decoder decoding
layers in addition to the base layer may not introduce data from the upper layers
into the decoding of the lower layers.
Data Partitioning : Data partitioning splits the video bit stream into two or
more layers. The encoder decides which syntactic elements are placed into the base
layer and which into the enhancement layers. Typically, high frequency DCT coeÆcients are transmitted in the low priority enhancement layer while all headers, side
information, motion vectors, and the rst few DCT coeÆcients are transmitted in
the high priority base layer. Data partitioning is appropriate when two transmission
channels are available. Due to the data partitioning, the decoder can decode the
base layer only if the decoder implements a bitstream loss concealer for the higher
layers. This concealer can be as simple as setting to zero the missing higher order
DCT coeÆcients in the enhancement layer. Fig. 13.20 shows a high-level view of
the encoder and decoder. The data partitioning functionality may be implemented
independent of the encoder and decoder. Data partitioning does not incur any noticeable overhead. However, its performance in an error-prone environment may be
poor compared to other methods of scalability [13]. Obviously, we will encounter
the drift problem if we decode only the base layer.
SNR Scalability : SNR scalability is a frequency domain method where all layers are coded with the same spatial resolution, but with diering picture quality
achieved through dierent MB quantization stepsizes. The lower layer provides the
basic video quality while the enhancement layer carry the information which, when
added to the lower layer, generate a higher quality reproduction of the input video.
Fig. 13.21 shows a SNR scalable coder, which includes a non-scalable base encoder.
The base encoder feeds the DCT coeÆcients after transform and quantization into
the SNR enhancement coder. The enhancement coder re-quantizes the quantiza-
456
Video Compression Standards
Chapter 13
Editing instruction: please redraw based on the provided sketch.
A data partitioning codec is suited for ATM networks that support
two degrees of quality of service.
Figure 13.20.
tion error of the base encoder and feeds the coeÆcients that it sends to the SNR
enhancement decoder, back into the base encoder, which adds them to its dequantized coeÆcients and to the encoder feedback loop. Due to the feedback of the
enhancement layer at the encoder, drift occurs for any decoder that decodes only
the base layer.
At a total bitrate between 4 Mbit/s and 9 Mbit/s, the combined picture quality
of base and enhancement layers is 0.5 to 1.1 dB less than that obtained with nonscalable coding. Obviously, SNR scalability outperforms data partitioning in terms
of picture quality for the base layer [60][13].
Spatial Scalability : In MPEG-2, spatial scalability is achieved by combining two
complete encoders at the transmitter and two complete decoders at the receiver.
The base layer is coded at low spatial resolution using a motion compensated DCT
encoder such as H.261, MPEG-1 or MPEG-2 (Fig. 13.22). The image in the frame
store of the feedback loop of this base encoder is made available to the spatial enhancement encoder. This enhancement coder is also a motion compensated DCT
encoder which codes the input sequence at the high resolution. It uses the upsampled input from the lower layer to enhance its temporal prediction. The prediction
image in the enhancement layer coder is the weighted sum of the temporal prediction image of the enhancement coder and the spatial prediction image from the
base encoder. Weights may be adapted on a MB level. There are no drift problems
with this coder since neither the encoder nor the decoder introduce information
of the enhancement layer into the base layer. At a total bitrate of 4 Mbit/s, the
combined picture quality of base and enhancement layers is 0.75 to 1.5 dB less than
that obtained with nonscalable coding [13].
Compared to simulcast, i.e. sending two independent bitstreams one having
the base layer resolution and one having the enhancement layer resolution, spatial
scalability is more eÆcient by 0.5 to 1.25 dB [13][61]. Spatial scalability is the
Section 13.5.
Digital TV with MPEG-2
457
Editing instruction: please redraw based on the provided sketch.
A detailed view of the SNR scalability encoder. This encoder defaults
to a standard encoder if the enhancement encoder is removed.
Figure 13.21.
appropriate tool to be used in applications where interworking of video standards
is necessary and the increased coding eÆciency compared to simulcasting is able to
oset the extra cost for complexity of encoders and decoders.
Temporal Scalability : In temporal scalability, the base layer is coded at a lower
frame rate using a nonscalable codec, and the intermediate frames can be coded in a
second bitstream using the rst bitstream reconstruction as prediction [62]. MPEG2 denes that only two frames may be used for the prediction of an enhancement
layer pictures. Fig. 13.23 and Fig. 13.24 show two typical congurations. If we
mentally collapse the images of enhancement layer and base layer in Fig. 13.23,
we notice that the resulting sequence of images and the prediction arrangement is
458
Video Compression Standards
Chapter 13
Editing instruction: please redraw based on the provided sketch.
An encoder with spatial scalability consists of two complete encoders
that are connected using a spatial interpolation lter.
Figure 13.22.
similar to a nonscalable coder and identical to a nonscalable coder if the base layer
uses only I and P frames. Accordingly, the picture quality of temporal scalability
is only 0.2 to 0.3 dB lower than a nonscalable coder [13]. In Fig. 13.25 we see that
enhancement and base layer encoders are two complete codecs that both operate
at half the rate of the video sequence. Therefore, the computational complexity of
temporal scalability is similar to a nonscalable coder operating at the full frequency
of the input sequence. There are no drift problems.
Temporal scalability is an eÆcient means of distributing video to terminals with
dierent computational capabilities like a mobile terminal and a desktop PC. Another application is stereoscopic video transmission where right and left channels
are transmitted as the enhancement and base layer, respectively. This is discussed
previously in Sec. 12.5.1.
13.5.4 Proles
The full MPEG-2 syntax covers a wide range of features and parameters. Extending
the MPEG-1 concept of a constrained parameter set (Tab. 13.4), MPEG-2 denes
Proles that describe the tools required for decoding a bitstream and Levels that
describe the parameter range for these tools. MPEG-2 initially dened ve proles
for video, each adding new tools in a hierarchical fashion. Later, two more proles
were added that do not t the hierarchical scheme:
Section 13.5.
459
Digital TV with MPEG-2
Editing instruction: please redraw based on the provided sketch.
Temporal scalability may use only the base layer to predict images
in the enhancement layer. Obviously, errors in the enhancement layers to do not
propagate over time.
Figure 13.23.
instruction: please redraw based on the provided sketch.
Editing
Temporal scalability may use the base layer and the enhancement
layer for prediction. This arrangement is especially useful for coding of stereoscopic
video.
Figure 13.24.
1. The Simple prole supports I and P frames, the 4:2:0 format, and no scalability. It is currently not used in the market.
2. The Main prole adds support for B-frames. The Main prole at Main level
([email protected]) is used for TV broadcasting. This proles is the most widely used.
3. The SNR prole supports SNR scalability in addition to the functionality of
the Main prole. It is currently not used in the market.
4. The Spatial prole supports the functionality of the SNR prole and adds
spatial scalability. It is currently not used in the market.
5. Finally, the High Prole supports the functionality of the Spatial prole and
adds support for the 4:2:2 format. This prole is far too complex to be useful.
460
Video Compression Standards
Chapter 13
Editing instruction: please redraw based on the provided sketch.
A temporal scalability encoder consists of two complete encoders
with the enhancement encoder using the base layer video as an additional reference
for prediction. The Temporal Demux sends pictures alternating to the base encoder
and the enhancement encoder.
Figure 13.25.
6. The 4:2:2 prole supports studio post production and high quality video for
storage and distribution. It basically extends the main prole to higher
bitrates and quality. The preferred frame order in a group of frames is
IBIBIBIBI... Equipment with this prole is used in digital studios.
7. The Multiview prole enables the transmission of several video streams in
parallel thus enabling stereo presentations. This functionality is implemented
using temporal scalability thus enabling Main prole decoders to receive one
of the video streams. Prototypes exist.
For each prole, MPEG dened levels. Levels essentially dene the size of the
video frames, the frame rate and picture types thus providing an upper limit for the
processing power of a decoder. Table 13.5 shows the levels dened for most proles.
The fact that only two elds in Table 13.5 are used in the market ([email protected] and
4:2:[email protected]) is a strong indication that standardization is a consensus process |
MPEG had to accommodate many individual desires to get patented technology
required in an MPEG prole without burdening the main applications, i.e. TV
production and broadcasting.
13.6 Coding of Audio Visual Objects with MPEG-4
The MPEG-4 standard is designed to address the requirements of a new generation
of highly interactive multimedia applications while supporting traditional applications as well. Such applications, in addition to eÆcient coding, also require advanced
Section 13.6.
461
Coding of Audio Visual Objects with MPEG-4
Proles and levels in MPEG-2 denes allowable picture types (I,P,B),
pels/line and lines/picture, picture format, and maximum bitrate (for all layers in
case of scalable bitstreams).
Table 13.5.
Level
Simple
Main
SNR
Spatial
High
Multiview
4:2:2
(I,P)
(I,P,B)
(I,P,B)
(I,P,B)
(I,P,B)
(I,P,B)
(I,P,B)
(4:2:0)
(4:2:0)
(4:2:0)
(4:2:0)
(4:2:0;
(4:2:0)
(4:2:0,
4:2:2)
Low
Main
4:2:2)
Pels/line
352
352
352
Lines/frame
288
288
288
Frames/s
30
30
30
Mbit/s
4
4
Pels/line
8
720
720
720
720
720
720
Lines/frame 576
576
576
576
576
512/608
Frames/s
30
30
30
30
30
30
Mbit/s
15
15
15
20
25
50
High-
Pels/line
1440
1440
1440
1440
1440
Lines/frame
1152
1152
1152
1152
Frames/s
60
60
60
60
Mbit/s
60
60
80
100
Pels/line
1920
1920
1920
1920
Lines/frame
1152
1152
1152
1152
Frames/s
60
60
60
60
Mbit/s
80
100
130
300
High
functionalities such as interactivity with individual objects, scalability of contents,
and a high degree of error resilience. MPEG-4 provides tools for object-based coding
of natural and synthetic audio and video as well as graphics. The MPEG-4 standard, similar to its predecessors, consists of a number of parts, the primary parts
being systems, visual and audio. The visual part and the audio part of MPEG-4
include coding of both natural and synthetic video and audio, respectively.
13.6.1 Systems
MPEG-4 Systems enables the multiplexing of audio-visual objects and their composition into a scene. Fig. 13.26 shows a scene that is composed in the receiver
and then presented on the display and speakers. A mouse and keyboard may be
provided to enable user input. If we neglect the user input, presentation is as on a
regular MPEG-1 or MPEG-2 terminal. However, the audio-visual objects are composited into a scene at the receiving terminal whereas all other standards discussed
in this chapter require scene composition to be done prior to encoding. The scene
in Fig. 13.26 in composited in a local 3D coordinate system. It consists of a 2D
background, a video playing on the screen in the scene, a presenter, coded as a 2D
sprite object, with audio, and 3D objects like the desk and the globe. MPEG-4
enables user interactivity by providing the tools to interact with this scene. Obvi-
462
Video Compression Standards
Chapter 13
audiovisual objects
voice
hierarchically multiplexed
downstream control / data
sprite
hierarchically multiplexed
upstream control / data
2D background
audiovisual
presentation
y
scene
coordinate
system
z
3D objects
x
user events
video
compositor
projection
plane
hypothetical viewer
audio
compositor
speaker
display
user input
Audio-visual objects are composed into a scene within the receiver
of a MPEG-4 presentation [Courtesy of MPEG-4].
Figure 13.26.
ously, this object-based content description gives tremendous exibility in creating
interactive content and in creating presentations that are customized to a viewer,
be it language, text, advertisements, logos, etc..
Fig. 13.27 shows the dierent functional components of an MPEG-4 terminal
[1]:
Media or Compression Layer: This is the component of the system performing the
decoding of the media like audio, video, graphics and other suitable media.
Media are extracted from the sync layer through the elementary stream interface. Specic MPEG-4 media include a Binary Format for Scenes (BIFS)
Section 13.6.
Coding of Audio Visual Objects with MPEG-4
463
An MPEG-4 terminal consists of a delivery layer, a synchronization
layer, and a compression layer. MPEG-4 does not standardize the actual composition and rendering [Courtesy of MPEG-4].
Figure 13.27.
464
Video Compression Standards
Chapter 13
for specifying scene compositions and graphics contents. Another specic
MPEG-4 media is the Object Descriptor (OD). An OD contains pointers to
elementary streams, similar to URLs. Elementary streams are used to convey individual MPEG-4 media. ODs also contain additional information such
as Quality of Service parameters. This layer is media aware, but delivery
unaware, i.e. it does not consider transmission [66].
Sync or elementary stream layer: This component of the system is in charge of
the synchronization and buering of individual compressed media. It receives
Sync Layer (SL) packets from the Delivery layer, unpacks the elementary
streams according to their timestamps and forwards them to the Compression
layer. A complete MPEG-4 presentation transports each media in dierent
elementary streams. Some media may be transported in several elementary
streams, for instance if scalability is involved. This layer is media unaware
and delivery unaware, and talks to the Transport layer through the Delivery
Multimedia Integration Framework (DMIF) application interface (DAI). The
DAI, additionally to the usual session set up and stream control functions,
also enables to set the quality of service requirements for each stream. The
DAI is network independent [14].
Transport layer: The transport layer is media unaware and delivery aware. MPEG4 does not dene any specic transport layer. Rather, MPEG-4 media can be
transported on existing transport layers such as for instance RTP, MPEG-2
Transport stream, H.223 or ATM using the DAI as specied in [31][2].
MPEG-4's Binary Format For Scenes (BIFS)
The BIFS scene model is a superset of the Virtual Reality Modeling Language
(VRML) [21][11]. VRML is a modeling language that allows to describe synthetic
3D objects in a synthetic scene and render it using a virtual camera. MPEG-4
extends VRML in three areas:
2-D scene description is dened for placement of 2D audiovisual objects onto
a screen. This is important of the coded media are only video streams that do
not require the overhead of 3D rendering. 2D and 3D scenes may be mixed.
Fig. 13.28 shows a scenegraph that places several 2D objects on the screen.
The object position is dened using Transform Nodes. Some of the objects
are 3D objects that require 3D rendering. After rendering, these objects are
used as 2D objects and placed into the 2D scene.
BIFS enables the description and animation of scenes and graphics objects
using its new compression tools based on arithmetic coders.
MPEG-4 recognizes the special importance of human faces and bodies. It
introduced special tools for very eÆcient description and animation of virtual
humans.
Section 13.6.
Coding of Audio Visual Objects with MPEG-4
465
A scenegraph with 2D and 3D components. The 2D scenegraph
requires only simple placement of 2D objects on the image using Transform2D
nodes. 3D objects are rendered and then placed on the screen as dened in the
3DLayer nodes. Interaction between objects can be dened using pointers from one
node to another (from [65]).
Figure 13.28.
13.6.2 Audio
The tools dened by MPEG-4 audio [30] [3] can be combined into dierent audio
coding algorithms. Since no single coding paradigm was found to span the complete
range from very low bitrate coding of speech signals up to high quality multi-channel
audio coding, a set of dierent algorithms has been dened to establish optimum
coding eÆciency for the broad range of anticipated applications (Fig. 13.29). The
scalable audio coder can be separated into several components.
At its lowest rate, a Text-to-Speech (TTS) synthesizer is supported using the
MPEG-4 Text-to-Speech Interface (TTSI)
Low rate speech coding (3.1 kHz bandwidth) is based on a Harmonic Vector
eXcitation Coding (HVXC) coder at 2kbit/s up to 4kbit/s.
Telephone speech (8 kHz bandwidth) and wideband speech (16 kHz bandwidth) are coded using a Code Excited Linear Predictive (CELP) coder at
rates between 3850 bit/s and 23800 bit/s. This CELP coder can create scalable bitstreams with 5 layers.
General audio is coded at 16 kbit/s and up to more than 64 kbit/s per channel
466
Video Compression Standards
Chapter 13
Editing Instructions: move 'bit-rate (bps) to the left of axis; move 'Typical Audio
Bandwidth' to the left of the kHz numbers.
MPEG-4 Audio supports coding of speech and audio starting at rates
below 2kbit/s up to more than 64 kbit/s/channel for multichannel audio coding.
Figure 13.29.
using a more eÆcient development of the MPEG-2 AAC coder. Transparent
audio quality can be achieved.
In addition to audio coding, MPEG-4 audio denes music synthesis at the receiver using a Structured Audio toolset that provides a single standard to unify the
world of algorithmic music synthesis and to implement scalability and the notion of
audio objects [9].
13.6.3 Basic Video Coding
Many of the MPEG-4 functionalities require access not only to an entire sequence
of pictures, but to an entire object, and further, not only to individual pictures, but
also to temporal instances of these objects within a picture. A temporal instance of
a video object can be thought of as a snapshot of an arbitrarily shaped object that
occurs within a picture. Like a picture, an object is intended to be an access unit,
and, unlike a picture, it is expected to have a semantic meaning. MPEG-4 enables
content-based interactivity with video objects by coding objects independently using
motion, texture and shape. At the decoder, dierent objects are composed into a
scene and displayed. In order to enable this functionality, higher syntactic structures
had to be developed. A scene consists of several VideoObjects (VO). The VO has
Section 13.6.
Coding of Audio Visual Objects with MPEG-4
467
3 dimensions (2D+time). A VO can be composed of several VideoObjectLayers
(VOL). Each VOL (2D+time) represents various instantiations of a VO. A VOL
can represent dierent layers of a scalable bitstream or dierent parts of a VO. A
time instant of a VOL is called VideoObjectPlane (VOP). A VOP is a rectangular
video frame or a part thereof. It can be fully described by its texture variations
(a set of luminance and chrominance values) and its shape. The video encoder
applies the motion, texture and shape coding tools to the VOP using I, P, and B
modes similar to the modes of MPEG-2. For editing and random access purposes,
consecutive VOPs can be grouped into a GroupOfVideoObjectPlanes (GVOP). A
video session, the highest syntactic structure, may consist of several VOs.
The example in Fig. 13.30 shows one VO composed of 2 VOL. VOL1 consists
of the tree and the background. VOL2 represents the person. In the example,
VOL1 is represented by two separate VOPs, VOP1 and VOP3. Hence, VOL1
may provide content-based scalability in the sense that a decoder may choose not
to decode one VOP of VOL1 due to resource limitations. VOL2 contains just one
VOP, namely VOP2. VOP2 may be represented using a temporal, spatial or quality
scalable bitstream. In this case, a decoder might again decide to decode only the
lower layers of VOL 2. The example in Fig. 13.30 shows the complex structures
of content-based access and scalability that MPEG-4 supports. However, the given
example could also be represented in a straight forward fashion with three VOs.
The background, the tree and the person are coded as separate VOs with one layer
each, and each layer is represented by one VOP. The VOPs are encoded separately
and composed in a scene at the decoder.
To see how MPEG-4 video coding works, consider a sequence of VOPs. Extending the concept of intra (I-) pictures, predictive (P-) and bidirectionally predictive
(B-) pictures of MPEG-1/2 to VOPs, I-VOP, P-VOP and B-VOP result. If, two
consecutive B-VOPs are used between a pair of reference VOPs (I- or a P-VOPs),
the resulting coding structure is as shown in Fig. 13.31.
Coding EÆciency Tools
In addition to the obvious changes due to the object-based nature of MPEG-4, the
following tools were introduced in order to increase coding eÆciency compared to
MPEG-1 and MPEG-2:
DC Prediction: This is improved compared to MPEG1/2. Either the previous
block or the block above the current block can be chosen as predictors to
predict the current DC value.
AC Prediction: AC prediction of DCT coeÆcients is new in MPEG-4. The block
that was chosen to predict the DC-coeÆcient is also used for predicting one line
of AC coeÆcients. If the predictor is the previous block, the AC coeÆcients
of its rst column are used to predict the co-located AC coeÆcients of the
current block. If the predictor is the block from the previous row, it is used to
predict the rst row of AC coeÆcients. AC prediction does not work well for
468
Video Compression Standards
VOP1
VOP2
Chapter 13
VOL1
VOP3
VOL2
Object-based coding requires the decoder to compose dierent
VideoObjectPlanes (VOP) into a scene. VideoObjectLayers (VOL) enable contentbased scalability.
Figure 13.30.
I
B
Forward
Prediction
Figure 13.31.
B
P
Backward
Prediction
An example prediction structure using I, P and B-VOPs (from [59]).
blocks with coarse texture or diagonal edges or horizontal as well as vertical
edges. Switching AC prediction on and o on a block level is desirable but
too costly. Consequently, the decision is made on the MB level.
Alternate Horizontal Scan: This scan is added to the two scans of MPEG-2
(Fig. 13.19). The Alternate Scan of MPEG-2 is referred to as Alternate Vertical Scan in MPEG-4. The Alternate Horizontal Scan is created by mirroring
the Vertical Scan. The scan is selected at the same time as the AC prediction
is decided. In case of AC prediction from the previous block, Alternate Vertical Scan is selected. In case of AC prediction from the block above, Alternate
Horizontal Scan is used. No AC prediction is coupled to the Zigzag scan.
3D VLC: Coding of DCT coeÆcients is achieved similar to H.263.
Section 13.6.
Coding of Audio Visual Objects with MPEG-4
469
4 Motion Vectors: Four motion vectors for a MB are allowed. This is done similar
to H.263.
Unrestricted Motion Vectors: This mode is enabled. Compared to H.263, a much
wider motion vector range of 2048 pels may be used.
Sprite: A Sprite is basically a large background image that gets transmitted to
the decoder. For display, the encoder transmits aÆne mapping parameters
that map a part of the image onto the screen. By changing the mapping, the
decoder can zoom in and out of the Sprite, pan to the left or right [8].
Global Motion Compensation: In order to compensate for global motion due to
camera motion, camera zoom or large moving objects, global motion is compensated according to the eight parameter motion model of Eq. 5.5.14 (see
Sec. 5.5.3):
ax+by +c
x0 = gx
+hy+1
(13.6.1)
dx+ey +f :
y 0 = gx
+hy+1
Global motion compensation is an important tool to improve picture quality for scenes with large global motion. These scenes are diÆcult to code
using block-based motion. In contrast to scenes with arbitrary motion, the
human eye is able to track detail in case of global motion. Thus global motion
compensation helps to improve the picture quality in the most critical scenes.
Quarter-pel Motion Compensation: The main target of quarter-pel motion compensation is to enhance the resolution of the motion compensation scheme
with only small syntactical and computational overhead, leading to more accurate motion description and less prediction error to be coded. Quarter-pel
motion compensation will only be applied to the luminance pels, chrominance
pels are compensated in half pel accuracy.
As pointed out, some tools are similar to those developed in H.263. As in H.263,
the MPEG-4 standard describes overlapped motion compensation. However, this
tool is not included in any MPEG-4 prole due to its computational complexity for
large picture sizes and due to its limited improvements for high quality video, i.e.
there is no MPEG-4 compliant decoder that needs to implement overlapped block
motion compensation.
Error Resilience Tools: Besides the tools developed to enhance the coding eÆciency, a set of tools are also dened in MPEG-4 for enhancing the resilience of the
compressed bit streams to transmission errors. These are described in Sec. 14.7.2.
13.6.4 Object-based Video Coding
In order to enable object-based functionalities for coded video, MPEG-4 allows
the transmission of shapes for video objects. While MPEG-4 does not standardize
the method of dening or segmenting the video objects, it denes the decoding
470
Video Compression Standards
Chapter 13
algorithm and implicitly an encoding algorithm for describing the shape. Shape is
described using alpha-maps that have the same resolution as the luminance signal.
An alpha map is co-located with the luminance picture. MPEG-4 denes the alphamap as two parts. The binary alpha-map denes pels that belong to the object.
In case of grey-scale alpha maps, we have an additional alpha map that denes
the transparency using 8 bits/pel. Alpha-maps extend a macroblock. The 16 16
binary alpha-map of a MB is called Binary Alpha Block (BAB). In the following,
we describe the individual tools that MPEG-4 uses for object-based video coding.
Binary Shape: A context-based arithmetic coder as described in Sec. 10.1.1 is used
to code boundary blocks of an object. A boundary block contains pels of the
object and of the background. It is co-located with a MB. For non-boundary
blocks, the encoder just signals whether the MB is part of the object or not.
A sequence of alpha-maps may be coded and transmitted without texture.
Alternatively, MPEG-4 uses tools like padding and DCT or SA-DCT to code
the texture that goes with the object. BABs are coded in intra-mode and
inter-mode. Motion compensation may be used in inter mode. Shape motion
vector coding uses the motion vectors associated with the texture coding as a
predictor.
Padding: In order to code the texture of BABs using block-based DCT, the texture
of the background may be set to any color. In intra mode, this background
color has no eect on the decoded pictures and can be choosen by the encoder.
However, for motion compensation, the motion vector of the current block
may refer to a boundary block in the previous reference picture. Part of the
background pels of the reference picture might be located in the area of the
current object - hence the value of these background pictures inuences the
prediction loop. MPEG-4 uses padding as described in Sec. 10.2.1 to dene
the background pels used in prediction.
Shape Adaptive DCT: The encoder may choose to use SA-DCT for coding the
texture of BABs (Sec. 10.2.2). However, padding of the motion compensated
prediction image is still required.
Greyscale Shape Coding: MPEG-4 allows the transmission of arbitrary alphamaps. Since the alpha-maps are dened with 8 bits, they are coded the same
way as the luminance signal.
Fig. 13.32a shows the block diagram of the object-based MPEG-4 video coder.
MPEG-4 uses two types of motion vectors: In Fig. 13.32, we name the conventional motion vectors used to compensate the motion of texture Texture Motion.
Motion vectors describing the shift of the object shape are called Shape Motion. A
shape motion vector may be associated to a BAB. Image analysis estimates texture
and shape motion of the current VOP Sk with respect to the reference VOP Sk0 1 .
Parameter coding encodes the parameters predictively. The parameters get transmitted, decoded and the new reference VOP is stored in the VOP memory. The
Section 13.6.
471
Coding of Audio Visual Objects with MPEG-4
VOP Sk
Image
Analysis
Texture Motion
Shape Motion
Shape
Texture
Parameter
Coding
Parameter
Decoding
a
VOP S’k-1
VOP
Memory
Texture Motion
Coder
Shape Motion
Shape
Texture
VOP S’k
Coder
M
U
Coder
Coded Shape
X
Padding
for Coding
+
-
Coder
Prediction
Shape Data
Padding
for MC
b
VOP S’k-1
Bitstream
Texture/Motion
Data
Block diagram of the video encoder (a) and the parameter coder (b)
for coding of arbitrarily shaped video objects.
Figure 13.32.
increased complexity due to the coding of arbitrarily shaped video objects becomes
evident in Fig. 13.32b. First, shape motion vectors and shape pels are encoded. The
shape motion coder knows which motion vectors to code by analyzing the potentially lossily encoded shape parameters. For texture prediction, the reference VOP
is padded as described above. The prediction error is padded using the original
shape parameters to determine the area to be padded. Then, each MB is encoded
using DCT.
13.6.5 Still Texture Coding
One of the functionalities supported by MPEG-4 is the mapping of static textures
onto 2-D or 3-D surfaces. MPEG-4 visual supports this functionality by providing
472
Video Compression Standards
Chapter 13
a separate mode for encoding static texture information. It is envisioned that
applications involving interactivity with texture mapped synthetic scenes require
continuous scalability.
For coding static texture maps, Discrete Wavelet Transform (DWT) coding was
selected for the exibility it oered in spatial and quality scalability while maintaining good coding performance (Sec. 11.3.1). In DWT coding, a texture map image
is rst decomposed using a 2D separable decomposition using Daubechies (9,3)-tap
biorthogonal lters. Next, the coeÆcients of the lowest band are quantized, coded
predictively using implicit prediction (similar to that used in intra DCT coding)
and arithmetic coding. This is followed by coding of coeÆcients of higher bands
by use of multilevel quantization, zero-tree scanning and arithmetic coding. The
resulting bitstream is exibly arranged allowing a large number of layers of spatial
and quality scalability to be easily derived.
This algorithm was extended to code arbitrarily shaped texture maps. In order
to adapt a scan line of the shape to the coding with DWT, MPEG-4 uses leading
and trailing boundary extensions that mirror the image signal (Sec. 11.3.1).
13.6.6 Mesh Animation
Mesh based representation of an object is useful for a number of functionalities such
as animation, content manipulation, content overlay, merging natural and synthetic
video and others [67].
Fig. 13.33 shows a mesh coder and its integration with a texture coder. The mesh
encoder generates a 2D mesh based representation of a natural or synthetic video
object at its rst appearance in the scene. The object is tesselated with triangular
patches resulting in an initial 2D mesh (Fig. 13.34). The node points of this initial
mesh are then animated in 2D as the VOP moves in the scene. Alternatively, the
motion of the node point can by animated from another source. The 2D motion
of a video object can thus be compactly represented by the motion vectors of the
node points of the mesh. Motion compensation can be achieved by warping of
texture map corresponding to patches by aÆne transform from one VOP to the
next. Texture used for mapping on to object mesh models or facial wireframe
models are either derived from video or still images. Whereas mesh analysis is not
part of the standard, MPEG-4 denes how to encode 2D meshes and the motion
of its node points. Furthermore, the mapping of a texture onto the mesh may be
described using MPEG-4.
13.6.7 Face and Body Animation
An MPEG-4 terminal supporting face and body animation is expected to include
a default face and body model. The systems part of MPEG-4 provides means to
customize this face or body model by means of face and body denition parameters
(FDP, BDP) or to replace it with one downloaded from the encoder. The denition of a scene including 3D geometry and of a face/body model can be sent to
the receiver using BIFS [23]. Fig. 13.35 shows a scenegraph that a decoder built
Section 13.6.
Auxiliary Encoder
Auxiliary Decoder
Mesh Encoder
Mesh Decoder
M
U
X
video
objects
473
Coding of Audio Visual Objects with MPEG-4
D
E
M
U
X
mesh geometry &
motion vectors
Mesh Object
Interface
Video Encoder
Video Decoder
Base Encoder
Base Decoder
video objects
Rendering
user
interaction
Simplied architecture of an encoder/decoder supporting the 2Dmesh object. The video encoder provides the texture map for the mesh object
(from [67]).
Figure 13.33.
Figure 13.34.
[67]).
A content-based mesh designed for the "Bream" video object (from
according to the BIFS stream. The Body node denes the location of the Body. It's
child BDP describes the look of the body using a skeleton with joints, surfaces and
surface properties. The bodyDefTable node describes how the model is deformed as
a function of the body animation parameters. The Face node is a descendent of
the body node. It contains the face geometry as well as the geometry for dening
the face deformation as a function of the face animation parameters (FAP). The
visual part of MPEG-4 denes how to animate these models using FAPs and body
animation parameters (BAP)[24].
Fig. 13.36 shows two phases of a left eye blink (plus the neutral phase) which
have been generated using a simple animation architecture [67]. The dotted half
circle in Fig. 13.36 shows the ideal motion of a vertex in the eyelid as it moves down
474
Video Compression Standards
Chapter 13
according to the amplitude of FAP 19. In this example, the faceDefTable for FAP
19 approximates the target trajectory with two linear segments on which the vertex
actually moves as FAP 19 increases.
Face Animation
Three groups of facial animation parameters (FAP) are dened [67]. First, for
low-level facial animation, a set of 66 FAPs is dened. These include head and eye
rotations as well as motion of feature points on mouth, ear, nose and eyebrow deformation (Fig.10.20). Since these parameters are model independent their amplitudes
are scaled according to the proportions of the actual animated model. Second, for
high-level animation, a set of primary facial expressions like joy, sadness, surprise
and disgust are dened. Third, for speech animation, 14 visemes dene mouth
shapes that correspond to phonemes. Visemes are transmitted to the decoder or
are derived from the phonemes of the Text-to-Speech synthesizer of the terminal.
The FAPs are linearly quantized and entropy coded using arithmetic coding.
Alternatively, a time sequence of 16 FAPs can also be DCT coded. Due to eÆcient
coding, it takes only about 2 kbit/s to achieve lively facial expressions.
Body Animation
BAPs manipulate independent degrees of freedom in the skeleton model of the
body to produce animation of the body parts [4]. Similar to the face, the remote
manipulation of a body model in a terminal with BAPs can accomplish lifelike
visual scenes of the body in real-time without sending pictorial and video details of
the body every frame. The BAPs will produce reasonably similar high level results
in terms of body posture and animation on dierent body models, also without the
need to transmit a model to the decoder. There are a total of 186 predened BAPs
in the BAP set, with an additional set of 110 user-dened extension BAPs. Each
predened BAP corresponds to a degree of freedom in a joint connecting two body
parts. These joints include toe, ankle, knee, hip, spine, shoulder, clavicle, elbow,
wrist, and the hand ngers. Extension BAPs are provided to animate additional
features than the standard ones in connection with body deformation tables [1], e.g.
for cloth animation or body parts that are not part of the human skeleton.
The BAPs are categorized into groups with respect to their eect on the body
posture. Using this grouping scheme has a number of advantages. First, it allows
us to adjust the complexity of the animation by choosing a subset of BAPs. For
example, the total number of BAPs in the spine is 72, but signicantly simpler
models can be used by choosing only a predened subset. Secondly, assuming that
not all the motions contain all the BAPs, only the active BAPs can be transmitted
to decrease required bit rate signicantly. BAPs are coded similar to FAPs using
arithmetic coding.
Section 13.6.
475
Coding of Audio Visual Objects with MPEG-4
Body
rendered
Body
BDP
body
SceneGraph
Joint
IFS
FBA
Decoder
face
SceneGraph
BIFS
Decoder
Face
rendered
Face
FDP
IFS
bodyDef
Table
Joint
Joint
BAP
FBA
Stream
BIFS
Stream
FAP
faceDef
Table
faceDef
Mesh
faceDef
Transform
The scenegraph describing a human body is transmitted in a BIFS
stream. The nodes Body and Face are animated using the FAP's and BAP's of
the FBA stream. The BDP and FDP nodes and their children describe the virtual
human (from [4]).
Figure 13.35.
476
Video Compression Standards
Chapter 13
y
x
y
0
z
FAP 19
Neutral state of the left eye (left) and two deformed animation phases
for the eye blink (FAP 19). The FAP denition denes the motion of the eyelid in
negative y-direction; the faceDefTable denes the motion in one of the vertices of
the eyelid in x and z direction.
Figure 13.36.
Integration of Speech Synthesis
MPEG-4 acknowledges the importance of TTS for multimedia applications by providing a text-to-speech synthesizer interface (TTSI) to a proprietary TTS. A TTS
stream contains text in ASCII and optional prosody in binary form. The decoder
decodes the text and prosody information according to the interface dened for the
TTS synthesizer. The synthesizer creates speech samples that are handed to the
compositor. The compositor presents audio and if required video to the user.
Fig. 13.37 shows the architecture for speech driven face animation that allows
synchronized presentation of synthetic speech and talking heads. A second output
interface of the TTS sends the phonemes of the synthesized speech as well as start
time and duration information for each phoneme to a Phoneme/Bookmark-to-FAPConverter. The converter translates the phonemes and timing information into
FAPs that the face renderer uses in order to animate the face model. In addition
to the phonemes, the synthesizer identies bookmarks in the text that convey nonspeech related FAPs like joy to the face renderer. The timing information of the
bookmarks is derived from their position in the synthesized speech. Since the facial
animation is driven completely from the text input to the TTS, transmitting an
FAP stream to the decoder is optional. Furthermore, synchronization is achieved
since the talking head is driven by the asynchronous proprietary TTS synthesizer.
Section 13.6.
477
Coding of Audio Visual Objects with MPEG-4
TTS Stream
Decoder
Face Model
Phoneme/Bookmark
to FAP Converter
Face
Renderer
Audio
Compositor
Phonemes
Bookmarks
Proprietary
Speeech
Speech
Synthesizer
Video
Editing Instructions: Replace 'Proprietary Speech Synthesizer' with 'Text to Speech
Synthesizer'; Add 'FAPs' to arrow at output of the block 'Phoneme/Bookmark ...'.
MPEG-4 architecture for face animation allowing synchronization of
facial expressions and speech generated by a proprietary text-to-speech synthesizer.
Figure 13.37.
13.6.8 Proles
MPEG-4 developed an elaborate structure of proles. As indicated in Fig. 13.38, a
MPEG-4 terminal has to implement several proles. An object descriptor prole is
required to enable the transport of MPEG-4 streams and identify these streams in
the terminal. A scene description prole provides the tools, to compose the audio,
video or graphics objects into a scene. A 2D scene description prole enables just
the placement of 2D video objects, higher proles provide more functionality. A
media prole needs to be implemented in order to present actual content on the
terminal. MPEG-4 supports audio, video and graphics as media. Several video
proles are dened. Here, we list only a subset of them and mention their main
functionalities.
Simple Prole: The Simple prole was created with low complexity applications in
mind. The rst usage is mobile use of (audio)visual services, and the second
is putting very low complexity video on the Internet. It supports up to four
objects in the scene with, at the lowest level, a maximum total size of a QCIF
picture. There are 3 levels for the Simple Prole with bitrates from 64 to
384 kbit/s. It provides the following tools: I-, P-VOPs, AC/DC prediction,
4 motion vectors, unrestricted motion vectors, slice resynchronization, data
partitioning and reversible VLC. This prole is able to decode H.263 video
streams that do not use any of the optional annexes of H.263.
Simple Scalable Prole: This prole adds support for B-frames, temporal scalability and spatial scalability to the Simple prole. This prole is useful for
478
Video Compression Standards
Chapter 13
An MPEG-4 terminal has to implement at least one prole of Object
descriptor, scene description, and media proles. Not all proles within a group are
listed (from [53]).
Figure 13.38.
applications which provide services at more than one level of quality due to
bit-rate or decoder resource limitations, such as Internet use and software
decoding.
Advanced Real-Time Simple (ARTS) Prole: This prole extends the capabilities of the Simple prole and provides more sophisticated error protection of
rectangular video objects using a back-channel, which signals transmission
errors from the decoder to the encoder such that the encoder can transmit
video information in intra mode for the aected parts of the newly coded
images. It is suitable for real-time coding applications such as videophone,
tele-conferencing and remote observation.
Core Prole: In addition to the tools of the Simple prole, it enables scalable still
textures, B-frames, binary shape coding and temporal scalability for rectangular as well as arbitrarily shaped objects. It is useful for higher quality
interactive services, combining good quality with limited complexity and supporting arbitrarily shaped objects. Also mobile broadcast services could be
supported by this prole. The maximum bitrate is 384 kbit/s at Level 1 and
2 Mbit/s at Level 2.
Core Scalable Visual Prole: This adds to the Core prole object based SNR as
well as spatial/temporal scalability.
Section 13.7.
Video Bitstream Syntax
479
Main Prole: The Main prole adds support for interlaced video, grayscale alpha
maps, and sprites. The Main prole was created with broadcast services in
mind, addressing progressive as well as interlaced material. It combines the
highest quality with the versatility of arbitrarily shaped object using greyscale coding. The highest level accepts up to 32 objects for a maximum total
bitrate of 38 Mbit/s.
Advanced Coding EÆciency (ACE): This prole targets transmission of entertainment videos at bitrates less than 1 MBit/s. However, in terms of specication,
it adds to the Main Prole by extending the range of bitrates and adding
the tools quarter pel motion compensation, global motion compensation and
shape adaptive DCT.
More proles are dened for face, body and mesh animation. The denition of
a Studio prole is in progress supporting bit rates up-to 600 Mb/sec for HDTV and
arbitrarily shaped video objects in 4:0:0, 4:2:2 and 4:4:4 formats. At the time of this
writing, it is still too early to know what proles will eventually be implemented in
products. First generation prototypes implement only the Simple prole and they
target applications in the area of mobile video communications.
13.6.9 Evaluation of Subjective Video Quality
MPEG-4 introduces new functionalities like object-based coding and claims to improve coding eÆciency. These claims were veried by means of subjective tests.
Fig. 13.39 shows results of subjective coding eÆciency tests comparing MPEG-4
video and MPEG-1 video at bitrates between 384 kbit/s and 768 kbit/s indicating that MPEG-4 outperforms MPEG-1 signicantly in these bitrates. MPEG-4
was coded using the tools of the main prole (Sec. 13.6.8). In Fig. 13.40, we see
the improvements in coding eÆciency due to the additional tools of the Advanced
Coding EÆciency (ACE) prole (Sec. 13.6.8). The quality provided by the ACE
Prole at 768 Kbps equals the quality provided by Main Prole at 1024 Kbps. This
makes the ACE prole very attractive for delivering movies over cable modem or
digital subscriber lines (DSL) to the home. Further subjective tests showed that
the object-based functionality of MPEG-4 does not decrease the subjective quality
of the coded video object when compared to coding the video object using framebased video, i.e. the bits spent on shape coding are compensated by saving bits for
not coding pels outside of the video object. Hence, the advanced tools of MPEG-4
enable content-based video representation without increasing the bitrate for video
coding.
13.7 Video Bitstream Syntax
As mentioned earlier, video coding standards dene the syntax and semantics of the
video bitstream, instead of the actual encoding scheme. They also specify how the
bitstream has to be parsed and decoded to produce the decompressed video signal.
480
Video Compression Standards
Chapter 13
Coding Efficiency Medium Bit Rate
Coding conditions
M 1_384
M 1_512
M 4_384
M 4_512
M 1_768
M 4_768
0
1
2
lowest quality
3
4
5
MOS
6
7
8
9
10
best quality
Subjective quality of MPEG-4 versus MPEG-1. M4 * is an MPEG-4
coder operating at the rate of * kbit/s, M1 * is an MPEG-1 encoder operating at
the given rate [27].
Figure 13.39.
In order to support dierent applications, the syntax has to be exible. This is
achieved by having a hierarchy of dierent layers that each start with a Header.
Each layer performs a dierent logical function (Tab. 13.6). Most headers can be
uniquely identied in the bitstream because they begin with a Start Code that is
long sequence of zeros (23 for MPEG-2) followed by a '1' and a start code identier.
Fig. 13.41 visualizes the hierarchy for MPEG-2.
Sequence: A video sequence commences with a sequence header and may contain
additional sequence headers. It includes one or more groups of pictures and
ends with an end-of-sequence code. The sequence header and its extensions
contain basic parameters such as picture size, image aspect ratio, picture
rate and other global parameters. The Video Object Layer header has the
same functionality, however it carries additional information that an MPEG4 decoder needs in order to compose several arbitrarily shape video sequences
into one sequence to be displayed.
Group of Pictures (GOP): A GOP is a header followed by a series of one of more
pictures intended to allow random access into the sequence, fast search and
editing. Therefore, the rst picture in a GOP is an intra-coded picture (Ipicture). This is followed by an arrangement of forward-predictive coded pic-
Section 13.7.
481
Video Bitstream Syntax
Frame Based High Bit Rate
M+ 1024
Coding conditions
M+ 768
M 1024
M 768
M+ 512
M 512
0
10
best quality
20
30
40
50
MOS
60
70
80
90
100
lowest quality
Subjective quality of MPEG-4 ACE versus MPEG-4 Main prole.
M * is an MPEG-4 coder according to the Main prole operating at the rate *
kbit/s, M+ * is an MPEG-4 encoder according to the ACE prole operating at the
given rate [26].
Figure 13.40.
tures (P-pictures) and optional bidirectionally predicted pictures (B-pictures).
This GOP header also contains a time code for synchronization and editing.
A GOP is the base unit for editing and random access since it is coded independent of previous and consecutive GOPs. In MPEG-4, the function of
GOP is achieved by a Group of Video Object Planes (GVOP). Since H.261
and H.263 were designed mainly for interactive applications, they do not use
the concept of GOP. However, the encoder may choose at any time to send
an I-picture thus enabling random access and simple editing.
Picture: A picture is the primary coding unit of a video sequence. A picture
consists of three rectangular matrices representing luminance (Y) and two
chrominance (Cb and Cr) values. The picture header indicates the picture
type (I, P, B), picture structure (eld/frame) and perhaps other parameters
like motion vector ranges. A VOP is the primary coding unit in MPEG-4. It
has the size of the bounding box of the video object.
Each standard divides a picture into groups of MBs. Whereas H.261 and H.263 use a
xed arrangement of MBs, MPEG-1 and MPEG-2 allow for a exible arrangement.
MPEG-4 arranges a variable number of MBs into one group.
482
Video Compression Standards
Chapter 13
Syntax hierarchy as used in dierent video coding standards. Each
layer starts with a header. An SC in a syntax layer indicates that the header of
that layer starts with a Start Code. VOP = Video Object Plane, GOB = Group of
Blocks (adapted from [13]).
Table 13.6.
Syntax Layer
Sequence (SC)
Video Object
Layer (SC)
Group of Pictures (SC)
Group of VOP
(SC)
Picture (SC)
VOP (SC)
GOB (SC)
Slice (SC)
Video
(SC)
MB
Block
Packet
Functionality
Denition of entire video sequence
Denition of entire video object
Standard
H.261/3, MPEG-1/2
Enables random access in video
stream
Enables random access in video
stream
Primary coding unit
Primary coding unit
Resynchronization, refresh, and
error recovery in a picture
Resynchronization, refresh, and
error recovery in a picture
Resynchronization error recovery
in a picture
Motion compensation and shape
coding unit
Transform and compensation
unit
MPEG-1/2;
MPEG-4
MPEG-4
H.261/3, MPEG-1/2
MPEG-4
H.261/3
MPEG-1/2
MPEG-4
H.261/3, MPEG-1/2/4
H.261/3, MPEG-1/2/4
GOB: H.261 and H.263 divide the image into GOBs of 3 lines of MBs with 11 MBs
in one GOB line. The GOB headers dene the position of the GOB within
the picture. For each GOB, a new quantizer stepsize may be dened. GOBs
are important in the handling of errors. If the bitstream contains an error,
the decoder can skip to the start of the next GOB thus limiting the extent of
biterrors to within one GOB of the current frame. However, error propagation
may occur when predicting the following frame.
Slice: MPEG-1, MPEG-2 and H.263 Annex K extend the concept of GOBs to a
variable conguration. A slice groups several consecutive MBs into one unit.
Slices may vary in size. In MPEG-1, a slice may be as big as one picture. In
MPEG-2 however, at least each row of MBs in a picture starts with a new
slice. Having more slices in the bitstream allows better error concealment,
but uses bits that could otherwise be used to improve picture quality.
Section 13.7.
Video Bitstream Syntax
483
Visualization of the hierarchical structure of a MPEG-2 bit stream
from a video sequence layer down to the block level shown for the luminance component. Each layer has also two chrominance components associated with it.
Figure 13.41.
Video Packet Header: The video packet approach adopted by MPEG-4 is based
on providing periodic resynchronization markers throughout the bitstream.
In other words, the length of the video packets are not based on the number
of MBs, but instead on the number of bits contained in that packet. If the
number of bits contained in the current video packet exceeds a threshold as
dened by the encoder, then a new video packet is created at the start of
the next MB. This way, a transmission error causes less damage to regions
with higher activity than to regions that are stationary when compared to the
more rigid slice and GOB structures. The video packet header carries position
information and repeats information of the picture header that is necessary
to decode the video packet.
Macroblock: A MB is a 16x 16 pixel block in a picture. Using the 4:2:0 format, each
chrominance component has one-half the vertical and horizontal resolution of
the luminance component. Therefore a MB consists of four Y, one Cr, and
one Cb block. Its header carries relative position information, quantizer scale
information, MTYPE information (I, P, B), and a CBP indicating which and
how the 6 blocks of a MB are coded. As with other headers, other parameters
may or may not be present in the header depending on MTYPE. Since MPEG4 also needs to code the shape of video objects, it extends the MB by a binary
alpha block (BAB) that denes for each pel in the MB whether it belongs
to the VO. In the case of grey-scale alpha maps, the MB also contains four
blocks for the coded alpha maps.
484
Video Compression Standards
Chapter 13
Block: A block is the smallest coding unit in standardized video coding algorithms.
It consists of 8x8 pixels and can be one of three types: Y, Cr, or Cb. The pixels
of a block are represented by their DCT coeÆcients coded using a Hufman
code that codes the number of '0's before the next non-zero coeÆcient and
the amplitude of this coeÆcient.
The dierent headers in the bitstream allow a decoder to recover from errors in
the bitstream and start decoding as soon as it receives a start code. The behaviour
of a decoder when receiving an erroneous bitstream is not dened in the standard.
Dierent decoders may behave very dierently, some decoders crash and require
rebooting of the terminal, others recover within a picture, yet others wait until the
next I-frame before they start decoding again.
13.8 Multimedia Content Description Using MPEG-7
With the ubiquitous use of video, the problem of indexing and searching for video
sequences becomes an important capability. MPEG-7 is an on-going standardization eort for content description of audio-visual (AV) documents [32, 63]. In
principle, MPEG-1, -2, and -4 are designed to represent the information itself, while
MPEG-7 is meant to represent information about the information. Looking from
another perspective: MPEG-1/2/4 make content available, while MPEG-7 allows
you to nd the content you need [63]. MPEG-7 is intended to provide complementary functionality to other MPEG standards: representing information about the
content, not the content itself (\the bits about the bits"). While MPEG-4 enables
to attach limited textual meta information to its streams, the MPEG-7 standard
will provide a full set of indexing and search capabilities such that we can search
for a movie not only with text keys but also with keys like color histograms, motion
trajectory, etc. MPEG-7 will be an international standard by the end of 2001.
In this section, we rst provide an overview of the elements standardized by
MPEG-7. We then describe multimedia description schemes, with focus on content description. We explain how MPEG-7 decomposes an AV document to arrive
at both structural and semantic descriptions. Finally we describe visual descriptors used in these descriptions. The descriptors and description schemes presented
below assume that semantically meaningful regions and objects can be segmented
and that the shape and motion parameters, and even semantic lables of these regions/objects can be accurately extracted. We would like to note that generation
of such information remains an unsolved problem and may need manual assistance.
The MPEG-7 standard only denes the syntax that can be used to specify these
information, but not algorithms that can be used to extract them.
13.8.1 Overview
The main elements of the MPEG-7 standard are [32]:
Descriptors (D): The MPEG-7 descriptors are designed to represent features,
Section 13.8.
Multimedia Content Description Using MPEG-7
485
including low-level audio-visual features; high-level features of semantic objects, events and abstract concepts; information about the storage media,
and so on. Descriptors dene the syntax and the semantics of each feature
representation.
Description Schemes (DS): The MPEG-7 DSs expand on the MPEG-7 descriptors by combining individual descriptors as well as other DSs within more
complex structures and by dening the relationships among the constituent
descriptors and DSs.
A Description Denition Language (DDL): It is a language that allows the
creation of new DSs and, possibly, new descriptors. It also allows the extension and modication of existing DSs. The XML Schema Language has been
selected to provide the basis for the DDL.
System tools: These are tools that are needed to prepare MPEG-7 descriptions for eÆcient transport and storage, and to allow synchronization between
content and descriptions, and to manage and protect intellectual property.
13.8.2 Multimedia Description Schemes
In MPEG-7, the DSs are categorized as pertaining specically to the audio or visual
domain, or pertaining generically to the description of multimedia. The multimedia DSs are grouped into the following categories according to their functionality
(Fig. 13.42):
Basic elements: These deal with basic datatypes, mathematical structures,
schema tools, linking and media localization tools as well as basic DSs, which
are elementary components of more complex DSs;
Content description: These DSs describe the structural and conceptual aspects of an AV document;
Content management: These tools specify information about the storage media, the creation and the usage of an AV document;
Content organization: These tools address the organization of the content
by classication, by the denition of collections of AV documents, and by
modeling;
Navigation and access: These include summaries for browsing and variations
of the same AV content for adaptation to capabilities of the client terminals,
network conditions or user preferences;
User interaction: These DSs specify user preferences pertaining to the consumption of the multimedia material.
486
Video Compression Standards
Collection &
Classification
Content organization
Model
Navigation &
Access
Creation &
production
Media
Usage
Content management
Chapter 13
User
interaction
User
preferences
Summary
Content description
Structural
aspects
Conceptual
aspects
Variation
Basic elements
Datatype &
structures
Figure 13.42.
of MPEG-7].
Schema
tools
Link & media
localization
Basic DSs
Overview of MPEG-7 Multimedia Description Schemes. [Courtesy
Content Description
In the following, we briey describe the DSs for content description. More detailed
information can be found in [29]. The DSs developed for content description fall
in two categories: those describing the structural aspects of an AV document, and
those describing the conceptual aspects.
These DSs describe the syntactic structure of an AV document in terms of segments and regions. An AV document (e.g., a video program
with audio tracks) is divided into a hierarchy of segments, known as a segmenttree. For example, the entire document is segmented into several story units, each
story unit is then divided into dierent scenes, and nally each scene is split into
many camera shots. A segment at each level of the tree can be further divided
into video segment and audio segment, corresponding to the video frames and the
audio waveform, respectively. In addition to using a video segment that contains
a set of complete video frames (may not be contiguous in time), a still or moving
region can also be extracted. A region can be recursively divided into sub-regions,
to form a region-tree. The concept of the segment tree is illustrated on the left side
of Fig. 13.43.
Structural Aspects:
Conceptual Aspects:
These DSs describe the semantic content of an AV document
in terms of events, objects, and other abstract notions. The semantic DS describes
events and objects that occur in a document, and attach corresponding \semantic
Section 13.8.
487
Multimedia Content Description Using MPEG-7
Time
Axis
Segment Tree
Shot1
Shot2
Event Tree
Shot3
Segment 1
Sub-segment 1
• Introduction
• Summary
Sub-segment 2
Sub-segment 3
• Program logo
• Studio
• Overview
Sub-segment 4
• News Presenter
segment 2
• News Items
Segment 3
• International
• Clinton Case
• Pope in Cuba
Segment 4
• National
Segment 5
Segment 6
• Twins
• Sports
• Closing
Segment 7
Description of an AV document (a news program in this case) based
on segment tree and event tree. The segment tree is like the table of contents in
the beginning of a book, whereas the event tree is like the index at the end of the
book. [Courtesy of MPEG-7].
Figure 13.43.
labels" to them. For example, the event type could be a news broadcast, a sports
game, etc. The object type could be a person, a car, etc. As with the structure
description, MPEG-7 also uses hierarchical decomposition to describe the semantic
content of an AV document. An event can be further broken up into many subevents to form an event-tree (right side of Fig. 13.43). Similarly, an object-tree can
be formed. An event-object relation graph describes the relation between events
and objects.
488
Video Compression Standards
Chapter 13
Relation between Structure and Semantic DSs:
An event is usually associated
with a segment, and an object with a region. Each event or object may occur
multiple times in a document, and their actual locations (which segment or region)
are described by a set of links, as shown in Fig. 13.43. In this sense, the syntactic
structure, represented by the segment-tree and the region-tree, is like the table of
contents in the beginning of a book, whereas the semantic structure, i.e., the eventtree and the object-tree, is like the index at the end of the book.
13.8.3 Visual Descriptors and Description Schemes
For each segment or region at any level of the segment- or region-tree, a set of audio
and visual descriptors and DSs are used to characterize this segment or region. In
this section, we briey describe the visual descriptors and DSs that have been
developed to describe the color, texture, shape, motion, and location of a video
segment or object. More complete descriptions can be found in [28, 33].
Color
These descriptors describe the color distributions in a video segment, a moving
region or a still region.
Color space: Five color spaces are dened, RGB, YCrCb, HSV, HMMD, or
monochrome. Alternatively, one can specify an arbitrary linear transformation
matrix from the RGB coordinate.
Color quantization: This descriptor is used to specify the quantization parameters, including the number of quantization levels and starting values for
each color component. Only uniform quantization is considered.
Dominant color: This descriptor species the dominant colors in the underlying segment, including the number of dominant colors, a value indicating the
spatial coherence of the dominant color (i.e., whether the dominant color is
scattered over the segment or form a cluster), and for each dominant color,
the percentage of pixels taking that color, the color value and its variance.
Color histogram: The color histogram is dened in the HSV space. Instead of
the color histogram itself, the Haar transform is applied to the histogram and
the Haar coeÆcients are specied using variable precision depending on the
available bit rate. Several types of histograms can be specied. The common
color histogram, which includes the percentage of each quantized color among
all pixels in a segment or region, is called ScalableColor. The GoF/GoP Color
refers to the average, median, or intersection (minimum percentage for each
color) of conventional histograms over a group of frames or pictures.
Color layout: This descriptor is used to describe in a coarse level the color
pattern of an image. An image is reduced to 8 8 blocks with each block
Section 13.8.
Multimedia Content Description Using MPEG-7
489
represented by its dominant color. Each color component (Y/Cb/Cr) in the
reduced image is then transformed using DCT, and rst few coeÆcients are
specied.
Color structure: This descriptor is intended to capture the spatial coherence
of pixels with the same color. The counter for a color is incremented as long
as there is at least one pixel with this color in a small neighborhood around
each pixel, called the structuring element. Unlike the color histogram, this
descriptor can distinguish between two images in which a given color is present
in identical amounts but where the structure of the groups of pixels having
that color is dierent in the two images.
Texture
This category is used to describe the texture pattern of an image.
Homogeneous texture: This is used to specify the energy distribution in dierent orientations and frequency bands (scales). The rst two components are
the mean value and the standard deviation of the pixel intensities. The following 30 components are obtained through a Gabor transform with 6 orientation
zones and 5 scale bands.
Texture browsing: This descriptor species the texture appearances in terms
of regularity, coarseness and directionality, which are in-line with the type of
descriptions that a human may use in browsing/retrieving a texture pattern.
In addition to regularity, up to two dominant directions and the coarseness
along each direction can be specied.
Edge histogram: This descriptor is used to describe the edge orientation distribution in an image. Three types of edge histograms can be specied, each
with ve entries, describing the percentages of directional edges in four possible orientations and non-directional edges. The global edge histogram is
accumulated over every pixel in an image; the local histogram consists of 16
sub-histograms, one for each block in an image divided into 4 4 blocks; the
semi-global histogram consists of 13 sub-histograms, one for each sub-region
in an image.
Shape
These descriptors are used to describe the spatial geometry of still and moving
regions.
Contour-based descriptor: This descriptor is applicable to a 2D region with
a closed boundary. MPEG-7 has chosen to use the peaks in the curvature
scale space (CSS) representation [55] to describe a boundary, which has been
found to reect human perception of shapes, i.e., similar shapes have similar parameters in this representation. The CSS representation of a boundary
490
Video Compression Standards
Chapter 13
is obtained by recursively blurring the original boundary using a smoothing
lter, computing the curvature along each ltered curve, and nally determining zero-crossing locations of the curvature after successive blurring. The
descriptor species the number of curvature peaks in the CSS, the global eccentricity and circularity of the boundary, the eccentricity and circularity of
the prototype curve, which is the curve leading to the highest peak in the
CSS, the prototype lter, and the positions of the remaining peaks.
Region-based shape descriptor: The region-based shape descriptor makes use
of all pixels constituting the shape, and thus can describe any shape, i.e. not
only a simple shape with a single connected region but also a complex shape
that consists of several disjoint regions. Specically, the original shape represented by an alpha map is projected onto Angular Radial Transform (ART)
basis functions, and the descriptor includes 35 normalized and quantized magnitudes of the ART coeÆcients.
Shape 3D: This descriptor provides an intrinsic description of 3D mesh models.
It exploits some local attributes of the 3D surface. To derive this descriptor,
the so-called shape index is calculated at every point on the mesh surface,
which depends on the principle curvature at the point. The descriptor species
the shape spectrum, which is the histogram of the shape indices calculated
over the entire mesh. Each entry in the histogram essentially species the
relative area of all the 3D mesh surface regions with a shape index lying in
a particular interval. In addition, the descriptor includes the relative area of
planar surface regions of the mesh, for which the shape index is not dened,
and the relative area of all the singular polygonal components, which are
regions for which reliable estimation of the shape index is not possible.
Motion
These descriptors describe the motion characteristics of a video segment or a moving
region as well as global camera motion.
Camera motion: Seven possible camera motions are considered: panning,
tracking (horizontal translation), tilting, booming (vertical translation), zooming, dollying (translation along the optical axis), and rolling (rotation around
the optical axis) (cf. Fig. 5.4). For each motion, two moving directions are
possible. For each motion type and direction, the presence (i.e. duration),
speed and the amount of motion are specied. The last term measures the
area that is covered or uncovered due to a particular motion.
Motion trajectory: This descriptor is used to specify the trajectory of a nonrigid moving object, in terms of the 2D or 3D coordinates of certain key points
at selected sampling times. For each key point, the trajectory between two
adjacent sampling times is interpolated by a specied interpolation function
(either linear or parabolic).
Section 13.9.
Summary
491
Parametric object motion: This descriptor is used to specify the 2D motion
of a rigid moving object. Five types of motion models are included: translation, rotation/scaling, aÆne, planar perspective, and parabolic. The planar
perspective and parabolic motions refer to the projective mapping dened in
Eq. 5.5.14, and the biquadratic mapping dened in Eq. 5.5.19, respectively.
In addition to the model type and model parameters, the coordinate origin
and time duration need to be specied.
Motion activity: This descriptor is used to describe the intensity and spread
of activity over a video segment (typically at the shot level). Four attributes
can be specied: i) intensity of activity, measured by the standard deviation
of the motion vector magnitudes; ii) direction of activity, which species the
dominant or average direction of all motion vectors; iii) spatial distribution
of activity, derived from the run-lengths of blocks with motion magnitudes
lower than the average magnitude, iv) spatial localization of activity, and
v) temporal distribution of activity, described by a histogram of quantized
activity levels over individual frames in the shot.
Localization
These descriptors and DSs are used to describe the location of a still or moving
region.
Region locator: This descriptor species the location of a region by a brief
and scalable representation of a bounding box or polygon.
Spatial-temporal locator: This DS describes a moving region. It decomposes
the entire duration of the region into a few sub-segments, with each segment
being specied by the shape of the region in the beginning of the segment,
known as a reference region, and the motion between this region and the reference region of the next segment. For a non-rigid object, a FigureTrajectory
DS is developed, which denes the reference region by a bounding rectangle,
ellipse, or polygon, and species the motion between reference regions using
the MotionTrajectory descriptor, which species the coordinates of selected
key points over successive sampling times. For a rigid region, a ParameterTrajectory DS is used, which uses the RegionLocator descriptor to specify a
reference region, and the parametric object motion descriptor to describe the
motion.
13.9 Summary
Video communications requires standardization in order to build reasonably-priced
equipment that interoperates and caters to a large market. Personal video telephony
was the rst application that was targeted by a digital video compression standard.
H.261 was published in 1990, 101 years after Jules Vernes wrote down the idea of
492
Video Compression Standards
Chapter 13
a video telephone and 899 years earlier than he predicted [68]. The subsequent
important video compression standards H.263, MPEG-1, MPEG-2, and MPEG-4
were established in 1993, 1995, 1995, and 1999, respectively.
Whereas the H.261 and H.263 standards describe only video compression, MPEG1/2/4 standards describe also the representation of audio as well as a system that
enables the joint transmission of audiovisual signals. H.261 is a block-based hybrid
coder with integer-pel motion compensation. The main applications for H.261 is
video coding for video conferencing over ISDN lines at rates between 64 kbit/s and
2 Mbit/s. H.263 extends H.261 and adds many features including half-pel motion
compensation thus enabling video coding for transmission over analog telephone
lines at rates below 56 kbit/s.
MPEG-1 is also derived from H.261. It added half-pel motion compensation,
bidirectional prediction for B-pictures and other improvements in order to meet the
requirements for coding video at rates around 1.2 Mbit/s for consumer video on
CD-ROM at CIF resolution. MPEG-2 is the rst standard that is able to code
interlaced video at full TV and HDTV resolution. It extended MPEG-1 to include
new prediction modes for interlaced video. Its main applications are TV broadcasting at rates around 4 Mbit/s and 15 Mbit/s for high-quality video. MPEG-4 video,
based on MPEG-2 and H.263, is the latest video coding standard that introduces
object-based functionalities describing video objects not only with motion and texture but also by their shape. Shape information is co-located with the luminance
signal and coded using a context-based arithmetic coder.
MPEG-2 and MPEG-4 dene proles that require a decoder to implement a subset of the tools that the standard denes. This enables to build standard decoders
that are to some extend taylored towards certain application areas.
Whereas MPEG-1/2/4 standards are developed to enable the exchange of audiovisual data, MPEG-7 aims to enable the searching and browsing of such data.
MPEG-7 can be used independent of the other MPEG standards | an MPEG-7
description might even be attached to an analog movie. MPEG-7 descriptions could
be used to improve the functionalities of previous MPEG standards, but will not
replace MPEG-1, MPEG-2 or MPEG-4.
Since the computational power in terminals increases every year, standardization
bodies try to improve on their standards. ITU currently works on the video coding
standard H.26L that promises to outperform H.263 and MPEG-4 by more than
1 dB for the same bitrate or reducing the bitrate by more than 20% for the same
picture quality when coding video at rates above 128 kbit/s.
13.10 Problems
13.1
13.2
What kinds of compatibility do you know?
What are the most compute intensive parts of an H.261 video encoder. What
are the most compute intensive parts of a decoder?
Section 13.10.
Problems
493
13.3
What is a loop lter? Why is H.261 the only standard implementing it?
13.4
What are the tools that improve the coding eÆciency of H.263 over H.261?
13.5
What is the main dierence between MPEG-1 B-frames and H.263 PB-frames
according to the Improved PB-frame mode?
13.6
What is the purpose of the standards H.323 and H.324?
13.7
Why does MPEG-2 have more than one scan mode?
13.8
13.9
What eect does the perceptual quantization of I frames have on the PSNR
of a coded picture? How does perceptual quantization aect picture quality.
What is a good guideline for choosing the coeÆcients of the weight matrix?
Explain the concept of proles and levels in MPEG-2.
Which of the MPEG-2 proles are used in commercial products? Why do
the others exist?
13.10
13.11
What kind of scalability is supported by MPEG-2?
13.12
What is drift? When does it occur?
Discuss the error resilience tools that H.261, H.263, MPEG-1/2/4 provide?
Why are MPEG-4 error resilience tools best suited for lossy transmission channels?
13.13
What are the dierences between MPEG-1 Layer III audio coding and MPEG2 NBC audio?
13.14
MPEG-4 allows to encode shape signals. In case of binary shape, how many
blocks are associated with a macroblock? What is their size? What about
greyscale shape coding?
13.15
Why does MPEG-4 video according to the ACE prole outperform MPEG-1
video?
13.16
13.17
What part of an MPEG-4 terminal is not standardized?
13.18
Why do video bitstreams contain start codes?
13.19
What is meta information?
13.20
Which standard uses a wavelet coder and for what purpose?
Why is the denition of FAP's as done in MPEG-4 important for content
creation?
13.21
494
Video Compression Standards
Chapter 13
How is synchronization achieved between a speech synthesizer and a talking
face?
13.22
13.23
What is the functionality and purpose of MPEG-4 mesh animation?
What is the diÆculty with video indexing and retrieval? How can a standardized content description interface such as MPEG-7 simplify video retrieval?
13.24
How does the segment-tree in MPEG-7 describe the syntactic structure of a
video sequence? How does the event-tree in MPEG-7 describe the semantic
structure of a video sequence? What are their relations?
13.25
What are the visual descriptors developed by MPEG-7? Assuming these
descriptors are attached to every video sequence in a large video database.
Describe ways that you may use them to retrieve certain type of sequences.
13.26
13.11 Bibliography
[1] O. Avaro, A. Elftheriadis, C. Herpel, G. Rajan, and L. Ward. MPEG-4 systems:
Overview. In A. Puri and T. Chen, editors, Multimedia Systems, Standards, and
Networks. Marcel Dekker, 2000.
[2] A. Basso, M. R. Civanlar, and V. Balabanian. Delivery and control of MPEG-4
content over IP networks. In A. Puri and T. Chen, editors, Multimedia Systems,
Standards, and Networks. Marcel Dekker, 2000.
[3] K. Brandenburg, O. Kunz, and A. Sugiyama. MPEG-4 natural audio coding.
Signal Processing: Image Communication, 15(4-5):423{444, 2000.
[4] T. K. Capin, E. Petajan, and J. Ostermann. EÆcient modeling of virtual humans
in MPEG-4. In Proceedings of the International Conference on Multimedia and
Expo 2000, page TPS9.1, New York, 2000.
[5] T. Chen. Emerging standards for multimedia applications. In L. Guan, S. Y.
Kung, and J. Larsen, editors, Multimedia Image and Video Processing, pages 1{18.
CRC Press, 2000.
[6] T. Chen, G. J. Sullivan, and A. Puri. H.263 (including H.263++) and other ITUT video coding standards. In A. Puri and T. Chen, editors, Multimedia Systems,
Standards, and Networks. Marcel Dekker, 2000.
[7] L. Chiariglione. Communication standards: Gotterdammerung? In A. Puri
and T. Chen, editors, Multimedia Systems, Standards, and Networks, pages 1{22.
Marcel Dekker, 2000.
[8] F. Dufaux and F. Moscheni. Motion estimation techniques for digital TV: A
review and a new contribution. Proceedings of the IEEE, 83(6):858{876, June 1995.
Section 13.11.
Bibliography
495
[9] Y. Lee E. D. Scheirer and J.-W. Yang. Synthetic and snhc audio in MPEG-4.
Signal Processing: Image Communication, 15(4-5):445{461, 2000.
[10] B. Girod, E. Steinbach, and N. Farber. Comparison of the H.263 and H.261
video compression standards. In Standards and Common Interfaces for Video,
Philadelphia, USA, October 1995. SPIE Proceedings Vol. CR60, SPIE.
[11] J. Hartman and J. Wernecke. The VRML handbook. Addison Wesley, 1996.
[12] B. G. Haskell, P. G. Howard, Y. A. LeCun, A. Puri, J. Ostermann, M. R.
Civanlar, L. Rabiner, L. Bottou, and P. Haner. Image and video codingemerging
standards and beyond. IEEE Transactions on Circuits and Systems for Video
Technology, 8(7):814{837, November 1998.
[13] B. G. Haskell, A. Puri, and A. N. Netravali. Digital Video: An Introduction to
MPEG-2. Chapman & Hall, New York, 1997.
[14] C. Herpel, A. Elftheriadis, and G. Franceschini. MPEG-4 systems: Elementary
stream management and delivery. In A. Puri and T. Chen, editors, Multimedia
Systems, Standards, and Networks. Marcel Dekker, 2000.
[15] ISO/IEC. IS 10918-1: Information technology { digital compression and coding
of continuous-tone still images: Requirements and guidelines, 1990. (JPEG).
[16] ISO/IEC. IS 11172: Information technology { Coding of moving pictures and
associated audio for digital storage media at up to about 1.5 mbit/s, 1993. (MPEG1).
[17] ISO/IEC. IS 13818-2: Information technology { Generic coding of moving
pictures and associated audio information: Video, 1995. (MPEG-2 Video).
[18] ISO/IEC. IS 13818-3: Information technology { Generic coding of moving
pictures and associated audio information { part 3: Audio, 1995. (MPEG-2 Audio).
[19] ISO/IEC. IS 13818: Information technology { Generic coding of moving pictures and associated audio information: Systems, 1995. (MPEG-2 Systems).
[20] ISO/IEC. IS 13818-3: Information technology { Generic coding of moving
pictures and associated audio information { part 7: Advanced audio coding (AAC),
1997. (MPEG-2 AAC).
[21] ISO/IEC. IS 14772-1: Information technology - computer graphics and image
processing - the virtual reality modeling language - part 1: Functional specication
and UTF-8 encoding, 1997. (VRML).
[22] ISO/IEC. IS 14496: Information technology { coding of audio-visual objects,
1999. (MPEG-4).
496
Video Compression Standards
Chapter 13
[23] ISO/IEC. IS 14496: Information technology { coding of audio-visual objects {
part 1: Systems, 1999. (MPEG-4 Systems).
[24] ISO/IEC. IS 14496: Information technology { coding of audio-visual objects {
part 2: Visual, 1999. (MPEG-4 Video).
[25] ISO/IEC. IS 16500: Information technology { generic digital audio-visual
systems, 1999. (DAVIC).
[26] ISO/IEC. Report of the formal verication tests on advanced coding eÆciency
ACE (formerly Main Plus) prole in version 2. Public document, ISO/IEC JTC
1/SC 29/WG 11 N2824, July 1999.
[27] ISO/IEC. Report of the formal verication tests on mpeg-4 coding eÆciency
for low and medium bit rates. Public document, ISO/IEC JTC 1/SC 29/WG 11
N2826, July 1999.
[28] ISO/IEC. CD 15938-3: MPEG-7 multimedia content description interface part 3: Visual. Public document, ISO/IEC JTC1/SC29/WG11 W3703, La Baule,
October 2000.
[29] ISO/IEC. CD 15938-5: information technology - multimedia content description interface - part 5: Multimedia description schemes. Public document,
ISO/IEC JTC1/SC29/WG11 N3705, La Baule, October 2000.
[30] ISO/IEC. IS 14496: Information technology { coding of audio-visual objects {
part 3: Audio, 2000. (MPEG-4 Audio).
[31] ISO/IEC. IS 14496: Information technology { coding of audio-visual objects
{ part 6: Delivery multimedia integration framework (DMIF), 2000. (MPEG-4
DMIF).
[32] ISO/IEC. Overview of the MPEG-7 standard (version 4.0). Public document,
ISO/IEC JTC1/SC29/WG11 N3752, La Baule, October 2000.
[33] ISO/IEC. MPEG-7 visual part of eXperimentation Model Version 9.0. Public
document, ISO/IEC JTC1/SC29/WG11 N3914, Pisa, January 2001.
[34] ITU-R. BT.601-5: Studio encoding parameters of digital television for standard
4:3 and wide-screen 16:9 aspect ratios, 1998. (Formerly CCIR 601).
[35] ITU-T. Recommendation G.711: Pulse code modulation (PCM) of voice frequencies, 1988.
[36] ITU-T. Recommendation G.722: 7 kHz audio-coding within 64 kbit/s, 1988.
[37] ITU-T. Recommendation G.723.1: Dual rate speech coder for multimedia
communications transmitting at 5.3 and 6.3 kbit/s, 1988.
Section 13.11.
Bibliography
497
[38] ITU-T. Recommendation G.728: Coding of speech at 16 kbit/s using low-delay
code excited linear prediction, 1992.
[39] ITU-T. Recommendation T.81 - Information technology - Digital compression
and coding of continuous-tone still images - Requirements and guidelines, 1992.
(JPEG).
[40] ITU-T. Recommendation H.261: Video codec for audiovisual services at px64
kbits, 1993.
[41] ITU-T. Recommendation V.34: A modem operating at data signaling rates of
up to 28,800 bit/s for use on the general switched telephone network and on leased
point-to-point 2-wire telephone-type circuits, 1994.
[42] ITU-T. Recommendation H.262: Information technology - generic coding of
moving pictures and associated audio information: Video, 1995.
[43] ITU-T. Recommendation H.324: Terminal for low bitrate multimedia communication, 1995.
[44] ITU-T. Recommendation H.223: Multiplexing protocol for low bit rate multimedia communication, 1996.
[45] ITU-T. Recommendation H.320: Narrow-band visual telephone systems and
terminal equipment, 1997.
[46] ITU-T. Recommendation V.25ter: Serial asynchronous automatic dialling and
control, 1997.
[47] ITU-T. Recommendation H.225.0: Call signaling protocols and media stream
packetization for packet based multimedia communications systems, 1998.
[48] ITU-T. Recommendation H.245: Control protocol for multimedia communication, 1998.
[49] ITU-T. Recommendation H.263: Video coding for low bit rate communication,
1998.
[50] ITU-T. Recommendation H.323: Packet-based multimedia communciations
systems, 1998.
[51] ITU-T. Recommendation V.8 bis: Procedures for the identication and selection of common modes of operation between data circuit- terminating equipments
(DCEs) and between data terminal equipments (DTEs) over the public switched
telephone network and on leased point-to-point telephone-type circuits, 1998.
[52] ITU-T. Recommendation V.8: Procedures for starting sessions of data transmission over the public switched telephone network, 1998.
498
Video Compression Standards
Chapter 13
[53] R. Koenen. Proles and levels in MPEG-4: approach and overview. Signal
Processing: Image Communications, 15(4-5):463{478, 2000.
[54] J. L. Mitchell, W. B. Pennebaker, C. E. Fogg, and D. J. LeGall. MPEG video
compression standard. Digital Multimedia Standards Series. Chapman and Hall,
Bonn, 1996.
[55] F.S. Mokhtarian, S. Abbasi, and J. Kittler. Robust and eÆcient shape indexing
through curvature scale space. In British Machine Vision Conference, pages 53{62,
Edinburgh, UK, 1996.
[56] H. G. Musmann and J. Klie. TV transmission using a 64 kbit/s transmission
rate. In International Conference on Communications, pages 23.3.1{23.3.5, 1979.
[57] S. Okubo. Reference model methodology - a tool for the collaborative creation
of video coding standards. Proceedings of the IEEE, 83(2):139{150, February 1995.
[58] M. T. Orchard and G. J. Sullivan. Overlapped block motion compensation:
An estimation-theoretic approach. IEEE Trans. Image Process., 3:693{699, 1994.
[59] J. Ostermann and A. Puri. Natural and synthetic video in MPEG-4. In Proceed-
ings of the International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 3805{3809, November 1998.
[60] A. Puri. Video coding using the MPEG-2 compression standard. In SPIE Visual communications and image processing, volume 1199, pages 1701{1713, November 1993.
[61] A. Puri and A. Wong. Spatial domain resolution scalable video coding. In
SPIE Visual communications and image processing, volume 1199, pages 718{729,
November 1993.
[62] A. Puri, L. Yan, and B. G. Haskell. Temporal resolution scalable video coding.
In International conference on image processing (ICIP 94), volume 2, pages 947{
951, November 1994.
[63] P. Salembier and J. R. Smith. MPEG-7 multimedia description schemes. IEEE
Trans. Circuits and Systems for Video Technology, 2001, to appear.
[64] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. RTP: A transport protocol for real-time applications. RFC 1889, IETF, available from
ftp://ftp.isi.edu/in-notes/rfc1889.txt, January 1996.
[65] J. Signes, Y. Fisher, and A. Eleftheriadis. MPEG-4's binary format for scene
description. Signal Processing: Image Communication, 15(4-5):321{345, 2000.
[66] J. Signes, Y. Fisher, and A. Elftheriadis. MPEG-4: Scene representation and
interactivity. In A. Puri and T. Chen, editors, Multimedia Systems, Standards,
and Networks. Marcel Dekker, 2000.
Section 13.11.
Bibliography
499
[67] A. M. Tekalp and J. Ostermann. Face and 2-D mesh animation in MPEG-4.
Signal Processing: Image Communication, Special Issue on MPEG-4, 15:387{421,
January 2000.
[68] J. Vernes. In the twenty-ninth century. The day of an American journalist in
2889. In Yesterday and Tomorrow, pages 107{124, London, 1965. Arco. Translated
from French, orignal text 1889.
[69] T. Wiegand, M. Lightstone, D. Mukherjee, T. Campell, and S. K. Mitra.
Rate-distortion optimized mode selection for very low bit rate video coding and
the emerging H.263 standard. IEEE Trans. Circuits Syst. for Video Technology,
6(2):182{190, Apr. 1996.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement