null  null
CENTER FOR THE COMMERCIALIZATION OF INNOVATIVE
TRANSPORTATION TECHNOLOGY
(CCITT)
USDOT UNIVERSITY TRANSPORTATION CENTER
NORTHWESTERN UNIVERSITY
FINAL REPORT
iTRAC: Intelligent Video Compression for Automated Traffic
Surveillance Systems
Principal Investigators
Sotirios A. Tsaftaris, Research Assistant Professor
Aggelos K. Katsaggelos, Professor
Student
Eren Soyak
Department of Electrical Engineering and Computer Science
August 1 2010
Disclaimer
The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein.
This document is disseminated under the sponsorship of the Department of
Transportation University Transportation Centers Program, in the interest
of information exchange. The U.S. Government assumes no liability for the
contents or use thereof.
This work was funded by the Northwestern Center for the Commercialization of Innovative Transportation Technology (CCITT).
CCITT (http://www.ccitt.northwestern.edu) is a University Transportation Center funded by the Research and Innovative Technology Administration ( http://www.rita.dot.gov/) of USDOT operated within the Northwestern University Transportation Center in the Robert R. McCormick School of
Engineering and Applied Science (http://www.mccormick.northwestern.edu).
Prof. Sotirios A. Tsaftaris is with the Northwestern University Departments of Electrical Engineering & Computer Science and Radiology. He can
be reached at [email protected]
Prof. Aggelos K. Katsaggelos is with the Northwestern University Department of Electrical Engineering & Computer Science. He can be reached
at [email protected]
Eren Soyak is with the Northwestern University Department of Electrical
Engineering & Computer Science. He can be reached at [email protected]
1
Chapter 1
Introduction
1.1
Project Summary
Non-intrusive video imaging sensors are commonly used in traffic monitoring
and surveillance. For some applications it is necessary to transmit the video
data over communication links. However, due to increased requirements of
bitrate this means either expensive wired communication links or the video
data being heavily compressed to not exceed the allowed communications
bandwidth. Current video imaging solutions utilize aging video compression
standards and require dedicated wired communication lines. Recently H.264
(a newer standard) has been proposed to be used in transportation applications. However, most video compression algorithms are not optimized for
traffic video data and do not take into account the possible data analysis that
will follow either in real time at the control center or offline. As a result of
compression, the visual quality of the data may be low, but more importantly,
as our research efforts in vehicle tracking has shown, the tracking accuracy
and efficiency is severely affected. iTRAC aims to inject highway contentawareness in the H.264 encoding standard. Our technology operates within
the computational limits of consumer grade hardware equipment. With the
possible reduction in bitrate we envision that we can provide a portable, easy
to deploy, low cost, low power, wireless video imaging sensor.
2
1.2
Project Goals
Non-intrusive video imaging sensors are commonly used in traffic monitoring
and surveillance [1]. They are the only cost effective solution that yields information on a large field of view that allows for real time monitoring of video
feeds and video archiving for forensic or traffic analysis applications. Other
imaging solutions (ie., Autosense Solo) can only count and identify vehicles
and measure instantaneous speed without providing any information on the
path a vehicle took in a area of interest. Video imaging is the only modality that observes a vehicle’s trajectory (path), which subsequently allows us
to study driver behavior and its possible effects on congestion. Recently,
automated video analysis has been suggested for the extraction of a vehicle’s trajectory, speed, and type (car, truck, etc) for a variety of applications
[2, 3]. Video data are compressed to reduce the amount of information being
transmitted or stored. Even with recent video standards (H.264) the bit-rate
is high forcing the use of dedicated wired lines (T1 or fiber optic lines). A
low cost, low power, wireless video imaging sensor that could be easily deployed over areas of interest, will enable transportation officials to monitor
these areas without a large investment in infrastructure and time-consuming
planning. If the video feed is post-processed by computers to extract the
trajectories of each vehicle the quality of the data have a large impact on
the accuracy of the tracking. Therefore, it is critical to maintain tracking
efficiency in the presence of compression.
Through our research we have identified that the quality of the transmitted/archived video is critical for the accurate detection and tracking of
vehicles, humans, or even animals. Video parameters such as resolution,
frame rate, and data rate are quite critical and each have a direct impact on
the performance of many target tracking algorithms. For example, if the resolution is small, and the camera has a wide field of view, targets can become
too small to be tractable [4, 5]. In addition, weather and lighting conditions
can affect the accuracy of tracking algorithms.
Herein we propose iTRAC for H.264, an intelligent algorithmic module
to be used in conjunction with the H.264 encoding standard. Compression
algorithms in general tend to be content agnostic, aiming to minimize the
video data rate while maintaining requested video quality as expressed by
an objective quality metric (e.g., mean squared error). We move away from
this common approach and provide a content aware system based on the
H.264 codec that is designed to minimize the compressed video data rate
3
while maintaining detection accuracy. iTRAC places special focus on moving objects or targets of interest and compresses them with such quality that
detection and tracking accuracy are maintained at high levels. The question
the encoder has to answer is how much data can be removed such that the
decoder can still detect and track the objects of interest as if there were no
compression at all. So in our case quality is defined simply as the accuracy
of the tracking result. In fact, the Federal Highway Administration defines
quality as “the fitness of data for all purposes that require it” [6]. We should
note that even if a human will monitor the video feed in real-time, our proposed approach will assist them since our video data will provide higher visual
fidelity on the moving targets (vehicles) as compared to a content-agnostic
H.264 implementation.
1.3
Outline
In this work we discuss the various technologies that when used individually
or in conjunction with each other implement the iTRAC system. This report
is organized as follows. In Chapter 2 we introduce the problem of optimal
video compression for video surveillance of vehicle traffic, and review existing
work in the literature concerning traffic video tracking and content-specific
video compression. In Chapter 3 we present a method of spatial resource
concentration via Region of Interest (ROI) coding for video compression; this
work has appeared in [7]. In Chapter 4 we present an algorithm to optimize
tracking accuracy for a given bitrate by concentrating available bits in the
frequency domain on the features most important to tracking; this work is
to appear in [8]. Finally in Chapter 5 we present concluding remarks.
4
Chapter 2
Background
In this chapter we will present a review the state of the art in Traffic Surveillance Systems, focusing on the areas of Video Compression and Video Object
Tracking as relevant to the field. By such reviews we lay the groundwork for
our novel algorithms and proposed future work in subsequent chapters.
2.1
Real-world Traffic Surveillance Systems
As a natural extension of modern urbanization, increasing vehicle traffic in
populated areas has created a need for automated regulation. The current
trends in urban traffic volume indicate that surveillance and control systems
capable of a diverse range of tasks need to be made available at most mediumand high-utilization roads. The high level needs such systems must address
include the following:
• gathering low-complexity statistics such as congestion or average vehicle velocity
• gathering high-complexity statistics such as driver behavior or road
conditions
• recording events of interest for purposes such as security, accident documentation or law enforcement
• automatic responses to predefined events such as speed limit violations
or accidents.
5
By this definition it is clear that the desired systems must possess the capability for higher-order tasks such as identifying vehicles or responding to
“risky” driver behavior. Such capability will require an architecture capable
of complex tasks yet affordable enough to make the required wide-scale deployment feasible. A sample study of capabilities expected from an intelligent
surveillance system is presented in [9].
Current traffic surveillance systems for the most part make use of mature
solutions such as inductor cables embedded in roads to count passing cars and
fixed or handheld radar units for speed detection. Newer technologies that
have seen recent deployment include video surveillance systems to record and
respond to low-complexity events such as red light infractions or improper
safety lane usage. However, even these newer systems are limited in the range
of tasks they can accomplish, and do not possess the capability to address
most of the needs described above.
The system for which algorithms will be proposed in this work is a “centrally controlled” traffic surveillance application. Such a system is comprised
of a nodular structure, with low-cost, easy to install remote camera units
whose small size allows them to blend with the rest of the urban infrastructure. These remote nodes capture and compress video for transmission over a
wireless link to a central processing station, where the bulk of the processing
capability of the system resides. Such a system, given its centrally located
processing power, is unconstrained in the complexity of tasks it can undertake, and yet is still relatively affordable and easy to deploy with sufficient
coverage given the simplicity of its remote nodes. The parameters for the
system would be as follows:
• Low-power nodes requiring only a power connection in terms of physical
links.
• Low-cost and easy to deploy wireless remote nodes, mountable on existing infrastructure such as poles or traffic signals.
• Full-duplex wireless communication channel between central processing
and remote nodes, carrying compressed video on the uplink (remote to
base) and control information on the downlink (base to remote).
• Powerful central processing station where records are kept, statistics
are gathered and automatic responses are generated. Remote nodes
are controlled from here, allowing the system to adapt to changing
6
conditions such as weather, day/night or even new functionality (implemented via a software update).
Such systems are very difficult to build with existing technology due to
the poor performance of computer vision algorithms on compressed video.
Given the bandwidth limitations of wireless channels, the limited processing
power available at remote nodes and the real-time operating constraints of
many desirable traffic surveillance applications the compressed video that
is transmitted to the central processing station is typically quite poor in
quality. Moreover, most computer vision algorithms that are to be used rely
on models based on the nature of the video content they seek to process.
Such models may no longer be realistic for video distorted by compression.
On the other hand, the bandwidth requirement to send video compressed
at quality that is acceptable for tracking algorithms is not commonly available
in wireless environments, typically requiring expensive dedicated channels or
hard to install and costly to maintain landlines – in [10] a study discussing
typical costs of video surveillance is presented. Given these parameters, wide
scale deployment of effective traffic surveillance systems are not feasible due
to the cost of installation and maintenance. In [11] an example can be seen
of how even modest gains in the compression subsystem can make drastic
changes to the feasibility of real-world traffic surveillance systems.
2.2
Video Compression for Traffic Surveillance
Compression artifacts are debilitating for tracking applications. In reviews
of object tracking presented in [12] and [13] it is shown that most algorithms
focus on the following three features in video to track objects:
• spatial edges
• color histograms
• detected motion boundaries.
Coding artifacts introduced by motion compensated video compression impact all three of these features – color histograms are distorted, true edges
are smeared, and artificial edges are introduced. As a result the estimated
motion field of pixels is sometimes significantly distorted. Other artifacts attributed to heavy quantization are contouring and posterizing (in otherwise
7
smooth image gradients), staircase noise along curving edges, and “mosquito
noise” around edges. Artifacts attributed to the time aspect of video are motion compensation errors and quantization drift. Compensation errors arise
from the fact that motion compensation does not aim at finding the true
motion of objects but rather the most similar object in a limited search area.
For example, heavily quantized but motionless areas such as the road surface
will flicker with time, appearing to have different intensity. Subsampling of
chroma components (typically from 4:4:4 to 4:2:0) in the YUV colorspace
further reduces the accuracy of color histogram based tracking.
These artifacts and distortions decrease the accuracy of computer vision
based tracking algorithms. Fig. 2.1 offers examples of such distortions.
The left column shows sample images from video sequences, the top being
uncompressed, the center compressed at a ratio of 102 : 3, and the bottom at
a ratio of 104 : 3. For each video a background model is computed by taking
the median intensity of each pixel over time, which is then subtracted from
each frame to give an error image (shown in center column). This error is
used to locate objects in each frame, even if they have not moved since the
previous frame. The pixel intensity histograms of the images (shown in right
column) are used to associate objects from different frames, thereby tracking
each object across time. Note that blocking artifacts due to quantization
are much more pronounced in the higher compression ratio video. Distorted
edges and artificial smudges in the difference data impair gradient based
tracking efforts. The intensity histogram is seen to be significantly distorted
for the 104 : 3 case – the artificially introduced peaks make histogram based
tracking more difficult.
The subject of standard-compliant video compression specifically optimized for subsequent tracking has been explored as early as [14] in the context
of MPEG compression, where the focus is on concentrating (consolidating)
bitrate on a Region of Interest (ROI). More recently in [15] a more elaborate approach that adds higher level elements such as motion field correction
filtering is proposed in the context of H.263. In [16] a method of using automatic resizing of ROIs detected by video encoder motion estimation in
conjunction with object tracking is presented, where the ROI detection relies
on motion estimation capturing true motion (and not for example best block
match) for good results. In [17] a method of using ROIs to focus limited processing power on highest gain encoder components in the context of H.264
is presented. In [18] an algorithm that specifically does not track individual
vehicles, but rather operates in the compressed domain to detect traffic con8
0.05
0.04
0.03
0.02
0.01
0
0
50
100
150
200
250
0
50
100
150
200
250
0
50
100
150
200
250
0.05
0.04
0.03
0.02
0.01
0
0.05
0.04
0.03
0.02
0.01
0
Figure 2.1: Compression effects on vehicle tracking. The top row is a sample
of uncompressed video, its error image vs. the background (median frame),
and its intensity histogram respectively. The middle row video was compressed at a ratio of 3 : 102 , the bottom row at 3 : 104 .
gestion. These methods are all low in complexity, but rely on information
generated by the encoder (such as motion vectors or macroblock types) to
limit computation.
2.3
Vehicle Tracking
The field of video object tracking is quite active, with various solutions offering strength/weakness combinations suitable for different applications. For
urban traffic video tracking most applications involve a background subtraction component for target acquisition such as the one developed in [19], and
an inter-frame object association component such as the ones developed in
[20, 21].
Each of these algorithms has its own strengths and weaknesses, and there
is no universally accepted gold-standard object tracking algorithm even in
the specific context of traffic surveillance. The computational complexity of
object tracking algorithms is the main motivation for our work: if such algorithms were simple enough to deploy on low-cost embedded systems it would
9
be feasible to perform object tracking directly on raw video data without
the need for compression and transmission to a central processing location.
In [22] an in-depth study of the processing burden of state-of-the-art video
tracking systems, including those proposed in [20] and [21], is presented. The
reported complexity of tracking systems analyzed in this study is helpful in
illustrating the unfeasibly high cost of implementing such functionality on a
multitude of remote nodes.
Applications similar to traffic tracking are also relevant to our discussion.
In [23] a survey of video processing techniques for traffic applications is presented, some of which are directly relevant to the pre-processing methods
proposed in this work. In [24] a method of vehicle counting for traffic congestion estimation is presented, a capability that would be very useful where
only an estimate of congestion (but not higher order statistics) is required.
In [25] a review of on-road vehicle detection techniques is presented, where
the camera acquiring the video for tracking is not statically elevated over the
road but instead located within a vehicle in motion on the road itself. Clearly
to be of value such methods need to both be realizable in real-time and to
be of complexity manageable by embedded systems feasible for deployment
on individual vehicles, making it quite relevant to our low-complexity precompression algorithms. In [26] a method of lane estimation and tracking
is presented. Lane extraction is of interest to our work in that it can be
used to focus our compression resources on video regions of greatest interest,
and can even be used to guide compression itself as in [15]. In [27] the presented method of road extraction in aerial images can serve as an example of
the challenges and complexity of the problem of road extraction in complex
images.
10
Chapter 3
Utility-Based Segmentation
and Coding
3.1
Introduction
In this chapter we propose a method of spatial resource concentration for
video compression which is designed to operate on remote nodes in our target system. Given that these remote nodes have low processing power and
memory, our algorithm maintains low requirements for both resources. Our
target technology is a video tracker, and therefore this algorithm seeks to
optimize for tracker performance while minimizing the bitrate required to
transmit the compressed video from the remote node to the central processing station.
The subject of standard-compliant video compression specifically optimized for later tracking has been explored as early as [14] in the context
of MPEG which focuses on concentrating (consolidating) bitrate on a Region of Interest (ROI). More recently in [15] a more elaborate approach that
adds higher level elements such as motion-field correction filtering is proposed in the context of H.263. In [16] a method of using automatic resizing
of ROIs detected by video encoder motion estimation in conjunction with
object tracking is presented, where the ROI detection relies on motion estimation capturing true motion for good results. In [17] a method of using
ROIs to focus limited processing power on highest gain encoder components
in the context of H.264 is presented. These methods are all low in complexity,
but rely on information generated by the encoder (such as motion vectors or
11
macroblock types) to limit computation.
We propose a computationally efficient ROI extraction method, which
is used during standard-compliant H.264 encoding to consolidate bitrate in
regions in video most likely to contain objects of tracking interest (vehicles).
The algorithm is low in complexity and requires limited modification of the
video compression module. Thus it is easily deployable in non-specialized
low processing power remote nodes of centralized traffic video systems. It
makes no assumptions about the operation of the video encoder (such as its
motion estimation or rate control methods) and is thus suitable for use in a
variety of systems.
3.2
Kurtosis-based Region Segmentation
The proposed algorithm optimizes bit allocation for video compression such
that the available bitrate is consolidated on regions that are expected to contain objects of tracking interest. The algorithm derives (and maintains) the
ROI by a non-parametric model based on the temporal distribution of pixel
intensities. The goal is to isolate a map of pixels which in a given analysis
window show a sharp intensity variation. Rather than regions undergoing
constant change, such as trees, fountains or reflections of the sky, we are
interested in regions undergoing periods of dramatic change such as roads,
whose intensity changes primarily due to passing cars.
In order to detect such regions we use the kurtosis of intensities for each
pixel position over time, defined as
n−1
1
4
µ4
i=0 (xi − x)
− 3.
κ(x) = 4 = 1n Pn−1
σ
( n i=0 (xi − x)2 )2
P
(3.1)
where x is the intensity of a pixel over time at the same spatial position over
n samples, and x is the mean value of the intensities. By this normalzied
definition the Gaussian distribution has an excess kurtosis of 0. A higher
kurtosis value indicates that the variance of a given distribution is largely
due to fewer but more dramatic changes, whereas a lower value indicates that
a larger number of smaller changes took place. In this aspect kurtosis, used
for a similar method of feature extraction in [28], is a better indicator of the
desired behavior than variance.
To identify a threshold that will help us in isolating areas of interest we
follow a probabilistic approach in modeling areas of interest. Video capture
12
noise is modeled as additive Gaussian, which is known to have a kurtosis of 0.
Therefore, regions of the scene without motion should have excess kurtosis 0.
Movement due to objects such as trees is modeled as a Mixture of Gaussians
(excess kurtosis of 0 by the additive property of kurtosis). The desired type
of motion will be modeled as a Poisson process, which is commonly used
for traffic analysis and is distributed exponentially (with excess kurtosis 6).
Therefore we set our model as X = N + M , where N is Gaussian noise and
M is any movement that occurs on top of it. M is classified as V (motion to
be tracked, such as vehicles) or T (motion to be ignored, such as trees). We
set M = {T if κ(X) ≤ threshold, else V }. The ROI is set to 1 for V and 0
for T type pixel positions.
While an online optimization to set the kurtosis threshold is possible
within a hypothesis testing framework, given the low computational cost requirement of the system a fixed threshold approach is proposed. We therefore
propose to use the threshold of 3, the midpoint between the two models excess kurtosis. Note that this method of modeling traffic as a Poisson process
is suitable for common urban and highway traffic, but will not perform well
in extreme cases of bumper to bumper congested traffic.
During encoding, for each frame the extracted ROI is used to suppress
the Displaced Frame Difference (DFD) that is encoded. This is done by
implementing the following change in the rate distortion optimization:
mi = argmin{wid ∗ Distortioni + λ ∗ Ratei }
(3.2)
where wid is set equal to 0 for areas outside the ROI and equal to 1 for those
within. Note that simply skipping macroblocks outside the ROI will cause the
decoder to possibly infer motion for these regions given H.264 spatial motion
vector prediction. Therefore this step is necessary to code “zero motion”
blocks outside the ROI, limiting motion prediction across ROI boundaries
via explicitly coded zero motion vectors . While such a binary scheme is
not necessarily optimal compared to one with more degrees of flexibility, it
is preferable due to the negligible extra computation it adds to the overall
system.
3.3
Experimental Results
The video compression experiments presented herein have been performed
using original and modified versions of the JM (H.264/14496-10 AVC Refer13
Figure 3.1: Sample frames from “Camera6” and “I-90” sequences (top), their
manually segmented ROI for analysis (center) and automatically extracted
kurtosis-driven ROI for encoding (bottom).
14
ence Software) v16.0. Given that the primary interest is in tracking vehicles,
in our experiments the reconstructed results are analyzed for performance
within the manually derived ROI.
The “I-90” sequence (720x480 @30Hz) was shot on DV tape and is therefore high quality. The “Camera6” content (640x480 @15Hz) was acquired
under the NGSIM license courtesy of the US FHWA and was MPEG4 compressed during acquisition, and is significantly noisier. Kurtosis estimation
was initialized and updated using 3 second windows (one update per temporal window). While the experiments were executed in MATLAB, the computation and memory requirements are low enough for mobile and embedded
platform implementations. The modifications to the H.264 encoder were
compartmentalized enough to make adding the algorithm to mature products feasible.
In Fig. 3.1 we show some sample detected and manually extracted ROI.
Note that in the figure “I-90” has a detected ROI much closer to the manually
extracted version than “Camera6” – this is because the observer manually
extracting the ROI was asked to mark “areas of interest to urban traffic”,
whereas the kurtosis-based ROI detection algorithm accumulates areas where
cars have actually been to within its analysis window. This difference is a
benefit for the detector in that it focuses the ROI to a region where activity
has been reported and not a region where activity could theoretically take
place in the future.
In order to analyze total distortion in tracking we focus separately on two
separate metrics: one to measure the degradation of a trackers ability to find
targets on each frame and the other to its ability to associate these targets
as the same object across frames. For the first the “Bounding Box Overlap Ratio” (BBOR) metric is used. This metric maintains a simple median
background model (updated once per window), which it uses for background
subtraction. The resulting foreground on each frame is thresholded using the
method presented in [29, 30] and processed with morphological operators before bounding boxes (BB) are extracted. For comparing sequences S1 (base1 ) ∩ BB(S2 )|
,
line) and S2 (compressed), the BBOR is defined as BBOR = |BB(S|BB(S
1 )|
where ∩ denotes the intersection and || the cardinality of the sets. Since our
main interest is in tracking vehicles, the manual ROI, which corresponds to
regions vehicles can be found such as roads and parking lots, is used to mask
the video after compression. In our experiments this simulates a specialized
tracker which targets only vehicles.
15
A higher value of the BBOR indicates that targets (not necessarily the
same targets from frame to frame) were found in more similar spatial locations between the two sequences being compared. In Fig. 3.2 BBOR results
comparing pre-compression performance to that of default encoding vs. encoding focusing on detected and manual ROIs are presented. Note that
at higher bitrates our algorithm provides significant bitrate reduction given
encoder sensitivity to noise and peripheral “uninteresting” motion (trees,
fountains) – bitrate savings of up to 75% for “I-90” and 50% for “Camera6”
were seen with negligible difference in BBOR. While such large savings are
not maintained at lower bitrates, even at the lowest analyzed bitrate results
never show below 5-10% savings. The larger savings seen in “I-90” compared
to “Camera6” can be attributed to “I-90” having a simpler and smaller ROI
and with smaller disparity between the detected and manually extracted
ROIs.
For the second analysis the “Mean Shift” tracking method proposed in [20]
and implemented in the OpenCV project available in [31] is used. The metrics
used in this case are number of “false positives” and “false negatives”. Given
that various traffic tracking applications can prefer one type of error to the
other a separate analysis is presented for each. Note that the measurements
for these metrics are done on an observation basis, and while the experiments
have been controlled by averaging repeated tests some degree of subjective
variability is expected. In Figs. 3.3 and 3.4 the number of errors in sample
Mean Shift tracking in uncompressed and compressed sequences are shown.
Note that in all cases an increase in errors is observed for the mid-range
bitrates, where the error numbers go up from high to mid rates and then back
down for the low rates. This behavior can be attributed to the smoothing
effect of coarse quantization removing error-causing features from the video
as the bitrate goes down. It is interesting to observe that the increase in
errors corresponds to 100Kbps - 1Mbps range, which is the operating space
that would be commonly used for acceptable visual quality applications. Also
note that for the “Camera6” sequence, where the detected and manual ROIs
differ, the detected ROI mostly outperforms the manual ROI.
In [32] a quality metric is proposed for tracking that combines scores for
edge sharpness, color histogram preservation and motion boundary sharpness of tracked silhouettes. While this score also covers all features most
significantly degraded by video compression, our metrics were chosen for
their simplicity. Complex metrics which analyze the sharpness of target segmentation or the stability of inter-frame association are available but not
16
universal.
3.4
Conclusions
We have proposed a novel method of using pixel intensity kurtosis to consolidate video compression bitrate on an ROI incorporating tracked object
trajectories. We have demonstrated that such an approach can lead to up to
75% bitrate savings for comparable tracking performance, and have shown
that an ROI derived by our method of extraction results in performance close
to a manually derived one. The reduction in required bandwidth coupled
with its relatively low processing and memory overhead make the algorithm
attractive for deployment on remote nodes of centralized traffic video tracking applications. The next step is the derivation of online low-complexity
optimization methods for the kurtosis threshold and the number of frames
needed in the analysis window.
17
(a) “I-90” BBOR
(b) “Camera6” BBOR
Figure 3.2: Bitrate vs BBOR for “I-90” and “Camera6” sequences.
18
(a) “I-90” false positives
(b) “Camera6” false positives
Figure 3.3: “I-90” and “Camera6” tracking false positive errors as a function
of bitrate.
19
(a) “I-90” false negatives
(b) “Camera6” false negatives
Figure 3.4: “I-90” and “Camera6” tracking false negative errors as a function
of bitrate.
20
Chapter 4
Tracking-Optimal Transform
Coding
4.1
Introduction
In this chapter we present an algorithm to optimize tracking accuracy for a
given bitrate by concentrating available bits in the frequency domain on the
features most important to tracking. We also present a tracking accuracy
metric which is more advanced than that used in Chapter 3, combining multiple pertinent metrics into a single measure which we use to iteratively drive
optimization. Our proposed algorithm is similar to the trellis-based R-D optimization presented in [33] in that it seeks to optimize for a given target
by manipulating quantized transform coefficients. However in our work we
optimize for tracking accuracy rather than fidelity, and work on a sequence
level as opposed to an individual transform level. This work is to appear in
[8].
Given the special parameters of centrally controlled traffic surveillance
systems, it is necessary to limit resource requirements, such as for memory
and processing power, for any technique seeking to counter the effects of
video distortion on tracking. Our algorithm is low in complexity and is readily deployable as a simple modular add-on to low processing power remote
nodes of centralized traffic video systems. It makes no assumptions about the
operation of the video encoder (such as its motion estimation or rate control
methods) and is thus suitable for use in a variety of systems. The resulting bitstreams are standard-compliant, thereby guaranteeing interoperability
21
∗ Ncap
source
∗ Nenc
capture
encode
∗ Nchan
channel
track
postproc
decode
Figure 4.1: Typical centrally controlled tracking system. Video of objects
to be tracked is acquired (with capture noise Ncap ) at a remote location,
compressed (with encoding distortion Nenc ), and transmitted over a channel
(with channel distortion Nchan ). At the receiver the transmission is decoded,
post-processed and passed on to tracker.
with other standard-compliant systems.
4.2
Frequency Decomposition of Tracking Features
The active field of video object tracking contains a large variety of algorithms,
yet most of these systems share some fundamental concepts. In reviews of
object tracking presented in [12] and [13] it is shown that most algorithms operate by modeling and segmenting foreground and background objects. Once
the segmentation is complete and the targets located, the targets are tracked
across time based on key features such as spatial edges, color histograms and
detected motion boundaries. The segmentation models and key features for
a particular tracking application are chosen based on the application’s goals
and parameters. For example, color histograms can be useful when tracking
highway vehicle activity during the day, but can be less useful under low light
conditions at night.
Compression artifacts are especially debilitating for video tracking applications. In a scenario where the video is distorted, the performance of
the tracking algorithm may suffer as the foreground/background models become not as realistic and key tracking features difficult to identify. In Fig.
4.1 a typical centrally controlled tracking system is shown, where the video
is captured at a remote location and must be transmitted to a central location for processing. Here the compressed video stream is decoded and
post-processed to remove as much distortion as possible, and then tracking
22
is performed. Such a separation of the capture and processing locations of
video is required in systems where many sources of video exist (streets, intersections, strategic locations) yet the processing power required to process
the video on-site at each location would be prohibitively costly. Therefore a
central processing location where all the video is sent is required. While the
distortion Ncap from the video acquisition process is inherent to any video
system, the distortion introduced by the video compression and lossy channel transmission (Nenc and Nchan ) are specific to such centrally controlled
systems.
The introduction of measures to alleviate the effects of distortion during
encoding, transmission and post-processing is challenging given the different
types of distortion, the parameters of which may also vary across time. In
the highway vehicle tracking example, Ncap and Nenc may vary based on
lighting conditions, and if a non-dedicated channel such as WiFi is used
Nchan will vary based on signal reception and traffic congestion. Therefore
any measures meant to alleviate distortion effects need to either account for
all such variations in advance or be adaptive to each variation.
In order to optimize for tracking quality a metric to measure tracking
accuracy is required. In [34] a state-of-the-art review for video surveillance
performance metrics is presented. Due to their pertinence in traffic surveillance for our work we choose the Overlap, Precision and Sensitivity metrics
presented therein. Overlap (OLAP) is defined in terms of the ratio of the intersection and union of the Ground Truth (GT) and Algorithm Result (AR)
objects,
GTi ∩ ARi
,
(4.1)
GTi ∪ ARi
where GTi are the segmented objects tracked in uncompressed video, the ARi
those tracked in compressed video, ∩ the intersection of the two regions and
∪ their union. Precision (PREC) is defined in terms of the average number
of True Positives (TPs) and False Positives (FPs) per frame as
OLAP =
TP
,
(4.2)
TP + FP
where TPs are objects present in both the GT and AR, while FPs are
objects present in the AR but not in the GT. An FP is flagged if an object detected in the AR does not overlap and equivalent object in the GT
P REC =
23
(OLAP (ARi , GTi ) = 0). Sensitivity (SENS) is defined in terms of TPs and
False Negatives (FNs) as
TP
,
(4.3)
TP + FN
where FNs are objects present in the GT but not in the AR. An FN is flagged
if an object detected in the GT does not overlap and equivalent object in the
AR (OLAP (GTi , ARi ) = 0). We define the aggregate tracking accuracy A
as
SEN S =
A = (α ∗ OLAP ) + (β ∗ P REC) + (γ ∗ SEN S),
(4.4)
where α, β and γ are weighting coefficients. Given that OLAP, SENS, PREC
are all in the range [0 1], no normalization of A is necessary as long as
α + β + γ = 1.
4.3
Iterative Quantization Table Optimization
The proposed algorithm seeks to optimize video compression in the system
to adaptively maximize performance under the varying effects of distortion.
To limit the scope of our discussion we will consider only Ncap and Nenc ,
disregarding Nchan . We assert that any given tracking algorithm uses one or
more features that play a greater role in its success than other features. Each
of these features is subject to Ncap and Nenc , possibly as governed by different
functions based on the nature of distortion – for example, a blurring Ncap may
impact edges but not color histograms. We further assert that there exist
undesirable features (such as those introduced by noise) that confuse tracking
efforts and actively detract from tracking accuracy while still consuming bits
to be represented in the compressed video. All of these features are each
coherently represented in the frequency domain by one or more of the spatial
transform filters used in hybrid video coding, an example of which is shown
in Fig. 4.2. The basis functions shown in the figure are those used for the
4x4 transform in the H.264/AVC video coding standard – observe that each
coefficient’s corresponding basis sharpens vertical and/or horizontal edges to
varying degrees, with the exception of the 0-index “DC” basis which sets the
mean value. Also observe that by their nature each basis will represent some
24
4
2
0
-2
-4
Figure 4.2: Transform coefficients represented as per-coefficient basis functions applied to the source 4x4 block. From left to right, top to bottom, the
coefficient indices are numbered 0,1,2..15.
feature more effectively than others, while at the same time not representing
other features at all – this observation will be key to our optimization.
Our algorithm automatically identifies and concentrates compression bitrate on frequencies useful to tracking, at the cost of bitrate allocated to
frequencies confusing or useless to tracking. We perform our optimization by
manipulating the quantization of coded transform coefficients. The quantization scheme is varied via the Quantization Table (QT) specified as part of
the Sequence and Picture Parameter Set structures in the H.264/AVC video
compression standard. Each entry of the QT is used to quantize a coefficient resulting from the 4x4 spatial transform depicted in Fig. 4.2 – the goal
is to spend the fewest bits on those coefficients containing the least useful
information pertaining to the features used by the tracker.
Refer to [35] for a description and to [36] for a detailed explanation of the
H.264/AVC frequency transform. The standard specifies quantization for a
given transform coefficient index qidx in terms of the quantization point (QP)
and the QT as
QT = [t0 , t1 , t2 , ...t15 ]
1
QT [idx] .
QPidx = QP ∗
16
25
(4.5)
Integers in the range [0-255] (8 bits) are allowed for each entry to signify a
1
16]. The probamultiplicative per-coefficient modification in the range [ 16
bility space for our optimization is therefore of dimension 25616 for a single
quantizer. Given the large number of costly evaluations that would have to
be tried in an exhaustive approach we proceed using Lagrangian optimization. Based on a chosen set of tracking accuracy criteria, we will iteratively
coarsen quantization of frequencies less useful to tracking, thereby saving
more bits per accuracy reduced than if we simply coarsened quantization
uniformly across all frequencies.
The optimization is performed by iteratively generating a set of operating
points (OPs), characterized by their bitrate R and accuracy A, and selecting a
subset of these considered superior in performance. These “iteration optimal”
OPs form the basis of the subsequent iteration, whose OPs are generated
by modifying the parameters of the previous iterations optimal OPs. The
algorithm is said to converge when the set of optimal OPs does not change
between two subsequent iterations. The ultimate goal is to generate a rateaccuracy curve allowing the user to specify a bitrate and receive a QT which
will maximize tracking accuracy.
We define the uniform QT Tinit = [255, 255...255], which attenuates all
frequencies at the maximum allowed level. The iteration optimal set Sopt is
defined as the strictly increasing set of rate-accuracy pairs which include the
lowest bitrate in the set,
Sopt → (Ak < Ak+n |Rk < Rk+n ) ∀ n, k
Sopt [0] = argmin{Rk } ∀ k,
(4.6)
where k and k+n are indices into the set of available OPs. The QT relaxation
function Φ is defined as
tidx
, ...t15 ].
(4.7)
C
To initialize our optimization set we generate the OPs obtained by relaxing each entry in Tinit and applying the result across a given range of
quantizers. Of these results we choose the optimal subset S0,opt , which forms
the basis of the first iteration. For each subsequent iteration i, each point
on Si−1,opt is revisited by relaxing entries in their QTs, forming the set of
OPs Si from which the optimal set Si,opt is drawn. Refer to Fig. 4.3 for a
Φ{T, idx, C} = T [t0 , t1 , t2 , ...
26
sample iteration. The set of OPs S0 (circles) are generated, and only the
elements of S0 which lie on the strictly increasing S0,opt curve are revisited
to form S1 (crosses). Thereafter only those members of S1 which lie on S1,opt
are revisited for S2 (triangles). The resulting set S2 contains OPs superior
to those on S1,opt , and therefore the algorithm will continue to iterate a third
time using an S2,opt to populate S3 .
Given that each iteration only a single QT entry can be modified per
OP, the theoretical worst-case convergence bound will involve a maximum of
255
iterations. Each iteration i can evaluate a maximum of 16i OPs. While
C
this worst case set already involves close to 20 orders of magnitude fewer
evaluations than the exhaustive search, given the highly unlikely nature of
the worst case it is expected for our algorithm to converge with significantly
fewer evaluations than the worst case allows for. Where a strict convergence
time requirement shorter than the worst case exists, the number of iterations
allowed can be set to a fixed ceiling for a faster resolution guarantee.
Note that the optimization must be performed simultaneously for a range
of base quantizers, as tracking is a nonlinear process subject to different
distortions at each quantization level. It is possible that a finer quantized
OP may result in worse tracking performance due to the introduction of
noise elements which were effectively filtered out with coarser quantization.
Any non-iterative effort to optimize quantization in this sense would require
accurate models of the video content and all sources of distortion, taking
into account all variations across time. Our iterative process allows for percoefficient quantization optimization without such difficult and error-prone
modeling.
A core assumption of our algorithm is that the distortion process of key
tracking features is stationary for a given video source, at least over sufficiently long periods of time where re-initialization of the optimization to
rebuild the optimal QT each time the distortion process changes is feasible.
Such change detection would need to be provided externally, for example via
light sensors to detect nightfall or via frame histograms to detect inclement
weather.
One limitation of our search method is that it is “greedy,” considering
only single hop modifications to Si−1,opt when populating Si . This limitation
introduces sparsity in the set of OPs that can be reached, making it possible
for the converged Sopt to be suboptimal compared to an exhaustive solution.
While this issue can be readily circumvented by allowing for multi-hop projections when populating Si , the additional computational burden to do so
27
Figure 4.3: An example showing the first three iterations of the optimization
process in the rate-accuracy domain.
28
will be unacceptably high for most low-cost embedded devices.
An point related to implementation is that the algorithm requires access
to the ground truth for operation. In a centrally controlled system such
as described in Fig. 4.1 this will not be available. However, a very close
approximation can be obtained by compressing the video sample at high
bitrates and transmitting it at channel capacity over a slower than real-time
interval before starting the optimization. If this is done such a process would
have to run in series with the optimization, thus adding to the initialization
time requirement.
4.4
Experimental Results
The video compression experiments presented herein have been performed
using the open-source H.264/AVC encoder x264 [37]. The “I-90” and “Golf”
sequences (720x480 @30Hz) were shot on DV tape and are therefore high
quality sources. 600 frames (20 seconds) of each sequence were compressed
using a common QP set of [25, 26, 27, 28, 29, 30] and uniform QTs Tj = 16 →
j = [0, 1, ...15]. The resulting video was used for tracking, and the results
were put through an “iteration optimal” criterion as described in Section 4.3
to generate the “optimal” uniform quantization performance curve.
For our experiments, the post-processing block shown in Fig. 4.1 involves
manually segmenting the road to help automated tracking – segmentation is
performed once and used for all cases where the content was utilized. The
open-source OpenCV “blobtrack” module available at [31] was used as the
object tracker.
Refer to Fig. 4.4 for results from experiment using the “I-90” sequence
(lightly congested highway traffic) and the Mean Shift tracker described in
[20]. The algorithm was allowed to run for 4 iterations, evaluating a total of
587 OPs. Note that at the higher bitrates close to 40% bitrate savings for
comparable accuracy tracking is possible using our algorithm. Also note the
gradual improvement in performance among curves Sopt,1 , Sopt,2 and Sopt,3 ,
each increasingly superior to the uniform quantized OPs of Sopt,f lat .
Refer to Fig. 4.5 for results from experiment using the “Golf” sequence
(average congested local intersection) and the “Connected Component” tracker
described in [21]. The algorithm was allowed to run for 3 iterations, evaluating a total of 447 OPs. The lower overall tracking accuracies compared
to those in Fig. 4.4 are due to more challenging tracking video being used.
29
Figure 4.4: Rate-accuracy results for the “I90” sequence and “Mean Shift”
tracking.
30
Figure 4.5: Rate-accuracy results for the “Golf” sequence and “Connected
Component” tracking.
Note that at lower bitrates savings exceeding 60% in bitrate can be realized
with just 3 iterations, and that as early as Sopt,2 the algorithm has almost
converged. Also note that here a completely different tracker than the one
in Fig. 4.4 has been used on content of a different nature (hard to track traffic intersection as opposed to easier to track highway content). Consistent
improvement across such different content and trackers clearly demonstrates
the adaptability of the algorithm.
The computation and memory requirements of the algorithm are low
enough for mobile and embedded platform implementations. Given that
the Lagrangian search can be done offline and needs to be performed only
once per system initialization or reset (triggered manually or due to a large
change in conditions), any system that can perform real time encoding at
remote nodes and tracking at the central node can reasonably complete the
optimization process in a matter of minutes.
31
4.5
Conclusions
We have proposed a novel method of optimizing object tracking quality in
compressed video through quantization tables. We have demonstrated using
two common object tracking algorithms that our algorithm allows for over
60% bitrate savings while maintaining comparable tracking quality.
32
Chapter 5
Conclusion
In this report we have discussed the various technologies that when used individually or in conjunction with each other implement the iTRAC system.
Used individually each algorithm can provide up to 75% savings in bitrate
required to transmit traffic surveillance video with comparable automated
tracking quality. Therefore for real-world traffic surveillance applications
featuring automated tracking, the bitrates required by systems using iTRAC
could be deployed over existing 3G or WiMAX wireless links, allowing ubiquitous coverage at reasonable cost.
The results for this project were published in [7, 8, 38, 39].
33
Bibliography
[1] A. Chatziioanou, S. L. M. Hockaday, S. Kaighn, and L. Ponce, “Video
image processing systems: applications in transportation,” Vehicle Navigation and Information Systems Conference, vol. 38, pp. 17–20, 1995.
[2] A. Chatziioanou, S. L. M. Hockaday, S. Kaighn, and L. Ponce, “Video
content analysis moves to the edge,” tech. rep., IMS Research, January
2007.
[3] M. Kyte, A. Khan, and K. Kagolanu, “Using machine vision (video imaging) technology to collect transportation data,” Innovations in Travel
Survey Methods, vol. Transportation Research Record No. 1412, 1995.
[4] V. Kovali, V. Alexiadis, and P. Zhang, “Video-based vehicle trajectory
data collection,” Transportation Research Board Annual Meeting, 2007.
[5] N. Zingirian, P. Baglietto, M. Maresca, and M. Migliardi, “Customizing
MPEG video compression algorithms to specific application domains:
The case of highway monitoring,” Transportation Research Board Annual Meeting, pp. 46–53, 1997.
[6] J. Versavel, “Traffic data collection: Quality aspects of video detection,”
Transportation Research Board Annual Meeting, 2007.
[7] E. Soyak, S. A. Tsaftaris, and A. K. Katsaggelos, “Content-aware H.264
encoding for traffic video tracking applications,” Proceedings of ICASSP,
March 2010.
[8] E. Soyak, S. A. Tsaftaris, and A. K. Katsaggelos, “Quantization optimized H.264 encoding for traffic video tracking applications,” (to appear)
Proceedings of ICIP, September 2010.
34
[9] A. J. Lipton, C. H. Heartwell, N. Haering, and D. Madden, “Critical
asset protection, perimeter monitoring, and threat detection using automated video surveillance,” tech. rep., ObjectVideo Inc., 2002.
[10] P. Eaton, “The hidden costs of video surveillance,” tech. rep., Recon
Systems Inc., 2007.
[11] “Roads in oakland county are safer and less congested thanks to wi4
fixed solutions,” tech. rep., Case Study by the Road Commission for
Oakland County, 2007.
[12] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM
Computing Surveys, vol. 38, pp. 13.1–13.45, 2006.
[13] P. F. Gabriel, J. G. Verly, J. H. Piater, and A. Genon, “The state of
the art in multiple object tracking under occlusion in video sequences,”
Proceedings of ACIVS, pp. 166–173, 2003.
[14] N. Zingirian, P. Baglietto, M. Maresca, and M. Migliardi, “Video object
tracking with feedback of performance measures,” Proceedings of ICIAP,
vol. 2, pp. 46–53, 1997.
[15] W. K. Ho, W. Cheuk, and D. P. Lun, “Content-based scalable h.263
video coding for road traffic monitoring,” IEEE Transactions on Multimedia, vol. 7, no. 4, 2005.
[16] R. D. Sutter, K. D. Wolf, S. Lerouge, and R. V. de Walle, “Lightweight
object tracking in compressed video streams demonstrated in region-ofinterest coding,” EURASIP Journal on Advances in Signal Processing,
2007.
[17] A. K. Kannur and B. Li, “Power-aware content-adaptive h.264 video
encoding,” Proceedings of ICASSP, vol. 00, pp. 925–928, 2009.
[18] F. Porikli and X. Li, “Traffic congestion estimation using HMM models without vehicle tracking,” IEEE Proceedings on Intelligent Vehicles,
2004.
[19] S. Cheung and C. Kamath, “Robust techniques for background subtraction in urban traffic video,” Proceedings of VCIP, vol. 5308, no. 1,
pp. 881–892, 2009.
35
[20] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid
objects using mean shift,” Proceedings of CVPR, vol. 2, pp. 142–149,
2000.
[21] A. Senior, A. Hampapur, Y.-L. Tian, L. Brown, S. Pankanti, and
R. Bolle, “Appearance models for occlusion handling,” Proceedings of
the 2nd IEEE Workshop on PETS, December 2001.
[22] T. P. Chen, H. Haussecker, A. Bovyrin, R. Belenov, K. Rodyushkin,
A. Kuranov, and V. Eruhimov, “Computer vision workload analysis:
Case study of video surveillance systems,” Intel Technology Journal,
vol. 9 (12), May 2005.
[23] V. Kastrinaki, M. Zervakis, and K. Kalaitzakis, “A survey of video processing techniques for traffic applications,” Image and Vision Computing, vol. 21, no. 4, pp. 359–381, 2003.
[24] E. Bas, A. M. Tekalp, and F. S. Salman, “Automatic vehicle counting
from video for traffic flow analysis,” IEEE Transactions on Intelligent
Transportation Systems, June 2007.
[25] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: A review,” IEEE Transactions on Pattern Analysis And Machine Intelligence, vol. 28, May 2006.
[26] J. McCall and M. M. Trivedi, “Video based lane estimation and tracking
for driver assistance: Survey, system, and evaluation,” IEEE Transactions on Intelligent Transportation Systems, vol. 7, pp. 20–37, March
2006.
[27] M. Barzohar, L. Preminger, T. Tasdizen, and D. B. Cooper, “Robust
method for completely automatic aerial detection of occluded roads with
new initialization,” Proceedings of SPIE, Infrared Technology and Applications XXVII, vol. 4820, pp. 688–698, January 2003.
[28] A. Briassouli, V. Mezaris, and I. Kompatsiaris, “Video segmentation and
semantics extraction from the fusion of motion and color information,”
Proceedings of ICIP, vol. 3, pp. 365 – 368, 2007.
36
[29] N. Otsu, “A threshold selection method from gray-level histograms,”
IEEE Transactions on Systems, Man and Cybernetics, vol. 9, pp. 62–
66, 1975.
[30] M. Luessi, M. Eichmann, G. Schuster, , and A. Katsaggelos, “Framework for efficient optimal multilevel image thresholding,” Journal of
Electronic Imaging, vol. 18, Jan. 2009.
[31] “http://opencv.willowgarage.com.”
[32] C. Erdem, A. M. Tekalp, and B. Sankur, “Video object tracking with
feedback of performance measures,” IEEE Transactions on Circuits And
Systems for Video Technology, vol. 13, pp. 310–324, 2003.
[33] J. Wen, M. Luttrell, and J. Villasenor, “Trellis-based R-D optimal quantization in H.263,” IEEE Transactions on Image Processing, vol. 9,
pp. 1431–1434, August 2000.
[34] M. B. A. Baumann, J. Ebling, M. Koenig, H. S. Loos, W. N. M. Merkel,
J. K. Warzelhan, and J. Yu, “A review and comparison of measures
for automatic video surveillance systems,” EURASIP Journal on Image
and Video Processing, 2008.
[35] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, July 2003.
[36] H. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Lowcomplexity transform and quantization in H.264/AVC,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 598–
603, July 2003.
[37] “http://www.videolan.org/developers/x264.html.”
[38] E. Soyak, S. A. Tsaftaris, and A. K. Katsaggelos, “Tracking-optimal preand post-processing for H.264 compression in traffic video surveillance
applications,” Proc. ICECS, Dec. 2010.
[39] E. Soyak, S. A. Tsaftaris, and A. K. Katsaggelos, “Low-complexity
video compression for automated transportation surveillance,” submitted, IEEE Transactions on Circuits And Systems for Video Technology.
37
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement