赵羽珩 Model Based Speech Enhancement and Coding David Yuheng Zhao

赵羽珩 Model Based Speech Enhancement and Coding David Yuheng Zhao
Thesis for the degree of Doctor of Philosophy
Model Based Speech Enhancement and
Coding
David Yuheng Zhao
赵羽珩
Sound and Image Processing Laboratory
School of Electrical Engineering
KTH (Royal Institute of Technology)
Stockholm 2007
Zhao, David Yuheng
Model Based Speech Enhancement and Coding
c
Copyright 2007
David Y. Zhao except where
otherwise stated. All rights reserved.
ISBN 978-91-628-7157-4
TRITA-EE 2007:018
ISSN 1653-5146
Sound and Image Processing Laboratory
School of Electrical Engineering
KTH (Royal Institute of Technology)
SE-100 44 Stockholm, Sweden
Telephone + 46 (0)8-790 7790
Abstract
In mobile speech communication, adverse conditions, such as noisy acoustic environments
and unreliable network connections, may severely degrade the intelligibility and naturalness of the received speech quality, and increase the listening effort. This thesis focuses
on countermeasures based on statistical signal processing techniques. The main body
of the thesis consists of three research articles, targeting two specific problems: speech
enhancement for noise reduction and flexible source coder design for unreliable networks.
Papers A and B consider speech enhancement for noise reduction. New schemes
based on an extension to the auto-regressive (AR) hidden Markov model (HMM) for
speech and noise are proposed. Stochastic models for speech and noise gains (excitation
variance from an AR model) are integrated into the HMM framework in order to improve
the modeling of energy variation. The extended model is referred to as a stochastic-gain
hidden Markov model (SG-HMM). The speech gain describes the energy variations of the
speech phones, typically due to differences in pronunciation and/or different vocalizations
of individual speakers. The noise gain improves the tracking of the time-varying energy of
non-stationary noise, e.g., due to movement of the noise source. In Paper A, it is assumed
that prior knowledge on the noise environment is available, so that a pre-trained noise
model is used. In Paper B, the noise model is adaptive and the model parameters are
estimated on-line from the noisy observations using a recursive estimation algorithm.
Based on the speech and noise models, a novel Bayesian estimator of the clean speech
is developed in Paper A, and an estimator of the noise power spectral density (PSD) in
Paper B. It is demonstrated that the proposed schemes achieve more accurate models
of speech and noise than traditional techniques, and as part of a speech enhancement
system provide improved speech quality, particularly for non-stationary noise sources.
In Paper C, a flexible entropy-constrained vector quantization scheme based on Gaussian mixture model (GMM), lattice quantization, and arithmetic coding is proposed. The
method allows for changing the average rate in real-time, and facilitates adaptation to
the currently available bandwidth of the network. A practical solution to the classical
issue of indexing and entropy-coding the quantized code vectors is given. The proposed
scheme has a computational complexity that is independent of rate, and quadratic with
respect to vector dimension. Hence, the scheme can be applied to the quantization of
source vectors in a high dimensional space. The theoretical performance of the scheme
is analyzed under a high-rate assumption. It is shown that, at high rate, the scheme
approaches the theoretically optimal performance, if the mixture components are located
far apart. The practical performance of the scheme is confirmed through simulations on
both synthetic and speech-derived source vectors.
Keywords: statistical model, Gaussian mixture model (GMM), hidden Markov
model (HMM), noise reduction, speech enhancement, vector quantization.
i
List of Papers
The thesis is based on the following papers:
[A] D. Y. Zhao and W. B. Kleijn, “HMM-based gain-modeling for
enhancement of speech in noise,” In IEEE Trans. Audio, Speech
and Language Processing, vol. 15, no. 3, pp. 882–892, Mar.
2007.
[B] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “On-line
noise estimation using stochastic-gain HMM for speech enhancement,” IEEE Trans. Audio, Speech and Language Processing,
submitted, Apr. 2007.
[C] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “Entropyconstrained vector quantization using Gaussian mixture models,” IEEE Trans. Communications, submitted, Apr. 2007.
iii
In addition to papers A-C, the following papers have also been
produced in part by the author of the thesis:
[1] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “GMM-based
entropy-constrained vector quantization,” in Proc. IEEE Int.
Conf. Acoustics, Speech and Signal Processing, Apr. 2007, pp.
1097–1100.
[2] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn,
“Low complexity, non-intrusive speech quality assessment,” in
IEEE Trans. Audio, Speech and Language Processing, vol. 14,
pp. 1948–1956, Nov. 2006.
[3] ——, “Non-intrusive speech quality assessment with low computational complexity,” in Interspeech - ICSLP, September 2006.
[4] D. Y. Zhao and W. B. Kleijn, “HMM-based speech enhancement using explicit gain modeling,” in Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing, vol. 1, May 2006, pp.
161–164.
[5] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “Method
and apparatus for improved estimation of non-stationary noise
for speech enhancement,” 2005, provisional patent application
filed by GN ReSound.
[6] D. Y. Zhao and W. B. Kleijn, “On noise gain estimation for
HMM-based speech enhancement,” in Proc. Interspeech, Sep.
2005, pp. 2113–2116.
[7] ——, “Multiple-description vector quantization using translated
lattices with local optimization,” in Proceedings IEEE Global
Telecommunications Conference, vol. 1, 2004, pp. 41 – 45.
[8] J. Plasberg, D. Y. Zhao, and W. B. Kleijn, “The sensitivity
matrix for a spectro-temporal auditory model,” in Proc. 12th
European Signal Processing Conf. (EUSIPCO), 2004, pp. 1673–
1676.
iv
Acknowledgements
Approaching the end of my Ph.D. study, I would like to thank my supervisor,
Prof. Bastiaan Kleijn, for introducing me to the world of academic research.
Your creativity, dedication, professionalism and hardworking nature have
greatly inspired and influenced me. I also appreciate your patience with
me, and that you spend many hours of your valuable time for correcting my
English mistakes.
I am grateful to all my current and past colleges at the Sound and Image Processing lab. Unfortunately, the list of names is too long to be given
here. Thank you for creating a pleasant and stimulating work environment.
I particularly enjoyed the cultural diversity and the feeling of an international “family” atmosphere. I am thankful to my co-authors Dr. Jonas
Samuelsson, Dr. Mattias Nilsson, Dr. Volodya Grancharov, Jan Plasberg,
Dr. Jonas Lindblom, Dr. Alexander Ypma, and Dr. Bert de Vries. Thank
you for our fruitful collaborations. Further more, I would like to express my
gratitude to Prof. Bastiaan Kleijn, Dr. Mattias Nilsson, Dr. Volodya Grancharov, Tiago Falk and Anders Ekman for proofreading the introduction of
the thesis.
During my Ph.D. study, I was involved in projects financially supported
by GN ReSound and the Foundation for Strategic Research. I would like
to thank Dr. Alexander Ypma, Dr. Bert de Vries, Prof. Mikael Skoglund,
Prof. Gunnar Karlsson, for your encouragement and friendly discussions
during the projects.
Thanks to my Chinese fellow students and friends at KTH, particularly
Xi, Jing, Lei, and Jinfeng. I feel very lucky to get to know you. The fun we
have had together is unforgettable.
I would also like to express my deep gratitude to my family, my wife
Lili, for your encouragement, understanding and patience; my beloved kids:
Andrew and whom yet-to-be-named, for the joy and hope you bring. I
thank my parents, for your sacrifice and your endless support; my brother,
for all the venture and joy we have together; my mother-in-law, for taking
care of Andrew. Last but not least, I am grateful to our host family, Harley
and Eva, for your kind hearts and unconditional friendship.
David Zhao, Stockholm, May 2007
v
Contents
Abstract
i
List of Papers
iii
Acknowledgements
v
Contents
vii
Acronyms
xi
I
Introduction
1
1
Statistical models . . . . . . . . . . . . . .
1.1
Short-term models . . . . . . . . .
1.2
Long-term models . . . . . . . . .
1.3
Model-parameter estimation . . . .
2
Speech enhancement for noise reduction .
2.1
Overview . . . . . . . . . . . . . .
2.2
Short-term model based approach
2.3
Long-term model based approach .
3
Flexible source coding . . . . . . . . . . .
3.1
Source coding . . . . . . . . . . . .
3.2
High-rate theory . . . . . . . . . .
3.3
High-rate quantizer design . . . . .
4
Summary of contributions . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . .
II
Included papers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
6
9
12
14
17
21
25
25
30
32
35
37
51
A HMM-based Gain-Modeling for Enhancement of Speech in
Noise
A1
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A1
vii
2
Signal model . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Speech model . . . . . . . . . . . . . . . . . . . . . .
2.2
Noise model . . . . . . . . . . . . . . . . . . . . . . .
2.3
Noisy signal model . . . . . . . . . . . . . . . . . . .
3
Speech estimation . . . . . . . . . . . . . . . . . . . . . . .
4
Off-line parameter estimation . . . . . . . . . . . . . . . . .
5
On-line parameter estimation . . . . . . . . . . . . . . . . .
6
Experiments and results . . . . . . . . . . . . . . . . . . . .
6.1
System implementation . . . . . . . . . . . . . . . .
6.2
Experimental setup . . . . . . . . . . . . . . . . . . .
6.3
Evaluation of the modeling accuracy . . . . . . . . .
6.4
Objective evaluation of MMSE waveform estimators
6.5
Objective evaluation of sample spectrum estimators
6.6
Perceptual quality evaluation . . . . . . . . . . . . .
7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Appendix I. EM based solution to (12) . . . . . . . . . . . . . . .
Appendix II. Approximation of fs̄ (ḡn0 |xn ) . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A4
A4
A5
A6
A7
A9
A11
A13
A13
A14
A15
A16
A19
A19
A22
A22
A23
A24
B On-line Noise Estimation Using Stochastic-Gain HMM for
Speech Enhancement
B1
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B1
2
Noise model parameter estimation . . . . . . . . . . . . . . B4
2.1
Signal model . . . . . . . . . . . . . . . . . . . . . . B4
2.2
On-line parameter estimation . . . . . . . . . . . . . B6
2.3
Safety-net state strategy . . . . . . . . . . . . . . . . B9
2.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . B10
3
Noise power spectrum estimation . . . . . . . . . . . . . . . B11
4
Experiments and results . . . . . . . . . . . . . . . . . . . . B12
4.1
System Implementation . . . . . . . . . . . . . . . . B12
4.2
Experimental setup . . . . . . . . . . . . . . . . . . . B13
4.3
Evaluation of noise model accuracy . . . . . . . . . . B14
4.4
Evaluation of noise PSD estimate . . . . . . . . . . . B17
4.5
Evaluation of speech enhancement performance . . . B18
4.6
Evaluation of safety-net state strategy . . . . . . . . B21
5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . B22
Appendix I. Derivation of (13-14) . . . . . . . . . . . . . . . . . . B25
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B27
C Entropy-Constrained Vector Quantization Using Gaussian
Mixture Models
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Problem formulation . . . . . . . . . . . . . . . . . .
viii
C1
C1
C4
C4
2.2
High-rate ECVQ . . . . . . . .
2.3
Gaussian mixture modeling . .
3
Quantizer design . . . . . . . . . . . .
4
Theoretical analysis . . . . . . . . . .
4.1
Distortion analysis . . . . . . .
4.2
Rate analysis . . . . . . . . . .
4.3
Performance bounds . . . . . .
4.4
Rate adjustment . . . . . . . .
5
Implementation . . . . . . . . . . . . .
5.1
Arithmetic coding . . . . . . .
5.2
Encoding for the Z-lattice . . .
5.3
Decoding for the Z-lattice . . .
5.4
Encoding for arbitrary lattices
5.5
Decoding for arbitrary lattices
5.6
Lattice truncation . . . . . . .
6
The algorithm . . . . . . . . . . . . . .
7
Complexity . . . . . . . . . . . . . . .
8
Experiments and results . . . . . . . .
8.1
Artificial source . . . . . . . . .
8.2
Speech derived source . . . . .
9
Summary . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C4
C5
C5
C6
C6
C7
C7
C10
C10
C10
C11
C12
C12
C13
C13
C14
C15
C15
C16
C19
C21
C22
Acronyms
3GPP
The Third Generation Partnership Project
AMR
Adaptive Multi-Rate
AR
Auto-Regressive
AR-HMM
Auto-Regressive Hidden Markov Model
CCR
Comparison Category Rating
CDF
Cumulative Distribution Function
DFT
Discrete Fourier Transform
ECVQ
Entropy-Constrained Vector Quantizer
EM
Expectation Maximization
EVRC
Enhanced Variable Rate Codec
FFT
Fast Fourier Transform
GMM
Gaussian Mixture Model
HMM
Hidden Markov Model
IEEE
Institute of Electrical and Electronics Engineers
I.I.D.
Independent and Identically-Distributed
IP
Internet Protocol
KLT
Karhunen-Loève Transform
LL
Log-Likelihood
LP
Linear Prediction
LPC
Linear Prediction Coefficient
LSD
Log-Spectral Distortion
xi
LSF
Line Spectral Frequency
MAP
Maximum A-Posteriori
ML
Maximum Likelihood
MMSE
Minimum Mean Square Error
MOS
Mean Opinion Score
MSE
Mean Square Error
MVU
Minimum Variance Unbiased
PCM
Pulse-Code Modulation
PDF
Probability Density Function
PESQ
Perceptual Evaluation of Speech Quality
PMF
Probability Mass Function
PSD
Power Spectral Density
RCVQ
Resolution-Constrained Vector Quantizer
REM
Recursive Expectation Maximization
SG-HMM
Stochastic-Gain Hidden Markov Model
SMV
Selectable Mode Vocoder
SNR
Signal-to-Noise Ratio
SSNR
Segmental Signal-to-Noise Ratio
SS
Spectral Subtraction
STSA
Short-Time Spectral Amplitude
VAD
Voice Activity Detector
VoIP
Voice over IP
VQ
Vector Quantizer
WF
Wiener Filtering
WGN
White Gaussian Noise
xii
Part I
Introduction
Introduction
The advance of modern information technology is revolutionizing the way
we communicate. Global deployment of mobile communication systems has
enabled speech communication from and to nearly every corner of the world.
Speech communication systems based on the Internet protocol (IP), the socalled voice over IP (VoIP), are rapidly emerging. The new technologies
allow for communication over the wide range of transport media and protocols that is associated with the increasingly heterogeneous communication
network environment. In such a complex communication infrastructure,
adverse conditions, such as noisy acoustic environments and unreliable network connections, may severely degrade the intelligibility and naturalness
of the perceived speech quality, and increase the required listening effort for
speech communication. The demand for better quality in such situations
has given rise to new problems and challenges. In this thesis, two specific
problems are considered: enhancement of speech in noisy environments and
source coder design for unreliable network conditions.
Noise
Speaker
Enhancement
Source
encoding
Channel
encoding
Network
Feedback
Listener
Source
decoding
Channel
decoding
Figure 1: A one-direction example of speech communication in a noisy
environment.
Figure 1 illustrates an adverse speech communication scenario that is
2
Introduction
typical of mobile communications. The speech signal is recorded in an
environment with a high level of ambient noise such as on a street, in a
restaurant, or on a train. In such a scenario, the recorded speech signal is
contaminated by additive noise and has degraded quality and intelligibility
compared to the clean speech. Speech quality is a subjective measure on how
pleasant the signal sounds to the listeners. It can be related to the amount
of effort required by listeners to understand the speech. Speech intelligibility can be measured objectively based on the percentage of sounds that are
correctly understood by listeners. Naturally, both speech quality and intelligibility are important factors in the subjective rating of a speech signal by
a listener. The level of background noise may be suppressed by a speech
enhancement system. However, suppression of the background noise may
lead to reduced speech intelligibility, due to possible speech distortion associated with excessive noise suppression. Therefore, most practical speech
enhancement systems for noise reduction are designed to improve the speech
quality, while preserving the speech intelligibility [6, 60, 61]. Besides speech
communication, speech enhancement for noise reduction may be applied to
a wide range of other speech processing applications in adverse situations,
including speech recognition, hearing aids and speaker identification.
Another limiting factor for the quality of speech communication is the
network condition. It is a particularly relevant issue for wireless communication, as the network capacity may vary depending on interferences from the
physical environment. The same issue occurs in heterogeneous networks
with a large variety of different communication links. Traditional source
coder design is often optimized for a particular network scenario, and may
not perform well in another network. To allow communication with the
best possible quality in the new network, it is necessary to adapt the coding
strategy, e.g., coding rate, depending on the network condition.
This thesis concerns statistical signal processing techniques that facilitate better solutions to the aforementioned problems. Formulated in a statistical framework, speech enhancement is an estimation problem. Source
coding can be seen as a constrained optimization problem that optimizes
the coder under certain rate constraints. Statistical modeling of speech (and
noise) is a key element for solving both problems.
The techniques proposed in this thesis are based on a hidden Markov
model (HMM) or a Gaussian mixture model (GMM). Both are generic models capable of modeling complex data sources such as speech. In Papers A
and Paper B, we propose an extension to an HMM that allows for more
accurate modeling of speech and noise. We show that the improved models
also lead to an improved speech enhancement performance. In Paper C, we
propose a flexible source coder design for entropy-constrained vector quantization. The proposed coder design is based on Gaussian mixture modeling
of the source.
The introduction is organized as follows. In section 1, an overview of
1 Statistical models
Number of samples
10
10
10
3
6
4
2
0
10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
Time domain speech amplitude
0.6
0.8
1
Figure 2: Histogram of time-domain speech waveform samples. Generated using 100 utterances from the TIMIT database, downsampled to 8kHz, normalized between -1 and 1, and 2048 bins.
statistical modeling for speech is given. We discuss short-term models for
modeling the local statistics of a speech frame, and long-term models (such
as an HMM or a GMM) that describe the long-term statistics spanning over
multiple frames. Parameter estimation using the expectation-maximization
(EM) algorithm is discussed. In section 2, we introduce speech enhancement
techniques for noise reduction, with our focus on statistical model-based
methods. Different approaches using short-term and long-term models are
reviewed and discussed. Finally, in section 3, an overview of theory and
practice for flexible source coder design is given. We focus on the high-rate
theory and coders derived from the theory.
1
Statistical models
Statistical modeling of speech is a relevant topic for nearly all speech processing applications, including coding, enhancement, and recognition. A
statistical model can be seen as a parameterized set of probability density
functions (PDFs). Once the parameters are determined, the model provides a PDF of the speech signal that describes the stochastic behavior of
the signal. A PDF is useful for designing optimal quantizers [64, 91, 128],
estimators [35,37–39,90], and recognizers [11,105,108] in a statistical framework.
One can make statistical models of speech in different representations.
The simplest one applies directly to the digital speech samples in the waveform domain (analog speech is out of the scope of this thesis). For instance,
a histogram of speech waveform samples, as shown in Figure 2, provides
4
Introduction
the long-term averaged distribution of the amplitudes of speech samples.
A similar figure is shown in [135]. The distribution is useful for designing
an entropy code for a pulse-code modulation (PCM) system. However, in
such a model, statistical dependencies among the speech samples are not
explicitly modeled.
It is more common to model speech signal as a stochastic process that
is locally stationary within 20-30 ms frames. The speech signal is then processed on a frame-by-frame basis. Statistical models in speech processing
can be roughly divided into two classes, short-term and long-term signal
models, depending on their scope of operation. A short-term model describes the statistics of the vector for a particular frame. As the statistics
are changing over the frames, the parameters of a short-term model are to
be determined for each frame. On the other hand, a model that describes
the statistics over multiple signal frames is referred to as a long-term model.
1.1
Short-term models
It is essential to model the sample dependencies within a speech frame of 2030 ms. This short-term dependency determines the formants of the speech
frame, which are essential for recognizing the associated phoneme. Modeling
of this dependency is therefore of interest to many speech applications.
Auto-regressive model
An efficient tool for modeling the short-term dependencies in the time domain is the auto-regressive (AR) model. Let x[i] denote the discrete-time
speech signal, the AR model of x[i] is defined as
x[i] = −
p
X
j=1
α[j]x[i − j] + e[i],
(1)
where α are the AR model parameters, also called the linear prediction
coefficients, p is the model order and e[i] is the excitation signal, modeled
as white Gaussian noise (WGN). An AR process is then modeled as the
output of filtering WGN through an all-pole AR model filter.
For a signal frame of K samples, the vector x = [x[1], . . . , x[K]] can be
seen as a linearly transformation of a WGN vector e = [e[1], . . . , e[K]], by
considering the filtering process as a convolution formulated as a matrix
transform. Consequently, the PDF of x can be modeled as a multi-variate
Gaussian PDF, with zero-mean and a covariance matrix parameterized using
the AR model parameters. The AR Gaussian density function is defined as
1 T −1
1
f (x) =
x ,
(2)
K exp − x D
2
(2πσ 2 ) 2
1 Statistical models
Pitch period
5
Gain
Pulse-train
generator
Time-varying
filter
Speech
signal
Noise
generator
Gain
Figure 3: Schematic diagram of a source-filter speech production model.
with the covariance matrix D = σ 2 (A] A)−1 , where A is a K × K lower
triangular Toeplitz matrix with the first p + 1 elements of the first column
T
consisting of the AR coefficients [α[0], α[1], α[2], . . . , α[p]] , α[0] = 1, and
the remaining elements equal to zero, ] denotes the Hermitian transpose,
and σ 2 denotes the excitation variance. Due to the structure of A, the
covariance matrix D has the determinant σ 2K .
Applying the AR model for speech processing is often motivated and
supported by the physical properties of speech production [43]. A commonly
used model of speech production is the source-filter model as shown in
Figure 3. The excitation signal, modeled using a pulse train for a voiced
sound or Gaussian noise for an unvoiced sound, is filtered through a timevarying all-pole filter that models the vocal tract. Due to the absence of a
pitch model, the AR Gaussian model is more accurate for unvoiced sounds.
Frequency-domain models
The spectral properties of a speech signal can be analyzed in the frequency
domain through the short-time discrete Fourier transform (DFT). On a
frame-by-frame basis, each speech segment is transformed into the frequency
domain by applying a sliding window to the samples and a DFT on the
windowed signal.
A commonly used model is the complex Gaussian model, motivated by
the asymptotic properties of the DFT coefficients. Assuming that the DFT
length approaches infinity and that the span of correlation of the signal
frame is short compared to the DFT length, the DFT coefficients may be
considered as a weighted sum of weakly dependent random samples [51, 94,
99,100]. Using the central limit theorem for weakly dependent variables [12,
130], the DFT coefficients across frequency can be considered as independent
6
Introduction
zero-mean Gaussian random variables. For k 6∈ {1, K
2 + 1}, the PDF of the
DFT coefficients of the k’th frequency bin X[k] is given by
1
|X[k]|2
,
(3)
f (X[k]) =
exp
−
πλ2 [k]
λ2 [k]
where λ2 [k] = E[|X[k]|2 ] is the power spectrum. For k ∈ {1, K
2 + 1}, X[k] is
real and has a Gaussian distribution with zero mean and variance E[|X[k]|2 ].
Due to the conjugate symmetry of DFT for real signals, only half of the DFT
coefficients need to be considered.
One consequence of the model is that each frequency bin may be analyzed and processed independently. Although, for typical speech processing DFT lengths, the independence assumption does not always hold, it is
widely used, since it significantly simplifies the algorithm design and reduces
the associated computational complexity. For instance, the complex Gaussian model has been successfully applied in speech enhancement [38, 39, 92].
The frequency domain model of an AR process can be derived under
the asymptotic assumption. Assuming that the frame length approaches
infinity, A is well approximated by a circulant matrix (neglecting the frame
boundary), and is approximately diagonalized by the discrete Fourier transform. Therefore, the frequency domain AR model is of the form of (3), and
the power spectrum λ2 [k] is given by
σ2
.
λ2 [k]AR = Pp
| n=0 α[n]e−jnk |2
(4)
It has been argued [89, 103, 118] that supergaussian density functions,
such as the Laplace density and the two-sided Gamma density, fit better to
speech DFT coefficients. However, the experimental studies were based on
the long-term averaged behavior of speech, and may not necessarily hold
for short-time DFT coefficients. Speech enhancement methods based on
supergaussian density function have been proposed in [19, 85, 86, 89, 90].
The comprehensive evaluations in [90] suggest that supergaussian methods
achieve consistent but small improvement in terms of higher segmental SNR
results relative to the Wiener filtering approach (which is based on the Gaussian model). However, in terms of perceptual quality, the Ephraim-Malah
short-time spectral amplitude estimators [38, 39] based on the Gaussian
model were found to provide more pleasant residual noise in the processed
speech [90].
1.2
Long-term models
With a long-term model, we mean a model that describes the long-term
statistics that span over multiple signal frames of 20-30 ms. The shortterm models in section 1.1 apply to one signal frame only, and the model
1 Statistical models
7
parameters obtained for a frame may not generalize to another frame. A
long-term model, on the other hand, is generic and flexible enough to allow
statistical variations in different signal frames. To model different shortterm statistics in different frames, a long-term model comprises multiple
states, each containing a sub-model describing the short-term statistics of a
frame. Applied to speech modeling, each sub-model represents a particular
class of speech frames that are considered statistically similar. Some of
the most well-known long-term models for speech processing are the hidden
Markov model (HMM), and the Gaussian mixture model (GMM).
Hidden Markov model
The hidden Markov model (HMM) was originally proposed by Baum and
Petrie [13] as probabilistic functions of finite state Markov chains. HMM
applied for speech processing can be found in automatic speech recognition
[11, 65, 105] and speech enhancement [35, 36, 111, 147].
An HMM models the statistics of an observation sequence, x1 , . . . , xN .
The model consists of a finite number of states that are not observable,
and the transition from one state to another is controlled by a first-order
Markov chain. The observation variables are statistically dependent of the
underlying states, and may be considered as the states observed through
a noisy channel. Due to the Markov assumption of the hidden states, the
observations are considered to be statistically interdependent to each other.
For a realization of state sequence, let s = [s1 , . . . , sN ], sn denote the state
of frame n, asn−1 sn denote the transition probability from state sn−1 to sn
with as0 s1 being the initial state probability, and let fsn (xn ) denote the
output probability of xn for a given state sn , the PDF of the observation
sequence x1 , . . . , xN of an HMM is given by
f (x1 , . . . , xN ) =
N
XY
asn−1 sn fsn (xn ),
(5)
s∈S n=1
where the summation is over the set of all possible state sequences S.
The joint probability of a state sequence s and the observation sequence
x1 , . . . , xN consists of the product of the transition probabilities and the
output probabilities. Finally, the PDF of the observation sequence is the
sum of the joint probabilities over all possible state sequences.
Applied to speech modeling, each sub-model of a state represents the
short-term statistics of a frame. Using the frequency domain Gaussian
model (3), each state consists of a power spectral density representing a
particular class of speech sounds (with similar spectral properties). The
Markov chain of states models the temporal evolution of short-term speech
power spectra.
8
Introduction
f(x |x =0.5)
f(x ,x )
2 1
1
1.25
1
1
0.5
0.75
x2
f(x2)
2
0
0.5
−0.5
0.25
−1
0
−1
−0.5
0
x
0.5
1
2
−1
−0.5
0
x
0.5
1
1
Figure 4: Examples of a one-dimensional GMM (left) and a twodimensional GMM (right).
Gaussian mixture model
Gaussian mixture modeling is a convenient tool useful in a wide range of
applications. An early application of GMM was found in adaptive block
quantization applied to image coding [131]. More recent examples of applications include clustering analysis [150], image coding [126], image retrieval [55], multiple description coding [115], speaker identification [108],
speech coding [80, 112, 114], speech quality estimation [42], speech recognition [33, 34], and vector quantization [58, 127, 128, 148].
A Gaussian mixture model (GMM) is defined as a weighted sum of
Gaussian densities,
f (x) =
M
X
ρm fm (x),
(6)
m=1
where m denotes the component index, ρ denote the mixture weights, and
fm (x) are the component Gaussian PDFs. Examples of a one-dimensional
GMM and a two-dimensional GMM are shown in Figure 3.
A GMM can be interpreted in two conceptually different ways. It can
be seen as a generic PDF that has the ability of approximate an unknown
PDF by adjusting the model parameters over a database of observations.
Another interpretation of a GMM is as a special case of an HMM, by assuming independent and identically-distributed (I.I.D.) random variables,
such that f (x1 , . . . , xN ) = f (x1 ) . . . f (xN ). The model can then be formulated as an HMM by assigning the initial state probability according to
the mixture component weights, and assuming the matrix of state transition probabilities to be the identity matrix. In fact, to generate random
1 Statistical models
9
observations from a GMM, one can first generate an I.I.D. state sequence
according to the mixture component weights, and then generate the observation sequence according to the component PDF of each state sample. The
generated observations will be distributed according to the GMM. Applied
to speech modeling, each component may represent a class of speech sounds,
the weights are according to the probability of their appearance, and each
component PDF describes the power spectrum of the sound class.
The main difference between a GMM and an HMM with Gaussian submodels is that, in a GMM, consecutive frames are considered independent,
while in an HMM, the Markov assumption of the underlying state sequence
models their temporal dependency. An HMM is therefore a more general
and powerful model than a GMM. Nevertheless, GMMs are popular models
in speech processing due to their simplicity in analysis and good modeling
capabilities. In fact, GMMs are often used as sub-models of an HMM in
practical implementations of HMM based methods.
1.3
Model-parameter estimation
Using an HMM or a GMM, the joint PDF of a sequence x1 , . . . , xN , denoted
f (x1 , . . . , xN ), is parameterized by unknown parameters, denoted by θ.
Each value of θ corresponds to one particular PDF of x1 , . . . , xN . Since
the true PDF is unknown, the value of θ is to be determined from data.
Let θ̂ denote the estimated parameters, as a function of x1 , . . . , xN . A
commonly used criterion of an optimal estimator is that 1) the estimator is
unbiased, i.e.,
E[θ̂] = θ,
(7)
and 2) the estimator gives an estimate that has the minimum variance. Such
an estimator is referred to as the minimum variance unbiased (MVU) estimator [67]. Unfortunately, the MVU estimator is difficult to determine in
many practical applications. An alternative estimator is the maximum likelihood (ML) estimator, which maximizes the likelihood function or equivalently the logarithm of the likelihood function (log-likelihood function),
θ̂
=
arg max f (x1 , . . . , xN |θ)
(8)
arg max log f (x1 , . . . , xN |θ),
(9)
θ
=
θ
where f (x1 , . . . , xN |θ)1 denotes the likelihood function, viewed as a function
of θ for the given data sequence x1 , . . . , xN . The ML approach typically
leads to practical estimators that can be implemented for complex estimation tasks. Also, it can be shown [67] that the ML estimator is asymptoti1 The likelihood function can also be seen as the PDF of x , . . . , x
1
N conditioned on
the parameter vector θ.
10
Introduction
cally optimal, and approaches the performance of the MVU estimator when
the data set is large.
For a generic PDF based on GMM or HMM, directly solving (8) or (9)
leads to intractable analytical expressions. A more practical and commonly
used method is the expectation-maximization (EM) algorithm [31].
Expectation-maximization (EM) algorithm
The expectation-maximization (EM) algorithm originated from early work
of Dempster et. al [31]. The EM algorithm is used in Papers A and B
for parameter estimation of an extended HMM model of speech. The EM
algorithm applies to a given batch of data, and is suitable for off-line optimization of the model. For other model parameters, e.g., in a noise model,
the observation data may not be available off-line and on-line estimation
using a recursive algorithm is necessary. The recursive formulation of the
EM algorithm was proposed in [132], and applied to on-line HMM parameter estimation in [72]. The recursive EM algorithm is used in Paper A for
estimation of the gain model parameters, and in Paper B for estimation of
noise model parameters. Both algorithms are based on the same principle
of incomplete data. Herein, an introduction to the theory behind the EM
algorithm is given.
The EM algorithm is an iterative algorithm that ensures non-decreasing
likelihood scores during each iteration step. Since the likelihood function is
upper-bounded, the algorithm guarantees convergence to a locally optimal
solution. The EM algorithm is designed for the estimation scenario when
the observation sequence is incomplete or partially missing. The missing
observation sequence can be either real or artificial. In either case, the EM
algorithm is applicable when the ML estimator leads to intractable analytical formulas but is significantly simplified when assuming the existence of
additional unknown values.
For notational convenience, let X1N = {x1 , . . . , xN } denote the observation data, Z1N = {z1 , . . . , zN } denote the missing observation data. The
complete data is then given by {X1N , Z1N }, and log f (X1N , Z1N |θ) is the complete log-likelihood function. Let θ̂ (j) denote the estimated parameter vector
from the j’th EM iterations. The EM algorithm consists of two sequential
sub-steps within each iteration. The E-step (expectation step) formulates
the auxiliary function, denoted Q(θ|θ̂ (j) ), which is the expected value of
the complete data log-likelihood with respect to the missing data given the
observation data and the parameter vector estimate of the j’th iteration,
Z
(j)
Q(θ|θ̂ ) =
f (Z1N |X1N , θ̂ (j) ) log f (Z1N , X1N |θ) dZ1N . (10)
Z1N
The maximization step in the EM algorithm gives a new estimate of the
1 Statistical models
11
parameter vector by maximizing the Q(θ|θ̂ (j) ) function of the current step,
θ̂ (j+1)
=
arg max Q(θ|θ̂ (j) ).
(11)
θ
It can be noted that the expectation of (10) is evaluated with respect to
f (Z1N |X1N , θ̂ (j) ) using the j’th parameter estimate, and the maximization
of (11) is over θ in the complete log-likelihood function, log f (Z1N , X1N |θ).
Therefore, it is essential to select Z1N such that differentiating (11) with
respect to θ and setting the resulting expression to zero should lead to
tractable solutions for the update equations.
Estimation of both GMM and HMM parameters can be derived using the
EM framework. For a GMM, the missing data consists of the component
indices from which the observation data is generated. For an HMM, the
missing data is the state sequence. For a detailed derivation of the EM
algorithm for GMM and HMM, we refer to [17]. The EM algorithm and the
recursive EM algorithm are extensively used in Papers A and B to estimate
the model parameters of extended HMMs of speech and noise.
Convergence of the EM algorithm
The iterative procedure of the EM algorithm ensures convergence towards
a locally optimal solution. A proof of convergence is given in [31], and
the proof is outlined in this section. We first show that the log-likelihood
function (9) over a large data set is upper-bounded. We then show that the
EM iterations improve the log-likelihood function in each iteration step.
Assume existence of a data set consisting of a large number of X1N , where
each data sequence X1N is distributed according to a ”true” distribution
function f (X1N ). Note the difference to f (X1N |θ), which is a PDF as a
function of an unknown parameter vector θ. The log-likelihood function of
this data set is
Z
X
Y
N
N
log f (X1 |θ) ≈
f (X1 |θ) =
LL = log
f (X1N ) log f (X1N |θ)dX1N . (12)
X1N
X1N
X1N
Comparing the log-likelihood function
with the expectation of the logaR
rithmic transform of the true PDF X N f (X1N ) log f (X1N ), we get
1
LL −
Z
X1N
f (X1N ) log f (X1N ) =
Z
≤
Z
=
X1N
X1N
f (X1N |θ)
dX1N
f (X1N )
f (X1N |θ)
−
1
dX1N
f (X1N )
f (X1N )
f (X1N ) log
1 − 1 = 0,
(13)
12
Introduction
where the inequality is due to log x ≤ x − 1. Hence, the log-likelihood
function is upperRbounded by the expectation of the logarithmic transform
of the true PDF X N f (X1N ) log f (X1N ).
1
The log-likelihood score of the j’th iteration can be written as
!
Z
N (j)
N (j)
N
N
(j)
N
log f (X1 |θ̂ ) =
log f (X1 |θ̂ )
f (Z1 |X1 , θ̂ )dZ1 (14)
Z1N
=
Z
f (Z1N |X1N , θ̂ (j) ) log f (X1N |θ̂ (j) )dZ1N
(15)
Z
f (Z1N |X1N , θ̂ (j) ) log
f (X1N , Z1N |θ̂ (j) )
dZ1N , (16)
Z1N
=
Z1N
f (Z1N |X1N , θ̂ (j) )
where the last step uses f (X1N , Z1N |θ̂ (j) ) = f (Z1N |X1N , θ̂ (j) )f (X1N |θ̂ (j) ).
The log-likelihood score of the (j + 1)’th iteration can be written as
!
Z
f (X1N , Z1N |θ̂ (j+1) )dZ1N
log f (X1N |θ̂ (j+1) ) = log
(17)
Z1N
Z
= log
Z1N
≥
Z
Z1N
f (X1N , Z1N |θ̂ (j+1) ) N
dZ1
f (Z1N |X1N , θ̂ (j) )
f (Z1N |X1N , θ̂ (j) )
f (Z1N |X1N , θ̂ (j) ) log
f (X1N , Z1N |θ̂ (j+1) )
f (Z1N |X1N , θ̂ (j) )
!
!
(18)
dZ1N , (19)
where the inequality is due to Jensen’s inequality.
Comparing the log-likelihood score of the (j + 1)’th iteration to the
previous iteration, we get
log f (X1N |θ̂ (j+1) ) − log f (X1N |θ̂ (j) )
Z
f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j+1) ) dZ1N
≥
Z1N
−
Z
Z1N
f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j) ) dZ1N
= Q(θ̂ (j+1) |θ̂ (j) ) − Q(θ̂ (j) |θ̂ (j) )
≥ 0,
(20)
(21)
(22)
where the last step follows from the definition of the maximization step (11).
Since the log-likelihood score is upper bounded, the convergence is proven.
2
Speech enhancement for noise reduction
In Papers A and B, we apply the statistical framework for enhancement of
speech in noisy environments. Noise reduction has become an increasingly
2 Speech enhancement for noise reduction
13
important component in speech communication systems. Based on a statistical framework using models of speech and noise, a noise reduction system
can be formulated as an estimation problem, in which the clean speech is to
be estimated from the noisy observation signal. In estimation theory, the
performance of an estimator is commonly measured as the expected value of
a distortion function that quantifies the difference between the clean speech
signal and the estimated signal. Success of the estimator that minimizes the
expected distortion depends largely on the accuracy of the assumed speech
and noise models and the perceptual relevance of the distortion measure.
Due to the inherent complexity of natural speech and human perception,
the present models and distortion measures are only simplified approximations. A large variety of noise reduction methods has been proposed and
their differences are often due to different assumptions on the models and
distortion measures.
The models discussed in section 1 can be applied to modeling of the
statistics of speech and noise. In the design of a noise reduction system,
additional assumptions on the noisy signal are made. Depending on the
environment, the interfering noise may be additive, e.g., from a separate
noise generator, or convolutive, e.g., due to room reverberations or fading.
Depending on the number of available microphones, the observation data
consist of one or more noisy signals. The corresponding noise reduction
methods are divided into single-channel (one microphone) and multi-channel
(microphone array) systems. The focus of this thesis is on single-channel
noise reduction for an additive noise environment. The noisy speech signal
is assumed to be the sum of the speech and noise signals. Furthermore, the
additive noise is often considered to be statistically independent of speech.
A successful noise reduction system may be used as a preprocessor to a
speech communication system or an automatic speech recognition system.
In fact, several algorithms have been successfully integrated into standardized speech coders for mobile speech communication. Examples of speech
coders with a noise reduction module include the Enhanced Variable Rate
codec (EVRC) [1] and Selectable Mode Vocoder (SMV) [7]. 3GPP (The
third Generation Partnership Project) has standardized minimum performance requirements and an evaluation procedure for the Adaptive MultiRate (AMR) codec [6], and a number of noise reduction algorithms that
satisfy the requirements have been developed, e.g., [3–5, 8].
The remainder of this section is organized as follows. An overview of
single-channel noise reduction is given in section 2.1. The focus is on frequency domain approaches that make use of a short-time spectral attenuation filter. The Bayesian estimation framework is introduced and noise
estimation is discussed. Sections 2.2 and 2.3 provide an overview of classical statistical model based methods using short-term models and long-term
models, respectively. Sections 2.2 and 2.3 also serves as an introduction
to Papers A and B. For a more extensive tutorial on speech enhancement
14
Introduction
algorithms for noise reduction, we refer to [37, 78, 135].
2.1
Overview
Due to the quasi-stationarity property of speech, the noisy signal is often
processed on a frame-by-frame basis. For additive noise, the noisy speech
signal of the n’th frame, is modeled as
yn [i] = xn [i] + wn [i],
(23)
where i is the time index of the n’th frame and y, x, w denote noisy speech,
clean speech and noise, respectively. In the vector notation, the noisy
speech, clean speech and noise frames are denoted by yn , xn and wn , respectively. In the following sections, the speech model parameters are denoted
using overbar ¯ and the noise model parameters are denoted using double
dots¨.
Short-time spectral attenuation
It is common to perform estimation in the spectral domain, e.g., through
short-time Fourier analysis and synthesis. An analysis window is applied
to each time domain signal segment and the DFT is then applied. For the
n’th frame, the frequency domain signal model is given by
Yn [k] = Xn [k] + Wn [k],
(24)
where k denotes the index of the frequency band.
For frequency-domain noise reduction, the noisy spectrum is attenuated
in the frequency bins that contain noise. The level of attenuation depends
on the signal to noise ratio (SNR) of the frequency bin. The attenuation is
performed by means of an adaptive spectral attenuation filter, denoted by
Hn [k], that is applied to Yn [k] to obtain an estimate of Xn [k], X̂n [k], given
by:
X̂n [k] = Hn [k]Yn [k].
(25)
The inverse DFT is applied to X̂n = [X̂n [1], . . . , X̂n [K]]T to obtain the
enhanced speech signal frame x̂n in the time domain. To avoid discontinuity
at frame boundaries, the overlap and add approach is used in synthesis. The
schematic diagram of a frequency domain noise reduction method using a
short-time Fourier transform is shown in Figure 5.
Spectral subtraction
One of the classical methods for noise reduction is spectral subtraction
[18, 92]. The method is based on a direct estimation of speech spectral
2 Speech enhancement for noise reduction
15
Noise
Estimation
yn
x̂n
DFT
Noise
Reduction
Inverse
DFT
Figure 5: Noise reduction method using a short-time Fourier transform.
magnitude by subtracting the noise spectral magnitude from the noisy one.
The attenuation filter for the spectral subtraction method can be formulated
as
! r1
max(|Yn [k]|r − |Ŵn [k]|r , 0)
SS
Hn [k] =
,
(26)
|Yn [k]|r
where |Ŵn [k]| is an estimate of the noise spectrum amplitude, and r =
1 for spectral magnitude subtraction [18] and r = 2 for power spectral
subtraction [92]. The power spectral subtraction method is closely related
to the Wiener filtering approach, with the only difference in the power term
(see section 2.2). For both magnitude and power subtractions, the phase
of the noisy spectrum is used unprocessed in construction of the enhanced
speech. One motivation is that the human auditory system is less sensitive
to noise in spectral phase than in spectral magnitude [139], therefore the
noise in spectral phase is less annoying.
Another formulation of the power spectral subtraction [92] is based on
the complex Gaussian model (3) of the noisy spectrum. Under the Gaussian assumption, and due to the independence assumption, the noisy power
spectrum of the k’th frequency bin is λ2n [k] = λ̄2n [k] + λ̈2n [k], where λ̄2n [k]
and λ̈2n [k] are the power spectra of speech and noise respectively. The PDF
of Yn [k] is given by
|Yn [k]|2
1
exp −
.
(27)
f (Yn [k]) =
π(λ̄2n [k] + λ̈2n [k])
λ̄2n [k] + λ̈2n [k]
ˆ
Assuming that an estimate of the noise power spectrum λ̈2n [k] exists, the
maximum likelihood (ML) estimate of the speech power spectrum maximizes f (Yn [k]) with respect to λ̄2n [k]. Using the additional constraint that
the power spectrum is nonnegative, the ML estimate of λ̄2n [k] is given by
ˆ2
ˆ 2 [k] = max(|Y [k]|2 − λ̈
λ̄
n
n
n [k], 0).
(28)
16
Introduction
The implementation of the spectral subtraction method is straightforward. However, the processed speech is plagued by the so-called musical
noise phenomenon. The musical noise is the residual noise containing isolated tones located randomly in time and frequency. These tones correspond
to spectral peaks in the original noisy signal and are left alone when their
surrounding spectral bins in the time-frequency plane have been heavily
attenuated. The musical noise is perceptually annoying and several methods have been derived from the spectral subtraction method targeting the
musical noise problem [48, 83, 133, 136].
Bayesian estimation approach
Spectral subtraction was originally an empirically motivated method, and
is often used in practice because of its simplicity. Alternatively, speech enhancement for noise reduction can be formulated in a statistical framework,
such that prior knowledge of speech and noise is incorporated through prior
PDFs. Using additionally a relevant cost function, speech can be estimated
using the Bayesian estimation framework. The following sections, 2.2 and
2.3, are devoted to methods using the Bayesian estimation framework. In
this subsection, a short introduction to Bayesian estimation is given.
At a conceptual level, assuming that speech and noise are random vectors
distributed according to PDFs f (x) and f (w), the joint PDF of x and y
can be formulated using the signal model (23). A Bayesian speech estimator
minimizes the Bayes risk which is the expected cost with respect to both x
and y, defined as
Z Z
Bayes risk =
C(x, x̂)f (x, y)dxdy,
(29)
where C(x, x̂) is a cost function between x and x̂.
A commonly used cost function is the square error function,
CM M SE (x, x̂) = ||x − x̂||2 ,
(30)
and the optimal estimator minimizing the corresponding Bayes risk is the
minimum mean square error (MMSE) estimator. The popularity of MMSE
based estimation is due to its mathematical tractability and good performance. It can be shown that the MMSE speech estimator, e.g., [67], is the
conditional expectation of speech given the noisy speech
Z
x̂ = xf (x|y)dx.
(31)
Noise estimation
One key component of a single-channel noise reduction system is estimation
of the noise statistics. Often, a speech estimator is designed assuming the
2 Speech enhancement for noise reduction
17
actual noise power spectrum is available, and its optimality is no longer
ensured when the actual noise power spectrum is replaced by an estimate
of the noise spectrum. Therefore, the performance of the speech estimator
depends largely on the quality of the noise estimation algorithm.
A large number of noise estimation algorithms has been proposed. They
are typically based on the assumption that 1) speech does typically not
occupy the entire time-frequency plane 2) noise is “more stationary” than
speech. By assuming that noise is stationary over a longer (than a frame)
period of time, e.g., up to several seconds, the noise power spectrum can be
estimated by averaging the noisy power spectra in the frequency bands with
low or zero speech activity. The update of the noise estimate is commonly
controlled through a voice-activity detector (VAD), e.g., [1,92], speech presence probability estimate [24, 87, 120], or order statistics [59, 88, 125]. For
instance, the minimum statistics method [88] is based on the observation
that the noisy power spectrum frequently approaches the noise power spectrum level due to speech absence in these frequency bins. To obtain the
minimum statistics, a buffer is used to store past noisy power spectra of
each frequency band. Bias compensation for taking into account the fluctuations of power spectrum estimation is proposed in [88]. In such a method,
the size of the buffer is a trade-off. If the buffer is too small, it may not contain any speech inactive frames and speech may be falsely detected as noise.
If the buffer is too large, it may not react fast enough for non-stationary
noise types. In [88], the buffer length is set to 1.5 seconds.
2.2
Short-term model based approach
This section provides an overview of some Bayesian speech estimators using
short-term models as discussed in section 1.1. In particular, we discuss
Wiener filtering and spectral amplitude estimation methods.
Wiener filtering
Wiener filtering [141] is a classical signal processing technique that has been
applied to speech enhancement. Wiener filters are the optimal linear timeinvariant filters for the minimum mean square error (MMSE) criterion, and
can be classified as Bayesian estimators under the constraint of linear estimation. The Wiener filter is also the optimal MMSE estimator under a
Gaussian assumption.
Assuming linear estimation, the speech estimate in the time domain is
of the form
x̂n = Hn yn ,
(32)
18
Introduction
for a linear estimation matrix Hn . The MMSE linear estimator fulfills
Z Z
WF
Hn = arg min
||xn − Hyn ||2 f (xn , yn )dxn dyn .
(33)
H
The optimal estimator can be solved by taking the first derivative of
(33) with respect to H, and setting the obtained expression to zero. Using
also the assumption that speech and noise have both zero mean and are
statistically independent, the optimal linear estimator is
HWF
= D̄n (D̄n + D̈n )−1 ,
n
(34)
where D̄n and D̈n are the covariance matrices of speech and noise, respectively, of frame n.
An alternative point of view for the Wiener filtering is through MMSE
estimation under a Gaussian assumption. Assuming that speech and noise
are of zero-mean Gaussian distributions, with covariance matrices D̄n and
D̈n , it can be shown [67] that the conditional PDF of xn given yn is also
Gaussian with mean D̄n (D̄n + D̈n )−1 yn . Hence, the MMSE estimator for
a Gaussian model is linear and leads to the same result as the linear MMSE
estimator of an arbitrary PDF (34).
Due to the correspondence between the convolution in the time domain
and multiplication in the frequency domain, the Wiener filter in the frequency domain formulation has the simpler form of (25) with
Hn [k]WF
=
λ̄2n [k]
,
λ̄2n [k] + λ̈2n [k]
(35)
for the k’th frequency bin.
Practical implementation of the Wiener filter requires estimation of the
second-order statistics (the power spectra) of speech and noise: λ̄2n [k] and
λ̈2n [k]. The noise spectrum may be obtained through a noise estimation
algorithm (section 2.1) and the speech spectrum may be estimated using
(28). Hence, the actual implementation of the Wiener filtering is closely
related to the power spectral subtraction method, and also suffers from the
musical noise issue.
Speech spectral amplitude estimation
Another well-known and widely applied MMSE estimator for speech enhancement is the MMSE short-time spectral amplitude (STSA) estimator
by Ephraim and Malah [38]. The STSA method does not suffer from musical noise in the enhanced speech, and is therefore interesting also from a
perceptual point of view.
From the MMSE estimation result (31), the STSA estimator is formulated as the expected value of the speech spectral amplitude conditioned
2 Speech enhancement for noise reduction
19
on the noisy spectrum. The STSA estimator depends on two parameters,
prio
a-posteriori SNR Rpost
n [k] and a-priori SNR Rn [k], that need to be evaluated for each frequency bin. The a-posteriori SNR is defined as the ratio
of the noisy periodogram and the noise power spectrum estimate2
Rpost
n [k] =
|Yn [k]|2
,
ˆ
λ̈2n [k]
(36)
and the a-priori SNR is defined as the ratio of the speech and noise power
spectrum [38, 92]
Rprio
n [k] =
λ̄2n [k]
.
λ̈2n [k]
(37)
The optimal attenuation filter for the STSA estimator is shown to be [38]
v
!
√ u
prio
1
πu
R
[k]
n
t
Hn [k] =
2
Rpost
1 + Rprio
n [k]
n [k]
!!
Rprio
n [k]
post
,
(38)
M
Rn [k]
1 + Rprio
n [k]
where M is the function
θ
θ
θ
(1 + θ)I0
+ θI1
,
M (θ) = exp −
2
2
2
(39)
where I0 and I1 are the zeroth and first-order modified Bessel functions,
respectively.
In practice, the a-priori SNR is not available and needs to be estimated.
Ephraim and Malah proposed a solution based on the decision-directed approach, and the a-priori SNR is estimated as [38]
Rprio
n [k] = α
|Hn−1 [k]Yn−1 [k]|2
+ (1 − α) max(Rpost
n [k] − 1, 0),
ˆ2
λ̈n [k]
(40)
where 0.9 ≤ α < 1 is a tuning parameter.
The STSA method has demonstrated superior perceptual quality compared to classic methods such as spectral subtraction and Wiener filtering. In particular, the enhanced speech does not suffer from the musical
noise phenomenon. It has been argued [22] that the perceptual relevance of
the method is mainly due to the decision-directed approach for estimating
2 This definition differs from the traditional definition of signal-to-noise ratio by including both speech and noise in the numerator.
20
Introduction
a-priori SNR. The decision-directed approach has an attractive temporal
smoothing behavior on the SNR estimate. It was shown [22] that for low
Rpost , i.e., the absence of speech, the Rprio is an averaged SNR of a large
number of successive frames; for high Rpost , Rprio follows Rpost with oneframe delay. This smoothing behavior has a significant effect on the perceptual quality of the enhanced speech, and the same strategy is applicable
to other enhancement methods such as the Wiener filtering and spectral
subtraction to reduce the musical noise in the enhanced speech.
Research challenges
The methods we have discussed in section 2.2 require an estimate of the
noise power spectrum. Most noise estimation algorithms are based on the
assumption that the noise is stationary or mildly non-stationary. Modern
mobile devices are required to operate in acoustical environments with large
diversity and variability in interfering noise. The non-stationarity may be
due to the noise source itself, i.e., the noise contains impulsive or transient
components, periodic patterns, or due to the movement of the noise source
and/or the recording device. The latter scenario occurs commonly on a
street, where noise sources (e.g., cars) move rapidly around.
Due to the increased degree of non-stationarity in such an environment,
performance of most noise estimation algorithms is non-ideal and may significantly affect the performance of the speech estimation algorithm. Design
of a noise reduction system that is capable of dealing with non-stationary
noise is of great interest and a challenging task to the research community.
From the Bayesian estimation point of view, the performance of the
speech estimator is determined by the accuracy of the speech and noise
models, and perceptual relevancy of the distortion measure. Unfortunately,
the true statistics of speech and noise are not explicitly available. Also,
speech quality is a subjective measure that requires human listening experiments, and is expensive to assess. Objective measures that approximate
the “true” perceptual quality have been proposed, e.g., for the evaluation
of speech codecs [2, 14, 49, 50, 104, 110, 137, 138, 140], but most of them are
complex and mathematically intractable for deriving a speech estimator.
Nevertheless, a number of algorithms that are based on distortion measures that capture some properties of human perception have been proposed [57, 62, 79, 133, 136, 143]. For instance, the speech estimator in Paper
A optimizes a criterion that allows for an adjustable level of residual noise
in the enhanced speech, in order to avoid unnecessary speech artifacts.
Incorporating a higher degree of prior information of speech and noise
can also improve the performance of the speech estimator. With short-term
models, the assumed prior information is mainly the type of PDF, of which
the parameters of both the speech and noise PDFs (their respective power
spectra for Gaussian PDFs) need to be estimated. To incorporate stronger
2 Speech enhancement for noise reduction
21
prior information of speech and noise, more refined statistical models should
be used. For instance, a more precise PDF of speech may be obtained
by using a speech database and optimizing a long-term model through a
training algorithm. In the next section, we discuss noise reduction methods
using long-term models such as an HMM.
2.3
Long-term model based approach
This section provides an overview of noise reduction methods using longterm models from section 1.2. Long-term model based noise reduction systems include HMM based methods [35, 111], GMM based methods [32, 149]
and codebook based methods [73, 121]. In this section, we focus mainly on
methods based on hidden Markov models and their derivatives.
HMM-based speech estimation
Due to the inter-frame dependency of a HMM, the HMM based MMSE
estimator (31) of speech can be formulated using current and past frames
of the noisy speech,
Z
x̂n = xn f (xn |Y1n )dxn ,
(41)
where Y1n = {y1 , . . . , yn } denotes the noisy observation sequence from frame
1 to n. Due to the Markov chain assumption of the underlying state sequence, both the current and past noisy frames are utilized in the speech
estimator of the current frame.
Assuming that both speech and noise are modeled using an HMM with
Gaussian sub-models, it can be shown that the noisy model is a composite HMM with Gaussian sub-models due to the statistical independence of
speech and noise. If the speech model has M̄ states and the noise model
M̈ states, the composite HMM has M̄ M̈ states, one for each combination
of speech and noise state. Under the Gaussian assumption, the noisy PDF
of a given state is a Gaussian PDF with zero mean and covariance matrix
Ds = D̄s̄ + D̈s̈ . The conditional PDF of speech given the noisy observation
sequence can be formulated as
X
f (sn , xn |Y1n )
f (xn |Y1n ) =
sn
=
X
sn
=
X
sn
f (sn |Y1n )f (xn |Y1n , sn )
f (sn |Y1n )f (xn |yn , sn ),
(42)
where the last step is due to the conditional independency property of an
HMM, which states that the observation for a given state is independent of
22
Introduction
other states and observations. Hence, the MMSE speech estimator using an
HMM (41) can be formulated as
x̂n
=
X
sn
=
X
sn
f (sn |Y1n )
Z
xn f (xn |yn , sn )dxn
f (sn |Y1n ) · D̄s̄ (D̄s̄ + D̈s̈ )−1 yn ,
(43)
and each of the per-state conditional expectations is a Wiener filter (34).
The HMM based MMSE estimator is then a weighted sum of Wiener filters,
where the weighting factors are the posterior state probabilities given the
noisy observations. The weighting factors can be calculated efficiently using
the forward algorithm [40, 105].
A significant difference between the HMM based estimator and the
Wiener filter (34) is that the Wiener filter requires on-line estimation of
both speech and noise statistics, while the per-state estimators of the HMM
approach use speech model parameters obtained through off-line training
(see section 1.2). Therefore, additional speech knowledge is incorporated in
an HMM based method.
Noise reduction using an HMM was originally proposed by Ephraim
in [35, 36] using an HMM with auto-regressive (AR) Gaussian sub-sources
for speech modeling. The noise statistics are modeled using a single Gaussian PDF, and the noise model parameters are obtained using the first few
frames of the noisy speech that are considered to be noise-only. Alternatively, an additional noise estimation algorithm may be used to obtain the
noise statistics. In [111], the HMM based methods are extended to allow
HMM based noise models. The method of [111] uses several noise models,
obtained off-line for different noise environments, and a classifier is used to
determine which noise model must be used. By utilizing off-line trained
models for both speech and noise, strong prior information of speech and
noise is incorporated. By having more than one noise state, each representing a distinct noise spectrum, the method can handle rapid changes of
the noise spectrum within a specific noise environment, and can provide
signification improvement for non-stationary noise environments.
Stochastic gain modeling
A key issue of using long-term speech/noise models obtained through off-line
training is the energy mismatch between the model and the observations.
Since the model is obtained off-line using training data, the energy level
of the training data may differ from the speech in the noisy observation,
causing a mismatch. Hence, strategies for gain adaptation are needed when
long-term models are used.
2 Speech enhancement for noise reduction
23
When using an AR-HMM for speech modeling [35], the signal is modeled
as an AR process in each state. The excitation variance of the AR model
is a parameter that is obtained from the training data. Consequently, the
different energy levels of a speech sound, typically due to differences in
pronunciation and/or different vocalizations of individual speakers, are not
efficiently modeled. In fact, an AR-HMM implicitly assumes that speech
consists of a finite number of frame energy levels. Another energy mismatch
between the speech model and the observation is due to different recording conditions during off-line training and on-line estimation, and may be
considered as a global energy mismatch. In the MMSE speech estimator
of [35], a global gain factor is used to reduce this mismatch.
A similar problem appears in noise modeling, since the noise energy level
in the noisy environment is inherently unknown, time-varying, and in most
natural cases, different from the noise energy level in the training. In [111],
a heuristic noise gain adaptation using a voice activity detector has been
proposed, where the adaptation is performed in speech pauses longer than
100 ms. In [146], continuous noise gain model estimation techniques were
proposed.
In Paper A, we propose a new HMM based gain-modeling technique that
extends the HMM-based methods [35,111] by introducing explicit modeling
of both speech and noise gains in a unified framework. The probability
density function of xn for a given state s̄ is the integral over all possible
speech gains. Modeling the speech energy in the logarithmic domain, we
then have
Z ∞
fs̄ (ḡn0 )fs̄ (xn |ḡn0 )dḡn0 ,
(44)
fs̄ (xn ) =
−∞
where ḡn0 = log ḡn and ḡn denotes the speech gain in the linear domain.
The extension of Paper A over the traditional AR-HMM is the stochastic modeling of the speech gain ḡn , where ḡn is considered as a stochastic
process. The PDF of ḡn is modeled using a state-dependent log-normal
distribution, motivated by the simplicity of the Gaussian PDF and the
appropriateness of the logarithmic scale for sound pressure level. In the
logarithmic domain, we have
1
1
2
0
0
(45)
exp − 2 (ḡn − µ̄s̄ − q̄n ) ,
fs̄ (ḡn ) = p
2σ̄s̄
2πσ̄s̄2
with mean µ̄s̄ + q̄n and variance σ̄s̄2 . The time-varying parameter q̄n denotes
the speech-gain bias, which is a global parameter compensating for the longterm energy level of speech, e.g., due to a change of physical location of
the recording device. The parameters {µ̄s̄ , σ̄s̄2 } are modeled to be timeinvariant, and can be obtained off-line using training data, together with
the other speech HMM parameters.
24
Introduction
The model proposed in Paper A is referred to as a stochastic-gain HMM
(SG-HMM), and may be applied to both speech and noise. To obtain the
time-invariant parameters of the model, an off-line training algorithm is
proposed based on the EM technique. For time-varying parameters, such as
q̄n , an on-line estimation algorithm is proposed based on the recursive EM
technique. In Paper A, we demonstrate through objective and subjective
evaluations that the new model improves the modeling of speech and noise,
and also the noise reduction performance.
The proposed HMM based gain modeling is related to the codebook
based methods [73, 74, 122, 124]. In the codebook methods, the prior speech
and noise models are represented as codebooks of spectral shapes, represented by linear prediction (LP) coefficients. For each frame, and each
combination of speech and noise code vectors, the excitation variances are
estimated instantaneously. In [73, 123], the most likely combination of code
vectors are used to form the Wiener filter. In [74, 124], a weighted sum
of Wiener filters is proposed, where the weighting is determined by the
posterior probabilities similarly to (43). The Bayesian formulation in [124]
is based on the assumption that both the code vectors and the excitation
variances are uniformly distributed.
Adaptation of the noise HMM
It is widely accepted that the characteristics of speech can be learned from a
(sufficiently rich) database of speech. A successful example is in the speech
coding application, where, e.g., the linear prediction coefficients (LPC) can
be represented with transparent quality using a finite number of bits [97].
However, noise may vary to a large extent in real-world situations. It is
unrealistic to assume that one can capture all of this variation in an initial
learning stage, and this implies that on-line adaptation to changing noise
characteristics is necessary.
Several methods have been proposed to allow noise model adaptation
based on different assumptions, e.g., [149] for a GMM and [84, 93] for ARHMM. The algorithm in [84, 149] only applies to a batch of noisy data,
e.g., one utterance, and is not directly applicable for on-line estimation.
The noise model in [149] is limited to stationary Gaussian noise (white or
colored). In [84, 93], a noise HMM is estimated from the observed noisy
speech, using a voice-activity-detector (VAD), such that noise-only frames
are used in the on-line learning.
In Paper B, we propose an adaptive noise model based on SG-HMM. The
results from Paper A demonstrate good performance for modeling speech
and noise. The work is extended in Paper B with an on-line noise estimation
algorithm for enhancement in unknown noise environments. The model parameters are estimated on-line using the noisy observations without requiring a VAD. Therefore, the on-line learning of the noise model is continuous
3 Flexible source coding
25
and facilitates adaptation and correction to changing noise characteristics.
Estimation of the noise model parameters is based on maximizing the likelihood of the noisy model, and the proposed implementation is based on the
recursive expectation maximization (EM) framework [72, 132].
3
Flexible source coding
The second part of this thesis focuses on flexible source coding for unreliable
networks. In Paper C, we propose a quantizer design based on a Gaussian
mixture model for describing the source statistics. The proposed quantizer
design allows for rate adaptation in real-time depending on the current
network condition. This section provides an overview of the theory and
practice that form the basis for quantizer design.
The section is organized as follows. Basic definitions and design criteria
are formulated in section 3.1. High-rate theory is introduced in section 3.2.
In section 3.3, some practical coder designs motivated by high-rate theory
are discussed.
3.1
Source coding
Information theory, founded by Shannon in his legendary 1948 paper [116],
forms the theoretical foundation for modern digital communication systems.
As shown in [116], the communication chain can roughly be divided into
two separate parts: source coding for data compression and channel coding
for data protection. Optimal performance of source and channel coding
can be predicted using the source and channel coding theorems [116]. An
implication of the theorems is that optimal transmission of a source may be
achieved by separately performing source coding followed by channel coding,
a result commonly referred to as the source-channel separation theorem.
The source and channel coding theorems motivate the design of essentially
all modern communication systems. The optimality of separate source and
channel coding is based on the assumption of coding in infinite dimensions,
therefore requires an infinite delay. While the optimality is not guaranteed
in any practical coders with finite delay, the design is used in practical
applications due to its simplicity.
Source coding consists of lossless coding [63, 75–77, 109, 142], to achieve
compression without introducing error, and lossy coding [16, 117], which
achieves a higher compression level but introduces distortion. Modern
source coders utilize lossy coding and remove both irrelevancy and redundancy. They often incorporate lossless coding methods to remove redundancy from parameters that require perfect reconstruction, such as quantization indices. Such a system typically consists of a quantizer (lossy coding)
followed by an entropy coder (lossless coding). The rate-distortion perfor-
26
Introduction
3.5
Rate (bits)
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
MSE distortion
1
1.2
Figure 6: Rate-distortion function for a unit-variance Gaussian source
and the mean square error distortion measure.
mance for the optimal coder design can be predicted by rate-distortion theory [116]. The theory provides a theoretical bound for the achievable rate
without exceeding a given distortion. The rate-distortion bound demonstrates an interesting and intuitive behavior, namely that the achievable
rate is a monotonically decreasing convex function of distortion. An example of a rate-distortion function for a unit-variance Gaussian source is shown
in Figure 6. While the theory is derived under the assumption of infinite
dimensions, it is expected that the behavior is also relevant in lower dimensions. Flexible source coder design should allow adjustment of the ratedistortion trade-off, e.g., multiple operation points on the rate-distortion
curve, such that the coder can operate optimally over a large range of rates.
For an unreliable network, it is then possible to adapt the rate of the coder in
real-time to the current network condition. By operating at a rate adapted
to the network condition, the coder performs with the best quality (lowest
distortion) for the available bandwidth.
Quantization
Let x denote a source vector in the K dimensional Euclidean space RK
distributed according to the PDF f (x). Quantization is a function
x̂ = Q(x)
(46)
that maps each source vector x ∈ RK to a code vector x̂ from a codebook,
which consists of a countable set of vectors in RK . Each code vector is
associated with a quantization cell (decision region), cell(x̂), defined as
cell(x̂) = {x ∈ RK : Q(x) = x̂}.
(47)
3 Flexible source coding
27
An index is assigned to each code vector in the codebook, and the index
is coded with a codeword, i.e., a sequence of bits. Each quantization index
corresponds to a particular codeword and all codewords together form a
uniquely decodable code, which implies that any encoded sequence of indices
can be uniquely decoded. For K = 1, the quantizer applies to scalar values,
and is called a scalar quantizer. The more general quantizer with K ≥ 1 is
called a vector quantizer.
Rate
Rate can be defined as the average codeword length per source vector. Since
the available bandwidth is limited, the quantizer is optimized under a rate
constraint. The codewords may be constrained to have a particular fixed
length, resulting in a fixed-rate coder, or constrained to have a particular
average length, resulting in a variable-rate coder.
For a fixed-rate coder with N code vectors in the codebook, the rate is
given by
Rfix. = dlog2 N e,
(48)
where d·e denotes the ceiling function (the smallest integer equal to or larger
than the input value). To achieve efficient quantization, N is typically
chosen to be an integer power of 2, to avoid rounding.
For a variable-rate coder, code vectors are coded using codewords with
different lengths. The rate determines the average codeword length,
X
Rvar. =
p(x̂)`(x̂),
(49)
x̂
R
where `(x̂) denotes the codeword length of x̂ and p(x̂) = cell(x̂) f (x)dx is
the probability mass function of x̂, obtained by integrating f (x) over the
quantization cell of x̂. From the source coding theorem [116], the optimal
variable-rate code has an average rate approaching the entropy of x̂, denoted
H,
X
H=−
p(x̂) log2 p(x̂).
(50)
x̂
Using an entropy code such as the arithmetic code [75–77,109,142], this rate
can be approached arbitrarily closely by increasing the sequence length.
Distortion
The performance of a quantizer is measured by the difference or error between the source vector and the quantized code vector. This difference
defines an error function, denoted d(x, x̂), that quantifies the distortion due
to the quantization of x. The performance of the system is given by the
28
Introduction
average distortion defined as the expected error with respect to the input
variable,
Z
D = d(x, x̂)f (x)dx.
(51)
The error function should be meaningful with respect to the source, and it
typically fulfills some basic properties, e.g., [53]
• d(x, x̂) ≥ 0
• if x = x̂ then d(x, x̂) = 0.
The most commonly used error function is the square error function,
d(x, x̂) =
1
1
||x − x̂||2 = (x − x̂)T (x − x̂),
K
K
(52)
where || · || denotes the L2 vector norm, and (·)T denotes the transpose.
The mean distortion of (52) is referred to as the mean square error (MSE)
distortion. The MSE distortion is commonly used due to its mathematical
tractability.
Other distortion measures may be conceptually or computationally more
complicated, e.g., by incorporating perceptual properties of the human auditory systems. If a complex error function can be assumed to be continuous
and have continuous derivatives, the error function can be approximated
under a small error assumption. By applying a Taylor series expansion of
d(x, x̂) around x = x̂, and noting that the first two expansion terms vanish
for x = x̂, the error function may then be approximated as [44]
d(x, x̂) ≈
1
(x − x̂)T W(x̂)(x − x̂),
2K
(53)
where W(x̂) is a K ×K dimensional matrix, the so-called sensitivity matrix,
with the i, jth element defined by
W(x̂)[i, j] =
∂ 2 d(x, x̂) ,
∂x[i]∂x[j] x=x̂
(54)
is the Jacobian matrix evaluated at the expansion point x̂. In [44], the
sensitivity matrix approach was applied to quantization of speech linear
prediction coefficients (LPC) using, e.g., the log-spectral distortion (LSD)
measure. In [101, 102], a sensitivity matrix was derived and analyzed for
a perceptual distortion measure based on a sophisticated spectral-temporal
auditory model [29, 30].
3 Flexible source coding
29
Quantizer design
The optimal quantizer can be formulated using the constrained optimization
framework, by minimizing the expected distortion, D, while constraining the
average rate below or equal to the available rate budget. The constraint is
defined using (48) for a fixed-rate coder or (50) for a variable-rate coder.
The resulting quantizer designs are referred to as a resolution-constrained
quantizer and an entropy-constrained quantizer, respectively. In general, an
entropy-constrained quantizer has better rate-distortion performance compared to a resolution-constrained quantizer, due to the flexibility of assigning bit sequences of different lengths to different code vectors according to
the probability of their appearance. In Paper C, we propose a flexible design
of an entropy-constrained vector quantizer.
Traditional quantizer design is typically based on an iterative training
algorithm [23,81,82,91]. By iteratively optimizing the encoder (decision regions or quantization cells) and decoder (reconstruction point of each cell),
and ensuring that the performance is non-decreasing at each iteration, a locally optimal quantizer is constructed. Using a proper initialization scheme,
and given sufficient training data, a good rate-distortion performance can
be achieved by a training based quantizer.
An issue with the traditional quantizer design is the computational complexity and memory requirement associated with the codebook search and
storage. As the number of code vectors is exponentially increasing with
respect to the rate, and in general a higher rate (per vector) is needed
for quantizing higher dimensional vectors, the approach is often limited to
low-rate and scalar or low-dimensional vector quantizers.
Another disadvantage of the traditional training based quantizer design
is that the resulting quantizer has a fixed codebook. Its rate-distortion performance is determined off-line, and adaptation of the codebook according
to the current available bandwidth is unfeasible without losing optimality.
To allow a certain flexibility, one can optimize multiple codebooks, each
for a different rate, and select the best codebook. A disadvantage of this
approach is, however, the additional memory requirement, e.g., for storage
of multiple codebooks.
It is desirable to have a flexible quantizer design that allows for a finegrained rate adaptation, such that the rate can be selected from a larger
range with a finer resolution, without increasing the memory requirement.
To achieve that, the quantizer design should avoid usage of an explicit codebook (that requires storage), and the code vectors are constructed on-line
for a given rate. High-rate theory provides a theoretical basis for designing
such a quantizer.
30
3.2
Introduction
High-rate theory
High-rate theory is a mathematical framework for analyzing the behavior and predicting the performance of a quantizer of any dimensionality.
High-rate theory originated from the early works on scalar quantization
[15, 47, 96, 98], and was later extended for vector quantization [45, 144]. A
comprehensive tutorial on quantization and high-rate theory is given in [54].
Textbooks treating the topic include [16, 46, 56, 71].
High-rate theory applies when the rate is high with respect to the
smoothness of the source PDF f (x). Under this high-rate assumption,
f (x) may be assumed to be constant within each quantization cell, such
as f (x) ≈ f (x̂) in cell(x̂). The probability mass function (PMF) of x̂ can
then be approximated using
p(x̂) ≈ f (x̂)vol(x̂),
(55)
where vol(x̂) denotes the volume of cell(x̂).
Instead of using a codebook, a high-rate quantizer can be formulated
as a quantization point density function, specifying the distribution of the
code vectors, e.g., the number of code vectors per unit volume. A quantization point density function may be defined as a continuous function that
approximates the inverse volumes of the quantization cells [71],
gc (x) ≈
1
, if x ∈ cell(x̂).
vol(x̂)
(56)
Under the high-rate assumption, optimal solutions of gc (x) can be solved
for different rate constraints. Practical quantizers designed according to
the obtained gc (x) approach optimality with increasing rate. Nevertheless,
quantizers designed based on the high-rate theory have shown competitive
performance also at a lower rate. In Paper C, the actual performance of
the proposed quantizer is very close to the predicted performance by the
high-rate theory for rates higher than 3-4 bits per dimension.
High-rate distortion
The average distortion of a quantizer can be reformulated using the highrate assumption. For the MSE distortion measure, (51) can be written as
Z
1
||x − x̂||2 f (x)dx
DMSE =
K
Z
1 X
≈
||x − x̂||2 f (x̂)dx
K
cell(x̂)
x̂
Z
1 X p(x̂)
||x − x̂||2 dx.
(57)
=
K
vol(x̂) cell(x̂)
x̂
3 Flexible source coding
31
Let
C(x̂) =
K+2
1
vol(x̂)− K
K
Z
cell(x̂)
||x − x̂||2 dx,
(58)
denote the coefficient of quantization of x̂. C(x̂) is the normalized distortion
with respect to volume, and shows the effect of the geometry of the cell.
The distortion can then be written as
X
2
(59)
DMSE =
p(x̂)vol(x̂) K C(x̂).
x̂
Gersho conjectured that the optimal quantizer has the same cell geometry for all quantization cells, and has a constant coefficient of quantization
Copt. [45]. Using the optimal cell geometry, C(x̂) ≈ Copt. , then the distortion is
X
2
DMSE ≈ Copt.
p(x̂)vol(x̂) K
≈ Copt.
Zx̂
2
f (x)gc (x)− K dx.
(60)
Resolution-constrained vector quantizer
The optimal resolution-constrained vector quantizer (RCVQ) has the quantization point density function gc (·) that minimizes the average distortion
D, under a rate constraint that the number of code vectors equals N . The
constraint may be formulated through integration of gc (x),
Z
gc (x)dx = N.
(61)
The optimal gc (·) minimizes the extended criterion
ηrcvq
= Copt.
Z
2
−K
f (x)gc (x)
dx + λ
Z
gc (x)dx − N
,
(62)
where the first term is the high-rate distortion (60) and λ is the Lagrange
multiplier for the rate constraint. Solving the Euler-Lagrange equation, and
combining with (61), the optimal gc (x) for constrained-resolution vector
quantization with respect to the MSE distortion measure is given by
K
gc (x) = R
N f (x) K+2
K
f (x) K+2 dx
.
(63)
32
Introduction
Entropy-constrained vector quantizer
The optimal entropy-constrained vector quantizer (ECVQ) has a quantization point density function gc (·) that minimizes the average distortion D,
under the constraint that the entropy of the code vectors equals a desired
target rate Rt . The entropy of the quantized variable at a high rate can be
formulated as
X
H ≈ −
f (x̂)vol(x̂) log2 (f (x̂)vol(x̂))
x̂
f (x)
f (x) log2
dx
gc (x)
Z
= h + f (x) log2 gc (x)dx,
≈
−
Z
(64)
R
where h = − f (x) log2 f (x)dx is the differential
entropy of the source.
R
Therefore, the constraint can be written as f (x) log2 gc (x)dx = Rt0 , where
Rt0 = Rt − h. The optimal gc (·) minimizes the extended criterion
Z
Z
2
0
−K
dx + λ
f (x) log2 gc (x)dx − Rt , (65)
ηecvq = Copt. f (x)gc (x)
where λ is the Lagrange multiplier for the entropy constraint.
The optimal gc (x) can be solved similarly to the RCVQ case, and the
resulting gc (x) is a constant [45],
gc (x) = gc = 2Rt −h .
(66)
This important result implies that, at high rate, uniformly distributed quantization points form an optimal quantizer if the point indices are subsequently coded using an entropy code.
3.3
High-rate quantizer design
High-rate theory provides a theoretical foundation and guideline for designing practical quantizers. Quantizer design motivated by high-rate theory
include lattice quantization [9, 10, 25, 27, 28, 41, 66, 127, 134, 145], and GMM
based quantization [52, 58, 106, 112–115, 127–129, 148, Paper C].
Lattice quantization
A lattice Λ is an infinite set of regularly arranged vectors in the Euclidean
space RK [28]. It can be defined through a generating matrix G, assumed
to be a square matrix,
Λ(G) = {GT u : u ∈ Zk },
(67)
3 Flexible source coding
33
2
1
0
−1
−2
−2
−1
0
1
2
Figure 7: The 2-dimensional hexagonal lattice.
where G is a matrix with linearly independent row vectors that form a set
of basis vectors in RK . The lattice points generated through G are integer
linear combinations of the row vectors of G.
Applied to quantization, each lattice point λ ∈ Λ is associated with
a Voronoi region containing all points in RK closest to λ according to the
Euclidean distance measure. The Voronoi region corresponds to the decision
region of a quantizer with the MSE distortion measure. Hence, a closest
point search algorithm can be used for quantization. Figure 7 shows a two
1
0
√
dimensional lattice, the hexagonal lattice (defined by G =
),
0.5 23
together with the Voronoi cells.
The high-rate theory for entropy-constrained vector quantization forms
the theoretical basis for using uniformly distributed code vectors. Such
vectors may be generated using a lattice, and the quantizer is then a lattice
quantizer. The design of a coder based on an entropy-constrained lattice
quantizer consists of the following steps: 1) selection of a suitable lattice for
quantization, 2) design of a closest point search algorithm for the selected
lattice, and 3) indexing and assigning variable-length codewords to the code
vectors.
The selection of a good lattice for quantization is considered in [9] for
the MSE distortion. Since the MSE performance of a lattice quantizer is
proportional to the coefficient of quantization associated with the Voronoi
cell shape of the lattice, lattice optimization for quantization is based on
an optimization with respect to the coefficient of quantization. A study
and tutorial on the “best known” lattices for quantization is given in [9].
A numerical optimization method for obtaining a locally optimal lattice for
quantization in an arbitrary dimension is proposed in [9].
Closest point search algorithms for lattices are considered in [10, 27].
The algorithms proposed in [27] are based on the geometrical structures
34
Introduction
of the lattices, and are therefore lattice-specific. A general searching algorithm that applies to an arbitrary lattice has been proposed [10]. Both
methods are considerably faster than a full search over all code vectors in
a traditional codebook, by utilizing the geometrically regular arrangement
of code vectors. As the search complexity and memory requirement using a
traditional codebook are exponentially increasing with respect to the vector
dimensionality and rate, lattice quantization is an attractive approach for
high-dimensional vector quantization.
The practical application of entropy-constrained lattice quantization has
been limited due to lack of a practical algorithm for indexing and assigning
variable-length codewords that work at high rates and dimensions. In [25,
26, 66], lattice truncation and indexing the lattice vectors are considered.
However, designing a practical entropy code is still limited to low dimensions
and rates, due to the exponentially increasing number of code vectors in the
truncated lattice. In Paper C, we propose a practical indexing and entropy
coding algorithm for a lattice quantization of a Gaussian variable as part of
a GMM based entropy-constrained vector quantizer design.
GMM-based quantization
High-rate theory assumes that the PDF of the source is known. However,
for most practical sources, the PDF is unknown and needs to be estimated
from the data. As discussed in section 2.3, GMMs form a convenient tool
for modeling real-world source vectors with an unknown PDF [107]. They
have been successfully applied to quantization, e.g., [52,58,114,127,128,131,
Paper C].
GMMs have been applied to approximate the high-rate optimal quantization point density for resolution-constrained quantization [58, 114]. A
codebook can be generated using a pseudo random vector generator, once
the optimal quantization point density is obtained. Thus, this method does
not require the storage of multiple codebooks for different rate requirements.
However, the method shares the same search complexity as a traditional
codebook based quantizer. The application is therefore limited to low rates
and dimensions. Nevertheless, the method is useful for predicting the performance of an optimal quantizer [58, 114].
Another GMM quantizer design is based on classification and component quantizers [127–129]. Such a quantizer is based on quantization of the
source vector using only one of the mixture component quantizers. The index of the quantizing mixture component is transmitted as side information.
The scheme is conceptually simple, since the problem is reduced to quantizer design for a Gaussian component. Similar ideas based on classified
quantization appears in image coding [131] and speech coding [68–70].
The methods of [127–129] are resolution-constrained vector quantizers
optimized for a fixed target rate. The overall bit rate budget is distributed
4 Summary of contributions
35
among the mixture components and for transmitting the component index.
The component quantizers are based on the so-called companding technique. The companding technique attempts to approximate the optimal
high-rate codebook of a Gaussian variable. It consists of an invertible mapping function that maps the input variable into a domain where a lattice
quantizer is applied. The forward transform is referred to as the compressor and the inverse transform as expander. Applying the expander function
to the lattice structured codebook would result in a codebook that approximates the optimal codebook in the original domain. Unfortunately,
the companding technique is suboptimal for vector dimensions higher than
two [20, 21, 45, 95, 119]. The sub-optimality is due to the fact that the
expander function scales the quantization cells differently in different dimensions and in different locations in space. Hence, the quantization cells
lose their optimal cell shape in the original domain. The great advantage
of the schemes proposed in [127–129] is the low computational complexity.
Also, codebook storage is not required by utilizing a lattice quantizer in the
transformed domain.
Applying classified quantization for entropy constrained quantization
has been discussed in [52, 71]. Motivated by high-rate theory, the component quantizers of an entropy constrained quantizer are based on lattice
structured codebooks. The focus of [52, 71] is on the high-rate analysis of
such a quantizer. In Paper C, we propose a practical GMM-based entropy
constrained quantizer using the classified quantization framework. In particular, a practical solution to indexing and entropy coding for the lattice
quantizers of the Gaussian components is proposed. We show that, under
certain conditions, the scheme approaches the theoretically optimal performance with increasing rate.
4
Summary of contributions
The focus of this thesis is on statistical model based techniques for speech
enhancement and coding. The main contributions of the thesis can be summarized as
• improved speech and noise modeling and estimation techniques applied to speech enhancement, and
• improvements towards flexible source coding.
The thesis consists of three research articles. Short summaries of the articles
are presented below.
36
Introduction
Paper A: HMM-based Gain-Modeling for Enhancement of Speech
in Noise
In Paper A, we propose a hidden Markov model (HMM) based speech
enhancement method using stochastic-gain HMMs (SG-HMMs) for both
speech and noise. We demonstrate that accurate modeling and estimation of
speech and noise gains facilitate good performance in speech enhancement.
Through the introduction of stochastic gain variables, energy variation in
both speech and noise is explicitly modeled in a unified framework. The
speech gain models the energy variations of the speech phones, typically due
to differences in pronunciation and/or different vocalizations of individual
speakers. The noise gain helps to improve the tracking of the time-varying
energy of non-stationary noise. The expectation-maximization (EM) algorithm is used to perform off-line estimation of the time-invariant model
parameters. The time-varying model parameters are estimated on-line using the recursive EM algorithm. The proposed gain modeling techniques are
applied to a novel Bayesian speech estimator, and the performance of the
proposed enhancement method is evaluated through objective and subjective tests. The experimental results confirm the advantage of explicit gain
modeling, particularly for non-stationary noise sources.
Paper B: On-line Noise Estimation Using Stochastic-Gain HMM
for Speech Enhancement
In Paper B, we extend the work in Paper A and introduce an on-line noise
estimation algorithm. A stochastic-gain HMM is used to model the statistics of non-stationary noise with time-varying energy. The noise model is
adaptive and the model parameters are estimated on-line from noisy observations. The parameter estimation is derived for the maximum likelihood
criterion and the algorithm is based on the recursive expectation maximization (EM) framework. The proposed method facilitates continuous
adaptation to changes of both noise spectral shapes and noise energy levels, e.g., due to movement of the noise source. Using the estimated noise
model, we also develop an estimator of the noise power spectral density
(PSD) based on recursive averaging of estimated noise sample spectra. We
demonstrate that the proposed scheme achieves more accurate estimates of
the noise model and noise PSD. When integrated to a speech enhancement
system, the proposed scheme facilitates a lower level of residual noise in the
enhanced speech.
Paper C: Entropy-Constrained Vector Quantization Using Gaussian Mixture Models
In Paper C, a flexible entropy-constrained vector quantization scheme based
on Gaussian mixture models (GMMs), lattice quantization, and arithmetic
References
37
coding is presented. A source vector is assumed to have a probability density function described by a GMM. The source vector is first classified to
one of the mixture components, and the Karhunen-Loève transform of the
selected mixture component is applied to the vector, followed by quantization using a lattice structured codebook. Finally, the scalar elements
of the quantized vector are entropy coded sequentially using a specially
designed arithmetic code. The proposed scheme has a computational complexity that is independent of rate, and quadratic with respect to vector
dimension. Hence, the scheme facilitates quantization of high dimensional
source vectors. The flexible design allows for changing of the average rate
on-the-fly. The theoretical performance of the scheme is analyzed under a
high-rate assumption. We show that, at high rate, the scheme approaches
the theoretically optimal performance, if the mixture components are located far apart. The performance loss when the mixture components are
located closely to each other can be quantified for a given GMM. The practical performance of the scheme was evaluated through simulations on both
synthetic and speech-derived source vectors, and competitive performance
has been demonstrated.
References
[1] Enhanced Variable Rate Codec, speech service option 3 for wideband
spread spectrum digital systems. TIA/EIA/IS-127, Jul. 1996.
[2] Perceptual Evaluation of Speech Quality (PESQ), an objective
method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P.862,
Feb. 2001.
[3] Test results of Mitsubishi AMR-NS solution based on TS 26.077.
3GPP Tdoc S4-020251, May 2002.
[4] Test results of NEC AMR-NS solution based on TS 26.077. 3GPP
Tdoc S4-020415, Jul. 2002.
[5] Test results of NEC low-complexity AMR-NS solution based on TS
26.077. 3GPP Tdoc S4-020417, Jul. 2002.
[6] Minimum performance requirements for noise suppressor; application
to the Adaptive Multi-Rate (AMR) speech encoder (release 5). 3GPP
TS 26.077, Jun. 2003.
[7] Selectable Mode Vocoder (SMV) service option for wideband spread
spectrum communication systems. 3GPP2 Document C.S0030-0, Jan.
2004. Version 3.0.
38
Introduction
[8] Test results of Fujitsu AMR-NS solution based on TS 26.077 specification. 3GPP Tdoc S4-050016, Feb. 2005.
[9] E. Agrell and T. Eriksson. Optimization of lattices for quantization.
IEEE Trans. Inform. Theory, 44(5):1814–1828, Sep. 1998.
[10] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point search
in lattices. IEEE Trans. Inform. Theory, 48(8):2201–2214, Aug. 2002.
[11] J. Baker. The DRAGON system - an overview. IEEE Trans. Acoust.,
Speech, Signal Processing, 23(1):24–29, Feb 1975.
[12] N. K. Bakirov. Central limit theorem for weakly dependent variables.
Mathematical Notes, 41(1):63–67, Jan. 1987.
[13] L. E. Baum and J. A. Petrie. Statistical inference for probabilistic
functions of finite state Markov chains. Ann. Math. Statist., 37:1554–
1563, 1966.
[14] J. Beerends and J. Stemerdink. A perceptual speech-quality measure
based on a psychoacoustic sound representation. J. Audio Eng. Soc,
42(3):115–123, 1994.
[15] W. R. Bennett. Spectra of quantized signals. Bell System Technical
Journal, 27:446–472, 1948.
[16] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data
Compression. Prentice-Hall, Englewood Cliffs, N.J., 1971.
[17] J. Bilmes. A gentle tutorial on the EM algorithm and its application
to parameter estimation for Gaussian mixture and hidden Markov
models. Technical Report, University of Berkeley, ICSI-TR-97-021,
1997.
[18] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Processing, 2(2):113–120,
Apr. 1979.
[19] C. Breithaupt and R. Martin. MMSE estimation of magnitudesquared DFT coefficients with supergaussian priors. In Proc. IEEE
Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages
896–899, April 2003.
[20] J. Bucklew. Companding and random quantization in several dimensions. IEEE Trans. Inform. Theory, 27(2):207–211, Mar 1981.
[21] J. Bucklew. A note on optimal multidimensional companders (corresp.). IEEE Trans. Inform. Theory, 29(2):279–279, Mar 1983.
References
39
[22] O. Cappé. Elimination of the musical noise phenomenon with the
Ephraim and Malah noise suppressor. IEEE Trans. Speech and Audio
Processing, 2(2):345–349, Apr. 1994.
[23] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained
vector quantization. IEEE Trans. Acoust., Speech, Signal Processing,
37(1):31–42, Jan. 1989.
[24] I. Cohen. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech
and Audio Processing, 11(5):466–475, Sep. 2003.
[25] J. Conway and N. J. A. Sloane. A fast encoding method for lattice
codes and quantizers. IEEE Trans. Inform. Theory, 29(6):820–824,
Nov. 1983.
[26] J. H. Conway, E. M. Rains, and N. J. A. Sloane. On the existence of
similar sublattice. Canadian Jnl. Math., 51:1300–1306, 1999.
[27] J. H. Conway and N. J. A. Sloane. Fast quantizing and decoding
algorithms for lattice quantizers and codes. IEEE Trans. Inform.
Theory, 28(2):227–232, Mar. 1982.
[28] J. H. Conway and N. J. A. Sloane. Sphere Packings, Lattices and
Groups. Springer-Verlag, 2nd edition, 1993.
[29] T. Dau, D. Püschel, and A. Kohlrausch. A quantitative model of the
effective signal processing in the auditory system. I. model structure.
J. Acoust. Soc. Am., 99(6):3615–3622, June 1996.
[30] T. Dau, D. Püschel, and A. Kohlrausch. A quantitative model of the
effective signal processing in the auditory system. II. simulations and
measurements. J. Acoust. Soc. Am., 99(6):3623–3631, June 1996.
[31] A. P. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood
from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B,
39(1):1–38, 1977.
[32] L. Deng, J. Droppo, and A. Acero. Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech
recognition. IEEE Trans. Speech and Audio Processing, 11(6):568–
580, Nov. 2003.
[33] V. Digalakis, P. Monaco, and H. Murveit. Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Trans. Speech and Audio Processing, 4(4):281–289, July
1996.
40
Introduction
[34] V. Digalakis, D. Rtischev, and L. Neumeyer. Speaker adaptation using
constrained estimation of Gaussian mixtures. IEEE Trans. Speech and
Audio Processing, 3(5):357–366, Sept. 1995.
[35] Y. Ephraim. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. Signal Processing,
40(4):725–735, Apr. 1992.
[36] Y. Ephraim. Gain-adapted hidden Markov models for recognition of
clean and noisy speech. IEEE Trans. Signal Processing, 40(6):1303–
1316, Jun. 1992.
[37] Y. Ephraim. Statistical-model-based speech enhancement systems.
Proceedings of the IEEE, 80(10):1526–1555, Oct. 1992.
[38] Y. Ephraim and D. Malah. Speech enhancement using a minimummean square error short-time spectral amplitude estimator. IEEE
Trans. Acoust., Speech, Signal Processing, 32(6):1109–1121, Dec.
1984.
[39] Y. Ephraim and D. Malah. Speech enhancement using a minimum
mean-square error log-spectral amplitude estimator. IEEE Trans.
Acoust., Speech, Signal Processing, 33(2):443–445, Apr 1985.
[40] Y. Ephraim, D. Malah, and B.-H. Juang. On the application of hidden
Markov models for enhancing noisy speech. IEEE Trans. Acoust.,
Speech, Signal Processing, 37(12):1846–1856, Dec. 1989.
[41] T. Eriksson. Vector Quantization in Speech Coding - Variable rate,
memory and lattice quantization. PhD thesis, Chalmers University of
Technology, 1996.
[42] T. H. Falk and W.-Y. Chan. Nonintrusive speech quality estimation using gaussian mixture models. IEEE Signal Processing Letters,
13(2):108–111, Feb. 2006.
[43] G. Fant. Acoustic theory of speech production. The Hague, Netherlands: Mouton, 1970.
[44] W. Gardner and B. Rao. Theoretical analysis of the high-rate vector quantization of LPC parameters. IEEE Trans. Speech and Audio
Processing, 3(5):367–381, Sept. 1995.
[45] A. Gersho. Asymptotically optimal block quantization. IEEE Trans.
Inform. Theory, 25(4):373–380, Oct. 1979.
[46] A. Gersho and R. M. Gray. Vector quantization and signal compression. Boston, MA: Kluwer Academic Publishers, 1992.
References
41
[47] H. Gish and J. N. Pierce. Asymptotically efficient quantizing. IEEE
Trans. Inform. Theory, 14(5):676–683, 1968.
[48] Z. Goh, K.-C. Tan, and T. Tan. Postprocessing method for suppressing musical noise generated by spectral subtraction. IEEE Trans.
Speech and Audio Processing, 6(3):287–292, May 1998.
[49] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Low
complexity, non-intrusive speech quality assessment. In IEEE Trans.
Audio, Speech and Language Processing, volume 14, pages 1948–1956,
Nov. 2006.
[50] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Nonintrusive speech quality assessment with low computational complexity. In Interspeech - ICSLP, September 2006.
[51] R. Gray. Reply to comments on ’source coding of the discrete Fourier
transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980.
[52] R. Gray. Gauss mixture vector quantization. In Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing, volume 3, pages 1769–1772,
2001.
[53] R. Gray, A. Buzo, J. Gray, A., and Y. Matsuyama. Distortion measures for speech processing. IEEE Trans. Acoust., Speech, Signal Processing, 28(4):367–376, Aug 1980.
[54] R. Gray and D. Neuhoff. Quantization. IEEE Trans. Inform. Theory,
44(6):2325–2383, Oct. 1998.
[55] R. Gray, J. Young, and A. Aiyer. Minimum discrimination information
clustering: modeling and quantization with Gauss mixtures. In Proc.
IEEE Int. Conf. Image Processing, volume 3, pages 14–17, Oct. 2001.
[56] R. M. Gray. Source coding theory. Kluwer Academic Publisher, 1990.
[57] S. Gustafsson, P. Jax, and P. Vary. A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 1, pages 397–400, 1998.
[58] P. Hedelin and J. Skoglund. Vector quantization based on Gaussian
mixture models. IEEE Trans. Speech and Audio Processing, 8(4):385–
401, Jul. 2000.
[59] H. G. Hirsch and C. Ehrlicher. Noise estimation techniques for robust
speech recognition. In Proc. IEEE Int. Conf. Acoustics, Speech and
Signal Processing, volume 1, pages 153–156, 1995.
42
Introduction
[60] Y. Hu and P. Loizou. Subjective comparison of speech enhancement
algorithms. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 1, pages 153–156, 2006.
[61] Y. Hu and P. Loizou. A comparative intelligibility study of speech
enhancement algorithms. In Proc. IEEE Int. Conf. Acoustics, Speech
and Signal Processing, pages 561–564, 2007.
[62] Y. Hu and P. C. Loizou. A perceptually motivated approach for speech
enhancement. IEEE Trans. Speech and Audio Processing, 11(5):457–
465, Sep. 2003.
[63] D. A. Huffman. A method for the construction of minimum redundancy codes. Proc. IRE, 40:1098–1101, 1952.
[64] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall,
1984.
[65] F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4):532–556, April 1976.
[66] D. G. Jeong and J. D. Gibson. Uniform and piecewise uniform lattice
vector quantization for memoryless Gaussian and Laplacian sources.
IEEE Trans. Inform. Theory, 39(3):786–804, May 1993.
[67] S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation
Theory. Prentice Hall, 1993.
[68] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ for the
speech signal. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 1, pages 645–648, 2002.
[69] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ with
companding. In Proc. IEEE Works. Speech Coding, pages 8–10, Oct.
2002.
[70] M. Y. Kim and W. Kleijn. KLT-based adaptive classified VQ of the
speech signal. IEEE Trans. Speech and Audio Processing, 12(3):277–
289, May 2004.
[71] W. B. Kleijn. A basis for source coding. Lecture notes, KTH, Stockholm, Jan. 2003.
[72] V. Krishnamurthy and J. Moore. On-line estimation of hidden Markov
model parameters based on the Kullback-Leibler information measure.
IEEE Trans. Signal Processing, 41(8):2557–2573, Aug. 1993.
References
43
[73] M. Kuropatwinski and W. B. Kleijn. Estimation of the excitation
variances of speech and noise AR-models for enhanced speech coding.
In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing,
volume 1, pages 669–672, May 2001.
[74] M. Kuropatwinski and W. B. Kleijn. Minimum mean square error estimation of speech short-term predictor parameters under noisy conditions. 1:96–99, 2003.
[75] G. Langdon and J. Rissanen. A simple general binary source code
(corresp.). IEEE Trans. Inform. Theory, 28(5):800–803, Sep 1982.
[76] G. Langdon and J. Rissanen. Correction to ’a simple general binary source code’ (sep 82 800-803). IEEE Trans. Inform. Theory,
29(5):778–779, Sep 1983.
[77] G. G. Langdon. An introduction to arithmetic coding. IBM Journal
of Research and Development, 28:135–149, 1984.
[78] J. S. Lim, editor. Speech Enhancement (Prentice-Hall Signal Processing Series). Prentice Hall, 1983.
[79] L. Lin, W. H. Holmes, and E. Ambikairajah. Speech denoising using perceptual modification of Wiener filtering. Electronics Letters,
38(23):1486–1487, Nov. 2002.
[80] J. Lindblom and J. Samuelsson. Bounded support Gaussian mixture
modeling of speech spectra. IEEE Trans. Speech and Audio Processing, 11(1):88–99, Jan. 2003.
[81] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer
design. Communications, IEEE Transactions on [legacy, pre - 1988],
28(1):84–95, Jan 1980.
[82] S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform.
Theory, 28(2):129–137, Mar 1982.
[83] P. Lockwood and J. Boudy. Experiments with a nonlinear spectral
subtractor (NSS), hidden Markov models and the projection, for robust speech recognition in cars. Speech Communication, 11(2-3):215–
228, 1992.
[84] B. Logan and T. Robinson. Adaptive model-based speech enhancement. Speech Communication, 34(4):351–368, Jul. 2001.
[85] T. Lotter and P. Vary. noise reduction by maximum a posteriori spectral amplitude estimation with supergaussian speech modeling. In Inter. Workshop on Acoustic Echo and Noise Control (IWAENC2003),
pages 86–89, Sept. 2003.
44
Introduction
[86] T. Lotter and P. Vary. Noise reduction by joint maximum a posteriori
spectral amplitude and phase estimation with super-Gaussian speech
modelling. In Proc. 12th European Signal Processing Conference, EUSIPCO, pages 1457–1460, 2004.
[87] D. Malah, R. Cox, and A. Accardi. Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 2, pages 789–792, March 1999.
[88] R. Martin. Noise power spectral density estimation based on optimal
smoothing and minimum statistics. IEEE Trans. Speech and Audio
Processing, 9(5):504–512, Jul. 2001.
[89] R. Martin. Speech enhancement using MMSE short time spectral
estimation with gamma distributed speech priors. In Proc. IEEE Int.
Conf. Acoustics, Speech and Signal Processing, volume 1, pages 253–
256, 2002.
[90] R. Martin. Speech enhancement based on minimum mean-square error
estimation and supergaussian priors. IEEE Trans. Speech and Audio
Processing, 13(5):845–856, Sept. 2005.
[91] J. Max. Quantizing for minimum distortion. IEEE Trans. Inform.
Theory, 6(1):7–12, Mar 1960.
[92] R. McAulay and M. Malpass. Speech enhancement using a softdecision noise suppression filter. IEEE Trans. Acoust., Speech, Signal
Processing, 28(2):137–145, Apr 1980.
[93] B. L. McKinley and G. H. Whipple. Noise model adaptation in
model based speech enhancement. In Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, volume 2, pages 633 – 636, 1996.
[94] A. Meiri. Comments on ’source coding of the discrete Fourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980.
[95] P. Moo and D. Neuhoff. Optimal compressor functions for multidimensional companding. In Proc. IEE Int. Symp. Info. Theory, page
515, 29 June-4 July 1997.
[96] B. Oliver, J. Pierce, and C. Shannon. The philosophy of PCM. Proceedings of the IRE, 36(11):1324–1331, Nov. 1948.
[97] K. K. Paliwal and W. B. Kleijn. Quantization of LPC parameters. In
W. B. Kleijn and K. K. Paliwal, editors, Speech coding and synthesis,
chapter 12, pages 433–466. Elsevier, 1995.
References
45
[98] P. F. Panter and W. Dite. Quantization distortion in pulse-count
modulation with nonuniform spacing of levels. Proceedings of the IRE,
39(1):44–48, 1951.
[99] W. Pearlman. Reply to comments on ’source coding of the discrete
Fourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep
1980.
[100] W. Pearlman and R. Gray. Source coding of the discrete Fourier
transform. IEEE Trans. Inform. Theory, 24(6):683–692, Nov 1978.
[101] J. Plasberg, D. Y. Zhao, and W. B. Kleijn. The sensitivity matrix
for a spectro-temporal auditory model. In Proc. 12th European Signal
Processing Conf. (EUSIPCO), pages 1673–1676, 2004.
[102] J. H. Plasberg and W. B. Kleijn. The sensitivity matrix: Using advanced auditory models in speech and audio processing. IEEE Trans.
Audio, Speech and Language Processing, 15(1):310–319, Jan. 2007.
[103] J. Porter and S. Boll. Optimal estimators for spectral restoration of
noisy speech. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 9, pages 53–56, 1984.
[104] S. Quackenbush, T. Barnwell, and M. Clements. Objective Measures
of Speech Quality. Prentice Hall, 1988.
[105] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286,
Feb. 1989.
[106] B. D. Rao. Speech coding using Gaussian mixture models. Technical
report, MICRO Project 02-062, 2002-2003.
[107] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984.
[108] D. A. Reynolds and R. C. Rose. Robust text-independent speaker
identification using Gaussian mixture speaker models. IEEE Trans.
Speech and Audio Processing, 3(1):72–83, Jan. 1995.
[109] J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM
Journal of Research and Development, 20:198, 1976.
[110] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual Evaluation of Speech Quality (PESQ) - A new method for speech quality
assessment of telephone networks and codecs. In Proc. IEEE Int.
Conf. Acoust., Speech, Signal Processing, volume 2, pages 749–752,
Salt Lake City, USA, 2001.
46
Introduction
[111] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan. HMM-based
strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. Speech and Audio Processing, 6(5):445–455,
Sep. 1998.
[112] J. Samuelsson. Waveform quantization of speech using Gaussian mixture models. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal
Processing, volume 1, pages 165–168, May 2004.
[113] J. Samuelsson. Toward optimal mixture model based vector quantization. In Proc. the Fifth Int. Conf. Information, Communications and
Signal Processing (ICICS), Dec. 2005.
[114] J. Samuelsson and P. Hedelin. Recursive coding of spectrum parameters. IEEE Trans. Speech and Audio Processing, 9(5):492–503, Jul.
2001.
[115] J. Samuelsson and J. Plasberg. Multiple description coding based on
Gaussian mixture models. IEEE Signal Processing Letters, 12(6):449–
452, June 2005.
[116] C. E. Shannon. A Mathematical Theory of Communication. Bell
System Technical Journal, 27:379–423, 623–656, 1948.
[117] C. E. Shannon. Coding theorems for a discrete source with a fidelity
criterion. IRE National Convention Record, Part 4, pages 142–163,
1959.
[118] J. W. Shin, J.-H. Chang, and N. S. Kim. Statistical modeling of
speech signals based on generalized gamma distribution. IEEE Signal
Processing Letters, 12(3):258–261, March 2005.
[119] S. Simon. On suboptimal multidimensional companding. In Proc.
Data Compression Conference (DCC), pages 438–447, 30 March-1
April 1998.
[120] J. Sohn and W. Sung. A voice activity detector employing soft decision
based noise spectrum adaptation. In Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, volume 1, pages 365–368, May 1998.
[121] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Speech enhancement
using a-priori information. In Proc. Eurospeech, volume 2, pages 1405–
1408, Sep. 2003.
[122] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-based
Bayesian speech enhancement. In Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, volume 1, pages 1077–1080, 2005.
References
47
[123] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook driven
short-term predictor parameter estimation for speech enhancement.
IEEE Trans. Audio, Speech and Language Processing, pages 1–14,
Jan. 2006.
[124] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-based
Bayesian speech enhancement for nonstationary environments. IEEE
Trans. Audio, Speech and Language Processing, 15(2):441–452, Feb.
2007.
[125] V. Stahl, A. Fischer, and R. Bippus. Quantile based noise estimation
for spectral subtraction and Wiener filtering. In Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing, volume 3, pages 1875–1878,
June 2000.
[126] J. Su and R. Mersereau. Coding using Gaussian mixture and generalized Gaussian models. In Proc. IEEE Int. Conf. Image Processing,
volume 1, pages 217–220, Sep. 1996.
[127] A. D. Subramaniam, W. R. Gardner, and B. D. Rao. Low-complexity
source coding using Gaussian mixture models, lattice vector quantization, and recursive coding with application to speech spectrum
quantization. IEEE Trans. Audio, Speech and Language Processing,
14(2):524–532, Mar. 2006.
[128] A. D. Subramaniam and B. D. Rao. PDF optimized parametric vector
quantization of speech line spectral frequencies. IEEE Trans. Speech
and Audio Processing, 11(2):130–142, Mar. 2003.
[129] B. Subramaniam, A.D.; Rao. Speech LSF quantization with rate independent complexity, bit scalability and learning. In Proc. IEEE
Int. Conf. Acoustics, Speech and Signal Processing, volume 2, pages
705–708, 2001.
[130] H. Takahata. On the rates in the central limit theorem for weakly
dependent random fields. Probability Theory and Related Fields,
64(4):445–456, Dec. 1983.
[131] M. Tasto and P. Wintz. Image coding by adaptive block quantization.
IEEE Transactions on Communication Technology, COM-19(6):957–
972, 1971.
[132] D. M. Titterington. Recursive parameter estimation using incomplete
data. J. Roy. Statist. Soc. B, 46(2):257–267, 1984.
[133] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis. Speech
enhancement based on audible noise suppression. IEEE Trans. Speech
and Audio Processing, 5(6):497–514, Nov. 1997.
48
Introduction
[134] V. A. Vaishampayan, N. J. A. Sloane, and S. Servetto. Multiple description vector quantization with lattice codebooks: design and analysis. IEEE Trans. Inform. Theory, 47(5):1718–1734, Jul. 2001.
[135] P. Vary and R. Martin. Digital Speech Transmission - Enhancement,
Coding and Concealment. Wiley, 2006.
[136] N. Virag. Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech and Audio
Processing, 7(2):126–137, Mar. 1999.
[137] S. Voran. Objective estimation of perceived speech quality - Part I:
Development of the measuring normalizing block technique. IEEE
Trans. Speech, Audio Processing, 7(4):371–382, 1999.
[138] S. Voran. Objective estimation of perceived speech quality - Part
II: Evaluation of the measuring normalizing block technique. IEEE
Trans. Speech, Audio Processing, 7(4):383–390, 1999.
[139] D. Wang and J. Lim. The unimportance of phase in speech enhancement. IEEE Trans. Acoust., Speech, Signal Processing, 30(4):679–681,
Aug 1982.
[140] S. Wang, A. Sekey, and A. Gersho. An objective measure for predicting subjective quality of speech coders. IEEE J. Selected Areas in
Commun., 10(5):819–829, 1992.
[141] N. Wiener. Extrapolation, Interpolation, and Smoothing of Stationary
Time Series. New York: Wiley, 1949.
[142] I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, June 1987.
[143] P. J. Wolfe and S. J. Godsill. Towards a perceptually optimal spectral
amplitude estimator for audio signal enhancement. In Proc. IEEE Int.
Conf. Acoustics, Speech and Signal Processing, volume 2, pages 821–
824, 2000.
[144] P. L. Zador. Topics in the asymptotic quantization of continuous
random variables. Bell Labs Tech. Memo, 1966.
[145] D. Y. Zhao and W. B. Kleijn. Multiple-description vector quantization
using translated lattices with local optimization. In Proc. IEEE Global
Telecommunications Conference, volume 1, pages 41 – 45, 2004.
[146] D. Y. Zhao and W. B. Kleijn. On noise gain estimation for HMMbased speech enhancement. In Proc. Interspeech, pages 2113–2116,
Sep. 2005.
References
49
[147] D. Y. Zhao and W. B. Kleijn. HMM-based gain-modeling for enhancement of speech in noise. In IEEE Trans. Audio, Speech and Language
Processing, volume 15, pages 882–892, Mar. 2007.
[148] D. Y. Zhao, J. Samuelsson, and M. Nilsson. GMM-based entropyconstrained vector quantization. In Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, pages 1097–1100, Apr. 2007.
[149] Y. Zhao. Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises. IEEE
Trans. Speech and Audio Processing, 8(3):255–266, May 2000.
[150] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao. Gaussian mixture density modeling, decomposition, and applications. IEEE Trans.
Image Processing, 5(9):1293–1302, Sept. 1996.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement