Thesis for the degree of Doctor of Philosophy Model Based Speech Enhancement and Coding David Yuheng Zhao 赵羽珩 Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology) Stockholm 2007 Zhao, David Yuheng Model Based Speech Enhancement and Coding c Copyright 2007 David Y. Zhao except where otherwise stated. All rights reserved. ISBN 978-91-628-7157-4 TRITA-EE 2007:018 ISSN 1653-5146 Sound and Image Processing Laboratory School of Electrical Engineering KTH (Royal Institute of Technology) SE-100 44 Stockholm, Sweden Telephone + 46 (0)8-790 7790 Abstract In mobile speech communication, adverse conditions, such as noisy acoustic environments and unreliable network connections, may severely degrade the intelligibility and naturalness of the received speech quality, and increase the listening effort. This thesis focuses on countermeasures based on statistical signal processing techniques. The main body of the thesis consists of three research articles, targeting two specific problems: speech enhancement for noise reduction and flexible source coder design for unreliable networks. Papers A and B consider speech enhancement for noise reduction. New schemes based on an extension to the auto-regressive (AR) hidden Markov model (HMM) for speech and noise are proposed. Stochastic models for speech and noise gains (excitation variance from an AR model) are integrated into the HMM framework in order to improve the modeling of energy variation. The extended model is referred to as a stochastic-gain hidden Markov model (SG-HMM). The speech gain describes the energy variations of the speech phones, typically due to differences in pronunciation and/or different vocalizations of individual speakers. The noise gain improves the tracking of the time-varying energy of non-stationary noise, e.g., due to movement of the noise source. In Paper A, it is assumed that prior knowledge on the noise environment is available, so that a pre-trained noise model is used. In Paper B, the noise model is adaptive and the model parameters are estimated on-line from the noisy observations using a recursive estimation algorithm. Based on the speech and noise models, a novel Bayesian estimator of the clean speech is developed in Paper A, and an estimator of the noise power spectral density (PSD) in Paper B. It is demonstrated that the proposed schemes achieve more accurate models of speech and noise than traditional techniques, and as part of a speech enhancement system provide improved speech quality, particularly for non-stationary noise sources. In Paper C, a flexible entropy-constrained vector quantization scheme based on Gaussian mixture model (GMM), lattice quantization, and arithmetic coding is proposed. The method allows for changing the average rate in real-time, and facilitates adaptation to the currently available bandwidth of the network. A practical solution to the classical issue of indexing and entropy-coding the quantized code vectors is given. The proposed scheme has a computational complexity that is independent of rate, and quadratic with respect to vector dimension. Hence, the scheme can be applied to the quantization of source vectors in a high dimensional space. The theoretical performance of the scheme is analyzed under a high-rate assumption. It is shown that, at high rate, the scheme approaches the theoretically optimal performance, if the mixture components are located far apart. The practical performance of the scheme is confirmed through simulations on both synthetic and speech-derived source vectors. Keywords: statistical model, Gaussian mixture model (GMM), hidden Markov model (HMM), noise reduction, speech enhancement, vector quantization. i List of Papers The thesis is based on the following papers: [A] D. Y. Zhao and W. B. Kleijn, “HMM-based gain-modeling for enhancement of speech in noise,” In IEEE Trans. Audio, Speech and Language Processing, vol. 15, no. 3, pp. 882–892, Mar. 2007. [B] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “On-line noise estimation using stochastic-gain HMM for speech enhancement,” IEEE Trans. Audio, Speech and Language Processing, submitted, Apr. 2007. [C] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “Entropyconstrained vector quantization using Gaussian mixture models,” IEEE Trans. Communications, submitted, Apr. 2007. iii In addition to papers A-C, the following papers have also been produced in part by the author of the thesis: [1] D. Y. Zhao, J. Samuelsson, and M. Nilsson, “GMM-based entropy-constrained vector quantization,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Apr. 2007, pp. 1097–1100. [2] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn, “Low complexity, non-intrusive speech quality assessment,” in IEEE Trans. Audio, Speech and Language Processing, vol. 14, pp. 1948–1956, Nov. 2006. [3] ——, “Non-intrusive speech quality assessment with low computational complexity,” in Interspeech - ICSLP, September 2006. [4] D. Y. Zhao and W. B. Kleijn, “HMM-based speech enhancement using explicit gain modeling,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, vol. 1, May 2006, pp. 161–164. [5] D. Y. Zhao, W. B. Kleijn, A. Ypma, and B. de Vries, “Method and apparatus for improved estimation of non-stationary noise for speech enhancement,” 2005, provisional patent application filed by GN ReSound. [6] D. Y. Zhao and W. B. Kleijn, “On noise gain estimation for HMM-based speech enhancement,” in Proc. Interspeech, Sep. 2005, pp. 2113–2116. [7] ——, “Multiple-description vector quantization using translated lattices with local optimization,” in Proceedings IEEE Global Telecommunications Conference, vol. 1, 2004, pp. 41 – 45. [8] J. Plasberg, D. Y. Zhao, and W. B. Kleijn, “The sensitivity matrix for a spectro-temporal auditory model,” in Proc. 12th European Signal Processing Conf. (EUSIPCO), 2004, pp. 1673– 1676. iv Acknowledgements Approaching the end of my Ph.D. study, I would like to thank my supervisor, Prof. Bastiaan Kleijn, for introducing me to the world of academic research. Your creativity, dedication, professionalism and hardworking nature have greatly inspired and influenced me. I also appreciate your patience with me, and that you spend many hours of your valuable time for correcting my English mistakes. I am grateful to all my current and past colleges at the Sound and Image Processing lab. Unfortunately, the list of names is too long to be given here. Thank you for creating a pleasant and stimulating work environment. I particularly enjoyed the cultural diversity and the feeling of an international “family” atmosphere. I am thankful to my co-authors Dr. Jonas Samuelsson, Dr. Mattias Nilsson, Dr. Volodya Grancharov, Jan Plasberg, Dr. Jonas Lindblom, Dr. Alexander Ypma, and Dr. Bert de Vries. Thank you for our fruitful collaborations. Further more, I would like to express my gratitude to Prof. Bastiaan Kleijn, Dr. Mattias Nilsson, Dr. Volodya Grancharov, Tiago Falk and Anders Ekman for proofreading the introduction of the thesis. During my Ph.D. study, I was involved in projects financially supported by GN ReSound and the Foundation for Strategic Research. I would like to thank Dr. Alexander Ypma, Dr. Bert de Vries, Prof. Mikael Skoglund, Prof. Gunnar Karlsson, for your encouragement and friendly discussions during the projects. Thanks to my Chinese fellow students and friends at KTH, particularly Xi, Jing, Lei, and Jinfeng. I feel very lucky to get to know you. The fun we have had together is unforgettable. I would also like to express my deep gratitude to my family, my wife Lili, for your encouragement, understanding and patience; my beloved kids: Andrew and whom yet-to-be-named, for the joy and hope you bring. I thank my parents, for your sacrifice and your endless support; my brother, for all the venture and joy we have together; my mother-in-law, for taking care of Andrew. Last but not least, I am grateful to our host family, Harley and Eva, for your kind hearts and unconditional friendship. David Zhao, Stockholm, May 2007 v Contents Abstract i List of Papers iii Acknowledgements v Contents vii Acronyms xi I Introduction 1 1 Statistical models . . . . . . . . . . . . . . 1.1 Short-term models . . . . . . . . . 1.2 Long-term models . . . . . . . . . 1.3 Model-parameter estimation . . . . 2 Speech enhancement for noise reduction . 2.1 Overview . . . . . . . . . . . . . . 2.2 Short-term model based approach 2.3 Long-term model based approach . 3 Flexible source coding . . . . . . . . . . . 3.1 Source coding . . . . . . . . . . . . 3.2 High-rate theory . . . . . . . . . . 3.3 High-rate quantizer design . . . . . 4 Summary of contributions . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . II Included papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 6 9 12 14 17 21 25 25 30 32 35 37 51 A HMM-based Gain-Modeling for Enhancement of Speech in Noise A1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . A1 vii 2 Signal model . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Speech model . . . . . . . . . . . . . . . . . . . . . . 2.2 Noise model . . . . . . . . . . . . . . . . . . . . . . . 2.3 Noisy signal model . . . . . . . . . . . . . . . . . . . 3 Speech estimation . . . . . . . . . . . . . . . . . . . . . . . 4 Off-line parameter estimation . . . . . . . . . . . . . . . . . 5 On-line parameter estimation . . . . . . . . . . . . . . . . . 6 Experiments and results . . . . . . . . . . . . . . . . . . . . 6.1 System implementation . . . . . . . . . . . . . . . . 6.2 Experimental setup . . . . . . . . . . . . . . . . . . . 6.3 Evaluation of the modeling accuracy . . . . . . . . . 6.4 Objective evaluation of MMSE waveform estimators 6.5 Objective evaluation of sample spectrum estimators 6.6 Perceptual quality evaluation . . . . . . . . . . . . . 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix I. EM based solution to (12) . . . . . . . . . . . . . . . Appendix II. Approximation of fs̄ (ḡn0 |xn ) . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A4 A4 A5 A6 A7 A9 A11 A13 A13 A14 A15 A16 A19 A19 A22 A22 A23 A24 B On-line Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement B1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B1 2 Noise model parameter estimation . . . . . . . . . . . . . . B4 2.1 Signal model . . . . . . . . . . . . . . . . . . . . . . B4 2.2 On-line parameter estimation . . . . . . . . . . . . . B6 2.3 Safety-net state strategy . . . . . . . . . . . . . . . . B9 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . B10 3 Noise power spectrum estimation . . . . . . . . . . . . . . . B11 4 Experiments and results . . . . . . . . . . . . . . . . . . . . B12 4.1 System Implementation . . . . . . . . . . . . . . . . B12 4.2 Experimental setup . . . . . . . . . . . . . . . . . . . B13 4.3 Evaluation of noise model accuracy . . . . . . . . . . B14 4.4 Evaluation of noise PSD estimate . . . . . . . . . . . B17 4.5 Evaluation of speech enhancement performance . . . B18 4.6 Evaluation of safety-net state strategy . . . . . . . . B21 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . B22 Appendix I. Derivation of (13-14) . . . . . . . . . . . . . . . . . . B25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B27 C Entropy-Constrained Vector Quantization Using Gaussian Mixture Models 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Problem formulation . . . . . . . . . . . . . . . . . . viii C1 C1 C4 C4 2.2 High-rate ECVQ . . . . . . . . 2.3 Gaussian mixture modeling . . 3 Quantizer design . . . . . . . . . . . . 4 Theoretical analysis . . . . . . . . . . 4.1 Distortion analysis . . . . . . . 4.2 Rate analysis . . . . . . . . . . 4.3 Performance bounds . . . . . . 4.4 Rate adjustment . . . . . . . . 5 Implementation . . . . . . . . . . . . . 5.1 Arithmetic coding . . . . . . . 5.2 Encoding for the Z-lattice . . . 5.3 Decoding for the Z-lattice . . . 5.4 Encoding for arbitrary lattices 5.5 Decoding for arbitrary lattices 5.6 Lattice truncation . . . . . . . 6 The algorithm . . . . . . . . . . . . . . 7 Complexity . . . . . . . . . . . . . . . 8 Experiments and results . . . . . . . . 8.1 Artificial source . . . . . . . . . 8.2 Speech derived source . . . . . 9 Summary . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C4 C5 C5 C6 C6 C7 C7 C10 C10 C10 C11 C12 C12 C13 C13 C14 C15 C15 C16 C19 C21 C22 Acronyms 3GPP The Third Generation Partnership Project AMR Adaptive Multi-Rate AR Auto-Regressive AR-HMM Auto-Regressive Hidden Markov Model CCR Comparison Category Rating CDF Cumulative Distribution Function DFT Discrete Fourier Transform ECVQ Entropy-Constrained Vector Quantizer EM Expectation Maximization EVRC Enhanced Variable Rate Codec FFT Fast Fourier Transform GMM Gaussian Mixture Model HMM Hidden Markov Model IEEE Institute of Electrical and Electronics Engineers I.I.D. Independent and Identically-Distributed IP Internet Protocol KLT Karhunen-Loève Transform LL Log-Likelihood LP Linear Prediction LPC Linear Prediction Coefficient LSD Log-Spectral Distortion xi LSF Line Spectral Frequency MAP Maximum A-Posteriori ML Maximum Likelihood MMSE Minimum Mean Square Error MOS Mean Opinion Score MSE Mean Square Error MVU Minimum Variance Unbiased PCM Pulse-Code Modulation PDF Probability Density Function PESQ Perceptual Evaluation of Speech Quality PMF Probability Mass Function PSD Power Spectral Density RCVQ Resolution-Constrained Vector Quantizer REM Recursive Expectation Maximization SG-HMM Stochastic-Gain Hidden Markov Model SMV Selectable Mode Vocoder SNR Signal-to-Noise Ratio SSNR Segmental Signal-to-Noise Ratio SS Spectral Subtraction STSA Short-Time Spectral Amplitude VAD Voice Activity Detector VoIP Voice over IP VQ Vector Quantizer WF Wiener Filtering WGN White Gaussian Noise xii Part I Introduction Introduction The advance of modern information technology is revolutionizing the way we communicate. Global deployment of mobile communication systems has enabled speech communication from and to nearly every corner of the world. Speech communication systems based on the Internet protocol (IP), the socalled voice over IP (VoIP), are rapidly emerging. The new technologies allow for communication over the wide range of transport media and protocols that is associated with the increasingly heterogeneous communication network environment. In such a complex communication infrastructure, adverse conditions, such as noisy acoustic environments and unreliable network connections, may severely degrade the intelligibility and naturalness of the perceived speech quality, and increase the required listening effort for speech communication. The demand for better quality in such situations has given rise to new problems and challenges. In this thesis, two specific problems are considered: enhancement of speech in noisy environments and source coder design for unreliable network conditions. Noise Speaker Enhancement Source encoding Channel encoding Network Feedback Listener Source decoding Channel decoding Figure 1: A one-direction example of speech communication in a noisy environment. Figure 1 illustrates an adverse speech communication scenario that is 2 Introduction typical of mobile communications. The speech signal is recorded in an environment with a high level of ambient noise such as on a street, in a restaurant, or on a train. In such a scenario, the recorded speech signal is contaminated by additive noise and has degraded quality and intelligibility compared to the clean speech. Speech quality is a subjective measure on how pleasant the signal sounds to the listeners. It can be related to the amount of effort required by listeners to understand the speech. Speech intelligibility can be measured objectively based on the percentage of sounds that are correctly understood by listeners. Naturally, both speech quality and intelligibility are important factors in the subjective rating of a speech signal by a listener. The level of background noise may be suppressed by a speech enhancement system. However, suppression of the background noise may lead to reduced speech intelligibility, due to possible speech distortion associated with excessive noise suppression. Therefore, most practical speech enhancement systems for noise reduction are designed to improve the speech quality, while preserving the speech intelligibility [6, 60, 61]. Besides speech communication, speech enhancement for noise reduction may be applied to a wide range of other speech processing applications in adverse situations, including speech recognition, hearing aids and speaker identification. Another limiting factor for the quality of speech communication is the network condition. It is a particularly relevant issue for wireless communication, as the network capacity may vary depending on interferences from the physical environment. The same issue occurs in heterogeneous networks with a large variety of different communication links. Traditional source coder design is often optimized for a particular network scenario, and may not perform well in another network. To allow communication with the best possible quality in the new network, it is necessary to adapt the coding strategy, e.g., coding rate, depending on the network condition. This thesis concerns statistical signal processing techniques that facilitate better solutions to the aforementioned problems. Formulated in a statistical framework, speech enhancement is an estimation problem. Source coding can be seen as a constrained optimization problem that optimizes the coder under certain rate constraints. Statistical modeling of speech (and noise) is a key element for solving both problems. The techniques proposed in this thesis are based on a hidden Markov model (HMM) or a Gaussian mixture model (GMM). Both are generic models capable of modeling complex data sources such as speech. In Papers A and Paper B, we propose an extension to an HMM that allows for more accurate modeling of speech and noise. We show that the improved models also lead to an improved speech enhancement performance. In Paper C, we propose a flexible source coder design for entropy-constrained vector quantization. The proposed coder design is based on Gaussian mixture modeling of the source. The introduction is organized as follows. In section 1, an overview of 1 Statistical models Number of samples 10 10 10 3 6 4 2 0 10 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 Time domain speech amplitude 0.6 0.8 1 Figure 2: Histogram of time-domain speech waveform samples. Generated using 100 utterances from the TIMIT database, downsampled to 8kHz, normalized between -1 and 1, and 2048 bins. statistical modeling for speech is given. We discuss short-term models for modeling the local statistics of a speech frame, and long-term models (such as an HMM or a GMM) that describe the long-term statistics spanning over multiple frames. Parameter estimation using the expectation-maximization (EM) algorithm is discussed. In section 2, we introduce speech enhancement techniques for noise reduction, with our focus on statistical model-based methods. Different approaches using short-term and long-term models are reviewed and discussed. Finally, in section 3, an overview of theory and practice for flexible source coder design is given. We focus on the high-rate theory and coders derived from the theory. 1 Statistical models Statistical modeling of speech is a relevant topic for nearly all speech processing applications, including coding, enhancement, and recognition. A statistical model can be seen as a parameterized set of probability density functions (PDFs). Once the parameters are determined, the model provides a PDF of the speech signal that describes the stochastic behavior of the signal. A PDF is useful for designing optimal quantizers [64, 91, 128], estimators [35,37–39,90], and recognizers [11,105,108] in a statistical framework. One can make statistical models of speech in different representations. The simplest one applies directly to the digital speech samples in the waveform domain (analog speech is out of the scope of this thesis). For instance, a histogram of speech waveform samples, as shown in Figure 2, provides 4 Introduction the long-term averaged distribution of the amplitudes of speech samples. A similar figure is shown in [135]. The distribution is useful for designing an entropy code for a pulse-code modulation (PCM) system. However, in such a model, statistical dependencies among the speech samples are not explicitly modeled. It is more common to model speech signal as a stochastic process that is locally stationary within 20-30 ms frames. The speech signal is then processed on a frame-by-frame basis. Statistical models in speech processing can be roughly divided into two classes, short-term and long-term signal models, depending on their scope of operation. A short-term model describes the statistics of the vector for a particular frame. As the statistics are changing over the frames, the parameters of a short-term model are to be determined for each frame. On the other hand, a model that describes the statistics over multiple signal frames is referred to as a long-term model. 1.1 Short-term models It is essential to model the sample dependencies within a speech frame of 2030 ms. This short-term dependency determines the formants of the speech frame, which are essential for recognizing the associated phoneme. Modeling of this dependency is therefore of interest to many speech applications. Auto-regressive model An efficient tool for modeling the short-term dependencies in the time domain is the auto-regressive (AR) model. Let x[i] denote the discrete-time speech signal, the AR model of x[i] is defined as x[i] = − p X j=1 α[j]x[i − j] + e[i], (1) where α are the AR model parameters, also called the linear prediction coefficients, p is the model order and e[i] is the excitation signal, modeled as white Gaussian noise (WGN). An AR process is then modeled as the output of filtering WGN through an all-pole AR model filter. For a signal frame of K samples, the vector x = [x[1], . . . , x[K]] can be seen as a linearly transformation of a WGN vector e = [e[1], . . . , e[K]], by considering the filtering process as a convolution formulated as a matrix transform. Consequently, the PDF of x can be modeled as a multi-variate Gaussian PDF, with zero-mean and a covariance matrix parameterized using the AR model parameters. The AR Gaussian density function is defined as 1 T −1 1 f (x) = x , (2) K exp − x D 2 (2πσ 2 ) 2 1 Statistical models Pitch period 5 Gain Pulse-train generator Time-varying filter Speech signal Noise generator Gain Figure 3: Schematic diagram of a source-filter speech production model. with the covariance matrix D = σ 2 (A] A)−1 , where A is a K × K lower triangular Toeplitz matrix with the first p + 1 elements of the first column T consisting of the AR coefficients [α[0], α[1], α[2], . . . , α[p]] , α[0] = 1, and the remaining elements equal to zero, ] denotes the Hermitian transpose, and σ 2 denotes the excitation variance. Due to the structure of A, the covariance matrix D has the determinant σ 2K . Applying the AR model for speech processing is often motivated and supported by the physical properties of speech production [43]. A commonly used model of speech production is the source-filter model as shown in Figure 3. The excitation signal, modeled using a pulse train for a voiced sound or Gaussian noise for an unvoiced sound, is filtered through a timevarying all-pole filter that models the vocal tract. Due to the absence of a pitch model, the AR Gaussian model is more accurate for unvoiced sounds. Frequency-domain models The spectral properties of a speech signal can be analyzed in the frequency domain through the short-time discrete Fourier transform (DFT). On a frame-by-frame basis, each speech segment is transformed into the frequency domain by applying a sliding window to the samples and a DFT on the windowed signal. A commonly used model is the complex Gaussian model, motivated by the asymptotic properties of the DFT coefficients. Assuming that the DFT length approaches infinity and that the span of correlation of the signal frame is short compared to the DFT length, the DFT coefficients may be considered as a weighted sum of weakly dependent random samples [51, 94, 99,100]. Using the central limit theorem for weakly dependent variables [12, 130], the DFT coefficients across frequency can be considered as independent 6 Introduction zero-mean Gaussian random variables. For k 6∈ {1, K 2 + 1}, the PDF of the DFT coefficients of the k’th frequency bin X[k] is given by 1 |X[k]|2 , (3) f (X[k]) = exp − πλ2 [k] λ2 [k] where λ2 [k] = E[|X[k]|2 ] is the power spectrum. For k ∈ {1, K 2 + 1}, X[k] is real and has a Gaussian distribution with zero mean and variance E[|X[k]|2 ]. Due to the conjugate symmetry of DFT for real signals, only half of the DFT coefficients need to be considered. One consequence of the model is that each frequency bin may be analyzed and processed independently. Although, for typical speech processing DFT lengths, the independence assumption does not always hold, it is widely used, since it significantly simplifies the algorithm design and reduces the associated computational complexity. For instance, the complex Gaussian model has been successfully applied in speech enhancement [38, 39, 92]. The frequency domain model of an AR process can be derived under the asymptotic assumption. Assuming that the frame length approaches infinity, A is well approximated by a circulant matrix (neglecting the frame boundary), and is approximately diagonalized by the discrete Fourier transform. Therefore, the frequency domain AR model is of the form of (3), and the power spectrum λ2 [k] is given by σ2 . λ2 [k]AR = Pp | n=0 α[n]e−jnk |2 (4) It has been argued [89, 103, 118] that supergaussian density functions, such as the Laplace density and the two-sided Gamma density, fit better to speech DFT coefficients. However, the experimental studies were based on the long-term averaged behavior of speech, and may not necessarily hold for short-time DFT coefficients. Speech enhancement methods based on supergaussian density function have been proposed in [19, 85, 86, 89, 90]. The comprehensive evaluations in [90] suggest that supergaussian methods achieve consistent but small improvement in terms of higher segmental SNR results relative to the Wiener filtering approach (which is based on the Gaussian model). However, in terms of perceptual quality, the Ephraim-Malah short-time spectral amplitude estimators [38, 39] based on the Gaussian model were found to provide more pleasant residual noise in the processed speech [90]. 1.2 Long-term models With a long-term model, we mean a model that describes the long-term statistics that span over multiple signal frames of 20-30 ms. The shortterm models in section 1.1 apply to one signal frame only, and the model 1 Statistical models 7 parameters obtained for a frame may not generalize to another frame. A long-term model, on the other hand, is generic and flexible enough to allow statistical variations in different signal frames. To model different shortterm statistics in different frames, a long-term model comprises multiple states, each containing a sub-model describing the short-term statistics of a frame. Applied to speech modeling, each sub-model represents a particular class of speech frames that are considered statistically similar. Some of the most well-known long-term models for speech processing are the hidden Markov model (HMM), and the Gaussian mixture model (GMM). Hidden Markov model The hidden Markov model (HMM) was originally proposed by Baum and Petrie [13] as probabilistic functions of finite state Markov chains. HMM applied for speech processing can be found in automatic speech recognition [11, 65, 105] and speech enhancement [35, 36, 111, 147]. An HMM models the statistics of an observation sequence, x1 , . . . , xN . The model consists of a finite number of states that are not observable, and the transition from one state to another is controlled by a first-order Markov chain. The observation variables are statistically dependent of the underlying states, and may be considered as the states observed through a noisy channel. Due to the Markov assumption of the hidden states, the observations are considered to be statistically interdependent to each other. For a realization of state sequence, let s = [s1 , . . . , sN ], sn denote the state of frame n, asn−1 sn denote the transition probability from state sn−1 to sn with as0 s1 being the initial state probability, and let fsn (xn ) denote the output probability of xn for a given state sn , the PDF of the observation sequence x1 , . . . , xN of an HMM is given by f (x1 , . . . , xN ) = N XY asn−1 sn fsn (xn ), (5) s∈S n=1 where the summation is over the set of all possible state sequences S. The joint probability of a state sequence s and the observation sequence x1 , . . . , xN consists of the product of the transition probabilities and the output probabilities. Finally, the PDF of the observation sequence is the sum of the joint probabilities over all possible state sequences. Applied to speech modeling, each sub-model of a state represents the short-term statistics of a frame. Using the frequency domain Gaussian model (3), each state consists of a power spectral density representing a particular class of speech sounds (with similar spectral properties). The Markov chain of states models the temporal evolution of short-term speech power spectra. 8 Introduction f(x |x =0.5) f(x ,x ) 2 1 1 1.25 1 1 0.5 0.75 x2 f(x2) 2 0 0.5 −0.5 0.25 −1 0 −1 −0.5 0 x 0.5 1 2 −1 −0.5 0 x 0.5 1 1 Figure 4: Examples of a one-dimensional GMM (left) and a twodimensional GMM (right). Gaussian mixture model Gaussian mixture modeling is a convenient tool useful in a wide range of applications. An early application of GMM was found in adaptive block quantization applied to image coding [131]. More recent examples of applications include clustering analysis [150], image coding [126], image retrieval [55], multiple description coding [115], speaker identification [108], speech coding [80, 112, 114], speech quality estimation [42], speech recognition [33, 34], and vector quantization [58, 127, 128, 148]. A Gaussian mixture model (GMM) is defined as a weighted sum of Gaussian densities, f (x) = M X ρm fm (x), (6) m=1 where m denotes the component index, ρ denote the mixture weights, and fm (x) are the component Gaussian PDFs. Examples of a one-dimensional GMM and a two-dimensional GMM are shown in Figure 3. A GMM can be interpreted in two conceptually different ways. It can be seen as a generic PDF that has the ability of approximate an unknown PDF by adjusting the model parameters over a database of observations. Another interpretation of a GMM is as a special case of an HMM, by assuming independent and identically-distributed (I.I.D.) random variables, such that f (x1 , . . . , xN ) = f (x1 ) . . . f (xN ). The model can then be formulated as an HMM by assigning the initial state probability according to the mixture component weights, and assuming the matrix of state transition probabilities to be the identity matrix. In fact, to generate random 1 Statistical models 9 observations from a GMM, one can first generate an I.I.D. state sequence according to the mixture component weights, and then generate the observation sequence according to the component PDF of each state sample. The generated observations will be distributed according to the GMM. Applied to speech modeling, each component may represent a class of speech sounds, the weights are according to the probability of their appearance, and each component PDF describes the power spectrum of the sound class. The main difference between a GMM and an HMM with Gaussian submodels is that, in a GMM, consecutive frames are considered independent, while in an HMM, the Markov assumption of the underlying state sequence models their temporal dependency. An HMM is therefore a more general and powerful model than a GMM. Nevertheless, GMMs are popular models in speech processing due to their simplicity in analysis and good modeling capabilities. In fact, GMMs are often used as sub-models of an HMM in practical implementations of HMM based methods. 1.3 Model-parameter estimation Using an HMM or a GMM, the joint PDF of a sequence x1 , . . . , xN , denoted f (x1 , . . . , xN ), is parameterized by unknown parameters, denoted by θ. Each value of θ corresponds to one particular PDF of x1 , . . . , xN . Since the true PDF is unknown, the value of θ is to be determined from data. Let θ̂ denote the estimated parameters, as a function of x1 , . . . , xN . A commonly used criterion of an optimal estimator is that 1) the estimator is unbiased, i.e., E[θ̂] = θ, (7) and 2) the estimator gives an estimate that has the minimum variance. Such an estimator is referred to as the minimum variance unbiased (MVU) estimator [67]. Unfortunately, the MVU estimator is difficult to determine in many practical applications. An alternative estimator is the maximum likelihood (ML) estimator, which maximizes the likelihood function or equivalently the logarithm of the likelihood function (log-likelihood function), θ̂ = arg max f (x1 , . . . , xN |θ) (8) arg max log f (x1 , . . . , xN |θ), (9) θ = θ where f (x1 , . . . , xN |θ)1 denotes the likelihood function, viewed as a function of θ for the given data sequence x1 , . . . , xN . The ML approach typically leads to practical estimators that can be implemented for complex estimation tasks. Also, it can be shown [67] that the ML estimator is asymptoti1 The likelihood function can also be seen as the PDF of x , . . . , x 1 N conditioned on the parameter vector θ. 10 Introduction cally optimal, and approaches the performance of the MVU estimator when the data set is large. For a generic PDF based on GMM or HMM, directly solving (8) or (9) leads to intractable analytical expressions. A more practical and commonly used method is the expectation-maximization (EM) algorithm [31]. Expectation-maximization (EM) algorithm The expectation-maximization (EM) algorithm originated from early work of Dempster et. al [31]. The EM algorithm is used in Papers A and B for parameter estimation of an extended HMM model of speech. The EM algorithm applies to a given batch of data, and is suitable for off-line optimization of the model. For other model parameters, e.g., in a noise model, the observation data may not be available off-line and on-line estimation using a recursive algorithm is necessary. The recursive formulation of the EM algorithm was proposed in [132], and applied to on-line HMM parameter estimation in [72]. The recursive EM algorithm is used in Paper A for estimation of the gain model parameters, and in Paper B for estimation of noise model parameters. Both algorithms are based on the same principle of incomplete data. Herein, an introduction to the theory behind the EM algorithm is given. The EM algorithm is an iterative algorithm that ensures non-decreasing likelihood scores during each iteration step. Since the likelihood function is upper-bounded, the algorithm guarantees convergence to a locally optimal solution. The EM algorithm is designed for the estimation scenario when the observation sequence is incomplete or partially missing. The missing observation sequence can be either real or artificial. In either case, the EM algorithm is applicable when the ML estimator leads to intractable analytical formulas but is significantly simplified when assuming the existence of additional unknown values. For notational convenience, let X1N = {x1 , . . . , xN } denote the observation data, Z1N = {z1 , . . . , zN } denote the missing observation data. The complete data is then given by {X1N , Z1N }, and log f (X1N , Z1N |θ) is the complete log-likelihood function. Let θ̂ (j) denote the estimated parameter vector from the j’th EM iterations. The EM algorithm consists of two sequential sub-steps within each iteration. The E-step (expectation step) formulates the auxiliary function, denoted Q(θ|θ̂ (j) ), which is the expected value of the complete data log-likelihood with respect to the missing data given the observation data and the parameter vector estimate of the j’th iteration, Z (j) Q(θ|θ̂ ) = f (Z1N |X1N , θ̂ (j) ) log f (Z1N , X1N |θ) dZ1N . (10) Z1N The maximization step in the EM algorithm gives a new estimate of the 1 Statistical models 11 parameter vector by maximizing the Q(θ|θ̂ (j) ) function of the current step, θ̂ (j+1) = arg max Q(θ|θ̂ (j) ). (11) θ It can be noted that the expectation of (10) is evaluated with respect to f (Z1N |X1N , θ̂ (j) ) using the j’th parameter estimate, and the maximization of (11) is over θ in the complete log-likelihood function, log f (Z1N , X1N |θ). Therefore, it is essential to select Z1N such that differentiating (11) with respect to θ and setting the resulting expression to zero should lead to tractable solutions for the update equations. Estimation of both GMM and HMM parameters can be derived using the EM framework. For a GMM, the missing data consists of the component indices from which the observation data is generated. For an HMM, the missing data is the state sequence. For a detailed derivation of the EM algorithm for GMM and HMM, we refer to [17]. The EM algorithm and the recursive EM algorithm are extensively used in Papers A and B to estimate the model parameters of extended HMMs of speech and noise. Convergence of the EM algorithm The iterative procedure of the EM algorithm ensures convergence towards a locally optimal solution. A proof of convergence is given in [31], and the proof is outlined in this section. We first show that the log-likelihood function (9) over a large data set is upper-bounded. We then show that the EM iterations improve the log-likelihood function in each iteration step. Assume existence of a data set consisting of a large number of X1N , where each data sequence X1N is distributed according to a ”true” distribution function f (X1N ). Note the difference to f (X1N |θ), which is a PDF as a function of an unknown parameter vector θ. The log-likelihood function of this data set is Z X Y N N log f (X1 |θ) ≈ f (X1 |θ) = LL = log f (X1N ) log f (X1N |θ)dX1N . (12) X1N X1N X1N Comparing the log-likelihood function with the expectation of the logaR rithmic transform of the true PDF X N f (X1N ) log f (X1N ), we get 1 LL − Z X1N f (X1N ) log f (X1N ) = Z ≤ Z = X1N X1N f (X1N |θ) dX1N f (X1N ) f (X1N |θ) − 1 dX1N f (X1N ) f (X1N ) f (X1N ) log 1 − 1 = 0, (13) 12 Introduction where the inequality is due to log x ≤ x − 1. Hence, the log-likelihood function is upperRbounded by the expectation of the logarithmic transform of the true PDF X N f (X1N ) log f (X1N ). 1 The log-likelihood score of the j’th iteration can be written as ! Z N (j) N (j) N N (j) N log f (X1 |θ̂ ) = log f (X1 |θ̂ ) f (Z1 |X1 , θ̂ )dZ1 (14) Z1N = Z f (Z1N |X1N , θ̂ (j) ) log f (X1N |θ̂ (j) )dZ1N (15) Z f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j) ) dZ1N , (16) Z1N = Z1N f (Z1N |X1N , θ̂ (j) ) where the last step uses f (X1N , Z1N |θ̂ (j) ) = f (Z1N |X1N , θ̂ (j) )f (X1N |θ̂ (j) ). The log-likelihood score of the (j + 1)’th iteration can be written as ! Z f (X1N , Z1N |θ̂ (j+1) )dZ1N log f (X1N |θ̂ (j+1) ) = log (17) Z1N Z = log Z1N ≥ Z Z1N f (X1N , Z1N |θ̂ (j+1) ) N dZ1 f (Z1N |X1N , θ̂ (j) ) f (Z1N |X1N , θ̂ (j) ) f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j+1) ) f (Z1N |X1N , θ̂ (j) ) ! ! (18) dZ1N , (19) where the inequality is due to Jensen’s inequality. Comparing the log-likelihood score of the (j + 1)’th iteration to the previous iteration, we get log f (X1N |θ̂ (j+1) ) − log f (X1N |θ̂ (j) ) Z f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j+1) ) dZ1N ≥ Z1N − Z Z1N f (Z1N |X1N , θ̂ (j) ) log f (X1N , Z1N |θ̂ (j) ) dZ1N = Q(θ̂ (j+1) |θ̂ (j) ) − Q(θ̂ (j) |θ̂ (j) ) ≥ 0, (20) (21) (22) where the last step follows from the definition of the maximization step (11). Since the log-likelihood score is upper bounded, the convergence is proven. 2 Speech enhancement for noise reduction In Papers A and B, we apply the statistical framework for enhancement of speech in noisy environments. Noise reduction has become an increasingly 2 Speech enhancement for noise reduction 13 important component in speech communication systems. Based on a statistical framework using models of speech and noise, a noise reduction system can be formulated as an estimation problem, in which the clean speech is to be estimated from the noisy observation signal. In estimation theory, the performance of an estimator is commonly measured as the expected value of a distortion function that quantifies the difference between the clean speech signal and the estimated signal. Success of the estimator that minimizes the expected distortion depends largely on the accuracy of the assumed speech and noise models and the perceptual relevance of the distortion measure. Due to the inherent complexity of natural speech and human perception, the present models and distortion measures are only simplified approximations. A large variety of noise reduction methods has been proposed and their differences are often due to different assumptions on the models and distortion measures. The models discussed in section 1 can be applied to modeling of the statistics of speech and noise. In the design of a noise reduction system, additional assumptions on the noisy signal are made. Depending on the environment, the interfering noise may be additive, e.g., from a separate noise generator, or convolutive, e.g., due to room reverberations or fading. Depending on the number of available microphones, the observation data consist of one or more noisy signals. The corresponding noise reduction methods are divided into single-channel (one microphone) and multi-channel (microphone array) systems. The focus of this thesis is on single-channel noise reduction for an additive noise environment. The noisy speech signal is assumed to be the sum of the speech and noise signals. Furthermore, the additive noise is often considered to be statistically independent of speech. A successful noise reduction system may be used as a preprocessor to a speech communication system or an automatic speech recognition system. In fact, several algorithms have been successfully integrated into standardized speech coders for mobile speech communication. Examples of speech coders with a noise reduction module include the Enhanced Variable Rate codec (EVRC) [1] and Selectable Mode Vocoder (SMV) [7]. 3GPP (The third Generation Partnership Project) has standardized minimum performance requirements and an evaluation procedure for the Adaptive MultiRate (AMR) codec [6], and a number of noise reduction algorithms that satisfy the requirements have been developed, e.g., [3–5, 8]. The remainder of this section is organized as follows. An overview of single-channel noise reduction is given in section 2.1. The focus is on frequency domain approaches that make use of a short-time spectral attenuation filter. The Bayesian estimation framework is introduced and noise estimation is discussed. Sections 2.2 and 2.3 provide an overview of classical statistical model based methods using short-term models and long-term models, respectively. Sections 2.2 and 2.3 also serves as an introduction to Papers A and B. For a more extensive tutorial on speech enhancement 14 Introduction algorithms for noise reduction, we refer to [37, 78, 135]. 2.1 Overview Due to the quasi-stationarity property of speech, the noisy signal is often processed on a frame-by-frame basis. For additive noise, the noisy speech signal of the n’th frame, is modeled as yn [i] = xn [i] + wn [i], (23) where i is the time index of the n’th frame and y, x, w denote noisy speech, clean speech and noise, respectively. In the vector notation, the noisy speech, clean speech and noise frames are denoted by yn , xn and wn , respectively. In the following sections, the speech model parameters are denoted using overbar ¯ and the noise model parameters are denoted using double dots¨. Short-time spectral attenuation It is common to perform estimation in the spectral domain, e.g., through short-time Fourier analysis and synthesis. An analysis window is applied to each time domain signal segment and the DFT is then applied. For the n’th frame, the frequency domain signal model is given by Yn [k] = Xn [k] + Wn [k], (24) where k denotes the index of the frequency band. For frequency-domain noise reduction, the noisy spectrum is attenuated in the frequency bins that contain noise. The level of attenuation depends on the signal to noise ratio (SNR) of the frequency bin. The attenuation is performed by means of an adaptive spectral attenuation filter, denoted by Hn [k], that is applied to Yn [k] to obtain an estimate of Xn [k], X̂n [k], given by: X̂n [k] = Hn [k]Yn [k]. (25) The inverse DFT is applied to X̂n = [X̂n [1], . . . , X̂n [K]]T to obtain the enhanced speech signal frame x̂n in the time domain. To avoid discontinuity at frame boundaries, the overlap and add approach is used in synthesis. The schematic diagram of a frequency domain noise reduction method using a short-time Fourier transform is shown in Figure 5. Spectral subtraction One of the classical methods for noise reduction is spectral subtraction [18, 92]. The method is based on a direct estimation of speech spectral 2 Speech enhancement for noise reduction 15 Noise Estimation yn x̂n DFT Noise Reduction Inverse DFT Figure 5: Noise reduction method using a short-time Fourier transform. magnitude by subtracting the noise spectral magnitude from the noisy one. The attenuation filter for the spectral subtraction method can be formulated as ! r1 max(|Yn [k]|r − |Ŵn [k]|r , 0) SS Hn [k] = , (26) |Yn [k]|r where |Ŵn [k]| is an estimate of the noise spectrum amplitude, and r = 1 for spectral magnitude subtraction [18] and r = 2 for power spectral subtraction [92]. The power spectral subtraction method is closely related to the Wiener filtering approach, with the only difference in the power term (see section 2.2). For both magnitude and power subtractions, the phase of the noisy spectrum is used unprocessed in construction of the enhanced speech. One motivation is that the human auditory system is less sensitive to noise in spectral phase than in spectral magnitude [139], therefore the noise in spectral phase is less annoying. Another formulation of the power spectral subtraction [92] is based on the complex Gaussian model (3) of the noisy spectrum. Under the Gaussian assumption, and due to the independence assumption, the noisy power spectrum of the k’th frequency bin is λ2n [k] = λ̄2n [k] + λ̈2n [k], where λ̄2n [k] and λ̈2n [k] are the power spectra of speech and noise respectively. The PDF of Yn [k] is given by |Yn [k]|2 1 exp − . (27) f (Yn [k]) = π(λ̄2n [k] + λ̈2n [k]) λ̄2n [k] + λ̈2n [k] ˆ Assuming that an estimate of the noise power spectrum λ̈2n [k] exists, the maximum likelihood (ML) estimate of the speech power spectrum maximizes f (Yn [k]) with respect to λ̄2n [k]. Using the additional constraint that the power spectrum is nonnegative, the ML estimate of λ̄2n [k] is given by ˆ2 ˆ 2 [k] = max(|Y [k]|2 − λ̈ λ̄ n n n [k], 0). (28) 16 Introduction The implementation of the spectral subtraction method is straightforward. However, the processed speech is plagued by the so-called musical noise phenomenon. The musical noise is the residual noise containing isolated tones located randomly in time and frequency. These tones correspond to spectral peaks in the original noisy signal and are left alone when their surrounding spectral bins in the time-frequency plane have been heavily attenuated. The musical noise is perceptually annoying and several methods have been derived from the spectral subtraction method targeting the musical noise problem [48, 83, 133, 136]. Bayesian estimation approach Spectral subtraction was originally an empirically motivated method, and is often used in practice because of its simplicity. Alternatively, speech enhancement for noise reduction can be formulated in a statistical framework, such that prior knowledge of speech and noise is incorporated through prior PDFs. Using additionally a relevant cost function, speech can be estimated using the Bayesian estimation framework. The following sections, 2.2 and 2.3, are devoted to methods using the Bayesian estimation framework. In this subsection, a short introduction to Bayesian estimation is given. At a conceptual level, assuming that speech and noise are random vectors distributed according to PDFs f (x) and f (w), the joint PDF of x and y can be formulated using the signal model (23). A Bayesian speech estimator minimizes the Bayes risk which is the expected cost with respect to both x and y, defined as Z Z Bayes risk = C(x, x̂)f (x, y)dxdy, (29) where C(x, x̂) is a cost function between x and x̂. A commonly used cost function is the square error function, CM M SE (x, x̂) = ||x − x̂||2 , (30) and the optimal estimator minimizing the corresponding Bayes risk is the minimum mean square error (MMSE) estimator. The popularity of MMSE based estimation is due to its mathematical tractability and good performance. It can be shown that the MMSE speech estimator, e.g., [67], is the conditional expectation of speech given the noisy speech Z x̂ = xf (x|y)dx. (31) Noise estimation One key component of a single-channel noise reduction system is estimation of the noise statistics. Often, a speech estimator is designed assuming the 2 Speech enhancement for noise reduction 17 actual noise power spectrum is available, and its optimality is no longer ensured when the actual noise power spectrum is replaced by an estimate of the noise spectrum. Therefore, the performance of the speech estimator depends largely on the quality of the noise estimation algorithm. A large number of noise estimation algorithms has been proposed. They are typically based on the assumption that 1) speech does typically not occupy the entire time-frequency plane 2) noise is “more stationary” than speech. By assuming that noise is stationary over a longer (than a frame) period of time, e.g., up to several seconds, the noise power spectrum can be estimated by averaging the noisy power spectra in the frequency bands with low or zero speech activity. The update of the noise estimate is commonly controlled through a voice-activity detector (VAD), e.g., [1,92], speech presence probability estimate [24, 87, 120], or order statistics [59, 88, 125]. For instance, the minimum statistics method [88] is based on the observation that the noisy power spectrum frequently approaches the noise power spectrum level due to speech absence in these frequency bins. To obtain the minimum statistics, a buffer is used to store past noisy power spectra of each frequency band. Bias compensation for taking into account the fluctuations of power spectrum estimation is proposed in [88]. In such a method, the size of the buffer is a trade-off. If the buffer is too small, it may not contain any speech inactive frames and speech may be falsely detected as noise. If the buffer is too large, it may not react fast enough for non-stationary noise types. In [88], the buffer length is set to 1.5 seconds. 2.2 Short-term model based approach This section provides an overview of some Bayesian speech estimators using short-term models as discussed in section 1.1. In particular, we discuss Wiener filtering and spectral amplitude estimation methods. Wiener filtering Wiener filtering [141] is a classical signal processing technique that has been applied to speech enhancement. Wiener filters are the optimal linear timeinvariant filters for the minimum mean square error (MMSE) criterion, and can be classified as Bayesian estimators under the constraint of linear estimation. The Wiener filter is also the optimal MMSE estimator under a Gaussian assumption. Assuming linear estimation, the speech estimate in the time domain is of the form x̂n = Hn yn , (32) 18 Introduction for a linear estimation matrix Hn . The MMSE linear estimator fulfills Z Z WF Hn = arg min ||xn − Hyn ||2 f (xn , yn )dxn dyn . (33) H The optimal estimator can be solved by taking the first derivative of (33) with respect to H, and setting the obtained expression to zero. Using also the assumption that speech and noise have both zero mean and are statistically independent, the optimal linear estimator is HWF = D̄n (D̄n + D̈n )−1 , n (34) where D̄n and D̈n are the covariance matrices of speech and noise, respectively, of frame n. An alternative point of view for the Wiener filtering is through MMSE estimation under a Gaussian assumption. Assuming that speech and noise are of zero-mean Gaussian distributions, with covariance matrices D̄n and D̈n , it can be shown [67] that the conditional PDF of xn given yn is also Gaussian with mean D̄n (D̄n + D̈n )−1 yn . Hence, the MMSE estimator for a Gaussian model is linear and leads to the same result as the linear MMSE estimator of an arbitrary PDF (34). Due to the correspondence between the convolution in the time domain and multiplication in the frequency domain, the Wiener filter in the frequency domain formulation has the simpler form of (25) with Hn [k]WF = λ̄2n [k] , λ̄2n [k] + λ̈2n [k] (35) for the k’th frequency bin. Practical implementation of the Wiener filter requires estimation of the second-order statistics (the power spectra) of speech and noise: λ̄2n [k] and λ̈2n [k]. The noise spectrum may be obtained through a noise estimation algorithm (section 2.1) and the speech spectrum may be estimated using (28). Hence, the actual implementation of the Wiener filtering is closely related to the power spectral subtraction method, and also suffers from the musical noise issue. Speech spectral amplitude estimation Another well-known and widely applied MMSE estimator for speech enhancement is the MMSE short-time spectral amplitude (STSA) estimator by Ephraim and Malah [38]. The STSA method does not suffer from musical noise in the enhanced speech, and is therefore interesting also from a perceptual point of view. From the MMSE estimation result (31), the STSA estimator is formulated as the expected value of the speech spectral amplitude conditioned 2 Speech enhancement for noise reduction 19 on the noisy spectrum. The STSA estimator depends on two parameters, prio a-posteriori SNR Rpost n [k] and a-priori SNR Rn [k], that need to be evaluated for each frequency bin. The a-posteriori SNR is defined as the ratio of the noisy periodogram and the noise power spectrum estimate2 Rpost n [k] = |Yn [k]|2 , ˆ λ̈2n [k] (36) and the a-priori SNR is defined as the ratio of the speech and noise power spectrum [38, 92] Rprio n [k] = λ̄2n [k] . λ̈2n [k] (37) The optimal attenuation filter for the STSA estimator is shown to be [38] v ! √ u prio 1 πu R [k] n t Hn [k] = 2 Rpost 1 + Rprio n [k] n [k] !! Rprio n [k] post , (38) M Rn [k] 1 + Rprio n [k] where M is the function θ θ θ (1 + θ)I0 + θI1 , M (θ) = exp − 2 2 2 (39) where I0 and I1 are the zeroth and first-order modified Bessel functions, respectively. In practice, the a-priori SNR is not available and needs to be estimated. Ephraim and Malah proposed a solution based on the decision-directed approach, and the a-priori SNR is estimated as [38] Rprio n [k] = α |Hn−1 [k]Yn−1 [k]|2 + (1 − α) max(Rpost n [k] − 1, 0), ˆ2 λ̈n [k] (40) where 0.9 ≤ α < 1 is a tuning parameter. The STSA method has demonstrated superior perceptual quality compared to classic methods such as spectral subtraction and Wiener filtering. In particular, the enhanced speech does not suffer from the musical noise phenomenon. It has been argued [22] that the perceptual relevance of the method is mainly due to the decision-directed approach for estimating 2 This definition differs from the traditional definition of signal-to-noise ratio by including both speech and noise in the numerator. 20 Introduction a-priori SNR. The decision-directed approach has an attractive temporal smoothing behavior on the SNR estimate. It was shown [22] that for low Rpost , i.e., the absence of speech, the Rprio is an averaged SNR of a large number of successive frames; for high Rpost , Rprio follows Rpost with oneframe delay. This smoothing behavior has a significant effect on the perceptual quality of the enhanced speech, and the same strategy is applicable to other enhancement methods such as the Wiener filtering and spectral subtraction to reduce the musical noise in the enhanced speech. Research challenges The methods we have discussed in section 2.2 require an estimate of the noise power spectrum. Most noise estimation algorithms are based on the assumption that the noise is stationary or mildly non-stationary. Modern mobile devices are required to operate in acoustical environments with large diversity and variability in interfering noise. The non-stationarity may be due to the noise source itself, i.e., the noise contains impulsive or transient components, periodic patterns, or due to the movement of the noise source and/or the recording device. The latter scenario occurs commonly on a street, where noise sources (e.g., cars) move rapidly around. Due to the increased degree of non-stationarity in such an environment, performance of most noise estimation algorithms is non-ideal and may significantly affect the performance of the speech estimation algorithm. Design of a noise reduction system that is capable of dealing with non-stationary noise is of great interest and a challenging task to the research community. From the Bayesian estimation point of view, the performance of the speech estimator is determined by the accuracy of the speech and noise models, and perceptual relevancy of the distortion measure. Unfortunately, the true statistics of speech and noise are not explicitly available. Also, speech quality is a subjective measure that requires human listening experiments, and is expensive to assess. Objective measures that approximate the “true” perceptual quality have been proposed, e.g., for the evaluation of speech codecs [2, 14, 49, 50, 104, 110, 137, 138, 140], but most of them are complex and mathematically intractable for deriving a speech estimator. Nevertheless, a number of algorithms that are based on distortion measures that capture some properties of human perception have been proposed [57, 62, 79, 133, 136, 143]. For instance, the speech estimator in Paper A optimizes a criterion that allows for an adjustable level of residual noise in the enhanced speech, in order to avoid unnecessary speech artifacts. Incorporating a higher degree of prior information of speech and noise can also improve the performance of the speech estimator. With short-term models, the assumed prior information is mainly the type of PDF, of which the parameters of both the speech and noise PDFs (their respective power spectra for Gaussian PDFs) need to be estimated. To incorporate stronger 2 Speech enhancement for noise reduction 21 prior information of speech and noise, more refined statistical models should be used. For instance, a more precise PDF of speech may be obtained by using a speech database and optimizing a long-term model through a training algorithm. In the next section, we discuss noise reduction methods using long-term models such as an HMM. 2.3 Long-term model based approach This section provides an overview of noise reduction methods using longterm models from section 1.2. Long-term model based noise reduction systems include HMM based methods [35, 111], GMM based methods [32, 149] and codebook based methods [73, 121]. In this section, we focus mainly on methods based on hidden Markov models and their derivatives. HMM-based speech estimation Due to the inter-frame dependency of a HMM, the HMM based MMSE estimator (31) of speech can be formulated using current and past frames of the noisy speech, Z x̂n = xn f (xn |Y1n )dxn , (41) where Y1n = {y1 , . . . , yn } denotes the noisy observation sequence from frame 1 to n. Due to the Markov chain assumption of the underlying state sequence, both the current and past noisy frames are utilized in the speech estimator of the current frame. Assuming that both speech and noise are modeled using an HMM with Gaussian sub-models, it can be shown that the noisy model is a composite HMM with Gaussian sub-models due to the statistical independence of speech and noise. If the speech model has M̄ states and the noise model M̈ states, the composite HMM has M̄ M̈ states, one for each combination of speech and noise state. Under the Gaussian assumption, the noisy PDF of a given state is a Gaussian PDF with zero mean and covariance matrix Ds = D̄s̄ + D̈s̈ . The conditional PDF of speech given the noisy observation sequence can be formulated as X f (sn , xn |Y1n ) f (xn |Y1n ) = sn = X sn = X sn f (sn |Y1n )f (xn |Y1n , sn ) f (sn |Y1n )f (xn |yn , sn ), (42) where the last step is due to the conditional independency property of an HMM, which states that the observation for a given state is independent of 22 Introduction other states and observations. Hence, the MMSE speech estimator using an HMM (41) can be formulated as x̂n = X sn = X sn f (sn |Y1n ) Z xn f (xn |yn , sn )dxn f (sn |Y1n ) · D̄s̄ (D̄s̄ + D̈s̈ )−1 yn , (43) and each of the per-state conditional expectations is a Wiener filter (34). The HMM based MMSE estimator is then a weighted sum of Wiener filters, where the weighting factors are the posterior state probabilities given the noisy observations. The weighting factors can be calculated efficiently using the forward algorithm [40, 105]. A significant difference between the HMM based estimator and the Wiener filter (34) is that the Wiener filter requires on-line estimation of both speech and noise statistics, while the per-state estimators of the HMM approach use speech model parameters obtained through off-line training (see section 1.2). Therefore, additional speech knowledge is incorporated in an HMM based method. Noise reduction using an HMM was originally proposed by Ephraim in [35, 36] using an HMM with auto-regressive (AR) Gaussian sub-sources for speech modeling. The noise statistics are modeled using a single Gaussian PDF, and the noise model parameters are obtained using the first few frames of the noisy speech that are considered to be noise-only. Alternatively, an additional noise estimation algorithm may be used to obtain the noise statistics. In [111], the HMM based methods are extended to allow HMM based noise models. The method of [111] uses several noise models, obtained off-line for different noise environments, and a classifier is used to determine which noise model must be used. By utilizing off-line trained models for both speech and noise, strong prior information of speech and noise is incorporated. By having more than one noise state, each representing a distinct noise spectrum, the method can handle rapid changes of the noise spectrum within a specific noise environment, and can provide signification improvement for non-stationary noise environments. Stochastic gain modeling A key issue of using long-term speech/noise models obtained through off-line training is the energy mismatch between the model and the observations. Since the model is obtained off-line using training data, the energy level of the training data may differ from the speech in the noisy observation, causing a mismatch. Hence, strategies for gain adaptation are needed when long-term models are used. 2 Speech enhancement for noise reduction 23 When using an AR-HMM for speech modeling [35], the signal is modeled as an AR process in each state. The excitation variance of the AR model is a parameter that is obtained from the training data. Consequently, the different energy levels of a speech sound, typically due to differences in pronunciation and/or different vocalizations of individual speakers, are not efficiently modeled. In fact, an AR-HMM implicitly assumes that speech consists of a finite number of frame energy levels. Another energy mismatch between the speech model and the observation is due to different recording conditions during off-line training and on-line estimation, and may be considered as a global energy mismatch. In the MMSE speech estimator of [35], a global gain factor is used to reduce this mismatch. A similar problem appears in noise modeling, since the noise energy level in the noisy environment is inherently unknown, time-varying, and in most natural cases, different from the noise energy level in the training. In [111], a heuristic noise gain adaptation using a voice activity detector has been proposed, where the adaptation is performed in speech pauses longer than 100 ms. In [146], continuous noise gain model estimation techniques were proposed. In Paper A, we propose a new HMM based gain-modeling technique that extends the HMM-based methods [35,111] by introducing explicit modeling of both speech and noise gains in a unified framework. The probability density function of xn for a given state s̄ is the integral over all possible speech gains. Modeling the speech energy in the logarithmic domain, we then have Z ∞ fs̄ (ḡn0 )fs̄ (xn |ḡn0 )dḡn0 , (44) fs̄ (xn ) = −∞ where ḡn0 = log ḡn and ḡn denotes the speech gain in the linear domain. The extension of Paper A over the traditional AR-HMM is the stochastic modeling of the speech gain ḡn , where ḡn is considered as a stochastic process. The PDF of ḡn is modeled using a state-dependent log-normal distribution, motivated by the simplicity of the Gaussian PDF and the appropriateness of the logarithmic scale for sound pressure level. In the logarithmic domain, we have 1 1 2 0 0 (45) exp − 2 (ḡn − µ̄s̄ − q̄n ) , fs̄ (ḡn ) = p 2σ̄s̄ 2πσ̄s̄2 with mean µ̄s̄ + q̄n and variance σ̄s̄2 . The time-varying parameter q̄n denotes the speech-gain bias, which is a global parameter compensating for the longterm energy level of speech, e.g., due to a change of physical location of the recording device. The parameters {µ̄s̄ , σ̄s̄2 } are modeled to be timeinvariant, and can be obtained off-line using training data, together with the other speech HMM parameters. 24 Introduction The model proposed in Paper A is referred to as a stochastic-gain HMM (SG-HMM), and may be applied to both speech and noise. To obtain the time-invariant parameters of the model, an off-line training algorithm is proposed based on the EM technique. For time-varying parameters, such as q̄n , an on-line estimation algorithm is proposed based on the recursive EM technique. In Paper A, we demonstrate through objective and subjective evaluations that the new model improves the modeling of speech and noise, and also the noise reduction performance. The proposed HMM based gain modeling is related to the codebook based methods [73, 74, 122, 124]. In the codebook methods, the prior speech and noise models are represented as codebooks of spectral shapes, represented by linear prediction (LP) coefficients. For each frame, and each combination of speech and noise code vectors, the excitation variances are estimated instantaneously. In [73, 123], the most likely combination of code vectors are used to form the Wiener filter. In [74, 124], a weighted sum of Wiener filters is proposed, where the weighting is determined by the posterior probabilities similarly to (43). The Bayesian formulation in [124] is based on the assumption that both the code vectors and the excitation variances are uniformly distributed. Adaptation of the noise HMM It is widely accepted that the characteristics of speech can be learned from a (sufficiently rich) database of speech. A successful example is in the speech coding application, where, e.g., the linear prediction coefficients (LPC) can be represented with transparent quality using a finite number of bits [97]. However, noise may vary to a large extent in real-world situations. It is unrealistic to assume that one can capture all of this variation in an initial learning stage, and this implies that on-line adaptation to changing noise characteristics is necessary. Several methods have been proposed to allow noise model adaptation based on different assumptions, e.g., [149] for a GMM and [84, 93] for ARHMM. The algorithm in [84, 149] only applies to a batch of noisy data, e.g., one utterance, and is not directly applicable for on-line estimation. The noise model in [149] is limited to stationary Gaussian noise (white or colored). In [84, 93], a noise HMM is estimated from the observed noisy speech, using a voice-activity-detector (VAD), such that noise-only frames are used in the on-line learning. In Paper B, we propose an adaptive noise model based on SG-HMM. The results from Paper A demonstrate good performance for modeling speech and noise. The work is extended in Paper B with an on-line noise estimation algorithm for enhancement in unknown noise environments. The model parameters are estimated on-line using the noisy observations without requiring a VAD. Therefore, the on-line learning of the noise model is continuous 3 Flexible source coding 25 and facilitates adaptation and correction to changing noise characteristics. Estimation of the noise model parameters is based on maximizing the likelihood of the noisy model, and the proposed implementation is based on the recursive expectation maximization (EM) framework [72, 132]. 3 Flexible source coding The second part of this thesis focuses on flexible source coding for unreliable networks. In Paper C, we propose a quantizer design based on a Gaussian mixture model for describing the source statistics. The proposed quantizer design allows for rate adaptation in real-time depending on the current network condition. This section provides an overview of the theory and practice that form the basis for quantizer design. The section is organized as follows. Basic definitions and design criteria are formulated in section 3.1. High-rate theory is introduced in section 3.2. In section 3.3, some practical coder designs motivated by high-rate theory are discussed. 3.1 Source coding Information theory, founded by Shannon in his legendary 1948 paper [116], forms the theoretical foundation for modern digital communication systems. As shown in [116], the communication chain can roughly be divided into two separate parts: source coding for data compression and channel coding for data protection. Optimal performance of source and channel coding can be predicted using the source and channel coding theorems [116]. An implication of the theorems is that optimal transmission of a source may be achieved by separately performing source coding followed by channel coding, a result commonly referred to as the source-channel separation theorem. The source and channel coding theorems motivate the design of essentially all modern communication systems. The optimality of separate source and channel coding is based on the assumption of coding in infinite dimensions, therefore requires an infinite delay. While the optimality is not guaranteed in any practical coders with finite delay, the design is used in practical applications due to its simplicity. Source coding consists of lossless coding [63, 75–77, 109, 142], to achieve compression without introducing error, and lossy coding [16, 117], which achieves a higher compression level but introduces distortion. Modern source coders utilize lossy coding and remove both irrelevancy and redundancy. They often incorporate lossless coding methods to remove redundancy from parameters that require perfect reconstruction, such as quantization indices. Such a system typically consists of a quantizer (lossy coding) followed by an entropy coder (lossless coding). The rate-distortion perfor- 26 Introduction 3.5 Rate (bits) 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 MSE distortion 1 1.2 Figure 6: Rate-distortion function for a unit-variance Gaussian source and the mean square error distortion measure. mance for the optimal coder design can be predicted by rate-distortion theory [116]. The theory provides a theoretical bound for the achievable rate without exceeding a given distortion. The rate-distortion bound demonstrates an interesting and intuitive behavior, namely that the achievable rate is a monotonically decreasing convex function of distortion. An example of a rate-distortion function for a unit-variance Gaussian source is shown in Figure 6. While the theory is derived under the assumption of infinite dimensions, it is expected that the behavior is also relevant in lower dimensions. Flexible source coder design should allow adjustment of the ratedistortion trade-off, e.g., multiple operation points on the rate-distortion curve, such that the coder can operate optimally over a large range of rates. For an unreliable network, it is then possible to adapt the rate of the coder in real-time to the current network condition. By operating at a rate adapted to the network condition, the coder performs with the best quality (lowest distortion) for the available bandwidth. Quantization Let x denote a source vector in the K dimensional Euclidean space RK distributed according to the PDF f (x). Quantization is a function x̂ = Q(x) (46) that maps each source vector x ∈ RK to a code vector x̂ from a codebook, which consists of a countable set of vectors in RK . Each code vector is associated with a quantization cell (decision region), cell(x̂), defined as cell(x̂) = {x ∈ RK : Q(x) = x̂}. (47) 3 Flexible source coding 27 An index is assigned to each code vector in the codebook, and the index is coded with a codeword, i.e., a sequence of bits. Each quantization index corresponds to a particular codeword and all codewords together form a uniquely decodable code, which implies that any encoded sequence of indices can be uniquely decoded. For K = 1, the quantizer applies to scalar values, and is called a scalar quantizer. The more general quantizer with K ≥ 1 is called a vector quantizer. Rate Rate can be defined as the average codeword length per source vector. Since the available bandwidth is limited, the quantizer is optimized under a rate constraint. The codewords may be constrained to have a particular fixed length, resulting in a fixed-rate coder, or constrained to have a particular average length, resulting in a variable-rate coder. For a fixed-rate coder with N code vectors in the codebook, the rate is given by Rfix. = dlog2 N e, (48) where d·e denotes the ceiling function (the smallest integer equal to or larger than the input value). To achieve efficient quantization, N is typically chosen to be an integer power of 2, to avoid rounding. For a variable-rate coder, code vectors are coded using codewords with different lengths. The rate determines the average codeword length, X Rvar. = p(x̂)`(x̂), (49) x̂ R where `(x̂) denotes the codeword length of x̂ and p(x̂) = cell(x̂) f (x)dx is the probability mass function of x̂, obtained by integrating f (x) over the quantization cell of x̂. From the source coding theorem [116], the optimal variable-rate code has an average rate approaching the entropy of x̂, denoted H, X H=− p(x̂) log2 p(x̂). (50) x̂ Using an entropy code such as the arithmetic code [75–77,109,142], this rate can be approached arbitrarily closely by increasing the sequence length. Distortion The performance of a quantizer is measured by the difference or error between the source vector and the quantized code vector. This difference defines an error function, denoted d(x, x̂), that quantifies the distortion due to the quantization of x. The performance of the system is given by the 28 Introduction average distortion defined as the expected error with respect to the input variable, Z D = d(x, x̂)f (x)dx. (51) The error function should be meaningful with respect to the source, and it typically fulfills some basic properties, e.g., [53] • d(x, x̂) ≥ 0 • if x = x̂ then d(x, x̂) = 0. The most commonly used error function is the square error function, d(x, x̂) = 1 1 ||x − x̂||2 = (x − x̂)T (x − x̂), K K (52) where || · || denotes the L2 vector norm, and (·)T denotes the transpose. The mean distortion of (52) is referred to as the mean square error (MSE) distortion. The MSE distortion is commonly used due to its mathematical tractability. Other distortion measures may be conceptually or computationally more complicated, e.g., by incorporating perceptual properties of the human auditory systems. If a complex error function can be assumed to be continuous and have continuous derivatives, the error function can be approximated under a small error assumption. By applying a Taylor series expansion of d(x, x̂) around x = x̂, and noting that the first two expansion terms vanish for x = x̂, the error function may then be approximated as [44] d(x, x̂) ≈ 1 (x − x̂)T W(x̂)(x − x̂), 2K (53) where W(x̂) is a K ×K dimensional matrix, the so-called sensitivity matrix, with the i, jth element defined by W(x̂)[i, j] = ∂ 2 d(x, x̂) , ∂x[i]∂x[j] x=x̂ (54) is the Jacobian matrix evaluated at the expansion point x̂. In [44], the sensitivity matrix approach was applied to quantization of speech linear prediction coefficients (LPC) using, e.g., the log-spectral distortion (LSD) measure. In [101, 102], a sensitivity matrix was derived and analyzed for a perceptual distortion measure based on a sophisticated spectral-temporal auditory model [29, 30]. 3 Flexible source coding 29 Quantizer design The optimal quantizer can be formulated using the constrained optimization framework, by minimizing the expected distortion, D, while constraining the average rate below or equal to the available rate budget. The constraint is defined using (48) for a fixed-rate coder or (50) for a variable-rate coder. The resulting quantizer designs are referred to as a resolution-constrained quantizer and an entropy-constrained quantizer, respectively. In general, an entropy-constrained quantizer has better rate-distortion performance compared to a resolution-constrained quantizer, due to the flexibility of assigning bit sequences of different lengths to different code vectors according to the probability of their appearance. In Paper C, we propose a flexible design of an entropy-constrained vector quantizer. Traditional quantizer design is typically based on an iterative training algorithm [23,81,82,91]. By iteratively optimizing the encoder (decision regions or quantization cells) and decoder (reconstruction point of each cell), and ensuring that the performance is non-decreasing at each iteration, a locally optimal quantizer is constructed. Using a proper initialization scheme, and given sufficient training data, a good rate-distortion performance can be achieved by a training based quantizer. An issue with the traditional quantizer design is the computational complexity and memory requirement associated with the codebook search and storage. As the number of code vectors is exponentially increasing with respect to the rate, and in general a higher rate (per vector) is needed for quantizing higher dimensional vectors, the approach is often limited to low-rate and scalar or low-dimensional vector quantizers. Another disadvantage of the traditional training based quantizer design is that the resulting quantizer has a fixed codebook. Its rate-distortion performance is determined off-line, and adaptation of the codebook according to the current available bandwidth is unfeasible without losing optimality. To allow a certain flexibility, one can optimize multiple codebooks, each for a different rate, and select the best codebook. A disadvantage of this approach is, however, the additional memory requirement, e.g., for storage of multiple codebooks. It is desirable to have a flexible quantizer design that allows for a finegrained rate adaptation, such that the rate can be selected from a larger range with a finer resolution, without increasing the memory requirement. To achieve that, the quantizer design should avoid usage of an explicit codebook (that requires storage), and the code vectors are constructed on-line for a given rate. High-rate theory provides a theoretical basis for designing such a quantizer. 30 3.2 Introduction High-rate theory High-rate theory is a mathematical framework for analyzing the behavior and predicting the performance of a quantizer of any dimensionality. High-rate theory originated from the early works on scalar quantization [15, 47, 96, 98], and was later extended for vector quantization [45, 144]. A comprehensive tutorial on quantization and high-rate theory is given in [54]. Textbooks treating the topic include [16, 46, 56, 71]. High-rate theory applies when the rate is high with respect to the smoothness of the source PDF f (x). Under this high-rate assumption, f (x) may be assumed to be constant within each quantization cell, such as f (x) ≈ f (x̂) in cell(x̂). The probability mass function (PMF) of x̂ can then be approximated using p(x̂) ≈ f (x̂)vol(x̂), (55) where vol(x̂) denotes the volume of cell(x̂). Instead of using a codebook, a high-rate quantizer can be formulated as a quantization point density function, specifying the distribution of the code vectors, e.g., the number of code vectors per unit volume. A quantization point density function may be defined as a continuous function that approximates the inverse volumes of the quantization cells [71], gc (x) ≈ 1 , if x ∈ cell(x̂). vol(x̂) (56) Under the high-rate assumption, optimal solutions of gc (x) can be solved for different rate constraints. Practical quantizers designed according to the obtained gc (x) approach optimality with increasing rate. Nevertheless, quantizers designed based on the high-rate theory have shown competitive performance also at a lower rate. In Paper C, the actual performance of the proposed quantizer is very close to the predicted performance by the high-rate theory for rates higher than 3-4 bits per dimension. High-rate distortion The average distortion of a quantizer can be reformulated using the highrate assumption. For the MSE distortion measure, (51) can be written as Z 1 ||x − x̂||2 f (x)dx DMSE = K Z 1 X ≈ ||x − x̂||2 f (x̂)dx K cell(x̂) x̂ Z 1 X p(x̂) ||x − x̂||2 dx. (57) = K vol(x̂) cell(x̂) x̂ 3 Flexible source coding 31 Let C(x̂) = K+2 1 vol(x̂)− K K Z cell(x̂) ||x − x̂||2 dx, (58) denote the coefficient of quantization of x̂. C(x̂) is the normalized distortion with respect to volume, and shows the effect of the geometry of the cell. The distortion can then be written as X 2 (59) DMSE = p(x̂)vol(x̂) K C(x̂). x̂ Gersho conjectured that the optimal quantizer has the same cell geometry for all quantization cells, and has a constant coefficient of quantization Copt. [45]. Using the optimal cell geometry, C(x̂) ≈ Copt. , then the distortion is X 2 DMSE ≈ Copt. p(x̂)vol(x̂) K ≈ Copt. Zx̂ 2 f (x)gc (x)− K dx. (60) Resolution-constrained vector quantizer The optimal resolution-constrained vector quantizer (RCVQ) has the quantization point density function gc (·) that minimizes the average distortion D, under a rate constraint that the number of code vectors equals N . The constraint may be formulated through integration of gc (x), Z gc (x)dx = N. (61) The optimal gc (·) minimizes the extended criterion ηrcvq = Copt. Z 2 −K f (x)gc (x) dx + λ Z gc (x)dx − N , (62) where the first term is the high-rate distortion (60) and λ is the Lagrange multiplier for the rate constraint. Solving the Euler-Lagrange equation, and combining with (61), the optimal gc (x) for constrained-resolution vector quantization with respect to the MSE distortion measure is given by K gc (x) = R N f (x) K+2 K f (x) K+2 dx . (63) 32 Introduction Entropy-constrained vector quantizer The optimal entropy-constrained vector quantizer (ECVQ) has a quantization point density function gc (·) that minimizes the average distortion D, under the constraint that the entropy of the code vectors equals a desired target rate Rt . The entropy of the quantized variable at a high rate can be formulated as X H ≈ − f (x̂)vol(x̂) log2 (f (x̂)vol(x̂)) x̂ f (x) f (x) log2 dx gc (x) Z = h + f (x) log2 gc (x)dx, ≈ − Z (64) R where h = − f (x) log2 f (x)dx is the differential entropy of the source. R Therefore, the constraint can be written as f (x) log2 gc (x)dx = Rt0 , where Rt0 = Rt − h. The optimal gc (·) minimizes the extended criterion Z Z 2 0 −K dx + λ f (x) log2 gc (x)dx − Rt , (65) ηecvq = Copt. f (x)gc (x) where λ is the Lagrange multiplier for the entropy constraint. The optimal gc (x) can be solved similarly to the RCVQ case, and the resulting gc (x) is a constant [45], gc (x) = gc = 2Rt −h . (66) This important result implies that, at high rate, uniformly distributed quantization points form an optimal quantizer if the point indices are subsequently coded using an entropy code. 3.3 High-rate quantizer design High-rate theory provides a theoretical foundation and guideline for designing practical quantizers. Quantizer design motivated by high-rate theory include lattice quantization [9, 10, 25, 27, 28, 41, 66, 127, 134, 145], and GMM based quantization [52, 58, 106, 112–115, 127–129, 148, Paper C]. Lattice quantization A lattice Λ is an infinite set of regularly arranged vectors in the Euclidean space RK [28]. It can be defined through a generating matrix G, assumed to be a square matrix, Λ(G) = {GT u : u ∈ Zk }, (67) 3 Flexible source coding 33 2 1 0 −1 −2 −2 −1 0 1 2 Figure 7: The 2-dimensional hexagonal lattice. where G is a matrix with linearly independent row vectors that form a set of basis vectors in RK . The lattice points generated through G are integer linear combinations of the row vectors of G. Applied to quantization, each lattice point λ ∈ Λ is associated with a Voronoi region containing all points in RK closest to λ according to the Euclidean distance measure. The Voronoi region corresponds to the decision region of a quantizer with the MSE distortion measure. Hence, a closest point search algorithm can be used for quantization. Figure 7 shows a two 1 0 √ dimensional lattice, the hexagonal lattice (defined by G = ), 0.5 23 together with the Voronoi cells. The high-rate theory for entropy-constrained vector quantization forms the theoretical basis for using uniformly distributed code vectors. Such vectors may be generated using a lattice, and the quantizer is then a lattice quantizer. The design of a coder based on an entropy-constrained lattice quantizer consists of the following steps: 1) selection of a suitable lattice for quantization, 2) design of a closest point search algorithm for the selected lattice, and 3) indexing and assigning variable-length codewords to the code vectors. The selection of a good lattice for quantization is considered in [9] for the MSE distortion. Since the MSE performance of a lattice quantizer is proportional to the coefficient of quantization associated with the Voronoi cell shape of the lattice, lattice optimization for quantization is based on an optimization with respect to the coefficient of quantization. A study and tutorial on the “best known” lattices for quantization is given in [9]. A numerical optimization method for obtaining a locally optimal lattice for quantization in an arbitrary dimension is proposed in [9]. Closest point search algorithms for lattices are considered in [10, 27]. The algorithms proposed in [27] are based on the geometrical structures 34 Introduction of the lattices, and are therefore lattice-specific. A general searching algorithm that applies to an arbitrary lattice has been proposed [10]. Both methods are considerably faster than a full search over all code vectors in a traditional codebook, by utilizing the geometrically regular arrangement of code vectors. As the search complexity and memory requirement using a traditional codebook are exponentially increasing with respect to the vector dimensionality and rate, lattice quantization is an attractive approach for high-dimensional vector quantization. The practical application of entropy-constrained lattice quantization has been limited due to lack of a practical algorithm for indexing and assigning variable-length codewords that work at high rates and dimensions. In [25, 26, 66], lattice truncation and indexing the lattice vectors are considered. However, designing a practical entropy code is still limited to low dimensions and rates, due to the exponentially increasing number of code vectors in the truncated lattice. In Paper C, we propose a practical indexing and entropy coding algorithm for a lattice quantization of a Gaussian variable as part of a GMM based entropy-constrained vector quantizer design. GMM-based quantization High-rate theory assumes that the PDF of the source is known. However, for most practical sources, the PDF is unknown and needs to be estimated from the data. As discussed in section 2.3, GMMs form a convenient tool for modeling real-world source vectors with an unknown PDF [107]. They have been successfully applied to quantization, e.g., [52,58,114,127,128,131, Paper C]. GMMs have been applied to approximate the high-rate optimal quantization point density for resolution-constrained quantization [58, 114]. A codebook can be generated using a pseudo random vector generator, once the optimal quantization point density is obtained. Thus, this method does not require the storage of multiple codebooks for different rate requirements. However, the method shares the same search complexity as a traditional codebook based quantizer. The application is therefore limited to low rates and dimensions. Nevertheless, the method is useful for predicting the performance of an optimal quantizer [58, 114]. Another GMM quantizer design is based on classification and component quantizers [127–129]. Such a quantizer is based on quantization of the source vector using only one of the mixture component quantizers. The index of the quantizing mixture component is transmitted as side information. The scheme is conceptually simple, since the problem is reduced to quantizer design for a Gaussian component. Similar ideas based on classified quantization appears in image coding [131] and speech coding [68–70]. The methods of [127–129] are resolution-constrained vector quantizers optimized for a fixed target rate. The overall bit rate budget is distributed 4 Summary of contributions 35 among the mixture components and for transmitting the component index. The component quantizers are based on the so-called companding technique. The companding technique attempts to approximate the optimal high-rate codebook of a Gaussian variable. It consists of an invertible mapping function that maps the input variable into a domain where a lattice quantizer is applied. The forward transform is referred to as the compressor and the inverse transform as expander. Applying the expander function to the lattice structured codebook would result in a codebook that approximates the optimal codebook in the original domain. Unfortunately, the companding technique is suboptimal for vector dimensions higher than two [20, 21, 45, 95, 119]. The sub-optimality is due to the fact that the expander function scales the quantization cells differently in different dimensions and in different locations in space. Hence, the quantization cells lose their optimal cell shape in the original domain. The great advantage of the schemes proposed in [127–129] is the low computational complexity. Also, codebook storage is not required by utilizing a lattice quantizer in the transformed domain. Applying classified quantization for entropy constrained quantization has been discussed in [52, 71]. Motivated by high-rate theory, the component quantizers of an entropy constrained quantizer are based on lattice structured codebooks. The focus of [52, 71] is on the high-rate analysis of such a quantizer. In Paper C, we propose a practical GMM-based entropy constrained quantizer using the classified quantization framework. In particular, a practical solution to indexing and entropy coding for the lattice quantizers of the Gaussian components is proposed. We show that, under certain conditions, the scheme approaches the theoretically optimal performance with increasing rate. 4 Summary of contributions The focus of this thesis is on statistical model based techniques for speech enhancement and coding. The main contributions of the thesis can be summarized as • improved speech and noise modeling and estimation techniques applied to speech enhancement, and • improvements towards flexible source coding. The thesis consists of three research articles. Short summaries of the articles are presented below. 36 Introduction Paper A: HMM-based Gain-Modeling for Enhancement of Speech in Noise In Paper A, we propose a hidden Markov model (HMM) based speech enhancement method using stochastic-gain HMMs (SG-HMMs) for both speech and noise. We demonstrate that accurate modeling and estimation of speech and noise gains facilitate good performance in speech enhancement. Through the introduction of stochastic gain variables, energy variation in both speech and noise is explicitly modeled in a unified framework. The speech gain models the energy variations of the speech phones, typically due to differences in pronunciation and/or different vocalizations of individual speakers. The noise gain helps to improve the tracking of the time-varying energy of non-stationary noise. The expectation-maximization (EM) algorithm is used to perform off-line estimation of the time-invariant model parameters. The time-varying model parameters are estimated on-line using the recursive EM algorithm. The proposed gain modeling techniques are applied to a novel Bayesian speech estimator, and the performance of the proposed enhancement method is evaluated through objective and subjective tests. The experimental results confirm the advantage of explicit gain modeling, particularly for non-stationary noise sources. Paper B: On-line Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement In Paper B, we extend the work in Paper A and introduce an on-line noise estimation algorithm. A stochastic-gain HMM is used to model the statistics of non-stationary noise with time-varying energy. The noise model is adaptive and the model parameters are estimated on-line from noisy observations. The parameter estimation is derived for the maximum likelihood criterion and the algorithm is based on the recursive expectation maximization (EM) framework. The proposed method facilitates continuous adaptation to changes of both noise spectral shapes and noise energy levels, e.g., due to movement of the noise source. Using the estimated noise model, we also develop an estimator of the noise power spectral density (PSD) based on recursive averaging of estimated noise sample spectra. We demonstrate that the proposed scheme achieves more accurate estimates of the noise model and noise PSD. When integrated to a speech enhancement system, the proposed scheme facilitates a lower level of residual noise in the enhanced speech. Paper C: Entropy-Constrained Vector Quantization Using Gaussian Mixture Models In Paper C, a flexible entropy-constrained vector quantization scheme based on Gaussian mixture models (GMMs), lattice quantization, and arithmetic References 37 coding is presented. A source vector is assumed to have a probability density function described by a GMM. The source vector is first classified to one of the mixture components, and the Karhunen-Loève transform of the selected mixture component is applied to the vector, followed by quantization using a lattice structured codebook. Finally, the scalar elements of the quantized vector are entropy coded sequentially using a specially designed arithmetic code. The proposed scheme has a computational complexity that is independent of rate, and quadratic with respect to vector dimension. Hence, the scheme facilitates quantization of high dimensional source vectors. The flexible design allows for changing of the average rate on-the-fly. The theoretical performance of the scheme is analyzed under a high-rate assumption. We show that, at high rate, the scheme approaches the theoretically optimal performance, if the mixture components are located far apart. The performance loss when the mixture components are located closely to each other can be quantified for a given GMM. The practical performance of the scheme was evaluated through simulations on both synthetic and speech-derived source vectors, and competitive performance has been demonstrated. References [1] Enhanced Variable Rate Codec, speech service option 3 for wideband spread spectrum digital systems. TIA/EIA/IS-127, Jul. 1996. [2] Perceptual Evaluation of Speech Quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P.862, Feb. 2001. [3] Test results of Mitsubishi AMR-NS solution based on TS 26.077. 3GPP Tdoc S4-020251, May 2002. [4] Test results of NEC AMR-NS solution based on TS 26.077. 3GPP Tdoc S4-020415, Jul. 2002. [5] Test results of NEC low-complexity AMR-NS solution based on TS 26.077. 3GPP Tdoc S4-020417, Jul. 2002. [6] Minimum performance requirements for noise suppressor; application to the Adaptive Multi-Rate (AMR) speech encoder (release 5). 3GPP TS 26.077, Jun. 2003. [7] Selectable Mode Vocoder (SMV) service option for wideband spread spectrum communication systems. 3GPP2 Document C.S0030-0, Jan. 2004. Version 3.0. 38 Introduction [8] Test results of Fujitsu AMR-NS solution based on TS 26.077 specification. 3GPP Tdoc S4-050016, Feb. 2005. [9] E. Agrell and T. Eriksson. Optimization of lattices for quantization. IEEE Trans. Inform. Theory, 44(5):1814–1828, Sep. 1998. [10] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point search in lattices. IEEE Trans. Inform. Theory, 48(8):2201–2214, Aug. 2002. [11] J. Baker. The DRAGON system - an overview. IEEE Trans. Acoust., Speech, Signal Processing, 23(1):24–29, Feb 1975. [12] N. K. Bakirov. Central limit theorem for weakly dependent variables. Mathematical Notes, 41(1):63–67, Jan. 1987. [13] L. E. Baum and J. A. Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist., 37:1554– 1563, 1966. [14] J. Beerends and J. Stemerdink. A perceptual speech-quality measure based on a psychoacoustic sound representation. J. Audio Eng. Soc, 42(3):115–123, 1994. [15] W. R. Bennett. Spectra of quantized signals. Bell System Technical Journal, 27:446–472, 1948. [16] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, Englewood Cliffs, N.J., 1971. [17] J. Bilmes. A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report, University of Berkeley, ICSI-TR-97-021, 1997. [18] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Processing, 2(2):113–120, Apr. 1979. [19] C. Breithaupt and R. Martin. MMSE estimation of magnitudesquared DFT coefficients with supergaussian priors. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 896–899, April 2003. [20] J. Bucklew. Companding and random quantization in several dimensions. IEEE Trans. Inform. Theory, 27(2):207–211, Mar 1981. [21] J. Bucklew. A note on optimal multidimensional companders (corresp.). IEEE Trans. Inform. Theory, 29(2):279–279, Mar 1983. References 39 [22] O. Cappé. Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor. IEEE Trans. Speech and Audio Processing, 2(2):345–349, Apr. 1994. [23] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vector quantization. IEEE Trans. Acoust., Speech, Signal Processing, 37(1):31–42, Jan. 1989. [24] I. Cohen. Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging. IEEE Trans. Speech and Audio Processing, 11(5):466–475, Sep. 2003. [25] J. Conway and N. J. A. Sloane. A fast encoding method for lattice codes and quantizers. IEEE Trans. Inform. Theory, 29(6):820–824, Nov. 1983. [26] J. H. Conway, E. M. Rains, and N. J. A. Sloane. On the existence of similar sublattice. Canadian Jnl. Math., 51:1300–1306, 1999. [27] J. H. Conway and N. J. A. Sloane. Fast quantizing and decoding algorithms for lattice quantizers and codes. IEEE Trans. Inform. Theory, 28(2):227–232, Mar. 1982. [28] J. H. Conway and N. J. A. Sloane. Sphere Packings, Lattices and Groups. Springer-Verlag, 2nd edition, 1993. [29] T. Dau, D. Püschel, and A. Kohlrausch. A quantitative model of the effective signal processing in the auditory system. I. model structure. J. Acoust. Soc. Am., 99(6):3615–3622, June 1996. [30] T. Dau, D. Püschel, and A. Kohlrausch. A quantitative model of the effective signal processing in the auditory system. II. simulations and measurements. J. Acoust. Soc. Am., 99(6):3623–3631, June 1996. [31] A. P. Dempster, N. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. B, 39(1):1–38, 1977. [32] L. Deng, J. Droppo, and A. Acero. Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition. IEEE Trans. Speech and Audio Processing, 11(6):568– 580, Nov. 2003. [33] V. Digalakis, P. Monaco, and H. Murveit. Genones: generalized mixture tying in continuous hidden Markov model-based speech recognizers. IEEE Trans. Speech and Audio Processing, 4(4):281–289, July 1996. 40 Introduction [34] V. Digalakis, D. Rtischev, and L. Neumeyer. Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech and Audio Processing, 3(5):357–366, Sept. 1995. [35] Y. Ephraim. A Bayesian estimation approach for speech enhancement using hidden Markov models. IEEE Trans. Signal Processing, 40(4):725–735, Apr. 1992. [36] Y. Ephraim. Gain-adapted hidden Markov models for recognition of clean and noisy speech. IEEE Trans. Signal Processing, 40(6):1303– 1316, Jun. 1992. [37] Y. Ephraim. Statistical-model-based speech enhancement systems. Proceedings of the IEEE, 80(10):1526–1555, Oct. 1992. [38] Y. Ephraim and D. Malah. Speech enhancement using a minimummean square error short-time spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Processing, 32(6):1109–1121, Dec. 1984. [39] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust., Speech, Signal Processing, 33(2):443–445, Apr 1985. [40] Y. Ephraim, D. Malah, and B.-H. Juang. On the application of hidden Markov models for enhancing noisy speech. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):1846–1856, Dec. 1989. [41] T. Eriksson. Vector Quantization in Speech Coding - Variable rate, memory and lattice quantization. PhD thesis, Chalmers University of Technology, 1996. [42] T. H. Falk and W.-Y. Chan. Nonintrusive speech quality estimation using gaussian mixture models. IEEE Signal Processing Letters, 13(2):108–111, Feb. 2006. [43] G. Fant. Acoustic theory of speech production. The Hague, Netherlands: Mouton, 1970. [44] W. Gardner and B. Rao. Theoretical analysis of the high-rate vector quantization of LPC parameters. IEEE Trans. Speech and Audio Processing, 3(5):367–381, Sept. 1995. [45] A. Gersho. Asymptotically optimal block quantization. IEEE Trans. Inform. Theory, 25(4):373–380, Oct. 1979. [46] A. Gersho and R. M. Gray. Vector quantization and signal compression. Boston, MA: Kluwer Academic Publishers, 1992. References 41 [47] H. Gish and J. N. Pierce. Asymptotically efficient quantizing. IEEE Trans. Inform. Theory, 14(5):676–683, 1968. [48] Z. Goh, K.-C. Tan, and T. Tan. Postprocessing method for suppressing musical noise generated by spectral subtraction. IEEE Trans. Speech and Audio Processing, 6(3):287–292, May 1998. [49] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Low complexity, non-intrusive speech quality assessment. In IEEE Trans. Audio, Speech and Language Processing, volume 14, pages 1948–1956, Nov. 2006. [50] V. Grancharov, D. Y. Zhao, J. Lindblom, and W. B. Kleijn. Nonintrusive speech quality assessment with low computational complexity. In Interspeech - ICSLP, September 2006. [51] R. Gray. Reply to comments on ’source coding of the discrete Fourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980. [52] R. Gray. Gauss mixture vector quantization. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 3, pages 1769–1772, 2001. [53] R. Gray, A. Buzo, J. Gray, A., and Y. Matsuyama. Distortion measures for speech processing. IEEE Trans. Acoust., Speech, Signal Processing, 28(4):367–376, Aug 1980. [54] R. Gray and D. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):2325–2383, Oct. 1998. [55] R. Gray, J. Young, and A. Aiyer. Minimum discrimination information clustering: modeling and quantization with Gauss mixtures. In Proc. IEEE Int. Conf. Image Processing, volume 3, pages 14–17, Oct. 2001. [56] R. M. Gray. Source coding theory. Kluwer Academic Publisher, 1990. [57] S. Gustafsson, P. Jax, and P. Vary. A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 397–400, 1998. [58] P. Hedelin and J. Skoglund. Vector quantization based on Gaussian mixture models. IEEE Trans. Speech and Audio Processing, 8(4):385– 401, Jul. 2000. [59] H. G. Hirsch and C. Ehrlicher. Noise estimation techniques for robust speech recognition. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 153–156, 1995. 42 Introduction [60] Y. Hu and P. Loizou. Subjective comparison of speech enhancement algorithms. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 153–156, 2006. [61] Y. Hu and P. Loizou. A comparative intelligibility study of speech enhancement algorithms. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pages 561–564, 2007. [62] Y. Hu and P. C. Loizou. A perceptually motivated approach for speech enhancement. IEEE Trans. Speech and Audio Processing, 11(5):457– 465, Sep. 2003. [63] D. A. Huffman. A method for the construction of minimum redundancy codes. Proc. IRE, 40:1098–1101, 1952. [64] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, 1984. [65] F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4):532–556, April 1976. [66] D. G. Jeong and J. D. Gibson. Uniform and piecewise uniform lattice vector quantization for memoryless Gaussian and Laplacian sources. IEEE Trans. Inform. Theory, 39(3):786–804, May 1993. [67] S. M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, 1993. [68] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ for the speech signal. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 645–648, 2002. [69] M. Y. Kim and W. Bastiaan Kleijn. KLT-based classified VQ with companding. In Proc. IEEE Works. Speech Coding, pages 8–10, Oct. 2002. [70] M. Y. Kim and W. Kleijn. KLT-based adaptive classified VQ of the speech signal. IEEE Trans. Speech and Audio Processing, 12(3):277– 289, May 2004. [71] W. B. Kleijn. A basis for source coding. Lecture notes, KTH, Stockholm, Jan. 2003. [72] V. Krishnamurthy and J. Moore. On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure. IEEE Trans. Signal Processing, 41(8):2557–2573, Aug. 1993. References 43 [73] M. Kuropatwinski and W. B. Kleijn. Estimation of the excitation variances of speech and noise AR-models for enhanced speech coding. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 669–672, May 2001. [74] M. Kuropatwinski and W. B. Kleijn. Minimum mean square error estimation of speech short-term predictor parameters under noisy conditions. 1:96–99, 2003. [75] G. Langdon and J. Rissanen. A simple general binary source code (corresp.). IEEE Trans. Inform. Theory, 28(5):800–803, Sep 1982. [76] G. Langdon and J. Rissanen. Correction to ’a simple general binary source code’ (sep 82 800-803). IEEE Trans. Inform. Theory, 29(5):778–779, Sep 1983. [77] G. G. Langdon. An introduction to arithmetic coding. IBM Journal of Research and Development, 28:135–149, 1984. [78] J. S. Lim, editor. Speech Enhancement (Prentice-Hall Signal Processing Series). Prentice Hall, 1983. [79] L. Lin, W. H. Holmes, and E. Ambikairajah. Speech denoising using perceptual modification of Wiener filtering. Electronics Letters, 38(23):1486–1487, Nov. 2002. [80] J. Lindblom and J. Samuelsson. Bounded support Gaussian mixture modeling of speech spectra. IEEE Trans. Speech and Audio Processing, 11(1):88–99, Jan. 2003. [81] Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. Communications, IEEE Transactions on [legacy, pre - 1988], 28(1):84–95, Jan 1980. [82] S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inform. Theory, 28(2):129–137, Mar 1982. [83] P. Lockwood and J. Boudy. Experiments with a nonlinear spectral subtractor (NSS), hidden Markov models and the projection, for robust speech recognition in cars. Speech Communication, 11(2-3):215– 228, 1992. [84] B. Logan and T. Robinson. Adaptive model-based speech enhancement. Speech Communication, 34(4):351–368, Jul. 2001. [85] T. Lotter and P. Vary. noise reduction by maximum a posteriori spectral amplitude estimation with supergaussian speech modeling. In Inter. Workshop on Acoustic Echo and Noise Control (IWAENC2003), pages 86–89, Sept. 2003. 44 Introduction [86] T. Lotter and P. Vary. Noise reduction by joint maximum a posteriori spectral amplitude and phase estimation with super-Gaussian speech modelling. In Proc. 12th European Signal Processing Conference, EUSIPCO, pages 1457–1460, 2004. [87] D. Malah, R. Cox, and A. Accardi. Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 2, pages 789–792, March 1999. [88] R. Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech and Audio Processing, 9(5):504–512, Jul. 2001. [89] R. Martin. Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 253– 256, 2002. [90] R. Martin. Speech enhancement based on minimum mean-square error estimation and supergaussian priors. IEEE Trans. Speech and Audio Processing, 13(5):845–856, Sept. 2005. [91] J. Max. Quantizing for minimum distortion. IEEE Trans. Inform. Theory, 6(1):7–12, Mar 1960. [92] R. McAulay and M. Malpass. Speech enhancement using a softdecision noise suppression filter. IEEE Trans. Acoust., Speech, Signal Processing, 28(2):137–145, Apr 1980. [93] B. L. McKinley and G. H. Whipple. Noise model adaptation in model based speech enhancement. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 2, pages 633 – 636, 1996. [94] A. Meiri. Comments on ’source coding of the discrete Fourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980. [95] P. Moo and D. Neuhoff. Optimal compressor functions for multidimensional companding. In Proc. IEE Int. Symp. Info. Theory, page 515, 29 June-4 July 1997. [96] B. Oliver, J. Pierce, and C. Shannon. The philosophy of PCM. Proceedings of the IRE, 36(11):1324–1331, Nov. 1948. [97] K. K. Paliwal and W. B. Kleijn. Quantization of LPC parameters. In W. B. Kleijn and K. K. Paliwal, editors, Speech coding and synthesis, chapter 12, pages 433–466. Elsevier, 1995. References 45 [98] P. F. Panter and W. Dite. Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1):44–48, 1951. [99] W. Pearlman. Reply to comments on ’source coding of the discrete Fourier transform’. IEEE Trans. Inform. Theory, 26(5):626–626, Sep 1980. [100] W. Pearlman and R. Gray. Source coding of the discrete Fourier transform. IEEE Trans. Inform. Theory, 24(6):683–692, Nov 1978. [101] J. Plasberg, D. Y. Zhao, and W. B. Kleijn. The sensitivity matrix for a spectro-temporal auditory model. In Proc. 12th European Signal Processing Conf. (EUSIPCO), pages 1673–1676, 2004. [102] J. H. Plasberg and W. B. Kleijn. The sensitivity matrix: Using advanced auditory models in speech and audio processing. IEEE Trans. Audio, Speech and Language Processing, 15(1):310–319, Jan. 2007. [103] J. Porter and S. Boll. Optimal estimators for spectral restoration of noisy speech. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 9, pages 53–56, 1984. [104] S. Quackenbush, T. Barnwell, and M. Clements. Objective Measures of Speech Quality. Prentice Hall, 1988. [105] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb. 1989. [106] B. D. Rao. Speech coding using Gaussian mixture models. Technical report, MICRO Project 02-062, 2002-2003. [107] R. A. Redner and H. F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195–239, 1984. [108] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech and Audio Processing, 3(1):72–83, Jan. 1995. [109] J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development, 20:198, 1976. [110] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual Evaluation of Speech Quality (PESQ) - A new method for speech quality assessment of telephone networks and codecs. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, volume 2, pages 749–752, Salt Lake City, USA, 2001. 46 Introduction [111] H. Sameti, H. Sheikhzadeh, L. Deng, and R. L. Brennan. HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. Speech and Audio Processing, 6(5):445–455, Sep. 1998. [112] J. Samuelsson. Waveform quantization of speech using Gaussian mixture models. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 165–168, May 2004. [113] J. Samuelsson. Toward optimal mixture model based vector quantization. In Proc. the Fifth Int. Conf. Information, Communications and Signal Processing (ICICS), Dec. 2005. [114] J. Samuelsson and P. Hedelin. Recursive coding of spectrum parameters. IEEE Trans. Speech and Audio Processing, 9(5):492–503, Jul. 2001. [115] J. Samuelsson and J. Plasberg. Multiple description coding based on Gaussian mixture models. IEEE Signal Processing Letters, 12(6):449– 452, June 2005. [116] C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27:379–423, 623–656, 1948. [117] C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, Part 4, pages 142–163, 1959. [118] J. W. Shin, J.-H. Chang, and N. S. Kim. Statistical modeling of speech signals based on generalized gamma distribution. IEEE Signal Processing Letters, 12(3):258–261, March 2005. [119] S. Simon. On suboptimal multidimensional companding. In Proc. Data Compression Conference (DCC), pages 438–447, 30 March-1 April 1998. [120] J. Sohn and W. Sung. A voice activity detector employing soft decision based noise spectrum adaptation. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 365–368, May 1998. [121] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Speech enhancement using a-priori information. In Proc. Eurospeech, volume 2, pages 1405– 1408, Sep. 2003. [122] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-based Bayesian speech enhancement. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 1, pages 1077–1080, 2005. References 47 [123] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook driven short-term predictor parameter estimation for speech enhancement. IEEE Trans. Audio, Speech and Language Processing, pages 1–14, Jan. 2006. [124] S. Srinivasan, J. Samuelsson, and W. B. Kleijn. Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio, Speech and Language Processing, 15(2):441–452, Feb. 2007. [125] V. Stahl, A. Fischer, and R. Bippus. Quantile based noise estimation for spectral subtraction and Wiener filtering. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 3, pages 1875–1878, June 2000. [126] J. Su and R. Mersereau. Coding using Gaussian mixture and generalized Gaussian models. In Proc. IEEE Int. Conf. Image Processing, volume 1, pages 217–220, Sep. 1996. [127] A. D. Subramaniam, W. R. Gardner, and B. D. Rao. Low-complexity source coding using Gaussian mixture models, lattice vector quantization, and recursive coding with application to speech spectrum quantization. IEEE Trans. Audio, Speech and Language Processing, 14(2):524–532, Mar. 2006. [128] A. D. Subramaniam and B. D. Rao. PDF optimized parametric vector quantization of speech line spectral frequencies. IEEE Trans. Speech and Audio Processing, 11(2):130–142, Mar. 2003. [129] B. Subramaniam, A.D.; Rao. Speech LSF quantization with rate independent complexity, bit scalability and learning. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 2, pages 705–708, 2001. [130] H. Takahata. On the rates in the central limit theorem for weakly dependent random fields. Probability Theory and Related Fields, 64(4):445–456, Dec. 1983. [131] M. Tasto and P. Wintz. Image coding by adaptive block quantization. IEEE Transactions on Communication Technology, COM-19(6):957– 972, 1971. [132] D. M. Titterington. Recursive parameter estimation using incomplete data. J. Roy. Statist. Soc. B, 46(2):257–267, 1984. [133] D. E. Tsoukalas, J. N. Mourjopoulos, and G. Kokkinakis. Speech enhancement based on audible noise suppression. IEEE Trans. Speech and Audio Processing, 5(6):497–514, Nov. 1997. 48 Introduction [134] V. A. Vaishampayan, N. J. A. Sloane, and S. Servetto. Multiple description vector quantization with lattice codebooks: design and analysis. IEEE Trans. Inform. Theory, 47(5):1718–1734, Jul. 2001. [135] P. Vary and R. Martin. Digital Speech Transmission - Enhancement, Coding and Concealment. Wiley, 2006. [136] N. Virag. Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans. Speech and Audio Processing, 7(2):126–137, Mar. 1999. [137] S. Voran. Objective estimation of perceived speech quality - Part I: Development of the measuring normalizing block technique. IEEE Trans. Speech, Audio Processing, 7(4):371–382, 1999. [138] S. Voran. Objective estimation of perceived speech quality - Part II: Evaluation of the measuring normalizing block technique. IEEE Trans. Speech, Audio Processing, 7(4):383–390, 1999. [139] D. Wang and J. Lim. The unimportance of phase in speech enhancement. IEEE Trans. Acoust., Speech, Signal Processing, 30(4):679–681, Aug 1982. [140] S. Wang, A. Sekey, and A. Gersho. An objective measure for predicting subjective quality of speech coders. IEEE J. Selected Areas in Commun., 10(5):819–829, 1992. [141] N. Wiener. Extrapolation, Interpolation, and Smoothing of Stationary Time Series. New York: Wiley, 1949. [142] I. Witten, R. Neal, and J. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, June 1987. [143] P. J. Wolfe and S. J. Godsill. Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, volume 2, pages 821– 824, 2000. [144] P. L. Zador. Topics in the asymptotic quantization of continuous random variables. Bell Labs Tech. Memo, 1966. [145] D. Y. Zhao and W. B. Kleijn. Multiple-description vector quantization using translated lattices with local optimization. In Proc. IEEE Global Telecommunications Conference, volume 1, pages 41 – 45, 2004. [146] D. Y. Zhao and W. B. Kleijn. On noise gain estimation for HMMbased speech enhancement. In Proc. Interspeech, pages 2113–2116, Sep. 2005. References 49 [147] D. Y. Zhao and W. B. Kleijn. HMM-based gain-modeling for enhancement of speech in noise. In IEEE Trans. Audio, Speech and Language Processing, volume 15, pages 882–892, Mar. 2007. [148] D. Y. Zhao, J. Samuelsson, and M. Nilsson. GMM-based entropyconstrained vector quantization. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pages 1097–1100, Apr. 2007. [149] Y. Zhao. Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises. IEEE Trans. Speech and Audio Processing, 8(3):255–266, May 2000. [150] X. Zhuang, Y. Huang, K. Palaniappan, and Y. Zhao. Gaussian mixture density modeling, decomposition, and applications. IEEE Trans. Image Processing, 5(9):1293–1302, Sept. 1996.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement