University of Crete Department of Computer Science FO.R.T.H. Institute of Computer Science On the Inverse Filtering of Speech (MSc. Thesis) George P. Kafentzis Heraklion October 2010 Department of Computer Science University of Crete On the Inverse Filtering of Speech Submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science October 22, 2010 c 2010 University of Crete & ICS-FO.R.T.H. All rights reserved. ° Author: George P. Kafentzis Department of Computer Science Committee Supervisor Yannis Stylianou Associate Professor Member Athanasios Mouchtaris Assistant Professor Member Panagiotis Tsakalides Professor Accepted by: Chairman of the Graduate Studies Committee Panos Trahanias Professor Heraklion, October 2010 ON THE INVERSE FILTERING OF SPEECH by George P. Kafentzis A thesis submitted to the faculty of University Of Crete in partial fulﬁllment of the requirements for the degree of Master of Science Computer Science Department University of Crete September 2010 c 2010 George P. Kafentzis Copyright All Rights Reserved ABSTRACT ON THE INVERSE FILTERING OF SPEECH George P. Kafentzis Computer Science Department Master of Science In all proposed source-ﬁlter models of speech production, Inverse Filtering (IF) is a well known technique for obtaining the glottal ﬂow waveform, which acts as the source in the vocal tract system. The estimation of glottal ﬂow is of high interest in a variety of speech areas, such as voice quality assessment, speech coding and synthesis as well as speech modiﬁcations. A major obstacle in comparing and/or suggesting improvements in the current state of the art approaches is simply the lack of real data concerning the glottal ﬂow. In other words, the results obtained from various inverse ﬁltering algorithms, cannot be directly evaluated because the actual glottal ﬂow waveform is simply unknown. To this direction, suggestions on the use of synthetic speech that has been created using artiﬁcial glottal waveform are widely used in the literature. This kind of evaluation, however, is not truly objective because speech synthesis and IF are typically based on similar models of the human voice production apparatus, in our case, the traditional source-ﬁlter model. This thesis presents three well-known IF methods based on Linear Prediction Analysis (LPA), and a new method, and its performance is compared to the others. The ﬁrst one is based on the conventional autocorrelation LPA, and the second one on the conventional closed phase covariance LPA. The closed phase is identiﬁed using Plumpe and Quatieri’s suggested method based on using statistics on the ﬁrst formant frequencies during a pitch period. The third one is based the work of Alku et al, which proposed an IF method based on a Mathematically Constrained Closed Phase Covariance LPA, in which mathematical constraints are imposed on the conventional covariance analysis. This results in more realistic root locations of the model on the z-plane. Finally, Magi et al suggested a new method for extracting the vocal tract ﬁlter, called Stabilized Weighted LP Analysis (SWLP), in which a short time energy window controls the performance of the LP model. This method is suggested for IF due to its interesting property of applying emphasis on speech samples which typically occur during the closed phase region of the speech signal. This is expected to yield a more robust, in the acoustic sense, vocal tract ﬁlter estimate than the conventional autocorrelation LP. The three IF approaches along with the suggested new one are applied on a database of physically modeled speech signals. In this case, the glottal ﬂow and the speech signal are available and direct evaluation of IF methods can be performed. Robust time and frequency parametrization measures are applied on both the actual glottal ﬂow and the estimated ones, in order to evaluate the performance of the methods. These measures include the Normalized Amplitude Quotient (NAQ), the diﬀerence between the ﬁrst two harmonics (H1-H2) of the glottal spectrum, and the Harmonic Richness Factor (HRF), along with the Signal to Reconstruction Error ratio (SRER). Experiments conducted on physically modeled sustained vowels (/aa/, /ae/, /eh/, /ih/) of a wide range of frequencies (105 to 255 Hz) for both male and female speech. Glottal ﬂow estimates were produced, using short time pitch synchronous analysis and synthesis for the covariance based methods, whereas for the autocorrelation methods, a long analysis window and a short synthesis window was used. The results and measures are compared and discussed, showing the prevalence of the covariance methods, but the suggested method typically produces better results than the conventional autocorrelation LP, according to our metrics. vi ACKNOWLEDGMENTS First, I would like to thank my supervisor, Professor Yannis Stylianou, for his continuous support in the M.Sc. program. I am really grateful for his advice, encouragement, guidance, motivation, and above all, trust and patience that he showed during the time we worked together. He gave me the chance to meet and work with exceptional people and scientists. He also taught me how to view things from diﬀerent perspectives, how to do research, and how to be persistent when things are not going well. Finally, his help and care on matters beyond academia were really emotive, and for this I thank him twice. I also thank Professor Paavo Alku, whom I had the opportunity to meet and work with in the Acoustics and Signal Processing Laboratory of the Aalto University of Technology, Helsinki, Finland, for his hospitality, kindness, support, and guidance, and for providing me the physically modeled speech database to work with in this Thesis. In addition, I would like to thank Dr. Olivier Rosec, whom I had the opportunity to work with for four months in France Telecom R & D, Lannion, France. His continuous support during my stay there was very important for me. I also thank my colleagues, past and present, with whom I shared my days at the University of Crete, and especially in the Multimedia Informatics Laboratory of the Computer Science Department. They made it a wonder- viii ful workplace and home for the past years. I would like to specially mention Christos and George Tzagkarakis, Pavlos Mattheakis, Maria Koutsogiannaki, Mairi Astrinaki, George Tzedakis, George Grekas, Christina Lionoudaki, Yannis Agiomyrgiannakis, Miltos Vasilakis, Yannis Pantazis, Andre Holzapfel, Maria Markaki, Nikos Pappas, Ilias Tsompanidis, Evgenios Kornaropoulos, and Dimitris Milioris, for helping me in so many ways and for being real friends. Moreover, I would like to say ”thank you” to many others, including George Charitos, Kostas Perakakis, George Charitonidis, Pepi Katsigianni, Petros Androvitsaneas, Kostas Sipsas, Nasos Lagios, Andreas Vourdoulas, Panayotis Tripolitis, Eleni Choremioti, and Kostas Katsaros, for their care and friendship, even if most of them are away from Crete. I am also grateful to the Institute of Computer Science (ICS) of the Foundation of Research and Technology - Hellas (F.O.R.T.H), for awarding me a fellowship during my graduate studies. Many special thanks should go to Glykeria Stergioula, for consuming her time on reading and suggesting corrections on the grammar and syntax of this dissertation. Last, but not least, I thank my family: my parents, Panayotis Kafentzis and Diamanto Tsirolia, for everything. I owe them the man I am today. My little sister Maria-Chrysanthi, for her patience, love, care, and support, in all these years we are together in Crete. My little brother Stelios, for his love and care, and for listening to his voice almost every night before going to bed. ix ! !" # $! ! !% %& " '( ! ) * & % $ & ! " + & % ! , %! " - !% ! % % % " '. ! ! , !% */ %& " " * / % * 0 -*$ 0- % " # & */ 0- % ! 0- % " # Quatier Plumpe % */ & , & " ( Alku % % */ 0- % +% ! - ! 0 / % ! x % /& " 1 Magi % (% 0- 2 Stabilised Weighted Linear Prediction % 0-" - ! !" / 0- % % ! ! * % " % % , % " (! % % & % % " * 3 4 - Normalized Amplitude Quotient - NAQ ,! & & H1-H2 - % & Harmonic Richness Factor - HRF / Signal to Reconstruction Error ratio SRER" - ,% )) )) )) ) ) % * ! 567 877 Hz " # !% & ! pitch synchronously ! %! !% % % ! %! !% % " 1 ! xi % % % ! %" xii xiii 3' % % 3% 9 ! , & &" ( & * % % ' ! , $ /" + & , ! %& " ( % & * & & " 1 *% % :! * ' & " ( % % 3% Paavo Alku & / (, - Aalto Helsinki , , % ! / !" ( % % ;" Olivier Rosec & / France Telecom R & D Lannion 0" # , ! " ( & & - 3 ( - 1 ( < &" # * ! " = % > 0 & 1/ -! +% + 3 + 0 & 1/ 0 & 0 > xiv 4 0 + 2 0 -/ Andre Holzapfel + + ? - # 1 ( 3 ; + & *% % " % % “ &” ! * 0 & > 3& - 0 & > - 3 - * 3& $ ? 4 2! - & 1 ( > & 3& 3 ' ! ' 3" ; % ! , 9 ! - 9- 9 ! 1 '( 91( & " ( % % 0 ! ! %& " 1 % % @ - & 3/ ; 1 " 1 " 1 +>% , / 3" 1 % * !""" Contents Table of Contents xv List of Figures xvii 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . 1.1.1 Source-Filter Production Model . 1.1.2 Linear Prediction Analysis . . . . 1.1.3 Inverse Filtering . . . . . . . . . . 1.1.4 The Glottal Flow . . . . . . . . . 1.1.5 Closed Phase Inverse Filtering . . 1.1.6 Parametrization of the Source . . 1.2 Evaluation of Inverse Filtering Methods . 1.3 Physical Modeling of Speech Signals . . . 1.4 Thesis Contribution . . . . . . . . . . . . 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Vocal Tract Filter Estimation with Linear Prediction 2.1 Autocorrelation based methods . . . . . . . . . . . . . . . . . . 2.1.1 Classic Linear Prediction - Autocorrelation Method . . . 2.1.2 Stabilized Weighted Linear Prediction . . . . . . . . . . 2.2 Covariance based methods . . . . . . . . . . . . . . . . . . . . . 2.2.1 Classic Linear Prediction - Covariance Method . . . . . . 2.2.2 Constrained Closed Phase Covariance Linear Prediction . 2.3 Inverse Filtering Procedure . . . . . . . . . . . . . . . . . . . . . 2.3.1 VUS detector . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Excitation Instants Detection . . . . . . . . . . . . . . . 2.3.4 Covariance-based LP Analysis . . . . . . . . . . . . . . . 2.3.5 Autocorrelation-based LP Analysis . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 5 7 8 14 15 19 20 . . . . . . . . . . . . . 21 21 21 23 27 27 29 38 39 39 39 40 41 41 xvi 3 Results 3.1 Glottal Flow Estimates Examples . . . . . . . . . . 3.2 Parametrization Results . . . . . . . . . . . . . . . 3.2.1 Signal to Recontruction Error Ratio - SRER 3.2.2 Normalized Amplitude Quotient - NAQ . . . 3.2.3 H1-H2 . . . . . . . . . . . . . . . . . . . . . 3.2.4 Harmonic Richness Factor - HRF . . . . . . 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . CONTENTS . . . . . . . 43 43 45 47 51 54 55 56 4 Conclusions 4.1 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . 59 59 61 Bibliography 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 1.1 1.2 1.3 1.4 1.5 Simple Source-Filter Model . . . . . . . . . . . . . . . . . . Phases of the glottal ﬂow and its derivative . . . . . . . . . . Liljencrants-Fant Model . . . . . . . . . . . . . . . . . . . . Schematic diagram of the lumped-element vocal fold model. Area function representation of the trachea and vocal tract . . . . . . . . . . . . . . . . . . . . . . . . . . 2 6 12 16 18 2.1 2.2 2.3 2.4 2.5 STE and Glottal Flow closed phase relation . . . . . . . . . Covariance Frame Misalignment and Glottal Flow Distortion Distortion caused by non-minimum phase ﬁlter . . . . . . . Examples of all-pole spectra . . . . . . . . . . . . . . . . . . Inverse Filtering Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 30 33 36 38 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Glottal Glottal Glottal Glottal Glottal Glottal Glottal Glottal ﬂow ﬂow ﬂow ﬂow ﬂow ﬂow ﬂow ﬂow estimates estimates estimates estimates estimates estimates estimates estimates for for for for for for for for vowel vowel vowel vowel vowel vowel vowel vowel /aa/, fundamental frequency of 105 Hz /aa/, fundamental frequency of 145 Hz /ae/, fundamental frequency of 145 Hz /ae/, fundamental frequency of 210 Hz /ih/, fundamental frequency of 115 Hz /ih/, fundamental frequency of 230 Hz /eh/, fundamental frequency of 105 Hz /eh/, fundamental frequency of 255 Hz xvii 44 45 46 47 48 49 50 51 xviii LIST OF FIGURES List of Tables 1.1 1.2 LF-model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . Input parameters for the vocal fold model . . . . . . . . . . . . . . . 13 17 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 SRER mean and std, for each IF method per vowel . . . . . SRER mean and std for each IF method per frequency . . . NAQrat mean and std for each IF method per vowel . . . . . NAQrat mean and std for each IF method per frequency . . H1 − H2dif mean and std for each IF method per vowel . . H1 − H2dif mean and std for each IF method per frequency HRFdif mean and std for each IF method per vowel . . . . . HRFdif mean and std for each IF method per frequency . . 52 52 53 53 54 55 56 57 xix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction The source of human speech production system, called the glottal ﬂow, poses an important role on the characteristics of speech and on several scientiﬁc ﬁelds. The glottal ﬂow can be obtained from the speech signal using a technique called inverse ﬁltering (IF). Inverse ﬁltering is extensively used in basic research of speech production and its applications to speech analysis, synthesis, and modiﬁcation. Also, an increased interest is recently risen in areas of speech science as environmental voice care, voice pathology detection, and analysis of the emotional content of speech. Most of the suggested techniques in the literature are based on Linear Prediction (LP) Analysis [43]. The goal of this thesis is to evaluate the performance of a number of IF techniques, using robust time and frequency domain parametrization measures on a database of physically modeled speech signals. This chapter starts with a mathematical framework of the source-ﬁlter model of speech production. We then provide motivation for the importance of assessing the performance of IF techniques and this is followed by a description of previous eﬀorts towards that direction. Finally, we discuss the contribution of this thesis and we provide an outline for the remainder of this thesis. 1 2 Chapter 1 Introduction 1.1 Background A mathematical framework of the classic source-ﬁlter model of speech production model is presented here. 1.1.1 Source-Filter Production Model Speech production can be considered as a linear ﬁltering operation, which is time invariant over short time periods. The overall system of speech production can be divided into three parts: the vocal tract, with impulse response v[n], the excitation signal e[n], which is the input of the vocal tract, and the lip radiation r[n]: s[n] = e[n] ∗ v[n] ∗ r[n] where s[n] is the speech pressure output, i.e. the speech signal, and * denotes convolution. For voiced speech, the excitation is a periodic series of pulses, whereas for unvoiced speech, the excitation has the properties of random noise. Figure 1.1 depicts this simple model. Series of impulses or random noise Source signal e[n] Vocal tract signal v[n] Lip radiation r[n] Speech s[n] Figure 1.1 Simple Source-Filter Model. In the Z-domain, the above equation can be written as: S(z) = E(z)V (z)R(z) It is shown that the vocal tract ﬁlter V (z) can be written as an all-pole ﬁlter: 1 1 = p , −1 −k ) k=1 (1 − ck z ) k=1 (αk z V (z) = p where p is the number of poles of the ﬁlter. 1.1 Background 1.1.2 3 Linear Prediction Analysis A primary tool for inverse ﬁltering speech waveforms is Linear Prediction (LP). LP is a very powerful modeling technique which may be applied to time series data. When applied to the speech signal, LP is used to produce an all-pole model of the system ﬁlter, V (z), which turns out to be a model of the vocal tract and its resonances or formants. As it is previously mentioned, the input to such a model is either a series of pulses or white noise, for voiced speech and unvoiced speech respectively. So, using LP we can estimate the vocal tract ﬁlter v[n] from the speech signal, and then cancel its eﬀect, thus resulting in the source waveform. In order to ﬁnd the vocal tract response, we set up a least squares minimization problem where the error e[n] = s[n] − p ak s[n − k] k=1 is to be minimized, where ak are the estimates of αk . The minimization is performed over a region R. The total error is given by E= e2 [n] R The solutions of this minimization problem are called linear prediction, from the fact that a speech sample x[n] can be written as a linear combination of p previous samples, that is, it can be predicted from p previous samples. Depending on the region selection, we are lead into two diﬀerent techniques of linear prediction. These two techniques, as well as improvements on these, will be presented in a later chapter. As is well known, using the method of least squares, this model has been successfully applied to a wide range of signals: deterministic, random, stationary, and non-stationary, including speech, where the method has been applied assuming local stationarity. LP has a number of advantages, including: 4 Chapter 1 Introduction 1. Mathematical tractability of the error measure 2. Stability of the model 3. Favorable computational characteristics of the resulting formulations 4. Spectral estimation properties 5. Wide applicability to a range of signal types 1.1.3 Inverse Filtering The idea behind inverse ﬁltering is to form a computational model for the vocal tract signal and then to cancel its eﬀect from the speech waveform by ﬁltering the speech signal through the inverse of the model. When we obtain an estimate, V̂ (z), of the vocal tract ﬁlter V (z), we can cancel its eﬀect by removing the vocal tract response from the speech signal s[n]. After that, we have an estimate of the driving function, ê[n], which is the combined signal of the glottal ﬂow and the lip radiation. In frequency domain, Ê(z) = S(z) V̂ (z) . The above equation describes a process called inverse ﬁltering. It is apparent that inverse ﬁltering is greatly depended on the estimate of the vocal tract ﬁlter, V̂ (z). The problem of robustly and accurately estimation of the vocal tract is often called spectral estimation in the literature. A common procedure before linear prediction analysis and inverse ﬁltering is preemphasizing the speech signal. Pre-emphasis is the process of ﬁltering the speech signal with a single zero high-pass ﬁlter: 1.1 Background 5 semph = s[n] − βemph s[n − 1], where βemph is the pre-emphasis coeﬃcient and its values are between 0.9 and 0.99. The reason for pre-emphasis is that the pre-emphasized spectrum is a closer representation of the vocal tract response, thus allowing linear prediction to better match the vocal tract response rather than the spectrum of the combined excitation and vocal tract. Another common procedure before applying vocal tract LP analysis is applying a LP analysis of order one to acquire a preliminary estimate for the combined eﬀects of the glottal ﬂow and the lip radiation eﬀect on the speech spectrum. Then, the estimated eﬀects are cancelled from speech by invserse ﬁltering with the corresponding ﬁlter. 1.1.4 The Glottal Flow The goal of inverse ﬁltering is the estimation of the voice source, e.g. the glottal ﬂow. The glottal ﬂow is the output of the glottis during voicing and thus it is a periodic signal. A period of the glottal ﬂow can be divided in two main parts, corresponding to the state of the glottis: 1. The glottal open phase, where the glottis starts to open, reaches its full openess, and eventually closes again. The open phase can be further divided into the opening phase, where the glottal ﬂow increases from baseline at time 0 to its maximum amplitude Av , also called amplitude of voicing at time Tp , and the 6 Chapter 1 Introduction closing phase, where the glottal ﬂow decreases from Av to a point at time Te where the derivative reaches its negative extremum Ee . Te is the so-called glottal closing instant (GCI) and Ee is called the maximum excitation. 2. The glottal closed phase, where the glottis is closed and no airﬂow pressure comes into the vocal tract. In normal voicing, after time Te , the glottal ﬂow derivative is continuous and exponentially returns to 0 at time Tc . This phase is called return phase and the exponential time constant is noted Ta . Figure 1.2 illustrates this analysis on both the glottal ﬂow and the glottal ﬂow derivative waveform. Opening Closing Return Flow Av 0 T p 0 20 40 60 Samples T T +T 80 100 e a Tc 120 Closed phase Open phase Flow Derivative e 0 Ee 0 20 40 60 Samples 80 100 Figure 1.2 Phases of the glottal flow and its derivative. 120 1.1 Background 1.1.5 7 Closed Phase Inverse Filtering It is advantageous to restrict the linear prediction analysis region to the time interval where the glottis is closed. The reason is that during closed phase, there is no source/vocal tract interaction [18] and an accurate vocal tract estimate can be calculated. This estimate can then be used to inverse ﬁlter both the open and the closed phase. The identiﬁcation of the glottal closed phase interval from the speech signal is a diﬃcult task. In the literature, there are several approaches in accurately determining the closed phase (and consequently, the glottal ﬂow) in a non-invasive manner. Wong et al [71] proposed the use of the maximum determinant of the covariance matrix in order to ﬁnd the closed phase interval. Ananthapadmanabha and Yegnanarayana [1] studied the linear prediction residual in order to ﬁnd the closed phase. In [50], the authors proposed to detect discontinuities in frequency by conﬁning the analysis around a single frequency. In this latter work, GCIs correspond to the positive zero-crossings of a ﬁltered signal obtained by successive integrations of the speech waveform and followed by a mean removal operation. Childers et al have discussed two systems [16] [15] [37]. The ﬁrst one is a two-channel analysis approach, using the electroglottographic (EGG) signal along with the speech signal. The second system uses weighting on speech samples based on the error in previous analysis windows. In [28], [17], and [71], it is suggested that an operator driven technique is required to estimate the closed phase interval. McKenna [46] suggested the use of Kalman ﬁltering for closed phase identiﬁcation. Plumpe et al [53] discussed the importance of an automatic system to ﬁnd glottal opening and closure, and suggested the use of a sliding covariance analysis window for calculating the formant trajectories. Then, the closed phase interval was identiﬁed based on ﬁrst order statistics on the motion 8 Chapter 1 Introduction of the ﬁrst formant. This technique is used in this thesis whenever a closed phase covariance analysis is discussed, as it will be seen in later chapters. 1.1.6 Parametrization of the Source Parametrization of the voice source has been the target of intensive research during the past few decades. The goal of parametrization is the expression of the most important features of the glottal ﬂow (or its derivative) using few numerical values. In the past years, a large number of possible parametric representations of glottal ﬂows given by inverse ﬁltering have been proposed. Although quantitative signal processing measures such as the Signal to Reconstruction Error Ratio (SRER) can be applied on the glottal ﬂow estimates, it is also desirable to apply measures that indicate the quality of IF based on the source-ﬁlter theory. Thus, we introduce here some of these metrics. In time domain, there is a considerable amount of parametrization measures concerning the glottal ﬂow (or its derivative) in the literature. This corresponds to quantifying the glottal ﬂow using certain quotients between the closed phase, the opening phase, the closing phase of the glottal volume velocity waveform [31]. Timebased measures have also been computed using the ﬁrst derivative of the glottal ﬂow by applying, for example, the time diﬀerence between the beginning of the closing phase and the instant of the maximal negative peak [64]. The three most common time-based parameters are: (1) the open quotient (OQ), which is the ration of between the open phase of the glottal pulse and the length of the fundamental period; (2) the speed quotient (SQ), which is the ratio between glottal opening and closing phases; and (3) the closing quotient (CQ), which is the ratio between the glottal closing phase and the length of the fundamental period. However, the extraction of these time-based parameters is often problematic due to the presence of noise or formant 1.1 Background 9 ripple in the estimated waveforms. In [7], a more robust time-domain parameter was introduced, called the Normalized Amplitude Quotient - NAQ, which is deﬁned as the ratio between the maximum value of the glottal ﬂow, fac , to the product of the minimum value of the glottal ﬂow derivative, dpeak , and the length of the fundamental period, T . The NAQ’s quality over its conventional counterpart, CQ, was demonstrated in both clean and noisy speech conditions. In addition, the calculation of NAQ is straightforward and can be computed automatically. The NAQ is therefore selected as the time-domain parameter for IF evaluation in this thesis. In addition, frequency-domain methods have been developed to quantify the voice source. These are typically based on measuring the decay of the voice source spectrum either from the spectral harmonics [33] [15], or from the pitch-synchronously computed spectrum [10]. This is justiﬁed by the fact that the harmonics in the speech signal below the ﬁrst formant are often considered important for the perception of vocal quality [32]. It has also been found that the glottal spectra shows distinctive amplitude relationships between the fundamental and higher harmonics [15]. A well known technique for frequency domain parametrization is called Parabolic Spectral Parameter (PSP) [10], is based on ﬁtting a parabolic function to a pitch synchronously computed spectrum of the estimated voice source. The PSP algorithm gives a single numerical value that describes how the spectral decay of an obtained glottal ﬂow behaves with respect to theoretical bounds corresponding to maximal and minimal spectral tilting. Another very common frequency domain parametrization measure is called H1H2, which is deﬁned as the diﬀerence in dB between the amplitudes of the fundamental and the second harmonic of the source spectrum [68]. Finally, another parameter, the 10 Chapter 1 Introduction Harmonic Richness Factor (HRF), is deﬁned from the spectrum of the glottal ﬂow as the diﬀerence in dB between the sum of the harmonic amplitudes above the fundamental frequency and the amplitude of the fundamental [15]. These two frequency fomain measures are selected for IF evaluation in this thesis. There are two reason for selecting these two parameters. First, both of them can be computed automatically, with low complexity, and without any user adjustments. Second, both of them are known to reﬂect the spectral decay of the glottal excitation. This means that if, for example, the glottal ﬂow estimates appear to have the so-called “jags” [71], which are sharp negative peaks of the glottal ﬂow near glottal closure, then this would be reﬂected into the H1H2 and HRF values. Finally, one category of voice source parametrization methods is represented by techniques that ﬁt certain predeﬁned mathematical functions to the obtained glottal waveform. In noisy conditions, IF is known to perform poorly, so the glottal ﬂow estimates (and their derivatives) would be severely distorted. This distortion can be alleviated if the mathematical function is ﬁrstly ﬁt onto the distorted glottal ﬂow derivative, and then calculate time-based or spectral measures using this mathematical model, instead of the original, distorted derivative. Among them, the Liljencrants-Fant (LF) [23] model is very popular. Parametrization methods for both time and frequency domain have been developed for the LF model [35]. LF-waveform parameterization is an intricate process, the complexities and implications of which are not always fully appreciated. In terms of ’accuracy’, the result of a parametric ﬁt greatly depends on the optimization algorithm and the cost function to be minimized. In addition, the results are heavily inﬂuenced by the presence of random or systematic error in the signal to be ﬁtted. Finally, if the model diﬀers substantially from the process that generated the signal, the concept of accuracy of 1.1 Background 11 the ﬁt looses most of its meaning. A quick closer look at the LF model follows next. The LF model In a previous section, we saw that the speech signal can be represented as: s[n] = e[n] ∗ v[n] ∗ r[n], where e[n] is the excitation signal, v[n] is the vocal tract impulse response, r[n] is the lip radiation, and * denotes convolution. The lip radiation can be modelled as a ﬁrst order diﬀerentiator, r[n] = δ[n] − δ[n − 1]. Re-arranging the terms, we have: s[n] = e[n] ∗ v[n] ∗ r[n] = e[n] ∗ r[n] ∗ v[n] = ê[n] ∗ v[n], where ê[n] is the derivative of the excitation signal, which is the source signal. Since the source of the speech production system is the glottal ﬂow, ê[n] is the derivative of the glottal ﬂow, the so-called glottal ﬂow derivative. The glottal ﬂow derivative is often called the driving function. In the late 1980’s, Liljencrants and Fant suggested a model for the derivative of the glottal ﬂow, called the LF model [23]. The LF model is a four parameter model, although in [53], it is extended up to seven parameters, including time instants, for speaker identiﬁcation applications. The LF model is described by the following equations: ⎧ ⎪ ⎪ E0 eat sin(ωg t), 0 ≤ t ≤ Te , ⎪ ⎪ ⎨ E0 −β(t−Te ) xLF (t) = (e − e−β(Tc −Te ) ), Te ≤ t ≤ Tc , − ⎪ βT a ⎪ ⎪ ⎪ ⎩ 0, elsewhere 12 Chapter 1 Introduction LF model ωg, a, E0 Te 0 Tc T Flow a Ee Samples Figure 1.3 Liljencrants-Fant Model. where Tc , Te , Ta are illustrated on Figure 1.3, and represent the glottal closure, the glottal excitation, and the return phase. The four parameters of the model are α, ωg , E0 , which describe the open phase, and Ta , which describes the return phase. The parameter β is dependent on Ta and although there is no closed form for the relationship of β and Ta , it can be assumed that β ≈ 1 Ta for small values of β. An- other assumption is that the glottal closure instant, Tc , can be considered to coincide with the glottal opening instant, To , of the next period. The four parameters do not include the timings of glottal opening, excitation, and closure, so these timing values must be given. 1.1 Background 13 Due to the large dependence of E0 on α [53], the parameter Ee , the value of the waveform at time Te , can be estimated instead of E0 . To calculate E0 from Ee the equation E0 = eαTe Ee sin(ωg Te ) is used. The parameter Ta is the most important parameter in terms of human perception, as it controls the amount of spectral tilt present in the source. The return phase of the LF-model is equivalent to a ﬁrst order lowpass ﬁlter [22] with a corner frequency of Fa = 1 2πTa The LF-model parameters and their signiﬁcance is illustrated in Table 1.1. LF model parameters Ee The value of the waveform at time Te α Determines the ratio of Ee to the height of the positive portion of the glottal flow derivative ωg Determines the curvature of the left side of the glottal pulse. Ta An exponential time constant which determines how quickly the waveform returns to zero after time Te Table 1.1 The four parameters of the Liljencrant-Fant model for the glottal ﬂow derivative waveform. 14 Chapter 1 Introduction 1.2 Evaluation of Inverse Filtering Methods A major obstacle both in developing new IF algorithms and in comparing existing methods is the complication of assessing the performance of an IF technique. This happens mostly because the signal to be estimated, the glottal ﬂow, is unavailable. Therefore, when IF is used to estimate the glottal ﬂow of natural speech, it is actually never possible to assess in detail how closely the estimated waveform corrseponds to the true glottal ﬂow generated by the vibrating vocal folds. However, it is possible to use synthesized speech that has been created using artiﬁcial glottal ﬂow waveforms to assess the accuracy and eﬃciency of an IF technique. The success of the algorithm can be judged by quantifying the error between the known input waveform and the version recovered by the algorithm. Although this approach is typically used in the literature [5], [6], [45], [65], [70], this kind of evaluation is not truly objective because speech synthesis and IF analysis are based on similar models of the human voice production system, such as the source-ﬁlter model. An improvement would be to provide synthesized speech generated by a more sophisticated articulatory model [34], which allows source-tract interaction [47] [8] [9]. A similar approach will be followed in this thesis, as it will be presented later. Once the algorithm has been veriﬁed and is being used for inverse ﬁltering real speech samples, there are two possible approaches to evaluate the results. One is to compare the waveforms obtained with those obtained by earlier methods. As, typically, the aim of this is to establish that the new method is superior, the objectivity of this approach is also doubtful. This approach can be made most objective when methods are compared using synthetic speech and results can be compared with the original source, as in [6]. In many works, no comparisons are made, a stance which is not wholly unjustiﬁed because there is not enough data available to say which are 1.3 Physical Modeling of Speech Signals 15 the correct glottal ﬂow waveforms. On the other hand, using two diﬀerent methods to extract the glottal ﬂow could be an eﬀective way to conﬁrm the appearance of the waveform as correct. If new techniques for glottal inverse ﬁltering produce waveforms which ’look like’ the other waveforms that have been produced before, then they are evaluated as better than those which do not: examples of the latter include [4], [19]. In the direction of evaluation of IF methods, a physiological-based model of vocal folds and vocal tract is used in order to evaluate diﬀerent IF methods. In this model, time-varying waveforms of the glottal ﬂow and radiated acoustic pressure are simulated. A detailed description of this model follows next. 1.3 Physical Modeling of Speech Signals A computational model of the vocal folds and acoustic wave propagation generated the sound pressure and glottal ﬂow waveforms used in this thesis. In detail, self sustained vocal fold vibration was simulated with three masses coupled to one another through stiﬀness and damping elements [60]. A schematic diagram depicting this model is shown in Figure 1.4, where the arrangement of the masses was designed to emulate the body-cover structure of the vocal folds [29]. The input parameters of this model are the lung pressure, prephonatory glottal half-width (adduction), resulting vocal fold length and thickness, and normalized activation levels of the cricothyroid (CT) and thyroarytenoid (TA) muscles. These values were transformed into mechanical parameters of the model, such as mass, stiﬀness, 16 Chapter 1 Introduction Figure 1.4 Schematic diagram of the lumped-element vocal fold model. The cover-body structure of each vocal fold is represented by three masses that are coupled to each other by spring and damping elements. Bilateral symmetry was assumed for all simulations. and damping, acording to [67]. In [66], through aerodynamic and acoustic considerations, the vocal fold model was coupled to the pressures and air ﬂows in the trachea and vocal tract. Bilateral symmetry was assumed for all simulations such that identical vibrations occur within both the left and right folds. The modiﬁcation of the resting vocal fold length and activation levels of CT and TA muscles resulted into eight diﬀerent fundamental frequency values (105, 115, 130, 145, 205, 210, 230, and 255 Hz). These values roughly approximate the range of the fundamental frequency in adult male and female speech [30]. The input parameters for all nine cases are shown in Table 1.2. Acoustic wave propagation in both the trachea and vocal tract was computed in time synchrony with the vocal fold model. This was performed with a wave-reﬂection 1.3 Physical Modeling of Speech Signals 17 Parameter value 105 115 130 145 205 210 230 255 aCT 0.1 0.4 0.1 0.4 0.2 0.3 0.3 0.4 aT A 0.1 0.1 0.4 0.4 0.2 0.2 0.3 0.4 L0 (cm) 1.6 1.6 1.6 1.6 0.9 0.9 0.9 0.9 T0 (cm) 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 ξ01 (cm) ξ02 (cm) PL (dyn/cm2 ) 10−4 10−4 10−4 10−4 10−4 10−4 10−4 10−4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7840 7840 7840 7840 7840 7840 7840 7840 Table 1.2 Input parameters for the vocal fold model used to generate the nine diﬀerent fundamental frequencies. Notation is identical to that used in Titze and Story (2002). The aCT and aT A are normalized activation levels (can range from 0 to 1) of the CT and TA muscles, respectively. Lo and To are the resting length and thickness of the vocal folds, respectively. ξ01 and ξ02 are the prephonatory glottal half-widths at the inferior and superior edges of vocal folds, respectively, and PL is the respiratory pressure applied at the entrance of the trachea. The value of PL shown in the table is equivalent to a pressure of 8 cm H2 O. approach [63] [39], where the area function of the vocal tract and trachea were discretized into short cylindrical sections or tubelets. Reﬂection and transmission coefﬁcients were calculated at the junctions of consecutive tubelets, at each time sample. From these, pressure and volume velocity were then computed to propagate the acoustic waves through the system. The glottal ﬂow was determined by the interaction of the glottal area with the time-varying pressures present just inferior and superior to the glottis, as described by Titze [66]. At the lip termination, the forward and backward traveling pressure wave components were subjected to a radiation load modeled as a resistance in parallel with an inductance, as suggested by Flanagan [25], intended to approximate a piston in an inﬁnite plane baﬄe. The output pressure is assumed to be representative of the pressure radiated at the lips. To the extent that the piston- 18 Chapter 1 Introduction in-a-baﬄe reasonably approximates the radiaton load, the calculated output pressure can also be assumed to be representative of the pressure that would be transduced by a microphone in a non-reﬂective environment. The speciﬁc implementation of the vocal tract model used for in this thesis was presented in Story [59], and included energy losses due to viscosity, yielding walls, heat conduction, as well as radiation at the lips. Figure 1.5 Area function representation of the trachea and vocal tract used to simulate the male “a”vowel. The vocal fold model of Fig. 1.4 would be located at the 0 cm point indicated by the dashed vertical line. Examples of the glottal flow and output pressure waveforms are shown near the locations at which they would be generated. In this thesis, glottal ﬂow and speech pressure waveforms were generated for 4 vowels (/aa/, /ae/, /eh/, and /ih/) in both male and female conﬁgurations. The area functions were taken from those reported for an adult male in Story et al [61]. For the simulations of male speech, the area functions were used directly with the exception of the vocal tract length, which was normalized to 17.46 cm. For female speech simulations, the same male area factors were based on those reported in Fitch 1.4 Thesis Contribution 19 and Giedd [24]. The trachea in all cases was the same as that shown in Figure 1.5. In summary, the model is a simpliﬁed but physically motivated representation of a speaker. It generates both the signal on which inverse ﬁltering is typically performed (microphone signal) and the signal that is seeked to be determined (glottal ﬂow). This provides an idealized test case for inverse ﬁltering algorithms. 1.4 Thesis Contribution In this thesis, a physiological-based model of the vocal folds and vocal tract is used in order to evaluate diﬀerent IF methods. In this model, time-varying waveforms of the glottal ﬂow and radiated acoustic pressure are simulated. By using this simulated speech pressure waveform as an input to an IF method, it is possible to determnine how close the obtained estimate of the glottal ﬂow is to the simulated glottal ﬂow. This approach diﬀers from using synthetic speech excited by an artiﬁcial glottal ﬂow because the glottal ﬂow waveform results from the interaction of the self-sustained oscillations of the vocal folds with subglottal and supraglottal pressures, as it would happen during real speech production. Therefore, this model generates a glottal ﬂow waveform that is expected to provide a more ﬁrm and realistic test of the IF method than a parametric ﬂow model, where no source-tract interaction is taken into account. In order to evaluate the diﬀerent IF techniques, robust time and frequency domain parametrization measures are typically used. In this way, the most important features of the glottal ﬂow (or its derivative) estimates are expressed using a few numerical values. The parametrization measures used in this thesis are the Normalized Amplitude Quotient (NAQ), the Signal to Reconstruction Error ratio (SRER), the 20 Chapter 1 Introduction diﬀerence in dB between the ﬁrst two harmonics of the glottal spectrum (H1-H2), and the Harmonic Richness Factor (HRF), which were previously discussed. 1.5 Thesis Organization In this chapter, the basic concepts of source-ﬁlter voice production system were introduced, and their relation to the inverse ﬁltering process. The problem of evaluation of IF techniques was illustrated, and a database of physically modeled speech pressure signals was delineated. This database will greatly help in comparing and evaluating diﬀerent methods of IF. In Chapter 2, we will discuss the diﬀerent vocal tract ﬁlter estimation methods and their properties, as well as the inverse ﬁltering procedure followed for each method. Chapter 3 covers the results of each IF method, using the measures introduced in this chapter, and ﬁnally Chapter 4 concludes the thesis and discusses ideas for future directions in related research. Chapter 2 Vocal Tract Filter Estimation with Linear Prediction As discussed in Chapter 1, most of the IF approaches suggest the use of Linear Prediction Analysis in order to estimate the vocal tract ﬁlter. In this chapter, we ﬁrstly discuss the linear prediction techniques that are used througout the rest of the thesis. Next, we describe the inverse ﬁltering procedure that we follow to estimate the glottal ﬂow waveforms. 2.1 2.1.1 Autocorrelation based methods Classic Linear Prediction - Autocorrelation Method If we assume that the speech signal is zero outside an interval 0 ≤ n ≤ N − 1, then the error signal e[n] will be nonzero in the interval 0 ≤ n ≤ N + p − 1. That interval is the region R. Since we are trying to predict nonzero samples from zero samples at the beginning of the interval, this will result in large errors at that point. This is also 21 22 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction valid for the end of the interval, where zero samples are predicted from nonzero ones. For this reason, a tapered window (e.g., Hamming) is often used. The forementioned assumptions result in the autocorrelation method of linear prediction. It can be formulated as follows: By using Hamming window w[n] with length N, we get a windowed speech segment sN [n], where sN [n] = s[n]w[n]. Then the mean squared prediction error is deﬁned as: EN = ∞ 2 e [n] = n=−∞ ∞ (sN [n] − n=−∞ p ak sN [n − k])2 . (2.1) k=1 The values of ak that minimize EN are found by assigning the partial derivatives of EN with respect to ak to zeros. This follows in the following p equations with p unknown variables ak : p k=1 ak ∞ ∞ sN [n − i]sN [n − k] = n=−∞ sN [n − i]sN [n], 1 ≤ i ≤ p. (2.2) n=−∞ Noticing that the windowed speech signal sN [n] = 0 outside the window w[n], and by introducing the autocorrelation function R(i) = N −1 sN [n]sN [n − i], 0 ≤ i ≤ p, (2.3) n=i equation X becomes p R(|i − k|)ak = R(i), 1 ≤ i ≤ p. (2.4) k=1 Using matrix notation, the latter equation can be written as Φa = r, (2.5) where matrix Φ is called the autocorrelation matrix and its elements are given by Φi,j = R(|i − j|), 1 ≤ i, j ≤ p. The other two vectors are given by 2.1 Autocorrelation based methods 23 a = [a1 , a2 , a3 , ..., ap ]T and r = [R(1), R(2), R(3), ..., R(p)]T . The main advantage of the autocorrelation method is that it always produces a stable ﬁlter with a reasonable computational load and that there are fast algorithms, such as the Levinson-Durbin recursion algorithm [55], for solving the matrix system. The eﬀects of the problems at the beggining and end of the analysis region can be reduced using a non-rectangular window, such as a Hamming window. 2.1.2 Stabilized Weighted Linear Prediction Stabilized Weighted Linear Prediction (SWLP), introduced by Magi et al [42], is an all-pole modeling method based on Weighted Linear Prediction (WLP) [41]. WLP uses time domain weighting of the square of the prediction error signal. This temporal weighting emphasizes the speech samples which have a high signal-to-noise ratio, and thus it has been shown that WLP improves the spectral envelope estimation of noisy speech in comparison to the conventional LP analysis. Moreover, in contrast to other robust methods of LP [40] [73], the WLP ﬁlter parameters can be calculated without any iterative update. A problem is that the WLP ﬁlter is not guaranteed to be stable. This drawback is dissolved through SWLP, where the weighting is such that the all-pole model is always stable. The formulation of WLP is presented next, as well as the SWLP algorithm that ensures stability of the all-pole model. 24 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction Weighted Linear Prediction As in conventional LP, sample x[n] is estimated by a linear combination of the past p samples: x̂[n] = − p ai x[n − i], (2.6) i=1 where the coeﬃcients ai ∈ . The prediction error en (a), the residual, is deﬁned as en (a) = x[n] − x̂[n] = x[n] + p ai x[n − i] = aT x[n], (2.7) i=1 where a = [a0 a1 ...ap ]T with a0 = 1 and x[n] = [x[n]...x[n − p]]T . The prediction error energy E(a) in the WLP method is N +p E(a) = n=1 2 T (en (a)) wn = a +p N wn x[n]xT [n])a = aT Ra, (2.8) n=1 where wn is the weight imposed on sample n, N is the length of the signal x[n], +p T and R = N n=1 wn x[n]x [n]. This problem is a constrained minimization problem, minimize E(a) subject to aT u = 1, where u is the vector deﬁned as u = [1 0 ... 0]T . It can be seen that the autocorrelation matrix R is weighted, in diﬀerence to the conventional LP analysis. Matrix R is symmetric but not Toeplitz, due to the weighting process. However, it is positive deﬁnite, and this makes the minimization problem convex. Using Lagrange multipliers, it can be shown that a satisﬁes the linear equation Ra = σ 2 u, (2.9) where σ 2 = aT Ra is the error energy. Finally, the WLP all-pole model is obtained as H(z) = 1/A(z), where A(z) is the Z-transform of vector a. 2.1 Autocorrelation based methods 25 Weighting function The time domain weighting function wn is the key point of both WLP and SWLP. In [41], the weighting function was chosen to be the Short-Time Energy (STE) wn = M −1 x2n−i−1 , (2.10) i=0 where M is the length of the STE window. The use of STE window can be justiﬁed as following. The STE function emphasizes the speech samples of large amplitude. It is well-known that applying LP analysis on speech samples that belong to the glottal closed phase interval will generally result in a more robust spectral representation of the vocal tract. So, by emphasizing on these samples that occur during the glottal closed phase, it is likely to yield more robust acoustical cues for the formants. In Figure 2.1, the focus on the glottal closed phase of the STE weighting function is illustrated on a clean vowel. Stability However, the WLP method with the STE window does not ensure stability of the all-pole model. Therefore, in [42], a formula for a generalized weighting function to be used in WLP is developed in order to guarantee stability. The autocorrelation matrix R in Eq. X can be expressed as R = Y T Y, where Y = [y0 y1 · · · yp ] ∈ (N +p)x(p+1) and √ √ y0 = [ w1 x[1] · · · wN x[N] 0 · · · 0]T . (2.11) 26 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction STE Weight function and Signal with p = 10 and M = 8 3 Amplitude 2 1 0 −1 STE Weight Function Signal −2 −3 0 0.005 0.01 0.015 0.02 0.025 Time (sec) 0.03 0.035 0.04 0.045 STE Weight function and Glottal Flow with p = 10 and M = 8 3 Amplitude 2 1 0 STE Weight Function Glottal Flow −1 0 0.005 0.01 0.015 0.02 0.025 Time (sec) 0.03 0.035 0.04 0.045 Figure 2.1 Upper panel: time domain waveforms of speech (vowel /a/ produced by male speaker) and short-time energy (STE) weight function (M=8). Lower panel: Glottal flow waveform of the vowel /a/ together with the STE weight function (M=8). The column vectors are given by yk+1 = Byk , k = 0, 1, · · · , p − 1, (2.12) 2.2 Covariance based methods where 27 ⎤ ⎡ 0 0 ··· 0 ⎢ ⎢ ⎢ 0 0 ··· ⎢ w2 /w1 ⎢ ⎢ B=⎢ w3 /w2 0 ··· 0 ⎢ ⎢ .. .. .. .. ⎢ . . . . ⎢ ⎣ 0 ··· 0 wN +p /wN +p−1 0⎥ ⎥ ⎥ 0⎥ ⎥ ⎥ 0⎥ ⎥ .. ⎥ .⎥ ⎥ ⎦ 0 So, before forming the matrix Y, the elements of the secondary diagonal of the matrix B are deﬁned for all i = 1, · · · , N + p − 1 as Bi+1,i ⎧ ⎪ ⎨ wi+1 /wi , = ⎪ ⎩ 1, if wi ≤ wi+1 if wi > wi+1 Finally, the WLP method computed using matrix B is called the Stabilized Weighted Linear Prediction model, and the stability of the all-pole ﬁlter is ensured. For more information on the stability of SWLP, see [42]. 2.2 2.2.1 Covariance based methods Classic Linear Prediction - Covariance Method If we assume that the speech signal is zero outside an interval p ≤ n ≤ N − 1, thus p samples outside the region are available, then the mean squared prediction error is given by EN = N −1 m=0 e2 [m] (2.13) 28 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction where e[m] = s[m] − p ak s[m − k], 0 ≤ m ≤ N − 1 and the interval [0, N − 1] is k=1 the prediction error interval. Using a similar approach with the autocorrelation method in minimizing the prediction error, we result in the covariance method of linear prediction, which is given by the following equation: Φa = ψ, (2.14) where the elements of the covariance matrix Φ are given by φi,j = N −1 s[n − i]s[n − j], (2.15) n=0 where 1 ≤ i ≤ p and 1 ≤ j ≤ p. The other two vectors are given by a = [a1 , a2 , a3 , ..., ap ]T = [φ0,1 , φ0,2, φ0,3 , ..., φ0,p ]T . ,ψ (2.16) Matrix Φ has the properties of a covariance matrix and thus the system can be eﬃciently solved using Cholesky Decomposition. The main advantage of the covariance method is that it always results in the correct solution for any window length greater than p. Also, if the boundaries of the window are handled properly, a rectangular window can be used. Finally, the main disadvantage of this method is that stability of the ﬁlter is not always guaranteed. It should also be noted that in high pitched speakers, the closed phase interval is typically too short, and thus the estimation of the vocal tract ﬁlter is not accurate. This yields severe distortions on the estimated glottal ﬂow, such as increased ripple in its closed phase interval. 2.2 Covariance based methods 29 As it was mentioned in Chapter 1, it is advantageous to restrict the analysis region in the closed phase interval of the glottal ﬂow waveform. This approach is called Closed Phase (CP) Covariance LP Analysis and it is the covariance technique that is tested in this thesis. 2.2.2 Constrained Closed Phase Covariance Linear Prediction Problems with conventional CP covariance analysis The classic CP covariance LP analysis described in the previous section suﬀers from certain shortcomings. Several previous studies [38] [69] [72] [57] indicate that the CP analysis is very sensitive to the position of the covariance frame, thus giving glottal ﬂow estimated that vary greatly. This is understandable if we consider that CP length is typically short, and the amount of data used to deﬁne the parametric model of the vocal tract is sparse. If the position of this frame is misaligned, then this results in poor modeling of the vocal tract resonances, and thus severe distortion of the glottal ﬂow estimates. The problem is more apparent in high pitch speakers, where the length of the CP interval is very short. This type of distortion is demonstrated in Figure 2.2. In this ﬁgure, three glottal ﬂow estimates are shown on the left, which were inverse ﬁltered from the same token of a male subject uttering the vowel [a], by using a minor change in the position of the covariance frame. This example shows how a minor change in the position of the covariance frame has resulted in a major change in the glottal ﬂow estimates. On the right of this ﬁgure, the corresponding pole-root locations of the glottal ﬂow estimates are shown. It is interesting to notice that in Figs. 30 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction Imaginary part Flow (a) 0.8 0.6 0.4 0.2 0 −0.2 10 0 −1 100 200 Samples 300 −2 −1 (b) Imaginary part 0.5 Flow 1 0 −0.5 0 Real part 1 2 1 10 0 −1 100 200 Samples 300 −1 (c) Imaginary part 1 Flow −2 0.5 0 −0.5 0 Real part 1 2 1 2 1 10 0 −1 0 100 200 Samples 300 −2 −1 0 Real part Figure 2.2 Covariance Frame Misalignment and Glottal Flow Distortion. 2.2(a), 2.2(b), and 2.2(c), the inverse ﬁlters have one root on the positive real axis in the z-domain. The eﬀect of an inverse ﬁlter root which is located on the positive real axis has the properties of a ﬁrst order diﬀerentiator, when the root approaches the unit circle, and a similar eﬀect is also produced by a pair of complex conjugate roots at low frequencies. Thus, the glottal ﬂow estimate in such cases becomes similar to a time-derivative of the ﬂow candidate given by an inverse ﬁlter with no such roots or roots located in a more neutral position near the origin of the z-plane. This distortion is more apparent at the time instants where the glottal ﬂow changes more rapidly, 2.2 Covariance based methods 31 that is, near glottal closure. As it can be seen in Figs. 2.2(b), 2.2(c), this distortion is typically seen as sharp negative peaks, called ”jags” by Wong et al [71], at the time instants of glottal closure. The theory of source-ﬁlter speech production suggests that poles of the vocal tract for non-nasalized voiced sounds occur at complex conjugate pairs and the low frequency emphasis of the spectrum results from the glottal source. However, as it can be seen in Figs. 2.2(b), 2.2(c), the estimated vocal tract model has roots on the positive real axis or at low frequencies, and thus the amplitude spectrum shows boosting of low frequencies, which comes in contrast with Fant’s suggested theory. Hence, it can be argued that among the three vocal tract models, the one depicted in Fig. 2.2(a) is the one that is more close to represent an amplitude spectrum of an all-pole vocal tract of a vowel. Also, the removal of the roots of the vocal tract model located on the real axis results in glottal ﬂow estimates which are less dependent on the covariance frame location [71] [13]. Another source of distortion in conventional CP covariance analysis is the fact that the inverse ﬁlter might be non-minimum phase, that is, the ﬁlter has roots outside the unit circle in the z-domain. From a stability point of view, this is not a problem, since the IF is computed using a FIR ﬁlter, and thus non-minimum phase ﬁlters do not cause stability problems. However, the use of non-minimum phase ﬁlters does cause other kinds of distortion. According to the source-ﬁlter theory of speech production, the glottal ﬂow is ﬁltered through the vocal tract, which is considered to be a stable all-pole system for vowels and liquids. In the z-domain, this system must have all its poles inside the unit circle. Its inverse ﬁlter cancels the vocal tract contribution by mapping each pole of the vocal tract into a zero of the IF ﬁlter inside the unit circle, 32 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction this resulting in a minimum phase ﬁlter. However, it is well-known in the theory of digital signal processing that a zero of a FIR ﬁlter can be mirrored, that is, it can be replaced by its mirror image partner. A zero at z = z0 can be replaced by a zero at z = 1/z0∗ , without changing the shape of the amplitude spectrum of the ﬁlter. Hence, from an inverse ﬁltering point of view, there are several inverse ﬁlters: one of them is minimum phase and the others are non-minimum phase, and all of them are considered equal. These ﬁlters, however, are diﬀerent in terms of phase characteristics. This diﬀerence can cause severe distortion, and it is particularly strong in cases where zeros of the inverse ﬁlter located in the vicinity of the lowest two formants are moved from inside to outside the unit circle. Figure 2.3 demonstrates this distortion caused by IF a vowel waveform with a minimum and a non-minimum phase FIR ﬁlter. In Fig. 2.3(a), the glottal ﬂow estimated with a minimum phase ﬁlter is shown on the left, and the corresponding z-plane representation is shown on the right. In Fig. 2.3(b), the complex conjugate root pair that models the ﬁrst formant is replaced by its counterpart outside the unit circle. Even though this slight modiﬁcation only changed the root radius by 0.04, the distortion caused in the glottal ﬂow estimate is severe and is displayed as increased ripple during the closed phase interval of the glottal cycle, as shown in the left panel of Fig. 2.3(b). Mathematically Constrained Linear Prediction The concept of mathematically constrained linear prediction is the modiﬁcation of the conventional covariance analysis in order to reduce the distortion which is caused by unrealistic vocal tract model roots location. This is achieved by not allowing the 2.2 Covariance based methods 33 1 1 Imaginary part 0.8 Flow 0.6 0.4 0.2 0.5 10 0 −0.5 0 −1 −1.5 −0.2 50 100 150 200 250 300 Samples −0.5 0 0.5 Real part 1 1 0.8 Imaginary part 0.6 Flow −1 0.4 0.2 0 −0.2 50 100 150 200 250 300 Samples 0.5 10 0 −0.5 −1 −1.5 −1 −0.5 0 0.5 Real part 1 Figure 2.3 Distortion caused by non-minimum phase filter. mean square error (MSE) criterion to locate the roots freely on the z-domain, thus providing certain restrictions in the predictor structure, which results in more realistic root locations, from the viewpoint of the acoustic theory described by Fant. In order to implement such restrictions in the MSE error, someone has to ﬁnd a method to express the constraint in a form of a concise mathematical equation and apply the constraint in the MSE minimization problem. One such constraint can be 34 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction expressed with the help of the DC gain of the LP ﬁlter. The DC gain can be expressed as the sum of the predictor coeﬃcients j0 V (e ) = p αk = lDC . (2.17) k=0 where αk are the ﬁlter coeﬃcients of the constrained inverse ﬁlter and the lDC is a pre-deﬁned real value for the gain of the ﬁlter at DC. The reason for selecting the constraint on the DC gain is that, without it, it is possible that the amplitude response of the vocal tract model shows excessive boost at zero frequency. It is known from Fant’s suggested source-ﬁlter theory, that the amplitude response of voiced sounds approaches unity at zero frequency [21]. Therefore, if a misplaced and short covariance frame occurs, it might even lead to an amplitude response with larger gain at zero frequency than at formants, which is a clear violation of the source-ﬁlter theory and its underlying acoustical theory of tube shapes. By imposing this constraint on the DC gain of the covariance linear prediction analysis, one might expect that the amplitude response of the resulting vocal tract estimates will better match Fant’s source-ﬁlter theory. However, it should be noted that this method still leaves the determination of the exact z-domain root locations of the vocal tract model to the MSE criterion, and does not bias the root locations prior the optimization. The formulation of the DCconstrained LP follows. A mathematically straightforward way to implement such a restriction in the conventional LP, as it is described in a previous section, is to set a certain pre-deﬁned value for the frequency response of the all-pole model at zero frequency, as it is written in Eq. 2.17. Using matrix notation, the DC-constrained minimization problem can 2.2 Covariance based methods 35 now be formulated as follows: minimize aT Φa subject to ΓT a = b, where a = [a0 , · · · , ap ]T is the ﬁlter coeﬃcient vector with a0 = 1, b = [1, lDC ]T , and Γ is a (p + 1) by 2 constraint matrix deﬁned as ⎡ ⎤ 1 1 ⎢ ⎥ ⎢ ⎥ ⎢0 1⎥ ⎢ ⎥ Γ = ⎢. .⎥ ⎢. .⎥ ⎢. .⎥ ⎣ ⎦ 0 1 (2.18) The Γ matrix is positive deﬁnite. Thus, the minimization problem is convex. The Lagrange multiplier method is suitable for eﬃcient solution of this problem. The objective function is deﬁned as: η(α, g) = αT Φα − 2gT (ΓT α − b), (2.19) where g = [g1 g2 ]T > 0 is the Lagrange multiplier vector. The above equation can be minimized by setting the derivative with respect to vector α to zero. Thus, by taking into account that matrix Φ is symmetric, we have: ∂η (α, g) = αT (ΦT + Φ) − 2gT ΓT = 2αT Φ − 2gT ΓT = 2(Φα − Γg) = 0. ∂α (2.20) Vector c can be solved from the group of equations Φα − Γg = 0 (2.21) ΓT α − b = 0 (2.22) which ﬁnally gives that the optimal coeﬃcients of the constrained inverse ﬁlter are: α = Φ−1 Γ(ΓT Φ−1 Γ)−1 b (2.23) 36 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction 25 Constr. at ω = 0 and ω = π CP Covariance Conventional CP Covariance 20 Magnitude (dB) 15 10 5 0 −5 −10 −15 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Normalized Frequency (×π rad/sample) 0.8 0.9 1 Figure 2.4 Examples of all-pole spectra computed in the closed phase covariance analysis by the conventional LP and by the DC-π constrained LP. In a similar manner, a constrained at ω = π can be formed. By denoting the transfer function of a pth order π-constrained inverse ﬁlter D(z), the following equation can be written: D(z) = p k=0 dk z −k ⇒ D(ejπ ) = p dk (−1)k = lπ . (2.24) k=0 The π-constrained minimization problem can now be formulated: minimize dT Φd subject to ΓT d = e, where d = [d0 · · · dp ]T is the ﬁlter coeﬃcient vector with d0 = 1, e = [1 lπ ]T , and Γ 2.2 Covariance based methods 37 the new (p + 1) by 2 constraint matrix is deﬁned as: ⎤ ⎡ ⎢1 1 ⎥ ⎥ ⎢ ⎢0 −1⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢0 1 ⎥ ⎥ Γ=⎢ ⎥ ⎢ ⎢0 −1⎥ ⎥ ⎢ ⎢. . ⎥ ⎢ .. .. ⎥ ⎥ ⎢ ⎦ ⎣ 0 1 (2.25) It is also possible to assign a third constraint by imposing simultaneously that the ﬁrst inverse ﬁlter coeﬃcient is equal to unity and that the ﬁlter gain at both ω = 0 and ω = π are equal to lDC and lπ respectively. Then, the constraint equation becomes ΓT v = h, where v = [v0 · · · vp ]T is the ﬁlter coeﬃcient vector with v0 = 1, h = [1 lDC lπ ] and the resulting (p + 1) by ⎡ ⎢1 ⎢ ⎢0 ⎢ ⎢ ⎢ ⎢0 Γ=⎢ ⎢ ⎢0 ⎢ ⎢. ⎢ .. ⎢ ⎣ 0 3 constraint matrix is deﬁned as: ⎤ 1 1⎥ ⎥ 1 −1⎥ ⎥ ⎥ ⎥ 1 1⎥ ⎥ ⎥ 1 −1⎥ ⎥ ⎥ .. ⎥ . ⎥ ⎦ 1 1 (2.26) An example of the vocal tract ﬁlter estimations during a closed phase region for the Constrained Covariance method and the conventional one is depicted in Figure 2.4. 38 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction 2.3 Inverse Filtering Procedure The inverse ﬁltering procedure that is followed in this thesis is illustrated in Figure 2.5. speech Voiced/Unvoiced Detection Sinusoidal Pitch Estimator Pitch Synchronous Covariance LP Excitation Covariance based LP Autocorrelation based LP Closed Phase Interval Identification Lip Radiation Removal Lip Radiation Removal Autocorrelation based LP analysis Covariance based LP analysis Glottal Flow Estimate Glottal Flow Estimate Figure 2.5 Inverse Filtering Procedure. Before processing, the speech and airﬂow signals are downsampled from 44 kHz to 8 kHz. Care is taken in order to avoid aliasing before downsampling. This is achieved by using a FIR ﬁlter with a cutoﬀ frequency at 4 kHz. An analytic description of each sub-system follows next. 2.3 Inverse Filtering Procedure 2.3.1 39 VUS detector As ﬁrst, the speech waveforms generated by physical modeling of the vocal tract are passed through a voiced/unvoiced/silence (VUS) detection algorithm in order to cut any silent or unvoiced regions. Any VUS detection algorithm can be used at this point, but one based on energy and zero crossings is preferred here due to its simplicity, low complexity, and speed. The analysis is performed small segments of speech, with 30 ms of duration and a 15 ms of overlap between successive segments. 2.3.2 Pitch Estimation Afterwards, an estimate of the pitch of the voiced parts of the waveform is obtained. Although any pitch estimator can be used in this part, the sinusoidal pitch estimator [44] is used here. Pitch estimates are calculated on a speech frame of 30 ms with a 15 ms overlap. 2.3.3 Excitation Instants Detection According to the pitch estimates generated by the previous algorithm, a pitch-synchronous covariance-based LP analysis on the waveform is performed, with an order of p = 10, for speech signals with sampling frequency Fs = 8 kHz. The purpose of this analysis is not an accurate estimate of the vocal tract or the glottal ﬂow in each frame, but a rough approximation of the excitation and the glottal excitation instants, which indicate glottal closure. The reason for this analysis is the identiﬁcation of pitch periods throughout the whole speech signal and it is an initial step for CP covariance 40 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction analysis, as it will be shown later. After this point, the analysis diﬀers depending on the LP method that is used. 2.3.4 Covariance-based LP Analysis For CP covariance-based methods, it is necessary to identify the closed phase interval. This is achieved by using the sliding covariance analysis suggested by Plumpe et al [53], which provides a robust method to extract the glottal closed phase interval out of the speech signal. Speciﬁcally, this method of glottal closed phase estimation relies on a sliding covariance analysis and uses vocal tract formant modulation which is predicted by Ananthapadmanabha and Fant [2] to vary more slowly in the glottal closed phase than in its open phase and to respond quickly to a change in glottal area. A ”stationary” region of formant modulation gives a closed-phase time interval, over which we estimate the vocal tract transfer function. A stationary region is present even when the vocal folds remain partly open [53]. For high pitch speakers, where the closed phase samples are less than twice the order of the LP analysis, a ﬁxed length of NCP + order closed phase samples was used. Then, before inverse ﬁltering, the lip radiation is cancelled using a ﬁrst order all-pole ﬁlter with its pole at z = 0.999. Having the closed phase intervals for each period, a covariance-based LP analysis with order p = 10 is set up on the corresponding speech samples. Finally, the vocal tract estimate that is obtained inverse ﬁlters pitch synchronously a speech segment consisting of two pitch periods, and thus the glottal ﬂow is obtained. 2.4 Summary 2.3.5 41 Autocorrelation-based LP Analysis For autocorrelation based methods, the analysis is simpler. The lip radiation is cancelled using a ﬁrst order all-pole ﬁlter with its pole at z = 0.999. Next, using the glottal pulses identiﬁed in a previous step, a pitch-synchronous LP analysis with order p = 10 is performed over a speech segment of 250 ms. The vocal tract contribution is then cancelled by inverse ﬁltering over a region of two pitch periods, with an overlap of one pitch period. The overall glottal ﬂow is synthesized using the well-known Overlap-Add method (OLA) [55]. 2.4 Summary In this chapter, four LP-based spectral estimation techniques were discussed and applied to an inverse ﬁltering process; two of them are well-kwown and widely used: the conventional autocorrelation and covariance LP. The other two were recently developed and they are referred to as Stabilised Weighted Linear Prediction and Constrained Closed Phase Covariance LP. SWLP is suggested for IF since it has the interesting property of emphasizing the speech samples that typically occur during the closed phase region. Finally, an automatic system for inverse ﬁltering was described. 42 Chapter 2 Vocal Tract Filter Estimation with Linear Prediction Chapter 3 Results In this chapter, we ﬁrstly present glottal ﬂow estimates of several vowels of diﬀerent frequencies, using the IF techniques described in Chapter 2. Then, parametrization measures in time and frequency domain are applied on the estimates, in order to compare and evaluate the performance of diﬀerent IF techniques. Finally, comments are presented on both the shape of the glottal ﬂow estimates and the parametrization measures. 3.1 Glottal Flow Estimates Examples The database consists of four physically modeled speech waveforms, all vowels (/aa/, /ae/, /eh/, /ih/), each one in eight diﬀerent frequencies (105, 115, 130, 145, 205, 210, 230, and 255 Hz). That is a total of 32 speech signals. For brevity, we present here characteristic examples for each of the ﬁve vowels. In each ﬁgure (from left to right), we show ﬁrst the actual glottal ﬂow waveform, which is encapsulated by a thick-line box. Then, it follows the estimates using the autocorrelation-based LPC IF technique, 43 44 Chapter 3 Results the Stabilized Weighted LPC IF technique with parameter M = 8, the Stabilized Weighted LPC IF technique with parameter M = 24, the covariance-based LPC IF technique, and ﬁnally the constrained covariance-based LPC IF technique. Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 SW−LPC, M=8 1 0.5 0.5 0 0 450 500 550 600 400 Covariance−LPC 1 0.5 0.5 0 0 450 500 550 550 600 450 500 550 600 Constr. Covariance−LPC 1 400 500 SW−LPC, M=24 1 400 450 600 400 450 500 550 600 Figure 3.1 Glottal flow estimates for vowel /aa/, fundamental frequency of 105 Hz. As it can be seen in the ﬁgures, the estimated glottal ﬂow waveforms are, in general, close to the original one. However, the best ﬁt is achieved by the CP covariance methods, because of a more accurate extraction of the vocal tract ﬁlter during the closed phase interval. Also, the SW LP24 waveform, which corresponds to the SWLP 3.2 Parametrization Results 45 Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 SW−LPC, M=8 1 0.5 0.5 0 0 450 500 550 600 400 Covariance−LPC 1 0.5 0.5 0 0 450 500 550 550 600 450 500 550 600 Constr. Covariance−LPC 1 400 500 SW−LPC, M=24 1 400 450 600 400 450 500 550 600 Figure 3.2 Glottal flow estimates for vowel /aa/, fundamental frequency of 145 Hz. with M = 24, is closer to the original one, compared to the autocorrelation method of LP. 3.2 Parametrization Results In order to parametrize our results, the previously discussed time and frequency domain measures were applied on the glottal ﬂow estimates. In this chapter, the IF methods are abbreviated as: W LP C8 stands for Stabilized Weighted Linear Predic- 46 Chapter 3 Results Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 500 600 700 800 400 SW−LPC, M=8 1 0.5 0.5 0 0 500 600 700 800 400 Covariance−LPC 1 0.5 0.5 0 0 500 600 700 700 800 500 600 700 800 Constr. Covariance−LPC 1 400 600 SW−LPC, M=24 1 400 500 800 400 500 600 700 800 Figure 3.3 Glottal flow estimates for vowel /ae/, fundamental frequency of 145 Hz. tion with STE window of length M = 8, W LP C24 stands for Stabilized Weighted Linear Prediction with STE window of length M = 24, LPC stands for the conventional autocorrelation Linear Prediction, CovLPC stands for the conventional CP covariance Linear Prediction and CLPC stands for the Constrained CP covariance Linear Prediction. 3.2 Parametrization Results 47 Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 450 SW−LPC, M=8 1 0.5 0.5 0 0 450 500 550 600 400 450 Covariance−LPC 1 0.5 0.5 0 450 500 550 600 500 550 600 Constr. Covariance−LPC 1 400 550 SW−LPC, M=24 1 400 500 600 0 400 450 500 550 Figure 3.4 Glottal flow estimates for vowel /ae/, fundamental frequency of 210 Hz. 3.2.1 Signal to Recontruction Error Ratio - SRER A typical measure in signal processing algorithms is the Signal to Reconstruction Error Ratio (SRER). SRER is deﬁned as: SRER = 10log10 σ s[n] σe[n] ) (3.1) where s[n] is the obtained glottal ﬂow in our case, e[n] = s[n] − ŝ[n] is the error between the original and the obtained glottal ﬂow, and σ denotes the corresponding 48 Chapter 3 Results Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 500 600 700 800 400 SW−LPC, M=8 1 0.5 0.5 0 0 500 600 700 800 400 Covariance−LPC 1 0.5 0.5 0 0 500 600 700 700 800 500 600 700 800 Constr. Covariance−LPC 1 400 600 SW−LPC, M=24 1 400 500 800 400 450 500 550 600 Figure 3.5 Glottal flow estimates for vowel /ih/, fundamental frequency of 115 Hz. standard deviation. SRER is a means to estimate how well the estimated glottal ﬂow ”ﬁts” on the original one. In other words, it is an index of the quality of the pitch synchronously resynthesized glottal ﬂow waveform. Table 3.1 shows the mean and the standard deviation of SRER for each IF method per vowel. In table 3.1, it is clearly seen that the CP covariance methods outperform the autocorrelation methods in terms of the SRER. However, CP covariance methods shows a small decrease in performance for the /ih/ vowel. This can be justiﬁed by the fact 3.2 Parametrization Results 49 Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 1 1 0.5 0.5 0 0 450 500 550 600 400 Covariance−LPC 1 0.5 0.5 0 0 450 500 550 550 600 450 500 550 600 Constr. Covariance−LPC 1 400 500 SW−LPC, M=24 SW−LPC, M=8 400 450 600 450 500 550 600 Figure 3.6 Glottal flow estimates for vowel /ih/, fundamental frequency of 230 Hz. that the ﬁrst formant for the /ih/ vowel is typically low, and LP is known to perform poor in such cases. This, along with a possible misalignment of the CP interval in some frames, and the short length of the CP interval for high pitch speakers, causes severe distortion in certain parts of the synthesized glottal ﬂow, and thus leads to the decreased SRER value. A clearer picture could be illustrated using a segmental SRER, while discarding frames where the analysis had poor estimation results. In table 3.2, the SRER mean and standard deviation for each IF method per frequency is illustrated. Here, it is evident that the performance of all IF methods are 50 Chapter 3 Results Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 SW−LPC, M=8 1 0.5 0.5 0 0 450 500 550 600 400 Covariance−LPC 1 0.5 0.5 0 0 450 500 550 550 600 450 500 550 600 Constr. Covariance−LPC 1 400 500 SW−LPC, M=24 1 400 450 600 400 450 500 550 600 Figure 3.7 Glottal flow estimates for vowel /eh/, fundamental frequency of 105 Hz. worse when the pitch increases, and especially in the CP covariance methods, where the covariance frame becomes too short. In almost all frequencies, SW LP24 SRERs are higher than conventional autocorrelation LP. 3.2 Parametrization Results 51 Actual glottal flow Autocorr−LPC 1 1 0.5 0.5 0 0 400 450 500 550 600 400 SW−LPC, M=8 1 0.5 0.5 0 0 450 500 550 600 400 Covariance−LPC 1 0.5 0.5 450 500 550 550 600 450 500 550 600 Constr. Covariance−LPC 1 0 400 500 SW−LPC, M=24 1 400 450 600 0 400 450 500 550 600 Figure 3.8 Glottal flow estimates for vowel /eh/, fundamental frequency of 255 Hz. 3.2.2 Normalized Amplitude Quotient - NAQ For the synthesized glottal ﬂows, the NAQ was estimated for each cycle. In order to compare the NAQ values of the original and the estimated glottal ﬂows, the ratio of the NAQ of the actual glottal ﬂow to the estimated one is formed, called NAQrat . The ratio should be equal to unity when the estimation of the glottal ﬂow has succeeded perfectly. Table 3.3 shows the mean and the standard deviation of NAQrat for each IF method 52 Chapter 3 Results Vowel W LP C8 W LP C24 LPC CovLPC CLPC /aa/ 33.5(±2.0) 39.7(±4.5) 36.2(±5.7) 34.9(±9.2) 42.1(±13.1) /ae/ 32.6(±2.8) 38.3(±3.3) 36.9(±5.2) 37.5(±8.4) 46.1(±6.3) /eh/ 34.0(±1.9) 38.4(±4.2) 34.0(±4.0) 34.5(±8.2) 43.4(±5.6) /ih/ 37.3(±1.6) 37.6(±3.1) 33.3(±4.6) 32.2(±9.6) 38.8(±8.0) Table 3.1 In this table, the mean and the standard deviation of the SRER value for each vowel (all 8 frequencies) is illustrated. All values are in dB. Frequency W LP C8 W LP C24 LPC CovLPC CLPC F0 = 105 Hz 37.1(±0.8) 42.8(±2.1) 40.7(±3.7) 41.1(±2.3) 51.5(±4.4) F0 = 115 Hz 35.4(±1.9) 40.1(±1.8) 37.7(±2.8) 41.5(±2.3) 50.0(±2.4) F0 = 130 Hz 34.1(±1.0) 39.2(±4.0) 38.7(±3.4) 42.4(±2.1) 48.8(±3.8) F0 = 145 Hz 34.0(±2.1) 37.1(±0.7) 37.5(±3.0) 40.7(±1.8) 49.3(±6.2) F0 = 205 Hz 33.5(±3.8) 38.4(±1.8) 32.7(±2.5) 31.4(±7.3) 36.4(±2.6) F0 = 210 Hz 34.1(±3.6) 37.2(±3.2) 32.4(±4.0) 28.2(±5.3) 36.3(±6.7) F0 = 230 Hz 33.2(±3.5) 38.8(±3.9) 30.9(±4.0) 32.3(±9.4) 33.5(±7.3) F0 = 255 Hz 33.3(±3.0) 34.2(±5.7) 30.0(±4.2) 20.7(±2.8) 35.0(±5.0) Table 3.2 In this table, the mean and the standard deviation of the SRER value for each frequency (all 4 vowels) is illustrated. All values are in dB. per vowel. In table 3.3, it is evident that SWLP outperforms both the conventional autocorrelation LP and the CP covariance LP. However, an interesting point in this table is the fact that the CP covariance waveforms, especially the conventional one, have NAQrat values that are higher than the others. This comes in contrast with the SRER index, which clearly showed that, in general, the CP covariance glottal ﬂows are closer to the original glottal ﬂow than the other estimated ﬂows. Since NAQ is a measure that depends only on the amplitude values of the waveform and its derivative, even a 3.2 Parametrization Results Vowel W LP C8 53 W LP C24 LPC CovLPC CLPC /aa/ 1.06(±0.12) 1.10(±0.12) 1.23(±0.13) 1.27(±0.18) 1.21(±0.12) /ae/ 0.99(±0.11) 1.09(±0.11) 1.25(±0.13) 1.23(±0.10) 1.18(±0.08) /eh/ 1.05(±0.12) 1.13(±0.12) 1.31(±0.14) 1.25(±0.13) 1.18(±0.11) /ih/ 1.10(±0.12) 1.14(±0.12) 1.32(±0.14) 1.24(±0.16) 1.17(±0.11) Table 3.3 In this table, the mean and the standard deviation of the ratio, denoted as NAQrat , between the NAQ of the actual glottal ﬂow to the NAQ of the estimated one, for each vowel (all 8 frequencies) is illustrated. small distortion due to an incomplete estimation of the vocal tract ﬁlter in the closed phase might result in sever amplitude changes on the glottal ﬂow. That is the reason why NAQ is usually not preﬀered in the literature when the CP covariance method is examined [8]. In table 3.4, the NAQrat values per frequency are illustrated. The results show no signiﬁcant change depending on the frequency. Frequency W LP C8 W LP C24 LPC CovLPC CLPC F0 = 105 Hz 1.03(±0.12) 1.12(±0.13) 1.28(±0.14) 1.33(±0.11) 1.20(±0.10) F0 = 115 Hz 1.07(±0.13) 1.14(±0.14) 1.36(±0.16) 1.33(±0.13) 1.22(±0.12) F0 = 130 Hz 1.03(±0.11) 1.10(±0.12) 1.27(±0.14) 1.26(±0.12) 1.19(±0.11) F0 = 145 Hz 1.01(±0.10) 1.05(±0.10) 1.28(±0.12) 1.28(±0.10) 1.15(±0.10) F0 = 205 Hz 1.03(±0.11) 1.11(±0.10) 1.23(±0.12) 1.20(±0.22) 1.17(±0.09) F0 = 210 Hz 1.06(±0.13) 1.15(±0.12) 1.23(±0.14) 1.20(±0.22) 1.14(±0.16) F0 = 230 Hz 1.10(±0.11) 1.13(±0.12) 1.30(±0.14) 1.18(±0.13) 1.16(±0.10) F0 = 255 Hz 1.09(±0.11) 1.10(±0.11) 1.28(±0.13) 1.26(±0.25) 1.24(±0.07) Table 3.4 In this table, the mean and the standard deviation of the ratio, denoted as NAQrat , between the NAQ of the actual glottal ﬂow to the NAQ of the estimated one, for each frequency (all 4 vowels), is illustrated. 54 Chapter 3 Results 3.2.3 H1-H2 For the synthesized glottal ﬂows, the H1-H2 parameter was estimated for a speech frame of ﬁve periods. In order to compare the H1-H2 values of the original and the estimated glottal ﬂows, the diﬀerence of the H1-H2 of the actual glottal ﬂow to the estimated one is formed, called H1 − H2dif . The diﬀerence should be equal to zero when the estimation of the glottal ﬂow has succeeded perfectly. H1-H2 is an index for the spectral tilt of the glottal spectrum. Table 3.5 shows the mean and the standard deviation of H1 − H2dif for each IF method per vowel. Vowel W LP C8 W LP C24 LPC CovLPC CLPC /aa/ 0.61(±0.03) 0.20(±0.03) 0.54(±0.03) 0.20(±0.15) 0.08(±0.08) /ae/ 0.19(±0.03) 0.18(±0.02) 0.38(±0.01) 0.18(±0.12) 0.07(±0.04) /eh/ 0.23(±0.03) 0.32(±0.02) 0.44(±0.01) 0.38(±0.15) 0.15(±0.05) /ih/ -0.16(±0.01) -0.029(±0.01) 0.26(±0.01) 1.10(±0.34) 0.27(±0.15) Table 3.5 In this table, the mean and the standard deviation of the diﬀerence in dB, denoted as H1 − H2dif , between the H1-H2 of the actual glottal ﬂow and the H1-H2 of the estimated one, for each vowel (all 8 frequencies) is illustrated. The data on Table 3.5 shows that the performance of CP covariance methods outperforms the autocorrelation methods, except for the /ih/ vowel, where the problem of estimating the formants and the length of the CP length aﬀects the resulting waveforms. This is consistent with the theory of CP analysis. However, it is not clear whether SWLP with M = 8 or M = 24 performs better with this criterion. In any case, they both outperform conventional autocorrelation LP. In table 3.6, the H1 − H2dif values per frequency is illustrated. Here, the H1 − 3.2 Parametrization Results 55 H2dif metric performs worse when the frequency increases, which is expected. Vowel W LP C8 F0 = 105 Hz 0.13(±0.03) W LP C24 0.18(±0.01) LPC CovLPC CLPC 0.20(±0.01) 0.20(±0.04) 0.16(±0.03) F0 = 115 Hz -0.11(±0.04) -0.02(±0.02) -0.03(±0.02) 0.14(±0.04) 0.07(±0.04) F0 = 130 Hz 0.05(±0.09) 0.09(±0.01) 0.15(±0.01) 0.19(±0.03) 0.11(±0.02) F0 = 145 Hz -0.12(±0.01) -0.05(±0.01) -0.07(±0.01) 0.15(±0.03) 0.14(±0.04) F0 = 205 Hz 0.35(±0.01) 0.25(±0.01) 0.53(±0.01) 0.35(±0.19) 0.11(±0.08) F0 = 210 Hz 0.54(±0.05) 0.36(±0.03) 0.62(±0.04) 0.56(±0.31) 0.18(±0.14) F0 = 230 Hz 0.33(±0.03) 0.23(±0.02) 0.52(±0.01) 0.60(±0.31) 0.24(±0.21) F0 = 255 Hz 0.38(±0.02) 0.28(±0.02) 1.32(±0.01) 2.81(±0.56) 0.35(±0.06) Table 3.6 In this table, the mean and the standard deviation of the diﬀerence in dB, denoted as H1 − H2dif , between the H1-H2 of the actual glottal ﬂow and the H1-H2 of the estimated one, for each frequency (all 4 vowels) is illustrated. 3.2.4 Harmonic Richness Factor - HRF For the synthesized glottal ﬂows, the HRF parameter was estimated for a frame of ﬁve glottal periods. HRF is deﬁned as the ratio, in dB, of the sum of the harmonic amplitudes above the fundamental to the amplitude of the fundamental: N n≥2 Hn HRFN = H1 (3.2) where H1 is the amplitude of the fundamental and Hn are the amplitudes of the higher harmonics. In order to compare the HRF values of the original and the estimated glottal ﬂows, the diﬀerence of the HRF of the actual glottal ﬂow to the estimated one is formed, called HRFdif . The diﬀerence should be equal to zero when the estimation of the glottal ﬂow has succeeded. The HRF criterion is an extension of the H1-H2, illustrating the relationship of the fundamental with the higher harmonics. For the 56 Chapter 3 Results calculation of the HRF, the ﬁrst eight (N = 8) harmonics were included. Vowel W LP C8 W LP C24 LPC CovLPC CLPC /aa/ -0.23(±0.02) -0.16(±0.02) -0.45(±0.04) -0.14(±0.16) -0.19(±0.10) /ae/ 0.12(±0.01) -0.19(±0.01) -0.44(±0.01) -0.04(±0.08) -0.18(±0.05) /eh/ 0.01(±0.02) -0.24(±0.02) -0.46(±0.02) -0.02(±0.15) -0.08(±0.05) /ih/ -0.23(±0.01) -0.29(±0.01) -0.53(±0.02) 0.09(±0.15) -0.03(±0.09) Table 3.7 In this table, the mean and the standard deviation of the diﬀerence in dB, denoted as HRFdif , between the HRF of the actual glottal ﬂow and the HRF of the estimated one, for each vowel (all 8 frequencies) is illustrated. Table 3.7 shows the mean and the standard deviation of HRFdif for each IF method per vowel. It is evident that CP covariance techniques outperforms the autocorrelation methods, thus proving that the vocal tract estimation in higher formant regions is more accurate when an accurate CP analysis is applied on the speech signal. In addition, the SWLP method provides better results than the conventional autocorrelation LP. In table 3.8, the HRFdif values per frequency is illustrated. It is interesting to note that, except for the constrained CP covariance method, all other methods perform worse when the frequency increases. 3.3 Summary In this chapter, the resulting waveforms of the IF techniques were parametrized using well known measures in time and frequency domain. The prevalence of the covariance methods is depicted, although the required analysis for this method has certain issues, 3.3 Summary Frequency 57 W LP C8 W LP C24 LPC CovLPC CLPC -0.112(±0.03) -0.098(±0.03) F0 = 115 Hz -0.033(±0.006) -0.096(±0.006) -0.195(±0.008) -0.129(±0.025) -0.093(±0.03) F0 = 105 Hz -0.024(±0.003) -0.119(±0.003) -0.180(±0.003) F0 = 130 Hz 0.041(±0.008) -0.118(±0.007) -0.297(±0.005) -0.133(±0.036) -0.107(±0.032) F0 = 145 Hz 0.019(±0.004) -0.122(±0.004) -0.382(±0.007) -0.165(±0.028) -0.078(±0.025) F0 = 205 Hz -0.086(±0.017) -0.288(±0.017) -0.630(±0.019) 0.106(±0.243) -0.154(±0.108) F0 = 210 Hz -0.224(±0.032) -0.427(±0.042) -0.669(±0.061) -0.118(±0.258) -0.143(±0.140) F0 = 230 Hz -0.202(±0.037) -0.335(±0.034) -0.712(±0.041) 0.072(±0.223) -0.015(±0.167) F0 = 255 Hz -0.152(±0.033) -0.276(±0.042) -0.711(±0.046) 0.272(±0.267) -0.288(±0.097) Table 3.8 In this table, the mean and the standard deviation of the diﬀerence in dB, denoted as HRFdif , between the HRF of the actual glottal ﬂow and the HRF of the estimated one, for each frequency (all 4 vowels) is illustrated. such as the accurate identiﬁcation of the closed phase region. However, it is interesting that the IF method based on SWLP with M = 24 performs better, in general and given a long enough analysis window, than the conventional autocorrelation method. 58 Chapter 3 Results Chapter 4 Conclusions 4.1 Summary of Findings In this thesis, an evaluation of inverse ﬁltering techniques for reliably estimating the glottal ﬂow waveform directly from speech, and to quantify the quality of the estimated glottal ﬂows using parametrization measures. The volume velocity airﬂow through the glottis, called the glottal ﬂow, is the source for voiced speech. Previous studes have shown the importance of the glottal ﬂow in several areas of speech sciences. Four diﬀerent techniques were examined in this thesis, all of them based on linear prediction analysis. Two of them are widely used in the literature, and they are based on autocorrelation and covariance method of linear prediction. The covariance analysis was restricted inside the closed phase region of the glottal cycle. Also, two recently developed LP techniques were introduced: Stabilized Weighted Linear Prediction and Constrained Closed Phase Covariance Linear Prediction. SWLP, which was recently suggested for robust vocal tract ﬁlter extraction in noisy conditions, computes the all-pole model by imposing temporal weighting of the square of the residual signal. 59 60 Chapter 4 Conclusions This is achieved by using STE as a weighting function. Thus, samples that ﬁt the underlying speech production model well are emphasized. The performance of SWLP in inverse ﬁltering depends on the length of the STE window. This property makes it interesting for IF, and it was our motivation for using it. The Constrained CP Covariance LP is a modiﬁed CP algorithm based on imposing certain predeﬁned values on the gains of the vocal tract inverse ﬁlter at angular frequencies of 0 and/or π in optimizing ﬁlter coeﬃcients. With these constraints, vocal tract models are less prone to show false low-frequency roots. A major problem in assessing the performance of IF techniques is the lack of direct comparison of the estimated glottal ﬂow waveforms with the actual glottal ﬂow, since the latter is can be obtained only with special equipment and in an invasive manner. To this direction, the performance of the discussed techniques is evaluated on a database of speech signals which are produced by physical modeling of the human voice production system. In this way, both the glottal airﬂow signals and the speech pressure signals are available and direct comparisons can be made. Several frequencies of diﬀerent vowels were produced and compared to the inverse ﬁltered estimates. The glottal ﬂows were estimated in a pitch synchronous manner using a system for inverse ﬁltering. The system inverse ﬁlters each waveform using all dieﬀerent LP methods, and synthesizes the glottal ﬂow estimate. For CP analysis techniques, the CP interval was determined using a statistical technique, which eliminates dependence on a particular type of frequency modulation, and allows the algorithm to adapt to the degree of glottal closure. In order to compare the resulted glottal ﬂow waveforms with the original one in a qualitative way, time and frequency domain parametrization measures were used. In time domain, the Normalized Amplitude Quotient and the Signal to Reconstruction Error ratio was used. In frequency domain, the diﬀerence in dB between the ﬁrst two harmonics (H1-H2), and the Harmonic Richness Factor 4.2 Suggestions for Future Work 61 were used. Inverse ﬁltering experiments showed that SWLP outperforms the conventional autocorrelation LP for a large STE window in our IF system, and shows comparable performance to the covariance based methods. However, the prevalence of the covariance methods, when the CP region is accurately estimated, is obvious, as expected. 4.2 Suggestions for Future Work The results of this work demonstrate the diﬀerent LP-based vocal tract ﬁlter estimation techniques, and their eﬀect on inverse ﬁltering. It was shown that robust and accurate vocal tract ﬁlter estimation is crucial in the process of inverse ﬁltering. The use of more sophisticated all-pole models for inverse ﬁltering could be used, such as Discrete All-Pole Modeling (DAP) [20]. DAP tries to ﬁt the all-pole model using only the ﬁnite set of spectral locations that are related to the harmonic positions of the fundamental frequency. Through DAP, it is possible to obtain estimates of the formant frequencies that are less biased by the harmonic structure of the speech spectrum. The DAP model uses the discrete Itakura-Saito (IS) error measure and the optimisation criterion is derived in the frequency domain, where the error function reaches the minimum only when the model spectrum coincide on all discrete points. Moreover, the DAP method tries to maximise the error ﬂatness. This could be useful fow high pitch speakers, as in female speech, where the bias of the low formant frequencies is more intense. A further problem is that if the order of the original LP method is increased, then the corresponding envelope overestimates the original voiced speech power spectrum. This means that the LP envelope is resolving the harmonics and not the spectral enve- 62 Chapter 4 Conclusions lope. The Minimum Variance Distortionless Response (MVDR) method [49] provides a smooth spectral envelope even when the model order is increased. In particular, if one chooses the proper order for the MVDR method, the all-pole envelope obtained models a set of spectral samples exactly. In the same sense, nearly any vocal tract ﬁlter estimation technique can be used in order to examine how well the eﬀect of the vocal tract ﬁlter can be cancelled on a speech signal. Finally, since the CP covariance method is the prevalent method for inverse ﬁltering and a major problem is the accurate identiﬁcation of the closed phase, the presented database can be useful in this direction. The closed phases for each phoneme are available and a direct comparison of the performance of several CP identiﬁcation algorithms can be made. On the parametrization of the glottal ﬂow, the LF model could be used to parametrize both the estimates and the original glottal ﬂow. Since there can be either time or frequency domain LF-parametrization, a more robust framework of ”similarity” between glottal ﬂow waveforms can be established, along with the measures described in this thesis. Also, the accuracy of quotients or spectral parameters is deteriorated when the glottal ﬂows are severely aﬀected by noise. The LF model can alleviate this distortion caused by instantaneous noisy peaks of the ﬂow, especially in the case of NAQ, where the amplitude values of the glottal ﬂow and the ﬂow derivative may be distorted. Finally, both time and frequency domain metrics that were discussed here do not take into account the presence of a ripple component in the closed phase, an event that happens when there is incomplete cancelling of some of the higher formant by the inverse ﬁlter. Alternatively, this component might be explained by the existence of nonlinear coupling between the source and the tract, which cannot be taken 4.2 Suggestions for Future Work 63 into account in any analysis based on linear modeling of the voice production system. In the direction of evaluating the performance of IF, it can be suggested that a more detailed simulation of the physical model of the human voice production system will reveal properties and details that are unknown so far. The database selected for the experiments in this thesis consists of signals that are produced by a simpliﬁed but physically-motivated representation of a speaker. A problem is that these vowels sound ”unnatural”. An interesting approach is the ”3D Vocal Tract Project” [74], which is a three-dimensional vocal tract model for articulatory and visual speech synthesis developed within CTT, the Centre for Speech Technology, KTH Royal Institute of Technology, Sweden. An ideal physical model would provide realistic waveforms in several midpoints of the vocal tract, such as near the glottis, inside the oral cavity, and at the lips. Furthermore, since the area of Voice Care has shown increased interest in non invasive methods of extracting the glottal ﬂow waveform, it would be interesting to implement a real-time system of glottal inverse ﬁltering. Recently, a real time voice pathology detection system based on pitch estimation and short time jitter estimator was implemented [12]. This system was implemented in Pure Data (PD), which is a real-time graphical programming environment for audio and graphical processing. This worked is also based on short time speech analysis, as in the IF system described in this thesis, and showed that a real time approach in speech analysis is both eﬃcient and accurate. In addition, the algorithms embedded in the IF system are low in computational cost, a fact that supports an eﬀort for a real time approach. 64 Chapter 4 Conclusions Bibliography [1] Ananthapadmanabha, T. and Yegnanarayana, B., 1979, ”Epoch extraction from linear prediction residual for identiﬁcation of closed glottis interval”, IEEE Trans. Acoust., Speech and Signal Processing, vol. 27, no. 4, pp. 309-319. [2] Ananthapadmanabha, T. and Fant, G, 1985, ”Calculation of the true glottal ﬂow and its components”, STL-QPR, 1-30. [3] Akande, O., and Murphy, P., 2005, ”Estimation of the vocal tract transfer function with application to glottal wave analysis”, Speech Commun. 46, 15-36 [4] Alkhairy, A, 1999, ”An algorithm for glottal volume velocity estimation”, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 1, 233-236. [5] Alku, P., 1992, ”An automatic method to estimate the time-based parameters of the glottal pulseform”, Proc. IEEE, Int. Conf. Acoustics, Speech and Signal Proc. 2, 29-32. [6] Alku, P., 1992, ”Glottal wave analysis with Pitch Synchronous Iterative Adaptive Inverse Filtering”, Speech Commun. 11, 109-118. [7] Alku, P., Backstrom, T., and Vilkman, E., 2002, ”Normalized Amplitude Quotient for parametrization of the glottal ﬂow”, J. Acoust. Soc. Am., 112, 701-710. 65 66 BIBLIOGRAPHY [8] Alku, P., Magi, C., Yrttiaho, S., Backstrom, T., and Story, B., 2009, ”Closed Phase Covariance Analysis based on Constrained Linear Prediction for Glottal Inverse Filtering”, J. Acoust. Soc. Am., 125(5), 3289-3305. [9] Alku, P., Story, B., and Airas, M., 2006, ”Estimation of the voice source from speech pressure signals: Evaluation of an inverse ﬁltering technique using physical modelling of voice production”, Folia Phoniatrica et Logopaedica, 58(2), 102-113. [10] Alku, P., Strik, H., and Vilkman, E., 1997, ”Parabolic Spectral Parameter - A new method for quantiﬁcation of the glottal ﬂow”, Speech Communications, 22, 67-79. [11] Alku, P., and Vilkman, E., 1996, ”A comparison of glottal voice source quantiﬁcation parameters in breathy, normal, and pressed phonation of female and male speakers”, Folia Phoniatr. Logop. 48, 240-254. [12] Astrinaki M., 2010, ”Real Time Voice Pathology Detection Using Autocorrelation Pitch Estimation and Short Time Jitter Estimator”, M.Sc. Thesis, Computer Science Department, University of Crete. [13] Childers, D., and Ahn, C., 1995, ”Modeling the glottal volume velocity waveform for three voice types”, J. Acoust. Society of America, 97, 505-519. [14] Childers, D., and Hu, H., 1994, ”Speech synthesis by glottal excited linear prediction”, J. Acoust. Society of America, 96, 2026-2036. [15] Childers, D., Lee, C., 1991, ”Vocal Quality Factors: Analysis, synthesis, and perception”, J. Acoust. Society of America, 90, 2394-2410. BIBLIOGRAPHY 67 [16] Childers, D., Principe, J.C., and Ting, Y.T., 1995, ”Adaptive WRLS-VFF for speech analysis”, IEEE Transactions on Speech and Audio processing, 3(3), 209213. [17] Cummings, K. E., and Clements, M. A., 1995, ”Analysis of the glottal excitation of emotionally stlyed and stressed speech”, J. Acoust. Society of America, 98, 88-98. [18] Childers, D., Wong, C.-F., 1994, ”Measuring and modeling vocal source-tract interaction”, IEEE Trans. Biomed. Eng. 41(7), 663-671. [19] Deng, H., Beddoes, M. P., Ward, R. K., Hodgson, M, 2003, ”Estimating the Glottal Waveform and the Vocal-Tract Filter from a Vowel Sound Signal”, Proc. IEEE Paciﬁc Rim Conf. Communications, Computers and Signal Processing, 1, 297-300. [20] El-Jaroudi, A. and Makhoul, J., 1991, ”Discrete all-pole modelling”, IEEE Transactions on Signal Processing, 39(2), 411-423. [21] Fant, G., 1970, Acoustic Theory of Speech Production (Mouton, The Hague). [22] Fant, G., 1993, ”Some problems in voice source analysis”, Speech Communication, 13, 7-22. [23] Fant, G., Liljencrants, J., and Lin, Q., 1985, ”A four parameter model of glottal ﬂow”, STL-QPSR, 4, 1-13. [24] Fitch, T., and Giedd, J., 1999, ”Morphology and development of the human vocal tract: A study using magnetic resonance imaging”, J. Acoust. Soc. America, 106, 1511-1522 68 BIBLIOGRAPHY [25] Flanagan, J., 1972, Speech Analysis, Synthesis, and Perception, (Springer, New York) [26] Frohlich, M., Michaelis, D., and Strube, H., 2001, SIM - Simultaneous inverse ﬁltering and matching of a glottal ﬂow model for acoustic speech signals”, J. Acoust. Soc. America, 110, 479-488. [27] Fu, Q., and Murphy, P., 2006, ”Robust glottal source estimation based on joint source-ﬁlter model optimization”, IEEE Trans. Audio, Speech, Lang. Process., 14, 492-501. [28] Gobl, C., 1988, ”Voice source dynamics in connected speech”, STL-QPSR, 1, 123-159. [29] Hirano, M., 1974, ”Morphological structure of the vocal cord as a vibrator and its variations”, Folia Phoniatr. Logop., 26, 89-94. [30] Hollien, H., Dew, D., and Philips, P., 1971, ”Phonational frequency ranges of adults”, J. Speech Hear. Res., 14, 755-760. [31] Holmberg, E. B., Hillman, R. E., and Perkell, J. S., 1988, ”Glottal airﬂow and transglottal air pressure measurements for male and female speakers in soft, normal, and loud voice”, J. Acoust. Soc. Am., 84, 511-529. [32] Holmes, J.N., 1973, ”The inﬂuence of glottal waveform on the naturalness of speech form a parallel formant synthesizer”, IEEE Trans. Audio Electroacoust., AU-21, 298-305. [33] Howell, P., and Williams, M., 1992, ”Acoustic Analysis and perception of vowels in children’s and teenagers’ stuttered speech”, J. Acoust. Soc. Am., 91, 16971706. BIBLIOGRAPHY 69 [34] Ishizaka, K., and Flanagan, J. L., 1972, ”Synthesis of voiced sounds from a two mass model of the vocal cords”, Bell Syst. Tech. J., 51, 1233-1268. [35] Kane, J., Kane, M., Gobl1, C., 2010, ”A spectral LF model based approach to voice source parameterisation”, Interspeech, Kyoto, Japan. [36] Klatt, D., and Klatt, L., 1990, ”Analysis, synthesis, and perception of voice quality variations among female and male talkers”, J. Acoust. Soc. America, 87, 820-857. [37] Krishnamurthy, A., and Childers, D., 1986, ”Two-channel speech analysis”, IEEE Trans. Acoust. Speech, Signal Proc., 34, 730-743. [38] Larar, J., Alsaka, Y., and Childers, D., 1985, ”Variability in closed phase analysis of speech”, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Tampa, FL, 1089-1092. [39] Liljencrants, J., 1985, ”Speech synthesis with a reﬂection-type line analog”, DS dissertation, Department of Speech Communication and Music Acoustics, Royal Institute of Technology, Stockholm, Sweden. [40] Lim, J., Oppenheim, A., 1978, ”All-pole modeling of degraded speech”, IEEE Trans. Acoust. Speech and Signal Process., ASSP-26(3), 197-210. [41] Ma, C., Kamp, Y., Willems, L., 1993, ”Robust signal selection for linear prediction analysis of voiced speech”, Speech Comm. 12 (1), 69-81. [42] Magi, C., Pohjalainen, J., Backstrom, T., Alku, P., 2009, ”Stabilised Weighted Linear Prediction”, Speech Communication, doi:10.1016/j.specom.2008.12.005 [43] Makhoul, J., 1975, ”Linear Prediction: a tutorial review”, Proc. IEEE 63(4), 561-580. 70 BIBLIOGRAPHY [44] McAulay, R.J., Quatieri, T.F., 1990, ”Pitch estimation and voicing detection based on a sinusoidal speech model”, IEEE International Conference on Acoustics, Speech, and Signal Processing, (1), 249-252. [45] Matausek, M. R., Batalov, V. S., 1980, ”A new approach to the determination of the glottal waveform”, IEEE Trans. Acoust., Speech, Signal Proc., 28, 616-622. [46] McKenna, G., 2001, ”Automatic Glottal Closed-Phase Location and Analysis by Kalman Filtering”. [47] Milenkovic, P., 1986, ”Glottal inverse ﬁltering by joint estimation of an AR system with a linear input model”, IEEE Trans. Acoust. Speech, Signal Process., 34, 28-42. [48] Miller, R., 1959, ”Nature of the vocal cord wave”, J. Acoust. Soc. Am., 31, 667-677. [49] Murthi, M.N, and Rao, B.D., 2000, ”All-pole modelling of speech based on the minimum variance distortionless response spectrum”, IEEE Transactions on Speech and Audio Processing, 8(3), 221-239. [50] Murty, K. and Yegnanarayana, B., 2008, ”Epoch Extraction From Speech Signals”, IEEE Trans. Audio Speech Lang. Processing, 16, 1602-1613. [51] Naylor, P., Kounoudes, A., Gudnason, J., and Brookes, M., 2007, ”Estimation of glottal closure instants in voiced speech using the DYPSA algorithm”, IEEE Trans. Audio, Speech, Lang. Process., 15, 34-43. [52] Oppenheim, A., and Schafer, R., 1989, Discrete-Time Signal Processing, (Prentice Hall, Englewood Cliﬀs, NJ). BIBLIOGRAPHY 71 [53] Plumpe, M., Quatieri, T., and Reynolds, D., 1999, ”Modeling of the glottal ﬂow derivative waveform with application to speaker identiﬁcation”, IEEE Trans. Speech Audio Process., 7, 569-586. [54] Price, P., 1989, ”Male and female voice source characteristics: inverse ﬁltering results”, Speech Commun., 8, 261-277. [55] Quatieri, T., 2001, Discrete Time Speech Signal Processing: Principles and Practice, Prentice Hall, Englewood Cliﬀs, NJ. [56] Rabiner, L., and Schafer, R., 1978, Digital Processing of Speech Signals, (Prentice Hall, Englewood Cliﬀs, NJ). [57] Riegelsberger, E., and Krishnamurthy, A., 1993, ”Glottal source estimation: methods of applying the LF model to inverse ﬁltering”, in Proceeding of the International Conference on Acoustics, Speech and Signal Processing, Minneapolis, MN, Vol. 2, pp. 542-545. [58] Rothenburg, M., 1973, ”A new inverse ﬁltering technique for deriving the glottal air ﬂow waveform during voicing”, J. Acoust. Soc. Am., 53, 1632-1645. [59] Story, B., 1995, Physiologically-based speech simulation using an enhanced wavereﬂection model of the vocal tract”, Ph.D. dissertation, University of Iowa. [60] Story, B., and Titze, I., 1995, ”Voice simulation with a body-cover model of the vocal folds”, J. Acoust. Soc. Am., 97, 1249-1260. [61] Story, B., Titze, I., and Hoﬀman, E., 1996, ”Vocal tract area functions from magnetic resonance imaging”, J. Acoust. Soc. Am., 100, 537-554. [62] Strube, H., 1974, ”Determination of the instant of glottal closure from the speech wave”, J. Acoust. Soc. Am., 56, 1625-1629. 72 BIBLIOGRAPHY [63] Strube, H., 1982, ”Time-varying wave digital ﬁlters for modeling analog systems”, IEEE Trans. Acoust. Speech and Signal Processing, 30, 864-868. [64] Sundberg, J., Titze, I., and Scherer, R., 1993, ”Phonatory control in male singing: A study of the eﬀects of subglottal pressure, fundamental frequency, and mode of phonation on the voice source”, J. Voice, 7, 15-29. [65] Ting, Y. T., Childers, D. G., 1990, ”Speech Analysis using the Weighted Recursive Least Squares Algorithm with a Variable Forgetting Factor”, Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 1, 389-392. [66] Titze, I., 2002, ”Regulating glottal airﬂow in phonation: Application of the maximum power transfer theorem to a low dimensional phonation model”, J. Acoust. Soc. Am., 111, 367-376. [67] Titze, I., and Story, B., 2002, ”Rules for controlling low-dimensional vocal fold models with muscle activites”, J. Acoust. Soc. Am., 112, 1064-1076. [68] Titze, I., and Sundberg, J., 1992, ”Vocal intensity in speakers and singers”, J. Acoust. Soc. Am., 107, 581-588. [69] Veeneman, D., and BeMent, S., 1985, ”Automatic glottal inverse ﬁltering from speech and electroglottographic signals”, IEEE Trans. Acoust. Speech, Signal Process., 33, 369-377. [70] Walker, J, 2003, ”Application of the bispectrum to glottal pulse analysis”, Proc. NoLisp ’03. [71] Wong, D., Markel, J., and Gray, A., Jr., 1979, ”Least Squares glottal inverse ﬁltering form the acoustic speech waveform”, IEEE Trans. Acoust., Speech, Signal Process. 27, 350-355. BIBLIOGRAPHY 73 [72] Yegnanarayana, B., and Veldhuis, N., 1998, ”Extraction of vocal-tract system characteristics from speech signals”, IEEE Trans. Speech Audio Process., 6, 313327. [73] Zhao, Q., Shimamura, T., Suzuk, J., 1997, ”Linear Predictive Analysis of noisy speech”, Communications, Computers and Signal Processing, PACRIM’97, Victoria, Canada, August 20-22, (2), 585-588. [74] ”A three-dimensional vocal tract model for articulatory and visual speech synthesis developed within CTT, the Centre for Speech Technology, KTH”: http://www.speech.kth.se/multimodal/vocaltract.html 74 BIBLIOGRAPHY

Download PDF

advertisement