# chapter 1 0

C H A P T E R 1 0 Environmental RobustnessEquation Section 10 A speech recognition systems trained in the lab with clean speech may degrade significantly in the real world if the clean speech used in training doesn’t match real-world speech. If its accuracy doesn’t degrade very much under mismatched conditions, the system is called robust. There are several reasons why realworld speech may differ from clean speech; in this chapter we focus on the influence of the acoustical environment, defined as the transformations that affect the speech signal from the time it leaves the mouth until it is in digital format. Chapter 9 discussed a number of variability factors that are critical to speech recognition. Because the acoustical environment is so important to practical systems, we devote this chapter to ways of increasing the environmental robustness, including microphone, echo cancellation, and a number of methods that enhance the speech signal, its spectrum, and the corresponding acoustic model in a speech recognition system. 473 474 10.1. Environmental Robustness THE ACOUSTICAL ENVIRONMENT The acoustical environment is defined as the set of transformations that affect the speech signal from the time it leaves the speaker’s mouth until it is in digital form. Two main sources of distortion are described here: additive noise and channel distortion. Additive noise, such as a fan running in the background, door slams, or other speakers’ speech, is common in our daily life. Channel distortion can be caused by reverberation, the frequency response of a microphone, the presence of an electrical filter in the A/D circuitry, the response of the local loop of a telephone line, a speech codec, etc. Reverberation, caused by reflections of the acoustical wave in walls and other objects, can also dramatically alter the speech signal. 10.1.1. Additive Noise Additive noise can be stationary or nonstationary. Stationary noise, such as that made by a computer fan or air conditioning, has a power spectral density that does not change over time. Nonstationary noise, caused by door slams, radio, TV, and other speakers’ voices, has statistical properties that change over time. A signal captured with a close-talking microphone has little noise and reverberation, even though there may be lip smacks and breathing noise. A microphone that is not close to the speaker’s mouth may pick up a lot of noise and/or reverberation. As described in Chapter 5, a signal x[n] is defined as white noise if its power spectrum is flat, S xx ( f ) = q , a condition equivalent to different samples’ being uncorrelated, Rxx [n] = qδ [n] . Thus, a white noise signal has to have zero mean. This definition tells us about the second-order moments of the random process, but not about its distribution. Such noise can be generated synthetically by drawing samples from a distribution p(x); thus we could have uniform white noise if p(x) is uniform, or Gaussian white noise if p(x) is Gaussian. While typically subroutines are available that generate uniform white noise, we are often interested in white Gaussian noise, as it resembles better the noise that tends to occur in practice. See Algorithm 10.1 for a method to generate white Gaussian noise. Variable x is normally continuous, but it can also be discrete. White noise is useful as a conceptual entity, but it seldom occurs in practice. Most of the noise captured by a microphone is colored, since its spectrum is not white. Pink noise is a particular type of colored noise that has a low-pass nature, as it has more energy at the low frequencies and rolls off at higher frequencies. The noise generated by a computer fan, an air conditioner, or an automobile engine can be approximated by pink noise. We can synthesize pink noise by filtering white noise with a filter whose magnitude squared equals the desired pow*er spectrum. A great deal of additive noise is nonstationary, since its statistical properties change over time. In practice, even the noises from a computer, an air conditioning system, or an automobile are not perfectly stationary. Some nonstationary noises, such as keyboard clicks, are caused by physical objects. The speaker can also cause nonstationary noises such as lip smacks and breath noise. The cocktail party effect is the phenomenon under which a human The Acoustical Environment 475 listener can focus onto one conversation out of many in a cocktail party. The noise of the conversations that are not focused upon is called babble noise. When the nonstationary noise is correlated with a known signal, the adaptive echo-canceling (AEC) techniques of Section 10.3 can be used. ALGORITHM 10.1 WHITE NOISE GENERATION To generate white noise in a computer, we can first generate a random variable • with a Rayleigh distribution: pρ ( ρ ) = ρ e − ρ 2 /2 (10.1) from another random variable r with a uniform distribution between (0, 1), pr (r ) = 1 , by simply equating the probability mass pρ ( ρ )d ρ = pr (r )dr so that results in r = e − ρ 2 /2 2 dr = ρ e − ρ / 2 ; with integration, it dρ and the inverse is given by ρ = −2 ln r (10.2) If r is uniform between (0, 1), and • is computed through Eq. (10.2), it follows a Rayleigh distribution as in Eq. (10.1). We can then generate Rayleigh white noise by drawing independent samples from such a distribution. If we want to generate white Gaussian noise, the method used above does not work, because the integral of the Gaussian distribution does not exist in closed form. However, if • follows a Rayleigh distribution as in Eq. (10.1), obtained using Eq. (10.2) where r is uniform between (0, 1), and • is uniformly distributed between (0, 2•), then the white Gaussian noise can be generated as the following two variables x and y: x = ρ cos(θ ) (10.3) y = ρ sin(θ ) They are independent Gaussian random variables with zero mean and unity variance, since the Jacobian of the transformation is given by ∂px ∂px ∂ρ ∂θ cos θ − ρ sin θ J= = =ρ (10.4) ∂p y ∂p y sin θ ρ cos θ ∂ρ ∂θ and the joint density p( x, y ) is given by p ( x, y ) = p( ρ ,θ ) p( ρ ) p(θ ) 1 −ρ2 / 2 e = = J 2π ρ 1 − ( x2 + y2 ) / 2 e = = N ( x, 0,1) N ( y, 0,1) 2π (10.5) 476 Environmental Robustness The presence of additive noise can sometimes change the way the speaker speaks. The Lombard effect [40] is a phenomenon by which a speaker increases his vocal effort in the presence of background noise. When a large amount of noise is present, the speaker tends to shout, which entails not only a higher amplitude, but also often higher pitch, slightly different formants, and a different coloring of the spectrum. It is very difficult to characterize these transformations analytically, but recently some progress has been made [36]. 10.1.2. Reverberation If both the microphone and the speaker are in an anechoic1 chamber or in free space, a microphone picks up only the direct acoustic path. In practice, in addition to the direct acoustic path, there are reflections of walls and other objects in the room. We are well aware of this effect when we are in a large room, which can prevent us from understanding if the reverberation time is too long. Speech recognition systems are much less robust than humans and they start to degrade with shorter reverberation times, such as those present in a normal office environment. As described in Section 10.2.2, the signal level at the microphone is inversely proportional to the distance r from the speaker for the direct path. For the kth reflected sound wave, the sound has to travel a larger distance rk , so that its level is proportionally lower. This reflection also takes time Tk = rk / c to arrive, where c is the speed of sound in air.2 Moreover, some energy absorption a takes place each time the sound wave hits a surface. The impulse response of such filter looks like ρk 1 ∞ ρ δ [n − Tk ] = å k δ [n − Tk ] c k = 0 Tk k = 0 rk ∞ h[n] = å (10.6) where ρ k is the combined attenuation of the kth reflected sound wave due to absorption. Anechoic rooms have ρ k ≈ 0 . In general ρ k is a (generally decreasing) function of frequency, so that instead of impulses δ [n] in Eq. (10.6), other (low-pass) impulse responses are used. Often we have available a large amount of speech data recorded with a close-talking microphone, and we would like to use the speech recognition system with a far field microphone. To do that we can filter the clean-speech training database with a filter h[n], so that the filtered speech resembles speech collected with the far field microphone, and then retrain the system. This requires estimating the impulse response h[n] of a room. Alternatively, we can filter the signal from the far field microphone with an inverse filter to make it resemble the signal from the close-talking microphone. 1 An anechoic chamber is a room that has walls made of special fiberglass or other sound-absorbing materials so that absorbs all echoes. It is equivalent to being in free space, where there are neither walls nor reflecting surfaces. 2 In air at standard atmospheric pressure and humidity the speed of sound is c = 331.4 + 0.6T (m / s ) is. It varies with different media and different levels of humidity and pressure. The Acoustical Environment 477 One way to estimate the impulse response is to play a white noise signal x[n] through a loudspeaker or artificial mouth; the signal y[n] captured at the microphone is given by y[n] = x[n] ∗ h[n] + v[n] (10.7) where v[n] is the additive noise present at the microphone. This noise is due to sources such as air conditioning and computer fans and is an obstacle to measuring h[n]. The impulse response can be estimated by minimizing the error over N samples E= 1 N N −1 M −1 æ ö å ç y[n] − å h[m]x[n − m] ÷ ø n=0 è m=0 2 (10.8) which, taking the derivative with respect to h[m] and equating to 0, results in our estimate hˆ[l ] : ∂E 1 = ∂h[l ] h[ l ]= hˆ[l ] N = = N −1 M −1 æ ö å çè y[n] − å hˆ[m]x[n − m] ÷øx[n − l ] n=0 m =0 N −1 M −1 n =0 m=0 æ1 N −1 ö 1 N å y[n]x[n − l ] − å hˆ[m] çè N å x[n − m]x[n − l ] ÷ø 1 N å y[n]x[n − l ] − hˆ[l ] − å hˆ[m] çè N å x[n − m]x[n − l ] − δ [m − l ] ÷ø = 0 (10.9) n =0 N −1 M −1 n =0 m=0 æ1 N −1 ö n=0 Since we know our white process is ergodic, it follows that we can replace time averages by ensemble averages as N → ∞ : 1 N →∞ N lim N −1 å x[n − m]x[n − l ] = E { x[n − m]x[n − l ]} = δ [m − l ] (10.10) n=0 so that we can obtain a reasonable estimate of the impulse response as 1 hˆ[l ] = N N −1 å y[n]x[n − l ] (10.11) n =0 Inserting Eq. (10.7) into Eq. (10.11) ,we obtain hˆ[l ] = h[l ] + e[l ] (10.12) where the estimation error e[n] is given by e[l ] = 1 N N −1 M −1 n=0 m=0 æ1 N −1 ö å v[n]x[n − l ] + å h[m] çè N å x[n − m]x[n − l ] − δ [m − l ] ÷ø (10.13) n =0 If v[n] and x[n] are independent processes, then E{e[l ]} = 0 , since x[n] is zero-mean, so that the estimate of Eq. (10.11) is unbiased. The covariance matrix decreases to 0 as 478 Environmental Robustness N → ∞ , with the dominant term being the noise v[n]. The choice of N for a low-variance estimate depends on the filter length M and the noise level present in the room. The filter h[n] could also be estimated by playing sine waves of different frequencies or a chirp3 [52]. Since playing a white noise signal or sine waves may not be practical, another method is based on collecting stereo recordings with a close-talking microphone and a far field microphone. The filter h[n] of length M is estimated so that when applied to the close-talking signal x[n] it minimizes the squared error with the far field signal y[n], which results in the following set of M linear equations: M −1 å h[m]R xx m=0 [m − n] = Rxy [n] (10.14) which is a generalization of Eq. (10.11) when x[n] is not a white noise signal. 7000 6000 Room Impulse Response 5000 4000 3000 2000 1000 0 -1000 -2000 -3000 0 200 400 600 800 1000 1200 Time (samples) 1400 1600 1800 2000 Figure 10.1 Typical impulse response of an average office. Sampling rate was 16 kHz. It was estimated by driving a 4-minute segment of white noise through an artificial mouth and using Eq. (10.11). The filter length is about 125 ms. It is not uncommon to have reverberation times of over 100 milliseconds in office rooms. In Figure 10.1 we show the typical impulse response of an average office. 10.1.3. A Model of the Environment A widely used model of the degradation encountered by the speech signal when it gets corrupted by both additive noise and channel distortion is shown in Figure 10.2. We can derive 3 A chirp function continuously varies its frequency. For example, a linear chirp varies its frequency linearly with time: sin(n(ω 0 + ω1n)) . The Acoustical Environment 479 the relationships between the clean signal and the corrupted signal both in power-spectrum and cepstrum domains based on such a model [2]. h[m] x[m] y[m] n[m] Figure 10.2 A model of the environment. In the time domain, additive noise and linear filtering results in y[m] = x[m] ∗ h[m] + n[m] (10.15) It is convenient to express this in the frequency domain using the short-time analysis methods of Chapter 6. To do that, we window the signal, take a 2K-point DFT in Eq. (10.15) and then the magnitude squared: Y ( f k ) = X ( f k ) H ( f k ) + N ( f k ) + 2 Re { X ( f k ) H ( f k ) N ∗ ( f k )} 2 2 2 2 = X ( f k ) H ( f k ) + N ( f k ) + 2 X ( f k ) H ( f k ) N ( f k ) cos(θ k ) 2 2 2 (10.16) where k = 0,1,L , K , we have used upper case for frequency domain linear spectra, and θ k is the angle between the filtered signal and the noise for bin k. The expected value of the cross-term in Eq. (10.16) is zero, since x[m] and n[m] are statistically independent. In practice, this term is not zero for a given frame, though it is small if we average over a range of frequencies, as we often do when computing the popular mel-cepstrum (see Chapter 6). When using a filterbank, we can obtain a relationship for the energies at each of the M filters: Y ( fi ) ≈ X ( f i ) H ( fi ) + N ( f i ) 2 2 2 2 (10.17) where it has been shown experimentally that this assumption works well in practice. Equation (10.17) is also implicitly assuming that the length of h[n], the filter’s impulse response, is much shorter than the window length 2N. That means that for filters with long reverberation times, Eq. (10.17) is inaccurate. For example, for N ( f ) = 0 , a window shift 2 of T, and a filter’s impulse response h[n] = δ [n − T ] , we have Yt [ f m ] = X t −1[ f m ] ; i.e., the output spectrum at frame t does not depend on the input spectrum at that frame. This is a more serious assumption, which is why speech recognition systems tend to fail under long reverberation times. By taking logarithms in Eq. (10.17), and after some algebraic manipulation, we obtain 480 Environmental Robustness ln Y ( f i ) ≈ ln X ( fi ) + ln H ( f i ) 2 ( 2 ( 2 + ln 1 + exp ln N ( f i ) − ln X ( f i ) − ln H ( fi ) 2 2 2 )) (10.18) Since most speech recognition systems use cepstrum features, it is useful to see the effect of the additive noise and channel distortion directly on the cepstrum. To do that, let’s define the following length-(M + 1) cepstrum vectors: ( h = C ( ln H ( f ) n = C ( ln N ( f ) y = C ( ln Y ( f ) x = C ln X ( f 0 ) 2 ln X ( f1 ) 2 ln H ( f1 ) 0 2 ln N ( f1 ) 0 2 ln Y ( f1 ) 0 2 2 2 ) L ln H ( f ) ) L ln N ( f ) ) L ln Y ( f ) ) L ln X ( f M ) 2 2 M 2 2 (10.19) M 2 M where C is the DCT matrix and we have used lower-case bold to represent cepstrum vectors. Combining Eqs. (10.18) and (10.19) results in yˆ = x + h + g(n − x − h) (10.20) where the nonlinear function g(z) is given by ( g(z ) = C ln 1 + eC −1 z ) (10.21) Equations (10.20) and (10.21) say that we can compute the cepstrum of the corrupted speech if we know the cepstrum of the clean speech, the cepstrum of the noise, and the cepstrum of the filter. In practice, the DCT matrix C is not square, so that the dimension of the cepstrum vector is much smaller than the number of filters. This means that we are losing resolution when going back to the frequency domain, and thus Eqs. (10.20) and (10.21) represent only an approximation, though it has been shown to work reasonably well. As discussed in Chapter 9, the distribution of the cepstrum of x can be modeled as a mixture of Gaussian densities. Even if we assume that x follows a Gaussian distribution, y in Eq. (10.20) is no longer Gaussian because of the nonlinearity in Eq. (10.21). It is difficult to visualize the effect on the distribution, given the nonlinearity involved. To provide some insight, let’s consider the frequency-domain version of Eq. (10.18) when no filtering is done, i.e., H ( f ) = 1 : y = x + ln (1 + exp ( n − x ) ) (10.22) where x, n, and y represent the log-spectral energies of the clean signal, noise, and noisy signal, respectively, for a given frequency. Using simulated data, not real speech, we can analyze the effect of this transformation. Let’s assume that both x and n are Gaussian random variables. We can use Monte Carlo simulation to draw a large number of points from those two Gaussian distributions and obtain the corresponding noisy values y using Eq. The Acoustical Environment 481 (10.22). Figure 10.3 shows the resulting distribution for several values of σ x . We fixed µ n = 0 dB , since it is only a relative level, and set σ n = 2 dB , a typical value. We also set µ x = 25dB and see that the resulting distribution can be bimodal when σ x is very large. Fortunately, for modern speech recognition systems that have many Gaussian components, σ x is never that large and the resulting distribution is unimodal. 0.03 0.02 0.01 0 0 50 0.04 0.08 0.03 0.06 0.02 0.04 0.01 0.02 100 0 0 20 40 60 0 0 20 40 60 Figure 10.3 Distributions of the corrupted log-spectra y of Eq. (10.22) using simulated data. The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum x is Gaussian with mean 25 dB and standard deviations of 25, 10, and 5 dB, respectively (the x-axis is expressed in dB). The first distribution is bimodal, whereas the other two are approximately Gaussian. Curves are plotted using Monte Carlo simulation. Figure 10.4 shows the distribution of y for two values of µ x , given the same values for the noise distribution, µ n = 0 dB and σ n = 2 dB , and a more realistic value for σ x = 5dB . We see that the distribution is always unimodal, though not necessarily symmetric, particularly for low SNR ( µ x − µ n ). 0.08 0.1 0.06 0.04 0.05 0.02 0 0 10 20 30 0 0 10 20 30 Figure 10.4 Distributions of the corrupted log-spectra y of Eq. (10.22) using simulated data. The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum is Gaussian with standard deviation of 5 dB and means of 10 and 5 dB, respectively. The first distribution is approximately Gaussian while the second is nonsymmetric. Curves are plotted using Monte Carlo simulation. The distributions used in an HMM are mixtures of Gaussians so that, even if each Gaussian component is transformed into a non-Gaussian distribution, the composite distribution can be modeled adequately by another mixture of Gaussians. In fact, if you retrain the 482 Environmental Robustness model using the standard Gaussian assumption on corrupted speech, you can get good results, so this approximation is not bad. 10.2. ACOUSTICAL TRANSDUCERS Acoustical transducers are devices that convert the acoustic energy of sound into electrical energy (microphones) and vice versa (loudspeakers). In the case of a microphone this transduction is generally realized with a diaphragm, whose movement in response to sound pressure varies the parameters of an electrical system (a variable-resistance conductor, a condenser, etc.), producing a variable voltage that constitutes the microphone output. We focus on microphones because they play an important role in designing speech recognition systems. There are near field or close-talking microphones, and far field microphones. Closetalking microphones, either head-mounted or telephone handsets, pick up much less background noise, though they are more sensitive to throat clearing, lip smacks, and breath noise. Placement of such a microphone is often very critical, since, if it is right in front of the mouth, it can produce pops in the signal with plosives such as /p/. Far field microphones can be lapel mounted or desktop mounted and pick up more background noise than near field microphones. Having a small but variable distance to the microphone could be worse than a larger but more consistent distance, because the corresponding HMM may have lower variability. When used in speech recognition systems, the most important measurement is the signal-to-noise ratio (SNR), since the lower the SNR the higher the error rate. In addition, different microphones have different transfer functions, and even the same microphone offers different transfer functions depending on the distance between mouth and microphone. Varying noise and channel conditions are a challenge that speech recognition systems have to address, and in this chapter we present some techniques to combat them. The most popular type of microphone is the condenser microphone. We shall study in detail its directionality patterns, frequency response, and electrical characteristics. 10.2.1. The Condenser Microphone A condenser microphone has a capacitor consisting of a pair of metal plates separated by an insulating material called a dielectric (see Figure 10.5). Its capacitance C is given by C = ε 0π b 2 / h (10.23) where ε 0 is a constant, b is the width of the plate, and h is the separation between the plates. If we polarize the capacitor with a voltage Vcc , it acquires a charge Q given by Q = CVcc (10.24) Acoustical Transducers 483 One of the plates is free to move in response to changes in sound pressure, which results in a change in the plate separation ∆h, thereby changing the capacitance and producing a change in voltage ∆V = ∆hVcc / h . Thus, the sensitivity4 of the microphone depends on the polarizing voltage Vcc , which is why this voltage can often be 100 V. b b h Figure 10.5 A diagram of a condenser microphone. Electret microphones are a type of condenser microphones that do not require a special polarizing voltage Vcc , because a charge is impressed on either the diaphragm or the back plate during manufacturing and it remains for the life of the microphone. Electret microphones are light and, because of their small size, they offer good responses at high frequencies. From the electrical point of view, a microphone is equivalent to a voltage source v(t) with an impedance ZM, as shown in Figure 10.6. The microphone is connected to a preamplifier which has an equivalent impedance RL. Microphone ZM v(t) Preamplifier RL ~ + G - Figure 10.6 Electrical equivalent of a microphone. From Figure 10.6 we can see that the voltage on RL is vR (t ) = v(t ) RL ( RM + RL ) (10.25) Maximization of vR (t ) in Eq. (10.25) results in RL = ∞ , or in practice RL >> RM , which is called bridging. Thus, for highest sensitivity the impedance of the amplifier has to be at least 10 times higher than that of the microphone. If the microphone is connected to an amplifier with lower impedance, there is a load loss of signal level. Most low-impedance 4 The sensitivity of a microphone measures the open-circuit voltage of the electric signal the microphone delivers for a sound wave for a given sound pressure level, often 94 dB SPL, when there is no load or a high impedance. This voltage is measured in dBV, where the 0-dB reference is 1 V rms. 484 Environmental Robustness microphones are labeled as 150 ohms, though the actual values may vary between 100 and 300. Medium impedance is 600 ohms and high impedance is 600–10,000 ohms. In practice, the microphone impedance is a function of frequency. Signal power is measured in dBm, where the 0-dB reference corresponds to 1 mW dissipated in a 600-ohm resistor. Thus, 0 dBm is equivalent to 0.775 V. Since the output impedance of a condenser microphone is very high (~ 1 Mohm), a JFET transistor must be coupled to lower the equivalent impedance. Such a transistor needs to be powered with DC voltage through a different wire, as in Figure 10.7. A standard sound card has a jack with the audio on the tip, ground on the sleeve, DC bias VDD on the ring, and a medium impedance. When using phantom power, the VCC bias is provided directly in the audio signal, which must be balanced to ground. Vcc Rc C0 Microphone Preamplifier VDD Cc RM CM RL + - G Figure 10.7 Equivalent circuit for a condenser microphone with DC bias on a separate wire. It is important to understand how noise affects the signal of a microphone. If thermal noise arises in the resistor RL, it will have a power PN = 4kTB (10.26) where k = 1.38 × 10-23 J/K is the Bolzmann’s constant, T is the temperature in °K, and B is the bandwidth in Hz. The thermal noise in Eq. (10.26) at room temperature (T = 297°K) and for a bandwidth of 4 kHz is equivalent to –132 dBm. In practice, the noise is significantly higher than this because of preamplifier noise, radio-frequency noise and electromagnetic interference (poor grounding connections). It is, thus, important to keep the signal path between the microphone and the preamp as short as possible to avoid extra noise. It is desirable to have a microphone with low impedance to decrease the effect of noise due to radiofrequency interference, and to decrease the signal loss if long cables are used. Most microphones specify their SNR and range where they are linear (dynamic range). For condenser microphones, a power supply is necessary (DC bias required). Microphones with balanced output (the signal appears across two inner wires not connected to ground, with the shield of the cable connected to ground) are more resistant to radio frequency interference. 10.2.2. Directionality Patterns A microphone’s directionality pattern measures its sensitivity to a particular direction. Microphones may also be classified by their directional properties as omnidirectional (or non- Acoustical Transducers 485 directional) and directional, the latter subdivided into bidirectional and unidirectional, based upon their response characteristics. 10.2.2.1. Omnidirectional Microphones By definition, the response of an omnidirectional microphone is independent of the direction from which the encroaching sound wave is coming. Figure 10.8 shows the polar response of an omnidirectional mike. A microphone’s polar response, or pickup pattern, graphs its output voltage for an input sound source with constant level at various angles around the mic. Typically, a polar response assumes a preferred direction, called the major axis or front of the microphone, which corresponds to the direction at which the microphone is most sensitive. The front of the mike is labeled as zero degrees on the polar plot, but since an omnidirectional mic has no particular direction at which it is the most sensitive, the omnidirectional mike has no true front and hence the zero-degree axis is arbitrary. Sounds coming from any direction around the microphone are picked up equally. Omnidirectional microphones provide no noise cancellation. 90 1 60 120 0.5 150 30 180 0 210 Mic opening 330 Diaphragm 240 300 270 (a) (b) Figure 10.8 (a) Polar response of an ideal omnidirectional microphone and (b) its cross section. Figure 10.8 shows the mechanics of the ideal5 omnidirectional condenser microphone. A sound wave creates a pressure all around the microphone. The pressure enters the opening of the mic and the diaphragm moves. An electrical circuit converts the diaphragm movement into an electrical voltage, or response. Sound waves impinging on the mic create a pressure at the opening regardless of the direction from which they are coming; therefore we have a nondirectional, or omnidirectional, microphone. As we have seen in Chapter 2, if the source signal is Be jω t , the signal at a distance r is given by ( A / r )e jω t independently of the angle. This is the most inexpensive of the condenser microphones, and it has the advantage of a flat frequency response that doesn’t change with the angle or distance to the microphone. On the other hand, because of its uniform polar pattern, it picks up not only the desired signal but also noise from any direction. For example, if a pair of speakers is monitoring the microphone output, the sound from the speakers can reenter the microphone and create an undesirable sound called feedback. 5 Ideal omnidirectional microphones do not exist. 486 10.2.2.2. Environmental Robustness Bidirectional Microphones The bidirectional microphone is a noise-canceling microphone; it responds less to sounds incident from the sides. The bidirectional mic utilizes the properties of a gradient microphone to achieve its noise-canceling polar response. You can see how this is accomplished by looking at the diagram of a simplified gradient bidirectional condenser microphone, as shown in Figure 10.9. A sound impinging upon the front of the microphone creates a pressure at the front opening. A short time later, this same sound pressure enters the back of the microphone. The sound pressure never arrives at the front and back at the same time. This creates a displacement of the diaphragm and, just as with the omnidirectional mic, a corresponding electrical signal. For sounds impinging from the side, however, the pressure from an incident sound wave at the front opening is identical to the pressure at the back. Since both openings lead to one side of the diaphragm, there is no displacement of the diaphragm, and the sound is not reproduced. Speech sound wave from the front Noise sound wave from the side Figure 10.9 Cross section of an ideal bidirectional microphone. source r1 r2 r (–d, θ (d, 0) Figure 10.10 Approximation to the noise-canceling microphone of Figure 10.9. To compute the polar response of this gradient microphone let’s make the approximation of Figure 10.10, where the microphone signal is the difference between the signal at the front and rear of the diaphragm, the separation between plates is 2d, and r is the distance between the source and the center of the microphone. Acoustical Transducers 487 You can see that r1 , the distance between the source and the front of the diaphragm, is the norm of the vector specifying the source location minus the vector specifying the location of the front of the diaphragm r1 = re jθ − d (10.27) Similarly, you obtain the distance between the source and the rear of the diaphragm r2 = re jθ + d (10.28) The source arrives at the front of the diaphragm with a delay δ1 = r1 / c , where c is the speed of sound in air. Similarly, the delay to the rear of the diaphragm is δ 2 = r2 / c . If the source is a complex exponential e jω t , the difference signal between the front and rear is given by x (t ) = A j 2π f ( t −δ1 ) A j 2π f ( t −δ 2 ) A j 2π ft − e = e e G ( f ,θ ) r1 r2 r (10.29) where A is a constant and, using Eqs. (10.27), (10.28) and (10.29), the gain G ( f ,θ ) is given by G ( f ,θ ) = e − j 2π e jθ − λ τ f e jθ − λ − e − j 2π e jθ + λ τ f (10.30) e jθ + λ where we have defined λ = d / r and τ = r / c . 90 25 120 60 20 15 150 30 10 5 180 0 210 330 240 300 270 Figure 10.11 Polar response of a bidirectional microphone obtained through (10.30) with d = 1 cm, r = 50 cm, c = 33,000 cm/s, and f = 1000 Hz. The magnitude of Eq. (10.30) is used to plot the polar response of Figure 10.11. As can be seen by the plot, the pattern resembles a figure eight. The bidirectional mic has an 488 Environmental Robustness interchangeable front and back, since the response is a maximum in two opposite directions. In practice, this bidirectional microphone is an ideal case, and the polar response has to be measured empirically. According to the idealized model, the frequency response of omnidirectional microphones is constant with frequency, and this approximately holds in practice for real omnidirectional microphones. On the other hand, the polar pattern of directional microphones is not constant with frequency. Clearly it is a function of frequency, as can be seen in Eq. (10.32). In fact, the frequency response of a bidirectional microphone at 0° is shown in Figure 10.12 for both near field and far field conditions. 0 Difference in air pressure (dB) -5 -10 -15 -20 -25 -30 2 10 10 3 4 10 Frequency (Hz) Figure 10.12 Frequency response of a bidirectional microphone with d = 1 cm at 0°. The larger the distance between plates, the lower the frequency. The highest values are obtained for 8250 Hz and 24,750 Hz and the null for 16,500 Hz. The solid line corresponds to far field conditions ( λ = 0.02 ) and the dotted line to near field conditions ( λ = 0.5 ). It can be shown, after taking the derivative of G ( f , 0) in Eq. (10.32) and equating to zero, that the maxima are given by fn = c (2n − 1) 4d (10.31) with n = 1, 2,L . We can observe from Eq. (10.31) that the larger the width of the diaphragm, the lower the first maximum. The increase in frequency response, or sensitivity, in the near field, compared to the far field, is a measure of noise cancellation. Consequently the microphone is said to be noise canceling. The microphone is also referred to as a differential or gradient microphone, since it measures the gradient (difference) in sound pressure between two points in space. The boost in low-frequency response in the near field is also referred to as the proximity effect, often used by singers to boost their bass levels by getting the microphone closer to their mouths. Acoustical Transducers 489 By evaluating Eq. (10.30) it can be seen that low-frequency sounds in a bidirectional microphone are not reproduced as well as higher frequencies, leading to a thin sounding mic. The resistive material of a unidirectional microphone reduces the high-frequency response and makes the microphone reproduce low and high frequencies more equally than the bidirectional microphone. Let’s interpret Figure 10.12. The net sound pressure between these two points, separated by a distance D = 2d, is influenced by two factors: phase shift and inverse square law. The influence of the sound-wave phase shift is less at low frequencies than at high, because the distance D between the front and rear port entries becomes a small fraction of the low-frequency wavelength. Therefore, there is little phase shift between the ports at low frequencies, as the opposite sides of the diaphragm receive nearly equal amplitude and phase. The result is slight diaphragm motion and a weak microphone output signal. At higher frequencies, the distance D between sound ports becomes a larger fraction of the wavelength. Therefore, more phase shift exists across the diaphragm. This causes a higher microphone output. The pressure difference caused by phase shift rises with frequency at a rate of 20 dB per decade. As the frequency rises to where the microphone port spacing D equals half a wavelength, the net pressure is at its maximum. In this situation, the diaphragm movement is also at its maximum, since the front and rear see equal amplitude but in opposite polarities of the wave front. This results in a peak in the microphone frequency response, as illustrated in Figure 10.12. As the frequency continues to rise to where the microphone port spacing D equals one complete wavelength, the net pressure is at its minimum. Here, the diaphragm does not move at all, since the front and rear sides see equal amplitude at the same polarity of the wave front. This results in a dip in the microphone frequency response, as shown in Figure 10.12. A second factor creating a net pressure difference across the diaphragm is the impact of the inverse square law. If the sound-pressure difference between the front and rear ports of a noise-canceling microphone were measured near the sound source and again further from the source, the near field measurement would be greater than the far field. In other words, the microphone's net pressure difference and, therefore, output signal, is greater in the near sound field than in the far field. The inverse-square-law effect is independent of frequency. The net pressure that causes the diaphragm to move is a combination of both the phase shift and inverse-square-law effect. These two factors influence the frequency response of the microphone differently, depending on the distance to the sound source. For distant sound, the influence of the net pressure difference from the inverse-square-law effect is weaker than the phase-shift effect; thus, the rising 20-dB-per-decade frequency response dominates the total frequency response. As the microphone is moved closer to the sound source, the influence of the net pressure difference from the inverse square law is greater than that of the phase shift; thus the total microphone frequency response is largely flat. The difference in near field to far field frequency response is a characteristic of all noise-canceling microphones and applies equally to both acoustic and electronic types. 490 Environmental Robustness 10.2.2.3. Unidirectional Microphones Unidirectional microphones are designed to pick-up the speaker’s voice by directing the audio reception toward the speaker, focusing on the desired input and rejecting sounds emanating from other directions that can negatively impact clear communications, such as computer noise from fans or other sounds. Speech sound wave from the front Noise sound wave from the side Figure 10.13 Cross section of a unidirectional microphone. Figure 10.13 shows the cross-section of a unidirectional microphone, which also relies upon the principles of a gradient microphone. Notice that the unidirectional mic looks similar to the bidirectional, except that there is a resistive material (often cloth or foam) between the diaphragm and the opening of one end. The material's resistive properties slow down the pressure on its path from the back opening to the diaphragm, thus optimizing it so that a sound pressure impinging on the back of the microphone takes equally long to reach the rear of the diaphragm as to reach the front of the diaphragm. If the additional delay through the back plate is given by τ 0 , the gain can be given by G ( f ,θ ) = e − j 2π e jθ − λ τ f e jθ − λ − e ( ) − j 2π τ 0 + e jθ + λ τ f e jθ + λ (10.32) which was obtained by modifying Eq. (10.30). Unidirectional microphones have the greatest response to sound waves impinging from one direction, typically referred to as the front, or major axis of the microphone. One typical response of a unidirectional microphone is the cardioid pattern shown in the polar plot of Figure 10.14, plotted from Eq. (10.32). The frequency response at 0° is similar to that of Figure 10.12. Because the cardioid pattern of polar response is so popular among them, unidirectional mics are often referred to as cardioid mics. Acoustical Transducers 491 90 25 120 20 60 15 150 30 10 5 180 0 210 330 240 300 270 Figure 10.14 Polar response (a) of a unidirectional microphone and its cross section (b). The polar response was obtained through Eq. (10.32) with d = 1 cm, r = 50 cm, c = 33,000 cm/s, f = 1 kHz, and τ 0 = 0.06 ms. Equation (10.32) was derived under a simplified schematic based on Figure 10.10, which is an idealized model so that, in practice, the polar response of a real microphone has to be measured empirically. The frequency response and polar pattern of a commercial microphone are shown in Figure 10.15. Figure 10.15 Characteristics of an AKG C1000S cardioid microphone: (a) frequency response for near and far field conditions (note the proximity effect) and (b) polar pattern for different frequencies. Although this noise cancellation decreases the overall response to sound pressure (sensitivity) of the microphone, the directional and frequency-response improvements far out- 492 Environmental Robustness weigh the lessened sensitivity. It is particularly well suited for use as a desktop mic or as part of an embedded microphone in a laptop or desktop computer. Unidirectional microphones achieve superior noise-rejection performance over omnidirectionals. Such performance is necessary for clean audio input and for audio signal processing algorithms such as acoustic echo cancellation, which form the core of speakerphone applications. 10.2.3. Other Transduction Categories In a passive microphone, sound energy is directly converted to electrical energy, whereas an active microphone requires an external energy source that is modulated by the sound wave. Active transducers thus require phantom power, but can have higher sensitivity. We can also classify microphones according to the physical property to which the sound wave responds. A pressure microphone has an electrical response that corresponds to the pressure in a sound wave, while a pressure gradient microphone has a response corresponding to the difference in pressure across some distance in a sound wave. A pressure microphone is a fine reproducer of sound, but a gradient microphone typically has a response greatest in the direction of a desired signal or talker and rejects undesired background sounds. This is particularly beneficial in applications that rely upon the reproduction of only a desired signal, where any undesired signal entering the reproduction severely degrades performance. Such is the case in voice recognition or speakerphone applications. In terms of the mechanism by which they create an electrical signal corresponding to the sound wave they detect, microphones are classified as electromagnetic, electrostatic, and piezoelectric. Dynamic microphones are the most popular type of electromagnetic microphone and condenser microphones the most popular type of electrostatic microphone. Electromagnetic microphones induce voltage based on a varying magnetic field. Ribbon microphones are a type of electromagnetic microphones that employ a thin metal ribbon suspended between the poles of a magnet. Dynamic microphones are electromagnetic microphones that employ a moving coil suspended by a light diaphragm (see Figure 10.16), acting like a speaker but in reverse. The diaphragm moves with changes in sound pressure, which in turns moves the coil, which causes current to flow as lines of flux from the magnet are cut. Dynamic microphones need no batteries or power supply, but they deliver low signal levels that need to be preamplified. Output voltage Magnet Diaphragm Coil Figure 10.16 Dynamic microphone schematics. Piezoresistive and piezoelectric microphones are based on the variation of electric resistance of their sensor induced by changes in sound pressure. Carbon button microphones Adaptive Echo Cancellation (AEC) 493 consist of a small cylinder packed with tiny granules of carbon that, when compacted by sound pressure, reduce the electric resistance. Such microphones, often used in telephone handsets, offer a worse frequency response than condenser microphones, and lower dynamic range. 10.3. ADAPTIVE ECHO CANCELLATION (AEC) If a spoken language system allows the user to talk while speech is being output through the loudspeakers, the microphone picks up not only the user’s voice, but also the speech from the loudspeaker. This problem may be avoided with a half-duplex system that does not listen when a signal is being played through the loudspeaker, though such systems offer an unnatural user experience. On the other hand, a full-duplex system that allows barge-in by the user to interrupt the system offers a better user experience. For barge-in to work, the signal played through the loudspeaker needs to be canceled. This is achieved with echo cancellation (see Figure 10.17), as discussed in this section. In hands-free conferencing the local user’s voice is output by the remote loudspeaker, whose signal is captured by the remote microphone and after some delay is output by the local loudspeaker. People are tolerant to these echoes if either they are greatly attenuated or the delay is short. Perceptual studies have shown that the longer the delay, the greater the attenuation needed for user acceptance. Loudspeaker x[n] Acoustic path H Adaptive filter Microphone e[n] s[n] Speech signal v[n] Local noise d[n] dˆ[n] + r[n] + Figure 10.17 Block diagram of an echo-canceling application. x[n] represents the signal from the loudspeaker, s[n] the speech signal, v[n] the local background noise, and e[n] the signal that goes to the microphone. The use of echo cancellation is mandatory in telephone communications and handsfree conferencing when it is desired to have full-duplex voice communication. This is particularly important when the call is routed through a satellite that can have delays larger than 200 ms. A block diagram is shown in Figure 10.18. In Figure 10.17, the return signal r[n] is the sum r[n] = d [n] + s[n] (10.33) where s[n] is the speech signal and d[n] is the attenuated and possibly distorted version of the loudspeaker’s signal x[n]. The purpose of the echo canceler is to remove the echo d[n] from the return signal r[n], which is done by means of an adaptive FIR filter whose coeffi- 494 Environmental Robustness cients are computed to minimize the energy of the canceled signal e[n]. The filter coefficients are reestimated adaptively to track slowly changing line conditions. x[n] Adaptive filter Speaker A e[n] dˆ[n] r[n] Hybrid circuit H d[n] + Speaker B + s[n] v[n] Noise Figure 10.18 Block diagram of echo canceling for a telephone communication. x[n] represents the remote call signal, s[n] the local outgoing signal. The hybrid circuit H does a 2-4 wire conversion and is nonideal because of impedance mismatches. This problem is essentially that of adaptive filtering only when s[n] = 0 , or in other words when the user is silent. For this reason, you have to implement a double-talk detection module that detects when the speaker is silent. This is typically feasible because the echo d[n] is usually small, and if the return signal r[n] has high energy it means that the user is not silent. Errors in double-talk detection result in divergence of the filter, so it is generally preferable to be conservative in the decision and when in doubt not adapt the filter coefficients. Initialization could be done by sending a known signal with white spectrum. The quality of the filtering is measured by the so-called echo-return loss enhancement (ERLE): ERLE (dB) = 10 log10 E{d 2 [n]} E{(d [n] − dˆ[n]) 2 } (10.34) The filter coefficients are chosen to maximize the ERLE. Since the telephone-line characteristics, or the acoustic path (due to speaker movement), can change over time, the filter is often adaptive. Another reason for adaptive filters is that reliable ERLE maximization requires a large number of samples, and such a delay is not tolerable. In the following sections, we describe the fundamentals of adaptive filtering. While there are some nonlinear adaptive filters, the vast majority are linear FIR filters, with the LMS algorithm being the most important. We introduce the LMS algorithm, study its convergence properties, and present two extensions: the normalized LMS algorithm and transform-domain LMS algorithms. 10.3.1. The LMS Algorithm Let’s assume that a desired signal d[n] is generated from an input signal x[n] as follows L −1 d [n] = å g k x[n − k ] + u[n] = G T X[n] + u[n] k =0 (10.35) Adaptive Echo Cancellation (AEC) 495 with G = {g 0 , g1 ,L g L −1} , the input signal vector X[n] = {x[n], x[n − 1],L x[n − L + 1]} , and u[n] being noise that is independent of x[n]. We want to estimate d[n] in terms of the sum of previous samples of x[n]. To do that we define the estimate signal y[n] as L −1 y[n] = å wk [n]x[n − k ] = WT [n]X[n] (10.36) k =0 where W[n] = {w0 [n], w1[n],L wL −1[n]} is the time-dependent coefficient vector. The instantaneous error between the desired and the estimated signal is given by e[n] = d [n] − WT [n]X[n] (10.37) The least mean square (LMS) algorithm updates the value of the coefficient vector in the steepest descent direction W[n + 1] = W[n] + ε e[n]X[n] (10.38) where ε is the step size. This algorithm is very popular because of its simplicity and effectiveness [58]. 10.3.2. Convergence Properties of the LMS Algorithm The choice of ε is important: if it is too small, the adaptation rate will be slow and it might not even track the nonstationary trends of x[n] , whereas if ε is too large, the error might actually increase. We analyze the conditions under which the LMS algorithm converges. Let’s define the error in the coefficient vector V[n] as V[n] = G − W[n] (10.39) and combine Eqs. (10.37), (10.38), and (10.39) to obtain V[n + 1] = V[n] − ε X[n]XT [n]V[n] − ε u[n]X[n] (10.40) Taking expectations in Eq. (10.40) results in E{V[n + 1]} = E{V[n]} − µ E{X[n]XT [n]V[n]} (10.41) where we have assumed that u[n] and x[n] are independent and that either is a zero-mean process. Finally, we express the autocorrelation of X[n] as R xx = E{X[n]XT [n]} = QΛQT (10.42) where Q is a matrix of its eigenvectors and Λ is a diagonal matrix of its eigenvalues {λ0 , λ1 ,L , λL −1} , which are all real valued because of the symmetry of R xx . 496 Environmental Robustness Although we know that X[n] and V[n] are not statistically independent, we assume in this section that they are, so that we can obtain some insight on the convergence properties. With this assumption, Eq. (10.41) can be expressed as E{V[n + 1]} = E{V[n]}(1 − ε R xx ) (10.44) which, applied recursively, leads to E{V[n + 1]} = E{V[0]}(1 − ε R xx ) n (10.45) Using Eqs. (10.39) and (10.42) in (10.45), we can express the (I + 1)th element of E{W[n]} as L −1 E{wi [n]} = gi + å qij (1 − ελ j ) n E{vi [0]} (10.46) j =0 where qij is the (I + 1, j + 1)th element of the eigenvector matrix Q, and vi [n] is the rotated coefficient error vector defined as [n] = QT V[n] V (10.47) From Eq. (10.46) we see that the mean value of the LMS filter coefficients converges exponentially to the true value if 0 < ε < 1/ λ j (10.48) so that the adaptation constant ε must be determined from the largest eigenvalue of X[n] for the mean LMS algorithm to converge. In practice, mean convergence doesn’t tell us the nature of the fluctuations that the coefficients experience. Analysis of the variance of V[n] together with some more approximations result in mean-squared convergence if 0<ε < K Lσ x2 (10.49) with σ x2 = E{x 2 [n]} being the input signal power and K a constant that depends weakly on the nature of the input signal statistics but not on its power. Because of the inaccuracies of the independence assumptions above, a rule of thumb used in practice to determine the adaptation constant ε is 0<ε < 0.1 Lσ x2 (10.50) The choice of largest value for ε in Eq. (10.49) makes the LMS algorithm track nonstationary variations in x fastest, and achieve faster convergence. On the other hand, the misadjustment of the filter coefficients increases as both the filter length L and adaptation Adaptive Echo Cancellation (AEC) 497 constant ε increase. For this reason, often the adaptation constant can be made a function of n ( ε [n] ), with larger values at first and smaller values once convergence has been determined. 10.3.3. Normalized LMS Algorithm The normalized LMS algorithm (NLMS) uses the result of Eq. (10.49) and, therefore, defines a normalized step size ε [ n] = ε δ + Lσˆ x2 [n] (10.51) where the constant δ avoids a division by 0 and σˆ x2 [n] is an estimate of the input signal power, which is typically done with an exponential window σˆ x2 [n] = (1 − β )σˆ x2 [n − 1] + β x 2 [n] (10.52) or a sliding rectangular window σˆ x2 [n] = 1 N N −1 å x [n − i] = σˆ 2 i =0 2 x [n − 1] + 1 2 ( x [ n] − x 2 [ n − N ] ) N (10.53) where both β and N control the effective memory of the estimators in Eqs. (10.52) and (10.53), respectively. Finally, we need to pick ε so that 0 < ε < 2 to assure convergence. Choice of the NLMS algorithm simplifies the selection of ε , and the NLMS often converges faster than the LMS algorithm in practical situations. 10.3.4. Transform-Domain LMS Algorithm As discussed in Section 10.3.2, convergence of the LMS algorithm is determined by the largest eigenvalue of the input. Since complex exponentials are approximate eigenvectors for LTI systems, the LMS algorithm’s convergence is dominated by the frequency band with largest energy, and convergence in other frequency bands is generally much slower. This is the rationale for the subband LMS algorithm, which performs independent LMS algorithms for different frequency bands, as proposed by Boll [14]. The block LMS (BLMS) algorithm keeps the coefficients unchanged for a block k of L samples L −1 W[k + 1] = W[k ] + ε å e[kL + m]X[kL + m] (10.54) m=0 which is represented by a linear convolution and therefore can be implemented efficiently using length-2N FFTs according to overlap-save method of Figure 10.19. Notice that im- 498 Environmental Robustness plementing a linear convolution with a circular convolution operator such as the FFT requires the use of the dashed box. [Old x | New x] FFT Generate Length 2N Data vector X[k] Input x[n] Save last N samples FFT x Update weight vector W Conjugate Output y[n] FFT Force last N elements to 0 IFFT x E[k] FFT x d[n] + Figure 10.19 Block diagram of the constrained frequency-domain block LMS algorithm. The unconstrained version of this algorithm eliminates the computation inside the dashed box. An unconstrained frequency-domain LMS algorithm can be implemented by removing the constraint in Figure 10.19, therefore implementing a circular instead of a linear convolution. While this is not exact, the algorithm requires only three FFTs instead of five. In some practical applications, there is no difference in convergence between the constrained and unconstrained cases. 10.3.5. The RLS Algorithm The search for the optimum filter can be accelerated when the gradient vector is properly deviated toward the minimum. This approach uses the Newton-Raphson method to iteratively compute the root of f(x) (see Figure 10.20) so that the value at iteration i + 1 is given by Multimicrophone Speech Enhancement xi +1 = xi − 499 f ( xi ) f ′( xi ) (10.55) f(x) x1 x0 Figure 10.20 Newton-Raphson method to compute the roots of a function. To minimize function f(x) we thus compute the roots of f’(x) through the above method: xi +1 = xi − f ′( xi ) f ′′( xi ) (10.56) In the case of a vector, Eq. (10.56) is transformed into w i +1 = w i − ε [n] ( ∇ 2 e(w i ) ) ∇e(w i ) −1 (10.57) where we add a step size ε [n] , and where ∇ 2 e(w i ) is the Hessian of the least-squares function which, for Eq. (10.37), equals the autocorrelation of x: ∇ 2 e(w i ) = R[n] = E{x[n]xT [n]} (10.58) The recursive least squares (RLS) algorithm specifies a method of estimating Eq. (10.58) using an exponential window: R[n] = λ R[n − 1] + x[n]xT [n] (10.59) While the RLS algorithm converges faster than the LMS algorithm, it also is more computationally expensive, as it requires a matrix inversion for every sample. Several algorithms have been derived to speed it up [54]. 10.4. MULTIMICROPHONE SPEECH ENHANCEMENT The use of more than one microphone is motivated by the human auditory system, in which the use of both ears has been shown to enhance detection of the direction of arrival, as well as increase SNR when one ear is covered. The methods the human auditory system uses to accomplish this task are still not completely known, and the techniques described in this section do not mimic that behavior. Microphone arrays use multiple microphones and knowledge of the microphone locations to predict delays and thus create a beam that focuses on the direction of the desired 500 Environmental Robustness speaker and rejects signals coming from other angles. Reverberation, as discussed in Section 10.1.2, can be combated with these techniques. Blind source separation techniques are another family of statistical techniques that typically do not use spatial constraints, but rather statistical independence between different sources. While in this section we describe only linear processing, i.e., the output speech is a linearly filtered version of the microphone signals, we could also combine these techniques with the nonlinear methods of Section 10.5. 10.4.1. Microphone Arrays The goals of microphone arrays are twofold: finding the position of a sound source in a room, and improving the SNR of the received signal. Steering is helpful in videoconferencing, where a camera has to follow the current speaker. Since the speaker is typically far away from the microphone, the received signal likely contains a fair amount of additive noise. Microphone arrays can also be used to increase the SNR. Let x[n] be the signal at the source S. Microphone i picks up a signal yi [n] = x[n] ∗ gi [n] + vi [n] (10.60) that is a filtered version of the source plus additive noise vi [n] . If we have N such microphones, we can attempt to recover s[n] because all the signals yi [n] should be correlated. A typical assumption made is that all the filters gi [n] are delayed versions of the same filter g[n] gi [n] = g[n − Di ] (10.61) with the delay Di = di / c , d i being the distance between the source S and microphone i, and c the speed of sound in air. We cannot recover signal x[n] without knowledge of g[n] or the signal itself, so the goal is to obtain the filtered signal y[n] y[n] = x[n]* g[n] (10.62) so that, combining Eqs. (10.60), (10.61), and (10.62), yi [n] = y[n − Di ] + vi [n] (10.63) Assuming vi [n] are independent and Gaussianly distributed, the optimal estimate of x[n] is given by y[n] = 1 N N −1 å y [n + D ] = y[n] + v[n] i=0 i i (10.64) which is the so-called delay-and-sum beamformer [24, 29], where the residual noise v[n] Multimicrophone Speech Enhancement v[n] = 1 N 501 N −1 å v [n + D ] i =0 i (10.65) i has a variance that decreases as the number of microphones N increases, since the noises vi [n + Di ] are uncorrelated. Equation (10.65) requires estimation of the delays Di . To attenuate the additive noise v[n], it is not necessary to identify the absolute delays, but rather the delays relative to one reference microphone (for example, the center microphone). It can be shown that the maximum likelihood solution consists in maximizing the energy of y[n] in Eq. (10.64), which is the sum of cross-correlations: æ N −1 N −1 ö Di = arg max ç åå Rij [ Di − D j ] ÷ Di è i=0 j =0 ø 0≤i< N (10.66) This approach assumes that we know nothing about the geometry of the microphone placement. In fact, given a point source and assuming no reflections, we can compute the delay based on the distance between the source and the microphone. The use of geometry allows us to reduce the number of parameters to estimate from (N - 1) to a maximum of 3, in case we desire to estimate the exact location. This location is often described in spherical coordinates (ϕ ,θ , ρ ) with ϕ being the direction of arrival, θ the elevation angle, and ρ the distance to the reference microphone, as shown in Figure 10.21. ρ Microphone θ Speaker ϕ Figure 10.21 Spherical coordinates (ϕ ,θ , ρ ) with ϕ being the direction of arrival, θ the elevation angle, and ρ the distance to the reference microphone. While 2-D and 3-D microphone configurations can be used, which would allow us to determine not just the steering angle ϕ , but also distance to the origin ρ and azimuth θ , linear microphone arrays are the most widely used configurations because they are the simplest. In a linear array all the microphones are placed on a line (see Figure 10.22). In this case, we cannot determine the elevation angle θ . To determine both ϕ and ρ we need at least two microphones in the array. If the microphones are relatively close to each other compared to the distance to the source, the angle of arrival ϕ is approximately the same for all signals. With this assumption, the normalized delay Di with respect to the reference microphone is given by 502 Environmental Robustness Di = −ai sin(ϕ ) / c (10.67) where ai is the y-axis coordinate in Figure 10.22 for microphone i, where the reference microphone has a0 = 0 and also D0 = 0 . M2 S M1 ϕ M0 M-1 M-2 Figure 10.22 Linear microphone array (five microphones). The source signal arrives at each microphone with a different delay, which allows us to find the correct angle of arrival. With approximation, we define Di (ϕ ) , the relative delay of the signal at microphone i to the reference microphone, as a function of the direction of arrival angle ϕ and independent of ρ . The optimal direction of arrival ϕ is then that which maximizes the energy of the estimated signal x[n] over a set of samples æ1 ϕ = arg max å ç ϕ n è N æ1 = arg max å ç ϕ n è N N −1 ö å y [n + D (ϕ )] ÷ø i =0 i N −1 2 i ö yi [n − ai sin(ϕ )] ÷ å i =0 ø 2 (10.68) The term beamforming entails that this array favors a specific direction of arrival ϕ and that sources arriving from other directions are not in phase and therefore are attenuated. Since the source can move over time, maximization of Eq. (10.68) can be done in an adaptive fashion. As the beam is steered away from the broadside, the system exhibits a reduction in spatial discrimination because the beam pattern broadens. Furthermore, beamwidth varies with frequency, so an array has an approximate bandwidth given by the upper f u and lower f l frequencies fu = c d max cos ϕ − cos ϕ ′ φ ,φ ′ f fl = u N (10.69) Multimicrophone Speech Enhancement 503 with d being the sensor spacing, ϕ ′ the steering angle measured with respect to the axis of the array, and ϕ the direction of the source. For a desired range of ±30o and five sensors spaced 5 cm apart, the range is approximately 880 to 4400 Hz. We see in Figure 10.23 that at very low frequencies the response is essentially omnidirectional, since the microphone spacing is small compared to the large wavelength. At high frequencies more lobes start appearing, and the array steers toward not only the preferred direction but others as well. For speech signals, the upshot is that we either need a lot of microphones to provide a directional polar pattern at low frequencies, or we need them to be spread far enough apart, or both. 120 90 25 60 12.5 150 180 120 330 240 270 300 12.5 150 30 0 210 90 25 60 180 120 30 0 330 210 240 270 300 90 25 60 12.5 150 180 120 30 0 330 210 240 270 300 90 25 60 12.5 150 180 30 0 330 210 240 270 300 Figure 10.23 Polar pattern of a microphone array with steering angle of ϕ ′ = 0 , five microphones spaced 5 cm apart for 400, 880, 4400, and 8000 Hz from left to right, respectively, for a source located at 5 m. The polar pattern in Figure 10.23 was computed as follows: N P( f , r , ϕ ) = å i =1 e − j 2π f éë ai sin ϕ ′ + | re jϕ − jai |ùû / c re jϕ − jai (10.70) though the sensors could be spaced nonuniformly, as in Figure 10.24, allowing for better behavior across the frequency spectrum. Mid-frequency array High-frequency array Low-frequency array Figure 10.24 Nonuniform linear microphone array containing three subarrays for the high, mid, and low frequencies. 504 Environmental Robustness Once a microphone array has been steered towars a direction ϕ ′ , it attenuates noise source coming from other directions. The beamwidth depends not only on the frequency of the signal, but also on the steering direction. If the beam is steered toward a direction ϕ ′ , then the direction of the source for which the beam response fall to half its power has been found empirically to be ì K ü ϕ 3dB ( f ) = cos −1 ícos ϕ ′ ± ý Ndf þ î (10.71) with K being a constant. Equation (10.71) shows that the smaller the array, the wider the beam, and that lower frequencies yield wider beams also. Figure 10.25 shows that the bandwidth of the array when steering toward a 30° direction is lower than when steering at 0°. 120 90 25 60 12.5 150 180 120 30 0 210 330 240 270 300 90 25 60 12.5 150 180 120 30 0 210 330 240 270 300 90 25 150 120 60 12.5 180 30 0 210 330 240 270 300 90 25 60 12.5 150 180 30 0 210 330 240 270 300 Figure 10.25 Polar pattern of a microphone array with steering angle of ϕ ′ = 30o , five microphones spaced 5 cm apart for 400, 880, 3000, and 4400 Hz from left to right, respectively, for a source located at 5 m. Microphone arrays have been shown to improve recognition accuracy when the microphones and the speaker are far apart [51]. Several companies are commercializing microphone arrays for teleconferencing or speech recognition applications. Only in anechoic chambers does the assumption in Eq. (10.61) hold, since in practice many reflections take place, which are also different for different microphones. In addition, the assumption of a common direction of arrival for all microphones may not hold either. For this case of reverberant environments, single beamformers typically fail. While computing the direction of arrival is much more difficult in this case, the SNR can still be improved. Let’s define the desired signal d[n] as that obtained in the reference microphone. We can estimate the vector H[n] = {h11 ,L , h1L , h21 ,L , h2 L ,L , h( N −1)1 ,L , h( N −1) L } for the (N-1) Ltap filters that minimizes the error array [25] e[n] = d [n] − H[n]Y[n] (10.72) where the (N – 1) microphone signals are represented in the vector Y[n] = { y1[n],L , y1[n − L − 1], y2 [n],L , y2 [n − L − 1],L , y N −1[n],L , yN −1[n − L − 1]} The filter coefficients G[n] can be estimated through the adaptive filtering techniques described in Section 10.3. The clean signal is then estimated as Multimicrophone Speech Enhancement xˆ[n] = 505 1 ( d [n] + H[n]Y[n]) 2 (10.73) This last method does not assume anything about the geometry of the microphone array. 10.4.2. Blind Source Separation The problem of separating the desired speech from interfering sources, the cocktail party effect [15], has been one of the holy grails in signal processing. Blind source separation (BSS) is a set of techniques that assume no information about the mixing process or the sources, apart from their mutual statistical independence, hence is termed blind. Independent component analysis (ICA), developed in the last few years [19, 38], is a set of techniques to solve the BSS problem that estimate a set of linear filters to separate the mixed signals under the assumption that the original sources are statistically independent. Let’s first consider instantaneous mixing. Let’s assume that R microphone signals yi [n] , denoted by y[n] = ( y1[n], y2 [n],L , yR [n]) , are obtained by a linear combination of R unobserved source signals xi [n] , denoted by x[n] = ( x1[n], x2 [n],L , xR [n]) : y[n] = Gx[n] (10.74) for all n, with G being the R x R mixing matrix. This mixing is termed instantaneous, since the sensor signals at time n depend on the sources at the same, but no earlier, time point. Had the mixing matrix been given, its inverse could have been applied to the sensor signals to recover the sources by x[n] = G −1y[n] . In the absence of any information about the mixing, the blind separation problem consists of estimating a separating matrix H = G −1 from the observed microphone signals alone. The source signals can then be recovered by x[n] = Hy[n] (10.75) We’ll use here the probabilistic formulation of ICA, though alternate frameworks for ICA have been derived also [18]. Let px (x[n]) be the probability density function (pdf) of the source signals, so that the pdf of microphone signals y[n] is given by py (y[n]) =| H | px (Hy[n]) (10.76) and if we furthermore assume the sources x[n] are independent from themselves in time, x[n + i ] i ≠ 0 , then the joint probability is given by N −1 py (y[0], y[1],L , y[ N − 1]) = ∏ py (y[n]) =| H |N n=0 whose normalized log-likelihood is given by N −1 ∏ p (Hy[n]) n=0 x (10.77) 506 Environmental Robustness Ψ= 1 1 ln py (y[0], y[1],L , y[ N − 1]) = ln | H | + N N N −1 å ln p (Hy[n]) n=0 x (10.78) It can be shown that ∂ ln H = ( HT ) ∂H −1 (10.79) so that that the gradient of Ψ [38] in Eq. (10.78) is given by −1 ∂Ψ 1 = ( HT ) + ∂H N N −1 å φ (Hy[n]) ( y[n]) T (10.80) n =0 where φ (x) is expressed as φ ( x) = ∂ ln px (x) ∂x (10.81) If we further assume the distribution is a zero mean Gaussian distribution with standard deviation σ, then Eq. (10.81) results in φ ( x) = − x σ2 (10.82) which inserted into Eq. (10.80) yields −1 ∂Ψ Hæ1 = ( HT ) − 2 ç ∂H σ èN N −1 å y[n] ( y[n]) n=0 T H ö T −1 ÷ = (H ) − 2 R σ ø (10.83) with R being the matrix of cross-correlations, i.e., Rij = 1 N N −1 å y [ n ]y [ n ] n=0 i j (10.84) Setting Eq. (10.83) to 0 results in maximization of Eq. (10.78) under the Gaussian assumption: H T H = σ 2 R −1 (10.85) which can be solved using the Cholesky decomposition described in Chapter 6. Since σ is generally not known, it can be shown from Eq. (10.85) that the sources can be recovered only to within a scaling factor [17]. Scaling is in general not a big problem, since speech recognition systems perform automatic gain control (AGC). Moreover, the sources can be recovered to within a permutation. To see this, let’s define a two-dimensional matrix A Multimicrophone Speech Enhancement æ 0 1ö A=ç ÷ è 1 0ø 507 (10.86) which is orthogonal: AT A = I (10.87) If H is a solution of Eq. (10.85), then AH is also a solution. Thus, a permutation of the sources yields the same correlation matrix in Eq. (10.84). Although we have shown it only under the Gaussian assumption, separation up to a scaling and source permutation is a general result in blind source separation [17]. Unfortunately, the Gaussian assumption does not guarantee separation. To see this, we can define a two-dimensional rotation matrix A æ cos θ A=ç è sin θ − sin θ ö ÷ cos θ ø (10.88) which is also orthogonal, so that if H is a solution of Eq. (10.85), then AH is also a solution. The Gaussian assumption entails considering only second-order statistics, and to ensure separation we could consider higher-order statistics. Since speech signals do not follow a Gaussian distribution, we could use a Laplacian distribution, as we saw in Chapter 7: px ( x ) = β −β x e 2 (10.89) which, using Eq. (10.81), results in ì− β φ ( x) = í îβ x>0 x<0 (10.90) and thus a nonlinear function of H for Eq. (10.80). Since a closed-form solution is not possible, a common solution in this case is gradient descent, where the gradient is given by −1 ∂Ψ = ( HTn ) + φ (H n y[n])(y[n])T ∂H n (10.91) and the update formula by H n +1 = H n − ε −1 ∂Ψ = H n − ε éê( HTn ) + φ (H n y[n])(y[n])T ùú ë û ∂H n (10.92) which is the so-called infomax rule [10]. Often the nonlinearity in Eq. (10.90) is replaced by a sigmoid6 function: 6 The sigmoid function can be expressed in terms of the hyperbolic tangent tanh( x ) = sinh( x) / cosh( x ) , where sinh( x) = (e x − e − x ) / 2 and cosh( x) = (e x + e− x ) / 2 . 508 Environmental Robustness φ ( x) = − β tanh( β x) (10.93) which implies a density function px ( x ) = β 2π cosh( β x) (10.94) The sigmoid converges to the Laplacian as β → ∞ . Nonlinear functions in Eqs. (10.90) and (10.93) can be expanded in Taylor series so that all the moments of the observed signals are used and not just the second order, as in the case of the Gaussian assumption. These nonlinearities have been shown to be more effective in separating the sources. The use of more accurate density functions for px (x) , such as a mixture of Gaussians [9], also results in nonlinear φ ( x) functions that have shown better separation. A problem of Eq. (10.92) is that it requires a matrix inversion at every iteration. The so-called natural gradient [7] was suggested to avoid this, also providing faster convergence. To do this we can multiply the gradient of Eq. (10.91) by a positive definite matrix, the inverse of the Fisher’s information matrix HTn H n , for example, to whiten the signal: H n +1 = H n − ε ∂Ψ T Hn Hn ∂H (10.95) which, combined with Eq. (10.91), results in H n +1 = H n − ε éëI + φ (xˆ [n])(xˆ [n])T ùû H n (10.96) where the estimated sources are given by xˆ[n] = H n y[n] (10.97) Notice the similarity of this approach to the RLS algorithm of Section 10.3.5. Similarly to most Newton-Raphson methods, the convergence of this approach is quadratic instead of linear as long as we are close enough to the maximum. Another way of overcoming the lack of separation under the independent Gaussian assumption is to make use of temporal information, which we know is important for speech signals. If the model of Eq. (10.74) is extended to contain additive noise y[n] = Gx[n] + v[n] (10.98) we can compute the autocorrelation of y[n] as R y [n] = GR x [n]G T + R v [n] (10.99) or, after some manipulation, R x [n] = H(R y [n] − R v [n])HT (10.100) Multimicrophone Speech Enhancement 509 which we know must be diagonal because the sources x are independent, and thus H can be estimated to minimize the squared error of the off-diagonal terms of R x [n] for several values of n [11]. Equation (10.100) is a generalization of Eq. (10.85) when considering temporal correlation and additive noise. x1[n] g11[n] x2 [n] + y1[n] h11[n] g12 [n] h12 [n] g 21[n] h21[n] g 22 [n] + y2 [n ] h22 [n] + xˆ1[n] + xˆ2 [n] Figure 10.26 Convolutional model for the case of two microphones. The case of instantaneous mixing is not realistic, as we need to consider the transfer functions between the sources and the microphones created by the room acoustics. It can be shown that the reconstruction filters hij [n] in Figure 10.26 will completely recover the original signals xi [n] if and only if their z-transforms are the inverse of the z-transforms of the mixing filters gij [n] : æ H11 ( z ) H12 ( z ) ö æ G11 ( z ) G12 ( z ) ö ç ÷=ç ÷ è H 21 ( z ) H 22 ( z ) ø è G21 ( z ) G22 ( z ) ø −1 æ G11 ( z ) G12 ( z ) ö 1 = ç ÷ G11 ( z )G22 ( z ) − G12 ( z )G21 ( z ) è G21 ( z ) G22 ( z ) ø (10.101) If the matrix in Eq. (10.101) is not invertible, separability is impossible. This can happen if both microphones pick up the same signal, which could happen if either the two microphones are too close to each other or the two sources are too close to each other. It’s reasonable to assume the mixing filters gij [n] to be FIR filters, whose length will generally depend on the reverberation time, which in turn depends on the room size, microphone position, wall absorbance, and so on. In general this means that the reconstruction filters hij [n] have an infinite impulse response. In addition, the filters gij [n] may have zeroes outside the unit circle, so that perfect reconstruction filters would need to have poles outside the unit circle. For this reason it is not possible, in general, to recover the original signals exactly. In practice, it’s convenient to assume such filters to be FIR of length q, which means that the original signals x1[n] and x2 [n] , will not be recovered exactly. Thus the problem 510 Environmental Robustness consists in estimating the reconstruction filters hij [n] directly from the microphone signals y1[n] and y2 [n] , so that the estimated signals xˆi [n] are as close as possible to the original signals. Often we are satisfied if the resulting signals are separated, even if they contain some amount of reverberation. An approach commonly used to combat this problem consists of taking a filterbank and assuming instantaneous mixing within each filter [38]. This approach can separate real sources much more effectively, but it suffers from the problem of permutations, which in this case is more severe because frequencies from different sources can be mixed together. To avoid this, we may need a probabilistic model of the sources that takes into account correlations across frequencies [3]. Another problem occurs when the number of sources is larger than the number of microphones. 10.5. ENVIRONMENT COMPENSATION PREPROCESSING The goal of this section is to present a number of techniques used to clean up the signal of additive noise and/or channel distortions prior to the speech recognition system. Although the techniques presented here are developed for the case of one microphone, they can be generalized to the case where several microphones are available using the approaches described in Section 10.4. These techniques can also be used to enhance the signal captured with a speakerphone or a desktop microphone in teleconferencing applications. Since the use of human auditory system is so robust to changes in acoustical environment, many researchers have attempted to develop signal processing schemes that mimic the functional organization of the peripheral auditory system [27, 49]. The PLP cepstrum described in Chapter 6 has also been shown to be very effective in combating noise and channel distortions [60]. Another alternative is to consider the feature vector as an integral part of the recognizer, and thus researchers have investigated its design so as to maximize recognition accuracy, as discussed in Chapter 9. Such approaches include LDA [34] and neural networks [45]. These discriminatively trained features can also be optimized to operate better under noisy conditions, thus possibly beating the standard mel-cepstrum, especially when several independent features are combined [50]. The mel-cepstrum is the most popular feature vector for speech recognition. In this context we present a number of techniques that have been proposed over the years to compensate for the effects of additive noise and channel distortions on the cepstrum. 10.5.1. Spectral Subtraction The basic assumption in this section is that the desired clean signal x[m] has been corrupted by additive noise n[m]: y[m] = x[m] + n[m] (10.102) Environment Compensation Preprocessing 511 and that both x[m] and n[m] are statistically independent, so that the power spectrum of the output y[m] can be approximated as the sum of the power spectra: Y( f ) ≈ X ( f ) + N( f ) 2 2 2 (10.103)~ with equality if we take expected values, as the expected value of the cross term vanishes (see Section 10.1.3). 2 Although we don’t know N ( f ) , we can obtain an estimate using the average periodogram over M frames that are known to be just noise (i.e., when no signal is present) as long as the noise is stationary 2 1 Nˆ ( f ) = M M −1 å Y(f) 2 i i=0 (10.104) Spectral subtraction supplies an intuitive estimate for X ( f ) using Eqs. (10.103) and (10.104) as 2 2 ö 1 2 2æ Xˆ ( f ) = Y ( f ) − Nˆ ( f ) = Y ( f ) ç1 − ÷ è SNR( f ) ø (10.105) where we have define the frequency-dependent signal-to-noise ratio SNR ( f ) as SNR ( f ) = Y( f ) Nˆ ( f ) 2 2 (10.106) Equation (10.105) describes the magnitude of the Fourier transform but not the phase. This is not a problem if we are interested in computing the mel-cepstrum as discussed in Chapter 6. We can just modify the magnitude and keep the original phase of Y ( f ) using a filter H ss ( f ) : Xˆ ( f ) = Y ( f ) H ss ( f ) (10.107) where, according to Eq. (10.105), H ss ( f ) is given by H ss ( f ) = 1 − 1 SNR( f ) (10.108) 2 Since Xˆ ( f ) is a power spectral density, it has to be positive, and therefore SNR( f ) ≥ 1 (10.109) 512 Environmental Robustness but we have no guarantee that SNR ( f ) , as computed by Eq. (10.106), satisfies Eq. (10.109). In fact, it is easy to see that noise frames do not comply. To enforce this constraint, Boll [13] suggested modifying Eq. (10.108) as follows: æ ö 1 H ss ( f ) = max ç1 − ,a÷ è SNR( f ) ø (10.110) with a ≥ 0 , so that the quantity within the square root is always positive, and where f ss ( x) is given by æ 1 ö f ss ( x) = max ç1 − , a ÷ è x ø (10.111) It is useful to express SNR ( f ) in dB so that x = 10 log10 SNR (10.112) and the gain of the filter in Eq. (10.111) also in dB: g ss ( x ) = 20 log10 f ss ( x ) (10.113) Using Eqs. (10.111) and (10.112), we can express Eq. (10.113) by ( g ss ( x ) = max 10log10 (1 − 10− x /10 ) , − A ) (10.114) after expressing the attenuation a in dB: a = 10− A /10 (10.115) Equation (10.114) is plotted in Figure 10.27 for A = 10 dB. The spectral subtraction rule in Eq. (10.111) is quite intuitive. To implement it we can do a short-time analysis, as shown in Chapter 6, by using overlapping windowed segments, zero-padding, computing the FFT, modifying the magnitude spectrum, taking the inverse FFT, and adding the resulting windows. This implementation results in output speech that has significantly less noise, though it exhibits what is called musical noise [12]. This is caused by frequency bands f for which 2 2 2 2 Y ( f ) ≈ Nˆ ( f ) . As shown in Figure 10.27, a frequency f for which Y ( f ) < Nˆ ( f ) 0 0 0 2 is attenuated by A dB, whereas a neighboring frequency f1 , where Y ( f1 ) > Nˆ ( f1 ) , has a 2 much smaller attenuation. These rapid changes with frequency introduce tones at varying frequencies that appear and disappear rapidly. The main reason for the presence of musical noise is that the estimates of SNR ( f ) through Eqs. (10.104) and (10.106) are poor. This is partly because SNR ( f ) is computed independently for each frequency, whereas we know that SNR ( f 0 ) and SNR ( f1 ) are corre- Environment Compensation Preprocessing 513 lated if f 0 and f1 are close to each other. Thus, one possibility is to smooth the filter in Eq. (10.114) over frequency. This approach suppresses a smaller amount of noise, but it does not distort the signal as much, and thus may be preferred by listeners. Similarly, smoothing over time SNR ( f , t ) = γ SNR( f , t − 1) + (1 − γ ) Y( f ) Nˆ ( f ) 2 (10.116) 2 can also be done to reduce the distortion, at the expense of a smaller noise attenuation. Smoothing over both time and frequency can be done to obtain more accurate SNR measurements and thus less distortion. As shown in Figure 10.28, use of spectral subtraction can reduce the error rate. 0 -2 Gain(dB) -4 -6 -8 spectral subtraction magnitude subtraction Oversubtraction -10 -12 -5 0 5 10 Instantaneous SNR (dB) 15 20 Figure 10.27 Magnitude of the spectral subtraction filter gain as a function of the input instantaneous SNR for A = 10 dB, for the spectral subtraction of Eq. (10.114), magnitude subtraction of Eq. (10.118), and oversubtraction of Eq. (10.119) with β = 2 dB. Additionally, the attenuation A can be made a function of frequency. This is useful when we want to suppress more noise at one frequency than another, which is a tradeoff between noise reduction and nonlinear distortion of speech. Other enhancements to the basic algorithm have been proposed to reduce the musical noise. Sometimes Eq. (10.111) is generalized to 1/ α æ 1 æ öö f ms ( x) = ç max ç1 − α / 2 , a ÷ ÷ è x øø è (10.117) where α = 2 corresponds to the power spectral subtraction rule in Eq. (10.111), and α = 1 corresponds to the magnitude subtraction rule (plotted in Figure 10.27 for A = 10 dB): 514 Environmental Robustness ( g ms ( x ) = max 20log10 (1 − 10− x / 5 ) , − A ) (10.118) Another variation, called oversubtraction, consists of multiplying the estimate of the 2 noise power spectral density Nˆ ( f ) in Eq. (10.104) by a constant 10 β /10 , where β > 0 , which causes the power spectral subtraction rule in Eq. (10.114) to be transformed to another function ( g os ( x ) = max 10log10 (1 − 10− ( x − β ) /10 ) , − A 2 This causes Y ( f ) < Nˆ ( f ) 2 ) (10.119) 2 to occur more often than Y ( f ) > Nˆ ( f ) 2 for frames for 2 Word Error Rate (%) 2 which Y ( f ) ≈ Nˆ ( f ) , and thus reduces the musical noise. Clean Speech Training 100 90 80 70 60 50 40 30 20 10 0 Spectral Subtraction Matched Noisy Training 0 5 10 15 20 25 30 SNR (dB) Figure 10.28 Word error rate as a function of SNR (dB) using Whisper on the Wall Street Journal 5000-word dictation task. White noise was added at different SNRs, and the system was trained with speech with the same SNR. The solid line represents the baseline system trained with clean speech, the dotted line the use of spectral subtraction with the previous clean HMMs. They are compared to a system trained on the same speech with the same SNR as the speech tested on. 10.5.2. Frequency-Domain MMSE from Stereo Data You have seen that several possible functions, such as Eqs. (10.114), (10.118), or (10.119), can be used to attenuate the noise, and it is not clear that any one of them is better than the others, since each has been obtained through different assumptions. This opens the possibility of estimating the curve g ( x ) using a different criterion, and, thus, different approximations than those used in Section 10.5.1. Environment Compensation Preprocessing 515 Figure 10.29 Empirical curves for input-to-output instantaneous SNR. Eight different curves for 0, 1, 2, 3, 4, 5, 6, 7 and 8 kHz are obtained following Eq. (10.121) [2] using speech recorded simultaneously from a close-talking microphone and a desktop microphone. One interesting possibility occurs when we have pairs of stereo utterances that have been recorded simultaneously in noise-free conditions in one channel and noisy conditions in the other channel. In this case, we can estimate f ( x) using a minimum mean squared criterion (Porter and Boll [47], Ephraim and Malah [23]), so that 2ü ì N −1 M −1 fˆ ( x) = arg min íå å X i ( f j ) − f ( SNR( f j ) ) Yi ( f j ) ý f ( x) î i=0 j =0 þ ( ) (10.120) or g(x) as ( ) ì N −1 M −1 2 2 2ü gˆ( x) = arg min íå å 10 log10 X i ( f j ) − g ( SNR( f j ) ) − 10 log10 Yi ( f j ) ý (10.121) g ( x) î i =0 j =0 þ 516 Environmental Robustness which can be solved by discretizing f(x) and g(x) into several bins and summing over all M frequencies and N frames. This approach results in a curve that is smoother and thus results in less musical noise and lower distortion. Stereo utterances of noise-free and noisy speech are needed to estimate f(x) and g(x) through Eqs. (10.120) and (10.121) for any given acoustical environment and can be collected with two microphones, or the noisy speech can be obtained by adding to the clean speech artificial noise from the testing environment. Another generalization of this approach is to use a different function f(x) or g(x) for every frequency [2] as shown in Figure 10.29. This also allows for a lower squared error at the expense of having to store more data tables. In the experiments of Figure 10.29, we note that more subtraction is needed at lower frequencies than at higher frequencies in this case. If such stereo data is available to estimate these curves, it makes the enhanced speech sound better [23] than does spectral subtraction. When used in speech recognition systems, it also leads to higher accuracies [2]. 10.5.3. Wiener Filtering Let’s reformulate Eq. (10.102) from the statistical point of view. The process y[n] is the sum of random process x[n] and the additive noise v[n] process: y[n] = x[n] + v[n] (10.122) We wish to find a linear estimate xˆ[n] in terms of the process y[n] : xˆ[n] = ∞ å h[m]y[n − m] (10.123) m =−∞ which is the result of a linear time-invariant filtering operation. The MMSE estimate of h[n] in Eq. (10.123) minimizes the squared error ∞ ìï é ù E í ê x[n] − å h[m]y[n − m]ú m =−∞ û îï ë 2 üï ý þï (10.124) which results in the famous Wiener-Hopf equation Rxy [l ] = ∞ å h[m]R m =−∞ yy [l − m] (10.125) so that, taking Fourier transforms, the resulting filter can be expressed in the frequency domain as H( f ) = S xy ( f ) S yy ( f ) (10.126) If the signal x[n] and the noise v[n] are orthogonal, which is often the case, then Environment Compensation Preprocessing S xy ( f ) = S xx ( f ) and S yy ( f ) = S xx ( f ) + Svv ( f ) 517 (10.127) so that Eq. (10.126) is given by H( f ) = S xx ( f ) S xx ( f ) + Svv ( f ) (10.128) Equation (10.128) is called the noncausal Wiener filter. This can be realized only if we know the power spectra of both the noise and the signal. Of course, if S xx ( f ) and Svv ( f ) do not overlap, then H ( f ) = 1 in the band of the signal and H ( f ) = 0 in the band of the noise. In practice, S xx ( f ) is unknown. If it were known, we could compute its mel-cepstrum, which would coincide exactly with the mel-cepstrum before noise addition. To solve this chicken-and-egg problem, we need some kind of model. Ephraim [22] proposed the use of an HMM where, if we know what state the current frame falls under, we can use its mean spectrum as S xx ( f ) . In practice we do not know what state each frame falls into either, so he proposed to weigh the filters for each state by the a posterior probability that the frame falls into each state. This algorithm, when used in speech enhancement, results in gains of 15 dB or more. A causal version of the Wiener filter can also be derived. A dynamical state model algorithm called the Kalman filter (see [42] for details) is also an extension of the Wiener filter. 10.5.4. Cepstral Mean Normalization (CMN) Different microphones have different transfer functions, and even the same microphone has a varying transfer function depending on the distance to the microphone and the room acoustics. In this section we describe a powerful and simple technique that is designed to handle convolutional distortions and, thus, increases the robustness of speech recognition systems to unknown linear filtering operations. Given a signal x[n], we compute its cepstrum through short-time analysis, resulting in a set of T cepstral vectors X = {x 0 , x1 ,L , xT −1} . Its sample mean x is given by x= 1 T −1 å xt T t =0 (10.129) Cepstral mean normalization (CMN) (Atal [8]) consists of subtracting x from each vector xt to obtain the normalized cepstrum vector xˆ t : xˆ t = xt − x (10.130) 518 Environmental Robustness Let’s now consider a signal y[n], which is the output of passing x[n] through a filter h[n]. We can compute another sequence of cepstrum vectors Y = {y 0 , y1 ,L , yT −1} . Now let’s further define a vector h as ( h = C ln H (ω 0 ) 2 L ln H (ω M ) 2 ) (10.131) where C is the DCT matrix. We can see that y t = xt + h (10.132) and thus the sample mean y t equals y= 1 T −1 1 T −1 y t = + å ( xt + h) = x + h å T t =0 T t =0 (10.133) and its normalized cepstrum is given by yˆ t = y t − y t = xˆ t (10.134) which indicates that cepstral mean normalization is immune to linear filtering operations. This procedure is performed on every utterance for both training and testing. Intuitively, the mean vector x conveys the spectral characteristics of the current microphone and room acoustics. In the limit, when T → ∞ for each utterance, we should expect means from utterances from the same recording environment to be the same. Use of CMN to the cepstrum vectors does not modify the delta or delta-delta cepstrum. Let’s analyze the effect of CMN on a short utterance. Assume that our utterance contains a single phoneme, say /s/. The mean x will be very similar to the frames in this phoneme, since /s/ is quite stationary. Thus, after normalization, xˆ t ≈ 0 . A similar result will happen for other fricatives, which means that it would be impossible to distinguish these ultrashort utterances, and the error rate will be very high. If the utterance contains more than one phoneme but is still short, this problem is not insurmountable, but the confusion among phonemes is still higher than if no CMN had been applied. Empirically, it has been found that this procedure does not degrade the recognition rate on utterances from the same acoustical environment, as long as they are longer than 2–4 seconds. Yet the method provides significant robustness against linear filtering operations. In fact, for telephone recordings, where each call has a different frequency response, the use of CMN has been shown to provide as much as 30% relative decrease in error rate. When a system is trained on one microphone and tested on another, CMN can provide significant robustness. Interestingly enough, it has been found in practice that the error rate for utterances within the same environment is actually somewhat lower, too. This is surprising, given that there is no mismatch in channel conditions. One explanation is that, even for the same microphone and room acoustics, the distance between the mouth and the microphone varies for different speakers, which causes slightly different transfer functions, as we studied in Section 10.2. In addition, the cepstral mean characterizes not only the channel transfer function, Environment Compensation Preprocessing 519 but also the average frequency response of different speakers. By removing the long-term speaker average, CMN can act as sort of speaker normalization. Word Error Rate (%) 16 14 No CMN 12 CMN-2 10 8 6 4 2 0 10 15 20 30 SNR (dB) Figure 10.30 Word error rate as a function of SNR (dB) for both no CMN (solid line) and CMN-2 [5] (dotted line). White noise was added at different SNRs and the system was trained with speech with the same SNR. The Whisper system is used on the 5000-word Wall Street Journal task using a bigram language model. One drawback of CMN is it does not discriminate silence and voice in computing the utterance mean. An extension to CMN consists in computing different means for noise and speech [5]: h ( j +1) = n ( j +1) = 1 Ns 1 Nn åx t∈qs t − ms å xt − mn (10.135) t∈qn i.e., the difference between the average vector for speech frames in the utterance and the average vector m s for speech frames in the training data, and similarly for the noise frames m n . Speech/noise discrimination could be done by classifying frames into speech frames and noise frames, computing the average cepstra for each, and subtracting them from the average in the training data. This procedure works well as long as the speech/noise classification is accurate. It’s best done by the recognizer, since other speech detection algorithms can fail in high background noise (see Section 10.6.2). As shown in Figure 10.30, this algorithm has been shown to improve robustness not only to varying channels but also to noise. 520 Environmental Robustness 10.5.5. Real-Time Cepstral Normalization CMN requires the complete utterance to compute the cepstral mean; thus, it cannot be used in a real-time system, and an approximation needs to be used. In this section we discuss a modified version of CMN that can address this problem, as well as a set of techniques called RASTA that attempt to do the same thing. We can interpret CMN as the operation of subtracting a low-pass filter d[n], where all the T coefficients are identical and equal 1/ T , which is a high-pass filter with a cutoff frequency ω c that is arbitrarily close to 0. This interpretation indicates that we can implement other types of high-pass filters. One that has been found to work well in practice is the exponential filter, so the cepstral mean xt is a function of time xt = α xt + (1 − α ) xt −1 (10.136) where α is chosen so that the filter has a time constant7 of at least 5 seconds of speech. Other types of filters have been proposed in the literature. In fact, a popular approach consists of an IIR bandpass filter with the transfer function: H ( z ) = 0.1z 4 * 2 + z −1 − z −3 − 2 z −4 1 − 0.98 z −1 (10.137) which is used in the so-called relative spectral processing or RASTA [32]. As in CMN, the high-pass portion of the filter is expected to alleviate the effect of convolutional noise introduced in the channel. The low-pass filtering helps to smooth some of the fast frame-to-frame spectral changes present. Empirically, it has been shown that the RASTA filter behaves similarly to the real-time implementation of CMN, albeit with a slightly higher error rate. Both the RASTA filter and real-time implementations of CMN require the filter to be properly initialized. Otherwise, the first utterance may use an incorrect cepstral mean. The original derivation of RASTA includes a few stages prior to the bandpass filter, and this filter is performed on the spectral energies, not the cepstrum. 10.5.6. The Use of Gaussian Mixture Models Algorithms such as spectral subtraction of Section 10.5.1 or the frequency-domain MMSE of Section 10.5.2 implicitly assume that different frequencies are uncorrelated from each other. Because of that, the spectrum of the enhanced signal may exhibit abrupt changes across frequency and not look like spectra of real speech signals. Using the model of the environment of Section 10.1.3, we can express the clean-speech cepstral vector x as a function of the observed noisy cepstral vector y as 7 The time constant τ of a low-pass filter is defined as the value for which the output is cut in half. For an exponential filter of parameter α and sampling rate Fs, α = ln 2 /(TFs ) . Environment Compensation Preprocessing ( x = y − h − C ln 1 − eC −1 (n − y ) 521 ) (10.138) where the noise cepstral vector n is a random vector. The MMSE estimate of x is given by { ( xˆ MMSE = E{x | y} = y − h − CE ln 1 − eC −1 (n − y ) ) | y} (10.139) where the expectation uses the distribution of n. Solution to Eq. (10.139) results in a nonlinear function which can be learned, for example, with a neural network [53]. A popular model to attack this problem consists in modeling the probability distribution of the noisy speech y as a mixture of K Gaussians: K −1 K −1 k =0 k =0 p(y ) = å p(y | k ) P[k ] = å Ν (y, µ k , Σ k ) P[k ] (10.140) where P[k] is the prior probability of each Gaussian component k. If x and y are jointly Gaussian within class k, then p(x | y, k ) is also Gaussian [42] with mean: E{x | y, k} = µ kx + Σ kxy ( Σ ky ) (y − µ ky ) = Ck y + rk −1 (10.141) so that the joint distribution of x and y is given by K −1 K −1 k =0 k =0 p(x, y ) = å p(x, y | k ) P[k ] = å p(x | y , k ) p(y | k ) P[k ] K −1 (10.142) = å Ν (x, Ck y + rk , Γ k ) Ν (y, µ k , Σ k ) P[k ] k =0 where rk is called the correction vector, Ck is the rotation matrix, and the matrix Γ k tells us how uncertain we are about the compensation. A maximum likelihood estimate of x maximizes the joint probability in Eq. (10.142). Assuming the Gaussians do not overlap very much (as in the FCDCN algorithm [2]): xˆ ML ≈ arg max p(x, y, k ) = arg max Ν (y, µ k , Σ k ) Ν (x, Ck y + rk , Γ k ) P[k ] k (10.143) k whose solution is xˆ ML = Ckˆ y + rkˆ (10.144) kˆ = arg max Ν (y, µ k , Σ k ) P[k ] (10.145) where k It is often more robust to compute the MMSE estimate of x (as in the CDCN [2] and RATZ [43] algorithms): 522 Environmental Robustness K −1 K −1 k =0 k =0 xˆ MMSE = E{x | y} = å p (k | y )E {x | y, k } = å p(k | y ) ( Ck y + rk ) (10.146) as a weighted sum for all mixture components, where the posterior probability p (k | y ) is given by p(k | y ) = p (y | k ) P[k ] K −1 å p(y | k ) P[k ] (10.147) k =0 where the rotation matrix Ck in Eq. (10.144) can be replaced by I with a modest degradation in performance in return for faster computation [21]. A number of different algorithms [2, 43] have been proposed that vary in how the parameters µ k , Σk , rk , and Γ k are estimated. If stereo recordings are available from both the clean signal and the noisy signal, then we can estimate µ k , Σk by fitting a mixture Gaussian model to y as described in Chapter 3. Then Ck , rk and Γ k can be estimated directly by linear regression of x and y. The FCDCN algorithm [2, 6] is a variant of this approach when it is assumed that Σ k = σ 2 I , Γ k = γ 2 I , and Ck = I , so that µ k and rk are estimated through a VQ procedure and rk is the average difference (y − x) for vectors y that belong to mixture component k. An enhancement is to use the instantaneous SNR of a frame, defined as the difference between the log-energy of that frame and the average log-energy of the background noise. It is advantageous to use different correction vectors for different instantaneous SNR levels. The log-energy can be replaced by the zeroth-order cepstral coefficient with little change in recognition accuracy. It is also possible to estimate the MMSE solution instead of picking the most likely codeword (as in the RATZ algorithm [43]). The resulting correction vector is a weighted average of the correction vectors for all classes. Often, stereo recordings are not available and we need other means of estimating parameters µ k , Σ k , rk , and Γ k . CDCN [6] is one such algorithm that has a model of the environment as described in Section 10.1.3, which defines a nonlinear relationship between x, y and the environmental parameters n and h for the noise and channel. This method also uses an MMSE approach where the correction vector is a weighted average of the correction vectors for all classes. An extension of CDCN using a vector Taylor series approximation [44] for that nonlinear function has been shown to offer improved results. Other methods that do not require stereo recordings or a model of the environment are presented in [43]. 10.6. ENVIRONMENTAL MODEL ADAPTATION We describe a number of techniques that achieve compensation by adapting the HMM to the noisy conditions. The most straightforward method is to retrain the whole HMM with the speech from the new acoustical environment. Another option is to apply standard adaptive techniques discussed in Chapter 9 to the case of environment adaptation. We consider a Environmental Model Adaptation 523 model of the environment that allows constrained adaptation methods for more efficient adaptation in comparison to the general techniques. 10.6.1. Retraining on Corrupted Speech Word Error Rate (%) If there is a mismatch between acoustical environments, it is sensible to retrain the HMM. This is done in practice for telephone speech where only telephone speech, and no clean high-bandwidth speech, is used in the training phase. Unfortunately, training a large-vocabulary speech recognizer requires a very large amount of data, which is often not available for a specific noisy condition. For example, it is difficult to collect a large amount of training data in a car driving at 50 mph, whereas it is much easier to record it at idle speed. Having a small amount of matched-conditions training data could be worse than a large amount of mismatched-conditions training data. Often we want to adapt our model given a relatively small sample of speech from the new acoustical environment. 100 Mismatched 80 Matched (Noisy) 60 40 20 0 0 5 10 15 20 25 30 SNR (dB) Figure 10.31 Word error rate as a function of the testing data SNR (dB) for Whisper trained on clean data (solid line) and a system trained on noisy data at the same SNR as the testing set (dotted line). White noise at different SNRs is added. One option is to take a noise waveform from the new environment, add it to all the utterances in our database, and retrain the system. If the noise characteristics are known ahead of time, this method allows us to adapt the model to the new environment with a relatively small amount of data from the new environment, yet use a large amount of training data. Figure 10.31 shows the benefit of this approach over a system trained on clean speech for the case of additive white noise. If the target acoustical environment also has a different channel, we can also filter all the utterances in the training data prior to retraining. This method allows us to adapt the model to the new environment with a relatively small amount of data from the new environment. If the noise sample is available offline, this simple technique can provide good results at no cost during recognition. Otherwise the noise addition and model retraining would need to occur at runtime. This is feasible for speaker-dependent small-vocabulary systems where 524 Environmental Robustness Word Error Rate (%) the training data can be kept in memory and where the retraining time can be small, but it is generally not feasible for large-vocabulary speaker-independent systems because of memory and computational limitations. 30 25 Matched Noise 20 Multistyle 15 10 5 0 5 10 15 20 25 30 SNR (dB) Figure 10.32 Word error rates of multistyle training compared to matched-noise training as a function of the SNR in dB for additive white noise. Whisper is trained as in Figure 10.30. The error rate of multistyle training is between 12% (for low SNR) and 25% (for high SNR) higher in relative terms than that of matched-condition training. Nonetheless, multistyle training does better than a system trained on clean data for all conditions other than clean speech. One possibility is to create a number of artificial acoustical environments by corrupting our clean database with noise samples of varying levels and types, as well as varying channels. Then all those waveforms from multiple acoustical environments can be used in training. This is called multistyle training [39], since our training data represents a different condition. Because of the diversity of the training data, the resulting recognizer is more robust to varying noise conditions. In Figure 10.32 we see that, though generally the error-rate curve is above the matched-condition curve, particularly for clean speech, multistyle training does not require knowledge of the specific noise level and thus is a viable alternative to the theoretical lower bound of matched conditions. 10.6.2. Model Adaptation We can also use the standard adaptation methods used for speaker adaptation, such as MAP or MLLR described in Chapter 9. Since MAP is an unstructured method, it can offer results similar to those of matched conditions, but it requires a significant amount of adaptation data. MLLR can achieve reasonable performance with about a minute of speech for minor mismatches [41]. For severe mismatches, MLLR also requires a large number of transformations, which, in turn, require a larger amount of adaptation data as discussed in Chapter 9. Let’s analyze the case of a single MLLR transform, where the affine transformation is simply a bias. In this case the MLLR transform consists only of a vector h that, as in the case of CMN described in Section 10.5.4, can be estimated from a single utterance. Instead of estimating h as the average cepstral mean, this method estimates h as the maximum like- Environmental Model Adaptation 525 lihood estimate, given a set of sample vectors X = {x 0 , x1 ,L , xT −1} and an HMM model λ [48], and it is a version of the EM algorithm where all the vector means are tied together (see Algorithm 10.2). This procedure for estimating the cepstral bias has a very slight reduction in error rates over CMN, although the improvement is larger for short utterances [48]. ALGORITHM 10.2 MLE SIGNAL BIAS REMOVAL Step 1: Initialize h (0) = 0 at iteration j = 0 Step 2: Obtain model λ ( j ) by updating the means from m k to m k + h ( j ) , for all Gaussians k. Step 3: Run recognition with model λ ( j ) on the current utterance and determine a state segmentation θ [t ] for each frame t. Step 4: Estimate h ( j +1) as the vector that maximizes the likelihood, which, using covariance matrices Σ k , is given by: −1 æ T −1 ö T −1 h ( j +1) = ç å Σθ−[1t ] ÷ å Σθ−[1t ] ( xt − mθ [t ] ) (10.148) è t =0 ø t =0 Step 5: If converged, stop; otherwise, increment j and go to Step 2. In practice two iterations are often sufficient. If both additive noise and linear filtering are applied, the cepstrum for the noise and that for most speech frames are affected differently. The speech/noise mean normalization [5] algorithm can be extended similarly, as shown in Algorithm 10.3. The idea is to estimate a vector n and h , such that all the Gaussians associated to the noise model are shifted by n , and all remaining Gaussians are shifted by h . We can make Eq. (10.150) more efficient by tying all the covariance matrices. This transforms Eq. (10.150) into 1 Ns t∈qs 1 = Nn t∈qn h ( j +1) = n ( j +1) åx t åx − ms (10.149) t − mn i.e., the difference between the average vector for speech frames in the utterance and the average vector m s for speech frames in the training data, and similarly for the noise frames m n . This is essentially the same equation as in the speech-noise cepstral mean normalization described in Section 10.5.4. The difference is that the speech/noise discrimination is done by the recognizer instead of by a separate classifier. This method is more accurate in high-background-noise conditions where traditional speech/noise classifiers can fail. As a compromise, a codebook with considerably fewer Gaussians than a recognizer can be used to estimate n and h . 526 Environmental Robustness ALGORITHM 10.3 SPEECH/NOISE MEAN NORMALIZATION Step 1: Initialize h (0) = 0 , n (0) = 0 at iteration j = 0 Step 2: Obtain model λ ( j ) by updating the means of speech Gaussians from m k to m k + h ( j ) , and of noise Gaussians from ml to m l + n ( j ) . Step 3: Run recognition with model λ ( j ) on the current utterance and determine a state segmentation θ [t ] for each frame t. Step 4: Estimate h ( j +1) and n ( j +1) as the vectors that maximize the likelihood for speech frames ( t ∈ qs ) and noise frames ( t ∈ qn ), respectively: æ ö h ( j +1) = ç å Σθ−[1t ] ÷ è t∈qs ø −1 å Σθ ( x t∈qs −1 [t ] t − mθ [t ] ) −1 æ ö n ( j +1) = ç å Σθ−1[t ] ÷ å Σθ−1[ t ] ( xt − mθ [t ] ) è t∈qn ø t∈qn Step 5: If converged, stop; otherwise, increment j and go to Step 2 10.6.3. (10.150) Parallel Model Combination By using the clean-speech models and a noise model, we can approximate the distributions obtained by training a HMM with corrupted speech. The memory requirements for the algorithm are then significantly reduced, as the training data is not needed online. Parallel model combination (PMC) is a method to obtain the distribution of noisy speech given the distribution of clean speech and noise as mixture of Gaussians. As discussed in Section 10.1.3, if the clean-speech cepstrum follows a Gaussian distribution and the noise cepstrum follows another Gaussian distribution, the noisy speech has a distribution that is no longer Gaussian. The PMC method nevertheless makes the assumption that the resulting distribution is Gaussian whose mean and covariance matrix are the mean and covariance matrix of the resulting non-Gaussian distribution. If it is assumed that the distribution of clean speech is a mixture of N Gaussians, and the distribution of the noise is a mixture of M Gaussians, the distribution of the noisy speech contains NM Gaussians. The feature vector is often composed of the cepstrum, delta cepstrum, and delta-delta cepstrum. The model combination can be seen in Figure 10.33. If the mean and covariance matrix of the cepstral noise vector n are given by µ cn and Σnc , respectively, we first compute the mean and covariance matrix in the log-spectral domain: Environmental Model Adaptation 527 µ ln = C−1µ nc (10.151) Σln = C−1 Σnc (C−1 )T Noise HMM Clean-speech HMM Cepstral domain C-1 C-1 Log-spectral domain exp ( ) exp ( ) Linear domain g + log ( ) Log-spectral domain -1 C Cepstral domain Noisy speech HMM Figure 10.33 Parallel model combination for the case of one-state noise HMM. In the linear domain N = en , the distribution is lognormal, whose mean vector µ N and covariance matrix Σ N can be shown (see Chapter 3) to be given by µ N [i ] = exp {µ ln [i ] + Σnl [i, i ]/ 2} ( ) Σ N [i, j ] = µ N [i ]µ N [ j ] exp {Σnl [i, j ]} − 1 (10.152) with expressions similar to Eqs. (10.151) and (10.152) for the mean and covariance matrix of X. Using the model of the environment with no filter is equivalent to obtaining a random linear spectral vector Y given by (see Figure 10.33) Y = X+N (10.153) and, since X and N are independent, we can obtain the mean and covariance matrix of Y as 528 Environmental Robustness µY = µX + µN ΣY = ΣX + ΣN (10.154) Although the sum of two lognormal distributions is not lognormal, the popular lognormal approximation [26] consists in assuming that Y is lognormal. In this case we can apply the inverse formulae of Eq. (10.152) to obtain the mean and covariance matrix in the log-spectral domain: ì Σ [i, j ] ü Σly [i, j ] = ln í Y + 1ý µ µ [ i ] [ j ] î Y þ Y ü 1 ì Σ [i , j ] µ ly [i ] = ln µ Y [i ] − ln í Y + 1ý 2 î µ Y [i ]µ Y [ j ] þ (10.155) and finally return to the cepstrum domain applying the inverse of Eq. (10.151): µ cy = Cµ ly Σcy = CΣly CT (10.156) The lognormal approximation cannot be used directly for the delta and delta-delta cepstrum. Another variant that can be used in this case and is more accurate than the lognormal approximation is the data-driven parallel model combination (DPMC) [26], which uses Monte Carlo simulation to draw random cepstrum vectors from both the clean-speech HMM and noise distribution to create cepstrum of the noisy speech by applying Eqs. (10.20) and (10.21) to each sample point. These composite cepstrum vectors are not kept in memory, only their means and covariance matrices are, therefore reducing the required memory though still requiring a significant amount of computation. The number of vectors drawn from the distribution was at least 100 in [26]. A way of reducing the number of random vectors needed to obtain good Monte Carlo simulations is proposed in [56]. A version of PMC using numerical integration, which is very computationally expensive, yielded the best results. Figure 10.34 and Figure 10.35 compare the values estimated through the lognormal approximation to the true value, where for simplicity we deal with scalars. Thus x, n, and y represent the log-spectral energies of the clean signal, noise, and noisy signal, respectively, for a given frequency. Assuming x and n to be Gaussian with means µ x and µ n and variances σ x and σ x respectively, we see that the lognormal approximation is accurate when the standard deviations σ x and σ n are small. 10.6.4. Vector Taylor Series The model of the acoustical environment described in Section 10.1.3 describes the relationship between the cepstral vectors x, n, and y of the clean speech, noise, and noisy speech, respectively: Environmental Model Adaptation y = x + h + g (n − x − h ) 529 (10.157) where h is the cepstrum of the filter, and the nonlinear function g(z) is given by ( g(z ) = C ln 1 + eC −1 z ) (10.158) Moreno [44] suggests the use of Taylor series to approximate the nonlinearity in Eq. (10.158), though he applies it in the spectral instead of the cepstral domain. We follow that approach to compute the mean and covariance matrix of y [4]. Assume that x, h, and n are Gaussian random vectors with means µ x , µ h , and µ n and covariance matrices Σ x , Σh , and Σn , respectively, and furthermore that x, h, and n are independent. After algebraic manipulation it can be shown that the Jacobian of Eq. (10.157) with respect to x, h, and n evaluated at µ = µ n − µ x − µ h can be expressed as ∂y ∂x ∂y =A ∂h (µn ,µ x ,µh ) = (µ n ,µ x ,µh ) ∂y =I−A ∂n ( µn ,µ x ,µh ) (10.159) where the matrix A is given by A = CFC−1 (10.160) and F is a diagonal matrix whose elements are given by vector f (µ ) , which in turn is given by f (µ ) = 1 1 + eC −1 µ (10.161) Using Eq. (10.159) we can then approximate Eq. (10.157) by a first-order Taylor series expansion around (µ n , µ x , µ h ) as y ≈ µ x + µ h + g(µ n − µ x − µ h ) + A (x − µ x ) + A (h − µ h ) + (I − A )(n − µ n ) (10.162) The mean of y, µ y , can be obtained from Eq. (10.162) as µ y ≈ µ x + µ h + g (µ n − µ x − µ h ) (10.163) and its covariance matrix Σy by Σ y ≈ AΣ x AT + AΣh AT + (I − A ) Σn (I − A )T (10.164) 530 Environmental Robustness so that even if Σ x , Σh , and Σn are diagonal, Σy is no longer diagonal. Nonetheless, we can assume it to be diagonal, because this way we can transform a clean HMM to a corrupted HMM that has the same functional form and use a decoder that has been optimized for diagonal covariance matrices. It is difficult to visualize how good the approximation is, given the nonlinearity involved. To provide some insight, let’s consider the frequency-domain version of Eqs. (10.157) and (10.158) when no filtering is done: y = x + ln (1 + exp ( n − x ) ) (10.165) where x, n, and y represent the log-spectral energies of the clean signal, noise, and noisy signal, respectively, for a given frequency. In Figure 10.34 we show the mean and standard deviation of the noisy log-spectral energy y in dB as a function of the mean of the clean logspectral energy x with a standard deviation of 10 dB. The log-spectral energy of the noise n is Gaussian with mean 0 dB and standard deviation 2 dB. We compare the correct values obtained through Monte Carlo simulation (or DPMC) with the values obtained through the lognormal approximation of Section 10.6.3 and the first-order VTS approximation described here. We see that the VTS approximation is more accurate than the lognormal approximation for the mean and especially for the standard deviation of y, assuming the model of the environment described by Eq. (10.165). 25 12 Montecarlo 1st order VTS PMC 10 y std dev (dB) y mean (dB) 20 15 10 5 0 8 6 4 2 -20 -10 0 10 x mean (dB) 20 0 Montecarlo 1st order VTS PMC -20 -10 0 10 x mean (dB) 20 Figure 10.34 Means and standard deviation of noisy speech y in dB according to Eq. (10.165). The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation 2 dB. The distribution of the clean log-spectrum x is Gaussian, having a standard deviation of 10 dB and a mean varying from –25 to 25 dB. Both the mean and the standard deviation of y are more accurate in first-order VTS than in PMC. Figure 10.35 is similar to Figure 10.34 except that the standard deviation of the clean log-energy x is only 5 dB, a more realistic number in speech recognition systems. In this case, both the lognormal approximation and the first-order VTS approximation are good estimates of the mean of y, though the standard deviation estimated through the lognormal Environmental Model Adaptation 531 approximation in PMC is not as good as that obtained through first-order VTS, again assuming the model of the environment described by Eq. (10.165). The overestimate of the variance in the lognormal approximation might, however, be useful if the model of the environment is not accurate. 25 6 Montecarlo 1st order VTS PMC 5 y std dev (dB) y mean (dB) 20 15 10 5 4 3 2 Montecarlo 1st order VTS PMC 1 0 0 -20 -10 0 10 x mean (dB) 20 -20 -10 0 10 x mean (dB) 20 Figure 10.35 Means and standard deviation of noisy speech y in dB according to Eq. (10.165). The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum x is Gaussian with a standard deviation of 5 dB and a mean varying from –25 dB to 25 dB. The mean of y is well estimated in both PMC and first-order VTS. The standard deviation of y is more accurate in first-order VTS than in PMC. To compute the means and covariance matrices of the delta and delta-delta parameters, let’s take the derivative of the approximation of y in Eq. (10.162) with respect to time: ∂y ∂x ≈A ∂t ∂t (10.166) so that the delta-cepstrum computed through ∆xt = xt + 2 − xt − 2 , is related to the derivative [28] by ∆x ≈ 4 ∂xt ∂t (10.167) so that µ ∆y ≈ Aµ ∆x (10.168) and similarly Σ ∆y ≈ AΣ ∆x AT + (I − A ) Σ ∆n (I − A )T where we assumed that h is constant within an utterance, so that ∆h = 0 . Similarly, for the delta-delta cepstrum, the mean is given by (10.169) 532 Environmental Robustness µ ∆2 y ≈ Aµ ∆2 x (10.170) and the covariance matrix by Σ ∆2 y ≈ AΣ ∆2 x AT + (I − A ) Σ ∆2n (I − A )T (10.171) where we again assumed that h is constant within an utterance, so that ∆ 2 h = 0 . Equations (10.163), (10.168), and (10.170) resemble the MLLR adaptation formulae of Chapter 9 for the means, though in this case the matrix is different for each Gaussian and is heavily constrained. We are interested in estimating the environmental parameters µ n , µ h , and Σn , given a set of T observation frames y t . This estimation can be done iteratively using the EM algorithm on Eq. (10.162). If the noise process is stationary, Σ ∆n could be approximated, assuming independence between nt + 2 and nt − 2 , by Σ ∆n = 2Σ n . Similarly, Σ ∆2 n could be approximated, assuming independence between ∆n t +1 and ∆n t −1 , by Σ ∆2 n = 4Σ n . If the noise process is not stationary, it is best to estimate Σ ∆n and Σ ∆2 n from input data directly. If the distribution of x is a mixture of N Gaussians, each Gaussian is transformed according to the equations above. If the distribution of n is also a mixture of M Gaussians, the composite distribution has NM Gaussians. While this increases the number of Gaussians, the decoder is still functionally the same as for clean speech. Because normally you do not want to alter the number of Gaussians of the system when you do noise adaptation, it is often assumed that n is a single Gaussian. 10.6.5. Retraining on Compensated Features We have discussed adapting the HMM to the new acoustical environment using the standard front-end features, in most cases the mel-cepstrum. Section 10.5 dealt with cleaning the noisy feature without retraining the HMMs. It’s logical to consider a combination of both, where the features are cleaned to remove noise and channel effects and then the HMMs are retrained to take into account that this processing stage is not perfect. This idea is illustrated in Figure 10.36, where we compare the word error rate of the standard matched-noisecondition training with the matched-noise-condition training after it has been compensated by a variant of the mixture Gaussian algorithms described in Section 10.5.6 [21]. The low error rates of both curves in Figure 10.36 are hard to obtain in practice, because they assume we know exactly what the noise level and type are ahead of time, which in general is hard to do. On the other hand, this could be combined with the multistyle training discussed in Section 10.6.1 or with a set of clustered models discussed in Chapter 9. Modeling Nonstationary Noise 533 Word Error Rate (%) 30 25 SPLICE-processed matched condition 20 Unprocessed matched condition 15 10 5 0 5 10 15 20 25 30 SNR (dB) Figure 10.36 Word error rates of matched-noise training without feature preprocessing and with the SPLICE algorithm [21] as a function of the SNR in dB for additive white noise. Error rate with the mixture Gaussian model is up to 30% lower than that of standard noisy matched conditions for low SNRs while it is about the same for high SNRs. 10.7. MODELING NONSTATIONARY NOISE The previous sections deal mostly with stationary noise. In practice, there are many nonstationary noises that often match a random word in the system’s lexicon better than the silence model. In this case, the benefit of using speech recognition vanishes quickly. The most typical types of noise present in desktop applications are mouth noise (lip smacks, throat clearings, coughs, nasal clearings, heavy breathing, uhms and uhs, etc), computer noise (keyboard typing, microphone adjustment, computer fan, disk head seeking, etc.), and office noise (phone rings, paper rustles, shutting door, interfering speakers, etc.). We can use a simple method that has been successful in speech recognition [57] as shown in Algorithm 10.4. In practice, updating the transcription turns out to be important, because human labelers often miss short noises that the system can uncover. Since the noise training data are often limited in terms of coverage, some noises can be easily matched to short word models, such as: if, two. Due to the unique characteristics of noise rejection, we often need to further augment confidence measures such as those described in Chapter 9. In practice, we need an additional classifier to provide more detailed discrimination between speech and noise. We can use a two-level classifier for this purpose. The ratio between the all-speech model score (fully connected context-independent phone models) and the all-noise model score (fully connected silence and noise phone models) can be used. Another approach [55] consists of having an HMM for noise with several states to deal with nonstationary noises. A decoder needs to be a three-dimensional Viterbi search which evaluates at each frame every possible speech state as well as every possible noise 534 Environmental Robustness state to achieve the speech/noise decomposition (see Figure 10.37). The computational complexity of such an approach is very large, though it can handle nonstationary noises quite well in theory. ALGORITHM 10.4 EXPLICIT NOISE MODELING Step 1: Augmenting the vocabulary with noise words (such as ++SMACK++), each composed of a single noise phoneme (such as +SMACK+), which are thus modeled with a single HMM. These noise words have to be labeled in the transcriptions so that they can be trained. Step 2: Training noise models, as well as the other models, using the standard HMM training procedure. Step 3: Updating the transcription. To do that, convert the transcription into a network, where the noise words can be optionally inserted between each word in the original transcription. A forced alignment segmentation is then conducted with the current HMM optional noise words inserted. The segmentation with the highest likelihood is selected, thus yielding an optimal transcription. Step 4: If converged, stop; otherwise go to Step 2. Noise HMM Speech HMM Observations Figure 10.37 Speech noise decomposition and a three-dimensional Viterbi decoder. 10.8. HISTORICAL PERSPECTIVE AND FURTHER READING This chapter contains a number of diverse topics that are often described in different fields; no single reference covers it all. For further reading on adaptive filtering, you can check the books by Widrow and Stearns [59] and Haykin [30]. Theodoridis and Bellanger provide [54] a good summary of adaptive filtering, and Breining et al. [16] a good summary of echocanceling techniques. Lee [38] has a good summary of independent component analysis for blind source separation. Deller et al. [20] provide a number of techniques for speech enhancement. Juang [35] and Junqua [37] survey techniques used in improving the robustness Historical Perspective and Further Reading 535 of speech recognition systems to noise. Acero [2] compares a number of feature transformation techniques in the cepstral domain and introduces the model of the environment. Adaptive filtering theory emerged early in the 1900s. The Wiener and LMS filters were derived by Wiener and Widrow in 1919 and 1960, respectively. Norbert Wiener joined the MIT faculty in 1919 and made profound contributions to generalized harmonic analysis, the famous Wiener-Hopf equation, and the resulting Wiener filter. The LMS algorithm was developed by B Widrow and his colleagues at Stanford University in the early 1960s. From a practical point of view, the use of gradient microphones (Olsen [46]) has proven to be one of the more important contributions to increased robustness. Directional microphones are commonplace today in most speech recognition systems. Boll [13] first suggested the use of spectral subtraction. This has been the cornerstone for noise suppression, and many systems nowadays still use a variant of Boll’s original algorithm. The Cepstral mean normalization algorithm was proposed by Atal [8] in 1974, although it wasn’t until the early 1990s that it became commonplace in most speech recognition systems evaluated in the DARPA speech programs [33]. Hermansky proposed PLP [31] in 1990. The work of Rich Stern’s robustness group at CMU (especially the Ph.D. thesis work of Acero [1] and Moreno [43]) and the Ph.D. thesis of Gales [26] also represented advances in the understanding of the effect of noise in the cepstrum. Bell and Sejnowski [10] gave the field of independent component analysis a boost in 1995 with their infomax rule. The field of source separation is a promising alternative to improve the robustness of speech recognition systems when more than one microphone is available. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] Acero, A., Acoustical and Environmental Robustness in Automatic Speech Recognition, PhD Thesis in Electrical and Computer Engineering 1990, Carnegie Mellon University, Pittsburgh. Acero, A., Acoustical and Environmental Robustness in Automatic Speech Recognition, 1993, Boston, Kluwer Academic Publishers. Acero, A., S. Altschuler, and L. Wu, "Speech/Noise Separation Using Two Microphones and a VQ Model of Speech Signals," Int. Conf. on Spoken Language Processing, 2000, Beijing, China. Acero, A., et al., "HMM Adaptation Using Vector Taylor Series for Noisy Speech Recognition," Int. Conf. on Spoken Language Processing, 2000, Beijing, China. Acero, A. and X.D. Huang, "Augmented Cepstral Normalization for Robust Speech Recognition," Proc. of the IEEE Workshop on Automatic Speech Recognition, 1995, Snowbird, UT. Acero, A. and R. Stern, "Environmental Robustness in Automatic Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1990, Albuquerque, NM pp. 849-852. Amari, S., A. Cichocki, and H.H. Yang, eds. A New Learning Algorithm for Blind Separation, Advances in Neural Information Processing Systems, 1996, Cambridge, MA, MIT Press. Atal, B.S., "Effectiveness of Linear Prediction Characteristics of the Speech Wave for Automatic Speaker Identification and Verification," Journal of the Acoustical Society of America, 1974, 55(6), pp. 1304--1312. 536 Environmental Robustness [9] [10] Attias, H., "Independent Factor Analysis," Neural Computation, 1998, 11, pp. 803-851. Bell, A.J. and T.J. Sejnowski, "An Information Maximisation Approach to Blind Separation and Blind Deconvolution," Neural Computation, 1995, 7(6), pp. 1129-1159. Belouchrani, A., et al., "A Blind Source Separation Technique Using Second Order Statistics," IEEE Trans. on Signal Processing, 1997, 45(2), pp. 434-444. Berouti, M., R. Schwartz, and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise," Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1979 pp. 208-211. Boll, S.F., "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Trans. on Acoustics, Speech and Signal Processing, 1979, 27(Apr.), pp. 113-120. Boll, S.F. and D.C. Pulsipher, "Suppression of Acoustic Noise in Speech Using Two Microphone Adaptive Noise Cancellation," IEEE Trans. on Acoustics Speech and Signal Processing, 1980, 28(December), pp. 751-753. Bregman, A.S., Auditory Scene Analysis, 1990, Cambridge MA, MIT Press. Breining, C., Acoustic Echo Control, in IEEE Signal Processing Magazine, 1999. pp. 42-69. Cardoso, J., "Blind Signal Separation: Statistical Principles," Proc. of the IEEE, 1998, 9(10), pp. 2009-2025. Cardoso, J.F., "Infomax and Maximum Likelihood for Blind Source Separation," IEEE Signal Processing Letters, 1997, 4, pp. 112-114. Comon, P., "Independent Component Analysis: A New Concept," Signal Processing, 1994, 36, pp. 287-314. Deller, J.R., J.H.L. Hansen, and J.G. Proakis, Discrete-Time Processing of Speech Signals, 2000, IEEE Press. Deng, L., et al., "Large-Vocabulary Speech Recognition Under Adverse Acoustic Environments," Int. Conf. on Spoken Language Processing, 2000, Beijing, China. Ephraim, Y., "Statistical Model-Based Speech Enhancement System," Proc. of the IEEE, 1992, 80(1), pp. 1526-1555. Ephraim, Y. and D. Malah, "Speech Enhancement Using Minimum Mean Square Error Short Time Spectral Amplitude Estimator," IEEE Trans. on Acoustics, Speech and Signal Processing, 1984, 32(6), pp. 1109-1121. Flanagan, J.L., et al., "Computer-Steered Microphone Arrays for Sound Transduction in Large Rooms," Journal of the Acoustical Society of America, 1985, 78(5), pp. 1508--1518. Frost, O.L., "An Algorithm for Linearly Constrained Adaptive Array Processing," Proc. of the IEEE, 1972, 60(8), pp. 926--935. Gales, M.J., Model Based Techniques for Noise Robust Speech Recognition, PhD Thesis in Engineering Department 1995, Cambridge University, . Ghitza, O., "Robustness against Noise: The Role of Timing-Synchrony Measurement," Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1987 pp. 2372-2375. Gopinath, R.A., et al., "Robust Speech Recognition in Noise - Performance of the IBM Continuous Speech Recognizer on the ARPA Noise Spoke Task," Proc. ARPA Workshop on Spoken Language Systems Technology, 1995 pp. 127-133. Griffiths, L.J. and C.W. Jim, "An Alternative Approach to Linearly Constrained Adaptive Beamforming," IEEE Trans. on Antennas and Propagation, 1982, 30(1), pp. 27-34. Haykin, S., Adaptive Filter Theory, 2nd ed, 1996, Upper Saddle River, NJ, Prentice-Hall. Hermansky, H., "Perceptual Linear Predictive (PLP) Analysis of Speech," Journal of the Acoustical Society of America, 1990, 87(4), pp. 1738-1752. Hermansky, H. and N. Morgan, "RASTA Processing of Speech," IEEE Trans. on Speech and Audio Processing, 1994, 2(4), pp. 578-589. [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] Historical Perspective and Further Reading [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] 537 Huang, X.D., et al., "The SPHINX-II Speech Recognition System: An Overview," Computer Speech and Language, 1993 pp. 137-148. Hunt, M. and C. Lefebre, "A Comparison of Several Acoustic Representations for Speech Recognition with Degraded and Undegraded Speech," Int. Conf. on Acoustic, Speech and Signal Processing, 1989 pp. 262-265. Juang, B.H., "Speech Recognition in Adverse Environments," Computer Speech and Language, 1991, 5, pp. 275-294. Junqua, J.C., "The Lombard Reflex and Its Role in Human Listeners and Automatic Speech Recognition," Journal of the Acoustical Society of America, 1993, 93(1), pp. 510--524. Junqua, J.C. and J.P. Haton, Robustness in Automatic Speech Recognition, 1996, Kluwer Academic Publishers. Lee, T.W., Independent Component Analysis: Theory and Applications, 1998, Kluwer Academic Publishers. Lippmann, R.P., E.A. Martin, and D.P. Paul, "Multi-Style Training for Robust IsolatedWord Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1987, Dallas, TX pp. 709-712. Lombard, E., "Le Signe de l'élévation de la Voix," Ann. Maladies Oreille, Larynx, Nez, Pharynx, 1911, 37, pp. 101-119. Matassoni, M., M. Omologo, and D. Giuliani, "Hands-Free Speech Recognition Using a Filtered Clean Corpus and Incremental HMM Adaptation," Proc. Int. Conf. on Acoustics, Speech and Signal Processing, 2000, Istanbul, Turkey pp. 1407-1410. Mendel, J.M., Lessons in Estimation Theory for Signal Processing, Communications, and Control, 1995, Upper Saddle River, NJ, Prentice Hall. Moreno, P., Speech Recognition in Noisy Environments, PhD Thesis in Electrical and Computer Engineering 1996, Carnegie Mellon University, Pittsburgh. Moreno, P.J., B. Raj, and R.M. Stern, "A Vector Taylor Series Approach for Environment Independent Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1996, Atlanta pp. 733-736. Morgan, N. and H. Bourlard, Continuous Speech Recognition: An Introduction to Hybrid HMM/Connectionist Approach, in IEEE Signal Processing Magazine, 1995. pp. 25-42. Olsen, H.F., "Gradient Microphones," Journal of the Acoustical Society of America, 1946, 17,(3), pp. 192-198. Porter, J.E. and S.F. Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech," Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1984, San Diego, CA pp. 18.A.2.1-4. Rahim, M.G. and B.H. Juang, "Signal Bias Removal by Maximum Likelihood Estimation for Robust Telephone Speech Recognition," IEEE Trans. on Speech and Audio Processing, 1996, 4(1), pp. 19-30. Seneff, S., "A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing," Journal of Phonetics, 1988, 16(1), pp. 55-76. Sharma, S., et al., "Feature Extraction Using Non-Linear Transformation for Robust Speech Recognition on the Aurora Database," Int. Conf. on Acoustics, Speech and Signal Processing, 2000, Istanbul, Turkey pp. 1117-1120. Sullivan, T.M. and R.M. Stern, "Multi-Microphone Correlation-Based Processing for Robust Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1993, Minneapolis pp. 2091-2094. Suzuki, Y., et al., "An Optimum Computer-Generated Pulse Signal Suitable for the Measurement of Very Long Impulse Responses," Journal of the Acoustical Society of America, 1995, 97(2), pp. 1119-1123. 538 Environmental Robustness [53] Tamura, S. and A. Waibel, "Noise Reduction Using Connectionist Models," Int. Conf. on Acoustics, Speech and Signal Processing, 1988, New York pp. 553-556. Theodoridis, S. and M.G. Bellanger, Adaptive Filters and Acoustic Echo Control, in IEEE Signal Processing Magazine, 1999. pp. 12-41. Varga, A.P. and R.K. Moore, "Hidden Markov Model Decomposition of Speech and Noise," Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1990 pp. 845-848. Wan, E.A., R.V.D. Merwe, and A.T. Nelson, "Dual Estimation and the Unscented Transformation" in Advances in Neural Information Processing Systems, S.A. Solla, T.K. Leen, and K.R. Muller, eds. 2000, Cambridge, MA, pp. 666-672, MIT Press. Ward, W., "Modeling Non-Verbal Sounds for Speech Recognition," Proc. Speech and Natural Language Workshop, 1989, Cape Cod, MA, Morgan Kauffman pp. 311-318. Widrow, B. and M.E. Hoff, "Adaptive Switching Algorithms," IRE Wescon Convention Record, 1960 pp. 96-104. Widrow, B. and S.D. Stearns, Adaptive Signal Processing, 1985, Upper Saddle River, NJ, Prentice Hall. Woodland, P.C., "Improving Environmental Robustness in Large Vocabulary Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1996, Atlanta, Georgia pp. 65-68. [54] [55] [56] [57] [58] [59] [60]

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

### Related manuals

Download PDF

advertisement