chapter 1 0
C H A P T E R
1 0
Environmental RobustnessEquation Section 10
A
speech recognition systems trained in the
lab with clean speech may degrade significantly in the real world if the clean speech used in
training doesn’t match real-world speech. If its accuracy doesn’t degrade very much under
mismatched conditions, the system is called robust. There are several reasons why realworld speech may differ from clean speech; in this chapter we focus on the influence of the
acoustical environment, defined as the transformations that affect the speech signal from the
time it leaves the mouth until it is in digital format.
Chapter 9 discussed a number of variability factors that are critical to speech recognition. Because the acoustical environment is so important to practical systems, we devote this
chapter to ways of increasing the environmental robustness, including microphone, echo
cancellation, and a number of methods that enhance the speech signal, its spectrum, and the
corresponding acoustic model in a speech recognition system.
473
474
10.1.
Environmental Robustness
THE ACOUSTICAL ENVIRONMENT
The acoustical environment is defined as the set of transformations that affect the speech
signal from the time it leaves the speaker’s mouth until it is in digital form. Two main
sources of distortion are described here: additive noise and channel distortion. Additive
noise, such as a fan running in the background, door slams, or other speakers’ speech, is
common in our daily life. Channel distortion can be caused by reverberation, the frequency
response of a microphone, the presence of an electrical filter in the A/D circuitry, the response of the local loop of a telephone line, a speech codec, etc. Reverberation, caused by
reflections of the acoustical wave in walls and other objects, can also dramatically alter the
speech signal.
10.1.1.
Additive Noise
Additive noise can be stationary or nonstationary. Stationary noise, such as that made by a
computer fan or air conditioning, has a power spectral density that does not change over
time. Nonstationary noise, caused by door slams, radio, TV, and other speakers’ voices, has
statistical properties that change over time. A signal captured with a close-talking microphone has little noise and reverberation, even though there may be lip smacks and breathing
noise. A microphone that is not close to the speaker’s mouth may pick up a lot of noise
and/or reverberation.
As described in Chapter 5, a signal x[n] is defined as white noise if its power spectrum
is flat, S xx ( f ) = q , a condition equivalent to different samples’ being uncorrelated,
Rxx [n] = qδ [n] . Thus, a white noise signal has to have zero mean. This definition tells us
about the second-order moments of the random process, but not about its distribution. Such
noise can be generated synthetically by drawing samples from a distribution p(x); thus we
could have uniform white noise if p(x) is uniform, or Gaussian white noise if p(x) is Gaussian. While typically subroutines are available that generate uniform white noise, we are
often interested in white Gaussian noise, as it resembles better the noise that tends to occur
in practice. See Algorithm 10.1 for a method to generate white Gaussian noise. Variable x is
normally continuous, but it can also be discrete.
White noise is useful as a conceptual entity, but it seldom occurs in practice. Most of
the noise captured by a microphone is colored, since its spectrum is not white. Pink noise is
a particular type of colored noise that has a low-pass nature, as it has more energy at the low
frequencies and rolls off at higher frequencies. The noise generated by a computer fan, an air
conditioner, or an automobile engine can be approximated by pink noise. We can synthesize
pink noise by filtering white noise with a filter whose magnitude squared equals the desired
pow*er spectrum.
A great deal of additive noise is nonstationary, since its statistical properties change
over time. In practice, even the noises from a computer, an air conditioning system, or an
automobile are not perfectly stationary. Some nonstationary noises, such as keyboard clicks,
are caused by physical objects. The speaker can also cause nonstationary noises such as lip
smacks and breath noise. The cocktail party effect is the phenomenon under which a human
The Acoustical Environment
475
listener can focus onto one conversation out of many in a cocktail party. The noise of the
conversations that are not focused upon is called babble noise. When the nonstationary noise
is correlated with a known signal, the adaptive echo-canceling (AEC) techniques of Section
10.3 can be used.
ALGORITHM 10.1 WHITE NOISE GENERATION
To generate white noise in a computer, we can first generate a random variable • with a
Rayleigh distribution:
pρ ( ρ ) = ρ e − ρ
2
/2
(10.1)
from another random variable r with a uniform distribution between (0, 1), pr (r ) = 1 , by simply
equating the probability mass pρ ( ρ )d ρ = pr (r )dr so that
results in r = e − ρ
2
/2
2
dr
= ρ e − ρ / 2 ; with integration, it
dρ
and the inverse is given by
ρ = −2 ln r
(10.2)
If r is uniform between (0, 1), and • is computed through Eq. (10.2), it follows a Rayleigh distribution as in Eq. (10.1). We can then generate Rayleigh white noise by drawing independent
samples from such a distribution.
If we want to generate white Gaussian noise, the method used above does not work, because the integral of the Gaussian distribution does not exist in closed form. However, if • follows a Rayleigh distribution as in Eq. (10.1), obtained using Eq. (10.2) where r is uniform between (0, 1), and • is uniformly distributed between (0, 2•), then the white Gaussian noise can
be generated as the following two variables x and y:
x = ρ cos(θ )
(10.3)
y = ρ sin(θ )
They are independent Gaussian random variables with zero mean and unity variance, since the
Jacobian of the transformation is given by
∂px ∂px
∂ρ ∂θ
cos θ − ρ sin θ
J=
=
=ρ
(10.4)
∂p y ∂p y
sin θ ρ cos θ
∂ρ
∂θ
and the joint density p( x, y ) is given by
p ( x, y ) =
p( ρ ,θ ) p( ρ ) p(θ )
1 −ρ2 / 2
e
=
=
J
2π
ρ
1 − ( x2 + y2 ) / 2
e
=
= N ( x, 0,1) N ( y, 0,1)
2π
(10.5)
476
Environmental Robustness
The presence of additive noise can sometimes change the way the speaker speaks. The
Lombard effect [40] is a phenomenon by which a speaker increases his vocal effort in the
presence of background noise. When a large amount of noise is present, the speaker tends to
shout, which entails not only a higher amplitude, but also often higher pitch, slightly different formants, and a different coloring of the spectrum. It is very difficult to characterize
these transformations analytically, but recently some progress has been made [36].
10.1.2.
Reverberation
If both the microphone and the speaker are in an anechoic1 chamber or in free space, a microphone picks up only the direct acoustic path. In practice, in addition to the direct acoustic
path, there are reflections of walls and other objects in the room. We are well aware of this
effect when we are in a large room, which can prevent us from understanding if the reverberation time is too long. Speech recognition systems are much less robust than humans and
they start to degrade with shorter reverberation times, such as those present in a normal office environment.
As described in Section 10.2.2, the signal level at the microphone is inversely proportional to the distance r from the speaker for the direct path. For the kth reflected sound wave,
the sound has to travel a larger distance rk , so that its level is proportionally lower. This
reflection also takes time Tk = rk / c to arrive, where c is the speed of sound in air.2 Moreover, some energy absorption a takes place each time the sound wave hits a surface. The
impulse response of such filter looks like
ρk
1 ∞ ρ
δ [n − Tk ] = å k δ [n − Tk ]
c k = 0 Tk
k = 0 rk
∞
h[n] = å
(10.6)
where ρ k is the combined attenuation of the kth reflected sound wave due to absorption.
Anechoic rooms have ρ k ≈ 0 . In general ρ k is a (generally decreasing) function of frequency, so that instead of impulses δ [n] in Eq. (10.6), other (low-pass) impulse responses
are used.
Often we have available a large amount of speech data recorded with a close-talking
microphone, and we would like to use the speech recognition system with a far field microphone. To do that we can filter the clean-speech training database with a filter h[n], so that
the filtered speech resembles speech collected with the far field microphone, and then retrain
the system. This requires estimating the impulse response h[n] of a room. Alternatively, we
can filter the signal from the far field microphone with an inverse filter to make it resemble
the signal from the close-talking microphone.
1
An anechoic chamber is a room that has walls made of special fiberglass or other sound-absorbing materials so
that absorbs all echoes. It is equivalent to being in free space, where there are neither walls nor reflecting surfaces.
2
In air at standard atmospheric pressure and humidity the speed of sound is c = 331.4 + 0.6T (m / s ) is. It varies
with different media and different levels of humidity and pressure.
The Acoustical Environment
477
One way to estimate the impulse response is to play a white noise signal x[n] through
a loudspeaker or artificial mouth; the signal y[n] captured at the microphone is given by
y[n] = x[n] ∗ h[n] + v[n]
(10.7)
where v[n] is the additive noise present at the microphone. This noise is due to sources such
as air conditioning and computer fans and is an obstacle to measuring h[n]. The impulse
response can be estimated by minimizing the error over N samples
E=
1
N
N −1
M −1
æ
ö
å
ç y[n] − å h[m]x[n − m] ÷
ø
n=0 è
m=0
2
(10.8)
which, taking the derivative with respect to h[m] and equating to 0, results in our estimate
hˆ[l ] :
∂E
1
=
∂h[l ] h[ l ]= hˆ[l ] N
=
=
N −1
M −1
æ
ö
å çè y[n] − å hˆ[m]x[n − m] ÷øx[n − l ]
n=0
m =0
N −1
M −1
n =0
m=0
æ1
N −1
ö
1
N
å y[n]x[n − l ] − å hˆ[m] çè N å x[n − m]x[n − l ] ÷ø
1
N
å y[n]x[n − l ] − hˆ[l ] − å hˆ[m] çè N å x[n − m]x[n − l ] − δ [m − l ] ÷ø = 0
(10.9)
n =0
N −1
M −1
n =0
m=0
æ1
N −1
ö
n=0
Since we know our white process is ergodic, it follows that we can replace time averages by
ensemble averages as N → ∞ :
1
N →∞ N
lim
N −1
å x[n − m]x[n − l ] = E { x[n − m]x[n − l ]} = δ [m − l ]
(10.10)
n=0
so that we can obtain a reasonable estimate of the impulse response as
1
hˆ[l ] =
N
N −1
å y[n]x[n − l ]
(10.11)
n =0
Inserting Eq. (10.7) into Eq. (10.11) ,we obtain
hˆ[l ] = h[l ] + e[l ]
(10.12)
where the estimation error e[n] is given by
e[l ] =
1
N
N −1
M −1
n=0
m=0
æ1
N −1
ö
å v[n]x[n − l ] + å h[m] çè N å x[n − m]x[n − l ] − δ [m − l ] ÷ø
(10.13)
n =0
If v[n] and x[n] are independent processes, then E{e[l ]} = 0 , since x[n] is zero-mean,
so that the estimate of Eq. (10.11) is unbiased. The covariance matrix decreases to 0 as
478
Environmental Robustness
N → ∞ , with the dominant term being the noise v[n]. The choice of N for a low-variance
estimate depends on the filter length M and the noise level present in the room.
The filter h[n] could also be estimated by playing sine waves of different frequencies
or a chirp3 [52]. Since playing a white noise signal or sine waves may not be practical, another method is based on collecting stereo recordings with a close-talking microphone and a
far field microphone. The filter h[n] of length M is estimated so that when applied to the
close-talking signal x[n] it minimizes the squared error with the far field signal y[n], which
results in the following set of M linear equations:
M −1
å h[m]R
xx
m=0
[m − n] = Rxy [n]
(10.14)
which is a generalization of Eq. (10.11) when x[n] is not a white noise signal.
7000
6000
Room Impulse Response
5000
4000
3000
2000
1000
0
-1000
-2000
-3000
0
200
400
600
800
1000 1200
Time (samples)
1400
1600
1800
2000
Figure 10.1 Typical impulse response of an average office. Sampling rate was 16 kHz. It was
estimated by driving a 4-minute segment of white noise through an artificial mouth and using
Eq. (10.11). The filter length is about 125 ms.
It is not uncommon to have reverberation times of over 100 milliseconds in office
rooms. In Figure 10.1 we show the typical impulse response of an average office.
10.1.3.
A Model of the Environment
A widely used model of the degradation encountered by the speech signal when it gets corrupted by both additive noise and channel distortion is shown in Figure 10.2. We can derive
3
A chirp function continuously varies its frequency. For example, a linear chirp varies its frequency linearly with
time: sin(n(ω 0 + ω1n)) .
The Acoustical Environment
479
the relationships between the clean signal and the corrupted signal both in power-spectrum
and cepstrum domains based on such a model [2].
h[m]
x[m]
y[m]
n[m]
Figure 10.2 A model of the environment.
In the time domain, additive noise and linear filtering results in
y[m] = x[m] ∗ h[m] + n[m]
(10.15)
It is convenient to express this in the frequency domain using the short-time analysis
methods of Chapter 6. To do that, we window the signal, take a 2K-point DFT in Eq. (10.15)
and then the magnitude squared:
Y ( f k ) = X ( f k ) H ( f k ) + N ( f k ) + 2 Re { X ( f k ) H ( f k ) N ∗ ( f k )}
2
2
2
2
= X ( f k ) H ( f k ) + N ( f k ) + 2 X ( f k ) H ( f k ) N ( f k ) cos(θ k )
2
2
2
(10.16)
where k = 0,1,L , K , we have used upper case for frequency domain linear spectra, and θ k
is the angle between the filtered signal and the noise for bin k.
The expected value of the cross-term in Eq. (10.16) is zero, since x[m] and n[m] are
statistically independent. In practice, this term is not zero for a given frame, though it is
small if we average over a range of frequencies, as we often do when computing the popular
mel-cepstrum (see Chapter 6). When using a filterbank, we can obtain a relationship for the
energies at each of the M filters:
Y ( fi ) ≈ X ( f i ) H ( fi ) + N ( f i )
2
2
2
2
(10.17)
where it has been shown experimentally that this assumption works well in practice.
Equation (10.17) is also implicitly assuming that the length of h[n], the filter’s impulse
response, is much shorter than the window length 2N. That means that for filters with long
reverberation times, Eq. (10.17) is inaccurate. For example, for N ( f ) = 0 , a window shift
2
of T, and a filter’s impulse response h[n] = δ [n − T ] , we have Yt [ f m ] = X t −1[ f m ] ; i.e., the
output spectrum at frame t does not depend on the input spectrum at that frame. This is a
more serious assumption, which is why speech recognition systems tend to fail under long
reverberation times.
By taking logarithms in Eq. (10.17), and after some algebraic manipulation, we obtain
480
Environmental Robustness
ln Y ( f i ) ≈ ln X ( fi ) + ln H ( f i )
2
(
2
(
2
+ ln 1 + exp ln N ( f i ) − ln X ( f i ) − ln H ( fi )
2
2
2
))
(10.18)
Since most speech recognition systems use cepstrum features, it is useful to see the effect of the additive noise and channel distortion directly on the cepstrum. To do that, let’s
define the following length-(M + 1) cepstrum vectors:
(
h = C ( ln H ( f )
n = C ( ln N ( f )
y = C ( ln Y ( f )
x = C ln X ( f 0 )
2
ln X ( f1 )
2
ln H ( f1 )
0
2
ln N ( f1 )
0
2
ln Y ( f1 )
0
2
2
2
)
L ln H ( f ) )
L ln N ( f ) )
L ln Y ( f ) )
L ln X ( f M )
2
2
M
2
2
(10.19)
M
2
M
where C is the DCT matrix and we have used lower-case bold to represent cepstrum vectors.
Combining Eqs. (10.18) and (10.19) results in
yˆ = x + h + g(n − x − h)
(10.20)
where the nonlinear function g(z) is given by
(
g(z ) = C ln 1 + eC
−1
z
)
(10.21)
Equations (10.20) and (10.21) say that we can compute the cepstrum of the corrupted
speech if we know the cepstrum of the clean speech, the cepstrum of the noise, and the cepstrum of the filter. In practice, the DCT matrix C is not square, so that the dimension of the
cepstrum vector is much smaller than the number of filters. This means that we are losing
resolution when going back to the frequency domain, and thus Eqs. (10.20) and (10.21) represent only an approximation, though it has been shown to work reasonably well.
As discussed in Chapter 9, the distribution of the cepstrum of x can be modeled as a
mixture of Gaussian densities. Even if we assume that x follows a Gaussian distribution, y in
Eq. (10.20) is no longer Gaussian because of the nonlinearity in Eq. (10.21).
It is difficult to visualize the effect on the distribution, given the nonlinearity involved.
To provide some insight, let’s consider the frequency-domain version of Eq. (10.18) when
no filtering is done, i.e., H ( f ) = 1 :
y = x + ln (1 + exp ( n − x ) )
(10.22)
where x, n, and y represent the log-spectral energies of the clean signal, noise, and noisy
signal, respectively, for a given frequency. Using simulated data, not real speech, we can
analyze the effect of this transformation. Let’s assume that both x and n are Gaussian random variables. We can use Monte Carlo simulation to draw a large number of points from
those two Gaussian distributions and obtain the corresponding noisy values y using Eq.
The Acoustical Environment
481
(10.22). Figure 10.3 shows the resulting distribution for several values of σ x . We fixed
µ n = 0 dB , since it is only a relative level, and set σ n = 2 dB , a typical value. We also set
µ x = 25dB and see that the resulting distribution can be bimodal when σ x is very large.
Fortunately, for modern speech recognition systems that have many Gaussian components,
σ x is never that large and the resulting distribution is unimodal.
0.03
0.02
0.01
0
0
50
0.04
0.08
0.03
0.06
0.02
0.04
0.01
0.02
100
0
0
20
40
60
0
0
20
40
60
Figure 10.3 Distributions of the corrupted log-spectra y of Eq. (10.22) using simulated data.
The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum x is Gaussian with mean 25 dB and
standard deviations of 25, 10, and 5 dB, respectively (the x-axis is expressed in dB). The first
distribution is bimodal, whereas the other two are approximately Gaussian. Curves are plotted
using Monte Carlo simulation.
Figure 10.4 shows the distribution of y for two values of µ x , given the same values for
the noise distribution, µ n = 0 dB and σ n = 2 dB , and a more realistic value for σ x = 5dB .
We see that the distribution is always unimodal, though not necessarily symmetric, particularly for low SNR ( µ x − µ n ).
0.08
0.1
0.06
0.04
0.05
0.02
0
0
10
20
30
0
0
10
20
30
Figure 10.4 Distributions of the corrupted log-spectra y of Eq. (10.22) using simulated data.
The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum is Gaussian with standard deviation of
5 dB and means of 10 and 5 dB, respectively. The first distribution is approximately Gaussian
while the second is nonsymmetric. Curves are plotted using Monte Carlo simulation.
The distributions used in an HMM are mixtures of Gaussians so that, even if each
Gaussian component is transformed into a non-Gaussian distribution, the composite distribution can be modeled adequately by another mixture of Gaussians. In fact, if you retrain the
482
Environmental Robustness
model using the standard Gaussian assumption on corrupted speech, you can get good results, so this approximation is not bad.
10.2.
ACOUSTICAL TRANSDUCERS
Acoustical transducers are devices that convert the acoustic energy of sound into electrical
energy (microphones) and vice versa (loudspeakers). In the case of a microphone this transduction is generally realized with a diaphragm, whose movement in response to sound pressure varies the parameters of an electrical system (a variable-resistance conductor, a condenser, etc.), producing a variable voltage that constitutes the microphone output. We focus
on microphones because they play an important role in designing speech recognition systems.
There are near field or close-talking microphones, and far field microphones. Closetalking microphones, either head-mounted or telephone handsets, pick up much less background noise, though they are more sensitive to throat clearing, lip smacks, and breath noise.
Placement of such a microphone is often very critical, since, if it is right in front of the
mouth, it can produce pops in the signal with plosives such as /p/. Far field microphones can
be lapel mounted or desktop mounted and pick up more background noise than near field
microphones. Having a small but variable distance to the microphone could be worse than a
larger but more consistent distance, because the corresponding HMM may have lower variability.
When used in speech recognition systems, the most important measurement is the signal-to-noise ratio (SNR), since the lower the SNR the higher the error rate. In addition, different microphones have different transfer functions, and even the same microphone offers
different transfer functions depending on the distance between mouth and microphone.
Varying noise and channel conditions are a challenge that speech recognition systems have
to address, and in this chapter we present some techniques to combat them.
The most popular type of microphone is the condenser microphone. We shall study in
detail its directionality patterns, frequency response, and electrical characteristics.
10.2.1.
The Condenser Microphone
A condenser microphone has a capacitor consisting of a pair of metal plates separated by an
insulating material called a dielectric (see Figure 10.5). Its capacitance C is given by
C = ε 0π b 2 / h
(10.23)
where ε 0 is a constant, b is the width of the plate, and h is the separation between the plates.
If we polarize the capacitor with a voltage Vcc , it acquires a charge Q given by
Q = CVcc
(10.24)
Acoustical Transducers
483
One of the plates is free to move in response to changes in sound pressure, which results in a change in the plate separation ∆h, thereby changing the capacitance and producing
a change in voltage ∆V = ∆hVcc / h . Thus, the sensitivity4 of the microphone depends on the
polarizing voltage Vcc , which is why this voltage can often be 100 V.
b
b
h
Figure 10.5 A diagram of a condenser microphone.
Electret microphones are a type of condenser microphones that do not require a special polarizing voltage Vcc , because a charge is impressed on either the diaphragm or the
back plate during manufacturing and it remains for the life of the microphone. Electret microphones are light and, because of their small size, they offer good responses at high
frequencies.
From the electrical point of view, a microphone is equivalent to a voltage source v(t)
with an impedance ZM, as shown in Figure 10.6. The microphone is connected to a preamplifier which has an equivalent impedance RL.
Microphone
ZM
v(t)
Preamplifier
RL
~
+
G
-
Figure 10.6 Electrical equivalent of a microphone.
From Figure 10.6 we can see that the voltage on RL is
vR (t ) = v(t )
RL
( RM + RL )
(10.25)
Maximization of vR (t ) in Eq. (10.25) results in RL = ∞ , or in practice RL >> RM ,
which is called bridging. Thus, for highest sensitivity the impedance of the amplifier has to
be at least 10 times higher than that of the microphone. If the microphone is connected to an
amplifier with lower impedance, there is a load loss of signal level. Most low-impedance
4
The sensitivity of a microphone measures the open-circuit voltage of the electric signal the microphone delivers
for a sound wave for a given sound pressure level, often 94 dB SPL, when there is no load or a high impedance.
This voltage is measured in dBV, where the 0-dB reference is 1 V rms.
484
Environmental Robustness
microphones are labeled as 150 ohms, though the actual values may vary between 100 and
300. Medium impedance is 600 ohms and high impedance is 600–10,000 ohms. In practice,
the microphone impedance is a function of frequency. Signal power is measured in dBm,
where the 0-dB reference corresponds to 1 mW dissipated in a 600-ohm resistor. Thus, 0
dBm is equivalent to 0.775 V.
Since the output impedance of a condenser microphone is very high (~ 1 Mohm), a
JFET transistor must be coupled to lower the equivalent impedance. Such a transistor needs
to be powered with DC voltage through a different wire, as in Figure 10.7. A standard sound
card has a jack with the audio on the tip, ground on the sleeve, DC bias VDD on the ring, and
a medium impedance. When using phantom power, the VCC bias is provided directly in the
audio signal, which must be balanced to ground.
Vcc
Rc
C0
Microphone
Preamplifier
VDD
Cc
RM
CM
RL
+
- G
Figure 10.7 Equivalent circuit for a condenser microphone with DC bias on a separate wire.
It is important to understand how noise affects the signal of a microphone. If thermal
noise arises in the resistor RL, it will have a power
PN = 4kTB
(10.26)
where k = 1.38 × 10-23 J/K is the Bolzmann’s constant, T is the temperature in °K, and B is
the bandwidth in Hz. The thermal noise in Eq. (10.26) at room temperature (T = 297°K) and
for a bandwidth of 4 kHz is equivalent to –132 dBm. In practice, the noise is significantly
higher than this because of preamplifier noise, radio-frequency noise and electromagnetic
interference (poor grounding connections). It is, thus, important to keep the signal path between the microphone and the preamp as short as possible to avoid extra noise. It is desirable to have a microphone with low impedance to decrease the effect of noise due to radiofrequency interference, and to decrease the signal loss if long cables are used. Most microphones specify their SNR and range where they are linear (dynamic range). For condenser
microphones, a power supply is necessary (DC bias required). Microphones with balanced
output (the signal appears across two inner wires not connected to ground, with the shield of
the cable connected to ground) are more resistant to radio frequency interference.
10.2.2.
Directionality Patterns
A microphone’s directionality pattern measures its sensitivity to a particular direction. Microphones may also be classified by their directional properties as omnidirectional (or non-
Acoustical Transducers
485
directional) and directional, the latter subdivided into bidirectional and unidirectional, based
upon their response characteristics.
10.2.2.1.
Omnidirectional Microphones
By definition, the response of an omnidirectional microphone is independent of the direction
from which the encroaching sound wave is coming. Figure 10.8 shows the polar response of
an omnidirectional mike. A microphone’s polar response, or pickup pattern, graphs its output voltage for an input sound source with constant level at various angles around the mic.
Typically, a polar response assumes a preferred direction, called the major axis or front of
the microphone, which corresponds to the direction at which the microphone is most sensitive. The front of the mike is labeled as zero degrees on the polar plot, but since an omnidirectional mic has no particular direction at which it is the most sensitive, the omnidirectional
mike has no true front and hence the zero-degree axis is arbitrary. Sounds coming from any
direction around the microphone are picked up equally. Omnidirectional microphones provide no noise cancellation.
90
1
60
120
0.5
150
30
180
0
210
Mic opening
330
Diaphragm
240
300
270
(a)
(b)
Figure 10.8 (a) Polar response of an ideal omnidirectional microphone and (b) its cross section.
Figure 10.8 shows the mechanics of the ideal5 omnidirectional condenser microphone.
A sound wave creates a pressure all around the microphone. The pressure enters the opening
of the mic and the diaphragm moves. An electrical circuit converts the diaphragm movement
into an electrical voltage, or response. Sound waves impinging on the mic create a pressure
at the opening regardless of the direction from which they are coming; therefore we have a
nondirectional, or omnidirectional, microphone. As we have seen in Chapter 2, if the source
signal is Be jω t , the signal at a distance r is given by ( A / r )e jω t independently of the angle.
This is the most inexpensive of the condenser microphones, and it has the advantage
of a flat frequency response that doesn’t change with the angle or distance to the microphone. On the other hand, because of its uniform polar pattern, it picks up not only the desired signal but also noise from any direction. For example, if a pair of speakers is monitoring the microphone output, the sound from the speakers can reenter the microphone and
create an undesirable sound called feedback.
5
Ideal omnidirectional microphones do not exist.
486
10.2.2.2.
Environmental Robustness
Bidirectional Microphones
The bidirectional microphone is a noise-canceling microphone; it responds less to sounds
incident from the sides. The bidirectional mic utilizes the properties of a gradient microphone to achieve its noise-canceling polar response. You can see how this is accomplished
by looking at the diagram of a simplified gradient bidirectional condenser microphone, as
shown in Figure 10.9. A sound impinging upon the front of the microphone creates a pressure at the front opening. A short time later, this same sound pressure enters the back of the
microphone. The sound pressure never arrives at the front and back at the same time. This
creates a displacement of the diaphragm and, just as with the omnidirectional mic, a corresponding electrical signal. For sounds impinging from the side, however, the pressure from
an incident sound wave at the front opening is identical to the pressure at the back. Since
both openings lead to one side of the diaphragm, there is no displacement of the diaphragm,
and the sound is not reproduced.
Speech sound wave
from the front
Noise sound wave
from the side
Figure 10.9 Cross section of an ideal bidirectional microphone.
source
r1
r2
r
(–d,
θ
(d, 0)
Figure 10.10 Approximation to the noise-canceling microphone of Figure 10.9.
To compute the polar response of this gradient microphone let’s make the approximation of Figure 10.10, where the microphone signal is the difference between the signal at the
front and rear of the diaphragm, the separation between plates is 2d, and r is the distance
between the source and the center of the microphone.
Acoustical Transducers
487
You can see that r1 , the distance between the source and the front of the diaphragm, is
the norm of the vector specifying the source location minus the vector specifying the location of the front of the diaphragm
r1 = re jθ − d
(10.27)
Similarly, you obtain the distance between the source and the rear of the diaphragm
r2 = re jθ + d
(10.28)
The source arrives at the front of the diaphragm with a delay δ1 = r1 / c , where c is the
speed of sound in air. Similarly, the delay to the rear of the diaphragm is δ 2 = r2 / c . If the
source is a complex exponential e jω t , the difference signal between the front and rear is
given by
x (t ) =
A j 2π f ( t −δ1 ) A j 2π f ( t −δ 2 ) A j 2π ft
− e
= e
e
G ( f ,θ )
r1
r2
r
(10.29)
where A is a constant and, using Eqs. (10.27), (10.28) and (10.29), the gain G ( f ,θ ) is given
by
G ( f ,θ ) =
e
− j 2π e jθ − λ τ f
e jθ − λ
−
e
− j 2π e jθ + λ τ f
(10.30)
e jθ + λ
where we have defined λ = d / r and τ = r / c .
90
25
120
60
20
15
150
30
10
5
180
0
210
330
240
300
270
Figure 10.11 Polar response of a bidirectional microphone obtained through (10.30) with d = 1
cm, r = 50 cm, c = 33,000 cm/s, and f = 1000 Hz.
The magnitude of Eq. (10.30) is used to plot the polar response of Figure 10.11. As
can be seen by the plot, the pattern resembles a figure eight. The bidirectional mic has an
488
Environmental Robustness
interchangeable front and back, since the response is a maximum in two opposite directions.
In practice, this bidirectional microphone is an ideal case, and the polar response has to be
measured empirically.
According to the idealized model, the frequency response of omnidirectional microphones is constant with frequency, and this approximately holds in practice for real omnidirectional microphones. On the other hand, the polar pattern of directional microphones is not
constant with frequency. Clearly it is a function of frequency, as can be seen in Eq. (10.32).
In fact, the frequency response of a bidirectional microphone at 0° is shown in Figure 10.12
for both near field and far field conditions.
0
Difference in air pressure (dB)
-5
-10
-15
-20
-25
-30
2
10
10
3
4
10
Frequency (Hz)
Figure 10.12 Frequency response of a bidirectional microphone with d = 1 cm at 0°. The larger the distance between plates, the lower the frequency. The highest values are obtained for
8250 Hz and 24,750 Hz and the null for 16,500 Hz. The solid line corresponds to far field conditions ( λ = 0.02 ) and the dotted line to near field conditions ( λ = 0.5 ).
It can be shown, after taking the derivative of G ( f , 0) in Eq. (10.32) and equating to
zero, that the maxima are given by
fn =
c
(2n − 1)
4d
(10.31)
with n = 1, 2,L . We can observe from Eq. (10.31) that the larger the width of the diaphragm, the lower the first maximum.
The increase in frequency response, or sensitivity, in the near field, compared to the
far field, is a measure of noise cancellation. Consequently the microphone is said to be noise
canceling. The microphone is also referred to as a differential or gradient microphone, since
it measures the gradient (difference) in sound pressure between two points in space. The
boost in low-frequency response in the near field is also referred to as the proximity effect,
often used by singers to boost their bass levels by getting the microphone closer to their
mouths.
Acoustical Transducers
489
By evaluating Eq. (10.30) it can be seen that low-frequency sounds in a bidirectional
microphone are not reproduced as well as higher frequencies, leading to a thin sounding
mic. The resistive material of a unidirectional microphone reduces the high-frequency response and makes the microphone reproduce low and high frequencies more equally than
the bidirectional microphone.
Let’s interpret Figure 10.12. The net sound pressure between these two points, separated by a distance D = 2d, is influenced by two factors: phase shift and inverse square law.
The influence of the sound-wave phase shift is less at low frequencies than at high,
because the distance D between the front and rear port entries becomes a small fraction of
the low-frequency wavelength. Therefore, there is little phase shift between the ports at low
frequencies, as the opposite sides of the diaphragm receive nearly equal amplitude and
phase. The result is slight diaphragm motion and a weak microphone output signal. At
higher frequencies, the distance D between sound ports becomes a larger fraction of the
wavelength. Therefore, more phase shift exists across the diaphragm. This causes a higher
microphone output.
The pressure difference caused by phase shift rises with frequency at a rate of 20 dB
per decade. As the frequency rises to where the microphone port spacing D equals half a
wavelength, the net pressure is at its maximum. In this situation, the diaphragm movement is
also at its maximum, since the front and rear see equal amplitude but in opposite polarities
of the wave front. This results in a peak in the microphone frequency response, as illustrated
in Figure 10.12. As the frequency continues to rise to where the microphone port spacing D
equals one complete wavelength, the net pressure is at its minimum. Here, the diaphragm
does not move at all, since the front and rear sides see equal amplitude at the same polarity
of the wave front. This results in a dip in the microphone frequency response, as shown in
Figure 10.12.
A second factor creating a net pressure difference across the diaphragm is the impact
of the inverse square law. If the sound-pressure difference between the front and rear ports
of a noise-canceling microphone were measured near the sound source and again further
from the source, the near field measurement would be greater than the far field. In other
words, the microphone's net pressure difference and, therefore, output signal, is greater in
the near sound field than in the far field. The inverse-square-law effect is independent of
frequency. The net pressure that causes the diaphragm to move is a combination of both the
phase shift and inverse-square-law effect. These two factors influence the frequency response of the microphone differently, depending on the distance to the sound source. For
distant sound, the influence of the net pressure difference from the inverse-square-law effect
is weaker than the phase-shift effect; thus, the rising 20-dB-per-decade frequency response
dominates the total frequency response. As the microphone is moved closer to the sound
source, the influence of the net pressure difference from the inverse square law is greater
than that of the phase shift; thus the total microphone frequency response is largely flat.
The difference in near field to far field frequency response is a characteristic of all
noise-canceling microphones and applies equally to both acoustic and electronic types.
490
Environmental Robustness
10.2.2.3.
Unidirectional Microphones
Unidirectional microphones are designed to pick-up the speaker’s voice by directing the
audio reception toward the speaker, focusing on the desired input and rejecting sounds emanating from other directions that can negatively impact clear communications, such as computer noise from fans or other sounds.
Speech sound wave
from the front
Noise sound wave
from the side
Figure 10.13 Cross section of a unidirectional microphone.
Figure 10.13 shows the cross-section of a unidirectional microphone, which also relies
upon the principles of a gradient microphone. Notice that the unidirectional mic looks similar to the bidirectional, except that there is a resistive material (often cloth or foam) between
the diaphragm and the opening of one end. The material's resistive properties slow down the
pressure on its path from the back opening to the diaphragm, thus optimizing it so that a
sound pressure impinging on the back of the microphone takes equally long to reach the rear
of the diaphragm as to reach the front of the diaphragm. If the additional delay through the
back plate is given by τ 0 , the gain can be given by
G ( f ,θ ) =
e
− j 2π e jθ − λ τ f
e jθ − λ
−
e
(
)
− j 2π τ 0 + e jθ + λ τ f
e jθ + λ
(10.32)
which was obtained by modifying Eq. (10.30). Unidirectional microphones have the greatest
response to sound waves impinging from one direction, typically referred to as the front, or
major axis of the microphone. One typical response of a unidirectional microphone is the
cardioid pattern shown in the polar plot of Figure 10.14, plotted from Eq. (10.32). The frequency response at 0° is similar to that of Figure 10.12. Because the cardioid pattern of polar
response is so popular among them, unidirectional mics are often referred to as cardioid
mics.
Acoustical Transducers
491
90
25
120
20
60
15
150
30
10
5
180
0
210
330
240
300
270
Figure 10.14 Polar response (a) of a unidirectional microphone and its cross section (b). The
polar response was obtained through Eq. (10.32) with d = 1 cm, r = 50 cm, c = 33,000 cm/s, f
= 1 kHz, and τ 0 = 0.06 ms.
Equation (10.32) was derived under a simplified schematic based on Figure 10.10,
which is an idealized model so that, in practice, the polar response of a real microphone has
to be measured empirically. The frequency response and polar pattern of a commercial microphone are shown in Figure 10.15.
Figure 10.15 Characteristics of an AKG C1000S cardioid microphone: (a) frequency response
for near and far field conditions (note the proximity effect) and (b) polar pattern for different
frequencies.
Although this noise cancellation decreases the overall response to sound pressure (sensitivity) of the microphone, the directional and frequency-response improvements far out-
492
Environmental Robustness
weigh the lessened sensitivity. It is particularly well suited for use as a desktop mic or as
part of an embedded microphone in a laptop or desktop computer. Unidirectional microphones achieve superior noise-rejection performance over omnidirectionals. Such performance is necessary for clean audio input and for audio signal processing algorithms such as
acoustic echo cancellation, which form the core of speakerphone applications.
10.2.3.
Other Transduction Categories
In a passive microphone, sound energy is directly converted to electrical energy, whereas an
active microphone requires an external energy source that is modulated by the sound wave.
Active transducers thus require phantom power, but can have higher sensitivity.
We can also classify microphones according to the physical property to which the
sound wave responds. A pressure microphone has an electrical response that corresponds to
the pressure in a sound wave, while a pressure gradient microphone has a response corresponding to the difference in pressure across some distance in a sound wave. A pressure
microphone is a fine reproducer of sound, but a gradient microphone typically has a response greatest in the direction of a desired signal or talker and rejects undesired background sounds. This is particularly beneficial in applications that rely upon the reproduction
of only a desired signal, where any undesired signal entering the reproduction severely degrades performance. Such is the case in voice recognition or speakerphone applications.
In terms of the mechanism by which they create an electrical signal corresponding to
the sound wave they detect, microphones are classified as electromagnetic, electrostatic, and
piezoelectric. Dynamic microphones are the most popular type of electromagnetic microphone and condenser microphones the most popular type of electrostatic microphone.
Electromagnetic microphones induce voltage based on a varying magnetic field. Ribbon microphones are a type of electromagnetic microphones that employ a thin metal ribbon
suspended between the poles of a magnet. Dynamic microphones are electromagnetic microphones that employ a moving coil suspended by a light diaphragm (see Figure 10.16),
acting like a speaker but in reverse. The diaphragm moves with changes in sound pressure,
which in turns moves the coil, which causes current to flow as lines of flux from the magnet
are cut. Dynamic microphones need no batteries or power supply, but they deliver low signal levels that need to be preamplified.
Output
voltage
Magnet
Diaphragm
Coil
Figure 10.16 Dynamic microphone schematics.
Piezoresistive and piezoelectric microphones are based on the variation of electric resistance of their sensor induced by changes in sound pressure. Carbon button microphones
Adaptive Echo Cancellation (AEC)
493
consist of a small cylinder packed with tiny granules of carbon that, when compacted by
sound pressure, reduce the electric resistance. Such microphones, often used in telephone
handsets, offer a worse frequency response than condenser microphones, and lower dynamic
range.
10.3.
ADAPTIVE ECHO CANCELLATION (AEC)
If a spoken language system allows the user to talk while speech is being output through the
loudspeakers, the microphone picks up not only the user’s voice, but also the speech from
the loudspeaker. This problem may be avoided with a half-duplex system that does not listen
when a signal is being played through the loudspeaker, though such systems offer an unnatural user experience. On the other hand, a full-duplex system that allows barge-in by the user
to interrupt the system offers a better user experience. For barge-in to work, the signal
played through the loudspeaker needs to be canceled. This is achieved with echo cancellation (see Figure 10.17), as discussed in this section.
In hands-free conferencing the local user’s voice is output by the remote loudspeaker,
whose signal is captured by the remote microphone and after some delay is output by the
local loudspeaker. People are tolerant to these echoes if either they are greatly attenuated or
the delay is short. Perceptual studies have shown that the longer the delay, the greater the
attenuation needed for user acceptance.
Loudspeaker
x[n]
Acoustic
path H
Adaptive
filter
Microphone
e[n]
s[n]
Speech
signal
v[n]
Local
noise
d[n]
dˆ[n]
+
r[n]
+
Figure 10.17 Block diagram of an echo-canceling application. x[n] represents the signal from
the loudspeaker, s[n] the speech signal, v[n] the local background noise, and e[n] the signal
that goes to the microphone.
The use of echo cancellation is mandatory in telephone communications and handsfree conferencing when it is desired to have full-duplex voice communication. This is particularly important when the call is routed through a satellite that can have delays larger than
200 ms. A block diagram is shown in Figure 10.18.
In Figure 10.17, the return signal r[n] is the sum
r[n] = d [n] + s[n]
(10.33)
where s[n] is the speech signal and d[n] is the attenuated and possibly distorted version of
the loudspeaker’s signal x[n]. The purpose of the echo canceler is to remove the echo d[n]
from the return signal r[n], which is done by means of an adaptive FIR filter whose coeffi-
494
Environmental Robustness
cients are computed to minimize the energy of the canceled signal e[n]. The filter coefficients are reestimated adaptively to track slowly changing line conditions.
x[n]
Adaptive
filter
Speaker
A
e[n]
dˆ[n]
r[n]
Hybrid
circuit H
d[n]
+
Speaker
B
+
s[n]
v[n]
Noise
Figure 10.18 Block diagram of echo canceling for a telephone communication. x[n] represents
the remote call signal, s[n] the local outgoing signal. The hybrid circuit H does a 2-4 wire conversion and is nonideal because of impedance mismatches.
This problem is essentially that of adaptive filtering only when s[n] = 0 , or in other
words when the user is silent. For this reason, you have to implement a double-talk detection
module that detects when the speaker is silent. This is typically feasible because the echo
d[n] is usually small, and if the return signal r[n] has high energy it means that the user is
not silent. Errors in double-talk detection result in divergence of the filter, so it is generally
preferable to be conservative in the decision and when in doubt not adapt the filter coefficients. Initialization could be done by sending a known signal with white spectrum.
The quality of the filtering is measured by the so-called echo-return loss enhancement
(ERLE):
ERLE (dB) = 10 log10
E{d 2 [n]}
E{(d [n] − dˆ[n]) 2 }
(10.34)
The filter coefficients are chosen to maximize the ERLE. Since the telephone-line
characteristics, or the acoustic path (due to speaker movement), can change over time, the
filter is often adaptive. Another reason for adaptive filters is that reliable ERLE maximization requires a large number of samples, and such a delay is not tolerable.
In the following sections, we describe the fundamentals of adaptive filtering. While
there are some nonlinear adaptive filters, the vast majority are linear FIR filters, with the
LMS algorithm being the most important. We introduce the LMS algorithm, study its convergence properties, and present two extensions: the normalized LMS algorithm and transform-domain LMS algorithms.
10.3.1.
The LMS Algorithm
Let’s assume that a desired signal d[n] is generated from an input signal x[n] as follows
L −1
d [n] = å g k x[n − k ] + u[n] = G T X[n] + u[n]
k =0
(10.35)
Adaptive Echo Cancellation (AEC)
495
with G = {g 0 , g1 ,L g L −1} , the input signal vector X[n] = {x[n], x[n − 1],L x[n − L + 1]} , and
u[n] being noise that is independent of x[n].
We want to estimate d[n] in terms of the sum of previous samples of x[n]. To do that
we define the estimate signal y[n] as
L −1
y[n] = å wk [n]x[n − k ] = WT [n]X[n]
(10.36)
k =0
where W[n] = {w0 [n], w1[n],L wL −1[n]} is the time-dependent coefficient vector. The instantaneous error between the desired and the estimated signal is given by
e[n] = d [n] − WT [n]X[n]
(10.37)
The least mean square (LMS) algorithm updates the value of the coefficient vector in
the steepest descent direction
W[n + 1] = W[n] + ε e[n]X[n]
(10.38)
where ε is the step size. This algorithm is very popular because of its simplicity and effectiveness [58].
10.3.2.
Convergence Properties of the LMS Algorithm
The choice of ε is important: if it is too small, the adaptation rate will be slow and it might
not even track the nonstationary trends of x[n] , whereas if ε is too large, the error might
actually increase. We analyze the conditions under which the LMS algorithm converges.
Let’s define the error in the coefficient vector V[n] as
V[n] = G − W[n]
(10.39)
and combine Eqs. (10.37), (10.38), and (10.39) to obtain
V[n + 1] = V[n] − ε X[n]XT [n]V[n] − ε u[n]X[n]
(10.40)
Taking expectations in Eq. (10.40) results in
E{V[n + 1]} = E{V[n]} − µ E{X[n]XT [n]V[n]}
(10.41)
where we have assumed that u[n] and x[n] are independent and that either is a zero-mean
process. Finally, we express the autocorrelation of X[n] as
R xx = E{X[n]XT [n]} = QΛQT
(10.42)
where Q is a matrix of its eigenvectors and Λ is a diagonal matrix of its eigenvalues
{λ0 , λ1 ,L , λL −1} , which are all real valued because of the symmetry of R xx .
496
Environmental Robustness
Although we know that X[n] and V[n] are not statistically independent, we assume
in this section that they are, so that we can obtain some insight on the convergence properties. With this assumption, Eq. (10.41) can be expressed as
E{V[n + 1]} = E{V[n]}(1 − ε R xx )
(10.44)
which, applied recursively, leads to
E{V[n + 1]} = E{V[0]}(1 − ε R xx ) n
(10.45)
Using Eqs. (10.39) and (10.42) in (10.45), we can express the (I + 1)th element of E{W[n]}
as
L −1
E{wi [n]} = gi + å qij (1 − ελ j ) n E{vi [0]}
(10.46)
j =0
where qij is the (I + 1, j + 1)th element of the eigenvector matrix Q, and vi [n] is the rotated
coefficient error vector defined as
[n] = QT V[n]
V
(10.47)
From Eq. (10.46) we see that the mean value of the LMS filter coefficients converges
exponentially to the true value if
0 < ε < 1/ λ j
(10.48)
so that the adaptation constant ε must be determined from the largest eigenvalue of X[n]
for the mean LMS algorithm to converge.
In practice, mean convergence doesn’t tell us the nature of the fluctuations that the coefficients experience. Analysis of the variance of V[n] together with some more
approximations result in mean-squared convergence if
0<ε <
K
Lσ x2
(10.49)
with σ x2 = E{x 2 [n]} being the input signal power and K a constant that depends weakly on
the nature of the input signal statistics but not on its power.
Because of the inaccuracies of the independence assumptions above, a rule of thumb
used in practice to determine the adaptation constant ε is
0<ε <
0.1
Lσ x2
(10.50)
The choice of largest value for ε in Eq. (10.49) makes the LMS algorithm track nonstationary variations in x fastest, and achieve faster convergence. On the other hand, the
misadjustment of the filter coefficients increases as both the filter length L and adaptation
Adaptive Echo Cancellation (AEC)
497
constant ε increase. For this reason, often the adaptation constant can be made a function of
n ( ε [n] ), with larger values at first and smaller values once convergence has been determined.
10.3.3.
Normalized LMS Algorithm
The normalized LMS algorithm (NLMS) uses the result of Eq. (10.49) and, therefore, defines a normalized step size
ε [ n] =
ε
δ + Lσˆ x2 [n]
(10.51)
where the constant δ avoids a division by 0 and σˆ x2 [n] is an estimate of the input signal
power, which is typically done with an exponential window
σˆ x2 [n] = (1 − β )σˆ x2 [n − 1] + β x 2 [n]
(10.52)
or a sliding rectangular window
σˆ x2 [n] =
1
N
N −1
å x [n − i] = σˆ
2
i =0
2
x
[n − 1] +
1 2
( x [ n] − x 2 [ n − N ] )
N
(10.53)
where both β and N control the effective memory of the estimators in Eqs. (10.52) and
(10.53), respectively. Finally, we need to pick ε so that 0 < ε < 2 to assure convergence.
Choice of the NLMS algorithm simplifies the selection of ε , and the NLMS often converges faster than the LMS algorithm in practical situations.
10.3.4.
Transform-Domain LMS Algorithm
As discussed in Section 10.3.2, convergence of the LMS algorithm is determined by the
largest eigenvalue of the input. Since complex exponentials are approximate eigenvectors
for LTI systems, the LMS algorithm’s convergence is dominated by the frequency band with
largest energy, and convergence in other frequency bands is generally much slower. This is
the rationale for the subband LMS algorithm, which performs independent LMS algorithms
for different frequency bands, as proposed by Boll [14].
The block LMS (BLMS) algorithm keeps the coefficients unchanged for a block k of L
samples
L −1
W[k + 1] = W[k ] + ε å e[kL + m]X[kL + m]
(10.54)
m=0
which is represented by a linear convolution and therefore can be implemented efficiently
using length-2N FFTs according to overlap-save method of Figure 10.19. Notice that im-
498
Environmental Robustness
plementing a linear convolution with a circular convolution operator such as the FFT requires the use of the dashed box.
[Old x | New x]
FFT
Generate Length 2N
Data vector
X[k]
Input
x[n]
Save last
N samples
FFT
x
Update weight
vector W
Conjugate
Output y[n]
FFT
Force last N
elements to 0
IFFT
x
E[k]
FFT
x
d[n]
+
Figure 10.19 Block diagram of the constrained frequency-domain block LMS algorithm. The
unconstrained version of this algorithm eliminates the computation inside the dashed box.
An unconstrained frequency-domain LMS algorithm can be implemented by removing
the constraint in Figure 10.19, therefore implementing a circular instead of a linear convolution. While this is not exact, the algorithm requires only three FFTs instead of five. In some
practical applications, there is no difference in convergence between the constrained and
unconstrained cases.
10.3.5.
The RLS Algorithm
The search for the optimum filter can be accelerated when the gradient vector is properly
deviated toward the minimum. This approach uses the Newton-Raphson method to iteratively compute the root of f(x) (see Figure 10.20) so that the value at iteration i + 1 is given
by
Multimicrophone Speech Enhancement
xi +1 = xi −
499
f ( xi )
f ′( xi )
(10.55)
f(x)
x1
x0
Figure 10.20 Newton-Raphson method to compute the roots of a function.
To minimize function f(x) we thus compute the roots of f’(x) through the above
method:
xi +1 = xi −
f ′( xi )
f ′′( xi )
(10.56)
In the case of a vector, Eq. (10.56) is transformed into
w i +1 = w i − ε [n] ( ∇ 2 e(w i ) ) ∇e(w i )
−1
(10.57)
where we add a step size ε [n] , and where ∇ 2 e(w i ) is the Hessian of the least-squares function which, for Eq. (10.37), equals the autocorrelation of x:
∇ 2 e(w i ) = R[n] = E{x[n]xT [n]}
(10.58)
The recursive least squares (RLS) algorithm specifies a method of estimating Eq.
(10.58) using an exponential window:
R[n] = λ R[n − 1] + x[n]xT [n]
(10.59)
While the RLS algorithm converges faster than the LMS algorithm, it also is more
computationally expensive, as it requires a matrix inversion for every sample. Several algorithms have been derived to speed it up [54].
10.4.
MULTIMICROPHONE SPEECH ENHANCEMENT
The use of more than one microphone is motivated by the human auditory system, in which
the use of both ears has been shown to enhance detection of the direction of arrival, as well
as increase SNR when one ear is covered. The methods the human auditory system uses to
accomplish this task are still not completely known, and the techniques described in this
section do not mimic that behavior.
Microphone arrays use multiple microphones and knowledge of the microphone locations to predict delays and thus create a beam that focuses on the direction of the desired
500
Environmental Robustness
speaker and rejects signals coming from other angles. Reverberation, as discussed in Section
10.1.2, can be combated with these techniques. Blind source separation techniques are another family of statistical techniques that typically do not use spatial constraints, but rather
statistical independence between different sources.
While in this section we describe only linear processing, i.e., the output speech is a
linearly filtered version of the microphone signals, we could also combine these techniques
with the nonlinear methods of Section 10.5.
10.4.1.
Microphone Arrays
The goals of microphone arrays are twofold: finding the position of a sound source in a
room, and improving the SNR of the received signal. Steering is helpful in videoconferencing, where a camera has to follow the current speaker. Since the speaker is typically far
away from the microphone, the received signal likely contains a fair amount of additive
noise. Microphone arrays can also be used to increase the SNR.
Let x[n] be the signal at the source S. Microphone i picks up a signal
yi [n] = x[n] ∗ gi [n] + vi [n]
(10.60)
that is a filtered version of the source plus additive noise vi [n] . If we have N such microphones, we can attempt to recover s[n] because all the signals yi [n] should be correlated.
A typical assumption made is that all the filters gi [n] are delayed versions of the
same filter g[n]
gi [n] = g[n − Di ]
(10.61)
with the delay Di = di / c , d i being the distance between the source S and microphone i, and
c the speed of sound in air. We cannot recover signal x[n] without knowledge of g[n] or the
signal itself, so the goal is to obtain the filtered signal y[n]
y[n] = x[n]* g[n]
(10.62)
so that, combining Eqs. (10.60), (10.61), and (10.62),
yi [n] = y[n − Di ] + vi [n]
(10.63)
Assuming vi [n] are independent and Gaussianly distributed, the optimal estimate of
x[n] is given by
y[n] =
1
N
N −1
å y [n + D ] = y[n] + v[n]
i=0
i
i
(10.64)
which is the so-called delay-and-sum beamformer [24, 29], where the residual noise v[n]
Multimicrophone Speech Enhancement
v[n] =
1
N
501
N −1
å v [n + D ]
i =0
i
(10.65)
i
has a variance that decreases as the number of microphones N increases, since the noises
vi [n + Di ] are uncorrelated.
Equation (10.65) requires estimation of the delays Di . To attenuate the additive noise
v[n], it is not necessary to identify the absolute delays, but rather the delays relative to one
reference microphone (for example, the center microphone). It can be shown that the maximum likelihood solution consists in maximizing the energy of y[n] in Eq. (10.64), which is
the sum of cross-correlations:
æ N −1 N −1
ö
Di = arg max ç åå Rij [ Di − D j ] ÷
Di
è i=0 j =0
ø
0≤i< N
(10.66)
This approach assumes that we know nothing about the geometry of the microphone
placement. In fact, given a point source and assuming no reflections, we can compute the
delay based on the distance between the source and the microphone. The use of geometry
allows us to reduce the number of parameters to estimate from (N - 1) to a maximum of 3, in
case we desire to estimate the exact location. This location is often described in spherical
coordinates (ϕ ,θ , ρ ) with ϕ being the direction of arrival, θ the elevation angle, and ρ
the distance to the reference microphone, as shown in Figure 10.21.
ρ
Microphone
θ
Speaker
ϕ
Figure 10.21 Spherical coordinates (ϕ ,θ , ρ ) with ϕ being the direction of arrival, θ the
elevation angle, and ρ the distance to the reference microphone.
While 2-D and 3-D microphone configurations can be used, which would allow us to
determine not just the steering angle ϕ , but also distance to the origin ρ and azimuth θ ,
linear microphone arrays are the most widely used configurations because they are the simplest. In a linear array all the microphones are placed on a line (see Figure 10.22). In this
case, we cannot determine the elevation angle θ . To determine both ϕ and ρ we need at
least two microphones in the array.
If the microphones are relatively close to each other compared to the distance to the
source, the angle of arrival ϕ is approximately the same for all signals. With this assumption, the normalized delay Di with respect to the reference microphone is given by
502
Environmental Robustness
Di = −ai sin(ϕ ) / c
(10.67)
where ai is the y-axis coordinate in Figure 10.22 for microphone i, where the reference microphone has a0 = 0 and also D0 = 0 .
M2
S
M1
ϕ
M0
M-1
M-2
Figure 10.22 Linear microphone array (five microphones). The source signal arrives at each
microphone with a different delay, which allows us to find the correct angle of arrival.
With approximation, we define Di (ϕ ) , the relative delay of the signal at microphone i
to the reference microphone, as a function of the direction of arrival angle ϕ and independent of ρ . The optimal direction of arrival ϕ is then that which maximizes the energy of the
estimated signal x[n] over a set of samples
æ1
ϕ = arg max å ç
ϕ
n è N
æ1
= arg max å ç
ϕ
n è N
N −1
ö
å y [n + D (ϕ )] ÷ø
i =0
i
N −1
2
i
ö
yi [n − ai sin(ϕ )] ÷
å
i =0
ø
2
(10.68)
The term beamforming entails that this array favors a specific direction of arrival ϕ
and that sources arriving from other directions are not in phase and therefore are attenuated.
Since the source can move over time, maximization of Eq. (10.68) can be done in an adaptive fashion.
As the beam is steered away from the broadside, the system exhibits a reduction in
spatial discrimination because the beam pattern broadens. Furthermore, beamwidth varies
with frequency, so an array has an approximate bandwidth given by the upper f u and lower
f l frequencies
fu =
c
d max cos ϕ − cos ϕ ′
φ ,φ ′
f
fl = u
N
(10.69)
Multimicrophone Speech Enhancement
503
with d being the sensor spacing, ϕ ′ the steering angle measured with respect to the axis of
the array, and ϕ the direction of the source. For a desired range of ±30o and five sensors
spaced 5 cm apart, the range is approximately 880 to 4400 Hz. We see in Figure 10.23 that
at very low frequencies the response is essentially omnidirectional, since the microphone
spacing is small compared to the large wavelength. At high frequencies more lobes start
appearing, and the array steers toward not only the preferred direction but others as well.
For speech signals, the upshot is that we either need a lot of microphones to provide a directional polar pattern at low frequencies, or we need them to be spread far enough apart, or
both.
120
90 25
60
12.5
150
180
120
330
240
270
300
12.5
150
30
0
210
90 25
60
180
120
30
0
330
210
240
270
300
90 25
60
12.5
150
180
120
30
0
330
210
240
270
300
90 25
60
12.5
150
180
30
0
330
210
240
270
300
Figure 10.23 Polar pattern of a microphone array with steering angle of ϕ ′ = 0 , five microphones spaced 5 cm apart for 400, 880, 4400, and 8000 Hz from left to right, respectively, for
a source located at 5 m.
The polar pattern in Figure 10.23 was computed as follows:
N
P( f , r , ϕ ) = å
i =1
e
− j 2π f éë ai sin ϕ ′ + | re jϕ − jai |ùû / c
re jϕ − jai
(10.70)
though the sensors could be spaced nonuniformly, as in Figure 10.24, allowing for better
behavior across the frequency spectrum.
Mid-frequency array
High-frequency array
Low-frequency array
Figure 10.24 Nonuniform linear microphone array containing three subarrays for the high,
mid, and low frequencies.
504
Environmental Robustness
Once a microphone array has been steered towars a direction ϕ ′ , it attenuates noise
source coming from other directions. The beamwidth depends not only on the frequency of
the signal, but also on the steering direction. If the beam is steered toward a direction ϕ ′ ,
then the direction of the source for which the beam response fall to half its power has been
found empirically to be
ì
K ü
ϕ 3dB ( f ) = cos −1 ícos ϕ ′ ±
ý
Ndf þ
î
(10.71)
with K being a constant. Equation (10.71) shows that the smaller the array, the wider the
beam, and that lower frequencies yield wider beams also. Figure 10.25 shows that the bandwidth of the array when steering toward a 30° direction is lower than when steering at 0°.
120
90 25
60
12.5
150
180
120
30
0
210
330
240
270
300
90 25
60
12.5
150
180
120
30
0
210
330
240
270
300
90 25
150
120
60
12.5
180
30
0
210
330
240
270
300
90 25
60
12.5
150
180
30
0
210
330
240
270
300
Figure 10.25 Polar pattern of a microphone array with steering angle of ϕ ′ = 30o , five microphones spaced 5 cm apart for 400, 880, 3000, and 4400 Hz from left to right, respectively, for
a source located at 5 m.
Microphone arrays have been shown to improve recognition accuracy when the microphones and the speaker are far apart [51]. Several companies are commercializing microphone arrays for teleconferencing or speech recognition applications.
Only in anechoic chambers does the assumption in Eq. (10.61) hold, since in practice
many reflections take place, which are also different for different microphones. In addition,
the assumption of a common direction of arrival for all microphones may not hold either.
For this case of reverberant environments, single beamformers typically fail. While computing the direction of arrival is much more difficult in this case, the SNR can still be improved.
Let’s define the desired signal d[n] as that obtained in the reference microphone. We
can estimate the vector H[n] = {h11 ,L , h1L , h21 ,L , h2 L ,L , h( N −1)1 ,L , h( N −1) L } for the (N-1) Ltap filters that minimizes the error array [25]
e[n] = d [n] − H[n]Y[n]
(10.72)
where the (N – 1) microphone signals are represented in the vector
Y[n] = { y1[n],L , y1[n − L − 1], y2 [n],L , y2 [n − L − 1],L , y N −1[n],L , yN −1[n − L − 1]}
The filter coefficients G[n] can be estimated through the adaptive filtering techniques described in Section 10.3. The clean signal is then estimated as
Multimicrophone Speech Enhancement
xˆ[n] =
505
1
( d [n] + H[n]Y[n])
2
(10.73)
This last method does not assume anything about the geometry of the microphone array.
10.4.2.
Blind Source Separation
The problem of separating the desired speech from interfering sources, the cocktail party
effect [15], has been one of the holy grails in signal processing. Blind source separation
(BSS) is a set of techniques that assume no information about the mixing process or the
sources, apart from their mutual statistical independence, hence is termed blind. Independent
component analysis (ICA), developed in the last few years [19, 38], is a set of techniques to
solve the BSS problem that estimate a set of linear filters to separate the mixed signals under
the assumption that the original sources are statistically independent.
Let’s first consider instantaneous mixing. Let’s assume that R microphone signals
yi [n] , denoted by y[n] = ( y1[n], y2 [n],L , yR [n]) , are obtained by a linear combination of R
unobserved source signals xi [n] , denoted by x[n] = ( x1[n], x2 [n],L , xR [n]) :
y[n] = Gx[n]
(10.74)
for all n, with G being the R x R mixing matrix. This mixing is termed instantaneous, since
the sensor signals at time n depend on the sources at the same, but no earlier, time point.
Had the mixing matrix been given, its inverse could have been applied to the sensor signals
to recover the sources by x[n] = G −1y[n] . In the absence of any information about the mixing, the blind separation problem consists of estimating a separating matrix H = G −1 from
the observed microphone signals alone. The source signals can then be recovered by
x[n] = Hy[n]
(10.75)
We’ll use here the probabilistic formulation of ICA, though alternate frameworks for
ICA have been derived also [18]. Let px (x[n]) be the probability density function (pdf) of
the source signals, so that the pdf of microphone signals y[n] is given by
py (y[n]) =| H | px (Hy[n])
(10.76)
and if we furthermore assume the sources x[n] are independent from themselves in time,
x[n + i ] i ≠ 0 , then the joint probability is given by
N −1
py (y[0], y[1],L , y[ N − 1]) = ∏ py (y[n]) =| H |N
n=0
whose normalized log-likelihood is given by
N −1
∏ p (Hy[n])
n=0
x
(10.77)
506
Environmental Robustness
Ψ=
1
1
ln py (y[0], y[1],L , y[ N − 1]) = ln | H | +
N
N
N −1
å ln p (Hy[n])
n=0
x
(10.78)
It can be shown that
∂ ln H
= ( HT )
∂H
−1
(10.79)
so that that the gradient of Ψ [38] in Eq. (10.78) is given by
−1
∂Ψ
1
= ( HT ) +
∂H
N
N −1
å φ (Hy[n]) ( y[n])
T
(10.80)
n =0
where φ (x) is expressed as
φ ( x) =
∂ ln px (x)
∂x
(10.81)
If we further assume the distribution is a zero mean Gaussian distribution with standard deviation σ, then Eq. (10.81) results in
φ ( x) = −
x
σ2
(10.82)
which inserted into Eq. (10.80) yields
−1
∂Ψ
Hæ1
= ( HT ) − 2 ç
∂H
σ èN
N −1
å y[n] ( y[n])
n=0
T
H
ö
T −1
÷ = (H ) − 2 R
σ
ø
(10.83)
with R being the matrix of cross-correlations, i.e.,
Rij =
1
N
N −1
å y [ n ]y [ n ]
n=0
i
j
(10.84)
Setting Eq. (10.83) to 0 results in maximization of Eq. (10.78) under the Gaussian assumption:
H T H = σ 2 R −1
(10.85)
which can be solved using the Cholesky decomposition described in Chapter 6.
Since σ is generally not known, it can be shown from Eq. (10.85) that the sources can
be recovered only to within a scaling factor [17]. Scaling is in general not a big problem,
since speech recognition systems perform automatic gain control (AGC). Moreover, the
sources can be recovered to within a permutation. To see this, let’s define a two-dimensional
matrix A
Multimicrophone Speech Enhancement
æ 0 1ö
A=ç
÷
è 1 0ø
507
(10.86)
which is orthogonal:
AT A = I
(10.87)
If H is a solution of Eq. (10.85), then AH is also a solution. Thus, a permutation of the
sources yields the same correlation matrix in Eq. (10.84). Although we have shown it only
under the Gaussian assumption, separation up to a scaling and source permutation is a general result in blind source separation [17].
Unfortunately, the Gaussian assumption does not guarantee separation. To see this, we
can define a two-dimensional rotation matrix A
æ cos θ
A=ç
è sin θ
− sin θ ö
÷
cos θ ø
(10.88)
which is also orthogonal, so that if H is a solution of Eq. (10.85), then AH is also a solution.
The Gaussian assumption entails considering only second-order statistics, and to ensure separation we could consider higher-order statistics. Since speech signals do not follow
a Gaussian distribution, we could use a Laplacian distribution, as we saw in Chapter 7:
px ( x ) =
β −β x
e
2
(10.89)
which, using Eq. (10.81), results in
ì− β
φ ( x) = í
îβ
x>0
x<0
(10.90)
and thus a nonlinear function of H for Eq. (10.80). Since a closed-form solution is not possible, a common solution in this case is gradient descent, where the gradient is given by
−1
∂Ψ
= ( HTn ) + φ (H n y[n])(y[n])T
∂H n
(10.91)
and the update formula by
H n +1 = H n − ε
−1
∂Ψ
= H n − ε éê( HTn ) + φ (H n y[n])(y[n])T ùú
ë
û
∂H n
(10.92)
which is the so-called infomax rule [10].
Often the nonlinearity in Eq. (10.90) is replaced by a sigmoid6 function:
6
The sigmoid function can be expressed in terms of the hyperbolic tangent tanh( x ) = sinh( x) / cosh( x ) , where
sinh( x) = (e x − e − x ) / 2 and cosh( x) = (e x + e− x ) / 2 .
508
Environmental Robustness
φ ( x) = − β tanh( β x)
(10.93)
which implies a density function
px ( x ) =
β
2π cosh( β x)
(10.94)
The sigmoid converges to the Laplacian as β → ∞ . Nonlinear functions in Eqs.
(10.90) and (10.93) can be expanded in Taylor series so that all the moments of the observed
signals are used and not just the second order, as in the case of the Gaussian assumption.
These nonlinearities have been shown to be more effective in separating the sources. The
use of more accurate density functions for px (x) , such as a mixture of Gaussians [9], also
results in nonlinear φ ( x) functions that have shown better separation.
A problem of Eq. (10.92) is that it requires a matrix inversion at every iteration. The
so-called natural gradient [7] was suggested to avoid this, also providing faster convergence. To do this we can multiply the gradient of Eq. (10.91) by a positive definite matrix,
the inverse of the Fisher’s information matrix HTn H n , for example, to whiten the signal:
H n +1 = H n − ε
∂Ψ T
Hn Hn
∂H
(10.95)
which, combined with Eq. (10.91), results in
H n +1 = H n − ε éëI + φ (xˆ [n])(xˆ [n])T ùû H n
(10.96)
where the estimated sources are given by
xˆ[n] = H n y[n]
(10.97)
Notice the similarity of this approach to the RLS algorithm of Section 10.3.5. Similarly to
most Newton-Raphson methods, the convergence of this approach is quadratic instead of
linear as long as we are close enough to the maximum.
Another way of overcoming the lack of separation under the independent Gaussian assumption is to make use of temporal information, which we know is important for speech
signals. If the model of Eq. (10.74) is extended to contain additive noise
y[n] = Gx[n] + v[n]
(10.98)
we can compute the autocorrelation of y[n] as
R y [n] = GR x [n]G T + R v [n]
(10.99)
or, after some manipulation,
R x [n] = H(R y [n] − R v [n])HT
(10.100)
Multimicrophone Speech Enhancement
509
which we know must be diagonal because the sources x are independent, and thus H can be
estimated to minimize the squared error of the off-diagonal terms of R x [n] for several values of n [11]. Equation (10.100) is a generalization of Eq. (10.85) when considering temporal correlation and additive noise.
x1[n]
g11[n]
x2 [n]
+
y1[n]
h11[n]
g12 [n]
h12 [n]
g 21[n]
h21[n]
g 22 [n]
+
y2 [n ]
h22 [n]
+
xˆ1[n]
+
xˆ2 [n]
Figure 10.26 Convolutional model for the case of two microphones.
The case of instantaneous mixing is not realistic, as we need to consider the transfer
functions between the sources and the microphones created by the room acoustics. It can be
shown that the reconstruction filters hij [n] in Figure 10.26 will completely recover the original signals xi [n] if and only if their z-transforms are the inverse of the z-transforms of the
mixing filters gij [n] :
æ H11 ( z ) H12 ( z ) ö æ G11 ( z ) G12 ( z ) ö
ç
÷=ç
÷
è H 21 ( z ) H 22 ( z ) ø è G21 ( z ) G22 ( z ) ø
−1
æ G11 ( z ) G12 ( z ) ö
1
=
ç
÷
G11 ( z )G22 ( z ) − G12 ( z )G21 ( z ) è G21 ( z ) G22 ( z ) ø
(10.101)
If the matrix in Eq. (10.101) is not invertible, separability is impossible. This can happen if both microphones pick up the same signal, which could happen if either the two microphones are too close to each other or the two sources are too close to each other. It’s reasonable to assume the mixing filters gij [n] to be FIR filters, whose length will generally
depend on the reverberation time, which in turn depends on the room size, microphone position, wall absorbance, and so on. In general this means that the reconstruction filters hij [n]
have an infinite impulse response. In addition, the filters gij [n] may have zeroes outside the
unit circle, so that perfect reconstruction filters would need to have poles outside the unit
circle. For this reason it is not possible, in general, to recover the original signals exactly.
In practice, it’s convenient to assume such filters to be FIR of length q, which means
that the original signals x1[n] and x2 [n] , will not be recovered exactly. Thus the problem
510
Environmental Robustness
consists in estimating the reconstruction filters hij [n] directly from the microphone signals
y1[n] and y2 [n] , so that the estimated signals xˆi [n] are as close as possible to the original
signals. Often we are satisfied if the resulting signals are separated, even if they contain
some amount of reverberation.
An approach commonly used to combat this problem consists of taking a filterbank
and assuming instantaneous mixing within each filter [38]. This approach can separate real
sources much more effectively, but it suffers from the problem of permutations, which in
this case is more severe because frequencies from different sources can be mixed together.
To avoid this, we may need a probabilistic model of the sources that takes into account correlations across frequencies [3]. Another problem occurs when the number of sources is larger than the number of microphones.
10.5.
ENVIRONMENT COMPENSATION PREPROCESSING
The goal of this section is to present a number of techniques used to clean up the signal of
additive noise and/or channel distortions prior to the speech recognition system. Although
the techniques presented here are developed for the case of one microphone, they can be
generalized to the case where several microphones are available using the approaches described in Section 10.4. These techniques can also be used to enhance the signal captured
with a speakerphone or a desktop microphone in teleconferencing applications.
Since the use of human auditory system is so robust to changes in acoustical environment, many researchers have attempted to develop signal processing schemes that mimic the
functional organization of the peripheral auditory system [27, 49]. The PLP cepstrum described in Chapter 6 has also been shown to be very effective in combating noise and channel distortions [60].
Another alternative is to consider the feature vector as an integral part of the recognizer, and thus researchers have investigated its design so as to maximize recognition accuracy, as discussed in Chapter 9. Such approaches include LDA [34] and neural networks
[45]. These discriminatively trained features can also be optimized to operate better under
noisy conditions, thus possibly beating the standard mel-cepstrum, especially when several
independent features are combined [50]. The mel-cepstrum is the most popular feature vector for speech recognition. In this context we present a number of techniques that have been
proposed over the years to compensate for the effects of additive noise and channel distortions on the cepstrum.
10.5.1.
Spectral Subtraction
The basic assumption in this section is that the desired clean signal x[m] has been corrupted
by additive noise n[m]:
y[m] = x[m] + n[m]
(10.102)
Environment Compensation Preprocessing
511
and that both x[m] and n[m] are statistically independent, so that the power spectrum of the
output y[m] can be approximated as the sum of the power spectra:
Y( f ) ≈ X ( f ) + N( f )
2
2
2
(10.103)~
with equality if we take expected values, as the expected value of the cross term vanishes
(see Section 10.1.3).
2
Although we don’t know N ( f ) , we can obtain an estimate using the average periodogram over M frames that are known to be just noise (i.e., when no signal is present) as
long as the noise is stationary
2
1
Nˆ ( f ) =
M
M −1
å Y(f)
2
i
i=0
(10.104)
Spectral subtraction supplies an intuitive estimate for X ( f ) using Eqs. (10.103) and
(10.104) as
2
2
ö
1
2
2æ
Xˆ ( f ) = Y ( f ) − Nˆ ( f ) = Y ( f ) ç1 −
÷
è SNR( f ) ø
(10.105)
where we have define the frequency-dependent signal-to-noise ratio SNR ( f ) as
SNR ( f ) =
Y( f )
Nˆ ( f )
2
2
(10.106)
Equation (10.105) describes the magnitude of the Fourier transform but not the phase.
This is not a problem if we are interested in computing the mel-cepstrum as discussed in
Chapter 6. We can just modify the magnitude and keep the original phase of Y ( f ) using a
filter H ss ( f ) :
Xˆ ( f ) = Y ( f ) H ss ( f )
(10.107)
where, according to Eq. (10.105), H ss ( f ) is given by
H ss ( f ) = 1 −
1
SNR( f )
(10.108)
2
Since Xˆ ( f ) is a power spectral density, it has to be positive, and therefore
SNR( f ) ≥ 1
(10.109)
512
Environmental Robustness
but we have no guarantee that SNR ( f ) , as computed by Eq. (10.106), satisfies Eq. (10.109).
In fact, it is easy to see that noise frames do not comply. To enforce this constraint, Boll [13]
suggested modifying Eq. (10.108) as follows:
æ
ö
1
H ss ( f ) = max ç1 −
,a÷
è SNR( f ) ø
(10.110)
with a ≥ 0 , so that the quantity within the square root is always positive, and where f ss ( x) is
given by
æ 1 ö
f ss ( x) = max ç1 − , a ÷
è x ø
(10.111)
It is useful to express SNR ( f ) in dB so that
x = 10 log10 SNR
(10.112)
and the gain of the filter in Eq. (10.111) also in dB:
g ss ( x ) = 20 log10 f ss ( x )
(10.113)
Using Eqs. (10.111) and (10.112), we can express Eq. (10.113) by
(
g ss ( x ) = max 10log10 (1 − 10− x /10 ) , − A
)
(10.114)
after expressing the attenuation a in dB:
a = 10− A /10
(10.115)
Equation (10.114) is plotted in Figure 10.27 for A = 10 dB.
The spectral subtraction rule in Eq. (10.111) is quite intuitive. To implement it we can
do a short-time analysis, as shown in Chapter 6, by using overlapping windowed segments,
zero-padding, computing the FFT, modifying the magnitude spectrum, taking the inverse
FFT, and adding the resulting windows.
This implementation results in output speech that has significantly less noise, though it
exhibits what is called musical noise [12]. This is caused by frequency bands f for which
2
2
2
2
Y ( f ) ≈ Nˆ ( f ) . As shown in Figure 10.27, a frequency f for which Y ( f ) < Nˆ ( f )
0
0
0
2
is attenuated by A dB, whereas a neighboring frequency f1 , where Y ( f1 ) > Nˆ ( f1 ) , has a
2
much smaller attenuation. These rapid changes with frequency introduce tones at varying
frequencies that appear and disappear rapidly.
The main reason for the presence of musical noise is that the estimates of SNR ( f )
through Eqs. (10.104) and (10.106) are poor. This is partly because SNR ( f ) is computed
independently for each frequency, whereas we know that SNR ( f 0 ) and SNR ( f1 ) are corre-
Environment Compensation Preprocessing
513
lated if f 0 and f1 are close to each other. Thus, one possibility is to smooth the filter in Eq.
(10.114) over frequency. This approach suppresses a smaller amount of noise, but it does not
distort the signal as much, and thus may be preferred by listeners. Similarly, smoothing over
time
SNR ( f , t ) = γ SNR( f , t − 1) + (1 − γ )
Y( f )
Nˆ ( f )
2
(10.116)
2
can also be done to reduce the distortion, at the expense of a smaller noise attenuation.
Smoothing over both time and frequency can be done to obtain more accurate SNR measurements and thus less distortion. As shown in Figure 10.28, use of spectral subtraction can
reduce the error rate.
0
-2
Gain(dB)
-4
-6
-8
spectral subtraction
magnitude subtraction
Oversubtraction
-10
-12
-5
0
5
10
Instantaneous SNR (dB)
15
20
Figure 10.27 Magnitude of the spectral subtraction filter gain as a function of the input
instantaneous SNR for A = 10 dB, for the spectral subtraction of Eq. (10.114), magnitude
subtraction of Eq. (10.118), and oversubtraction of Eq. (10.119) with β = 2 dB.
Additionally, the attenuation A can be made a function of frequency. This is useful
when we want to suppress more noise at one frequency than another, which is a tradeoff
between noise reduction and nonlinear distortion of speech.
Other enhancements to the basic algorithm have been proposed to reduce the musical
noise. Sometimes Eq. (10.111) is generalized to
1/ α
æ
1
æ
öö
f ms ( x) = ç max ç1 − α / 2 , a ÷ ÷
è x
øø
è
(10.117)
where α = 2 corresponds to the power spectral subtraction rule in Eq. (10.111), and α = 1
corresponds to the magnitude subtraction rule (plotted in Figure 10.27 for A = 10 dB):
514
Environmental Robustness
(
g ms ( x ) = max 20log10 (1 − 10− x / 5 ) , − A
)
(10.118)
Another variation, called oversubtraction, consists of multiplying the estimate of the
2
noise power spectral density Nˆ ( f ) in Eq. (10.104) by a constant 10 β /10 , where β > 0 ,
which causes the power spectral subtraction rule in Eq. (10.114) to be transformed to another function
(
g os ( x ) = max 10log10 (1 − 10− ( x − β ) /10 ) , − A
2
This causes Y ( f ) < Nˆ ( f )
2
)
(10.119)
2
to occur more often than Y ( f ) > Nˆ ( f )
2
for frames for
2
Word Error Rate (%)
2
which Y ( f ) ≈ Nˆ ( f ) , and thus reduces the musical noise.
Clean Speech Training
100
90
80
70
60
50
40
30
20
10
0
Spectral Subtraction
Matched Noisy Training
0
5
10
15
20
25
30
SNR (dB)
Figure 10.28 Word error rate as a function of SNR (dB) using Whisper on the Wall Street
Journal 5000-word dictation task. White noise was added at different SNRs, and the system
was trained with speech with the same SNR. The solid line represents the baseline system
trained with clean speech, the dotted line the use of spectral subtraction with the previous clean
HMMs. They are compared to a system trained on the same speech with the same SNR as the
speech tested on.
10.5.2.
Frequency-Domain MMSE from Stereo Data
You have seen that several possible functions, such as Eqs. (10.114), (10.118), or (10.119),
can be used to attenuate the noise, and it is not clear that any one of them is better than the
others, since each has been obtained through different assumptions. This opens the possibility of estimating the curve g ( x ) using a different criterion, and, thus, different approximations than those used in Section 10.5.1.
Environment Compensation Preprocessing
515
Figure 10.29 Empirical curves for input-to-output instantaneous SNR. Eight different curves
for 0, 1, 2, 3, 4, 5, 6, 7 and 8 kHz are obtained following Eq. (10.121) [2] using speech recorded simultaneously from a close-talking microphone and a desktop microphone.
One interesting possibility occurs when we have pairs of stereo utterances that have
been recorded simultaneously in noise-free conditions in one channel and noisy conditions
in the other channel. In this case, we can estimate f ( x) using a minimum mean squared
criterion (Porter and Boll [47], Ephraim and Malah [23]), so that
2ü
ì N −1 M −1
fˆ ( x) = arg min íå å X i ( f j ) − f ( SNR( f j ) ) Yi ( f j ) ý
f ( x)
î i=0 j =0
þ
(
)
(10.120)
or g(x) as
(
)
ì N −1 M −1
2
2 2ü
gˆ( x) = arg min íå å 10 log10 X i ( f j ) − g ( SNR( f j ) ) − 10 log10 Yi ( f j ) ý (10.121)
g ( x)
î i =0 j =0
þ
516
Environmental Robustness
which can be solved by discretizing f(x) and g(x) into several bins and summing over all M
frequencies and N frames. This approach results in a curve that is smoother and thus results
in less musical noise and lower distortion. Stereo utterances of noise-free and noisy speech
are needed to estimate f(x) and g(x) through Eqs. (10.120) and (10.121) for any given acoustical environment and can be collected with two microphones, or the noisy speech can be
obtained by adding to the clean speech artificial noise from the testing environment.
Another generalization of this approach is to use a different function f(x) or g(x) for
every frequency [2] as shown in Figure 10.29. This also allows for a lower squared error at
the expense of having to store more data tables. In the experiments of Figure 10.29, we note
that more subtraction is needed at lower frequencies than at higher frequencies in this case.
If such stereo data is available to estimate these curves, it makes the enhanced speech
sound better [23] than does spectral subtraction. When used in speech recognition systems, it
also leads to higher accuracies [2].
10.5.3.
Wiener Filtering
Let’s reformulate Eq. (10.102) from the statistical point of view. The process y[n] is the
sum of random process x[n] and the additive noise v[n] process:
y[n] = x[n] + v[n]
(10.122)
We wish to find a linear estimate xˆ[n] in terms of the process y[n] :
xˆ[n] =
∞
å h[m]y[n − m]
(10.123)
m =−∞
which is the result of a linear time-invariant filtering operation. The MMSE estimate of h[n]
in Eq. (10.123) minimizes the squared error
∞
ìï é
ù
E í ê x[n] − å h[m]y[n − m]ú
m =−∞
û
îï ë
2
üï
ý
þï
(10.124)
which results in the famous Wiener-Hopf equation
Rxy [l ] =
∞
å h[m]R
m =−∞
yy
[l − m]
(10.125)
so that, taking Fourier transforms, the resulting filter can be expressed in the frequency domain as
H( f ) =
S xy ( f )
S yy ( f )
(10.126)
If the signal x[n] and the noise v[n] are orthogonal, which is often the case, then
Environment Compensation Preprocessing
S xy ( f ) = S xx ( f ) and S yy ( f ) = S xx ( f ) + Svv ( f )
517
(10.127)
so that Eq. (10.126) is given by
H( f ) =
S xx ( f )
S xx ( f ) + Svv ( f )
(10.128)
Equation (10.128) is called the noncausal Wiener filter. This can be realized only if
we know the power spectra of both the noise and the signal. Of course, if S xx ( f ) and
Svv ( f ) do not overlap, then H ( f ) = 1 in the band of the signal and H ( f ) = 0 in the band
of the noise.
In practice, S xx ( f ) is unknown. If it were known, we could compute its mel-cepstrum,
which would coincide exactly with the mel-cepstrum before noise addition. To solve this
chicken-and-egg problem, we need some kind of model. Ephraim [22] proposed the use of
an HMM where, if we know what state the current frame falls under, we can use its mean
spectrum as S xx ( f ) . In practice we do not know what state each frame falls into either, so
he proposed to weigh the filters for each state by the a posterior probability that the frame
falls into each state. This algorithm, when used in speech enhancement, results in gains of
15 dB or more.
A causal version of the Wiener filter can also be derived. A dynamical state model algorithm called the Kalman filter (see [42] for details) is also an extension of the Wiener filter.
10.5.4.
Cepstral Mean Normalization (CMN)
Different microphones have different transfer functions, and even the same microphone has
a varying transfer function depending on the distance to the microphone and the room
acoustics. In this section we describe a powerful and simple technique that is designed to
handle convolutional distortions and, thus, increases the robustness of speech recognition
systems to unknown linear filtering operations.
Given a signal x[n], we compute its cepstrum through short-time analysis, resulting in
a set of T cepstral vectors X = {x 0 , x1 ,L , xT −1} . Its sample mean x is given by
x=
1 T −1
å xt
T t =0
(10.129)
Cepstral mean normalization (CMN) (Atal [8]) consists of subtracting x from each vector
xt to obtain the normalized cepstrum vector xˆ t :
xˆ t = xt − x
(10.130)
518
Environmental Robustness
Let’s now consider a signal y[n], which is the output of passing x[n] through a filter
h[n]. We can compute another sequence of cepstrum vectors Y = {y 0 , y1 ,L , yT −1} . Now
let’s further define a vector h as
(
h = C ln H (ω 0 )
2
L ln H (ω M )
2
)
(10.131)
where C is the DCT matrix. We can see that
y t = xt + h
(10.132)
and thus the sample mean y t equals
y=
1 T −1
1 T −1
y t = + å ( xt + h) = x + h
å
T t =0
T t =0
(10.133)
and its normalized cepstrum is given by
yˆ t = y t − y t = xˆ t
(10.134)
which indicates that cepstral mean normalization is immune to linear filtering operations.
This procedure is performed on every utterance for both training and testing. Intuitively, the
mean vector x conveys the spectral characteristics of the current microphone and room
acoustics. In the limit, when T → ∞ for each utterance, we should expect means from utterances from the same recording environment to be the same. Use of CMN to the cepstrum
vectors does not modify the delta or delta-delta cepstrum.
Let’s analyze the effect of CMN on a short utterance. Assume that our utterance contains a single phoneme, say /s/. The mean x will be very similar to the frames in this phoneme, since /s/ is quite stationary. Thus, after normalization, xˆ t ≈ 0 . A similar result will
happen for other fricatives, which means that it would be impossible to distinguish these
ultrashort utterances, and the error rate will be very high. If the utterance contains more than
one phoneme but is still short, this problem is not insurmountable, but the confusion among
phonemes is still higher than if no CMN had been applied. Empirically, it has been found
that this procedure does not degrade the recognition rate on utterances from the same acoustical environment, as long as they are longer than 2–4 seconds. Yet the method provides
significant robustness against linear filtering operations. In fact, for telephone recordings,
where each call has a different frequency response, the use of CMN has been shown to provide as much as 30% relative decrease in error rate. When a system is trained on one microphone and tested on another, CMN can provide significant robustness.
Interestingly enough, it has been found in practice that the error rate for utterances
within the same environment is actually somewhat lower, too. This is surprising, given that
there is no mismatch in channel conditions. One explanation is that, even for the same microphone and room acoustics, the distance between the mouth and the microphone varies for
different speakers, which causes slightly different transfer functions, as we studied in Section 10.2. In addition, the cepstral mean characterizes not only the channel transfer function,
Environment Compensation Preprocessing
519
but also the average frequency response of different speakers. By removing the long-term
speaker average, CMN can act as sort of speaker normalization.
Word Error Rate (%)
16
14
No CMN
12
CMN-2
10
8
6
4
2
0
10
15
20
30
SNR (dB)
Figure 10.30 Word error rate as a function of SNR (dB) for both no CMN (solid line) and
CMN-2 [5] (dotted line). White noise was added at different SNRs and the system was trained
with speech with the same SNR. The Whisper system is used on the 5000-word Wall Street
Journal task using a bigram language model.
One drawback of CMN is it does not discriminate silence and voice in computing the
utterance mean. An extension to CMN consists in computing different means for noise and
speech [5]:
h ( j +1) =
n ( j +1) =
1
Ns
1
Nn
åx
t∈qs
t
− ms
å xt − mn
(10.135)
t∈qn
i.e., the difference between the average vector for speech frames in the utterance and the
average vector m s for speech frames in the training data, and similarly for the noise frames
m n . Speech/noise discrimination could be done by classifying frames into speech frames
and noise frames, computing the average cepstra for each, and subtracting them from the
average in the training data. This procedure works well as long as the speech/noise classification is accurate. It’s best done by the recognizer, since other speech detection algorithms
can fail in high background noise (see Section 10.6.2). As shown in Figure 10.30, this algorithm has been shown to improve robustness not only to varying channels but also to noise.
520
Environmental Robustness
10.5.5.
Real-Time Cepstral Normalization
CMN requires the complete utterance to compute the cepstral mean; thus, it cannot be used
in a real-time system, and an approximation needs to be used. In this section we discuss a
modified version of CMN that can address this problem, as well as a set of techniques called
RASTA that attempt to do the same thing.
We can interpret CMN as the operation of subtracting a low-pass filter d[n], where all
the T coefficients are identical and equal 1/ T , which is a high-pass filter with a cutoff frequency ω c that is arbitrarily close to 0. This interpretation indicates that we can implement
other types of high-pass filters. One that has been found to work well in practice is the exponential filter, so the cepstral mean xt is a function of time
xt = α xt + (1 − α ) xt −1
(10.136)
where α is chosen so that the filter has a time constant7 of at least 5 seconds of speech.
Other types of filters have been proposed in the literature. In fact, a popular approach
consists of an IIR bandpass filter with the transfer function:
H ( z ) = 0.1z 4 *
2 + z −1 − z −3 − 2 z −4
1 − 0.98 z −1
(10.137)
which is used in the so-called relative spectral processing or RASTA [32]. As in CMN, the
high-pass portion of the filter is expected to alleviate the effect of convolutional noise introduced in the channel. The low-pass filtering helps to smooth some of the fast frame-to-frame
spectral changes present. Empirically, it has been shown that the RASTA filter behaves
similarly to the real-time implementation of CMN, albeit with a slightly higher error rate.
Both the RASTA filter and real-time implementations of CMN require the filter to be properly initialized. Otherwise, the first utterance may use an incorrect cepstral mean. The original derivation of RASTA includes a few stages prior to the bandpass filter, and this filter is
performed on the spectral energies, not the cepstrum.
10.5.6.
The Use of Gaussian Mixture Models
Algorithms such as spectral subtraction of Section 10.5.1 or the frequency-domain MMSE
of Section 10.5.2 implicitly assume that different frequencies are uncorrelated from each
other. Because of that, the spectrum of the enhanced signal may exhibit abrupt changes
across frequency and not look like spectra of real speech signals. Using the model of the
environment of Section 10.1.3, we can express the clean-speech cepstral vector x as a function of the observed noisy cepstral vector y as
7
The time constant τ of a low-pass filter is defined as the value for which the output is cut in half. For an exponential filter of parameter α and sampling rate Fs, α = ln 2 /(TFs ) .
Environment Compensation Preprocessing
(
x = y − h − C ln 1 − eC
−1
(n − y )
521
)
(10.138)
where the noise cepstral vector n is a random vector. The MMSE estimate of x is given by
{ (
xˆ MMSE = E{x | y} = y − h − CE ln 1 − eC
−1
(n − y )
) | y}
(10.139)
where the expectation uses the distribution of n. Solution to Eq. (10.139) results in a nonlinear function which can be learned, for example, with a neural network [53].
A popular model to attack this problem consists in modeling the probability distribution of the noisy speech y as a mixture of K Gaussians:
K −1
K −1
k =0
k =0
p(y ) = å p(y | k ) P[k ] = å Ν (y, µ k , Σ k ) P[k ]
(10.140)
where P[k] is the prior probability of each Gaussian component k. If x and y are jointly
Gaussian within class k, then p(x | y, k ) is also Gaussian [42] with mean:
E{x | y, k} = µ kx + Σ kxy ( Σ ky ) (y − µ ky ) = Ck y + rk
−1
(10.141)
so that the joint distribution of x and y is given by
K −1
K −1
k =0
k =0
p(x, y ) = å p(x, y | k ) P[k ] = å p(x | y , k ) p(y | k ) P[k ]
K −1
(10.142)
= å Ν (x, Ck y + rk , Γ k ) Ν (y, µ k , Σ k ) P[k ]
k =0
where rk is called the correction vector, Ck is the rotation matrix, and the matrix Γ k tells
us how uncertain we are about the compensation.
A maximum likelihood estimate of x maximizes the joint probability in Eq. (10.142).
Assuming the Gaussians do not overlap very much (as in the FCDCN algorithm [2]):
xˆ ML ≈ arg max p(x, y, k ) = arg max Ν (y, µ k , Σ k ) Ν (x, Ck y + rk , Γ k ) P[k ]
k
(10.143)
k
whose solution is
xˆ ML = Ckˆ y + rkˆ
(10.144)
kˆ = arg max Ν (y, µ k , Σ k ) P[k ]
(10.145)
where
k
It is often more robust to compute the MMSE estimate of x (as in the CDCN [2] and
RATZ [43] algorithms):
522
Environmental Robustness
K −1
K −1
k =0
k =0
xˆ MMSE = E{x | y} = å p (k | y )E {x | y, k } = å p(k | y ) ( Ck y + rk )
(10.146)
as a weighted sum for all mixture components, where the posterior probability p (k | y ) is
given by
p(k | y ) =
p (y | k ) P[k ]
K −1
å p(y | k ) P[k ]
(10.147)
k =0
where the rotation matrix Ck in Eq. (10.144) can be replaced by I with a modest degradation in performance in return for faster computation [21].
A number of different algorithms [2, 43] have been proposed that vary in how the parameters µ k , Σk , rk , and Γ k are estimated. If stereo recordings are available from both the
clean signal and the noisy signal, then we can estimate µ k , Σk by fitting a mixture Gaussian
model to y as described in Chapter 3. Then Ck , rk and Γ k can be estimated directly by
linear regression of x and y. The FCDCN algorithm [2, 6] is a variant of this approach when
it is assumed that Σ k = σ 2 I , Γ k = γ 2 I , and Ck = I , so that µ k and rk are estimated
through a VQ procedure and rk is the average difference (y − x) for vectors y that belong to
mixture component k. An enhancement is to use the instantaneous SNR of a frame, defined
as the difference between the log-energy of that frame and the average log-energy of the
background noise. It is advantageous to use different correction vectors for different instantaneous SNR levels. The log-energy can be replaced by the zeroth-order cepstral coefficient
with little change in recognition accuracy. It is also possible to estimate the MMSE solution
instead of picking the most likely codeword (as in the RATZ algorithm [43]). The resulting
correction vector is a weighted average of the correction vectors for all classes.
Often, stereo recordings are not available and we need other means of estimating parameters µ k , Σ k , rk , and Γ k . CDCN [6] is one such algorithm that has a model of the environment as described in Section 10.1.3, which defines a nonlinear relationship between x,
y and the environmental parameters n and h for the noise and channel. This method also
uses an MMSE approach where the correction vector is a weighted average of the correction
vectors for all classes. An extension of CDCN using a vector Taylor series approximation
[44] for that nonlinear function has been shown to offer improved results. Other methods
that do not require stereo recordings or a model of the environment are presented in [43].
10.6.
ENVIRONMENTAL MODEL ADAPTATION
We describe a number of techniques that achieve compensation by adapting the HMM to the
noisy conditions. The most straightforward method is to retrain the whole HMM with the
speech from the new acoustical environment. Another option is to apply standard adaptive
techniques discussed in Chapter 9 to the case of environment adaptation. We consider a
Environmental Model Adaptation
523
model of the environment that allows constrained adaptation methods for more efficient
adaptation in comparison to the general techniques.
10.6.1.
Retraining on Corrupted Speech
Word Error Rate (%)
If there is a mismatch between acoustical environments, it is sensible to retrain the HMM.
This is done in practice for telephone speech where only telephone speech, and no clean
high-bandwidth speech, is used in the training phase.
Unfortunately, training a large-vocabulary speech recognizer requires a very large
amount of data, which is often not available for a specific noisy condition. For example, it is
difficult to collect a large amount of training data in a car driving at 50 mph, whereas it is
much easier to record it at idle speed. Having a small amount of matched-conditions training
data could be worse than a large amount of mismatched-conditions training data. Often we
want to adapt our model given a relatively small sample of speech from the new acoustical
environment.
100
Mismatched
80
Matched (Noisy)
60
40
20
0
0
5
10
15
20
25
30
SNR (dB)
Figure 10.31 Word error rate as a function of the testing data SNR (dB) for Whisper trained
on clean data (solid line) and a system trained on noisy data at the same SNR as the testing set
(dotted line). White noise at different SNRs is added.
One option is to take a noise waveform from the new environment, add it to all the utterances in our database, and retrain the system. If the noise characteristics are known ahead
of time, this method allows us to adapt the model to the new environment with a relatively
small amount of data from the new environment, yet use a large amount of training data.
Figure 10.31 shows the benefit of this approach over a system trained on clean speech for
the case of additive white noise. If the target acoustical environment also has a different
channel, we can also filter all the utterances in the training data prior to retraining. This
method allows us to adapt the model to the new environment with a relatively small amount
of data from the new environment.
If the noise sample is available offline, this simple technique can provide good results
at no cost during recognition. Otherwise the noise addition and model retraining would need
to occur at runtime. This is feasible for speaker-dependent small-vocabulary systems where
524
Environmental Robustness
Word Error Rate (%)
the training data can be kept in memory and where the retraining time can be small, but it is
generally not feasible for large-vocabulary speaker-independent systems because of memory
and computational limitations.
30
25
Matched Noise
20
Multistyle
15
10
5
0
5
10
15
20
25
30
SNR (dB)
Figure 10.32 Word error rates of multistyle training compared to matched-noise training as a
function of the SNR in dB for additive white noise. Whisper is trained as in Figure 10.30. The
error rate of multistyle training is between 12% (for low SNR) and 25% (for high SNR) higher
in relative terms than that of matched-condition training. Nonetheless, multistyle training does
better than a system trained on clean data for all conditions other than clean speech.
One possibility is to create a number of artificial acoustical environments by corrupting our clean database with noise samples of varying levels and types, as well as varying
channels. Then all those waveforms from multiple acoustical environments can be used in
training. This is called multistyle training [39], since our training data represents a different
condition. Because of the diversity of the training data, the resulting recognizer is more robust to varying noise conditions. In Figure 10.32 we see that, though generally the error-rate
curve is above the matched-condition curve, particularly for clean speech, multistyle training
does not require knowledge of the specific noise level and thus is a viable alternative to the
theoretical lower bound of matched conditions.
10.6.2.
Model Adaptation
We can also use the standard adaptation methods used for speaker adaptation, such as MAP
or MLLR described in Chapter 9. Since MAP is an unstructured method, it can offer results
similar to those of matched conditions, but it requires a significant amount of adaptation
data. MLLR can achieve reasonable performance with about a minute of speech for minor
mismatches [41]. For severe mismatches, MLLR also requires a large number of transformations, which, in turn, require a larger amount of adaptation data as discussed in Chapter 9.
Let’s analyze the case of a single MLLR transform, where the affine transformation is
simply a bias. In this case the MLLR transform consists only of a vector h that, as in the
case of CMN described in Section 10.5.4, can be estimated from a single utterance. Instead
of estimating h as the average cepstral mean, this method estimates h as the maximum like-
Environmental Model Adaptation
525
lihood estimate, given a set of sample vectors X = {x 0 , x1 ,L , xT −1} and an HMM model λ
[48], and it is a version of the EM algorithm where all the vector means are tied together
(see Algorithm 10.2). This procedure for estimating the cepstral bias has a very slight reduction in error rates over CMN, although the improvement is larger for short utterances [48].
ALGORITHM 10.2 MLE SIGNAL BIAS REMOVAL
Step 1: Initialize h (0) = 0 at iteration j = 0
Step 2: Obtain model λ ( j ) by updating the means from m k to m k + h ( j ) , for all Gaussians k.
Step 3: Run recognition with model λ ( j ) on the current utterance and determine a state segmentation θ [t ] for each frame t.
Step 4: Estimate h ( j +1) as the vector that maximizes the likelihood, which, using covariance
matrices Σ k , is given by:
−1
æ T −1
ö T −1
h ( j +1) = ç å Σθ−[1t ] ÷ å Σθ−[1t ] ( xt − mθ [t ] )
(10.148)
è t =0
ø t =0
Step 5: If converged, stop; otherwise, increment j and go to Step 2. In practice two iterations
are often sufficient.
If both additive noise and linear filtering are applied, the cepstrum for the noise and
that for most speech frames are affected differently. The speech/noise mean normalization
[5] algorithm can be extended similarly, as shown in Algorithm 10.3. The idea is to estimate
a vector n and h , such that all the Gaussians associated to the noise model are shifted by
n , and all remaining Gaussians are shifted by h .
We can make Eq. (10.150) more efficient by tying all the covariance matrices. This
transforms Eq. (10.150) into
1
Ns
t∈qs
1
=
Nn
t∈qn
h ( j +1) =
n
( j +1)
åx
t
åx
− ms
(10.149)
t
− mn
i.e., the difference between the average vector for speech frames in the utterance and the
average vector m s for speech frames in the training data, and similarly for the noise frames
m n . This is essentially the same equation as in the speech-noise cepstral mean normalization described in Section 10.5.4. The difference is that the speech/noise discrimination is
done by the recognizer instead of by a separate classifier. This method is more accurate in
high-background-noise conditions where traditional speech/noise classifiers can fail. As a
compromise, a codebook with considerably fewer Gaussians than a recognizer can be used
to estimate n and h .
526
Environmental Robustness
ALGORITHM 10.3 SPEECH/NOISE MEAN NORMALIZATION
Step 1: Initialize h (0) = 0 , n (0) = 0 at iteration j = 0
Step 2: Obtain model λ ( j ) by updating the means of speech Gaussians from m k to
m k + h ( j ) , and of noise Gaussians from ml to m l + n ( j ) .
Step 3: Run recognition with model λ ( j ) on the current utterance and determine a state segmentation θ [t ] for each frame t.
Step 4: Estimate h ( j +1) and n ( j +1) as the vectors that maximize the likelihood for speech
frames ( t ∈ qs ) and noise frames ( t ∈ qn ), respectively:
æ
ö
h ( j +1) = ç å Σθ−[1t ] ÷
è t∈qs
ø
−1
å Σθ ( x
t∈qs
−1
[t ]
t
− mθ [t ] )
−1
æ
ö
n ( j +1) = ç å Σθ−1[t ] ÷ å Σθ−1[ t ] ( xt − mθ [t ] )
è t∈qn
ø t∈qn
Step 5: If converged, stop; otherwise, increment j and go to
Step 2
10.6.3.
(10.150)
Parallel Model Combination
By using the clean-speech models and a noise model, we can approximate the distributions
obtained by training a HMM with corrupted speech. The memory requirements for the algorithm are then significantly reduced, as the training data is not needed online. Parallel model
combination (PMC) is a method to obtain the distribution of noisy speech given the distribution of clean speech and noise as mixture of Gaussians. As discussed in Section 10.1.3, if the
clean-speech cepstrum follows a Gaussian distribution and the noise cepstrum follows another Gaussian distribution, the noisy speech has a distribution that is no longer Gaussian.
The PMC method nevertheless makes the assumption that the resulting distribution is Gaussian whose mean and covariance matrix are the mean and covariance matrix of the resulting
non-Gaussian distribution. If it is assumed that the distribution of clean speech is a mixture
of N Gaussians, and the distribution of the noise is a mixture of M Gaussians, the distribution of the noisy speech contains NM Gaussians. The feature vector is often composed of the
cepstrum, delta cepstrum, and delta-delta cepstrum. The model combination can be seen in
Figure 10.33.
If the mean and covariance matrix of the cepstral noise vector n are given by µ cn and
Σnc , respectively, we first compute the mean and covariance matrix in the log-spectral domain:
Environmental Model Adaptation
527
µ ln = C−1µ nc
(10.151)
Σln = C−1 Σnc (C−1 )T
Noise HMM
Clean-speech
HMM
Cepstral domain
C-1
C-1
Log-spectral domain
exp ( )
exp ( )
Linear domain
g
+
log ( )
Log-spectral domain
-1
C
Cepstral domain
Noisy speech
HMM
Figure 10.33 Parallel model combination for the case of one-state noise HMM.
In the linear domain N = en , the distribution is lognormal, whose mean vector µ N and
covariance matrix Σ N can be shown (see Chapter 3) to be given by
µ N [i ] = exp {µ ln [i ] + Σnl [i, i ]/ 2}
(
)
Σ N [i, j ] = µ N [i ]µ N [ j ] exp {Σnl [i, j ]} − 1
(10.152)
with expressions similar to Eqs. (10.151) and (10.152) for the mean and covariance matrix
of X.
Using the model of the environment with no filter is equivalent to obtaining a random
linear spectral vector Y given by (see Figure 10.33)
Y = X+N
(10.153)
and, since X and N are independent, we can obtain the mean and covariance matrix of Y as
528
Environmental Robustness
µY = µX + µN
ΣY = ΣX + ΣN
(10.154)
Although the sum of two lognormal distributions is not lognormal, the popular lognormal approximation [26] consists in assuming that Y is lognormal. In this case we can
apply the inverse formulae of Eq. (10.152) to obtain the mean and covariance matrix in the
log-spectral domain:
ì Σ [i, j ]
ü
Σly [i, j ] = ln í Y
+ 1ý
µ
µ
[
i
]
[
j
]
î Y
þ
Y
ü
1 ì Σ [i , j ]
µ ly [i ] = ln µ Y [i ] − ln í Y
+ 1ý
2 î µ Y [i ]µ Y [ j ] þ
(10.155)
and finally return to the cepstrum domain applying the inverse of Eq. (10.151):
µ cy = Cµ ly
Σcy = CΣly CT
(10.156)
The lognormal approximation cannot be used directly for the delta and delta-delta cepstrum. Another variant that can be used in this case and is more accurate than the lognormal
approximation is the data-driven parallel model combination (DPMC) [26], which uses
Monte Carlo simulation to draw random cepstrum vectors from both the clean-speech HMM
and noise distribution to create cepstrum of the noisy speech by applying Eqs. (10.20) and
(10.21) to each sample point. These composite cepstrum vectors are not kept in memory,
only their means and covariance matrices are, therefore reducing the required memory
though still requiring a significant amount of computation. The number of vectors drawn
from the distribution was at least 100 in [26]. A way of reducing the number of random vectors needed to obtain good Monte Carlo simulations is proposed in [56]. A version of PMC
using numerical integration, which is very computationally expensive, yielded the best results.
Figure 10.34 and Figure 10.35 compare the values estimated through the lognormal
approximation to the true value, where for simplicity we deal with scalars. Thus x, n, and y
represent the log-spectral energies of the clean signal, noise, and noisy signal, respectively,
for a given frequency. Assuming x and n to be Gaussian with means µ x and µ n and variances σ x and σ x respectively, we see that the lognormal approximation is accurate when
the standard deviations σ x and σ n are small.
10.6.4.
Vector Taylor Series
The model of the acoustical environment described in Section 10.1.3 describes the relationship between the cepstral vectors x, n, and y of the clean speech, noise, and noisy speech,
respectively:
Environmental Model Adaptation
y = x + h + g (n − x − h )
529
(10.157)
where h is the cepstrum of the filter, and the nonlinear function g(z) is given by
(
g(z ) = C ln 1 + eC
−1
z
)
(10.158)
Moreno [44] suggests the use of Taylor series to approximate the nonlinearity in Eq.
(10.158), though he applies it in the spectral instead of the cepstral domain. We follow that
approach to compute the mean and covariance matrix of y [4].
Assume that x, h, and n are Gaussian random vectors with means µ x , µ h , and µ n and
covariance matrices Σ x , Σh , and Σn , respectively, and furthermore that x, h, and n are
independent. After algebraic manipulation it can be shown that the Jacobian of Eq. (10.157)
with respect to x, h, and n evaluated at µ = µ n − µ x − µ h can be expressed as
∂y
∂x
∂y
=A
∂h (µn ,µ x ,µh )
=
(µ n ,µ x ,µh )
∂y
=I−A
∂n ( µn ,µ x ,µh )
(10.159)
where the matrix A is given by
A = CFC−1
(10.160)
and F is a diagonal matrix whose elements are given by vector f (µ ) , which in turn is given
by
f (µ ) =
1
1 + eC
−1
µ
(10.161)
Using Eq. (10.159) we can then approximate Eq. (10.157) by a first-order Taylor series expansion around (µ n , µ x , µ h ) as
y ≈ µ x + µ h + g(µ n − µ x − µ h )
+ A (x − µ x ) + A (h − µ h ) + (I − A )(n − µ n )
(10.162)
The mean of y, µ y , can be obtained from Eq. (10.162) as
µ y ≈ µ x + µ h + g (µ n − µ x − µ h )
(10.163)
and its covariance matrix Σy by
Σ y ≈ AΣ x AT + AΣh AT + (I − A ) Σn (I − A )T
(10.164)
530
Environmental Robustness
so that even if Σ x , Σh , and Σn are diagonal, Σy is no longer diagonal. Nonetheless, we can
assume it to be diagonal, because this way we can transform a clean HMM to a corrupted
HMM that has the same functional form and use a decoder that has been optimized for diagonal covariance matrices.
It is difficult to visualize how good the approximation is, given the nonlinearity involved. To provide some insight, let’s consider the frequency-domain version of Eqs.
(10.157) and (10.158) when no filtering is done:
y = x + ln (1 + exp ( n − x ) )
(10.165)
where x, n, and y represent the log-spectral energies of the clean signal, noise, and noisy
signal, respectively, for a given frequency. In Figure 10.34 we show the mean and standard
deviation of the noisy log-spectral energy y in dB as a function of the mean of the clean logspectral energy x with a standard deviation of 10 dB. The log-spectral energy of the noise n
is Gaussian with mean 0 dB and standard deviation 2 dB. We compare the correct values
obtained through Monte Carlo simulation (or DPMC) with the values obtained through the
lognormal approximation of Section 10.6.3 and the first-order VTS approximation described
here. We see that the VTS approximation is more accurate than the lognormal approximation for the mean and especially for the standard deviation of y, assuming the model of the
environment described by Eq. (10.165).
25
12
Montecarlo
1st order VTS
PMC
10
y std dev (dB)
y mean (dB)
20
15
10
5
0
8
6
4
2
-20
-10
0
10
x mean (dB)
20
0
Montecarlo
1st order VTS
PMC
-20
-10
0
10
x mean (dB)
20
Figure 10.34 Means and standard deviation of noisy speech y in dB according to Eq. (10.165).
The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation 2 dB. The distribution of the clean log-spectrum x is Gaussian, having a standard deviation
of 10 dB and a mean varying from –25 to 25 dB. Both the mean and the standard deviation of y
are more accurate in first-order VTS than in PMC.
Figure 10.35 is similar to Figure 10.34 except that the standard deviation of the clean
log-energy x is only 5 dB, a more realistic number in speech recognition systems. In this
case, both the lognormal approximation and the first-order VTS approximation are good
estimates of the mean of y, though the standard deviation estimated through the lognormal
Environmental Model Adaptation
531
approximation in PMC is not as good as that obtained through first-order VTS, again assuming the model of the environment described by Eq. (10.165). The overestimate of the variance in the lognormal approximation might, however, be useful if the model of the environment is not accurate.
25
6
Montecarlo
1st order VTS
PMC
5
y std dev (dB)
y mean (dB)
20
15
10
5
4
3
2
Montecarlo
1st order VTS
PMC
1
0
0
-20
-10
0
10
x mean (dB)
20
-20
-10
0
10
x mean (dB)
20
Figure 10.35 Means and standard deviation of noisy speech y in dB according to Eq. (10.165).
The distribution of the noise log-spectrum n is Gaussian with mean 0 dB and standard deviation of 2 dB. The distribution of the clean log-spectrum x is Gaussian with a standard deviation
of 5 dB and a mean varying from –25 dB to 25 dB. The mean of y is well estimated in both
PMC and first-order VTS. The standard deviation of y is more accurate in first-order VTS than
in PMC.
To compute the means and covariance matrices of the delta and delta-delta parameters,
let’s take the derivative of the approximation of y in Eq. (10.162) with respect to time:
∂y
∂x
≈A
∂t
∂t
(10.166)
so that the delta-cepstrum computed through ∆xt = xt + 2 − xt − 2 , is related to the derivative
[28] by
∆x ≈ 4
∂xt
∂t
(10.167)
so that
µ ∆y ≈ Aµ ∆x
(10.168)
and similarly
Σ ∆y ≈ AΣ ∆x AT + (I − A ) Σ ∆n (I − A )T
where we assumed that h is constant within an utterance, so that ∆h = 0 .
Similarly, for the delta-delta cepstrum, the mean is given by
(10.169)
532
Environmental Robustness
µ ∆2 y ≈ Aµ ∆2 x
(10.170)
and the covariance matrix by
Σ ∆2 y ≈ AΣ ∆2 x AT + (I − A ) Σ ∆2n (I − A )T
(10.171)
where we again assumed that h is constant within an utterance, so that ∆ 2 h = 0 .
Equations (10.163), (10.168), and (10.170) resemble the MLLR adaptation formulae
of Chapter 9 for the means, though in this case the matrix is different for each Gaussian and
is heavily constrained.
We are interested in estimating the environmental parameters µ n , µ h , and Σn , given
a set of T observation frames y t . This estimation can be done iteratively using the EM algorithm on Eq. (10.162). If the noise process is stationary, Σ ∆n could be approximated, assuming independence between nt + 2 and nt − 2 , by Σ ∆n = 2Σ n . Similarly, Σ ∆2 n could be approximated, assuming independence between ∆n t +1 and ∆n t −1 , by Σ ∆2 n = 4Σ n . If the noise
process is not stationary, it is best to estimate Σ ∆n and Σ ∆2 n from input data directly.
If the distribution of x is a mixture of N Gaussians, each Gaussian is transformed according to the equations above. If the distribution of n is also a mixture of M Gaussians, the
composite distribution has NM Gaussians. While this increases the number of Gaussians, the
decoder is still functionally the same as for clean speech. Because normally you do not want
to alter the number of Gaussians of the system when you do noise adaptation, it is often assumed that n is a single Gaussian.
10.6.5.
Retraining on Compensated Features
We have discussed adapting the HMM to the new acoustical environment using the standard
front-end features, in most cases the mel-cepstrum. Section 10.5 dealt with cleaning the
noisy feature without retraining the HMMs. It’s logical to consider a combination of both,
where the features are cleaned to remove noise and channel effects and then the HMMs are
retrained to take into account that this processing stage is not perfect. This idea is illustrated
in Figure 10.36, where we compare the word error rate of the standard matched-noisecondition training with the matched-noise-condition training after it has been compensated
by a variant of the mixture Gaussian algorithms described in Section 10.5.6 [21].
The low error rates of both curves in Figure 10.36 are hard to obtain in practice, because they assume we know exactly what the noise level and type are ahead of time, which
in general is hard to do. On the other hand, this could be combined with the multistyle training discussed in Section 10.6.1 or with a set of clustered models discussed in Chapter 9.
Modeling Nonstationary Noise
533
Word Error Rate (%)
30
25
SPLICE-processed matched condition
20
Unprocessed matched condition
15
10
5
0
5
10
15
20
25
30
SNR (dB)
Figure 10.36 Word error rates of matched-noise training without feature preprocessing and
with the SPLICE algorithm [21] as a function of the SNR in dB for additive white noise. Error
rate with the mixture Gaussian model is up to 30% lower than that of standard noisy matched
conditions for low SNRs while it is about the same for high SNRs.
10.7.
MODELING NONSTATIONARY NOISE
The previous sections deal mostly with stationary noise. In practice, there are many nonstationary noises that often match a random word in the system’s lexicon better than the silence
model. In this case, the benefit of using speech recognition vanishes quickly.
The most typical types of noise present in desktop applications are mouth noise (lip
smacks, throat clearings, coughs, nasal clearings, heavy breathing, uhms and uhs, etc), computer noise (keyboard typing, microphone adjustment, computer fan, disk head seeking,
etc.), and office noise (phone rings, paper rustles, shutting door, interfering speakers, etc.).
We can use a simple method that has been successful in speech recognition [57] as shown in
Algorithm 10.4.
In practice, updating the transcription turns out to be important, because human labelers often miss short noises that the system can uncover. Since the noise training data are
often limited in terms of coverage, some noises can be easily matched to short word models,
such as: if, two. Due to the unique characteristics of noise rejection, we often need to further
augment confidence measures such as those described in Chapter 9. In practice, we need an
additional classifier to provide more detailed discrimination between speech and noise. We
can use a two-level classifier for this purpose. The ratio between the all-speech model score
(fully connected context-independent phone models) and the all-noise model score (fully
connected silence and noise phone models) can be used.
Another approach [55] consists of having an HMM for noise with several states to
deal with nonstationary noises. A decoder needs to be a three-dimensional Viterbi search
which evaluates at each frame every possible speech state as well as every possible noise
534
Environmental Robustness
state to achieve the speech/noise decomposition (see Figure 10.37). The computational complexity of such an approach is very large, though it can handle nonstationary noises quite
well in theory.
ALGORITHM 10.4 EXPLICIT NOISE MODELING
Step 1: Augmenting the vocabulary with noise words (such as ++SMACK++), each composed
of a single noise phoneme (such as +SMACK+), which are thus modeled with a single HMM.
These noise words have to be labeled in the transcriptions so that they can be trained.
Step 2: Training noise models, as well as the other models, using the standard HMM training
procedure.
Step 3: Updating the transcription. To do that, convert the transcription into a network, where
the noise words can be optionally inserted between each word in the original transcription. A
forced alignment segmentation is then conducted with the current HMM optional noise words
inserted. The segmentation with the highest likelihood is selected, thus yielding an optimal transcription.
Step 4: If converged, stop; otherwise go to Step 2.
Noise
HMM
Speech
HMM
Observations
Figure 10.37 Speech noise decomposition and a three-dimensional Viterbi decoder.
10.8.
HISTORICAL PERSPECTIVE AND FURTHER READING
This chapter contains a number of diverse topics that are often described in different fields;
no single reference covers it all. For further reading on adaptive filtering, you can check the
books by Widrow and Stearns [59] and Haykin [30]. Theodoridis and Bellanger provide [54]
a good summary of adaptive filtering, and Breining et al. [16] a good summary of echocanceling techniques. Lee [38] has a good summary of independent component analysis for
blind source separation. Deller et al. [20] provide a number of techniques for speech enhancement. Juang [35] and Junqua [37] survey techniques used in improving the robustness
Historical Perspective and Further Reading
535
of speech recognition systems to noise. Acero [2] compares a number of feature transformation techniques in the cepstral domain and introduces the model of the environment.
Adaptive filtering theory emerged early in the 1900s. The Wiener and LMS filters
were derived by Wiener and Widrow in 1919 and 1960, respectively. Norbert Wiener joined
the MIT faculty in 1919 and made profound contributions to generalized harmonic analysis,
the famous Wiener-Hopf equation, and the resulting Wiener filter. The LMS algorithm was
developed by B Widrow and his colleagues at Stanford University in the early 1960s.
From a practical point of view, the use of gradient microphones (Olsen [46]) has
proven to be one of the more important contributions to increased robustness. Directional
microphones are commonplace today in most speech recognition systems.
Boll [13] first suggested the use of spectral subtraction. This has been the cornerstone
for noise suppression, and many systems nowadays still use a variant of Boll’s original algorithm.
The Cepstral mean normalization algorithm was proposed by Atal [8] in 1974, although it wasn’t until the early 1990s that it became commonplace in most speech recognition systems evaluated in the DARPA speech programs [33]. Hermansky proposed PLP [31]
in 1990. The work of Rich Stern’s robustness group at CMU (especially the Ph.D. thesis
work of Acero [1] and Moreno [43]) and the Ph.D. thesis of Gales [26] also represented advances in the understanding of the effect of noise in the cepstrum.
Bell and Sejnowski [10] gave the field of independent component analysis a boost in
1995 with their infomax rule. The field of source separation is a promising alternative to
improve the robustness of speech recognition systems when more than one microphone is
available.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
Acero, A., Acoustical and Environmental Robustness in Automatic Speech Recognition, PhD
Thesis in Electrical and Computer Engineering 1990, Carnegie Mellon University, Pittsburgh.
Acero, A., Acoustical and Environmental Robustness in Automatic Speech Recognition,
1993, Boston, Kluwer Academic Publishers.
Acero, A., S. Altschuler, and L. Wu, "Speech/Noise Separation Using Two Microphones and
a VQ Model of Speech Signals," Int. Conf. on Spoken Language Processing, 2000, Beijing,
China.
Acero, A., et al., "HMM Adaptation Using Vector Taylor Series for Noisy Speech Recognition," Int. Conf. on Spoken Language Processing, 2000, Beijing, China.
Acero, A. and X.D. Huang, "Augmented Cepstral Normalization for Robust Speech Recognition," Proc. of the IEEE Workshop on Automatic Speech Recognition, 1995, Snowbird,
UT.
Acero, A. and R. Stern, "Environmental Robustness in Automatic Speech Recognition," Int.
Conf. on Acoustics, Speech and Signal Processing, 1990, Albuquerque, NM pp. 849-852.
Amari, S., A. Cichocki, and H.H. Yang, eds. A New Learning Algorithm for Blind Separation, Advances in Neural Information Processing Systems, 1996, Cambridge, MA, MIT
Press.
Atal, B.S., "Effectiveness of Linear Prediction Characteristics of the Speech Wave for
Automatic Speaker Identification and Verification," Journal of the Acoustical Society of
America, 1974, 55(6), pp. 1304--1312.
536
Environmental Robustness
[9]
[10]
Attias, H., "Independent Factor Analysis," Neural Computation, 1998, 11, pp. 803-851.
Bell, A.J. and T.J. Sejnowski, "An Information Maximisation Approach to Blind Separation
and Blind Deconvolution," Neural Computation, 1995, 7(6), pp. 1129-1159.
Belouchrani, A., et al., "A Blind Source Separation Technique Using Second Order Statistics," IEEE Trans. on Signal Processing, 1997, 45(2), pp. 434-444.
Berouti, M., R. Schwartz, and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic
Noise," Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1979 pp.
208-211.
Boll, S.F., "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE
Trans. on Acoustics, Speech and Signal Processing, 1979, 27(Apr.), pp. 113-120.
Boll, S.F. and D.C. Pulsipher, "Suppression of Acoustic Noise in Speech Using Two Microphone Adaptive Noise Cancellation," IEEE Trans. on Acoustics Speech and Signal Processing, 1980, 28(December), pp. 751-753.
Bregman, A.S., Auditory Scene Analysis, 1990, Cambridge MA, MIT Press.
Breining, C., Acoustic Echo Control, in IEEE Signal Processing Magazine, 1999. pp. 42-69.
Cardoso, J., "Blind Signal Separation: Statistical Principles," Proc. of the IEEE, 1998, 9(10),
pp. 2009-2025.
Cardoso, J.F., "Infomax and Maximum Likelihood for Blind Source Separation," IEEE Signal Processing Letters, 1997, 4, pp. 112-114.
Comon, P., "Independent Component Analysis: A New Concept," Signal Processing, 1994,
36, pp. 287-314.
Deller, J.R., J.H.L. Hansen, and J.G. Proakis, Discrete-Time Processing of Speech Signals,
2000, IEEE Press.
Deng, L., et al., "Large-Vocabulary Speech Recognition Under Adverse Acoustic Environments," Int. Conf. on Spoken Language Processing, 2000, Beijing, China.
Ephraim, Y., "Statistical Model-Based Speech Enhancement System," Proc. of the IEEE,
1992, 80(1), pp. 1526-1555.
Ephraim, Y. and D. Malah, "Speech Enhancement Using Minimum Mean Square Error
Short Time Spectral Amplitude Estimator," IEEE Trans. on Acoustics, Speech and Signal
Processing, 1984, 32(6), pp. 1109-1121.
Flanagan, J.L., et al., "Computer-Steered Microphone Arrays for Sound Transduction in
Large Rooms," Journal of the Acoustical Society of America, 1985, 78(5), pp. 1508--1518.
Frost, O.L., "An Algorithm for Linearly Constrained Adaptive Array Processing," Proc. of
the IEEE, 1972, 60(8), pp. 926--935.
Gales, M.J., Model Based Techniques for Noise Robust Speech Recognition, PhD Thesis in
Engineering Department 1995, Cambridge University, .
Ghitza, O., "Robustness against Noise: The Role of Timing-Synchrony Measurement," Proc.
of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1987 pp. 2372-2375.
Gopinath, R.A., et al., "Robust Speech Recognition in Noise - Performance of the IBM Continuous Speech Recognizer on the ARPA Noise Spoke Task," Proc. ARPA Workshop on
Spoken Language Systems Technology, 1995 pp. 127-133.
Griffiths, L.J. and C.W. Jim, "An Alternative Approach to Linearly Constrained Adaptive
Beamforming," IEEE Trans. on Antennas and Propagation, 1982, 30(1), pp. 27-34.
Haykin, S., Adaptive Filter Theory, 2nd ed, 1996, Upper Saddle River, NJ, Prentice-Hall.
Hermansky, H., "Perceptual Linear Predictive (PLP) Analysis of Speech," Journal of the
Acoustical Society of America, 1990, 87(4), pp. 1738-1752.
Hermansky, H. and N. Morgan, "RASTA Processing of Speech," IEEE Trans. on Speech
and Audio Processing, 1994, 2(4), pp. 578-589.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
Historical Perspective and Further Reading
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
537
Huang, X.D., et al., "The SPHINX-II Speech Recognition System: An Overview," Computer
Speech and Language, 1993 pp. 137-148.
Hunt, M. and C. Lefebre, "A Comparison of Several Acoustic Representations for Speech
Recognition with Degraded and Undegraded Speech," Int. Conf. on Acoustic, Speech and
Signal Processing, 1989 pp. 262-265.
Juang, B.H., "Speech Recognition in Adverse Environments," Computer Speech and Language, 1991, 5, pp. 275-294.
Junqua, J.C., "The Lombard Reflex and Its Role in Human Listeners and Automatic Speech
Recognition," Journal of the Acoustical Society of America, 1993, 93(1), pp. 510--524.
Junqua, J.C. and J.P. Haton, Robustness in Automatic Speech Recognition, 1996, Kluwer
Academic Publishers.
Lee, T.W., Independent Component Analysis: Theory and Applications, 1998, Kluwer Academic Publishers.
Lippmann, R.P., E.A. Martin, and D.P. Paul, "Multi-Style Training for Robust IsolatedWord Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1987,
Dallas, TX pp. 709-712.
Lombard, E., "Le Signe de l'élévation de la Voix," Ann. Maladies Oreille, Larynx, Nez,
Pharynx, 1911, 37, pp. 101-119.
Matassoni, M., M. Omologo, and D. Giuliani, "Hands-Free Speech Recognition Using a
Filtered Clean Corpus and Incremental HMM Adaptation," Proc. Int. Conf. on Acoustics,
Speech and Signal Processing, 2000, Istanbul, Turkey pp. 1407-1410.
Mendel, J.M., Lessons in Estimation Theory for Signal Processing, Communications, and
Control, 1995, Upper Saddle River, NJ, Prentice Hall.
Moreno, P., Speech Recognition in Noisy Environments, PhD Thesis in Electrical and Computer Engineering 1996, Carnegie Mellon University, Pittsburgh.
Moreno, P.J., B. Raj, and R.M. Stern, "A Vector Taylor Series Approach for Environment
Independent Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing,
1996, Atlanta pp. 733-736.
Morgan, N. and H. Bourlard, Continuous Speech Recognition: An Introduction to Hybrid
HMM/Connectionist Approach, in IEEE Signal Processing Magazine, 1995. pp. 25-42.
Olsen, H.F., "Gradient Microphones," Journal of the Acoustical Society of America, 1946,
17,(3), pp. 192-198.
Porter, J.E. and S.F. Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech,"
Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1984, San Diego,
CA pp. 18.A.2.1-4.
Rahim, M.G. and B.H. Juang, "Signal Bias Removal by Maximum Likelihood Estimation
for Robust Telephone Speech Recognition," IEEE Trans. on Speech and Audio Processing,
1996, 4(1), pp. 19-30.
Seneff, S., "A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing," Journal
of Phonetics, 1988, 16(1), pp. 55-76.
Sharma, S., et al., "Feature Extraction Using Non-Linear Transformation for Robust Speech
Recognition on the Aurora Database," Int. Conf. on Acoustics, Speech and Signal Processing, 2000, Istanbul, Turkey pp. 1117-1120.
Sullivan, T.M. and R.M. Stern, "Multi-Microphone Correlation-Based Processing for Robust
Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1993, Minneapolis pp. 2091-2094.
Suzuki, Y., et al., "An Optimum Computer-Generated Pulse Signal Suitable for the Measurement of Very Long Impulse Responses," Journal of the Acoustical Society of America,
1995, 97(2), pp. 1119-1123.
538
Environmental Robustness
[53]
Tamura, S. and A. Waibel, "Noise Reduction Using Connectionist Models," Int. Conf. on
Acoustics, Speech and Signal Processing, 1988, New York pp. 553-556.
Theodoridis, S. and M.G. Bellanger, Adaptive Filters and Acoustic Echo Control, in IEEE
Signal Processing Magazine, 1999. pp. 12-41.
Varga, A.P. and R.K. Moore, "Hidden Markov Model Decomposition of Speech and Noise,"
Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1990 pp. 845-848.
Wan, E.A., R.V.D. Merwe, and A.T. Nelson, "Dual Estimation and the Unscented Transformation" in Advances in Neural Information Processing Systems, S.A. Solla, T.K. Leen, and
K.R. Muller, eds. 2000, Cambridge, MA, pp. 666-672, MIT Press.
Ward, W., "Modeling Non-Verbal Sounds for Speech Recognition," Proc. Speech and Natural Language Workshop, 1989, Cape Cod, MA, Morgan Kauffman pp. 311-318.
Widrow, B. and M.E. Hoff, "Adaptive Switching Algorithms," IRE Wescon Convention
Record, 1960 pp. 96-104.
Widrow, B. and S.D. Stearns, Adaptive Signal Processing, 1985, Upper Saddle River, NJ,
Prentice Hall.
Woodland, P.C., "Improving Environmental Robustness in Large Vocabulary Speech Recognition," Int. Conf. on Acoustics, Speech and Signal Processing, 1996, Atlanta, Georgia pp.
65-68.
[54]
[55]
[56]
[57]
[58]
[59]
[60]
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement