BLIND SOURCE SEPARATION IN REAL TIME USING SECOND ORDER

BLIND SOURCE SEPARATION IN REAL TIME USING SECOND ORDER
BLIND SOURCE
SEPARATION IN REAL TIME
USING SECOND ORDER
STATISTICS
Silva Ruwan Lakmal
Bo Zhu
This thesis is presented as part of Degree of
Master of Science in Electrical Engineering
Blekinge Institute of Technology
September 2007
Blekinge Institute of Technology
School of Engineering
Department of Applied Signal Processing
Supervisor: Dr. Nedelko Grbić
Examiner: Dr. Nedelko Grbić
-1-
Abstract
A multi stage approach to speech enhancement methods improves the quality of separation over
standard techniques such as spectral subtraction and beamforming. Two algorithms are implemented
for convolutive mixtures in two of the important stages of a speech enhancement system, the source
separation stage and the background denoising stage. For source separation, a blind source separation
method based on second order statistics has been adopted whereas for background denoising, a method
based on minimum statistics of subband power, has been used. An efficient real time algorithm for
convolutive blind source separation of broad band signals has been realized, by exploiting the second
order statistics and non-stationarity of the signals. The real time system is capable of separating sources
in a two microphone setup.
Keywords: Blind source separation, convolutive mixtures, second order statistics, Adaptive
decorrelation
-2-
Acknowledgment
Firstly, we would like express our sincere gratitude to our research supervisor Dr. Nedelko for
providing us with an interesting thesis topic to work with, and accepting us for supervision. We thank
you for guiding us throughout this thesis work, enlightening us with advises and knowledge, and
encouraging us all the time with your innovative work.
We would like to thank BTH for giving us the opportunity to study at a well established scientific
environment and providing the background knowledge to work on the thesis.
Last, but not least, we thank our parents for encouraging us for higher studies and assisting us by all
means during our study period in Sweden.
Thank you all!!
Lakmal
Bo
-3-
Contents
1. Introduction
06
1.1 Scope
06
1.2 Organization of the Thesis
07
2. Background and Related work
2.1 Background
09
09
2.1.1. Instantaneous mixing model
10
2.1.2. Convolutive mixing model
11
2.2 Solution approaches for convolutive BSS
13
2 2.1 Extending ICA in time domain
14
2.2.2 Frequency domain BSS
14
2.2.3 Time-Frequency domain BSS
15
2.3 Related work on second order statistics
15
3. Foundations
17
3.1 Second order Statistics
17
3.2 Short Time Fourier Transform (STFT)
17
3.3 Steepest Decent Algorithm
18
3.4 Convolutive BSS
19
3.5 Cross-Correlation, circular and linear convolution
20
4. Offline BSS
22
4.1 Problem formulation
22
4.2 Derivation of Offline Algorithm
23
4.2.1 Cross-Correlation
24
4.2.2 Backward Model
25
4.2.3 Power normalization
27
4.3 Offline Algorithm
28
-45. Online BSS
30
5.1 Introduction
30
5.2 Derivation in time domain
31
5.3 Frequency domain conversion
33
5.4 Power Normalization
34
6. Spectral Subtraction based on minimum statistics
37
6.1 Components
37
6.2 Algorithm
38
6.3 Subtraction Rule
40
7. Real Time Implementation
42
7.1 Requirements
42
7.2 Matlab implementation
42
7.3 Experiment Setup and algorithm
43
7.4 Simulation result
45
8. Evaluation and Results
46
8.1 Introduction
46
8.2 BSS evaluation method
46
8.3 Experiment and Results
47
8.4 Evaluation of Spectral subtraction method
51
9. Conclusion
53
References
54
-5-
List of Figures
2.1 Block diagram for an instantaneous mixing model
10
2.2 Multi-path problem in convolutive mixing
12
4.1 BSS system
23
4.1 Pictorial view of the calculation of Rx
28
4.2 Flow chat for the offline algorithm
29
5.1 Flow chart for the online algorithm
35
6.1 Block diagram of the spectral calculation process
38
6.2 Structure of the noise subband power estimation algorithm
40
7.1 Experiment equipment
44
7.2 Software routines
45
8.1 Time signals of a convolutive input signals and output signals
51
8.2 Time domain signals of the input noisy signal and the output signal after denoising
52
8.3 Frequency domain plots of the input noisy signal and the output signal
52
-6-
Chapter 1
Introduction
In today's technological society, human computer interactions are ever increasing. In many new
systems, voice recognition platforms are implemented to give users more convenient ways of operating
equipment and systems. For instance, dictating into your favorite word processing application is more
convenient and efficient than typing. Vehicle manufactures have recently given thought of
implementing systems which are handled by human voice commands such as activating dash board
buttons. On the other hand, the increased use of personal communication devices, personal computers
and wireless cellular telephones, has given rise to new innovative systems based on voice recognition
systems. Such examples are checking bank balance, finding paths in big cities and checking
transportation time table information using cellular phones. However, the performance of such systems
degrades substantially in real life systems due to surrounding noise and other acoustic signals such as
music, vehicular noise, and speech of other humans, being picked up by the microphones of mobile
phones. Hence, it is crucial for the success of these innovative systems, that unwanted signals are
separated, and only the required user's speech signal is fed into the voice recognition system.
In this thesis, we try to contribute to solve this problem by verifying a BSS method based on second
order statistics and making suggestions to improve the technique in real time systems.
1.1 Scope
In mobile environments, speech enhancement can be conducted using two main building blocks. First,
a blind source separation stage is adopted, followed by a background denoising process, to remove
further noise included in the separated signal. There exist several methods to perform these two
operations, such as conventional beamforming and single channel denoising techniques. The two
methods we considered are the BSS based on second order statistics and background denoising based
on minimum statistics. These chosen methods have been the basis for several successful
implementations that exist today and are considered as benchmarks in each category.
-7The work of this thesis involves a general case of recovering convolutive mixtures of wide band signals
with at least as many sensors as sources. More emphasis is given on the practical type of signals which
are quite often non-stationary, and explicitly use the non-stationarity in the development of the
methods. An efficient solution to the permutation problem of the frequency domain algorithm is
introduced, which are dealt with in more depth in other proposals.
The two main processing blocks are evaluated, the BSS process and the background denoising process.
However, more emphasis is given to the BSS process, since carefully implemented, the background
denoising can also be incorporated within the BSS process.
Experiments are carried out separately for BSS and background denoising methods using a two
microphone setup.
1.2 Organization of the Thesis
The remainder of the report is structured as follows. Chapter 2 provides information about the problem
formulation. It further discusses the solution approaches available today and related research carried
out by other researchers in the field.
The basic foundations to acoustic signal processing and particularly the theory behind second order
statistics are discussed in Chapter 3. It gives further insight into the Multiple Adaptive Decorrelation
(MAD) algorithm. The derivation of the offline version of the MAD algorithm is introduced in Chapter
4, based on the so called “backward model”. Detailed flow chart for the implementation of the said
algorithm is provided.
In chapter 5, the online version of the MAD algorithm is introduced. The direct time domain derivation
is presented. Practical implementation details such as techniques for power normalization and block
processing approach are discussed further.
-8Chapter 6 introduces the background denoising method based on minimum statistics, namely the
Martin's Algorithm. This method is briefly discussed, since background denoising can even be
implemented within the BSS process.
The realtime implementation and experiments are discussed in chapter 7. This chapter also includes the
improvements that are used to make the algorithm work in realtime.
Chapter 8 provides an evaluation of the performance of the algorithms. A recently proposed standard
method is used to evaluate the quality of separation and the distortion measure of the BSS algorithms.
A qualitative measure is used to evaluate the background denoising method.
Finally, Chapter 9 concludes the work of this thesis.
-9-
Chapter 2
Background and Related work
It is the aim of this chapter to provide an overview of the origin and solution approaches to the source
separation problem in general.
2.1 Background
A number of single or multiple microphone based signal processing algorithms have become popular
when performing speech enhancement in real world applications. They often use a combination of a
probabilistic frame work with statistical models of desired speech signal and spatial information about
signal mixtures, by using an array of microphones with a know geometry to suppress interfering
signals, which is also known as beamforming [12].
In recent years, much attention has turned towards blind source separation. When referring to “blind”,
the meaning is that both the sources and the mixtures are unknown and only the recordings of the
mixture are available. The art of separating source signals by observing only the mixed signals is
known as Blind Source Separation. Blind source separation can be considered as an alternative to
beamforming. Source separation aims to separate a set of signals from a set of mixed signals. The
signal mixing process can be modeled in two different ways, i.e. instantaneous mixing model and
convolutive mixing model, which are described below.
- 10 2.1.1. Instantaneous mixing model
When a signal mixture is represented as a linear combination of the original sources at every time
instant, it is defined as an instantaneous mixing model. Suppose that the observed mixture at sensors
is = [
, … , ] of source signals = [
, … , ] where is a discrete
time index. Then an instantaneous mixture can be represented mathematically as follows:
= + (1)
where is a mixing matrix, and is the additive noise. A pictorial view of the
instantaneous model is shown in figure 2.1.
.
.
.
.
.
.
.
.
.
.
̂
Separation
Algorithm
Figure 2.1: Block diagram for an instantaneous mixing model
.
̂ .
.
.
.
̂ - 11 This is the simplest form of modeling a mixture of signals. Practically, when the signals are recorded in
an ideal environment, i.e., no reverberation, the mixture of the signals recorded by the mixing system
can be considered as an instantaneous mixture. In fact, other applications such as signals found in
biomedical contexts and images are examples of close to instantaneous mixture problems.
2.1.2. Convolutive mixing model
Although instantaneous mixing models are extensively studied, where many algorithms are proposed
with promising results for separation, it fails to model real life situations, for instance recording in a
real room. In such a situation the microphones in the environment picks up not only the signals of the
sources, but also the delayed and attenuated versions of the same signals due to reverberation. Hence,
this can be viewed as microphones receiving a filtered version of different signals, and can be modeled
as follows.
= − + !"#
2
Where is the multipath multi-channel filter.
The source mixtures are called convolved mixtures since acoustic signals recorded simultaneously in a
reverberant environment can be described as the sums of differently convolved sources. The following
figure depicts an example of a convolutive mixing system.
- 12 -
.
.
.
.
.
.
.
.
.
.
̂
Separation
Algorithm
.
̂ .
.
.
.
̂ Figure 2.2: Multi-path problem in convolutive mixing
There are many instances where it is necessary to separate signals from a mixture to obtain the original
sources, by observing only the mixture i.e. without prior knowledge of the original sources. This is a
typical requirement especially in acoustic signal processing and speech enhancement applications. BSS
is one way of solving this kind of problems and recently significant research has been conducted. One
such problem is described by the so called “cocktail party” problem.
Suppose you are at a cocktail party. It is not a big problem for human beings to concentrate on listening
to one voice in the room, even when there are lots of other sound sources, such as other conversations,
music, background noise, etc. present in the room.
The "cocktail party" problem can be described as the ability to focus one's listening attention on a
single talker among a cacophony of conversation and background noise; also known as "cocktail party
effect". For a long time, it has been recognized as an interesting and a challenging problem. It is not
known exactly how humans are able to separate the different sound sources [11].
- 13 From a technical perspective, the "cocktail party problem" is called the "multichannel blind
deconvolution” problem. It is aimed at separating a set of mixtures of convolved signals, detected by an
array of microphones. This is performed extremely well by the human brain, and over the years
attempts have been made to capture this functionality by using assemblies of abstracted neurons or
adaptive processing units.
In recent research, blind source separation is considered as a common usage when dealing with the
“cocktail party” problem. Some of the other interesting application areas of BSS include speech
enhancement in multiple microphones, cross talk removal in multi-channel communication, Direction
of Arrival (DOA) estimation in sensor arrays and improvement over beamforming microphones for
audio and passive sonar.
2.2 Solution approaches for convolutive BSS
Many of the natural signals are convolutive mixtures. Hence methods to process convolutive mixtures
to separate source signals are vital. The solution to the problem of instantaneous mixtures is quite well
known and can be easily addressed by means of applying an ICA algorithm. In the time domain,
convolutive mixtures can be modeled as multiple instantaneous mixtures in the frequency domain. The
BSS problem of convolutive mixtures can be solved by simply employing ICA on each frequency bin.
However the main problem of this simple approach is solving the so called "permutation problem". The
following section briefly describes ways [10] of extending ICA to solve the convolutive mixtures as
well as merits and demerits associated with each of the techniques.
- 14 2 2.1 Extending ICA in time domain
This is the simplest way to separate signals from a convolutive mixture, although it has several
drawbacks. In the time domain, it is possible to extend the ICA directly for convolutive mixtures and
one would obtain good separation, provided that the algorithm converges. However, extending the ICA
algorithm for a convolutive mixture is more complex than for a instantaneous mixture, and is
computationally more expensive when dealing with longer FIR filters, due to many convolution
operations.
2.2.2 Frequency domain BSS
In the frequency domain, for each frequency bin, complex valued ICA for instantaneous mixtures is
employed. The advantage of this approach is that the ICA algorithm remains simple and performed
separately at each frequency bin. Compared to convolutions in the time domain techniques, the
frequency domain multiplications are computationally more efficient. Consequently the frequency
domain BSS approach is more efficient than its time domain counterpart. Further improvements could
be achieved by adopting fast algorithms such as FastICA [13][14], and / or by parallel computation of
multiple frequency bins. However, a serious problem in the frequency domain techniques is the
“permutation ambiguity”, which refers to the aligning of permutations in each frequency bins so that
separated signal in the time domain contains frequency components from the same source. This is the
so called “permutation problem” of the frequency domain BSS techniques. The other issue in such
implementations is the circularity effect of discrete frequency representation. These major problems are
solved in the method proposed by Parra [1], which is the basis of this thesis, and is described in detail,
in the chapters to follow.
- 15 2.2.3 Time-Frequency domain BSS
In some time domain BSS methods, convolution in the time domain is speed up by the overlap-save /
overlap-add methods in the frequency domain, whereas some other methods uses filter coefficients
updated in the frequency domain, and non linear functions for evaluating independence are applied in
the time domain. The permutation problem is avoided due to the independence of separated signals
being evaluated in the time domain. By choosing an appropriate window function, circularity problem
can be minimized.
2.3 Related work on Second Order statistics
The types of signals that are dealt with in this thesis are acoustic speech signals, which can be
considered as broadband signals. There are mainly three [2] general types of algorithms known in the
field of blind source separation for convolutive mixtures of broad-band signals.
•
Algorithms that diagonalize a single estimate of the second order statistics.
•
Algorithms that simultaneously diagonalize second order statistics at multiple times exploiting
non-stationarity of the signals
•
Algorithms that identify statistically independent signals by considering higher order statistics.
In the first type of algorithms, decorrelated signals are generated by diagonalizing second order
statistics, which is described in the next section, and involves a simple structure which can be
implemented efficiently, as pointed out in [5]. However the main problem with such implementations is
the convergence to the correct solution is not always guaranteed, as single decorrelation is not a
sufficient condition for achieving independent model sources. Hence, to achieve independent source
models for stationary signals, higher order statistics need to be considered, which can be obtained by
either direct measurement and optimization of higher order statistics [6] or indirectly by assuming a
model of the shape of the cumulative density function (cdf) for the signals [7]. Most of these methods
are fairly complex and difficult to implement whereas other methods fail when it is not possible to
assume the cdf accurately.
- 16 There have been a handful of online algorithms proposed for methods based on the single decorrelation
and indirect higher order methods [8][9] corresponding to their offline counterparts described above.
However they also suffer from the issues discussed earlier. In an online BSS algorithm as data may
change properties quickly, an essential criterion is the fast convergence for non-static filter.
The method we are verifying is based on proposals by Parra in [1] and [2], which proves to be
efficiently and accurately performing the convolutive signal decorrelation and both the offline method
and the online version are discussed and implemented within this thesis. This technique falls under the
second category out of the three mentioned above.
The second process, which is the background deniosing, is considered next. Especially in modern
hands free speech communication environments, most of the time it occurs that the speech signal is
corrupted by background noise. The aim of introducing this process after the BSS process is to
suppress any additional background noise that is included in the separated signal, to make it even
clearer. For this purpose we have chosen Martin's algorithm [3], since it is an efficient algorithm and
suitable for real time applications.
- 17 -
Chapter 3
Foundations
In this section, some fundamental concepts in BSS are introduced, which forms the foundation for the
discussions in the remaining chapters.
3.1 Second order Statistics
In second order and higher order statistical separation criteria, the basic assumption made is the
statistical independence of the source signals. In general, to capture statistical independence, higher
order statistics are required, which suggests that second order statistics alone is not sufficient to analyze
the statistical independence. For instance, if the source signals are identically and independently
distributed samples of a stationary distribution, second and fourth order statistics are not sufficient for
separation [1]. However the natural signals (voice, images) are often not stationary nor independently
distributed, hence it is possible to use either second order statistics, or both second and fourth order
statistics as a measure of independence.
In many cases signals with temporally correlated sources, separation can be performed entirely based
on second order statistics. It is also known that non-stationary signals can be separated by means of
decorrelation. The more difficult ‘convolutive separation’ i.e. the separation of both correlated and nonstationary signals, can be solved using only second order statistics.
3.2 Short Time Fourier Transform (STFT)
Generally in the analysis using DFT, it is assumed that frequency components do not change with time.
Hence, the length of the window does not affect the DFT results, and signal properties remain the same
from the beginning to the end of the window. However, in the signals that we deal with in practice, for
instance non-stationary signals such as radar, sonar and speech and data communication signals,
properties such as amplitude, frequency and phase changes over time. For such signals a single DFT
- 18 estimation is not sufficient as it does not give any information on the time at which a frequency
component occurs. To deal with such signals, STFT or Time Dependent Fourier Transform (TDFT)
concept has been introduced. STFT is a commonly used concept in speech processing applications.
STFT is a simple concept where a moving window is applied to the signal and the Fourier transform is
taken on the signal within the window as the window is moved. Mathematically, STFT of a signal can be defined as follows.
,
, $ = % − &' ()*+
+"(,
3
where % is a window sequence. The single discrete variable of the one dimensional signal is
converted into a two dimensional function of discrete time variable and continuous frequency
variable $, as a result of the STFT process. The STFT is periodic with in $ with period 2..
3.3 Steepest Decent Algorithm
In many ICA and BSS algorithms it is common to use the steepest decent or gradient decent approach
to obtain the optimized weights for a particular solution. This method provides good tradeoffs between
computational complexity and convergence performance.
Suppose a starting point # is known for a given function to be optimized. Then in a steepest decent
algorithm an iteration is performed as follows, to update the current value of x.
/0
= / − 1∇3 45 / 6
(4)
Here, 1 is the step size which needs to be chosen carefully depending on the application. The last term
is pointing in the negative gradient direction. The iteration is continued until no further change occurs
in the x, i.e. , should converged to the minimum point of the error function 4.
- 19 3.4 Convolutive BSS
In this section, background theory is formulated for the BSS method considered in this thesis. Most of
this theory is adapted from the work published by Parra [17].
For real environments, instantaneous mixture model is not a sufficient descriptor. Signals arrive at
different times and delays towards the sensors in real environments. Some signals are reflected at
boundaries and obstacles of the system under consideration, and results in multiple delays. Such
phenomenon is known as a multipath environment and can be modeled as a Finite Impulse Response
(FIR) convolutive mixture,
= − !"#
5
Given the above model, the problem then is identifying 8 coefficients of the channel , where
8 is the channel length, and thereby, to estimate 9, the source signals. This is rather a complex
process since matrix is a matrix of filters, rather than a matrix of scalars, which is the case for
instantaneous mixture model. Even when the channel is identified, inverting it to obtain the inverse
matrix is hard, since the inverse may be recursive and may result in an unstable Infinite Impulse
Response (IIR) filter.
An alternative approach would be to formulate an FIR inverse model :,
<
9 = : − !"#
6
and estimate : such that model sources 9 = [̂
, … , ̂ ] are statistically independent.
- 20 3.5 Cross-Correlation, circular and linear convolution
The cross correlation of a discrete signal can be expressed as
= , + > = 4[ + > ]
(7)
The 4[. ] operator indicates the Expectation operator. For stationary signals, the correlations depend on
the relative time and not that of the absolute time i.e.,
= , + > = = >
(8)
Now consider taking the z-transform of = >. Practically when performing the z-transform, a limited
number of sample points of z are taken (i.e. the DFT). Then,
= @ = @AB zDzH
(9)
where AB z are the z-transform of the autocorrelations of the sources, @ is the matrix of z-
transformed FIR filters. It should be noted that due to the assumption of independence, AB z is
diagonal.
Suppose T equidistant sampling points on the unit circle are considered. Then it is possible to replace
the z-transform with the Discrete Fourier Transform (DFT). For periodic signals, DFT can be expressed
as the circular convolution. In (5) and (6), it was assumed that they are linearly convolved. Linear
convolution can be approximated by circular convolution, provided the frame length T is chosen much
larger than the channel filer length, i.e. 8 ≪ G. Then the STFT of , , $ can be expressed as
follows.
, $ ≈ $, $,
IJK 8 ≪ G
(10)
where , $ is the DFT of frame size T starting at discrete time . Similarly , $ and $ can be
expressed in the same way.
- 21 But for non-stationary signals, cross-correlations are time dependent. Estimation of the cross-power
spectrum with a resolution of 1/T can be difficult if the stationarity of the source signal is in the order
of T or less. However, any cross-power spectrum average that diagonalizes the source signals may be
sufficient [1].
Parra and Spence in [1] considered the simple sample average,
P(
1
L . $ = O + G, $ OQ + G, $
=
N
!"#
11
which yields the following in matrix form.
L , $ = $AB , $Q $
=
(12)
Due to the independence assumption, and if N is considered to be sufficiently large, A , $ can be
modeled as a diagonal matrix. It is important that the signals are non-stationary for (12) to be linearly
independent for different time .
- 22 -
Chapter 4
Offline algorithm
4.1 Problem formulation
The method of performing blind source separation of convolutive signals by simultaneously
diagonalizing second order statistics at multiple time periods in the frequency domain is considered,
which have a simple structure and can be implemented efficiently.
If = , and given we are interested in finding and . This is a simple problem if priory
information about either or s is known. The solution can then be determined explicitly, provided that
the inverse of exists or can be approximated using a pseudo inverse. The main problem we are
dealing with in this thesis is the situation where we don't have any information about either or and
our task is to determine (R and purely based on some set of basic assumptions made on and ,
and with only having knowledge about . This process of finding solutions is known as Blind
source/signal separation.
- 23 SJTKU' VWXY
Z['K\' VWXY
(n)
A11
S'XKX]' VWXY
̂ ̂
W11
A21
W21
A12
W12
Source A
A22
Source B
Mixing System A
• Delay
• Attenuation
• Reverbaration
̂ W22
Demixing System W
Estimation of ̂
and ̂ using only
and Figure 4.1: BSS system
4.2 Derivation of the Offline Algorithm
Two approaches can be found in the literature to solve the convolutive separation based on second
order statistics, the so called “forward model” and the “backward model” [1]. Given a forward model,
finding a stable inverse is not always guaranteed. Hence our implementation is based on the latter
approach, since it's more robust. In addition to source separation, these techniques can be used for
additive noise power estimation.
The source separation problem is described here. Assume the statistical non-stationary independent
signals are denoted ,
= [
, … , ]
(13)
These sources are convolved and mixed in a linear medium and observed in a multi-path environment
leading to sensor signals ,
- 24 = [
, … , ]
(14)
It is further assumed that ≤ , which means at least as many sensors as sources. The sensor signals
with additional noise is presented in discrete time domain by the equation:
= D − > + _"#
15
Source separation techniques are used to identify the 8 coefficients of the channel (p) and find
an estimate 9 for the unknown source signals.
Alternatively, to estimate the source signal there exist a finite impulse response (FIR) multi-path filter
of :>,
<
9 = :> − >
_"#
16
This is known as the backward model.
4.2.1 Cross-Correlation
For non-stationary signals, the cross-correlation is time dependent and varies from one estimate
segment to another. The cross correlation estimates are computed as follows:
P(
1
` , $/ = O + G, $/ Oa + G, $/ =
N
Where :
$/ =
c/
d
!"#
17
, k = 1, 2, … , K
O + G, $/ = ggG[ + G]
(18)
= [, … , + G − 1]
(19)
- 25 -
We can also write for each average,
L , $ = $AB n, ωDH ω
=
(20)
If N is sufficiently large, we can model AB n, ω as diagonal due to the independence assumption. For
(20) to be linearly independent for different times , it is necessary that AB n, ω changes over time for
a given frequency, i.e., the signals are non-stationary.
4.2.2 Backward Model
Using the cross correlation estimate of equation (17), we can find the source signals with the crosspower-spectra satisfying the following equation:
L , $:Q $
A , $ = :$=
(21)
L / , $, i.e. / = jNG
It’s important to make sure we have non overlapping averaging time for =
(where G is the window length, N is the number of intervals used to estimate each cross-power-matrix
and j is the number of matrices to be diagonalized) because we need to fulfill the independence
condition for every time instant. If the signals vary sufficiently fast, overlapping times need to been
chosen. The windows maybe overlap one after another, which means each DFT value is derived from
the signal information that is also contained in the previous window. In an audio signal processing
system, the specific value G is selected based on the acoustics of the room in which the signals are
recorded. For example, if the signal is recorded in a large room with strong reverberation effect, in
order to achieve good quality of separation, a sufficiently long window size G should be chosen, such
to include the reverberation effect.
- 26 The multipath model : that should simultaneously satisfy these equations for j time estimations can
be found using a least squares estimation procedure as follows,
k, À = argmin
:
Where:
$x =
cx
:,q̀
:_"#,_r<
:ss *"
∑x"
∑d
/"
||vw, $x ||
(22)
, t = 1, 2, … , T
` w, $:Q $ − A w, $
vw, $ = :$=
(23)
To compute the filter coefficients :, a gradient decent algorithm is applied which minimizes the value
of : as function
K
∂|
` w, $
= 2 k, ω}ω =
∂} ∗ ω
"
∂|
∂À∗B k, $
= −diagk, $
24
25
The value of : are updated as :„…† = :‡ˆ − ‰∇† v, where ‰∇† v is the gradient step value and ‰ is
a weighting constant that controls the size of the update.
The optimal solution for given :$ and at every gradient step can be computed explicitly, which
yields:
L 3 k, $} T $]
ÀB k, $ = diag[}$Š
(26)
In order to achieve an accurate solution, the gradient descent process is constrained in time domain for
the filter coefficient in : to attain certain values. Effectively T filter coefficients in :$ are
- 27 parameterized with Πparameters in :>. Selecting arbitrary permutations will not satisfy this
condition on the length of the filter, : $ = 0 for > > Œ ≪ G. This, requires zero coefficients for
elements with > > Œ, to restrict the solutions to be continuous or “smooth” in the frequency domain.
The filter size constraint can be enforced by projecting the unconstrained gradients in (24) to the
subspace of permissible solutions. This projection is implemented by transforming the gradient into the
time domain, by zeroing all components with > > Œ, and transforming back to the frequency domain.
The unit gain constraint on diagonal filters is simply enforced by keeping the filter coefficient constant
to  $ = 1 [1].
In this manner, the “permutation problem” is solved and a unique solution for the FIR filter coefficients
is computed such that these filter coefficients are used to process the received signals which will
efficiently separate the source signals.
4.2.3 Power normalization
Experimentally it has been shown [2] that convergence of the gradient algorithm can be substantially
improved when different adaptation constants are used for every frequency. According to (24) the
gradient terms are scaled with the square of = , resulting in considerably varying signal powers across
frequencies. Consequently gradient terms at different frequencies results in considerably different
magnitudes. To obtain comparable update steps for different frequencies, the gradients are scaled by a
power normalization process. This amounts to an introduction of a weighting factor for the cost
function.
d
‘ = &$’vw, $’
*"
/"
27
It has been shown in [2] that good results can be achieved when &$ is a straight forward power
normalization defined as:
L w, $’ (
&$ = ∑//"
’=
(28)
- 28 4.3 Offline algorithm
The basic offline algorithm flows as follows. Blocks (segment) of input signal that comprises mixed
signals are accumulated. Then the algorithm divides the length of the input signal into a plurality of Tlength periods (window) and performs a discrete Fourier transform (DFT) on the mixed signal over
each T-length period. The K cross-correlation power spectra that are each averaged over N of the Tlength periods is computed. Using the cross-correlation power values, a gradient descent process
computes the coefficients of an FIR filter that will effectively separate the source signals from the input
signal by simultaneously decorrelating the K cross-correlation power spectra.
Figure 4.1: Pictorial view of the calculation of Rx
- 29 The flow chart for the implementation of the offline algorithm is shown in figure 4.2.
Figure 4.2: Flow chart for the offline algorithm
- 30 -
Chapter 5
Online BSS
Non-stationary signals in a static path environment can be recovered using simultaneously
decorrelating varying second order statistics as discussed previously. However, when the sources are
moving, which is the more practical and realistic scenario, the assumption of “static multi path” is no
longer a valid assumption.
In an online algorithm, the data cannot be saved and they need to be processed as they arrive, to
produce the output. Although there exist many online algorithms, we have chosen [2] since it is
considered as a benchmark in this type of algorithms and also it has proven to be efficient in real time
environments.
One way to deal with online scenario is to convert the offline algorithm to an online version directly.
When doing so, it is required to find non-stationary signal statistics for convergence, which is
practically difficult.
The algorithm proposed by Parra in [2] is an efficient online gradient algorithm with adaptive step size
in the frequency domain, based on second order statistics, namely, the Multiple Adaptive Decorrelation
(MAD) which we have implemented in this thesis.
5.1 Introduction
The main objective of the MAD algorithm is to converge quickly in changing environments. The
algorithm attempts to optimize the decorrelation of the cross correlation along time. Similar to the
offline counterpart, the online version is a gradient decent algorithm. The general idea of a gradient
decent algorithm is to optimize a cost function, which in this case is defined in terms of separation
filters. The derivation of the cost function is illustrated in section 5.2.
- 31 The on-line algorithm is directly derived in the time domain and later transformed into the frequency
domain for efficiency and faster convergence of on-line updates.
5.2 Derivation in time domain
Consider again = [
, … , ] , the non-stationary independent source signals. When these
signals
are
observed
in
a
multi
path
environment,
= [
, … , ] can be modeled as follows.
the
= > − > + _"#
observed
/
measured
signal
29
Where > is the time domain mixing filter of order 8 , and ] is the background noise. For
effective separation of sources it is necessary to assume that there exist more sensors than sources, ie
≤ . It has been shown in [23], that under certain conditions on the coefficients of >, the
sources can be recovered by finding a sequence of optimal filter coefficients for the unmixing filter
matrices :>, of suitably chosen length Πsuch that,
<
9 = :> − >
_"#
30
Similar to the offline algorithm, Œ ≪ G, in order to prevent the permutation effect. The estimated
sources would have diagonal second order statistics at different times.
ie.
∀, > ∶ 4[99Q − >] = À , >
(31)
ÀB n, τ = diag5—λ™
n, τ, … , λ™š› n, τœ6, are the autocorrelations of the sources at times , which
should be estimated from the measured data. Unlike previous work, this online algorithm is derived in
the time domain and is later converted into the frequency domain to improve efficiency and provide
faster convergence of the online update rule.
- 32 When the sampling average starting time is , the expectation of a sequence I can be written as:
45I6 = I + > , 32
_,
Then a separation criterion for simultaneous diagonalization using (32) can be defined as follows.
% = , % = ’ž, %’
„
„
= Ÿ4[99Q − >] − ̀ , >Ÿ
„,_
33
¡ 9 + > , 9Q − > + > , − ̀ , >¡
„,_
„,
34
35
where ||.|| is the Frobenius norm.
The cost function to be minimized for the algorithm is %, and the algorithm is developed to search
for separation filters by minimizing the cost function % with a gradient decent algorithm. For a
given : , the optimal estimates of the autocorrelation A , > are the diagonal elements of
E[£9n£9H n − τ]. Hence the gradient only needs to be calculated with respect to W. The stochastic
gradient is now calculated for the online algorithm with a gradient step for every . The following
relation can be derived with this information.
¤„,:
¤:ˆ
= ∑¦∑¦¥ £9n + τ¥ £9H n + τ¥ − τ − AB n, τ ∗ ∑¦¥¥ £9n + τ¥¥ − τ§ H n + τ¥¥ −
l + ∑¦∑¦¥ £9n + τ¥ − τ£9H n + τ¥ − AB n, τ ∗ ∑¦¥¥ £9n + τ¥¥ § H n + τ¥¥ − τ − l
(36)
- 33 This can be simplified as follows.
Δ}lª = −2μ ∑¦∑¦¥ £9n + τ¥ £9H n + τ¥ − τ − AB n, τx ∑¦¥¥ £9n + τ¥¥ − τ§ H n + τ¥¥ − l
(37)
The sums over >′ and >′′ are the averaging operations and sum over > is from (34).
If it is assumed that estimated cross-correlations don't change significantly within the time scale,
Where
` , > = 4[
®
®Q − >]
=
(38)
` ¯:„ = −2‰ž:=
(39)
` :Q − ž = :=
(40)
5.3 Frequency domain conversion
Since the convolution process is more computationally expensive, the time domain gradient is
transformed into Frequency domain with G frequency bins. Due to the assumption, Q<<T, (39) can be
approximated in frequency domain as follows.
where:
` , $
∆x :$ = 2‰ž, $:$=
(41)
` , $:Q $ − A , $
ž, $ = :$=
(42)
- 34 5.4 Power Normalization
An adaptive power normalization factor is introduced to improve the convergence of the algorithm,
similar to the offline counterpart. To perform a proper Newton-Raphson update, the inverse of the
Hessian is required. However, computing such Hessian inverse is difficult in our case. One way to
overcome this is to neglect the off-diagonal terms of the Hessian. This would result in efficient gradient
updates when coefficients are not strongly coupled. According to (41), the approximate frequency
domain gradient updates depend on :$ . Therefore parameters are decoupled for different
frequencies. On the other hand if several elements of :$ at a single frequency are strongly
dependent, then the diagonal approximation is quite poor. This would result in poor performance when
powers of different channels are quite different to each other. Hence the gradient direction of the matrix
elements :$) for a given frequency should not be modified.
Instead, the original gradient is used with an adaptive step size with a normalization factor ℎ, $ for
different frequencies. Ie,
¤„,*
¯:$„ = −‰ℎ(
, $ ¤:∗*
(43)
To calculate ℎ, $, the sum over  is considered and we obtain the following.
ℎ, $ = )
² , $
` , $Ÿ
= Ÿ:$=
∗ $²
²)
) $
44
This is an adaptive power normalization and the resulting updates are stable and leads to faster
convergence.
- 35 -
Figure 5.1: Flow chart for the online algorithm
- 36 The online algorithm is implemented as a block processing procedure in practice. The signals are
windowed with a length of G, and the block of G elements are then transformed into frequency domain.
In signal processing terms we calculate the Short time Fourier Transform (STFT) of the signal. These
frequency bins are then used to calculate the frequency domain estimated cross-correlations.
It is typical in online algorithms to implement the expectation operation as an exponentially windowed
average of the past values. This can be expressed as
` , $ = 1 − ³=
` , $ + ³, $Q , $
=
(45)
where γ is a forgetting factor which should be determined based on the stationarity of the signal.
Finally when : is converged, these weights are used to filter out the original signal to obtain the
separated estimated sources.
- 37 -
Chapter 6
Spectral Subtraction based on minimum statistics
The second building block we consider is the background denoising process. Spectral subtraction based
algorithms are generally used to enhance noisy speech signals. One approach is to use a speech activity
detector to detect the speech pause, which requires additional procedures and equipment. In this thesis
we will be using an approach based on minimum statistics of sub bands, proposed by Martin [3]. This
algorithm is capable of tracking non-stationary noise signals and eliminates the problem of speech
activity detection. We selected this algorithm due to its computational simplicity and hence its
suitability for real time applications.
6.1 Components
Let be the sum of a zero mean speech signal and a zero mean noise signal , where n is
the discrete time index.
= + (46)
If and are assume to be statistically independent, then taking expectation on both sides
gives:
4´ µ = 4´ µ + 4´ µ
(47)
The spectral processing is based on a DFT filter bank with ¶ sub bands and with a decimation /
interpolation ratio of · [5]. A pictorial view of the algorithm is depicted in figure 6.1. As it is
demonstrated, only the magnitudes of the DFT of sub bands of the signal are changed and the phase
components are preserved. Typically ¶ is chosen to be of length 256 and · to be 64.
- 38 -
Figure 6.1: Block diagram of the spectral calculation process
The main components of this algorithm comprises of two main procedures.
1. A Noise Power Estimator
2. Subtraction rule which translates sub band Signal to Noise Ratio (SNR) into a spectral
weighting factor
The basic idea is to attenuate sub bands with lower SNR (with the Spectral Weighting factor) and to
keep the sub bands of higher SNR intact. The above two factors have a significant impact on audible
residual noise [4]
6.2 Noise Estimation
Two approaches can be identified for noise estimation. The simpler of the two is the analysis during
speech pauses. There are two disadvantages in this approach.
(a) Changes in the noise spectrum during speech periods cannot be detected, i.e. the noise has to be
stationary over long time periods.
(b) A voice activity detection (VAD) must be introduced to interrupt noise estimation between speech
activity. One major difficulty in this case is the recognition of unvoiced phonemes.
- 39 We use the second approach, which is a minimum statistics algorithm proposed by Martin [7].
The first step in Estimating the noise power is to calculate the short time subband signal power
8 , $. This is performed using a recursively smoothed periodogram. The periodogram updates are
given by the following formula.
8 , $ = 18 , $ − 1 + 1 − 1|, $|
(48)
Typically, 1 is set to values between 0.9-0.95. By calculating a weighted minimum of the short time
power estimate 8 , $ within a window of ¸ subband power samples, the noise power estimate
8„ , $ is obtained as follows.
8„ , $ = U 8+„ , $
(49)
where 8+„ , $ is the estimated minimum noise power and U is a compensation factor for the bias of
the minimum estimate and depends only on known algorithmic parameters. To improve computational
efficiency, the delay in the data window of length ¸ is decomposed into ¹ windows of length ,
where ¹ ∗  = ¸. The minimum noise power estimate 8+„ , $ of a subband is obtained by frame-
wise comparison of smoothed signal power estimate 8 , $ and preceding PSD values 8 − &, $,
where & = 1, 2, . . . , ¹ − 1, which are stored in a FIFO register. The depth of the FIFO is given by ¹.
The process of determining the minimum of ¹ consecutive subband power samples is depicted in
figure 6.2. It can be seen that in case of decreasing noise power, faster update of the minimum power
estimate can be achieved whereas in case of increasing noise power the update of noise estimates is
delayed by ¹ samples [19].
- 40 , $
[. ]
1−1
, $
8 , $
T
8 , $
8 − 1, $
8 − 2, $
8+„ , $
min
… … ….
8„ , $
o
8 − ¹ − 1, $
1
Figure 6.2: Structure of the noise subband power estimation algorithm
6.3 Subtraction rule
We calculate the short time signal power ºººººººººººº
|, $| by smoothing the subsequent magnitude squared
input spectra with a first order recursive network.
ººººººººººººº
ºººººººººººººººººº
|,
$| = ³|,
$ − 1| + 1 − ³|, $|
(50)
where ³ ≤ 0.9.
As proposed by Berouti in [17], spectral magnitudes with an oversubtraction factor J», is
subtracted and a limitation of the maximum subtraction by a spectral floor constant 0.01 ≤ ≤ 0.05, the output magnitude then becomes,
|¼, $| = ½
¾ 8„ , $ VI |, $|Œ, $ ≤ ¾ 8„ , $¿
|, $| Œ, $
'Y'
(51)
- 41 Where,
„,*
Â
Œ, $ = À1 − ÁJ, $ ººººººººººººº
Å
|Ä,*|Ä
(52)
Large oversubtraction factor JÆ, w would eliminate residual spectral peaks ('musical noise') which
also effects the quality of speech at the same time, resulting in low energy phonemes being suppressed.
- 42 -
Chapter 7
Computer simulation
This section introduces the implementation of the BSS algorithms discussed in the earlier chapters in a
real time system. The basic platform used to implement the online algorithm was originally developed
by Dr. Nedelko Grbić and his research group at the department of Applied Signal Processing, Blekinge
Institute of Technology. This includes an extension to MATLAB, which interacts with a stereo sound
card to sample the incoming voice data and to make the output data available in real time, after
processing with an algorithm of one’s choice. This is a very cost effective and a convenient way to test
real time scenarios, hence suitable for rapid application development due to its ability to perform
simulations on MATLAB, before being implemented on an actual real time system. Once favorable
results are obtained with this setup, it's a matter of converting the MATLAB code to C/C++ code
suitable for an embedded application.
7.1 Requirements
The main requirements in a real time system are the efficiency of the algorithm and low latency in
capturing and producing the output. Even though MAD algorithm is efficient, it still needs to be
implemented in an efficient manner to work in real time implementation in matlab. The incoming data
need to be processed without iteration of the same data, since input data is available in realtime and the
output should be produced in realtime with small delay.
7.2 MATLAB implementation
It is well known that in MATLAB, the introduction of nested “for” loops tend to degrade the efficiency
of a program due to the access of array elements. Hence the MAD algorithm was implemented in a
block processing fashion using vectors. The block processing rule ensures faster data processing. A
significant improvement in the algorithm was achieved in this way.
- 43 For instance, using
n = 11; x = rand(n); y = rand(n)
z = x(1:2:n) + y(1:2:n)
is more efficient than using
for i=1:2:n z(i) = x(i) + y(i); end.
There are certain restrictions imposed by the sound card. Basically the sampling rate and the block size
is fixed by the manufacturer. The available sampling rates are 48 kHz and 44.1 kHz with a block size of
512 samples. Hence we need to work within these fixed constraints and ensure the algorithm works
properly under this given sampling rate and the block size.
7.3 Experiment Setup and algorithm
The basic hardware included in this phase are :
•
A laptop (1.73-GHz Intel Pentium M 740)
•
A stereo sound card (Echo Indigo) manufactured by Echo Indigo Inc (frequency sample rate
48/44.1 kHZ, data block 512, one i/o channel)
•
A flexible 2 microphone set
•
A loud speaker
Figure 7.1 shows the equipment that is being used in this experiment.
- 44 -
Figure 7.1: Experiment equipment
The figure 7.2 depicts a block diagram of the real time setup and figure 7.2 shows the block diagram of
the software routine.
- 45 -
Figure 7.2: Software routines
7.4 Simulation results
The experiments were carried out in a room of 9& with background music. The distance between the
speaker and the microphones is 30 cm in a square ordering.
Sample frequency
44.1kHZ
Block size
512
Filter length
128
Microphone used
2
Table 7.1: Experiment setup values for real time testing
- 46 -
Chapter 8
Evaluation
8.1 Introduction
Speech enhancement refers to the improvement of the value or quality of the speech. The
improvements are made in intelligibility and / or quality of the degraded speech signal. The
measurement of such improvements is a difficult problem due to the fact that the nature and
characteristics of noise signals can change dramatically in time and from application to application,
hence it is hard to find a robust measure. Another reason is that performance measurements can also be
defined differently by each researcher.
Two popular criteria for performance measurement are:
(1) Quality: This is a subjective measurement and depends on the individual preferences.
(2) Intelligibility: This is an objective measurement, for instance considering the percentage of
words identified by users. A main issue which concerns this type of measurement is listener
fatigue, which refers to the user's ear tunes out unwanted noises and focuses on the wanted
ones.
8.2 BSS evaluation method
Having a standard measurement technique is important to measure the separation quality in BSS
algorithms, since many researchers use their own methods to evaluate their results, which does not
reflect a true picture of the separation quality and makes it hard to compare different algorithms. To
overcome these problems and to compare different types of BSS techniques, a standard evaluation
method is needed. We use a technique proposed by Schobben, Torkkola and Smaragdis in [15] to
evaluate our results.
- 47 The separation quality of the Æ th separated output is defined as
ÈÉ̂Ê,Ê Ä Ë
S) = 10YJW Ç
Í
(53)
ÈÉ∑sÌÊ̂ Ê, Ä Ë
s
Where ̂),s is the Æ-th output when only the V th source is active. With this equation, the power ratio
between the desired separated output source and all the other disturbing source is calculated.
The distortion of the j th separated output is defined as
ÈÉÊ,Ê (ÎÊ ,̂Ê Ä Ë
¸) = 10YJW Ç
ÈÉÊ, Ä Ë
Ê
Í
(54)
Where ),Ê is the contribution to the Æ- th sensor of the Æ - th source alone, and 1) = 4 É),Ê Ë /4Ð̂) Ñ
is a scale invariant measure. 4´. µ is the expectation operator.
8.3 Experiment and Results
Following settings are kept fixed throughout the experiment.
•
The mixing is a two source, two sensor setup in a real room recording setup as well as some
simulated scenarios
•
Input speech sequences are from music and male voice, recorded for 8 seconds
•
Signals are sampled with a sampling rate of 8 kHz in the recordings
•
The block size of the algorithm is 512 samples
•
Filter lengths were varied from 64 samples to 512 samples
- 48 Separation quality (S1 | S2) dB
MAD Offline
MAD Online
Filter length 64
20.4621|31.5317
21.9559|34.4566
Filter length 128
16.3620|21.6626
20.9290|31.4024
Filter length 256
16.6603|22.0382
20.0148|33.1949
Filter length 512
12.6127|27.4357
13.1895|11.4447
Table 8.1: Separation Quality Measures for an Instantaneous Mixture
Separation distortion (D1|D2)dB
MAD Offline
MAD Online
Filter length 64
-15.9390| -13.3931
-13.1940| -9.0089
Filter length 128
-12.0912| -7.8137
-9.9158| -10.0921
Filter length 256
-14.2202| -7.9876
-7.7516| -4.2034
Filter length 512
-10.1250| -8.5993
-4.8297| 12.6678
Table 8.2: Separation Distortion Measures for an Instantaneous mixture
Separation quality (S1 | S2) dB
MAD Offline
MAD Online
Filter length 64
0.7391|18.8786
1.6904|10.9569
Filter length 128
5.3014|17.0672
0.4649|15.0045
Filter length 256
3.5759|18.0015
2.1190|13.3240
Filter length 512
1.2806|18.2095
-5.2883|12.4346
Table 8.3: Separation Quality Measures for a Static Mixture
Separation distortion (D1|D2)dB
MAD Offline
MAD Online
Filter length 64
-6.9282| 4.6183
-7.1653| -11.5754
Filter length 128
1.0028| 7.6093
-10.3319| -14.9425
Filter length 256
1.8191| 9.4782
-12.6137| -16.7639
Filter length 512
1.4328| 3.0287
-9.7559| -10.1933
Table 8.4: Separation Distortion Measures for a Static mixture
- 49 Separation quality (S1 | S2) dB
MAD Offline
MAD Online
Filter length 64
3.3408 | 3.1798
4.7635 | -1.8335
Filter length 128
7.7058 | 4.0183
7.3421 | -1.3522
Filter length 256
7.4332 | 6.6986
2.8662 | 3.6885
Filter length 512
4.7900 | 6.4837
-1.7124 | 9.0720
Table 8.5: Separation Quality Measures for a Head Mixture
Separation distortion (D1|D2)dB
MAD Offline
MAD Online
Filter length 64
-0.2479 | -5.0346
0.1738 | -1.0855
Filter length 128
-1.0609 | -3.9216
-0.2862 | -0.8946
Filter length 256
-0.5360 | -3.0300
-0.6044 | -0.7258
Filter length 512
-1.0753 | -4.6627
-0.1054 | -4.7607
Table 8.6: Separation Distortion Measures for a Head mixture
Separation quality (S1 | S2) dB
MAD Offline
MAD Online
Filter length 64
3.4720 | 3.4264
0.5001 |-0.5227
Filter length 128
1.7252 | 2.1411
-5.3413 | 8.8600
Filter length 256
5.6180 | 6.2060
8.1500 | 7.1554
Filter length 512
3.2465 | 6.9594
5.6853 | 10.2311
Table 8.7: Separation Quality Measures for a simulated real room
Separation distortion (D1|D2)dB
MAD Offline
MAD Online
Filter length 64
-3.5196 | -4.4073
-3.5004 | -4.5466
Filter length 128
-3.8052 | -4.0388
-3.4879 | -2.6749
Filter length 256
-2.9301 | -3.1329
-2.7426 | -3.0561
Filter length 512
-2.7404 | -2.9757
-2.4048 | -3.1939
Table 8.8: Separation Distortion Measures for a simulated real room
- 50 Table 8.1 to Table 8.8 summarizes the performance of the quality of separation and the separation
distortion as described in equations (53) and (54) respectively, for different signal mixtures. In this
experiment the evaluation of the offline and online MAD algorithm was tested against instantaneous
mixtures, static mixtures, head mixtures and in a simulated real room based on the tools provided in
[15] for varying filter lengths, Q.
The static mixture makes a 2x2 FIR mixing matrix with filters generated from various kinds of random
filters. The Head mixture performs a 2 by 2 mixing, using the Head Related Transfer Functions
(HRTFs) of a dummy head. This generates sounds in a similar manner in which the signals reach each
of ears, by performing filtering based on the shape of a human shaped head [16].
As can be seen from the results, the online algorithm performs as much as the offline counterpart.
Hence online algorithm described in [2] is a very efficient algorithm which can be used in real time
systems for signal separation of convolutive mixtures. However, as expected, the quality of separation
is higher in the instantaneous mixture compared to the other mixtures where it also gives a low
distortion. In the instantaneous mixture, the separation quality goes down when increasing the filter
length. In the simulated reverberant room, the separation quality measure is not as high as for the
instantaneous case, but the perceived quality of separation by users is subjectively satisfactory.
An issue with the setup used in the implementation was a “clicking” effect which can be heard in the
separated signals. This is due to the block processing. This can be reduced if one would implement the
STFT in a filer bank structure.
The following figures displays the time domain input signals and the corresponding outputs of a more
difficult convolutive mixture which comprises of a human voice signal mixed with a music signal.
- 51 -
(a)
(b)
(c)
(d)
Figure 8.1: Time signals of a convolutive input signals (a), (b) and output signals (c), (d)
8.4 Evaluation of Spectral subtraction method
We use plots of the input and output signals in the time domain and in the frequency domain (PSD) to
show a qualitative evaluation of Martin’s Algorithm. Figure 8.2 shows the time domain plot of a signal,
where a female voice is recorded with background noise of a car. It can be seen that most of the noise is
suppressed in the output signal. However, unlike in BSS algorithms, the “musical noise” is present in
the output. Hence, care should be taken when selecting the parameters for the Martin’s algorithm,
specially the value of ³. In many occasions, reasonably good results could be achieved by setting
³ = 0.85.
- 52 -
(a)
(b)
Figure 8.2: Time domain signals of the noisy input signal (a) and the output signal after denoising(b)
(a)
(b)
Figure 8.3: Frequency domain plots of the noisy input signal (a) and the output signal (b)
- 53 -
Chapter 9
Conclusion
We verified two methods which can be used as building blocks in mobile environments for speech
enhancement. Two frequency domain algorithms, The Multiple Adaptive Decorrelation method for
speech separation and Martin’s algorithm for background denoising were implemented. The
implementations were successfully tested on a two microphone setup in real room recordings and in a
real time realization.
More emphasis was given to the BSS algorithm since background denoising can be implemented with
BSS. The MAD algorithm was implemented both as an offline algorithm as well as an online
algorithm. This produced promising results. Improved hearing quality can probably be achieved by
realizing the STFT in a filter bank. The online version of the MAD algorithm was implemented in a
block processing manner, and successful real time separation of signals in a two microphone setup was
achieved in real room environments.
Martin’s algorithm was implemented and tested for background denoising. Applications which require
noise suppression alone, could adopt this method, since it’s less complex and computationally efficient
than BSS algorithms. However, the “musical noise”, which is typical to spectral subtraction methods is
an issue that need to be addressed.
The performance of the implementation was evaluated using a standard technique proposed by
Schobben, Torkkola and Smaragdis. The results show on average, a separation of 15-20 dB in
instantaneous mixtures and 3dB in simulated real room environments. However in real time recordings
a good separation was achieved as perceived by the user, but this was not included in the standard
measures.
- 54 -
References
[1] Parra, L., Spence, C., “Convolutive Blind Separation of Non-Stationary Sources”, IEEE Trans. on
Speech and Audio Proc., vol 8., pp. 320-327, May 2000.
[2] Parra L., Spence C., “On-line convolutive source separation of non stationary signals”, Journal of
VLSI Signal Processing, 26(1/2):39{46, Aug. 2000.
[3] Martin R., “Spectral Subtraction Based on Minimum Statistics”, Proc. EUSPICO’94, pp. 11811185, 1994.
[4] Vary P., “Noise Suppression by Spectral Magnitude Estimation - Mechanism and Theoretical Limits
-”, Signal Processing, Vol. 8, pp. 387-400, Elsevier, 1985.
[5] Weinstein E., Feder M., and Oppenheim A. V. “Multi-Channel Signal Separation by Decorrelation”,
IEEE Trans. Speech Audio Processing, vol 1, no.4, pp. 405-413, April 1993.
[6] Yellin D., Weinstein E., “Multichannel Signal Separation: Methods and Analysis”, IEEE Trans.
Signal Processing, vol. 44, no1, pp. 106-118, 1996.
[7] Lambert R., Bell A. “Blind Separation of Multiple Speakers in a Multipath Environment”, Proc.
ICASSP 97, 1997, pp. 423-426.
[8] Amari S., Douglas C. S., Cichocki A. and Yang H. H., “Novel On-Line Adaptive Learning
Algorithms for Blind Deconvolution Using the Natural Gradient Approach”, Proc. 11th IFAC
Symposium on System Identification, vol. 3, pp. 1057-1062, Kitakyushu City, Japan, July 1997.
[9] Smaragdis P., “Blind separation of convolved mixtures in the frequency domain”, Neurocomputing,
vol. 22, pp. 21-34, 1998.
- 55 [10] Benesty, J., Makino, S., Chen, J. “Speech Enhancement”, Springer Verlag, May 2005.
[11] Andrzej Cichocki, Shun-ichi Amari, “Adaptive Blind Signal and Image Processing: Learning
Algorithms and Applications”, ISBN-10: 0471607916 & ISBN-13: 9780471607915John Wiley & Sons
Inc, May 2002,
[12] Visser, E., Lee, T.-W., “Blind Source Separation in Mobile Environments using A Priori
Knowledge”, ICASSP 2004, III 893-896, Montreal, May 2004.
[13] Hyvärinen, A. “Fast and Robust Fixed-Point Algorithms for Independent Component Analysis”,
IEEE Transactions on Neural Networks 10(3):626-634, 1999.
[14] Hyvärinen, A. and Oja, E. “A Fast Fixed-Point Algorithm for Independent Component Analysis”
Neural Computation, 9(7):1483-1492, 1997.
[15] Schobben D., Torkkola K., and Smaragdis P., "Evaluation of Blind Signal Separation Methods",
Proceedings Int. Workshop Independent Component Analysis and Blind Signal Separation, Aussois,
France, 1999.
[16] Schobben D. E., “Real-Time Adaptive Concepts in Acoustics: Blind Signal Separation and
Multichannel Echo Cancellation”, Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001.
[17] Berouti M., Schwartz R., and Makhoul J., “Enhancement of Speech Corrupted by Acoustic
Noise”, Proc. IEEE Conf. ASSP, pp. 208-211, April 1979.
[18] Parra L., "An Introduction to Independent Component Analysis and Blind Source Separation", Part
of a course on Neural Network in the EE department in Princeton Univ., November 1998.
[19] Schmitt S., Sandrock M., Cronemeyer J., ”Single Channel Noise Reduction for Hands Free
Operation in Automotive Environments”, AES 112th Convention, Munich, Germany May, 2002.
- 56 [20] Nedelko Grbic; Xiao-Jiao Tao; Sven Nordholm; Ingvar Claesson, “Blind signal separation using
overcomplete subband representation”, IEEE Transactions on Speech and Audio Processing, vol. 9, no.
5, 2001.
[21] Parra L. C., "Steerable Frequency-Invariant Beamforming for Arbitrary Arrays", Journal of the
Acoustical Society of America, 119 (6), pp. 3839-3847, June 2006.
[22] Parra L. C., "Least squares frequency invariant beamforming", IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics, Mohonk, October 2005, pp. 102-105.
[23] Nelson P. A., Orduna-Bustamante F. and Hamada H., "Inverse filter design and equalization zones
in multichannel sound reproduction," IEEE Trans. Speech Audio Process. 3(3), 185–192 (1995).
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement