A speech recognition-based telephone auto-attendant G Fv B van Leeuwen

A speech recognition-based telephone auto-attendant G Fv B van Leeuwen
A speech recognition-based telephone
auto-attendant
by
GFv B van Leeuwen
Submitted in partial fulfilment of the requirements for the degree
Master of Engineering (Computer Engineering)
in the Faculty of Engineering
UNIVERSITY OF PRETORIA
© University of Pretoria
Summary
Keywords: speaker independent
logue, grammar,
speech recognition,
telephone auto attendant,
dia-
whole word hidden Markov modelling, Mel frequency cepstral coeffi-
cients, Viterbi search, level building algorithm, noise compensation
This dissertation
details the implementation
phone auto attendant
attendant
of a real-time, speaker-independent
from first principles on limited quality speech data.
is a computerized
tele-
An auto
agent that answers the phone and switches the caller
through to the desired person's extension after conducting a limited dialogue to determine the wishes of the caller, through the use of speech recognition technology.
The platform
is a computer
with a telephone interface card. The speech recognition
engine uses whole word hidden Markov modelling, with limited vocabulary
strained (finite state) grammar.
and con-
The feature set used is based on Mel frequency spaced
cepstral coefficients. The Viterbi search is used together with the level building algorithm to recognise speech within the utterances.
"garbage"
Word-spotting
model, are used. Various techniques compensating
techniques including a
for noise and a varying
channel transfer function are employed to improve the recognition rate. An Afrikaans
conversational
interface prompts the caller for information.
Detailed experiments
rameters,
illustrate the dependence and sensitivity of the system on its pa-
and show the influence of several techniques aimed at improving the recog-
nition rate.
Samevatting
Sleutelwoorde: Spreker-onafhanklike
grammatika,
spraakherkenning,
telefoonoutomaat,
dialoog,
heelwoord verskuilde Markov modellering, Mel frekwensie kepstrale
koeffisiente, Viterbi soektog, vlakbou algoritme, ruiskompensasie
Hierdie verhandeling
telefoonoutomaat
antwoorddiens
outomaat
beskryf die implementering
vanaf eerste beginsels.
'n Telefoonoutomaat
is 'n geoutomatiseerde
wat die inbeller na 'n gevraagde persoon se uitbreiding
deurskakel.
Die
bepaal met wie die inbeller wil praat deur 'n beperkte dialoog te voer en van
spraakherkenning
tegnologie gebruik te maak.
Die implementeringsplatform
spraakherkenner
ordeskat
gespasieerde
building")
is 'n rekenaar
met 'n telefoon koppelvlakkaart.
Die
gebruik heelwoord verskuilde Markov modelle met 'n beperkte
en 'n begrensde
Die kenmerke
probeer
van 'n intydse, spreker-onafhanklike
grammatika
om relatief swak kwalitiet
van die spraak wat gebruik word vir herkenning
wo-
spraak te herken.
is Mel frekwensie-
kepstrale koeffisiente. Die Viterbi soektog saam met die vlakbou ("level
algoritme word gebruik om die woorde te herken.
om te kompenseer
(spraakkoppelvlak)
Eksperimentele
vir ruis en 'n varierende
kanaal.
Verskeie tegnieke word
'n Afrikaanse
dialoog
lei die oproepvloei.
result ate wat die stelsel se afhanklikheid
en sensitiwiteit
vir sy param-
eters beskryf, word gegee. Die resultate dien ook om die invloed van die verskillende
tegnieke wat gebruik is om die herkenner te verbeter, te kwantifiseer.
Acknow ledgements
I would like to thank the following people for their help and support, without which
this project would not have been possible:
Contents
2.1
2.2
2.3
2.5
2.6
4.1.2
4.1.3
4.3
4.4
4.8
4.9
Chapter 1
Introduction
The system implemented and studied in this dissertation is a telephone auto-attendant.
The definition of an auto-attendant
in this context is a system that is able to attend
to incoming telephone calls without human intervention. The specific task undertaken
is that of an auto-receptionist,
in other words a system that, with the aid of a con-
versational interface, tries to determine with whom a caller wishes to speak, and puts
the call through if a name was successfully negotiated. The conversational interface is
implemented in the Afrikaans language spoken in South Africa.
The system is implemented on a personal computer equipped with an industry standard
telephone line interface. The telephone line interface digitises audio waveforms from
the analogue telephone line so that the speech waveforms can be processed by the
computer.
Speech recognition technology lies at the heart of the system, enabling
the transformation
can understand
of the digitised speech waveforms into representations
and interpret.
the software
Conversational feedback in the form of pre-recorded
questions and responses is provided.
The auto-attendant
was developed as a real-world implementation
speech recognition theory. The motivations for this work are:
of leading edge
Chapter 1
• to determine the performance of a speech recognition system with the low bandwidth audio provided by the telephone system,
• to show that current computer hardware technology is fast enough for real-time
speech processing and recognition, and
Some of the characteristics
of a system like this, are not only that there are in the
order of 50 different vocabulary
words to be recognised, but because these words are
the names of people (with whom the callers are often unfamiliar),
of the names varies greatly.
train such a system.
A substantial
the pronunciation
amount of data is therefore necessary to
Even though many of the pronunciations
are clearly incorrect,
they were the natural response of the caller reading the name of the callee, and must
therefore be regarded as a valid pronunciation.
Similar work has been done in the past, notably the last few years since computer hardware has started to approach acceptable performance for real-time speech recognition.
Section 1.2 details these developments.
A system must be implemented
which can interface a computer with a telephone line.
This system must be able to playback and record data to and from the telephone line.
A conversational
interface must guide the caller into giving the name of a person he
wishes to speak to (the system must thus act as a telephone
auto attendant).
The
name must be recognised from the incoming wave data, and, if the computer
has a
sufficiently high confidence that the recognition was correct, the extension number of
the person must be looked up, and the call must be forwarded to that number.
If
the recognition confidence is not high enough, the system must prompt the caller into
confirming (medium confidence) or repeating
(low confidence) the answer.
Experiments need to be performed off-line on pre-recorded telephone speech data to
compare the performance of several techniques to enhance the speech recognition rate.
A large number of projects dealing with telephone speech recognition and the development of auto attendants
exist worldwide:
Chien and Wang[l] researched methods of improving automatic speech recognition for
telephone quality speech, and developed the phone-dependent
channel compensated
hidden Markov model (PDCC-HMM) for speech recognition.
Their method involves
taking into account the average channel transfer function during training so that no
additional computation is necessary when the system is in the on-line mode. For this,
they assume that the overall transfer function of the telephone channel does not have
a large variance.
Guojun et al.[2] developed an automatic telephone operator that routed calls based
on recognizing the callee's name.
They required a fixed format firstname
surname
combination, and achieved 99% recognition in their preliminary results using a small
data set.
Fraser et al. [3] developed a system named "Operetta" which had a more sophisticated
conversational interface, allowing user friendliness of call handling.
Operetta
could
handle eight simultaneous telephone calls using low-cost technology.
Koumpis et al.[4] developed an auto attendant
of Crete.
system for the Technical University
They used a phoneme based recogniser, and achieved 97.5% correct name
retrieval for a dictionary of 350 names and services. Their system used the commercial
Nuance Communications'
speech recognition engine, and the dictionary of names and
services was created using strings of phonemes.
Kellner et aI. [5] from Philips developed an automatic telephone switchboard and directory information system. This system combined state-of-the-art
and language understanding
speech recognition
to deliver a system which could understand very complex
dialogues, and provide the user with a variety of information or switch the call through
to a desired party. With a database of over 600 names, they achieved a dialogue success
rate (i.e. the dialogue concluded successfully, often with retry attempts
included) of
95% overall.
Schramm et aI.[6], building on the previously mentioned Philips work, developed an
advanced system to deal with large scale directory assistance.
A two-stage approach
was used, where the first stage asked for a city name (in Germany), and the second stage
prompted the user for a last name and first name. The largest of the directories was
for the city of Berlin, containing 1.3 million entries. They showed that for cooperative
users more than 90% of all simple requests can be automated successfully. If the last
name was not recognised accurately, the caller would be asked to spell out the name.
After this dissertation
was started, computers started becoming powerful enough to
warrant commercial exploitation of speech recognition. The result is that several commercial software developers' toolkits (SDK's) have been developed, enabling application developers to build telephone based speech recognisers and interactive systems
with minimum effort. At least two such SDK's are currently available from Nuance
and Speechworks
2.
1
However, none of the commercially available systems currently
offer Afrikaans, or any other official South African language (other than English) as a
language option.
lwww.nuance.com
2www.speechworks.com
1.3
Approach
Our group decided to use the hardware of an industry leader in telephone speech
interfacing, to ensure reliability and quality of speech data.
A Dialogic D/21H card
was used together with a Pentium 133Mhz class computer for the hardware platform
of our system.
A decision was made to implement the speech recognition using whole word hidden
Markov models, partly because no labeled phonemic database existed for the South
African accent.
Initially it was decided to use and modify (as necessary) the OGI CSLU3 speech toolkit
with the Dialogic card in the Solaris operating system. After encountering some difficulties, our group decided to implement our own object oriented speech toolkit, the
Hidden Markov Toolkit for Speech Recognition (HMTSR4).
During the course of this
dissertation this toolkit was further enhanced by adding features such as support for
additional wave file formats, support for creating and using finite state grammar networks and the inclusion of duration modeling. Support for various noise compensation
techniques was also added.
For the purpose of this dissertation, HMTSR had to be ported to the Microsoft Windows NT platform because of limitations in the Dialogic support for Solaris5. Software
was then written to integrate the HMTSR toolkit with the Dialogic drivers, and control the flow of the conversational interface and telephone line control with some state
machines.
For the purpose of recording training data, the system was configured so that it did
not perform recognition, but rather prompted callers to say the names of people. This
30regon Graduate Institute Center for Spoken Language Understanding. Toolkit available from
website: http://cslu.
cse.ogi .edu/toolkit/
4See website http://www.ee.up.ac.za/ee/pattern...recogni
tion..page/HMTSR/HMTSR-Q
. 2. tgz
5 As well as the lack of support at that stage for Linux drivers.
training data was then labeled (i.e. the precise beginning and ending of each word
stored along with a name) so that it could be used for training the hidden Markov
model representations.
Testing of the system was performed off-line with pre-recorded utterances, so that any
modifications to the system or parameters could be tested against consistent speech
data.
This dissertation
makes the following contributions
to the field of automatic speech
recogni tion:
• The process of building a speech recognition system from first principles is explained in detail.
• The performance sensitivity with regard to different techniques and system parameters is shown, giving a comparative basis of the merits of each approach.
• The system is the first of its kind known to be developed in the Afrikaans language.
• The HMTSR package which was developed by our group provided the functionality to train HMM models from speech data, and perform a level-building search
with an FSN grammar on a speech utterance.
For the purposes of this disserta-
tion, I added the capabilities to optionally perform Cepstral Mean Subtraction,
Spectral Mean Subtraction, duration modeling (instead of using state transition
probability matrix), normalisation of confidence measure (using a garbage model)
and robust start/end
point detection. Furthermore, I wrote a graphical program
(using the Qt widget set in linux) to create and manipulate the FSN grammars,
Chapter 1
and integrated the file format into HMTSR. I also co-authored6 a windows program to ease labeling of the data, as well as providing the ability to compare
labeled test data and the transcribed output of the recognition engine. I ported
HMTSR to Microsoft Windows NT, and integrated it into software I developed
to interface to the Dialogic card, which spawns the conversational interface state
machine in a thread so that multiple simultaneous calls can be handled.
• Background Theory
This chapter provides the necessary technical information needed to understand
the auto-attendant
system. Since speech recognition is an application of pattern
recognition, many of the fundamentals
of pattern
recognition apply.
Feature
extraction needs to be performed on the raw speech signal to provide distinctive
information for the pattern recognition process.
A suitable pattern comparison technique needs to be found, as well as a method
for "training" the system with existing speech data.
Several techniques for optimising system performance are also considered.
• System
The hardware and software architecture of the system is provided, from a topdown overview to implementation details and specific parameters employed.
• Experiments
A chapter on experimental work details the data sets used as well as experiments
6The other programmer was Janus Brink.
performed to optimise system parameters and to benchmark system performance.
The performance sensitivity of the system to the different methods and parameters is determined from these experiments .
• Conclusion
A conclusion is given which summarizes the work done and provides details of
future work that evolves from this dissertation .
• Appendix
The appendix contains detailed descriptions of the source code implementation.
Chapter 2
Background Theory
2.1
System overview
The framework of the telephone speech recognition engine is provided here, enabling
the reader to tie up the theory with the various system components.
The input to the system is analogue speech data from a telephone line. This input data
is converted into digital form by a special analogue to digital converter card adapted
to a telephone connection (see Figure 2.1).
This raw digital waveform data then needs to be analysed and processed to extract
only the distinguishing characteristics from the data, reducing the sheer volume of data
and easing the task of the classifier in the speech recognition algorithm. This process
is called feature extraction.
The system needs to be trained to recognise speech, and for this the features of a
training set are provided to a training algorithm. Once the system has been trained,
it can recognise utterances that are similar to the training set.
TeleRhone
interface card
Forward call
via telephone
interface card
Off-line:
mode
:
I
I
... _-----------""
Various preprocessing techniques can be utilised to improve the feature set, e.g. filtering out noise, and a few of these techniques are discussed. A conversational interface
prompts the caller into giving people's names, and a finite state grammar network assists the classifier in recognising the utterances.
If a callee name is correctly recognised,
the telephone call is forwarded to the appropriate telephone number via the PABX.
In off-line mode, pre-recorded training and testing data can be used to form a basis for
performing experiments to compare different speech recognition techniques. Some techniques to improve the efficiency (speed and accuracy) of the training and recognition
algorithms are also described in this chapter.
2.2
Telephone quality speech
The telephone system in South Africa utilises copper pairs to carry analogue signals
from each telephone point to a Central Office. From here the telephone signals are
usually digitised to 8 bit resolution at 8kHz sampling frequency and routed to the
destination where the digital signal is converted into an analogue signal again. This
digitisation process leads to two problems:
Furthermore,
due to a combination of filters, transformers and signal transducers
in
various parts of the telephone system, non-linearities and a lower bandwidth limit of
about 300Hz are also imposed on the signal.
The effective voice channel of the telephone system can be approximated
by a non-
linear bandpass filter with lower and upper cut-offs of 300Hz and 3.7kHz respectively,
and with 8-bit quantisation noise.
2.3
Cepstra as features for speech recognition
Due to the local stationarity
of the speech signal, short-time spectral estimates of
a speech signal are used in speech recognition applications as a basis for extracting
features from the continuous signal (from Rabiner et al. [7]). A short-time spectrum
is normally obtained by placing a data window on the sample sequence, and the resultant spectral estimate is considered a snapshot of the speech characteristics
at a
particular instant t. This method effectively discretises the continuous speech signal.
This spectral representation
is denoted by S(w, t) to emphasise the time variability of
the speech. The data window slides across the signal sequence, producing a series of
short-time spectral representations,
S(w, t)~=o'
A relatively robust, reliable feature set for speech recognition can be found by computing the cepstral coefficients of a speech window. The complex cepstrum of a signal is
defined as the Fourier transform of the log of the signal spectrum.
trum (magnitude-squared
For a power spec-
Fourier transform) S(w), which is symmetric with respect to
w = 0 and is periodic for a sampled data sequence, the Fourier series representation
logS(w) can be expressed as
of
00
L
logS(w) =
jnw
cne-
n=-oo
The cepstral coefficients are therefore coefficients of the Fourier transform representation of the log magnitude spectrum.
provides a good representation
This representation
of the speech spectrum
of the local spectral properties of the signal for the
given analysis frame. The cepstral coefficients thus give a feature vector for a given
speech window. The lower index coefficients represent the spectral envelope, whereas
the higher index coefficients represent the pitch and excitation signals [8].
The following equation shows that the
Co
coefficient is directly proportional
to the
energy underneath the log power spectrum:
Co
=
j1r
-1r
logS(w) dw.
2n
• variances essentially inversely proportional to the square of the coefficient index
(E{ c~} ex n\)'
Spectral transitions are believed to play an important role in human speech perception.
This leads to the idea [8]that an improved representation can be obtained by extending
the analysis to include information about the temporal cepstral derivative. Both first
and second order temporal derivatives have been investigated and found to improve
the performance of speech-recognition systems. To introduce temporal order into the
cepstral representation, we denote the mth cepstral coefficient at time t by cm(t), where
t refers to the analysis speech frame in practice. The cepstral time derivative is approximated as follows[7, p. 116]: The time derivative of the log magnitude spectrum
has a Fourier series representation of the form
~[logIS(ejW
at
'
t)l] = ~
~
m=-oo
Since cm(t) is a discrete time representation,
equation is inappropriate
tial derivative acm(t)/at
acm(t) e-jwm.
at
using a first- or second-order difference
to approximate the derivative, as it is too noisy. The parcan be approximated by an orthogonal polynomial fit (least
squares estimate) over a finite length window:
where
J.1,
is an appropriate normalization constant and (2K+l)
is the number of frames
over which the computation is performed. Typically a value of K = 3 has been found
appropriate for computation of the first-order temporal derivative.
2.3.1
Optimal encoding of temporal information
Other research by Milner [9] indicates that a polynomial approximation for temporal
derivatives does not optimally encode the temporal information. He calculates a temporal transformation
matrix H, so that the dot product of M (a matrix consisting
of M rows of successive cepstral column feature vectors) and H gives the generalised
feature matrix V.
He shows that it is important that the columns of H are orthogonal to decorrelate
the columns of V (thus extracting temporal information). He goes on to show several
different H matrices that can effectively perform transformations
temporal information.
to M to extract
For example, to generate static, first and second order cepstral
derivatives (based on, in his case, a regression analysis), the matrix H would be,
o
o
0
100
1
o
1
0
o
-1
o
0 0
o
0
1
0 0
with the stack width, M, set to 5. The
Oth
H=
0 0
-2 0 0
column of the resulting feature matrix, V,
is the static cepstrum, the 1st column is the first derivative, and the 2nd column gives
the second derivative.
His research shows that to optimally encode the temporal information, the covariance
matrix of M needs to be diagonalised, and this is achieved by the Karhunen- Loeve
Transform (KLT) matrix. By pooling together many examples from the speech training data, the covariance of the stacked cepstral coefficients can be determined.
The
KLT transform matrix, H, is given by the eigenvectors of the covariance matrix. The
eigenvectors which form the KLT basis functions (columns ofthe H matrix) are ranked
in terms of their eigenvalues (i.e. the
Oth
KLT basis function corresponds to the eigen-
vector which has the largest eigenvalue).
Using this KLT transform was shown by Milner to have a performance increase over
using standard
first and second order temporal derivatives, but unfortunately
this
research was discovered after the experiments in Chapter 4 had already been completed.
This method can be considered for future work. Calculation of the KLT transform
matrix is a computationally
intensive task, but only needs to be performed once on the
training data, so it does not affect the real-time operation of the system.
Psychophysical studies [7, p. 183] have shown that human perception of the frequency
content of sounds, either for pure tones or for speech signals, does not follow a linear
scale, e.g. a 4kHz sine audio wave is not perceived as being twice as high as a 2kHz
tone. This has led to the idea of defining subjective pitch of pure tones. Thus for each
tone with an actual frequency,
J,
measured in Hz, a subjective pitch is measured on
a scale called the "mel" scale. As a reference point, the pitch of a 1kHz tone, 40dB
above the perceptual hearing threshold, is defined as 1000 mels. Other subjective pitch
values are obtained by adjusting the frequency of a tone such that it is half or twice
the perceived pitch of a reference tone (with a known mel frequency).
o
FREQUENCY (Hz)
2000 4000 6000 8000 10,000 12,000
2000
1
J:
1500
~
Q..
lOGARITHMIC
FREQUENCY
SCALE
o
10
100
1000
FREQUENCY (Hz)
Figure 2.2 shows subjective pitch as a function of frequency. The upper curve shows
the relationship of the subjective pitch to frequency on a linear scale; it shows that
subjective pitch in mels increases less rapidly as the stimulus frequency is increased
linearly. The lower curve, on the other hand, shows the subjective pitch as a function
of the logarithmic frequency; it indicates that the subjective pitch is essentially linear
with the logarithmic frequency beyond about 1000Hz.
Hidden Markov models (HMM) [7,10] of the acoustic feature vectors form the basis of
most modern speech recognition systems. They have a number of desirable qualities
which will be detailed in this section.
Hidden Markov modelling finds its basis in
statistical signal modelling, which is a widely used practice in pattern recognition. The
HMM method provides a natural and highly reliable way of modelling and recognizing
speech for a wide range of applications and integrates well into systems incorporating
both task syntax and semantics. HMMs will be discussed only in the context of speech
recognition, although they have much wider application, and were in fact not originally
developed for speech recognition in particular.
An observable Markov model is a system that may be described as being in one of a set
of N distinct states, which at regularly spaced, discrete times may undergo a change
of state (possible back to the same state) according to a set of probabilities associated
with each state (Figure 2.3).
The time instants associated with the state changes are denoted by t = 1,2, ... , and
the actual state at time t is given by qt. A state in a discrete-time, first-order Markov
chain has a probabilistic dependence only on the preceding state, i.e.,
Only the Markov processes where this probability is independent of time are considered
for speech recognition (i.e. the probability to go to the next state is only based on the
current state and not on the history). Thus, the transition probabilities
by
aij
are given
Chapter 2
aij ~ 0,
Vj,i
(2.6)
Vi
(2.7)
N
Laij
j=l
= 1,
(2.8)
to conform to standard
stochastic
constraints.
The model described above is an ob-
servable Markov process because the output of the process is the set of states at each
instant of time, where each state corresponds to an observable event. An example of
such a Markov process is a simple traffic light, where the observable states are [green,
orange, redj,1 and a possible transition
matrix A could be given by
0.6 0.4
A
= {aij} =
0
0
0.4
0;6
0.5
0
0.5
indicating that green cannot go to red, orange cannot go to green, and red cannot go to
orange. Given this model, we can determine things such as the probability
sequence of events, or what the probability
of a certain
is of a system staying in a certain state for
a specified time.
Hidden Markov models extend observable Markov models by including the case in which
the observation
is a probabilistic
function of the state instead of a directly observable
event. That is, the resulting model is a doubly embedded stochastic
process with an
underlying stochastic process that is not directly observable (it is hidden), but can only
be observed through another set of stochastic processes that produce the sequence of
1We
could extend this to include flashing red for error, or blank for failure
Chapter
2
observations.
An example related to the speech recognition problem at hand can be
given as follows: imagine that every possible sound that can be made by humans can
be described by a finite set of numbers, each number corresponding to a single sound.
2
A hidden Markov model for a word ideally consists of as many states as are required to
describe each transition from one sound to another through the word from beginning
to end. The observation sequence produced by this model is a series of equally time
spaced numbers, e.g.
To match a particular word model to an observation sequence entails finding the probability of observing that sequence given a model with observation output probability
at each state, and a given state transition matrix. Note that, as in this example, most
HMM based speech recognizers use a strict left to right transition model (also called a
Bakis model). This left-right model allows for auto-transitions
and transitions to the
next sequential state only. This configuration matches the left-right temporal property
of a spoken word.
• N, the number of states in the model. The number of states in a model cannot
necessarily be deduced from the observation sequence by looking at the number
of transitions.
The states are labeled {I, 2, ... , N}, and the state at time t is
denoted by qt .
• M, the number of distinct observation symbols per state, or, in other words, the
discrete alphabet size. The individual symbols are labeled V =
aij
= P[qt+l
--------------
=
jlqt
= i],
{VI, V2,' .. , Vm}.
I ::;
i, j ::;N,
2Note that this can be approximated in reality by using a vector quantizer codebook, although this
does result in serious performance degradation compared to continuous densities.
This describes a left-right model without skipping transitions,
i.e.
the state
can only go to the next sequentially numbered state, or stay within itself, as is
commonly used in speech recognition. As will be seen in Section 2.9, a transition
matrix is not optimal for speech recognition.
7ri
=
{o,
i =1= 1
1,
i= 1
for a strict left-right model, to ensure that the model starts in the first state.
(The probability for a model to be in state i initially is given by
7rd
As has previously been mentioned, observations with a discrete probability density lead
to a serious performance degradation for speech recognition, since we are dealing with
a continuous signal from which feature vectors can be extracted.
It would be advan-
tageous to use HMMs with continuous observation densities to model the continuous
signal representations
directly.
To use a continuous observation density, a probability density function (pdf) must
be chosen which has parameters that can be estimated consistently from the training
data. The most general representation of the pdf, for which the parameter estimation
equations can be expressed analytically, is a finite mixture of Gaussian pdf's of the
form
M
bj(o)
= L CjkW(O,
k=l
/Ljb Ujk),
where
° is the
observation vector being modelled,
Cjk
is the mixture coefficient for the
kth mixture in state j and W is a Gaussian probability density function with mean
vector
#-tjk
and covariance matrix Ujk for the kth mixture component in state j. The
mixture gains
Cjk
satisfy the stochastic constraint
The pdf in Eq. (2.10) can be used to approximate with arbitrary precision any finite
continuous density function, making it suitable for the speech recognition problem.
Three problem formulations exist with regard to HMMs that are of interest to the
speech recognition problem:
1. Given an observation sequence 0
=
(01'"
OT)
and a number of models, how
does one compute the probability of each model, so that the most likely model
to give that observation sequence can be found? This has direct application in
recognizing a feature sequence of a spoken word.
2. Given the observation sequence and a model, how does one choose a state sequence q = (qlq2 ... qT) that best explains the observation (i.e. the optimal state
sequence)? There is no "correct" state sequence, so an optimimality criterion is
used to solve this problem. This has application in continuous speech recognition
to find optimal state sequences, and is discussed in Section 2.7, and forms part
of both the training and the recognition.
3. How does one adjust the model parameters to maximise the probability of an
observation sequence given the model P( Ol>')? The observation sequence used
21
'60
~5
II ~
6\~Lt'3~4~
to adjust the model parameters is called a "training" sequence. This is an important problem for speech recognition since it allows one to optimally adapt model
parameters to observed training data, giving the best model for a real speech
signal during training.
Section 2.6 shows how the "expectation-maximization"
method can be used to iteratively select these model parameters,
given one or
more training sequences.
2.6
Expectation-Maximisation
The most difficult problem in HMMs is to determine a method to adjust the model
parameters to satisfy a certain optimization criterion.
There is no analytical way to
solve for these parameters to maximize the probability of the observation sequence, but
an iterative procedure known as the Baum- Welch [7] method can be used to choose
the maximum likelihood (ML) parameters.
We define ~(i, j) as the probability of being in state i at time t, and state j at time
t
+ 1, given
the model and observation sequence, and can be written as:
~(i, j) = P(qt = i, qt+l = jIO,,x)
(2.12)
at (i)aijbj (Ot+1),Bt+l (j)
EJ:1 Ef=1 at(i)aijbj(oHd,Bt+l
(2.13)
(j)
where at(i) is defined as the probability of the partial observation sequence °1°2' .. 0t
(until time t) in state i at time t, and ,Bt(i) is defined as the probability of the partial
observation sequence from t + 1 to the end, given state i at time t.
If we define 'Yt(i) as the probability of being in state i at time t given the entire
observation sequence and the model, we can relate 'Yt(i) to ~t(i, j) by summing over j,
giving
N
'Yt(i) = L~t(i,j)
j=l
If we sum 'Yt(i) over the time index t, we get a quantity that can be interpreted as the
expected (over time) number of transitions made from state i (if we exclude t = T).
Similarly, summation of ~t(i,j)
over t = 1 to t = T - 1 can be interpreted
as the
expected number of transitions from state i to state j. Using this knowledge, we can
get a set of reasonable reestimation formulas for
o'ij
1f,
A, and B:
expected number of transitions from state i to state j
expected number of transitions from state i
= --------------------
E'i=~l~t(i,j)
-
lj.(k)
J
=
"T-l
(")
L.."t=
1 'Yt 1,
expected number of times in state j and observing symbol
expected number of times in state j
Vk
_ E~=lOt=Vk'Yt(j)
E'i=l 'Yt(j)
If we define the current model as A
=
(A, B, 1f) and use that to compute the right
hand side of Eqs. (2.15)-(2.18), and we define the reestimated model as .:\ =
(.ii, f3,
if)
as determined from the left-hand side of Eqs. (2.15)-(2.18), then it has been proven
by Baum et at. that either the initial model A defines a critical point of the likelihood
function, in which case .:\ = A, or model .:\ is more likely than model A in the sense
that P(OI.:\) > P(OIA); i.e. we have found a new model from which the observation
sequence is more likely to have been produced. If we iteratively use this procedure, we
can improve the probability of 0 being observed from the model until some limiting
point is reached. The final result of this reestimation procedure is a maximum likelihood
estimate of the HMM.
JLjk
L;=l
=
",T
L..d=l
'Yt(j, k)Ot
(.)
'Yt J, k
where 'Yt(j, k) is the probability of being in state j at time t with the kth mixture
component accounting for
0t,
i.e.
These equations can be interpreted
as follows: The reestimation
formula for
ejk
is
the ratio between the expected number of times the system is in state j using the
kth mixture component, and the expected number of times the system is in state j.
The reestimation
formula for the mean vector
JLjk
weights each numerator term by
the observation, thereby giving the expected value of the portion of the observation
vector accounted for by the kth mixture component. The covariance matrix Ujk is also
weighted by the observation vector.
As previously mentioned, left-right or Bakis models are commonly used for speech
recognition.
Due to the transient nature of the states within a model of this form,
only a small amount of observations are available for any state.
Therefore, to have
sufficient data to make reliable estimates of all model parameters,
one has to use
multiple observation sequences.
Since the reestimation
formulas rely on previous values of the HMM parameters,
a
question that arises is "How does one choose initial estimates of the HMM parameters
so that the local maximum obtained after reestimation is equal to or as close as possible
to the global maximum of the likelihood function?"
Experience has shown that either random (subject to the stochastic and nonzero restraints) or uniform initial estimates of the
7f
and A parameters are adequate for giving
estimates of these parameters in almost all cases. However, for the B parameters, experience has shown that good initial estimates are essential in the continuous-distribution
case [7, p. 370].
A procedure for providing good initial estimates of the B parameters has been devised
and is called the segmental K-means training procedure [7, p. 382]. This iterative
procedure also requires an initial model estimate but this can be chosen either randomly
or based on any available model appropriate to the data.
The set of training observation sequences is segmented in states, based on the current model A. This sequence is achieved by finding the optimal state sequence, via
the Viterbi algorithm, and then backtracking along the optimal path.
The result of
segmenting each of the training sequences is, for each of the N states, a maximum likelihood estimate of the set of the observations that occur within each state j according
to the current model.
Since we are using continuous observation densities, a segmental K-means procedure
is used to cluster the observation vectors within each state j into a set of M clusters
using a Euclidean distortion measure. Each cluster represents one of the M mixtures
of the bj (Ot) density. From the clustering, a set of updated model parameters is derived
as follows:
Cjm
= number of vectors classified in cluster m of state j divided by
the number of vectors in state j
{Ljm
= sample mean of the vectors classified in cluster m of state j
Ujm = sample covariance matrix of the vectors classified in
cluster m of state j
Based on this state segmentation, when using a transition matrix, updated estimates
of the
aij
coefficients can be obtained by counting the number of transitions from state
i to state j and dividing it by the number of transitions
from state i to any state
(incl uding itself).
An updated
model ~ is obtained from the new model parameters,
and the formal
reestimation procedure (Equations (2.20), (2.21) and (2.22)) is used to reestimate all
model parameters.
The resulting model is then compared to the previous model by
computing a distance score that reflects the statistical similarity of the HMMs. If the
model distance exceeds a threshold, then the old model>. is replaced by the reestimated
model and the overall training loop is repeated. If the model distance falls below the
threshold,
then model convergence is assumed and the final model parameters
are
saved.
The way in which the "optimal" state sequence associated with any given observation
sequence is calculated is dependent on the definition of the optimal state sequence, i.e.
an optimality
criterion needs to be specified. The most widely used criterion in speech
recognition is to maximize P(qIO, >'), which is equivalent to maximizing P(q, 01>'). A
formal technique called the Viterbi algorithm exists for finding this single best state
sequence, and is formulated
as follows [7]:
To find the single best state sequence, q = (q1q2 ... qT), for the given observation
sequence 0 = (0102 ... Or), we need to define the quantity
b"t(i) =
P[q1q2' .. Qt-1, Qt = i, 0102' .. Otl>.]·
max
Ql,Q2,···,qt-l
that is, b"t(i) is the highest probability
the first t observations
along a single path at time t which accounts for
and ends in state i. By induction we get
To retrieve the actual state sequence, we need to keep track of the argument
that
We can do this via the array 'ljJ(j).
The
maximized
Eq. (2.25) for each t and j.
complete procedure
for finding the best sequence can be stated as follows (using the
log to reduce the number of multiplications
7ri = log(7ri),l
bi(Ot) = log[bi(Ot)),l
required):
~ i ~N
~ i ~N, 1 ~ t ~ T
b(i) = 109(b"l(i)) = 7ri + bi(Ol),l
~ i ~N
'ljJ1(i) = 0,1 ~ i ~N
Chapter 2
P*
q;'
The preprocessing
= arg
l~i~N
[6T(i)]
max [6T(i)]
l~i~N
step needs to be performed only once and saved, thus the cost is
negligible. The rest of the calculations
2.8
= max
required are in the order of N2T additions.
Level building algorithm
The Viterbi method
on its own works well for doing isolated word recognition,
i.e.
matching a single spoken word to one of several word models. For many applications,
notably those referred to as "command-and-control"
well and is appropriate.
applications,
this approach works
However, for the purpose of this dissertation,
the recognition
engine needs to be able to cope with a sequence of words from a limited recognition
vocabulary,
and thus using the Viterbi method alone will not suffice.
To formulate the problem at hand we need to make some definitions [7]. Assume that
we are given the sequence of feature vectors (extracted
Mel-scaled cepstra) T = t(l), t(2), ... , t(M),
(~,i = Lv) for each of V vocabulary words.
from the speech signal, e.g.
and we also have the HMM word models
Chapter 2
The connected word-recognition
problem can be summarised
a fluently spoken sequence of words, how can we determine
terms of concatenation
by the question:
Given
the optimum match in
of word models? That is, how do we find R* (the best possible
sequence of reference patterns)
that best matches T (the test pattern)?
• The number of words, L, in the string is not usually known (although often upper
and lower bounds exist).
• The word utterance
boundaries are unknown, except for the beginning of the first
word and the end of the last word.3
• The word boundaries
automatically
are fuzzy and not uniquely identifiable.
It is difficult to
find accurate word boundaries because of sound coarticulation,
i.e.
the ending sound of one word is "morphed" into the beginning sound of the next
word when spoken fluently.
• For a set of V word models, and for a given value of L (the number of words in the
string, if known), there are VL possible combinations
of composite matching pat-
terns; for anything but extremely small values of V and L the exponential
ber of composite matching patterns
num-
implies that the connected word-recognition
problem cannot be solved by exhaustive means.
A non-exhaustive
matching algorithm is required to solve the connected word-recognition
problem efficiently. Fortunately
algorithms exist with which to do this. In particular,
the level building algorithm will be discussed in this text.
3The beginning and end of the utterance could be silence or noise, but that can also be classified
as a word model.
Assume there are L word patterns in R*, where L varies between the minimum and
maximum possible values. The best possible sequence of reference patterns R* is given
by a concatenation
of L reference patterns:
An approach to calculate R* efficiently, called the Level-Building algorithm, has been
devised [7, pp. 400-422].The basic approach is as follows: At the first "level", a Viterbi
alignment is performed with each reference model to the test sequence.
The range
of starting frames (initial state) is 1 to T - Nv, where Nv is the number of states in
each reference model, and T is the number of frames in the test sequence. The range
of ending frames (final state) for each Viterbi alignment is given by Nv to T. These
parallelogram style ranges result from the strict left-right limitation of the Bakis model
HMMs. This first level results in probability estimates at the final state for each model
at test frames Ni to T. Over all the models, a minimum starting frame for the next
level is determined by the model with the smallest number of states, say Nm. The
model with the highest probability at each point in the range Nm to T is stored along
with the probability value, and this vector is the result of the first "level." At the next
level, the calculation is picked up starting from time point Nm where the previous level
ended. Each reference model is again aligned with the rest of the test sequence, with
the probabilities accumulating from the best models at each point from the previous
level. After the second level, the new starting point will be 2Nm. At the end of each
level, the best scores and models are stored, as well as a backpointer indicating the
source point that value was calculated from. After the iteration has gone through all
the levels, the highest total accumulated probability at frame T over all levels is found
(since the best matching sequence can also be less than L concatenated word models).
By using the backpointer,
score can be found.
the best reference frame sequence which resulted in this
Figure 2.4 shows the resulting trellis shaped grid of results obtained for a particular
reference model during the level building computation.
t
TEST FRAME
2.8.1
T
Frame synchronous approach
The standard level-building approach requires that the entire set of feature vectors T
be known before recognition can start, which implies that it is not possible to start the
recognition before the speaker has finished the utterance.
An alternative approach is
to use a method known as the frame synchronous level-building approach [11]. As its
name suggests, this method can do most of the calculations frame by frame, making it
useful for real-time implementations.
For the case of HMMs using Viterbi alignment to
eventually find the optimum path through the trellis of output observation probability
densities, this approach can be simplified to merely calculating the output probability
density for each incoming test frame for every state of each model, and storing the
partially calculated accumulated Viterbi score. These calculations form the bulk of the
computational
cost of this speech recognition approach, and can be done with each
Chapter 2
incoming frame, assuming enough computational power exists in the processor. At the
end of the utterance the relatively low computational cost of backtracking and finding
the highest accumulated scores on each "level" can be done to find the best overall
matching sequence of concatenated word models.
2.9
Duration modelling
A major weakness of conventional HMMs is that they implicitly model state durations
by a Geometric distribution,
which is usually inappropriate.
Erroneous recognition
with HMM based systems is often associated with unreasonable occupancy durations
of the (phoneme-like) states from which the model is built up.
Each state in a left-right (Bakis) HMM has an auto-transition
probability aii, and a
probability to go to the next state. The sum of these two probabilities is necessarily 1.
Assuming that the model stays in this state, on average, for a duration
distribution
p(T) is given by p(T)
=
probability.
the Geometric
aii(l - aii)' Figure 2.5 shows the relative decay
over time for two cases, one with a high auto-transition
auto-transition
T,
probability, and one with a low
It is obvious that for a high auto-transition
probability,
the decay is relatively slow, and this is in direct contrast with the actual probability
of durations of acoustic events, which drop rapidly to zero when the duration is either
too short or too long.
This exponential durational model imposes little limitation on how long an utterance
can occupy a certain state in the model.
and/or
substitutions
This freedom often creates false alarms
by allowing an utterance to virtually skip over the unsuitable
states, while lingering unreasonably long in more suitable states [12]. In an improved
model for speech structure, duration is incorporated explicitly by specifying probability
distribution
functions for the occupancy durations of states. Although this intuitively
will lead to better recognition results, computational
and memory requirements may
Ie Small initial probability
t
B Large initial probability
be adversely affected. A method for implementing duration modelling with negligible
overhead needs to be used to keep the system fast enough for real-time recognition.
Several methods of state duration modelling have been proposed in the literature to
enhance the performance of HMM based speech recognition systems:
• The simplest state duration modelling technique is to have an upper and lower
bound on the allowed state durations, and to disallow paths in the Viterbi search
that exceed these specifications .
• In the Ferguson model [13] each state i has an associated discrete state duration
probability density di(r),r
= 1,2,
... , r:nax where r:nax is the largest duration al-
lowed. Ferguson incorporated the estimation of di(r) into the Baum-Welch reestimation algorithm.
computational
This method has two disadvantages, namely the excessive
overhead that it imposes, and the excessive amounts of training
data that might be required to estimate all the duration parameters:
each state
i has
7:nax
duration parameters and sufficient statistics on each duration
7
need
to be collected at each state i so as to estimate di ( 7) reliably .
• Russel & Moore [14] and Levinson [15] used parametric state duration distributions (Poisson and gamma distributions respectively) to solve the problem of
requiring large amounts of training data. Their methods still imposed excessive
computational
requirements.
• Rabiner et at. [16] suggested a postprocessor approach in order to incorporate
duration modelling in a computationally
efficient way. Besides real-time diffi-
culties, a major disadvantage of the backtracking approach is that the duration
contribution to the standard Viterbi metric is only added after candidate paths
have been collected. This could mean that the correct path might not be one of
those candidates.
Burshtein [17] has suggested a practical approach to duration modelling that avoids the
above mentioned problems. He proposes a modified Viterbi algorithm that incorporates
duration modelling, and has essentially the same computational
requirements of the
conventional Viterbi algorithm.
Since his method also relies on a parametric distribution,
he performed experiments
dedicated to finding the most appropriate distribution for state durations.
pared histograms of state durations to a Gaussian distribution,
and a geometric distribution
He com-
gamma distribution
(to show how inaccurate this implicit duration model
is), and found the gamma distribution to most accurately describe the state duration
(Figure 2.6).
He shows that the gamma distribution
is capable of describing both the monotonic
character of the geometric distribution,
and the unimodal character of the Gaussian
distribution.
In addition, unlike the Gaussian distribution,
is one-sided; i.e. it assigns zero probability to negative
duration distributions.
7'S,
the gamma distribution
which is appropriate for
Finally, the gamma distribution possesses a slower decay rate,
\
\
\
I
\
c: 0,14
~
@ 0,12
I
\
\
': \
:.
"
,I
:
.:
0.1
','
"
: I
,
: I
:
0,08
,
\
", \
"
\
\
", \
\
"
\
\
,
\
0,06
\
"
: I "
:at'G
t
\
:
,\:, ,
1;)
.:0
~
, ""
"
\
\
..
"
\
\
..
",
\
\
.•.
0.04
0,02
\
",
\
.•.. ,
,\
";\
,
,
:
0
0
2
•
10
12
state duration
Figure 2,6: State duration distributions
for the 7th state of the word 'seven': empirical distri-
bution (solid line), Gauss fit (dashed line), Gamma fit (dotted line), geometrical
fit (dash-dot line), from [17],
Chapter 2
,,2
e-ax which is more appropriate for duration modelling than the fast e-~
decay of the
Gaussian distribution.
where a > 0 and p >
from 0 to
00
o.
K is a normalizing term calculated so that the sum of p( r)
equals 1. The parameters
a and p can easily be estimated by noting
= pia
that the mean and variance of the gamma distribution are given by E {r}
and
V AR{ r} = pi a2• a and p can thus be estimated using the empirical expectation E( r)
-----
and empirical variance V AR(r) of the duration by
a~ =
E(r)
_
VAR(r)
~
p
E2(r)
= -----
VAR(r)
.
The modified Viterbi algorithm keeps track of the duration Ds(t)
of each state s at
time t. Let Ms denote the time at which the peak value of the gamma distribution
occurs, at state s, and let l(u) = log p(u), where pO here is the gamma density. The
duration penalty, P, of making a transition from state s at time t to state
s
at time
t + 1 is given by:
P=
s =s
ifDs(t)
< Ms and
ifDs(t)
~ Ms and s = s
l(Ds(t))
ifDs(t)
< Ms and s::j:. s
l(Ms)
if Ds(t) ~ Ms and s ::j:.s.
0
l(Ds(t
+ 1)) -l(Ds(t))
(2.33)
The motivation is the following: before the duration is Ms, we do not penalize for
remaining in the same state, unlike the conventional Viterbi algorithm
Eq.(2.33)).
(P
=
0 in
After the duration is Ms, we penalize gradually, unlike the backtracking
approach of Rabiner et al. mentioned earlier (P = I(Ds(t+
1)) -1(Ds (t)) in Eq.(2.33)).
If the next state is different to the current state and the duration is less than Ms, the
penalty is I(Ds(t)).
If moving to a new state and the duration is greater or equal to
Ms, the penalty is fixed at I(Ms)'
The duration penalty is incorporated into the Viterbi algorithm instead of using the
log state transition
probability penalty
aij
in Eq.(2.28).
For greater computational
efficiency, the I ( u) function can be stored in a lookup table for each state of each
HMM, thus trading a relatively small amount of memory for a large computational
saving.
It was found in practice that it is unnecessary to iteratively update the duration modeling parameters during training. Instead, the models are trained with the standard transition matrix reestimation formulas (Eq.(2.16)).
After the models have been trained,
each model is Viterbi aligned to all its training examples, and the empirical mean and
variance of the durations of each state for each model are obtained as described below.
The sum of each state's durations (over all the training examples for each HMM) is
stored as stateduTl and the sum of the squares of the durations is stored as stateduTsq'
These values can be used to calculate the empirical mean and variance values of duration for each state:
j; (r) = stateduT
examples
VAR( r) =
stateduTsq - j;2 (r),
examples
A
where examples
_
is the number of training examples for each hmm. E(r) and V AR(r)
can now be substituted into Eqs.(2.31) and (2.32) to obtain the values of 0- and
to calculate the discrete gamma distribution
p needed
(Eq.(2.30)). This method is less compu-
tationally intensive and easier to implement than recursively updating the duration
modeling parameters, and results in values that accurately model the state durations.
2.10
Channel compensation techniques
Environmental
robustness and speaker independence are important
issues of current
speech recognition research. Normalisation methods might make use of the model but
primarily influence the signal such that important information is kept and unwanted
distortions are cancelled out.
When using the telephone system, the overall channel transfer function can differ from
connection to connection. These variations can be due to the physical wires in different
parts of the connection, different microphones and speakers, codecs such as GSM and
others4 and analogue to digital as well as digital to analogue converters in the signal
path. These variations lead to degradation in speech recognition performance. To try
to improve the performance in the presence of such a varying channel, a method needs
to be used to try to compensate for the transfer function of the channel, effectively
normalising the speech to the channel transfer function. Most modern speech recognition systems use a channel normalisation approach called Cepstral Mean Subtraction
to compensate for the accoustic channel, as well as the speaker.
Cepstral Mean Subtraction[19] (CMS) is a simple but effective channel normalisation
technique.
Many adaptation
methods require an adaptation
method that has to be
trained. In contrast to this, CMS is purely signal based and tries to eliminate disturbing
channel and speaker effects before the signal is used to train a recogniser. It is also a
very convenient technique to use if cepstra are already being used as the features for a
recogniser.
4Telephone network providers are starting to use the internet as part of their core network in
some countries. Telkom is also considering Voice-Over-IPto be an essential element of their Next
Generation Network.
When a speech signal passes a linear time invariant channel, this convolutional distortion becomes multiplicative in the spectral domain, and additive in the log-spectral
domain. Since the cepstrum is just a linear transformation
of the log-spectrum, both
can be regarded equally in this context. eMS deals with channel effects, such as the
use of different microphones on differing telephone brands, or the use of a cellphone.
For speech recognition, a short time analysis is performed, resulting in the speech
spectrum St (w) and the measured spectrum
It (w)
after modification by the channel
transfer function C (w) (which is assumed not to be a function of time), where
and where Yt(w) is the log-spectrum or cepstrum (the cepstrum is just a linear transformation of the log-spectrum, so both can be treated equally in this context), given
by
The assumption of a constant channel C(w) allows to compensate for it by subtracting
the mean of the measured cepstrum, leading to a cepstral mean subtracted feature
This shows that a speech mean
St
is also subtracted.
We can now use
Zt
Zt:
as the input
feature vector for training and performing recognition, with the effect of the channel
transfer function essentially eliminated.
Several variations of Cepstral Mean Subtraction
have been developed to try and solve
some of the problems that arise from using it:
1. Standard
CMS: taking Zt
= Yt
- Yt as new input feature, Westphal
[19] studied
the effect of CMS on several cases (continuous speech, between-word pauses, and
silence) assuming static noise. For segmented speech without many pauses, the
compensation
works well although
some noise dependence
the pause case, a shift is introduced
conversational
that is proportional
is introduced.
For
to the channel.
In
speech there is a greater variance of the pause proportion
that
will reduce the desired channel compensation.
2. Speech-based
dependence
Cepstral Mean Substraction
on the pause proportion,
(SCMS) [19]: To try to overcome the
SCMS estimates the mean only on speech
frames (Zt = Yt - mspe) using
spe
m
where Wt is the probability
=
p(speechIYt)
for speech, 0 for pause). Experiments
LtWt'
Yt
""'Wt
LJt
or the output
of a speech detector
(1
suggest that it is only really worthwhile to
employ this method in cases where there are relatively long between-word silences.
For speech utterances
it is therefore easier to strip off the (longer) initial and final
silences, and then use standard CMS on the speech. Normal between word pauses
are too short to make using SCMS worth the extra overhead.
CMS deals with channel distortion,
which introduces convolutional
noise to the signal.
A method is also needed to get rid of additive spectral noise, to eliminate any slow
time varying signals and DC offsets, and background noise (car engine noise, constant
babble, white noise).
There are several variants of spectral subtraction,
similar to CMS, that try to com-
pensate for the additive spectral noise [20]. One approach estimates the noise power
spectrum, and subtracts
this from the entire speech utterance.
rise to a number of problems.
This approach gives
Firstly, because noise is estimated during pauses, the
performance of the spectral subtraction system relies upon a robust noise/speech classification system. If a misclassification occurs, this may result in a mis-estimation of
the noise model and thus a degradation of the speech estimate.
Spectral subtraction
may also result in negative power spectrum values, reset to non-negative values in a
variety of ways. This results in residual noise often called musical noise.
Continuous spectral subtraction continuously updates an estimate of the average of the
long term spectrum (since speech changes rapidly, the average over a long period should
closely match the spectrum of the additive noise signal). In an off-line recogniser, the
average of an entire utterance can be calculated before recognition if the utterance is
not longer than a few seconds. In this case, the compensation method becomes known
as Spectral Mean Subtraction
(SMS).
SMS [20] simply involves calculating the average (over time) power spectrum by adding
the power spectra of all the frames in the utterance, and then dividing by the number
of frames. This average power spectrum is then subtracted from each frame. Within
this framework occasional negative estimates of the power spectrum occur. To make
the estimates consistent, some artificial flooring is required, yielding musical noise as
previously mentioned. This musical noise is caused by the remaining isolated patches
of energy in the time frequency representation.
Much effort has been put into reducing this musical noise. One effective way is smoothing over time of the short-time spectra.
This has the contrary effect, however, of
introducing echoes.
While reducing the average level of the background noise substantially, plain SMS has
been found in the literature to be rather ineffective in improving intelligibility and
quality for broadband
background noise, and thus not very effective for improving
speech recognition. This has been reaffirmed by the experiments in Section 4.6.
This chapter on theory has provided the necessary background information for describing the implementation
of the telephone speech recognition system.
Chapter 3
Telephone speech recognition
system
The ultimate
goal of our system is to act as an automatic
telephone receptionist,
prompting the caller into giving the name of the person (within the department)
to
whom they'd like to speak, and forwarding the call.
Two major parts comprise the system, namely hardware necessary for implementation,
and the software which performs the recognition and controls the flow of events during
a call. These two parts will be discussed in detail in this chapter.
The relatively slow! CPU speed encouraged careful and intelligent optimisation of the
speech recognition software.
The Dialogic model D/21H telephone card enables two telephone lines to be connected
to the computer, although only one was ever tested in practice.
PC on the PCI (Peripheral Component Interconnect)
as analogue-to-digital
and digital-to-analogue
It interfaces to the
bus, providing facilities such
conversion of the telephone speech, ring
detection, dial-tone detection and DTMF (Dual Tone Multi Frequency signalling, in
common use for telephone systems) encoding and decoding.
The telephone interface card was connected to the PABX at the campus of the University of Pretoria.
This particular PABX has the facility of being able to forward a call
by sending what is known as a hookfl.ash (putting the phone "on-hook" for approximately lOOms, then taking it "off-hook" again) followed by the four digit internal (to
the University of Pretoria) telephone number sent via DTMF. Since the recommended
maximum rate of DTMF digit transmission is 10 per second (i.e. lOOms per digit), the
whole process of forwarding a call takes approximately 500ms.
The software which performs the sound data processing, feature extraction,
model
training and level-building search was developed in-house at the University of Pretoria
by the Pattern Recognition Group. The software toolkit (named the "Hidden Markov
Toolkit for Speech Recognition" , HMTSR2) was developed in Gnu C++ in the Linux
operating system using an extensive object-oriented approach, described in detail later
lcompared to the leading-edge Intel processors in 1999, 550MHz Intel Pentium III, 5 to 8 times
faster. At the beginning of 2001, 1.2Ghz processors were available.
2See website http://www.ee.up.ac.za/ee/pattern.recogni
tion-page/HMTSR/HMTSR-O.2. tgz
The telephone speech recognition system was necessarily developed on a Windows NT
4.0 platform, since at the time of development no Linux drivers existed for the Dialogic
card3.
A Windows NT Gnu
C++
compatible compiler along with libraries emulating
Unix system calls had been developed by Cygnus Solutions, and HMTSR was ported
into the Windows environment
using this tool, with some modifications.
HMTSR thus
formed the backend of the telephone speech recogniser. The Dialogic API (Application
Programmer's
Interface)
Microsoft Visual
C++)
libraries had to be ported (they were originally written for
to be compatible with Cygwin with some major modifications
relating to the way threads are handled under NT. A console application
was written
to handle incoming calls (multiple, i.e. 2 for the D /21H card, calls can be handled
simultaneously),
a thread being spawned for each incoming call. The thread takes care
of the conversational
has finished speaking.
interface with the caller by prompting him, and detecting when he
Speech to be recognised is passed to the HMTSR level-building
search, and action is taken depending on the transcription
of the models found in the
search.
3.2.1
Hidden Markov Toolkit for Speech Recognition
The HMTSR toolkit comprises a number of classes (Object Oriented Programming)
to
facilitate speech processing and hidden Markov modelling of raw waveform data, and
these classes are all grouped together as a library. The library provides the following
functionality:
• Perform feature extraction
(pre-emphasis,
hamming
banks, mel-cepstra).
3Dialogic released Linux drivers for this card in July 2000.
window, mel-spaced
filter
• Dealing with the transcriptions
of labeled training data, as well as storing tran-
scriptions of recognised data.
The process by which a matrix of cepstral coefficients (vector of coefficients over time)
is extracted from a raw sound wave file can be summarised as follows (Figure
3.3).
The input data consists of a vector of floating point values (representing the sound
wave file, values normalised between 0 and 1) with a corresponding length (number
of samples) and a sample rate (8kHz for the telephone speech recognition system).
This data is divided into a number of windows, determined by two parameters, namely
window size and jramestep.
The windowsize determines the number of samples (the
unit of the window size parameter is seconds, so the actual number of samples can
only be calculated given the sample rate) used for every cepstral calculation, and the
jramestep determines the offset from the current position where the next windowsize
samples begin. The jramestep parameter is usually less than windowsize so that there
is some overlap between adjacent windows. For example, for windowsize = 16ms and
jramestep = 10ms, the first cepstral coefficient vector will be calculated from the first
16ms of speech, and the second vector will encompass 10ms to 26ms of the speech
Telephone Speech Recognition System
segment.
The total number of windows in a given length of speech data is therefore
given by:
nwindows
where windowsamples
=
n samples
_ windowsamples-jramesamples
2
jramesamples
and jramesamples
are calculated
from the sample rate as
follows:
Since a fast Fourier transform can only be calculated for lengths of data that are powers
of two, the next power of two, that is larger than windowsamples,
is determined.
This
gives the size of the Fourier transform vector.
Preemphasis
is performed on each window. Preemphasis
is typically a first-order FIR
filter used to spectrally flatten the signal (i.e. lessen the effect of spectral tilt) and make
it less susceptible
to finite precision effects later in the signal processing.
of the preemphasis
The output
network is related to the input s( n) by the difference equation
Each frame (consisting of window samples samples) is now windowed so as to minimize
signal discontinuities
to taper
at the beginning
and end of each frame.
the signal to zero at the beginning
number of functions
window(Figure
and end of each frame.
which achieve this, the most popular
3.1), which has the form
A window is used
There are a
one being the Hamming
w(n)
= 0.54 -
0.46 . cos (
. d 21rn 1
),0
W'ln owsamp es - 1
:::;n :::;windowsamples
- 1.
64
Sample number
The power spectrum of this windowed frame is now calculated to Iitsize
elements.
The DC component of the power spectrum is set to zero, and the power spectrum
vector is then passed through a mel frequency spaced triangular filter bank, the centre
frequencies of which are determined by
liltercentrei
= 700
* ( (1 + sampleratej1400)
where nfilters is another system parameter.
---i±!...nfilters
)
-
700,0 :::;i < l/ilters.
For the default case where samplerate is
8kHz and nfilters = 12, the centre frequencies are located at
Hertz respectively, and the base corners of the triangles as indices into the power
spectrum vector are determined by:
filtercentrei
f Ct. -- (.)(
'lnt -------sampleratej f ftsize
2000
+ 05)
.
3000
FREQUENCY (Hz)
Figure 3.2: A filter-bank design in which each filter has a triangle bandpass frequency response with bandwidth and spacing determined by a constant mel frequency
interval, from [7, p. 190].
The energy underneath each triangular filter in the filter bank is summed, and nfilters
energy values are returned. The In of each of these values is taken, after which a discrete
cosine transform is performed, resulting in melorder
(another system parameter) mel
frequency scaled cepstral coefficients, mfcCi,O ::;i < melorder.
Liftering is applied to
normalise the contribution from each cepstral term with the algorithm:
Temporal derivatives (so-called delta features) are optionally calculated next using
polynomial approximation. Depending on the delta order (only first and second order
delta parameters are considered), the feature vector doubles or triples in size per frame.
For the optional case where Spectral Mean Subtraction is performed, the power spectrum vectors for the whole utterance are calculated, a mean derived, and the mean
is subtracted
with artificial flooring from each power spectrum
vector.
The artificial
flooring employed assumes the following logic: if the mean spectral value is less than
the spectral value, the result is 0.1 . spectralvalue, else spectralvalue - meanvalue.
Several values (including 0) of this constant were tried, with 0.1 giving the best results.
For the cases of generating the feature cache and model training,
Cepstral Mean Sub-
traction is optionally performed on this feature matrix since only whole words are being
considered.
In the case where a search on an utterance
has to be performed, the initial
and ending silences are first trimmed off (clarified further on) before CMS is performed.
After this entire procedure has been followed on a raw wave sound utterance,
the feature
matrix (cepstral coefficients vs. frame (time)) is ready to be used by the hidden Markov
modelling routines.
Figure 3.3 summarises the feature extraction
process.
Finite State Network grammar
A graphical
tool was developed in X-windows using the QT library to facilitate
the
creation of finite state networks. Network nodes can be dynamically created and moved
around, and given properties
such as model name, display name.
Nodes can be con-
nected to other nodes, or themselves, and flags can be set indicating whether the node
is optional, initial (the start of a grammar path) or final (the end of a grammar path).
The tool writes to and reads from text files with the same lexical format as used by
the grammar class.
Many of the errors in the recognition process can be attributed
to phrases that cannot
possibly be correct, and would immediately be seen as such by a human. In the context
of this system, this includes phrases with names and surnames from different people
mixed, phrases with repetitions
order.
of a word, and phrases that have an incorrect word
This can be remedied fairly easily by implementing
(FSN) grammar,
a Finite State Network
limiting the search to "possible" combinations.
The topology of the
Raw wave
sound
-
Summate filter
bank outputs
Compute
natural log
of each filter out
divide into
frames
-
Perfonn pre-emphasis
on each frame
Optionally subtract
spectral mean
Perfonn discrete
cosine transform,
resulting in cepstral
coefficients
~
Hamming window
each frame
compute power
spectrum of each
frame
Perfonn liftering
on each coefficient
to reduce spectral
tilt
+
Calculate temporal
derivatives
Optionally perfonn
Cepstral Mean
Subtraction
Feature set
Telephone Speech Recognition
System
• POSSIBLE START
o END
Other optional models are placed in the FSN to improve robustness. An initial silence
model is included, followed by a "breath-in" module to compensate for the fact that
many people inhale sharply just before uttering a phrase.
An optional "breathout"
model follows the above mentioned grammar network.
Other possibilities exist, e.g. Title ---+ Name, but since this is very informal and uncommon, it is considered erroneous for the purpose of this system.
Telephone Speech Recognition System
The full
grammar network (Figure 3.5) consists of the previously mentioned paths
for each callee in parallel, so that names, surnames and titles of different callees cannot
be confused.
The FSN grammar is implemented in the forward Viterbi search by adding either in(O)
(approximated by the largest negative number the PC can take) to the log probability
for an impossible transition, or in(l) to the log probability for a grammatically correct
transition.
This is analogous to multiplying the probability by 1 or 0 if not working in
the in domain. In this way very little overhead is incurred on the search engine, but
drammatically
improved recognition results can be obtained (see Section 4.4).
3.3
End point detection
End point detection entails finding the ends of an utterance,
i.e. finding where in
temporal space the utterance starts, and where it ends. This can be done by the recognition engine by incorporating an optional beginning and ending silence model, but
this gives rise to some complexities. Accurately detecting the endpoint is important,
since incorrect endpoint detection can greatly affect the performance of the recogniser
(see Figure 3.6).
UJ
"~ -90
J:
U -120
-150
-150 -120 -90
-60
-30
0
30
60
90
120 150
CHANGE IN BEGINNING POINT (msec)
Figure 3.6: Contour of digit recognition accuracy (percent correct) as a function of endpoint
perturbation
(in ms), from [7, p.144].
The beginning and ending silences' lengths can vary greatly. Since the silence model is
a single state model, the state duration modelling will apply to the entire silence. This
is not desired, so compensation must be made in the system to allow for HMMs to
which duration modelling does not apply. This unnecessary complexity can be avoided
by first applying end point detection to the utterance. Furthermore, the Viterbi search
together with the level-building algorithm take linearly longer with longer utterances,
and are potentially much slower than an algorithm which only has to find silence. Thus
by first removing the silence, large time savings can be made in the recognition process.
Optional noise models can still be incorporated in the recognition FSN, to compensate
for noise artifacts that may have been missed by the endpoint detection. In this way,
a "hybrid" two-stage speech endpoint detection system is implemented.
For speech produced in laboratory circumstances, i.e. carefully articulated and relatively noise free, accurate detection of speech is a simple problem. In practice however,
one or more problems make accurate endpoint detection difficult. One particular class
of problems is those attributed
to the speaker and manner of producing the speech.
For example, during articulation, the talker often produces sound artifacts, including
lip smacks, heavy breathing, and mouth clicks and pops.
I]:'~~h:;: Ii
1
Frame number
CLICK'}'
~_M.IIW·
4001
._.'_' _----
The energy levels of the artifacts are often comparable to speech energy levels, ruling
out using an energy envelope threshold as an accurate means of speech detection.
To detect the silences, the HMM output probability for a silence model is considered
with an empirically determined threshold.
If the utterance silence output probability
estimate falls below this threshold for an empirically determined number of frames
Tinitial
in a moving average series, the speech is considered to start
Tinitial
frames back.
Similarly, if the silence output probability estimate is above the threshold for
frames, the speech is considered to have ended
and
Tfinal
Tfinal
frames ago. By choosing
Tfinal
Tinitial
carefully so as to easily include the shortest vocabulary word, yet exclude
between word pauses, we can eliminate most mouth clicks and pops and any other
short unwanted sounds. The procedure described here was implemented by me in the
system.
-1800
o
80
Utterance window
To facilitate more accurate system performance
(in terms of the caller being put
through to the correct person), a conversational interface is needed to clarify any cases
where the recognition process did not generate a sufficiently high confidence measure
(see Section 4.8).
The system conversational interface follows the flow diagram which is depicted in Figure 3.9.
Play strict
"askname"
Play "RUsure"
/person name
1. The confidence level in recognising the utterance is below a certain threshold
(determined empirically).
In this case, the prompt is repeated and the user
asked to repeat his answer. This could be due to excessive noise, and invalid
response or a speaker dependent problem (breathing noise, mouth pops/clicks,
strange accent).
2. The system recognises the utterance incorrectly, and since the confidence level
is not sufficiently high, the user is asked for confirmation. Assuming the yes/no
recogniser has an extremely high recognition performance4, the person will either
have to repeat his answer (a stricter prompt is played, specifying that the user
must say the name and surname of the callee), or his call will be forwarded.
3. The system recognises the phrase correctly, but since the confidence level is not
sufficiently high, the user is asked for confirmation, as in the previous case.
4. The system recognises the phrase incorrectly, and the caller is forwarded to the
wrong number as the confidence measure is sufficiently high.
5. The system recognises the phrase correctly, and the caller is forwarded to the
correct person.
This is obviously the case the system strives to achieve most
often.
3.5
Dealing with extraneous speech
The conversational interface strives to prompt callers into giving only answers in the
correct grammatical format (e.g. Name Surname, or Yes/No), however the cases of
extraneous speech need to be taken into account. Some people respond to a prompt
like" Please say with whom you'd like to speaM' by replying: "I'd like to speak to soand-so please," even though they realise its a machine they're speaking to. The system
under normal conditions would expect a reply to be just the name5.
4High possibility since only two vocabulary words need to be distinguished.
5In any of the generally accepted name lexical formats.
To attempt
to deal with this, a grammar FSN can be constructed
using a so-called
garbage hidden Markov model to try and cater for any out-of-vocabulary
(OOV) words.
The idea of the garbage model is to give a higher score in the Viterbi search than any
of the in-vocabulary models for an OOV word, whilst giving a lower score for an invocabulary word than the correct corresponding model [21].
Several HMM structures for a garbage model have been tried in the literature, the best
result being achieved for a model with a single state and one or more mixtures.
To train this garbage model, all the vocabulary words are used together, based on the
assumption that there is enough variance between these words to make a good garbage
model.
Dealing with this extraneous speech has not been implemented in the online (real-time
recognition) version of the system.
It was left out for performance considerations,
and it has been found in practice that the prompts are clear enough that the caller
usually only utters phrases in one ofthe correct grammar patterns.
Off-line experiments
have been done to test the effectiveness of including a garbage model in the system
(see Section 4.9). This concludes the description of the telephone speech recognition
system that has been implemented. The next chapter details some experiments to test
the efficiency of various components of the system.
Chapter 4
Experiments
To test the system performance, and determine the sensitivity to methods used and
system parameter variations, experiments were performed. The experiments were done
with the system in off-line mode, using data previously recorded from the telephone
line.
This was done so that all recognition results from the different methods and
parameter settings reflect the same data, allowing fair comparisons to be made.
4.1
Experimental data
Since the system uses the whole word recognition approach, only a limited number of
words can be recognised by the system. The words which the system recognises can
be seen in Table 4.1.
The name models (each first name is one word, and double surnames are also treated
as a single word model) that the system recognises can be seen in Table 4.2.
Total distinct words
42
Total num. people
18
Titles
5
The telephone system was initially set up to playa prompt, and then record an utterance without performing any kind of recognition or giving any feedback to the user.
This was done so as to be able to record enough data to "bootstrap"
the hmm models
with some seed data to be able to perform recognition. Once sufficient data (approximately eight words by different speakers per model was determined to be good enough)
had been recorded in this manner, the system was trained and put into "live" operation, where callers' utterances were recognised (and recorded), and the conversational
interface guided them to the point where their call was forwarded to the callee.
The system was left to run in this manner for a period of time, until sufficient data
(male and female) had been collected to perform meaningful experiments.
The target
for data collection was set so that at least 16 spoken words per model were recorded.
Table 4.4 summarizes the data set.
Due to the nature of the data recording technique, it is difficult to establish exactly
how many different speakers there are in the data set, but it is estimated that there
are approximately 25 to 35 different speakers.
Total male utterances
179
Total female utterances
183
Num. labeled words
950
A "leave-one-out" experimental setup was used throughout the test. A total of five sets
of training/testing
the remaining
data was used. Each training set consists of ~ of all the data, with
i used
for testing. In this way all the data can be used for testing and
training, but the test data never appears in the training set. An average performance
value is obtained over all five sets for each experiment to give the final result.
4.2
Adjustable system parameters
The system parameter set which can be optimised for maximum recognition performance was discussed in detail in the previous chapter, and can be summarised as
follows:
• windowsize:
Length of raw sound (in seconds) used to calculate each feature
frame (in the order of 10ms to 30ms).
• framestep:
Length of raw sound (in seconds) to step. The difference windowsize-
jramestep determines the overlap between adjacent feature vectors (in the order
of 5ms to 30ms).
• MelOrder:
The number of meI-scaled cepstral coefficients to calculate per frame
(in the order of 6 to 12).
•
nfilters:
1. 5 .
Number of filters in the mel-scale filter bank.
This value defaults to
sampler ate
• EMiter:
1000'
Number of expectation-maximisation
iterations to perform.
This value
defaults to 10, as the trained mean and variance values for the HMM mixtures
seem to stabilise after approximately
8 iterations.
• Delta order [9]: The order of the temporal cepstral derivative.
values 0 (no delta features),
May take on the
1 or 2. For example, for a MelOrder of 8 and delta
of 2, the feature vector will be of length 24 per frame.
•
nstates:
The number of states to use in the hidden Markov word models. This can
actually be specified per model, so that different models have different number
of states to try and account for the varying number of sounds in words.
•
nmixtures:
The number of Gaussian mixtures to use per state in each model. The
number of mixtures can also be independently
set for each state in each model if
desired.
• performDuration:
Boolean flag to choose between using traditional
tion matrices resulting in implicit geometric state durations,
state transi-
or using lngamma
pdf based state duration modelling (see Section 2.9).
• performCMS:
Subtraction
• performSMS:
Subtraction
Boolean flag indicating whether or not to perform Cepstral Mean
on the feature matrix.
Boolean flag indicating whether or not to perform Spectral Mean
on the power spectrum.
Experimental
results are shown where these parameters
system performance.
are varied to find the best
Since the parameter variations can influence each other (e.g. if
the window size is smaller, you might want to consider increasing nstates since more
features will be calculated per word), the parameters are varied individually, leading
to a large number of experiments, and extensive computation.
Some ofthe parameter variations were found not to have much influence on each other,
and these were kept at a fixed value during the iteration procedure, to try and keep
the computation realistic. The values in Table 4.5 were used.
performCMS
on
performSMS
off
Delta
2
pre-emph
0.98
perform Duration
on
grammar
full
EMiter
10
nfilters
12
V iterbiiter
10
The parameters in Table 4.6 were varied in the specified ranges. These ranges were obtained through coarse initial experimentation to determine in what range the optimum
point occurs.
The jramestep
and window size parameters were varied together, since it was previ-
ously found that the best results are obtained if jramestep
than half of windowsize.
Since jramestep
is equal to slightly more
is the increment for each successive start
Experiments
1,2,3
nmixtures
8,9,10,11,12
nstates
windowsize
8,12,16,20,30ms
frame step
5,7,10,12,19ms
MelOrder
6,8,10,12
of window size samples, this means that there is always an overlap with the previous
frame.
The best overall person recognition rate attained was 95.68%, with the parameter
values as given in Table 4.7.
nmixtures
nstates
1
10
windowsize
20ms
framestep
12ms
MelOrder
8
The worst overall person recognition rate attained with these parameter variations was
73.02%, with the parameter values as in Table 4.8.
The fact that the feature vectors will be large (MelOrder
= 12) and that there are
three mixtures per model to train, implies that there is probably insufficient training
data available to successfully train the models, providing a possible explanation for the
poor results with these parameters.
The small window size also means that at an 8kHz
sampling rate, there are only 64 samples per window, so the FFT results in frequency
components with frequency spacings of 125Hz. This means that the resolution for the
Chapter 4
nmixtures
3
nstates
11
windowsize
8ms
framestep
5ms
MelOrder
12
If we take the average recognition results over each parameter
of all the results where MelOrder
variation
(e.g. average
= 8), we get the following (figures 4.1, 4.2, 4.3 and
4.4):
90.10%
89.89%
89.44%
88.23%
Although
the parameters
are interdependent,
there is still a very close correlation
between the overall best and worst parameters sets, and the results of the above average
parameter
variation
scores.
These "best" and "worst" parameter
sets are used as a
Chapter 4
91.94%
90.49%
90.46%
88.56%
85.62%
7,12
10,16
framestep,w1ndowslze
12,20
(ms)
91.16%
90.17%
89.92%
2
nmlxtures
Chapter 4
90.79%
90.39%
89.67%
89.12%
87.12%
10
nstates
baseline for the rest of the parameter and method variation experiments, each effect
(e.g. using eMS or not) being measured on both the best and worst sets to see the
effect.
It can also be seen that the recognition results do not seem to vary much with the
number of states, as long as there are enough states to model at least the minimum
number of distinct sounds in each word. Another experiment was performed where the
word models were categorised into long and shori words, with the long words having
50% more states than the short words. This also made no significant difference to the
recognition results, but when the experiment was repeated without duration modeling,
it became clear that a difference could be observed. Without duration modeling, the
number of states provides a crude lower duration limit due to the strict left-right nature
of the Bakis models, leading to a noticeable performance increase. With less states, it
is possible (in the absence of duration modeling) for the model to be in one particular
state for only one frame, but if there are more states than distinct sounds in the word,
there is a minimum duration time that will be spent in each sound model.
Experiments
were also performed where the number of mixtures per state was var-
ied, i.e. instead of having 2 mixtures per state for each state in the model, different
variations were tried. The rationale behind this was that the initial and final states
are more likely to have problems with breath-in, breath-out or other noises, as well as
simply being prone to having the data affected by incorrect end-point detection (see
Section 3.3). The following variations were tried:
No significant differences in recognition performance were observed with either the
"best" or "worst" parameter sets previously determined, so this avenue of investigation
was not pursued further.
4.3
Effect of temporal derivative
To determine the effect of the temporal derivative on the recognition performance,
the system parameters in Table 4.5 were used, with Delta = 0 and Delta = 1. This
parameter set was used with the sets in Tables 4.7 and 4.8 to give the results as in
Table 4.9.
From this we can conclude that using a first or second order temporal derivative significantly reduces the recognition error rate (by 67%) for the "best" parameter case. The
first order temporal derivative slightly improves the recognition rate for the "worst"
parameter set, but the second order temporal derivative increases the error rate above
what it was without using it at all. Therefore, we can use Delta = 1 for good performance and less computation.
4.4
best param (Delta
= 0)
86.65%
best param (Delta
=
1)
95.56%
best param (Delta
=
2)
95.68%
worst param (Delta
=
0)
77.08%
worst param (Delta
=
1)
83.39%
worst param (Delta
=
2)
73.02%
Effect of grammar
To determine the effect the grammar FSN has on recognition performance,
were set up with less restrictive
FSN grammars.
one (general) which followed the pattern
Two different grammars
experiments
were used,
as shown in Figure 3.4, thereby allowing
unknown names to be found (i.e. different titles, names and surnames may be mixed),
and a second grammar
(allwords) which allowed any number of words (titles, names
and surnames) in any order. The recognition experiment was repeated using again the
"best" and "worst" parameter
grammar
sets, and comparing these to the results with the full
(see Table 4.10).
full
general
allwords
best
95.68%
77.14%
19.18%
worst
73.02%
52.08%
9.92%
param set
From these results it is clear that the incorporation
dramatically
improves the recognition
phrases for the general grammar,
rate.
of a restrictive
As expected,
the additional
FSN grammar
on inspection
(compared
to the full
errors were caused by the emergence of unknown names (e.g.
of the error
grammar)
"mevrou renier van-
leeuwen").
Closer inspection of the error cases for the allwords grammar revealed
that the majority of the additional (compared to the general grammar) erroneous
phrases did in fact contain the correct phrase, but extra words (usually short words)
were found at the beginning or end of the phrase (e.g. "prof prof botha" ). As the FSN
grammar becomes less constrained, the level-building search takes longer to execute,
since more possibilities arise.
4.5
Effect of Cepstral Mean Subtraction
To determine the effect of CMS on the recognition performance, the system parameters
in Table 4.5 were used, with Per formCMS
=
o.
This parameter set was used with
the sets in Tables 4.7 and 4.8 to give the results in Table 4.11.
best param (Per formCMS
best param (PerformCMS
worst param (PerformCMS
worst param (Per formCMS
= 0)
= 1)
= 0)
= 1)
90.54%
95.68%
69.1%
73.02%
CMS leads to a 54% reduction in errors for the "best" parameter set, a significant
improvement.
The telephone channel transfer function is highly variable, since phone
calls go through at least two
AID
and D I A conversions, as well as variable lengths of
copper wire, or they could even be made from a cellular phone, where codecs such as
GSM influence the overall channel transfer function. Using CMS to compensate for this
is a very successful technique. The error rate is also slightly improved for the "worst"
parameter set, which can be expected, since CMS does not increase the feature vector
size, but merely attempts to improve the features by compensating for channel effects.
Experiments
4.6
Effect of Spectral Mean Subtraction
To determine the effect of SMS on the recognition performance,
in Table 4.5 were used, with Per jormSMS
= 1.
the system parameters
Since it is possible that CMS can have
an effect on the result, the tests were also run with Per j ormC M S = O. This parameter
set was used with the sets in Tables 4.7 and 4.8 to give the results in Table 4.12.
best param (Per jormSMS
best param (Per jormSMS
worst param (Per jormSMS
worst param (Per jormSMS
best param (Per jormSMS
best param (PerjormSMS
worst param (PerjormSMS
worst param (PerjormSMS
= 1)
= 0)
= 1)
= 0)
= 1, CMS = 0)
= O,CMS = 0)
= 1,CMS = 0)
= O,CMS = 0)
SMS always appears to diminish performance,
without
CMS, where performance
the spectrum
95.68%
72.58%
73.02%
87.18%
90.54%
72.06%
69.1%
except in the "worst" parameter
is only slightly increased.
appears to have more of a detrimental
than a positive effect.
90.65%
Subtracting
This could be due to the "musical noise" mentioned
additive noise, which SMS was developed to counter.
SMS
= 0 for
best performance
in Sec-
do not exhibit audible constant
SMS does not appear to be a
very effective method for improving speech recognition in this system.
= 1 and
the mean of
effect on recognition performance
tion 2.10.2, and the fact that most of the utterances
use CMS
case
(95.7%).
Therefore, we
Chapter 4
4.7
Effect of In(gamma)
based state duration mod-
eling
To determine the effect of duration modeling on the recognition performance,
system parameters
parameter
in Table 4.5 were used, with Per formDuration
=
off.
the
This
set was used with the sets in Tables 4.7 and 4.8 to give the results in
Table 4.13.
best param (Per formDuration
= off)
93.22%
best param (Per formDuration
= on)
95.68%
worst param (Per formDuration
= off)
80.8%
worst param (Per formDuration
= on)
73.02%
As can be seen by these results, duration modeling improves the recognition rate slightly
for the "best" parameter set (error rate reduced by 36%), but significantly degrades
performance for the "worst" parameter set. This could be due to the fact that since
there are more states to train per model, there is less data available to train each
state, leading to inaccurate estimates of the duration parameters.
Therefore, we use
Per f ormDuration = on for best performance.
The output of the level-building search gives a In(probability) score. Since this score
is strongly dependent on utterance length as well as a number of other factors, these
scores cannot be used as an absolute or even relative point of reference for deciding
whether the best utterance given by the level-building search was correctly recognised
or not.
In an attempt to alleviate this problem, an FSN grammar consisting only of the garbage
model was created, with the garbage model able to appear a number of times to make
up an utterance.
Once a search has been run on an utterance using the normal grammar
FSN, another search can be done using this garbage FSN. The results relative to each
other do show a significant correspondence with regard to recognition accuracy. This
method can be used to generate a meaningful confidence measure by using the rules
given in Table 4.14. The confidence measure can in turn be used by the conversational
interface to decide the path of events to follow.
decision
confidence
1. utterance
> 10 . garbage
2. garbage < utterance
3. utterance
< garbage
< 10 . garbage
accept
confirm
reject
Using these rules as a confidence measure, the classification of recognition results obtained for the "best" and "worst" parameter sets can be seen in Table 4.15. These
results consider the case where a confirmation is necessary (i.e. case 2 in Table 4.14),
to be the same as an incorrect decision. Of course, the rules can be altered to change
the sensitivity of the system as desired, but ideally there should be more false rejections
than false acceptances, since a false acceptance is far worse than a false rejection.
The sum of false rejections and correct acceptance gives identical results to the previous
"best" and "worst" recognition rates (i.e. 95.68% and 73.02%), as can be expected.
These results show that this simple method does give a reliable confidence measure.
4.9
best: false rejection
5.2%
best: false acceptance
2.8%
best: correct rejection
1.7%
best: correct acceptance
90.3%
worst: false rejection
1.7%
worst: false acceptance
21.5%
worst: correct rejection
5.2%
worst: correct acceptance
71.6%
Experiments with extraneous speech
Off-line experiments were performed with speech data that contained extraneous speech.
A garbage model was trained to compensate
for words not in the normal vocabulary
by using all 950 labeled words re-Iabeled as garbage, since the overall variance between words is very high.
and one mixture
(extraneous)
The HMM for the garbage model consisted of one state
(this was found to give the best results), and a new grammar
FSN
was created which added multiple optional garbage models to the start
and end of the full grammar for the level-building search .
• Ek wil met meneer ben vanleeuwen praat asseblief ("I would like to speak to
mister bert vanleeuwen please").
• Sit my deur na janus brink nou dadelik ("Put me through to janus brink immediately").
number of utterances
person recognition
36
88.89%
Table 4.16: Extraneous speech recognition results
It can be seen that the recognition rate is still relatively high, but since the grammar is
much less constrained and the overall utterance length longer, the level-building search
takes much longer (in the order of 1 second on a Pentium III 550Mhz processor). On
closer inspection of the four error cases, one was found to have somebody laughing
loudly in the background during the crucial part of the utterance, and two more had
only the first name (i.e. no title or surname) in the utterance, in between extraneous
speech, leading to a greater chance of failure. This indicates (as can be expected) that
there is a higher chance of successfully recognizing a name within extraneous speech
when more of the name (e.g. title or surname) is specified.
4.9.1
Case study
One of the utterances ("Ek wi! met meneer ben vanleeuwen praat asseblief") was used
to provide an illustration of the working of the level-building algorithm and Viterbi
alignment.
The utterance length was 2.18 seconds, giving 182 frames of features in total (with a
framestep of 12ms). The resultant final match consisted of six garbage models, followed
by meneer ben vanleeuwen followed by more garbage models, so that meneer, bert and
vanleeuwen were found on levels six, seven and eight of the level-building search (the
first level is marked zero). The transcription
of the result showed that meneer went
from frame 33 to frame 59, ben went up to frame 80 and vanleeuwen carried on till
frame 115, which matches very well with the positions in the wave file if converted to
Figure 4.5 shows the outputs of the <5't(i) function (the data was modified slightly prior
to plotting by replacing the log(zero) values (represented by -9.9999998· 1010) with
the lowest probability number in the rest of the search) during Viterbi alignment of
the aforementioned
HMMs on levels six, seven and eight (see also Section 2.7 and
Figure 2.4). Notice that vanleeuwen has 10 states, and the other two models have only
6 states (this experiment was still run with the long/short word models mentioned in
Section 4.2.1), and that states 2 to N have probability log(zero)
in the first N - 1
frames due to the left-right nature of the Bakis type HMMs.
Effect of gender on recognition performance
To see how much effect the gender of the speaker has on classification performance, the
female data in the training sets were relabelled so that the HMM model names were all
uppercase letters (this was done for backward-compatibility;
to do a genderless search
one only has to perform case-insensitive comparisons between the recognised phrase
and the transcription).
This of course also has the effect of reducing the training
data for male and female models compared to the genderless model, so one can expect
reduced recognition performance if the same parameter sets are used. A new optimum
parameter set could be calculated, but ideally the amount of training data should be
increased. The latter is not an easy option however, since many more utterances need
to be recorded and painstakingly labeled.
The FSN grammar was expanded to include all female model paths (identical except
for the model name case difference), so the level-building search on this experiment
set takes twice as long as normal, but the training time remains about the same, since
there are now twice as many HMMs, but each with (on average) half as much training
data.
QJ
•
'\ij3-
.•..III •.
91
&ame
Figure 4.5: 8t(i) for HMMs meneer (top), bert (middle) and vanleeuwen
(bottom) on levels
six, seven and eight of a level-building search with extraneous speech
parameter set
recognition rate
male class. accuracy
female class. accuracy
best param
85.29%
72.62%
93.64%
worst param
41.28%
93.85%
91.78%
The results for overall recognition accuracy as well as gender classification accuracy
were determined, and can be seen in Table 4.17.
Since the number of male and female speakers were almost the same (179 vs. 183),
the classification results were relatively high (much greater than 50%, which would be
as good as chance), showing that the chosen features for recognition are not entirely
gender independent,
and therefore probably not entirely speaker independent either.
The reason the cepstral coefficients were chosen as features was that in the literature
they were found to have a large degree of speaker independence.
The fact that the
system works as well as it does for speakers that are not in the training set at all shows
that there is truth in this, but obviously there is still a degree of speaker dependence.
The leading edge dictation software available today still requires a large amount of
training for a specific speaker, and thereafter the recognition rates are very speaker
dependent.
More research needs to be done on finding effective speaker independent
features.
Real-time performance
The telephone speech recognition system was implemented on a PC equipped with
an Intel Pentium 133MHz processor. It was found that the entire process of feature
extraction,
performing the level-building search, deciding whether this was a good
enough match and subsequently dialing the call to forward it to the correct person
took less than 1 second for an average utterance length of over 2 seconds, quick enough
Experiments
Chapter 5
Summary and conclusion
5.1
Summary of work and results
The main goal of this dissertation was to implement a telephone speech recognition
system (with the specific task of a telephone auto-attendant)
algorithms,
using state-of-the-art
and to perform experiments with the system to determine performance
and compare some different techniques to enhance performance.
The hardware solution consisted of a Pentium 133MHz class computer using the Microsoft Windows NT operating system, with a Dialogic D21/H telephone interface
card.
A literature study was undertaken to determine which speech recognition algorithm to
use, and find out what techniques have been developed to enhance performance. Based
on this, a decision was made to use hidden Markov models to represent whole words
in continuous speech.
A speech toolkit (HMTSR) was developed by our research group during the course
of this dissertation,
and a number of enhancements and modifications specific to this
project were made. This toolkit was ported to the Windows operating system, since
initially there was no Dialogic support for the Linux operating system.
The toolkit
was integrated with the main state machine controlling the conversational interface,
and the Dialogic API was utilised to control the D21/H card so that waveform speech
data could be recorded and played back as desired.
Almost one thousand words in total were recorded and manually labeled to form the
training set. This training set could then be used as a basis for performing experiments
to compare different techniques, and assess system performance.
The system in operational on-line mode could then recognise spoken names with a
confidence measure, and forward the telephone call to the desired person's telephone
number.
Based on the experimental results in Chapter 4, the overall best person recognition
performance attained off-line was 95.68%,
using the parameters as specified in Ta-
bles 4.5 and 4.7. The on-line system performed rapidly enough even on slow hardware
to attain "real-time" recognition.
• The process of implementing a speech recognition system from first principles
was detailed.
• It was shown how low quality telephone speech data can be successfully used in
a speech recognition system.
• The system performance sensitivity with regard to different implementation techniques and system parameters was shown.
• The system is the first of its kind known to be developed in the Afrikaans language.
Furthermore,
the results of this work were also published and presented at various
conferences (PRASAI 1998 [22], PRASA 1999 [23] and AfriCon 1999 [24]).
A feature of the current system is the fact that it uses word-based recognition models
rather than phonemic models.
This implies that it is labour intensive to add more
callees to the system: Each word in the new callee's name needs to be recorded by at
least ten different speakers, all the words carefully labeled in the .wav data files, new
HMMs trained and added to the system, as well as adding the new name to the full
grammar FSN.
To improve upon this restriction would mean using a sub-word unit based recognition
system.
The commercially available systems use the latter approach, and are more
flexible in that they are not word-based, but rather based on the recognition of subword units, so that any vocabulary of words can be constructed using the sub-word
units.
However, the word-based approach is much less computationally
intensive in
terms of searches, and requires far less training data to implement each individual
word model. A sub-word unit system requires massive amounts of carefully labeled
data to achieve sufficient accuracy, and these data sets come at a large price. The
search performed by sub-word unit systems also requires that words need to be found
in strings of sub-word units, as well as only allowing the correct words in the grammar.
The effective grammar models used thus need to be much more complex.
Using the KLT transform matrix to optimally encode temporal information (see section 2.3.1) can also be considered for future work, since it improves recognition accuracy
Pattern Recognition Association of Souto Africa
1
Chapter 5
over using standard first and second order temporal derivatives, with little or no additional computational
cost.
This research by Milner [9] looks promising, but was
discovered only after the experiments had been completed, so was not investigated as
part of this dissertation.
Using whole word hidden Markov modeling for telephone speech recognition has given
very good recognition results. The use of a finite state grammar network has proven
particularly helpful. The inclusion of the temporal derivative in the feature set, noise
compensation
by cepstral mean substraction
and duration modeling have all helped
to reduce the error rate. The ability to discern the accuracy of a recognition search
was essential for the inclusion of a conversational interface, and the initial experiments
with extraneous speech look very promising.
The result from the gender separation
experiment indicates that there is still a degree of speaker dependence in the feature
set, but despite this the overall recognition performance of the system is still very high
with speakers not in the training set.
A good telephone auto attendant
can be on duty 24/7/52, handling one or a dozen
incoming lines with ease - always patient, always doing its best to get the caller to
the right extension - for a fraction of the cost of human staffing. An auto attendant
is such a simple idea, yet it's taken so long to reach fruition. There are currently only
a handful of companies providing commercial auto attendants,
but the market is now
ready to accept them, and technology is now ready to provide reliable auto attendants.
Appendix A
HMTSR detailed description
The Hidden Markov Toolkit for Speech Recognition
1
developed by the Pattern Recog-
nition group at the University of Pretoria is written in C++ and makes extensive use
of object oriented programming techniques.
The following classes form the core of
HMTSR:
cache:
Facilitates cacheing the extracted feature vectors from the raw sound files
to speed up experiments and model training.
cluster:
Implementation of the K-segmental means algorithm to cluster the feature
vectors for the specified number of mixtures in the HMM.
grammar:
Implements a finite state network class, parsing the description from file
and creating the network structure in memory. This is used to restrict grammar in the
level-building search.
1See
website http://www.ee.up.ac.za/ee/pattern...recogni
tion-page/HMTSR/HMTSR-O
. 2. tgz
math:
Routines to calculate Fast Fourier Transform, Discrete Cosine Transform,
Inverse Discrete Cosine Transform, Power Spectrum.
mem:
Routines for allocating and freeing memory regions for 2-dimensional,3-dimensional
and 4-dimensional arrays of arbitrary type.
processing:
Algorithms implemented to calculate pre-emphasis, mel-spaced filter
banks, mel-cepstra, hamming window, based on parameters of the system such as windowsize (in milliseconds), stepsize (milliseconds), how many mel-cepstral coefficients
to calculate.
All the feature extraction is done by these routines. A raw soundfile is
passed in, and a matrix of cepstral coefficients vs time (frame) is returned.
Can also
optionally perform Spectral Mean Subtraction, since this must be done on the power
spectrum of the signal before the mel-cepstra are extracted.
sound:
Class that deals with audio file formats, to extract the raw audio data as an
array of floating point numbers from NIST, AIFC and raw data files. Supports various
sample rates, bits per sample and number of channels.
speech:
Class to encapsulate sound and processing.
A speech object is instanti-
ated with a filename as parameter, automatically retrieves the speech data from the
file, and performs feature extraction and manipulation, e.g. Cepstral Mean Subtraction
and temporal derivative calculation.
transcription:
A class to deal with transcriptions
of speech data, i.e. labeling
which HMM represents which section of speech. Converts this information to and from
the .lola2
hmm:
file format.
This actually consists of a number of classes and utility functions.
2.lola comes from location label, a file format for labeled transcriptions found in the OGI CSLU
toolkit.
See website http://cslu
the file format.
.cse. ogi. edu/toolkit/old/man/n/seglist
.html for details of
• logmath:
Utility functions to perform log mathematics.
Log mathematics
is
used throughout the HMM calculations since it involves addition and subtraction
instead of multiplication and division, resulting in faster, more efficient code.
• state:
This class represents a state of an HMM, containing all the mixtures of
that state with mixture weights, with all the means and variances of each mixture,
the transition
probabilities (or state duration gamma pdf properties),
function
to calculate the observation probability for the state (weighted sum of mixture
probabilities), as well as functions to update the state parameters iteratively.
• hmm:
This class represents the hidden Markov model, consisting of state ob-
jects, transition probability matrix (or state duration parameters)
and a model
name as well as a display name (e.g. the silence model has a blank display name).
The class has methods to perform vector quantization, expectation-maximization,
Viterbi alignment and duration parameter training.
• search:
modeling.
• model:
The model class consists of a number of HMM objects. It is used to
group HMMs with similar parameters
(in terms of number of states, mixtures
and feature dimension), and can parse a text file describing the model to create
the hmmobjects.
• duration:
the parameters
This class encapsulates a single gamma pdf precalculated array, and
(mean and variance), as well as functions to calculate lngamma
and the lngammapdj.
Bibliography
[1] J.T. Chien and H.C. Wang,
"Phone-dependent
channel compensated
hidden
Markov model for telephone speech recognition," in IEEE Signal Processing Letters, June 1998, vol. 5, pp. 143-145.
[2] Z. Guojun, Z. Lieguang, and F. Chongxi,
speech recognition,"
"Automatic telephone operator using
in Proceedings of the International
Conference on Commu-
nication Technology (ICCT), 1996, vol. 1, pp. 420-423.
[3] N. Fraser, B. Salmon, and T. Thomas,
trial results for the Operetta system,"
Technology for Telecommunications
"Call routing by name recognition: Field
in IEEE Workshop on Interactive
Applications
(IVTTA),
Voice
1996, vol. 1, pp. 101-
104.
[4] K. Koumpis, V. Digalakis, and H. Murveit, "Design and implementation
of auto
attendant system for the T.U.C. campus using speech recognition," in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, May 1998, vol. 2, pp. 845-848.
[5] A. Kellner, B. Rueber, F. Seide, and B.-H. Tran, "Padis - an automated telephone
switchboard and directory information system," in Speech Communication,
1997,
vol. 23, pp. 95-111.
[6] H. Schramm, B. Rueber, and A. Kellner,
"Strategies for name recognition in
automatic directory assistance systems," in Speech Communication,
pp. 329-338.
2000, vol. 31,
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement