Automatic Stress Detection in Emergency

Automatic Stress Detection in Emergency
Int. J. of Intelligent Defence Support Systems, Vol. x, No. x, xxxx
Automatic Stress Detection in Emergency
(Telephone) Calls
Iulia Lefter1,2,3
Leon J. M. Rothkrantz1,2
David. A. van Leeuwen3
Pascal Wiggers1
1
2
3
Delft University of Technology, Delft, The Netherlands
The Netherlands Defence Academy
TNO Defence, Security and Safety, The Netherlands
Abstract: The abundance of calls to emergency lines during crises
is difficult to handle by the limited number of operators. Detecting
if the caller is experiencing some extreme emotions can be a solution
for distinguishing the more urgent calls. Apart from these, there are
several other applications that can benefit from awareness of the
emotional state of the speaker. This paper describes the design of
a system for selecting the calls that appear to be urgent, based
on emotion detection. The system is trained using a database
of spontaneous emotional speech from a call-centre. Four machine
learning techniques are applied, based on either prosodic or spectral
features, resulting in individual detectors. As a last stage, we
investigate the effect of fusing these detectors into a single detection
system. We observe an improvement in the Equal Error Rate (eer)
from 19.0 % on average for 4 individual detectors to 4.2 % when fused
using linear logistic regression. All experiments are performed in a
speaker independent cross-validation framework.
Keywords: Stress detection, emotional speech detection, crisis
management, speech analysis, machine learning
Biographical notes: Iulia Lefter received her MSc in Media and
Knowledge Engineering from Delft University of Technology. Now she
is a PhD student at the same university, in a project involving the
Netherlands Defence Academy and TNO. Her interests are multi-modal
information fusion, emotion recognition in speech, behavior analysis
and autonomous surveillance systems.
Leon Rothkrantz is Professor of Computer Science at the
Netherlands Defence Academy and at Delft University of Technology.
He has a MSc degree in Mathematics but also in Psychology. He
received his PhD in Mathematics at the University of Amsterdam. His
c 2010 Inderscience Enterprises Ltd.
Copyright 1
2
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
research interest are multimodal communication, artificial intelligence,
signal processing and sensor technology.
Pascal Wiggers is an assistant professor in the man-machine
interaction group of Delft University of Technology. His research
focuses on automatic speech-recognition, multi-modal interaction,
intuitive and collaborative interfaces, emotion recognition and
probabilistic methods for machine learning such as dynamic Bayesian
networks.
David van Leeuwen is full Professor at Radboud University
Nijmegen and Senior Researcher at TNO Human Factors. He received
his MSc in Physics at Delft University of Technology and PhD at
University of Leiden. He has been active in the field of large vocabulary
continuous speech recognition (evaluation, development of Dutch
system), word spotting, and speaker and language recognition. He
has organised several benchmark evaluations (LVCSR Fr/Ge/BrEng:
EU SQALE in 1995, Forensic Speaker Recognition NFITNO in 2003,
LVCSR Dutch: N-Best in 2008). He has participated in NIST SRE,
LRE and RT evaluations since 2003. He has been a representative in
several NATO IST Research task groups on speech technology since
2002, and an ISCA member since 1995.
1 Introduction
During critical phases of military operations stress level might rise considerably. A
high level of stress has a negative impact on performance, and is an indicator that
people are physically or mentally overloaded or in some other critical situation.
Stress and negative emotions in general have a strong influence on voice
characteristics. Because speech is a natural means of communication, we can utilise
the paralinguistic features of speech to detect stress and (negative) emotions in a
non-intrusive way by monitoring the communication.
The encoding of physical stress in voice is highly influenced by the increase
in respiration rate, which increases the sub glottal pressure during phonation.
The distance of speech between breaths is diminished and the articulation rate is
affected. The muscle activity of the larynx and vocal folds is changed by stress and
this modifies the air velocity through glottis and sound frequency (if vocal folds
get a higher tension then the frequency increases). Activity of other muscles like
tongue, jaw and lips which shape the resonant cavities is also affected by stress
and results in changes of speech production (Hansen07).
Stress can appear at different levels, from physical stressors like g-forces or
vibration, unconscious physiological stressors like fatigue, illness, sleep deprivation,
conscious physiological stressor like noise or poor communication channel and
psychological stressors like emotion, high work load, confusion or frustration over
contradictory information (Hansen00), (Hansen07).
The research described in this paper focuses on automatic detection of speech
showing negative emotions. The proposed approach is based on emotion detection
from the sound signal, and does not use any linguistic information. The purpose is
to obtain a language independent emotion detection system.
Automatic Stress Detection in Emergency (Telephone) Calls
3
As opposed to most contributions in the field which use corpora of emotions
portrayed by actors, we use a database that contains recordings of genuine
emotional speech from a call-centre. It contains neutral utterances and utterances
with negative emotions like anger, frustration and stress. The use of a database
with real emotions is very valuable to our research. Real-life data implies a lot
of blended emotions and different intensities, which can make the problem more
complicated. Furthermore, since the telephone acts as a filter and much of the
higher frequency band of the speech signal is reduced, detecting the emotion
becomes more difficult, both in the case of humans and machines. However, this is
exactly the kind of data that we expect in a real application.
Despite the challenges of using a database with spontaneous emotions, we were
able to separate neutral from emotional speech with high accuracy. For this we
extracted a set of prosodic features as well as spectral features. Four classification
methods employing these features were tested and compared. The performance of
each of them was outperformed by the final approach, in which we fused all the
individual classifiers in one detector.
The paper is organised as follows. First, related work is presented, with
accent on each step in the development of an emotion recognition system, namely
the available databases, the emotion characteristic features that are relevant for
emotion recognition and machine learning techniques used. Furthermore, a number
of studies using speech for extracting information about abnormal situation and
cases of aggression and stress detection are reviewed. Our approach for emotion
detection is presented in the third section. Here the database used in this work,
the features and the classification techniques are described. This is followed by the
analysis of our results and a discussion. Possible application in the military domain
are outlined in the penultimate section, followed by our conclusions.
2 Related work
In this section we illustrate what are the components of emotion recognition
systems are and what kind of choices can be made at each step. After introducing,
in turn, the state of the art in databases, features and classification techniques, we
briefly review a number of papers with applications in the military domain.
2.1 Databases of emotional speech
Emotion recognition from speech is typically based on using machine learning
techniques for training models on recorded data (Pantic03), (Ververidis06) and
(Zeng07). Therefore, the availability of labeled emotional speech databases is
a prerequisite for the design of such systems. The effectiveness as well as the
perceived results of emotion recognition systems are, to a high extent, dependent
on the databases used for training and evaluation. It has been suggested that the
focus should shift from acted databases to ones that contain spontaneous emotions
(DouglasCowie03).
Several approaches were used for the collection of spontaneous emotion
databases to elicit spontaneous emotions in the collection of speech databases.
Human machine interaction was used with the help of Wizard of Oz (WoZ)
4
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
scenarios (Steidl08) or computer based dialogue systems (Zeng07). Recent
scenarios for the recording of spontaneous databases are driving simulators
(McNahon08), one-shooter computer games (Truong08) or earthquake emulators
(Ververidis08).
Three databases with natural and simulated recordings of stress were recorded
for the nato project “Speech under Stress Conditions” (Hansen00). The first
one, susc-0/1 (Speech under Stress Conditions), includes recordings of speech
during genuine stress from a crashing aircraft and an engine out situation. Besides
these psychological stress situations, recordings of voice under physical stress
have also been made. The second database, susas (Speech Under Simulated and
Actual Stress) (Hansen97), contains a large variety of stresses and speaking styles.
Computer tracking tasks were used for creating stresses like time pressure and
workload, roller coaster rides and helicopter commands were used for provoking
physical and psychological stress, also determining Lombard effect (Lombard11).
The third database, dlp, is obtained from license plate reading tasks under
psychological stress from time pressure.
2.2 Emotional features in speech
A large set of acoustic features has been found so far to correlate with emotions.
In general researchers use different subsets, and try to find the most suitable
combinations. The acoustic features can be categorised in prosodic, spectral, and
voice quality features.
Prosody studies the rhythm, stress and intonations of speech and works on
larger segments of speech, like syllables, words, phonemes or entire turns of
a speaker. Pitch, loudness, speaking rate, duration, pause and rhythm are all
perceived characteristics of prosody. Even though there are no unique acoustic
correspondents between these characteristics, there are strong correlations between
them. Prosodic features are the most popular in speech emotion recognition
(Steidl08), (Batliner06), (Truong07), and (Krajewski07).
Besides the fundamental frequency which correlates with pitch, the speech
signal contains other spectral characteristics that are influenced by emotional
arousal. Such examples are harmonics and formants. Mel Frequency Cepstral
Coefficients (mfcc) (Davis90), which are generally used in speech recognition,
are also spectral features successfully used for emotion detection. They can be
augmented by Linear Predictive Cepstral Coefficients (lpcc), Mel Filter Bank
(mfb) features (Steidl08) or Perceptual Linear Prediction (plp) (Truong07).
Studies on voice quality report that there is a strong correlation between
voice quality features and emotions. Examples of voice qualities are neutral,
whispery, breathy, creaky, and harsh or falsetto voice. Jitter measures cycle to
cycle variation of period length while shimmer measures cycle to cycle variations of
peak or average amplitude. Harmonics to noise ratio (hnr) measures the degree of
periodicity in a sound and is another voice quality feature. Applications involving
these features can be found in (Krajewski07) and (Vlasenko07a).
In general there are two units of analysis that are widely used: the utterance
level and the frame level. As a general approach for extracting utterance level
features, different statistical functions are applied over the overall contour of the
features, as for example fundamental frequency. The idea is that much information
Automatic Stress Detection in Emergency (Telephone) Calls
5
about emotion can be extracted from the overall trend of a certain parameter.
Spectral features, however, give information at the frame level (∼ 10 ms), which
can be modelled much more accurately because of the relative large amount of
frames in an utterance. In (Datcu06) the efficiency of using distinct numbers of
frames per speech utterance is investigated.
A comprehensive set of acoustic cues, their perceived correlates, definitions and
acoustic measurements in vocal affect expression is provided in (Juslin05). The
same work also provides guidelines for choosing a minimum set of features that is
bound to provide emotion discrimination capabilities.
An extensive study by Scherer, Johnstone and Klasmeyer (Scherer03b)
presents an overview of empirically identified major effects of emotion on vocal
expression. Studies on the relevance of acoustic features for stress can be found in
(Rothkrantz04), (Tolkmitt86). In (Zhou01) the benefits for using features derived
from the Teager energy operator in the case of classification of speech under stress
are presented.
There is a trend to generate an increasingly large set of features and statistics
of these features, and use a selection algorithm to determine the most appropriate
ones. Examples of such approaches can be found in (Schuller05a), (Batliner06),
(Schuller07). Even though there is an improvement in the results achieved
using the reduced feature set, this approach generates models which are very
well adapted to the specified database while capabilities for generalisation are
decreased. A strong proof of this is given in (Vogt05) where a large number
of features has been extracted for acted and spontaneous databases, the most
relevant ones are selected using correlation based feature selection. The resulting
feature subsets are different for acted and for spontaneous speech. It appears that
emotions in acted speech can be well recognised using the pitch values, while for
the spontaneous case, mfccs perform better.
2.3 Classification techniques
After the feature sets are obtained, different kinds of classifiers are used
for training and testing. Experiments using linear classifiers (Krajewski07),
Gaussian mixture models (gmm) (Truong07), neural networks (nn) (Petrushin99),
(Krajewski07), hidden Markov models (hmm) (Vlasenko07a), Bayesian networks
(bn) (Schuller05a), random forests, support vector machines (svm) (Truong07)
and other machine learning techniques are reported. In general it is difficult
to compare the results across different experiments because different general
conditions are applied: different databases, units of analysis, feature sets, some
report speaker dependent, some speaker independent results, etc. There are also
studies investigating the capabilities of such classifiers. It appears that support
vector machines are among the most successful. Recent approaches combine the
results of different weak classifiers in ensembles, and their combination is a stronger
classifier, which in almost all cases gives better results than each individual
classifier (Schuller05b).
2.4 State of the art with military applications
In (Devillers07) a corpus with recordings from a medical emergency call centre
is used. Relief, anger, fear and sadness are recognised by means of linguistic
6
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
content and paralinguistic cues. Best results are obtained for fear in the case
of paralinguistic cues, and for relief in the case of linguistic content. Another
approach (Morrison05) focuses on detecting angry speech from neutral. The
purpose is to deploy the real-time classification system in call centres. Emotion
recognition for the purpose of crisis management in the context of an automatic
online dispatcher is reported in (Fitrianie08). The system is able to gain
information about the crisis from the user’s input, and also to recognise the
affective state by means of linguistic content, using a database with valence and
arousal scores for specific keywords.
An interesting contribution that uses emotion recognition techniques for the
purpose of audio-based surveillance systems is presented in (Clavel08). Here the
focus is on detecting one type of emotion, namely fear. The research is base on a
database with segments from movies that show no emotion or different levels of
fear is presented.
Besides examples aiming at detecting different emotions, similar systems can be
developed for the use of surveillance systems. In (Valenzise07), two gmm classifiers
are used for detecting screams from noise and gunshots from noise. Furthermore,
the system is able to perform acoustic source localization using a microphone array
and the sound’s time of arrivals using linear correction least square localization
algorithm. A verbal aggression detection system is described in (Hengel07). What
is particular about their approach is that the system does not use classification
methods, but is rather a knowledge based system. The detection methodology is
inspired from the way in which sounds are detected by the human auditory system.
The detection of stress from speech can also be used for the purpose of
deception detection. An example based on finding the physiological microtremor
known to appear due to psychological stress is (Mbitiru08).
Several studies analyse the speech of drivers and pilots as special categories.
Stress detection in the case of drivers’ speech is described in (Fernandez03).
Speech can also be used for acquiring information on pilots’ fatigue from speech
recorded from cockpit and laboratory conditions, which are highly affected by noise
(Legros08).
Although, as shown in this section, many contributions are available in the field
of emotion recognition from speech, the problem is far from being solved and more
research is needed for increasing the recognition rates, to diminish the influences
of noise in different environments and to deploy the systems in real time.
3 Methodology
We move on to describing the details of our emotion detection approach. First, the
database is introduced, followed by the feature sets used. The section continues
with more details on the classifiers used.
3.1 The South-African database
The South-African database contains genuine recordings from call-centres. The
data has been labeled according to two classes: emotional and neutral. In Figure 1
the spectrograms and pitches of an neutral and one emotional sample are depicted.
Automatic Stress Detection in Emergency (Telephone) Calls
Table 1
7
The final feature set used in our experiments
Feature type
Functionals
Pitch
Mean, Standard deviation, Range,
Absolute slope (without octave jumps),
Jitter
Mean, Standard deviation, Range,
Absolute slope, Shimmer
Mean F1, Mean F2, Mean F3, Mean F4
Slope, Hammarberg index, High energy
Centre of gravity, Skewness
Intensity
Formants
Long term averaged spectrum
Spectrum
There is a total of 3000 utterances, out of which 1000 have been labeled
as emotional. In most of the cases the emotional utterances show anger,
dissatisfaction or frustration, while the neutral ones are just emotionally neutral
sentences. The recorded utterances are mainly short, with an average length of 4.30
seconds. In total, the data set contains 215.02 minutes of speech. Based on the call
identities, we can estimate that messages from 1200 speakers have been recorded.
The data of each class was split in 10 balanced subsets, taking into account gender
independence. Based on a 10 fold cross-validation scheme, 8 subsets were used
for training, one for validation, and one for testing. The detection was performed
gender intependently.
3.2 Feature set
Choosing an optimal feature set is an open issue. The choice of the feature set
has a great impact in solving the classification problem. Even though there are
features for which research has proven their changed behavior when influenced by
emotions, it is important to decide which ones to use and also which statistics
would comprise the most relevant information. A high number of features and
statistics contain a lot of redundancy and also the process becomes very expensive.
On the other hand, too little features might miss the important information about
emotion. Therefore, a compromise should made be for choosing the right set of
features.
We utilised two kinds of features: prosody-based features at the utterance level
and spectral features at the frame level.
3.2.1 Prosodic features
Based on the research in (Juslin05), (Truong08) and (Vogt05), we used prosodic
features as indicated in Table 1. These features are extracted at the utterance
level, and are based mainly on pitch and intensity with some additions in spectral
shape. The features have been extracted using Praat (Boersma01).
Table 2 shows the ranking of the features based on information gain ratio.
3.2.2 Spectral features
As is customary in almost all speech processing technologies (speech, speaker
and language recognition) we used cepstral features for emotion recognition.
8
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
Table 2
Information gain (ig) for the prosodic feature set
IG
0.27868
0.15186
0.13838
0.12302
0.12232
0.1133
0.10513
0.10201
0.09952
0.09101
Feature
pitch(mean)
ltas(cog)
pitch(range)
intens(mean))
pitch(std)
pitch(slope no octave
jumps)
intens(absolute slope)
pitch(absolute slope)
high energy
F4(mean)
IG
Feature
0.08102
0.05889
0.05063
0.04389
0.04007
0.02835
jitter
ltas(slope)
intens(range)
ltas(skew)
intens(std)
Hammarberg
0.02621
0.01588
0.0134
0.00751
F3(mean)
F2(mean)
shimmer
F1(mean)
Specifically we used Perceptual Linear Prediction coefficients (Hermansky90) with
rasta processing (Hermansky92). Twelve plp coefficients were extracted over
a window of 32 ms, every 16ṁs, and these were augmented with log energy.
Derivatives over 5 consecutive frames (delta’s) were added to these, resulting in 26
features.
3.3 Classification techniques
This section describes a series of experiments we did on the South-African
database. The experiments include the use of four classification approaches based
on the features mentioned in section 3.2 and also two methods for decision level
fusion of more classifiers. The purpose is to see how well the classifiers combine,
how well the features complement each other, and which method yields the best
results.
All experiments are implemented using 10 fold speaker independent crossvalidation. In the case of fusion, double cross-validation was needed since the scores
come from different classifiers and span different ranges (some are log likelihoods,
some probabilities). As we want to perform a linear combination of the scores, a
first step is to bring all the scores into the same range. This can be done using
t-normalisation (Auckenthaler04). The data was divided in training, development
and test sets, for each possibility in the cross-validation framework. For an easy
understanding, we name the emotional samples as target, and the neutral samples
as non-target. The mean and standard deviation of the scores of the non-target
development set are used in order to normalise the scores in the test set.
3.3.1 gmms with plp features
A first classification approach was to model each of the two classes - emotional and
neutral - by a mixture of Gaussians, see Figure 2(a). The distribution of frames for
emotional and neutral speech were modelled by two GMMs. These were trained
using EM, in four iterations. Classification of test utterances was based on the
difference of the log likelihoods of the test data given the models.
Automatic Stress Detection in Emergency (Telephone) Calls
9
The experiments include using different number of Gaussian mixtures in order
to inspect which one is the most beneficial and what is the influence of varying the
number of mixtures on the discrimination result.
In the case of plp features the number of features is high: 26 features for one
frame, therefore a large number of features for an entire file. In order to obtain a
good representation, the data needs to be modelled with an adequate number of
mixture. Experiments with 64, 128 and 256 mixtures were performed.
A precomputed Universal Background Model (ubm) was also used as a basis
for Maximum A-Posteriori (map) adaptation (Reynolds00). The ubm was trained
on English telephone conversations, so the recording conditions of these samples
and the ones in the database we use are similar. The ubm contains 512 mixtures.
map adaptation of the ubm is computationally much more efficient than training
gmms form scratch.
3.3.2 gmms with prosodic features
The prosodic features were also modelled using a gmm, but because the data
is much sparser than in the case of spectral features (only one data point per
utterance) the number of Gaussians had to be chosen much smaller—in our
experiments 2, 4, and 8.
3.3.3 The Dot Scoring approach (ds)
Instead of full computation of the likelihood or the test sample using ubm
and map-adapted gmm, it is possible to linearise the gmm likelihood function
(Brummer09). This operation leads to very efficient modeling and scoring, reducing
the training and testing process to the evaluation of the inner product (dotproduct) of the sufficient statistics of the train and test utterances with respect
to the ubm. In both speaker and language recognition this method has proved
to work with high performance, specifically because channel compensation can be
carried out at the level of the sufficient statistics. We have adapted this dot-scoring
system as well for emotion classification.
3.3.4 svm with prosodic features
Support Vector Machines (svm) are among the most popular classifiers in
speech emotion recognition, due to their high generalisation capability. Given the
separation problem of two classes, a support vector machine will determine a
hyperplane that can completely separate these two classes, as shown in Figure
2(b). The idea is to find a hyperplane that maximises the margin between the
two datasets, and the samples that lie on the margin are called support vectors.
The features mentioned in Table 1 are used for this experiment, along with svm
with RBF kernel. The experiment is designed in the speaker independent crossvalidation framework.
3.3.5 The ubm-gmm-svm approach (ugs)
Recently in speaker recognition it has been shown that the map adapted means
of the ubm can be used as high-dimensional data points to train an svm
10
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
(Campbell06b). The utterances of the two classes emotion and neutral are used
as positive and negative examples in the svm, and classification occurs by map
adaptation of the ubm to the test utterance, and using the means as the test data
point in the svm.
3.3.6 Fusion of classifiers
For a successful fusion, it is important that the approaches being fused handle
the problems differently, so that they complement each other. In our case,
the approaches previously described can be roughly divided into gmm-based
approaches and svm-based approaches, from the classification point of view, and
in spectral (frame level) features and prosodic (utterance level) features from
the features point of view. Given these different approaches, we can expect to
benefit from the fusion between gmm-based and svm-based approaches, since
the classification procedure is very different, as well as from the fusion between
different types of features.
Three examples of late fusion are implemented within this work:
• fusion between gmm with plp features and svm with prosodic features,
• fusion between the ubm-gmm-svm approach and svm with prosodic features,
• fusion between the Dot Scoring approach and svm with prosodic features.
As a first step we have performed fusion based on a linear combination of the
t-normalised scores of our classifiers. For this purpose we have used equal weights
for the classifiers.
Besides the fusion done using a linear combination of the t-normalised scores,
linear logistic regression fusion was performed. Given that N classifiers are to be
combined, and their output scores are s1j , s2j , .., sN j given a detection trial j, the
fusion creates a score based on the following linear combination of them:
X
fj = a0 +
aij si .
i
The constant a0 is added to the formula for calibration, and it has in itself
no discrimination power. The weights aij can be interpreted as the contributions
of each classifier to the final score. This approach is implemented using FoCal
(Brummer06), a tool that provides simultaneous fusion and calibration. The fusion
is linear, and a specific weight is assigned to each component in such a way to
make the fusion optimal, using Cwlr as minimisation objective:
Cwlr =
K
L
1−P X
P X
log(1 + e−fj −logitP ) +
log(1 + e(gj +logitP ) ),
K j=1
L j=1
where
fj = a0 +
N
X
i=1
and where
ai sij ,
gj = a0 +
N
X
i=1
ai rij ,
logitP = log
P
,
1−P
Automatic Stress Detection in Emergency (Telephone) Calls
11
K=the number of target trials,
L=the number of non-target trials,
N =the number of detectors to be fused,
fj =the fused target scores,
gj =the fused non-target scores,
sij = N ∗ K matrix of scores for target trials, and
rij = N ∗ L matrix of scores for non-target trials.
In our case P =0.5 and Cwlr is Cllr introduced in the next section. What makes
the results of the logistic regression training more powerful is that the fused scores
tend to be well-calibrated detection log-likelihood-ratios.
4 Results and interpretation
The det (Detection Error Tradeoff) (Martin97) curve is a plot of the false alarm
probability on the horizontal axis and miss probability on the vertical axis. It is
a variant of the traditional roc curve (Receiver Operating Characteristic) which
presents false alarm probability on the horizontal axis and the correct detection
rate on the vertical axis. In the case of the det curve both error types are treated
uniformly and the resulting plots are close to linear. The best performing systems
lead to det curves very close to the origin.
A value that summarises the det curve is the equal error rate (eer). It is the
value where PFA = Pmiss .
det curves and equal error rates are very good measures of the discrimination
capability of a recognition system. However, in real applications the threshold has
to be set beforehand.
The Detection Cost Function is a simultaneous measure of the discrimination
capabilities and of calibration. It is based on computing a cost based on the actual
costs of misses (Cmiss )=1 and false alarms (Cf a )=1, as well as of the probability
of a target (Ptar )=0.5. Based on these parameters, the detection cost function is
calculated as follows:
Cdet (Pmiss , Pf a ) = Cmiss Pmiss Ptar + Cf a Pf a (1 − Ptar ).
After calculating the detection function, the natural approach is to choose a
threshold that minimises the cost function. However, choosing such a threshold
that is expected to have a proper behavior on unseen data is a challenging task.
A metric called Cllr encapsulates information about the application independent
performance of a recognition system. For a more detailed description please refer
to (Brummer06) and (vanLeeuwen07).
Figure 3(a) depicts the equal error rates for gmm classification with different
numbers of mixtures and plp features. It is important to note that for 512
Gaussians, the gmm was adapted from a ubm of different data. Apart from this
result, we can see that the trend is to have smaller equal error rates as the number
of Gaussians in the mixture increases. In our case, 256 Gaussians lead to the
smallest equal error rate (19.5).
Looking at the outputs for gmm with the prosodic features depicted in Figure
3(b), we can see that the errors seem to be minimal for a mixture with 4 Gaussians,
12
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
and that changing the number of Gaussians in any direction leads to worse
performance. The results show that plp features are more appropriate than the
prosodic feature set when using gmm.
Table 3
Results in eers for individual classifiers and classifier fusion schemes with
equal weights and weights determined by FoCal
Classifier
Weight
gmm
1
1
3.03
3.35
ugs
1
1
5.62
5.92
ds
1
1
1.81 1.72
svm
1
1
5.77 1
5.44 1
5.27 5.13
eer(%)
21.2 19.8 19.6 15.5 11.3 9.9 10.5 10.1 11.3 11.0 4.2
The experiments continue with using gmm with plp features, ubm-gmm-svm,
Dot Scoring and svm with the prosodic feature set. Table 3 shows the results of
these individual classifiers and their fusion in terms of equal error rates. The results
of two types of fusion are provided, both performed on the t-normalised scores:
with equal weights for each classifiers and with weights determined by logistic
regression with FoCal. The first four columns of the table show the weights for the
classifiers, while the last columns shows the equal error rate given the combination
in that row of the table.
Among the individual classifiers, the smallest equal error rates are obtained
for svm using the prosodic features. ugs and ds have similar performances, while
gmm with plp features leads to the highest eer.
T -normalising the scores and using a simple linear combination of them
in which they are assigned equal weights leads to a high improvement in the
performance in all cases. It is interesting to see that the fusion between gmm
and svm gives the same eer as the fusion between ds and svm, even though
individually ds works better than gmm. So there are more benefits for fusing gmm
with svm. The fusion of ugs and svm has the best performance in the case of
equal weight fusion. Interestingly, the fusion between Dot Scoring and svm leads
to the same equal error rate as gmm fused with svm.
Using logistic regression for the computation of weights leads to even better
results in all cases. This time the best results are obtained by the combination of
gmm and svm, and the worst by ds and svm. However, the differences between
the results are relatively small.
In the case of fusion by logistic regression it is interesting to examine the
coefficients FoCal has assigned to the classifiers. In the case of fusing gmm with
svm, svm has a higher weight, and it also performs better individually. This leads
to a higher improvement from the equal weight fusion than in the case of the other
two combinations. The fusion between ds and svm is done with a much smaller
weight for ds, even though by itself it performs better than gmm. On the other
hand, when fusing ugs with svm, we can see that the two classifiers are given
almost the same importance. Interestingly, the weight of ubm-gmm-svm is slightly
higher, while its performance is slightly worse than the one of svm.
As a final step, we have fused all the individual classifiers using logistic
regression. This was done on both the original scores and on the t-normed scores
Automatic Stress Detection in Emergency (Telephone) Calls
Table 4
13
Cost values for the fused systems
Method
FoCal
FoCal TN
Cllr
0.2613
0.1624
min Cllr
0.2505
0.1533
eer
7.05
4.23
Cdet
0.0695
0.0416
min Cdet
0.0684
0.0411
for gmm and Dot Scoring. The det curves of these fusions are presented in Figure
4. The fusion of all four individual classifiers: gmm, svm, ubm-gmm-svm and Dot
Scoring leads to an equal error rate of 7.1% if FoCal is used on the original scores.
When the scores are t-normalised and then fused the equal error rate drops to
4.2% which is a major improvement. The ubm-gmm-svm seems to be the most
important for the final result, being assigned the highest weight, followed closely
by svm. Dot Scoring has the least importance.
Two circles are marked in Figure 4. They represent the minimum detection
costs. Table 4 gives more insight into the cost associated with making a decision
for the system based on the fusion of all individual classifiers, with and without
t-norming. We can see that for these well calibrated systems the actual detection
costs (Cdet ) are very close to the minimum costs (minCdet ). The actual cost is
very close to the minimum cost. This means that by the chosen threshold we do
not lose much performance.
We refer to the area on the det plot corresponding to a threshold as an
operating point. In Figure 4 boxes are drown around the operating points based on
the 95% confidence intervals of Pf a and Pmiss . We can see that in both cases the
operating point is very close to the point of minimum cost, which proves that the
calibration is right. The situation in Figure 4 for the black curve is optimal, since
the minimum cost is in the operating point confidence area.
Using proper information about the application parameters (Cf a , Cmiss and
Ptar ) we can set a threshold at θ where θ is defined as:
θ = −log
Cmiss Ptar
,
Cf a (1 − Ptar )
and expect this to be an optimum threshold in terms of cost. The point we
operate in is located in the det curve at the (Pmiss ,Pf a ) point given the threshold.
As an alternative to computing Cdet for a given cost function, it is possible
to evaluate the detection and calibration performance by integrating over all cost
functions. The measure that does this is Cllr (according to Table 4) and the
calibration performance can be assessed by comparing to Cllrmin , which is Cllr
with optimal calibration (Brummer06).
5 Applications
Enabling machines to perceive information about the emotions a subject is
experiencing has a high scientific relevance and making use of such technology has
strong societal impacts. In this section we briefly present a number of possible
applications for a stress detector based on speech: support for other speech
technologies, automatic surveillance, emergency call centres and pilots/troops
communicating with the base, as depicted in Figure 6 a) and b). We also observe
14
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
the need of real time processing in all these application, and briefly present the
a proposed architecture and briefly present our online almost real-time emotion
recognition system that is based in part on the technology discussed in the previous
sections.
Speech technology has already been employed for a number of problems
with security applications like speech recognition, speaker identification, language
recognition and deception detection. Opportunities for further developments are
highlighted in (Weinstein91). Speech recognition can be used for detecting certain
key words or even recognising continuous speech, and the result can show hints
of aggression or contain important information that can be retrieved. In high
performance fighter aircraft and helicopter environments the pilot could give
commands by using a speech recogniser and therefore be able to concentrate better
on flying the air plane in crisis situations. However, speech recognition can be
severely altered when modifications of the pilot’s voice appear due to high stress
level, Lombard effect and strong noise. Speaker verification and recognition can
also be affected by stress or other emotions experienced by the speaker. Having
information on the emotion a speaker is experiencing can be used as a basis for
improving the before mentioned shortcomings. The effects of stressed speech on
speech and speaker recognition are analysed in (Steeneken99) for three databases
with recordings of different types of stress. In the case of both speech and speaker
recognition, a training stage using a speech database is needed. As shown by this
research, the most severe negative effects are noticed when training is done only
with stress-free data and testing with stress data, and less severe for matched
training.
In the case of emergency call centres, an operator can probably best assess the
urgency of a call. However, it is not always possible to have an operator at any
time and any place. Often during crises, emergency lines are overloaded with calls.
Therefore, a system able to recognise emotions or the stress level of a speaker can
be used together with an answering machine, so that the urgent calls should be
selected and answered with high priority by the operator and less urgent calls are
handled by the machine. While speaking with a certain caller, the operator can
be assisted by the stress detector and receive continuous feedback about the stress
level of the speaker. An key features of a prosody based stress detector is that
it does not rely on language related information. This way it can detect if a call
seems urgent even if the speaker is using a different language than the one from
the training database, and the human operator could ask for an interpreter.
Besides detecting the stress levels of people accessing a call centre, it is also
important to be aware of the emotional state of the operator, who, during a crisis
can be overloaded and might not be able to cope with the calls. In such a case
other operators should be asked to offer support.
The emergency call centre is only a general application of the stress detection
system. The communication between pilots or troops and the base has some similar
features. In dangerous cases, when the troops are under fire, the communication
is expected to be full of extreme emotions and emergency alerts should be given
to the base. A number of different sensors for sweat and heart rate measurements
can be employed. Nevertheless, a stress detector based on voice is a least intrusive
way and is expected to have a high level of acceptance.
Automatic Stress Detection in Emergency (Telephone) Calls
15
Automatic surveillance is another area of application of stress and negative
emotions detectors. Scenes dominated by aggression are most likely accompanied
by panic, screams, negative emotions like anger and fear. A surveillance system
based on sound can detects these scenarios individually or in combination with
video surveillance.
As it can be easily noticed, all these applications require real time processing.
An architecture for an online stress detection system is depicted in Figure 5.
After speech acquisition, a preprocessing stage is necessary. Speech needs to be
segmented into smaller units that will be analysed. Noise filtering is an important
stage that should be done before feature extraction. Different types of features can
be extracted, like spectral or prosodic features. These should be normalised and
classified based on the already trained models. A classifier fusion stage is included
in order to increase the accuracy, as encouraged by our research. We have already
developed an online, almost real time, system based on this architecture, using
only prosodic features and an svm classifier (Lefter10). It can currently detect
anger, happiness and sadness, and also has a separate anger detection function, as
depicted in Figure 7.
6 Conclusions
In this work we propose an approach for automatic detection of speech coloured
by negative emotions. Such a system is relevant in several military scenarios, when
critical situations need to be detected. Since we have used for training a corpus
with neutral and emotionally coloured speech from a call centre, thus real life data,
we can regard our results as more reliable and more representative of what can be
expected in a real application, compared to systems using acted data.
We have extracted prosodic and spectral features and used them for training
svm- and gmm-based detectors. All gmm-based approaches, namely gmm, ubmgmm-svm and Dot Scoring lead to equal error rates close to 20%. svm performs
better showing an equal error rate of 15.5%. Fusing a gmm-based detector with
an svm one leads to a drop in equal error rate to 11% on average, based on a
linear equal weight combination. The average becomes 10.3% when the weights are
calculate using logistic regression. We can conclude that detectors gmm-based on
detectors with spectral features and svm-based detectors with prosodic features
complement each other very well. The highest improvement is when all classifiers
are fused by FoCal, and the eer drops to 4.2%.
7 Figures
Figure 1
Spectrogram and pitch for neutral (left) and emotional (right) speech
16
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
(a) Gaussian mixtures
Figure 2
Classification techniques
(a) plp features
Figure 3
Figure 4
(b) SVM
(b) Prosodic features
Equal error rates for gmm with different numbers of mixtures
det curves for the fusion of Dot Scoring, svm, gmm and ubm-gmm-svm by
logistic regression
Automatic Stress Detection in Emergency (Telephone) Calls
Figure 5
17
Design of a real time stress detector
(a) Pilot under fire
(b) Communication with the base
Figure 6
Examples of application scenarios
Figure 7
GUI of a real-time emotion recogniser (Lefter10)
References
[Auckenthaler04] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas.
Score
Normalization for Text-Independent Speaker Verification Systems. Digital
Signal Processing, pages 42–54, 2000.
[Batliner06] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt,
L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson.
Combining Efforts for Improving Automatic Classification of Emotional
User States. In Proceedings of IS-LTC, pages 240–245, 2006.
18
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
[Boersma01] P. Boersma. Praat, a System for Doing Phonetics by Computer. Glot
International, 5(9/10):341–345, 2001.
[Brummer06] N. Brümmer and J. du Preez. Application-independent Evaluation of
Speaker Detection. Computer Speech and Language, pages 230–275, 2006.
[Brummer09] N. Brümmer. Discriminative Acoustic Language Recognition via ChannelCompensated GMM Statistics. In Proceedings of Interspeech, pages 2187–
2190. ISCA, 2009.
[Campbell06b] W. Campbell, D. Sturim, and D. Reynolds. Support Vector Machines
Using GMM Supervectors for Speaker Verification.
IEEE Signal
Processing Letters, 13(5):308–311, 2006.
[Clavel08] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette. Feartype Emotion Recognition for Future Audio-based Surveillance Systems.
Speech Communication, pages 487–503, 2008.
[Datcu06] D. Datcu and L. J. M. Rothkrantz. The Recognition of Emotions from
Speech using GentleBoost Classifier. In CompSysTech’06, 2006.
[Davis90] S. B. Davis and P. Mermelstein.
Comparison of Parametric
Representations for Monosyllabic Word Recognition in Continuously
Spoken Sentences. Readings in Speech Recognition, pages 65–74, 1990.
[Devillers07] L. Devillers and L. Vidrascu. Real-Life Emotion Recognition in Speech.
Speaker Classification II: Selected Projects, pages 34–42, 2007.
[DouglasCowie03] E. Douglas-Cowie, N Campbell, R. Cowie, and P. Roach. Emotional
Speech: Towards a New Generation of Databases. Speech Communication,
pages 33–60, 2003.
[Fernandez03] R. Fernandez and R. W. Picard. Modeling Drivers’ Speech Under Stress.
Speech Communication, 40(1-2):145–159, 2003.
[Fitrianie08] S. Fitrianie and L. J. M. Rothkrantz. An Automated Online Crisis
Dispatcher. International Journal of Emergency Management, 5(1/2):123–
144, June 2008.
[Hansen00] J. H. L. Hansen, C. Swail, A. J. South, R. K. Moore, H. Steeneken,
D. A. van Leeuwen, E. J. Cupples, T. Anderson, C. R. A. Vloeberghs,
I. Trancoso, and P. Verlinde. The Impact of Speech Under Stress on
Military Speech Technology. In NATO RTO-TR-10, AC/323(IST)TP/5
IST/TG-01, 2000.
[Hansen07] J. H. Hansen and S. Patil. Speech Under Stress: Analysis, Modeling
and Recognition. Speaker Classification I: Fundamentals, Features, and
Methods, Lecture Notes in Artificial Intelligence, pages 108–137, 2007.
[Hansen97] J. H. Hansen and S. E. Ghazale. Getting started with SUSAS. In
Proceedings of Eurospeech’97, pages 1743–1746, 1997.
[Hengel07] P.W.J. van Hengel and T.C. Andringa. Verbal Aggression Detection in
Complex Social Environments. IEEE Conference on Advanced Video and
Signal Based Surveillance, pages 15–20, 2007.
[Hermansky90] H. Hermansky. Perceptual Linear Predictive (PLP) Analysis for Speech.
Journal of the Acoustical Society of America, pages 1738–1752, 1990.
[Hermansky92] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. RASTA-PLP speech
analysis technique. In IEEE International Conference on Acoustics,
Speech, and Signal Processing, pages 121–124, 1992.
[Juslin05] P.N. Juslin and K.R. Scherer. In J. Harrigan, R. Rosenthal, and K.
Scherer, (Eds.) - The New Handbook of Methods in Nonverbal Behavior
Research, chapter Vocal expression of affect, pages 65–135. Oxford
University Press, 2005.
Automatic Stress Detection in Emergency (Telephone) Calls
19
[Krajewski07] J. Krajewski and B. Kroger. Using Prosodic and Spectral Characteristics
for Sleepiness Detection. In Interspeech, pages 94–98, 2007.
[Lefter10] I. Lefter, P. Wiggers, and L. J. M. Rothkrantz. EmoReSp - An Online
Emotion Recognizer Based on Speech. In CompSysTech’10, 2010.
[Legros08] C. Legros, R. Ruiz, and P. Plantin De Hugues. Analysing Cockpit and
Laboratory Recordings to Determine Fatigue Levels in Pilots Voices. In
Journal of the Acoustical Society of America, pages 3070–3070, 2008.
[Lombard11] E. Lombard. Le Signe de lElevation de la Voix. In Ann. Maladies Oreille,
Larynx, Nez, Pharynx, pages 101–119, 1911.
[Martin97] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki.
The Det Curve In Assessment Of Detection Task Performance. In
Proceedings Eurospeech ’97, pages 1895–1898, 1997.
[Mbitiru08] N. Mbitiru, P. Tay, J. Z. Zhang, and R. D. Adams. Analysis of Stress
in Speech Using Empirical Mode Decomposition. In Proceedings of The
2008 IAJC-IJME International Conference, 2008.
[McNahon08] E. McNahon, R. Cowie, J. Wagner, and E. Andre. Multimodal Records
of Driving Influenced by Induced Emotion. In International Conference
on Language Resources and Evaluation, 2008.
[Morrison05] D. Morrison, R. Wang, L. C. De Silva, and W.L. Xu. Real-time Spoken
Affect Classification and its Application in Call-Centres. In Proceedings
of the Third International Conference on Information Technology and
Applications (ICITA’05), 2005.
[Pantic03] M. Pantic and L.J.M. Rothkrantz.
Towards an Affect-sensitive
Multimodal Human-Computer Interaction. Proceedings of the IEEE,
pages 1370–1390, 2003.
[Petrushin99] V. A. Petrushin. Emotion in Speech: Recognition and Application to Call
Centers. In Artificial Neural Networks in Engineering, pages 7–10, 1999.
[Reynolds00] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification
Using Adapted Gaussian Mixture Models. Digital Signal Processing, pages
19–41, 2000.
[Rothkrantz04] L.J.M. Rothkrantz, P. Wiggers, J.W.A. van Wees, and R.J. van Vark.
Voice Stress Analysis. Text, Speech and Dialogues, Lecture Notes in
Artificial Intelligence, pages 449–456, 2004.
[Scherer03b] K. R. Scherer, T. Johnstone, and G. Klasmeyer. Vocal Expression of
Emotion. R. J. Davidson, H. Goldsmith, K. R. Scherer (Eds.) - Handbook
of the Affective Sciences. Oxford University Press, 2003.
[Schuller05a] B. Schuller, R. J. Villar, G. Rigoll, and M. Lang. Meta-Classifiers
in Acoustic and Linguistic Feature Fusion-Based Affect Recognition.
In IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP ’05), pages 325–328, 18-23, 2005.
[Schuller05b] B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, and
G. Rigoll. Speaker Independent Speech Emotion Recognition by Ensemble
Classification. In IEEE International Conference on Multimedia and Expo
(ICME), pages 864–867, 2005.
[Schuller07] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner,
L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. The
Relevance of Feature Type for the Automatic Classification of Emotional
User States: Low Level Descriptors and Functionals. In Proceedings of
Interspeech, pages 2253–2256, 2007.
20
I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers
[Steeneken99] H.J.M. Steeneken and J.H.L. Hansen. Speech Under Stress Conditions:
Overview of the Effect on Speech Production and on System Performance.
IEEE International Conference on Acoustics, Speech, and Signal
Processing, pages 2079–2082, 1999.
[Steidl08] S. Steidl, A. Batliner, E. Nöth, and J. Hornegger. Quantification of
Segmentation and F0 Errors and Their Effect on Emotion Recognition. In
TSD ’08: Proceedings of the 11th International Conference on Text, Speech
and Dialogue, pages 525–534, Berlin, Heidelberg, 2008. Springer-Verlag.
[Tolkmitt86] F. J. Tolkmitt and K. R. Scherer. Effect of Experimentally Induced Stress
on Vocal Parameters. Journal of Experimental Psychology and Human
Perceptual Performance, 12(3):302–313, 1986.
[Truong07] K. P. Truong and D. A. van Leeuwen. Automatic Discrimination Between
Laughter and Speech. Speech Communication, pages 144–158, 2007.
[Truong08] K. P. Truong and S. Raaijmakers. Automatic Recognition of Spontaneous
Emotions in Speech Using Acoustic and Lexical Features. In Proceedings
of the 5th international workshop on Machine Learning for Multimodal
Interaction (MLMI), pages 161–172, 2008.
[Valenzise07] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti.
Scream and Gunshot Detection and Localization for Audio-surveillance
Systems.
IEEE Conference on Advanced Video and Signal Based
Surveillance, pages 21–26, 2007.
[Ververidis06] D. Ververidis and C. Kotropoulos.
Emotional Speech Recognition:
Resources, Features, and Methods. Speech Communication, pages 1162–
1181, 2006.
[Ververidis08] D. Ververidis, I. Kotsia, C. Kotropoulos, and I. Pitas. Multi-modal
Emotion-related Data Collection within a Virtual Earthquake Emulator.
In International Conference on Language Resources and Evaluation, 2008.
[Vlasenko07a] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll. Combining
Frame and Turn-Level Information for Robust Recognition of Emotions
within Speech. In Proceedings of Interspeech, 2007.
[Vogt05] T. Vogt and E. Andre.
Comparing Feature Sets for Acted and
Spontaneous Speech in View of Automatic Emotion Recognition. In IEEE
International Conference on Multimedia and Expo, pages 474–477, 2005.
[Weinstein91] C.J. Weinstein. Opportunities for Advanced Speech Processing in Military
Computer-based Systems. Proceedings of the workshop on Speech and
Natural Language, 79(11):1626–1641, 1991.
[Zeng07] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A Survey of Affect
Recognition Methods: Audio, Visual and Spontaneous Expressions. In
ICMI ’07: Proceedings of the 9th International Conference on Multimodal
Interfaces, pages 126–133, 2007.
[Zhou01] G. Zhou, J.H.L. Hansen, and J.F. Kaiser. Nonlinear Feature Based
Classification of Speech Under Stress. IEEE Transactions on Speech and
Audio Processing, 9(3):201–216, 2001.
[vanLeeuwen07] D. A. van Leeuwen and N. Brümmer. An introduction to applicationindependent evaluation of speaker recognition systems.
Speaker
Classification I: Fundamentals, Features, and Methods, Lecture Notes in
Artificial Intelligence, pages 330–353, 2007.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement