Int. J. of Intelligent Defence Support Systems, Vol. x, No. x, xxxx Automatic Stress Detection in Emergency (Telephone) Calls Iulia Lefter1,2,3 Leon J. M. Rothkrantz1,2 David. A. van Leeuwen3 Pascal Wiggers1 1 2 3 Delft University of Technology, Delft, The Netherlands The Netherlands Defence Academy TNO Defence, Security and Safety, The Netherlands Abstract: The abundance of calls to emergency lines during crises is difficult to handle by the limited number of operators. Detecting if the caller is experiencing some extreme emotions can be a solution for distinguishing the more urgent calls. Apart from these, there are several other applications that can benefit from awareness of the emotional state of the speaker. This paper describes the design of a system for selecting the calls that appear to be urgent, based on emotion detection. The system is trained using a database of spontaneous emotional speech from a call-centre. Four machine learning techniques are applied, based on either prosodic or spectral features, resulting in individual detectors. As a last stage, we investigate the effect of fusing these detectors into a single detection system. We observe an improvement in the Equal Error Rate (eer) from 19.0 % on average for 4 individual detectors to 4.2 % when fused using linear logistic regression. All experiments are performed in a speaker independent cross-validation framework. Keywords: Stress detection, emotional speech detection, crisis management, speech analysis, machine learning Biographical notes: Iulia Lefter received her MSc in Media and Knowledge Engineering from Delft University of Technology. Now she is a PhD student at the same university, in a project involving the Netherlands Defence Academy and TNO. Her interests are multi-modal information fusion, emotion recognition in speech, behavior analysis and autonomous surveillance systems. Leon Rothkrantz is Professor of Computer Science at the Netherlands Defence Academy and at Delft University of Technology. He has a MSc degree in Mathematics but also in Psychology. He received his PhD in Mathematics at the University of Amsterdam. His c 2010 Inderscience Enterprises Ltd. Copyright 1 2 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers research interest are multimodal communication, artificial intelligence, signal processing and sensor technology. Pascal Wiggers is an assistant professor in the man-machine interaction group of Delft University of Technology. His research focuses on automatic speech-recognition, multi-modal interaction, intuitive and collaborative interfaces, emotion recognition and probabilistic methods for machine learning such as dynamic Bayesian networks. David van Leeuwen is full Professor at Radboud University Nijmegen and Senior Researcher at TNO Human Factors. He received his MSc in Physics at Delft University of Technology and PhD at University of Leiden. He has been active in the field of large vocabulary continuous speech recognition (evaluation, development of Dutch system), word spotting, and speaker and language recognition. He has organised several benchmark evaluations (LVCSR Fr/Ge/BrEng: EU SQALE in 1995, Forensic Speaker Recognition NFITNO in 2003, LVCSR Dutch: N-Best in 2008). He has participated in NIST SRE, LRE and RT evaluations since 2003. He has been a representative in several NATO IST Research task groups on speech technology since 2002, and an ISCA member since 1995. 1 Introduction During critical phases of military operations stress level might rise considerably. A high level of stress has a negative impact on performance, and is an indicator that people are physically or mentally overloaded or in some other critical situation. Stress and negative emotions in general have a strong influence on voice characteristics. Because speech is a natural means of communication, we can utilise the paralinguistic features of speech to detect stress and (negative) emotions in a non-intrusive way by monitoring the communication. The encoding of physical stress in voice is highly influenced by the increase in respiration rate, which increases the sub glottal pressure during phonation. The distance of speech between breaths is diminished and the articulation rate is affected. The muscle activity of the larynx and vocal folds is changed by stress and this modifies the air velocity through glottis and sound frequency (if vocal folds get a higher tension then the frequency increases). Activity of other muscles like tongue, jaw and lips which shape the resonant cavities is also affected by stress and results in changes of speech production (Hansen07). Stress can appear at different levels, from physical stressors like g-forces or vibration, unconscious physiological stressors like fatigue, illness, sleep deprivation, conscious physiological stressor like noise or poor communication channel and psychological stressors like emotion, high work load, confusion or frustration over contradictory information (Hansen00), (Hansen07). The research described in this paper focuses on automatic detection of speech showing negative emotions. The proposed approach is based on emotion detection from the sound signal, and does not use any linguistic information. The purpose is to obtain a language independent emotion detection system. Automatic Stress Detection in Emergency (Telephone) Calls 3 As opposed to most contributions in the field which use corpora of emotions portrayed by actors, we use a database that contains recordings of genuine emotional speech from a call-centre. It contains neutral utterances and utterances with negative emotions like anger, frustration and stress. The use of a database with real emotions is very valuable to our research. Real-life data implies a lot of blended emotions and different intensities, which can make the problem more complicated. Furthermore, since the telephone acts as a filter and much of the higher frequency band of the speech signal is reduced, detecting the emotion becomes more difficult, both in the case of humans and machines. However, this is exactly the kind of data that we expect in a real application. Despite the challenges of using a database with spontaneous emotions, we were able to separate neutral from emotional speech with high accuracy. For this we extracted a set of prosodic features as well as spectral features. Four classification methods employing these features were tested and compared. The performance of each of them was outperformed by the final approach, in which we fused all the individual classifiers in one detector. The paper is organised as follows. First, related work is presented, with accent on each step in the development of an emotion recognition system, namely the available databases, the emotion characteristic features that are relevant for emotion recognition and machine learning techniques used. Furthermore, a number of studies using speech for extracting information about abnormal situation and cases of aggression and stress detection are reviewed. Our approach for emotion detection is presented in the third section. Here the database used in this work, the features and the classification techniques are described. This is followed by the analysis of our results and a discussion. Possible application in the military domain are outlined in the penultimate section, followed by our conclusions. 2 Related work In this section we illustrate what are the components of emotion recognition systems are and what kind of choices can be made at each step. After introducing, in turn, the state of the art in databases, features and classification techniques, we briefly review a number of papers with applications in the military domain. 2.1 Databases of emotional speech Emotion recognition from speech is typically based on using machine learning techniques for training models on recorded data (Pantic03), (Ververidis06) and (Zeng07). Therefore, the availability of labeled emotional speech databases is a prerequisite for the design of such systems. The effectiveness as well as the perceived results of emotion recognition systems are, to a high extent, dependent on the databases used for training and evaluation. It has been suggested that the focus should shift from acted databases to ones that contain spontaneous emotions (DouglasCowie03). Several approaches were used for the collection of spontaneous emotion databases to elicit spontaneous emotions in the collection of speech databases. Human machine interaction was used with the help of Wizard of Oz (WoZ) 4 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers scenarios (Steidl08) or computer based dialogue systems (Zeng07). Recent scenarios for the recording of spontaneous databases are driving simulators (McNahon08), one-shooter computer games (Truong08) or earthquake emulators (Ververidis08). Three databases with natural and simulated recordings of stress were recorded for the nato project “Speech under Stress Conditions” (Hansen00). The first one, susc-0/1 (Speech under Stress Conditions), includes recordings of speech during genuine stress from a crashing aircraft and an engine out situation. Besides these psychological stress situations, recordings of voice under physical stress have also been made. The second database, susas (Speech Under Simulated and Actual Stress) (Hansen97), contains a large variety of stresses and speaking styles. Computer tracking tasks were used for creating stresses like time pressure and workload, roller coaster rides and helicopter commands were used for provoking physical and psychological stress, also determining Lombard effect (Lombard11). The third database, dlp, is obtained from license plate reading tasks under psychological stress from time pressure. 2.2 Emotional features in speech A large set of acoustic features has been found so far to correlate with emotions. In general researchers use different subsets, and try to find the most suitable combinations. The acoustic features can be categorised in prosodic, spectral, and voice quality features. Prosody studies the rhythm, stress and intonations of speech and works on larger segments of speech, like syllables, words, phonemes or entire turns of a speaker. Pitch, loudness, speaking rate, duration, pause and rhythm are all perceived characteristics of prosody. Even though there are no unique acoustic correspondents between these characteristics, there are strong correlations between them. Prosodic features are the most popular in speech emotion recognition (Steidl08), (Batliner06), (Truong07), and (Krajewski07). Besides the fundamental frequency which correlates with pitch, the speech signal contains other spectral characteristics that are influenced by emotional arousal. Such examples are harmonics and formants. Mel Frequency Cepstral Coefficients (mfcc) (Davis90), which are generally used in speech recognition, are also spectral features successfully used for emotion detection. They can be augmented by Linear Predictive Cepstral Coefficients (lpcc), Mel Filter Bank (mfb) features (Steidl08) or Perceptual Linear Prediction (plp) (Truong07). Studies on voice quality report that there is a strong correlation between voice quality features and emotions. Examples of voice qualities are neutral, whispery, breathy, creaky, and harsh or falsetto voice. Jitter measures cycle to cycle variation of period length while shimmer measures cycle to cycle variations of peak or average amplitude. Harmonics to noise ratio (hnr) measures the degree of periodicity in a sound and is another voice quality feature. Applications involving these features can be found in (Krajewski07) and (Vlasenko07a). In general there are two units of analysis that are widely used: the utterance level and the frame level. As a general approach for extracting utterance level features, different statistical functions are applied over the overall contour of the features, as for example fundamental frequency. The idea is that much information Automatic Stress Detection in Emergency (Telephone) Calls 5 about emotion can be extracted from the overall trend of a certain parameter. Spectral features, however, give information at the frame level (∼ 10 ms), which can be modelled much more accurately because of the relative large amount of frames in an utterance. In (Datcu06) the efficiency of using distinct numbers of frames per speech utterance is investigated. A comprehensive set of acoustic cues, their perceived correlates, definitions and acoustic measurements in vocal affect expression is provided in (Juslin05). The same work also provides guidelines for choosing a minimum set of features that is bound to provide emotion discrimination capabilities. An extensive study by Scherer, Johnstone and Klasmeyer (Scherer03b) presents an overview of empirically identified major effects of emotion on vocal expression. Studies on the relevance of acoustic features for stress can be found in (Rothkrantz04), (Tolkmitt86). In (Zhou01) the benefits for using features derived from the Teager energy operator in the case of classification of speech under stress are presented. There is a trend to generate an increasingly large set of features and statistics of these features, and use a selection algorithm to determine the most appropriate ones. Examples of such approaches can be found in (Schuller05a), (Batliner06), (Schuller07). Even though there is an improvement in the results achieved using the reduced feature set, this approach generates models which are very well adapted to the specified database while capabilities for generalisation are decreased. A strong proof of this is given in (Vogt05) where a large number of features has been extracted for acted and spontaneous databases, the most relevant ones are selected using correlation based feature selection. The resulting feature subsets are different for acted and for spontaneous speech. It appears that emotions in acted speech can be well recognised using the pitch values, while for the spontaneous case, mfccs perform better. 2.3 Classification techniques After the feature sets are obtained, different kinds of classifiers are used for training and testing. Experiments using linear classifiers (Krajewski07), Gaussian mixture models (gmm) (Truong07), neural networks (nn) (Petrushin99), (Krajewski07), hidden Markov models (hmm) (Vlasenko07a), Bayesian networks (bn) (Schuller05a), random forests, support vector machines (svm) (Truong07) and other machine learning techniques are reported. In general it is difficult to compare the results across different experiments because different general conditions are applied: different databases, units of analysis, feature sets, some report speaker dependent, some speaker independent results, etc. There are also studies investigating the capabilities of such classifiers. It appears that support vector machines are among the most successful. Recent approaches combine the results of different weak classifiers in ensembles, and their combination is a stronger classifier, which in almost all cases gives better results than each individual classifier (Schuller05b). 2.4 State of the art with military applications In (Devillers07) a corpus with recordings from a medical emergency call centre is used. Relief, anger, fear and sadness are recognised by means of linguistic 6 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers content and paralinguistic cues. Best results are obtained for fear in the case of paralinguistic cues, and for relief in the case of linguistic content. Another approach (Morrison05) focuses on detecting angry speech from neutral. The purpose is to deploy the real-time classification system in call centres. Emotion recognition for the purpose of crisis management in the context of an automatic online dispatcher is reported in (Fitrianie08). The system is able to gain information about the crisis from the user’s input, and also to recognise the affective state by means of linguistic content, using a database with valence and arousal scores for specific keywords. An interesting contribution that uses emotion recognition techniques for the purpose of audio-based surveillance systems is presented in (Clavel08). Here the focus is on detecting one type of emotion, namely fear. The research is base on a database with segments from movies that show no emotion or different levels of fear is presented. Besides examples aiming at detecting different emotions, similar systems can be developed for the use of surveillance systems. In (Valenzise07), two gmm classifiers are used for detecting screams from noise and gunshots from noise. Furthermore, the system is able to perform acoustic source localization using a microphone array and the sound’s time of arrivals using linear correction least square localization algorithm. A verbal aggression detection system is described in (Hengel07). What is particular about their approach is that the system does not use classification methods, but is rather a knowledge based system. The detection methodology is inspired from the way in which sounds are detected by the human auditory system. The detection of stress from speech can also be used for the purpose of deception detection. An example based on finding the physiological microtremor known to appear due to psychological stress is (Mbitiru08). Several studies analyse the speech of drivers and pilots as special categories. Stress detection in the case of drivers’ speech is described in (Fernandez03). Speech can also be used for acquiring information on pilots’ fatigue from speech recorded from cockpit and laboratory conditions, which are highly affected by noise (Legros08). Although, as shown in this section, many contributions are available in the field of emotion recognition from speech, the problem is far from being solved and more research is needed for increasing the recognition rates, to diminish the influences of noise in different environments and to deploy the systems in real time. 3 Methodology We move on to describing the details of our emotion detection approach. First, the database is introduced, followed by the feature sets used. The section continues with more details on the classifiers used. 3.1 The South-African database The South-African database contains genuine recordings from call-centres. The data has been labeled according to two classes: emotional and neutral. In Figure 1 the spectrograms and pitches of an neutral and one emotional sample are depicted. Automatic Stress Detection in Emergency (Telephone) Calls Table 1 7 The final feature set used in our experiments Feature type Functionals Pitch Mean, Standard deviation, Range, Absolute slope (without octave jumps), Jitter Mean, Standard deviation, Range, Absolute slope, Shimmer Mean F1, Mean F2, Mean F3, Mean F4 Slope, Hammarberg index, High energy Centre of gravity, Skewness Intensity Formants Long term averaged spectrum Spectrum There is a total of 3000 utterances, out of which 1000 have been labeled as emotional. In most of the cases the emotional utterances show anger, dissatisfaction or frustration, while the neutral ones are just emotionally neutral sentences. The recorded utterances are mainly short, with an average length of 4.30 seconds. In total, the data set contains 215.02 minutes of speech. Based on the call identities, we can estimate that messages from 1200 speakers have been recorded. The data of each class was split in 10 balanced subsets, taking into account gender independence. Based on a 10 fold cross-validation scheme, 8 subsets were used for training, one for validation, and one for testing. The detection was performed gender intependently. 3.2 Feature set Choosing an optimal feature set is an open issue. The choice of the feature set has a great impact in solving the classification problem. Even though there are features for which research has proven their changed behavior when influenced by emotions, it is important to decide which ones to use and also which statistics would comprise the most relevant information. A high number of features and statistics contain a lot of redundancy and also the process becomes very expensive. On the other hand, too little features might miss the important information about emotion. Therefore, a compromise should made be for choosing the right set of features. We utilised two kinds of features: prosody-based features at the utterance level and spectral features at the frame level. 3.2.1 Prosodic features Based on the research in (Juslin05), (Truong08) and (Vogt05), we used prosodic features as indicated in Table 1. These features are extracted at the utterance level, and are based mainly on pitch and intensity with some additions in spectral shape. The features have been extracted using Praat (Boersma01). Table 2 shows the ranking of the features based on information gain ratio. 3.2.2 Spectral features As is customary in almost all speech processing technologies (speech, speaker and language recognition) we used cepstral features for emotion recognition. 8 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers Table 2 Information gain (ig) for the prosodic feature set IG 0.27868 0.15186 0.13838 0.12302 0.12232 0.1133 0.10513 0.10201 0.09952 0.09101 Feature pitch(mean) ltas(cog) pitch(range) intens(mean)) pitch(std) pitch(slope no octave jumps) intens(absolute slope) pitch(absolute slope) high energy F4(mean) IG Feature 0.08102 0.05889 0.05063 0.04389 0.04007 0.02835 jitter ltas(slope) intens(range) ltas(skew) intens(std) Hammarberg 0.02621 0.01588 0.0134 0.00751 F3(mean) F2(mean) shimmer F1(mean) Specifically we used Perceptual Linear Prediction coefficients (Hermansky90) with rasta processing (Hermansky92). Twelve plp coefficients were extracted over a window of 32 ms, every 16ṁs, and these were augmented with log energy. Derivatives over 5 consecutive frames (delta’s) were added to these, resulting in 26 features. 3.3 Classification techniques This section describes a series of experiments we did on the South-African database. The experiments include the use of four classification approaches based on the features mentioned in section 3.2 and also two methods for decision level fusion of more classifiers. The purpose is to see how well the classifiers combine, how well the features complement each other, and which method yields the best results. All experiments are implemented using 10 fold speaker independent crossvalidation. In the case of fusion, double cross-validation was needed since the scores come from different classifiers and span different ranges (some are log likelihoods, some probabilities). As we want to perform a linear combination of the scores, a first step is to bring all the scores into the same range. This can be done using t-normalisation (Auckenthaler04). The data was divided in training, development and test sets, for each possibility in the cross-validation framework. For an easy understanding, we name the emotional samples as target, and the neutral samples as non-target. The mean and standard deviation of the scores of the non-target development set are used in order to normalise the scores in the test set. 3.3.1 gmms with plp features A first classification approach was to model each of the two classes - emotional and neutral - by a mixture of Gaussians, see Figure 2(a). The distribution of frames for emotional and neutral speech were modelled by two GMMs. These were trained using EM, in four iterations. Classification of test utterances was based on the difference of the log likelihoods of the test data given the models. Automatic Stress Detection in Emergency (Telephone) Calls 9 The experiments include using different number of Gaussian mixtures in order to inspect which one is the most beneficial and what is the influence of varying the number of mixtures on the discrimination result. In the case of plp features the number of features is high: 26 features for one frame, therefore a large number of features for an entire file. In order to obtain a good representation, the data needs to be modelled with an adequate number of mixture. Experiments with 64, 128 and 256 mixtures were performed. A precomputed Universal Background Model (ubm) was also used as a basis for Maximum A-Posteriori (map) adaptation (Reynolds00). The ubm was trained on English telephone conversations, so the recording conditions of these samples and the ones in the database we use are similar. The ubm contains 512 mixtures. map adaptation of the ubm is computationally much more efficient than training gmms form scratch. 3.3.2 gmms with prosodic features The prosodic features were also modelled using a gmm, but because the data is much sparser than in the case of spectral features (only one data point per utterance) the number of Gaussians had to be chosen much smaller—in our experiments 2, 4, and 8. 3.3.3 The Dot Scoring approach (ds) Instead of full computation of the likelihood or the test sample using ubm and map-adapted gmm, it is possible to linearise the gmm likelihood function (Brummer09). This operation leads to very efficient modeling and scoring, reducing the training and testing process to the evaluation of the inner product (dotproduct) of the sufficient statistics of the train and test utterances with respect to the ubm. In both speaker and language recognition this method has proved to work with high performance, specifically because channel compensation can be carried out at the level of the sufficient statistics. We have adapted this dot-scoring system as well for emotion classification. 3.3.4 svm with prosodic features Support Vector Machines (svm) are among the most popular classifiers in speech emotion recognition, due to their high generalisation capability. Given the separation problem of two classes, a support vector machine will determine a hyperplane that can completely separate these two classes, as shown in Figure 2(b). The idea is to find a hyperplane that maximises the margin between the two datasets, and the samples that lie on the margin are called support vectors. The features mentioned in Table 1 are used for this experiment, along with svm with RBF kernel. The experiment is designed in the speaker independent crossvalidation framework. 3.3.5 The ubm-gmm-svm approach (ugs) Recently in speaker recognition it has been shown that the map adapted means of the ubm can be used as high-dimensional data points to train an svm 10 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers (Campbell06b). The utterances of the two classes emotion and neutral are used as positive and negative examples in the svm, and classification occurs by map adaptation of the ubm to the test utterance, and using the means as the test data point in the svm. 3.3.6 Fusion of classifiers For a successful fusion, it is important that the approaches being fused handle the problems differently, so that they complement each other. In our case, the approaches previously described can be roughly divided into gmm-based approaches and svm-based approaches, from the classification point of view, and in spectral (frame level) features and prosodic (utterance level) features from the features point of view. Given these different approaches, we can expect to benefit from the fusion between gmm-based and svm-based approaches, since the classification procedure is very different, as well as from the fusion between different types of features. Three examples of late fusion are implemented within this work: • fusion between gmm with plp features and svm with prosodic features, • fusion between the ubm-gmm-svm approach and svm with prosodic features, • fusion between the Dot Scoring approach and svm with prosodic features. As a first step we have performed fusion based on a linear combination of the t-normalised scores of our classifiers. For this purpose we have used equal weights for the classifiers. Besides the fusion done using a linear combination of the t-normalised scores, linear logistic regression fusion was performed. Given that N classifiers are to be combined, and their output scores are s1j , s2j , .., sN j given a detection trial j, the fusion creates a score based on the following linear combination of them: X fj = a0 + aij si . i The constant a0 is added to the formula for calibration, and it has in itself no discrimination power. The weights aij can be interpreted as the contributions of each classifier to the final score. This approach is implemented using FoCal (Brummer06), a tool that provides simultaneous fusion and calibration. The fusion is linear, and a specific weight is assigned to each component in such a way to make the fusion optimal, using Cwlr as minimisation objective: Cwlr = K L 1−P X P X log(1 + e−fj −logitP ) + log(1 + e(gj +logitP ) ), K j=1 L j=1 where fj = a0 + N X i=1 and where ai sij , gj = a0 + N X i=1 ai rij , logitP = log P , 1−P Automatic Stress Detection in Emergency (Telephone) Calls 11 K=the number of target trials, L=the number of non-target trials, N =the number of detectors to be fused, fj =the fused target scores, gj =the fused non-target scores, sij = N ∗ K matrix of scores for target trials, and rij = N ∗ L matrix of scores for non-target trials. In our case P =0.5 and Cwlr is Cllr introduced in the next section. What makes the results of the logistic regression training more powerful is that the fused scores tend to be well-calibrated detection log-likelihood-ratios. 4 Results and interpretation The det (Detection Error Tradeoff) (Martin97) curve is a plot of the false alarm probability on the horizontal axis and miss probability on the vertical axis. It is a variant of the traditional roc curve (Receiver Operating Characteristic) which presents false alarm probability on the horizontal axis and the correct detection rate on the vertical axis. In the case of the det curve both error types are treated uniformly and the resulting plots are close to linear. The best performing systems lead to det curves very close to the origin. A value that summarises the det curve is the equal error rate (eer). It is the value where PFA = Pmiss . det curves and equal error rates are very good measures of the discrimination capability of a recognition system. However, in real applications the threshold has to be set beforehand. The Detection Cost Function is a simultaneous measure of the discrimination capabilities and of calibration. It is based on computing a cost based on the actual costs of misses (Cmiss )=1 and false alarms (Cf a )=1, as well as of the probability of a target (Ptar )=0.5. Based on these parameters, the detection cost function is calculated as follows: Cdet (Pmiss , Pf a ) = Cmiss Pmiss Ptar + Cf a Pf a (1 − Ptar ). After calculating the detection function, the natural approach is to choose a threshold that minimises the cost function. However, choosing such a threshold that is expected to have a proper behavior on unseen data is a challenging task. A metric called Cllr encapsulates information about the application independent performance of a recognition system. For a more detailed description please refer to (Brummer06) and (vanLeeuwen07). Figure 3(a) depicts the equal error rates for gmm classification with different numbers of mixtures and plp features. It is important to note that for 512 Gaussians, the gmm was adapted from a ubm of different data. Apart from this result, we can see that the trend is to have smaller equal error rates as the number of Gaussians in the mixture increases. In our case, 256 Gaussians lead to the smallest equal error rate (19.5). Looking at the outputs for gmm with the prosodic features depicted in Figure 3(b), we can see that the errors seem to be minimal for a mixture with 4 Gaussians, 12 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers and that changing the number of Gaussians in any direction leads to worse performance. The results show that plp features are more appropriate than the prosodic feature set when using gmm. Table 3 Results in eers for individual classifiers and classifier fusion schemes with equal weights and weights determined by FoCal Classifier Weight gmm 1 1 3.03 3.35 ugs 1 1 5.62 5.92 ds 1 1 1.81 1.72 svm 1 1 5.77 1 5.44 1 5.27 5.13 eer(%) 21.2 19.8 19.6 15.5 11.3 9.9 10.5 10.1 11.3 11.0 4.2 The experiments continue with using gmm with plp features, ubm-gmm-svm, Dot Scoring and svm with the prosodic feature set. Table 3 shows the results of these individual classifiers and their fusion in terms of equal error rates. The results of two types of fusion are provided, both performed on the t-normalised scores: with equal weights for each classifiers and with weights determined by logistic regression with FoCal. The first four columns of the table show the weights for the classifiers, while the last columns shows the equal error rate given the combination in that row of the table. Among the individual classifiers, the smallest equal error rates are obtained for svm using the prosodic features. ugs and ds have similar performances, while gmm with plp features leads to the highest eer. T -normalising the scores and using a simple linear combination of them in which they are assigned equal weights leads to a high improvement in the performance in all cases. It is interesting to see that the fusion between gmm and svm gives the same eer as the fusion between ds and svm, even though individually ds works better than gmm. So there are more benefits for fusing gmm with svm. The fusion of ugs and svm has the best performance in the case of equal weight fusion. Interestingly, the fusion between Dot Scoring and svm leads to the same equal error rate as gmm fused with svm. Using logistic regression for the computation of weights leads to even better results in all cases. This time the best results are obtained by the combination of gmm and svm, and the worst by ds and svm. However, the differences between the results are relatively small. In the case of fusion by logistic regression it is interesting to examine the coefficients FoCal has assigned to the classifiers. In the case of fusing gmm with svm, svm has a higher weight, and it also performs better individually. This leads to a higher improvement from the equal weight fusion than in the case of the other two combinations. The fusion between ds and svm is done with a much smaller weight for ds, even though by itself it performs better than gmm. On the other hand, when fusing ugs with svm, we can see that the two classifiers are given almost the same importance. Interestingly, the weight of ubm-gmm-svm is slightly higher, while its performance is slightly worse than the one of svm. As a final step, we have fused all the individual classifiers using logistic regression. This was done on both the original scores and on the t-normed scores Automatic Stress Detection in Emergency (Telephone) Calls Table 4 13 Cost values for the fused systems Method FoCal FoCal TN Cllr 0.2613 0.1624 min Cllr 0.2505 0.1533 eer 7.05 4.23 Cdet 0.0695 0.0416 min Cdet 0.0684 0.0411 for gmm and Dot Scoring. The det curves of these fusions are presented in Figure 4. The fusion of all four individual classifiers: gmm, svm, ubm-gmm-svm and Dot Scoring leads to an equal error rate of 7.1% if FoCal is used on the original scores. When the scores are t-normalised and then fused the equal error rate drops to 4.2% which is a major improvement. The ubm-gmm-svm seems to be the most important for the final result, being assigned the highest weight, followed closely by svm. Dot Scoring has the least importance. Two circles are marked in Figure 4. They represent the minimum detection costs. Table 4 gives more insight into the cost associated with making a decision for the system based on the fusion of all individual classifiers, with and without t-norming. We can see that for these well calibrated systems the actual detection costs (Cdet ) are very close to the minimum costs (minCdet ). The actual cost is very close to the minimum cost. This means that by the chosen threshold we do not lose much performance. We refer to the area on the det plot corresponding to a threshold as an operating point. In Figure 4 boxes are drown around the operating points based on the 95% confidence intervals of Pf a and Pmiss . We can see that in both cases the operating point is very close to the point of minimum cost, which proves that the calibration is right. The situation in Figure 4 for the black curve is optimal, since the minimum cost is in the operating point confidence area. Using proper information about the application parameters (Cf a , Cmiss and Ptar ) we can set a threshold at θ where θ is defined as: θ = −log Cmiss Ptar , Cf a (1 − Ptar ) and expect this to be an optimum threshold in terms of cost. The point we operate in is located in the det curve at the (Pmiss ,Pf a ) point given the threshold. As an alternative to computing Cdet for a given cost function, it is possible to evaluate the detection and calibration performance by integrating over all cost functions. The measure that does this is Cllr (according to Table 4) and the calibration performance can be assessed by comparing to Cllrmin , which is Cllr with optimal calibration (Brummer06). 5 Applications Enabling machines to perceive information about the emotions a subject is experiencing has a high scientific relevance and making use of such technology has strong societal impacts. In this section we briefly present a number of possible applications for a stress detector based on speech: support for other speech technologies, automatic surveillance, emergency call centres and pilots/troops communicating with the base, as depicted in Figure 6 a) and b). We also observe 14 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers the need of real time processing in all these application, and briefly present the a proposed architecture and briefly present our online almost real-time emotion recognition system that is based in part on the technology discussed in the previous sections. Speech technology has already been employed for a number of problems with security applications like speech recognition, speaker identification, language recognition and deception detection. Opportunities for further developments are highlighted in (Weinstein91). Speech recognition can be used for detecting certain key words or even recognising continuous speech, and the result can show hints of aggression or contain important information that can be retrieved. In high performance fighter aircraft and helicopter environments the pilot could give commands by using a speech recogniser and therefore be able to concentrate better on flying the air plane in crisis situations. However, speech recognition can be severely altered when modifications of the pilot’s voice appear due to high stress level, Lombard effect and strong noise. Speaker verification and recognition can also be affected by stress or other emotions experienced by the speaker. Having information on the emotion a speaker is experiencing can be used as a basis for improving the before mentioned shortcomings. The effects of stressed speech on speech and speaker recognition are analysed in (Steeneken99) for three databases with recordings of different types of stress. In the case of both speech and speaker recognition, a training stage using a speech database is needed. As shown by this research, the most severe negative effects are noticed when training is done only with stress-free data and testing with stress data, and less severe for matched training. In the case of emergency call centres, an operator can probably best assess the urgency of a call. However, it is not always possible to have an operator at any time and any place. Often during crises, emergency lines are overloaded with calls. Therefore, a system able to recognise emotions or the stress level of a speaker can be used together with an answering machine, so that the urgent calls should be selected and answered with high priority by the operator and less urgent calls are handled by the machine. While speaking with a certain caller, the operator can be assisted by the stress detector and receive continuous feedback about the stress level of the speaker. An key features of a prosody based stress detector is that it does not rely on language related information. This way it can detect if a call seems urgent even if the speaker is using a different language than the one from the training database, and the human operator could ask for an interpreter. Besides detecting the stress levels of people accessing a call centre, it is also important to be aware of the emotional state of the operator, who, during a crisis can be overloaded and might not be able to cope with the calls. In such a case other operators should be asked to offer support. The emergency call centre is only a general application of the stress detection system. The communication between pilots or troops and the base has some similar features. In dangerous cases, when the troops are under fire, the communication is expected to be full of extreme emotions and emergency alerts should be given to the base. A number of different sensors for sweat and heart rate measurements can be employed. Nevertheless, a stress detector based on voice is a least intrusive way and is expected to have a high level of acceptance. Automatic Stress Detection in Emergency (Telephone) Calls 15 Automatic surveillance is another area of application of stress and negative emotions detectors. Scenes dominated by aggression are most likely accompanied by panic, screams, negative emotions like anger and fear. A surveillance system based on sound can detects these scenarios individually or in combination with video surveillance. As it can be easily noticed, all these applications require real time processing. An architecture for an online stress detection system is depicted in Figure 5. After speech acquisition, a preprocessing stage is necessary. Speech needs to be segmented into smaller units that will be analysed. Noise filtering is an important stage that should be done before feature extraction. Different types of features can be extracted, like spectral or prosodic features. These should be normalised and classified based on the already trained models. A classifier fusion stage is included in order to increase the accuracy, as encouraged by our research. We have already developed an online, almost real time, system based on this architecture, using only prosodic features and an svm classifier (Lefter10). It can currently detect anger, happiness and sadness, and also has a separate anger detection function, as depicted in Figure 7. 6 Conclusions In this work we propose an approach for automatic detection of speech coloured by negative emotions. Such a system is relevant in several military scenarios, when critical situations need to be detected. Since we have used for training a corpus with neutral and emotionally coloured speech from a call centre, thus real life data, we can regard our results as more reliable and more representative of what can be expected in a real application, compared to systems using acted data. We have extracted prosodic and spectral features and used them for training svm- and gmm-based detectors. All gmm-based approaches, namely gmm, ubmgmm-svm and Dot Scoring lead to equal error rates close to 20%. svm performs better showing an equal error rate of 15.5%. Fusing a gmm-based detector with an svm one leads to a drop in equal error rate to 11% on average, based on a linear equal weight combination. The average becomes 10.3% when the weights are calculate using logistic regression. We can conclude that detectors gmm-based on detectors with spectral features and svm-based detectors with prosodic features complement each other very well. The highest improvement is when all classifiers are fused by FoCal, and the eer drops to 4.2%. 7 Figures Figure 1 Spectrogram and pitch for neutral (left) and emotional (right) speech 16 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers (a) Gaussian mixtures Figure 2 Classification techniques (a) plp features Figure 3 Figure 4 (b) SVM (b) Prosodic features Equal error rates for gmm with different numbers of mixtures det curves for the fusion of Dot Scoring, svm, gmm and ubm-gmm-svm by logistic regression Automatic Stress Detection in Emergency (Telephone) Calls Figure 5 17 Design of a real time stress detector (a) Pilot under fire (b) Communication with the base Figure 6 Examples of application scenarios Figure 7 GUI of a real-time emotion recogniser (Lefter10) References [Auckenthaler04] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas. Score Normalization for Text-Independent Speaker Verification Systems. Digital Signal Processing, pages 42–54, 2000. [Batliner06] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. Combining Efforts for Improving Automatic Classification of Emotional User States. In Proceedings of IS-LTC, pages 240–245, 2006. 18 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers [Boersma01] P. Boersma. Praat, a System for Doing Phonetics by Computer. Glot International, 5(9/10):341–345, 2001. [Brummer06] N. Brümmer and J. du Preez. Application-independent Evaluation of Speaker Detection. Computer Speech and Language, pages 230–275, 2006. [Brummer09] N. Brümmer. Discriminative Acoustic Language Recognition via ChannelCompensated GMM Statistics. In Proceedings of Interspeech, pages 2187– 2190. ISCA, 2009. [Campbell06b] W. Campbell, D. Sturim, and D. Reynolds. Support Vector Machines Using GMM Supervectors for Speaker Verification. IEEE Signal Processing Letters, 13(5):308–311, 2006. [Clavel08] C. Clavel, I. Vasilescu, L. Devillers, G. Richard, and T. Ehrette. Feartype Emotion Recognition for Future Audio-based Surveillance Systems. Speech Communication, pages 487–503, 2008. [Datcu06] D. Datcu and L. J. M. Rothkrantz. The Recognition of Emotions from Speech using GentleBoost Classifier. In CompSysTech’06, 2006. [Davis90] S. B. Davis and P. Mermelstein. Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. Readings in Speech Recognition, pages 65–74, 1990. [Devillers07] L. Devillers and L. Vidrascu. Real-Life Emotion Recognition in Speech. Speaker Classification II: Selected Projects, pages 34–42, 2007. [DouglasCowie03] E. Douglas-Cowie, N Campbell, R. Cowie, and P. Roach. Emotional Speech: Towards a New Generation of Databases. Speech Communication, pages 33–60, 2003. [Fernandez03] R. Fernandez and R. W. Picard. Modeling Drivers’ Speech Under Stress. Speech Communication, 40(1-2):145–159, 2003. [Fitrianie08] S. Fitrianie and L. J. M. Rothkrantz. An Automated Online Crisis Dispatcher. International Journal of Emergency Management, 5(1/2):123– 144, June 2008. [Hansen00] J. H. L. Hansen, C. Swail, A. J. South, R. K. Moore, H. Steeneken, D. A. van Leeuwen, E. J. Cupples, T. Anderson, C. R. A. Vloeberghs, I. Trancoso, and P. Verlinde. The Impact of Speech Under Stress on Military Speech Technology. In NATO RTO-TR-10, AC/323(IST)TP/5 IST/TG-01, 2000. [Hansen07] J. H. Hansen and S. Patil. Speech Under Stress: Analysis, Modeling and Recognition. Speaker Classification I: Fundamentals, Features, and Methods, Lecture Notes in Artificial Intelligence, pages 108–137, 2007. [Hansen97] J. H. Hansen and S. E. Ghazale. Getting started with SUSAS. In Proceedings of Eurospeech’97, pages 1743–1746, 1997. [Hengel07] P.W.J. van Hengel and T.C. Andringa. Verbal Aggression Detection in Complex Social Environments. IEEE Conference on Advanced Video and Signal Based Surveillance, pages 15–20, 2007. [Hermansky90] H. Hermansky. Perceptual Linear Predictive (PLP) Analysis for Speech. Journal of the Acoustical Society of America, pages 1738–1752, 1990. [Hermansky92] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. RASTA-PLP speech analysis technique. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 121–124, 1992. [Juslin05] P.N. Juslin and K.R. Scherer. In J. Harrigan, R. Rosenthal, and K. Scherer, (Eds.) - The New Handbook of Methods in Nonverbal Behavior Research, chapter Vocal expression of affect, pages 65–135. Oxford University Press, 2005. Automatic Stress Detection in Emergency (Telephone) Calls 19 [Krajewski07] J. Krajewski and B. Kroger. Using Prosodic and Spectral Characteristics for Sleepiness Detection. In Interspeech, pages 94–98, 2007. [Lefter10] I. Lefter, P. Wiggers, and L. J. M. Rothkrantz. EmoReSp - An Online Emotion Recognizer Based on Speech. In CompSysTech’10, 2010. [Legros08] C. Legros, R. Ruiz, and P. Plantin De Hugues. Analysing Cockpit and Laboratory Recordings to Determine Fatigue Levels in Pilots Voices. In Journal of the Acoustical Society of America, pages 3070–3070, 2008. [Lombard11] E. Lombard. Le Signe de lElevation de la Voix. In Ann. Maladies Oreille, Larynx, Nez, Pharynx, pages 101–119, 1911. [Martin97] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The Det Curve In Assessment Of Detection Task Performance. In Proceedings Eurospeech ’97, pages 1895–1898, 1997. [Mbitiru08] N. Mbitiru, P. Tay, J. Z. Zhang, and R. D. Adams. Analysis of Stress in Speech Using Empirical Mode Decomposition. In Proceedings of The 2008 IAJC-IJME International Conference, 2008. [McNahon08] E. McNahon, R. Cowie, J. Wagner, and E. Andre. Multimodal Records of Driving Influenced by Induced Emotion. In International Conference on Language Resources and Evaluation, 2008. [Morrison05] D. Morrison, R. Wang, L. C. De Silva, and W.L. Xu. Real-time Spoken Affect Classification and its Application in Call-Centres. In Proceedings of the Third International Conference on Information Technology and Applications (ICITA’05), 2005. [Pantic03] M. Pantic and L.J.M. Rothkrantz. Towards an Affect-sensitive Multimodal Human-Computer Interaction. Proceedings of the IEEE, pages 1370–1390, 2003. [Petrushin99] V. A. Petrushin. Emotion in Speech: Recognition and Application to Call Centers. In Artificial Neural Networks in Engineering, pages 7–10, 1999. [Reynolds00] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, pages 19–41, 2000. [Rothkrantz04] L.J.M. Rothkrantz, P. Wiggers, J.W.A. van Wees, and R.J. van Vark. Voice Stress Analysis. Text, Speech and Dialogues, Lecture Notes in Artificial Intelligence, pages 449–456, 2004. [Scherer03b] K. R. Scherer, T. Johnstone, and G. Klasmeyer. Vocal Expression of Emotion. R. J. Davidson, H. Goldsmith, K. R. Scherer (Eds.) - Handbook of the Affective Sciences. Oxford University Press, 2003. [Schuller05a] B. Schuller, R. J. Villar, G. Rigoll, and M. Lang. Meta-Classifiers in Acoustic and Linguistic Feature Fusion-Based Affect Recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’05), pages 325–328, 18-23, 2005. [Schuller05b] B. Schuller, S. Reiter, R. Muller, M. Al-Hames, M. Lang, and G. Rigoll. Speaker Independent Speech Emotion Recognition by Ensemble Classification. In IEEE International Conference on Multimedia and Expo (ICME), pages 864–867, 2005. [Schuller07] B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson. The Relevance of Feature Type for the Automatic Classification of Emotional User States: Low Level Descriptors and Functionals. In Proceedings of Interspeech, pages 2253–2256, 2007. 20 I. Lefter, L. J. M. Rothkrantz, D. A. van Leeuwen and P. Wiggers [Steeneken99] H.J.M. Steeneken and J.H.L. Hansen. Speech Under Stress Conditions: Overview of the Effect on Speech Production and on System Performance. IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 2079–2082, 1999. [Steidl08] S. Steidl, A. Batliner, E. Nöth, and J. Hornegger. Quantification of Segmentation and F0 Errors and Their Effect on Emotion Recognition. In TSD ’08: Proceedings of the 11th International Conference on Text, Speech and Dialogue, pages 525–534, Berlin, Heidelberg, 2008. Springer-Verlag. [Tolkmitt86] F. J. Tolkmitt and K. R. Scherer. Effect of Experimentally Induced Stress on Vocal Parameters. Journal of Experimental Psychology and Human Perceptual Performance, 12(3):302–313, 1986. [Truong07] K. P. Truong and D. A. van Leeuwen. Automatic Discrimination Between Laughter and Speech. Speech Communication, pages 144–158, 2007. [Truong08] K. P. Truong and S. Raaijmakers. Automatic Recognition of Spontaneous Emotions in Speech Using Acoustic and Lexical Features. In Proceedings of the 5th international workshop on Machine Learning for Multimodal Interaction (MLMI), pages 161–172, 2008. [Valenzise07] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. Scream and Gunshot Detection and Localization for Audio-surveillance Systems. IEEE Conference on Advanced Video and Signal Based Surveillance, pages 21–26, 2007. [Ververidis06] D. Ververidis and C. Kotropoulos. Emotional Speech Recognition: Resources, Features, and Methods. Speech Communication, pages 1162– 1181, 2006. [Ververidis08] D. Ververidis, I. Kotsia, C. Kotropoulos, and I. Pitas. Multi-modal Emotion-related Data Collection within a Virtual Earthquake Emulator. In International Conference on Language Resources and Evaluation, 2008. [Vlasenko07a] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll. Combining Frame and Turn-Level Information for Robust Recognition of Emotions within Speech. In Proceedings of Interspeech, 2007. [Vogt05] T. Vogt and E. Andre. Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition. In IEEE International Conference on Multimedia and Expo, pages 474–477, 2005. [Weinstein91] C.J. Weinstein. Opportunities for Advanced Speech Processing in Military Computer-based Systems. Proceedings of the workshop on Speech and Natural Language, 79(11):1626–1641, 1991. [Zeng07] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang. A Survey of Affect Recognition Methods: Audio, Visual and Spontaneous Expressions. In ICMI ’07: Proceedings of the 9th International Conference on Multimodal Interfaces, pages 126–133, 2007. [Zhou01] G. Zhou, J.H.L. Hansen, and J.F. Kaiser. Nonlinear Feature Based Classification of Speech Under Stress. IEEE Transactions on Speech and Audio Processing, 9(3):201–216, 2001. [vanLeeuwen07] D. A. van Leeuwen and N. Brümmer. An introduction to applicationindependent evaluation of speaker recognition systems. Speaker Classification I: Fundamentals, Features, and Methods, Lecture Notes in Artificial Intelligence, pages 330–353, 2007.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project