Exploiting multi-expression dependences for implicit

Exploiting multi-expression dependences for implicit
Image and Vision Computing 32 (2014) 682–691
Contents lists available at ScienceDirect
Image and Vision Computing
journal homepage: www.elsevier.com/locate/imavis
Exploiting multi-expression dependences for implicit multi-emotion
video tagging☆
Shangfei Wang a,⁎, Zhilei Liu a, Jun Wang a, Zhaoyu Wang a, Yongqiang Li b, Xiaoping Chen a, Qiang Ji c
a
b
c
Key Lab of Computing and Communication Software of Anhui Province, School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, PR China
School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin, Hei Longjiang 15000, PR China
Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
a r t i c l e
i n f o
Article history:
Received 21 May 2013
Received in revised form 21 March 2014
Accepted 30 April 2014
Available online 9 May 2014
Keywords:
Implicit video tagging
Multi-emotion
Multi-expression
a b s t r a c t
In this paper, a novel approach of implicit multiple emotional video tagging is proposed, which considers the relations between the users' facial expressions and emotions as well as the relations among multiple expressions.
First, the audiences' expressions are inferred through a multi-expression recognition model, which consists of an
image-driven expression measurement recognition and a Bayesian network representing the co-existence and
mutual exclusion relations among multi-expressions. Second, the videos' multi-emotion tags are obtained
from the recognized expressions by another Bayesian Network, capturing the relations between expressions
and emotions. Results of the experiments conducted on the JAFFE and NVIE databases demonstrate that the performance of expression recognition is improved by considering the relations among multiple expressions. Furthermore, the relations between expressions and emotions help improve emotional tagging, as our approach
outperforms the traditional expression-based or image-driven implicit tagging methods.
© 2014 Elsevier B.V. All rights reserved.
1. Introduction
Recent years have seen a rapid increase in the size of digital video
collections. Because emotion is an important component in the personalized classification and retrieval of digital videos, assigning emotional
tags to videos has been an active research area in recent decades [1].
This tagging work is usually divided into two categories: explicit and
implicit tagging [2]. Explicit tagging involves a user manually labeling
a video's emotional content based on his/her visual examination of the
video. Implicit tagging, on the other hand, refers to assigning tags to
videos based on an automatic analysis of a user's spontaneous response
while consuming the videos [2]. Although explicit tagging is a major
method at present, it is time-consuming and brings users extra work.
Implicit emotion tagging can overcome the above limitations of the explicit tagging.
The manifestations of human emotion are various, including physiological signals and visual behaviors, which are adopted as users' spontaneous nonverbal responses in implicit emotion tagging research.
Physiological signals reflect subtle unconscious variations during emotion experience in bodily functions, which are controlled by the Sympathetic Nervous System (SNS). Most of these functions cannot be easily
☆ This paper has been recommended for acceptance by Stan Sclaroff.
⁎ Corresponding author.
E-mail addresses: [email protected] (S. Wang), [email protected] (Z. Liu),
[email protected] (J. Wang), [email protected] (Z. Wang),
[email protected] (Y. Li), [email protected] (X. Chen), [email protected] (Q. Ji).
http://dx.doi.org/10.1016/j.imavis.2014.04.013
0262-8856/© 2014 Elsevier B.V. All rights reserved.
captured by other sensory channels or observation methods. However,
to capture physiological responses, users are required to wear complex
apparatuses, which may make some users feel uncomfortable. In contrast, when obtaining an implicit tag by visual behavior, no complex apparatus other than one standard visible camera is needed. Thus videobased approach is more easily applied outside the laboratory.
Facial expression analysis is one of the most feasible visual behaviors
for implicit video tagging. Six basic expression categories proposed by
Paul Ekman are the widely adopted descriptors, including anger, disgust, fear, happiness, sadness, and surprise [3]. Present expressionbased video tagging research however assumes human's expression as
a singular category. It means that only one particular category of expression appears at a time. This assumption may be challenged by psychology studies, which demonstrate that some expression manifestations
are the combination of some specific basic expression categories [4] because of the underlying facial anatomy. In addition, some expressions
may rarely appear together, such as happiness and sadness. These coexistent and mutual exclusive phenomenon of basic expressions should
be considered in expression recognition. In this paper, multi-expression
recognition is conducted. The relationships among expressions are
taken into consideration by a Bayesian Network (BN).
Even though facial expression is the major visual manifestation of
emotions, they are still different [5–8]. Thus, we cannot treat them totally similarly [9]. However, the relationships between them are rarely analyzed or considered [10]. In the video emotion tagging, the facial
expressions are typically regarded as the emotions. Similar to expressions, multiple emotions may appear while subjects watch stimuli
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
videos [11] or during our daily communications. Therefore, it is reasonable to tag a video's emotion with multi-emotion categories and to consider their relationships. In this paper, we use a BN to capture the
relations between multiple emotions and multiple expressions.
In this paper, a multi-emotion tagging method is proposed. First, the
multi-expressions of the user are recognized through a novel expression
recognition model, which contains an image-driven expression recognition and a BN model exploiting the relations among multi-expressions.
Then, the video's multi-emotion tags are obtained from recognized
multi-expressions through another BN model considering the relations
between the expressions and emotions. Expression recognition results
on both JAFFE [12] and NVIE [13] databases demonstrate the importance of modeling multi-expression relations. Emotional tagging results
on NVIE database validate the effectiveness of our approach.
2. Related work
To the best of our knowledge, study on emotional tagging of videos
was first conducted at the beginning of the last decade by Moncrieff
et al. [14]. Earlier studies mainly infer emotional tags directly from
video content, and so they belong to the explicit approach.
Two kinds of tag descriptors are often used. One is the categorical
approach. It uses subset of six basic emotions (happiness, sadness, surprise, fear, disgust, and anger) [15–18], and other emotion descriptor
(such as boring, exciting and so on) [19–29]. The other is the dimensional approach [18,19,27–35], such as valence and arousal [18,28–34].
Various kinds of visual and audio features (e.g., color, motion, and
sound energy) are extracted from videos [14,19,36,37]. The mapping
from video features to emotional tags is accomplished by different machine learning methods, such as support vector machine [36], support
vector regression [38], neural networks [23], hidden Markov model
[17,18,25], dynamic Bayesian networks [27], conditional random fields
[39], etc.
Only recently have researchers begun to realize that users' spontaneous physiological and behavior responses are useful features to the
video's emotional tags. By recognizing users' emotion from their induced physiological/behavioral signals while watching the videos, the
tags can be obtained automatically. This is called the implicit approach.
Pantic and Vinciarelli [2] is the first to introduce the concept of implicit
human-centered tagging, and identify the main problems with this research. Currently, implicit emotion tagging of videos mainly uses physiological signals or subjects' spontaneous visual behavior. In this section,
we give a brief review of video emotion tagging and related work, such
as indexing, retrieval, segmentation, and summarization of the emotional content from videos [40].
2.1. Implicit emotion tagging of videos using physiological signals
Several researchers have focused on implicit tagging using physiological signals, which could reflect subtle variations in the human
body. Money and Agius [41,42] investigated whether users' physiological responses, such as Galvanic Skin Response (GSR), respiration, Blood
Volume Pulse (BVP), Heart Rate (HR) and Skin Temperature (ST), can
serve as summaries of affective video content. They collected 10 subjects' physiological responses during watching three films and two
award-winning TV shows. Experimental results showed the potential
of the physiological signals as external user-based information for
affective video content summaries. They [43] further proposed
Entertainment-Led Video Summaries (ELVIS) to identify the most entertaining sub-segments of videos based on their previous study.
Soleymani et al. [44,45] analyzed the relationships between subjects'
physiological responses, the subject's emotional valence as well as
arousal, and the emotional content of the videos. Moreover, Kierkels
et al. [46] implemented an affect-based multimedia retrieval system
by using both implicit and explicit tagging methods. Two multimodal
databases are further constructed for implicit tagging. One is DEAP
683
(Database for Emotion Analysis using Physiological signals) [47], in
which electroencephalogram (EEG) and peripheral physiological signals, including GSR, respiration amplitude, ST, electrocardiograph
(ECG), BVP, electromyography (EMG) and electrooculogram (EOG),
were collected from 32 participants during their watching of 40 oneminute long excerpts of music videos. Frontal face videos were also recorded from 22 among 32 participants. The other database is MAHNOBHCI [48], in which face videos, speech, eye gaze, and both peripheral and
central nervous system physiological signals of 27 subjects were recorded during two experiments. In the first experiment, subjects selfreported their felt emotions to 20 emotion-induced videos using arousal, valence, dominance and predictability as well as emotional keywords. In the second experiment, subjects assessed agreement or
disagreement of the displayed tags with the short videos or images.
While these two pioneer groups investigated many kinds of physiological signals as the implicit feedback, other researchers focused only
on one or two kinds of physiological signals. For example, Canini et al.
[49] investigated the relationship between GSR and affective video features for the arousal dimension. Smeaton and Rothwell [50] proposed to
detect film highlights from viewers' HR and GSR. Toyosawa and Kawai
[51] proposed to extract attentive shots with the help of subjects'
heart rate and heart rate variability.
Two researcher groups considered event-related potential (ERP) as
subjects' implicit feedback. One of them [52] attempted to validate
video tags using an N400 ERP. Another group [15] attempted to perform
implicit emotion tagging of multimedia content through a braincomputer interface system based on a P300 ERP.
Recently, Abadi et al. [53] proposed to differentiate between low
versus high arousal and low versus high valence using the
Magnetoencephalogram (MEG) brain signal.
Instead of using contact and intrusive physiological signals, Krzywicki
et al. [54] adopted facial thermal signatures, a nonconstant and nonintrusive physiological signal, to analyze affective content of films.
These studies described above have indicated the potential of using
physiological signals for the implicit emotion tagging of videos. However, to acquire physiological signals, subjects are usually required to wear
several contact apparatuses, which may make them feel uncomfortable
and hinder the real application of these methods.
2.2. Implicit emotion tagging of videos using spontaneous visual behavior
Several researchers have turned to implicit tagging according to
human spontaneous visual behavior, since it can be measured using
non-contact and non-intrusive techniques, and easily applied in real
life. Joho et al. [55] proposed to detect personal highlights in videos by
analyzing viewers' facial activities. The experimental results on a
dataset of 10 participants watching eight video clips suggested that
compared with the activity in the lower part, the activity in the upper
part of face tended to be more indicative of personal highlights. Arapakis
et al. [56] proposed a multimodal recommender system that uses facial
expression to convey users' emotional feedback. The experimental results on 24 participants verified the effectiveness of facial expression
for the recommender system.
Other than focusing on subjects' whole facial activity, Ong and
Kameyama [30] analyzed affective video content by using viewers'
pupil sizes and gazing points. Experimental results on 6 subjects
watching 3 videos showed the effectiveness of their approach.
Peng et al. [57] proposed to fuse users' eye movements (like blink or
saccade) and facial expressions (positive or negative) for home video
summarization. Their experimental results on 8 subjects watching 5
video clips, demonstrated the feasibility of both eye movements and facial expressions for video summarization application. They [58] also
proposed and integrated an interest meter module into a video summarization system, and achieved good performance.
McDuff et al. [59] proposed to classify “liking” and “desire to watch
again” automatically from spontaneous smile responses. Their study
684
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
based on over 1500 facial responses to media collected from the Internet demonstrated the feasibility of using facial responses for content effectiveness evaluation.
The studies described above illustrate the development of methods for
using spontaneous visual behavior in the implicit tagging of videos. However, the assumptions made by the above studies is that the expressions
displayed by the subjects were the same as their internal feelings when
they watched the videos. For this reason, most researchers have used
the recognized expression directly as the emotional tag of the videos.
However, research has indicated that internal feelings and displayed facial
behaviors are related, but not always the same [5–7]. Facial expressions
reflect not only emotions, but also social context etc. [8]. Thus, a certain
emotion does not necessarily produce a certain spontaneous expression.
Furthermore, present research of expression-based emotional video tagging assumes that subjects only display one expression while watching
videos, thus there is only one emotional tag for a video. However, it is
very hard to find one subject who can only express a high level of a single
expression without the presence of any other categories either in day-today living or inside the laboratory emotion stimuli experiments [9,11].
From the existing expression databases [12,13] with multi-expression labels, we can observe that multiple expressions can be assigned for a singular expression image. For example, a sample with fear expression
may co-exist with a certain degree of anger and surprise. On the other
hand, the sample with happiness expression will not exist with the negative expressions, such as anger, sadness, and so on. The same is with the
emotional tags of videos. For example, Gross and Levenson [11] developed a set of films to elicit eight emotion states. Based on their study,
the videos that elicit amusement always elicit happiness and surprise.
The videos that induce anger may also induce some degree of disgust,
sadness, fear and surprise. The videos that induce disgust may also induce
fear and surprise to some extent. However, the videos that induce anger
and disgust may not induce high level of happiness. Those phenomena
of co-existence and mutual exclusion for emotional categories are also revealed in [60]. Till now, there is little research considering multiexpression recognition and multi-emotion tagging of videos [61]. Although Kierkels et al. [46] and Kierkels and Pun [62] have considered multiple labels for emotional tagging, it does not exploit the dependencies
among the labels. Each label is predicted individually.
Thus, in this paper, we treat expression–recognition and video emotional tagging as multi-label classification. We propose two BNs: one is
to systemically capture the dependencies among expressions, and the
other is to model the relations between multiple emotions and multiple
expressions.
Compared to the related work, our contributions are as follows: First,
we are among the first to exploit the coexistent and mutually exclusive relationships among multiple expressions, which is further exploited for
multi-expression recognition. Second, we are the first to consider the relations between the expressions and the emotions in implicit video tagging.
3. Method
The framework of our implicit emotion tagging approach is shown in
Fig. 1, which consists of two modules: multi-expression recognition and
multi-emotion tagging. The former module includes feature extraction,
expression measurements extraction using an image-driven method,
and multi-expression recognition using BN, which captures the relations among multi-expressions. The latter module infers multiemotions of the stimuli video from the recognized multi-expressions.
The details are described in the following sections.
3.1. Multi-expression recognition
3.1.1. Feature extraction
In this paper, Active Appearance Model (AAM) [63] features are extracted from the apex expressional images, which capture both texture and
shape [63] information. First, the subject's eyes are located automatically
by using an eye location method based on AdaBoost and Haar features
[64]. Then, all the expressional images are normalized to 400 × 400 grayscale images. After that, the face is labeled with 61 points as shown in
Fig. 2. Here, AAMs are trained in a person-independent manner, and the
AAM tool from [65] is used. Finally, a 30-dimension appearance feature vector is obtained for each facial image, including shape and texture feature.
3.1.2. Expression measurement extraction
In order to select discriminative features for classification, the F-test
statistic [66] is used for feature selection. The significance of all features
can be ranked by sorting their corresponding F-ratios in descending
order, and then 10-fold cross validation is adopted during training
to empirically determine the number l and to select top l features. The
F-ratio of feature x is calculated by using Eq. (1):
N
X
F−ratioðxÞ ¼
N
X
c¼1
Fig. 1. Framework of our proposed method.
2
nc ðxc −xÞ
c¼1
2
ðnc −1Þσ c
n−N
N−1
ð1Þ
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
685
independent of the initial structure. Furthermore, the algorithm provides
an anytime valid solution, i.e., the algorithm can be stopped at any-time
with a best current solution found so far and an upper bound to the global optimum. Representing state of the art method in BN structure learning, this method allows automatically capturing the relationships among
expressions. Details of this algorithm can be found in [68]. Examples of
the trained BN structure are shown in Figs. 4 and 5.
After the BN structure is constructed, parameters can be learned
from the training data. Learning the parameters in a BN means finding
the most probable values ^θ for θ that can best explain the training
data. Here, let λj denote a variable of BN, and θjlk denote a probability
parameter for BN, then,
k
l
θjlk ¼ P λ j jpa λ j
Fig. 2. Distribution of AAM points.
where xc is the average of a single feature x within class c, σ2c is the variance and x is the global mean, and nc is the number of samples of class c.
After feature selection, SVMs are adopted as the classifiers. The input
of the SVM is the feature vector consisting of the selected features. The
output of the SVM is a binary value, indicating whether this sample
has a certain expression tag or not. There are a total of six classifiers, corresponding to six basic expression categories proposed by Ekman, including happiness, disgust, fear, surprise, anger and sadness, since the
adopted databases in our experiments, i.e. the NVIE database and the
JAFFE database, are annotated with only six basic expression categories.
Finally, each sample can be attached with a binary string of 6 bits. The
binary string is used as the input to the BN model in the following step.
3.1.3. Multi-expression recognition using BN
As traditional facial expression recognition methods treat each expression category individually and do not consider the dependencies
among categories in the training set, some valuable information may
be lost. In order to model the semantic relationships among expression
categories, we utilize a BN model for further expression recognition. As
a probabilistic graphical model, BN can effectively capture the dependencies among variables in data. In our work, each node of the BN is
an expression label, and the links and their conditional probabilities
capture the probabilistic dependencies among expressions.
3.1.3.1. BN structure and parameters learning. The BN learning consists of
structure learning and parameter learning respectively. The structure
consists of the directed links among the nodes, while the parameters
are the conditional probabilities of each node given its parents.
Given the dataset of multiple target labels DL = {Ti}m
i = 1, where Ti =
{λij}nj = 1, m is the number of samples, and n is the number of labels, the
structure learning is to find a structure G that maximizes a score function. In this work, we employ the Bayesian Information Criterion (BIC)
[67] score function which is defined as follows:
ScoreðGÞ ¼ max logðpðDLjG; θÞÞ−
θ
DimG
logm
2
ð2Þ
where the first term is the log-likelihood function of parameters θ with
respect to data DL and structure G, representing the fitness of the network to the data; the second term is a penalty relating to the complexity
of the network, and DimG is the number of independent parameters.
To learn the structure, we propose to employ our BN structure learning algorithm [68]. By exploiting the decomposition property of the BIC
score function, this method allows learning an optimal BN structure efficiently and it guarantees to find the global optimum structure,
ð3Þ
where j ∈ {1,…,n}, l ∈ {1,…,rj} and k ∈ {1,…,sj}. Here n denotes the
number of variables (nodes in the BN); pa(λj) represents the parent of
variable λj; rj represents the number of the possible instantiations for
pa(λj); sj indicates the number of the state instantiations for λj. Hence,
λkj denotes the kth state of variable λj.
Based on the Markov condition, any node in a Bayesian network is
conditionally independent of its non-descendants, given its parents.
The joint probability distribution represented by BN can be denoted
as: P(λ) = P(λ1, …, λn) = ∏jP(λj|pa(λj)). In this work, the “fitness” of
parameters θ and training data D is quantified by the log likelihood function log(P(D|θ)), denoted as LD(θ). Assuming the training data are independent, based on the conditional independence assumptions in BN, the
log likelihood function is shown in Eq. (4).
n
rj
sj
!
n
LD ðθÞ ¼ log ∏ ∏ ∏ θjlkjlk
ð4Þ
j¼1 l¼1 k¼1
where njlk indicates the number of elements in D containing both λkj and
pal(λj).
Since there is no hidden node in the BN, and the fully labeled training
data are used in this work, maximum likelihood estimation (MLE)
method can be described as a constrained optimization problem,
which is shown in Eq. (5).
S:T
MAX LD ðθÞ
sj
X
θjlk −1 ¼ 0
g jl ðθÞ ¼
ð5Þ
k¼1
where gjl imposes the constraint that the parameters of each node sum
to 1 over all the states of that node. Solving the above equations, we can
get θjlk ¼ n .
jlk
∑k njlk
3.2. BN inference
A complete BN model is obtained after parameter and structure learning. Given the expression measurements obtained in the first procedure,
the true expression category of the input sample is estimated through BN
inference. During the BN inference, the posterior probability of categories
can be estimated by combining the likelihood from measurement with
the learned prior model. Let λj and Mλj, j ∈ {1,…n}, denote the variable
and the corresponding measurements obtained by image-driven
methods respectively. Then, the probability of each expression combination pattern given the measurements is calculated as follows:
⋆
Y ¼ arg max P ðλ1 ; …; λn jMλ1 ; …; Mλn Þ
λ1 ;…;λn
n n
∝arg max ∏ P Mλ j jλ j ∏ P λ j jpa λ j :
λ1 ;…;λn
j¼1
ð6Þ
j¼1
The first part of the equation is the likelihood of λj given the measurements and the second part is the product of the conditional
686
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
probabilities of each category node λj given its parents pa(λj), which are
BN model parameters that have been learned. In practice, the belief
propagation algorithm [69] is used to estimate the posterior probability
of each category node efficiently.
3.3. Multi-emotion tagging from multi-expression measurements
Expression is the facial appearance of the subject's emotion, and the
emotion tag is the latent feeling. In order to establish the relationship
between the recognized expressions and the individual emotion states,
another BN is constructed for video emotion tagging, whose structure is
manually defined as shown in Fig. 3. It includes 12 discrete nodes,
representing six emotion tags Y = Y1,…,Y6 and six recognized expressions X = X1,…,X6. Each node has 2 states (1,0), representing whether
this expression or emotion tag exists or not. The connections of these
nodes capture the transition relationship from expression to emotion.
We choose to use the six emotional prototypes mainly because of the
database we used for our experiments. The NVIE database is annotated
with only the six emotions. Our method however is not limited to the
six emotions and it can be trained to classify other emotion categories.
Given the BN's structure as shown in Fig. 3, the BN parameters, i.e.,
the prior probability P(Yj) of Yj (j = 1,…6), and the conditional probability P(Xl|Yj)(l,j = 1,…,6) are learned from the training data through the
maximum likelihood estimation. After training, the posterior probability P(Yj|X = X1,…,X6) of a testing sample is calculated according to the
following equation:
P Y j jX ¼ fX 1 ; …; X 6 g
P Y j P X ¼ fX 1 ; …; X 6 gjY j
¼
X 1 ; …;X6gÞ
P ðX ¼ f
6
P Y j ∏l¼1 P X l jY j
:
¼
P ðX ¼ fX 1 ; …; X6gÞ
ð7Þ
Based on the calculated posterior probability P(Yj|X = X1,…,X6) of
each emotion node, the final emotion tag of the corresponding stimuli
video is obtained
as follows: Z = (z1,…,zj,…,z6), where z j ¼
arg max P Y j ¼ t X ; t∈f0; 1g; j ¼ 1; …; 6.
t
4. Experiments
4.1. Experimental conditions
Presently, three public databases can be used for implicit video emotion tagging. They are DEAP [47], MAHNOB-HCI [48] and NVIE [13]. The
first two databases provide neither multiple emotion tags for a video,
nor expression annotations. Thus, this paper adopted the NVIE database,
which contains both posed expressions and video-elicited spontaneous
expressions of more than 100 subjects. During the spontaneous expression collection experiments, the participants offered the self-report intensity of the six basic emotion categories, ranging from 0 to 4, to the
stimuli video according to their emotion experiences. These can be
regarded as the emotion tags of the videos. In addition, the NVIE database provides the facial expression intensity annotations of both apex
facial images and image sequences in six categories, ranging from 0 to
2. The construction details of the NVIE database can be found in [13].
Therefore, spontaneous samples with six expression and emotion categories in the NVIE database are considered in this paper to recognize
multiple expressions of users' and to assign multiple emotions to videos.
Before the experiments, the expression and emotion annotations of
each sample is converted to a binary vector according to the following
strategy: if the annotation value of an expression or emotion category
is larger than 0, the state of this expression or emotion is set to be 1, otherwise, it is set to be 0. Ultimately, 1154 samples are selected.
With respect to facial expression database, exhaustive surveys can
be found in [70] and [71]. Most present expression databases only assign one expression category to one image or image sequence, except
for the JAFFE and NVIE databases. Thus, the JAFFE database is adopted
for multiple expression recognition experiment besides the NVIE database. The JAFFE database is a posed database consisting of only apex facial expression images, which are evaluated using a five-scale (1–5)
intensity for six basic expression categories by 60 raters. An expression
is present if the average intensity for this expression is higher than 3.
There are some images whose intensities for 6 expressions are all less
than 3. So in the experiment, those images are removed. After image
preprocessing, we get 1154 images from NVIE and 188 images from
JAFFE. Since both the NVIE database and JAFFE database provide the
apex facial images, we hence need not identify the apex facial images.
We just use the existing apex facial images in the databases. We do
not employ the temporal information in our algorithm. Table 1 presents
the distribution of samples.
The evaluation metric of multi-label classification is different from
that of single label classification, since for each instance there are multiple labels which may be classified partly correctly or partly incorrectly.
Thus, there are two kinds of commonly used metrics, example-based
and label-based measures [72], evaluating the multi-label emotional
tagging performance from the view of instances and labels respectively.
We adopt both measures in this work. Let Ti denote the true labels for
instance i, and Zi denote the predicted labels for instances i. Both are binary strings. m represents the number of the instances and n is the number of labels. The example-based measures: hamming loss, precision,
and subset accuracy are defined in Eqs. (8)–(10). and the label-based
measures: precision and accuracy, are defined in Eqs. (11)–(12).
Fig. 3. BN for video emotion tagging based on recognized expression tags.
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
Table 1
Sample distribution in NVIE and JAFFE database.
Table 3
Dependencies among expressions for JAFFE database.
Database
Hap.
Dis.
Fea.
Sur.
Ang.
Sad.
NVIE
JAFFE
399
41
326
96
232
45
259
58
283
57
245
70
Example-based measures:
Hamming loss ¼
m X
n
1 X
½Ið j∈T i ∧j∉Z i Þ þ Ið j∉T i ∧j∈Z i Þ
nm i¼1 j¼1
ð8Þ
where I is the indicator function. Hamming loss measures the degree of
distance between predicted labels and actual labels.
Precision ¼
m
1X
jT i ∩Z i j
m i¼1 jZ i j
ð9Þ
Precision is the proportion of correctly predicted labels to the total
number of actual labels, averaged over all instances.
Subset accuracy ¼
m
1X
IðT i ¼ Z i Þ
m i¼1
ð10Þ
Subset accuracy indicates the ratio of completely correctly predicted
samples to the total number of samples.
Label-based measures:
Xn
Precision; P micro ¼
X
m
j j
T Z
j¼1
i¼1 i i
Xn Xm j
Z :
j¼1
i¼1 i
ð11Þ
Pmicro is the proportion of correctly predicted labels to the total number
of actual labels, averaged over all labels.
Xn
Accuracy; Accmicro ¼
Xm
j¼1
i¼1
687
j
j
I T i ¼ Zi
ð12Þ
nm
Accmicro is the proportion of the correctly classified sample number in
the total samples. It is the average accuracy over all labels.
For the hamming loss, the smaller the value the better the performance. For the other metrics, the larger the value the better the performance. In order to satisfy subject-independent, the subjects are divided
into ten groups as equally as possible. Then 10-fold cross-validation is
adopted, and for each validation, one group is taken as the testing set
and the other nine groups are taken as the training set. Therefore,
there is no intersection between the subjects of training set and those
of testing set. T-test is used to examine the improvement significance.
To further evaluate the effectiveness of our approach, we conduct
cross-database experiments in addition to the within database experiments. The NVIE database can be used for both expression recognition
and emotional tagging, while the JAFFE database can only be used for
expression recognition, since it is a posed facial expression database.
Therefore we conduct two kinds of cross-database multi-expression
λj
P(λj |λi)
λi
Hap.
Dis.
Fea.
Sur.
Ang.
Sad.
Hap.
Dis.
Fea.
Sur.
Ang.
Sad.
1
0
0
0.035
0
0
0
1
0.8
0.362
0.965
0.771
0
0.375
1
0.448
0.105
0.443
0.049
0.219
0.578
1
0.105
0.229
0
0.573
0.133
0.103
1
0.3
0
0.563
0.689
0.276
0.368
1
recognition experiments. In the first experiment, we train the SVM classifiers using one database and test its performance on the other database. In the second experiment, we train the BN model using the
labels from one database, and evaluate its performance on the other database. The results are summarized in Section 4.2.3.
4.2. Experimental results of expression recognition
4.2.1. Analysis of the relations among expressions
We quantify the co-occurrence among different expressions using a
conditional probability of P (λj|λi), which measures the probability that
expression λj happens, given that expression λi happens. Tables 2 and 3
show the condition probabilities between different expressions for the
NVIE and JAFFE databases respectively.
From Table 2, we can find that subjects can display multiple expressions. For instance, anger is often accompanied by sadness with high
probability. There exist two kinds of relationships among emotions:
co-occurrence and mutual exclusion. For example, P(anger|happiness)
and P(sadness|happiness) are 0.00 and 0.016, suggesting that happiness
rarely coexists with anger or sadness. Disgust and anger are cooccurrent together with a relatively high P(disgust|anger) of 0.495.
Similar to NVIE database, JAFFE database also shows two relations: coexistence and mutual exclusion. For example, P(disgust|anger) is 0.965,
showing that anger and disgust are present together frequently, while
P(happiness|sad) is 0, indicating mutual exclusive relationship.
Compared the two tables, we find that the value of P (λj|λi) of the
NVIE database ranges from 0.0 to 0.535, while that of the JAFEE database
ranges from 0.0 to 0.965. The top ranked and the bottom ranked P (λj|λi)
of the two databases are not exactly the same either, although most of
them are similar. The inconsistency between the databases in terms of
expression overlaps may be caused by the inherent database biases.
Table 2
Dependencies among expressions for NVIE database.
λj
P(λj|λi)
λi
Hap.
Dis.
Fea.
Sur.
Ang.
Sad.
Hap.
Dis.
Fea.
Sur.
Ang.
Sad.
1
0.046
0.056
0.405
0
0.016
0.038
1
0.358
0.073
0.495
0.294
0.03
0.255
1
0.205
0.060
0.102
0.263
0.059
0.228
1
0.039
0.016
0
0.429
0.073
0.043
1
0.535
0.01
0.221
0.108
0.015
0.463
1
Fig. 4. The trained BN structure from NVIE database. The shaded nodes are the hidden
nodes we want to infer and the unshaded nodes are the corresponding measurement
nodes.
688
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
probability distributions for expression pairs in the two database are
different, indicating the database bias.
4.2.2. Experimental results of multi-expression recognition within database
Tables 4 and 5 show the experimental results of the NVIE database
and JAFFE database respectively. From Table 4, we can find that our approach improves the performance of expression recognition by considering the relations among expressions, since most of the example based
and label based measures of our approach are better than those of SVM
classifier without considering the relations among expressions. Most of
the improvements are significant. The improvement of subset accuracy
demonstrates that our approach can obtain more completely correctly
classified samples. Thus, our method makes the predictions more accurate than traditional methods. For NVIE database, the average accuracy
rate of six expressions is increased from 77.3% to 80.3% by employing
the relationships among expressions. For JAFFE database, our approach
increases the average accuracy rate to 84.3% from 84.0%. The above experimental results demonstrate the effectiveness of our approach, since
it can more effectively capture the dependencies among expressional
labels.
Fig. 5. The trained BN structure from JAFFE database. The shaded nodes are the hidden
nodes we want to infer and the unshaded nodes are the corresponding measurement
nodes.
After all, the NVIE database is a spontaneous database, while the JAFFE
database is a posed facial expression database.
To systematically capture relationships among labels, we train a BN
on each database. Figs. 4 and 5 show the learned BNs from the NVIE database and JAFFE database respectively. The links in the structure represent the dependencies among labels. In Fig. 4, the links from happiness
to fear and anger demonstrate that there are strong dependencies between the two pairs. From Table 2, we can see that the probabilities of
P(fear|happiness) and P(anger|happiness) are 0.03 and 0, indicating mutual exclusive relations. In Fig. 5, the link from anger to disgust shows
the co-occurrent relationship because the probability of P(disgust|
anger) is 0.965 in Table 3.
Comparing two trained BNs with the two dependency tables, we
find that the label pairs whose conditional probabilities are top ranked
or bottom ranked are linked in the BNs for NVIE and JAFFE databases
in most cases. It demonstrates the effectiveness of the BN structure
learning method which can effectively capture the mutual exclusive
and co-existent relationships among multiple expression labels. Some
common co-existent and mutual exclusive relationships are well
established in both structures. For example, the mutual exclusive expression pairs, sadness and happiness, disgust and happiness are
modeled in both BN structures. The co-existent expression pair, sadness
and disgust are shown in both BN. Besides, there exist some differences
in two structures. For example, there is a link from anger to disgust in
the JAFFE structure which is not shown in the NVIE structure. There exists a link from fear to sad in the NVIE structure which is not captured in
the JAFFE structure. The reason for the differences is that the conditional
SVM
SVM + BN
Sig. level
Example based
4.3. Experimental results of emotional tagging of videos
The relationships between expression and emotion are shown in
Table 7. From Table 7, it can be seen clearly that there are multiple expressions for one emotion, and multiple emotions for one expression.
For example, out of 432 self-reported happiness emotions, 346 samples
present happiness expression, and 210 samples present surprise expression. It indicates the necessity to model the relations between expressions and emotions.
Table 5
Multiple expression recognition results on the JAFFE database.
Table 4
Multiple expression recognition results on the NVIE database.
Method
4.2.3. Experimental results of cross-database multi-expression recognition
The cross-database multi-expression recognition experimental
results are summarized in Table 6. From Table 6, we can observe the following: 1) For both cross-database experiments, their performances decrease compared to the corresponding within-database experiments.
This performance decrease is expected because of the incongruity of
the two databases and their inherent biases. In particular, the NVIE database is a spontaneous expression database, while the JAFFE is a posed
expression database. Hence, the image features or the BN model learned
from one database cannot completely characterize the emotions in another database. 2) The cross-database performance decrease, however,
is asymmetric. Specifically, the cross-database performance for the
JAFFE database is better than that of the NVIE database. This suggests
that NVIE database is more general and more applicable to the JAFFE database. The reverse is, however, not true. 3) Comparing the two crossdatabase experiments (the “SVM + BN (cross)” row and the “SVM
(cross)” row of Table 6), SVM + BN improves the performance over
SVM. This suggests that the BN model, though learnt from a different database, can still generalize to another database despite the significant
differences between the two databases.
Method
Label based
Ham.
Pre.
SubAcc.
MicPre.
MicAcc.
0.227
0.197
1.3 × 10−6*
0.586
0.626
2.9 × 10−6*
0.235
0.335
4.3 × 10−8*
0.534
0.596
1.3 × 10−8*
0.773
0.803
1.3 × 10−6*
“Ham.” refers to “hammingloss”, “pre.” refers to “precision”, “subAcc.” refers to
“subsetAccuracy”, “micPre.” refers to “micro precision”, and “micAcc.” refers to “micro
Accuracy”. "*" means there exists difference at a 0.05 significant level.
SVM
SVM + BN
Sig. level
Example based
Label based
Ham.
pre.
SubAcc.
MicPre.
MicAcc.
0.160
0.157
0.878
0.777
0.809
0.028*
0.441
0.543
0.005*
0.840
0.786
0.019*
0.840
0.843
0.878
“Ham.” refers to “hammingloss”, “pre.” refers to “precision”, “subAcc.” refers to
“subsetAccuracy”, “micPre.” refers to “micro precision”, and “micAcc.” refers to “micro
Accuracy”. "*" means there exists difference at a 0.05 significant level.
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
689
Table 6
The multi-expression recognition results of cross-database experiment.
Method
NVIE
SVM (within)
SVM (cross)
SVM + BN (within)
SVM + BN (cross)
JAFFE
Ham.
Pre.
SubAcc.
MicPre.
MicAcc.
Ham.
Pre.
SubAcc.
MicPre.
MicAcc.
0.227
0.280
0.197
0.207
0.586
0.432
0.626
0.621
0.235
0.180
0.335
0.338
0.534
0.449
0.596
0.583
0.773
.0720
0.803
0.793
0.160
0.342
0.157
0.213
0.777
0.271
0.809
0.677
0.441
0.085
0.543
0.340
0.840
0.425
0.786
0.670
0.840
0.658
0.843
0.787
“Ham.” refers to “hammingloss”, “pre.” refers to “precision”, “subAcc.” refers to “subsetAccuracy”, “micPre.” refers to “micro precision”, and “micAcc.” refers to “micro Accuracy”. “Within”
refers to “within database”. “Cross” refers to “cross database”.
Based on the predefined BN in Section 3.2, the testing results obtained in Section 4.2.2 are used as the inputs of this model. The final video
emotion tagging results are obtained based on the decision method defined in Section 3.2, and given in the “Exp-BN-Emo” row of Table 8.
In order to verify the effectiveness of our multi-emotion tagging
method, two comparative experiments are conducted. In the first comparative experiment, emotion tags are recognized by an image-driven
method, which uses AAM features and SVMs, similar to the first step
of expression recognition. The tagging results are shown in the “SVMEmo” row of Table 8. In the second experiment, the recognized expressions are directly regarded as the subjects' emotions. This is the commonly adopted expression based video tagging method. The results
are shown in the “RecExp-Emo” row of Table 8.
Comparing with these two experimental results, we can see that our
approach (results showed in “Exp-BN-Emo” row) can get better results,
since most of the example based and label based measurements of our
approach are better than those of image-driven method and traditional
expression-based emotion tagging. Most of the improvements are significant. The improvements of hamming loss indicate that the predicted
results of our method are closer to the actual labels than the imagedriven method and traditional expression based emotion tagging. Our
approach gets an average accuracy of 75.4% for six emotions, while the
image-driven method obtains an average accuracy of 72.1% and the
expression-based emotion tagging gets an average accuracy of 74.2%.
Compared with these two experimental results, the effectiveness of
our video emotion tagging method considering the relationships
among multi-expressions as well as the relationships among the expressions and emotions is well validated.
Since the BN is learned from the ground truth labels, it can only capture
the genuine relationships between expression and emotion, and cannot
model the masking expressions and emotions.
There are a few issues we would like to investigate in the future.
First, besides facial expression, emotions can also be characterized by
the facial action units (AUs). Facial expressions describe facial behavior
globally, while facial action units represent facial muscle actions locally.
Instead of using expression categories, AUs can also be used for emotional video tagging. Since current video tagging databases do not provide the AU labels of the viewers', we only use expressions as
expression descriptors for emotional video tagging in this paper. In
the future, we may investigate emotional tagging approach using AUs.
Second, in this paper the self-report is used to obtain the emotions of
the subjects since it is the most commonly used method in previous research. However, the emotion of the subjects is very difficult to obtain,
and even self-reports are not always reliable due to many problems,
such as cognitive bias [73]. Recently, Healey's work [74] indicated that
triangulating multiple sources of ground truth information, such as “In
situ” rating, “End-of-Day” rating and “Third Party” rating, leads to a set
of more reliable emotion labels. We may refer this work to obtain
ground truth emotion labels in the future work.
Finally, as we discussed in Section 4.2.1, the incongruity between the
databases and their inherent biases may cause challenges for crossdatabase expression recognition and video emotion tagging. The expression overlap of different databases may not be exactly the same
due to the inherent database biases. We will further investigate this
issue in the future.
Acknowledgments
5. Conclusion
In this paper, a video emotion tagging model, considering the relationship among the facial multi-expressions, and the relationships
among the expressions and emotions, is proposed and validated on
JAFFE and NVIE databases. Experimental results demonstrate that: (1)
the relations among the expressions can be well captured through the
Bayesian Network's structures and parameters. Their relations help improve the performance of expression recognition. (2) Emotion tagging
results, considering the relationships among the expressions and emotions, are better than those of image-driven methods and those obtained by directly regarding the expressions as the emotions. All these
results verify the effectiveness of our proposed method in this paper.
Table 7
The coexisting matrix of expressions and emotions from the NVIE database.
Exp
Emo
Hap.(432)
Dis.(405)
Fea.(330)
Sur.(445)
Ang.(232)
Sad.(213)
Hap.(399)
Dis.(326)
Fea.(232)
Sur.(259)
Ang.(283)
Sad.(245)
346
29
21
210
12
5
18
260
162
80
79
50
14
113
187
79
39
19
138
40
51
205
18
8
6
171
68
43
161
107
11
92
57
33
93
157
This work has been supported by the National Program 863
(2008AA01Z122), the National Natural Science Foundation of China
(Grant No. 61175037, 61228304), the US NSF (A40338), Special Innovation Project on Speech of Anhui Province (11010202192), project from
Anhui Science and Technology Agency (1106c0805008) and the Fundamental Research Funds for the Central Universities.
Table 8
Multiple emotion video tagging results on NVIE database.
Method
SVM-Emo
Exp-BNEmo
Sig. level
RecExpEmo
Exp-BNEmo
Sig. level
Example based
Label based
Ham.
Pre.
SubAcc.
MicPre.
MicAcc.
0.279
0.246
0.544
0.638
0.172
0.228
0.523
0.612
0.721
0.754
1.6 × 10−3*
0.259
5.8 × 10−5*
0.601
0.038*
0.214
3.3 × 10−5*
0.567
1.6 × 10−3*
0.742
0.246
0.638
0.228
0.612
0.754
0.005*
−4
1.2 × 10
*
0.120
−4
7.4 × 10
*
0.005*
“Ham.” refers to “hammingloss”, “pre.” refers to “precision”, “subAcc.” refers to
“subsetAccuracy”, “micPre.” refers to “micro precision”, and “micAcc.” refers to “micro
Accuracy”. "*" means there exists difference at a 0.05 significant level.
690
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
References
[1] S.-f. Wang, X.-f. Wang, Emotional semantic detection from multimedia: a brief overview, Kansei Engineering and Soft Computing: Theory and Practice, 2011. 126–146.
[2] M. Pantic, A. Vinciarelli, Implicit human-centered tagging [social sciences], IEEE Signal Proc. Mag. 26 (6) (2009) 173–180.
[3] P. Ekman, W.V. Friesen, Constants across cultures in the face and emotion, J. Pers.
Soc. Psychol. 17 (2) (1971) 124.
[4] S.C. Widen, J.A. Russell, A. Brooks, Anger and disgust: discrete or overlapping categories? 2004 APS Annual Convention, Boston College, Chicago, IL, 2004.
[5] J.-M. Fernandez-Dols, F. Sanchez, P. Carrera, M.-A. Ruiz-Belda, Are spontaneous expressions and emotions linked? An experimental test of coherence, J. Nonverbal
Behav. 21 (3) (1997) 163–177.
[6] A.J. Fridlund, Human Facial Expression: an Evolutionary View, Academic Press, 1994.
[7] I.B. Mauss, R.W. Levenson, L. McCarter, F.H. Wilhelm, J.J. Gross, The tie that binds?
Coherence among emotion experience, behavior, and physiology, Emotion 5 (2)
(2005) 175.
[8] J. Gratch, L. Cheng, S. Marsella, J. Boberg, Felt Emotion and Social Context Determine
the Intensity of Smiles in a Competitive Video Game, 2013. 1–8.
[9] J.A. Russell, J.-A. Bachorowski, J.-M. Fernández-Dols, Facial and vocal expressions of
emotion, Annu. Rev. Psychol. 54 (1) (2003) 329–349.
[10] J.T. Cacioppo, G.G. Berntson, T. Aue, Social Psychophysiology, Wiley Online Library, 1983.
[11] J.J. Gross, R.W. Levenson, Emotion elicitation using films, Cognit. Emot. 9 (1) (1995)
87–108.
[12] J.G. Michael, J. Lyons, Miyuki Kamachi, Japanese female facial expressions (JAFFE),
Database of digital images (1997).
[13] S. Wang, Z. Liu, S. Lv, Y. Lv, G. Wu, P. Peng, F. Chen, X. Wang, A natural visible and
infrared facial expression database for expression recognition and emotion inference, IEEE Trans. Multimedia 12 (7) (2010) 682–691.
[14] S. Moncrieff, C. Dorai, S. Venkatesh, Affect computing in film through sound energy
dynamics, Proceedings of the ninth ACM international conference on Multimedia,
ACM, 2001, pp. 525–527.
[15] A. Yazdani, J.-S. Lee, T. Ebrahimi, Implicit emotional tagging of multimedia using eeg
signals and brain computer interface, Proceedings of the first SIGMM workshop on
Social media, ACM, 2009, pp. 81–88.
[16] M. Xu, X. He, J.S. Jin, Y. Peng, C. Xu, W. Guo, Using Scripts for Affective Content Retrieval, 2011. 43–51.
[17] K. Sun, J. Yu, Video affective content representation and recognition using video affective tree and hidden markov models, Affective Computing and Intelligent Interaction, Springer, 2007, pp. 594–605.
[18] M. Xu, J.S. Jin, S. Luo, L. Duan, Hierarchical movie affective content analysis based on
arousal and valence features, Proceedings of the 16th ACM international conference
on Multimedia, ACM, 2008, pp. 677–680.
[19] R.M.A. Teixeira, T. Yamasaki, K. Aizawa, Determination of emotional content of video
clips by low-level audiovisual features, Multimedia Tools Appl. 61 (1) (2012) 21–49.
[20] X.Y. Chen, Z. Segall, Xv-pod: an emotion aware, affective mobile video player, Computer Science and Information Engineering, 2009 WRI World Congress on, vol. 3,
IEEE, 2009, pp. 277–281.
[21] S. Zhao, H. Yao, X. Sun, P. Xu, X. Liu, R. Ji, Video indexing and recommendation based
on affective analysis of viewers, Proceedings of the 19th ACM International Conference on Multimedia, ACM, 2011, pp. 1473–1476.
[22] G. Irie, K. Hidaka, T. Satou, T. Yamasaki, K. Aizawa, Affective video segment retrieval
for consumer generated videos based on correlation between emotions and emotional audio events, Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on, IEEE, 2009, pp. 522–525.
[23] S.C. Watanapa, B. Thipakorn, N. Charoenkitkarn, A sieving ann for emotion-based
movie clip classification, IEICE Trans. Inf. Syst. 91 (5) (2008) 1562–1572.
[24] H.L. Wang, L.-F. Cheong, Affective understanding in film, IEEE Trans. Circuits Syst.
Video Technol. 16 (6) (2006) 689–704.
[25] M. Xu, J. Wang, X. He, J.S. Jin, S. Luo, H. Lu, A three-level framework for affective content analysis and its case studies, Multimedia Tools Appl. (2012) 1–23.
[26] Z. Lu, X. Wen, X. Lin, W. Zheng, A video retrieval algorithm based on affective features, Computer and Information Technology, 2009. CIT'09, Ninth IEEE International
Conference on, vol. 1, IEEE, 2009, pp. 134–138.
[27] S. Arifin, P.Y. Cheung, A novel probabilistic approach to modeling the pleasurearousal-dominance content of the video based on “working memory”, Semantic
Computing, 2007. ICSC 2007. International Conference on, IEEE, 2007, pp. 147–154.
[28] C.H. Chan, G.J. Jones, Affect-based indexing and retrieval of films, Proceedings of the
13th Annual ACM International Conference on Multimedia, ACM, 2005, pp. 427–430.
[29] M. Soleymani, J. Kierkels, G. Chanel, T. Pun, A bayesian framework for video affective
representation, Affective Computing and Intelligent Interaction and Workshops,
2009. ACII 2009. 3rd International Conference on, IEEE, 2009, pp. 1–7.
[30] K.-M. Ong, W. Kameyama, Classification of video shots based on human affect, Inf.
Media Technol. 4 (4) (2009) 903–912.
[31] A. Hanjalic, L.-Q. Xu, Affective video content representation and modeling, IEEE
Trans. Multimedia 7 (1) (2005) 143–154.
[32] S. Zhang, Q. Tian, S. Jiang, Q. Huang, W. Gao, Affective mtv analysis based on arousal
and valence features, Multimedia and Expo, 2008 IEEE International Conference on,
IEEE, 2008, pp. 1369–1372.
[33] S. Zhang, Q. Huang, S. Jiang, W. Gao, Q. Tian, Affective visualization and retrieval for
music video, IEEE Trans. Multimedia 12 (6) (2010) 510–522.
[34] S. Arifin, P.Y. Cheung, User attention based arousal content modeling, Image Processing, 2006 IEEE International Conference on, IEEE, 2006, pp. 433–436.
[35] S. Arifin, P.Y. Cheung, A computation method for video segmentation utilizing the
pleasure-arousal-dominance emotional information, Proceedings of the 15th International Conference on Multimedia, ACM, 2007, pp. 68–77.
[36] C.-Y. Wei, N. Dimitrova, S.-F. Chang, Color-mood analysis of films based on syntactic
and psychological models, Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on, vol. 2, IEEE, 2004, pp. 831–834.
[37] A. Hanjalic, Extracting moods from pictures and sounds: towards truly personalized
tv, IEEE Signal Proc. Mag. 23 (2) (2006) 90–100.
[38] L. Canini, S. Benini, R. Leonardi, Affective recommendation of movies based on selected
connotative features, IEEE Trans. Circuits Syst. Video Technol. 23 (4) (2013) 636–647.
[39] M. Xu, C. Xu, X. He, J.S. Jin, S. Luo, Y. Rui, Hierarchical affective content analysis in
arousal and valence dimensions, Signal Process. 93 (8) (2013) 2140–2150.
[40] S. Wang, Z. Liu, Y. Zhu, M. He, X. Chen, Q. Ji, Implicit video emotion tagging from audiences' acial expression, Multimedia Tools Appl. (2014) 1–28.
[41] A.G. Money, H. Agius, Feasibility of personalized affective video summaries, Affect
and Emotion in Human–computer Interaction, Springer, 2008, pp. 194–208.
[42] A.G. Money, H. Agius, Analysing user physiological responses for affective video
summarisation, Displays 30 (2) (2009) 59–70.
[43] A.G. Money, H. Agius, Elvis: entertainment-led video summaries, ACM Trans. Multimedia Comput. Commun. Appl. (TOMCCAP) 6 (3) (2010) 17.
[44] M. Soleymani, G. Chanel, J.J. Kierkels, T. Pun, Affective ranking of movie scenes using
physiological signals and content analysis, Proceedings of the 2nd ACM Workshop
on Multimedia Semantics, ACM, 2008, pp. 32–39.
[45] M. Soleymani, G. Chanel, J. Kierkels, T. Pun, Affective characterization of movie
scenes based on multimedia content analysis and user's physiological emotional responses, Multimedia, 2008. ISM 2008. Tenth IEEE International Symposium on, IEEE,
2008, pp. 228–235.
[46] J.J. Kierkels, M. Soleymani, T. Pun, Queries and tags in affect-based multimedia retrieval, Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on,
IEEE, 2009, pp. 1436–1439.
[47] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt,
I. Patras, Deap: a database for emotion analysis; using physiological signals, IEEE
Trans. Affect. Comput. 3 (1) (2012) 18–31.
[48] M. Soleymani, J. Lichtenauer, T. Pun, M. Pantic, A multimodal database for affect recognition and implicit tagging, IEEE Trans. Affect. Comput. 3 (1) (2012) 42–55.
[49] L. Canini, S. Gilroy, M. Cavazza, R. Leonardi, S. Benini, Users' response to affective film
content: a narrative perspective, Content-Based Multimedia Indexing (CBMI), 2010
International Workshop on, IEEE, 2010, pp. 1–6.
[50] A.F. Smeaton, S. Rothwell, Biometric responses to music-rich segments in films: the
CDVPlex, Content-Based Multimedia Indexing, 2009. CBMI'09. Seventh International
Workshop on, IEEE, 2009. 162–168.
[51] S. Toyosawa, T. Kawai, An experience oriented video digesting method using heart
activity and its applicable video types, Advances in Multimedia Information
Processing-PCM 2010, Springer, 2010, pp. 260–271.
[52] S. Koelstra, C. Muhl, I. Patras, Eeg analysis for implicit tagging of video data, Affective
Computing and Intelligent Interaction and Workshops, 2009. ACII 2009. 3rd International Conference on, IEEE, 2009, pp. 1–6.
[53] M.K. Abadi, M. Kia, R. Subramanian, P. Avesani, N. Sebe, Decoding Affect in Videos
Employing the Meg Brain, Signal, 2013. 1–6.
[54] A.T. Krzywicki, G. He, B.L. O'Kane, Analysis of facial thermal variations in response to
emotion: eliciting film clips, SPIE Defense, Security, and Sensing, International Society for Optics and Photonics, 2009, (734312–734312).
[55] H. Joho, J. Staiano, N. Sebe, J.M. Jose, Looking at the viewer: analysing facial activity
to detect personal highlights of multimedia contents, Multimedia Tools Appl. 51 (2)
(2011) 505–523.
[56] I. Arapakis, Y. Moshfeghi, H. Joho, R. Ren, D. Hannah, J.M. Jose, Integrating facial expressions into user profiling for the improvement of a multimodal recommender
system, Multimedia and Expo, 2009. ICME 2009. IEEE International Conference on,
IEEE, 2009, pp. 1440–1443.
[57] W.-T. Peng, C.-H. Chang, W.-T. Chu, W.-J. Huang, C.-N. Chou, W.-Y. Chang, Y.-P.
Hung, A real-time user interest meter and its applications in home video summarizing, Multimedia and Expo (ICME), 2010 IEEE International Conference on, IEEE,
2010, pp. 849–854.
[58] W.-T. Peng, W.-T. Chu, C.-H. Chang, C.-N. Chou, W.-J. Huang, W.-Y. Chang, Y.-P.
Hung, Editing by viewing: automatic home video summarization by viewing behavior analysis, IEEE Trans. Multimedia 13 (3) (2011) 539–550.
[59] D. McDuff, R. el Kaliouby, D. Demirdjian, R. Picard, Predicting online media effectiveness based on smile responses gathered over the internet, IEEE Conference on Automatic Face and Gesture Recognition 2013, IEEE, 2013, pp. 1–7.
[60] P. Philippot, Inducing and assessing differentiated emotion–feeling states in the laboratory, Cognit. Emot. 7 (2) (1993) 171–193.
[61] Z. Wang, S. Wang, M. He, Q. Ji, Emotional tagging of videos by exploring multiemotion coexistence, Automatic Face & Gesture Recognition and Workshops (FG
2013), 2013 IEEE International Conference on, IEEE, 2013.
[62] J.J. Kierkels, T. Pun, Simultaneous exploitation of explicit and implicit tags in affectbased multimedia retrieval, Affective Computing and Intelligent Interaction and
Workshops, 2009. ACII 2009. 3rd International Conference on, IEEE, 2009, pp. 1–6.
[63] T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models, IEEE Trans. Pattern
Anal. Mach. Intell. 23 (6) (2001) 681–685.
[64] Y. Lv, S. Wang, P. Shen, A real-time attitude recognition by eye-tracking, Proceedings
of the Third International Conference on Internet Multimedia Computing and Service, ACM, 2011, pp. 170–173.
[65] T. Cootes, AAM tools, [online]. available http://personalpages.manchester.ac.uk/
staff/timothy.f.cootes/ .
[66] T. Wu, J. Duchateau, J.-P. Martens, D. Van Compernolle, Feature subset selection for
improved native accent identification, Speech Comm. 52 (2) (2010) 83–98.
[67] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464.
[68] C.P. de Campos, Q. Ji, Efficient structure learning of bayesian networks using constraints, J. Mach. Learn. Res. 12 (3) (2011) 663–689.
S. Wang et al. / Image and Vision Computing 32 (2014) 682–691
[69] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988.
[70] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognition methods:
audio, visual, and spontaneous expressions, IEEE Trans. Pattern Anal. and Mach. Intelligence 31 (1) (2009) 39–58.
[71] http://emotion-research.net/wiki/databases.
[72] M.S. Sorower, A literature survey on algorithms for multi-label learning, Tech. Rep.
(2010) 1–25.
691
[73] J.D. Laird, C. Bresler, The process of emotional experience: a self-perception theory
(1992).
[74] J. Healey, Recording affect in the field: towards methods and metrics for improving
ground truth labels, Affective Computing and Intelligent Interaction, Springer, 2011,
pp. 107–116.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement