Visual Aesthetic Quality Assessment with Multi

Visual Aesthetic Quality Assessment with Multi
arXiv:1604.04970v1 [cs.CV] 18 Apr 2016
1
Visual Aesthetic Quality Assessment with Multi-task
Deep Learning
Yueying Kao, Ran He, Kaiqi Huang
CRIPAC & NLPR, Institute of Automation, Chinese Academy of Sciences
Abstract. This paper considers the problem of assessing visual aesthetic quality
with semantic information. We cast the assessment problem as the main task
among a multi-task deep model, and argue that semantic recognition offers the
key to addressing this problem. Based on convolutional neural networks, we
propose a general multi-task framework with four different structures. In each
structure, aesthetic quality assessment task and semantic recognition task are
leveraged, and different features are explored to improve the quality assessment.
Moreover, an effective strategy of keeping a balanced effect between the semantic
task and aesthetic task is developed to optimize the parameters of our framework.
The correlation analysis among the tasks validates the importance of the semantic
recognition in aesthetic quality assessment. Extensive experiments verify the
effectiveness of the proposed multi-task framework, and further corroborate the
above proposition.
Keywords: Visual aesthetic quality assessment, multi-task learning, semantic
information
Introduction
Aesthetic image analysis has attracted increasing attention recently in computer vision
community [6, 9, 18, 25]. It is related to the high-level perception of visual aesthetics,
and is useful in many applications, e.g., image retrieval, photo management, and
photography [5, 11].
The main challenge in assessing aesthetic quality of images automatically is that
visual aesthetics is a subjective attribute. Many efforts have been made to address this
issue. Data-driven approaches [4, 7, 17, 19, 25–28, 31] are often used to learn from the
labeled images. The aesthetic quality of the images are labeled by humans. For aesthetic
feature extracting, most of these approaches treat visual aesthetic quality assessment as
a single and standalone classification task.
Handcrafted features are earlier attempts. They are based on the intuitions of how
people perceive the aesthetic quality of images or photographic rules. These features
include color [4, 11, 23], the rule of thirds [4], simplicity [17, 25], and composition [7].
Later, generic image descriptors such as bag-of-visual-words (BOV) [3] and fisher
vectors (FV) [8] are used to assess aesthetic quality. They are shown to outperform
the traditional handcrafted features [19, 20, 22]. Recently, deep convolutional neural
networks (CNNs) [12, 30] have been applied to aesthetic quality assessment [10, 15].
Aesthetic:
Semantic:
High
Portraiture
High
Sky, Architecture
Low
Food and Drink
Low
Still Life, Nature
Fig. 1: Example images with aesthetic and semantic labels.
Nevertheless, these computational approaches provide either accurate or interpretable
results [18].
For human beings, aesthetic quality assessment is always coupled with the
identification of semantic content of images [14, 21]. It is difficult for humans to treat
aesthetic quality assessment as an isolate and independent task. When humans assess
the aesthetic quality of an image, they first understand what they are assessing. That
is, they have known the sematic information of this image. Seen from Fig. 1, we can
recognize the semantic content from these images at a glance and assess the aesthetic
quality quickly. Hence it is reasonable to assume that, assessing the aesthetic quality and
semantic recognition are correlated tasks for machine learning. The task of semantic
recognition is potentially helpful to improve the task of automatically assessing visual
aesthetic quality.
This paper employs multi-task convolutional neural network (MTCNN) to address
the problem of aesthetic quality assessment. Multi-task learning can learn multiple
related tasks in parallel with shared knowledge. It has been demonstrated that this
approach can boost some or all of the tasks [2]. Our goal is to utilize semantic
recognition in the joint objective function to improve the aesthetic quality assessment.
Multi-task learning is suited to our problem. Facing the different learning difficulties
in the two tasks, we present a strategy to keep the effect of both tasks balanced in the
joint objective function. The strategies of treating all tasks equally and early stopping
are often adopted in existing works [2, 29, 32]. To investigate how to take advantage of
semantic information best and how semantic information influence aesthetic task, we
present four MTCNNs with different structures. In addition, the correlation in different
tasks is also analyzed to explain the factors in aesthetic quality assessment and make
our results more interpretable.
Our contributions are summarized as follows:
– Instead of taking visual aesthetic quality assessment as an isolated task, we propose
to exploit the semantic recognition to assess the aesthetic quality jointly with a
multi-task convolutional neural network. It is a novel attempt to learn aesthetic
features with the help of a related task, i.e. semantic recognition.
– Four MTCNNs including three basic MTCNNs with different structures and an
enhanced MTCNN, are developed to explore different features with the supervision
of aesthetic and semantic labels. The correlation in aesthetic quality assessment and
semantic recognition is analyzed from our MTCNNs, which can explain the factors
in aesthetic quality assessment and makes our results more interpretable.
– The proposed method significantly outperforms the state-of-the-art methods on the
challenging AVA dataset.
The rest of this paper is organized as follows: we summarize related work in Sec. 2,
describe our method in detail in Sec. 3, present the experiments in Sec. 4, and conclude
the paper in Sec. 5.
2
Related work
Since our work is related to the aesthetic quality assessment and multi-task learning,
we will mainly review work related to the two parts in this section.
Aesthetic quality assessment: Most previous works [4, 7, 11, 16, 19, 24] on aesthetic
quality assessment focus on the challenging problem of designing appropriate features.
Typically, handcrafted features are proposed based on the intuitions about human
perception of the aesthetic quality of images or photographic rules. For example, Datta
et al. [4] design certain visual features such as colorfulness, the rule of thirds, and low
depth of field indicators, to discriminate between aesthetically pleasing and displeasing
images. Dhar et al. [7] extract some high level attributes including compositional,
content, and sky-illumination attributes, which are characteristically used by humans
to describe images. Luo et al. [16] and Tang et al. [25] consider that photos may have
different aesthetic criteria in mind for different type of images and design visual features
in different ways according to the variety of photo content. In [19], generic image
descriptors are used to assess aesthetic quality. It is shown that they can outperform
the traditional handcrafted features.
Despite the success of handcrafted features and generic image descriptors, CNNs
have been applied to aesthetic quality assessment [10, 15] and obtain new state-of-theart performances. CNNs learn aesthetic features automatically. However, they extract
features by treating aesthetic quality assessment as an independent problem. The best
network in [15], RDCNN, hopes to leverage the idea of multi-task learning with the
style attributes to help determine the aesthetic quality of images. Unfortunately, due
to many missing labels for style attributes, they can not jointly perform aesthetics
categorization and style classification in a neural network, and just concatenate the
features of the aesthetics and style by using transfer learning. Our work is also related to
CNNs for aesthetics classification. In contrast, firstly, we exploit semantic information
to assist in learning aesthetic representation with a multi-task learning framework.
Secondly, we can jointly learn aesthetics categorization and semantic recognition with
a single multi-task network, which is different from RDCNN [15]. Finally, images are
labeled with semantic information much easier than style attributes in real world. This
is because only professional photographer and photography amateurs are familiar with
all the style attributes.
Multi-task learning: Multi-task learning aims to boost the generalization performance
by learning multiple related tasks simultaneously [1, 2, 13, 32]. Deep neural network
can learn features jointly under multiple objectives and it is very appropriate for multitask learning. Multi-task learning based on deep neural network has been applied to
many computer vision problems [29,32]. However, there are many strategies for sharing
knowledge and learning process. For example, Yim et al. [29] treat all tasks equally
important. In contrast, early stopping strategy is used in some related tasks [32], due to
different learning difficulties and convergence rates in different tasks. In our problem,
because semantic recognition task is much easier than aesthetic quality assessment,
common features of our two tasks are learned simultaneously and an effective strategy
of keeping effect of all the tasks balanced in the joint objective function is used.
3
Method
In this section, we propose to exploit the semantic information to help identify
the aesthetic quality of images, assuming that they are considered as the related
attributes [14,21]. Our problem is formulated as a MTCNN model and its framework is
illustrated in Fig. 2. To explore the effect of semantic recognition on aesthetic quality
assessment, three basic MTCNNs and an enhanced MTCNN are presented to optimize
the model.
Common feature representation learning with parameters
Layer 1
x
Filter size 11
Layer 2
Layer 3
128
128
Layer 4 Layer 5 Layer 6
layer 7
256
27×27
54×54
3×3 max pool
Stride 2, Norm
3×3 max pool
Stride 2, Norm
4096
units
4096
units
Task 2:
Semantic
...
27×27
13×13
3×3 max pool
Stride 2, Norm
Task1:
Aesthetic
y
...
3
...
3
Stride 2
227×227
Multi-task
Wa
96
5
Input image 256×256

z
Ws
Large scale dataset
Fig. 2: The framework of our multi-task
learning
and
the illustration for the architecture
Transfer
parameters
Small dataset
of our MTCNN #1.
Layer 1
Layer 2
Layer 3
Layer 4 Layer 5 Layer 6
layer 7
x'
Filter size 11
Wa'
96
128
128
256
5
ya'
...
Problem Formulation
Stride 2
3
Task 2:
...
...
3.1
3
Task1:
Aesthetic
13×13
Semantic
27×27
27×27
Our problem can be interpreted as54×54
a probabilistic
model. Using the probabilistic
y
227×227
formulation,
various deep networks
can3×3solve
our problem
by optimizing
Wthe model
Input image 256×256
3×3 max pool
max pool
3×3 max pool
4096
4096
Stride 2, norm
Stride 2, Norm
2, Norm
units
units
parameters that maximize the
posterior
probability.Stride
Then,
Bayesian
analysis is
leveraged to predict most likely aesthetic quality and semantic attributes of given
images.
Assuming we have a training dataset with a total of N samples, which are associated
with C aesthetic classes and M semantic attributes. Considering each image has
only one aesthetic class and multiple semantic attributes in real world, each image is
represented as (xn , yn , zn ), n = 1, 2, ..., N . Here xn represents the n-th image sample,
yn = c, c = 0, ..., C − 1 is the aesthetic label and zn = [zn1 , ..., znm , ..., znM ]T is
the semantic label for the n-th image sample. If the n-th image sample has the mth semantic attribute, the m-th semantic label is set as znm = 1, otherwise znm = 0.
Therefore a given dataset is denoted as (X, Y, Z) = {(xn , yn , zn ), n ∈ {1, 2, ..., N }}.
For our MTCNNs (our MTCNN #1 is shown in Fig. 2), Θ denotes the common
parameters in some bottom layers to learn features for all tasks , and W = [Wa , Ws ]
'
s
'
s
indicates the specific parameters for associated tasks. Wa and Ws represent the
parameters for aesthetic quality assessment and semantic recognition respectively. Each
column in Wa or Ws corresponds to a subtask. The goal is to find the optimal or suboptimal parameters Θ, W, λ by maximizing the following posterior probability
Θ̂, Ŵ , λ̂ = argmax p(Θ, W, λ|X, Y, Z),
(1)
Θ,W,λ
where λ is the weight coefficient of the semantic recognition task in the joint learning
process.
Based on the Bayesian theorem, we have
p(X, Y, Z|Θ, W, λ)p(Θ, W, λ)
p(X, Y, Z)
∝ p(X, Y, Z|Θ, W, λ)p(Θ, W, λ),
p(Θ, W, λ|X, Y, Z) =
(2)
where p(X, Y, Z|Θ, W, λ) is the conditional probability, and p(Θ, W, λ) is the prior
probability.
Then Eqn. (1) takes the form
Θ̂, Ŵ , λ̂ ∝ argmax p(Y |X, Θ, Wa )p(Z|X, Θ, Ws , λ)p(Θ)p(W )p(λ).
(3)
Θ,W,λ
Each term in Eqn. (3) is defined as:
1) The conditional probability p(Y |X, Θ, Wa ) corresponds to the task of aesthetic
quality assessment. Here assessing aesthetic quality is interpreted as a classification
problem and modeled as a multinomial logistic regression similar to traditional
classification problems [12]. The conditional probability p(Y |X, Θ, Wa ) can be
formulated as
p(Y |X, Θ, Wa ) =
N X
C
Y
(4)
1{yn = c}p(yn = c|xn , Θ, Wa ),
n=1 c=1
where 1{·} is the indicator function, it has two values, 1{a true statement} = 1,
and 1{a f alse statement} = 0. p(yn = c|xn , Θ, Wa ) is calculated by the softmax
function
exp(Wac T (ΘT xn ))
p(yn = c|xn , Θ, Wa ) = PC
.
lT
T
l=1 exp(Wa (Θ xn ))
(5)
2) The conditional probability p(Z|X, Θ, Ws , λ) corresponds to the semantic
recognition. Since each element of the semantic label of a given image is binary:
znm ∈ {0, 1}, each semantic attribute recognition can be interpreted as a logistic
regression. Hence the conditional probability p(Z|X, Θ, Ws , λ) can be
p(Z|X, Θ, Ws , λ)
=
N Y
M
Y
m
m
(p(znm = 1|xn , Θ, Wsm )zn (1 − p(znm = 1|xn , Θ, Wsm ))1−zn )λ ,
n=1 m=1
(6)
where p(znm = 1|xn , Θ, Wsm ) is calculated by a sigmoid function σ(x) = 1/(1 +
exp(−x)).
3) The prior probability p(Θ) corresponds to the network parameters for common
features. The parameters Θ canQbe initialized Q
as a standard normal distribution like
K
K
previous network [12]. p(Θ) = k=1 p(θk ) = k=1 N (0, I), where 0 is a zero matrix
and I is an identity matrix.
4) Similar to Θ, the parameters W for specific tasks can also be initialized
as a standard normal distribution. Thus, the prior probability can be p(W ) =
p(Wa )p(Ws ) = Na (0, I)Ns (0, I).
5) λ is used to control the influence of semantic recognition task in the final
objective function. The prior probability p(λ) is implemented by defining λ obeying
a normal distribution, p(λ) = N(µ, σ 2 ).
Then Eqns. (4), (5) and (6) are substituted into Eqn. (3), negative log function is
taken for Eqn. (3), and the constant terms are omitted. As a result, the objective function
can be
argmin{−
Θ,W,λ
−λ
N X
C
X
exp(Wac T (ΘT xn ))
1{yn = c}log PC
lT
T
n=1 c=1
l=1 exp(Wa (Θ xn ))
M
N X
X
(znm logσ(Wsm T (ΘT xn )) + (1 − znm )(1−
(7)
n=1 m=1
logσ(Wsm T (ΘT xn )))) + ΘT Θ + W T W + (λ − µ)2 }.
3.2
Optimization Procedure
The multi-task objective function in Eqn. (7) can be optimized by a network through
stochastic gradient descent (SGD) [12]. Here MTCNN is applied to search optima for
the parameters Θ, W, λ. One architecture of our MTCNNs is shown in Fig. 2. Firstly, all
tasks share knowledge in bottom layers. Then specific features are learned for each task
in top layers. Finally, the combination of the softmax loss function for aesthetic quality
prediction (the first term in Eqn. (7)) and the cross entropy loss function for semantic
recognition (the second term in Eqn. (7)) are employed to update the parameters of the
network jointly.
Traditionally, multiple tasks are treated equally important in back propagation of
multi-task learning [2, 29] assuming that they can reach best performance roughly at
the same time. However, different tasks may have different learning difficulties and
convergence rates. Caruana [2] propose to control the effect of different tasks by
adjusting the learning weight on each output task. He also put forward some strategies
for this problem, such as early-stopping. Early stopping strategy has been used to
some works [32] and good performance is achieved. Nevertheless, this strategy is
not suited to our problem. Because the extra task, semantic recognition task, is much
easier, and converges more rapidly than the main task, aesthetic quality assessment.
Our experimental results (details in Sec. 4) show that, when early stop the convergent
semantic recognition task, the training error of aesthetic task continues converging
very slowly and do not drop obviously. We think it is mainly because the aesthetic
is subjective and needs the help of semantic task in entire training process. Hence, we
Color
codetoused:
= convolutional
layer,inyellow
=
present a simple
strategy
keep purple
the effect
of all tasks balanced
back propagation.
fully-connected
layer.
Because the softmax
loss function
only considers the value corresponding ground truth
label for each example. In our problem, λ = 1/M is fixed in the objective function in
the entire training process.
MTCNN #1
Semantic
Aesthetic
MTCNN #2
Aesthetic
MTCNN #3
Semantic
Enhanced MTCNN
Semantic
Aesthetic Semantic
Wa
Aesthetic
Ws
2
Aesthetic
Wa'
1
Input
Input
Input
Input
Fig. 3: Explored MTCNNs with different architectures. The details of MTCNN #1 are
illustrated in Fig. 2. Color code used: purple = convolutional layer + max pooling, grey
= convolutional layer, yellow = fully-connected layer.
3.3
MTCNN Implementation
To implement the multi-task model, we investigate several multi-task network
architectures to utilize semantic information for visual aesthetic quality assessment.
These networks are explained in Fig. 3. The supervision of aesthetic and semantic
labels can be in the same or different layers in the network. Here we propose three
basic network architectures and an enhanced network. For all networks, the input is a
227 × 227 × 3 patch randomly extracted from a resized image 256 × 256 × 3 as previous
work [15].
MTCNN #1: Since our goal is to discover the effective features for aesthetic
assessment with the help of semantic information, a simple idea is to learn all
parameters for aesthetic representations with aesthetic and semantic supervision in a
network. MTCNN #1 implements this idea. The architecture of MTCNN #1 (in Fig. 3)
is detailed in Fig. 2. The network contains four convolutional layers and two fullyconnected layers with parameters Θ for common feature learning. The parameters
W = [Wa , Ws ] from layer 6 to layer 7 for each task are learned separately. Then, the
softmax loss function is adopted for aesthetic quality prediction, and the cross entropy
loss function for semantic recognition. The combination of the two loss functions is
employed to jointly update the parameters of the network.
MTCNN #2: To explore different structures for aesthetic features learning, we
introduce MTCNN #2 (shown in Fig. 3) to allow some top layers to learn aesthetic
representations independently without semantic supervision. Similar to MTCNN #1, the
network #2 contains four convolutional layers with parameters Θ for common feature
learning. Different from the architecture #1, layers 5, 6 and 7 in the network #2 learn
parameters W = [Wa , Ws ] separately for the two tasks. The loss functions are also the
same as the architecture #1.
MTCNN #3: Since CNNs can learn hierarchical features, we consider the low-level
features of a network for our main task in the MTCNN #3 (shown in Fig. 3). In this
network, four convolutional layers and three fully-connected layers are designed for
semantic recognition, while two convolutional layers and two fully-connected layers for
aesthetic quality assessment. The two tasks share knowledge Θ in the two convolutional
layers. The other layers are used to learn specific parameters W = [Wa , Ws ] for each
task. The loss functions are also the same as the architecture #1.
Enhanced MTCNN: To further explore the effective aesthetic features, we propose
an enhanced MTCNN by combining MTCNN #1 and MTCNN #3. That is, we add
extra aesthetic supervision in the first two layers in MTCNN #1. Shown in Fig. 3, the
common parameters Θ1 in the first and second convolutional layers are learned for
three tasks, the common parameters Θ2 in other two convolutional layers and two fully0
connected layers are learned for two tasks, and specific parameters W = [Wa , Wa , Ws ]
are learned separately in top layers. Our goal is to enhance the supervision of aesthetic
labels in the first and second convolutional layers under the premise of ensuring the
influence of semantic information in all network. Here we denote Θ = [Θ1 , Θ2 ]. The
objective function in Eqn. (7) is transformed to
argmin{−
Θ,W,λ
−
N X
C
X
n=1 c=1
−λ
N X
C
X
exp(Wac T (ΘT xn ))
1{yn = c}log PC
lT
T
n=1 c=1
l=1 exp(Wa (Θ xn ))
0
1{yn = c}log PC
N X
M
X
T
exp(Wac (Θ1T xn ))
l=1
T
exp(Wa0 l (Θ1T xn ))
(8)
(znm logσ(Wsm T (ΘT xn )) + (1 − znm )(1−
n=1 m=1
logσ(Wsm T (ΘT xn ))))
+ ΘT Θ + W T W + (λ − µ)2 },
where the first term in Eqn. (8) is our main task, and the second term is the added task.
We fix λ = 2/M based on our strategy for the enhanced MTCNN.
4
Experiments
In this section, we evaluate the proposed method on the challenging large-scale AVA
dataset. Experimental results show that the benefits of semantic information and the
effectiveness of our proposed method.
4.1
AVA Dataset
AVA dataset [22] is one of the most large-scale and challenging dataset for visual
aesthetic quality assessment. It contains more than 255,000 images gathered from
www.dpchallenge.com. Each image has about 200 voters to assess the aesthetic score
from one to ten. In addition, each image contains 0, 1 or 2 semantic tags (attributes).
We select 185,751 images used in this paper based on some rules. 1)More than 3000
images are available for each tag; 2) each image contains at least one tag. Eventually
29 semantic tags are chosen. From the 185,751 images, 20,000 images are selected
randomly as testing set similar to [15], and the rest 165,751 images as training set. For
aesthetic labels, we follow the experimental setup as [15, 22], the training set is divided
into two classes: high quality and low quality images. We designate the images with
an average score larger than 5 + δ as high quality images, those with an average score
smaller than 5 − δ as low quality images. Images with an average score between 5 + δ
and 5 − δ are discarded. We set δ to 0 and 1 respectively for the training set to obtain the
ground truth labels. There are 165,751 images in training set when δ = 0 and 38,994
images in training set when δ = 1. We set δ to 0 for the testing set regardless of the
value of δ for training set. For semantic labels, each image is labeled as a 29-dim binary
vector.
Table 1: Accuracy (%) of our MTCNN #1 with different λ on the AVA dataset.
δ λ = 0 λ = 1/29 λ = 2/29 λ = 1 with early stopping
4.2
0
72.19
76.15
75.76
73.54
73.43
1
75.13
75.90
75.82
73.12
74.28
Evaluating the Effectiveness of Keeping Balance Strategy
In the objective function, λ is used to control the contributions from semantic
information. To validate our strategy of keeping the influence of two tasks balanced,
we implement our MTCNN #1 with our strategy λ = 1/M (here λ = 1/29) and we
also compare the experimental results of MTCNN #1 with λ = 0, λ = 2/29, λ = 1 and
early stopping strategy (shown in Table 1). By compering the results with or without the
supervision of semantic labels, the MTCNN #1 with λ 6= 0 performs better than that
with λ = 0. This indicates the supervision is effective. What’s more, the results shown
in Table 1 demonstrate that our strategy λ = 1/29 performs best on both values of δ.
When λ = 1/29, the aesthetic and semantic tasks have same effect on the process of
back propagation. Therefore the effectiveness of our strategy is verified.
To further demonstrate the effectiveness of our MTCNN with our strategy, we also
analyze the accuracy on each semantic tag using MTCNN #1 with different setting of
λ in Fig. 4. As shown, our MTCNN #1 with λ = 1/29 performs best on overall images
and most semantic tags. We also observe that different results are achieved on various
semantic tags with the same method, and different improvements with MTCNNs are
also different on various semantic tags. For example, the semantic tags “Family” and
“Snapshot” obtain an great improvement with different methods.
0.85
0.80
0.75
Accuracy (δ=0)
0.70
0.65
0.60
MTCNN #1 (λ=0)
0.55
MTCNN #1 (λ=1/29)
MTCNN #1 (λ=2/29)
0.50
MTCNN #1 (λ=1)
0.45
0.40
Semantic tags
(a)
0.85
0.80
Accuracy (δ=1)
0.75
0.70
MTCNN #1 (λ=0)
0.65
MTCNN #1 (λ=1/29)
MTCNN #1 (λ=2/29)
0.60
MTCNN #1 (λ=1)
0.55
0.50
Semantic tags
(b)
Fig. 4: Accuracy on each semantic tag using MTCNN #1 with different λ when δ = 0
and δ = 1.
4.3
Evaluating the Benefits of Semantic Information
To evaluate our MTCNNs with the help of semantic information for aesthetic
classification, we compare our results (fix λ = 1/29 for three basic MTCNNs and
λ = 2/29 for the enhanced MTCNN) with those of our single task CNN (STCNN,
MTCNN #1, λ = 0) on the AVA dataset with both values of δ. Shown in Table 2, all
the four MTCNNs perform better than our STCNN especially when δ = 0. Aesthetic
quality classification with δ = 0 is more challenging than that with δ = 1 [22]. These
results demonstrate the effectiveness of semantic information.
Table 2: Accuracy (%) of different methods on the AVA dataset.
δ
Our
MTCNN MTCNN MTCNN Enhanced
STCNN
#1
#2
#3
MTCNN
0
72.19
76.15
75.91
75.92
1
75.13
75.90
75.81
75.37
[22]
SCNN DCNN RDCNN
[15]
[15]
[15]
76.58
66.7 71.20
73.25
74.46
76.04
67.0 68.63
73.05
73.70
In addition, we analyze the results with the four MTCNNs to investigate how to
take advantage of semantic information best and how semantic information influence
aesthetic task. We can see that the more supervision semantic labels makes on the
aesthetic feature learning, the better performance our MTCNN achieves. It also reveals
that the low-level features of MTCNN #3 can still perform well. Therefore, under
the premise of ensuring the effect of semantic information in the whole network, we
enhance the aesthetic supervision in the two bottom layers. Experimental results show
that our enhanced MTCNN for the main task performs best.
(a) δ= 0, STCNN (MTCNN, λ=0)
(b) δ= 0, MTCNN, λ=2/29
(c) δ= 1, STCNN (MTCNN, λ=0)
(d) δ= 1, MTCNN, λ=2/29
Fig. 5: Learned filters in the first convolutional layer with STCNN for aesthetic task
only and MCTNN #1 for the two tasks with both δ = 0 and δ = 1.
Low aesthetic quality
High aesthetic quality
To qualitatively demonstrate the benefits of our MTCNN with semantic information,
we show the learned filters in the first convolutional layer with a STCNN for aesthetic
task only and our MCTNN #1 with both δ = 0 and δ = 1 in Fig. 5. Compared to
the filters learned without semantic information, the filters with semantic information
are smoother, cleaner and more understandable. The proposed MTCNN can learn more
color and high frequency edge information than STCNN. These differences can also
be observed from the examples of test images correctly classified by MTCNN but
misclassified by STCNN in Fig. 6. The high quality images often have more vivid color
and clearer edge than low quality images. Most of the low quality images in Fig. 6
are blurred and dull. This indicates that the supervision of semantic labels for aesthetic
feature learning is very beneficial, and aesthetic and semantic tasks are related to some
extent.
Fig. 6: Example test images correctly classified by MTCNN but incorrectly by STCNN.
The labels of the images on the first and second row are high aesthetic quality, and The
labels of the images on the third and fourth row are low aesthetic quality.
4.4
Comparison with Other State-of-the-art Methods
To further validate our MTCNNs with semantic information for aesthetic classification,
we compare our results with those of the state-of-the-art methods in [15,22]on the AVA
dataset. Shown in Table 2, all the four MTCNNs perform better than the method in [22],
SCNN [15], DCNN [15] and RDCNN [15] on both values of δ. The method in [22] is
the baseline of the AVA dataset and is implemented by extracting SIFT [19] information
and SVM classifier. SCNN is a single-column CNN, DCNN is a double-column CNN
with two inputs consisting of a global view and a local view, and RDCNN is a doublecolumn CNN with an aesthetic column and a style column. Thus, these results in Table
2 illustrate the effectiveness of our MTCNNs with semantic recognition task.
Furthermore, we also train a separate model for each semantic labels to assess
aesthetic quality. Due to different number of images for different semantic labels, we
only train four CNNs separately for “Landscape”, “Nature”, “Still Life” and ”Black
and White”. The four labels have the most number of images in 29 labels. Here we
call the CNNs trained separately for the four semantic labels “respective CNN”. For
example, the respective CNN for “Landscape” is trained only with “Landscape” images
for aesthetic categorization. Figure 7 shows the results with different methods for
aesthetic classification on “Landscape”, “Nature”, “Still Life” and ”Black and White”
separately with both value of δ. As shown in Fig. 7, all the MTCNNs outperform the
respective CNN on each semantic labels, which also demonstrates the effectiveness of
the proposed MTCNNs. Moreover, MTCNNs don’t need to know the semantic labels
of test images, while the respective CNNs have to know the semantic labels.
Respective CNN
MTCNN #1
MTCNN #2
MTCNN #3
Enhanced MTCNN
Accuracy (δ=0)
0.8
0.75
0.7
0.85
Respective CNN
MTCNN #1
MTCNN #2
MTCNN #3
Enhanced MTCNN
0.8
Accuracy (δ=1)
0.85
0.75
0.7
0.65
0.65
Landscape
Nature
Still Life
Black and White
(a)
Landscape
Nature
Still Life
Black and White
(b)
Fig. 7: The accuracy with different methods for aesthetic classification on “Landscape”,
“Nature”, “Still Life” and ”Black and White” separately with both δ = 0 and δ = 1.
4.5
Inter Tasks Correlation Analysis
To demonstrate the effectiveness of MTCNNs with semantic information and
investigate how semantic information influence aesthetic task again, we analyze the
correlation between the two tasks. Since each column vector of task-specific matrix
W = [Wa , Ws ] in the network corresponds to the parameters of a subtask, we calculate
the correlation coefficient between any two column vector of weight matrix W as
correlation between any two subtasks. Shown in layer 7 of Fig. 2 in our problem,
the aesthetic classification task has two subtasks: high aesthetic and low aesthetic, the
semantic recognition task has 29 subtasks. Figure 8 presents the correlation in any
two subtasks learned by MTCNN #1 with δ = 0, which also verifies that semantic
information is beneficial for aesthetic estimation. Seen from Fig. 8, low aesthetic task
has high negative correlation with high aesthetic task. We can also see that the aesthetic
tasks have high correlation with certain semantic attributes. For instance, the semantic
tags “Snapshot” and “Candid” recognition has high positive correlation with the low
aesthetic task. In real word, most of “Snapshot” and “Candid” images are usually
regarded as low aesthetic quality images. While “Advertisement” and “Seascapes”
recognition has positive correlation with the high aesthetic task. This accords with the
knowledge that most of “Seascapes” and “Advertisement” images are usually taken as
high aesthetic quality images. In addition, Fig. 8 can also visualize the correlation in
different semantic tag recognitions.
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
children
food and drink
floral
transportation
seascapes
studio
advertisement
rural
water
travel
action
macro
architecture
black and white
still life
animals
portraiture
nature
candid
landscape
urban
emotive
sports
sky
snapshot
family
humorous
abstract
cityscape
low aesthetic
high aesthetic
low aesthetic 1.000 -0.996 0.340 -0.154 0.322 0.084 -0.065 0.552 0.080 0.018 0.084 -0.011 -0.049 0.178 -0.006 -0.024 -0.066 -0.153 -0.011 0.035 -0.035 0.230 -0.006 0.099 -0.209 -0.228 -0.068 0.001 -0.193 0.066 0.096
high aesthetic -0.996 1.000 -0.340 0.155 -0.322 -0.084 0.063 -0.552 -0.079 -0.017 -0.086 0.012 0.049 -0.177 0.005 0.024 0.068 0.156 0.013 -0.036 0.035 -0.230 0.007 -0.100 0.208 0.227 0.067 -0.003 0.195 -0.067 -0.096
abstract 0.340 -0.340 1.000 -0.336 -0.031 -0.152 0.032 0.161 -0.424 -0.241 0.001 -0.149 -0.058 -0.270 -0.247 0.454 -0.452 -0.314 -0.078 0.448 -0.368 -0.162 -0.324 -0.026 0.159 -0.001 -0.285 0.483 -0.376 0.348 -0.156
cityscape -0.154 0.155 -0.336 1.000 -0.393 -0.480 0.320 -0.095 0.277 0.822 -0.162 0.775 0.211 0.016 -0.388 -0.525 -0.035 0.879 0.234 -0.569 0.864 0.046 0.800 0.686 -0.629 -0.416 0.711 -0.529 0.898 -0.761 -0.430
family 0.322 -0.322 -0.031 -0.393 1.000 0.743 -0.413 0.775 0.289 0.019 0.764 -0.507 -0.441 0.846 0.848 0.062 0.307 -0.258 0.426 -0.033 -0.187 0.356 -0.198 -0.381 0.346 0.338 -0.326 -0.145 -0.148 0.303 0.934
humorous 0.084 -0.084 -0.152 -0.480 0.743 1.000 -0.552 0.503 0.348 -0.158 0.438 -0.661 -0.299 0.593 0.710 0.428 0.295 -0.321 0.042 0.168 -0.360 0.398 -0.324 -0.570 0.586 0.726 -0.558 0.208 -0.217 0.643 0.798
sky -0.065 0.063 0.032 0.320 -0.413 -0.552 1.000 -0.299 -0.124 -0.129 -0.086 0.497 0.043 -0.508 -0.381 -0.305 -0.292 -0.041 -0.122 -0.133 0.427 0.052 0.060 0.564 -0.155 -0.215 0.774 -0.109 0.114 -0.257 -0.410
snapshot 0.552 -0.552 0.161 -0.095 0.775 0.503 -0.299 1.000 0.520 0.306 0.447 -0.086 -0.026 0.716 0.391 -0.106 0.255 0.001 0.220 0.010 0.156 0.536 0.206 0.091 -0.109 0.061 -0.097 -0.020 0.102 0.158 0.567
sports 0.080 -0.079 -0.424 0.277 0.289 0.348 -0.124 0.520 1.000 0.418 -0.072 0.180 0.444 0.467 0.125 -0.318 0.692 0.353 -0.144 0.056 0.456 0.816 0.496 0.300 -0.294 0.067 0.149 -0.028 0.450 -0.059 0.209
urban 0.018 -0.017 -0.241 0.822 0.019 -0.158 -0.129 0.306 0.418 1.000 0.066 0.556 0.132 0.458 -0.090 -0.457 0.131 0.910 0.452 -0.536 0.736 0.137 0.823 0.506 -0.591 -0.352 0.399 -0.554 0.893 -0.664 -0.077
emotive 0.084 -0.086 0.001 -0.162 0.764 0.438 -0.086 0.447 -0.072 0.066 1.000 -0.363 -0.687 0.648 0.804 0.038 -0.046 -0.169 0.623 -0.305 -0.058 0.019 -0.213 -0.306 0.432 0.302 -0.067 -0.339 -0.015 0.085 0.785
landscape -0.011 0.012 -0.149 0.775 -0.507 -0.661 0.497 -0.086 0.180 0.556 -0.363 1.000 0.464 -0.257 -0.653 -0.654 -0.169 0.616 0.009 -0.455 0.846 -0.031 0.828 0.903 -0.797 -0.618 0.823 -0.352 0.650 -0.684 -0.611
nature -0.049 0.049 -0.058 0.211 -0.441 -0.299 0.043 -0.026 0.444 0.132 -0.687 0.464 1.000 -0.308 -0.617 -0.291 0.352 0.228 -0.510 0.366 0.316 0.248 0.497 0.492 -0.526 -0.269 0.174 0.371 0.212 -0.092 -0.520
candid 0.178 -0.177 -0.270 0.016 0.846 0.593 -0.508 0.716 0.467 0.458 0.648 -0.257 -0.308 1.000 0.750 -0.171 0.460 0.225 0.594 -0.255 0.117 0.352 0.178 -0.206 0.048 0.137 -0.168 -0.414 0.286 -0.078 0.796
portraiture -0.006 0.005 -0.247 -0.388 0.848 0.710 -0.381 0.391 0.125 -0.090 0.804 -0.653 -0.617 0.750 1.000 0.137 0.340 -0.249 0.455 -0.102 -0.313 0.191 -0.372 -0.603 0.568 0.451 -0.380 -0.249 -0.172 0.246 0.930
still life -0.024 0.024 0.454 -0.525 0.062 0.428 -0.305 -0.106 -0.318 -0.457 0.038 -0.654 -0.291 -0.171 0.137 1.000 -0.216 -0.441 -0.183 0.498 -0.694 -0.069 -0.665 -0.635 0.745 0.681 -0.677 0.581 -0.473 0.766 0.159
animals -0.066 0.068 -0.452 -0.035 0.307 0.295 -0.292 0.255 0.692 0.131 -0.046 -0.169 0.352 0.460 0.340 -0.216 1.000 0.170 -0.047 0.302 0.024 0.607 0.134 -0.113 -0.090 0.004 -0.160 -0.060 0.133 -0.050 0.302
architecture -0.153 0.156 -0.314 0.879 -0.258 -0.321 -0.041 0.001 0.353 0.910 -0.169 0.616 0.228 0.225 -0.249 -0.441 0.170 1.000 0.330 -0.487 0.711 0.074 0.806 0.523 -0.608 -0.402 0.433 -0.547 0.901 -0.738 -0.296
black and white -0.011 0.013 -0.078 0.234 0.426 0.042 -0.122 0.220 -0.144 0.452 0.623 0.009 -0.510 0.594 0.455 -0.183 -0.047 0.330 1.000 -0.491 0.210 -0.228 0.147 -0.039 0.056 -0.055 0.187 -0.680 0.385 -0.387 0.475
macro 0.035 -0.036 0.448 -0.569 -0.033 0.168 -0.133 0.010 0.056 -0.536 -0.305 -0.455 0.366 -0.255 -0.102 0.498 0.302 -0.487 -0.491 1.000 -0.565 0.261 -0.490 -0.314 0.354 0.322 -0.520 0.820 -0.512 0.647 -0.033
travel -0.035 0.035 -0.368 0.864 -0.187 -0.360 0.427 0.156 0.456 0.736 -0.058 0.846 0.316 0.117 -0.313 -0.694 0.024 0.711 0.210 -0.565 1.000 0.225 0.873 0.840 -0.666 -0.388 0.851 -0.494 0.849 -0.670 -0.264
action 0.230 -0.230 -0.162 0.046 0.356 0.398 0.052 0.536 0.816 0.137 0.019 -0.031 0.248 0.352 0.191 -0.069 0.607 0.074 -0.228 0.261 0.225 1.000 0.146 0.184 -0.048 0.195 0.064 0.119 0.179 0.168 0.261
rural -0.006 0.007 -0.324 0.800 -0.198 -0.324 0.060 0.206 0.496 0.823 -0.213 0.828 0.497 0.178 -0.372 -0.665 0.134 0.806 0.147 -0.490 0.873 0.146 1.000 0.758 -0.794 -0.490 0.593 -0.418 0.828 -0.668 -0.310
water 0.099 -0.100 -0.026 0.686 -0.381 -0.570 0.564 0.091 0.300 0.506 -0.306 0.903 0.492 -0.206 -0.603 -0.635 -0.113 0.523 -0.039 -0.314 0.840 0.184 0.758 1.000 -0.745 -0.544 0.825 -0.255 0.601 -0.574 -0.523
studio -0.209 0.208 0.159 -0.629 0.346 0.586 -0.155 -0.109 -0.294 -0.591 0.432 -0.797 -0.526 0.048 0.568 0.745 -0.090 -0.608 0.056 0.354 -0.666 -0.048 -0.794 -0.745 1.000 0.843 -0.540 0.366 -0.507 0.741 0.550
advertisement -0.228 0.227 -0.001 -0.416 0.338 0.726 -0.215 0.061 0.067 -0.352 0.302 -0.618 -0.269 0.137 0.451 0.681 0.004 -0.402 -0.055 0.322 -0.388 0.195 -0.490 -0.544 0.843 1.000 -0.429 0.452 -0.223 0.774 0.525
seascapes -0.068 0.067 -0.285 0.711 -0.326 -0.558 0.774 -0.097 0.149 0.399 -0.067 0.823 0.174 -0.168 -0.380 -0.677 -0.160 0.433 0.187 -0.520 0.851 0.064 0.593 0.825 -0.540 -0.429 1.000 -0.489 0.596 -0.640 -0.360
floral 0.001 -0.003 0.483 -0.529 -0.145 0.208 -0.109 -0.020 -0.028 -0.554 -0.339 -0.352 0.371 -0.414 -0.249 0.581 -0.060 -0.547 -0.680 0.820 -0.494 0.119 -0.418 -0.255 0.366 0.452 -0.489 1.000 -0.512 0.779 -0.142
transportation -0.193 0.195 -0.376 0.898 -0.148 -0.217 0.114 0.102 0.450 0.893 -0.015 0.650 0.212 0.286 -0.172 -0.473 0.133 0.901 0.385 -0.512 0.849 0.179 0.828 0.601 -0.507 -0.223 0.596 -0.512 1.000 -0.645 -0.159
food and drink 0.066 -0.067 0.348 -0.761 0.303 0.643 -0.257 0.158 -0.059 -0.664 0.085 -0.684 -0.092 -0.078 0.246 0.766 -0.050 -0.738 -0.387 0.647 -0.670 0.168 -0.668 -0.574 0.741 0.774 -0.640 0.779 -0.645 1.000 0.368
children 0.096 -0.096 -0.156 -0.430 0.934 0.798 -0.410 0.567 0.209 -0.077 0.785 -0.611 -0.520 0.796 0.930 0.159 0.302 -0.296 0.475 -0.033 -0.264 0.261 -0.310 -0.523 0.550 0.525 -0.360 -0.142 -0.159 0.368 1.000
Fig. 8: Correlation in any two subtasks of aesthetic quality classification and semantic
recognition learned by MTCNN #1 with δ = 0.
5
Conclusion
In this paper, we have employed the semantic information to help discover representations for aesthetic quality assessment by formulating a multi-task deep learning
framework. Aesthetic quality assessment has not been taken as an isolation problem.
To make full use of the semantic information and investigate how semantic information
influence aesthetic task, four MTCNNs have been developed to learn the aesthetic
representation jointly with the supervision of aesthetic and semantic labels. At the same
time, a strategy of keeping the effect of two tasks balanced is presented to optimize the
parameters of our MTCNNs. In addition, the correlations in the two tasks have been
analyzed to investigate the role of semantic recognition in aesthetic quality assessment.
Experimental results have shown that our MTCNNs perform better than the state-of-theart methods. It is demonstrated that the semantic information is beneficial to aesthetic
feature learning and the high-level features in the network play an important role in
aesthetic quality assessment. In the future, we will explore other useful factors to help
improve the aesthetic quality assessment.
References
1. Abdulnabi, A.H., Wang, G., Lu, J., Jia, K.: Multi-task cnn model for attribute prediction.
IEEE TMM 17(11), 1949–1959 (2015)
2. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997)
3. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags
of keypoints. In: Workshop on statistical learning in computer vision, ECCV. vol. 1, pp. 1–2.
Prague (2004)
4. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a
computational approach. In: ECCV. pp. 288–301 (2006)
5. Datta, R., Li, J., Wang, J.Z.: Learning the consensus on visual quality for next-generation
image management. In: ACM MM. pp. 533–536 (2007)
6. Datta, R., Li, J., Wang, J.Z.: Algorithmic inferencing of aesthetics and emotion in natural
images: An exposition. In: ICIP. pp. 105–108 (2008)
7. Dhar, S., Ordonez, V., Berg, T.L.: High level describable attributes for predicting aesthetics
and interestingness. In: CVPR. pp. 1657–1664 (2011)
8. Jaakkola, T.S., Haussler, D., et al.: Exploiting generative models in discriminative classifiers.
NIPS pp. 487–493 (1999)
9. Joshi, D., Datta, R., Fedorovskaya, E., Luong, Q.T., Wang, J.Z., Li, J., Luo, J.: Aesthetics
and emotions in images. IEEE Signal Processing Magazine 28(5), 94–115 (2011)
10. Kao, Y., Wang, C., Huang, K.: Visual aesthetic quality assessment with a regression model.
In: ICIP. pp. 1583 – 1587 (2015)
11. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In:
CVPR. pp. 419–426 (2006)
12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: NIPS. pp. 1097–1105 (2012)
13. Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for
video thumbnail selection. In: CVPR. pp. 3707–3715 (2015)
14. Locher, P.J.: The aesthetic experience with visual art at first glance. In: Investigations Into
the Phenomenology and the Ontology of the Work of Art, pp. 75–88. Springer (2015)
15. Lu, X., Lin, Z., Jin, H., Yang, J., Wang, J.Z.: Rapid: Rating pictorial aesthetics using deep
learning. In: ACM MM. pp. 457–466 (2014)
16. Luo, W., Wang, X., Tang, X.: Content-based photo quality assessment. In: ICCV. pp. 2206–
2213 (2011)
17. Luo, Y., Tang, X.: Photo and video quality evaluation: Focusing on the subject. In: ECCV.
pp. 386–399 (2008)
18. Marchesotti, L., Murray, N., Perronnin, F.: Discovering beautiful attributes for aesthetic
image analysis. IJCV 113(3), 246–266 (2015)
19. Marchesotti, L., Perronnin, F., Larlus, D., Csurka, G.: Assessing the aesthetic quality of
photographs using generic image descriptors. In: ICCV. pp. 1784–1791 (2011)
20. Marchesotti, L., Perronnin, F., Meylan, F.: Learning beautiful (and ugly) attributes. In:
BMVC. vol. 7, pp. 1–11 (2013)
21. Mullin, C., Hayn-Leichsenring, G., Wagemans, J.: There is beauty in gist: An investigation
of aesthetic perception in rapidly presented scenes. Journal of vision 15(12), 123–123 (2015)
22. Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual
analysis. In: CVPR. pp. 2408–2415 (2012)
23. Nishiyama, M., Okabe, T., Sato, I., Sato, Y.: Aesthetic quality classification of photographs
based on color harmony. In: CVPR. pp. 33–40 (2011)
24. Niu, Y., Liu, F.: What makes a professional video? a computational aesthetics approach.
IEEE Transactions on Circuits and Systems for Video Technology 22(7), 1037–1049 (2012)
25. Tang, X., Luo, W., Wang, X.: Content-based photo quality assessment. IEEE TMM 15(8),
1930–1943 (2013)
26. Wang, Y., Dai, Q., Feng, R., Jiang, Y.G.: Beauty is here: Evaluating aesthetics in videos using
multimodal features and free training data. In: ACM MM. pp. 369–372 (2013)
27. Wu, O., Hu, W., Gao, J.: Learning to predict the perceived visual quality of photos. In: ICCV.
pp. 225–232 (2011)
28. Yeh, H.H., Yang, C.Y., Lee, M.S., Chen, C.S.: Video aesthetic quality assessment by
temporal integration of photo-and motion-based features. IEEE TMM 15(8), 1944–1957
(2013)
29. Yim, J., Jung, H., Yoo, B., Choi, C., Park, D., Kim, J.: Rotating your face using multi-task
deep neural network. In: CVPR. pp. 676 – 684 (2015)
30. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV.
pp. 818–833 (2014)
31. Zhang, L., Gao, Y., Zimmermann, R., Tian, Q., Li, X.: Fusion of multichannel local and
global structural cues for photo aesthetics evaluation. IEEE TIP 23(3), 1419–1429 (2014)
32. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task
learning. In: ECCV, pp. 94–108 (2014)
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement