Effective Deep Learning Based Multi

Effective Deep Learning Based Multi
VLDBJ manuscript No.
(will be inserted by the editor)
Effective Deep Learning Based Multi-Modal Retrieval
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
the date of receipt and acceptance should be inserted later
Abstract Multi-modal retrieval is emerging as a new search
paradigm that enables seamless information retrieval from
various types of media. For example, users can simply snap
a movie poster to search for relevant reviews and trailers.
The mainstream solution to the problem is to learn a set of
mapping functions that project data from different modalities into a common metric space in which conventional indexing schemes for high-dimensional space can be applied.
Since the effectiveness of the mapping functions plays an
essential role in improving search quality, in this paper, we
exploit deep learning techniques to learn effective mapping
functions. In particular, we first propose a general learning objective that captures both intra-modal and inter-modal
semantic relationships of data from heterogeneous sources.
Then, we propose two learning algorithms based on the general objective: (1) an unsupervised approach that uses stacked
auto-encoders (SAEs) and requires minimum prior knowledge on the training data, and (2) a supervised approach using deep convolutional neural network (DCNN) and neural
Wei Wang
School of Computing, National University of Singapore, Singapore.
E-mail: [email protected]
Xiaoyan Yang
Advanced Digital Sciences Center, Illinois at Singapore Pte, Singapore.
E-mail: [email protected]
Beng Chin Ooi
School of Computing, National University of Singapore, Singapore.
E-mail: [email protected]
Dongxiang Zhang
School of Computing, National University of Singapore, Singapore.
E-mail: [email protected]
Yueting Zhuang
College of Computer Science and Technology, Zhejiang University,
Hangzhou, China
E-mail: [email protected]
language model (NLM). Our training algorithms are memory efficient with respect to the data volume. Given a large
training dataset, we split it into mini-batches and adjust the
mapping functions continuously for each batch. Experimental results on three real datasets demonstrate that our proposed methods achieve significant improvement in search
accuracy over the state-of-the-art solutions.
1 Introduction
The prevalence of social networking has significantly increased the volume and velocity of information shared on
the Internet. A tremendous amount of data in various media types is being generated every day in social networking
systems. For instance, Twitter recently reported that over
340 million tweets were sent each day1 , while Facebook
reported that around 300 million photos were created each
day2 . These data, together with other domain specific data,
such as medical data, surveillance and sensory data, are big
data that can be exploited for insights and contextual observations. However, effective retrieval of such huge amounts
of media from heterogeneous sources remains a big challenge.
In this paper, we exploit deep learning techniques, which
have been successfully applied in processing media data [33,
29, 3], to solve the problem of large-scale information retrieval from multiple modalities. Each modality represents
one type of media such as text, image or video. Depending
on the heterogeneity of data sources, we have two types of
1. Intra-modal search has been extensively studied and
widely used in commercial systems. Examples include
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
web document retrieval via keyword queries and contentbased image retrieval.
2. Cross-modal search enables users to explore relevant
resources from different modalities. For example, a user
can use a tweet to retrieve relevant photos and videos
from other heterogeneous data sources. Meanwhile he
can search relevant textual descriptions or videos by submitting an interesting image as a query.
There has been a long stream of research on multi-modal
retrieval [45, 4, 44, 36, 30, 21, 43, 25]. These works follow
the same query processing strategy, which consists of two
major steps. First, a set of mapping functions are learned to
project data from different modalities into a common latent
space. Second, a multi-dimensional index for each modality
in the common space is built for efficient similarity retrieval.
Since the second step is a classic kNN problem and has been
extensively studied [16, 40, 42], we focus on the optimization of the first step and propose two types of novel mapping
functions based on deep learning techniques.
We propose a general learning objective that effectively
captures both intra-modal and inter-modal semantic relationships of data from heterogeneous sources. In particular, we
differentiate modalities in terms of their representations’ ability to capture semantic information and robustness when
noisy data are involved. The modalities with better representations are assigned with higher weight for the sake of
learning more effective mapping functions. Based on the objective function, we design an unsupervised algorithm using stacked auto-encoders (SAEs). SAE is a deep learning
model that has been widely applied in many unsupervised
feature learning and classification tasks [31, 38, 13, 34]. If
the media are annotated with semantic labels, we design a
supervised algorithm to realize the learning objective. The
supervised approach uses a deep convolutional neural network (DCNN) and neural language model (NLM). It exploits the label information, thus can learn robust mapping
functions against noisy input data. DCNN and NLM have
shown great success in learning image features [20, 10, 8]
and text features [33, 28] respectively.
Compared with existing solutions for multi-modal retrieval, our approaches exhibit three major advantages. First,
our mapping functions are non-linear and are more expressive than the linear projections used in IMH [36] and CVH [21].
The deep structures of our models can capture more abstract
concepts at higher layers, which is very useful in modeling categorical information of data for effective retrieval.
Second, we require minimum prior knowledge in the training. Our unsupervised approach only needs relevant data
pairs from different modalities as the training input. The
supervised approach requires additional labels for the media objects. In contrast, MLBE [43] and IMH [36] require a
big similarity matrix of intra-modal data for each modality.
LSCMR [25] uses training examples, each of which con-
sists of a list of objects ranked according to their relevance
(based on manual labels) to the first object. Third, our training process is memory efficient because we split the training dataset into mini-batches and iteratively load and train
each mini-batch in memory. However, many existing works
(e.g., CVH, IMH) have to load the whole training dataset
into memory which is infeasible when the training dataset is
too large.
In summary, the main contributions of this paper are:
– We propose a general learning objective for learning mapping functions to project data from different modalities
into a common latent space for multi-modal retrieval.
The learning objective differentiate modalities in terms
of their input features’ quality of capturing semantics.
– We realize the general learning objective by one unsupervised approach and one supervised approach based
on deep learning techniques.
– We conduct extensive experiments on three real datasets
to evaluate our proposed mapping mechanisms. Experimental results show that the performance of our method
is superior to state-of-the-art methods.
The remainder of the paper is organized as follows. Problem statements and overview are provided in Section 2 and
Section 3. After that, we describe the unsupervised and supervised approaches in Section 4 and Section 5 respectively.
Query processing is presented in Section 6. We discuss related works in Section 7 and present our experimental study
in Section 8. We conclude our paper in Section 9. This work
is an extended version of [39] which proposed an unsupervised learning algorithm for multi-modal retrieval. We add a
supervised learning algorithm as a new contribution in this
work. We also revise Section 3 to unify the learning objective of the two approaches. Section 5 and Section 8.3 are
newly added for describing the supervised approach and its
experimental results.
2 Problem Statements
In our data model, the database D consists of objects from
multiple modalities. For ease of presentation, we use images and text as two sample S
modalities to explain our idea,
i.e., we assume that D = DI DT . To conduct multi-modal
retrieval, we need a relevance measurement for the query
and the database object. However, the database consists of
objects from different modalities, there is no such widely
accepted measurement. A common approach is to learn a
set of mapping functions that project the original feature
vectors into a common latent space such that semantically
relevant objects (e.g., image and its tags) are located close.
Consequently, our problem includes the following two subproblems.
Effective Deep Learning Based Multi-Modal Retrieval
Image Query
Step 2
Image mapping
function fI
Source Images
Step 1
Training data
Step 3
Indexed Latent
feature vectors
Source Text
Text mapping
function fT
Step 2
Text Query
Fig. 1: Flowchart of multi-modal retrieval framework. Step
1 is offline model training that learns mapping functions.
Step 2 is offline indexing that maps source objects into latent
features and creates proper indexes. Step 3 is online multimodal kNN query processing.
Definition 1 Common Latent Space Mapping
Given an image x ∈ DI and a text document y ∈ DT , find
two mapping functions fI : DI → Z and fT : DT → Z
such that if x and y are semantically relevant, the distance
between fI (x) and fT (y) in the common latent space Z,
denoted by distZ (fI (x), fT (y)), is small.
The common latent space mapping provides a unified
approach to measuring distance of objects from different
modalities. As long as all objects can be mapped into the
same latent space, they become comparable. Once the mapping functions fI and fT have been determined, the multimodal search can then be transformed into the classic kNN
problem, defined as following:
Definition 2 Multi-Modal Search
Given a query object Q ∈ Dq and a target domain Dt (q, t ∈
{I, T }), find a set O ⊂ Dt with k objects such that ∀o ∈ O
and o0 ∈ Dt /O, distZ (fq (Q), ft (o0 )) ≥ distZ (fq (Q), ft (o)).
Since both q and t have two choices, four types of queries
can be derived, namely Qq→t and q, t ∈ {I, T }. For instance, QI→T searches relevant text in DT given an image
from DI . By mapping objects from different high-dimensional
feature spaces into a low-dimensional latent space, queries
can be efficiently processed using existing multi-dimensional
indexes [16, 40]. Our goal is then to learn a set of effective
mapping functions which preserve well both intra-modal semantics (i.e., semantic relationships within each modality)
and inter-modal semantics (i.e., semantic relationships across
modalities) in the latent space. The effectiveness of mapping functions is measured by the accuracy of multi-modal
retrieval using latent features.
3 Overview of Multi-modal Retrieval
The flowchart of our multi-modal retrieval framework is illustrated in Figure 1. It consists of three main steps: 1) of-
fline model training 2) offline indexing 3) online kNN query
processing. In step 1, relevant image-text pairs are used as
input training data to learn the mapping functions. For example, image-text pairs can be collected from Flickr where the
text features are extracted from tags and descriptions for images. If they are associated with additional semantic labels
(e.g., categories), we use a supervised training algorithm.
Otherwise, an unsupervised training approach is used. After
step 1, we can obtain a mapping function fm : Dm → Z
for each modality m ∈ {I, T }. In step 2, objects from different modalities are first mapped into the common space Z
by function fm . With such unified representation, the latent
features from the same modality are then inserted into a high
dimensional index for kNN query processing. When a query
Q ∈ Dm comes, it is first mapped into Z using its modalspecific mapping function fm . Based on the query type, k
nearest neighbors are retrieved from the index built for the
target modality and returned to the user. For example, image
index is used for queries of type QI→I and QT →I against
the image database.
General learning objective A good objective function
plays a crucial role in learning effective mapping functions.
In our multi-modal search framework, we design a general
learning objective function L. By taking into account the
image and text modalities, our objective function is defined
as follows:
L = βI LI + βT LT + LI,T + ξ(θ)
where Lm , m ∈ {I, T } is called the intra-modal loss to
reflect how well the intra-modal semantics are captured by
the latent features. The smaller the loss, the more effective the learned mapping functions are. LI,T is called the
inter-modal loss which is designed to capture inter-modal
semantics. The last term is used as regularization to prevent
over-fitting [14] (L2 Norm is used in our experiment). θ denotes all parameters involved in the mapping functions. βm ,
m ∈ {I, T } denotes the weight of the loss for modality m
in the objective function. We observe in our training process that assigning different weights to different modalities
according to the nature of its data offers better performance
than treating them equally. For the modality with lower quality input feature (due to noisy data or poor data representation), we assign smaller weight for its intra-modal loss in
the objective function. The intuition of setting βI and βT in
this way is that, by relaxing the constraints on intra-modal
loss, we enforce the inter-modal constraints. Consequently,
the intra-modal semantics of the modality with lower quality input feature can be preserved or even enhanced through
their inter-modal relationships with high-quality modalities.
Details of setting βI and βT will be discussed in Section 4.3
and Section 5.3.
Training Training is to find the optimal parameters involved in the mapping functions that minimizes L. Two types
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
Single-Modal Training
Multi-Modal Training
latent feature
latent feature
latent feature
Original feature
Original feature
Original feature
laateent fe
Original feature
(b) Training Stage I
(c) Training Stage II
Fig. 2: Flowchart of training. Relevant images (or text) are associated with the same shape (e.g., ). In single-modal training,
objects of same shape and modality are moving close to each other. In multi-modal training, objects of same shape from all
modalities are moving close to each other.
Image SAE
Latent Layer
Latent feature fT(y)
Text SAE
Latent feature fI(x)
𝑊1 , 𝑏1
𝑊2 , 𝑏2
Input image x
IInput text y
Input Layer
Fig. 3: Model of MSAE, which consists of one SAE for each
modality. The trained SAE maps input data into latent features.
Reconstruction Layer
Fig. 4: Auto-Encoder
4 Unsupervised Approach – MSAE
of mapping functions are proposed in this paper. One is trained
by an unsupervised algorithm, which uses simple imagetext pairs for training. No other prior knowledge is required.
The other one is trained by a supervised algorithm which
exploits additional label information to learn robust mapping functions against noisy training data. For both mapping
functions, we design a two-stage training procedure to find
the optimal parameters. A complete training process is illustrated in Figure 2. In stage I, one mapping function is trained
independently for each modality with the objective to map
similar features in one modality close to each other in the
latent space. This training stage serves as the pre-training of
stage II by providing a good initialization for the parameters. In stage II, we jointly optimize Equation 1 to capture
both intra-modal semantics and inter-modal semantics. The
learned mapping functions project semantically relevant objects close to each other in the latent space as shown in the
In this section, we propose an unsupervised learning algorithm called MSAE (Multi-modal Stacked Auto-Encoders)
to learn the mapping function fI and fT . The model is shown
in Figure 3. We first present the preliminary knowledge of
auto-encoder and stacked auto-encoders. Based on stacked
auto-encoders, we address how to define the terms LI , LT
and LI,T in our general objective learning function in Equation 1.
4.1 Background: Auto-encoder & Stacked Auto-encoder
Auto-encoder Auto-encoder has been widely used in unsupervised feature learning and classification tasks [31, 38, 13,
34]. It can be seen as a special neural network with three layers – the input layer, the latent layer, and the reconstruction
layer. As shown in Figure 4, the raw input feature x0 ∈ Rd0
in the input layer is encoded into latent feature x1 ∈ Rd1
Effective Deep Learning Based Multi-Modal Retrieval
via a deterministic mapping fe :
x1 = fe (x0 ) =
se (W1T x0
+ b1 )
where se is the activation function of the encoder, W1 ∈
Rd0 ×d1 is a weight matrix and b1 ∈ Rd1 is a bias vector.
The latent feature x1 is then decoded back to x2 ∈ Rd0 via
another mapping function fd :
x2 = fd (x1 ) = sd (W2T x1 + b2 )
Similarly, sd is the activation function of the decoder with
parameters {W2 , b2 }, W2 ∈ Rd1 ×d0 , b2 ∈ Rd0 . Sigmoid
function or Tanh function is typically used as the activation
functions se and sd . The parameters {W1 , W2 , b1 , b2 } of
the auto-encoder are learned with the objective of minimizing the difference (called reconstruction error) between the
raw input x0 and the reconstruction output x2 . Squared Euclidean distance, negative log likelihood and cross-entropy
are often used to measure the reconstruction error. By minimizing the reconstruction error, we can use the latent feature
to reconstruct the original input with minimum information
loss. In this way, the latent feature preserves regularities (or
semantics) of the input data.
Stacked Auto-encoder Stacked Auto-encoders (SAE)
are constructed by stacking multiple (e.g., h) auto-encoders.
The input feature vector x0 is fed to the bottom auto-encoder.
After training the bottom auto-encoder, the latent representation x1 is propagated to the higher auto-encoder. The same
procedure is repeated until all the auto-encoders are trained.
The latent representation xh from the top (i.e., h-th) autoencoder, is the output of the stacked auto-encoders, which
can be further fed into other applications, such as SVM for
classification. The stacked auto-encoders can be fine tuned
by minimizing the reconstruction error between the input
feature x0 and the reconstruction feature x2h which is computed by forwarding the x0 through all encoders and then
through all decoders as shown in Figure 5. In this way, the
output feature xh can reconstruct the input feature with minimal information loss. In other words, xh preserves regularities (or semantics) of the input data x0 .
Fig. 5: Fine-tune Stacked Auto-Encoders
average unit value
1.5 2.0 2.5 3.0
average unit value
3.5 4.0
Fig. 6: Distribution of image (6a) and text (6b) features extracted from NUS-WIDE training dataset (See Section 8).
Each figure is generated by averaging the units for each feature vector, and then plot the histogram for all data.
4.2 Realization of the Learning Objective in MSAE
4.2.1 Modeling Intra-modal Semantics of Data
We extend SAEs to model intra-modal losses in the general
learning objective (Equation 1). Specifically, LI and LT are
modeled as the reconstruction errors for the image SAE and
the text SAE respectively. Intuitively, if the two reconstruction errors are small, the latent features generated by the top
auto-encoder would be able to reconstruct the original input
well, and consequently, capture the regularities of the input
data well. This implies that, with small reconstruction error, two objects from the same modality that are similar in
the original space would also be close in the latent space. In
this way, we are able to capture the intra-modal semantics
of data by minimizing LI and LT respectively. But to use
SAEs, we have to design the decoders of the bottom autoencoders carefully to handle different input features.
The raw (input) feature of an image is a high-dimensional
real-valued vector (e.g., color histogram or bag-of-visualwords). In the encoder, each input image feature is mapped
to a latent vector using Sigmoid function as the activation
function se (Equation 2). However, in the decoder, the Sigmoid activation function, whose range is [0,1], performs poorly
on reconstruction because the raw input unit (referring to
one dimension) is not necessarily within [0,1]. To solve this
issue, we follow Hinton [14] and model the raw input unit
as a linear unit with independent Gaussian noise. As shown
in Figure 6a, the average unit value of image feature typically follows Gaussian distribution. When the input data is
normalized with zero mean and unit variance, the Gaussian
noise term can be omitted. In this case, we use an identity
function for the activation function sd in the bottom decoder.
Let x0 denote the input image feature vector, x2h denote the
feature vector reconstructed from the top latent feature xh (h
is the depth of the stacked auto-encoders). Using Euclidean
distance to measure the reconstruction error, we define LI
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
for x0 as:
LI (x0 ) = ||x0 − x2h ||22
The raw (input) feature of text is a word count vector or
tag occurrence vector 3 . We adopt the Rate Adapting Poisson
model [32] for reconstruction because the histogram for the
average value of text input unit generally follows Poisson
distribution (Figure 6b). In this model, the activation function in the bottom decoder is
x2h = sd (z2h ) = l P z2hj
where l = j x0j is the number of words in the input text,
and z2h = W2h
x2h−1 + b2h . The probability of a reconstruction unit x2hi being the same as the input unit x0i is:
p(x2hi = x0i ) = P ois(x0i , x2hi )
where P ois(n, λ) = e n!λ . Based on Equation 6, we define
LT using negative log likelihood:
LT (x0 ) = −log
p(x2hi = x0i )
By minimizing LT , we require x2h to be similar as x0 . In
other words, the latent feature xh is trained to reconstruct
the input feature well, and thus preserves the regularities of
the input data well.
4.2.2 Modeling Inter-modal Semantics of Data
Given one relevant image-text pair (x0 , y0 ), we forward them
through the encoders of their stacked auto-encoders to generate latent feature vectors (xh , yh ) (h is the height of the
SAE). The inter-modal loss is then defined as,
LI,T (x0 , y0 ) = dist(xh , yh ) = ||xh − yh ||22
By minimizing LI,T , we capture the inter-modal semantics
of data. The intuition is quite straightforward: if two objects
x0 and y0 are relevant, the distance between their latent features xh and yh shall be small.
documents) and then adjusts the parameters. The learned
image and text SAEs are fine-tuned in stage II by BackPropagation and mini-batch SGD with the objective to find
the optimal parameters that minimize the learning objective
(Equation 1). In our experiment, we observe that the training would be more stable if we alternatively adjust one SAE
with the other SAE fixed.
Setting βI & βT βI and βT are the weights of the reconstruction error of image and text SAEs respectively in the
objective function (Equation 1). As mentioned in Section 3,
they are set based on the quality of each modality’s raw (input) feature. We use an example to illustrate the intuition.
Consider a relevant object pair (x0 , y0 ) from modality x and
y. Assume x’s feature is of low quality in capturing semantics (e.g., due to noise) while y’s feature is of high quality.
If xh and yh are the latent features generated by minimizing
the reconstruction error, then yh can preserve the semantics
well while xh is not as meaningful due to the low quality of
x0 . To solve this problem, we combine the inter-modal distance between xh and yh in the learning objective function
and assign smaller weight to the reconstruction error of x0 .
This is the same as increasing the weight of the inter-modal
distance from xh to yh . As a result, the training algorithm
will move xh towards yh to make their distance smaller. In
this way, the semantics of low quality xh could be enhanced
by the high quality feature yh .
In the experiment, we evaluate the quality of each modality’s raw feature on a validation dataset by performing intramodal search against the latent features learned in singlemodal training. Modality with worse search performance is
assigned a smaller weight. Notice that, because the dimensions of the latent space and the original space are usually
of different orders of magnitude, the scale of LI , LT and
LI,T are different. In the experiment, we also scale βI and
βT to make the losses comparable, i.e., within an order of
5 Supervised Approach–MDNN
4.3 Training
Following the training flow shown in Figure 2, in stage I we
train a SAE for the image modality and a SAE for the text
modality separately. Back-Propagation [22] (see Appendix)
is used to calculate the gradients of the objective loss, i.e.,
LI or LT , w.r.t., the parameters. Then the parameters are updated according to mini-batch Stochastic Gradient Descent
(SGD) (see Appendix), which averages the gradients contributed by a mini-batch of training records (images or text
The binary value for each dimension indicates whether the corresponding tag appears or not.
In this section, we propose a supervised learning algorithm
called MDNN (Multi-modal Deep Neural Network) based
on a deep convolutional neural network (DCNN) model and
a neural language model (NLM) to learn mapping functions
for the image modality and the text modality respectively.
The model is shown in Figure 7. First, we provide some
background on DCNN and NLM. Second, we extend one
DCNN [20] and one NLM [28] to model intra-modal losses
involved in the general learning objective (Equation 1). Third,
the inter-modal loss is specified and combined with the intramodal losses to realize the general learning objective. Finally, we describe the training details.
Effective Deep Learning Based Multi-Modal Retrieval
co-occur, SGM models the conditional probability p(a|b)
using softmax:
feature fT(y)
Latent feature fI(x)
Input image x
Input text y
Fig. 7: Model of MDNN, which consists of one DCNN for
image modality, and one Skip-Gram + MLP for text modality. The trained DCNN (or Skip-Gram + MLP) maps input
data into latent features.
5.1 Background: Deep Convolutional Neural Network &
Neural Language Model
Deep Convolutional Neural Network (DCNN) DCNN has
shown great success in computer vision tasks [8, 10] since
the first DCNN (called AlexNet) was proposed by Alex [20].
It has specialized connectivity structure, which usually consists of multiple convolutional layers followed by fully connected layers. These layers form stacked, multiple-staged
feature extractors, with higher layers generating more abstract features from lower ones. On top of the feature extractor layers, there is a classification layer. Please refer to
[20] for a more comprehensive review of DCNN.
The input to DCNN is raw image pixels such as an RGB
vector, which is forwarded through all feature extractor layers to generate a feature vector that is a high-level abstraction of the input data. The training data of DCNN consists
of image-label pairs. Let x denote the image raw feature and
fI (x) the feature vector extracted from DCNN. t is the binary label vector of x. If x is associated with the i-th label
li , ti is set to 1 and all other elements are set to 0. fI (x) is
forwarded to the classification layer to predict the final output p(x), where pi (x) is the probability of x being labelled
with li . Given x and fI (x), pi (x) is defined as:
eva ·vb
p(a|b) = P vã ·v
ã e
where va and vb are vector representations
Pof word a and
context b respectively. The denominator ã evã ·vb is expensive to calculate given a large vocabulary, where ã is
any word in the vocabulary. Thus, approximations were proposed to estimate it [28]. Given a corpus of sentences, SGM
is trained to learn vector representations v by maximizing
Equation 11 over all co-occurring pairs.
The learned dense vectors can be used to construct a
dense vector for one sentence or document (e.g., by averaging), or to calculate the similarity of two words, e.g., using
the cosine similarity function.
5.2 Realization of the Learning Objective in MDNN
5.2.1 Modeling Intra-modal Semantics of Data
Having witnessed the outstanding performance of DCNNs
in learning features for visual data [8, 10], and NLMs in
learning features for text data [33], we extend one instance
of DCNN – AlexNet [20] and one instance of NLM – SkipGram model (SGM) [28] to model the intra-modal semantics of images and text respectively.
Image We employ AlexNet to serve as the mapping function fI for image modality. An image x is represented by an
RGB vector. The feature vector fI (x) learned by AlexNet
is used to predict the associated labels of x. However, the
objective of the original AlexNet is to predict single label of
an image while in our case images are annotated with multiple labels. We thus follow [11] to extend the softmax loss
(Equation 10) to handle multiple labels as follows:
1 X
LI (x, t) = − P
ti log pi (x)
i ti i
where pi (x) is defined in Equation 9. Different from SAE,
which models reconstruction error to preserve intra-modal
semantics, the extended AlexNet tries to minimize the prediction error LI shown in Equation 12. By minimizing prewhich is a softmax function. Based on Equation 9, we dediction error, we require the learned high-level feature vecfine the prediction error, or softmax loss as the negative log
tors fI (x) to be discriminative in predicting labels. Images
with similar labels shall have similar feature vectors. In this
way, the intra-modal semantics are preserved.
LI (x, t) = −
ti log pi (x)
Text We extend SGM to learn the mapping function fT
for text modality. Due to the noisy nature of text (e.g., tags)
Neural Language Model (NLM) NLMs, first introduced associated with images [23], directly training the SGM over
in [2], learn a dense feature vector for each word or phrase,
the tags would carry noise into the learned features. Howcalled a distributed representation or a word embedding. Among ever, labels associated with images are carefully annotated
them, the Skip-Gram model (SGM) [28] proposed by Mikolov and are more accurate. Hence, we extend the SGM to inteet al. is the state-of-the-art. Given a word a and context b that
grate label information so as to learn robust features against
efI (x)i
pi (x) = P f (x)
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
fT (y) = W2 · s(W1 ỹ + b1 ) + b2
s(v) = max(0, v)
Indexed Image Latent Feature Vectors
noisy text (tags). The main idea is, we first train a SGM [28],
treating all tags associated with one image as an input sentence. After training, we obtain one word embedding for
each tag. By averaging word embeddings of all tags of one
image, we create one text feature vector for those tags. Second, we build a Multi-Layer Perceptron (MLP) with two
hidden layers on top of the SGM. The text feature vectors
are fed into the MLP to predict image labels. Let y denote
the input text (e.g., a set of image tags), ỹ denote the averaged word embedding generated by SGM for tags in y.
MLP together with SGM serves as the mapping function fT
for the text modality,
Image DB
Image Query
Text Query
Text DB
function fI
function fT
Indexed Text Latent Feature Vectors
Fig. 8: Illustration of Query Processing
and the extended NLM (i.e., MLP+Skip-Gram) separately6 .
where W1 and W2 are weight matrices, b1 and b2 are bias
The learned parameters are used to initialize the joint model.
vectors, and s() is the ReLU activation function [20]4 . The
All training is conducted by Back-Propagation using miniloss function of MLP is similar to that of the extended AlexNet batch SGD (see Appendix) to minimize the objective loss
for image label prediction:
(Equation 1).
Setting βI & βT In the unsupervised training, we assign
LT (y, t) = − P
log qi (y)
larger βI to make the training prone to preserve the intrai ti i
modal semantics of images if the input image feature is of
efT (y)i
higher quality than the text input feature, and vice versa.
qi (y) = P f (y)
For supervised training, since the intra-modal semantics are
preserved based on reliable labels, we do not distinguish the
We require the learned text latent features fT (y) to be disimage modality from the text one in the joint training. Hence
criminative for predicting labels. In this way, we model the
βI and βT are set to the same value. In the experiment, we
intra-modal semantics for the text modality 5 .
set βI = βT = 1. To make the three losses within one order
of magnitude, we scale the inter-modal distance by 0.01.
5.2.2 Modeling Inter-modal Semantics of Data
After extending the AlexNet and Skip-Gram model to preserve the intra-modal semantics for images and text respectively, we jointly learn the latent features for image and text
to preserve the inter-modal semantics. We follow the general learning objective in Equation 1 and realize LI and LT
using Equation 12 and 15 respectively. Euclidean distance
is used to measure the difference of the latent features for
an image-text pair, i.e., LI,T is defined similarly as in Equation 8. By minimizing the distance of latent features for an
image-text pair, we require their latent features to be closer
in the latent space. In this way, the inter-modal semantics are
5.3 Training
Similar to the training of MSAE, the training of MDNN consists of two steps. The first step trains the extended AlexNet
4 We tried both the Sigmoid function and ReLU activation function
for s(). ReLU offers better performance.
5 Notice that in our model, we fix the word vectors learned by SGM.
It can also be fine-tuned by integrating the objective of SGM (Equation 11) into Equation 15.
6 Query Processing
After the unsupervised (or supervised) training, each modality has a mapping function. Given a set of heterogeneous
data sources, high-dimensional raw features (e.g., bag-ofvisual-words or RGB feature for images) are extracted from
each source and mapped into a common latent space using
the learned mapping functions. In MSAE, we use the image
(resp. text) SAE to project image (resp. text) input features
into the latent space. In MDNN, we use the extended DCNN
(resp. extended NLM) to map the image (resp. text) input
feature into the common latent space.
After the mapping, we create VA-Files [40] over the latent features (one per modality). VA-File is a classic index
that can overcome the curse of dimensionality when answering nearest neighbor queries. It encodes each data point into
a bitmap and the whole bitmap file is loaded into memory
for efficient scanning and filtering. Only a small number
6 In our experiment, we use the parameters trained by Caffe [18]
to initialize the AlexNet to accelerate the training. We use Gensim (http://radimrehurek.com/gensim/) to train the SkipGram model with the dimension of word vectors being 100.
Effective Deep Learning Based Multi-Modal Retrieval
of real data points will be loaded into memory for verification. Given a query input, we check its media type and
map it into the latent space through its modal-specific mapping function. Next, intra-modal and inter-modal searches
are conducted against the corresponding index (i.e., the VAFile) shown in Figure 8. For example, the task of searching
relevant tags of one image, i.e., QI→T , is processed by the
index for the text latent vectors.
To further improve the search efficiency, we convert the
real-valued latent features into binary features, and search
based on Hamming distance. The conversion is conducted
using existing hash methods that preserve the neighborhood
relationship. For example, in our experiment (Section 8.2),
we use Spectral Hashing [41] , which converts real-valued
vectors (data points) into binary codes with the objective to
minimize the Hamming distance of data points that are close
in the original Euclidean space. Other hashing approaches
like [35, 12] are also applicable.
The conversion from real-valued features to binary features trades off effectiveness for efficiency. Since there is
information loss when real-valued data is converted to binaries, it affects the retrieval performance. We study the tradeoff between efficiency and effectiveness on binary features
and real-valued features in the experiment section.
7 Related Work
The key problem of multi-modal retrieval is to find an effective mapping mechanism, which maps data from different
modalities into a common latent space. An effective mapping mechanism would preserve both intra-modal semantics
and inter-modal semantics well in the latent space, and thus
generates good retrieval performance.
Linear projection has been studied to solve this problem [21, 36, 44]. The main idea is to find a linear projection matrix for each modality that maps semantic relevant
data into similar latent vectors. However, when the distribution of the original data is non-linear, it would be hard to
find a set of effective projection matrices. CVH [21] extends
the Spectral Hashing [41] to multi-modal data by finding a
linear projection for each modality that minimizes the Euclidean distance of relevant data in the latent space. Similarity matrices for both inter-modal data and intra-modal
data are required to learn a set of good mapping functions.
IMH [36] learns the latent features of all training data first
before it finds a hash function to fit the input data and output
latent features, which could be computationally expensive.
LCMH [44] exploits the intra-modal correlations by representing data from each modality using its distance to cluster
centroids of the training data. Projection matrices are then
learned to minimize the distance of relevant data (e.g., image and tags) from different modalities.
Other recent works include CMSSH [4], MLBE [43] and
LSCMR [25]. CMSSH uses a boosting method to learn the
projection function for each dimension of the latent space.
However, it requires prior knowledge such as semantic relevant and irrelevant pairs. MLBE explores correlations of
data (both inter-modal and intra-modal similarity matrices)
to learn latent features of training data using a probabilistic
graphic model. Given a query, it is converted into the latent space based on its correlation with the training data.
Such correlation is decided by labels associated with the
query. However, labels of a query are usually not available
in practice, which makes it hard to obtain its correlation with
the training data. LSCMR [25] learns the mapping functions with the objective to optimize the ranking criteria (e.g.,
MAP) directly. Ranking examples (a ranking example is a
query and its ranking list) are needed for training. In our algorithm, we use simple relevant pairs (e.g., image and its
tags) as training input. Thus no prior knowledge such as irrelevant pairs, similarity matrix, ranking examples and labels of queries, is needed.
Multi-modal deep learning [29, 37] extends deep learning to multi-modal scenario. [37] combines two Deep Boltzmann Machines (DBM) (one for image, one for text) with a
common latent layer to construct a Multi-modal DBM. [29]
constructs a Bimodal deep auto-encoder with two deep autoencoders (one for audio, one for video). Both two models
aim to improve the classification accuracy of objects with
features from multiple modalities. Thus they combine different features to learn a good (high dimensional) latent feature. In this paper, we aim to represent data with low-dimensional
latent features to enable effective and efficient multi-modal
retrieval, where both queries and database objects may have
features from only one modality. DeViSE [9] from Google
shares similar idea with our supervised training algorithm. It
embeds image features into text space, which are then used
to retrieve similar text features for zero-shot learning. Notice
that the text features used in DeViSE to learn the embedding
function are generated from high-quality labels. However, in
multi-modal retrieval, queries usually do not come with labels and text features are generated from noisy tags. This
makes DeViSE less effective in learning robust latent features against noisy input.
8 Experimental Study
This section provides an extensive performance study of our
solution in comparison with the state-of-the-art methods. We
examine both efficiency and effectiveness of our method including training overhead, query processing time and accuracy. Visualization of the training process is also provided
to help understand the algorithms. All experiments are conducted on CentOS 6.4 using CUDA 5.5 with NVIDIA GPU
(GeForce GTX TITAN). The size of main memory is 64GB
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
and the size GPU memory is 6GB. The code and hyperparameter settings are available online 7 . In the rest of this
section, we first introduce our evaluation metrics, and then
study the performance of unsupervised approach and supervised approach respectively.
8.1 Evaluation Metrics
We evaluate the effectiveness of the mapping mechanism
by measuring the accuracy of the multi-modal search, i.e.,
Qq→t (q, t ∈ {T, I}), using the mapped latent features. Without specifications, searches are conducted against real-valued
latent features using Euclidean distance. We use Mean Average Precision (MAP) [27], one of the standard information
retrieval metrics, as the major evaluation metric. Given a set
of queries, the Average Precision (AP) for each query q is
calculated as,
P (k)δ(k)
AP (q) = k=1
j=1 δ(j)
where R is the size of the test dataset; δ(k) = 1 if the k-th
result is relevant, otherwise δ(k) = 0; P (k) is the precision
of the result ranked at position k, which is the fraction of true
relevant documents in the top k results. By averaging AP
for all queries, we get the MAP score. The larger the MAP
score, the better the search performance. In addition to MAP,
we measure the precision and recall of search tasks. Given
a query, the ground truth is defined as: if a result shares at
least one common label (or category) with the query, it is
considered as a relevant result; otherwise it is irrelevant.
Besides effectiveness, we also evaluate the training overhead in terms of time cost and memory consumption. In addition, we report query processing time.
8.2 Experimental Study of Unsupervised Approach
First, we describe the datasets used for unsupervised training. Second, an analysis of the training process by visualization is presented. Last, comparison with previous works,
including CVH [21], CMSSH [4] and LCMH [44] are provided. 8
Table 1: Statistics of Datasets for Unsupervised Training
Total size
Training set
Validation set
Test set
Average Text Length
NUS-WIDE The dataset contains 269,648 images from
Flickr, with each image associated with 6 tags on average.
We refer to the image and its tags as an image-text pair.
There are 81 ground truth labels manually annotated for
evaluation. Following previous works [24, 44], we extract
190,421 image-text pairs annotated with the most frequent
21 labels and split them into three subsets for training, validation and test respectively. The size of each subset is shown
in Table 1. For validation (resp. test), 100 (resp. 1000) queries
are randomly selected from the validation (resp. test) dataset.
Image and text features are provided in the dataset [5]. For
images, SIFT features are extracted and clustered into 500
visual words. Hence, an image is represented by a 500 dimensional bag-of-visual-words vector. Its associated tags are
represented by a 1, 000 dimensional tag occurrence vector.
Wiki This dataset contains 2,866 image-text pairs from
the Wikipedia’s featured articles. An article in Wikipedia
contains multiple sections. The text and its associated image in one section is considered as an image-text pair. Every
image-text pair has a label inherited from the article’s category (there are 10 categories in total). We randomly split the
dataset into three subsets as shown in Table 1. For validation
(resp. test), we randomly select 50 (resp. 100) pairs from
the validation (resp. test) set as the query set. Images are
represented by 128 dimensional bag-of-visual-words vectors based on SIFT feature. For text, we construct a vocabulary with the most frequent 1,000 words excluding stop
words, and represent one text section by 1,000 dimensional
word count vector like [25]. The average number of words
in one section is 131 which is much higher than that in NUSWIDE. To avoid overflow in Equation 6 and smooth the text
input, we normalize each unit x as log(x + 1) [32].
Flickr1M This dataset contains 1 million images associated with tags from Flickr. 25,000 of them are annotated
Unsupervised training requires relevant image text pairs, which with labels (there are 38 labels in total). The image feature
is a 3,857 dimensional vector concatenated by SIFT feaare easy to collect. We use three datasets to evaluate the
performance—NUS-WIDE [5], Wiki [30] and Flickr1M [17]. ture, color histogram, etc [37]. Like NUS-WIDE, the text
feature is represented by a tag occurrence vector with 2,000
7 http://www.comp.nus.edu.sg/ wangwei/code
dimensions. All the image-text pairs without annotations are
8 The code and parameter configurations for CVH and CMSSH
used for training. For validation and test, we randomly seare available online at http://www.cse.ust.hk/˜dyyeung/
lect 6,000 pairs with annotations respectively, among which
code/mlbe.zip; The code for LCMH is provided by the authors.
1,000 pairs are used as queries.
Parameters are set according to the suggestions provided in the paper.
8.2.1 Datasets
Effective Deep Learning Based Multi-Modal Retrieval
Before training, we use ZCA whitening [19] to normalize each dimension of image feature to have zero mean and
unit variance.
8.2.2 Training Visualization
In this section we visualize the training process of MSAE
using the NUS-WIDE dataset as an example to help understand the intuition of the training algorithm and the setting of
the weight parameters, i.e., βI and βT . Our goal is to learn
a set of effective mapping functions such that the mapped
latent features capture both intra-modal semantics and intermodal semantics well. Generally, the inter-modal semantics
is preserved by minimizing the distance of the latent features of relevant inter-modal pairs. The intra-modal semantics is preserved by minimizing the reconstruction error of
each SAE and through inter-modal semantics (see Section 4
for details).
First, following the training procedure in Section 4, we
train a 4-layer image SAE with the dimension of each layer
as 500 → 128 → 16 → 2. Similarly, a 4-layer text SAE (the
structure is 1000 → 128 → 16 → 2) is trained9 . There is
no standard guideline for setting the number of latent layers
and units in each latent layer for deep learning [1]. In all our
experiments, we adopt the widely used pyramid-like structure [15, 6], i.e. decreasing layer size from the bottom (or
first hidden) layer to the top layer. In our experiment, we observed that 2 latent layers perform better than a single latent
layer. But there is no significant improvement from 2 latent
layers to 3 latent layers. Latent features of sampled imagetext pairs from the validation set are plotted in Figure 9a.
The pre-training stage initializes SAEs to capture regularities of the original features of each modality in the latent
features. On the one hand, the original features may be of
low quality to capture intra-modal semantics. In such a case,
the latent features would also fail to capture the intra-modal
semantics. We evaluate the quality of the mapped latent features from each SAE by intra-modal search on the validation
dataset. The MAP of the image intra-modal search is about
0.37, while that of the text intra-modal search is around 0.51.
On the other hand, as the SAEs are trained separately, intermodal semantics are not considered. We randomly pick 25
relevant image-text pairs and connect them with red lines in
Figure 9b. We can see the latent features of most pairs are far
away from each other, which indicates that the inter-modal
semantics are not captured by these latent features. To solve
the above problems, we integrate the inter-modal loss in the
learning objective as Equation 1. In the following figures,
we only plot the distribution of these 25 pairs for ease of
The last layer with two units is for visualization purpose, such that
the latent features could be showed in a 2D space.
(a) 300 random image-text pairs
(b) 25 image-text pairs
Fig. 9: Visualization of latent features after projecting them
into 2D space (Blue points are image latent features; White
points are text latent features. Relevant image-tex pairs are
connected using red lines)
Second, we adjust the image SAE with the text SAE
fixed from epoch 1 to epoch 30. One epoch means one pass
of the whole training dataset. Since the MAP of the image
intra-modal search is worse than that of the text intra-modal
search, according to the intuition in Section 3, we should
use a small βI to decrease the weight of image reconstruction error LI in the objective function, i.e., Equation 1. To
verify this, we compare the performance of two choices of
βI , namely βI = 0 and βI = 0.01. The first two rows
of Figure 10 show the latent features generated by the image SAE after epoch 1 and epoch 30. Comparing image-text
pairs in Figure 10b and 10d, we can see that with smaller
βI , the image latent features move closer to their relevant
text latent features. This is in accordance with Equation 1,
where smaller βI relaxes the restriction on the image reconstruction error, and in turn increases the weight for intermodal distance LI,T . By moving close to relevant text latent
features, the image latent features gain more semantics. As
shown in Figure 10e, the MAPs increase as training goes
on. MAP of QT →T does not change because the text SAE
is fixed. When βI = 0.01, the MAPs do not increase in Figure 10f. This is because image latent features hardly move
close to the relevant text latent features as shown in Figure 10c and 10d. We can see that the text modality is of
better quality for this dataset. Hence it should be assigned a
larger weight. However, we cannot set a too large weight for
it as explained in the following paragraph.
Third, we adjust the text SAE with the image SAE fixed
from epoch 31 to epoch 60. We also compare two choices
of βT , namely 0.01 and 0.1. βI is set to 0. Figure 11 shows
the snapshots of latent features and the MAP curves of each
setting. From Figure 10b to 11a, which are two consecutive
snapshots taken from epoch 30 and 31 respectively, we can
see that the text latent features move much closer to the relevant image latent features. It leads to the big changes of
MAPs at epoch 31 in Figure 11e. For example, QT →T substantially drops from 0.5 to 0.46. This is because the sudden moves towards images change the intra-modal relation-
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
(a) βI = 0, epoch 1
(b) βI = 0, epoch 30
(a) βT = 0.01,epoch 31
(b) βT = 0.01,epoch 60
(c) βI = 0.01, epoch 1
(d) βI = 0.01, epoch 30
(c) βT = 0.1,epoch 31
(d) βT = 0.1,epoch 60
(e) βI = 0
(f) βI = 0.01
(e) βT = 0.01
(f) βT = 0.1
Fig. 10: Adjusting Image SAE with Different βI and Text
SAE fixed (a-d show the positions of features of image-text
pairs in 2D space)
ships of text latent features. Another big change happens on
QI→T , whose MAP increases dramatically. The reason is
that when we fix the text features from epoch 1 to 30, an
image feature I is pulled to be close to (or nearest neighbor of) its relevant text feature T . However, T may not be
the reverse nearest neighbor of I. In epoch 31, we move T
towards I such that T is more likely to be the reverse nearest neighbor of I. Hence, the MAP of query QI→T is greatly
improved. On the contrary, QT →I decreases. From epoch 32
to epoch 60, the text latent features on the one hand move
close to relevant image latent features slowly, and on the
other hand rebuild their intra-modal relationships. The latter is achieved by minimizing the reconstruction error LT
to capture the semantics of the original features. Therefore,
both QT →T and QI→T grows gradually. Comparing Figure 11a and 11c, we can see the distance of relevant latent
features in Figure 11c is larger than that in Figure 11a. The
reason is that when βT is larger, the objective function in
Equation 1 pays more effort to minimize the reconstruction
error LT . Consequently, less effort is paid to minimize the
inter-modal distance LI,T . Hence, relevant inter-modal pairs
cannot move closer. This effect is reflected as minor changes
of MAPs at epoch 31 in Figure 11f in contrast with that in
Fig. 11: Adjusting Text SAE with Different βT and Image
SAE fixed (a-d show the positions of features of image-text
pairs in 2D space)
Figure 11e. Similarly, small changes happen between Figure 11c and 11d, which leads to minor MAP changes from
epoch 32 to 60 in Figure 11f.
8.2.3 Evaluation of Model Effectiveness on NUS WIDE
We first examine the mean average precision (MAP) of our
method using Euclidean distance against real-valued features. Let L be the dimension of the latent space. Our MSAE
is configured with 3 layers, where the image features are
mapped from 500 dimensions to 128, and finally to L. Similarly, the dimension of text features are reduced from 1000 →
128 → L by the text SAE. βI and βT are set to 0 and 0.01
respectively according to Section 8.2.2. We test L with values 16, 24 and 32. The results compared with other methods
are reported in Table 2. Our MSAE achieves the best performance for all four search tasks. It demonstrates an average improvement of 17%, 27%, 21%, and 26% for QI→I ,
QT →T ,QI→T , and QT →I respectively. CVH and CMSSH
prefer smaller L in queries QI→T and QT →I . The reason
is that it needs to train far more parameters with Larger L
and the learned models will be farther from the optimal solutions. Our method is less sensitive to the value of L. This is
Effective Deep Learning Based Multi-Modal Retrieval
Table 2: Mean Average Precision on NUS-WIDE dataset
Dimension of 16
Latent Space
Table 3: Mean Average Precision on NUS-WIDE dataset (using Binary Latent Features)
Dimension of 16
Latent Space
probably because with multiple layers, MSAE has stronger
representation power and thus is more robust under different
Figure 12 shows the precision-recall curves, and the recallcandidates ratio curves (used by [43, 44]) which show the
change of recall when inspecting more results on the returned rank list. We omit the figures for QT →T and QI→I as
they show similar trends as QT →I and QI→T . Our method
shows the best accuracy except when recall is 0 10 , whose
precision p implies that the nearest neighbor of the query
appears in the p1 -th returned result. This indicates that our
method performs the best for general top-k similarity retrieval except k=1. For the recall-candidates ratio, the curve
of MSAE is always above those of other methods. It shows
that we get better recall when inspecting the same number of
objects. In other words, our method ranks more relevant objects at higher (front) positions. Therefore, MSAE performs
better than other methods.
Besides real-valued features, we also conduct experiments
against binary latent features for which Hamming distance
is used as the distance function. In our implementation, we
choose Spectral Hashing [41] to convert real-valued latent
feature vectors into binary codes. Other comparison algorithms use their own conversion mechanisms. The MAP scores
are reported in Table 3. We can see that 1) MSAE still performs better than other methods. 2) The MAP scores using
Hamming distance is not as good as that of Euclidean distance. This is due to the possible information loss by converting real-valued features into binary features.
The performance is reported in Table 4. MAPs on Wiki
dataset are much smaller than those on NUS-WIDE except
for QT →T . This is because the images of Wiki are of much
lower quality. It contains only 2, 000 images that are highly
diversified, making it difficult to capture the semantic relationships within images, and between images and text. Query
task QT →T is not affected as Wkipedia’s featured articles
are well edited and rich in text information. In general, our
method achieves an average improvement of 8.1%, 30.4%,
32.8%, 26.8% for QI→I , QT →T ,QI→T , and QT →I respectively. We do not plot the precision-recall curves and recallcandidates ratio curves as they show similar trends to those
8.2.5 Evaluation of Model Effectiveness on Flickr1M
We configure a 4-layer image SAE as 3857 → 1000 →
128 → L, and a 4-layer text SAE as 2000 → 1000 →
128 → L for this dataset. Different from the other two
datasets, the original image feature of Flickr1M are of higher
quality as it consists of both local and global features. For
intra-modal search, the image latent feature performs equally
well as the text latent feature. Therefore, we set both βI and
βT to 0.01.
We compare the MAP of MSAE and CVH in Table 5.
MSAE outperforms CVH in most of the search tasks. The
results of LCMH and CMSSH cannot be reported as both
methods run out of memory in the training stage.
8.2.4 Evaluation of Model Effectiveness on Wiki Dataset
We conduct similar evaluations on Wiki dataset as on NUSWIDE. For MSAE with latent feature of dimension L, the
structure of its image SAE is 128 → 128 → L, and the
structure of its text SAE is 1000 → 128 → L. Similar to the
settings on NUS-WIDE, βI is set to 0 and βT is set to 0.01.
Here, recall r =
#all relevant results
≈ 0.
Table 5: Mean Average Precision on Flickr1M dataset
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
(a) QI→T , L = 16
(b) QT →I , L = 16
(c) QI→T , L = 16
(d) QT →I , L = 16
(e) QI→T , L = 24
(f) QT →I , L = 24
(g) QI→T , L = 24
(h) QT →I , L = 24
(i) QI→T , L = 32
(j) QT →I , L = 32
(k) QI→T , L = 32
(l) QT →I , L = 32
Fig. 12: Precision-Recall and Recall-Candidates Ratio on NUS-WIDE dataset
Table 4: Mean Average Precision on Wiki dataset
Dimension of 16
Latent Space
Fig. 13: Training Cost Comparison on Flickr1M Dataset
8.2.6 Evaluation of Training Cost
We use the largest dataset Flickr1M to evaluate the training
cost of time and memory consumption. The results are reported in Figure 13. The training cost of LCMH and CMSSH
are not reported because they run out of memory on this
dataset. We can see that the training time of MSAE and
CVH increases linearly with respect to the size of the training dataset. Due to the stacked structure and multiple iterations of passing the dataset, MSAE is not as efficient as
CVH. Roughly, the overhead is about the number of training iterations times the height of MSAE. Possible solutions
for accelerating the MSAE training include adopting Distributed deep learning [7]. We leave this as our future work.
Figure 13b shows the memory usage of the training process. Given a training dataset, MSAE splits them into minibatches and conducts the training batch by batch. It stores
the model parameters and one mini-batch in memory, both
of which are independent of the training dataset size. Hence,
the memory usage stays constant when the size of the training dataset increases. The actual minimum memory usage
for MSAE is smaller than 10GB. In our experiments, we allocate more space to load multiple mini-batches into memory to save disk reading cost. CVH has to load all training
Effective Deep Learning Based Multi-Modal Retrieval
Table 6: Statistics of Datasets for Supervised Training
Total size
Training set
Validation set
Test set
Fig. 14: Querying Time Comparison Using Real-valued and
Binary Latent Features
data into memory for matrix operations. Therefore, its memory usage increases with respect to the size of the training
8.2.7 Evaluation of Query Processing Efficiency
We compare the efficiency of query processing using binary
latent features and real-valued latent features. Notice that
all methods (i.e., MSAE, CVH, CMSSH and LCMH) perform similarly in query processing after mapping the original data into latent features of same dimension. Data from
the Flickr1M training dataset is mapped into a 32 dimensional latent space to form a large dataset for searching. To
speed up the query processing of real-valued latent features,
we create an index (i.e., VA-File [40]) for each modality. For
binary latent features, we do not create any indexes, as linear scan offers decent performance as shown in Figure 14.
It shows the time of searching 50 nearest neighbors (averaged over 100 random queries) against datasets represented
using binary latent features (based on Hamming distance)
and real-valued features (based on Euclidean distance) respectively. We can see that the querying time increases linearly with respect to the dataset size for both binary and
real-valued latent features. But, the searching against binary
latent features is 10× faster than that against real-valued latent features. This is because the computation of Hamming
distance is more efficient than that of Euclidean distance.
By taking into account the results from effectiveness evaluations, we can see that there is a trade-off between efficiency and effectiveness in feature representation. The binary encoding greatly improves the efficiency in the expense
of accuracy degradation.
8.3 Experimental Study of Supervised Approach
8.3.1 Datasets
Supervised training requires input image-text pairs to be associated with additional semantic labels. Since Flickr1M does
not have labels and Wiki dataset has too few labels that are
203, 400
not discriminative enough, we use NUS-WIDE dataset to
evaluate the performance of supervised training. We extract
203, 400 labelled pairs, among which 150, 000 are used for
training. The remaining pairs are evenly partitioned into two
sets for validation and testing. From both sets, we randomly
select 2000 pairs as queries. This labelled dataset is named
We further extract another dataset from NUS-WIDE-a
by filtering those pairs with more than one label. This dataset
is named NUS-WIDE-b and is used to compare with DeViSE [9], which is designed for training against images annotated with single label. In total, we obtain 76, 000 pairs.
Among them, we randomly select 60, 000 pairs for training
and the rest are evenly partitioned for validation and testing.
1000 queries are randomly selected from the two datasets
8.3.2 Training Analysis
NUS-WIDE-a In Figure 15a, we plot the total training loss
L and its components (LI , LT and LI,T ) in the first 50, 000
iterations (one iteration for one mini-batch) against the NUSWIDE-a dataset. We can see that training converges rather
quickly. The training loss drops dramatically at the very beginning and then decreases slowly. This is because initially
the learning rate is large and the parameters approach quickly
towards the optimal values. Another observation is that the
intra-modal loss LI for the image modality is smaller than
LT for the text modality. This is because some tags may be
noisy or not very relevant to the associated labels that represent the main visual content in the images. It is difficult
to learn a set of parameters to map noisy tags into the latent
space and well predict the ground truth labels. The intermodel training loss is calculated at a different scale and is
normalized to be within one order of magnitude as LI and
LT .
The MAPs for all types of searches using supervised
training model are shown in Figure 15b. As can be seen, the
MAPs first gradually increase and then become stable in the
last few iterations. It is worth noting that the MAPs are much
higher than the results of unsupervised training (MSAE) in
Figure 11. There are two reasons for the superiority. First,
the supervised training algorithm (MDNN) exploits DCNN
and NLM to learn better visual and text features respectively.
Second, labels bring in more semantics and make latent fea-
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
Iteration (x100)
(a) Training Loss
0.40 0
Iteration (x100)
ℚI → I
ℚI → T
ℚT → T
ℚT → I
Iteration (x100)
(b) MAP on Validation Dataset
(c) Precision-Recall on Validation Dataset
Fig. 15: Visualization of Training on NUS-WIDE-a
0.30 0
Iteration (x100)
(a) Training Loss
Iteration (x100)
ℚI → I
ℚI → T
ℚT → T
ℚT → I
(b) MAP on Validation Dataset
Iteration (x100)
(c) Precision-Recall on Validation Dataset
Fig. 16: Visualization of Training on NUS-WIDE-b
tures learned more robust to noises in input data (e.g., visual
irrelevant tags).
Besides MAP, we also evaluate MDNN for multi-label
prediction based on precision and recall. For each image (or
text), we look at its labels with the largest k probabilities
based on Equation 9 (or 16). For the i-th image (or text),
let Ni denote the number of labels out of k that belong to
its ground truth label set, and Ti the size of its ground truth
label set. The precision and recall are defined according to
[11] as follows:
precison =
, recall = Pi=1
i=1 Ti
where n is the test set size. The results are shown in Figure 15c (k = 3). The performance decreases at the early
stage and then goes up. This is because at the early stage,
in order to minimize the inter-modal loss, the training may
disturb the pre-trained parameters fiercely that affects the
intra-modal search performance. Once the inter-modal loss
is reduced to a certain level, it starts to adjust the parameters to minimize both inter-modal loss and intra-modal loss.
Hence, the classification performance starts to increase. We
can also see that the performance of latent text features is not
as good as that of latent image features due to the noises in
tags. We use the same experiment setting as that in [11], the
(over all) precision and recall is 7% and 7.5% higher than
that in [11] respectively.
NUS-WIDE-b Figure 16 shows the training results against
the NUS-WIDE-b dataset. The results demonstrate similar
patterns to those in Figure 15. However, MAPs become lower,
possibly due to smaller training dataset size and fewer number of associated labels. In Figure 16c, the precision and recall for classification using image (or text) latent features are
the same. This is because each image-text pair has only one
label and Ti = 1. When we set k = 1, the denominator k ∗ n
in precision is equal to i=1 Ti in recall.
2D Visualization To demonstrate that the learned mapping functions can generate semantic discriminative latent
features, we extract top-8 most popular labels and for each
label, we randomly sample 300 image-text pairs from the
test dataset of NUS-WIDE-b. Their latent features are projected into a 2-dimensional space by t-SNE [26]. Figure 17a
shows the 2-dimensional image latent features where one
point represents one image feature and Figure 17b shows the
2-dimensional text features. Labels are distinguished using
different shapes. We can see that the features are well clustered according to their labels. Further, the image features
and text features semantically relevant to the same labels are
projected to similar positions in the 2D space. For example,
in both figures, the red circles are at the left side, and the
Effective Deep Learning Based Multi-Modal Retrieval
(a) Image Latent Feature
(b) Text Latent Feature
Fig. 17: Visualization of Latent Features Learned by MDNN for the Test Dataset of NUSWDIE-a (features represented by
the same shapes and colors are annotated with the same label)
Table 7: Mean Average Precision using Real-valued Latent Feature
blue right triangles are in the top area. The two figures together confirm that our supervised training is very effective
in capturing semantic information for multi-modal data.
8.3.3 Evaluation of Model Effectiveness on NUS-WIDE
In our final experiment, we compare MDNN with DeViSE [9]
in terms of effectiveness of multi-modal retrieval. DeViSE
maps image features into text feature space. The learning
objective is to minimize the rank hinge loss based on the latent features of an image and its labels. We implement this
algorithm and extend it to handle multiple labels by averaging their word vector features. We denote this algorithm
as DeViSE-L. Besides, we also implement a variant of DeViSE denoted as DeViSE-T, whose learning objective is to
minimize the rank hinge loss based on the latent features of
an image and its tag(s). Similarly, if there are multiple tags,
we average their word vectors. The results are shown in Table 7. The retrieval is conducted using real-valued latent feature and cosine similarity as the distance function. We can
see that MDNN performs much better than both DeViSE-L
and DeViSE-T for all four types of searches on both NUSWIDE-a and NUS-WIDE-b. The main reason is that the image tags are not all visually relevant, which makes it hard for
the text (tag) feature to capture the visual semantics in DeViSE. MDNN exploits the label information in the training,
which helps to train a model that can generate more robust
Dataset NUS-WIDE-a
(a) Training Time
(b) Memory Consumption
Fig. 18: Training Cost Comparison on NUSWIDE-a Dataset
feature against noisy input tags. Hence the performance of
MDNN is better.
8.3.4 Evaluation of Training Cost
We report the training cost in terms of training time (Fig. 18a)
and memory consumption (Fig. 18b) on NUS-WIDE-a dataset.
Training time includes the pre-training for each single modality and the joint multi-modal training. MDNN and DeViSEL take longer time to train than MSAE, because the convolution operations in them are time consuming. Further, MDNN
involves pre-training stages for the image modality and text
modality, and thus incurs longer training time than DeViSEL. The memory footprint of MDNN is similar to that of
DeViSE-L, as the two methods both rely on DCNN, which
incurs most of the memory consumption. DeViSE-L uses
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
features of higher dimension (100 dimension) than MDNN
(81 dimension), which leads to about 100 MegaBytes difference as shown in Fig. 18b.
1. Bengio Y (2012) Practical recommendations for
gradient-based training of deep architectures. CoRR
2. Bengio Y, Ducharme R, Vincent P, Janvin C (2003)
8.3.5 Comparison with Unsupervised Approach
A neural probabilistic language model. Journal of Machine Learning Research 3:1137–1155
By comparing Table 7 and Table 2, we can see that the su3. Bengio Y, Courville AC, Vincent P (2013) Represenpervised approach–MDNN performs better than the unsutation learning: A review and new perspectives. IEEE
pervised approach–MSAE. This is not surprising because
Trans Pattern Anal Mach Intell 35(8):1798–1828
MDNN consumes more information than MSAE. Although
4. Bronstein MM, Bronstein AM, Michel F, Paragios N
the two methods share the same general training objective,
(2010) Data fusion through cross-modality metric learnthe exploitation of label semantics helps MDNN learn better
ing using similarity-sensitive hashing. In: CVPR, pp
features in capturing the semantic relevance of the data from
different modalities. For memory consumption, MDNN and
5. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng YT
MSAE perform similarly (Fig. 18b).
(July 8-10, 2009) Nus-wide: A real-world web image database from national university of singapore. In:
Proc. of ACM Conf. on Image and Video Retrieval
(CIVR’09), Santorini, Greece.
Ciresan DC, Meier U, Gambardella LM, Schmidhuber J
9 Conclusion
(2012) Deep big multilayer perceptrons for digit recognition. vol 7700, Springer, pp 581–598
In this paper, we have proposed a general framework (ob7. Dean J, Corrado G, Monga R, Chen K, Devin M, Le QV,
jective) for learning mapping functions for effective multiMao MZ, Ranzato M, Senior AW, Tucker PA, Yang K,
modal retrieval. Both intra-modal and inter-modal semantic
Ng AY (2012) Large scale distributed deep networks.
relationships of data from heterogeneous sources are capIn: NIPS Lake Tahoe, Nevada, United States., pp 1232–
tured in the general learning objective function. Given this
general objective, we have implemented one unsupervised
Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N,
training algorithm and one supervised training algorithm sepTzeng
E, Darrell T (2013) Decaf: A deep convolutional
arately to learn the mapping functions based on deep learnactivation
feature for generic visual recognition. arXiv
ing techniques. The unsupervised algorithm uses stacked autopreprint arXiv:13101531
encoders as the mapping functions for the image modality
9. Frome A, Corrado GS, Shlens J, Bengio S, Dean J,
and the text modality. It only requires simple image-text
Ranzato M, Mikolov T (2013) Devise: A deep visualpairs for training. The supervised algorithm uses an extend
semantic embedding model. In: NIPS, Lake Tahoe,
DCNN as the mapping function for images and an extend
Nevada, United States., pp 2121–2129
NLM as the mapping function for text data. Label infor10.
RB, Donahue J, Darrell T, Malik J (2014) Rich
mation is integrated in the training to learn robust mapping
for accurate object detection and sefunctions against noisy input data. The results of experimantic
In: CVPR 2014, Columbus, OH,
ment confirm the improvements of our method over previous
pp 580–587
works in search accuracy. Based on the processing strate11. Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep
gies outlined in this paper, we have built a distributed trainconvolutional ranking for multilabel image annotation.
ing platform (called SINGA) to enable efficient deep learnCoRR abs/1312.4894
ing training that supports training large scale deep learning
12. Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Itermodels. We shall report the system architecture and its perative quantization: A procrustean approach to learning
formance in a future work.
binary codes for large-scale image retrieval. IEEE Trans
Pattern Anal Mach Intell 35(12):2916–2929
13. Goroshin R, LeCun Y (2013) Saturating auto-encoder.
CoRR abs/1301.3577
10 Acknowledgments
14. Hinton G (2010) A Practical Guide to Training Restricted Boltzmann Machines. Tech. rep.
This work is supported by A*STAR project 1321202073.
15. Hinton G, Salakhutdinov R (2006) Reducing the diXiaoyan Yang is supported by Human-Centered Cyber-physical
mensionality of data with neural networks. Science
Systems (HCCS) programme by A*STAR in Singapore.
Effective Deep Learning Based Multi-Modal Retrieval
313(5786):504 – 507
16. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst
17. Huiskes MJ, Lew MS (2008) The mir flickr retrieval
evaluation. In: Multimedia Information Retrieval, pp
18. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick RB, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In:
MM’14, Orlando, FL, USA, 2014, pp 675–678
19. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep.
20. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet
classification with deep convolutional neural networks.
In: Advances in Neural Information Processing Systems
25, pp 1106–1114
21. Kumar S, Udupa R (2011) Learning hash functions for
cross-view similarity search. In: IJCAI, pp 1360–1365
22. LeCun Y, Bottou L, Orr G, Müller K (1998) Efficient
BackProp. In: Orr G, Müller KR (eds) Neural Networks: Tricks of the Trade, Lecture Notes in Computer
Science, vol 1524, Springer Berlin Heidelberg, Berlin,
Heidelberg, chap 2, pp 9–50
23. Liu D, Hua X, Yang L, Wang M, Zhang H (2009) Tag
ranking. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid,
Spain, April 20-24, 2009, pp 351–360, DOI 10.1145/
24. Liu W, Wang J, Kumar S, Chang SF (2011) Hashing
with graphs. In: ICML, pp 1–8
25. Lu X, Wu F, Tang S, Zhang Z, He X, Zhuang Y (2013)
A low rank structural large margin method for crossmodal ranking. In: SIGIR, pp 433–442
26. van der Maaten L (2014) Accelerating t-SNE using
Tree-Based Algorithms. Journal of Machine Learning
Research 15:3221–3245
27. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University
28. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space.
CoRR abs/1301.3781
29. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY
(2011) Multimodal deep learning. In: ICML, pp 689–
30. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: ACM
Multimedia, pp 251–260
31. Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011)
Contractive auto-encoders: Explicit invariance during
feature extraction. In: ICML, pp 833–840
32. Salakhutdinov R, Hinton GE (2009) Semantic hashing.
Int J Approx Reasoning 50(7):969–978
33. Socher R, Manning CD (2013) Deep learning for NLP
(without magic). In: Human Language Technologies:
Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings,
June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta,
Georgia, USA, pp 1–3
34. Socher R, Pennington J, Huang EH, Ng AY, Manning
CD (2011) Semi-supervised recursive autoencoders for
predicting sentiment distributions. In: EMNLP, pp 151–
35. Song J, Yang Y, Huang Z, Shen HT, Hong R (2011)
Multiple feature hashing for real-time large scale nearduplicate video retrieval. In: MM, 2011, ACM, pp 423–
36. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013)
Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: SIGMOD Conference, pp
37. Srivastava N, Salakhutdinov R (2012) Multimodal
learning with deep boltzmann machines. In: NIPS, pp
38. Vincent P, Larochelle H, Bengio Y, Manzagol PA
(2008) Extracting and composing robust features with
denoising autoencoders. In: ICML, pp 1096–1103
39. Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014)
Effective multi-modal retrieval based on stacked autoencoders. PVLDB 7(8):649–660
40. Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, pp 194–205
41. Weiss Y, Torralba A, Fergus R (2008) Spectral hashing.
In: NIPS, pp 1753–1760
42. Zhang D, Agrawal D, Chen G, Tung AKH (2011) Hashfile: An efficient index structure for multimedia data.
In: ICDE 2011, April 11-16, 2011, Hannover, Germany,
IEEE Computer Society, pp 1103–1114
43. Zhen Y, Yeung DY (2012) A probabilistic model for
multimodal hash function learning. In: KDD, pp 940–
44. Zhu X, Huang Z, Shen HT, Zhao X (2013) Linear crossmodal hashing for efficient multimedia search. In: ACM
Multimedia Conference, MM ’13, Barcelona, Spain,
October 21-25, 2013, pp 143–152
45. Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for
cross-media retrieval. IEEE Transactions on Multimedia 10(2):221–229
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, Yueting Zhuang
A Appendix
Algorithm 1 MiniBatchSGD(D, b, f, α)
Input: D, training dataset
Input: b, batchsize
Input: f , the initial mapping function
Input: α, learning rate//set manually
Output: f , updated mapping function.
1. init θ //init all model parameters of f
2. repeat
3. for i = 0 to |D|/b do
B = D[i ∗ b : i ∗ b + b]
Oθ=average(BackPropogation(x, f ) for x ∈ B )
θ = θ + α ∗ Oθ
7. until Converge
8. return f
Algorithm 2 BackPropagation(x0 , f )
Input: x0 , input feature vector
Input: f , the mapping function with parameter set θ
Output: θ, gradients of parameters of f
1. h ← height of layers in f
2. {xi }h
// forward through all layers
i=1 ← f (x0 |θ )
3. δh = ∂x
4. for i = h to 1 do
//i.e., ∂θ
5. Oθi = δi ∗ ∂x
δi−1 = δi ∗
7. return {Oθi }h
In this section, we present the mini-batch Stochastic Gradient Descent (mini-batch SGD) algorithm and the Back-Propagation (BP) algorithm [22], which are used throughout this paper to train MSAE and
Mini-batch SGD minimizes the objective loss (e.g., L, LI , LT ) by
updating the parameters involved in the mapping function(s) based on
the gradients of the objective w.r.t the parameters. Specifically, it iterates the whole dataset to extract mini-batches (Line 4). For each minibatch, it averages the gradients computed from BP (Line 5), and updates the parameters (Line 6).
BP calculates the gradients of the objective loss (e.g., L, LI , LT )
w.r.t. the parameters involved in the mapping function (e.g., fI , fT ) using a chain rule (Equation 19, 20). It forwards the input feature vector
through all layers of the mapping function (Line 2). Then it backwards
the gradients according to the chain rule (Line 4-6). θi denotes parameters involved in the i-th layer. Gradients are returned at Line 7.
∂L ∂xi
∂xi ∂θi
∂xi ∂xi−1
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF