Supervised Semantic Indexing

Supervised Semantic Indexing
Supervised Semantic Indexing
Bing Bai
Jason Weston
David Grangier
NEC Labs America
Princeton, NJ
NEC Labs America
Princeton, NJ
NEC Labs America
Princeton, NJ
[email protected]
Ronan Collobert
NEC Labs America
Princeton, NJ
[email protected]
[email protected]
Olivier Chapelle
Yahoo! Research
Santa Clara, CA
[email protected]
[email protected]
Kilian Weinberger
Yahoo! Research
Santa Clara, CA
[email protected]
ABSTRACT
In this article we propose Supervised Semantic Indexing
(SSI) an algorithm that is trained on (query, document)
pairs of text documents to predict the quality of their match.
Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy).
However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which
we argue is the reason for our superior results. As the query
and target texts are modeled separately, our approach is
easily generalized to different retrieval tasks, such as online
advertising placement. Dealing with models on all pairs of
words features is computationally challenging. We propose
several improvements to our basic model for addressing this
issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide
an empirical study of all these methods on retrieval tasks
based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while
providing realistically scalable methods.
Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Indexing methods; H.3.3 [Information Search and Retrieval]: Retrieval models
General Terms
Algorithms
Keywords
semantic indexing, learning to rank, content matching
1.
INTRODUCTION
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’09, November 2–6, 2009, Hong Kong, China.
Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.
In this article we study the task of learning to rank documents, given a query, by modeling their semantic content.
Although natural language can express the same concepts in
many different ways using different words, classical ranking
algorithms do not attempt to model the semantics of language, and simply measure the word overlap between texts.
For example, a classical vector space model, see e.g. [1], uses
weighted word counts (e.g. via tf-idf) as a feature representation of a text, and the cosine similarity for comparing to
other texts. If two words in query and document texts mean
the same thing but are different unique strings, there is no
contribution to the matching score derived from this semantic similarity. Indeed, if the texts do not share any words at
all, no match is inferred.
There exist several unsupervised learning methods to try
to model semantics, in particular Latent Semantic Indexing [11], and related methods such as pLSA and LDA [19,
3]. These methods choose a low dimensional feature representation of “latent concepts” that is constructed via a
linear mapping from the (bag of words) content of the text.
This mapping is learnt with a reconstruction objective, either based on mean squared error (LSI) or likelihood (pLSA,
LDA). As these models are unsupervised, they may not learn
a matching score that works well for the task of interest. Supervised LDA (sLDA) [2] has been proposed where a set of
auxiliary labels are trained on jointly with the unsupervised
task. However, the supervised task is not a task of learning
to rank because the supervised signal is at the document
level and is query independent.
In this article we propose Supervised Semantic Indexing
(SSI) which defines a class of models that can be trained
on a supervised signal (i.e., labeled data) to provide a ranking of a database of documents given a query. This signal
is defined at the (query,documents) level and can either be
point-wise — for instance the relevance of the document to
the query — or pairwise — a given document is better than
another for a given query. In this work, we focus on pairwise preferences. For example, if one has click-through data
yielding query-target relationships, one can use this to train
these models to perform well on this task [22]. Or, if one is
interested in finding documents related to a given query document, one can use known hyperlinks to learn a model that
performs well on this task [16]. Moreover, our approach can
model queries and documents separately, which can accommodate for differing word distributions between documents
and queries. This might be important in cases like matching
advertisements to web pages where the two distributions are
different, and a good match does not necessarily have overlapping words.
Learning to rank as a supervised task is not a new subject,
however most methods and models have typically relied on
optimizing over only a few hand-constructed features, e.g.
based on existing vector space models such as tf-idf, the title, URL, PageRank and other information, see e.g. [22, 5].
Our work is orthogonal to those works, as it presents a way
of learning a model for query and target texts by considering features generated by all pairs of words between the
two texts. The difficulty here is that such feature spaces
are very large and we present several models that deal with
memory, speed and capacity control issues. In particular
we propose constraints on our model that are diagonal preserving but otherwise low rank and a technique of hashing
features (sharing weights) based on their correlation, called
correlated feature hashing (CFH). In fact, both our proposed
methods can be used in conjunction with other features and
methods explored in previous work for further gains.
We show experimentally on retrieval tasks developed from
Wikipedia that our method strongly outperforms word-feature
based models such as tf-idf vector space models, LSI and
other baselines on document-document and query-document
tasks. Finally, we give results on an Internet advertising task
using proprietary data from an online advertising company.
The rest of this article is as follows. In Section 2 we describe our method, Section 3 discusses prior work, Section 4
describes the experimental study of our method, and Section
5 concludes with a discussion.
2.
SUPERVISED SEMANTIC INDEXING
Let us denote the set of documents in the corpus as {dt }`t=1 ⊂
D
R and a query text as q ∈ RD , where D is the dictionary
size1 , and the j th dimension of a vector indicates the frequency of occurrence of the j th word, e.g. using the tf-idf
weighting and then normalizing to unit length [1].
2.1
Basic Model
The set of models we propose are all special cases of the
following type of model:
f (q, d) = q > W d =
D
X
qi Wij dj
(1)
where Φs (·) is the sth dimension in our feature space, and
choosing the set of models:
f (q, d) = w · Φ(q, d).
Note that a model taking pairs of words as features is essential here, a simple approach concatenating (q, d) into a
single vector and using f (q, d) = w · [q, d] is not a viable
option as it would result in the same document ordering for
any query.
We could train any standard method such as a ranking
perceptron or a ranking SVM using our choice of features.
However, without further modifications, this basic approach
has a number of problems in terms of speed, storage space
and capacity as we will now discuss.
Efficiency of a dense W matrix .
We analyze both memory and speed considerations. Firstly,
this method so far assumes that W fits in memory (unless sparsity is somehow enforced). If the dictionary size
D = 30000, then this requires 3.4Gb of RAM (assuming
floats), and if the dictionary size is 2.5 Million (as it will
be in our experiments in Section 4) this amounts to 14.5
Terabytes. The vectors q and d are sparse so the speed of
computation of a single query-document pair involves mn
computations qi Wij dj , where q and d have m and n nonzero terms, respectively. We have found this is reasonable
for training, but may be an issue at test time2 . Alternatively,
one can compute v = q > W once, and then compute vd for
each document. This is the same speed as a classical vector
space model where the query contains D terms, assuming
W is dense. The capacity of this model is also obviously
rather large. As every pair of words between query and target is modeled separately it means that any pair not seen
during the training phase will not have its weight trained.
Regularizing the weights so that unseen pairs have Wij = 0
is thus essential (this is discussed in Section 2.3). However,
this is still not ideal and clearly a huge number of training
examples will be necessary to train so many weights, most
of which are not used for any given training pair (q, d).
Overall, a dense matrix W is challenging in terms of memory footprint, computation time and controlling its capacity
for good generalization. In the next section we describe ways
of improving over this basic approach.
2.2
i,j=1
where f (q, d) is the score between a query q and a given
document d, and W ∈ RD×D is the weight matrix, which
will be learned from a supervised signal. This model can
capture synonymy and polysemy (hence the term “semantic”
in the name of the algorithm) as it looks at all possible cross
terms, and can be tuned directly for the task of interest.
We do not use stemming since our model can already match
words with common stems (if it is useful for the task). Note
that negative correlations via negative values in the weight
matrix W can also be encoded.
Expressed in another way, given the pair q, d we are constructing the joint feature map:
Φ((i−1)D+j) (q, d) = (qd> )ij
1
Improved Model: Low rank (diagonal preserving) W matrices
An efficient scheme is to constrain W in the following way:
W = U > V + I.
(4)
Here, U and V are N × D matrices. This induces a N dimensional “latent concept” space in a similar way to LSI.
However, it differs in several ways:
• Most importantly it is trained from a supervised signal
using preference relations (ranking constraints).
• Further, U and V differ so it does not assume the query
and target document should be embedded in the same
way. This can hence model when the query text distribution is very different to the document text distribution, e.g. the queries are typically short and have
(2)
In fact in our resulting methods there is no need to restrict
that both q and d have the same dimensionality D but we
will make this assumption for simplicity of exposition.
(3)
2
Of course, any method can be sped up by applying it to
only a subset of pre-filtered documents, filtering using some
faster method.
different word occurence and co-occurrence statistics.
In content matching query and target texts could be
quite different and are naturally modeled in this setup.
• Finally, the addition of the identity term means this
model automatically learns the tradeoff between using the low dimensional space and a classical vector
space model. This is important because the diagonal of
the W matrix gives the specificity of picking out when
a word co-ocurrs in both documents (indeed, setting
W = I is equivalent to cosine similarity using tf-idf,
see below). The matrix I is full rank and therefore cannot be approximated with the low rank model U > V ,
so our model combines both. Note that the weights of
U and V are learnt so one does not necessarily need a
weighting parameter multiplied by I.
However, the efficiency and memory footprint are as favorable as LSI. Typically, one caches the N -dimensional representation for each document to use at query time.
We also highlight several other regularization variants,
which are further possible ways of constraining W :
• W = I: if q and d are normalized tf-idf vectors this is
equivalent to using the standard cosine similarity with
no learning (and no synonymy or polysemy).
• W = D, where D is a diagonal matrix: one learns a
re-weighting of tf-idf using labeled data (still no synonymy or polysemy). This is similar to a method proposed in [16].
• W = U > U + I: we constrain the model to be symmetric; the query and target document are treated in the
same way.
2.2.1
Correlated Feature Hashing
Another way to both lower the capacity of our model and
decrease its storage requirements is to share weights among
features.
Hash Kernels (Random Hashing of Words).
In [29] the authors proposed a general technique called
“Hash Kernels” where they approximate the feature representation Φ(x) with:
X
Φi (x)
Φ̄j (x) =
i∈W:h(i)=j
where h : W → {1, . . . , H} is a hash function that reduces
an the feature space down to H dimensions, while maintaining sparsity, where W is the set of initial feature indices.
The software Vowpal Wabbit3 implements this idea (as a
regression task) for joint feature spaces on pairs of objects,
e.g. documents. In this case, the hash function used for a
pair of words (s, t) is h(s, t) = mod(sP + t, H) where P is a
large prime. This yields
X
Φ̄j (q, d) =
Φs,t (q, d).
(5)
(s,t)∈{1,...,D}2 :h(s,t)=j
where Φs,t (·) indexes the feature on the word pair (s, t),
e.g. Φs,t (·) = Φ((s−1)D+t) (·). This technique is equivalent to sharing weights, i.e. constraining Wst = Wkl when
3
http://hunch.net/~vw/
h(s, t) = h(k, l). In this case, the sharing is done pseudorandomly, and collisions in the hash table generally results
in sharing weights between term pairs that share no meaning
in common.
Correlated Feature Hashing.
We thus suggest a technique to share weights (or equivalently hash features) so that collisions actually happens for
terms with close meaning. For that purpose, we first sort
the words in our dictionary in frequency order, so that i = 1
is the most frequent, and i = D is the least frequent. For
each word i = 1, . . . , D, we calculate its DICE coefficient
[30] with respect to each word j = 1, . . . , F among the top
F most frequent words:
DICE(i, j) =
2 · cooccur(i, j)
occur(i) + occur(j)
where cooccur(i, j) counts the number of co-occurences for i
and j at the document or sentence level, and occur(i) is the
total number of occurences of word i. Note that these scores
can be calculated from a large corpus of completely unlabeled documents. For each i, we sort the F scores (largest
first) so that Sp (i) ∈ {1, . . . , F} correspond to the index of
the pth largest DICE score DICE(i, Sp (i)). We can then
use the Hash Kernel approximation Φ̄(·) given in equation
(5) relying on the “hashing” function:
h(i, j) = (S1 (i) − 1)F + S1 (j)
This strategy is equivalent to pre-processing our documents
and replacing all the words indexed by i with S1 (i). Note
that we have reduced our feature space from D2 features to
H = F 2 features. This reduction can be important as shown
in our experiments, see Section 4: e.g. for our Wikipedia
experiments, we have F = 30, 000 as opposed to D = 2.5
Million. Typical examples of the top k matches to a word
using the DICE score are given in Table ??.
Moreover, we can also combine correlated feature hashing
with the low rank W matrix constraint described in Section
2.2. In that case U and V are reduced from D × N dimensional matrices to F × N matrices instead because the set
of features is no longer the entire dictionary, but the first F
words.
Correlated Feature Hashing by Multiple Binning.
It is also suggested in [29] to hash a feature Φi (·) so that it
contributes to multiple features Φ̄j (·) in the reduced feature
space. This strategy theoretically lessens the consequence
of collisions. In our case, we can construct multiple hash
functions from the values Sp (·), p = 1, . . . , k, i.e. the top k
correlated words according to their DICE scores:
X
1
Φs,t (q, d)
(6)
Φ̄j (q, d) =
k
p = 1, . . . , k
(s, t) ∈ {1, . . . , D}2 : hp (s, t) = j
where
hp (s, t) = (Sp (s) − 1)F + Sp (t).
(7)
Pseudocode for constructing the feature map Φ̄(q, d) is given
in algorithm 1.
Equation (6) defines the reduced feature space as the mean
of k feature maps which are built using hashing functions
using the p = 1, . . . , k most correlated words. Equation
Algorithm 1 Construction of the correlated feature hashing
(6)
Initialize the F dimensional vectors a and b to 0.
for p = 1, . . . , k do
for all i = 1, . . . , D such that qi > 0 or di > 0 do
j ← Sp (i)
a j ← a j + qi
bj ← bj + di
end for
end for
∀i, j ∈ [1 . . . F ], Φ̄((i−1)F +j) (q, d) ← k1 (ab> )ij
(7) defines the hash function for a pair of words i and j
using the pth most correlated words Sp (i) and Sp (j). That
is, the new feature space consists of, for each word in the
original document, the top k most correlated words from
the set of size F of the most frequently occurring words.
Hence as before there are never more than H = F 2 possible
features. Overall, this is in fact equivalent to pre-processing
our documents and replacing all the words indexed by i with
S1 (i), . . . , Sk (i), with appropriate weights.
Hashing n-grams.
One can also use these techniques to incorporate n-gram
features into the model without requiring a huge feature
representation that would have no way of fitting in memory.
We simply use the DICE coefficient between an n-gram i
and the first F words j = 1, . . . , F , and proceed as before.
In fact, our feature space size does not increase at all, and
we are free to use any value of n.
2.3
Training Methods
We now discuss how to train the models we have described
in the previous section.
2.3.1
Training the Basic Model
Suppose we are given a set of tuples R (labeled data),
where each tuple contains a query q, a relevant document d+
and an irrelevant (or lower ranked) document d− . We would
like to choose W such that q > W d+ > q > W d− , expressing
that d+ should be ranked higher than d− .
For that purpose, we employ the margin ranking loss [18]
which has already been used in several IR methods before
[22, 5, 16], and minimize:
X
max(0, 1 − q > W d+ + q > W d− ).
(8)
(q,d+ ,d− )∈R
This optimization problem is solved through stochastic gradient descent, (see, e.g. [5]): iteratively, one picks a random
tuple and makes a gradient step for that tuple:
W ← W +λ(q(d+ )> −q(d− )> ), if 1−q > W d+ +q > W d− > 0
Obviously, one should exploit the sparsity of q and d when
calculating these updates. To train our model, we choose
the (fixed) learning rate λ which minimizes the training error. We also suggest to initialize the training with W = I
as this initializes the model to the same solution as a cosine
similarity score. This introduces a prior expressing that the
weight matrix should be close to I, considering term correlation only when it is necessary to increase the score of
a relevant document, or conversely decreasing the score of
a non-relevant document. Termination is then performed
by viewing when the error is no longer improving, using a
validation set.
Stochastic training is highly scalable and is easy to implement for our model. Our method thus far is a margin ranking perceptron [9] with a particular choice of features (2).
It thus involves a convex optimization problem and is hence
related to a ranking SVM [18, 22], except we have a highly
scalable optimizer. However, we note that such optimization cannot be easily applied to probabilistic methods such
as pLSA because of their normalization constraints. Recent
methods like LDA [3] also suffer from scalability issues.
Researchers have also explored optimizing various alternative loss functions other than the ranking loss including
optimizing normalized discounted cumulative gain (NDCG)
and mean average precision (MAP) [5, 6, 7, 33]. In fact, one
could use those optimization strategies to train our models
instead of optimizing the ranking loss as well.
2.3.2
Training with a Low Rank W matrix
When the W matrix is constrained, e.g. W = U > V + I,
training is done in a similar way to before, but in this case
by making a gradient step to optimize the parameters U and
V:
U ← U + λV (d+ − d− )q > , if 1 − f (q, d+ ) + f (q, d− ) > 0
V ← V + λU q(d+ − d− )> , if 1 − f (q, d+ ) + f (q, d− ) > 0.
Note this is no longer a convex optimization problem. In our
experiments we initialized the matrices U and V randomly
using a normal distribution with mean zero and standard
deviation one.
2.3.3
Training with Feature Hashing
The same training techniques as described above can be
used for training with feature hashing, just with a different
choice of feature map, dependent on the hashing technique
chosen.
2.4
2.4.1
Applications
Standard Retrieval
We consider two standard retrieval models: returning relevant documents given a keyword-based query, and finding
related documents with respect to a given query document,
which we call the query-document and document-document
tasks.
Our methods naturally can be trained to solve these tasks.
We note here that so far our models have only included features based on the bag-of-words model, but there is nothing
stopping us adding other kinds of features as well. Typical
choices include: features based on the title, body, words in
bold font, the popularity of a page, its PageRank, the URL,
and so on, see e.g. [1]. However, in this paper for clarity
and simplicity our experiments use a setup where we only
use raw words.
2.4.2
Content Matching
Our models can also be applied to identify two differing
types of text as a matching pair, for example a sequence of
text (which could be a web page or an email or a chat log)
with a targeted advertisement. In this case, click-through
data can provide supervision. Here, again for simplicity, we
assume both text and advert are represented as words. In
practice, however, other types of engineered features could
be added for optimal performance.
3.
PRIOR WORK
A tf-idf vector space model and LSI [11] are two main
baselines we will compare to. We already mentioned pLSA
[19] and LDA [3] both have scalability problems and are
not reported to generally outperform LSA and TF-IDF [13].
Moreover in the introduction we discussed how sLDA[2] provides supervision at the document level (via a class label
or regression value) and is not a task of learning to rank,
whereas here we study supervision at the (query,documents)
level. In this section, we now discuss other relevant methods.
In [16] the authors learned the weights of an orthogonal
vector space model on Wikipedia links, improving over the
OKAPI method. Joachims et al.[22] trained a SVM with
hand-designed features based on the title, body, search engines rankings and the URL. Burges et al.[5] proposed a
neural network method using a similar set of features (569
in total). In contrast we limited ourselves to body text (not
using title, URL, etc.) and train on at most D2 = 900 million features.
Query Expansion, often referred to as blind relevance feedback, is another way to deal with synonyms, but requires
manual tuning and does not always yield a consistent improvement [34].
The authors of [17] used a related model to the ones we
describe, but for the task of image retrieval, and [15] also
used a related (regression-based) method for advert placement. They both use the idea of using the cross product
space of features in the perceptron algorithm as in equation
(2) which is implemented in related software to these two
publications, PAMIR4 and Vowpal Wabbit5 . The task of
document retrieval, and the use of low rank matrices, is not
studied.
Several authors [28, 23] have proposed interesting nonlinear versions of (unsupervised) LSI using neural networks
and showed they outperform LSI or pLSA. However, in the
case of [28] we note their method might require considerable
computationally expense, and hence they only used a dictionary size of 2000. Finally, [31] proposes a “supervised” LSI
for classification. This has a similar sounding title to ours,
but is quite different because it is based on applying LSI to
document classification rather than improving ranking via
known preference relations. The authors of [12] proposed
“Explicit Semantic Analysis” which represents the meaning
of texts in a high-dimensional space of concepts by building a
feature space derived from the human-organized knowledge
from an encyclopedia, e.g. Wikipedia. In the new space,
cosine similarity is applied. SSI could be applied to such
feature representations so that they are not agnostic to a
particular supervised task as well.
Another related area of research is in distance metric learning [32, 21, 14]. Methods like LMNN [32] also learn a model
similar to the basic model (2.1) with the full matrix W (but
not with our improvements to this model). They constrain
during the optimization that W should be a positive semidefinite matrix. Their method has considerable computational
cost for example even after considerable optimization of the
algorithm it still takes 3.5 hours to train on 60,000 examples
4
5
http://www.idiap.ch/pamir/
http://hunch.net/~vw/
and 169 features (a pre-processed version of MNIST). This
would hence not be scalable for large scale text ranking experiments. Nevertheless, Chechik et al. compared LMNN
[32], LEGO [21] and MCML [14] to a stochastic gradient
method with a full matrix W (the basic model (2.1)) on a
small image ranking task and report in fact that the stochastic method provides both improved results and efficiency6 .
4.
EXPERIMENTAL STUDY
Learning a model of term correlations over a large vocabulary is a considerable challenge that requires a large amount
of training data. Standard retrieval datasets like TREC7 or
LETOR [24] contain only a few hundred training queries,
and are hence too small for that purpose. Moreover, some
datasets only provide few pre-processed features like pagerank, or BM25, and not the actual words. Click-through
from web search engines could provide valuable supervision.
However, such data is not publicly available, and hence experiments on such data are not reproducible.
We hence conducted most experiments on Wikipedia and
used links within Wikipedia to build a large scale ranking
task. Thanks to its abundant, high-quality labeling and
structuring, Wikipedia has been exploited in a number of
applications such as disambiguation [4, 10], text categorization [26, 20], relationship extraction [27, 8], and searching
[25] etc. Specifically, Wikipedia link structures were also
used in [26, 25, 27].
We considered several tasks: document-document retrieval
described in Section 4.1, query-document retrieval described
in Section 4.2. In Section 4.3 we also give results on an Internet advertising task using proprietary data from an online
advertising company.
In these experiments we compare our approach, Supervised Semantic Indexing (SSI), to the following methods: tfidf with cosine similarity (TFIDF), Query Expansion (QE),
LSI8 , αLSI + (1 − α) TFIDF and a margin ranking perceptron using Hash Kernels. Moreover SSI with an “unconstrained W” is just a margin ranking perceptron with a
particular choice of feature map, and SSI using hash kernels is the approach of [29] employing a ranking loss. For
LSI we report the best value of α and embedding dimension
(50, 100, 200, 500, 750 or 1000), optimized on the training
set ranking loss. We then report the low rank version of
SSI using the same choice of dimension. Query Expansion
involves
applying TFIDF and then adding the mean vector
P
β Ei=1 dri of the top E retrieved documents multiplied by
a weighting β to the query, and applying TFIDF again. We
report the error rate where β and E are optimized using the
training set ranking loss.
For each method, we measure the ranking loss (the percentage of tuples in R that are incorrectly ordered), precision P (n) at position n = 10 ([email protected]) and the mean average
precision (MAP), as well as their standard errors. For computational reasons, MAP and [email protected] were measured by averaging over a fixed set of 1000 test queries, where for each
query the linked test set documents plus random subsets of
6
Oral presentation at the (Snowbird) Machine Learning Workshop, see http://snowbird.djvuzone.org/
abstracts/119.pdf
7
http://trec.nist.gov/
8
We use the SVDLIBC software http://tedlab.mit.edu/
~dr/svdlibc/ and the cosine distance in the latent concept
space.
1
0.75
0.5
Document-Document Retrieval
We considered a set of 1,828,645 English Wikipedia documents as a database, and split the 24,667,286 links9 randomly into two portions, 70% for training and 30% for testing. We then considered the following task: given a query
document q, rank the other documents such that if q links
to d then d should be highly ranked.
Limited Dictionary Size.
In our first experiments, we used only the top 30,000 most
frequent words. This allowed us to compare all methods with
the proposed approach, Supervised Semantic Indexing (SSI),
using a completely unconstrained W matrix as in equation
(1). LSI is also feasible to compute in this setting. We compare several variants of our approach, as detailed in Section
2.2.
Results on the test set are given in Table 1. All the variants of our method SSI strongly outperform the existing
techniques TFIDF, LSI and QE. SSI with unconstrained W
performs worse than the low rank counterparts – probably
because it has too much capacity given the training set size.
Non-symmetric low-rank SSI W = U > V + I slightly outperforms its symmetric counterpart W = U > U + I. SSI with
Diagonal SSI W = D is only a learned re-weighting of word
weights, but still slightly outperforms TFIDF. In terms of
our baselines, LSI is slightly better than TFIDF but QE in
this case does not improve much over TFIDF, perhaps because of the difficulty of this task, i.e. there may too often
many irrelevant documents in the top E documents initially
retrieved for QE to help.
Unlimited Dictionary Size.
In our second experiment we no longer constrained methods to a fixed dictionary size, so all 2.5 million words are
used. Due to being unable to compute LSI for the full dictionary size, we used the LSI computed in the previous experiment on 30000 words and combined it with TFIDF using
the entire dictionary. In this setting we compared our baselines with the low rank SSI method W = (U > V )n +I, where
n means that we constrained the rows of U and V for infrequent words (i.e. all words apart from the most frequent
n) to equal zero. The reason for this constraint is that it
can stop the method overfitting: if a word is used in one
document only then its embedding can take on any value
independent of its content. Infrequent words are still used
in the diagonal of the matrix (via the +I term). The results,
given in Table 2, show that using this constraint outperforms
an unconstrained choice of n = 2.5M . Figure 1 shows scatter plots where SSI outperforms the baselines TFIDF and
LSI in terms of average precision.
Overall, compared to the baselines the same trends are
observed as in the limited dictionary case, indicating that
the restriction in the previous experiment did not bias the
results in favor of any one algorithm. Note also that as a
9
We removed links to calendar years as they provide little
information while being very frequent.
Mean Average Precision for LSI and SSI_30k+I
1
alpha*LSI_30k+(1-alpha)*TFIDF_2.5M
4.1
Mean Average Precision for TFIDF and SSI_30k+I
TFIDF
10,000 documents were used as the database, rather than
the whole testing set. The ranking loss is measured using
100,000 testing tuples (i.e. 100,000 queries, and for each
query one random positive and one random negative target
document were selected).
0.25
0
0
0.25
0.5
0.75
1
SSI_30k+I
0.75
0.5
0.25
0
0
0.25
0.5
0.75
1
SSI_30k+I
Figure 1: Scatter plots of Average Precision for 500
documents: (a) SSI30k +I2.5M vs. TFIDF2.5M , (b)
SSI30k +I2.5M vs. the best combination of LSI30k and
TFIDF2.5M .
page has on average just over 3 test set links to other pages,
the maximum [email protected] one can achieve in this case is 0.31,
while our best model reaches 0.263 for this measure.
Hash Kernels and Correlated Feature Hashing.
On the full dictionary size experiments in Table 2 we
also compare Hash Kernels [29] with our Correlated Feature Hashing method described in Section 2.2.1. For Hash
Kernels we tried several sizes of hash table H (1M, 3M and
6M), we also tried adding a diagonal to the matrix learned
in a similar way as is done for LSI. We note that if the hash
table is big enough this method is equivalent to SSI with an
unconstrained W , however for the hash sizes we tried Hash
Kernels did not perform as well. For correlated feature hashing, we simply used the SSI model W = (U > V )30k + I from
the 7th row in the table to model the most frequent 30,000
words and trained a second model using equation (6) with
k = 5 to model all other words, and combined the two models with a mixing factor (which was also learned). The result
“SSI: CFH (1-grams)” is the best performing method we have
found. Doing the same trick but with 2-grams instead also
improved over Low Rank SSI, but not by as much. Combining both 1-grams and 2-grams, however did not improve
over 1-grams alone.
Training and Testing Splits.
In some cases, one might be worried that our experimental
setup has split training and testing data only by partitioning the links, but not the documents, hence performance of
our model when new unseen documents are added to the
database might be in question. We therefore also tested
an experimental setup where the test set of documents is
completely separate from the training set of documents, by
completely removing all training set links between training
and testing documents. In fact, this does not alter the performance significantly, as shown in Table 3. This outlines
that our model can accommodate a growing corpus without
frequent re-training.
Importance of the Latent Concept Dimension.
In the above experiments we simply chose the dimension
N of the low rank matrices to be the same as the best latent concept dimension for LSI. However, we also tried some
experiments varying N and found that the error rates are
fairly invariant to this parameter. For example, using a lim-
Table 1: Empirical results for document-document ranking on Wikipedia (limited dictionary size of D =30,000
words).
Algorithm
TFIDF
QE
LSI
αLSI + (1 − α)TFIDF
SSI: W = D
SSI: W unconstrained
SSI: W = U > U + I
SSI: W = U > V + I
Parameters
0
2
1000D
200D+1
D
D2
200D
400D
Rank-Loss
1.62%
1.62%
4.79%
1.28%
1.41%
0.41%
0.41%
0.30%
MAP
0.329±0.010
0.330±0.010
0.158±0.006
0.346±0.011
0.355±0.009
0.477±0.011
0.506±0.012
0.517±0.011
[email protected]
0.163±0.006
0.163±0.006
0.098±0.005
0.170±0.007
0.177±0.007
0.212±0.007
0.225±0.007
0.229±0.007
Table 2: Empirical results for document-document ranking on Wikipedia (unlimited dictionary size, all
D = 2.5M words). Results for random hashing (i.e., hash kernels [29]) and correlated feature hashing (CFH)
on all words are included.
Algorithm
TFIDF
QE
αLSI30k + (1 − α)TFIDF
SSI: W = (U > U )2.5M + I
SSI: W = (U > V )100k + I
SSI: W = (U > V )60k + I
SSI: W = (U > V )30k + I
SSI: Hash Kernels [29]
SSI: Hash Kernels
SSI: Hash Kernels
SSI: Hash Kernels + αI
SSI: Hash Kernels + αI
SSI: Hash Kernels + αI
SSI: CFH (2-grams)
SSI: CFH (1-grams)
Params
0
2
200 × 30k + 1
50D
100 × 100k
100 × 60k
200 × 30k
1M
3M
6M
1M+1
3M+1
6M+1
300 × 30k
300 × 30k
Rank Loss
0.842%
0.842%
0.721%
0.200%
0.178%
0.172%
0.158%
2.98%
1.75%
1.37%
0.525%
0.370%
0.347%
0.149%
0.119%
MAP
0.432±0.012
0.432±0.012
0.433±0.012
0.503±0.012
0.536±0.012
0.541±0.012
0.547±0.012
0.239±0.009
0.301±0.01
0.335±0.01
0.466±0.011
0.474±0.012
0.485±0.011
0.559±0.012
0.614±0.012
[email protected]
0.193±0.007
0.193±0.007
0.193±0.007
0.220±0.007
0.233±0.008
0.232±0.008
0.239±0.008
0.127±0.005
0.152±0.006
0.164±0.007
0.207±0.007
0.211±0.007
0.215±0.007
0.249±0.007
0.263±0.008
Table 3: Empirical results for document-document ranking in two train/test setups: partitioning into
train+test sets of links, or into train+test sets of documents with no cross-links (limited dictionary size
of 30,000 words). The two setups yield similar results.
Algorithm
SSI: W = U > V + I
SSI: W = U > V + I
Testing Setup
Partitioned links
Partitioned docs+links
ited dictionary size of 30,000 words we achieve a ranking
loss 0.39%, 0.30% or 0.31% for N =100, 200, 500 using a
W = U > V + I type model.
Importance of the Identity matrix for Low Rank representations.
The addition of the identity term in our model W =
U > V + I allows this model to automatically learn the tradeoff between using the low dimensional space and a classical vector space model. The diagonal elements count when
there are exact matches (co-ocurrences) of words between
the documents. The off-diagonal (approximated with a low
rank representation) captures topics and synonyms. Using
only W = I yields the inferior TFIDF model. Using only
W = U > V also does not work as well as W = U > V + I.
Indeed, we obtain a mean average precision of 0.42 with
the former, and 0.51 with the latter. Similar results can be
seen with the error rate of LSI with or without adding the
(1−α)TFIDF term, however for LSI this modification seems
rather ad-hoc rather than being a natural constraint on the
Rank Loss
0.407%
0.401%
MAP
0.506±0.012
0.503±0.010
[email protected]
0.225±0.007
0.225±0.006
general form of W as in our method.
Ignoring the Diagonal.
On the other hand, for some tasks it is not possible to
use the identity matrix at all, e.g. if one wishes to perform cross-language retrieval. Out of curiosity, we thus also
tested our method SSI training a dense matrix W where the
diagonal is constrained to be zero10 , so only synonyms can
be used. This obtained a test ranking loss of 0.69% (limited
dictionary size case), compare to 0.41% with the diagonal.
Training Speed.
Training our model over the 1.2M documents ( where the
number of triplets R is obviously much larger ) takes on
the order of a few days on standard machine (single CPU)
with our current implementation. As triplets are sampled
stochastically, not all possible triplets have been seen in this
Note that the model W = U > V with the identity achieved
a ranking loss of 0.56%, however this model can represent
at least some of the diagonal.
10
time, however the test error on a validation set has reached
its minimum by that time, see Section 4.4.
0.05
0.04
0.04
4.2
Query-Document Retrieval
We also tested our approach in a query-document setup.
We used the same setup as before but we constructed queries
by keeping only k random words from query documents in an
attempt to mimic a “keyword search”. First, using the same
setup as in the previous section with a limited dictionary
size of 30,000 words we present results for keyword queries
of length k = 5, 10 and 20 in Table 4. SSI yields similar improvements as in the document-document retrieval case over
the baselines. Here, we do not report full results for Query
Expansion, however it did not give large improvements over
TFIDF, e.g. for the k = 10 case we obtain 0.084 MAP and
0.0376 [email protected] for QE at best. Results for k = 10 using an
unconstrained dictionary are given in Table 5. Again, SSI
yields similar improvements. Overall, non-symmetric SSI
gives a slight but consistent improvement over symmetric
SSI. Changing the embedding dimension N (capacity) did
not appear to effect this, for example for k = 10 and N = 100
we obtain 3.11% / 0.215 / 0.097 for Rank Loss/MAP/[email protected]
using SSI W = U > U + I and 2.93% / 0.235 / 0.102 using
SSI W = U > V + I (results in Table 4 are for N = 200). Finally, correlated feature hashing again improved over models
without hashing.
4.3
Content Matching
We present results on matching adverts to web pages, a
problem closely related to document retrieval. We obtained
proprietary data from an online advertising company of the
form of pairs of web pages and adverts that were clicked
while displayed on that page. We only considered clicks in
position 1 and discarded the sessions in which none of the
ads was clicked. This is a way to circumvent the well known
position bias problem — the fact that links appearing in
lower positions are less likely to be clicked even if they are
relevant. Indeed, by construction, every negative example
comes from a slate of adverts in which there was a click in a
lower position; it is thus likely that the user examined that
negative example but deemed it irrelevant (as opposed to
the user not clicking because he did not even look at the
advert).
We consider these (webpage,clicked-on-ad) pairs as positive examples (q, d+ ), and any other randomly chosen ad is
considered as a negative example d− for that query page.
1.9M pairs were used for training and 100,000 pairs for testing. The web pages contained 87 features (words) on average, while the ads contained 19 features on average. The two
classes (clicks and no-clicks) are roughly balanced. From the
way we construct the dataset, this means than when a user
clicks on an advert, he/she clicks about half of the time on
the one in the first position.
We compared TFIDF, Hash Kernels and Low Rank SSI
on this task. The results are given in Table 6. In this
case TFIDF performs very poorly, often the positive (page,
ad) pairs share very few, if any, features, and even if they
do this does not appear to be very discriminative. Hash
Kernels and Low Rank SSI appear to perform rather similarly, both strongly outperforming TFIDF. The rank loss
on this dataset is two orders of magnitude higher than on
the Wikipedia experiments described in the previous sections. This is probably due to a combination of two factors:
0.03
0.03
0.02
0.02
0.01
0.01
100
1100
2100
3100
4100
5100
6100
7100
8100
9100
(a) 1% data for training
0.04
100
1100
2100
3100
4100
5100
6100
7100
8100
(b) 4% data for training
0.04
0.03
0.03
0.02
0.02
0.01
0.01
100
1100
2100
3100
4100
5100
6100
7100
8100
(c) 10% data for training
100
1100
2100
3100
4100
5100
6100
7100
8100
(d) 30% data for training
Figure 2: Wikipedia-based retrieval task: training
with different training data sizes. In each figure, the
(lower) blue curve is training error, and the (higher)
red curve is testing error. Here, the total data size
is 8.1M qJap -dEng pairs, so 1% of the data is about
80,000 pairs.
first, the positive and negative classes are balanced, whereas
there was only a few positive documents in the Wikipedia
experiments; and second, clicks data are much more noisy.
We might add, however, that at test time, Low Rank SSI
has a considerable advantage over Hash Kernels in terms
of speed. As the vectors U q and V d can be cached for each
page and ad, a matching operation only requires N multiplications (a dot product in the “embedding” space). However,
for hash kernels |q||d| hashing operations and multiplications
have to be performed, where | · | means the number of nonzero elements. For values such as |q| = 100, |d| = 100 and
N = 100 that would mean Hash Kernels would be around
100 times slower than Low Rank SSI at test time, and this
difference gets larger if more features are used.
Table 6: Content Matching experiments on proprietary data of web-page/advertisement pairs.
Algorithm
TFIDF
SSI: Hash Kernels [29]
SSI: Hash Kernels
SSI: W = (U > V )10k + I
SSI: W = (U > V )20k + I
SSI: W = (U > V )30k + I
4.4
Parameters
0
1M
10M
50 × 10k = 0.5M
50 × 20k = 1M
50 × 30k = 1.5M
Notes on Overfitting Issues
Rank Loss
45.60%
26.15%
25.56%
25.83%
26.68%
26.98%
Table 4: Empirical results for query-document ranking on Wikipedia where query has k keywords (this
experiment uses a limited dictionary size of D = 30, 000 words). For each k we measure the ranking loss, MAP
and [email protected] metrics.
Algorithm
TFIDF
αLSI + (1 − α)TFIDF
SSI: W = U > U + I
SSI: W = U > V + I
Params
0
200D+1
200D
400D
Rank Loss
21.6%
14.2%
4.80%
4.37%
k=5
MAP
0.047±0.004
0.049±0.004
0.161±0.007
0.166±0.007
Algorithm
TFIDF
αLSI + (1 − α)TFIDF
SSI: W = U > U + I
SSI: W = U > V + I
Params
0
200D+1
200D
400D
Rank Loss
14.0%
9.73%
3.10%
2.91%
k = 10
MAP
0.083±0.006
0.089±0.006
0.2138±0.0009
0.229±0.009
Algorithm
TFIDF
αLSI + (1 − α)TFIDF
SSI: W = U > U + I
SSI: W = U > V + I
Params
0
200D+1
200D
400D
k = 20
MAP
0.128±0.007
0.133±0.007
0.287±0.01
0.302±0.01
Rank Loss
9.14%
6.36%
1.87%
1.80%
[email protected]
0.023±0.0007
0.023±0.0007
0.079±0.003
0.083±0.003
[email protected]
0.035±0.001
0.037±0.001
0.095±0.004
0.100±0.004
[email protected]
0.054±0.002
0.059±0.002
0.126±0.005
0.130±0.005
Table 5: Empirical results for query-document ranking for k = 10 keywords (unlimited dictionary size of
D = 2.5 million words).
Algorithm
TFIDF
αLSI + (1 − α)TFIDF
SSI: W = U > V + I
Params
0
200 × 30k+1
400 × 30k
As stated earlier, we did not experience overfitting problems in most of our experiments. However, overfitting happened when we reduced the training data to a very limited
amount. Figure 2 shows a document-document wikipedia retrieval task (note that this is a different train/test document
setup to sections 4.1 and so the results are not comparable)
where we use different training data sizes.
It appears that overfitting occurs when training data size
is very small (1% and 4%). The gap between training and
testing error reduces with increased training data size, as
expected. However, especially or larger training set sizes,
one can see that we did not experience a divergence between
train and test error as the number of iterations of online
learning increased.
5.
Rank
12.86%
8.95%
3.02%
MAP
0.128±0.008
0.133±0.008
0.261±0.010
[email protected]
0.035±0.003
0.051±0.003
0.113±0.004
more features into our models as we just mentioned, generalizing to other kinds of nonlinear models, and exploring the
use of the same models for other tasks such as question answering. In general, web search and other standard retrieval
tasks currently often depend on entering query keywords
which are likely to be contained in the target document,
rather than the user directly describing what they want to
find. Our models are capable of learning to rank using either
the former or the latter.
6.
ADDITIONAL AUTHORS
Additional authors: Yanjun Qi, NEC Labs America, Princeton, NJ, email: [email protected] and Kunihiko Sadamasa
NEC Labs America, Princeton, NJ, email: [email protected]
DISCUSSION
We have described a versatile, powerful set of discriminatively trained models for document ranking. Many “learning
to rank” papers have focused on the problem of selecting the
objective to optimize (given a fixed class of functions) and
typically use a relatively small number of hand-engineered
features as input. This work is orthogonal to those works
as it studies models with large feature sets generated by all
pairs of words between the query and target texts. The challenge here is that such feature spaces are very large, and we
thus presented low rank models that deal with the memory, speed and capacity control issues. In fact, all of our
proposed methods can be used in conjunction with other
features and objective functions explored in previous work
for further gains.
Many generalizations of our work are possible: adding
7.
REFERENCES
[1] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern
information retrieval. Addison-Wesley Harlow,
England, 1999.
[2] D. M. Blei and J. D. McAuliffe. Supervised topic
models. In In Advances in Neural Information
Processing Systems (NIPS), 2007.
[3] D. M. Blei, A. Ng, and M. I. Jordan. Latent dirichlet
allocation. The Journal of Machine Learning
Research, 3:993–1022, 2003.
[4] R. Bunescu and M. Pasca. Using encyclopedic
knowledge for named entity disambiguation. In In
EACL, pages 9–16, 2006.
[5] C. Burges, T. Shaked, E. Renshaw, A. Lazier,
M. Deeds, N. Hamilton, and G. Hullender. Learning to
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
rank using gradient descent. In ICML 2005, pages
89–96, New York, NY, USA, 2005. ACM Press.
C.J.C. Burges, R. Ragno, and Q.V. Le. Learning to
Rank with Nonsmooth Cost Functions. In Advances in
Neural Information Processing Systems: Proceedings
of the 2006 Conference. MIT Press, 2007.
Z. Cao, T. Qin, T.Y. Liu, M.F. Tsai, and H. Li.
Learning to rank: from pairwise approach to listwise
approach. In Proceedings of the 24th international
conference on Machine learning, pages 129–136. ACM
Press New York, NY, USA, 2007.
S. Chernov, T. Iofciu, W. Nejdl, and X. Zhou.
Extracting semantic relationships between wikipedia
categories. In In 1st International Workshop:
âĂİSemWiki2006 - From Wiki to SemanticsâĂİ
(SemWiki 2006), co-located with the ESWC2006 in
Budva, 2006.
M. Collins and N. Duffy. New ranking algorithms for
parsing and tagging: kernels over discrete structures,
and the voted perceptron. In Proceedings of the 40th
Annual Meeting on Association for Computational
Linguistics, pages 263–270. Association for
Computational Linguistics Morristown, NJ, USA,
2001.
S. Cucerzan. Large-scale named entity disambiguation
based on wikipedia data. In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural
Language Learning, pages 708–716, Prague, June
2007. Association for Computational Linguistics.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K.
Landauer, and R. Harshman. Indexing by latent
semantic analysis. JASIS, 41(6):391–407, 1990.
E. Gabrilovich and S. Markovitch. Computing
semantic relatedness using wikipedia-based explicit
semantic analysis. In International Joint Conference
on Artificial Intelligence, 2007.
P. Gehler, A. Holub, and M. Welling. The rate
adapting poisson (rap) model for information retrieval
and object recognition. In Proceedings of the 23rd
International Conference on Machine Learning. 2006.
A. Globerson and S. Roweis. Visualizing pairwise
similarity via semidefinite programming. In AISTATS.
2007.
S. Goel, J. Langford, and A. Strehl. Predictive
indexing for fast search. In Advances in Neural
Information Processing Systems 21. 2009.
D. Grangier and S. Bengio. Inferring document
similarity from hyperlinks. In CIKM ’05, pages
359–360, New York, NY, USA, 2005. ACM.
D. Grangier and S. Bengio. A discriminative
kernel-based approach to rank images from text
queries. IEEE Trans. PAMI., 30(8):1371–1384, 2008.
R. Herbrich, T. Graepel, and K. Obermayer. Large
margin rank boundaries for ordinal regression. MIT
Press, Cambridge, MA, 2000.
T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR 1999, pages 50–57. ACM Press, 1999.
J. Hu, L. Fang, Y. Cao, H. Zeng, H. Li, Q. Yang, and
Z. Chen. Enhancing text clustering by leveraging
wikipedia semantics. In SIGIR ’08: Proceedings of the
31st annual international ACM SIGIR conference on
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
Research and development in information retrieval,
pages 179–186, New York, NY, USA, 2008. ACM.
P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman.
Online metric learning and fast similarity search. In
Advances in Neural Information Processing Systems
(NIPS). 2008.
T. Joachims. Optimizing search engines using
clickthrough data. In ACM SIGKDD, pages 133–142,
2002.
M. Keller and S. Bengio. A Neural Network for Text
Representation. In International Conference on
Artificial Neural Networks, ICANN, 2005. IDIAP-RR
05-12.
T.Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor:
Benchmark dataset for research on learning to rank
for information retrieval. In Proceedings of SIGIR
2007 Workshop on Learning to Rank for Information
Retrieval, 2007.
D. N. Milne, I. H. Witten, and D. M. Nichols. A
knowledge-based search engine powered by wikipedia.
In CIKM ’07: Proceedings of the sixteenth ACM
conference on Conference on information and
knowledge management, pages 445–454, New York,
NY, USA, 2007. ACM.
Z. Minier, Z. Bodo, and L. Csato. Wikipedia-based
kernels for text categorization. In In 9th International
Symposium on Symbolic and Numeric Algorithms for
Scientific Computing, pages 157–164, 2007.
M. Ruiz-casado, E. Alfonseca, and P. Castells.
Automatic extraction of semantic relationships for
wordnet by means of pattern learning from wikipedia.
In In NLDB, pages 67–79. Springer Verlag, 2005.
R. Salakhutdinov and G. Hinton. Semantic Hashing.
Proceedings of the SIGIR Workshop on Information
Retrieval and Applications of Graphical Models,
Amsterdam., 2007.
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,
A. Strehl, and V. Vishwanathan. Hash kernels. In
Twelfth International Conference on Artificial
Intelligence and Statistics, 2009.
F. Smadja, K. R. McKeown, and V. Hatzivassiloglou.
Translating collocations for bilingual lexicons: a
statistical approach. Comput. Linguist., 22(1):1–38,
1996.
J. Sun, Z. Chen, H. Zeng, Y. Lu, C. Shi, and W. Ma.
Supervised latent semantic indexing for document
categorization. In ICDM 2004, pages 535–538,
Washington, DC, USA, 2004. IEEE Computer Society.
K. Weinberger and L. Saul. Fast solvers and efficient
implementations for distance metric learning. In
International Conference on Machine Learning. 2008.
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A
support vector method for optimizing average
precision. In SIGIR, pages 271–278, 2007.
L. Zighelnic and O. Kurland. Query-drift prevention
for robust query expansion. In SIGIR 2008, pages
825–826, New York, NY, USA, 2008. ACM.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement