Analyzing the Effect of Query Class on Document Retrieval Performance.

Analyzing the Effect of Query Class on Document Retrieval Performance.
Analyzing the Effect of Query Class on Document
Retrieval Performance
Pawel Kowalczyk, Ingrid Zukerman, and Michael Niemann
School of Computer Science and Software Engineering
Monash University
Clayton, VICTORIA 3800, AUSTRALIA
{pawel,ingrid,niemann}@csse.monash.edu.au
Abstract. Analysis of queries posed to open-domain question-answering systems indicates that particular types of queries are dominant, e.g., queries about
the identity of people, and about the location or time of events. We applied a rulebased mechanism and performed manual classification to classify queries into
such commonly occurring types. We then experimented with different adjustments to our basic document retrieval process for each query type. The application of the best retrieval adjustment for each query type yielded improvements in
retrieval performance. Finally, we applied a machine learning technique to automatically learn the manually classified query types, and applied the best retrieval
adjustments obtained for the manual classification to the automatically learned
query classes. The learning algorithm exhibited high accuracy, and the retrieval
performance obtained for the learned classes was consistent with the performance
obtained for the rule-based and manual classifications.
1 Introduction
The growth in popularity of the Internet highlights the importance of developing systems that generate responses to queries targeted at large unstructured corpora. These
queries vary in their informational goal and topic, ranging from requests for descriptions of people or things, to queries about the location or time of events, and questions
about specific attributes of people or things. There is also some variation in the success
of question-answering systems in answering the different types of queries.
Recently, there has been some work on predicting whether queries can be answered
by the documents in a particular corpus [1, 2]. The hope is that by identifying features
that affect the “answerability” of queries, the queries can be modified prior to attempting
document retrieval, or appropriate steps can be taken during the retrieval process to
address problems that arise due to these features.
In this paper, we investigate the use of query type as a predictor of document retrieval performance in the context of a question answering task, and as a basis for the
automatic selection of a retrieval policy. The first step in our study consisted of performing two types of query classification: a coarse-grained classification, which was
performed by means of a rule-based mechanism, and a finer-grained classification,
which was done manually. We considered these two types of classification because
finer-grained classes are believed to be more informative than coarser-grained classes
2
when extracting answers to queries from retrieved documents [3]. However, prior to
committing to a particular classification grain, we must examine its effect on document
retrieval performance (as document retrieval is the step that precedes question answering).
Our analysis of the effect of query type on document retrieval performance shows
that performance varies across different types of queries. This led us to experiment with
different types of adjustments to our basic document retrieval process, in order to determine the best adjustment for each type of query. The application of specific adjustments
to the retrieval of documents for different query types yielded improvements in retrieval
performance both for the coarse-grained and the finer-grained query classes.
In the last step of our study, we applied a supervised machine learning technique,
namely Support Vector Machines (SVMs) [4], to learn the finer-grained query types
from shallow linguistic features of queries, and used the query-type-based retrieval
adjustments to retrieve documents for the learned classes.1 Our results for both the
machine learning algorithm and the document retrieval process are encouraging. The
learning algorithm exhibited high accuracy, and the resultant retrieval performance was
consistent with the performance obtained for the rule-based and manually-derived query
categories.
In the next section, we review related research. In Section 3, we describe our document retrieval procedure. Next, we describe our data set, discuss our rule-based classification process and our manual classification process, and present the results of our
experiments with the adjustments to the retrieval procedure. In Section 5, we describe
the data used to train the SVM for query classification, and evaluate the performance
of the SVM and the retrieval performance for the automatically learned classes. In Section 6, we summarize the contribution of this work.
2 Related Research
Our research is at the intersection of query classification systems [5, 6, 3] and performance prediction systems [1, 2].
Query classification systems constitute a relatively recent development in Information Retrieval (IR). Radev et al. [5] and Zhang and Lee [3] studied automatic query
classification based on the type of the expected answer. Their work was motivated by
the idea that such a classification can help select a suitable answer in a document when
performing an open-domain question-answering task. Radev et al. compared a machine
learning approach with a heuristic (hand-engineered) approach for query classification,
and found that the latter approach outperformed the former. Zhang and Lee experimented with five machine learning methods to learn query classes, and concluded that
when only surface text features are used, SVMs outperform the other techniques. Kang
and Kim’s study of query classification [6] was directed at categorizing queries according to the task at hand (informational, navigational or transactional). They postulated
that appropriate query classification supports the application of algorithms dedicated
to particular tasks. Our work resembles that of Zhang and Lee in its use of SVMs.
1
The automation of the finer-grained classification is necessary in order to incorporate it as a
step in an automatic document-retrieval and question-answering process.
3
However, they considered two grains of classifications: coarse (6 classes) and fine (50
classes), while we consider an intermediate grain (11 classes). More importantly, like
Kang and Kim, we adjust our retrieval policy based on query class. However, Kang and
Kim’s classes were broad task-oriented classes, while we offer finer distinctions within
the informational task.
Performance-prediction systems identify query features that predict retrieval performance. The idea is that queries that appear “unpromising” can be modified prior to
attempting retrieval, or retrieval behaviour can be adjusted for such queries. CronenTownsend et al. [1] developed a clarity score that measures the coherence of the language used in documents which “generate” the terms in a query. Thus, queries with a
high clarity score yield a cohesive set of documents, while queries with a low clarity
score yield documents about different topics. Zukerman et al. [2] adopted a machine
learning approach to predict retrieval performance from the surface features of queries
and word frequency counts in the corpus. They found that queries were “answerable”
when they did not contain words whose frequency exceeded a particular threshold (this
threshold is substantially lower than the frequency of stop words, which are normally
excluded from the retrieval process). This finding led to the automatic removal of such
words from queries prior to document retrieval, yielding significant improvements in
retrieval performance. The work described in this paper predicts retrieval performance
from surface features of queries (by first using these features to classify queries). However, it does not consider corpus-related information. Additionally, unlike the system
described in Zukerman et al. that modifies the queries, our system dynamically adjusts
its retrieval behaviour.2
3 Document Retrieval
Our retrieval mechanism combines the classic vector-space model [7] with a paraphrasebased query expansion process [8, 2]. This mechanism is further adjusted by considering different numbers of paraphrases (between 0 and 19) and different retrieval policies.
Below we describe our basic retrieval procedure followed by the adjustments.
Procedure Paraphrase&Retrieve
1. Tokenize, tag and lemmatize the query.
Tagging is performed using Brill’s part-of-speech tagger [9]. Lemmatizing consists
of converting words into lemmas, which are uninflected versions of words.
2. Generate replacement lemmas for each content lemma in the query.
The replacement lemmas are the intersection of lemmas obtained from two resources: WordNet [10] and a thesaurus that was automatically constructed from
the Oxford English Dictionary. The thesaurus also yields similarity scores between
each query lemma and its replacement lemmas.
3. Propose paraphrases for the query using different combinations of replacement
lemmas, compute the similarity score between each paraphrase and the query, and
2
We also replicated Zukerman et al.’s machine learning experiments. However, since our document retrieval technique combines the vector-space model with boolean retrieval (Section 3),
the results obtained when their machine learning approach was used with our system differed
from Zukerman et al.’s original findings.
4
rank the paraphrases according to their score.
The similarity score of each paraphrase is computed from the similarity scores between the original lemmas in the query and the corresponding replacement lemmas
in the paraphrase.
4. Retain the lemmatized query plus the top K paraphrases (the default value for K
is 19).
5. Retrieve documents for the query and its paraphrases using a paraphrase-adjusted
version of the vector-space model.
For each lemma in the original query or its paraphrases, documents that contain
this lemma are retrieved. Each document is scored using a function that combines
the tf.idf (term frequency inverse document frequency) score [7] of the query lemmas and paraphrase lemmas that appear in the document, and the similarity score
between the paraphrase lemmas that appear in the document and the query lemmas.
The tf.idf part of the score takes into account statistical features of the corpus, and
the similarity part takes into account semantic features of the query.
6. Retain the top N documents (at present N = 200).
3.1 Adjustments to the basic retrieval procedure
The adjustments to our retrieval procedure pertain to the number of paraphrases used
for query expansion and to the retrieval policy used in combination with the vector
space model. The effect of these adjustments on retrieval performance is discussed in
Sections 4.1 and 4.2.
Number of paraphrases. We consider different numbers of paraphrases (between 0 and
19), in addition to the original query.
Retrieval policies. Our system features three boolean document retrieval policies, which
are used to constrain the output of the vector-space model: (1) 1NNP, (2) 1NG (1 Noun
Group), and (3) MultipleNGs.
– 1NNP – retrieve documents that contain at least one proper noun (NNP) from the
query.3 If no proper nouns are found, fall back to the vector space model.
– 1NG – retrieve documents that contain the content words of at least one of the noun
groups in the query, where a noun group is a sequence of nouns possibly interleaved
by adjectives and function words, e.g., “Secretary of State”, “pitcher’s mound” or
“house”.
– MultipleNGs – retrieve documents that contain at least g NGs in the query, where
g = min{2, # of NGs in the query}.
3
NNP is the tag used for singular proper nouns in parsers and part-of-speech taggers. This tag
is part of the Penn Treebank tag-set (http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.
html).
5
4 Query Classification and Retrieval Adjustment
Our dataset consists of 911 unique queries from the TREC11 and TREC12 corpora.
These queries were obtained from logs of public repositories such MSNSearch and
AskJeeves. Their average length is 8.9 words, with most queries containing between 5
and 12 words. The answers to these queries are retrieved from approximately 1 million
documents in the ACQUAINT corpus (this corpus is part of the NIST Text Research
Collection, http://trec.nist.gov). These documents are newspaper articles from
the New York Times, Associated Press Worldstream (APW), and Xinhua English (People’s Republic of China) news services. Thus, the task at hand is an example of the more
general problem of finding answers to questions in open-domain documents that were
not designed with these questions in mind (in contrast to encyclopedias).
We first extracted six main query features by automatically performing shallow linguistic analysis of the queries. These features are
1. Type of the initial query words – corresponds mostly to the first word in the query,
but merges some words, such as “what” and “which”, into a single category, and
considers additional words if the first word is “how”.
2. Main focus – the attribute sought in the answer to the query, e.g., “How far is it
from Earth to Mars?” (similar components have been considered in [11, 12]).
3. Main verb – the main content verb of the query (different from auxiliary verbs such
as “be” and “have”), e.g., “What book did Rachel Carson write in 1962?”. It often
corresponds to the head verb of the query, but it may also be a verb embedded
in a subordinate clause, e.g., “What is the name of the volcano that destroyed the
ancient city of Pompeii?”.
4. Rest of the query – e.g., “What is the name of the volcano that destroyed the ancient
city of Pompeii?”.
5. Named entities – entities characterized by sequences of proper nouns, possibly interleaved with function words, e.g., “Hong Kong” or “Hunchback of Notre Dame”.
6. Prepositional phrases – e.g., “In the bible, who is Jacob’s mother?”.
After the shallow analysis, the queries were automatically classified by a rule-based
system into six broad categories which represent the type of the desired answer.
1.
2.
3.
4.
5.
6.
location, e.g., “In what country did the game of croquet originate?”.
name, e.g., “What was Andrew Jackson’s wife’s name?”.
number, e.g., “How many chromosomes does a human zygote have?”.
person, e.g., “Who is Tom Cruise married to?”.
time, e.g., “What year was Alaska purchased?”.
other, which is the default category, e.g., “What lays blue eggs?”.
The rules for query classification considered two main factors: type of the initial
query words (feature #1 above), and main-focus words (feature #2). The first factor
was used to identify location, number, person and time queries (“where”, “how [much
| many | ADJ]”, “[who | whom | whose]” and “when” respectively). The main focus
words were then used to classify queries whose initial word is “what”, “which” or “list”.
For example, “country”, “state” and “river” indicate location (e.g., “What is the state
6
Query type # of queries # of queries Performance
Best performance
with answers
19/1NNP
ans queries (%) #para/policy ans queries (%)
location
172
167
137 (82.0%)
8/MultNG
149 (89.2%)
name
63
60
45 (75.0%)
1/MultNG
49 (81.7%)
number
169
147
109 (74.1%)
0/MultNG
122 (83.0%)
person
61
55
48 (87.3%)
19/NNP
48 (87.3%)
time
143
138
115 (83.3%) 19/MultNG
124 (89.9%)
other
303
275
191 (69.5%) 12/MultNG
204 (74.2%)
total
911
842
645 (76.6%)
696 (82.7%)
Table 1. Breakdown of automatically derived categories for TREC11 and TREC12 queries; performance for 19 paraphrases and 1NNP retrieval policy; best retrieval adjustment and best performance (measured in answerable queries)
with the smallest population?”), and “date” and “year” indicate time (e.g., “What year
was the light bulb invented?”).
Table 1 shows the breakdown of the six query categories, together with the retrieval
performance when using our default retrieval method (19 paraphrases and the 1NNP
retrieval policy), and when using the retrieval adjustment that yields the best retrieval
performance for each query class. The first column lists the query type, the second column shows the number of queries of this type, and the third column shows the number
of queries for which TREC participants found answers in the corpus (obtained from the
TREC judgment file).4 The retrieval performance for our default method appears in the
fourth column. The fifth and sixth columns present information pertaining to the best
performance (discussed later in this section).
We employ a measure called number of answerable queries to assess retrieval performance. This measure, which was introduced in [2], returns the number of queries
for which the system has retrieved at least one document that contains the answer to a
query. We use this measure because the traditional precision measure is not sufficiently
informative in the context of a question-answering task. For instance, consider a situation where 10 correct documents are retrieved for each of 2 queries and 0 correct
documents for each of 3 queries, compared to a situation where 2 correct documents
are retrieved for each of 5 queries. Average precision would yield a better score for the
first situation, failing to address the question of interest for the question-answering task,
namely how many queries have a chance of being answered, which is 2 in the first case
and 5 in the second case.
As seen from the results in the first four columns of Table 1, there are differences in
retrieval performance for the six categories. Also, the other category (which is rather
uninformative) dominates, and exhibits the worst retrieval performance. This led to two
directions of investigation: (1) examine the effect of number of paraphrases and retrieval
4
TREC releases a judgment file for all the answers submitted by participants. For each submitted answer (and source document for that answer) the file contains a number which represents
a degree of correctness. Thus, at present, when assessing the performance of our retrieval procedure, we are bounded by the answers found by previous TREC participants.
7
policy on retrieval performance, and (2) refine the query classification to increase the
specificity of the “other” category in particular.
4.1 Effect of number of paraphrases and retrieval policy on performance
We ran all the combinations of number of paraphrases and retrieval policies on our six
query classes (a total of 60 runs: 20 × 3), and selected the combination of number of
paraphrases and retrieval policy that gave the best result for each query type (where several adjustments yielded the same performance, the adjustment with the lowest number
of paraphrases was selected). The fifth and sixth columns in Table 1 show the results of
this process. The fifth column shows the number of paraphrases and retrieval policy that
yield the best performance for a particular query type, and the sixth column shows the
number of answerable queries obtained by these adjustments. As can be seen from these
results, the retrieval adjustments based on query type yield substantial performance improvements.
4.2 Manual refinement of query classes
We re-examined the automatically derived query classes with a view towards a more
precise identification of the type of the desired answer (as stated above, the hope is
that this more precise categorization will help to find answers in documents). At the
same time, we endeavoured to define categories that had some chance of being automatically identified. This led us to the 11 categories specified below. These classes
include five of the six previously defined categories (name was split between person
and attribute). The queries in the six original classes were then manually re-tagged
with the new classes.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
location, e.g., “In what country did the game of croquet originate?”.
number, e.g., “How many chromosomes does a human zygote have?”.
person, e.g., “Who is Tom Cruise married to?”.
time, e.g., “What year was Alaska purchased?”.
attribute – an attribute of the query’s topic, e.g., “What is Australia’s national
blossom?”.
howDoYouSay – the spelling-out of an acronym or the translation of a word, e.g.,
“What does DNA stand for?”.
object – an object or the composition of an object, e.g., “What did Alfred Nobel
invent?”.
organization – an organization or group of people, e.g., “What company manufactures Sinemet?”.
process – a process or how an event happened, e.g., “How did Mahatma Gandhi
die?”. It is worth noting that 80% of the queries in this category are about how
somebody died.
term – a word that defines a concept, e.g., “What is the fear of lightning called?”.
other – queries that did not fit in the other categories, e.g., “What is the chemical
formula for sulfur dioxide?”.
8
Query type # of queries # of queries Performance
Best performance
with answers
19/1NNP
ans queries (%) #para/policy ans queries (%)
∗location
206
198
163 (82.3%)
6/MultNG
175 (88.4%)
∗number
187
163
120 (73.6%)
6/MultNG
135 (82.8%)
∗person
118
106
88 (83.0%)
2/NNP
90 (84.9%)
∗time
144
139
116 (83.5%) 19/MultNG
125 (89.9%)
attribute
121
112
81 (72.3%)
4/MultNG
85 (75.9%)
howDoYouSay
21
20
10 (50.0%)
4/NNP
12 (60.0%)
object
13
13
8 (61.5%)
0/1NG
11 (84.6%)
organization
26
25
22 (88.0%)
0/1NG
24 (96.0%)
process
35
30
21 (70.0%)
0/MultNG
25 (83.3%)
term
30
27
11 (40.7%)
0/1NG
13 (48.1%)
∗other
10
9
5 (55.6%)
0/MultNG
6 (66.7%)
total
911
842
645 (76.6%)
701 (83.3%)
Table 2. Breakdown of manually tagged categories for TREC11 and TREC12 queries; performance for 19 paraphrases and 1NNP retrieval policy; best retrieval adjustment and best performance (measured in answerable queries)
Table 2 shows the breakdown of the 11 query categories (the original categories are
asterisked), together with the retrieval performance obtained for our default retrieval
policy (19 paraphrases, 1NNP). As for Table 1, the first column lists the query types, the
second column shows the number of queries of each type, and the third column shows
the number of queries which were deemed correct according to the TREC judgment file.
The retrieval performance for our default method appears in the fourth column. The fifth
and sixth columns contain the retrieval adjustments yielding the best performance, and
the result obtained by these adjustments, respectively. As can be seen from these results,
the retrieval adjustments based on the finer-grained, manually-derived query types yield
performance improvements that are similar to those obtained for the coarser rule-based
query types.
It is worth noting that there is nothing intrinsically important that distinguishes these
11 categories from other options. The main factor is their ability to improve system
performance, which spans two aspects of the system: document retrieval and answer
extraction. Since the finer categories are more informative than the coarser ones, the
hope is that they will assist during the answer extraction stage. Our results show that
this will not occur at the expense of retrieval performance.
5 Using Support Vector Machines to Learn Query Classes
The SVM representation of each query has 11 parts, which may be roughly divided
into three sections: coarse properties (3 parts), fine-grained properties (6 parts), and
WordNet properties (2 parts).
– Coarse properties – these are properties that describe a query in broad terms.
• headTarget – the target or topic of the question, which is the first sequence
of proper nouns in the query, and if no proper nouns are found then it is the
9
first noun group, e.g., for the query “Who is Tom Cruise married to?”, the
headTarget is “Tom Cruise”.
• headConcept – the attribute we want to find out about the target, e.g., for the
query “What is the currency of China?”, the headConcept is “currency”.
• headAction – the action performed on the target, which is mostly the head verb
of the query, e.g., “married” in the above query about Tom Cruise.
– Fine properties – these properties correspond to the six query features extracted
from the query by performing shallow linguistic analysis (Section 4): (1) type of the
initial query words, (2) main focus, (3) main verb, (4) rest of the query, (5) named
entities, and (6) prepositional phrases. They provide additional detail about a query
to that provided by the coarse properties (but main focus and main verb often overlap with headConcept and headAction respectively).
– WordNet properties – these properties contain the WordNet categories for the top
four WordNet senses of the main verb and main focus of the query. 5
• verbWNcat – e.g., “marry” has two senses, both of which are social, yielding
the value social: 2.
• focusWNcat – e.g., “currency” has four senses, two of which are attribute
(which is different from our attribute query type), one possession, and one
state, yielding the values attribute: 2, possession: 1, state: 1.
Each of these parts contains a bag-of-lemmas (recall that words are lemmatized),
which are modified as follows.
– Proper nouns are replaced by designations which represent how many consecutive
proper nouns are in a query, e.g., “Who is Tom Cruise married to?” yields who be
2NNP marry.
– Similarly, abbreviations are replaced by their designation.
– Certain combinations of up-to three query-initial words are merged, e.g., “what is
the” and “who is”.
The SVM was trained as follows. For each query type, we separated the 911 queries
into two groups: queries that belong to that type and the rest of the queries. For instance,
when training to identify person queries, our data consisted of 118 positive samples
and 793 negative samples. Each group was then randomly split in half, where one half
was used for training and the other half for testing, e.g., for our person example, both
the training set and the test set consisted of 59 positive samples and 397 negative samples.6
5
6
These properties perform word-sense collation, rather than word-sense disambiguation.
We also used another training method where the 911 queries were randomly split into two
halves: training and testing. We then used the queries of a particular class in the training set,
e.g., person, as positive samples (and the rest of the training queries as negative samples).
Similarly any queries of that class that were found in the test set were considered positive
samples (and the rest of the queries were negative samples). Although both methods yielded
consistent results, we prefer the method described in the body of the paper, as it guarantees a
consistent number of positive training samples.
10
Query type # of queries # of queries
(with ans)
∗location
∗number
∗person
∗time
attribute
howDoYouSay
object
organization
process
term
∗other
total
206
187
118
144
121
21
13
26
35
30
10
911
198
163
106
139
112
20
13
25
30
27
9
842
SVM performance
(average over 20 runs)
Recall (STDV) Precision (STDV)
0.93 (0.02)
0.93 (0.04)
0.96 (0.02)
0.99 (0.01)
0.95 (0.03)
0.82 (0.04)
1.00 (0.01)
0.99 (0.01)
0.84 (0.05)
0.90 (0.04)
0.86 (0.13)
0.81 (0.09)
0.74 (0.18)
0.90 (0.12)
0.62 (0.13)
0.91 (0.09)
0.97 (0.03)
0.99 (0.02)
0.99 (0.02)
0.83 (0.04)
0.42 (0.19)
0.77 (0.23)
0.92
0.93
Retrieval
performance
(avg. 20 runs,
best adjustment)
ans queries (STDV)
87.8%
(2.1%)
82.6%
(3.0%)
85.0%
(2.8%)
88.7%
(3.4%)
74.7%
(3.9%)
62.3%
(12.2%)
82.7%
(12.2%)
91.9%
(6.6%)
85.1%
(6.8%)
46.2%
(10.1%)
68.2%
(16.4%)
82.6%
Table 3. Recall and precision obtained by SVM for 11 manually derived categories for TREC11
and TREC12 queries; average retrieval performance for SVM-learned queries
Table 3 shows the recall and precision of the SVM for the 11 manually-derived
query types, where recall and precision are defined as follows.
Recall =
Precision =
number of queries in class i learned by the SVM
number of queries in class i
number of queries in class i learned by the SVM
number of queries attributed by the SVM to class i
Also shown is the retrieval performance for the learned classes after the application of
query-type-based retrieval adjustments. Both results were obtained from 20 trials.
As seen from the results in Table 3, six of the learned classes had over 93% recall,
and seven classes had over 90% precision. The other class had a particularly low recall (42%) and also a rather low precision (77%). However, this is not surprising owing
to the amorphous nature of the queries in this class (i.e., they had no particular distinguishing features, so queries belonging to other classes found their way into other,
and other queries wandered to other classes). The organization class had a high
precision, but a lower recall, as some organization queries were wrongly identified as person queries (this is a known problem in question answering, as it is often
hard to distinguish between organizations and people without domain knowledge). The
object class exhibited this problem to a lesser extent. Some attribute queries were
mis-classified as location queries and others as person queries. This explains the
lower recall obtained for attribute, and together with the organization problem
mentioned above, also explains the lower precision obtained for person (location is
less affected owing to the larger number of location queries).
Overall, although we used only a modified bag-of-words approach, we obtained better results for our fine-grained classification than those obtained by Zhang and Lee [3]
11
for a coarse classification (the best performance they obtained with an SVM for bag-ofwords was 85.8%, for n-grams 87.4%, and for a kernel that takes into account syntactic
structure 90.0%). This may be attributed to our use of WordNet properties, and our distinction between coarse, fine and WordNet properties. Also, it is worth noting that the
features considered significant by the SVM for the different classes are intuitively appealing. For instance, examples of the modified lemmas considered significant for the
location class are: “city”, “country”, “where” and “where be”, while examples of the
modified lemmas for the time class are: “year”, “date”, “when” and “when be”.
The retrieval performance shown in the last column of Table 3 is consistent with
that shown in Table 2. That is, the retrieval performance obtained for the automaticallylearned classes was not significantly different from that obtained for the manuallytagged classes.
6 Conclusion
We have studied two aspects of the question-answering process – performance prediction and query classification – and offered a new contribution at the intersection of
these aspects: automatic selection of adjustments to the retrieval procedure based on
query class. Overall, our results show that retrieval performance can be improved by
dynamically adjusting the retrieval process on the basis of automatically learned query
classes.
In query classification, we applied SVMs to learn query classes from manually
tagged queries. Although our input was largely word based, our results (averaged over
20 runs) were superior to recent results in the literature. This may be attributed to the
breakdown of query properties into coarse-grained, fine-grained and WordNet-based.
In performance prediction, we first used coarse-grained query classes learned by a
rule-based system as predictors of retrieval performance, and as a basis for the dynamic
selection of adjustments to the retrieval procedure. This yielded improvements in retrieval performance. Next, finer-grained, manually-derived classes were used as the basis for the dynamic selection of retrieval adjustments. Retrieval performance was maintained for these finer-grained categories. This is an encouraging result, as fine-grained
categories are considered more useful than coarse-grained categories for answer extraction. Finally, the retrieval adjustments were applied to the SVM-learned, fine-grained
query categories, yielding a retrieval performance that is consistent with that obtained
for the manually-derived categories. This result demonstrates the applicability of our
techniques to an automated question-answering process.
7 Acknowledgments
This research is supported in part by the ARC Centre for Perceptive and Intelligent
Machines in Complex Environments. The authors thank Oxford University Press for
the use of their electronic data, and Tony for developing the thesaurus.
12
References
1. Cronen-Townsend, S., Zhou, Y., Croft, W.B.: Predicting query performance. In: SIGIR’02
– Proceedings of the 25th ACM International Conference on Research and Development in
Information Retrieval, Tampere, Finland (2002) 299–306
2. Zukerman, I., Raskutti, B., Wen, Y.: Query expansion and query reduction in document
retrieval. In: ICTAI2003 – Proceedings of the 15th International Conference on Tools with
Artificial Intelligence, Sacramento, California (2003) 552–559
3. Zhang, D., Lee, W.S.: Question classification using Support Vector Machines. In: SIGIR’03
– Proceedings of the 26th ACM International Conference on Research and Development in
Information Retrieval, Toronto, Canada (2003) 26–32
4. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, New York (1998)
5. Radev, D., Fan, W., Qi, H., Wu, H., Grewal, A.: Probabilistic question answering from the
Web. In: WWW2002 – Proceedings of the 11th World Wide Web Conference, Honolulu,
Hawaii (2002) 408–419
6. Kang, I.H., Kim, G.: Query type classification for Web document retrieval. In: SIGIR’03
– Proceedings of the 26th ACM International Conference on Research and Development in
Information Retrieval, Toronto, Canada (2003) 64–71
7. Salton, G., McGill, M.: An Introduction to Modern Information Retrieval. McGraw Hill
(1983)
8. Zukerman, I., Raskutti, B.: Lexical query paraphrasing for document retrieval. In: COLING’02 – Proceedings of the International Conference on Computational Linguistics, Taipei,
Taiwan (2002) 1177–1183
9. Brill, E.: A simple rule-based part of speech tagger. In: ANLP-92 – Proceedings of the Third
Conference on Applied Natural Language Processing, Trento, IT (1992) 152–155
10. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An
on-line lexical database. Journal of Lexicography 3 (1990) 235–244
11. Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V.: The
structure and performance of an open-domain question answering system. In: ACL2000 –
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics,
Hong Kong (2000) 563–570
12. Zukerman, I., Horvitz, E.: Using machine learning techniques to interpret WH-questions.
In: ACL01 Proceedings – the 39th Annual Meeting of the Association for Computational
Linguistics, Toulouse, France (2001) 547–554
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement