thesis specialized search

thesis specialized search
UNIVERSITY OF CALIFORNIA
SANTA CRUZ
COST-EFFECTIVE CREATION OF SPECIALIZED SEARCH ENGINES
A dissertation submitted in partial satisfaction of the
requirements for the degree of
DOCTOR OF PHILOSOPHY
in
COMPUTER SCIENCE
by
Reiner Kraft
March 2005
The Dissertation of Reiner Kraft
is approved:
Professor Raymie Stata, Chair
Professor Darrell Long
Professor Patrick Mantey
Dr. Kevin McCurley
Robert C. Miller
Vice Chancellor for Research and
Dean of Graduate Studies
c by
Copyright Reiner Kraft
2005
Contents
List of Figures
vi
List of Tables
vii
Abstract
x
Dedication
xi
Acknowledgments
xii
1
2
3
Introduction and Motivation
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Observing Users Searching for Specialized Information .
1.1.2 Dimensions for Search Specialization . . . . . . . . . .
1.1.3 Document Type vs. File type . . . . . . . . . . . . . . .
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
4
5
5
Iterative Filtering MetaSearch (IFM)
2.1 Introduction and Related Work . . . . . . . . . .
2.1.1 Metasearch . . . . . . . . . . . . . . . .
2.1.2 Specialized Search Engines with Corpus .
2.2 IFM Overview . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
8
9
12
13
Case Study: The Buying Guide Finder (BGF)
3.1 Overview and Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Marketing Research Related to Consumer Buying Behavior . . . . . .
3.1.2 The Difficulty of Finding Buying Guides on the Web . . . . . . . . . .
3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Binoculars: Our Benchmark Topic . . . . . . . . . . . . . . . . . . . .
3.2.2 Query Templates: A Framework for Specifying and Generating Queries
3.2.3 Harvesting Results: Doc-Type Screening . . . . . . . . . . . . . . . .
3.2.4 Selecting Terms: PMI-IR . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
16
17
18
20
23
24
25
26
26
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
31
34
Extending the Google Web Service API for IFM applications
4.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . . . . .
4.2 A Declarative Approach: The Projection Operation . . . . . . . . .
4.2.1 Supplementing the Snippet Operator . . . . . . . . . . . . .
4.2.2 Introducing the Projection Operation . . . . . . . . . . . . .
4.2.3 Other Considerations . . . . . . . . . . . . . . . . . . . . .
4.3 Experimental Setup and Results . . . . . . . . . . . . . . . . . . .
4.3.1 Generating Result Vectors with BGF . . . . . . . . . . . . .
4.3.2 Filtering Results in BGF . . . . . . . . . . . . . . . . . . .
4.3.3 API Implementations . . . . . . . . . . . . . . . . . . . . .
4.3.4 Modifications to BGF . . . . . . . . . . . . . . . . . . . .
4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Summary and Conclusions for the Proposed IFM Web Services API
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
38
40
42
44
45
46
46
47
49
49
52
Doc-type Classification via Automated Feature Engineering
5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Doc-Type and Genre Classification . . . . . . . . . . .
5.2.2 Classification and Feature Selection . . . . . . . . . . .
5.3 Automated Feature Engineering . . . . . . . . . . . . . . . . .
5.3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . .
5.3.2 Feature Augmentation . . . . . . . . . . . . . . . . . .
5.4 Tuning the Doc-Type Classifier . . . . . . . . . . . . . . . . . .
5.4.1 Adjusting Parameter Settings . . . . . . . . . . . . . . .
5.4.2 Additional Features and Heuristics . . . . . . . . . . . .
5.4.3 Example: Adding Heuristics for Classifying Homepages
5.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . .
5.5.1 Building Training Sets . . . . . . . . . . . . . . . . . .
5.5.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . .
5.5.4 Feature Distribution . . . . . . . . . . . . . . . . . . .
5.6 Summary and Conclusions for Doc-Type Classification . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
54
56
56
57
58
59
60
66
67
68
68
70
70
72
72
75
77
Case Study: Contextual Search
6.1 Introduction and Overview . . . . . . . . . . . . . . . .
6.2 Contextual Search . . . . . . . . . . . . . . . . . . . . .
6.2.1 Related Work . . . . . . . . . . . . . . . . . . .
6.2.2 Terminology . . . . . . . . . . . . . . . . . . .
6.2.3 Approaches for Implementing Contextual Search
6.3 Adapting IFM for Contextual Search . . . . . . . . . . .
6.3.1 Query Generation . . . . . . . . . . . . . . . . .
6.3.2 Implementing Ranking in IFM . . . . . . . . . .
6.4 Evaluation and Results . . . . . . . . . . . . . . . . . .
6.4.1 Methodology . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
80
80
81
83
83
85
88
90
92
95
96
3.4
3.5
4
5
6
3.3.3 Judgment Guidelines . . . .
3.3.4 Results . . . . . . . . . . .
Related Work to BGF . . . . . . . .
Summary and Conclusions for BGF
.
.
.
.
.
.
.
.
.
.
.
.
iv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.4.2
6.4.3
6.4.4
6.4.5
6.4.6
7
Judgment Guidelines . . . . . . . .
Experimental Setup . . . . . . . . .
Metrics . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . .
Does IFM have Recall Limitations?
Conclusion
7.1 Summary . . . . . . . . . . . .
7.2 Future Work . . . . . . . . . . .
7.2.1 IFM . . . . . . . . . . .
7.2.2 BGF . . . . . . . . . . .
7.2.3 IFM Web Service API .
7.2.4 Doc-Type Classification
7.2.5 Contextual Search . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 97
. 99
. 101
. 102
. 110
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
116
118
118
119
119
121
122
123
v
List of Figures
2.1
Iterative, filtering metasearch information flow and structure. . . . . . . . . . . . . . . .
13
3.1
Experimental BGF results comparing the precision-at-10 ([email protected]), which measures the
fraction of retrieved documents that are relevant with document cut-off value equals 10,
of the BASE algorithm and BGF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Experimental results comparing precision-at-10 ([email protected]), which measures the fraction
of retrieved documents that are relevant, but capped at 10, of BGF using the standard
Google API vs. using the Projection API on the Google and Nutch search engine. . . . .
50
4.1
5.1
5.2
6.1
Classification accuracy chart showing the Naive Bayes classifier’s accuracy expressed as
a percentage for homepages, buying guides, and recipes. . . . . . . . . . . . . . . . . .
Feature scores distribution chart, where feature type scores are expressed as percentages
between structural features, meta-data features, lexical affinities features, and text only
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Three different approaches for implementing contextual search. . . . . . . . . . . . . . .
vi
74
78
85
List of Tables
1.1
Examples for document types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.1
3.2
3.3
Query templates that were used in BGF. . . . . . . . . . . . . . . . . . . . . . . . . . .
BGF product-category phrases along with their abbreviations. . . . . . . . . . . . . . .
Per-topic consensus expressed as percentages for “simple” (whether or not the document
is a buying guide) and “good” buying guide (whether or not the document is a good
buying guide) judgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Experimental BGF results showing precision-at-10 ([email protected]), and precision-at-5 ([email protected]),
which measures the fraction of retrieved documents that are relevant at different document cut-off values (10 and 5) for 10 different product categories along with average
precision scores. We distinguish between a ’simple’ buying guide judgment (“Is the document a buying guide?”), and a “good” buying guide judgment (“Is the document a good
buying guide?”). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
27
3.4
5.1
5.2
5.3
5.4
6.1
Classification accuracy expressed as a percentage comparing the baseline against feature
selection and feature augmentation using the Naive Bayes classifier. . . . . . . . . . . .
Classification accuracy expressed as a percentage comparing the baseline against feature
selection and feature augmentation for the maximum entropy classifier. . . . . . . . . .
Feature counts distribution, where feature type counts are expressed as percentages between structural features, meta-data features, lexical affinities features, and text only
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Feature scores distribution, where feature type scores are expressed as percentages between structural features, meta-data features, lexical affinities features, and text only
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
30
73
73
77
77
Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in contextonly scenario (query-less) using the SE1 search engine. Strong Precision is defined as
the number of relevant results divided by the number of retrieved results, but capped at
one (or five), and expressed as a percentage. A result is considered relevant if it receives
a judgment of Yes-only for Precision. Strong Enhancement is defined as the number of
enhancing results divided by the number of retrieved results, but capped at one (or five),
and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vii
6.2
6.3
6.4
6.5
6.6
6.7
Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in contextonly scenario (query-less) using the SE2 search engine. Strong Precision is defined as
the number of relevant results divided by the number of retrieved results, but capped at
one (or five), and expressed as a percentage. A result is considered relevant if it receives
a judgment of Yes-only for Precision. Strong Enhancement is defined as the number of
enhancing results divided by the number of retrieved results, but capped at one (or five),
and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Experimental results for Precision (P) and Enhancement (E) in context-only scenario
(query-less) using the SE1 search engine. Precision is defined as the number of relevant
results divided by the number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is considered relevant if it receives a judgment of Yes
or Somewhat for Precision. Enhancement is defined as the number of enhancing results
divided by the number of retrieved results, but capped at one (or five), and expressed as a
percentage. A result is considered enhancing if it receives a Yes or Somewhat judgment
for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Experimental results for Precision (P) and Enhancement (E) in context-only scenario
(query-less) using the SE2 search engine. Precision is defined as the number of relevant
results divided by the number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is considered relevant if it receives a judgment of Yes
or Somewhat for Precision. Enhancement is defined as the number of enhancing results
divided by the number of retrieved results, but capped at one (or five), and expressed as a
percentage. A result is considered enhancing if it receives a Yes or Somewhat judgment
for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in context
plus query scenario using the SE1 search engine. Strong Precision is defined as the
number of relevant results divided by the number of retrieved results, but capped at one
(or five), and expressed as a percentage. A result is considered relevant if it receives a
judgment of Yes-only for Precision. Strong Enhancement is defined as the number of
enhancing results divided by the number of retrieved results, but capped at one (or five),
and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in context
plus query scenario using the SE2 search engine. Strong Precision is defined as the
number of relevant results divided by the number of retrieved results, but capped at one
(or five), and expressed as a percentage. A result is considered relevant if it receives a
judgment of Yes-only for Precision. Strong Enhancement is defined as the number of
enhancing results divided by the number of retrieved results, but capped at one (or five),
and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Experimental results for Precision (P) and Enhancement (E) in context plus query scenario using the SE1 search engine. Precision is defined as the number of relevant results
divided by the number of retrieved results, but capped at one (or five), and expressed as a
percentage. A result is considered relevant if it receives a judgment of Yes or Somewhat
for Precision. Enhancement is defined as the number of enhancing results divided by the
number of retrieved results, but capped at one (or five), and expressed as a percentage. A
result is considered enhancing if it receives a Yes or Somewhat judgment for enhancement.108
viii
6.8
Experimental results for Precision (P) and Enhancement (E) in context plus query scenario using the SE2 search engine. Precision is defined as the number of relevant results
divided by the number of retrieved results, but capped at one (or five), and expressed as a
percentage. A result is considered relevant if it receives a judgment of Yes or Somewhat
for Precision. Enhancement is defined as the number of enhancing results divided by the
number of retrieved results, but capped at one (or five), and expressed as a percentage. A
result is considered enhancing if it receives a Yes or Somewhat judgment for enhancement.109
6.9 Change of rank positions between QR-1 and RB-5-1 using weight w1 . . . . . . . . . . . 113
6.10 Change of rank positions between QR-1 and RB-5-1 using weight 2.5 × w1 . . . . . . . . 114
6.11 Change of rank positions between QR-1 and RB-5-1 using weight 5 × w1 . . . . . . . . . 115
6.12 Change of rank positions between QR-1 and RB-5-1 using weight 10 × w1 . . . . . . . . 115
ix
Abstract
Cost-effective Creation of Specialized Search Engines
by
Reiner Kraft
People increasingly use Web search engines to fill a wide variety of navigational, informational, and
transactional needs. However, today’s major search engines demonstrate poor precision when searching
for specialized information (e.g., buying guides). Specialized search engines address this problem by
assuming a specific information need and returning results according to that assumption. This leads to
increased precision with simple queries. However, building a specialized search engine requires a high
level of expertise combined with a significant development effort, making them very expensive.
This thesis proposes a flexible, extensible architecture, and tools for building specialized search
engines based on Iterative, Filtering Metasearch (IFM). With our framework, an advanced web developer
or consultant can build a search engine that is specialized by document type within a few man-weeks of
effort. Furthermore, we show how IFM can successfully be used and adapted to the broader context of
building specialized search applications, and also discuss the limitations of applying IFM.
To my wife,
Bettina M. Kraft,
who made all of this possible, for her endless encouragement and patience.
Iron rusts from disuse; water loses its purity from stagnation and in cold weather becomes frozen;
even so does inaction sap the vigors of the mind.
– Leonardo da Vinci (1452-1519), Italian engineer, painter, & sculptor.
xi
Acknowledgments
I appreciate the many contributions that people have made for this thesis. First of all, I am grateful to my
advisor, Raymie Stata, for his guidance and patience with me during these past two years. I would also
like to express my deepest appreciation to my committee members Patrick Mantey, Kevin McCurley, and
Darrell Long for their helpful guidance and insightful comments on this research. Furthermore, I also
wish to thank the faculty and staff of the computer science department at UC Santa Cruz for their help
and support during the last 5 years.
When I started to take graduate classes at Stanford University back in 1997 while I was still
working at the IBM Almaden Research Center in San Jose, CA, there were a few people I would like
to mention here, who motivated me to pursue a PhD degree in the first place: First, my former second
line and senior manager, Norm Pass. Without his support and encouragement this would have been not
possible. Second, my friend and mentor Qi Lu helped and guided me during the early stages of my
career and encouraged me further to pursue my PhD degree. He is still very supportive, and his advice
is always valuable. I would also like to thank my former managers at IBM, Daniel Ford and Eugene
Shekita, who would always encourage me to pursue my education, as well as Shanghua Teng (now a
Professor at the University of Boston), who was a visiting scientist during the summer of 1998, and with
whom I worked together on various interesting projects back then. Overall there were many more people
at IBM to whom I have to thank for their help and support while working there.
Finally, I would like to thank many friends and co-workers at Yahoo!. Special thanks go to my
manager, Farzin Maghoul, who was very supportive the past year.
The text of this dissertation includes a reprint of the following previously published material:
R. Kraft and R. Stata. Finding Buying Guides with a Web Carnivore, 1st Latin American Web Congress
(LA-WEB), Santiago, pages 84–92, November 2003. The co-author listed in this publication directed
and supervised the research which forms the basis for the dissertation.
xii
The work in Chapter 6 has been done while working at Yahoo!. Farzin Maghoul contributed
background to section 6.2 on contextual search. Chi-Chao Chang contributed to section 6.4 on evaluation
and results, and Raymie Stata directed and supervised the research.
Last not least, I do appreciate my wife Bettina M. Kraft, and Michele Repine for their kind and
thorough proofreading of my thesis.
xiii
Chapter 1
Introduction and Motivation
1.1
Background
Using the World Wide Web for problem solving by utilizing web search engines has become
increasingly commonplace. However, the problem of poor precision when searching for specialized
information (e.g., buying guides) using today’s major web search engines (e.g., Yahoo! 1 , Google 2 ) is
well known in the literature [59], [33], and [53].
A widely accepted de-facto standard for web search is a simple query form that accepts keywords and returns as a result document locators (URLs 3 ). In general, keyword based query articulation
is difficult and requires a substantial learning effort. Therefore typical queries are short, comprising an
average of two to three terms [83], [72] per query. When searching for specialized information the problem of query formulation becomes harder: A user then has to transfer a complex information need into a
limited set of keywords.
1 http://search.yahoo.com
2 http://www.google.com
3 A Universal Resource Locator (URL) represents a unifying syntax for the expression of names and addresses of objects on the
network as used in the World-Wide Web. Its details are specified in RFC 1630 (http://www.ietf.org/rfc/rfc1630.txt).
1
1.1.1
Observing Users Searching for Specialized Information
We performed a small experiment where we asked some users to search for buying guides on
various topics (e.g., digital cameras). We then observed the users on how they approached that task. Our
anecdotal investigation indicates that many expert web users can utilize the following query formulation
approach that can be broken down into four types of behaviors:
Doc-type expansion: Adding document type or genre related terms to the query. With document type
we refer not to the document’s file or mime type, but rather to its intent. In the case of the “buying
guide” document type, for example, such terms could include buying guide or feature overview.
Doc-type expansion is effective for mining search engines for documents within a specific document type.
Topic expansion: Trying alternative terms or synonyms to obtain better results for some topics. For
example: First search query “binoculars”. Topic expansion to “binocs” on second search query.
Iteration: Applying doc-type and topic expansion iteratively, sometimes exploring local optimizations
(modifying the previous query in a small way to refine the results), sometimes exploring global
optimizations (entering a different query to find a new collection of results).
Ranking and Filtering: Identifying and recording good results, organizing and ranking them by their
overall quality and relevancy. When enough results are found, the search stops.
This overall search process and refinement approach is time consuming even for experts. Specialized search engines are addressing this issue by assuming a specific information need and returning
results according to that assumption.
1.1.2
Dimensions for Search Specialization
There are various dimensions for search specialization:
2
File format: Specialization is done by file or mime type of a document (e.g., Yahoo Image Search 4 ).
Geography/Language: Specialization is done by either a geographic region (e.g., based on the URL)
or the language of a document. For example, major search engines offer language specific search
applications (e.g., Google.de 5 ).
Transient Information: Specialization is done by focusing on transient information. Google News 6 or
Daypop 7 are examples of search engines that focus on transient information.
Document Type: Specialization is done by the intent of the document. ResearchIndex [57], MRQE
[63], and BGF [53] fall into this category.
User Intent: Specialization is done by looking at the user’s intent (why a user is searching) at the moment of search. For example, search engines could base specialization on Broder’s [13] classification of queries into informational, navigational, and transactional queries. Ahoy! [46], the
homepage finder, applies user intent specialization.
Search Context: Specialization is done by applying and leveraging the search context to augment the
query. The promise is a higher relevancy within the given context. We describe contextual search
in more detail in Chapter 6. Y!Q 8 represents an example of a contextual search application.
Overall, all of these specialization approaches have one goal in common: Increased precision
with simple queries. However, today there are only a limited number of specialized search engines
available on the Web, and no framework or tools yet exist to build these easily.
4 http://search.yahoo.com/images
5 http://www.google.de
6 http://news.google.com
7 http://www.daypop.com
8 http://yq.search.yahoo.com
3
1.1.3
Document Type vs. File type
It is important to note the difference between file format and document type: For the purpose
of this discussion the document type refers to the intent of the document, rather than a file or mime type.
Popular examples of document types are shown in Table 1.1.
Buying guides
Movie reviews
Product reviews
Lyrics
Recipes
Personal homepages
Commercial homepages
News articles
Answers to homework questions in textbooks
Vacation impressions
Student assignments
License agreements or other legal documents
Research papers
Restaurant and/or hotel reviews
Corporate mission statements
Course schedules
Conference or workshop call for papers
Software bug fixes
Driving or other travel directions to a location
Product announcements by manufacturer (not reseller or reviewer)
Legal meeting notices (e.g., city council hearings, planning commissions, etc.)
Environmental impact reports
Pornographic material
Concert venue schedules
Table 1.1: Examples for document types.
Specialized search by document type has been identified as an important area of research [59],
[65], [46], but there are still many unresolved issues. In particular, the identification of a document type
is a highly subjective process, which makes the classification of document types very difficult.
4
1.2
Thesis Statement
We have described above the major specialization approaches for web search. The major
problem we identified with specialized web search solutions is that they are very expensive to build, and
require a high level of expertise combined with a significant development effort.
This thesis focuses on one approach: Specialization by document type. The primary objective
of this dissertation is to build a framework and tools with which an advanced web developer or consultant
can build a search engine specialized by document type within a few man-weeks of effort. Once this is
accomplished our secondary goal is to show how the proposed framework could be more broadly applied.
Therefore we offer an in depth study of contextual search and its application as another dimension of
specialized search applied within the proposed framework.
1.3
Dissertation Outline
Chapter 2 proposes a framework and approach for implementing specialized search which we
refer to as “Iterative, Filtering, Metasearch (IFM)”. We provide an overview with background information, and identify three major issues:
1. Query formulation
2. Filtering/Ranking
3. Web search engine APIs
We then show in chapters 3, 4, and 5 how the IFM approach can be used to build specialized
search applications by document type.
Chapter 3 presents a specific example of an IFM application: The Buying Guide Finder (BGF).
BGF represents a specialized search application by document type that focuses on buying guides. We
5
describe the BGF architecture in detail, introduce Query Templates as a generic framework for solving
IFM’s query formulation problem, and give experimental results that show that BGF indeed outperforms
a major search engine in terms of precision and relevancy when searching for buying guides.
We observed while working with BGF that the exposed search engine interfaces are not ideal
for IFM: Existing search engine APIs are targeted towards human usage and consumption. We therefore
propose in Chapter 4 a specific web service API that introduces a projection operation as a preferred
interface to a search engine that works well with IFM applications, and imposes relatively little overhead
to a search engine. We show in experiments that the proposed API further increases BGF’s precision and
relevancy.
Furthermore, we identified the need for a systematic doc-type classification solution that works
well for domain transfer: A doc-type classifier for buying guides should be easily transferable to a
different domain, for example recipes. Chapter 5 investigates the problem of doc-type classification in
depth. We propose a novel feature engineering method that uses off-the-shelf classification packages and
works well for a variety of different document types, while requiring relatively little training effort. We
then conclude that our IFM approach combined with the proposed doc-type classification system indeed
satisfies our primary goal of providing a framework and tools with which an advanced web developer or
consultant can build a search engine specialized by document type within a few man-weeks of effort.
Chapter 6 explores how contextual search can be implemented using IFM and shows a comparison of IFM’s performance against two other approaches for implementing contextual search. The
motivation derives from our secondary goal to study in depth contextual search and its application as
another dimension of specialized search applied with the proposed framework. The reason we picked
contextual search is that it requires IFM using ranking to do filtering. We describe an IFM application
that implements contextual search and uses a flexible rank aggregation scheme. While working on contextual search we observed possible recall limitations of IFM that we investigate in Section 6.4.6. We
qualitatively compare different approaches – naive query rewriting and rank biasing – for implement6
ing contextual search against our IFM implementation, and present experimental results that show the
qualitative differences between the different approaches.
We then conclude in Chapter 7 and present possible areas of future work.
7
Chapter 2
Iterative Filtering MetaSearch (IFM)
2.1
Introduction and Related Work
There are two major strategies for building specialized web search engines. This section de-
scribes both in more detail and addresses their limitations:
Metasearch The term metasearch engine typically refers to applications whose intent is to offer broad
searching services along the lines of traditional search engines like Alta Vista
1
and Google 2 .
Metasearch engines were among the first applications built on top of web search engines. They
distribute their queries to possibly multiple “real” search engines and combine the results to the
user. SavvySearch [40], [22] and MetaCrawler [70] were both launched in mid-1995 and appear
to have been invented independently. Since then, a large body of literature has been written on the
topic (see [61] for a survey).
Specialized search engines with corpus These are self-contained search engines with their own indices
that achieve specialization through a carefully-constructed corpus (typically by utilizing focused
1 http://www.altavista.com/
2 http://www.google.com
8
crawling and information extraction tools). For example, [18], [6], [69], and [3] take this approach.
Focused crawling is a concept that seems to predate web metasearch (see [11]). A common strategy
for focused crawling is to use a best-first strategy that ranks links according to the match between
their anchor text 3 and a model of the search goal.
Broadly speaking, search can be specialized in several dimensions as discussed in the previous
chapter. We are focusing in this chapter on two dimensions:
1. Specialization by topic (as in the case of the nanotechnology engine [19]).
2. Specialization by document type (as in the case of the homepage finder [46] and BGF [53]).
Sometimes, an engine is specialized in both dimensions, as in the case of the ResearchIndex
[57], which specializes in research papers (doc-type) in scientific fields (topic). While certain document
types (such as homepages [46]) have been studied for a while, it should be noted that the more general
aspect of document type classification [59] itself has captured more research interest recently, along with
work related to genre search [29], [48], [49]. The difference between document type and genre is that
document type refers to the intent of a document, while genre focuses more on its style (e.g., subjective,
objective, casual, formal). In the literature both terms are often used synonymously. In our work we
focus primarily on document type, not genre.
2.1.1
Metasearch
Based on their intent, we can classify metasearch engines into
• General metasearch engines
• Specialized metasearch engines
3 Anchor
text is the visible text in a hyperlink, the text embedded between the <a> and </a> tags
9
The benefit of general metasearch engines (e.g., Dogpile 4 ) is said to be better recall (because
the coverage of the metasearch engine is the union of that of the underlying engines). However, the
advantages of metasearch are not confined to better recall, but also include better precision. This is
explained by the fact that different search engines employ different ranking strategies. For example, if
a result element has been ranked highly in many different result lists, it may be indeed a good match.
Alternatively, if a result element appears highly ranked in only one or a few search engines, chances
are that it does not deserve an overall better ranking position. Rank aggregation techniques (e.g., [23],
[28]) focus on this issue, and may therefore produce a fairer and overall more robust ranking with better
precision.
The relationship between a metasearch engine and other search engines could be either one to
one or one to many. We therefore have the following cases:
1. General metasearch - on top of one search engine
2. General metasearch - on top of many search engines
3. Specialized metasearch - on top of one search engine
4. Specialized metasearch - on top of many search engines
For general metasearch we typically see case number two implemented (we mention case number one in the interest of being comprehensive). In our work we specifically focus on case number three
where we have specialized metasearch on top of one search engine. We plan to extend our work to case
number four in future work.
An alternative goal for metasearch engines – besides higher recall and better precision in the
general case – is to provide higher precision in the context of a specific information need. Furthermore,
the idea of reusing the large indices of general-purpose search engines represents an attractive strategy
4 http://www.dogpile.com
10
for building specialized search solutions [26]. We call these specialized metasearch engines. There is a
developing body of work applying specialized metasearch engines to homepage finding [32], [46], [74],
[25]. This example of specialized metasearch helps the user specifically find personal homepages. Other
examples of specialized metasearch engines include one that specializes in the topic of nanotechnology
[19], one that finds news articles related to closed-caption text [39], and the buying guide finder [53] that
we describe in depth in Chapter 3.
While from an economical perspective it seems to be compelling to use metasearch to build
specialized search solutions (e.g., there is no index to build and to maintain), specialized metasearch
engines remain difficult to develop. The main reason for this is because each search engine uses its own
query language and syntax, and returns results in some HTML format (parsing the result format is highly
error-prone). The information needs of metasearch applications have been so deeply considered that the
STARTS protocol [37] was proposed in an effort to standardize the interfaces used by metasearch engines
to access underlying search engines. Such standardization efforts (if widely adopted), and emerging web
service APIs (e.g., Google API [36]) will help to address the need of standard interfaces to automate
processing. Today’s available web search engine APIs are targeted more towards human consumption of
results, making them not a preferred choice for automated processing by specialized metasearch applications. We point out these limitations in depth in Chapter 4.
Another major challenge to building specialized metasearch engines is how to specify and implement filtering of search results to achieve the desired specialization. Today this requires a significant
level of expertise and manual tuning of classifiers, as can be seen with examples like the homepage finder
[46], or more recent work related to document type classification [59].
Similarly, query formulation that leads to many high quality results according to a user’s specialized information need, remains difficult, and there is no “off the shelf” solution available. Glover
et al. [32] are proposing to learn query modifications based on support vector machines (SVMs). The
problem with this approach is that it requires a significant training phase and expertise, but it certainly
11
represents a viable step towards automating query formulation for specialized information needs.
Oyama et al. [65] present a different approach to query formulation by using decision trees
to generate “keyword spices”. This model does not filter documents. Instead, the idea is to extend the
user’s input query with domain specific Boolean expressions (keyword spices) to obtain better precision.
However, there are several limitations with this approach. First, although precision will be higher with
carefully constructed queries that contain domain specific extensions, our experiments indicate that a
filtering step still seems to be beneficial to weed out unwanted results. Second, because not all search
engines support full Boolean search operators, the results obtained when issuing Boolean queries to these
are not exactly defined. In addition, Oyama et al. do not describe how the proposed techniques can be
extended to fully automate the construction of domain-specific web search engines for various domains.
2.1.2
Specialized Search Engines with Corpus
One example of a popular specialized search engine we mentioned before is ResearchIndex,
which is an autonomous citation indexing system for research papers. It uses focused crawling as its
primary source to discover and harvest new information. ResearchIndex also uses web search engines
and heuristics to locate research papers. This is done by using queries for documents that contain certain
words (e.g., “publications”, “papers”). The obtained search results are then used as a seed list for their
crawler. ResearchIndex downloads the research papers themselves, extracts meta-data (such as title,
author, and citations), and does its own indexing. Other examples of specialized search engines include
the FAQ Finder [14], the Movie Review Query Engine [63], and one that allows searching for biomedical
information [75].
At present there seems to more literature on specialized search via focused crawling [18, 6, 3]
than on specialized metasearch, however, we believe that the comprehensive scope of modern web search
engines will turn this tide. The increased recall achieved by focused crawling will be minimal, while the
decreased cost of metasearch will be compelling.
12
2.2
IFM Overview
As the previous sections illustrate, the metasearch concept of building web IR applications on
top of large, general-purpose web search engines is a broad one. In this section we introduce a specific
approach for specialized metasearch which we refer to as “iterative, filtering metasearch (IFM)”. Again,
it is our belief that growing collection sizes and web services APIs will lead to increased interest in this
approach.
The IFM approach can be used as a basis to build web carnivores. Etzioni coined this colorful
phrase [26]. In this analogy, web pages are at the bottom of the web information food chain. Search
engines are the herbivores of the food chain, grazing on web pages and regurgitating them as searchable
indices. Carnivores sit at the top of the food chain, intelligently hunting and feasting on the herbivores.
The carnivore approach leverages the significant and continuous effort required to maintain a worldclass search engine (crawling, scrubbing, de-spamming, parsing, indexing, and ranking). In a search
context, the carnivore approach is applicable when standard web search engines are known to contain
the documents of interest, but do not return them in response to naive queries. IFM represents an example
of the carnivore approach, therefore we use these terms sometimes synonymously.
Figure 2.1: Iterative, filtering metasearch information flow and structure.
Figure 2.1 depicts the overall structure of IFM. [19], [39], and [53], for example, are in this
family ([32] is also in this family, although it does not seem to perform filtering).
13
In the IFM scenario some form of input arrives from which output is to be generated. In
[39], for example, that input is in the form of processed closed-caption text, while in [53] that input is
simple phrases naming product categories. A list of URLs, and potentially some descriptive meta-data,
is produced as output.
In between this input and output, a query-formulation algorithm generates queries to be fed to
the search engine. The search engine processes the queries and returns results to be fed into the IFM
application’s filtering and ranking engine (Filter/Ranker). The filtering and ranking engine then selects
and possibly re-ranks results from the search engine. It also can produce feedback to the query generation
phase, which has an impact on future queries submitted to the search engine.
As mentioned earlier this form of metasearch is a generalization of traditional metasearch,
but the new components are significant: In traditional metasearch, the user’s query is either passed
directly through to the underlying search engine, or is only slightly changed for syntactic reasons. In our
more general case, the query generator is much more active, perhaps inferring queries, which were not
explicitly given [39], or perhaps significantly augmenting the user’s query [32, 53].
Also, traditional metasearch engines typically re-rank results from the underlying search engine (in fact, this “rank aggregation” problem is a research topic on its own [28, 23]). However, typically
they do not filter results. In our own experience, and according to [39], such filtering can significantly
increase the precision of specialized metasearch engines.
14
Chapter 3
Case Study: The Buying Guide Finder (BGF)
3.1
Overview and Introduction
Research on buying behavior indicates that buying guides perform an important role in the
overall buying process. Given this, we built a web carnivore that finds buying guides on behalf of
consumers. Finding buying guides is an instance of the more general problem of specialized search by
document type. BGF finds buying guides by issuing machine-generated queries to a web search engine
and filtering the results. It represents an instance of IFM where the product phrase is taken as input
and returns (what are thought to be) buying guides as output. This chapter describes our BGF system
and quantitatively compares it to a basic search engine. Our system almost always returns more buying
guides, and often return as twice as many. In addition, our user study also suggests that we return better
buying guides.
3.1.1
Marketing Research Related to Consumer Buying Behavior
Silverman et al. [71] report that, although consumer purchasing on the Web is increasing,
consumers still abort 23% of sales transactions that they initiate: Four out of five that abort do so because
15
of search-related reasons. Further, they report that many more potential transactions are never initiated
due to search-related failings. The current process of finding information relevant to purchasing decisions
is simply too complicated for the naive Internet user. These findings motivated us to explore new kinds
of search systems that help consumers at various stages of the buying process.
Guttman et al. [38] provides an overview of marketing research related to consumer buying
behavior. They describe a multi-activity decision-making process in which product brokering is an important, early activity. The goal of product brokering is to learn the high-level features and characteristics
of a product category (vs. a specific product model) and how those features and characteristics relate to
the buyer’s personal needs and constraints. Buying guides provide just this information. Good buying
guides for digital cameras, for example, describe features such as “pixel count” and “optical zoom” and
relate those features to personal goals such as “sending prints to grandmother” and to constraints such as
budget. As part of our larger program on decision-support systems (DSS) for consumer e-commerce, we
have built BGF to aid in the important activity of product brokering. The BGF takes as input a product
category (a.k.a. topic), such as “digital camera” or “washing machine”, and returns buying guides for
the given product category.
The Web contains many good buying guides on a bewildering range of product categories.
The Web also contains lots of useful shopping pages that are not buying guides. For example, product
reviews describe the features and benefits of one or a few particular products. Comparison shopping
sites compare features and prices of usually quite a few particular products. Although this information
is valuable, it typically does not address the needs of product brokering. Unfortunately, this panoply of
information makes it hard to find buying guides when such guides are needed.
3.1.2
The Difficulty of Finding Buying Guides on the Web
While many buying guides can be found on the Web, finding those guides is difficult to impos-
sible for the average consumer. Web search engines typically index many buying guides on many topics,
16
but simple queries do not often return these results.
There are many plausible reasons why simple queries do not return those buying guides. One
of them is that an author of a buying guide may use a different wording or terminology (e.g., feature
overview, shopping guide, review). In this case a search for buying guide would not necessarily return
those. Another reason is economic: Commerce sites are working hard to improve their search engine
rankings. In that case good buying guides that are not bound to commerce channels would have a
disadvantage to be shunted aside. Based on anecdotal experiments we observed that buying guides are
themselves a precious commodity, and while everyone wants them, few sites actually have them. But
there are still plenty of good buying guides available on the Web covering a plethora of topics. This
observation motivated us further to have an automated way of finding good buying guides, and to make
them available to users who need them.
This chapter points out novel aspects of our system that are applicable to building specialized search applications. Overall the results obtained from this work were encouraging and provided
experimental evidence that IFM is indeed a viable and cost-effective strategy for building specialized
search engines by document type. The next section of this chapter describes our BGF system, including
a number of variations with which we have experimented. Then we evaluate and compare the results of
a standard web search engine – Google – to our BGF. At the end of this chapter we conclude and discuss
related work.
3.2
Approach
Our basic approach is based on a manual technique that many expert web users employ when
searching the Web for documents of a particular type. When manually performing doc-type searches,
one occasionally obtains good results by simply typing the product category directly into a regular search
engine. Also, sometimes a specialized search engine is available (e.g., the Movie Review Query Engine
17
[63]). However, when these simple approaches fail, our anecdotal investigation (presented in Section
1.1.1) indicates that many expert web users utilize a more or less manual and tedious refinement approach.
BGF automates this technique. Our system repeatedly executes trials, each trial consisting of
three steps:
1. Building a query (during which we employ doc-type and topic expansion).
2. Sending the query to a search engine.
3. Harvesting results (where we employ filtering).
The system runs trials until the desired number of results is obtained. It uses Google as its
underlying search engine. Google is well-suited to our task because it has broad coverage, good ranking,
and exposes an API 1 that allows programmatic access. At a high-level, our results are independent of
Google and could be applied with any search engine. However, as will be seen, certain aspects of the
Google API do impact the details of our system.
3.2.1
Binoculars: Our Benchmark Topic
Before describing the details of our system, we first describe the infrastructure we created for
tuning those details. Early on into our project it became clear in the details of what we were building,
we were facing a huge number of design decisions, each of which could have a significant impact on the
performance of our system. On the one hand, it was impractical for us to do user-based evaluations of
the many combinations we were facing. On the other hand, we wanted a mechanism for using data to
test our decisions.
We created such a mechanism by manually building a reasonably exhaustive benchmark set
of buying guides for a single benchmark topic (“binoculars”). When exploring our design space, we
1 http://www.google.com/apis/
18
used traditional precision and recall metrics (see also Section 3.3.1) to measure the impact of various
alternatives. Later in this section, when we say that one approach works “better” than others, such
claims are based on measurements against this benchmark set. During development, to avoid overtuning to our benchmark topic, we occasionally performed additional evaluations against a handful of
secondary benchmark topics (“digital cameras” and “cars”). In these secondary evaluations, we simply
measured precision of the first tens of results by manually inspecting for buying guides.
As just suggested, this approach to tuning could over-tune our system to the benchmark topic.
It could also introduce a biased notion of what is a “buying guide”. The evaluation in Section 3.3 suggests
that neither of these issues became a problem for us.
We selected “binoculars” as our primary benchmark topic for the following reasons. First, we
wanted a topic that was somewhat obscure, but which had non-trivial representation in the Web. Google
indicated that it has around 1.5M documents containing “binoculars” (as of June ’03), a small number but
not a tiny one. Second, we wanted a non-electronic product category, under the theory that buying guides
for electronic categories are more easily found on the Web (the evaluation suggests this assumption may
not be the case).
Given the topic “binoculars”, we manually searched Google for buying guides. We examined
over 3,000 (unique) results from over a 1,000 queries (carefully crafted to be biased towards buying
guides) and identified a benchmark set of around 200 buying guides. Based on the diminished returns
we observed late in our searching process, we believe this benchmark set to be reasonably exhaustive for
our topic.
It should be noted that the process of manually creating the benchmark set gave us good insight
into the types of queries that do and do not tend to turn up buying guides. It also pointed out certain
problems (e.g., “doc-type drift”) that we needed to address. Thus, we benefited unexpectedly from
generating this benchmark set early in the project.
19
3.2.2
Query Templates: A Framework for Specifying and Generating Queries
On the surface, our approach to generating queries seems easy: simply combine some topic
and doc-type terms. However, a number of details conspire to make it harder than it first seems:
Term generation We need to generate a candidate pool, that is, a set of candidate terms from which we
randomly pick to generate queries. We divide this pool into topic candidates and doc-type candidates, and in fact further subdivide the latter into verb, noun and adjective doc-type candidates.
Query formulation Once we have a good candidate pool, we need a process for turning it into a sequence of queries.
We return to the problem of term generation in Section 3.2.4. The rest of this subsection
discusses query formulation.
Naively giving the entire candidate pool to a search engine as a single query does not yield
good results. Rather, our experience indicates that queries need to be formulated more surgically, e.g.,
combining a single topic term with a doc-type verb and doc-type noun. Further, web search engines
typically have rich query languages. For example, Google treats terms early in the query as more “important” than latter words, it is sensitive to the proximity of words within a query, and it offers a number
operators such as intitle: [15]. We found that utilizing such features can greatly improve yields.
These factors caused us to create a query template language to more quickly explore the huge
space of strategies for query formulation. A query template represents a pattern for specifying and
generating a family of queries. At execution time the query templates get expanded into a sequence of
queries by systematically picking terms from the candidate pools.
Our query template language comprises:
Literals A literal represents text that gets simply copied when the template gets expanded.
20
Placeholders A placeholder is a associated with a topic a doc-type candidate term, and gets replaced at
the time the template gets expanded.
Decorations A decoration is associated with placeholders to manipulate the default behavior when the
placeholder gets expanded.
Roughly speaking, our query template language works like this:
1. Our four candidate pools are represented by the placeholders:
• TOPIC
• DOCTYPE NOUN
• DOCTYPE VERB
• DOCTYPE ADJECTIVE
2. A simple query template is simply a sequence of literals and placeholders. For example, the
template “intitle:TOPIC DOCTYPE NOUN” generates a query consisting of “intitle:”
followed by a topic term followed by a doc-type noun.
3. If expanded exhaustively, even a small template and candidate pool can generate a huge number of
queries. However, our high-level strategy is to generate just a few queries per template, then move
on to another template in a sequence of templates. Thus, we need a mechanism for controlling the
number of expansions that occur.
4. This is done through decorations placed on the placeholders. Thus, for example, TOPIC=2 indicates that only two expansions should be tried. Therefore the template TOPIC=3 DOCTYPE VERB=2
would generate six queries. The expansion is done in a pseudo-random fashion such that the same
template against the same candidate pool will generate the same expanded queries.
21
Query Template
Example
+TOPIC +DT VERB[0] +DT NOUN[0]
+binocular +buying +guide
+TOPIC[0] +TOPIC +DT VERB[0] +DT NOUN[0]
+binocular +binocs +buying +guide
+TOPIC +DT VERB[0] +DT NOUN
+binocular +buying +advice
+TOPIC[0] DT VERB
+binocular choosing
+TOPIC[0] TOPIC DT VERB
+binocular 8x21 pick
+TOPIC DT VERB DT ADJECTIVE
+binoculars buying right
+TOPIC[0] DT VERB TOPIC
+binocular choose magnification
+intitle:TOPIC TOPIC DT VERB
+intitle:binocular binocs buying
+TOPIC DT VERB DT ADJECTIVE DT NOUN
+binocular choose best guide
DT VERB DT ADJECTIVE DT NOUN +TOPIC
choosing best guide +binocular
Table 3.1: Query templates that were used in BGF.
The above description is meant only to give a rough overview of our rich template language.
It is beyond the scope of this section to describe it entirely.
As already mentioned, our strategy is to generate only a few queries (5-10) from a given template, harvest the results from these, and then move on to another template in a larger sequence. The
system evaluated in Section 3.3 used ten hand-crafted templates shown in Table 3.1. We generated these
templates by both reflecting on our experience in creating the benchmark set and also by using the benchmark set to test a large set of alternatives. In general, query templates that start with topic terms followed
by doc-type terms seem to work best. At the same time, we slightly favored more doc-type terms and
fewer topic terms. We found that, on the doc-type side, mixing parts of speech yields better results, e.g.,
“choosing guide” works better than “choosing selecting”. This observation led us to creating multiple
classes of doc-type terms. Finally, we found that smaller, simpler templates yield better results. Thus,
when ranking templates, we try the simple ones first and move to the more complicated ones if the simple
ones do not yield enough results.
To sum up, query templates represent a generic framework that can be used for IFM’s query
formulation phase. The query templates specified in Table 3.1 can be easily re-used when building an
IFM application for a different doc-type (e.g., recipes). What has to be done in this case is to generate
22
different candidate terms, which we describe in Section 3.2.4.
3.2.3
Harvesting Results: Doc-Type Screening
Recall that our overall algorithm looks something like the following:
Harvest := { };
for each query template T in our template sequence:
for each query Q generated by T:
resultSet := submit Q to the search engine;
Harvest
:= Harvest UNION filter(resultSet);
if (|Harvest| > goal), we’re done, otherwise continue;
As the above pseudo code suggests, we have found that filtering the output using a doc-type
screener yields higher-quality results (this finding is in keeping with [39]).
Doc-type screening is based on term vectors. We combine terms associated with a result—
including its title terms, URL, and “snippet” terms— into a result term vector which we then compare
against a doc-type screening vector. (Note that the inclusion of a fair number of doc-type terms in
the query yields a fair number of doc-type terms in the snippets of good documents.) Our process
for generating this screening vector is described in the next subsection. We compare these vectors by
computing their cosine.
We observed that “doc-type drift” occurred in earlier versions of our system. For example,
queries often took us to price-comparison pages (which, as discussed in Section 3.1.1, are not what we
mean by “buying guides”). To counter such drift, we added a doc-type discrimination vector. This vector
contains words (such as comparison) which are negatively correlated with the results that we desire.
Thus, our total filtering process consists of computing a doc-type screening and discrimination
score using these two vectors, combining them in a linear fashion that gives extra weight to discrimination, and then selecting against a rather high threshold. We take at most two results from each query,
which yields a higher-quality result-set overall. By prioritizing the discriminator and thresholding on the
high side, we are being conservative. However, filtering is done in a context in which many candidates
23
are being generated. (Again, reminiscent of [39]).
3.2.4
Selecting Terms: PMI-IR
We need to select terms for a number of reasons: we need terms for a number of candidate
pools, plus terms for the doc-type screening vector and doc-type discrimination vector. This section
discusses the selection of these terms.
We have not spent much time on selection of terms for the topic pool. All we currently do is
apply simple stemming operations to the category phrase supplied by the user.
To produce doc-type-related terms, we first tried the simple approach: looking up synonyms in
various thesauri (e.g., Wordnet). However, this approach did not work: it suggested many bad terms and
failed to suggest many good ones. So we turned to PMI-IR [77], an unsupervised learning algorithm for
recognizing synonyms. PMI-IR measures the similarity of pairs of words by observing co-occurrences,
assuming that co-occurrences are not statistically independent.
In the simplest case we define the co-occurrence score of two terms choice and problem as:
score(choice) =
hits(problem AN D choice)
hits(choice)
To calculate the co-occurrence score for choice we send a query to Google and ask how many
documents contain both terms – problem AND choice – and then we ask home many documents contain
choice alone. The ratio of these two numbers is the score for choice. PMI-IR is supposed to work better
with a NEAR operator. However, such an operator is not supported by Google, so we used the simpler
Boolean AND.
Our overall term-selection process is complicated and includes a number of manual steps. Here
are the highlights of what we do:
PMI-IR seeds We have hand-crafted two seed vectors for the PMI-IR algorithm. These vectors are
small, containing fewer than five terms each. In the PMI literature, the terms in these vectors are
24
called problems words. One vector contains seeds for the screening vector (e.g., “buying guide”)
and the other seeds for the discrimination vector (e.g., “comparative”).
PMI-IR database We have implemented a wrapper around the Google API that runs the PMI-IR scoring
algorithm against the results returned by a query. This algorithm takes the above “problem” vectors
as inputs. Each run of this algorithm adds new scores to our PMI-IR database, a disk file in which
we keep scores returned by the PMI-IR algorithm. These databases have grown to over 3,000
terms each. The update algorithm takes a long time to run, so we do not run it very often.
Doc-type vectors We compute the screening and discrimination vectors from the PMI-IR database. Basically, we take the top-scoring 100 terms from each database; however, we do introduce an element of “hand tuning” in the process. These vectors are recomputed only very occasionally, many
fewer times than the PMI-IR database gets updated.
Doc-type term pools All of the doc-type pools (noun, verb, and adjective) are maintained by hand.
These are quite small, currently containing 25 terms in total. When we recompute the doc-type
screening vector, we keep an eye out for terms we might add to the doc-type pools, but in practice
we rarely change these pools.
3.3
Evaluation
We ran a simple user study comprising a group of seven evaluators to test both the effectiveness
and the generality of our system. Remember that we tuned our algorithms by testing recall against
a “benchmark” product category (Section 3.2.1). While this technique allowed quick refinements, it
is possible that these refinements work only for the benchmark categories and do not work generally.
Thus, in addition to testing effectiveness, we wanted to test that our system generalized over a range of
categories.
25
3.3.1
Metrics
In our experiments we used the standard IR metric precision [42], which measures the fraction
of retrieved documents that are relevant, while recall measures the fraction of all relevant documents that
are retrieved. More formally:
precision =
recall =
r
× 100
n
r
× 100
R
In both equations r represents the number of relevant documents retrieved, R represents the
total number of relevant documents, and n the number of documents retrieved. We multiply by a factor
of 100 to obtain a percentage, instead of a value between 0 and 1. In the remainder of the paper we will
stick to this percentage notation for precision.
We can see that precision measures the efficiency of the search, while recall measures its
breadth. Furthermore, we will use the term DCV (document cut-off value) for the number of documents
retrieved. In our BGF experiment we set DCV = 10, which is known in the literature also as precisionat-10 ([email protected]). Precision-at-ten (“[email protected]”) measures the number of relevant documents contained in the
top-ten ranking documents returned by a search engine (“10%” means one of those top-ten documents
were relevant). [email protected] is used widely for evaluating Web search engines because it corresponds closely to
what a web searcher experiences (ten results on a “results page”). Sometimes we also evaluate precision
at lower DCV values (e.g., DCV = 5). It is typically assumed that recall is not an issue within the
top-ten results; that is, the corpus contains well over ten relevant results, and thus the problem is one of
precision, not recall [52].
3.3.2
Experimental Setup
We measured the performance of BASE and BGF on ten different product categories. For
each category, a “product-category phrase” was chosen (see Table 3.2). Our baseline system (a.k.a.
26
Nickname
bin
nb
car
pc
dc
saw
dvd
vac
mp3
wm
Category Phrase
Binoculars
Notebooks
Cars
Pressure cooker
Digital camera
Skilsaws
Dvd players
Vacuum cleaners
MP3 players
Washing machines
Table 3.2: BGF product-category phrases along with their abbreviations.
BASE) was to submit that product-category phrase, augmented by “Naive doc-type expansion” to the
Google search engine. “Naive doc-type expansion” means we attached the phrase “buying guide” to the
product-category phrase. We added this phrase because the category phrase by itself typically fails to
turn up any buying guides, which makes for an unreasonably poor baseline to compare our own system
against. While the phrase “buying guide” may seem restrictive at first glance (we could have chosen
different ones), anecdotal evidence suggested that it generally returns the most buying guides (compared
to other phrases such as review, comparative review, feature guide, or other similar ones). Based on these
observations we felt that BASE represented a strong baseline, and we think it is important to highlight
this point: Our improvements therefore are relative to a higher, stronger baseline.
Evaluating the results for ten categories is a fair amount of work. However, we wanted to
measure a reasonably large number of categories to test the generality of our approach. As pointed out
earlier, bin was used for tuning our algorithms. Seven of the ten categories – nb, vac, dvd, mp3, pc,
saw, and wm – were suggested by our evaluators after we froze our algorithm. We did this to further
ensure that our study measured the generality of our approach (it might be noted that two categories,
dc and car, were also suggested by our users, but these categories were secondary benchmarks for our
tuning process and thus could be biased).
27
The output of the two systems was combined randomly and duplicates eliminated. These
combined lists were presented to the evaluators (on a category-by-category basis) for measurement.
3.3.3
Judgment Guidelines
Our general instructions to the evaluators stated:
You are being asked to make two judgments for us. First, you are being asked to judge
whether or not each document is or is not a buying guide. Second, for those documents you
believe are buying guides, you are being asked to judge whether they are “good” or “bad”.
The instructions to the evaluators went on to discuss in more detail what is a buying guide. We
felt it was important that the evaluators judge buying guides in terms of fulfilling the information needs
of the product-brokering process rather than in terms of specific characteristics. Thus, our instructions
included paragraphs such as:
A buying guide is defined in terms of its intent. A buying guide is meant to help people at
a certain point in the buying process. Imagine that you know nothing about digital cameras
but you think you might want one. At the very beginning, you’re less interested in the specific
details of particular products and more interested in learning about the entire category.
When looking at a document, ask yourself: Is this document useful given that I know little
about this category and I’m trying to learn about it?
In selecting evaluators, we tried to obtain diversity of backgrounds. Two were CS researchers,
two were programmers (one in India), two were marketing professionals, and one was an artist. The evaluators did not communicate amongst themselves regarding the evaluation. We computed our precision
numbers on the basis of a simple majority amongst the evaluators. However, because of the open-ended
nature of the judgments being made, we wanted to see consensus among the evaluators. We considered
that a “consensus” had been reached when 2/3 of our evaluators agreed.
3.3.4
Results
The results of the evaluation are given in Table 3.4 and, graphically, in Figure 3.1. We present
the [email protected] numbers for both the simple judgment (“is it a buying guide”) and the “goodness” judgment
28
Category
bin
car
dc
dvd
mp3
nb
pc
saw
vac
wm
All
Consensus (simple BG)
90%
88%
81%
86%
86%
70%
100%
88%
83%
86%
85%
Consensus (good BG)
76%
85%
85%
82%
71%
81%
94%
75%
88%
68%
80%
Table 3.3: Per-topic consensus expressed as percentages for “simple” (whether or not the document is a
buying guide) and “good” buying guide (whether or not the document is a good buying guide) judgments.
(“is it a ’good’ buying guide”). In the left column we show the categories that were evaluated by our
judges (see also Table 3.2). Columns 2 to 3 show [email protected] for both BASE and BGF for the “simple buying
guide” judgment (whether a document is a buying guide or not). The columns 4 and 5 then show [email protected]
for the “good buying guide” judgment. The last two columns show [email protected] numbers for both BASE and
BGF. The last row contains the average precision scores for each column.
The aggregated consensus numbers (shown in Table 3.3) do not suggest any anomalies or
patterns: For the simple “buying guide” judgment, consensus was reached on 85%; for “good buying
guide”, on 80%. Overall, we were satisfied with the level of consensus reached amongst a diversified
collection of evaluators.
In half of the product categories, our approach at least doubles the number of buying guides in
the top-ten. In another category, our approach improves performance (on the simple test) by over 30%.
In another two categories, performance stays the same, and in two categories performance degrades
slightly. Overall, this study shows both effectiveness (especially given that our baseline is more than a
simple search engine) and generality (especially given that some of our strongest improvements are for
categories supplied by our users).
Our hypothesis why BGF (or in general IFM) performs better than BASE is that it explores
29
Category
bin
car
dc
dvd
mp3
nb
pc
saw
vac
wm
All
Simple BG
[email protected]
BASE
60%
20%
40%
20%
40%
40%
10%
0%
20%
80%
33%
[email protected]
BGF
80%
60%
30%
40%
40%
40%
20%
60%
70%
70%
51%
Good BG
[email protected]
BASE
40%
0%
20%
20%
10%
10%
10%
0%
10%
30%
15%
[email protected]
BGF
40%
10%
0%
30%
10%
0%
20%
20%
60%
50%
24%
Good BG
[email protected]
BASE
60%
0%
0%
20%
0%
20%
20%
0%
20%
40%
18%
[email protected]
BGF
40%
20%
0%
40%
20%
0%
40%
20%
60%
80%
32%
Table 3.4: Experimental BGF results showing precision-at-10 ([email protected]), and precision-at-5 ([email protected]), which
measures the fraction of retrieved documents that are relevant at different document cut-off values (10
and 5) for 10 different product categories along with average precision scores. We distinguish between
a ’simple’ buying guide judgment (“Is the document a buying guide?”), and a “good” buying guide
judgment (“Is the document a good buying guide?”).
multiple strategies in form of carefully articulated queries. Each of these queries will possibly return
buying guides that are buried deep within the search engine’s index. One query alone may miss some
good buying guides, or simply return none at all, but there are possibly other strategies that contain
good buying guides. BGF has then the opportunity to benefit from these good strategies by keeping
the “winners” and eliminating the “losers”. Therefore BGF uses a more robust approach compared to
a single query that a web user typically issues: In case the single query misses and returns a few or no
buying guides at all, BGF still has many other strategies left to explore. As an example, consider the saw
category: The simple query failed and produced no buying guides, but BGF was able to return many.
Because of our limited amount of resources we had at our disposal, the relatively small number
of evaluators (seven), and the total number of judgments (on the order of 1,400) the results may not be
statistically significant. However, this does not necessarily mean that there is no difference between the
BASE and BGF method, merely that the test was unable to detect one. Keen [50] makes the valuable
point that differences that are not statically significant can still be important if they occur repeatedly in
30
Figure 3.1: Experimental BGF results comparing the precision-at-10 ([email protected]), which measures the fraction of retrieved documents that are relevant with document cut-off value equals 10, of the BASE algorithm and BGF.
many different contexts. Since we had a total of 10 different document types sampled from a broad range
that were suggested from users (after we froze our algorithm), we think that the differences of the two
methods are significant enough to be considered important.
3.4
Related Work to BGF
To our knowledge, there has been no work done related to doc-type classification of buying
guides. Doc-type and also genre classification itself are areas of active research and quite different
from topic classification. Finn et al. [29] identify genre as an important factor in retrieving useful
31
documents. They investigate the performance of three different techniques and are primarily focused on
domain transfer. Also, Karlgren [48] performs various experiments to investigate various word-based
and text-based statistics for the purpose of improving retrieval results. In other work, Karlgren et al.
[49] propose a genre scheme for web pages and outline a search interface prototype that incorporates
genre and content. Genre classification is somehow related to doc-type classification: Whereas doc-type
classification tries to classify documents by its intent, genre classification tries to classify documents by
its style. It would be interesting to investigate whether we can apply our doc-type classification method
to genre as well.
Turney [77] describes the PMI-IR algorithm, how it compares to LSA, and shows experimental
results on its performance on the TOEFL test. In our approach we apply PMI-IR to find related words
for doc-type or topic information. So far the PMI-IR algorithm so far has produced good results. The
PMI-IR paper also discusses a somehow better scoring function, which we couldn’t use at that time.
The reason for this is that this scoring function requires a NEAR (proximity) operator, and unfortunately
Google doesn’t support such a proximity operator yet.
Liu et al. [9] are also trying to use search engines to mine topic-specific knowledge on the Web.
Their goal is to help people learn in-depth knowledge of topic systematically on the Web. However, the
difference to our approach that they propose techniques to first identify sub-topics or salient concepts of
a topic, and then find and organize those informative pages (like a book) instead of focusing on genre.
We could see that it might be interesting to first use our technique to harvest a large set of buying guides
for a given topic and then use their technique to organize this collection like a big buying guide book.
Davison et al. [21] analyze search engine traffic by focusing on the queries-to-results graph
generated by a search engine. Mining that graph can show interesting relations. For example, related
URLs can be found, or for a given URL, the list of search query that yielded that URL can be retrieved.
It might be interesting to use the proposed technique to enhance our query candidate pool: For example,
if we know the URL of a good buying guide, what were the queries people used to find that guide. We
32
could then analyze these queries to gain insights on how to further improve our techniques for topic and
genre expansion.
Giles et al. [31] describe CiteSeer, an autonomous citation indexing system for research papers.
It uses crawling as its primary source to discover and harvest new information. They also use web search
engines and heuristics to locate papers by using queries for documents that contain certain words (e.g.,
“publications”, “papers”). The obtained search results are then used as a seed list for their crawler.
Although they realized the potential of search engines as a source for discovering new information, their
queries are manually composed and fixed. They do not use the techniques described in this paper (e.g.,
query sequences using genre expansion) to avoid crawling. Instead they use the search engine only as
a shortcut for some potential leads that need to be explored further by a crawler. We are going one
step further since we are eliminating the crawling step completely. Our assumption is that everything
useful is already crawled somewhere by a web search engine. Instead we want to leverage crawling,
preprocessing, and indexing from web search engines and let them do the necessary ground work that
enables us to find desired information faster with less effort.
The Movie Review Query Engine [63] helps to find movie reviews quickly. It does this by
managing a database of known sites, where reviews are located, that are downloaded regularly to keep it
up to date. Although there are focusing on a specific document type movie reviews there are many
differences compared to our work. They are using manually edited lists of known sites to locate new
information, instead of using genre expansion techniques combined with query sequences that are run
against search engines to collect useful genre specific information. Although they do not perform genre
screening, since they can rely on the fact that the sites they know contains reviews.
The idea itself of using query expansion techniques to improve precision is not new. Mitra
et al. [62] show that adding words to queries via blind feedback, without any user input, can improve
precision of such queries. We are also using expansion techniques to improve precision, but the methods
we use are quite different. Where query expansion in the user scenario is typically a query refinement
33
process combined with some form of relevance feedback (or not) to eventually find the desired piece of
information, we are constructing sequences of queries up front based on query templates. Our process is
fully automatic and the user is not involved the providing relevance feedback. Also, we do not download
documents for feedback. Our doc-type screener uses “cheap” meta-data that provides already strong
hints to perform doc-type classification. Another interesting direction related to query refinement and
expansion techniques based on anchor text of a document collection is discussed in [54], and it would
be worthwhile to investigate in future work how this could be integrated with IFM’s query formulation
algorithm. Our application represents an interesting way of using query expansion and therefore may
stimulate new research on query expansion techniques based on query templates and adaptive query
sequencing.
3.5
Summary and Conclusions for BGF
This chapter presented BGF – a web carnivore for finding buying guides on the Web. BGF
is an instance of IFM, and illustrates its usage and benefits to be a cost-effective solution for building
specialized search engines by document type. Finding buying guides is an important part of consumer
e-commerce, and it’s a problem ill served by existing technology. Our evaluation indicated that our
approach significantly improves the ability of consumers to find buying guides using simple productcategory phrases.
As mentioned earlier, finding buying guides is also an instance of the larger problem of specialized search by document type. We will show in the next chapter that our ideas will translate well to
the more general context. In particular, we believe that our contributions in query templates, doc-type
expansion, and our unsupervised approach of generating doc-type-related terms will all be applicable to
other document types as well. Combined with the work described in Chapter 5 we can build an IFM
application that requires relatively little expertise to develop, returns highly relevant results according to
34
the specialized information need, and can be easily adjusted to work with new document types.
While BGF is largely automated, there are still a few manual steps that remain. It therefore
may not be easy get the doc-type screener working with new document types. This motivated our work
described in Chapter 5 to build a fully automated doc-type classifier that requires relatively little training
effort, and works well for a variety of different document types. Experiments in Chapter 5 will show that
we were able to achieve this while still maintaining high precision.
35
Chapter 4
Extending the Google Web Service API for
IFM applications
4.1
Overview and Motivation
In the previous chapters we described IFM and BGF, an instance of IFM. IFM type applications
obtain their result data from existing web search engines rather than directly crawling and indexing data
on the Web. This technique makes it economical to build very powerful, comprehensive search engines
for very small, specialized purposes.
Historically, such IFM applications have been built by “page scraping” (simulating an HTML
form submission and parsing the resulting HTML output). While effective, this approach is tedious,
error-prone, and lacks robustness. A web services API – such as the Google API [36] – remedies the
deficiencies of page scraping.
What would be an “ideal” search engine API for IFM applications? In this chapter, we explore
this design question in depth. We describe the structure of a large class of IFM applications and their
information retrieval requirements. We propose a declarative approach – a “projection operation” –
36
that better suits the needs of these IFM applications than does the current “snippet-based” operations
described in Section 3.2.3. We have implemented this projection operation on both the Google and Nutch
[64] search engines (Nutch is primarily open-source software that implements a web search engine). We
describe an experiment on these implementations that shows that, for a particular IFM application –
our “Buying Guide Finder” – precision when using the new API improves from 64.28% to 75.71% on
Google, and improves from 34.28% to 52.85% on Nutch.
Recently, a number of pure IFM type applications have been reported in the literature [53],
[39]. We believe this is the start of a large and important trend in the area of specialized web information
systems. We believe the tide is turning in favor of IFM applications for two reasons:
• The corpora of search engines have become so large (in document count) and complete (in document types) that they include high-quality documents meeting almost all information needs. This
claim is speculative, of course, and in separate work we are doing comparative studies with focused crawlers to support it. This hasn’t been true in the past, forcing people in the direction of
focused crawlers.
• The Google API [36] has enabled rapid development of IFM applications. In the past, one had to
write page-scraping software that was unreliable and required constant maintenance. The Google
API significantly changes the cost of writing and maintaining IFM applications.
While the Google API is extremely valuable to writing IFM applications, its current design is
focused more on human, rather than machine oriented inputs and outputs.
The next section of this chapter discusses the strengths and the weaknesses of the Google API
for writing a large class of IFM applications. We then explore the space of alternative APIs for such
IFM applications and pick a particular point in that space. We describe an implementation of that design
point on the Nutch search engine and present some quality results suggesting that our design is indeed
an improvement. Section 4.3 presents and discusses our experimental results, and Section 4.4 presents
37
some concluding remarks.
We provide the first in-depth analysis of a web services API specialized for the needs of IFM
applications. This analysis focuses on the strengths and weaknesses (to IFM applications) of Google’s
“snippets”, but looks more broadly as well. On the basis of this analysis, we propose what we believe
to be a widely applicable web search engine API for IFM. This design work is focused on the central
method that takes a query and returns results.
We also measure how it improves the performance of a particular IFM application. This measurement suggests that such API improvements are significant for any search engine, but particularly
significant for search engines that have limited coverage and/or immature ranking. We further describe
implementation issues associated with the proposed IFM API for search engines (e.g., result compression).
We also provide the first published look at the Nutch search engine, a brief description of
Nutch, a comparison (in the context of the experiment described above) against Google, and also commenting on its suitability as a research platform.
4.2
A Declarative Approach: The Projection Operation
The Google API [36] is a web services API for accessing the Google search engine. This API
has three methods:
• doSpellingSuggestion(): provides access to Google’s spelling checker.
• doGetCachedPage(): retrieves a copy of a given URL from Google’s cache (if it’s there).
• doGoogleSearch(): executes a query and returns the results. This is the workhorse method of
the API and the focus of this chapter. This method takes as input the query itself plus some control
information, which we shall ignore (e.g., whether or not family filtering should be applied). This
38
method produces as output a list of results, with the following meta-data items for each result: the
URL, the page title, the page category, and the snippet text.
As mentioned in Section 3.2.3, summary-based IFM applications rely on the meta-data items
– and in particular the snippet text – to perform their post-processing. This approach is vital to building
lower-latency IFM applications. However, it puts the IFM application at the mercy of the search engine’s
snippet-generation algorithm.
Snippet text is for human consumption. It typically contains one or two sentence fragments
surrounding terms from the input query. For example, for the query “binocular buying guide,” the snippet
text of one result was:
<font size=-1>You are here: Home Page ... Sports &
Fitness ... Sports and Fitness <b>Buying</b>Guides<br>
... <b>Binocular</b> <b>Buying</b><b>Guide</b>.
<b>...</b> <b>Binocular</b><b>Buying</b><b>Guide</b>.
What Do the Numbers Mean?<b>...</b></font>
For some IFM applications (e.g., question-answering systems such as [56]), human-oriented
fragments are exactly what is useful. For others, however, they are suboptimal:
Missing terms. Snippets are focused around terms from the input query. Often there are other terms of
interest to the IFM application, but including such terms in the query over specifies the query and
degrades overall performance.
Irrelevant terms. Snippets contain irrelevant words (including stop words) that make the snippet more
understandable for humans but are noise to many IFM applications. Such terms consume bandwidth that could otherwise be used for more useful terms. We define “irrelevant” to be not of
use for a particular application, since what seem irrelevant for one application may be crucial for
another.
Detagging. A minor point: Google snippets are rendered in HTML, which requires extraneous parsing.
39
In the following subsection we consider alternatives to snippet text that might be better for
some IFM applications. We select one of these alternatives – a “projection” summary – and explore it in
more detail. This proposed projection operation is meant to supplement rather than replace the existing
snippet operation: snippets are also useful to many IFM applications, typically where projections are not
useful.
4.2.1
Supplementing the Snippet Operator
We considered the following alternatives to the snippet meta-data:
Send page. In this approach, the search-engine returns the entire page to the IFM application, perhaps
parsed, perhaps not.
Take code. In this approach, the IFM application sends page-processing code to the search engine, which
returns the output of that code.
Send indexing terms. In this approach, the search-engine returns a fixed term vector of indexing terms
([81]) for each page.
Send a projection. In this approach, the IFM application sends a list of “terms of interest” to the searchengine, which returns the per-page counts for terms for each result page.
Send a score. In this approach, the search-engine returns a score for result pages based on a term-vector
provided by the IFM application. This is similar to the projection approach above, except that the
server returns a single, numeric score for each result page rather than an entire term vector.
The focus of this section is on the idea of sending a “projection,” which we believe is particularly well suited to filtering metasearch engines. We discuss projections in detail in the following
subsection, but first let’s consider the other alternatives briefly.
40
Sending entire pages back to the IFM application would maximize the data available for postprocessing at the IFM application. However, such an operation would consume significant bandwidth. At
the very least, this bandwidth would be a burden on the search engine. Even to many IFM applications,
obtaining the entire page would introduce significant latency without much gain in output-quality. Thus,
we do not see this operation serving the needs of high-throughput, summary-based IFM applications.
However, it may represent still a viable alternative if a fully customizable feature set is desired. Also,
for content-based metasearch engines, and for training scenarios of some summary-based metasearch
engines (e.g., [32]), it is indeed valuable to obtain the entire page. In these scenarios, the search engine
would be relieved if the IFM application would go directly to the Web for fetching the page content.
However, one of the problems with this approach is that the web page might have changed in the meantime. Going to the web page directly would then result in getting a different version of the page (which
is probably not desirable). For this reason the search engine should probably provide a change detection hash to more easily detect these cases. It will therefore be worthwhile to investigate the send page
approach in more depth in future work.
Sending code to the search engine (imperative approach) is the opposite extreme to sending
the entire page back to the IFM application. Allowing the IFM application to send its post-processing
code to the search engine gives the IFM application access to entire pages while avoiding excessive
data transfer. However, this approach did not seem practical from a scalability or security perspective.
Perhaps this approach would be reasonable using a restricted page-processing language that provided the
search engine with hard limits on the per-page computation effort. However, it does not seem reasonable
to allow the IFM application to send arbitrary code to a search engine. One possible way to do this
would be to have a separate feature server, which would allow executing limited page processing code
within a “sandbox style” container. With this approach we let the search engine do what it can do best:
Retrieving result locators for a given query. An IFM application would then take these locators and
send them (along with some executable code) to such a feature server, which would then return a stream
41
of desirable features that can be used by a classifier (e.g., our doc-type classifier described in the next
chapter). There is definitely potential for this approach, which is why we will probably add it to our
future roadmap.
A third alternative would be to send a fixed term-vector of “indexing terms” back to the IFM
application. As suggested in [76], such an interface would be useful for certain clustering and classification IFM applications (see also [82]). However, this approach gives the IFM application no control over
the terms returned and thus is not as well suited for filtering IFM applications (because the IFM application’s choice of filtering terms may not coincide with the search-engine’s choice of indexing terms). A
search operation returning fixed term-vectors might also be a fine addition to a Google-like API, but our
focus in this section is on returning a projection.
Finally, instead of sending the entire projection, we might try to optimize further by sending
a score based on that projection. While this approach is initially attractive in that it reduces the size of
results, this approach also reduces the flexibility offered to the IFM application (which may have its own
scoring algorithm). Also, after considering the fairly reasonable compression techniques described in the
next subsection, the bandwidth savings may not be significant. Thus, we consider the scoring approach
to be dominated by the pure projection approach.
4.2.2
Introducing the Projection Operation
Similar to doGoogleSearch(), a projection operation takes a query and returns meta-data
for a set of results. In this case, however, the operation also takes a set of “terms of interest” and returns
as part of the meta-data a count of those terms in each result. We call this a “projection” operation
because it projects results onto a fixed subspace of the term universe.
Such an operation is ideal for metasearch engines, both traditional ones and filtering, iterative
ones. An important part of traditional metasearch engines is ranking the results that are returned. With
a projection operation, the metasearch engine can send terms relevant to this ranking (e.g., query terms
42
and expansion thereof) and receive strong evidence to be used during ranking. Filtering metasearch
engines typically have a set of filtering terms used to make filtering decisions. By sending those terms to
the projection operation, these engines can overcome the “missing term” problem that plagues snippetbased processing.
The STARTS proposal [37] is very close to our proposal, so a careful comparison is warranted:
• In our proposal, the search engine takes a query string and a set of “terms of interest” as input.
The search engine executes the query in its usual fashion; then projects the results onto the IFM
application’s terms of interest.
• In the STARTS proposal, the search engine takes a “filtering expression” and a “ranking expression.” The filter expression is a Boolean expression selecting pages of interest. The ranking
expression is, to first order, like our “terms of interest.” In the STARTS proposal, the search engine is supposed to use both the filter and ranking expressions to select (and rank) results to return,
and then projects the results onto the union of query terms (from both the filtering and ranking
expressions).
We believe that separating the query from the projection (as is done in SQL, for example)
is simpler and more flexible. Further, we expect that many web search engines would not be able to
implement the “filter/rank” semantics very effectively. Thus, we believe our proposal is both more useful
on the client side and more practical to implement on the server side 1 .
We have considered two implementation aspects of our proposed projection operation: implementation on the server side, and compression of results. These factors are of particular concern when
scaling the implementation. Our analysis leaves us optimistic that the proposal will scale well.
On the server side, the projection process admits a particularly simple implementation. Asso1 It should be noted that the STARTS proposal was designed for federating Digital-Library services, including services such as
LexisNexis, as well as web search engines. Our own proposal was not designed for this broader context and may not be well suited
to it
43
ciate an index (a small integer) with each input term, and build a table mapping from these terms to their
indexes. These input terms come in the form of an array from the client; as we will see, it is advantageous to assign a term to its index in this array: Allocate an array of counters for these terms, index by
the afore-mentioned indices, scan the result page, looking up each term in this table, and incrementing
the associated counter if found. Assuming this lookup will typically miss, a Bloom filter 2 can be put in
front of this table to eliminate most lookups.
To compress the results, we once again assign each input term to the index in which it appears
in the input array. We return the counts in a dense sequence according to this same indexing scheme.
This dense sequence is encoded using a simple prefix code. As a point of reference, in the experiments
described in the next section, our input vectors were 355-terms in length, a total of around 2500 characters. The above compression scheme resulted in an average of 2.7-bits per count (including a 2x overhead
for base64 encoding), or an average of 120-bytes per result vector (this compares favorably to the typical 150-bytes per Google snippet). We believe that a more aggressive compression scheme (e.g., using
run-lengths for long strings of zeros) could improve this compression further.
For scaling reasons, we believe that computation should be pushed onto the IFM application
whenever possible. In this regard, one might consider requiring that the client pre-hash the input terms,
or perhaps use Rabin fingerprints in place of term text [12]. However, we have not done any experiments
to indicate what benefits such optimizations might yield.
4.2.3
Other Considerations
The primary goal presented in this section is to propose the inclusion of a projection operator
in the web services API offered by web search engines. However, our experience with IFM applications
to date suggest the following, additional points:
2 The main purpose of Bloom filters is to build a space-efficient data structure for set membership. Indeed, to maximize space
efficiency, correctness is sacrificed: if a given key is not in the set, then a Bloom filter may give the wrong answer (this is called a
false positive), but the probability of such a wrong answer can be made small.
44
• As suggested above, projection-search supplements the snippet-search rather than replaces it. Similarly, it may also be helpful to include an indexing-term-search as well.
• It would be helpful to include, in the meta-data for search results, both document-length information (in bytes and terms).
• It also would be helpful to obtain document counts for terms as well as term frequencies within
documents. The STARTS proposal already suggests returning this information with the results
meta-data as well. However, we believe it should be made available via a separate method because IFM applications can typically retrieve the information once and then use it across multiple
searches done at the search engine.
4.3
Experimental Setup and Results
To measure the benefit of – and to otherwise gain experience with – our proposed projection
operator, we implemented the Projection API: an extension of the Google API that includes the proposed
projection operator described in the previous section. We actually built two implementations of this API:
one on top of Google itself, and another on the Nutch search engine [64].
We modified an existing IFM application – our buying guide finder – to utilize the Projection
API. For seven product categories, we compared the performance of BGF implemented on the standard
Google API against BGF implemented using the Projection API. On both search engines (Google and
Nutch), for all product categories, BGF using the Projection API outperformed BGF on the existing
Google API. The first subsection provides background on our two implementations of the Projection
API. The final subsection describes our measurements in more detail.
45
4.3.1
Generating Result Vectors with BGF
Referring back to our main BGF algorithm (see Section 3.2.3), after generating a query and
submitting it to the search engine, we get back a list of “candidate results.” Before being returned to the
user, these candidates are filtered to ensure that they are truly buying guides. To perform the filtering, we
create a term vector for each of the candidates; we call this vector the result vector for the candidate.
For each candidate result, the original Google API returns meta-data items including title text,
and URL, and snippet text. When running against this API, we create result vectors by extracting terms
from all of these meta-data items. Our Projection API, on the other hand, directly returns a result vector
for each candidate, so no translation is needed. However, in this case, the contents of the result vector
are controlled by the inputs we give the Projection API; this issue is discussed further in Section 4.2.
4.3.2
Filtering Results in BGF
Once we have a result vector for a candidate, we then filter the candidate by comparing its
result vector against both a doc-type and discrimination vector. The doc-type vector contains terms that
are positively correlated with the genre in question (e.g., “buying guide”). The discrimination vector
contains terms that are negatively correlated with the doc-type (e.g., “comparison”).
To pass through the filter, a result vector has to score high in the comparison with the doc-type
vector and score low in the comparison with the discrimination vector. BGF generates a doc-type score
by taking the dot product of the result and doc-type vectors and similarly a discrimination score. To
filter results, it divides the doc-type score by the discrimination score and compares against a threshold.
We felt it best to reduce false positives at the expense of increasing false negatives (which translates into
favoring higher-quality results at the expense of time to converge). Thus, our threshold penalized high
discrimination scores more than it rewarded high doc-type scores.
46
4.3.3
API Implementations
To take our measurements, we needed a standard Google and Projection API implementation
for both Google and Nutch.
Of course, Google already supports the Google API. We provided a Projection API for Google
via a proxy. Recall that the Projection API adds a new operation, which we happened to call doNutchSearch(),
that takes a list of terms as input and returns a term vector on those terms for each result in the output. Our
proxy implemented this operation by sending query requests to Google (via doGoogleSearch()),
fetching the full pages of those results via doGetCachedPage(), and computing the needed term
vectors from this raw page information.
In the case of Nutch, we needed to write an implementation of both the Google and Projection
APIs. This turned out to be relatively easy. However, as this is the first experiment run on Nutch by
people outside of the core team, it might be appropriate to briefly describe Nutch.
The Nutch web search engine includes its own crawler, uses Lucene 3 for indexing, and provides “transparent ranking” during search: An “explain” link next to each search result which, when hit,
describes the factors that determined its ranking. This software is about 2 years old, created by a team of
four people working part time.
The Nutch system consists of a number of programs for creating and updating three on-disk
data structures: a meta-data database, a page collection, and an index on the page collection. The metadata database contains meta-data about all pages seen by the Nutch installation, for example, the date of
last download (if any), the content MD5 4 at last download, retry information, and link information. The
page collection contains both the raw, downloaded data plus a parsed version of that data. The index is a
Lucene index of both the page text and a collection of fields generated from the meta-data; these fields
3 http://jakarta.apache.org/lucene/
4 The MD5 algorithm takes as input a message of arbitrary length and produces as output a 128-bit “fingerprint” or “message
digest” of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest,
or to produce any message having a given pre-specified target message digest. The MD5 algorithm is intended for digital signature
applications, where a large file must be “compressed” in a secure manner before being encrypted with a private (secret) key under
a public-key cryptosystem such as RSA.
47
include anchor text and also link-based ranking scores.
The central Nutch programs include generate, which generates a “fetch list” consisting of
highly-ranked pages that are due for a download, either because the have not been downloaded before or
they are due for a refresh. The fetch command downloads the pages in the fetch list. The updatedb
command updates all the meta-data databases based on this download; analyze generates static page
rankings (e.g., based on inlink counts); and index updates the index.
A Nutch installation is created by seeding an initial fetch list, then repeating a fetch, update, analyze, and generate cycle. The operator can insert indexing between the analyze and generate commands
at any time, but often multiple cycles are performed between indexing.
While Nutch is primarily a software system, the Nutch team has also built its own installation
of Nutch. This installation is currently private, although they did provide us access to it. When we used
this installation, it contained a 100M-page collection gathered in the Summer of 2003. This collection
was seeded using DMOZ 5 . This collection was one of the first the Nutch team had built, and was an
order of magnitude larger than their previous test collection. Given the early stage of development when
we were working with Nutch we were very pleased with its search quality.
The Nutch search engine is small and simple, and thus was very easy to modify for our purposes. Nutch is not yet very modular, so we really had to rip into it in pieces. While our resulting code is
not very sharable, this process did no hinder us in any way. The Nutch code is already available to the research community and our understanding is that the Nutch team and the Internet Archive are discussing
the feasibility of making instances of Nutch available to other researchers. If this occurs, we believe
it will be very valuable to the research community, as building collections of this size is operationally
prohibitive to most research teams.
5 http://www.dmoz.org
48
4.3.4
Modifications to BGF
The baseline BGF algorithm (as described in Section 3.2.3) was designed to run against the
Google API. Only a few slight modifications were needed to make it work with the Projection API.
Instead of calling the old doGoogleSearch(), the modified BGF calls the new
doNutchSearch(). As described in Section 4.2, this new operation takes a list of “terms of interest”, which it uses to formulate term vectors for results. For this argument, we passed the union of
terms from both the doc-type and discrimination vectors (355 terms total). When doNutchSearch()
returns, the modified BGF directly uses the term vectors that are returned as result vectors (rather than
computing a result vector from the title, URL, and snippet text as is done by the unmodified BGF).
In the original BGF described in Chapter 3, the filter step was implemented by comparing a
linear combination of the doc-type and discrimination scores against a threshold without performing any
normalization. However, the counts in the result vectors computed for the Google API are lower than the
counts for the Projection API. To normalize the comparison across the two APIs, we changed our filtering
algorithm to compare the ratio of the two scores against a threshold (as described in Section 4.3.2).
4.3.5
Results
For our evaluation we used the same metrics that were used by BGF and described in Section
3.3.1. We measured precision-at-ten ([email protected]) of four systems for seven product categories. The results
are summarized in Figure 4.1.
The four systems measured were our BGF running against:
• Standard API using Google
• Projection API using Google
• Standard API using the Nutch search engine
49
• Projection API using the Nutch search engine
The seven product categories were binoculars, notebook (computers), mp3 players, telescopes,
cars, pressure cookers, and HDTV.
Figure 4.1: Experimental results comparing precision-at-10 ([email protected]), which measures the fraction of
retrieved documents that are relevant, but capped at 10, of BGF using the standard Google API vs. using
the Projection API on the Google and Nutch search engine.
Our measurements are presented in Figure 4.1. We can see that in almost all cases the Projection API outperformed the Google API. The precision improved from 64.28% to 75.71% for the Google
search engine, and improved from 34.28% to 52.85% for the Nutch search engine.
Because of resource constraints we were not able to conduct a full-fledged and editorially
staffed relevancy evaluation for the API comparison. The results can therefore not be considered statistically significant. However, as we already pointed out earlier, this does not necessarily mean that there
is no difference between BGF using the standard API and BGF using the projection API, merely that
50
the test was unable to detect one. We again cite Keen [50], who makes the valuable point that differences that are not statically significant can still be important if they occur repeatedly in many different
contexts. Since we had a total of 7 different document types sampled from a broad range, and saw a
similar improvement on two different search engines, we think that the differences of the two methods
are significant enough to be considered important.
We were especially encouraged by this result because our original BGF was optimized for
the Google API. We are often asked why we did not use more traditional classification techniques (e.g.,
SVMs [43]) for the filtering part of BGF. The answer is that we started down that path but were hampered
by the relatively small number of terms generated by the snippet data (see Section 4.2). The Projection
API returns a richer set of terms with each result, which we believe would support much stronger approaches to filtering.
In addition to demonstrating improved performance for the new API, these results also provide
an interesting comparison of Nutch and Google. Given the maturity of Nutch versus Google, it was
not surprising to see Google outperform Nutch across the board. However, we were surprised to see a
correlation between the performance of Nutch and Google. Where Google did great, Nutch did pretty
good as well, and where Google has trouble so does Nutch. We have formulated a number of hypotheses
for this correlation:
• Our BGF is better on some topics than others, this bias being independent of the search engine
being used.
• The crawling policies of Google and Nutch are similar enough that they both tend to cover and
miss topics in similar proportions.
• The coverage of the underlying Web itself is uneven, with better coverage for topics like notebook
computers and worse coverage for pressure cookers.
We suspect there is some truth in each of these hypotheses.
51
4.4
Summary and Conclusions for the Proposed IFM Web Services
API
In this chapter we considered the design of a web services API for web search engines in the
specific context of web IFM applications.
Our primary conclusion is that search engines should support a projection operation to supplement their snippet operation. Like the snippet operation, the projection operation takes a query and
returns meta-data for a set of results matching the query. However, where the meta-data for the snippet
operation includes human-consumable snippets based on terms from the query, the meta-data for the
projection operation includes a machine-consumable term vector based on “terms of interest” provided
by the IFM application.
We reviewed the extensive literature on IFM applications and found that the proposed projection operation would be well suited for traditional metasearch engines – the largest class of IFM application so far – where the projection can be used to merge results. In addition, the projection operation is
particularly well suited to the emerging family of “iterative, filtering metasearch engines.”
To further support our proposal, we built an implementation of the projection operation on
two search engines, Nutch and Google. We modified BGF – an iterative, filtering metasearch engine for
buying guides originally written for a snippet operation – to run on the projection operation. On both
search engines, for all seven test categories, the modified projection-based BGF was never worse than
the original snippet-based one: In all of these implementations the results show that, for a particular IFM
application – our “Buying Guide Finder” – precision when using the new API improves from 64.28%
to 75.71% on Google, and improves from 34.28% to 52.85% on Nutch. These results were particularly
encouraging given that the modifications made to BGF were very minimal, that is, we did not try to
utilize the additional data offered by the new projection API.
Our work on doc-type specific search continues, and we believe the new projection API will
52
play an important role. In particular, we believe it will allow us to better leverage more systematic (and
powerful) classification techniques to perform doc-type filtering.
53
Chapter 5
Doc-type Classification via Automated Feature
Engineering
5.1
Introduction and Motivation
We have shown in the previous chapter that web carnivores – specialized web search applica-
tions layered on top of general-purpose search engines – have proven to be an effective way to improve
precision for specialized searches without requiring users to formulate complicated queries. Examples
from the literature include homepage, buying-guide, and movie-review finders. Web carnivores are typically built around a document-type classifier that processes results from the underlying search engine
and separates the desired specialized results from less precise ones. These classifiers are commonly built
in an ad hoc manner, requiring a high level of expertise combined with a significant development effort.
In this chapter, we look at reducing the cost required to build web carnivores (or in general IFM
type applications) by using standard classification algorithms, rather than ad hoc approaches, to solve the
doc-type classification problem. This problem is central to our overall research program and has broader
applications as well. It has received some attention in the literature [41], [59]. Previous work in this area
54
has focused on ad hoc techniques, which, while effective, cannot easily be developed by experienced
web developers or consultants. We therefore need a general doc-type classification algorithm that can
achieve good performance (greater than 90% accuracy) given only a small amount of labeled training
data.
We have found two aspects of IFM contribute substantially to the challenges of building such
a doc-type classification algorithm:
Training bias In our metaphorical thinking, a document has two “signals”: A typically stronger signal indicating the topic and a weaker signal indicating the doc-type. To help isolate this weaker
signal, it would be nice if the training data had a few examples from a large number of topics.
Unfortunately, in practice, it’s much easier to collect training data consisting of a larger number
of examples from a smaller number of topics. With this type of training data, a naive use of standard classification algorithms can “lock on” to the union of these topic-signals rather than to the
underlying doc-type signal.
Input bias Referring to Figure 2.1, in IFM, the classifier is fed from the output of a query that was
shaped to generated documents of interest. This means that the classifier is not separating the
desired class of documents from a random selection of the Web but rather is separating it from
“near misses”. This makes it more difficult to achieve our target of over 90% accuracy.
The main contribution of this chapter is to evaluate whether standard “off-the-shelf” classification algorithms and packages that are available on the Web are well suited for doc-type classification.
As we shall see, however, standard classification algorithms are indeed up to the task – but only if the
features used for classification are appropriately engineered.
This chapter describes a systematic, turnkey approach to developing doc-type classifiers. Our
approach starts with standard classification algorithms (Naive Bayes in particular). We show how to
automate this engineering process by combining feature selection with a number of feature-augmentation
55
techniques.
From our experimental results we find that our automated approach achieves a high accuracy
for three selected document types: buying guides, personal homepages, and recipes, while requiring
relatively little training effort. We have reason to believe that the proposed methods can be generalized
well and applied to different document types while producing similar performance.
The remainder of this chapter is organized as follows: Section 5.2 discusses related work to
doc-type and genre classification, classification in general, and feature selection. Section 5.3 describes
our automated feature-engineering techniques that can be used with standard classifiers to solve the doctype problem. Although our system will provide good baseline results, in Section 5.4 we provide some
examples on how an advanced web developer would be able to fine-tune the system to further increase
classification accuracy if needed. Section 5.5 presents results from our experiments, measuring the
performance of our classifier on three different document types. We then provide a summary, concluding
comments, and discuss future work related to doc-type classification.
5.2
Related Work
This section focuses on work related to document type and genre classification, research related
to text classification and feature selection, and describes how it relates to our work.
5.2.1
Doc-Type and Genre Classification
While certain document types and genres (such as homepages [46]) have been studied for a
while, it should be noted that the more general aspect of document type classification [59] itself along
with genre search [29], [30], [48], [49, 4], [51], [58] has captured more research interest recently.
However, the presented solutions are mostly based on manual rules and a carefully handcrafted
feature set and heuristics that makes it difficult to transfer these approaches to new document types.
56
For example, the aforementioned work related to PageTypeSearch [59] classifies web pages into the
document types by comparing their pages with typical structural characteristics of the types. While these
solutions that are based on hand-crafted rules achieve good performance, when compared with standard
keyword based search systems, the problem of adopting these techniques to new document types is very
difficult, and requires substantial expertise. These observations motivated us initially to explore a more
systematic approach for doc-type classification according to the goals outlined in the introduction.
Hu [41] describes features and methods for document image comparison and classification at
the spatial layout level. The presented methods are useful for visual similarity based document retrieval
as well as fast algorithms for initial document type classification without OCR. A novel feature set called
interval encoding is introduced to capture elements of spatial layout. This feature set encodes region
layout information in fixed-length vectors, which can be used for fast page layout comparison. The
feature augmentation technique we presented can possibly make use of the proposed interval encoding
feature set, which could be added as additional meta-data. It would need to be further investigated
whether this will lead to increased accuracy for certain document types.
5.2.2
Classification and Feature Selection
There is extensive literature on both classification algorithms (e.g., [44], [79], and feature
selection (e.g., [1], [45], [80]).
Regarding classification algorithms, we ran experiments using Decision Trees [67], Winnow
[78], Naive Bayes [60], and Maximum Entropy [47]. Each performed better than others on different doc
types, but none was the clear winner. But the important point is that all were improved by our featureengineering techniques. Thus, the selection of an underlying classifier is reduced to an engineering
decision beyond the scope of our research.
Although we are aware of a number of alternative techniques for feature selection, we did not
have time to explore their impact on performance in the way we did with different classifiers. We suspect
57
the outcome would be largely the same: While the difference between algorithms might be large enough
to make this a significant engineering decision, there is no fundamental issue to be explored here.
5.3
Automated Feature Engineering
When we were studying the literature on document type classification, we noticed that many
classifiers were built based on manual rules, or ad hoc solutions. These highly customized classifiers
typically performed well for the particular task for which they were designed. However, these solutions
are difficult to develop. Therefore, we turned our attention to standard text-book classification algorithms
and asked: Can these algorithms achieve at least 90% classification accuracy? And can they do so with
relatively little training data?
Standard classification algorithms can indeed achieve these goals, but not without careful engineering of the feature space. Further, given our requirement that IFM’s be buildable by advanced web
developers or consultants, this feature engineering needs to be done in an automated fashion. We have
developed three such automated means of feature engineering:
Feature selection Traditionally, feature selection is used to boost accuracy by eliminating noise introduced by irrelevant features. In our context, feature selection has the added benefit of handling
training bias, eliminating topic signals that may confuse the classifier.
Feature augmentation For certain document types, text features alone will not produce the desired
accuracy. We therefore need to introduce additional type of features that we can derive in an
automatic way and which represent good discriminators.
Feature co-occurrence The co-occurrence of certain features within a document is not statistically random. Phrases are the most valuable of these co-occurrences. For example, in buying guides
there are phrases like “buying guide” or “product guide” that happen to appear closely together.
58
Therefore another important form of feature augmentation is finding these high-information cooccurrences.
5.3.1
Feature Selection
When we first ran standard classifiers on our benchmark data, we immediately observed that
the classification accuracy was not in the range we wanted it to be. Looking at the data more closely,
we saw that, in addition to typical noise terms, the classification model also contained certain secondary
signals in form of topic terms, which seemed to have a negative impact on the classification accuracy.
For example, buying guides are available for various topics (e.g., for digital cameras, binoculars). These
topics in the context of doc-type classification represent noise and are therefore not relevant when identifying the document type.
Feature selection was the first form of automated feature engineering that we considered.
While it helped in the context of text-only features, it turned out to be even more important when we
later explored feature-augmentation.
Feature selection is well studied ([1], [45], [80]), and there are a variety of algorithms available
that can be used. In our experiments we used the Fisher Discrimination Index (FDI) described in detail by
Chakrabarti [17]. Given a two-class learning problem, FDI will determine the set of most discriminative
features: Let X and Y be the sets of document vectors corresponding to the two classes, for which we
want to find the most discriminative features. Each document in X is represented by a document vector
(the components of the vector may be term counts) and scaled to unit length. We define
P
x
|X|
X
µX =
and
P
µY =
Y
y
|Y |
to be the mean vectors (or centroids), for each class. Further, let µX,t be the tth component of µX , µY,t
59
be the tth component of µY , xt be the tth component of document vector x, and yt be the tth component
of document vector y.
We can calculate the FDI score of a term t as follows:
F DI(t) =
1
|X|
·
(µX,t − µY,t )2
P
1
2
2
X (xt − µX,t ) + |Y | ·
Y (yt − µY,t )
P
For each term in our document collection we calculate its FDI score, and sort the terms in
decreasing order of F DI(t). The top k ranked terms are then chosen as features. Picking a small set of
features (e.g., top 10% or less) will typically result in improved accuracy [80].
5.3.2
Feature Augmentation
When running early experiments we noticed that solely using text features did not produce
the desired accuracy. While working with data of different document types and reviewing the literature
related to doc-type classification, we noticed that there are certain types of features that seemed to be
useful for a broad range of document types. For example, in the case of homepages [46] the URL
structure seemed to be a strong indicator for that document type. In the case of buying guides [53]
certain meta-data worked well (e.g., attributes such as the size of the document, number of hyperlinks),
as well as structural information (e.g., a term that appeared in the title of a document or in a header).
This observation lead to the following idea:
• Automatically derive a broad set of different type features and add them in a systematic way to the
feature space.
• Use feature selection to select the appropriate type of features that work as best discriminators for
a particular document type.
We identified at least two broad classes of features that are relevant to a variety of document
types:
60
Meta-data These can be derived from the document. For example, the size of a document, its URL, the
number of outgoing links.
Structure We want to leverage the HTML structure and associate structural information to a term. For
example, a term appears in the title of the document, a header, or in the body.
This short list of features is far from exhaustive, but we decided that it was sufficient to get
started.
Simply adding features using a brute force approach will significantly increase the feature
space. For meta-data type features this wasn’t a problem, since the number of features per document was
only a small fixed size. However, adding structure tokens in addition would roughly double the number
of features, and therefore introduce noise that would make it difficult to reach the desired classification
accuracy level even with feature selection techniques. We therefore flattened the structural feature space
by grouping together similar structural elements into structural classes. This decreased the number of
added features significantly, while still introducing good discriminators that worked well for certain
document types.
The remainder of this section will describe the features we used and how they were selected in
more detail, along with a description of the algorithms and overall data-flow of the system that was used
to automatically generate the desired feature set for a given document type.
5.3.2.1
Meta-data
Meta-data represents data that can be derived from the content of the document itself or ex-
plicitly from annotation. In our experiments we decided to derive the following meta-data:
• Size of the document in bytes
• Number of hyperlinks
61
• Number of words
• Number of unique words
All of these features were very easy to derive locally from the document without having to
look at a more global context of a document collection. It would certainly be interesting to add more
complicated ones, for example, the number of inlinks to a document, a document’s Pagerank [66] or
other meta-data that can be derived from a global document collection.
Once we had calculated the values for each of those features we would need to map those
into the feature space. Since all of these were represented by numbers, we needed to convert them
in a way that a classifier could better interpret them in the vector space model. We decided to use a
logarithmic scale to convert numbers into discrete feature tokens. For example, a document with a size
of 8 bytes would produce a feature token /metatoken/size/3, a size of 16 bytes would result in
/metatoken/size/4, and so on. Other mapping schemes would certainly be interesting to explore
further (e.g., using a harmonic scale), but this seemed to be a reasonable approach to keep the number of
features low, and still capture the desired semantics.
5.3.2.2
URL
The URL was parsed, normalized, and then divided into its structural parts: protocol, host,
path, filename, and argument parameters. We discarded the protocol part, the arguments, and added
simply path tokens. Also, we looked for special characters within the URL, and produced a feature
/metatoken/url/special for this separately.
For example, consider the URL:
http://www.xyz.com/users/˜a/papers/index.html
For this URL the tokenization algorithm would generate the following feature tokens:
/metatoken/url/host/www.xyz.com
62
/metatoken/url/host/xyz.com
/metatoken/url/path/users
/metatoken/url/path/˜a
/metatoken/url/path/papers
/metatoken/url/path/index.html
/metatoken/url/special/˜
5.3.2.3
Structure
HTML pages contain structural elements that are often used by authors to structure the content
better, emphasize words or sentences, or change the visual appearance of the document when rendered
in a web browser. During our initial experiments we observed, when looking at the rendered version of
certain document types, that they typically share some visual properties. For example, the overall layout
of different buying guides looked very similar. This observation suggested using the HTML structure to
generate additional features that can be used to increase the accuracy of doc-type classification.
However, parsing HTML to obtain a proper Document Object Model (DOM) 1 tree that represents the structure of the document itself represents a major challenge. The problem is that HTML
does not enforce a strict syntax and grammar. Today’s web browsers are built to handle faulty HTML
pages and still try to render them as well as possible. In our usage scenario we were mostly concerned
with obtaining the most important structural properties of a page. For example, we wanted to know
whether a term appeared in the title, in the header of a paragraph, or in a list. This kind of information
can be extracted using simpler parsing techniques to generate a “shallow” hierarchical representation of
the document. Similar elements were grouped together in structural classes. For example, all header
tags can be grouped into one header class. This allowed us to keep the feature space small, while still
capturing important structural semantics.
1 A programming interface that allows HTML pages and XML documents to be created and modified as if they were program
objects. DOM makes the elements of these documents available to a program as data structures, and supplies methods that may be
invoked to perform common operations upon the document’s structure and data. DOM is both platform and language neutral, and
is a standard of the World Wide Web Consortium (W3C).
63
We focused on the most commonly used structural HTML tags and performed the following
mapping:
TITLE Terms that appeared in the title element were mapped to /structure/title/term token
features, where term represents a placeholder for the actual term. As an example, if the term buying appeared in the title element, we would generate a token /structure/title/buying.
LI Term within lists were mapped to /structure/list/term.
H1-H7 Header terms were mapped to /structure/header/term.
B, I, U, TT, S, DFN, EM, STRONG These tags can all be used to emphasize terms (and there are even
more around). All of these would be mapped into /structure/emph/term.
It can be seen that there might be other structural tags, e.g., tables that could be used to further
improve the feature set. Our experimental results later will show that the selected subset of structural
features is sufficient to achieve the desired accuracy for various document types.
5.3.2.4
Phrase Selection
Standard classification algorithms (e.g., Naive Bayes) assume that feature co-occurrences are
statistically independent. However, this is almost never the case. We felt that our framework for automated feature engineering must consider this important source of information. At the same time, while
co-occurrence is a broad concept, we felt that phrase selection – finding frequently occurring pairs of
proximate words – was the best place to focus our efforts.
Our work on phrase selection was inspired by the concept of lexical affinities (LAs), successfully developed to capture term relationships in the context of query refinement [16]. The general idea of
LAs is that we have a window with a fixed size: We then iterate over all terms within the document. For
each term we look at the neighborhood of terms within the specified window size and build a compound
64
term that could be added to the feature space (phrase candidate). As an example, consider the sentence
buying guide for binoculars and a window size of 1 would yield the following LAs:
• buying guide
• guide for
• for binoculars
However, if we add all possible LAs with even a small window size of 1, then we would typically multiply the number of features by ten or more. Early experiments showed that this would degrade
classification accuracy significantly, even when feature selection techniques were applied. Therefore, a
more careful selection was developed that will add only LAs that help improve the overall classification
accuracy: This lead to a relatively small number of phrases that were added, but those contributed well
to the overall accuracy.
Our phrase selection algorithm comprises the following steps:
• In the first step of the algorithm, we pre-process the data (e.g., detagging) and perform feature
selection by calculating the FDI score for each feature. We then sort all features in decreasing
order of their score.
• Second, we select a phrase-expansion pool by selecting the top k ranked features based on their
FDI score.
• Third, we iterate over all documents in the document collection. Where we see a token in the
phrase-expansion pool, we combine it with the following token and insert the combination into the
pool of phrase candidates, counting occurrences of each phrase candidate.
• Fourth, we consider these phrase candidates, along with the original terms, renormalizing the
combined counts to a new unit vector and computing FDI scores for the elements of this new,
combined vector.
65
• Fifth, we select only those phrase candidates that score higher than their first term scores by itself
(recall that the first term of phrase candidates are from the original set of top-k scoring terms). For
example, consider the phrase candidate buying guide. If the original term, buying, scores
2.13, and buying guide scores 3.0, we would pick buying guide and add it to the feature
space. Otherwise, if buying guide scores less than 2.13 the algorithm would have discarded
the phrase candidate.
We needed to decide whether to simply add those phrases or to actually replace the original
term with the phrase. Intuitively, removing the original term has the problem that the term might also
haven been used in different contexts along with other terms (since the FDI score is computed over all
term occurrences in all documents). If it gets now replaced with an expanded term we will loose this
information. Experiments showed that indeed the classification accuracy degrades when doing replacement. Therefore we add phrases, and keep the original terms too.
5.4
Tuning the Doc-Type Classifier
Although the design goal is to have a system that requires relatively little hand tuning and
expertise – our baseline accuracy of over 90% on average should be sufficient for most purposes – it will
probably almost always be the case that when working with a new document type there will be manual
tuning efforts that could potentially improve the overall classification accuracy even further. We therefore
want to provide relatively easy knobs and methods so that this tuning is in line with our goal—such that
an advanced web developer or consultant should be able to perform that tuning relatively easy if needed.
This section will highlight some of these tuning aspects. They fall into two categories:
• Adjusting parameter settings
• Additional features and heuristics
66
The following sections will describe these in more detail and provide an example on how that
could be done in case of the document type homepage.
5.4.1
Adjusting Parameter Settings
First, the system will expose an API with a few parameters that can be tuned and adjusted when
working with a new document type. Again, the default settings there should be sufficient as a starting
point. Second, when working with a new document type, there can be some features that are very unique
to that particular type. If added to the feature space, they could produce even better results for that
particular doc-type. We therefore want to provide a systematic way and API to add those heuristics if
needed.
This is an overview of the parameters that are exposed for tuning:
Frequencies for meta-data tokens Text classifiers take term frequencies into account. Therefore we
need to decide during the augmentation phase what frequency values we assign to meta-data tokens. Depending on the approach we can introduce a tuning parameter that allows adjusting the
assigned term frequencies accordingly. In general, there could be a fixed global value, or one that
takes the document size and other term occurrences into account when assigning term frequencies
to these meta-data tokens. In our experiments we used a fixed global value that turned out to be
sufficient for our purposes. However, tuning it was worthwhile and improved accuracy.
Picking good parameters for term expansion boundaries Phrase selection requires fixing the parameter k that determines the number of top ranked terms that are used as a basis for term expansion.
Experiments with different sizes for k have shown that a relatively modest number for k provides
good accuracy improvements (e.g., set 100 ≤ k ≤ 250), while choosing bigger values for k will
decrease the classification performance and introduces eventually noise into the feature space.
67
5.4.2
Additional Features and Heuristics
Depending on the doc-type there might be a set of features or heuristics that might produce
very good discriminators. For example, there are typically hidden semantics that a URL conveys in
form of naming conventions that are used and widely adopted by web masters to name certain type of
pages. Many systems that were trying to find and classify homepages [46] exploited those hidden URL
semantics. Generally, these tuning efforts fall into one of the two following categories:
Additional features that can be used in the augmentation process: We could pick more or different
meta-data features, use structure in a more precise way, or use new types of features (e.g., anchor
text). Again, our doc-type classification system allows to systematically adding these within the
feature augmentation phase as part of the meta-data.
Additional doc-type specific heuristics: In case of homepages it turned out that there are valuable
heuristics that will help to boost classification accuracy based on hidden semantics in URLs. It
may be the case that there are other document types where other special heuristics may be helpful. The proposed doc-type classifier system would allow systematically introducing these in the
augmentation phase as new type of features.
5.4.3
Example: Adding Heuristics for Classifying Homepages
To illustrate how a developer could add these heuristics or additional features to our classifi-
cation system, we present an example for the document type homepages. In our experiments we were
able to achieve strong baseline accuracy for homepages without any additional tuning. Adding those
heuristics in the proposed systematic way increased accuracy slightly. Given the fact that adding those
heuristics to our system required relatively little effort, it might be worthwhile to do this.
When studying the literature on homepage classification [46] most approaches will take hidden semantics of URLs into account: The domain name for example of a URL could carry additional
68
semantics that may indicate that a document is about a certain topic or type. We decided to use some of
these heuristics to identify a common and broad set of these hidden rules and naming conventions that
were indicating that a URL was likely to be a homepage of some user.
The way to integrate these heuristics into our system is to map them into our features space as
meta-data tokens. One could therefore write a tool that parses and analyzes a document based on these
heuristics, and produces a list of meta-data feature tokens using some predefined naming scheme that
can be inserted into the feature space.
Here is an example of some heuristics that could be used for homepages, and how they would
be mapped into our system:
Domain membership A membership of a URL to a certain domain could be an indicator that a document belongs to a certain document type. For example, domains like members.aol.com or
geocities.com are likely to contain homepages. For documents that belong to one of these
domains we would insert a feature token /metatoken/url/homepage.
Path Patterns Web masters typically use certain naming conventions for path names, such as /homepages/,
/people/, /u/, /users/, /homes/, /personal/, and /˜, to indicate that the document
represents a homepage, or at least is a document within a user’s personal homepage. For documents
that contain these URL pattern we would insert a feature token /metatoken/url/homepage.
We can see that the way in which the heuristics are mapped into our classification system is
intuitive: A new feature would be represented as a new metatoken, and inserted into the feature space.
The feature augmentation and selection methods will then figure out on how to use that new feature in
the classification process (if it is used at all).
To sum up, adding heuristics can be done in a systematic way by adding features based on
these heuristics as meta-data token into the feature space. Along with the two tuning parameters that
are exposed per API one can further fine-tune our doc-type classification system to further enhance the
69
classification accuracy if needed.
5.5
Experiments and Results
This section describes the experiments we conducted with our doc-type classifier in detail.
For these experiments we didn’t fine-tune our classifier, since we were interested in obtaining a high
base-line accuracy, which should require no manual intervention.
5.5.1
Building Training Sets
In our experiments we focused on three common and popular document types:
• Buying Guides
• Personal Homepages
• Recipes
We chose buying guides, since we wanted to compare results with our work related to the
buying guide finder. Before selecting other document types we compiled an extensive list of available
document types (see Table 1.1). Our goal was to select two other document types, which we believed
would require different feature types besides text. Since homepages were intensively used in the literature
(e.g., homepage finder) we decided that this document type would also be a good one to compare with
other published and related results in the literature. We then picked another popular document type,
recipes.
We were trying to find a standard and comprehensive benchmark from the IR and classification
community that we could use for the task of doc-type classification. Unfortunately we couldn’t find an
appropriate one. Therefore we decided to build our own benchmark for the three different document
types so we can see whether results of the proposed doc-type classifier can be generalized. Building a
70
training benchmark for one document type is costly and time consuming. We built three benchmarks of
small size, and we’re thinking of expanding these over time.
To build training sets quicker and more efficiently we used our web carnivore: The web carnivore queried the Yahoo search engine
2
for the particular document type using query templates that
were likely to produce results that were of the desired document type. Instead of filtering the results
and classify them (as what the carnivore would typically do), it would just collect the data, make sure
we have unique URLs. We would then manually evaluate and label the data to be either of the desired
document type or not. We did this for buying guides, recipes, and personal homepages. The goal was to
get at least 80 positive examples and 80 negative examples for each document type. We felt that labeling
that amount of data is manageable and in line with our goal of limiting the effort it would take to build
a specialized search application by document type. For example, if our document type classifier would
need thousands of examples to work well, it would probably be very time consuming to build one for a
new document type.
As mentioned earlier the literature points out domain transfer as an important problem for doctype classification [29]: A doc-type classifier trained on binocular buying guides would do poorly on
buying guides for digital cameras. Therefore an additional requirement for building our benchmark was
that it would need to include a variety of different topics for each document type. For buying guides
and recipes we therefore had 40 topics, and for each topic we would pick two positive and two negative
examples. For personal homepages the topic would correspond to the actual author of the homepage. In
this case we would actually have 80 different topics, comprising one positive and one negative example
per topic. Earlier experiments on buying guides indicated that having a variety of different topics would
actually decrease the overall classification accuracy (compared to having one topic with many samples),
which is consistent with the findings in the literature and confirms the difficulty of doing domain transfer.
2 http://search.yahoo.com/
71
Once we had obtained the necessary training data for each document type we developed the
tooling to run the experiments. The data was first tokenized using an HTML parser, which would also
detag the document (removing all HTML tags) to clean it up. We used W3C Tidy 3 for this task.
5.5.2
Metrics
In each trial the classifiers would output its classification accuracy. The classification accuracy
is defined to be the ratio between all correctly predicted documents and the total number of documents
in the test set. We multiplied this ratio by 100 to obtain the accuracy as a percentage. More precisely:
accuracy =
c
× 100
n
In this equation c represents the number of correctly predicted documents, and n the total
number of documents in the test set. To get reliable data for the average classification accuracy, we
would then re-run the trials and average out the accuracy over a total of 1,000 trials.
5.5.3
Methodology
For the purpose of our experiments we used a standard “off-the-shelf” classification package
Mallet 4 : A Machine Learning for Language Toolkit, that implemented various standard text classification algorithms. From this package we used primarily the Naive Bayes classifier, as well as one based on
a maximum entropy model [47]. We picked a Naive Bayes classifier since it is popular, easy to implement, and efficient in usage. The latter aspect is especially important if we are considering integrating a
classification solution within an IFM type application. The second classifier, the maximum entropy, was
used to see whether the improvements we would see over the Naive Bayes were also similar for different
sort of classifiers.
3 http://www.w3.org/People/Raggett/tidy/
4 http://www.cs.umass.edu/
mccallum/mallet/
72
Baseline
Feature Selection
Feature Augmentation
Homepages
65.72%
78.93%
88.43%
Buying Guides
81.52%
82.94%
94.47%
Recipes
88.40%
98.27%
99.16%
Table 5.1: Classification accuracy expressed as a percentage comparing the baseline against feature
selection and feature augmentation using the Naive Bayes classifier.
Baseline
Feature Selection
Feature Augmentation
Homepages
81.35%
88.43%
90.23%
Buying Guides
87.49%
91.13%
92.26%
Recipes
90.34%
94.80%
94.25%
Table 5.2: Classification accuracy expressed as a percentage comparing the baseline against feature
selection and feature augmentation for the maximum entropy classifier.
In our experiments we would always run both classifiers on the preprocessed (detagged) data
set. The tool would pick randomly 50% from the labeled data to be the training set. The other remaining
50% would then represent our test set
In the first experiment we ran both classifiers on each document type on the pre-processed data
to obtain a baseline accuracy (see tables 5.1 and 5.2).
We can see that the maximum entropy classifier performs better on average compared to Naive
Bayes: On average for all document type Naive Bayes performs roughly at 78.55% accuracy, while
maximum entropy performs on average of 86.39% (which is somewhat below our desired target of 90%).
In the next experiment we then applied feature selection (FDI) to the set of all features in the
document collection (please see second row of tables 5.1 and 5.2). While keeping only the top 10%
of the features we ran both classifiers again and noted down the expected performance improvements.
On average Naive Bayes was now performing at 86.71%, while maximum entropy was performing at
91.45%. On both classifiers we could see a significant improvement. Maximum entropy was even
performing above the 90% range.
We now applied our feature augmentation technique combined with feature selection as de-
73
Figure 5.1: Classification accuracy chart showing the Naive Bayes classifier’s accuracy expressed as a
percentage for homepages, buying guides, and recipes.
scribed in the previous section. Again, we measured the accuracy of both classifiers when only the top
10% of all features were used (see row 3 in tables 5.1 and 5.2). On average over all document types
Naive Bayes performed now at 94.02 % accuracy, and maximum entropy was up at 92.24%. We have
reached our goal that both classifiers are now performing at over 90% accuracy on average.
Figure 5.1 shows a classification accuracy chart showing the Naive Bayes classifier’s accuracy
expressed as a percentage on homepages, buying guides, and recipes. When looking at feature selection
we can see that for homepages and recipes the accuracy was boosted significantly, while in case of buying
guides feature selection did not help a lot. For feature augmentation we can see a boost for homepages
and buying guides, while it didn’t help a lot for recipes (which had already 98.25% accuracy before
feature augmentation).
The overall amount of training data needed was moderate (only 80 positive and 80 negative
74
labels), which fulfills our goal of requiring relatively little training effort. However, given the small size
of the benchmark the results may not be statistically significant. Since we used many diverse topics
in each benchmark set (for buying guides and recipes we had 40 topics, and for homepages even 80),
and also looked at three different document types (and observed the same trend), we are optimistic
that the results should generalize well to different document types. We’re also planning to increase our
benchmark size.
From these experiments we conclude that feature augmentation along with feature selection
on average improved the classification accuracy significantly. In cases where feature selection alone
wasn’t able to boost results in the 90% region, feature augmentation helped here to achieve our goal. We
therefore achieved our accuracy goal of at least 90% over all document types we tested. This compares
favorably with the 91% accuracy we achieved using our BGF on the same benchmark data for buying
guides: We were able to obtain a better classification accuracy as BGF while using standard classification
algorithms, therefore confirming that indeed our doc-type classification technique meets (or in this case
beats) the classification accuracy of an ad-hoc solution.
5.5.4
Feature Distribution
The main idea of our feature augmentation technique essentially is to carefully add a broad
variety of different types of features, and then let feature selection pick the appropriate ones that are
needed as good discriminators. Therefore we wanted to measure the distribution of the different feature
types, and how they contribute to the accuracy for different document types. What we wanted to see here
was that for certain document types there are for example more structural features contributing to the
accuracy, while for other document types there could be more text features that are most useful for the
classification task.
To measure the feature contribution we first counted the number of features for each of the
following feature types:
75
• Structure (Title)
• Structure (Header)
• Structure (Emph)
• Structure (List)
• Structure (Total)
• Meta-data
• LAs
• Text only
We divided the structural features into subcategories (title, header, emph, list) to obtain more
detailed information on the overall distribution, but also aggregated all structure tokens to obtain an
aggregated feature contribution for these. The feature selection process selected the top 10% of all
features. We then analyzed the distribution based on the type for these selected features. The feature
count distribution is represented in Table 5.3. Besides the raw counts, we looked at how FDI scores were
distributed among the different feature types. These are represented in Table 5.4, and look very similar
to the feature counts distribution.
We were less concerned with the absolute distribution, since it was clear that there are more
text or structure features in the corpus, compared to meta-data or LA features. Therefore the focus for
this experiment was to observe the relative distribution between the different document types. Some
observations:
• Text features seem to be most important for recipes, and least important for buying guides.
• Meta-data features seem to be most important for homepages, while they are very insignificant for
buying guides and recipes.
76
Structure (Title)
Structure (Header)
Structure (Emph)
Structure (List)
Structure (Total)
Meta-data
LAs
Text only
Homepages
1.05%
3.40%
13.02%
2.50%
19.97%
0.62%
1.55%
77.86%
Buying Guides
1.65%
1.18%
8.36%
8.26%
19.45%
0.66%
4.38%
75.51%
Recipes
1.82%
1.50%
9.89%
1.56%
14.77%
0.59%
1.63%
83.02%
Table 5.3: Feature counts distribution, where feature type counts are expressed as percentages between
structural features, meta-data features, lexical affinities features, and text only features.
Structure (Title)
Structure (Header)
Structure (Emph)
Structure (List)
Structure (Total)
Meta
LAs
Text only
Homepages
1.68%
3.50%
10.26%
2.14%
17.58%
2.63%
3.75%
76.04%
Buying Guides
1.24%
0.81%
4.86%
5.98%
12.90%
1.17%
11.21%
74.72%
Recipes
1.45%
0.96%
7.36%
0.93%
10.71%
0.77%
3.50%
85.02%
Table 5.4: Feature scores distribution, where feature type scores are expressed as percentages between
structural features, meta-data features, lexical affinities features, and text only features.
• Structural features seem to be most important for homepages.
• LAs seem to be most important for buying guides.
When looking across all doc-types, we can see the feature-selection applied to our augmented feature
set is doing a good job of “automated feature engineering,” that is, of selecting the types of features that
work best for each document type.
5.6
Summary and Conclusions for Doc-Type Classification
Our primary goal was to investigate whether standard “off-the-shelf” text classifiers can be
used for classifying documents by their type, while achieving high accuracy. The conducted experiments
77
Figure 5.2: Feature scores distribution chart, where feature type scores are expressed as percentages
between structural features, meta-data features, lexical affinities features, and text only features.
with the three different document types confirm that indeed this is feasible. However, achieving the
desired high accuracy was only possible with a combination of feature selection and feature augmentation
techniques that we introduced earlier: We then achieved accuracy above our desired threshold goal of
90% with different text classifiers. It was interesting to see how both techniques complemented each
other when looking at different document types. For example, for buying guides feature selection was
less important, but feature augmentation made the difference. While for recipes feature selection alone
provide very good results. In the case of homepages both feature selection and feature augmentation
together helped to achieve the desired accuracy. The advantage of this outcome is that we can leverage
all the research and work related to text classification and feature selection, without having to rely on
hand-crafted rules and ad-hoc classification implementations when developing document classifiers for
78
new document types.
The second goal was to develop a method for document type classification that requires relatively little training effort. In our experiments we only used a set of 80 positively labeled and 80
negatively labeled documents. That was sufficient to achieve the desired accuracy. Having more training
data would probably be helpful and further increase accuracy.
Another important goal was that a document type classifier for a new document type can be
developed by someone with relatively little expertise. It can be seen that we would be able to develop
tooling that would allow an advanced web developer or consultant to train a classifier by presenting a list
of samples. A modified version of a web carnivore could be used to build such a tool. Once the training
phase has been completed the document type classifier would be ready for usage.
We conclude that the presented feature engineering techniques as a basis for a document type
classifier can be used and integrated in IFM type applications, as well as for a broader variety of different
information retrieval tasks that require document type classification.
79
Chapter 6
Case Study: Contextual Search
6.1
Introduction and Overview
So far, we have been used IFM for applications that relied on filtering techniques as in the case
of BGF. The filtering approach typically requires a complex classifier that makes a decision on whether
to keep a result element or not. Such a filtering step may be too expensive to implement within a search
engine environment, making IFM a perfect choice for the scenario when such a complex filtering step is
required (e.g., doc-type classification).
We were interested to find out whether IFM can also be successfully applied to ranking of
results, instead of filtering them. The problem of ranking is not a binary decision: From a list of results
that a search engine returns we will have to apply one or more scoring functions (instead of a filtering
step), and re-order the results accordingly. If IFM would show good performance when re-ranking
results, this would open up even a broader applicability of the IFM method for different usage scenarios
(besides filtering). On the other hand, if we would discover performance problems this may indicate
limitations of using IFM for applications that require ranking or re-ranking of results.
We therefore decided to study IFM’s performance and usefulness for implementing contextual
80
search as an example application that requires ranking instead of filtering. Contextual search tries to
better capture a user’s information need by augmenting the user’s query with contextual information
extracted from the search context (for example, terms from the file the user is currently editing or the
email the user is currently reading).
This chapter describes three implementations of contextual search for the Web. Naive queryrewriting sends the user’s query augmented with a few terms from the search context to an unmodified
web search engine. Rank-biasing sends the user’s query along with a representation of the context to a
search engine specifically modified for contextual search. Finally, IFM sends multiple augmentations of
the user’s query to an unmodified search engine and then re-ranks the resulting candidate pool.
We describe an evaluation of these three approaches involving over 7,000 relevancy judgments.
In brief, our evaluation shows both rank-biasing and IFM outperforming naive query-rewriting but shows
no statistically significant difference in relevancy between rank-biasing and IFM. However, rank-biasing
was outperforming IFM on second criteria enhancement, which motivated us to conduct an additional
experiment to study the cause of this finding.
Section 6.2 provides an overview of contextual search, along with an overview of related work,
terminology, and describes the three contextual search engines that we built, followed by a section that
describes in detail how we adapted IFM for implementing contextual search. In Section 6.4 we present
our evaluation and results, along with a discussion and insights we learned. Particularly, we point out
possible recall limitations of IFM in Section 6.4.6. We then conclude and discuss future work.
6.2
Contextual Search
Today’s web search engines accept keyword-based queries and return results that are relevant
to these queries. These engines have proven to be extremely useful, perhaps surprisingly so given the
short length of a typical web query and the huge size of today’s web corpora. However, relevancy is a
81
significant challenge for search engines, especially for queries with an ambiguous topic or intent (such
as the infamous “jaguar” example). Contextual web search aims to improve the relevancy of web search
results by considering the implicit information context of the user’s current task along with the user’s
explicit query.
Often, search queries are formulated while the user is engaged in some larger task. In these
cases, there is often an information context available that can help refine the meaning of the user’s query.
For example, a user may be browsing a web page about the Jaguar car. The article stimulates some
interest and the user wants to know something related to that car. A contextual search engine might take
that web page as an additional input to disambiguate and otherwise augment the user’s explicit query.
Contextual search involves two mechanisms. The first is a UI mechanism for obtaining an
appropriate context for a query. Such a mechanism is beyond the scope of this chapter. Rather, we assume
the existence of such a mechanism to provide us, along with the user’s explicit query, an appropriate piece
of text (e.g., an article, paragraph, sentence, or even just a few words), which we call the context. The
second mechanism is the contextual search-engine itself, which takes the query and context together and
returns results.
This chapter describes and evaluates three approaches to building a contextual search engine:
• Naive query-rewriting sends the user’s query augmented with a few terms from the context to an
unmodified, general-purpose web search engine and simply returns these results.
• Rank-biasing sends the user’s query along with the context to a search engine specifically modified
for contextual search.
• IFM sends multiple augmentations of the user’s query to an unmodified search engine and then
re-ranks the resulting results into a final result.
Naive query-rewriting and IFM sit on top of a general-purpose web search engine, while rankbiasing requires internal modifications to such a search engine.
82
Engineering efficiencies would argue against modifying something as complicated as a web
search engine. However, we were interested in understanding how much additional performance might
be available if one indeed makes the effort to directly modify the search engine. Thus, we implemented
all three approaches and compared their results.
6.2.1
Related Work
Overall, contextual search has been gained more interest recently [7] given the current fierce
competition around web search, since it promises to return more relevant results to the user by leveraging
the search context. The majority of the contextual search work revolves around learning user profiles
based on previous searches, search results, and recently web navigation patterns. The information system
uses this learning to represent the user interest for refinement of future searches. Context Learning [34],
[5], [35] focused on the context learning based on judged relevant documents, query terms, document
vectors, etc. Context as a query – a different approach [39] – is to treat the context as a background for
topic specific search, and extract the query representing the context and therefore, the task at hand. The
body of the work in this chapter is based on an initial context provided by the user as part of users current
information need.
6.2.2
Terminology
At the beginning of Section 6.2 we motivated that contextual search will lead to more relevant
results for a query given a certain context. Unfortunately the word context has been overloaded with
many different meanings. We therefore want to define more precisely what we mean by a context and
some related terminology, since this represents the foundation for our experimental evaluation.
Context A piece of text (e.g., a few words, a sentence, a paragraph, an article) that has been authored
by someone.
83
Context Term Vector A dense representation or digest of a context that can be obtained using various
text or entity recognition algorithms (for example [10], [8]) represented in the vector space model
[81]. In this model extracted terms are typically associated with weights, which represent the
importance of a term within the context. In Figure 6.1 our query generator takes as input a query
and a context (where the context will be represented by a context term vector).
Contextual Search Query A search query that comprises a keyword query and a context (represented
by a context term vector).
Contextual Search A search metaphor that is based on contextual search queries. Its goal is to provide
more relevant results to a user within the specified context.
Simple Queries (SQ) Queries of this type are regular keyword based search queries (not contextual
search queries) typically comprising a few keywords or phrases, but no special or expensive operators (e.g., rank-biasing or proximity operators).
Complex Queries (CQ) Queries of this type are regular keyword based search queries (not contextual
search queries) typically comprising keywords or phrases plus ranking operators, and are therefore
more expensive to evaluate.
Standard Search Engine Backend (Std. SE) A standard or typical web search engine backend like
Yahoo! or Google that supports simple queries.
Modified Search Engine Backend (Mod. SE) A standard search engine backend that has been modified to support complex search queries using rank-biasing techniques (see below).
Contextual Search Engine (CSE) An application front-end that supports contextual search queries. It
can use a standard or modified search engine back-end to evaluate these contextual search queries.
In the remainder of this chapter we use the term query to refer to regular search queries (either
simple or complex). If we refer to contextual search queries we will explicitly state that.
84
6.2.3
Approaches for Implementing Contextual Search
There are different approaches available for implementing contextual search. Figure 6.1 illus-
trates these. One dimension is the number of queries we send per contextual search query to the search
engine (either sequentially or in parallel). The second dimension is the type of queries we send to the
search engine. We distinguish here between simple and complex queries (see definition above).
Figure 6.1: Three different approaches for implementing contextual search.
We can enumerate and name these approaches as follows (the numbering corresponds to Figure
6.1).
(1) Query-Rewriting: Send one simple query per contextual search query to a standard search engine
backend.
85
(2) Rank-Biasing: Send one complex query per contextual search query to a modified search engine
backend.
(3) IFM: Send multiple, simple queries per contextual search query to a standard search engine backend.
For completeness we also mention the approach of sending multiple, complex queries to a
modified search engine backend. However, we did not consider this case further. The problem with that
approach is its high cost: There are economic decisions to consider when trying increasing the scalability
of a web search engine to support more traffic (since resources are limited). One dimension to increase
scalability is to improve the number of simple queries that can be handled per second (this is the typical
scenario). The other dimension is to allow queries to be more complex (e.g., to put in more logic into the
ranking function). However, investing in both scenarios is too costly and therefore not very likely.
The following paragraphs describe the different approaches in more detail.
6.2.3.1
Query-Rewriting
In this scenario (see approach one in Figure 6.1) we send one simple query to a standard search
engine backend per contextual search query. A naive query-rewriting approach would therefore simply
concatenate all query and context term vector terms to form a – rather long – query using AND semantics.
Unfortunately, there are three major problems that make this simple approach not too useful:
• The query and context term vector together may comprise more terms than the search engine
supports for evaluation. For example, Google only supports up to 10 terms in its API [15].
• The more terms are added in a conjunctive way, the more restricted the query is, and the less results
it will likely return (low recall).
• A document that may be considered relevant since it matches most of the terms, may not be
considered enhancing by a user.
86
In our experiments we would implement an algorithm for query-rewriting, which sorts the
terms of the context term vector by decreasing weight, and appends the top five tokens to the user’s
query.
~ = (a, bc, d, e, f ),
As an example: Consider we have a query q, and a context term vector ctv
where the terms are sorted by decreasing weight, and bc is a phrase comprising two tokens. Our queryrewriter component would then construct the following rewritten query q0 = q a bc d e.
We added this approach to our experimental section to get a good baseline performance.
6.2.3.2
The Rank-Biasing Approach
In this approach we have full access to the internals of a search engine (e.g., index, runtime)
and are able to modify its data structures, indexing, and query processing code to support contextual
search optimally. This enables us to expose that functionality through a specialized contextual search
API. The query generator component (see approach two in Figure 6.1) then can generate and send a
complex query, which contains ranking instructions, to a modified search engine that implements such a
specialized contextual search API. The actual query is made up of two parts, the actual query keywords
and the formulated context. The main goal is to have the same level of recall as if the query was issued
without a context. For this purpose we used the special ranking operator provided by the search engine
to boost the score of the documents in the candidate set similar to the associated context.
The main focus of this approach is on recall: Optional ranking terms have no impact on recall
and only boost the score of the documents in the candidate set (identified by the query). The ranking
score is adjusted based on a tf/idf weighting of the optional context terms. Our underlying assumption
and motivation for this approach is that there are only a few matching documents per contextual search
query available in the corpus, and we want to make sure we find them.
Our query-rewriting algorithm would construct a complex query as follows:
87
• Add the query as prefix to the rewritten query
• Sort the terms of the context term vector by decreasing weight, and add the top five ranked terms
to the query by using a rank operator r(t, w). Each rank operator takes as argument the term t
plus a weight w, which is derived by multiplying the original term vector weight of the term by a
rank scaling factor to scale the weights accordingly so that they make sense to the search engine’s
ranking algorithms.
There are still many challenges on how to use such a rank-biasing operator. For example, the
mapping of weights from the context terms into weights that the operator uses—the rank scaling factor–
is not intuitive. Second, the number of rank operators to use also impacts the overall performance. We
conducted many experiments to determine settings that worked well.
~ = (a, bc, d, e, f ),
As an example: Consider we have a query q, and a context term vector ctv
where the terms are sorted by decreasing weight, and bc is a phrase comprising two tokens. Our queryrewriter
component
would
then
construct
the
following
rewritten
query
q0 = q r(a, w1 ) r(bc, w2 ) r(c, w3 ) r(d, w4 ) r(d, w5 ).
6.3
Adapting IFM for Contextual Search
In Figure 6.1 approach number three illustrates our IFM model and architecture, and shows
how we adapted IFM for implementing contextual search. In IFM the query generator component uses
query templates to generate a set of n simple search queries per contextual query. It will then send
these queries (in parallel or sequentially) to the search engine (step 2), which will evaluate them. As an
intermittent step (not shown in the figure) the search engine will then generate up to n candidate sets.
Each of the n candidate sets is then ranked according to the search engine’s scoring function. The search
engine will then return the top k ranked results for each of the n result sets, and return those as input to
88
IFM’s re-ranking algorithm (step 3). As a last step IFM will apply its own ranking scheme to the union
of those results (IFM candidate set), and returns the top ranked IFM results to the user.
We need to point out here that the current implementation does not do filtering, but instead
ranking, which is quite different. One big advantage of IFM is that it can implement very complex
scoring functions, since it is not limited to the constraints in which a large web search engine has to
operate. However, a search engine uses typically many features for its ranking and has efficient access to
a potentially very large candidate set. In contrast, IFM will typically requests a small size of results per
query in the order of tens or hundreds that are already ranked by the search engine. We can immediately
see that the problem of recall may be of great disadvantage for IFM’s ranking, if the desired documents
are scarce and will not be returned by the search engine. IFM has then no chance of returning those to
the user, even if it would implement a strong scoring method. Our primary focus so far was on precision.
The problem then became eliminating of the losers, whereas in the recall scenario IFM would need to
find the “needle” in the haystack (which is quite different). We will investigate this problem in depth
when comparing and evaluating the different approaches.
We believe the biggest advantage of the IFM approach is that it tries out multiple strategies
represented by different queries, which makes this approach very robust. The idea is that probably one
or more of the strategies will work out and produce good results. This is in contrast to the rank-biasing
approach, where we put all our “faith” into on complex query. However, if this query fails, the overall
results are not good. The underlying assumption for the IFM approach is that there are many documents
available in the corpus per contextual search query. Its focus then is on precision: It may miss a few good
documents, but those returned should have high quality.
Another advantage is that the IFM approach can easily be adapted to work with many different
search engines. In addition, from an engineering perspective IFM has more flexibility on what functionality to implement, since it is more loosely coupled and does not have to adhere to internal search engine
requirements and performance constraints. Furthermore, since IFM is not dependent to a search engine
89
production cycle, there will be a quick turn-around when experimenting with new ranking functions.
One disadvantage of this approach is that it may not be as efficient as the tight coupling when
having internal access to a search engine. However, its overall flexibility and cost efficiency make this
approach very attractive. It should be noted that the effort required to build rank-biasing was much larger
compared to the effort required to adapt IFM for contextual search (which required only a few days
work and little tuning effort). While this evaluation does not prove that search-engine modifications have
no role in the larger problem of contextual search, it does suggest that initial work in this area is more
efficiently pursued on top of general-purpose engines rather than inside of them.
As described in Chapter 2, IFM comprises two major components that are focusing on query
generation and filtering (the latter in this case will be replaced by ranking). The remainder of this section
will describe those into more detail.
6.3.1
Query Generation
IFM uses queries to retrieve document locators (plus meta-data) from a search engine. In the
case of the buying guide finder IFM used query templates as a basis for generating queries. The overall
idea is to send many queries that explore the problem domain in a systematic way to the search engine.
Each query can be seen as a different strategy that should be explored. IFM then merges the results into
a candidate set, which will then be filtered according to IFM’s own ranking and filtering algorithm.
For implementing contextual search IFM will also use query templates. As a first step IFM
derives a term candidate pool from the query and context term vector. Query templates then represent
placeholders, in which we can fill in terms from that pool. When these templates are expanded, each
query template may result in possibly many queries.
~ = (a, b, d, e), where the
As an example, if we have a query q and a context term vector ctv
terms a, b, d, e are sorted by decreasing weight, a query template may generate the following list of
queries: qa, qb, qd, qe, qab, qbd, qde, qabd, qabe, and qabde.
90
Overall, query templates just represent a convenient notation to specify what type of queries
to construct from the term candidate pool, since in most cases we don’t want to exhaustively build all
possible query combinations: We want to only pick strategies that are most likely to produce good results.
We achieved this by combining query and context terms in a way such that higher ranked terms and the
actual query would always be present. These high ranked terms would then be combined with different
permutations of lower ranked terms. Our experiments indicated that a much smaller subset of all possible
queries (we just used four query templates) is sufficient.
The approach of using query templates is also very attractive for the following reason: It uses
different queries based on permutations of the term candidate pool that will eventually return documents
with a different overlap with the context term vector. In some sense each unique query - besides representing a different strategy – also represents a different view of the context, and the returned results from
the search engine represent relevant results to that particular view: Some of them may be very similar to
the document (in case the query contained most of the context terms), while others are quite different,
but still there is some similarity since the document has been derived from at least a few context terms.
To populate the term candidate pool we merge the query and the terms of the context term
vector. Longer queries (more than two terms) can optionally be segmented into its base concepts. For
example, a query britney spears concert could be segmented into britney spears and
concert.
We should point out at this point that query generation in BGF required significant effort: We
used an unsupervised learning algorithm – PMIIR – to first build up a pool of term candidates (which was
quite expensive). In addition, terms were typed, and the query templates were therefore more carefully
chosen – exploiting the term types – based on continuous testing and tuning. In contrast, the query
templates we had chosen for contextual search are quite simple, and required only little tuning. We
believe that further work concentrating on query generation may further help to increase IFM’s relevancy
for contextual search.
91
6.3.2
Implementing Ranking in IFM
As mentioned earlier in our buying guide finder IFM would use a doc-type classifier to make
a filtering decision on whether a document should be considered a buying guide or not. In the case of
contextual search the decision is not a binary one. We therefore have to apply some re-ranking on the
returned documents to rank those documents high that would represent the best match for a contextual
search query. As stated earlier the key motivation for this chapter was to find out whether IFM can be
successfully applied to this scenario when it has to do ranking instead of filtering.
6.3.2.1
Determining a Document Set for Ranking
If we look again at Figure 6.1 in approach three we can see that we have a choice to make on
which result lists we want to apply ranking and filtering on: First, after step 4 we have n ranked result
sets from the search engine. This suggests we could apply a standard metasearch ranking approach
on the original search result lists, where we would try to leverage the ranking of the search engine and
aggregate the different rankings into a consolidated score. Rank aggregation approaches such as [70] and
[23] provide suggestions on how that can be done in a robust and systematic way. A rank aggregation
method would work well if URLs appear on average in many result lists (if we have a high variance).
When looking at some examples this could be confirmed, therefore making this approach worthwhile to
consider.
After step four (see Figure 6.1) we merge all result elements into one IFM candidate set, where
each URL would appear only once. IFM’s re-ranking component would then use this candidate set as a
basis for ranking and filtering. We show results for both approaches in the experimental section.
6.3.2.2
Ranking on Snippet or Document Data
We also had to make a decision what data we should use as a basis for our ranking and filtering.
There are two alternatives:
92
1. Scoring based on the search result element meta-data (e.g., title, summary, terms) that the search
engine backend returns.
2. Scoring based on the original document.
From our previous work on BGF we knew that using search result element meta-data (snippets)
is not optimal for many reasons: It lacks important terms, it depends on the algorithm the search engine
uses and the original search query (which may not help us to make a decision on whether a result element
is a good contextual match). Initial experiments confirmed that snippet data is indeed not optimal for our
ranking purposes, so we have chosen to do scoring based on the original document.
6.3.2.3
A Rank Aggregation Framework for Contextual Search
If we use the original document source as a basis for our re-ranking and filtering, we will
need to think about a ranking function that is optimal for contextual search. We were successfully
using a scoring function based on cosine similarity between the query/context term vector and a term
vector generated from the document itself in earlier experiments. For this reason we definitely wanted
to run experiments using a cosine similarity scoring function: Result elements would then be sorted by
decreasing similarity score and we would return the top k ranked elements to the user.
However, it is not clear that there exists one perfect scoring function that is most suitable for
contextual search. We also were not really in favor of any particular scoring function (although we saw
some encouraging results with cosine similarity earlier). Overall there are potentially an infinite number
of different scoring functions that could be applied to the problem of contextual search. So instead of
using just one it seemed reasonable to use many. Given that assumption, we needed a fair and robust
way of combining and aggregating different scores.
Our approach was based on rank aggregation methods that were proposed originally by [23]
and similar also to [28] and [27]. Our use of rank aggregation was motivated in part because it produces
93
a flexible system, in which we can easily experiment with different scoring functions. Furthermore, it
allows combining scoring functions in a robust, fair, and systematic way to produce high quality results
for contextual search queries.
In our IFM system we would therefore define a simple interface for a scoring function: It
would take as input parameters a document, a query, and context term vector and returns a score. The
higher the score the more relevant a document will be considered. Assume we have k scoring functions,
and n documents. We would pass each document into each scoring function. Therefore we would get k
scores for each document. This will result in k lists of document scores. Each list would then be sorted
by its score in decreasing order. We would break ties arbitrarily. To calculate the aggregated score for
each d we would calculate its average or median rank. We sort then all documents by their aggregated
score to obtain IFM’s final ranking order.
We used a total of 10 different scoring functions. Some of them were based on overlap between
the context term vector and the document; others were considering term frequencies in various ways. One
would use cosine similarity, and another one would use the original aggregated rank. There are probably
many more one could think of. Again, our flexible rank aggregation architecture allows plugging in
different scoring functions easily to allow further experimentation.
6.3.2.4
Scoring Functions
The following scoring functions were used:
Cosine Determines the cosine similarity between the query/context term vector and the vectorized document. Higher is better.
Vector coverage Determines the percentage of terms from the query/context term vector that occur in
the document. Higher is better.
Weighted vector coverage Determines the percentage of the sum of weighted terms from the context
94
term vector (use term weights of context term vector) that occur in the document. Higher is better.
TF Counts term frequencies of terms in the document that occur in the query/context term vector. Higher
is better.
Weighted TF Counts term frequencies of terms in the document that occur in the query/context term
vector and multiplies them by the weight of the term in the context term vector. Higher is better.
Number of spam terms We would count terms frequencies of terms from the context term vector in
the document that had a relatively high term count. If a term would have a frequency over a very
high threshold this would be considered a spam term, and we count the number of these. Lower is
better.
Aggregated original rank This uses the aggregated original rank from the search engine ranking. Lower
is better.
Number of outlinks We like to rank pages with fewer links higher to favor pages with more information
and less links. Lower is better.
6.4
Evaluation and Results
This section first presents the methodology used to evaluate contextual searches; in particular,
it outlines the modifications to standard IR evaluation techniques [20] to account for contexts, web search
results, and user experience. It also offers some details and examples on the editorial guidelines. We then
define the metrics used for evaluation, present the experimental results, and conclude with some editorial
observations and discussion.
95
6.4.1
Methodology
We base our evaluation on document relevance judgments from expert judges. In the standard
Cranfield model [20], judgments are issued with respect to a topic (or query) in a blind setting to remove
bias. The evaluation is “end-to-end”, that is, we are only concerned with whether a contextual search
technique provides a boost in overall relevance and not with the quality of intermediate data such as
context vectors, SQ and CQ (simple and complex queries). Our methodology differs from the standard
framework in three ways:
Relevancy to the Context and the Query Expert judges are asked to provide relevance judgments for
documents returned by the query given the context; that is, they are asked to read the context as
well prior to issuing any judgment.
Perceived Relevance Search engines always return document titles and abstracts that are generated
based on document summarization techniques and highly tailored to the given query (recall that
query here refers to SQ and CQ queries constructed from the original query and the context). In
our setting, judges base their relevance judgment on the title, abstract, and URL of the document.
Automatic document summarization engines have improved substantially with time, compared to
a decade ago when they first appeared, and typically do a good job distilling the gist of the document. Current research [2] has shown that perceived relevance – judgment based on summary
rather than the contents of the entire document – is cost-effective and keeps the same level of
sensitivity and resolution.
Enhancement to User Experience In addition to relevance, expert judges are asked if a particular document appears to bring in more information – background, details, complementary information –
that is not originally in the context but is highly relevant. In other words, we want to know if a
particular document not only meets but “exceeds” the information needs expressed by the context
and query.
96
6.4.2
Judgment Guidelines
A judge is asked to select a context/query pair that he or she feels capable of judging reliably
out of a pool of contexts. Once selected, the pair is removed from the pool. Given a context and a query,
a judge is presented with a list of web results, one at a time. For each web result, the judge answers the
following question, “Is this result relevant to the query and the context?” Judgments are captured in a
4-point scale:
• Yes (relevant)
• Somewhat (somewhat relevant)
• No (not relevant)
• Can’t Tell.
The editorial guidelines define the relevant metrics judgments as follows:
Yes: The result is relevant with respect to the context and query. A relevant result is not necessarily
enhancing, however it must bear a connection to the context plus query beyond the superficial. In
addition, the result must demonstrate a connection to the gist of the context plus query (not just
provide word matches or simple relationships).
Somewhat: The result is somewhat relevant to the ideas expressed by the full context and query, or is
fully relevant to one aspect of the context and/or query. A duplicative or repetitive result that is
nonetheless relevant should receive a Somewhat judgment.
No: The result is not relevant to the context and query, or is outside the scope of the context. A result
that matches keywords, but does not provide relevant information should be judged No.
Can’t Tell: The result is unclear in its relationship to the context (this should be used sparingly – most
97
situations where there is no apparent relevance should receive a No judgment – see example below).
Consider the following context:
It discusses at length the Dallas Cowboys’ (an American pro-football team) decision to
release quarterback Quincy Carter. The closing paragraph mentions the fact that this action
leaves Vinny Testaverde as the starting quarterback.
A web result relating directly to the Dallas Cowboys (i.e. the official website with news flashes)
and Quincy Carter would typically be judged “Yes”. A result about Jimmy Carter, the former president
of the U.S., would be judged “No”. A result entitled “Cowboy History and Information” with an abstract
that does not clarify whether this refers to Dallas Cowboys or an ordinary cowboy would typically receive
a “Can’t Tell”.
In addition to relevance the judges were also asked to answer the question: “Does it enhance
the user’s search experience?” in a 4-point scale: Yes, Somewhat, No, and Can’t Tell. A Yes answer means
that the result adds to the context, giving a perception of offering different and interesting information.
The editorial guidelines define the enhancing metrics judgments as follows:
Consider the information need/interests that the user expressed by selecting to further explore the context.
Yes: The result significantly adds to the context and query, offering different and interesting information
(or having the perception of doing so). A Yes result provides background information, additional
details, or other complementary information. The important aspect is that it understands the information need as expressed through the context and query and provides an enhancing addition. A
Yes result educates the user on the subject at hand.
Somewhat: The result adds information to the original context and query, but doesn’t go beyond information that could be found by a similar web search.
No: The result provides information that would not satisfy the perceived need of the user. Incomplete,
repetitive, or duplicative information should receive a No judgment.
98
Can’t Tell: The result is unclear as to whether it adds to the context and query (this should be used
sparingly – most situations where there is an unclear contextual enhancement should receive a
’no’ judgment).
For example, a result containing Quincy Carter’s career stats as a Dallas quarterback would be
enhancing. Arguably, a somewhat answer would be applicable if the result was about a list of all starting
quarterbacks for Dallas Cowboys and their tenure periods over the last thirty years. A No judgment
would certainly be applied to articles that simply repeat the information in the context; for example, an
article by a different reported published at a different time about the very same subject. Obviously, a
result about Jimmy Carter would receive a No judgment.
In summary, relevance judgments attempt to capture relatedness whereas enhancement judgments seek to capture added-value. For our experiments we are focusing primarily on enhancing, since
it is much stronger than relevant. Within enhancing we typically prefer enhancement at five ([email protected]) or
strong enhancement at five ([email protected]).
6.4.3
Experimental Setup
We use a test benchmark of 100 contexts. Contexts are paragraphs of text selected from a
random set of recent news articles, top web pages, or other web based sources. The minimum length of a
context was 11 words, the average 56, the median 49, and the maximum length was 184 words. For each
context, a separate group of expert editors were asked to come up with three to five reasonable related
web search queries. One query was randomly selected from this list to form the context/query pair. The
average query length, in words, was 2.83, which is commensurate other published figures [73], [83].
We use two major standard engines, SE1 and SE2, and one modified engine ME1. ME1 happens to be SE1 with a custom interface that implements the rank-biasing approach. The rationale is to see
if the techniques are sensitive to backend engines with substantially different ranking algorithms and in-
99
dex sizes (but same order of magnitude). We evaluate twelve different contextual search configurations,
each returning a maximum of five search results:
QR-1: This search configuration is based on a query-rewriting approach using SE1 (see Figure 6.1
approach number 1). The number of terms that are used from the context term vector is capped at
five to prevent too long queries.
QR-2: This search configuration is based on a query-rewriting approach using SE2 (see Figure 6.1
approach number 1): Same configuration as QR-1, different backend.
RB-2-1: This search configuration is based on the rank-biasing approach using SE1 (see Figure 6.1
approach number 2). Given a context/query pair, if the original query is a single-term query, CQ
is composed by the original query along with one context term; otherwise, CQ is just the original
query. Two context terms are used as ranking adjustors, equally weighted.
RB-5-1: This search configuration is based on the rank-biasing approach using SE1 (see Figure 6.1
approach number 2). Given a context/query pair, if the original query is a single-term query, CQ
is composed by the original query along with one context term; otherwise, CQ is just the original
query. Five context terms are used as ranking adjustors, equally weighted.
RB-2-2: This search configuration is the same as RB-2-1 but using SE2.
RB-5-2: This search configuration is the same as RB-5-1 but using SE2.
IFM-COS-1: This search configuration is based on our IFM approach (see Figure 6.1 approach number
3) using cosine similarity ranking on SE1.
IFM-RA-1: This search configuration is based on our IFM approach (see Figure 6.1 approach number
3) using rank aggregation on original results from SE1.
100
IFM-RAVG-1: This search configuration is based on our IFM approach (see Figure 6.1 approach number 3) using rank aggregation on IFM candidate set on SE1.
IFM-COS-2: This search configuration is the same as IFM-A-1 but using SE2 as backend.
IFM-RA-2: This search configuration is the same as IFM-B-1 but using SE2 as backend.
IFM-RAVG-2: This search configuration is the same as IFM-C-1 but using SE2 as backend.
We also present experimental results using these 100 contexts but without the query. We refer
to these experiments as “query-less”. The goal is to obtain a set of baseline numbers that factor out the
query selection process. For this test, RB-2-1 and RB-5-1 are slightly modified to always take one and
three context terms as CQ.
6.4.4
Metrics
For our experiments we are using the following metrics:
Precision at 1 and 5 ([email protected], [email protected]): Defined as the number of relevant results divided by the number of
retrieved results, but capped at one (or five), and expressed as a percentage. A result is considered
relevant if it receives a judgment of Yes or Somewhat for Precision.
Strong Precision at 1 ([email protected], [email protected]): Defined as the number of relevant results divided by the number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is
considered relevant if it receives a judgment of Yes-only for Precision.
Enhancement at 1 and 5 ([email protected], [email protected]): Defined as the number of enhancing results divided by the
number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is
considered enhancing if it receives a Yes or Somewhat judgment for enhancement.
101
Strong Enhancement at 1 and 5 ([email protected], [email protected]): Defined as the number of enhancing results divided
by the number of retrieved results, but capped at one (or five), and expressed as a percentage. A
result is considered enhancing if it receives a Yes-only judgment for enhancement.
Coverage Defined as the percentage of context/query pairs with no results.
NumResults Defined as the total number of results returned. Coverage and NumResults are reasonable
proxies for recall.
6.4.5
Results
We conducted two experiments: one using the context/query pairs benchmark and another
using contexts-only benchmark. Between both tests, we collected 7,013 judgments using a group of
19 expert judges. In both tests, most contexts were judged by more than one judge; all contexts were
covered by at least one judge. Multiple judgments by different judges on a given result are averaged to
form one judgment.
At this point we are focusing on relevancy and enhancement only, not on coverage, since we
observed that the overall coverage didn’t seem to be an issue in both scenarios (context only, query and
context). Per contextual search query we requested up to five results, and all providers generated 100%
coverage (except QR-2 was slightly below 100 % —which was expected since we’re limiting recall with
longer queries).
Because our benchmark set is rather small, we applied bootstrap sampling [24, 42] to obtain
confidence intervals for our precision and enhancement metrics. The method used is a form of subsampling where we create several (1,000 in our case) replicate data sets from a single data set. More
specifically:
In our benchmark we have n query/context pairs Q (n = 100). Each Q has up to k ranked
result elements r (in our experiments we set k = 5). We denote ri,j to be the result element for Qj at
102
position i (1 ≤ i ≤ k and 1 ≤ j ≤ n), and then s(ri,j ) represents the judgment function that assigns to
each ri,j a relevancy score. We construct a judgment score matrix S = s(ri,j ).
We support the following relevancy scores:
• 1.0 (“Yes” judgment)
• 2.0 (“Somewhat” judgment)
• 3.0 (“No” judgment)
• 4.0 (“Can’t tell” judgment)
In case we have more than one judgment for ri,j , we average the scores, so that we have an
aggregated judgment score 1.0 ≤ s(ri,j ) ≤ 4.0.
To do judgment level bootstrapping we then take the judgment score matrix S as input to
generate 1,000 replicated versions Sm , where 1 ≤ m ≤ 1000. To generate a replicated judgment score
matrix Sm we randomly pick s(ri,j ) values, and copy them to Sm . Since we want Sm to have the same
dimension as S we have to do this k · n times for each Sm
Once we have our m replicas of S we evaluate the relevancy (or enhancement) metrics for
each replica Sm , and then calculate the mean relevancy x (or enhancement) and standard deviation σ
over these m relevancy scores. Since we know that we have a normal distribution, we can then represent
a 95% confidence interval as:
x ± 1.96 · σ
For each replicate set, we recomputed the precision and enhancement metric as outlined above,
thereby obtaining an empirical distribution of the metrics. We report the average of that distribution. For
all distributions, 95% confidence intervals corresponded to roughly +/- 3%.
103
6.4.5.1
Context only
Table 6.1 shows the Strong Precision and Strong Enhancement metrics for the context-only
experiment, and Table 6.2 shows the same experiment conducted with SE2. In these experiments we
didn’t use a query. The task for contextual search in this scenario becomes a “more like this” task: Given
a context, return related information.
We can see that QR-1 performs lowest with SE1 (both RB and IFM approaches work better)
and therefore represents as expected our baseline. Surprisingly, QR-2 with SE2 performs remarkably
well, very similar to IFM on SE2. That indicates that there is a difference in the ranking algorithm for
queries of length five to six terms between SE1 and SE2. It seems that SE2 is able to better interpret
and rank queries of this size. Since we are sending query terms in order of decreasing weight, and if
SE2 is optimized for treating terms similarly (give boost to terms that appear first) this could explain this
behavior. SE-1 instead may treat all terms equally or may use a different ranking scheme (difficult to tell
without knowing the internals). However, what we do know so far is that SE2 seems to work well for
query-rewriting when the query has five to six terms (but there still might be a recall issue).
RB-5-1 works better than RB-2-1. This indicates that using more rank operators using smaller
weights for biasing seems to be beneficial. RB-5-1 outperforms QR-1, and is stronger than IFM in
the enhancement metrics. This suggests that we might have a recall problem, such that results that are
enhancing are somehow buried deep in the search engine, and the only way to retrieve them is by boosting
them internally. In this case it would explain why IFM would not be able to match RB’s performance
for enhancing, since it is not retrieving deep results. Typically in our experiments IFM would send
queries and obtain the top 20 ranked results, or sometimes as deep as the top 100 results. But still, if
enhancing results appear deeper in the result list it would be very difficult or almost impossible for IFM
to retrieve them. Since this would represent a major limitation of the IFM approach when using ranking,
we dedicate section 6.4.6 to shed light to this problem.
104
C (SE1)
QR-1
RB-2-1
RB-5-1
IFM-COS-1
IFM-RA-1
IFM-RAVG-1
[email protected]
56.38%
63.27%
68.48%
56.70%
60.82%
59.79%
[email protected]
53.85%
53.18%
63.68%
49.48%
58.76%
57.94%
[email protected]
39.36%
61.22%
58.70%
42.27%
41.24%
44.33%
[email protected]
40.72%
49.69%
52.12%
40.41%
44.95%
43.92%
Table 6.1: Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in contextonly scenario (query-less) using the SE1 search engine. Strong Precision is defined as the number of
relevant results divided by the number of retrieved results, but capped at one (or five), and expressed as
a percentage. A result is considered relevant if it receives a judgment of Yes-only for Precision. Strong
Enhancement is defined as the number of enhancing results divided by the number of retrieved results,
but capped at one (or five), and expressed as a percentage. A result is considered enhancing if it receives
a Yes-only judgment for enhancement.
From all IFM approaches we can see that the one using cosine similarity for ranking performs
worst. This seems to indicate that the cosine similarity measure may not be too useful for ranking of
contextual search results. However, the cosine similarity measure depends on how we generate term
vectors from the documents, and the context. It might be the case that different vectorization techniques
may produce better results.
The difference between IFM-RA-1 and IFM-RAVG-1 is also very interesting: In case of IFMRA-1 we simply aggregate the ranks from the original search engine results sets. We can see that this
method is less expensive to implement compared to IFM-RAVG-1, where we have to fetch the actual
document, perform preprocessing, parsing, and apply a number of scoring functions before we obtain
an aggregated score. Based on this finding it seems to be a strong motivation to make IFM-RA-1 the
preferred choice from an engineering perspective. Even if they would perform at similar levels it seems
not worth doing all the extra work necessary for IFM-RAVG-1.
When comparing the overall performance of SE1 and SE2 we can see that in general SE2
performs slightly better or similar for IFM approaches, but performs quite a bit better for simple queryrewriting as pointed out earlier.
Table 6.3 shows the numbers for Precision and Enhancement that we obtained when using
105
C (SE2)
QR-2
IFM-COS-2
IFM-RA-2
IFM-RAVG-2
[email protected]
63.16%
54.64%
65.98%
62.89%
[email protected]
56.16%
49.07%
58.56%
54.64%
[email protected]
50.53%
42.27%
51.55%
51.55%
[email protected]
47.03%
39.18%
47.63%
44.54%
Table 6.2: Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in contextonly scenario (query-less) using the SE2 search engine. Strong Precision is defined as the number of
relevant results divided by the number of retrieved results, but capped at one (or five), and expressed as
a percentage. A result is considered relevant if it receives a judgment of Yes-only for Precision. Strong
Enhancement is defined as the number of enhancing results divided by the number of retrieved results,
but capped at one (or five), and expressed as a percentage. A result is considered enhancing if it receives
a Yes-only judgment for enhancement.
C (SE1)
QR-1
RB-2-1
RB-5-1
IFM-COS-1
IFM-RA-1
IFM-RAVG-1
[email protected]
88.30%
84.69%
93.48%
81.44%
89.69%
87.63%
[email protected]
83.48%
78.64%
88.68%
74.64%
83.92%
84.33%
[email protected]
60.64%
72.45%
72.83%
55.67%
57.73%
56.70%
[email protected]
56.11%
61.19%
64.15%
54.43%
59.38%
57.53%
Table 6.3: Experimental results for Precision (P) and Enhancement (E) in context-only scenario (queryless) using the SE1 search engine. Precision is defined as the number of relevant results divided by
the number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is
considered relevant if it receives a judgment of Yes or Somewhat for Precision. Enhancement is defined
as the number of enhancing results divided by the number of retrieved results, but capped at one (or
five), and expressed as a percentage. A result is considered enhancing if it receives a Yes or Somewhat
judgment for enhancement.
SE1, and Table 6.4 the same measurements taken with SE2.
We notice a similar trend compared to the numbers for strong precision. QR-2 is still better
than QR-1. Overall QR-2 is working remarkably well, even slightly better than the best IFM approach
in [email protected] on SE2. Both RB approaches work better than IFM especially in the enhancement metric. We
definitely need to investigate further why this is the case (see Section 6.4.6).
6.4.5.2
Context and Query
We then conducted the same experiment now using a query and a context. This represents a
typical contextual search scenario, where we have a context, and use that context somehow to augment
106
C (SE2)
QR-2
IFM-COS-2
IFM-RA-2
IFM-RAVG-2
[email protected]
92.63%
76.29%
95.88%
91.75%
[email protected]
86.76%
71.13%
85.77%
83.71%
[email protected]
62.11%
52.58%
65.98%
69.07%
[email protected]
60.96%
54.23%
62.06%
60.82%
Table 6.4: Experimental results for Precision (P) and Enhancement (E) in context-only scenario (queryless) using the SE2 search engine. Precision is defined as the number of relevant results divided by
the number of retrieved results, but capped at one (or five), and expressed as a percentage. A result is
considered relevant if it receives a judgment of Yes or Somewhat for Precision. Enhancement is defined
as the number of enhancing results divided by the number of retrieved results, but capped at one (or
five), and expressed as a percentage. A result is considered enhancing if it receives a Yes or Somewhat
judgment for enhancement.
C+Q (SE1)
QR-1
RB-2-1
RB-5-1
IFM-COS-1
IFM-RA-1
IFM-RAVG-1
[email protected]
48.57%
58.33%
61.46%
53.66%
58.54%
50.00%
[email protected]
48.56%
54.17%
58.25%
48.91%
54.74%
49.76%
[email protected]
35.71%
52.08%
56.25%
37.80%
41.46%
36.59%
[email protected]
37.06%
48.96%
51.98%
38.20%
43.31%
37.38%
Table 6.5: Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in context plus
query scenario using the SE1 search engine. Strong Precision is defined as the number of relevant results
divided by the number of retrieved results, but capped at one (or five), and expressed as a percentage.
A result is considered relevant if it receives a judgment of Yes-only for Precision. Strong Enhancement
is defined as the number of enhancing results divided by the number of retrieved results, but capped at
one (or five), and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement.
the user’s query.
Table 6.5 shows the metrics for Strong Precision and Strong Enhancement of the context and
query experiment on SE1.
Overall the results of the context and query scenario are similar to the context only numbers.
Notice that QR-2 and QR-1 now perform very similar. Rank-biasing is slightly better than IFM in
relevancy (given our 3% error rate), but the gap is significant for enhancing. As mentioned before we
will explore this observation further in Section 6.4.6.
The best IFM and RB are outperforming QR considerably. IFM performs slightly better on
107
C+Q (SE2)
QR-2
IFM-COS-2
IFM-RA-2
IFM-RAVG-2
[email protected]
50.00%
43.90%
64.63%
56.10%
[email protected]
47.32%
48.54%
57.18%
51.22%
[email protected]
32.89%
34.15%
47.56%
46.34%
[email protected]
36.90%
40.78%
49.39%
42.68%
Table 6.6: Experimental results for Strong Precision (SP) and Strong Enhancement (SE) in context plus
query scenario using the SE2 search engine. Strong Precision is defined as the number of relevant results
divided by the number of retrieved results, but capped at one (or five), and expressed as a percentage.
A result is considered relevant if it receives a judgment of Yes-only for Precision. Strong Enhancement
is defined as the number of enhancing results divided by the number of retrieved results, but capped at
one (or five), and expressed as a percentage. A result is considered enhancing if it receives a Yes-only
judgment for enhancement.
C + Q (SE1)
QR-1
RB-2-1
RB-5-1
IFM-COS-1
IFM-RA-1
IFM-RAVG-1
[email protected]
82.86%
78.13%
77.08%
80.49%
81.71%
75.61%
[email protected]
82.11%
78.96%
75.57%
76.89%
79.56%
75.24%
[email protected]
54.29%
69.79%
73.96%
56.10%
52.44%
53.66%
[email protected]
50.16%
65.63%
66.39%
55.23%
56.45%
53.16%
Table 6.7: Experimental results for Precision (P) and Enhancement (E) in context plus query scenario
using the SE1 search engine. Precision is defined as the number of relevant results divided by the number
of retrieved results, but capped at one (or five), and expressed as a percentage. A result is considered
relevant if it receives a judgment of Yes or Somewhat for Precision. Enhancement is defined as the
number of enhancing results divided by the number of retrieved results, but capped at one (or five), and
expressed as a percentage. A result is considered enhancing if it receives a Yes or Somewhat judgment
for enhancement.
SE-2.
The most surprising fact when looking at tables 6.7 and 6.8 is the fact the QR-1 now substantially outperforms QR-2 in relevancy (they are still similar for enhancing metric). In the previous
experiments it was the other way around. The only difference between the two scenarios is that in the
case we have a query, we would use this as a prefix for the rewritten query, which makes the actual
rewritten query even longer. For example, if we have a query with three terms this will result in possibly
8 terms that are being used in the scenario when we have a query and a context. Whereas in the contextonly scenario we would use only up to five terms from the context. This suggests that overall SE1 seems
108
C + Q (SE2)
QR-2
IFM-COS-2
IFM-RA-2
IFM-RAVG-2
[email protected]
76.32%
80.49%
82.93%
81.71%
[email protected]
75.30%
76.21%
82.73%
77.56%
[email protected]
44.74%
46.34%
60.98%
59.76%
[email protected]
50.30%
53.64%
65.45%
55.37%
Table 6.8: Experimental results for Precision (P) and Enhancement (E) in context plus query scenario
using the SE2 search engine. Precision is defined as the number of relevant results divided by the number
of retrieved results, but capped at one (or five), and expressed as a percentage. A result is considered
relevant if it receives a judgment of Yes or Somewhat for Precision. Enhancement is defined as the
number of enhancing results divided by the number of retrieved results, but capped at one (or five), and
expressed as a percentage. A result is considered enhancing if it receives a Yes or Somewhat judgment
for enhancement.
to work better with longer queries for producing relevant results.
Also, IFM now slightly outperforms RB in relevancy, but RB is still better for the enhancing
metric.
Overall we are very pleased with IFM’s performance, given the amount of effort that went
into developing the IFM prototype for contextual search: RB has been developed and tested by an engineering team, and tuned by dedicated relevancy teams and experts, with a total of several men months
of work. On the other side, the IFM prototype had been implemented within a few days of work, with
relatively little tuning and no editorial support at all. Considering this, and the fact that IFM significantly
outperforms QR (which already has a strong baseline), we conclude that IFM represents a cost-effective
solution for developing specialized search engines that performs well with relatively little tuning effort.
To sum up, these are our key insights:
• Precision isn’t the differentiating metric, enhancement is.
• QR approach lowest relevancy as expected in context-only scenario compared to other approaches,
but still achieves reasonable precision. In context and query scenario QR-1 even outperforms RB
and IFM in relevancy, but has similar numbers for enhancement.
• QRs relevance is strongly dependent of the underlying engine.
109
• Both IFM and RA generally significantly outperform QR.
• IFM and RA are similar in relevancy, RA has slight edge.
• RA beats IFM in enhancement metric.
The key question we wanted to find out after seeing these results was why RB is doing a better
job in returning enhancing results in both scenarios. We will investigate this in more detail in the next
section.
6.4.6
Does IFM have Recall Limitations?
The advantage of the rank-biasing approach is that it allows to find and return deep results –
the “needle in the hay stack” – whereas IFM is “fishing” more on the surface of typically shallow result
sets returned from the search engine. This works well in a precision scenario, where we have many
good results, and our focus is to eliminate bad results. For example, in our experiments IFM would fetch
20 results only per query. Increasing this number is expensive, since IFM needs to process all result
elements. Also, there are typically limits imposed by the search engine on how many results can be
retrieved per request.
If it turns out that recall causes a major problem for IFM in the contextual search scenario, this
would indicate a general weakness of the IFM approach, making it generally not a preferred choice for
scenarios where ranking is involved: The problem is that some applications may require recall and the
ability to look at results that are not at the surface. For other applications this may not be an issue. But
typically one may not actually know whether recall is an issue or not.
We therefore conducted another experiment to investigate whether recall is an issue or not.
What we observed in our experiment is that for many contexts and queries (see tables 6.3, 6.5, and
6.7) IFM would obtain a relevancy similar to RB (sometimes even better), but would loose significantly
110
against RB in the enhancing metrics. For example, for [email protected] the difference in Table 6.7 is about 13%, or
for [email protected] in Table 6.5 the gap is almost 8%.
To verify whether we really have a recall problem, we picked 12 contexts and queries from our
benchmark where RB was performing very well in relevancy and enhancement based on the judgments.
In fact, we sorted the list of contexts by their average enhancement and relevancy score and picked the
top ranked contexts based on this score. All of them had five judgments rated enhancing.
We then issued a rewritten, simple query (using QR-1) derived from these contexts and queries
to SE-1 and requested 1,000 results. Also, we issued a complex query using RB-5-1 and retrieved the top
five results for these. We knew from the relevance judgments that these results from RB-5-1 are highly
relevant and enhancing.
In the next step we looked at the rank position for each of these five result elements on where
they were positioned in the long result list that was obtained with QR-1. The goal was to obtain a
rank tuple: (url, bef ore, af ter) for each of these five URLs, where we would record the URL, its rank
position in the list obtained from QR-1, and its rank position in the list obtained from RB-5-1. With this
information we could calculate the rank difference for each URL; that is the difference between the rank
of a URL u in the list obtained from QR-1 and the rank of u in the list obtained from rank-biasing.
An example will illustrate this better:
Assume we have an URL u ranked at position 1 in the list obtained from RB-5-1. Assume u is
ranked at position 307 in the list obtained from QR-1. We would then generate the following rank tuple
(u, 307, 1), and the rank difference would be |307 − 1| = 306.
We then derived the rank tuples for all 12 context/query pairs. Since each result list of RB-5-1
contains maximal five URLs, we would retrieve five rank tuples per context/query pair.
Recall that RB-5-1 sends complex queries that comprise rank operators, which each have associated weights. These weights determine the boost for a term. We ran a total of four experiments using
different weight settings. The first weight setting we refer to as w1 (see Table 6.9) is the same that was
111
used in the experiments in the previous section. We then ran the same experiment with a higher weight
setting to measure the differences. The weight settings we used were 2.5 × w1 (Table 6.10), 5 × w1
(Table 6.11), and 10 × w1 (Table 6.12). We have chosen this weight scale based on observations from
earlier tuning experiments with RB-5-1.
The rank tuples would provide us with insights from where RB-5-1’s top ranked URLs came
from. In the case when most of these are buried deep in QR-1’s result list, or even worse, would not
be included there, we would have confirmation that we indeed have a recall problem. Otherwise, if the
results were ranked high enough so that IFM could actually retrieve them, but would not return them
high enough, we would simply have a ranking problem with IFM. This case can be addressed by further
tuning IFM’s ranking function, whereas the former recall case would show a significant problem for
applications where recall is an issue.
Table 6.9 shows per row the contextual search query number (e.g., Q1), the maximum difference in the rank position (Max DR) for this query, the sum of all rank differences (Sum DR), as well as
the number of search result elements that could not be reached by IFM.
We define not reachable by IFM to have a rank > 20. A search result element that would be
in the list of RB-5-1, but not in the list of QR-1 would be assigned a rank of 1001 (since the maximum
size of the list would be 1000).
An example illustrates how we calculate Max DR, Sum DR, and the number of not reachable URLs for IFM: Let (u, 307, 1) be a rank tuple obtained from Q1. The rank distance then is
|1 − 307| = 306. We then calculate the rank distance simply for all five tuples for Q1 and take the
maximum. Calculating the sum of rank differences (Sum DR) works analogous by just taking the sum
over the rank distance for these five tuples. In our example the rank distance for this URL was 306,
which is greater than 20, so we count it as not reachable by IFM. We can easily see that the number
of unreachable URLs for IFM per context/query pair can be maximal five in our case. At the end we
calculate and show averages over the maximum and sum of rank distances. In addition, we calculate the
112
Query
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Average
Max DR
9
1000
998
999
997
997
998
206
998
31
16
34
606.92
Sum DR
15
2082
2007
1997
997
1999
2001
221
1008
65
19
53
1038.67
# Not Reachable by IFM
0
3
2
2
1
2
2
2
1
2
1
1
31.67 %
Table 6.9: Change of rank positions between QR-1 and RB-5-1 using weight w1 .
average percentage of not reachable URLs for IFM per result set.
Table 6.9 indicates that on average per result set between one and two result elements that
could not be reached by IFM. To be more precise: 31.67% of URLs per result set could not be reached
by IFM.
We know that for these selected 12 context/query pairs all URLs had been rated as enhancing.
We also know 31.67 % is not reachable by IFM, but would represent enhancing results. Worse, there are
still 88 other context/query pairs in our benchmark that are probably affected similarly, but we didn’t look
at these further. Studying all 100 context/query pairs in detail would require a new user study along with
new judgments, which is expensive. The conducted experiment provided us with insights and anecdotal
evidence why RB may perform better. However, at this point we cannot conclude for sure that we have
a recall problem.
Based on the relatively small effort we spent on tuning IFM we have reason to believe that
we can further increase IFM’s performance. For example, there might be more good URLs obtained
in higher ranked positions that IFM incorrectly ranked too low (poor ranking strategy). In addition,
we pointed out earlier that IFM’s query formulation strategy is quite simple compare to BGF. Query
113
Query
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Average
Max DR
3
1000
999
999
997
997
998
525
1000
98
17
23
638.00
Sum DR
8
3073
2007
1997
997
1999
2008
1394
1998
160
21
53
1309.58
# Not Reachable by IFM
0
4
2
2
1
2
3
5
2
3
1
1
43.33 %
Table 6.10: Change of rank positions between QR-1 and RB-5-1 using weight 2.5 × w1 .
formulation enhancements are also seem to be likely to improve IFM’s relevancy. However, the fact that
IFM cannot reach these 31.67% of URLs remains. Our plan is to further improve IFM’s ranking using
the results obtained from these experiments to guide the tuning.
Looking at the tables 6.10-6.12 with higher weight settings show that the percentage of unreachable URLs even increases, when we increase the boost weight of RB-5-1.
114
Query
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Average
Max DR
4
1000
999
999
997
999
999
997
1000
999
755
35
815.25
Sum DR
16
3073
2615
1997
997
1017
2009
3077
2274
1157
775
53
1588.33
# Not Reachable by IFM
0
4
4
2
1
2
2
5
5
4
2
1
53.33 %
Table 6.11: Change of rank positions between QR-1 and RB-5-1 using weight 5 × w1 .
Query
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Q11
Q12
Average
Max DR
998
1000
1000
1000
999
999
999
997
1000
1000
757
96
903.75
Sum DR
1068
3073
3610
2001
1996
2013
3028
3077
2274
2128
777
142
2098.92
# Not Reachable by IFM
2
4
5
2
2
2
4
5
5
4
2
2
65.00 %
Table 6.12: Change of rank positions between QR-1 and RB-5-1 using weight 10 × w1 .
115
Chapter 7
Conclusion
7.1
Summary
The objective of this dissertation was to build a framework and tools with which an advanced
web developer or consultant can build a search engine specialized by document type within a few manweeks of effort.
We conclude that we have reached this goal: First, we have shown in the case of BGF that
IFM represents a cost effective approach for building web carnivores that outperform major search engines within their area of specialization. However, building BGF still required significant expertise and
developing effort. We addressed this issue by presenting a novel feature engineering system that allows
using “off-the-shelf” classifiers and performs as well as BGF in the case of buying guides. We have
also tested our doc-type classifier for other popular document types –homepages and recipes – and have
seen similar accuracy. We have also presented an extension to a web service API for search engines that
allows IFM application developers to more easily retrieve meta-data from a search engine that can be
used to improve classification accuracy.
Other anecdotal evidence that supports the claim that IFM represents a cost effective approach
116
was when we implemented IFM for contextual search (described in the previous chapter): It literally
took us a few days of work to obtain relevancy that was very similar to rank biasing, which had been
a large ongoing development effort over several months comprising dedicated engineering teams and
many relevancy experts for tuning.
If such a framework for building cost-effective specialized search applications by doc-type
would exist, our secondary goal of the thesis was to investigate whether such an approach could be more
broadly applied to different dimensions of specialization. We explored contextual search as such an
area of specialization. The interesting aspect and motivation for choosing contextual search was that it
requires ranking to do filtering.
We were seeing strong results for relevancy and enhancement when comparing IFM’s performance against query rewriting, which confirms that IFM overall represents a cost-effective approach for
building specialized search solutions.
We summarize IFM’s strength and advantages as follows:
• Easy to implement, requires no internal access to search engines.
• Good for prototyping and rapid development of specialized search solutions.
• High precision: Works particularly well for precision scenarios, especially for filtering, where IFM
can implement complex classification algorithms that could potentially never be integrated into a
web search engine architecture. Filtering and ranking algorithms are not restricted by constraints
of a web search engine.
• Delivers robust results, since multiple queries explore different strategies of the problem domain.
• The usage of our feature engineering method allows to cost-effectively build IFM applications
specialized by document type within a few man-weeks of effort.
We summarize the disadvantages and problems we encountered with IFM as follows:
117
• Loose search engine coupling may not be as efficient for certain applications, but tight coupling
with search engine expensive to develop and implement.
• Filtering or ranking expensive when based on document content (web service API extensions as
presented in Chapter 4 will mitigate the problem).
• Ideal IFM API for search engine integration still has not been defined. Promising candidates are
send page or send code that we want to explore further in future work.
7.2
Future Work
The thesis introduced IFM as a method for cost-effectively generating specialized search en-
gines. Although we have done a lot of work around IFM in the past two years, there are still many open
questions and issues that are worth exploring in future work.
7.2.1
IFM
We looked at two major areas for IFM for specialization: Doc-type and contextual search.
There are other interesting areas of specialization that are worthwhile exploring further. With each new
area we will probably learn more about IFM’s strength and weaknesses, as we did when we developed
IFM for contextual search.
Also, there are also more engineering aspects that remain to be more deeply explored to make
IFM more scalable: For example, how can we minimize the number of queries needed per request while
still maintaining good relevancy?
118
7.2.2
BGF
First, we presented BGF and showed that it returned more relevant buying guides compared to
what a user would get directly from Google. Although the performance was quite good already, it would
be interesting to tune it further and increase its overall relevancy.
Also, we did not have the time to fully integrate our BGF system with the proposed doc-type
classifier. Therefore it would be interesting to build a generic “doc-type finder” based on IFM and our
doc-type classifier, and test it out with various other document types that we haven’t explored in this
thesis so far.
We are intrigued by utilizing inter-trial feedback. In observing expert human Web searchers,
we observe that the results of one search influence the formation of the next. For example, good results
might suggest small refinements, while bad results might suggest a whole new approach. We would
like to integrate this and other feedback strategies into our system. We believe our template language
provides a good foundation for such work.
Second, we have performed some initial experiments with more aggressive topic expansion.
These early attempts have helped for some topics, but have hurt for others – for too many others. We
are looking for techniques that more consistently help. In this regard, our work so far suggests that topic
discriminators or some other mechanism to reduce topic drift will be important.
Another on-going aspect of the BGF work is reducing the number of queries that get issued.
Our initial code would often issue many hundred queries per topic. We have reduced this to under a
hundred, but would like to reduce it even further (while improving precision).
7.2.3
IFM Web Service API
What is the ideal IFM API for web search engines? Although with our projection API we
made a first step into this direction it still is not a solved problem. Sending code (imperative approach)
119
and sending pages may represent viable alternatives, but both require substantial effort, so we did not
include them in the scope of this thesis.
In particular, sending code potentially raises many security implications. For example, what
are the dimensions for constraining the code (e.g., memory, CPU cycles, I/O requests, access to outside
data sources, temporary disk space, etc.)?
Furthermore, we also pointed out the operation of sending a complete page back to the IFM
application. That would allow the IFM application the same flexibility as the imperative approach: It
can process and analyze the full page, and can derive any desirable feature set. The major problem with
this approach is bandwidth consumption. Typical IFM applications are sending per search request many
simple queries, each of these queries returns on the order of 10 to 20 URLs. For one IFM request this will
potentially result in hundreds of documents that would need to be streamed somehow to the client. Some
optimizations are possible, for example if the search engine would be able to support batch processing
of queries along with a streaming API. In this case IFM would send a list of queries for one request.
The search engine would evaluate all of these, determine the union of results, and then start streaming a
compressed version of the documents back to the client. There are many engineering challenges. Nevertheless, this approach – if implemented efficiently – may also represent a viable alternative worthwhile
exploring in future work.
In addition to sending pages or sending code, there are possibly more APIs that a search engine
might expose, including things like statistics on terms, documents, posting lists, tokenized documents,
or various aggregated statistics (e.g., per site of per directory). One problem that is difficult to address
with the proposed term based projection API is when features are not solely based on simple independent
terms alone, for example in the case of feature comparison charts that have rows for products and columns
for features (or vice versa). Specific structures like these require a parser that would recognize such a
table structure than just a sequence of tokens in the document. It may be therefore interesting to explore
more sophisticated parsing techniques that go beyond the simple token stream model in future work and
120
see how they could be integrated into the proposed API.
7.2.4
Doc-Type Classification
For our doc-type classifier we have shown that it works well for three document types. The
experimental data suggests that it will work fine for other document types too that use similar feature
sets. However, showing that this is indeed the case is not easy: While our doc-type classifier may work
fine with the majority of popular doc-types, there can always be a new document type that could not
work well with our doc-type classifier. In this scenario we would probably have to investigate the feature
set type that is needed, and see if there is a way of adding that type of feature in a systematic way, so that
for a new, but similar doc-type we would not need to repeat that step.
Furthermore, we have not experimented with different and promising feature sets that are more
difficult to obtain. There are many directions in this context that are interesting to explore further. For
example, adding different meta-data that can be obtained from a global analysis of the document corpus
might be useful to further enhance the accuracy, and broaden the support for different document types.
Adding anchor text for example might help for certain document types where the text features of the
document alone might be too weak as discriminators. Or, the work related to interval encoding [41]
might be useful for certain type of documents.
In addition, there is a plethora of feature selection techniques available. Comparing those
to the one we used (FDI) seems to be interesting. Also, there are more text classifiers available. In
our experiments we used Naive Bayes, maximum entropy, decision trees, and winnow. Looking for
example at support vector machines (SVMs) and comparing the performance of our feature selection
and augmentation techniques could provide interesting insights. However, since we already reached the
desired accuracy goal just using Naive Bayes classifier this seemed to be of lower priority.
Last not least, while working with feature sets we envisioned a different form of search engine
integration, where we would introduce a feature server. Such a feature server would – given a URL
121
or list of URLs – return customized feature sets that can be used for classification purposes. It can be
seen that such a feature server could present a valuable platform for experimentation with all kind of
different feature sets and applications that can be built. Designing and building such a feature server also
represents a challenging problem from a system perspective, and is worth pursuing further.
7.2.5
Contextual Search
We still need to investigate further the qualitative differences between IFM and RB, and in par-
ticular why RB is outperforming IFM. We have reason to believe that spending slightly more effort into
tuning IFM’s ranking may help to eventually increase IFM’s relevancy to be similar or better compared
to RB.
Once IFM’s relevancy is similar or better than RB’s relevancy we need to think about addressing scalability and performance issues of IFM. While RB is using one complex query, IFM uses multiple
simple queries. It will therefore be interesting to investigate in more detail the performance and relevancy
trade-off between sending multiple simple queries and sending one complex query.
122
Bibliography
[1] D. W. Aha and R. L Bankert. A comparative evaluation of sequential feature selection algorithms.
In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, pages
1–7. Ft. Lauderdale, FL: Unpublished. (NCARAI TR: AIC-94-026), 1995.
[2] K. Ali, Y. Juan, and C. Chang. Exploring cost-effective approaches to human evaluation of search
engine relevance. In ECIR ’05, 27th European Conference on Information Retrieval, Santiago de
Compostela, Spain, March 2005.
[3] Niran Angkawattanawit and Arnon Rungsawang. Learnable crawling: An efficient approach to
topic-specific web resource discovery. In The 2nd International Symposium on Communications
and Information Technology (ISCIT2002), 2002.
[4] A.D. Bagdanov and M. Worring. Fine-grained document genre classification using first order random graphs. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, Vol., Iss., 2001, pages 79–83, 2001.
[5] Nicklas J. Belkin, Robert Oddy, and Helen M. Brooks. ASK for Information Retrieval: Part I.
Background and Theory, Part II. Results of a Design Study. Journal of Documentation, 38(3):61–
71, 145–164, Sep. 1982.
[6] Donna Bergmark, Carl Lagoze, and Alex Sbityakov. Focused crawls, tunneling, and digital li-
123
braries. In Proceedings of the 6th European Conference on Research and Advanced Technology for
Digital Libraries, pages 91–106. Springer-Verlag, 2002.
[7] Krishna Bharat. Searchpad: explicit capture of search context to support web search. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international
journal of computer and telecommunications netowrking, pages 493–501. North-Holland Publishing Co., 2000.
[8] Daniel M. Bikel, Richard L. Schwartz, and Ralph M. Weischedel. An algorithm that learns what’s
in a name. Machine Learning, 34(1-3):211–231, 1999.
[9] Hwee Tou Ng. Bing Liu, Chee Wee Chin. Mining topic-specific concepts and definitions on the
web. In Proceedings of the twelfth international World Wide Web conference, 2003.
[10] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. Exploiting diverse knowledge sources
via maximum entropy in named entity recognition. In Proceedings 6th Workshop on very large
corpra at the 17th International Conference on computational Linguistics and the 36th Annual
meeting of the Association for Computational Linguistics, Montreal, Canada, 1998.
[11] Paul De Bra and R. D. J. Post. Searching for arbitrary information in the World Wide Web: the
fish-search for Mosaic. In Second WWW Conference, Chicago, 1994.
[12] Andrei Z. Broder. Some applications of rabin’s fingerprint method. In Sequences II: Methods in
Communications, Security, and Computer Science, R. Capocelli, A. D. Santis, and U. Vaccaro, Eds.
Springer Verlag, pages 143–152, 1993.
[13] Andrei Z. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3–10, 2002.
[14] R. Burke, K. Hammond, V. Kulyukin, S. Lytinen, N. Tomuro, and S. Schoenberg. Natural language
processing in the faq finder system: Results and prospects, 1997.
124
[15] T. Calishain and R. Dornfest. Google Hacks: 100 Industrial-Strength Tips & Tools. O’Reilly, ISBN
0596004478, 2003.
[16] David Carmel, Eitan Farchi, Yael Petruschka, and Aya Soffer. Automatic query refinement using
lexical affinities with maximal information gain. In Proceedings of the 25th annual international
ACM SIGIR conference on Research and development in information retrieval, pages 283–290.
ACM Press, 2002.
[17] Soumen Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. MorganKauffman, 2002.
[18] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach
to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999),
31(11–16):1623–1640, 1999.
[19] Michael Chau, Hsinchun Chen, Jailun Qin, Yilu Zhou, Yi Qin, Wai-Ki Sung, and Daniel McDonald. Comparison of two approaches to building a vertical search tool: A case study in the
nanotechnology domain. In Proceedings Joint Conference on Digital Libraries, Portland, OR.,
2002.
[20] M. Keen C.W. Cleverdon, J. Mills. Factors determining the performance of indexing systems.
Volume I - Design, Volume II - Test Results, ASLIB Cranfield Project, Reprinted in Sparck Jones &
Willett, Readings in Information Retrieval, 1966.
[21] B. D. Davison, D. G. Deschenes, and D. B. Lewanda. Finding relevant website queries. In Proceedings of the twelfth international World Wide Web conference, 2003.
[22] Daniel Dreilinger and Adele E. Howe. Experiences with selecting search engines using metasearch.
ACM Transactions on Information Systems, 15(3):195–222, 1997.
125
[23] Cynthia Dwork, Ravi Kumar, Moni Naor, and D. Sivakumar. Rank aggregation methods for the
web. In Proceedings of the tenth international conference on World Wide Web, pages 613–622.
ACM Press, 2001.
[24] B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1–26, 1979.
[25] Tina Eliassi-Rad and Jude Shavlik. Intelligent Web agents that learn to retrieve and extract information. Physica-Verlag GmbH, 2003.
[26] Oren Etzioni. Moving up the information food chain: Deploying softbots on the world wide web.
In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth
Innovative Applications of Artificial Intelligence Conference, pages 1322–1326, Menlo Park, 4–
8 1996. AAAI Press / MIT Press.
[27] Ronald Fagin, Ravi Kumar, Kevin S. McCurley, Jasmine Novak, D. Sivakumar, John A. Tomlin,
and David P. Williamson. Searching the workplace web. In WWW ’03: Proceedings of the twelfth
international conference on World Wide Web, pages 366–375. ACM Press, 2003.
[28] Ronald Fagin, Ravi Kumar, and D. Sivakumar. Efficient similarity search and classification via
rank aggregation. In Proceedings of the 2003 ACM SIGMOD international conference on on Management of data, pages 301–312. ACM Press, 2003.
[29] A. Finn, N. Kushmerick, and B. Smyth. Genre classification and domain transfer for information
filtering. In Proc. 24th European Colloquium on Information Retrieval Research, Glasgow, pages
353–362, 2002.
[30] Aidan Finn and Nicholas Kushmerick. Learning to classify documents according to genre. In
IJCAI-03 Workshop on Computational Approaches to Style Analysis and Synthesis, 2003.
[31] C. Lee Giles, Kurt Bollacker, and Steve Lawrence. CiteSeer: An automatic citation indexing
system. In Ian Witten, Rob Akscyn, and Frank M. Shipman III, editors, Digital Libraries 98 126
The Third ACM Conference on Digital Libraries, pages 89–98, Pittsburgh, PA, June 23–26 1998.
ACM Press.
[32] Eric Glover, Gary Flake, Steve Lawrence, William P. Birmingham, Andries Kruger, C. Lee Giles,
and David Pennock. Improving category specific web search by learning query modifications. In
Symposium on Applications and the Internet, SAINT, pages 23–31, San Diego, CA, January 8–12
2001. IEEE Computer Society, Los Alamitos, CA.
[33] Eric J. Glover, Steve Lawrence, William P. Birmingham, and C. Lee Giles. Architecture of a
metasearch engine that supports user information needs. In Proceedings of the eighth international
conference on Information and knowledge management, pages 210–216. ACM Press, 1999.
[34] Ayse Goker. Capturing information need by learning user context. In Sixteenth International Joint
Conference in Artificial Intelligence: Learning About Users Workshop, pages 21–27, 1999.
[35] Ayse Goker, Stuart Watt, Hans I. Myrhaug, Nik Whitehead, Murat Yakici, Ralf Bierig, Sree Kanth
Nuti, and Hannah Cumming. User context learning for intelligent information retrieval. In EUSAI
’04: Proceedings of the 2nd European Union symposium on Ambient intelligence, pages 19–24.
ACM Press, 2004.
[36] Google Web APIs. http://www.google.com/apis/.
[37] Luis Gravano, Chen-Chuan K. Chang, Hector Garcia-Molina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching. In Proceedings of the 1997 ACM SIGMOD international
conference on Management of data, pages 207–218. ACM Press, 1997.
[38] Robert H. Guttmann and Pattie Maes. Agent-mediated integrative negotiation for retail electronic
commerce. Lecture Notes in Computer Science, pages 70–90, 1999.
[39] Monika Henzinger, Bay-Wei Chang, Brian Milch, and Sergey Brin. Query-free news search. In
127
Twelfth international World Wide Web Conference (WWW-2003), Budapest, Hungary, May 20-24
2003.
[40] Adele E. Howe and Daniel Dreilinger. SAVVYSEARCH: A metasearch engine that learns which
search engines to query. AI Magazine, 18(2):19–25, 1997.
[41] Jianying Hu, Ramanujan Kashi, and Gordon T. Wilfong. Document classification using layout
analysis. In DEXA Workshop, pages 556–560, 1999.
[42] David Hull. Using statistical testing in the evaluation of retrieval experiments. In SIGIR ’93:
Proceedings of the 16th annual international ACM SIGIR conference on Research and development
in information retrieval, pages 329–338. ACM Press, 1993.
[43] Thorsten Joachims. Text categorization with suport vector machines: Learning with many relevant
features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142.
Springer-Verlag, 1998.
[44] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant
features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, DE, 1998. Springer Verlag,
Heidelberg, DE.
[45] George H. John, Ron Kohavi, and Karl Pfleger. Irrelevant features and the subset selection problem.
In International Conference on Machine Learning, pages 121–129, 1994.
[46] Oren Etzioni Jonathan Shakes, Marc Langheinrich. Dynamic reference sifting: A case study in
the homepage domain. In Sixth International World Wide Web Conference, pages 1193–1204, Apr.
1997.
[47] John Lafferty Kamal Nigam and Andrew McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67, 1999.
128
[48] Jussi Karlgren. Stylistic experiments in information retrieval. In T. Strzalkowski, editor, Natural
Language Information Retrieval, 1999.
[49] Jussi Karlgren, Ivan Bretan, JOhan Dewe, Anders Hallberg, and Niklas Wolkert. Iterative information retrieval using fast clusterting and usage-specific genres. In Eight DELOS workshop on User
Interfaces in Digital Libraries, Stockholm, Sweden, pages 85–92, 1998.
[50] E. Michael Keen. Presenting results of experimental retrieval comparisons. Information Processing
and Management, 28(4):491–502, 1992.
[51] Brett Kessler, Geoffrey Nunberg, and Hinrich Schütze. Automatic detection of text genre. In
Philip R. Cohen and Wolfgang Wahlster, editors, Proceedings of the Thirty-Fifth Annual Meeting
of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, pages 32–38, Somerset, New Jersey, 1997.
Association for Computational Linguistics.
[52] Mei Kobayashi and Koichi Takeda. Information retrieval on the web. ACM Comput. Surv.,
32(2):144–173, 2000.
[53] Reiner Kraft and Raymie Stata. Finding buying guides with a web carnivore. In 1st Latin American
Web Congress (LA-WEB), Santiago, pages 84–92, November 2003.
[54] Reiner Kraft and Jason Zien.
Mining anchor text for query refinement.
In Proceedings of
WWW2004, International Conference of the World Wide Web, May 2004.
[55] Cody Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the web. ACM
Trans. Inf. Syst., 19(3):242–262, 2001.
[56] Cody C. T. Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the web. In
World Wide Web, pages 150–161, 2001.
129
[57] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and Autonomous Citation
Indexing. IEEE Computer, 32(6):67–71, 1999.
[58] Yong-Bae Lee and Sung Hyon Myaeng. Text genre classification with genre-revealing and subjectrevealing features. In Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 145–150. ACM Press, 2002.
[59] Katsushi Matsuda and Toshikazu Fukushima. Task-oriented world wide web retrieval by document type classification. In Proceedings of the eighth international conference on Information and
knowledge management, pages 109–113. ACM Press, 1999.
[60] Andrew McCallum and Kamal Nigam. A comparison of event models for naive bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. Technical
Report WS-98-05. AAAI Press, 1998.
[61] Weiyi Meng, Clement Yu, and King-Lup Liu. Building efficient and effective metasearch engines.
ACM Comput. Surv., 34(1):48–89, 2002.
[62] Mandar Mitra, Amit Singhal, and Chris Buckley. Improving automatic query expansion. In Research and Development in Information Retrieval, pages 206–214, 1998.
[63] Movie Review Query Engine. http://www.mrqe.com.
[64] Nutch. http://www.nutch.org/docs/en/.
[65] Satoshi Oyama, Takashi Kokubo, Teruhiro Yamada, Yasuhiko Kitamura, and Toru Ishida. Keyword
spices: A new method for building Domain-Specific web search engines. In Bernhard Nebel,
editor, Proceedings of the seventeenth International Conference on Artificial Intelligence (IJCAI01), pages 1457–1466, San Francisco, CA, August 4–10 2001. Morgan Kaufmann Publishers,
Inc.
130
[66] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
[67] J. R. Quinlan. Learning decision tree classifiers. ACM Comput. Surv., 28(1):71–72, 1996.
[68] Daniel E. Rose and Danny Levinson. Understanding user goals in web search. In WWW ’04:
Proceedings of the 13th international conference on World Wide Web, pages 13–19. ACM Press,
2004.
[69] B. Dom S. Chakrabarti, M. Van den Berg. Focused crawling: A new approach to topic-specific web
resource discovery. In Proceedings of WWW8, Toronto (ON), pages 545–562, 1999.
[70] Erik Selberg and Oren Etzioni. The MetaCrawler Architecture for Resource Aggregation on the
Web. IEEE Expert, 12(1):8–14, January 1997.
[71] Barry G. Silverman, Mintu Bachann, and Khaled Al-Akharas. Implications of buyer decision theory
for design of ecommerce websites, June 2001.
[72] Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. Analysis of a very
large web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
[73] Craig Silverstein, Hannes Marais, Monika Henzinger, and Michael Moricz. Analysis of a very
large web search engine query log. SIGIR Forum, 33(1):6–12, 1999.
[74] Ellen Spertus. Parasite: mining structural information on the web. In Selected papers from the sixth
international conference on World Wide Web, pages 1205–1215. Elsevier Science Publishers Ltd.,
1997.
[75] P. Srinivasan, J. Mitchell, O. Bodenreider, G. Pant, and F. Menczer. Web crawling agents for
retrieving biomedical information. In Proc. Int. Workshop on Agents in Bioinformatics (NETTAB02), 2002.
131
[76] Raymie Stata, Krishna Bharat, and Farzin Maghoul. The term vector database: fast access to
indexing terms for web pages. In Proceedings of the 9th International World Wide Web Conference,
May 2000.
[77] Peter D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In Proceedings of the
12th European Conference on Machine Learning, pages 491–502. Springer-Verlag, 2001.
[78] P.P.T.M. van Mun.
Text classification in information retrieval using winnow (cite-
seer.ist.psu.edu/133034.html).
[79] Yiming Yang. An evaluation of statistical approaches to text categorization. Information Retrieval,
1(1/2):69–90, 1999.
[80] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization.
In Douglas H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine
Learning, pages 412–420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.
[81] Clement T. Yu, K. Lam, and Gerard Salton. Term weighting in information retrieval using the term
precision model. JACM, 29(1):152–170, 1982.
[82] Oren Zamir and Oren Etzioni. Grouper: a dynamic clustering interface to Web search results.
Computer Networks (Amsterdam, Netherlands: 1999), 31(11–16):1361–1374, 1999.
[83] Jason Zien, Joerg Meyer, John Tomlin, and Joy Liu. Web query characteristics and their implications on search engines. IBM Research Report, RJ 10199, November 2000.
132
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement