Method for organizing records of database search activity by topical

Method for organizing records of database search activity by topical
US 20030014399A1
(19) United States
(12) Patent Application Publication (10) Pub. N0.: US 2003/0014399 A1
(43) Pub. Date:
Hansen et al.
(54)
METHOD FOR ORGANIZING RECORDS OF
(52)
Jan. 16, 2003
Us. 01. ................................................................ .. 707/3
DATABASE SEARCH ACTIVITY BY
TOPICAL RELEVANCE
(76) Inventors: Mark H. Hansen, Hoboken, NJ (US);
Elizabeth A. Shriver, Jersey City, NJ
Correspondence Address:
FAY, SHARPE, FAGAN,
MINNICH & McKEE, LLP
Seventh Floor
1100 Superior Avenue
Cleveland, OH 44114-2518 (US)
(21) Appl. No.:
10/096,688
(22) Filed:
Mar. 12, 2002
Related US. Application Data
Provisional application No. 60/275,068, ?led on Mar.
12, 2001.
Publication Classi?cation
(51)
Int. Cl? ..................................................... .. G06F 7/00
114
g
ABSTRACT
A method for organizing records of a database by topical
relevance generates statistics on relevance by monitoring
search terms used and search paths traversed by a database
(Us)
(60)
(57)
118
1
l1
user community. Records revieWed most often in relation to
a given search term are assumed to be most relevant to that
search term in the eyes of members of the user community.
Additionally, a record revieWed in relation to a plurality of
search terms is determined to be related by topical relevance
to other records revieWed in relation to that plurality of
search terms. Again, a probability is calculated, based on a
frequency of record revieW and search terms used, as a
measure of this record topical relevance. An embodiment
directed toWard Internet searches provides for seeding the
probability calculations With information from labeled data
available from open source Internet directories. The activi
ties of the user community are monitored, for example, at a
proxy server, or by revieWing proxy server logs. Other
monitoring points are contemplated.
122
1i
[111]
124
124
1 J
135.104.45.10 02/Dec/20011:01:4B:55 "GET http://wwwgoug1e.com/search'é’q=in10c)om+2?110"
155.104.46.111 [12,/Dec/2UUIJ:01:49:115 HGET http://ww11.1eee-infocomorg/ZUUW"
135.104.46.11} D2/Dec/2D11E1:111149:2? "GET ht1p://www.1eea~inf0c0m.org/ZOUU/progromhtml"
Patent Application Publication
Jan. 16, 2003 Sheet 1 0f 5
US 2003/0014399 A1
w:m:N92:#2W
w
i
w
“j
.MQhm1
m2a.s3uEpn?é:gia 2E5.m3?9spléia m2L.E?qQg5:abéCsZU
.N9?“
EN
a
J
Ja
EN
Patent Application Publication
65%m
Jan. 16, 2003 Sheet 4 0f 5
6kgw f5EEa<1%é
E
US 2003/0014399 A1
a?2)52
2EIgaQEmE(aW aE2img2ga8é%a
_
41
#um85tgzq 5W205
%
552%2“
Patent Application Publication
Jan. 16, 2003 Sheet 5 0f 5
US 2003/0014399 A1
i
_
E\
A
IQmMzEc5?lmzo
.12%5
E\
E32as5m
Exmaz“5eim%f;5‘3as$5 29
i;a;
Ma?a
Ea
GEE.5:v
a;2
a;2
2gE:5%
m..95‘%\EN
E28
1%”
1%At53
Jan. 16, 2003
US 2003/0014399 A1
et al. Giuseppe Attardi, Antonio Gulli, and FabriZio Sebas
METHOD FOR ORGANIZING RECORDS OF
DATABASE SEARCH ACTIVITY BY TOPICAL
RELEVANCE
the Eighth International World Wide Web Conference
BACKGROUND OF THE INVENTION
ence, May 1999, incorporated herein by reference in its
[0001] This application claims the bene?t of Provisional
Application Serial No. 60/275,068, ?led Mar. 12, 2001, the
entire substance of Which is incorporated herein by refer
entirety, the context surrounding a link in an HTML docu
ment to extract information for categoriZing the document
ence.
[0002] The invention is related to the art of data search. It
is described in reference to World Wide Web and Internet
tiani, in Theseus: CategoriZation by Context, Proceedings of
(WWW8) (Toronto, Canada), pages 389-401, Elsevier Sci
referred by the link. Oren Zamir and Oren EtZioni, in Web
Document Clustering: A Feasibility Demonstration, Pro
ceedings of the 21StAnnual International ACM SIGIR Con
ference on Research and Development in Information
Retrieval (SIGIR ’98) (Melbourne, Australia), pages 46-54,
searching. HoWever, those of ordinary skill in the art Will
ACM, August 1998, incorporated herein by reference in its
understand that the described embodiments can readily be
adapted to other database or data search tasks.
to quickly group the results based on phrases shared betWeen
[0003] A great deal of Work is being done to improve
database and Web searching. For example, Ayse Goker and
Daqing He, in Analyzing Web Search Logs to Determine
Session Boundaries for Unoriented Learning, Proceedings
of the Adaptive Hypermedia and Adaptive Web-Based Sys
tems International Conference (Trento, Italy), pages 319
322, August 2000, incorporated herein by reference in its
entirety, de?nes a search session to be a meaningful unit of
entirety, use the snippets of text returned by search engines
documents. Murata Tsuyoshi Murata, in Discovery of Web
Communities Based on the Co-Occurrence of References,
Proceedings of the Third International Conference on Dis
covery Science (DS’2000) (Kyoto, Japan), December 2000,
incorporated herein by reference in its entirety, computes
clusters of URLs returned by a search engine by entering the
URLs themselves as secondary queries.
[0006]
Clusters of similar Web pages can be developed
activities, With the intention of using it as input for a learning
technique. Sessions are determined by a length in time from
the ?rst search query. Goker reports that a session boundary
of 11-15 minutes compares Well With human judgment. This
is a simple model, and does not alloW for determining Which
events in the time WindoW correspond to Web searching.
using the approach presented by Dean and HenZinger, Which
Additionally Goker analyZed logs from search engines only.
Search Software CCE, Foster City, Calif., incorporated
[0004] Johan Bollen, in Group User Models for Person
herein by reference in its entirety. The categories can be
aliZed Hyperlink Recommendation, Proceedings of the
Adaptive Hypermedia and Adaptive Web-Based Systems
International Conference (Trento, Italy), pages 39-50,
August 2000, incorporated herein by reference in its entirety,
presents a method to reconstruct user searching using the
Web server log entries of the Los Alamos Research Library
corresponding to access to the digital library of journal
articles. The resulting retrieval paths are a group user model.
The group user model is used to construct relationships
?nds pages similar to a speci?ed one by using connectivity
information on the Web. The Context Classi?cation Engine
catalogs documents With one or more categories from a
controlled set. For example, see Classifying Content With
Ultraseek Server CCE by Walter UnderWood of Inktomi
arranged in either a hierarchical or enumerative classi?ca
tion scheme. Finally, DynaCat, by Wanda Pratt, Marti A.
Hearst, and LaWrence M. Gagan in A Knowledge-Based
Approach to OrganiZing Retrieved Documents, Proceedings
of the 6th National Conference on Arti?cial Intelligence
(AAAI-99); Proceedings of the 11th Conference on Innova
tive Applications of Arti?cial Intelligence (Orlando, Fla.),
pages 80-85, AAAI/MIT Press, July 1999, incorporated
herein by reference in its entirety, dynamically categoriZes
betWeen journals using a V><V matrix, Where V is the set of
search results into a hierarchical organiZation using a model
hypertext pages. In this library of journal articles, a journal
article is represented by a URL (Universal Resource Loca
of the domain terminology.
[0005] Many techniques exist for automatically determin
[0007] Another approach to document categoriZation is
“content ignorant.” For example, Doug Beeferman and
Adam Berger in Agglomerative Clustering of a Search
Engine Query Log, Proceedings of the 2000 Conference on
ing the category of a document based on its content (e.g.,
Knowledge Discovery and Data Mining ( Boston, Mass.),
Yiming Yang and Xin Liu, in A Re-Examination of Text
pages 407-416, August 2000, incorporated herein by refer
tor). This approach Will not scale Well and Would be over
Whelmed When V is the set of publicly-accessed URLs.
CategoriZation Methods, Proceedings of SIGIR-99, 22Dd
ence in its entirety, uses click-through data to discover
ACM International Conference on Research and Develop
disjoint sets of similar queries and disjoint sets of similar
ment in Information Retrieval (Berkeley, Calif.), pages
42-49, ACM, August 1999 and its references, all of Which
are incorporated herein by reference in their entirety) and the
URLs. Their algorithm represents each query and URL as a
node in a graph and creates edges representing the user
action of selecting a speci?ed URL in response to a given
query. Nodes are then merged in an iterative fashion until
some termination condition is reached. This algorithm
forces a hard clustering of queries and URLs. This algorithm
Works on large sets of data in batch mode, and does not
include prior labeled data from existing content hierarchies.
in- and out-links of the document. For example, Jeffrey Dean
and Monika R. HenZinger in Finding Related Web Pages in
the World Wide Web, Proceedings of the Eighth Interna
tional World Wide Web Conference (WWW8) (Toronto,
Canada), pages 389-401, Elsevier Science, May 1999,
incorporated herein by reference in its entirety, Dharmendra
S. Modha and W. Scott Spangler, in Clustering Hypertext
With Applications to Web Searching, Proceedings of the
ACM Hypertext 2000 Conference (San Antonio, Tex.), May
2000, incorporated herein by reference in its entirety, Attardi
By focusing on click-through statistics, these authors only
see an abbreviated portion of a user’s activities While
searching. This paper also only advocates improving Web
search by proposing for users alternative queries taken from
the disjoint sets of queries built by their algorithm.
Jan. 16, 2003
US 2003/0014399 A1
[0008]
Approaches to hierarchical classi?cation such as
that discussed by Ke Wang, Senqiang Zhou, and Shiang
herein by reference in its entirety, discusses Rab, a Web
recommendation system; this system is not designed to
Chen LieW in Building Hierarchical Classi?ers Using Class
assist in Web searching, and it requires users to rate Web
Proximity, Proceedings of the Twenty-?fth International
Conference on Very Large Databases (Edinburgh, Scotland,
UK), pages 363-374, September 1999, incorporated herein
pates. WebGlimpse described by Udi Manber, Mike Smith,
and Burra Gopal in WebGlimpse: Combining BroWsing and
Searching, Proceedings of the 1997 USENIX Annual Tech
by reference in its entirety, When applied to our data, Would
nical Conference (Anaheim, Calif.), pages 195-206, January
only alloW for one URL to be related With each query.
[0009]
Most recent Work in Web searching has been to
improve the search engine ranking algorithms. For example,
1997, incorporated herein by reference in its entirety,
restricts Web searches to a neighborhood of similar pages,
perhaps searching With additional keyWords in the neigh
PageRank, by Sergey Brin and LaWrence Page, in The
Anatomy of a Large-Scale Hypertextual Web Search
Engine, Proceedings of the Seventh International World
borhood. It saves one from building site-speci?c search
Wide Web Conference
(Brisbane, Australia),
Elsevier Science, April 1998, incorporated herein by refer
Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David
Gibson, and Jon Kleinberg in Automatic Resource Compi
lation by AnalyZing Hyperlink Structure and Associated
Text, Proceedings of the Seventh International World Wide
ence in its entirety, The WISE System by Budi YuWono and
Dik Lun Lee, in WISE: A World Wide Web Resource
Database System, IEEE Transactions on Knowledge and
Data Engineering, 8(4):5:48-554, August 1996, incorpo
engines.
[0012] Clever, described by Soumen Chakrabarti, Byron
Web Conference
(Brisbane, Australia), Elsevier
Science, April 1998, incorporated herein by reference in its
entirety, and D. Gibson, J. Kleinberg, and P. Raghavan in
rated herein by reference in its entirety, Budi YuWono and
Dik L. Lee, in Server Ranking for Distributed Text Retrieval
Systems on the Internet, Proceedings of the 5th Interna
Inferring Web Communities from Link Topologies, Pro
ceedings of the 9th ACM Conference on Hypertext and
tional Conference on Database Systems forAdvanced Appli
Hypermedia: Links, Objects, Time and Space—Structure in
cations (DASFAA ’97) (Melbourne, Australia), pages 41-49,
Hypermedia Systems (Pittsburgh, Pa.), pages 225-234, June
April 1997, incorporated herein by reference in its entirety,
1998, incorporated herein by reference in its entirety, builds
on the HITS (Hypertext-Induced Topic Search) algorithm,
and NECI’s metasearch engine, by Steve LaWrence and C.
Lee Giles, in Inquirus, the NECI Meta Search Engine,
Proceedings of the Seventh International World Wide Web
Which seeks to ?nd authoritative sources of information on
Conference
(Brisbane, Australia), pages 95-105,
Elsevier Science, April 1998, incorporated herein by refer
lations of such authoritative sources. The original HTS
algorithm ?rst uses a standard text search engine to gather a
ence in its entirety, are examples of such Work. Direct Hit
(WWW.directhit.com) claims to track Which Web sites a
“root set” of pages matching the query subject. Next, it adds
searcher selects from the list provided by a search engine,
hoW much time she spends on those sites, and takes into
account the position of that site relative to other sites on the
list provided. Thus, for future queries, the most popular and
relevant sites are notated in the search engine results.
[0010]
WebWatcher attempts to serve as a tour guide to
Web neighborhoods, see WebWatcher: A Learning Appren
tice for the World Wide Web by Robert Armstrong, Dayne
Freitag, Thorsten Joachims, and Tom Mitchell in Proceed
ings of the 1995 AAAI Spring Symposium on Information
Gathering From Heterogeneous, Distributed Environments
(Palo Alto, Calif.), pages 6-12, March 1995, incorporated
herein by reference in its entirety, and WebWatcher: A Tour
Guide for the World Wide Web by Thorsten Joachims,
Dayne Freitag, and Tom M. Mitchell in Proceedings of 15th
International Joint Conference on Arti?cial Intelligence
the Web, together With sites (hubs) featuring good compi
to the pool all pages pointing to or pointed to by the root set.
Thereafter, it uses only the links betWeen these pages to
distill the best authorities and hubs. The key insight is that
these links capture the annotative poWer (and effort) of
millions of individuals independently building Web pages.
Clever additionally uses the content of the Web pages.
SALSA described by R. Lempel and S. Moran in The
Stochastic Approach for Link-Structure Analysis (SALSA)
and the TKC Effect, Proceedings of the Ninth International
World Wide Web Conference (WWW9) (Amsterdam, Neth
erlands), May 2000, incorporated herein by reference in its
entirety, presents another method to ?nd hubs and authori
ties.
[0013]
Paul P. Maglio and Rob Barrett, in HoW to Build
Modeling Agents to Support Web Searchers, Proceedings of
the Sixth International Conference on User Modeling
(UM97) (Sardinia, Italy), Springer Wien, NeW York, June
(II CAI97) (Nagoya, Japan), pages 770-777, Morgan Kauf
1997, incorporated herein by reference in its entirety, studied
mann, August 1997, incorporated herein by reference in its
hoW people search for information on the Web. They for
maliZed the concept of Waypoints, key nodes that lead users
entirety. Users invoke WebWatcher by folloWing a link to
the WebWatcher server, then continue broWsing as Web
Watcher accompanies them, providing advice along the Way
to their searching goal. To support the searching behavior
they observed, Maglio and Barrett constructed a Web agent
WebWatcher gains expertise by analyZing user actions,
to help identify the Waypoint based on a user’s searching
history. Unfortunately, it is not clear hoW to extend the
statements of interest, and the set of pages visited by users.
Waypoint URL so that other users can pro?t from it.
on Which link to folloW next based on a stated user goal.
Their studies suggested that WebWatcher could achieve
close to the human level of performance on the problem of
predicting Which link a user Will folloW given a page and a
statement of interest.
[0011]
Marko Balabanovic and Yoav Shoham in Fab:
[0014] All of this Work is motivated, at least in part, by a
general need to improve database and Internet searching in
general. HoWever, a large part of the motivation to improve
Web searching is brought about by the advent of mobile
computing and communication devices and services. For
Content-Based, Collaborative Recommendation, Communi
example, cell phone and personal digital assistant (PDA)
cations of theACIVI, 40(3):66-72, March 1997, incorporated
users are demanding Internet connectivity. One of the fun
Jan. 16, 2003
US 2003/0014399 A1
damental design challenges of today’s mobile devices is the
constraints of their small displays. For example, PDAs may
have a display space of 160x160 pixels, While a cellular
phone can be limited to only ?ve lines of 14 characters each.
Differences in display real estate and access to peripherals
like keyboards and mice can alter the user experience With
much of the content available on the Web. These display
limitations as Well as bandWidth limitations related to con
straints of mobile communication are accommodated
through special connectivity services.
could then feed into a shared pool of knoWledge. To be
practically useful, this pool needs to be maintained Without
requiring direct input from the members of the community.
[0023] HoWever, gathering such a pool is only useful if
queries are repeated. In examining 17 months of proxy
server logs at Bell Labs, 20% of the queries sent to search
engines had been done before. Based on this promising
number, SearchLight, a system disclosed in US. patent
application Ser. No. 09/428,031, ?led Oct. 27, 1999, entitled
Method for Improving Web Searching Performance by
Considering the interface constraints in the mobile
Using Community-Based Filtering by Shriver and Small,
environment, one can easily see hoW important proper
selection of content becomes in mobile Web searching
Which is incorporated herein by reference in its entirety, Was
built, Which transparently constructs a database of search
engine queries and a subset of the URLs visited in response
[0015]
applications. Without the bene?t of re?ning content selec
tion, delivery, and distribution, a user may be inundated With
search results, and may be unable to manipulate the content
in a manner satisfactory to the task, context, or application
at hand. As such, it Would be desirable to have an improved
search system for general Internet and database applications,
to those queries. Then, When a user vieWs the results of a
query from a search engine, SearchLight augments the
results With URLs from the database. Experimental results
indicate that among all the cases When a search involves a
but also for tailoring search results for display on a limited
query contained in the SearchLight database, the desired
URL is among those in the SearchLight display 64% of the
broWser screen.
time.
[0016]
Of the available methods to improve search results,
there are several techniques that are commonly used:
[0017] Improved ranking algorithms. Current search
engines craWl the Web and build indexes on the
[0024] Unfortunately, if the SearchLight database is large,
it Will have many of the same problems experienced by other
search engines—too many results to display With the order
being the only technique to help the user.
keyWords that they deem are important. The key
[0025]
Words are used to identify Which URLs should be
improve or augment available data searching techniques.
displayed. A great deal of Work had been done to
improve the ranking of the URLs. For example, see
the Work of Brin and Page mentioned above.
There is a desire to provide a scalable method to
BRIEF SUMMARY OF THE INVENTION
[0026]
Therefore, a method of improving search of a
database has been developed. The method comprises, moni
[0018] Meta-search engines. A meta-search engine
toring user search activity in a user population, extracting
queries a group of popular engines, hoping that the
search sessions, de?ned by search queries and paths, from
user search activity, determining groups of semantically
combined results Will be more useful than the results
from any one engine. For example, MetaCraWler
Web, IEEE Expert, 12(12):8-14, January/February
related queries or paths based on search session data, deter
mining probabilities that records in the database are relevant
for each query or path group, maintaining a table associating
an index for each record in the database With the probability
that the record is relevant for each query or path group, and,
1997, by Erik Selberg and Oren EtZioni, incorpo
supplementing search results With information regarding
rated herein by reference in its entirety).
records from the database With tabulated relevance prob
abilities.
collates results, eliminates duplication, and displays
the results With aggregate scores (see The MetaCra
Wler Architecture for Resource Aggregation on the
[0019]
Dedicated search engines. There exist a num
ber of search engines specialiZing in particular top
ics.
[0020] Specialized
directories.
Yahoo,
About,
LookSmart, and DMOZ organiZe pages into topic
directories. These special hierarchies are maintained
by one or more editors, and hence their coverage is
someWhat limited and their quality can vary. These
directory structures are also referred to as resource
lists or catalogs.
[0021] Bookmarks. Individuals often keep a set of
bookmarks of frequently visited pages and share
their bookmark ?les With others interested in the
same topics, e.g. WWW.back?ip.com.
[0022] With reference to the tWo last techniques, members
of a community (office, Work group, or social organiZation)
often think about, and research, the same set of topics. When
searching for information on the Web, if others from one’s
community have recently performed the same searches, it
Would be helpful to knoW What they found; search results
[0027] In some embodiments, monitoring user search
activity in a user population and extracting search sessions
from user search activity includes off-line processing of
proxy server access logs to determine search sessions (Where
off-line refers to a batch style processing in Which data are
handled at regular intervals e.g. once a day).
[0028] In some embodiments, monitoring user search
activity in a user population and extracting search sessions
from user search activity includes on-line processing in a
proxy server to determine search sessions (Where online
refers to an event driven style of processing in Which data
are handled each time a search session ends).
[0029] In some of these embodiments determining search
sessions includes determining complete search sessions. For
example, a search session is determined to include all the
Web pages visited While performing the searching task,
including, for example, not only the Web pages presented in
a search engine results page, but also including pages
explored as a result of vieWing pages listed on the search
engine results page.
Jan. 16, 2003
US 2003/0014399 A1
BRIEF DESCRIPTION OF THE SEVERAL
VIEWS OF THE DRAWINGS
[0030] The invention may take form in various compo
nents and arrangements of components, and in various
procedures and arrangements of procedures. The drawings
are only for purposes of illustrating preferred embodiments,
they are not to scale, and are not to be construed as limiting
the invention.
[0031] FIG. 1 is a portion of an exemplary proxy server
log corresponding to a search session.
[0032] FIG. 2 represents data related to a search session
that Was extracted from the proxy server log of FIG. 1.
[0033] FIG. 3 is an exemplary broWser WindoW illustrat
ing a ?rst search results augmentation scheme.
[0034] FIG. 4 is an exemplary broWser WindoW illustrat
ing a second search results augmentation scheme.
[0035] FIG. 5 is a portion of an exemplary set of prede
termined directory or labeled data.
[0036] FIG. 6 is a How diagram summariZing a method for
organiZing records of a database by topical relevance.
[0037] FIG. 7 is a block diagram illustrating a ?rst system
operative to implement aspects of the methods of the inven
tion.
[0038] FIG. 8 is a block diagram illustrating a second
system operative to implement aspects of the methods of the
invention.
DETAILED DESCRIPTION OF THE
INVENTION
[0039] We consider enhancing the standard search facility
associated With a database. Users initiate searches by sub
mitting queries to the search facility, Where each query
consists of one or more search terms. The present invention
is based on the idea that semantically related search terms
(even if they do not include any of the same Words) lead
users to access similar records in a database While they are
searching. By combining the complete search activities from
a large community of users, search terms can be grouped
through clustering or grouping. Then, for each group, the
most relevant records are identi?ed, again using the data
collected from user activities. When a user submits a query
to a search engine, the present invention, Which is termed
Hyponym, decides to Which group or groups the search term
belongs, and then displays indices for the most relevant
records strati?ed by the identi?ed query groups.
[0040] More particularly, the method consists of the fol
loWing steps: 1) User activities are passively monitored as
they access the standard search facility of a database. Users
submit queries to the search facility, Where each query
consists of one or more search terms. 2) We summariZe the
sequence of user activities during a searching task into a
structure called a search session. Technically, a search
session consists of a user’s search terms and the indices of
the records they accessed in the database While searching. 3)
We maintain a table of the number of times each record Was
accessed in response to each search term issued by a
community of users. Every time a user conducts a search, We
increment the appropriate elements in this table based on the
associated search session. A search session may also include
a timestamp. 4) RecogniZing that semantically related search
terms lead users to access many of the same records, We use
this table to form groups or clusters of search terms, knoWn
as query groups based on the patterns of accesses recorded
by the search sessions. With some kinds of clustering, a
search term may belong to several groups and a numerical
score is used to describe the strength of association. 5) Then,
again using the tabulated search session data, We estimate
the chance that each record in the database is relevant for the
different query groups. (It is possible to also use the tabu
lated data to introduce groups of URLs as Well. In this case
We Would estimate the probability that a group of URLs is
relevant for a group of queries.) The resulting numerical
scores are called relevance Weights. Either of steps (4) and
(5) can be updated every time a user completes a search, a
method knoWn as on-line processing; or they can be done
periodically, processing a number of search sessions in a
batch i.e. as in off-line processing. 6) When a neW search is
initiated We identify the group or groups With Which the
user’s search term it is most strongly associated and return
a list of the indices to the most relevant records in the
database, strati?ed by query group.
[0041]
In some embodiments, the query groups are com
puted via a mixture model. This kind of clustering Will
typically involve computing association Weights (relating
search terms to clusters) and relevance Weights (relating
database records to query groups) via the Well-knoWn expec
tation-maximiZation (EM) algorithm.
[0042] In some embodiments of the method, the clustering
can be aided by information in existing structures that
provide organiZation to the database. This might include a
tree structure that associates records in the database With a
hierarchically speci?ed set of topics. We refer to information
of this kind as labeled data because it directly associates
database records With broad topics. In some embodiments of
the method, When a mixture model is employed, this labeled
data can be used via a simple approximate EM scheme.
[0043] An embodiment directed toWard improving Web
search specialiZes the database to the collection of pages
available on one or more Web sites, and takes the standard
search facility to be an existing search engine. In this
context, the labeled data to help form query clusters and
relevance Weights could consist of an existing content
hierarchy (like WWW.yahoo.com or WWW.about.com).
[0044] In situations Where either the content in a database
or the terms being searched for by the community of users
continually changes, the methods for integrating neW search
session data should function in near realtime. This necessi
tates an on-line mechanism for learning query groups and
relevance Weights. When this clustering involves a mixture
model, an on-line variant of the EM algorithm can be
employed.
[0045] Information about users as they search is distilled
into an object knoWn as a search session, the pairing of a
user’search term and the records they accessed While search
ing eg the query and complete path. The present invention
relies on tWo assumptions: (1) Search sessions can be
obtained; and (2) the information contained in a collection of
search sessions can be used to assist in searching “in the
future.” In one embodiment of the invention, the World
Wide Web is searched by users for HTML documents
relating to a given search term. VieWing the World Wide
Jan. 16, 2003
US 2003/0014399 A1
Web as a database, the separate records or Web pages are
search?hl=en&q=netWork+statistics in the proxy server log.
indexed by their URLs. For Web searching, a user’s search
session consists of their search terms and URLs of all the
HTML pages they visited in response to their query. Several
Atable of rules for hoW to extract the search terms from each
options are available for monitoring user activities on the
World Wide Web. In one embodiment, We have made use of
proxy server logs. Aproxy server is a computer that connects
a community of users to the public Internet. It accesses
(popular) knoWn search engine (say, WWW.lycos.com, WWW
.google.com, and search.yahoo.com among others) is easily
maintained. It is more dif?cult to determine, using only
proxy server logs, When a search session ends. In order to do
so, the folloWing assumptions are made: (1) Once a user
submits a search query, as long as the user visits pages that
content on behalf of this community. Requests for HTML
are referenced, directly or indirectly through a link, by the
pages and other items are directed to the proxy server, Which
in turn establishes a connection With the appropriate host on
results of the search query, the search session has not ended.
This is not true When the user types in a URL that is also in
the currently displayed page; this case is rare. (2) A search
session ends if it is inactive for more than an hour. Inactivity
the Internet and retrieves the desired item. It then delivers
the item to the user Who issued the request. As part of this
process of serving content, proxy servers record the URLs of
the items requested by their users. From this large log ?le,
the search sessions for every person using the particular
proxy server. A search session extractor takes as input proxy
server logs, and outputs queries and ordered sets of URLs
visited for those queries and timestamps for these events.
[0046] While the disclosed methods can be applied gen
erally to database searches, We provide extra detail concern
is determined using the timestamp of the last URL added to
the search session. (3) The user can perform a side task using
their broWser, and then return to the original searching task.
The ?rst URL in the side task is a transitional URL. Finally,
a technical condition is required in settings Where users
aggressively “multi-task”: (4) The user does not have more
than 10 search sessions active at any one time.
ing an exemplary embodiment involving search session
[0051] A completed search session is one Where a user
visits at least one URL. The user could vieW the search
extraction from proxy server logs.
engine results and decide not to visit any links, resulting in
[0047] Search Session Extraction Example: Proxy Server
Logs
[0048] As noted above, a proxy server handles all the
requests made by a user community and hence records a
Wealth of information about user behaviors. With these data,
access to the complete path a user folloWs While searching
is available. Given a proxy server log, We can extract search
sessions in one of tWo Ways. First, We can “replay” that part
of a user’s actions that are directly associated With a search
task (i.e., re-retrieving the pages a user requested) to deter
mine the path the user folloWed. This scheme is referred to
as an off-line collection scheme. Alternatively, We can avoid
the overhead of replaying requests by instead modifying a
proxy to directly log the information needed to determine
search sessions, or to have a background daemon processing
the Web pages While they are still in the ?le system cache.
We refer to this setup as an on-line collection scheme.
[0049]
In the context of searching the World Wide Web for
HTML pages, search sessions consist a query posted to a
search engine together With the URLs of HTML pages the
user accesses in response to the query. Recall that a proxy
Will record all the items requested by a user, Which includes
the embedded URLs (such as image ?les) on each HTML
page they vieW that are fetched automatically by the
broWser. Therefore, for the purpose of enhancing Web
an incomplete search session.
[0052] FIG. 1 contains a subset 110 of the ?elds available
in an exemplary proxy log corresponding to a search session.
Many of the ?elds are not needed for the search session
extractor, and thus are not shoWn. For example, the proxy
log subset includes an IP (Internet protocol) address 114
associated With a proxy user, a time stamp 118 associated
With the logged event, a URL 122 associated With a target
Web page. Where the event is a search engine search the
URL can includes search terms 124. FIG. 2 lists the result
ing search session 210 With timestamps 214. There are many
complications that need to be addressed When extracting a
search session, such as, for example, handling multiple
concurrent searches from the same user on similar topics.
Details of the search session extractor are described by
Elizabeth Shriver and Mark Hansen in Search Session
Extraction: A User Model of Searching. Bell Labs Technical
Report, January 2002, incorporated herein by reference in its
entirety.
[0053] A re?nement of a query occurs When the user
modi?es the query or decides to use a different search
engine. For example, the user’s ?rst query might be “high
blood pressure”, the second query could be “high blood
pressure causes”, and the third could be “hypertension”.
[0054] Since the search terms could completely change
search for HTML pages, We exclude these other URLs from
during a re?nement, it Was determined that a query is an
a search session, and throughout the rest of this disclosure,
element in a re?nement by the amount of time betWeen tWo
consecutive queries from a user. For example, if the amount
take “URL” to mean an HTML URL. (HoWever, the meth
ods disclosed here are clearly extendable to other ?le and
data types.)
of time is short (e.g. less than 10 minutes), the queries are
assumed to be related. This heuristic Was veri?ed (by eye)
for a month Worth of queries and found to be suf?cient. A
[0050]
As Will be clear to those of ordinary skill in the art,
?nding the beginning of a search session from a proxy server
log is trivial: a session begins When a user submits a query
to a knoWn search engine like WWW.google.com. In terms of
the proxy server log ?le, this event is associated With a string
of the form “http://WWW.google.com/search?hl=en&q=
query”, Where “query” is another string consisting of one or
more search terms. For example, a search for “netWork
statistics” Will generate the string http://WWW.google.com
more sophisticated approach involves modeling the time
betWeen the initiation of search sessions, and deriving
user-speci?c time constants. A query that is not re?ned is a
simple query. Queries that are re?nements are grouped into
topic sessions.
[0055] Class of Algorithms
[0056] The search session data 210 contain the URLs 218
visited during user searches. From this information, many
Jan. 16, 2003
US 2003/0014399 A1
things can be determined. For example, hoW long a user
[0063] Another search ef?ciency enhancing feature is an
visited a page, Which page Was visited ?rst, Which page Was
visited the most across search sessions for a speci?c query,
the broWser’s search string and retrieves results for the
and other information can be extracted from the session data
matching abbreviated term. To determine the common
210. Thus, a class of algorithms is de?ned Which manipulate
search and topic sessions to improve Web search. TWo
examples from this class, SearchLight and Hyponym, Will
algorithm that replaces and/or expands abbreviations from
abbreviations, for each URL logged in our exemplary proxy
log, a list of all queries in Which the URL Was the last URL
selected Was generated. The lists of queries Were examined
be discussed beloW. The general form of input into this class
that represented the most frequent URLs and added process
is (t,q,u), u is a URL selected from the group of URLs
ing in a table lookup routine for the 12 most common ones.
formed by the transitive closure of the search engine results
For example, “NY” is replaced by “New York,” and “air
lines” (and vice versa). Of course, other kinds of enhance
for query q. The timestamp t is the difference in time
betWeen the current event and the previous. In the general
form of output from the algorithm, each query qi is associ
?rst relation is captured by the triple (qi, Qk, Wqik), Where
ments can be added. For example, cases of equating Words
With their plurals could be done by a Word stemmer. The
URL 314 list is sorted so that the most frequently accessed
page is displayed ?rst. As the number of URLs increase for
queries, the URLs With loW counts are moved off of the list
that is displayed to the user. Thus, old URLs are displaced
Wqik is the probability that qi belongs to group Qk. A second
With neWer URLs.
ated With one or more query groups, each URL uj is
associated With one or more URL groups, and each query
group Qk is associated With one or more URL groups U1. A
relation is captured by the triple (uj, U1, WUJ-I), Where W'Jj1 is
the probability that uj belongs to group U1. A third relation
[0064] Hyponym
is represented by (Qk, U1, Wkl) Where Wk1 is the probability
[0065] Aside from post-processing that enlarges or
that Qk and U1 are related. That is, With probability Wkl, the
urls in U1 contain information about the queries in Qk.
reduces search terms, SearchLight relies on an exact match
[0057] Practically, the triples are put in a table (often
another database) Which is then queried When a user per
forms a search. Clearly, the table can be placed at any point
in the Web path that recogniZes that the user is performing
a query; possible spots are at the broWser, in a proxy server
if one is used, and at a search engine server.
to make recommendations. In studying the SearchLight
table, it can be found that search terms that are semantically
related often lead users to the same collection of URLs.
Therefore, groups of queries are formed based on the
similarity of their associated search sessions. In turn, by
combining search sessions With queries in a given group, the
relevance of the URLs recommended is improved. This is
the basic idea behind Hyponym. When a user initiates a neW
We noW brie?y present a simple element of the
search, they are presented With a display of query groups
class knoWn as SearchLight. SearchLight uses a table of
related to the search terms and the most relevant URLs for
each group.
[0058]
query and target URL pairs (q,u) pairs, but does not involve
any kind of clustering. The present invention, Hyponym, is
best explained as an extension of SearchLight.
[0059] SearchLight
[0060] SearchLight begins With a table that records the
number of times each query and target URL pair (q,u) occurs
[0066] The present invention includes algorithms for both
forming the query groups as Well as determining the most
relevant URLs for each group. The present invention, or
Hyponym, constructs a statistical mixture model to describe
the data contained in a table, eg the SearchLight table. This
model has as its parameters the probability that a given
among a collection of search sessions. SomeWhat heuristi
cally, the target URL for a search session is de?ned to be the
query belongs to a particular group as Well as a set of
last page that the user visits before they move to a neW task.
URLs. The algorithms attempt to ?t the same model to the
data. Some embodiments of Hyponym employ a standard
Other possible de?nitions include the URL that the search
stays on for the longest amount of time and the ?rst 5 URLs
that the searcher visits.
[0061]
The table is used to ?nd and display URLs related
to a query input by a user. For example, With reference to
FIG. 3, SearchLight displays the URLs 314 by Weight 318.
(FIG. 3 assumes that SearchLight is implemented in a
proxy; if it Were implemented in a search engine, the
WindoW Would have only the loWer frame.)
group-speci?c relevance Weights assigned to collections of
EM (Expectation-MaximiZation) algorithm. HoWever, this
technique has problems related to scaling (both in the
number of search sessions as Well as the number of groups
needed to obtain a reasonable ?t) and therefore has disad
vantages. Other embodiments of Hyponym use a relatively
less computationally expensive scheme that is referred to as
approximate EM. The approximate EM technique usually
arrives at a different ?t than the standard EM, hoWever there
is typically little practical difference betWeen the tWo.
SearchLight is triggered into action When a user
Finally, given the dynamic character of many databases (like
enters a search string or query into a search engine. If
the collection of pages on the Web) We Will also introduce
an embodiment of Hyponym that includes online variants of
the EM algorithm that alloW us to process search sessions in
realtime.
[0062]
necessary, SearchLight ?rst modi?es the query by convert
ing it to loWer case, removing punctuation, and sorting the
terms alphabetically. If there are no table entries for the
modi?ed query, SearchLight considers intersecting sets of
the search terms. This ensures that the application provides
URLs even if the query is only a close approximation to
those in the table. So, if a search for “cryptosystem mce
liece” does not have any exact matches in the table, URLs
[0067] The Hyponym Idea
[0068]
Given the description above, each query qi is asso
ciated With one or more groups. This relation is captured by
Would be returned from queries such as “mceliece”, “cryp
the triple (qi, k, Wik), Where k denotes a group ID and Wl-k is
the probability that qi belongs to group k. Then, for each
tosystem”, and even “robert mceliece”.
group, a number of relevant URLs are identi?ed. This is
Jan. 16, 2003
US 2003/0014399 A1
described by the triple (k, uj, >\,kj) Where uj is a URL and )tkj
is a Weight that determines hoW likely it is that uj is
associated With the queries belonging to group k. These
triples are stored in a table that Hyponym uses. An example
of a query-group triple (qi, k, Wik) might look like,
[0078] The parameters )tkj are referred to as relevance
Weights, and the probability that yik=1 is used as the kth
group Weight for query qi (the Wik mentioned at the begin
ning of this section).
[0079] Anumber of different algorithms ?t this model and,
in turn, perform a clustering. They are presented beloW.
[0069] (infocom+2000,304,0.9)
[0070] While the associate group-relevance triples (k,
uj, >\,kj) might be
[0080] The table is used to display URLs related to the
query searched by the user. Referring to FIG. 4, the query
groups 414, 418 are displayed by Weight, With the URLs
[0071] (304,http://WWW.ieee-infocom.org/2000/,0.5)
[0072] (304,http://WWW.ieee-infocom.org/2000/pro
gram.html,0.5)
422, 426 in each group ordered by Weight.
[0073] As mentioned above, sets of such triples constitute
the parameters in a statistical model for the search sessions
contained in a table, similar to that described in reference to
SearchLight.
[0074]
A mixture model is employed to form both the
query groups as Well as the relevance Weights. Assume that
a dataset has I queries that We Would like to assign to K
groups, and in turn determine group-speci?c relevance
Weights for each of J URLs. For the moment, let nij denote
the number of times the URL uj Was selected by some user
during a search session under the query qi. Let ni=(ni1, . . .
, nij) denote the vector of counts associated With query qi.
This vector is modeled as coming from a mixture of the form
[0081] Standard EM Algorithm
[0082] As explained by A. P. Dempster, N. M. Laird, and
D. B. Rubin, in Maximum Likelihood for Incomplete Data
Via the EM Algorithm (With discussion), Journal of the
Royal Statistical Society (Series B), 3911-38, 1977, incor
porated herein by reference in its entirety, the standard
Expectation-MaximiZation (EM) algorithm is a convenient
statistical tool for ?nding maximum likelihood estimates of
the parameters in a mixture model.
[0083] The EM algorithm alternates betWeen tWo steps; an
E-step in Which We compute the expectation of the complete
data log-likelihood conditional is computed on the observed
data and the current parameter estimates, and an M-step in
Which the parameters maximiZing the expected log-likeli
hood from the E-step are found. The E-step consists of
calculating the conditional expectation of the indicator vari
ables yl-k, Which are denoted:
[0075]
Where the terms otk sum to one and denote the
proportion of the population coming from the kth compo
nent. Also associated With the kth component in the mixture
is a vector of parameters >\.k. From a sampling perspective,
one can consider the entire dataset ni,=1, . . . , I, as being
generated in tWo steps: ?rst, one of the K groups k* is
selected according to the probabilities otk and then the
associated distribution p(~|)tk*) is used to generate the vector
[0084] Where p(~|Xk) is given in
ni.
quantities étk and
[0076] The speci?cation of each component in the mixture
(1) is considered. It is assumed that in the kth component,
Note that
is an estimate of the probability that query qi
belongs to group k, and Will be taken to be our query group
Weights. Then, for the M-step,
is substituted for yl-k =L in
the data nij come from a Poisson distribution With mean M],
where the counts for each different URL uj are independent.
Then, setting 7\k=()tk1, . . . , >\.kj), the likelihood of the kth
In this expression, the
denote our current parameter estimates.
(3), and maximiZed With respect to the parameters otk and >\.k.
In this case, a closed form expression is available, giving the
component associated With a vector of counts ni is given by
updates
[0077] To ?t a model of this type, a set of unobserved (or
missing) indicator variables yl-k, are introduced Where yki=1
if qi is in group k, and Zero otherWise. Then, the so-called
complete data likelihood for both the set of counts ni and the
rithm a convenient tool for determining query group Weights
indicator variables yi=(y?, . . . ,yiK), i=1, . . . I, can be
[0085] Clearly, these simple updates make the EM algo
and relevance Weights. Unfortunately, the convergence of
this algorithm can be sloW and Will often converge to only
a local maximum. To obtain a good solution, We start the EM
expressed as
process from several random initial conditions and take the
best of the converged ?ts.
Wk
(3)
[0086] Approximate Algorithm With Prior Data
[0087] In moving from the original SearchLight to the
Hyponym embodiments, the query groups formed by the
mixture model introduced above alloWs the borroWing of
Jan. 16, 2003
US 2003/0014399 A1
strength from search sessions initiated With different but
semantically related search terms. The mixture approach,
however, is highly unstructured in the sense that only user
data is incorporated to learn the groups and relevance
1
Weights. Having taken a probabilistic approach to grouping,
prior information about related URLs is incorporated from,
an existing directory like DMOZ (WWW.dmoZ.com) or
Yahoo. These directories constitute predetermined directo
ries and included labeled data. For example, consider the
30,000 categories identi?ed by DMOZ. (FIG. 5 contains a
subset of the ?elds for a feW sample DMOZ entries.)
nt represent a search session initiated at time t, and let
Embodiments of Hyponym that are based on the approxi
ut=(ut)1, . . . , ut,Jt) denote the URLs visited and nt (ntJ, . . .
mate EM algorithms use a directory structure or labeled data
to seed query groups. The data in such a structure can be
version of (6) resembles
[0102] Where the updated probabilities are given in
In
the on-line approach, the sums over the queries i are
replaced With a Weighted version indexed by time. Let qt, ut,
, nt, JD their frequency Within the session. Then, an on-line
represented as a set of pairs (1, ulj) is a URL in the lth group.
The probability W'Jj1 is not speci?ed in the directory so it is
assumed it has a value of 0t.
and
[0088] With this algorithm, as illustrated in the pseudo
code beloW, mappings betWeen queries and URLs are estab
Tj‘,1k(l)=(1-TI(IDj‘JkU-D’FTI(UY’HJM for 1:1: - - - > Jr
lished When either the query or the URL has been seen in
[0103]
either the prior data (from a proxy log) or the sessions that
utl. In these expressions,
Where jt>1)l=1, . . . , Jt, represents an index for URL
have already been processed.
[0089] readPriorData
[0090]
for each session s
[0091]
if (query in s Was NOT seen before) and
[0092] (# URLs existed/# URLs in s<T)
MY
1
and
[0093] put s aside to be processed using
BasicEM algorithm
[0094] else
[0095] createURLGroup if needed
[0096] add mappings betWeen query and URL groups
[0097] output mappings
[0098] The remaining sessions are processed in a batch
using the standard EM algorithm described above. The
approximate algorithm can be tuned With the threshold value
[0104] The term 0<11(t)<1 used to control the learning rate.
Technically, given a long stream of data, the rate should be
reduced as more data is seen. Given the sparse nature of the
search session data, a constant learning rate is found to
perform adequately. Additionally, Sato and Ishii have shoWn
search sessions or topic sessions.
that for a ?xed number of clusters, this approach provides a
stochastic approximation to the maximum likelihood esti
mators for query group membership and the relevance
Weights. Given the large number of clusters in a directory
like DMOZ, it is impractical to do a full M-step at each time
point. Instead, We choose to only take a partial M-step and
[0099] The approximate EM algorithm has the advantage
update just those relevance Weights With indices contained
in the incoming search session (qt, ut, n). This kind of
T (0§T§ 1) to force more of the URLs in the session to exist
in the prior data or the previously processed data. The
approximate EM algorithm supports processing data as
of incorporating prior or predetermined data, but has the
disadvantage of only sloWly adding to the set of clusters
alteration is Well knoWn (for example, see AVieW of the EM
a more formal on-line algorithm can take a directory struc
Algorithm that Justi?es Incremental, Sparse, and other Vari
ants by Radford M. Neal and Geoffrey E. Hinton in Learning
in Graphical Models, pages 355-368, KluWer Academic
Publishers, 1998, incorporated herein by reference in its
entirety) and in the basic EM algorithm does not effect
ture as input and learn the query clusters. The on-line EM
convergence.
When a neW topic is found in the data.
[0100]
On-line EM Algorithm With Labeled Data
[0101] Using the probabilistic mixture model of the data,
algorithm presented in On-Line EM Algorithm for the
NormaliZed Gaussian Network by Masa-aki Sato and Shin
Ishii presented in Neutral Computation, 12(2):407-432, Feb
ruary 2000, incorporated herein by reference in its entirety,
is used to process data arriving in a stream of search
sessions. To understand this approach, consider the E-step
given in (4) and the M-step in
Because the Poisson
model is part of an exponential family, these updates should
be considered in terms of the sufficient statistics for the
complete data model
Then, the E-Step computes
[0105] To incorporate the labeled data, a set of clusters is
initiated so that each URL ulj in the lth group of the existing
hierarchy is assigned some ?xed value )tlj. Then, When a neW
search session (qt, ut, nt) arrives, (7) is evaluated to see hoW
Well it ?ts With the existing groups. If the probability is too
small, We initiate a neW cluster With the URLs ut and
intensities nt. When faced With a long stream of data,
splitting clusters and deleting unused clusters may be nec
essary.
Jan. 16, 2003
US 2003/0014399 A1
[0106] Summary
a user population is de?ned as mobile device users, search
[0107] Referring to FIG. 6, the present invention concerns
a method (1010) for improving the standard search facility
results related to the search term “ATM” might be supple
mented With a list giving priority to Web pages containing
information about the location of Automatic Teller
for a database. The activities of a community of users are
monitored as they search the database (step 614). With each
search, a user is presented With a list of indices for records
in the database. Preferably this list also summariZes each of
the potentially relevant records. The user revieWs the list and
accesses records that appear (from the accompanying infor
mation) to be related to their search terms. The user’s search
terms together With the indices of the items they access are
combined to form a search session (step 618). The extraction
of search sessions can happen either When the session has
ended (knoWn as on-line processing) or periodically in a
batch, say, processing a log ?le of user activities once a day
(knoWn as off-line processing).
Machines. Similarly, Where a user population is de?ned as
mobile device users in NeW York, searches for “restaurants”
are supplemented With lists of pages prioritiZed toWard those
With information about restaurants near them in NeW York.
Such users are relieved from having to Wade through infor
mation about restaurants in Chicago and elseWhere.
[0111]
The methods of this invention such as those sum
mariZed in FIG. 6 may be implemented in a variety of
communication and computing environments. As explained
above, for example, they may be implemented in proxy
servers, search engine provider hardWare, gateWays, and
other points in database or Internet search paths. With a full
[0108]
The extracted search sessions are then used to
formulate groups of semantically related queries, and to
understanding of the present invention, those of skill in the
art Will readily determine suitable hardWare and softWare
associate With each group a set of relevance Weights, or
con?gurations for their particular applications.
technically, the probabilities that each record satis?es the
queries in each group (step 622). In an exemplary embodi
ment, the formation of query groups and relevance Weights
[0112]
For example, With reference to FIG. 7 and FIG. 8,
a user in a user community 710, 810 uses a Web broWser to
is accomplished by ?tting a mixture model. In this case, a
probability distribution is constructed that describes hoW the
data Were generated. In other embodiments, the clustering of
queries and the determination of relevance Weights can be
done in separate steps. In still another embodiment, groups
of records could also be formed from the search session data,
in Which case the relevance Weights Would associate query
groups and record groups. This computation could be done
either in one step (by formulating a slightly more elaborate
the Web traf?c generated by the user goes through a Web
proxy server 714, 814, so the request for a search engine also
does. Once the proxy server 714, 814 determines that the
request is a search engine request, it routes it on to the search
engine, Which is a Web server 718, 818 in the Internet 722,
822, and also sends the query to a Clusterer 726, 826. The
Clusterer 726, 826 sends records Whose probabilities have
mixture model) or in tWo or more separate steps. This
processing can be done either for each neW search session
the Clusterer 726, 826.
(on-line processing) or at regular intervals in batch mode
(off-line processing).
[0109] The essential byproduct of this component of the
invention is a collection of query groups and relevance
Weights. We use this data to aid users With future searches.
In addition to the output from the standard search facility, We
also present the user With an additional display built from
our table of query groups and relevance Weights (step 626).
Given a neW query, the present invention ?rst identi?es one
or more query groups based on the search terms of the query.
access a Web search engine such as Google or Yahoo. All of
passed a threshold. The records are maintained in tables in
[0113]
These tables are generated using search session
data and/or labeled data 734, 834. If the Clusterer is an
on-line Clusterer 826, the search session data is input into
the Clusterer 826 as individual search sessions from search
session extractor 830. If the Clusterer is an off-line Clusterer
726, the search session data is batched by the Search Session
Extractor 730, perhaps batching by a 24-hour period. Certain
versions of the Clusterer 726, 826 might use labeled data
734, 834 in its algorithms.
[0114] The input into the Search Session Extractor 730,
Then, for each group, indices for the most relevant records
830 is the proxy access logs. If the Clusterer is on-line, each
in each group are presented to the user. This list of indices
log event is sent to the Extractor as it occurs. OtherWise, a
batch of log events are sent.
is strati?ed by query group, making it easier to broWse the
search results.
[0110]
The user population referred to above is preferably
a community of users With something in common. For
example, the user community can be the Workers of a
company or organiZation, mobile device users in a particular
location, or users that are grouped together because of
common interests or habits. The method 610 of improving
Web search is bene?cially applied to these kinds of user
populations because the common interest or aspect of the
[0115]
Clearly, there are different types of searches that
users perform; sometimes, there is one desired page (e.g., a
conference call-for-papers announcement), and other times,
the searching process of visiting many pages alloWs the user
to ?nd the desired information (e.g., What is available on the
Web about Wireless handsets). In addition, the user can use
the desired page as a jump-off point for further exploration.
SearchLight is most successful When one page is desired.
Hyponym is successful for both types of searching.
community can be used to automatically narroW or ?ne tune
search results. For example, Where a user population is
de?ned as Bell Lab Workers, search results related to the
search term “ATM”, the reliance on population search path
statistics of the method 610 of improving Web search may
direct users to pages containing information about Asyn
chronous Transfer Mode sWitches. At the same time, Where
[0116] The invention has been described With reference to
particular embodiments. Modi?cations and alterations Will
occur to others upon reading and understanding this speci
?cation. It is intended that all such modi?cations and alter
ations are included insofar as they come Within the scope of
the appended claims or equivalents thereof.
Jan. 16, 2003
US 2003/0014399 A1
10
We claim:
1. A method of improving search of a database, the
probabilities that records in the database are relevant for
method comprising:
labeled data.
12. The method of improving database search of claim 11
each query or path group includes using predetermined
monitoring user search activity in a user population;
Wherein determining groups of semantically related queries
extracting search sessions, de?ned by search queries and
or paths based on search session data and determining
probabilities that records in the database are relevant for
each query or path group includes applying an approximate
paths, from user search activity;
determining groups of semantically related queries or
paths based on search session data;
determining probabilities that records in the database are
relevant for each query or path group;
maintaining a table associating an index for each record in
the database With the probability that the record is
relevant for each query or path group; and,
supplementing search results With information regarding
Expectation-MaximiZation algorithm to the predetermined
labeled data.
13. The method of improving database search of claim 11
Wherein determining groups of semantically related queries
or paths based on search session data and determining
probabilities that records in the database are relevant for
each query or path group includes using predetermined
labeled data by seeding query or path groups.
14. The method of improving database search of claim 1
records from the database With tabulated relevance
Wherein determining groups of semantically related queries
probabilities.
or paths based on search session data and determining
probabilities that records in the database are relevant for
2. The method of improving database search of claim 1
Wherein the search is Web page search and the database
includes a collection of available Web pages.
3. The method of improving database search of claim 1
Wherein the search is Web page search and the database
includes a collection of publicly available Internet Web
pages.
4. The method of improving database search of claim 1
Wherein the search is Web page search and the database
includes a collection of private intranet Web pages.
5. The method of improving database search of claim 2
Wherein monitoring user search activity in a user population
and extracting search sessions from user search activity
includes off-line processing of proxy server access logs to
determine search sessions.
6. The method of improving database search of claim 2
Wherein monitoring user search activity in a user population
and extracting search sessions from user search activity
includes on-line processing in a proxy server to determine
search sessions.
7. The method of improving database search of claim 2
Wherein monitoring user search activity in a user population
and extracting search sessions from user search activity
includes off-line processing of proxy server access logs to
determine complete search sessions.
8. The method of improving database search of claim 2
Wherein monitoring user search activity in a user population
and extracting search sessions from user search activity
includes on-line processing in a proxy server to determine
complete search sessions.
9. The method of improving database search of claim 1
includes extracting topic sessions, de?ned by multiple
search sessions Where the queries include re?nements, from
user search activity.
10. The method of improving database search of claim 1
each query or path group includes clustering queries or paths
in an on-line fashion.
15. The method of improving database search of claim 1
Wherein maintaining a table associating the index for each
record includes using a database to store the table.
16. The method of improving database search of claim 1
Wherein supplementing search results With information
regarding records from the database With tabulated rel
evance probabilities includes displaying the information in a
separate area of the display from results of a search engine.
17. The method of improving database search of claim 1
Wherein supplementing search results With information
regarding records from the database With tabulated rel
evance probabilities includes modifying the order of the
information.
18. The method of improving database search of claim 1
Wherein determining groups of semantically related queries
or paths based on search session data and determining
probabilities that records in the database are relevant for
each query or path group includes clustering data.
19. The method of improving database search of claim 1
Wherein determining groups of semantically related queries
based on search session data and determining probabilities
that records in the database are relevant for each query group
includes clustering queries based on a similarity of items in
their associated search paths.
20. The method of improving database search of claim 19
Wherein determining groups of semantically related queries
or paths based on search session data and determining
probabilities that records in the database are relevant for
each query or path group includes clustering queries or paths
using an Expectation-MaximiZation algorithm.
21. A method of improving search of a database, the
method comprising:
Wherein determining groups of semantically related queries
monitoring user search activity in a user population;
or paths based on search session data and determining
probabilities that records in the database are relevant for
extracting search sessions, de?ned by search queries and
each query or path group includes clustering queries based
paths, from user search activity;
on a similarity of the associated search paths using a Poisson
mixture model.
11. The method of improving database search of claim 1
determining groups of semantically related paths based on
search session data;
Wherein determining groups of semantically related queries
determining probabilities that records in the database are
relevant for each path group;
or paths based on search session data and determining
Jan. 16, 2003
US 2003/0014399 A1
maintaining a table associating an index for each record in
the database With the probability that the record is
relevant for each path group; and,
supplementing search results with information regarding
records from the database With tabulated relevance
probabilities.
22. A method of improving search of a database, the
method comprising:
rnonitoring user search activity in a user population;
extracting search sessions, de?ned by search queries and
paths, from user search activity;
deterrnining groups of sernantically related queries based
on search session data;
deterrnining probabilities that records in the database are
relevant for each query group;
maintaining a table associating an indeX for each record in
the database With the probability that the record is
relevant for each query group; and,
supplementing search results with information regarding
records from the database With tabulated relevance
probabilities.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement