US 20030014399A1 (19) United States (12) Patent Application Publication (10) Pub. N0.: US 2003/0014399 A1 (43) Pub. Date: Hansen et al. (54) METHOD FOR ORGANIZING RECORDS OF (52) Jan. 16, 2003 Us. 01. ................................................................ .. 707/3 DATABASE SEARCH ACTIVITY BY TOPICAL RELEVANCE (76) Inventors: Mark H. Hansen, Hoboken, NJ (US); Elizabeth A. Shriver, Jersey City, NJ Correspondence Address: FAY, SHARPE, FAGAN, MINNICH & McKEE, LLP Seventh Floor 1100 Superior Avenue Cleveland, OH 44114-2518 (US) (21) Appl. No.: 10/096,688 (22) Filed: Mar. 12, 2002 Related US. Application Data Provisional application No. 60/275,068, ?led on Mar. 12, 2001. Publication Classi?cation (51) Int. Cl? ..................................................... .. G06F 7/00 114 g ABSTRACT A method for organizing records of a database by topical relevance generates statistics on relevance by monitoring search terms used and search paths traversed by a database (Us) (60) (57) 118 1 l1 user community. Records revieWed most often in relation to a given search term are assumed to be most relevant to that search term in the eyes of members of the user community. Additionally, a record revieWed in relation to a plurality of search terms is determined to be related by topical relevance to other records revieWed in relation to that plurality of search terms. Again, a probability is calculated, based on a frequency of record revieW and search terms used, as a measure of this record topical relevance. An embodiment directed toWard Internet searches provides for seeding the probability calculations With information from labeled data available from open source Internet directories. The activi ties of the user community are monitored, for example, at a proxy server, or by revieWing proxy server logs. Other monitoring points are contemplated. 122 1i  124 124 1 J 18.104.22.168 02/Dec/20011:01:4B:55 "GET http://wwwgoug1e.com/search'é’q=in10c)om+2?110" 22.214.171.124 [12,/Dec/2UUIJ:01:49:115 HGET http://ww11.1eee-infocomorg/ZUUW" 126.96.36.199} D2/Dec/2D11E1:111149:2? "GET ht1p://www.1eea~inf0c0m.org/ZOUU/progromhtml" Patent Application Publication Jan. 16, 2003 Sheet 1 0f 5 US 2003/0014399 A1 w:m:N92:#2W w i w “j .MQhm1 m2a.s3uEpn?é:gia 2E5.m3?9spléia m2L.E?qQg5:abéCsZU .N9?“ EN a J Ja EN Patent Application Publication 65%m Jan. 16, 2003 Sheet 4 0f 5 6kgw f5EEa<1%é E US 2003/0014399 A1 a?2)52 2EIgaQEmE(aW aE2img2ga8é%a _ 41 #um85tgzq 5W205 % 552%2“ Patent Application Publication Jan. 16, 2003 Sheet 5 0f 5 US 2003/0014399 A1 i _ E\ A IQmMzEc5?lmzo .12%5 E\ E32as5m Exmaz“5eim%f;5‘3as$5 29 i;a; Ma?a Ea GEE.5:v a;2 a;2 2gE:5% m..95‘%\EN E28 1%” 1%At53 Jan. 16, 2003 US 2003/0014399 A1 et al. Giuseppe Attardi, Antonio Gulli, and FabriZio Sebas METHOD FOR ORGANIZING RECORDS OF DATABASE SEARCH ACTIVITY BY TOPICAL RELEVANCE the Eighth International World Wide Web Conference BACKGROUND OF THE INVENTION ence, May 1999, incorporated herein by reference in its  This application claims the bene?t of Provisional Application Serial No. 60/275,068, ?led Mar. 12, 2001, the entire substance of Which is incorporated herein by refer entirety, the context surrounding a link in an HTML docu ment to extract information for categoriZing the document ence.  The invention is related to the art of data search. It is described in reference to World Wide Web and Internet tiani, in Theseus: CategoriZation by Context, Proceedings of (WWW8) (Toronto, Canada), pages 389-401, Elsevier Sci referred by the link. Oren Zamir and Oren EtZioni, in Web Document Clustering: A Feasibility Demonstration, Pro ceedings of the 21StAnnual International ACM SIGIR Con ference on Research and Development in Information Retrieval (SIGIR ’98) (Melbourne, Australia), pages 46-54, searching. HoWever, those of ordinary skill in the art Will ACM, August 1998, incorporated herein by reference in its understand that the described embodiments can readily be adapted to other database or data search tasks. to quickly group the results based on phrases shared betWeen  A great deal of Work is being done to improve database and Web searching. For example, Ayse Goker and Daqing He, in Analyzing Web Search Logs to Determine Session Boundaries for Unoriented Learning, Proceedings of the Adaptive Hypermedia and Adaptive Web-Based Sys tems International Conference (Trento, Italy), pages 319 322, August 2000, incorporated herein by reference in its entirety, de?nes a search session to be a meaningful unit of entirety, use the snippets of text returned by search engines documents. Murata Tsuyoshi Murata, in Discovery of Web Communities Based on the Co-Occurrence of References, Proceedings of the Third International Conference on Dis covery Science (DS’2000) (Kyoto, Japan), December 2000, incorporated herein by reference in its entirety, computes clusters of URLs returned by a search engine by entering the URLs themselves as secondary queries.  Clusters of similar Web pages can be developed activities, With the intention of using it as input for a learning technique. Sessions are determined by a length in time from the ?rst search query. Goker reports that a session boundary of 11-15 minutes compares Well With human judgment. This is a simple model, and does not alloW for determining Which events in the time WindoW correspond to Web searching. using the approach presented by Dean and HenZinger, Which Additionally Goker analyZed logs from search engines only. Search Software CCE, Foster City, Calif., incorporated  Johan Bollen, in Group User Models for Person herein by reference in its entirety. The categories can be aliZed Hyperlink Recommendation, Proceedings of the Adaptive Hypermedia and Adaptive Web-Based Systems International Conference (Trento, Italy), pages 39-50, August 2000, incorporated herein by reference in its entirety, presents a method to reconstruct user searching using the Web server log entries of the Los Alamos Research Library corresponding to access to the digital library of journal articles. The resulting retrieval paths are a group user model. The group user model is used to construct relationships ?nds pages similar to a speci?ed one by using connectivity information on the Web. The Context Classi?cation Engine catalogs documents With one or more categories from a controlled set. For example, see Classifying Content With Ultraseek Server CCE by Walter UnderWood of Inktomi arranged in either a hierarchical or enumerative classi?ca tion scheme. Finally, DynaCat, by Wanda Pratt, Marti A. Hearst, and LaWrence M. Gagan in A Knowledge-Based Approach to OrganiZing Retrieved Documents, Proceedings of the 6th National Conference on Arti?cial Intelligence (AAAI-99); Proceedings of the 11th Conference on Innova tive Applications of Arti?cial Intelligence (Orlando, Fla.), pages 80-85, AAAI/MIT Press, July 1999, incorporated herein by reference in its entirety, dynamically categoriZes betWeen journals using a V><V matrix, Where V is the set of search results into a hierarchical organiZation using a model hypertext pages. In this library of journal articles, a journal article is represented by a URL (Universal Resource Loca of the domain terminology.  Many techniques exist for automatically determin  Another approach to document categoriZation is “content ignorant.” For example, Doug Beeferman and Adam Berger in Agglomerative Clustering of a Search Engine Query Log, Proceedings of the 2000 Conference on ing the category of a document based on its content (e.g., Knowledge Discovery and Data Mining ( Boston, Mass.), Yiming Yang and Xin Liu, in A Re-Examination of Text pages 407-416, August 2000, incorporated herein by refer tor). This approach Will not scale Well and Would be over Whelmed When V is the set of publicly-accessed URLs. CategoriZation Methods, Proceedings of SIGIR-99, 22Dd ence in its entirety, uses click-through data to discover ACM International Conference on Research and Develop disjoint sets of similar queries and disjoint sets of similar ment in Information Retrieval (Berkeley, Calif.), pages 42-49, ACM, August 1999 and its references, all of Which are incorporated herein by reference in their entirety) and the URLs. Their algorithm represents each query and URL as a node in a graph and creates edges representing the user action of selecting a speci?ed URL in response to a given query. Nodes are then merged in an iterative fashion until some termination condition is reached. This algorithm forces a hard clustering of queries and URLs. This algorithm Works on large sets of data in batch mode, and does not include prior labeled data from existing content hierarchies. in- and out-links of the document. For example, Jeffrey Dean and Monika R. HenZinger in Finding Related Web Pages in the World Wide Web, Proceedings of the Eighth Interna tional World Wide Web Conference (WWW8) (Toronto, Canada), pages 389-401, Elsevier Science, May 1999, incorporated herein by reference in its entirety, Dharmendra S. Modha and W. Scott Spangler, in Clustering Hypertext With Applications to Web Searching, Proceedings of the ACM Hypertext 2000 Conference (San Antonio, Tex.), May 2000, incorporated herein by reference in its entirety, Attardi By focusing on click-through statistics, these authors only see an abbreviated portion of a user’s activities While searching. This paper also only advocates improving Web search by proposing for users alternative queries taken from the disjoint sets of queries built by their algorithm. Jan. 16, 2003 US 2003/0014399 A1  Approaches to hierarchical classi?cation such as that discussed by Ke Wang, Senqiang Zhou, and Shiang herein by reference in its entirety, discusses Rab, a Web recommendation system; this system is not designed to Chen LieW in Building Hierarchical Classi?ers Using Class assist in Web searching, and it requires users to rate Web Proximity, Proceedings of the Twenty-?fth International Conference on Very Large Databases (Edinburgh, Scotland, UK), pages 363-374, September 1999, incorporated herein pates. WebGlimpse described by Udi Manber, Mike Smith, and Burra Gopal in WebGlimpse: Combining BroWsing and Searching, Proceedings of the 1997 USENIX Annual Tech by reference in its entirety, When applied to our data, Would nical Conference (Anaheim, Calif.), pages 195-206, January only alloW for one URL to be related With each query.  Most recent Work in Web searching has been to improve the search engine ranking algorithms. For example, 1997, incorporated herein by reference in its entirety, restricts Web searches to a neighborhood of similar pages, perhaps searching With additional keyWords in the neigh PageRank, by Sergey Brin and LaWrence Page, in The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proceedings of the Seventh International World borhood. It saves one from building site-speci?c search Wide Web Conference (Brisbane, Australia), Elsevier Science, April 1998, incorporated herein by refer Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg in Automatic Resource Compi lation by AnalyZing Hyperlink Structure and Associated Text, Proceedings of the Seventh International World Wide ence in its entirety, The WISE System by Budi YuWono and Dik Lun Lee, in WISE: A World Wide Web Resource Database System, IEEE Transactions on Knowledge and Data Engineering, 8(4):5:48-554, August 1996, incorpo engines.  Clever, described by Soumen Chakrabarti, Byron Web Conference (Brisbane, Australia), Elsevier Science, April 1998, incorporated herein by reference in its entirety, and D. Gibson, J. Kleinberg, and P. Raghavan in rated herein by reference in its entirety, Budi YuWono and Dik L. Lee, in Server Ranking for Distributed Text Retrieval Systems on the Internet, Proceedings of the 5th Interna Inferring Web Communities from Link Topologies, Pro ceedings of the 9th ACM Conference on Hypertext and tional Conference on Database Systems forAdvanced Appli Hypermedia: Links, Objects, Time and Space—Structure in cations (DASFAA ’97) (Melbourne, Australia), pages 41-49, Hypermedia Systems (Pittsburgh, Pa.), pages 225-234, June April 1997, incorporated herein by reference in its entirety, 1998, incorporated herein by reference in its entirety, builds on the HITS (Hypertext-Induced Topic Search) algorithm, and NECI’s metasearch engine, by Steve LaWrence and C. Lee Giles, in Inquirus, the NECI Meta Search Engine, Proceedings of the Seventh International World Wide Web Which seeks to ?nd authoritative sources of information on Conference (Brisbane, Australia), pages 95-105, Elsevier Science, April 1998, incorporated herein by refer lations of such authoritative sources. The original HTS algorithm ?rst uses a standard text search engine to gather a ence in its entirety, are examples of such Work. Direct Hit (WWW.directhit.com) claims to track Which Web sites a “root set” of pages matching the query subject. Next, it adds searcher selects from the list provided by a search engine, hoW much time she spends on those sites, and takes into account the position of that site relative to other sites on the list provided. Thus, for future queries, the most popular and relevant sites are notated in the search engine results.  WebWatcher attempts to serve as a tour guide to Web neighborhoods, see WebWatcher: A Learning Appren tice for the World Wide Web by Robert Armstrong, Dayne Freitag, Thorsten Joachims, and Tom Mitchell in Proceed ings of the 1995 AAAI Spring Symposium on Information Gathering From Heterogeneous, Distributed Environments (Palo Alto, Calif.), pages 6-12, March 1995, incorporated herein by reference in its entirety, and WebWatcher: A Tour Guide for the World Wide Web by Thorsten Joachims, Dayne Freitag, and Tom M. Mitchell in Proceedings of 15th International Joint Conference on Arti?cial Intelligence the Web, together With sites (hubs) featuring good compi to the pool all pages pointing to or pointed to by the root set. Thereafter, it uses only the links betWeen these pages to distill the best authorities and hubs. The key insight is that these links capture the annotative poWer (and effort) of millions of individuals independently building Web pages. Clever additionally uses the content of the Web pages. SALSA described by R. Lempel and S. Moran in The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect, Proceedings of the Ninth International World Wide Web Conference (WWW9) (Amsterdam, Neth erlands), May 2000, incorporated herein by reference in its entirety, presents another method to ?nd hubs and authori ties.  Paul P. Maglio and Rob Barrett, in HoW to Build Modeling Agents to Support Web Searchers, Proceedings of the Sixth International Conference on User Modeling (UM97) (Sardinia, Italy), Springer Wien, NeW York, June (II CAI97) (Nagoya, Japan), pages 770-777, Morgan Kauf 1997, incorporated herein by reference in its entirety, studied mann, August 1997, incorporated herein by reference in its hoW people search for information on the Web. They for maliZed the concept of Waypoints, key nodes that lead users entirety. Users invoke WebWatcher by folloWing a link to the WebWatcher server, then continue broWsing as Web Watcher accompanies them, providing advice along the Way to their searching goal. To support the searching behavior they observed, Maglio and Barrett constructed a Web agent WebWatcher gains expertise by analyZing user actions, to help identify the Waypoint based on a user’s searching history. Unfortunately, it is not clear hoW to extend the statements of interest, and the set of pages visited by users. Waypoint URL so that other users can pro?t from it. on Which link to folloW next based on a stated user goal. Their studies suggested that WebWatcher could achieve close to the human level of performance on the problem of predicting Which link a user Will folloW given a page and a statement of interest.  Marko Balabanovic and Yoav Shoham in Fab:  All of this Work is motivated, at least in part, by a general need to improve database and Internet searching in general. HoWever, a large part of the motivation to improve Web searching is brought about by the advent of mobile computing and communication devices and services. For Content-Based, Collaborative Recommendation, Communi example, cell phone and personal digital assistant (PDA) cations of theACIVI, 40(3):66-72, March 1997, incorporated users are demanding Internet connectivity. One of the fun Jan. 16, 2003 US 2003/0014399 A1 damental design challenges of today’s mobile devices is the constraints of their small displays. For example, PDAs may have a display space of 160x160 pixels, While a cellular phone can be limited to only ?ve lines of 14 characters each. Differences in display real estate and access to peripherals like keyboards and mice can alter the user experience With much of the content available on the Web. These display limitations as Well as bandWidth limitations related to con straints of mobile communication are accommodated through special connectivity services. could then feed into a shared pool of knoWledge. To be practically useful, this pool needs to be maintained Without requiring direct input from the members of the community.  HoWever, gathering such a pool is only useful if queries are repeated. In examining 17 months of proxy server logs at Bell Labs, 20% of the queries sent to search engines had been done before. Based on this promising number, SearchLight, a system disclosed in US. patent application Ser. No. 09/428,031, ?led Oct. 27, 1999, entitled Method for Improving Web Searching Performance by Considering the interface constraints in the mobile Using Community-Based Filtering by Shriver and Small, environment, one can easily see hoW important proper selection of content becomes in mobile Web searching Which is incorporated herein by reference in its entirety, Was built, Which transparently constructs a database of search engine queries and a subset of the URLs visited in response  applications. Without the bene?t of re?ning content selec tion, delivery, and distribution, a user may be inundated With search results, and may be unable to manipulate the content in a manner satisfactory to the task, context, or application at hand. As such, it Would be desirable to have an improved search system for general Internet and database applications, to those queries. Then, When a user vieWs the results of a query from a search engine, SearchLight augments the results With URLs from the database. Experimental results indicate that among all the cases When a search involves a but also for tailoring search results for display on a limited query contained in the SearchLight database, the desired URL is among those in the SearchLight display 64% of the broWser screen. time.  Of the available methods to improve search results, there are several techniques that are commonly used:  Improved ranking algorithms. Current search engines craWl the Web and build indexes on the  Unfortunately, if the SearchLight database is large, it Will have many of the same problems experienced by other search engines—too many results to display With the order being the only technique to help the user. keyWords that they deem are important. The key  Words are used to identify Which URLs should be improve or augment available data searching techniques. displayed. A great deal of Work had been done to improve the ranking of the URLs. For example, see the Work of Brin and Page mentioned above. There is a desire to provide a scalable method to BRIEF SUMMARY OF THE INVENTION  Therefore, a method of improving search of a database has been developed. The method comprises, moni  Meta-search engines. A meta-search engine toring user search activity in a user population, extracting queries a group of popular engines, hoping that the search sessions, de?ned by search queries and paths, from user search activity, determining groups of semantically combined results Will be more useful than the results from any one engine. For example, MetaCraWler Web, IEEE Expert, 12(12):8-14, January/February related queries or paths based on search session data, deter mining probabilities that records in the database are relevant for each query or path group, maintaining a table associating an index for each record in the database With the probability that the record is relevant for each query or path group, and, 1997, by Erik Selberg and Oren EtZioni, incorpo supplementing search results With information regarding rated herein by reference in its entirety). records from the database With tabulated relevance prob abilities. collates results, eliminates duplication, and displays the results With aggregate scores (see The MetaCra Wler Architecture for Resource Aggregation on the  Dedicated search engines. There exist a num ber of search engines specialiZing in particular top ics.  Specialized directories. Yahoo, About, LookSmart, and DMOZ organiZe pages into topic directories. These special hierarchies are maintained by one or more editors, and hence their coverage is someWhat limited and their quality can vary. These directory structures are also referred to as resource lists or catalogs.  Bookmarks. Individuals often keep a set of bookmarks of frequently visited pages and share their bookmark ?les With others interested in the same topics, e.g. WWW.back?ip.com.  With reference to the tWo last techniques, members of a community (office, Work group, or social organiZation) often think about, and research, the same set of topics. When searching for information on the Web, if others from one’s community have recently performed the same searches, it Would be helpful to knoW What they found; search results  In some embodiments, monitoring user search activity in a user population and extracting search sessions from user search activity includes off-line processing of proxy server access logs to determine search sessions (Where off-line refers to a batch style processing in Which data are handled at regular intervals e.g. once a day).  In some embodiments, monitoring user search activity in a user population and extracting search sessions from user search activity includes on-line processing in a proxy server to determine search sessions (Where online refers to an event driven style of processing in Which data are handled each time a search session ends).  In some of these embodiments determining search sessions includes determining complete search sessions. For example, a search session is determined to include all the Web pages visited While performing the searching task, including, for example, not only the Web pages presented in a search engine results page, but also including pages explored as a result of vieWing pages listed on the search engine results page. Jan. 16, 2003 US 2003/0014399 A1 BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS  The invention may take form in various compo nents and arrangements of components, and in various procedures and arrangements of procedures. The drawings are only for purposes of illustrating preferred embodiments, they are not to scale, and are not to be construed as limiting the invention.  FIG. 1 is a portion of an exemplary proxy server log corresponding to a search session.  FIG. 2 represents data related to a search session that Was extracted from the proxy server log of FIG. 1.  FIG. 3 is an exemplary broWser WindoW illustrat ing a ?rst search results augmentation scheme.  FIG. 4 is an exemplary broWser WindoW illustrat ing a second search results augmentation scheme.  FIG. 5 is a portion of an exemplary set of prede termined directory or labeled data.  FIG. 6 is a How diagram summariZing a method for organiZing records of a database by topical relevance.  FIG. 7 is a block diagram illustrating a ?rst system operative to implement aspects of the methods of the inven tion.  FIG. 8 is a block diagram illustrating a second system operative to implement aspects of the methods of the invention. DETAILED DESCRIPTION OF THE INVENTION  We consider enhancing the standard search facility associated With a database. Users initiate searches by sub mitting queries to the search facility, Where each query consists of one or more search terms. The present invention is based on the idea that semantically related search terms (even if they do not include any of the same Words) lead users to access similar records in a database While they are searching. By combining the complete search activities from a large community of users, search terms can be grouped through clustering or grouping. Then, for each group, the most relevant records are identi?ed, again using the data collected from user activities. When a user submits a query to a search engine, the present invention, Which is termed Hyponym, decides to Which group or groups the search term belongs, and then displays indices for the most relevant records strati?ed by the identi?ed query groups.  More particularly, the method consists of the fol loWing steps: 1) User activities are passively monitored as they access the standard search facility of a database. Users submit queries to the search facility, Where each query consists of one or more search terms. 2) We summariZe the sequence of user activities during a searching task into a structure called a search session. Technically, a search session consists of a user’s search terms and the indices of the records they accessed in the database While searching. 3) We maintain a table of the number of times each record Was accessed in response to each search term issued by a community of users. Every time a user conducts a search, We increment the appropriate elements in this table based on the associated search session. A search session may also include a timestamp. 4) RecogniZing that semantically related search terms lead users to access many of the same records, We use this table to form groups or clusters of search terms, knoWn as query groups based on the patterns of accesses recorded by the search sessions. With some kinds of clustering, a search term may belong to several groups and a numerical score is used to describe the strength of association. 5) Then, again using the tabulated search session data, We estimate the chance that each record in the database is relevant for the different query groups. (It is possible to also use the tabu lated data to introduce groups of URLs as Well. In this case We Would estimate the probability that a group of URLs is relevant for a group of queries.) The resulting numerical scores are called relevance Weights. Either of steps (4) and (5) can be updated every time a user completes a search, a method knoWn as on-line processing; or they can be done periodically, processing a number of search sessions in a batch i.e. as in off-line processing. 6) When a neW search is initiated We identify the group or groups With Which the user’s search term it is most strongly associated and return a list of the indices to the most relevant records in the database, strati?ed by query group.  In some embodiments, the query groups are com puted via a mixture model. This kind of clustering Will typically involve computing association Weights (relating search terms to clusters) and relevance Weights (relating database records to query groups) via the Well-knoWn expec tation-maximiZation (EM) algorithm.  In some embodiments of the method, the clustering can be aided by information in existing structures that provide organiZation to the database. This might include a tree structure that associates records in the database With a hierarchically speci?ed set of topics. We refer to information of this kind as labeled data because it directly associates database records With broad topics. In some embodiments of the method, When a mixture model is employed, this labeled data can be used via a simple approximate EM scheme.  An embodiment directed toWard improving Web search specialiZes the database to the collection of pages available on one or more Web sites, and takes the standard search facility to be an existing search engine. In this context, the labeled data to help form query clusters and relevance Weights could consist of an existing content hierarchy (like WWW.yahoo.com or WWW.about.com).  In situations Where either the content in a database or the terms being searched for by the community of users continually changes, the methods for integrating neW search session data should function in near realtime. This necessi tates an on-line mechanism for learning query groups and relevance Weights. When this clustering involves a mixture model, an on-line variant of the EM algorithm can be employed.  Information about users as they search is distilled into an object knoWn as a search session, the pairing of a user’search term and the records they accessed While search ing eg the query and complete path. The present invention relies on tWo assumptions: (1) Search sessions can be obtained; and (2) the information contained in a collection of search sessions can be used to assist in searching “in the future.” In one embodiment of the invention, the World Wide Web is searched by users for HTML documents relating to a given search term. VieWing the World Wide Jan. 16, 2003 US 2003/0014399 A1 Web as a database, the separate records or Web pages are search?hl=en&q=netWork+statistics in the proxy server log. indexed by their URLs. For Web searching, a user’s search session consists of their search terms and URLs of all the HTML pages they visited in response to their query. Several Atable of rules for hoW to extract the search terms from each options are available for monitoring user activities on the World Wide Web. In one embodiment, We have made use of proxy server logs. Aproxy server is a computer that connects a community of users to the public Internet. It accesses (popular) knoWn search engine (say, WWW.lycos.com, WWW .google.com, and search.yahoo.com among others) is easily maintained. It is more dif?cult to determine, using only proxy server logs, When a search session ends. In order to do so, the folloWing assumptions are made: (1) Once a user submits a search query, as long as the user visits pages that content on behalf of this community. Requests for HTML are referenced, directly or indirectly through a link, by the pages and other items are directed to the proxy server, Which in turn establishes a connection With the appropriate host on results of the search query, the search session has not ended. This is not true When the user types in a URL that is also in the currently displayed page; this case is rare. (2) A search session ends if it is inactive for more than an hour. Inactivity the Internet and retrieves the desired item. It then delivers the item to the user Who issued the request. As part of this process of serving content, proxy servers record the URLs of the items requested by their users. From this large log ?le, the search sessions for every person using the particular proxy server. A search session extractor takes as input proxy server logs, and outputs queries and ordered sets of URLs visited for those queries and timestamps for these events.  While the disclosed methods can be applied gen erally to database searches, We provide extra detail concern is determined using the timestamp of the last URL added to the search session. (3) The user can perform a side task using their broWser, and then return to the original searching task. The ?rst URL in the side task is a transitional URL. Finally, a technical condition is required in settings Where users aggressively “multi-task”: (4) The user does not have more than 10 search sessions active at any one time. ing an exemplary embodiment involving search session  A completed search session is one Where a user visits at least one URL. The user could vieW the search extraction from proxy server logs. engine results and decide not to visit any links, resulting in  Search Session Extraction Example: Proxy Server Logs  As noted above, a proxy server handles all the requests made by a user community and hence records a Wealth of information about user behaviors. With these data, access to the complete path a user folloWs While searching is available. Given a proxy server log, We can extract search sessions in one of tWo Ways. First, We can “replay” that part of a user’s actions that are directly associated With a search task (i.e., re-retrieving the pages a user requested) to deter mine the path the user folloWed. This scheme is referred to as an off-line collection scheme. Alternatively, We can avoid the overhead of replaying requests by instead modifying a proxy to directly log the information needed to determine search sessions, or to have a background daemon processing the Web pages While they are still in the ?le system cache. We refer to this setup as an on-line collection scheme.  In the context of searching the World Wide Web for HTML pages, search sessions consist a query posted to a search engine together With the URLs of HTML pages the user accesses in response to the query. Recall that a proxy Will record all the items requested by a user, Which includes the embedded URLs (such as image ?les) on each HTML page they vieW that are fetched automatically by the broWser. Therefore, for the purpose of enhancing Web an incomplete search session.  FIG. 1 contains a subset 110 of the ?elds available in an exemplary proxy log corresponding to a search session. Many of the ?elds are not needed for the search session extractor, and thus are not shoWn. For example, the proxy log subset includes an IP (Internet protocol) address 114 associated With a proxy user, a time stamp 118 associated With the logged event, a URL 122 associated With a target Web page. Where the event is a search engine search the URL can includes search terms 124. FIG. 2 lists the result ing search session 210 With timestamps 214. There are many complications that need to be addressed When extracting a search session, such as, for example, handling multiple concurrent searches from the same user on similar topics. Details of the search session extractor are described by Elizabeth Shriver and Mark Hansen in Search Session Extraction: A User Model of Searching. Bell Labs Technical Report, January 2002, incorporated herein by reference in its entirety.  A re?nement of a query occurs When the user modi?es the query or decides to use a different search engine. For example, the user’s ?rst query might be “high blood pressure”, the second query could be “high blood pressure causes”, and the third could be “hypertension”.  Since the search terms could completely change search for HTML pages, We exclude these other URLs from during a re?nement, it Was determined that a query is an a search session, and throughout the rest of this disclosure, element in a re?nement by the amount of time betWeen tWo consecutive queries from a user. For example, if the amount take “URL” to mean an HTML URL. (HoWever, the meth ods disclosed here are clearly extendable to other ?le and data types.) of time is short (e.g. less than 10 minutes), the queries are assumed to be related. This heuristic Was veri?ed (by eye) for a month Worth of queries and found to be suf?cient. A  As Will be clear to those of ordinary skill in the art, ?nding the beginning of a search session from a proxy server log is trivial: a session begins When a user submits a query to a knoWn search engine like WWW.google.com. In terms of the proxy server log ?le, this event is associated With a string of the form “http://WWW.google.com/search?hl=en&q= query”, Where “query” is another string consisting of one or more search terms. For example, a search for “netWork statistics” Will generate the string http://WWW.google.com more sophisticated approach involves modeling the time betWeen the initiation of search sessions, and deriving user-speci?c time constants. A query that is not re?ned is a simple query. Queries that are re?nements are grouped into topic sessions.  Class of Algorithms  The search session data 210 contain the URLs 218 visited during user searches. From this information, many Jan. 16, 2003 US 2003/0014399 A1 things can be determined. For example, hoW long a user  Another search ef?ciency enhancing feature is an visited a page, Which page Was visited ?rst, Which page Was visited the most across search sessions for a speci?c query, the broWser’s search string and retrieves results for the and other information can be extracted from the session data matching abbreviated term. To determine the common 210. Thus, a class of algorithms is de?ned Which manipulate search and topic sessions to improve Web search. TWo examples from this class, SearchLight and Hyponym, Will algorithm that replaces and/or expands abbreviations from abbreviations, for each URL logged in our exemplary proxy log, a list of all queries in Which the URL Was the last URL selected Was generated. The lists of queries Were examined be discussed beloW. The general form of input into this class that represented the most frequent URLs and added process is (t,q,u), u is a URL selected from the group of URLs ing in a table lookup routine for the 12 most common ones. formed by the transitive closure of the search engine results For example, “NY” is replaced by “New York,” and “air lines” (and vice versa). Of course, other kinds of enhance for query q. The timestamp t is the difference in time betWeen the current event and the previous. In the general form of output from the algorithm, each query qi is associ ?rst relation is captured by the triple (qi, Qk, Wqik), Where ments can be added. For example, cases of equating Words With their plurals could be done by a Word stemmer. The URL 314 list is sorted so that the most frequently accessed page is displayed ?rst. As the number of URLs increase for queries, the URLs With loW counts are moved off of the list that is displayed to the user. Thus, old URLs are displaced Wqik is the probability that qi belongs to group Qk. A second With neWer URLs. ated With one or more query groups, each URL uj is associated With one or more URL groups, and each query group Qk is associated With one or more URL groups U1. A relation is captured by the triple (uj, U1, WUJ-I), Where W'Jj1 is the probability that uj belongs to group U1. A third relation  Hyponym is represented by (Qk, U1, Wkl) Where Wk1 is the probability  Aside from post-processing that enlarges or that Qk and U1 are related. That is, With probability Wkl, the urls in U1 contain information about the queries in Qk. reduces search terms, SearchLight relies on an exact match  Practically, the triples are put in a table (often another database) Which is then queried When a user per forms a search. Clearly, the table can be placed at any point in the Web path that recogniZes that the user is performing a query; possible spots are at the broWser, in a proxy server if one is used, and at a search engine server. to make recommendations. In studying the SearchLight table, it can be found that search terms that are semantically related often lead users to the same collection of URLs. Therefore, groups of queries are formed based on the similarity of their associated search sessions. In turn, by combining search sessions With queries in a given group, the relevance of the URLs recommended is improved. This is the basic idea behind Hyponym. When a user initiates a neW We noW brie?y present a simple element of the search, they are presented With a display of query groups class knoWn as SearchLight. SearchLight uses a table of related to the search terms and the most relevant URLs for each group.  query and target URL pairs (q,u) pairs, but does not involve any kind of clustering. The present invention, Hyponym, is best explained as an extension of SearchLight.  SearchLight  SearchLight begins With a table that records the number of times each query and target URL pair (q,u) occurs  The present invention includes algorithms for both forming the query groups as Well as determining the most relevant URLs for each group. The present invention, or Hyponym, constructs a statistical mixture model to describe the data contained in a table, eg the SearchLight table. This model has as its parameters the probability that a given among a collection of search sessions. SomeWhat heuristi cally, the target URL for a search session is de?ned to be the query belongs to a particular group as Well as a set of last page that the user visits before they move to a neW task. URLs. The algorithms attempt to ?t the same model to the data. Some embodiments of Hyponym employ a standard Other possible de?nitions include the URL that the search stays on for the longest amount of time and the ?rst 5 URLs that the searcher visits.  The table is used to ?nd and display URLs related to a query input by a user. For example, With reference to FIG. 3, SearchLight displays the URLs 314 by Weight 318. (FIG. 3 assumes that SearchLight is implemented in a proxy; if it Were implemented in a search engine, the WindoW Would have only the loWer frame.) group-speci?c relevance Weights assigned to collections of EM (Expectation-MaximiZation) algorithm. HoWever, this technique has problems related to scaling (both in the number of search sessions as Well as the number of groups needed to obtain a reasonable ?t) and therefore has disad vantages. Other embodiments of Hyponym use a relatively less computationally expensive scheme that is referred to as approximate EM. The approximate EM technique usually arrives at a different ?t than the standard EM, hoWever there is typically little practical difference betWeen the tWo. SearchLight is triggered into action When a user Finally, given the dynamic character of many databases (like enters a search string or query into a search engine. If the collection of pages on the Web) We Will also introduce an embodiment of Hyponym that includes online variants of the EM algorithm that alloW us to process search sessions in realtime.  necessary, SearchLight ?rst modi?es the query by convert ing it to loWer case, removing punctuation, and sorting the terms alphabetically. If there are no table entries for the modi?ed query, SearchLight considers intersecting sets of the search terms. This ensures that the application provides URLs even if the query is only a close approximation to those in the table. So, if a search for “cryptosystem mce liece” does not have any exact matches in the table, URLs  The Hyponym Idea  Given the description above, each query qi is asso ciated With one or more groups. This relation is captured by Would be returned from queries such as “mceliece”, “cryp the triple (qi, k, Wik), Where k denotes a group ID and Wl-k is the probability that qi belongs to group k. Then, for each tosystem”, and even “robert mceliece”. group, a number of relevant URLs are identi?ed. This is Jan. 16, 2003 US 2003/0014399 A1 described by the triple (k, uj, >\,kj) Where uj is a URL and )tkj is a Weight that determines hoW likely it is that uj is associated With the queries belonging to group k. These triples are stored in a table that Hyponym uses. An example of a query-group triple (qi, k, Wik) might look like,  The parameters )tkj are referred to as relevance Weights, and the probability that yik=1 is used as the kth group Weight for query qi (the Wik mentioned at the begin ning of this section).  Anumber of different algorithms ?t this model and, in turn, perform a clustering. They are presented beloW.  (infocom+2000,304,0.9)  While the associate group-relevance triples (k, uj, >\,kj) might be  The table is used to display URLs related to the query searched by the user. Referring to FIG. 4, the query groups 414, 418 are displayed by Weight, With the URLs  (304,http://WWW.ieee-infocom.org/2000/,0.5)  (304,http://WWW.ieee-infocom.org/2000/pro gram.html,0.5) 422, 426 in each group ordered by Weight.  As mentioned above, sets of such triples constitute the parameters in a statistical model for the search sessions contained in a table, similar to that described in reference to SearchLight.  A mixture model is employed to form both the query groups as Well as the relevance Weights. Assume that a dataset has I queries that We Would like to assign to K groups, and in turn determine group-speci?c relevance Weights for each of J URLs. For the moment, let nij denote the number of times the URL uj Was selected by some user during a search session under the query qi. Let ni=(ni1, . . . , nij) denote the vector of counts associated With query qi. This vector is modeled as coming from a mixture of the form  Standard EM Algorithm  As explained by A. P. Dempster, N. M. Laird, and D. B. Rubin, in Maximum Likelihood for Incomplete Data Via the EM Algorithm (With discussion), Journal of the Royal Statistical Society (Series B), 3911-38, 1977, incor porated herein by reference in its entirety, the standard Expectation-MaximiZation (EM) algorithm is a convenient statistical tool for ?nding maximum likelihood estimates of the parameters in a mixture model.  The EM algorithm alternates betWeen tWo steps; an E-step in Which We compute the expectation of the complete data log-likelihood conditional is computed on the observed data and the current parameter estimates, and an M-step in Which the parameters maximiZing the expected log-likeli hood from the E-step are found. The E-step consists of calculating the conditional expectation of the indicator vari ables yl-k, Which are denoted:  Where the terms otk sum to one and denote the proportion of the population coming from the kth compo nent. Also associated With the kth component in the mixture is a vector of parameters >\.k. From a sampling perspective, one can consider the entire dataset ni,=1, . . . , I, as being generated in tWo steps: ?rst, one of the K groups k* is selected according to the probabilities otk and then the associated distribution p(~|)tk*) is used to generate the vector  Where p(~|Xk) is given in ni. quantities étk and  The speci?cation of each component in the mixture (1) is considered. It is assumed that in the kth component, Note that is an estimate of the probability that query qi belongs to group k, and Will be taken to be our query group Weights. Then, for the M-step, is substituted for yl-k =L in the data nij come from a Poisson distribution With mean M], where the counts for each different URL uj are independent. Then, setting 7\k=()tk1, . . . , >\.kj), the likelihood of the kth In this expression, the denote our current parameter estimates. (3), and maximiZed With respect to the parameters otk and >\.k. In this case, a closed form expression is available, giving the component associated With a vector of counts ni is given by updates  To ?t a model of this type, a set of unobserved (or missing) indicator variables yl-k, are introduced Where yki=1 if qi is in group k, and Zero otherWise. Then, the so-called complete data likelihood for both the set of counts ni and the rithm a convenient tool for determining query group Weights indicator variables yi=(y?, . . . ,yiK), i=1, . . . I, can be  Clearly, these simple updates make the EM algo and relevance Weights. Unfortunately, the convergence of this algorithm can be sloW and Will often converge to only a local maximum. To obtain a good solution, We start the EM expressed as process from several random initial conditions and take the best of the converged ?ts. Wk (3)  Approximate Algorithm With Prior Data  In moving from the original SearchLight to the Hyponym embodiments, the query groups formed by the mixture model introduced above alloWs the borroWing of Jan. 16, 2003 US 2003/0014399 A1 strength from search sessions initiated With different but semantically related search terms. The mixture approach, however, is highly unstructured in the sense that only user data is incorporated to learn the groups and relevance 1 Weights. Having taken a probabilistic approach to grouping, prior information about related URLs is incorporated from, an existing directory like DMOZ (WWW.dmoZ.com) or Yahoo. These directories constitute predetermined directo ries and included labeled data. For example, consider the 30,000 categories identi?ed by DMOZ. (FIG. 5 contains a subset of the ?elds for a feW sample DMOZ entries.) nt represent a search session initiated at time t, and let Embodiments of Hyponym that are based on the approxi ut=(ut)1, . . . , ut,Jt) denote the URLs visited and nt (ntJ, . . . mate EM algorithms use a directory structure or labeled data to seed query groups. The data in such a structure can be version of (6) resembles  Where the updated probabilities are given in In the on-line approach, the sums over the queries i are replaced With a Weighted version indexed by time. Let qt, ut, , nt, JD their frequency Within the session. Then, an on-line represented as a set of pairs (1, ulj) is a URL in the lth group. The probability W'Jj1 is not speci?ed in the directory so it is assumed it has a value of 0t. and  With this algorithm, as illustrated in the pseudo code beloW, mappings betWeen queries and URLs are estab Tj‘,1k(l)=(1-TI(IDj‘JkU-D’FTI(UY’HJM for 1:1: - - - > Jr lished When either the query or the URL has been seen in  either the prior data (from a proxy log) or the sessions that utl. In these expressions, Where jt>1)l=1, . . . , Jt, represents an index for URL have already been processed.  readPriorData  for each session s  if (query in s Was NOT seen before) and  (# URLs existed/# URLs in s<T) MY 1 and  put s aside to be processed using BasicEM algorithm  else  createURLGroup if needed  add mappings betWeen query and URL groups  output mappings  The remaining sessions are processed in a batch using the standard EM algorithm described above. The approximate algorithm can be tuned With the threshold value  The term 0<11(t)<1 used to control the learning rate. Technically, given a long stream of data, the rate should be reduced as more data is seen. Given the sparse nature of the search session data, a constant learning rate is found to perform adequately. Additionally, Sato and Ishii have shoWn search sessions or topic sessions. that for a ?xed number of clusters, this approach provides a stochastic approximation to the maximum likelihood esti mators for query group membership and the relevance Weights. Given the large number of clusters in a directory like DMOZ, it is impractical to do a full M-step at each time point. Instead, We choose to only take a partial M-step and  The approximate EM algorithm has the advantage update just those relevance Weights With indices contained in the incoming search session (qt, ut, n). This kind of T (0§T§ 1) to force more of the URLs in the session to exist in the prior data or the previously processed data. The approximate EM algorithm supports processing data as of incorporating prior or predetermined data, but has the disadvantage of only sloWly adding to the set of clusters alteration is Well knoWn (for example, see AVieW of the EM a more formal on-line algorithm can take a directory struc Algorithm that Justi?es Incremental, Sparse, and other Vari ants by Radford M. Neal and Geoffrey E. Hinton in Learning in Graphical Models, pages 355-368, KluWer Academic Publishers, 1998, incorporated herein by reference in its entirety) and in the basic EM algorithm does not effect ture as input and learn the query clusters. The on-line EM convergence. When a neW topic is found in the data.  On-line EM Algorithm With Labeled Data  Using the probabilistic mixture model of the data, algorithm presented in On-Line EM Algorithm for the NormaliZed Gaussian Network by Masa-aki Sato and Shin Ishii presented in Neutral Computation, 12(2):407-432, Feb ruary 2000, incorporated herein by reference in its entirety, is used to process data arriving in a stream of search sessions. To understand this approach, consider the E-step given in (4) and the M-step in Because the Poisson model is part of an exponential family, these updates should be considered in terms of the sufficient statistics for the complete data model Then, the E-Step computes  To incorporate the labeled data, a set of clusters is initiated so that each URL ulj in the lth group of the existing hierarchy is assigned some ?xed value )tlj. Then, When a neW search session (qt, ut, nt) arrives, (7) is evaluated to see hoW Well it ?ts With the existing groups. If the probability is too small, We initiate a neW cluster With the URLs ut and intensities nt. When faced With a long stream of data, splitting clusters and deleting unused clusters may be nec essary. Jan. 16, 2003 US 2003/0014399 A1  Summary a user population is de?ned as mobile device users, search  Referring to FIG. 6, the present invention concerns a method (1010) for improving the standard search facility results related to the search term “ATM” might be supple mented With a list giving priority to Web pages containing information about the location of Automatic Teller for a database. The activities of a community of users are monitored as they search the database (step 614). With each search, a user is presented With a list of indices for records in the database. Preferably this list also summariZes each of the potentially relevant records. The user revieWs the list and accesses records that appear (from the accompanying infor mation) to be related to their search terms. The user’s search terms together With the indices of the items they access are combined to form a search session (step 618). The extraction of search sessions can happen either When the session has ended (knoWn as on-line processing) or periodically in a batch, say, processing a log ?le of user activities once a day (knoWn as off-line processing). Machines. Similarly, Where a user population is de?ned as mobile device users in NeW York, searches for “restaurants” are supplemented With lists of pages prioritiZed toWard those With information about restaurants near them in NeW York. Such users are relieved from having to Wade through infor mation about restaurants in Chicago and elseWhere.  The methods of this invention such as those sum mariZed in FIG. 6 may be implemented in a variety of communication and computing environments. As explained above, for example, they may be implemented in proxy servers, search engine provider hardWare, gateWays, and other points in database or Internet search paths. With a full  The extracted search sessions are then used to formulate groups of semantically related queries, and to understanding of the present invention, those of skill in the art Will readily determine suitable hardWare and softWare associate With each group a set of relevance Weights, or con?gurations for their particular applications. technically, the probabilities that each record satis?es the queries in each group (step 622). In an exemplary embodi ment, the formation of query groups and relevance Weights  For example, With reference to FIG. 7 and FIG. 8, a user in a user community 710, 810 uses a Web broWser to is accomplished by ?tting a mixture model. In this case, a probability distribution is constructed that describes hoW the data Were generated. In other embodiments, the clustering of queries and the determination of relevance Weights can be done in separate steps. In still another embodiment, groups of records could also be formed from the search session data, in Which case the relevance Weights Would associate query groups and record groups. This computation could be done either in one step (by formulating a slightly more elaborate the Web traf?c generated by the user goes through a Web proxy server 714, 814, so the request for a search engine also does. Once the proxy server 714, 814 determines that the request is a search engine request, it routes it on to the search engine, Which is a Web server 718, 818 in the Internet 722, 822, and also sends the query to a Clusterer 726, 826. The Clusterer 726, 826 sends records Whose probabilities have mixture model) or in tWo or more separate steps. This processing can be done either for each neW search session the Clusterer 726, 826. (on-line processing) or at regular intervals in batch mode (off-line processing).  The essential byproduct of this component of the invention is a collection of query groups and relevance Weights. We use this data to aid users With future searches. In addition to the output from the standard search facility, We also present the user With an additional display built from our table of query groups and relevance Weights (step 626). Given a neW query, the present invention ?rst identi?es one or more query groups based on the search terms of the query. access a Web search engine such as Google or Yahoo. All of passed a threshold. The records are maintained in tables in  These tables are generated using search session data and/or labeled data 734, 834. If the Clusterer is an on-line Clusterer 826, the search session data is input into the Clusterer 826 as individual search sessions from search session extractor 830. If the Clusterer is an off-line Clusterer 726, the search session data is batched by the Search Session Extractor 730, perhaps batching by a 24-hour period. Certain versions of the Clusterer 726, 826 might use labeled data 734, 834 in its algorithms.  The input into the Search Session Extractor 730, Then, for each group, indices for the most relevant records 830 is the proxy access logs. If the Clusterer is on-line, each in each group are presented to the user. This list of indices log event is sent to the Extractor as it occurs. OtherWise, a batch of log events are sent. is strati?ed by query group, making it easier to broWse the search results.  The user population referred to above is preferably a community of users With something in common. For example, the user community can be the Workers of a company or organiZation, mobile device users in a particular location, or users that are grouped together because of common interests or habits. The method 610 of improving Web search is bene?cially applied to these kinds of user populations because the common interest or aspect of the  Clearly, there are different types of searches that users perform; sometimes, there is one desired page (e.g., a conference call-for-papers announcement), and other times, the searching process of visiting many pages alloWs the user to ?nd the desired information (e.g., What is available on the Web about Wireless handsets). In addition, the user can use the desired page as a jump-off point for further exploration. SearchLight is most successful When one page is desired. Hyponym is successful for both types of searching. community can be used to automatically narroW or ?ne tune search results. For example, Where a user population is de?ned as Bell Lab Workers, search results related to the search term “ATM”, the reliance on population search path statistics of the method 610 of improving Web search may direct users to pages containing information about Asyn chronous Transfer Mode sWitches. At the same time, Where  The invention has been described With reference to particular embodiments. Modi?cations and alterations Will occur to others upon reading and understanding this speci ?cation. It is intended that all such modi?cations and alter ations are included insofar as they come Within the scope of the appended claims or equivalents thereof. Jan. 16, 2003 US 2003/0014399 A1 10 We claim: 1. A method of improving search of a database, the probabilities that records in the database are relevant for method comprising: labeled data. 12. The method of improving database search of claim 11 each query or path group includes using predetermined monitoring user search activity in a user population; Wherein determining groups of semantically related queries extracting search sessions, de?ned by search queries and or paths based on search session data and determining probabilities that records in the database are relevant for each query or path group includes applying an approximate paths, from user search activity; determining groups of semantically related queries or paths based on search session data; determining probabilities that records in the database are relevant for each query or path group; maintaining a table associating an index for each record in the database With the probability that the record is relevant for each query or path group; and, supplementing search results With information regarding Expectation-MaximiZation algorithm to the predetermined labeled data. 13. The method of improving database search of claim 11 Wherein determining groups of semantically related queries or paths based on search session data and determining probabilities that records in the database are relevant for each query or path group includes using predetermined labeled data by seeding query or path groups. 14. The method of improving database search of claim 1 records from the database With tabulated relevance Wherein determining groups of semantically related queries probabilities. or paths based on search session data and determining probabilities that records in the database are relevant for 2. The method of improving database search of claim 1 Wherein the search is Web page search and the database includes a collection of available Web pages. 3. The method of improving database search of claim 1 Wherein the search is Web page search and the database includes a collection of publicly available Internet Web pages. 4. The method of improving database search of claim 1 Wherein the search is Web page search and the database includes a collection of private intranet Web pages. 5. The method of improving database search of claim 2 Wherein monitoring user search activity in a user population and extracting search sessions from user search activity includes off-line processing of proxy server access logs to determine search sessions. 6. The method of improving database search of claim 2 Wherein monitoring user search activity in a user population and extracting search sessions from user search activity includes on-line processing in a proxy server to determine search sessions. 7. The method of improving database search of claim 2 Wherein monitoring user search activity in a user population and extracting search sessions from user search activity includes off-line processing of proxy server access logs to determine complete search sessions. 8. The method of improving database search of claim 2 Wherein monitoring user search activity in a user population and extracting search sessions from user search activity includes on-line processing in a proxy server to determine complete search sessions. 9. The method of improving database search of claim 1 includes extracting topic sessions, de?ned by multiple search sessions Where the queries include re?nements, from user search activity. 10. The method of improving database search of claim 1 each query or path group includes clustering queries or paths in an on-line fashion. 15. The method of improving database search of claim 1 Wherein maintaining a table associating the index for each record includes using a database to store the table. 16. The method of improving database search of claim 1 Wherein supplementing search results With information regarding records from the database With tabulated rel evance probabilities includes displaying the information in a separate area of the display from results of a search engine. 17. The method of improving database search of claim 1 Wherein supplementing search results With information regarding records from the database With tabulated rel evance probabilities includes modifying the order of the information. 18. The method of improving database search of claim 1 Wherein determining groups of semantically related queries or paths based on search session data and determining probabilities that records in the database are relevant for each query or path group includes clustering data. 19. The method of improving database search of claim 1 Wherein determining groups of semantically related queries based on search session data and determining probabilities that records in the database are relevant for each query group includes clustering queries based on a similarity of items in their associated search paths. 20. The method of improving database search of claim 19 Wherein determining groups of semantically related queries or paths based on search session data and determining probabilities that records in the database are relevant for each query or path group includes clustering queries or paths using an Expectation-MaximiZation algorithm. 21. A method of improving search of a database, the method comprising: Wherein determining groups of semantically related queries monitoring user search activity in a user population; or paths based on search session data and determining probabilities that records in the database are relevant for extracting search sessions, de?ned by search queries and each query or path group includes clustering queries based paths, from user search activity; on a similarity of the associated search paths using a Poisson mixture model. 11. The method of improving database search of claim 1 determining groups of semantically related paths based on search session data; Wherein determining groups of semantically related queries determining probabilities that records in the database are relevant for each path group; or paths based on search session data and determining Jan. 16, 2003 US 2003/0014399 A1 maintaining a table associating an index for each record in the database With the probability that the record is relevant for each path group; and, supplementing search results with information regarding records from the database With tabulated relevance probabilities. 22. A method of improving search of a database, the method comprising: rnonitoring user search activity in a user population; extracting search sessions, de?ned by search queries and paths, from user search activity; deterrnining groups of sernantically related queries based on search session data; deterrnining probabilities that records in the database are relevant for each query group; maintaining a table associating an indeX for each record in the database With the probability that the record is relevant for each query group; and, supplementing search results with information regarding records from the database With tabulated relevance probabilities.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project