Detecting Events with Date and Place Information in Unstructured Text

Detecting Events with Date and Place Information in Unstructured Text
Detecting Events with Date and Place Information in
Unstructured Text
David A. Smith
Perseus Project, Tufts University
Medford, MA 02155
[email protected]
ABSTRACT
Digital libraries of historical documents provide a wealth of
information about past events, often in unstructured form.
Once dates and place names are identified and disambiguated,
using methods that can differ by genre, we examine collocations to detect events. Collocations can be ranked by several measures, which vary in effectiveness according to type
of events, but the log-likelihood measure (−2 log λ) offers
a reasonable balance between frequently and infrequently
mentioned events and between larger and smaller spatial and
temporal ranges. Significant date-place collocations can be
displayed on timelines and maps as an interface to digital
libraries. More detailed displays can highlight key names
and phrases associated with a given event.
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content
Analysis and Indexing; H.5.2 [Information Interfaces and
Presentation]: User Interfaces—Graphical user interfaces
General Terms
Design
Keywords
event detection, geographic visualization, phrase browsing
1.
INTRODUCTION
Digital libraries of historical documents provide a wealth
of information about past events in an unstructured form.
Natural questions about particular periods and places are
“What happened then?” and “What happened here?”, but
they may not be best answered by ad hoc queries typed
into search forms. Simply by restricting our queries to certain collections catalogued by time or place, we can exclude many irrelevant events, but questions of relevance, in
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
JCDL’02, July 13-17, 2002, Portland, Oregon, USA.
Copyright 2002 ACM 1-58113-513-0/02/0007 ...$5.00.
a broad sense, remain. What events will different users find
relevant when browsing four thousand years of history, or the
nineteenth century, or 1862? What events are significant, in
some sense, at global, national, and local scales? Of particular interest to digital libraries, dates, places, and events can
provide general interfaces for access to diverse collections.
Automatically detected events can also augment manually
produced metadata, particularly for long documents that
cover many topics.
The Perseus Digital Library Project (http://www.perseus.
tufts.edu) has focused on developing automatic methods
for structuring large document collections, especially in the
humanities. Generalizing tools we first built for ancient
Greek literature, art, and archaeology, we have built testbeds
on English Renaissance literature, ancient and early modern
science, the history and topography of London, and United
States history in the nineteenth century. We have previously worked on named-entity, term, and date identification [3] and on place name disambiguation [9]. Especially in
the United States, where there are a Springfield and several
Middletowns in every state, place names have to be disambiguated before they can be plotted on maps.
Building on this work with individual terms, names, and
dates, we have exploited co-occurrences of dates and place
names in our testbeds to detect and describe likely events in
a digital library. We use statistical measures to determine
the relative significance of various events. We have also built
interfaces that help users preview likely regions of interest
for a given range of space and time and that identify key
phrases associated with each possible event.
2.
PRIOR WORK ON NEWS TEXTS
Although our testbeds are primarily in the humanities,
it is useful to compare applications for historical digital libraries with the Topic Detection and Tracking (TDT) study.
As with similar competitive evaluations, such as TREC for
information retrieval, TDT seeks to advance the state of the
art by concentrating research around a quantitatively evaluated task. TDT aims at developing techniques for “discovering and threading together topically related material
from streams of data such as newswire and broadcast news”
[12]. Topics are defined as specific events, “something (nontrivial) happening in a certain place at a certain time” [13]
although some researchers use event to mean a single happening within a larger topic story [6]. Due to its focus on
news data, TDT possesses “an explicitly time-tagged corpus”. Although not part of the TDT task, systems such as
[8] for visualizing news broadcasts on maps also take advantage of a time-tagged data stream.
TDT systems, by design, will aggregate stories over a span
of several days, even with some gaps, into single event topics. Despite the definition of an event, however, as occurring
in a certain place, most TDT systems do not directly take
geographical location into account. Geographical names,
rather, are treated just like other named entities, such as
personal and company names, or even as single words. Although some TDT systems perform retrospective event detection across an entire corpus, many are designed to handle
the more difficult task of classifying stories into topics in the
order in which they come in. Applications to historical documents should be able to take advantage of less error-prone
retrospective methods.
The most significant problem in adapting TDT methods
to historical texts is the difficulty of handling long-running
topics. For the mid-1990s events in the second TDT study,
systems had trouble treating the O. J. Simpson case or the
investigation of the Oklahoma city bombing as a single event
[11, 13]. Many historical documents discuss long-running
events, and many users will wish to browse digital libraries
at a scale larger than events of a few days’ length.
3.
THE HISTORICAL DOMAIN
Since a precise dateline heads each story, modern news
texts are of course explicitly time-tagged. Indexing schemes
can associate every term — be it a word, phrase, or named
entity — with that date. Most historical texts do not fit
this model for three reasons: discursiveness, digression, and
scale. First, historical texts tend to be discursive, not broken into discrete date units. While some genres, such as
chronicles and diaries, do fit this format, they do not make
up a very sizable portion of most digital library collections.
Domain-specific formatting cues, such as the title and dateline in news stories, can be used to segment such texts, but
we need to automatically discover which documents should
be so segmented in order for the solution to be scalable.
Most documents, however, although not neatly segmentable,
still contain a large amount of date information, but the
association of each date in a text and the terms around
is not one of simple “aboutness”. Second, historical documents tend to be more digressive than news stories. Even if
there is a main linear narrative, a historian will often digress
about events from before or after the main period, or taking
place in another region. These digressions, of course, may
themselves provide information about other events. Henry
Wheatley, in his 1891 survey of London streets, mentions
that “Quebec Street commemorates the capture of Quebec by General Wolfe in 1759.” Finally, many historical
documents are simply on a larger scale than news stories.
Not only are books, and even chapters, orders of magnitude
longer than newspaper pieces, but the ranges of time and
space covered are often much larger.
In addition to problems of interpretation, historical documents present obstacles merely to identifying relevant dates.
First of all, many scholarly works are strewn with bibliographic citations. Bibliographic dates can be useful in their
own right; it would be interesting to see, for example, that
a work published in the 1990s cited works mostly from the
1960s. Bibliography is not, however, directly related to historical narrative and distracts from most information needs.
News stories seldom make citations and current academic
practice relegates much bibliography to a separate section,
but older works often mix citations with narrative. In general, accurately identifying bibliographic references has been
an active area of research with varying success [1]; nevertheless, as McKay and Cunningham point out [7], identifying
bibliographic dates is easier than identifying (and linking)
entire citations.
Further problems arise when older documents use dating
schemes other than the modern, Western Gregorian calendar. Simultaneous events may have different dates on different calendars, as when the Russian revolution in Orthodox,
Julian October took place in Western, Gregorian November. Even more involved are the problems with ancient
systems that dated by the years in which various magistrates — such as Athenian archons or Roman consuls —
served. At present, Perseus often avoids these problems by
acquiring texts already annotated, in footnotes or headings,
with modern date equivalents. Also, older texts with more
involved and uncertain dating systems tend, unfortunately
for historians, to contain many fewer dates.
4.
RANKING COLLOCATIONS
Once dates and other features have been identified and, if
necessary, disambiguated, they can be used to detect events
in documents. Our initial experiments have focused on associations of dates and places. To cite one precedent, Swan
and Allan report better event detection when associating
named entities, rather than simple phrases, with dates[10].
Unlike other projects, we have privileged place names over
other named entities since we can identify multiple names
referring to a single place and detect the use of the same
name for different places.
Since we cannot depend on our source documents to have
marked or easily detectable story divisions, we must define
some sort of window of association. Given the discursive and
digressive properties of our documents mentioned above, we
have chosen sentences and paragraphs. We count, for example, the number of sentences that contain each date or place
and the number of times each date and place occur in the
same sentence. For each date-place pair, we can thus build
a contingency table where a is the number of times date D
and place P occur in the same sentence, b the number of
times D occurs without P , c the number of times P occurs
without D, and d the number of sentences in which neither
D nor P occur. These counts can be used to calculate several different measures of association between the date and
place. Widely used measures are mutual information (MI)
[2], chi-squared (χ2 ), and phi-squared (φ2 ), which is χ2 normalized on the number of association windows. Dunning
argued that the assumption that text tokens are normally
distributed overestimated the significance of rare statistical
events and proposed the log-likelihood test (−2 log λ) based
on the binomial or multinomial distributions [4].
We have experimented with these statistics to test their
effectiveness at detecting events. Without a definitive list
of events in our testbeds, we have concentrated on relative
ordering of events by significance rather than absolute relevance or irrelevance. As described below, users can select
the amount of event information they want to see, and we
hope this will effectively take them from short, highly precise
lists, to total recall of all events in the corpus. As an example, we compare the twenty top-ranked events by each test
for all world events of the nineteenth century (tables 1–4).
Place
Corinth, Mississippi
Gettysburg, Pennsylvania
Mobile Bay, Alabama
Mobile Bay, Alabama
California, United States
Malvern Hill, Virginia
Knoxville, Tennessee
Waterloo, Belgium
Spotsylvania, Virginia
Virginia, United States
Pittsburg Landing, Tennessee
Walcheren, Netherlands
Gettysburg, Pennsylvania
Chancellorsville, Virginia
Crimea, Ukraine
Atlanta, Georgia
Huntsville, Alabama
Great Britain, United Kingdom
California, United States
United States
Date
1862
July 3 1863
August 5 1864
August 6 1864
1849
July 1 1862
1862
1815
May 12 1864
1860
1862
1809
1863
May 3 1863
1854
1864
1862
1812
1850
1861
Count
320
164
110
80
227
76
170
82
66
264
124
53
154
49
65
138
88
86
131
245
−2 log λ
2745.31
2076.08
1870.14
1375.46
1219.85
1113.22
1078.49
995.161
994.899
963.186
881.619
860.891
749.540
618.326
608.433
568.375
561.238
536.693
521.704
503.163
Table 1: 19th c. events: Ranked by log-likelihood
Place
Wakulla county, Florida
Mobile Bay, Alabama
Mobile Bay, Alabama
Queretaro, Mexico
Dooly, Georgia
Crisfield, Maryland
Broad Creek, Massachusetts
Walcheren, Netherlands
Spotsylvania, Virginia
Waynesboro, Georgia
Jeffersonville, Ohio
Mayo, Cape Verde
Malvern Hill, Virginia
Puerto Cabello, Venezuela
Gettysburg, Pennsylvania
Mobile Bay, Alabama
Pocomoke, North Carolina
Five Forks, Maryland
Appomattox county, Virginia
Greenwich, Connecticut
January 7
August 5
August 6
May
December 17
September
September
May 12
December 4
March 13
March 12
July 1
July 26
July 3
August 8
September
April 1
January 31
May 30
Date
1859
1864
1864
1848
1860
1874
1874
1809
1864
1864
1862
1835
1862
1861
1863
1864
1874
1865
1863
1848
Count
9
110
80
10
7
5
5
53
66
16
5
5
76
6
164
20
7
5
6
7
χ2
2193820
935482
736456
576247
498001
491228
439518
290660
262641
255647
255635
246335
232525
191783
152491
141363
139885
138559
137580
125128
Table 2: Ranked by chi-squared
The φ2 measure would produce the same ranking as χ2 and
is not listed. We have also included place-date pairs ranked
by raw association counts. Using a common rule of thumb in
contingency table analysis, we exclude date-place pairs with
fewer than five occurrences. Perseus collections for this period focus on British an U.S. history: the Bolles collection on
the history and topography of London; three collections on
California, the Upper Midwest, and the Chesapeake region
from the Library of Congress’ American Memory project;
and a collection of memoirs and official records of the U.S.
Civil War.
The log-likelihood measure achieves a balance between
events at a very specific place and time — such as the battles
of Gettysburg (specifically the third day, July 3, 1863), Mobile Bay, Malvern Hill, Spotsylvania, and Waterloo — and
larger regions of concentration — such as the California Gold
Rush of 1849 and 1850 or the Crimean War. Civil War batPlace
Wakulla county, Florida
Crisfield, Maryland
Broad Creek, Massachusetts
Dooly, Georgia
Queretaro, Mexico
Jeffersonville, Ohio
Mayo, Cape Verde
Puerto Cabello, Venezuela
Five Forks, Maryland
Appomattox county, Virginia
Greenbrier county, West Virginia
Abingdon, United Kingdom
Pocomoke, North Carolina
Greenwich, Connecticut
Ashley River, South Carolina
Waynesboro, Georgia
Pocotaligo, South Carolina
Washington, Georgia
Drummond Island, Michigan
Nantucket, Massachusetts
January 7
September
September
December 17
May
March 13
March 12
July 26
April 1
January 31
March
March 22
September
May 30
December 7
December 4
December 20
May 4
March
August
Date
1859
1874
1874
1860
1848
1862
1835
1861
1865
1863
1858
1860
1874
1848
1864
1864
1864
1865
1816
1841
Count
9
5
5
7
10
5
5
6
5
6
5
6
7
7
5
16
7
8
7
5
MI
17.8951
16.5841
16.4237
16.1185
15.8144
15.6418
15.5884
14.9642
14.7583
14.4851
14.3862
14.3106
14.2867
14.1258
14.0987
13.9639
13.7488
13.7094
13.6673
13.6232
Table 3: Ranked by mutual information
Place
Corinth, Mississippi
Virginia, United States
United States
California, United States
Richmond, Virginia
Knoxville, Tennessee
Gettysburg, Pennsylvania
Gettysburg, Pennsylvania
United States
United States
Atlanta, Georgia
Georgia, United States
United States
California, United States
Virginia, United States
Virginia, United States
United States
Pittsburg Landing, Tennessee
Washington, United States
United States
Date
1862
1860
1861
1849
1862
1862
July 3 1863
1863
1812
1860
1864
1864
1862
1850
1861
1862
1864
1862
1862
1848
Count
320
264
245
227
171
170
164
154
152
146
138
136
134
131
131
128
128
124
124
122
Table 4: Ranked by raw association count
tles are well represented, probably because several different
memoirs, diaries, and official histories will discuss the same
event, while events in other corpora are less likely to receive
repeat coverage. The chi-squared and mutual information
scores highlight associations of rarer dates and places; for
example, January 7, 1859 in Wakulla county, Florida, is singled out as the day that the offices of Tax Assessor and
Collector and Sheriff were combined. Since this particular
day and place are not mentioned except when together, the
chi-squared and mutual information scores overestimate the
significance of these nine occurrences. Similarly, Crisfield,
Maryland, in September, 1874, is singled out with only five
collocations due to a murder that occurred there. Although
these are undoubtedly events, they are not very useful for
a user wishing to get a sense of the contents of the digital
library. Interestingly, all of the χ2 scores in these top twenty
in table 2 are far above the significance threshold of 10.83
for 99.9% confidence; while the statistic may be useful for
determining absolute significance, it may not be as useful
for establishing rank among significant collocations.
On the whole, mutual information shows a greater bias
for rare events: in the top twenty ranked by MI, no event
is represented by more than 16 passages. Log-likelihood
and χ2 exhibit a greater range in the number of passages
supporting each event. Although ranking by raw counts
privileges whole years and larger regions such as states and
countries, such a result may also be appropriate at scales of
the whole world and a century.
Finally, note that the raw count list contains only one
event with a month and day — the heavily covered battle of
Gettysburg. All events in the mutual information list contain at least a month, and χ2 only shows one event without
a month or day: the half-hearted Walcheren expedition of
1809 that is mentioned in many British officers’ biographies.
The log-likelihood measure, again, shows a balance of specific and more general dates.
Even outside the scope of precise dates, log-likelihood
ranking can perform well. Beyond the nineteenth century,
fewer dates are recorded precisely to the day. Tables 5 and 6
show events in the sixth and fifth centuries BC, and the thirteenth and fourteenth centuries AD. The digital library contains substantial material on the ancient period. As noted
above, however, there are fewer dates to exploit in older documents, and the lower counts bear this out. The low numbers show up in a bogus disambiguation of “Lade” for the
United Kingdom instead of Greece. Still, decisive moments
in Greek history are clear with the end of the Peloponnesian
Place
Aegospotami, Turkey
Plataea
Salamis, Greece
Delium, Greece
Lade, United Kingdom
Athens, Greece
Samos, Greece
Olynthus
Tanagra, Greece
Sybaris
Greece
Athens, Greece
Mantinea, Greece
Athens, Greece
Syracuse, Italy
Amphipolis, Greece
Sparta, Greece
Sardes, Turkey
Thurii
Sicily, Italy
Date
405 BC
479 BC
480 BC
424 BC
494 BC
431 BC
440 BC
432 BC
457 BC
510 BC
480 BC
480 BC
418 BC
404 BC
485 BC
422 BC
404 BC
481 BC
443 BC
415 BC
Count
24
17
20
11
9
18
14
9
8
9
20
22
7
14
8
6
10
6
5
9
−2 log λ
467.124
241.044
211.093
203.543
174.566
160.52
151.662
146.786
136.139
129.891
128.819
125.905
116.546
114.052
106.041
101.548
99.4967
96.6489
96.5052
91.6774
Table 5: Events in the 6th and 5th centuries BC,
ranked by log-likelihood
Place
Poitiers, France
Lewes, United Kingdom
Crecy, France
Bannockburn, United Kingdom
Neville’s Cross, United Kingdom
Gascony, France
Lewes, United Kingdom
Sluys, Netherlands
Lewes, United Kingdom
Montfort, France
Flanders, Belgium
Gascony, France
Gascony, France
Epsom, United Kingdom
Lewes, United Kingdom
Halidon Hill, United Kingdom
Montfort, France
Gascony, France
Montfort, France
Bannockburn, United Kingdom
Date
1356
1264
1346
1314
1346
1264
1265
1340
1263
1264
1297
1265
1297
1265
1258
1333
1263
1253
1265
1313
Count
19
19
16
15
11
14
13
11
12
11
14
11
11
11
11
8
9
10
9
9
−2 log λ
357.045
314.943
309.233
305.789
235.198
233.708
222.948
217.536
208.978
201.241
193.794
193.198
190.275
183.179
182.392
177.775
176.772
176.184
172.843
172.033
Table 6: Events in the 13th and 14th centuries
war at the battle of Aegispotami and the climax of the Persian wars at Plataea. The Perseus Digital Library does not
contain any resources specifically for medieval history, but
enough allusions are made in the Bolles London collection
to detect some significant events in medieval England. The
battles of Poitiers, Lewes, Crecy, and Bannockburn, at the
top of the list, are decisive events in the Hundred Years War,
the unrest in the reign of Henry III, and the Scottish struggle with the English. When working with small numbers of
passages, however, the different ranking strategies appear to
make less difference (table 7).
5.
5.1
BROWSING EVENTS
Geo-Temporal Overview
Place
Neville’s Cross, United Kingdom
Halidon Hill, United Kingdom
Bannockburn, United Kingdom
Boroughbridge, United Kingdom
Bretigny, France
Crecy, France
Poitiers, France
Sluys, Netherlands
Codnor, United Kingdom
Montfort, France
Montfort, France
Bannockburn, United Kingdom
Bannockburn, United Kingdom
Poitou, France
Crecy, France
Neville’s Cross, United Kingdom
Neville’s Cross, United Kingdom
Sluys, Netherlands
Montfort, France
Crecy, France
Date
1346
1333
1314
1322
1360
1346
1356
1340
1241
1263
1265
1313
1306
1214
1342
1341
1338
1344
1264
1356
Count
11
8
15
8
6
16
19
11
5
9
9
9
9
7
9
5
5
6
11
9
χ2
821941
821624
786028
626645
593667
530521
483353
449818
430686
363850
296822
287064
275102
267580
264700
262741
236020
228297
227686
215066
Table 7: Events in the 13th and 14th centuries
ranked by chi-square
Figure 1: Map of top events from 1400–1600: for
this period, the DL primarily deals with British history. Sites in Europe are English expeditions.
We have developed an interface to explore these associations with a combination of graphical and tabular display.
This display is useful not only for browsing the results of
our event detection but also as a generalized interface to
many heterogeneous digital libraries. In addition to lists or
timelines of significant events, we also generate global or regional maps. When the user selects a particular range of
time — whether a century, decade, or year — the map is
updated to show the sites of significant events in that range.
Users can also zoom in on particular regions to see events
in a specific area. The locations of top-scoring events in
any given space-time range are brighter in color and labeled
on the map; lower-scoring events are fainter in color. The
top-ranked events are also listed below the map, with date,
place, and the number of times they co-occur in the digital
library.
Figures 1, 2, and 3 show three snapshots of the North
America and Europe, the primary focus of the Perseus Digital Library collections. Users browsing at these two hundred
year intervals can clearly see the shift in coverage from Europe — primarily Britain — in the early modern period to
North America in the nineteenth century. Events on the
continent of Europe tend to relate to English and British
wars: Holland (1586), Blenheim (1704), Fontenoy (1745),
and Waterloo (1815). As we observed with the tabular data
above, battles stand out particularly well, since they are
memorable, and heavily documented, events that occur at
a specific place and time. An error in figure 2 is instructive: the town of Monmouth, in Wales, is associated with
1685. This collocation highlights the rebellion of the Duke of
Monmouth, Charles II’s illegitimate son, against James II.
Many of the references in the DL to the Duke could be construed as ambiguous: e.g., “commanded regiment of horse
against Monmouth” or “summoned to join royalist forces
against Monmouth”. The collocation, nevertheless, points
to an important event in Britain in 1685.
5.2
Phrase Browsing
If users wish to explore the detected events more closely,
they can click on the date-place collocation and call up a
display of the individual text passages from the digital library. Since the Perseus system disambiguates toponyms in
texts, these searches are for the unique toponym identifiers,
not for the names themselves as strings.
The default display organizes these passages by phrases
common to two or more sentences. This clustering feature is
Phrase
fire of london
great fire
city of london
charles ii
act of parliament
duke of york
christ church oxford
house of commons
dreadful fire
rebuilding of the city
college oxford
privy council
view of london
burning of london
church of st
Count
21
21
8
6
4
4
3
3
3
2
3
3
2
2
2
Table 8: Clusters for London, 1666
Figure 2: Map of top events from 1600–1800: the DL
continues its focus on Britain. Some North American information, particularly the capture of Quebec,
is present. The strong association of ‘Monmouth’
with 1685 refers to the Duke of Monmouth’s uprising.
Phrase
san francisco
discovery of gold in california
discovery of gold
gold rush
united states
gold fields
trip to california
gold fever
cape horn
california gold
california during the years
early in the year
Count
19
8
10
9
9
7
5
6
6
6
3
3
Table 9: Clusters for California, 1849
available for all searches, not just these date-place searches,
in the Perseus Digital Library. We produce the clusters at
run time using a suffix-tree algorithm similar to [14]. The
phrases are ranked by a score s that combines the number
of words in the phrase w with the number of passages in
the cluster p, using a cluster-constant c, usually set to 0.5
(equation 1). Clustering is polythetic: each search result
may belong to one or more clusters. The clustering and
ranking are fast enough to be used interactively without
any offline computation, as in [5].
s=p·
Figure 3: Map of top events from 1800–1900: collections on pioneering in the Upper Midwest and
California (note the 1849 at the extreme west) combine with a Civil War collection to give a North
American focus. The battle of Waterloo holds out
for British history.
1 − e−cw
1 + e−cw
(1)
The examples show clusters for London, 1666, the date
of the Great Fire (table 8); for California, 1849, the Gold
Rush (table 9); and for Atlanta, 1864, when a Union army
captured the city (table 10). Phrases containing dates are
removed since they mostly show variations like “fire in 1666”
and “fire in the year 1666”. Note that the cluster head
phrases need not contain the search terms.
These phrases can characterize events by listing associated
people or places, such as the opposing generals Sherman
and Johnston, San Francisco, or Cape Horn, around which
many sailed to California. Phrase clusters may also be more
descriptive: “rebuilding of the city”, “gold fever”, or “march
to the sea”. The user can also group passages by the book
or collection from which they come. The number of distinct
Phrase
military division of the mississippi
atlanta ga
atlanta georgia
atlanta campaign
march to the sea
major general
general sherman
sherman’s army
effective strength of the army
advance on atlanta
battle of atlanta
capture of atlanta
general joseph e johnston
maj gen
kenesaw mountain
Count
13
19
18
14
5
8
7
5
3
4
4
4
3
4
4
Table 10: Clusters for Atlanta, 1864
documents recording a date-place collocation could be useful
in deciding an event’s significance.
6.
CONCLUSIONS
Although historical documents cannot often benefit from
the tight topic focus and reliable structure of news or scholarly articles, their broad scope and lack of structure can
provide a useful testbed for building more scalable architectures for event detection and information extraction systems. Once detected and ranked, events can provide a useful generic interface to digital library systems through maps,
timelines, and tabular displays.
Evaluating these and other methods of event detection
requires attention to varying information needs. Does the
user wish to gain a broad overview of a particular corpus
or subcorpus or to focus on events that stand out from
the rest of the corpus? Since the distance between places
or dates is measurable, and not arbitrary as in many topic
browsing systems, we can group the data to minimize the
aggregation effects of using individual days, years, or places
as terms of association. We have concentrated on ranking
events using statistical measures, finding evidence that the
log-likelihood measure achieves a balance among spatial and
temporal scope and frequency of occurrence. Future work
can concentrate on finding genre-specific cues for events in
diaries, letters, encyclopedias, and biographical dictionaries.
We have also built a browsing interface so that users can see
regions of concentration within the digital library and explore names and phrases associated with a given event.
7.
ACKNOWLEDGMENTS
The work presented in this paper has been supported by
a grant from the Digital Library Initiative Phase 2 (NSF
IIS-9817484), with particular backing from the National Endowment for the Humanities and the National Science Foundation. I would also like to thank Greg Crane and Jeff
Rydberg-Cox for comments on this paper.
8.
REFERENCES
[1] Donna Bergmark and Carl Lagoze. An architecture for
automatic reference linking. In Proceedings of ECDL
2001, pages 115–126, Darmstadt, 4-9 September 2001.
[2] Kenneth Church and Patrick Hanks. Word association
norms, mutual information, and lexicography.
Computational Linguistics, 16(1):22–29, 1990.
[3] Gregory Crane, David A. Smith, and Clifford E.
Wulfman. Building a hypertextual digital library in
the humanities: A case study on London. In
Proceedings of the First ACM+IEEE Joint Conference
on Digital Libraries, pages 426–434, Roanoke, VA,
24-28 June 2001.
[4] Ted Dunning. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19(1):61–74, 1993.
[5] Steve Jones and Gordon Paynter. Topic-based
browsing within a digital library using keyphrases. In
Proceedings of the 4th ACM Conference on Digital
Libraries, pages 114–121, Berkeley, CA, 11-14 August
1999.
[6] Vikash Khandelwal, Rahul Gupta, and James Allan.
An evaluation corpus for temporal summarization. In
James Allan, editor, Proceedings of HLT 2001, First
International Conference on Human Language
Technology Research, San Francisco, 2001. Morgan
Kaufmann.
[7] Dana McKay and Sally Jo Cunningham. Mining dates
from historical documents. Technical report,
Department of Computer Science, University of
Waikato, 2000.
[8] Andreas M. Olligschlaeger and Alexander G.
Hauptmann. Multimodal information systems and
GIS: The Informedia digital video library. In
Proceedings of the ESRI User Conference, San Diego,
California, July 1999.
[9] David A. Smith and Gregory Crane. Disambiguating
geographic names in a historical digital library. In
Proceedings of ECDL, pages 127–136, Darmstadt, 4-9
September 2001.
[10] Russell Swan and James Allan. Extracting significant
time varying features from text. In Proceedings of the
Eighth International Conference on Information
Knowledge Management (CIKM ’99), pages 38–45,
Kansas City, MO, November 1999.
[11] Russell Swan and James Allan. Automatic generation
of overview timelines. In Proceedings of the 23rd
Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 49–56, Athens, Greece, July 2000.
[12] Charles L. Wayne. Multilingual topic detection and
tracking: Successful research enabled by corpora and
evaluation. In LREC 2000: 2nd International
Conference on Language Resources and Evaluation,
Athens, Greece, June 2000.
[13] Yiming Yang, Tom Pierce, and Jaime Carbonell. A
study on retrospective and on-line event detection. In
Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 28–36, Melbourne,
Australia, August 1998.
[14] Oren Zamir, Oren Etzioni, Omid Madani, and
Richard M. Karp. Fast and intuitive clustering of web
documents. In Proceedings of the 3rd ACM SIGKDD
Conference, 1997.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement