Hybrid Methods for Coreference Resolution in Swedish
Hybrid Methods for Coreference Resolution in Swedish
Hybrid Methods for Coreference
Resolution in Swedish
Printed in Sweden by Universitetsservice AB, Stockholm 2010
Distributor: Department of Linguistics, Stockholm University
The aim of this thesis is to improve coreference resolution in Swedish by providing a hybrid approach based on combining data-driven methods and linguistic knowledge. Coreference resolution here consists in identifying all expressions in a text that have the same referent, for example, a person or an object.
The linguistic knowledge is based on Accessibility theory (Ariel 1990).
This is used for selecting likely anaphor-antecedent pairs from the set of all possible such pairs in a text. The data-driven method adopted is Memory-
Based Learning (MBL), a supervised method based on the idea that learning means storing experiences in memory, and that new problems are solved by reusing solutions from similar experiences (Daelemans and Van den Bosch
The referring expressions covered by the system are names, definite descriptions, and pronouns. In order to maximize performance, we use different classifiers with a specific set of linguistically motivated features for each type of expression. The great majority of features used for classification are domain- and language-independent.
We demonstrate two ways of using this method of linguistically motivated selection of anaphor-antecedent pairs.
First, the amount of training examples stored in memory is reduced. We find that for coreference resolution of definite descriptions and names, the amount of training data can thereby be reduced with only a minor loss in performance, but for pronoun resolution there is a negative effect.
Second, selection can be used for improving on coreference resolution results. This is the first step in our hybrid approach to coreference resolution, where the second step is the application of an MBL classifier for determining coreference between the selected pairs. Results indicate that this hybrid approach is advantageous for coreference resolution of definite descriptions and names. For pronoun resolution, there is a negative effect on recall along with a positive effect on precision.
Syftet med denna avhandling är att undersöka om koreferensbestämning för svenska kan förbättras genom en hybridmodell bestående av en kombination av datadrivna och kunskapsbaserade metoder. Koreferensbestämning är en uppgift som går ut på att känna igen och sammanföra alla uttryck i en text som har samma referent, till exempel en viss person eller ett objekt.
Accessibility theory (Ariel 1990) utgör grunden för en metod för att för varje anafor välja troliga antecedentkandidater från mängden av alla möjliga kandidater i en text. Den datadrivna metod som används är Memory-Based
Learning (MBL), en övervakad metod som bygger på en idé om att inlärning innebär lagring av upplevelser i minnet, och att nya problem hanteras genom
återanvändning av tidigare erfarenheter av problemlösning (Daelemans and
Van den Bosch 2005).
De refererande uttryck som hanteras är namn, bestämda beskrivningar och pronomen. För varje typ av uttryck används en särskild klassificerare med lingvistiskt motiverade attribut. De allra flesta attribut som används för klassificering är domän- och språkoberoende.
I avhandlingen visas två sätt som en lingvistiskt motiverad urvalsmetod för att välja troliga antecedentkandidater för varje anafor kan användas på.
För det första används urvalsmetoden för att minska mängden lagrade träningsexempel i MBL-klassificerarens minne. Resultaten visar att för koreferensbestämning av bestämda beskrivningar och namn kan mängden träningsdata reduceras radikalt enligt denna metod med endast en marginell försämring av klassificerarens prestanda, medan urvalsmetoden har en negativ effekt för pronomen.
För det andra kan urvalsmetoden användas för att förbättra resultatet av koreferensbestämningen. Urvalsmetoden är det första steget i vår hybridmodell för koreferensbestämning; det andra steget är att de utvalda paren bestående av anapfor och antecedentkandidat klassificeras om koreferenta eller icke-koreferenta av MBL-klassificeraren. Resultaten tyder på att detta tillvägagångssätt är fördelaktigt för hantering av bestämda beskrivningar och namn, medan det för pronomen ger en försämrad täckning i kombination med en förbättrad precision.
First of all, I want to thank my supervisors Martin Volk and Mats Wirén. You have given me advice, encouragement, and moral support at crucial moments, and I could never have completed this thesis without you.
Thanks to past and present members of the Section for Computational Linguistics at Stockholm University: Sofia Gustafson-Capkova, David Hagstrand,
Britt Hartmann, Magnus Sahlgren, Yvonne Samuelsson, Jörgen Aasa, Jennifer Spenader, and Robert Östling. A special thanks to my office mates,
Hans Hjelm and Henrik Oxhammar, for interesting discussions, relaxing coffee breaks, and for helping out in so many ways; thanks also to Henrik for singing so beautifully to me every day.
I wish to thank everybody at the Department of Linguistics for contributing to a warm and friendly atmosphere. In particular, I thank Linda Habermann,
Nada Djokic, Cilla Nilsson, Christina Hellman, and Liisa Karhapää for not only patiently sorting out administrative tangles, but doing so with a smile.
Thanks also to Bosse Kassling for sysadmin at SU, and to Robert Andersson for sysadmin at GSLT.
During my time as a PhD student, I have had the opportunity to teach: thanks to all my students, especially Aisha Renee Malmgren. Thanks also to the directors of studies who have helped along the way: Päivi Juvonen, Sofia
Gustafson-Capkova, Eva Lindström, Tove Gerholm, and Ludvig Florén.
A special thanks to Tove Gerholm and Anna-Lena Nilsson, for support and cooperation during my term as director of studies. Thanks also to study career advisors Calle Börstell and Heléne Walberg.
Thanks to Maria Koptjevskaja Tamm, Östen Dahl, and the PhD students in
General Linguistics for a rewarding tutorial.
Thanks to everyone at the Graduate School of Language Technology;
I feel truly privileged to have been part of this constructive and inspiring environment. A special thanks to Atelach Alemu Argaw for all those pep talks/lunches/discussions on NLP, but most of all for being a good friend.
Thanks also to the people I first met during my years in another inspiring environment, the Language Technology Program at Uppsala University; in particular Lars Borin, Pelle Weijnitz, Fredrik Olsson, and Eva Forsbom.
Finally, thanks to my favorite people in Västerbotten: the Hellgrens, the
Hanssons, and the Jonssons, and my favorite Southerners the Björkenstams.
Thanks to my friends, who – regardless of time, space, and work load – always make my life brighter, especially Maria, Nina, Ellen, and Sara. Thanks
to my parents, Margareta and Jan, and my sister Helena for their constant encouragement and support.
For everything else, I thank my family: Tomas, Nora, and the little one. This is for you.
Stockholm, April 2010
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 121
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
When we talk or write about things, such as people, objects, or events, we use names, e.g., Richard B. Cheney and Bush, different kinds of descriptive expressions, e.g., the Vice President, or pronouns such as they.
(1.1) Hours before they were to leave office after eight troubled years,
George W. Bush and Richard B. Cheney had one final and painful piece of business to conclude. For over a month Cheney had been pleading, cajoling, even pestering Bush to pardon the Vice President’s
In particular, we do not have to use the same name or descriptive expression every time we mention the same thing; by choosing different ways to mention a referent we can add more information, or stress a particular property of the person or object in question, e.g., Richard B. Cheney can be substituted for the
Vice President . We do not have to use full names, e.g., George W. Bush, but can substitute these for a shorter name variant such as Bush.
When we repeatedly use expressions that refer to the same referent, we say that these expressions are coreferent. Coreference resolution is the process of interpretation of coreference between expressions. Thus, from an application point of view, the goal of coreference resolution is typically to find all expressions that corefer within a document.
There are two main motivations for why coreference resolution is an interesting subject for a computational linguist: first, finding out which noun phrases refer to the same person or object in a text is an important preprocessing step in many Natural Language Processing (NLP) applications, and second, coreference resolution is an interesting problem in its own right – it is something that humans do all the time, seemingly without difficulty, but from a computational perspective it is a difficult task.
This raises questions regarding the processing of noun phrases in discourse:
What are the main problems we need to deal with? How can anaphoric noun phrases be recognized, and the most likely candidate antecedents be selected
TIME Magazine (Aug. 3, 2009): Massimo Calabresi & Michael Weisskopf: “The Final Days”
among all preceding noun phrases in the discourse? What kinds of linguistic knowledge do humans use for resolution, and how can such knowledge be modeled or approximated?
If we are able to build robust systems for coreference resolution, many existing NLP systems can benefit from the additional information provided.
For example, for Information Extraction purposes, coreference resolution is needed in order to merge information about the same entity (or referent) located in different parts of the discourse (Grishman 2003), for example all different mentions of Richard B. Cheney in the example above.
For Passage Retrieval, where a query is answered by a passage from the discourse, coreference resolution can improve on recall (Morton 2005). Similarly, in a Question Answering (QA) setting, coreference resolution is one of the preprocessing steps needed in order to answer the question “What position did Scooter Libby have in the Bush Administration?” based on the article from
Preiss and Briscoe (2003) have shown that adding coreference resolution to a
QA system can improve on the overall accuracy.
In approaches to Automatic Summarization and Automatic Abstracting, coreference resolution can improve on techniques for extracting important sentences, and provide information on the main topic of the text. When the appropriate sentences have been selected, natural language generation techniques can be used to replace repetitions of the same phrase with anaphoric expressions in order to generate coherent summaries (Moore and Wiemer-
Hastings 2003, Mitkov 2003).
Similarly, in systems for natural language interaction, such as Dialog Systems, both interpretation and generation of referring expressions are important
(Androutsopoulos and Aretoulaki 2003).
For Machine Translation, it is important to resolve pronouns to their antecedents when translating to languages that mark the gender of pronouns. For example, intra-sentential pronoun resolution is needed for correct translation
(1.2) (a) The cat slept because it was tired.
(b) *Katten sov för det var trött.
Machine translation, and other applications where word-sense disambiguation is important, might also benefit from coreference resolution when linking a full form lexical noun phrases, e.g., des cheques-repas (‘the meal vouchers’), to a partially repeated form with multiple senses, des cheques.
Translated 2010-04-19. URL: http://translate.google.com
The task of coreference resolution is related to that of anaphora resolution, where the goal is to find and interpret words or phrases called anaphors that are pointing back to a previously mentioned expression in the discourse, called the antecedent. We say that the anaphor and the antecedent are coreferent when they have the same referent in the real or hypothetical world.
A distinction can be made between referential noun phrases, e.g., proper names and definite descriptions which can refer independently of the linguistic context (but may also be dependent on a preceding noun phrase for interpretation, see e.g., (Ariel 1990)), and anaphoric noun phrases, e.g., pronouns, which typically require a linguistic antecedent in order to be correctly interpreted. In the context of coreference resolution, we want to find coreference links between noun phrases, some of which may be categorized as anaphoric and some as referential.
Throughout this thesis, we will use the term anaphor to mean a noun phrase that is coreferent with one (or more) of the preceding noun phrases, and the term antecedent to mean each coreferent noun phrase preceding the anaphor.
noun phrase Cheney, and the term antecedent for the noun phrase Richard B.
A number of formal definitions of coreference have been proposed; the definition by Van Deemter and Kibble (1999) states that assuming that the referring expressions α
1 and α
2 are occurrences of noun phrases, and that both have a unique reference in the context in which they occur:
1 and α
2 corefer if and only if Reference(α
) = Reference(α
It follows from this definition that the coreference relation is symmetrical and transitive, i.e., that if α
1 and α
2 are coreferential, and α
2 and α
3 are coreferential we can conclude that α
1 and α
3 are coreferential (Van Deemter and
A coreference chain is a sequence formed by all noun phrases that corefer in a given text. Chains can stretch across sentence and paragraph boundaries, and across speaker boundaries within the same discourse. The coreference chains partition the noun phrases within the discourse into equivalence classes
results in the following four coreference chains: [they, George W. Bush and
Richard B. Cheney
[George W. Bush, Bush], [Richard B. Cheney, Cheney,
An alternative would be to reserve the use of the term anaphor for (anaphoric) pronouns, and to use the term subsequent-mention for (referential) proper names and definite descriptions; to the extent that we do not need this terminological difference, we refrain from making it.
Here, the pronoun they, which refers to the subsequent coordinated noun phrase George W.
Bush and Richard B. Cheney
, is an example of cataphora.
the Vice President ], and [the Vice President’s former chief of staff, I. Lewis
The task of coreference resolution can be further divided according to the linguistic form of the expressions linked by coreference relations: they can be either nominal (i.e., a proper name, a lexical noun phrase, or a pronoun) or non-nominal (e.g., events, facts, propositions). In this thesis, we are concerned
More specifically, we distinguish between the following types of noun phrases:
1. Proper names and other named entities, categorized as e.g., person names (e.g., George W. Bush), organizations (e.g., the World
Bank ), locations (e.g., Texas), or miscellaneous names such as e.g.,
e.g., the Vice President, which has a common noun head. In order to distinguish between the set of all lexical noun phrases and the subset of lexical noun phrases that are definite and thus may be anaphoric, we refer to the latter as definite descriptions.
3. Pronouns, e.g., they.
Depending on the type, these noun phrase types can enter into different kinds of relations with other noun phrases. Definite descriptions can enter into synonymy, hyperonymy, and hyponymy relations with other lexical noun phrases.
Such a relation may provide a clue to a coreference relation between two expressions. As indicated by the term, definite descriptions can add information about the referent, or serve as a paraphrase of its antecedent.
characters George W. Bush, Richard B. Cheney, and I. Lewis Libby are mentioned multiple times, using different expressions. Bush is referred to by name variants such as George W. Bush and Bush, definite descriptions such as the
President , the former President, and the Commander in Chief, and pronouns such as he and his. In addition to pronouns, Cheney is referred to by the name variants Richard B. Cheney, Dick Cheney and Cheney, the definite descriptions the Vice President , his [Bush’s] deputy and closest counselor, his [Libby’s] boss and the outgoing Vice President. Libby is referred to by the name variants I. Lewis (Scooter) Libby, Libby, Scooter, and the definite descriptions the
Vice President’s former chief of staff , Cheney’s former top aide on domestic and foreign policy , his [Cheney’s] aide, and Cheney’s longtime aide.
The last chain consists of two noun phrases in apposition; we treat appositive noun phrases separated by a comma as two expressions.
Due to this restriction, we do not handle coreference between non-nominal entities, e.g., the link between one final and painful piece of business and the event described in the second
Lexical noun phrases have also been referred to as common noun NPs (see e.g., Hoste 2005), or nominal mentions (see e.g., Haghighi and Klein 2007, Ng 2008).
Knowledge needed for coreference resolution
In the previous section, we defined the three types of noun phrases (NPs) covered by our coreference resolution system. Accordingly, resolution of each of these NP types can be described as an individual task, which may require e.g., specific kinds of knowledge or strategies. In this thesis, we distinguish between these three tasks:
1. Identification of coreferent Named Entities, e.g., the names
Richard B. Cheney
, Cheney, Dick Cheney might all refer to the same referent within a discourse (or, in the case of cross-document coreference resolution, within different discourses). In addition to recognition of coreference between such name variants, and between definite descriptions and subsequent-mention named entities (NEs), this task also includes identification of acronyms such as CIA – The
Central Intelligence Agency
2. Resolution of definite descriptions, that is, determining the antecedents for definite lexical NPs. E.g., the President of the United
States might refer to the same entity as other lexical NPs such as the
President or the Commander in Chief, or an NE like George W. Bush within a discourse. This resolution task can be further divided into the resolution of:
repeated form of a lexical NP (e.g., the President
– the President), or partially repeated form of a lexical
NP (e.g., the White House counsel – the counsel), b) redescriptions: some type of lexical replacement, e.g.,
(near) synonymy (e.g., the Vice President – the President’s deputy ) or hyperonymy (e.g., the Vice President’s former chief of staff – his aide), or when the antecedent is a NE and the anaphor is a definite description (e.g., Fred Fielding – the counsel ).
3. Pronoun resolution, e.g., the pronoun he (singular, masc.) can refer to either of the NPs the President, Cheney, or Scooter Libby depending on the context of each occurrence. The goal of this task is to determine the correct antecedent for each such pronoun.
These three subtasks require different means for resolution, and are of varying degrees of difficulty: e.g, for languages such as English and Swedish, NEs and definite descriptions can often be resolved by estimations of string similarity between the anaphor and the antecedent because repetitions are frequent, whereas string similarity is a less predictable factor in pronoun resolution.
It is commonly assumed that agreement in person, gender, and number guide anaphora resolution, and psycholinguistic studies of anaphoric reference (e.g., reading time, eye-tracking, and priming studies) show that resolu-
Also referred to as same head anaphora, see e.g., (Vieira and Poesio 2000).
tion takes less processing time if there is only one possible antecedent. For example, studies show that it takes less processing to resolve pronouns within a shorter distance of the antecedent compared to more distant ones, and to resolve reference to topical concepts compared to less topical ones (Sanders and
Spooren 2007, Singer 2007). Thus, knowledge on morphological agreement, distance, and prominence are likely to be important for the resolution process.
Even when there is more than one possible antecedent, as in the classic example by Winograd (1972:p. 33), humans usually have no difficulty in determining the antecedent:
(1.3) The city council denied the demonstrators a permit because ...
(a) ... they feared violence. (they = the city council)
(b) ... they advocated revolution. (they = the demonstrators)
are alike syntactically. One of the antecedent candidates, the demonstrators, agrees in grammatical number with the anaphor they, while the other, the city council
For resolution, humans use not only lexical knowledge but also knowledge about the world in general, and about the objectives and motivations of city councils and demonstrators in particular. Naturally, this kind of real-world knowledge and the reasoning needed for arriving at the most plausible solution based on this knowledge are difficult to model.
In most approaches to anaphora and coreference resolution, both knowledge-based and data-driven, some of the following kinds of knowledge are used:
• Morphological and lexical knowledge: used to check for agreement in e.g., grammatical gender and number between the anaphor and antecedent.
• Lexico-semantic knowledge: used to check for animacy agreement between the anaphor and antecedent, or semantic relatedness, e.g., synonymy, hyponymy/hyperonymy.
• Syntactic knowledge: pronoun behavior is governed by grammatical rules, e.g., that reflexives typically refer to the subject of the clause.
Further, it has been argued that e.g., for English, there is a grammatical role hierarchy that determines salience (Brennan, Friedman and
Pollard 1987). Other commonly used syntactic restrictions are based on the c-command constraint (Bosch 1983).
• Discourse knowledge: e.g., information on which utterance is the most salient at a certain point in the discourse, i.e., the focus (Azzam,
Thus, this is also an example of how grammatical number agreement is a preference but not a prerequisite for anaphora in English. In most algorithms for English pronoun resolution, number
Humphreys and Gaizauskas 1999) or center (Brennan et al. 1987).
Winograd (1972) among others distinguishes between the local discourse , and the overall discourse.
• Recency information: recency is a factor since especially pronouns, but also other types of anaphoric NPs, are local in the sense that the antecedent is usually found within a narrow context (see e.g.,
Charniak 1972, Mitkov 2002).
• Real-world knowledge: e.g., described by Charniak (1972) as selectional restrictions on the candidate antecedents put forth by verb semantics, e.g., that ‘they’ in ‘They announced that ...’ can refer to both a group of people and an organization, while ‘They hollered that ...’ can (most probably) only refer to a group of people, due to differences in semantics between the two verbs ‘announce’ and ‘holler’.
This level of knowledge still does not suffice for cases such as exam-
common-sense knowledge, involving all the previous kinds of knowledge.
Aims and outline of the thesis
The general aim of this thesis is to explore how data-driven methods in combination with linguistic knowledge can be used for coreference resolution in
Swedish. To this end, we adopt a hybrid approach which combines a supervised machine learning approach with linguistically motivated selection of anaphor-antecedent candidates, and features based on a large set of linguistic and other properties.
The basic approach to coreference resolution used in this thesis is to redefine the problem as pair-wise classification problem, where each NP within a document is compared to all preceding NPs. For each such pair, linguistic knowledge on both NPs and on the relation between the two NPs is used to classify this pair as coreferent or non-coreferent. In a second step, coreference chains are built from the NPs predicted by the classifier as coreferent.
The data-driven method used to predict coreference is Memory-Based
Learning (MBL), a supervised method based on the idea that learning means storing experiences in memory, and that new problems are solved by reusing solutions from similar experiences (Daelemans and Van den Bosch 2005).
The referring expressions covered by the system are names, definite descriptions, and pronouns. In order to maximize performance, we use different classifiers with a specific set of linguistically motivated features for each type of expression. The great majority of features used for classification are domain- and language-independent.
In our hybrid approach to coreference resolution, the first step is to apply a knowledge-based method for selecting likely anaphor-antecedent candidates
based on Accessibility theory, as put forth by Ariel (1990). Through instance selection, unlikely anaphor-antecedent pairs are labeled as non-coreferent and removed from further processing. The second step is to hand the remaining, likely candidates to a classifier which then labels each pair as either coreferent or non-coreferent. Finally, in order to form coreference chains, the pair-wise predictions of the classifier are used to link coreferent NPs together.
This thesis is structured as follows. Chapter 2 gives an introduction to knowledge-based and data-driven methods for anaphora and coreference resolution, with focus on the types of linguistic knowledge used and on methods for antecedent selection.
Part I describes our work on annotation. Chapter 3 gives an overview of coreference annotation guidelines, and describes the process of annotating a corpus with relations between NPs; some initial observations on coreference and related phenomena in this data are outlined in chapter 4.
Part II describes our experiments on hybrid methods for coreference resolution in Swedish. In chapter 5, we describe how linguistic knowledge is incorporated in the system through methods for selecting likely anaphor-antecedent candidates; we also describe the construction of the complete feature set.
Chapter 6 concerns the development of specific feature sets for names, definite descriptions, and pronouns. We report on cross-validation experiments on the training data, where we compare the results of classifiers using a basic feature set to the results of task-specific classifiers with linguistically motivated feature sets.
In chapter 7, the classifiers described in chapter 6 are evaluated on a held-out test data set. During evaluation, we focus on the effect of linguistically motivated instance selection on the training data: can we reduce the amount of training data without loss in performance? We also evaluate our hybrid approach: is it beneficial to combine linguistically motivated selection of anaphor-antecedent pairs with a data-driven model of coreference?
In chapter 8, we conclude our work with a discussion on the main issues and contributions of this thesis, and on ideas for future work.
The main contributions of this thesis are the following:
• The first coreference resolution system for Swedish where the three tasks of identification of coreferent Named Entities, resolution of definite descriptions, and pronoun resolution have been addressed in a uniform way, using specialized classifiers for each task in order to maximize performance.
• A hybrid approach, which combines a standard data-driven model for classification with linguistically-based instance selection. The instance selection methods used to select likely candidate antecedents
is based on Accessibility theory, as put forth by Ariel (1990). To our knowledge, Accessibility theory has not previously been used for this purpose in computational approaches to coreference resolution.
• A comprehensive feature set for coreference resolution in Swedish.
From this set, linguistically motivated feature selection is then used to create specialized feature sets for each of the three coreference resolution subtasks.
2. Related Work on Anaphora and
Before we explain our approach to coreference resolution in Swedish, we will summarize prior work on anaphora and coreference resolution, including both knowledge-based and data-driven approaches. In particular we look at the kinds of linguistic knowledge incorporated in these approaches, both as features describing the anaphors and the antecedents, and as models for selecting the most likely among the candidate antecedents. We will also consider how preferences and conditions for coreference can differ between languages, and what this entails for our approach for Swedish.
We use the term knowledge-based approaches for algorithms and systems for anaphora and coreference resolution consisting of rules and heuristics based on linguistic knowledge. While most of the algorithms mentioned in this section have been developed for English, algorithms for languages such as Norwegian, German, and Swedish are also described.
The first step in any anaphora or coreference resolution algorithm is to identify and select the set of anaphors to be resolved. The types of entities identified as anaphors depends on the definition of the resolution task, and also on the application area for which the resolution system is intended, the language, and the text genre and domain. Anaphors allowed in systems for pronoun resolution are e.g., third person pronouns and possessives (see e.g., Mitkov,
Evans and Orasan 2002), or third person pronouns, reflexives and reciprocals
(see e.g., Lappin and Leass 1994), and for coreference resolution e.g., NEs, lexical NPs, pronouns (personal, demonstrative, reflexive, and relative), and possessive and relative determiners (e.g., Hartrumpf 2001).
The main motivation for restricting or categorizing the set of anaphors is that different NP types require different means for resolution. The set of anaphors can also be restricted in order to remove cases that are too difficult to resolve, e.g, in the algorithm for Norwegian anaphora resolution by Holen
(2007), anaphors with coordinated or split antecedents, and ambiguous pro-
Norwegian det, like the English it, can function as an expletive.
A common modular architecture, used for both pronoun resolution and coreference resolution, consists of the following steps (for each selected anaphor):
1. Identifying and selecting a set of candidate antecedents.
2. Processing the set of candidate antecedents, by filtering out unlikely candidates by applying constraints (also referred to as eliminating factors (Mitkov 1997)), and/or ranking the candidates according to
that either increase or decrease the combined score (or weight) of the candidate.
3. Selecting the most likely antecedent, e.g., the closest candidate according to some measure of proximity or recency
(Hobbs 1978, Haghighi and Klein 2009), or the highest ranking candidate based on the combined score of the preferences (Mitkov et al. 2002, Lappin and Leass 1994, Mitkov, Evans, Orasan, Ha and
Pekar 2007, Holen 2007).
Identifying and selecting candidate antecedents
Depending on the architecture of the system, different approaches to identification and selection of candidate antecedents are used. The motivation for candidate selection is to improve on the resolution accuracy by restricting the search space. Candidate antecedent selection is commonly based on the notion of recency, as the scope of a discourse referent is typically limited.
The so-called linear-k model (Cristea, Ide, Marcu and Tablan 2000) is commonly used in knowledge-based approaches to pronoun resolution. In this model, k is set to the number of sentences included in the search space for each anaphor, e.g, k=3 for the current and two preceding sentences (Mitkov et al. 2002, Holen 2007). The size of the search space can be different for different types of pronouns; e.g., in CogNIAC, a system aimed at high precision resolution, the rule for reflexive pronouns picks out the nearest candidate antecedent in the current sentence, while the rule for subject pronouns extends the search scope to the preceding sentence if the anaphor is the subject of the current sentence (Baldwin 1997).
A related, linear solution is to restrict the search space to the discourse segment (e.g., paragraph) that contains the anaphor (see e.g., Azzam et al.
Cristea et al. (2000) contrast the simple linear-k model to the hierarchical discourse-VT-k model. This model is based on a model of global discourse structure called Veins theory (Cristea, Ide and Romary 1998), where the discourse structure is expressed as hierarchically ordered discourse structure trees as defined in Rhetorical Structure Theory (Mann and Thompson 1988).
Cristea et al. (2000) propose that by applying the discourse-VT-k model,
Also referred to as e.g., preferential factors, see (Mitkov 1997).
a more refined subset of candidate antecedents can be accessed from the k discourse units that hierarchically precede the discourse unit where the anaphor is located. However, until “veins” can be accurately identified automatically, this model cannot be used for practical anaphora resolution
While linear-k models often are based on the relatively short distance between anaphoric pronouns and their antecedents, it has been observed by e.g.,
Fraurud (1992) that animate referents can be referred to by pronouns (e.g., han ‘he’, hon, ‘she’) located further away from the antecedent, compared to inanimate referents (referred to by den, ‘it’+uter, or det, ‘it’+neuter, depending on the grammatical gender of the closest antecedent). In her algorithm for
Swedish pronoun resolution, Fraurud does not restrict the search scope for pronouns but selects the closest, compatible candidate antecedent among all candidates.
Corpus studies have also shown that coreferential subsequent-mention definite descriptions and NEs show different patterns of antecedent distance than anaphoric pronouns (see e.g., (Ariel 1990) and (Vieira and Poesio 2000)).
Haghighi and Klein (2009) use different antecedent selection methods for pronominal and non-pronominal anaphora in their system for English coreference resolution. They do not restrict the initial set of candidate antecedents for anaphoric pronouns while the final antecedent selection is based on proximity defined as distance within and between syntax trees. The candidate antecedent search space for definite descriptions and NEs are restricted to NPs with the same head noun or similar NEs (i.e., NPs of which the anaphor is a partial of full repetition) regardless of distance, NPs in certain syntax patterns with the anaphor (e.g., appositive relationships), or NPs that are semantically compatible with the anaphor (e.g., synonyms).
Processing the candidate antecedents
The processing of the set of candidate antecedents is typically a combination of the application of a set of constraints that rule out unlikely cases of reference, and a set of preferences that, among the remaining candidates, reward the likely and penalize the unlikely candidates. This processing ends with a set of likely candidates among which the most likely candidate (according to some measure) is selected as the antecedent.
Constraints are rules that filter out incompatible candidates. For definite descriptions and NEs, incompatibility can be defined as semantic disagreement, e.g., regarding animacy. For pronouns, incompatibility is often defined as intra-sentential syntactic restrictions on anaphora and morphological disagreement (see e.g., (Lappin and Leass 1994), (Hobbs 1978), (Baldwin 1997), and (Mitkov 2002)).
Because most approaches to anaphora and coreference resolution have been developed for and applied to English data, the morphological agreement con-
straint is often presented as a definite condition. Morphological agreement in English entails only number and natural gender agreement as grammatical gender is not a feature of this language, while for languages such as Swedish and Norwegian with both grammatical and natural gender, compatibility is a
Constraints can also be learned from corpora making the system a hybrid of knowledge-based and data-driven methods; in addition to linguistically motivated constraints, Hartrumpf (2001) use a training corpus to learn distance constraints (on sentence, paragraph, or phrase level) and semantic compatibility. Similarly, Haghighi and Klein (2009) use constraints on semantic agreement, with semantic knowledge learned from unlabeled corpora.
Preferences (or factors) are a set of rules that award each of the candidate antecedents with a score. The score can be either positive or negative, and this numerical value is typically based on linguistic motivations, and tuned during system development.
Distance, repetition, and identical collocation patterns are examples of positive preferences in the English anaphora resolution system MARS, while indefiniteness is one of the negative preferences. These preferences can be described as largely language- and genre-independent, and the MARS algorithm has been adapted to other languages, e.g., Bulgarian and Portuguese.
But the original MARS system for English also uses highly language- and genre-dependent preferences, e.g., term preference (awarded to representative terms within a specific genre), and frequently occurring sequence patterns, e.g., imperative constructions in technical manuals (Mitkov 2002).
As most of the approaches to pronoun resolution discussed in this section have been developed for English, they have some properties in common that may not be optimal for languages such as Norwegian, Swedish, and German: morphological agreement is treated as a hard constraint rather than a preference, and based on the surface structure in English (as captured in e.g., the grammatical role hierarchy of forward-looking centers in the Centering approach to pronouns (Brennan et al. 1987)), subject candidates are preferred over non-subjects, and syntactic parallelism is rewarded.
Holen (2007) describes the adaptation of the constraints and preferences of the English pronoun resolution algorithms by (Mitkov et al. 2002) and
(Lappin and Leass 1994) to Norwegian; even though English and Norwegian are considered to be closely related languages, Holen finds that for Norwegian anaphora resolution, it is more advantageous to treat morphological agreement as a preference than a constraint, and that some of the preferences based on grammatical function, e.g., giving preference to antecedent candidates that are
As Swedish and Norwegian are closely
See (Barbu, Evans and Mitkov 2002) for a discussion on number agreement constraints as a source of error for resolution of English plural pronouns.
See also (Strube and Hahn 1999) on the adaptation of the Centering approach for German, a language with freer constituent order. They propose replacing the grammatical function hier-
related (more so than English and Norwegian), these findings are interesting for coreference resolution in Swedish.
Selecting the most likely antecedent
The final stage of the resolution process is the selection of the most likely antecedent. In knowledge-based systems where the candidate antecedents are graded according to the combined score of a set of preferences, the highest ranking candidate above a given threshold (in order to eliminate single, weak candidates) is selected as the antecedent. In such systems, recency is typically a built-in condition; the set of candidate antecedents are collected from the two or three preceding sentences in order to restrict the search space and thus increase the precision of the system (see e.g., (Mitkov et al. 2002), (Lappin and Leass 1994), and (Holen 2007)).
In systems where the emphasis is on filtering out candidates based on syntactic knowledge and semantic incompatibility, proximity as in distance within and between syntax trees is used to select the antecedent (see e.g., (Hobbs
1978) and (Haghighi and Klein 2009)).
A major distinction within the field of knowledge-based anaphora and coreference resolution is between knowledge-intensive and knowledge-poor approaches; we define the former as approaches requiring high-level processing including semantic information such as selectional restrictions and inference, and the latter as requiring preprocessing on a lower level, including e.g., morphological and syntactic analysis.
Hobbs’ “naive” approach
The “naive” algorithm for resolution of English pronouns by Hobbs (1978) is based on deep syntactic analysis, where for each anaphor a left-to-right breadth-first search of the each preceding parse tree is performed in search of candidate antecedents. The algorithm first looks for antecedents of reflexive pronouns, and then for antecedents of other types of pronouns. Each candidate antecedent is compared against a set of constraints (based on e.g., the number and gender of the anaphor), and the first (i.e., the most recent) candidate to satisfy these constraints is selected. Intra-sentential resolution is favored over inter-sentential, and short-distance antecedents are favored over more distant ones, because the algorithm begins by searching the syntax tree of the anaphor clause and continues from there to embedded clauses and prior clauses, and then on to preceding sentences until an antecedent is found. Hobbs evaluated the algorithm manually on a total of 300 pronouns from two text types: fiction archy of the Centering approach with criteria that reflect the information structure (in terms of hearer-old and hearer-new discourse entities) of the utterances.
(100 pronouns) and non-fiction (200 pronouns); with selectional constraints,
91.7% of all pronouns are resolved correctly.
While Hobbs’ original algorithm is based on deep syntactic analysis, an implementation of Hobbs’ algorithm by Kehler, Appelt, Taylor and Simma
(2004) do not use deep parse trees, but searches through NP groups following the preferred search order by Hobbs: in the current sentence, starting with the
NP to the left of the anaphor, working from right to left, in the immediately preceding sentence from left to right (thus, because of English word order, giving preference to the subject), and in the second preceding sentence from left to right. The final result, the average of 14 evaluation runs on the ACE
2002 evaluation data set (consisting of newspaper and newswire text, with a total of 762 third person pronouns) was 68.8% correctly resolved pronouns.
Hobbs’ “naive” algorithm has been implemented, and used as a baseline for evaluation by among others Lappin and Leass (1994), Tetreault (1999),
Mitkov (2002), and Kehler et al. (2004).
RAP by Lappin and Leass
In RAP (Resolution of Anaphora Procedure) by Lappin and Leass (1994), a combination of constraints and preferences is used for English pronoun resolution. The algorithm is based primarily on syntactic knowledge and a dynamic model of attention state. The algorithm is modular, and consists of an intra-sentential syntactic filter for candidate anaphor-antecedent pairs for which coreference can be ruled out based on a set of syntactic constraints, followed by a set of salience factors. For each candidate antecedent nor removed by the syntactic filter, weights as assigned to the set of preferences defined as salience factors, e.g., grammatical role, proximity, and sentence recency. The weights describe a grammatical role hierarchy that e.g., ranks
When an anaphor is linked to an antecedent, an equivalence class of coreferent
NPs is formed (or updated). The attention state is calculated during processing for each equivalence class, based on the combined salience weights of all elements in the class, sentence recency, and the weight of the salience factors of the anaphor. For inter-sentential anaphora, the most recent NP within each equivalence class is selected as a candidate antecedent, and new, local salience weights are calculated. Among these candidates, the antecedent with the highest salience weight that agrees with the anaphor in number and gender is selected as the antecedent.
Lappin and Leass (1994) evaluate RAP on 360 pronouns in computer manuals, using full parse trees as input. The evaluation of RAP results in 86% correctly resolved pronouns (89% correct intra-sentential, and 74% correct inter-sentential anaphoric links), while an implementation of Hobbs’ algorithm used as a baseline results in 82% correctly resolved pronouns (81%
This hierarchy is similar to the one used in the Centering approach to anaphora resolution by
Brennan et al. (1987).
correct intra-sentential, and 87% correct inter-sentential links). Because intrasentential anaphora is more frequent then inter-sentential in the evaluation data, the algorithm by Lappin and Leass obtained better results than Hobbs’ algorithm.
While the system implemented by Lappin and Leass (1994) used full parse trees as input, a less knowledge-intensive version (where the input data is tagged with morpho-syntactic information and syntactic functions) of this algorithm has been implemented by Kennedy and Boguraev (1996); in comparison with an implementation of the original algorithm in (Lappin and Leass
1994), this system does well considering the differences in the level of preprocessing (and thus the complexity of the factors used for resolution), and genre
(press releases, product announcements, news paper articles, and web pages compared to computer manuals).
Fraurud’s algorithm for Swedish singular pronouns
The algorithm for resolution of Swedish singular pronouns by Fraurud
(1992) requires morpho-syntactic analysis (including syntactic function) and
NP-chunking, and – because the algorithm does not allow anaphors to corefer with the subject of the clause, or with NPs coreferential with the subject
– elaborate syntactic reasoning. The algorithm is based on morphological agreement and syntactic constraints in combination with recency information.
When evaluated manually on 457 pronouns after manually removing plural pronouns and pronouns with reference to propositions (the latter very difficult to do automatically), Fraurud reports that 91% of the anaphors were correctly resolved. The results varied depending on the text type of the test data: the best results were obtained for fiction (99.3% correct) and reports from court proceedings (93.3% correct), while pronoun resolution in news articles resulted in 75% correctly resolved anaphors (due to a larger proportion of difficult third person inanimate pronouns compared to the other text types). Given the level of preprocessing, the manual removal of plural pronouns and pronouns referring to propositions, the enforcement of an incrementally processed noncoreference constraint, and a manual resolution and evaluation process, these results emphasize the difficulty of the task.
Approaches based on Centering Theory
Centering theory by Grosz, Joshi and Weinstein (1995) is a theory of coherence between utterances within a discourse segment. This theory has been used as the foundation for anaphora resolution algorithms, e.g., the Centering approach to pronoun resolution by Brennan et al. (1987) and the Left-Right
Centering (LRC) algorithm by Tetreault (1999).
According to the Centering approach, a discourse segment consists of a sequence of utterances. Each utterance is associated with an ordered list of forward-looking centers , consisting of those entities that are realized by linguistic expressions in the utterance. The list is ordered by grammatical func-
tion, that is, the subject is preferred as the center of the utterance, which coincides with surface constituent order in English. The first item on the list is the preferred center . Each utterance is about one entity at a time, the backwardlooking center , which has already been introduced into the discourse as the
The way utterances are linked together are described by the transition state: continuing, retaining, or shifting. The transition state is based on two factors: whether the backwardlooking center is the same from one utterance to the next (continuing or retaining) or not (shifting), and whether this entity coincides with the preferredcenter (continuing) or not (retaining) (Brennan et al. 1987).
The Centering approach is governed by a system of constraints and rules which includes a ranking of the transitions (that continuing is preferred over retaining which is preferred over shifting). It is concerned with inter-sentential anaphoric relations, while intra-sentential anaphoric relations are handled by another module based on e.g., the surface syntactic structure of the utterance
(Brennan et al. 1987).
Brennan et al. (1987) do not report on any formal evaluation of the Centering approach to pronouns, but an implementation of this pronoun resolution model is evaluated against a modified approach, the Left-Right Centering Algorithm (LRC), and the results are discussed in (Tetreault 1999). LRC builds on the principles of the Centering approach to pronoun resolution by Brennan et al. (1987), but allows for incremental processing of an utterance, which
Tetreault (1999) found to be a more plausible cognitive model of pronoun resolution. The algorithm requires Penn Treebank-style full parse trees, as the algorithm was developed for manually parsed text from the Penn Wall Street
The results of the LRC algorithm are also compared to the results of an implementation of Hobbs’ “naive” algorithm. The evaluation data consists of
195 news articles (1,696 pronouns, after removal of quoted speech pronouns) with full parse trees; on this data set, the performance of LRC and Hobbs’ is similar: 72.8% correctly resolved pronouns by Hobbs’, and 72.4% by LRC
The evaluations cited above show the difficulty of the resolution task. With manual preprocessing and/or manually corrected and restricted input (e.g., removal of pronouns with reference to propositions), the results of the pronoun resolution algorithms described are impressive (cmp. Hobbs’ 91.7% correctly resolved pronouns, and Fraurud’s 91%). With automatic preprocessing, the results decrease; fully automatic implementations of Hobbs’ algorithm result in 71% (Mitkov 2002) and 68.8% (Kehler et al. 2004) correctly resolved anaphors.
The backward-looking center corresponds roughly to Sidner’s discourse focus (Brennan et al. 1987).
Baldwin (1997) reports 73% correctly resolved anaphors for CogNIAC, and error analysis shows that while 25% of the errors were due to the resolution system, most errors were the result of unspecified software problems, misclassification of NEs, errors in the morphological analysis, and problems with NP identification.
Different implementations of Hobbs “naive” algorithm have been used as a baseline by many of the authors here cited. In implementations where both the preprocessing and the resolution process are fully automatic, and where e.g., the syntactic analysis is less complex, the results for Hobbs’ algorithm decrease compared to the original evaluation.
Depending on the algorithm, the types of anaphors included in the algorithm, and the text genre the algorithm is applied to, the results may vary: e.g.,
Fraurud (1992) reports 90% correctly resolved pronouns on average, but only
75% for the subcategory news articles with the largest proportion of difficult cases.
Data-driven approaches require data annotated with coreferential or anaphoric relations between NPs. Projects that have had a big impact on the field are the coreference resolution tasks within the series of Message Understanding
and the Automatic Content Extraction
These initiatives have resulted in annotation guidelines, tools,
influenced evaluation methods for coreference resolution.
Data-driven approaches to coreference resolution can be further sub-categorized as supervised and unsupervised. A supervised approach is commonly a two-stage operation, where classification of pairs of anaphors and candidate antecedents is followed by linking the pairs predicted as coreferent into coreference chains. The classification is based on feature vectors constructed for each pair, both features describing each NP alone, and features comparing the two NPs. Such approaches are referred to as supervised because annotated training data is needed to learn a model for classification of candidate anaphor–antecedent pairs, followed by a method for combining the output into coreference chains, possibly also learned from labeled data.
In an unsupervised approach, the coreference resolution task is defined as a clustering task, where the NPs in a document are partitioned into clusters
MUC-6, URL: http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
MUC-7, URL: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/ muc_7_toc.html
ACE, URL: http://projects.ldc.upenn.edu/ace/
based on similarity between feature vectors describing each NP. Such unsupervised approaches do not require annotated data as training data.
Coreference resolution is the task of determining which NPs in a text belong to the same coreference chain (or equivalence class), but in order to apply machine learning to this task it is often defined as a binary classification task, taking candidate pairs of anaphors and antecedents as input. Machine learning approaches have been applied to both pronoun resolution and coreference resolution, and even though the definition of the task and the definition of what constitutes an anaphor varies, most supervised approaches are constructed in similar ways, described below.
The methods of preprocessing applied in order to identify and select the candidate anaphor–antecedent pairs and to construct a set of features describing their characteristics, reported in the literature here cited, varies from lowlevel to advanced, but most approaches have in common a preprocessing phase consisting of tokenization, part-of-speech tagging, Named Entity recognition, and NP chunking. Each of these preprocessing steps can (and does) lead to errors, which then propagates to the next step. As with any kind of high-level natural language processing, this kind of error chaining effect is difficult to avoid in coreference resolution.
After the initial preprocessing, additional information can be added, depending on the characteristics of the language and on the available resources; information sources commonly used are e.g., WordNet (Fellbaum 1998) or
EuroWordNet (Vossen 1998) for adding semantic information.
From this preprocessed data, the next task is to select a set of anaphors and a set of candidate antecedents. This can be viewed as part of the coreference resolution task, or it can be viewed as the first of two related tasks, where the first task can be described as entity detection (or identification), and the second task is to identify coreference relations between the entities. By dividing the task into two, researchers can focus on coreference resolution without having to solve the entity detection task.
In knowledge-based approaches (as described in the previous section), a set of candidate antecedents are selected for each anaphor, and from this set constraints and preferences are used to find the most likely antecedent. Similarly in most supervised machine-learning approaches to coreference resolution, a set of candidate antecedents are selected for each anaphor, and candidate pairs of anaphors and antecedents are created. A classifier is used to determine whether each such pair is coreferent or not (or the probability of the pair being coreferent). If the task is defined as coreference resolution, the anaphorantecedent pairs classified as coreferent are linked together in order to form coreference chains.
In order to build such a classifier, preprocessed training data labeled with coreference information is used as input to a machine-learning algorithm. During this training phase, some of the important issues are:
1. Identifying and selecting a set of candidate antecedents: often all preceding NPs within some threshold are selected for each anaphor.
From the sets of anaphors and candidate antecedents, training instances are created by pairing up each anaphor with all its previous candidate antecedents. This typically results in an uneven class distribution with mostly negative instances, and a few positive instances.
In order to rebalance the data some method of selecting a subset of instances can be used.
2. Creating a set of features: each feature can either be descriptive, as in e.g., describing the anaphor or the candidate antecedent by naming their word class or gender, or comparative, as in comparing the anaphor to the antecedent, by e.g., labeling them as being compatible as to gender, or as non-compatible.
3. Choosing a machine-learning algorithm: many different algorithms have been applied to this task, e.g., Decision Trees (Soon, Ng and
The training phase is followed by an evaluation phase, where the classifier trained on the training data is evaluated on preprocessed test data by:
1. Identifying and selecting a set of anaphors and, for each anaphor, a set of candidate antecedents, and from these sets creating candidate pairs of anaphors and antecedents.
2. Applying the classifier trained during the previous phase to the test data. This results in pairs of NPs labeled as coreferent or non-coreferent (or with some probability of coreference).
3. Linking the pairs together into coreference chains, and evaluating the result.
Selection of training instances
The first step after preprocessing is to identify and select the anaphors according to the definition of what can constitute an anaphor (e.g., personal pronouns with the exception of relative pronouns, Named Entities, and definite descriptions), and then to pair these anaphors with all preceding NPs. This is due to the definition of the coreference resolution task: the objective is to find all coreference chains within a document, that is, to link all NPs within a document that refers to the same entity. Thus, we need to compare each definite
NP with every preceding NP.
Each NP pair, consisting of an anaphor and a candidate antecedent, makes up one instance. If the pair is coreferent, it is a positive instance, and if the pair is not coreferent, it is a negative instance. In the instance creation method proposed by McCarthy and Lehnert (1995) all positive and negative instances
are used as training data. But NP coreference is a rare phenomenon since most NPs are not coreferent; corpus studies have shown that most referents referred to by a definite NP have not been previously introduced by an overt, coreferent antecedent (see e.g., Vieira and Poesio 2000, Fraurud 1992). Consequently, most such candidate anaphor–antecedent pairs are not coreferent, and therefore there are more negative than positive instances.
In order to create more evenly distributed data sets different instance selection methods can be used. This can be seen as combining linguistic knowledge with data-driven methods, creating a hybrid approach; see e.g., (Hendrickx,
Hoste and Daelemans 2007) where pure data-driven methods are compared to a hybrid approach where linguistic knowledge is used to restrict the data sets.
The motivation for reducing the number of training instances is both to improve on the speed of the classifier, and to improve on the classification results.
Instance selection can focus on negative instance filtering, where the objective is to remove as many negative instances as possible (even though a few positive instances might be lost in the process). In positive instance filtering, positives that may be difficult to solve for the classifier, and thus “contaminate” the training data can be removed (see e.g., Ng and Cardie 2002a, Uryupina 2004), or the aim can be to remove less important or easy positive instances in order to force the learner to focus on the difficult cases (Hendrickx et al. 2007).
Some commonly used strategies for selecting training instances are based on the idea that only one antecedent is needed in order to connect an anaphor to the proper coreference chain. Soon et al. (2001) propose restricting the search space of each anaphor to the closest coreferent antecedent: for each anaphor, a positive instance is created for the anaphor and its closest antecedent; and any intervening NPs are used to create negative instances for that anaphor.
A variation of this approach described in (Ng and Cardie 2002a) is to choose the most likely antecedent for pronominal and non-pronominal NPs, thus allowing for differences in anaphoric behavior between pronouns and other types of NPs: for non-pronominal NPs, the most likely antecedent is the closest non-pronominal preceding antecedent, and for pronouns, it is the closest preceding antecedent.
The instance selection method described in (Soon et al. 2001) is extended with filters based on linguistic knowledge to further reduce the number of instances in (Strube, Rapp and Müller 2002); in a resolution system for German, instances are discarded if e.g., :
• The anaphor is indefinite.
• The anaphor (or the antecedent) is embedded in the antecedent (or the anaphor).
• For non-pronominal NP pairs: they have different semantic class values.
• The anaphor is a pronoun and the entities have different agreement values (person, gender, number).
This filter has been developed for German, which (like Swedish) allows for cases where a non-pronominal anaphor points to an antecedent that has a different grammatical gender. The filter also take into account the syntactic structure of the NPs, by removing instances where the anaphor is embedded in the candidate antecedent (or vice versa).
Features for coreference resolution
In order to classify a candidate anaphor–antecedent pair as coreferent or noncoreferent, we need different kinds of knowledge describing the properties of both NPs, as well as comparative knowledge about the candidate anaphorantecedent pair. In knowledge-based approaches such knowledge is referred to as constraints and preferences (or factors) while in data-driven approaches we use the term features.
The content and the size of the set of features used in data-driven approaches to coreference resolution vary greatly, as well as the level of linguistic knowledge expressed by these features: from basic knowledge and shallow analysis methods such as string matching, to advanced knowledge on semantic relatedness and discourse structure.
Criteria for selecting the feature set are e.g., relevance according to previous research (e.g., corpus studies on anaphora, and research on coreference resolution), relevance for the particular language in question, low annotation cost and/or high reliability of automatic tagging, and either domain-independence or relevance for the domain. Thus, depending on the language and the domain, different information sources are used to construct the feature set: for
English, common information sources are part-of-speech tagging, morphological analysis, and semantic categorization based on information on hyperonymy/hyponymy and synonymy from WordNet (Fellbaum 1998).
The feature set used by Soon et al. (2001) for English coreference resolu-
class agreement, with values determined using WordNet as the primary information source. The ten chosen classes are arranged in a IS-A hierarchy
(with person and object as top nodes), and each class is mapped to a WordNet synset. During preprocessing, a semantic class determination module looks up the first WordNet sense of the head word of each markable. As senses in Word-
Net are ordered by frequency, this means that each markable is mapped to the most frequent sense of its head noun (making this feature domain-insensitive).
The selected sense is used as a semantic class label, and matched against the classes in the IS-A hierarchy of WordNet. If the semantic class of the markable is a subclass of one of the classes C in the IS-A hierarchy, its semantic class is C, else it is unknown. But WordNet is a limited source of information.
If WordNet mapping cannot be used for determining the semantic class of either the anaphor or the antecedent, the head noun of both NPs are compared for string similarity (akin to feature 4, String match) (Soon et al. 2001).
The gender agreement feature is based on several clues, e.g., designators such as Mr., Ms. and lists of common male and female names for NEs. The semantic class determination of the head noun is used to determine gender for lexical NPs: markables classified as objects are marked as “neutral”, and markables that either are classified as persons or that cannot be classified are marked as “unknown”. When constructing the gender agreement feature for a candidate anaphor–antecedent pair, three values are used: unknown when the gender of either one is unknown, true when they agree in gender, and false when they disagree (Soon et al. 2001).
The appositive feature is based on recognition of both tight and loose appositives, and requires that at least one of the NPs is a proper name. Appositive constructs are found by checking for the existence of verbs and punctuation between the anaphor and its most immediate antecedent (Soon et al. 2001).
The feature set constructed by Soon et al. (2001) has been used as a baseline by e.g., Ng and Cardie (2002b) who expand the feature set with features based on more sophisticated linguistic knowledge, e.g., by adding lexical features
(including more complex string matching features), semantic features based on more fine-grained semantic compatibility tests using WordNet, grammatical features (describing e.g., grammatical roles), and a positional feature describing distance in the number of intervening paragraphs. Ng and Cardie
(2002b) also add two features based on knowledge-based methods: a simple, rule-based pronoun resolution algorithm and a rule-based coreference resolution algorithm. Ng and Cardie (2002b) find that using the full feature set results in a decrease in performance compared to the Soon baseline. They attribute this to problems with resolution of definite descriptions and to data fragmentation due to the increased size of the feature set. By manual feature selection, aiming for elimination of low-precision rules for definite descriptions, performance is improved.
Because full and partial repetitions of already introduced NPs are common
(as shown in corpus studies on anaphora by e.g., Vieira and Poesio (2000)), string similarity is a reliable clue for coreference resolution. Strube et al.
(2002) suggest using the minimum edit distance between the anaphor and the candidate antecedent (and vice versa) as a language- and domain-independent generalization of the string matching and alias features used by e.g., Soon et al. (2001). In their approach to German coreference resolution, Strube et al.
(2002) find that using features based on the minimum edit distance between anaphor and antecedent results in an improvement over the baseline (using a feature set similar to that of (Soon et al. 2001)).
The Soon feature for semantic class agreement based on the WordNet sense of the head word of the NP is commonly used, but there has also been work on corpus-based methods for inducing semantic features for definite descriptions; e.g., Ng (2007) use a NE recognizer to label NEs with semantic classes (e.g.,
Table 2.1: Features used by Soon et al. (2001). Terminology: ana, j = candidate anaphor; ante, j = candidate antecedent, T = true, F = false.
1 Distance Distance between ana and ante measured in sentences; if in same sentence, distance = 0. (Value: 0 ..
-1, where n is the number of sentences in the text.)
-Pronoun j -Pronoun
Is ante a pronoun? (Value: T/F)
Is ana a pronoun? (Value: T/F)
4 String match Are ana and ante identical after removal of articles a, an, the
, and demonstratives this, that, these, those
5 Def NP
6 Dem NP
( Value: T/F)
Is ana a definite noun phrase (i.e., starts with the
Is ana a demonstrative noun phrase (i.e., starts with this, that, these, those
)? (Value: T/F)
8 Sem class
Do ana and ante agree with respect to number?
If not a pronoun, the morphological root of a noun is used to determine number. (Value: T/F)
Do ana and ante agree with respect to ten predefined semantic classes, e.g., person, female, male object, organization, location
, and date
If the semantic class of ana or ante cannot be determined, the head noun of both NPs are compared for string similarity; if the same string (T), else (unknown).
9 Gender Do ana and ante agree as to gender? (Value: T/F/unknown)
10 Proper name Are both ana and ante proper names? (Value: T/F)
11 Alias Are ana and ante similar, including substring match, abbreviations, and acronyms? (Value: T/F)
12 Appositive Is ana appositive of ante? (Value: T/F)
person, organization, location), and a dependency parser to extract appositive relations from a large, unlabeled corpus. The semantic class of a lexical NP is determined by computing the probability that the NP co-occurs with each of the NE types. For lexical NPs not belonging to any of the NE semantic classes, the WordNet first sense heuristic is used. Adding this feature to a baseline set of features (employed by the system described in (Ng and Cardie 2002b)) resulted in a small but significant improvement over the baseline.
Among others Ng and Cardie (2002b) employ two features describing the probability of an NP having an antecedent (referred to as the anaphoricity of an NP) and the probability that a pair of NPs are coreferential computed from coreference annotated corpora.
Features based on linguistic theory, e.g., Centering theory, can also be added; (Iida, Inui, Takamura and Matsumoto 2003) use a feature set similar to that of (Ng and Cardie 2002b) in combination with features capturing contextual information based on an application of Centering theory to
Machine-learning algorithms used for coreference resolution
When a feature vector has been constructed for each instance in the data, the training data set can be used for training a classifier. In most work on datadriven coreference resolution standard machine-learning algorithms have been used, e.g., Decision Trees (McCarthy and Lehnert 1995, Aone and Bennet
1995, Soon et al. 2001, Ng and Cardie 2002b, Strube et al. 2002, Yang, Zhou,
Su and Tan 2003), Rule learning (Hoste 2005), Memory-based learning (Hoste
2005, Nøklestad 2009), Maximum Entropy (Morton 2000, Kehler et al. 2004), and Support Vector Machines (Iida et al. 2003).
The choice of learning algorithm for a particular task depends on many parameters, e.g., the definition of the learning task, and the characteristics of the feature set. For example, Decision Tree learning algorithms are robust to errors in the training data, with respect to both class labels and features
(Mitchell 1997). This robustness makes them suitable for NLP, both because manual annotation of e.g., coreference is not always consistent and thus there might be errors when that annotation is transformed into class labels in the training data, and because preprocessing may add errors to the feature set. Decision Trees are also suitable when some feature values are unknown, which is a possibility when dealing with natural language. This method can handle both numerical and categorical data, and both numerical (e.g., distance between NPs) and categorical features might be clues to coreference resolution.
In Decision Tree learning algorithms, instances (for coreference resolution typically a pair of NPs) are classified by sorting them down the tree from the root to some leaf node, where each internal node specifies a test of some feature (or attribute), each branch corresponds to the tested feature value, and each leaf node provides a classification for the instance. Decision trees are constructed top-down, by first finding the feature that best classifies the data
(Mitchell 1997). For coreference resolution, this is typically string matching.
As two strings tested for similarity can be either similar or not, this results in two possible values for the string matching test: positive or negative.
Soon et al. (2001) present a decision tree classifier for coreference resolution in English constructed using the C5.0 algorithm on the MUC-6 training
an unknown instance is for string matching where the positive value results in a classification of the anaphor and the antecedent as coreferent, while a test for whether the anaphor is a pronoun or not is branched from the negative value. Non-pronominal anaphors are then tested for being in apposition to the candidate antecedent, where a positive value gives a classification as coreferent, and a negative value is branched into a test of whether the anaphor is an acronym of the candidate antecedent, and thus coreferent with the antecedent, or not. Pronominal anaphors are first tested for gender agreement, where non-agreement and unknown gender values results in a classification as non-coreferent, while gender agreement branches into a test of whether the antecedent too is a pronoun. If so, the anaphor and candidate antecedent are classified as coreferent, but if the antecedent is not a pronoun, there is a test for distance where candidates in different sentences are classified as non-coreferent, and candidates within the same sentence are tested for number agreement. This test for number agreement is the final test in the tree, where pairs with a positive value are classified as coreferent, and pairs with a negative value are classified as non-coreferent.
Note that in this tree, there is no test for e.g., semantic class agreement even
al. (2001) report that the values of this feature are noisy, because the task of finding the proper semantic class for each NP based on WordNet is difficult.
The absence of testing for semantic class agreement might also be due to the bias of Decision Tree learning toward the frequent and typical; compared to string matching or apposition, coreference between semantically related lexical NPs is a rare phenomenon.
Decision tree learning algorithms are sometimes referred to as “eager” learners, as a lot of effort is used for building a decision tree for efficient classification of new instances, in contrast to “lazy” learners such as
Memory-Based Learning (MBL) algorithms which has a simple learning algorithm (simply storing all training data in memory) but a more complex classification algorithm based on retrieval of instances and similarity-based reasoning. The basis for MBL is learning as a cognitive task: when people learn from experience, they reuse their memory rather than extract a set of rules from the most typical and frequent events (Daelemans and Van den
A memory-based classifier is trained by storing instances in memory, and classification of an unknown instance is performed by looking up instances with similar feature vectors in memory, and extrapolating from the class labels
of these stored instances. By similar vectors, we mean the nearest neighbors of an instance represented as a point in an example space where the dimensions are the features used to describe each instance. The Tilburg memory-based learner (TIMBL) is an implementation of the k-nearest neighbor approach optimized for working with linguistic data sets (Hendrickx et al. 2007, Daelemans and Van den Bosch 2005).
Lazy learning methods such as MBL are suitable for NLP tasks such as coreference resolution, where there are many exceptions to the most frequent cases and where ignoring such exceptional and infrequent examples can be harmful. MBL has been applied to coreference resolution by e.g., ? and to pronoun resolution by Nøklestad (2009).
Selection of test instances
In order to evaluate the classifier trained on the training data, a disjoint set of test data is needed. This data set is constructed in the same way as the training data, using the same feature set to describe the instances. But just as the coreference training data set is unevenly distributed with mostly negative instances and a few positive instances, so is the test data.
Thus, the motivation for the selection of test instances is similar to the motivation for training instance selection: to improve on the speed of the classifier, and to improve on precision by excluding unlikely candidate antecedents. One major difference is that for test instance selection, information about the target concept (that is, the manual coreference annotations) cannot be used for selecting a subset of test instances.
Test instance selection often takes the NP type of the anaphor into account, e.g., by restricting the search scope for pronouns to the immediate context
(defined as the current and the two (Yang et al. 2003), three (Klenner and
Ailloud 2008), or five previous sentences (Uryupina 2004)), and to filter the candidate antecedents of non-pronominal anaphors for string similarity, and to apply a restricted search scope if the strings are not similar, e.g., two (Yang et al. 2003), or three sentences (Hendrickx, Hoste and Daelemans 2008).
Proximity in combination with observations regarding typical behavior and features of different NP types is also taken into account in the approach by
Yang et al. (2003): for each pronominal anaphor all candidate antecedents in the current and the previous two sentences that agree in number, gender, and person are added, while for non-pronominal anaphors all non-pronominal antecedent candidates are added.
By applying linguistically motivated filters (comparable to the hard constraints applied in knowledge-based approaches), the number of instances can be further reduced; by e.g., removing instances where the candidate antecedent and the pronominal anaphor disagree in number, gender, and person
(see e.g., (Klenner and Ailloud 2008) for German, (Hendrickx et al. 2007) for
Dutch, and (Nøklestad 2009) for Norwegian).
The result of applying the classifier to the test data is for each candidate anaphor–antecedent pair a class label (coreferent or non-coreferent) or a confidence score depending on the classifier type. This pair-wise result can be evaluated using the standard computational linguistics evaluation measures precision, recall and F-score. Here, precision and recall are estimated in comparison to the label of each pair (extracted from the manual coreference annotation).
Linking coreferent pairs into coreference chains
But since the task is coreference resolution, classifying the candidate anaphor– antecedent pairs as coreferent or non-coreferent is not enough; we want to combine the classifier output into coreference chains, or equivalence classes.
Methods for linking the output of the classifier into coreference chains can be based on some property of the prototypical coreference relation, e.g., proximity: in the closest-first clustering algorithm, each anaphor is linked to the closest preceding NP among the NPs classified as coreferent with the anaphor
(if any) (Soon et al. 2001).
Another property of coreference is that some NP types are more likely as antecedents than others. The best-first clustering algorithm is a modification of closest-first clustering, where the anaphor is linked to the most likely preceding NP (that is, with the highest confidence value returned by the classifier) among all NPs classified as coreferent (Ng and Cardie 2002b). In comparison to closest-first clustering, this clustering method may increase precision.
The coreference relation exists between all coreferent NPs in a document, not only the closest or the most likely. In the aggressive-merge clustering algorithm, each anaphor is merged with all of the preceding NPs classified as coreferent. Compared to closest-first clustering and best-first clustering, more merging occurs which may increase recall (Ng 2005).
Other methods focus on enforcing the transitivity of the coreference relation, i.e., that if A and B are coreferential, and B and C are coreferential we can conclude that A and C are coreferential. Klenner and Ailloud (2008) propose a method where coreference sets are incrementally generated starting from the “safest” pairs classified as coreferent. The safest pairs are defined as those for which the classifier can find no counter-examples in the training data, while the least safe pairs are those for which there are no positive examples. From the safest seed clusters, an n-best beam search in combination with hard constraints based on linguistic knowledge is used to incrementally create consistent coreference chains.
The evaluation of coreference resolution systems is an open issue, and there are a number of different measures that can be used, e.g., the MUC score (Vilain, Burger, Aberdeen, Connolly and Hirschman 1995), B-cubed
(Baldwin and Morton 1998), CEAF (Luo 2005), and the ACE score
(Doddington, Mitchell, Przybocki, Ramshaw, Strassel and Weischedel 2004).
The ACE score has been developed within the Automatic Content Extraction program for the specific ACE task definition (which includes coreference resolution between seven types of entities: person, organization, location, geopolitical entity, weapon, vehicle, and facility) and the ACE data sets. The ACE
The evaluation metric by Vilain et al. (1995), commonly referred to as the
MUC score, is one of the most widely used evaluation scores for coreference.
It is an attempt to move from evaluating the task of finding one antecedent for an anaphoric expression to evaluating the task of finding coreference chains
(or equivalence classes) in terms of recall, precision, and F-score. The MUCscore was developed with an Information Extraction scenario in mind, and result is as “close to the task of coreference resolution as needed in real-world
The MUC score is based on the links between NPs; it compares the equivalence classes derived from the links between the NPs in the answer key (henceforth: the key) with the equivalence classes derived from the links between the NPs in the corresponding system output (henceforth: the response). In the
MUC scoring scheme, the recall error for a specific equivalence class is computed by determining how many links between NPs would have to be added to the response in order to create the same equivalence class in the response as in the key, and divide that by the number of correct links. Recall is then the minimal set of links needed to generate the equivalence class in question minus the number of missing links, divided by the minimal set of correct links of the equivalence class. This measure can be extended to a set of equivalence classes by summing over the key equivalence classes. There is an inverse relationship between precision and recall in the MUC scoring scheme in that precision is computed in the same way as recall, only precision is computed by determining how many links would have to be added to the key in order to create the response (Vilain et al. 1995).
Baldwin and Morton (1998) argue that the MUC scoring scheme has two major problems: first, it ignores (antecedentless and non-anaphoric) single mentions since they are not linked to anything, and second, all errors are considered to be equally harmful. Baldwin and Morton claim that not all coreference errors are equally damaging, and thus should be penalized differently. As a solution to these problems, they propose the B-Cubed metric. There, the precision and recall are computed for each entity (NP) in the document, whether that entity is part of a coreference chain or not, by looking at the presence or absence of each entity relative to each of the other entities in the equivalence classes. This is then combined to produce the overall precision and
The MUC score was first developed for evaluation of the shared coreference resolution task of the sixth Message Understanding Conferences (MUC-6) (Vilain et al. 1995).
recall. The B-Cubed measure can be used either with equal weights for each entity, which makes it more sensitive to precision errors than the second version of B-cubed where equal weights are assigned to each equivalence class.
Baldwin and Morton (1998) argue that the former version is more appropriate for evaluation of coreference resolution systems in Information Extraction tasks, where coreference precision errors can be harmful to the system, and that the latter version is more appropriate for systems adapted for Information
Luo (2005) agrees with the first problem identified by Baldwin and Morton, and in addition argues that the MUC F-score favors systems that produce fewer entities, and therefore is unable to distinguish between system outputs of different quality, and in fact might result in higher F-score for worse systems. Luo proposes the Constrained Entity-Aligned F-measure, CEAF, which is designed to address this problem.
Given that the evaluation of coreference resolution systems is an open issue, the 5th International Workshop on Semantic Evaluations (SemEval-2010) task
Coreference Resolution in Multiple Languages aims to evaluate coreference resolution systems by employing both B-cubed and CEAF in comparison with the MUC score. The purpose is to investigate the advantages and drawbacks of the different metrics (Recasens, Martí, Taulé, Màrquez and Sapena 2009).
There are also visualization tools that can be used both as an aid during annotation but also to explore the output of a coreference or anaphora resolution system, see e.g., (Witte and Tang 2007) and (Johansson, Nøklestad and
The results of coreference resolution systems vary for the different subtasks
(identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution). Typically, the NE identification task is the most successful, while resolution of definite descriptions is the most difficult. State-of-the-art results reported on the MUC and ACE data sets for fully automatic English coreference resolution are around 60% (MUC F-score), reaching up to 70%, depending on which of the evaluation metrics and data set are used (see e.g.,
(Soon et al. 2001), (Ng and Cardie 2002b), and (Ng 2007)). For pronoun resolution in Norwegian, Nøklestad (2009) reports an accuracy of 72%; this system compares favorably to the knowledge-based system by Holen (2007).
The approaches described above are described as supervised since they require training data annotated with coreference relations. But creating or acquiring annotated data is expensive, and for some (or indeed, most) languages there is little or no coreference annotated data to be had; by using unsupervised (or semi-supervised) approaches to coreference resolution this problem can be avoided.
Unsupervised approaches to coreference resolution are based on the observation that because the coreference relation is symmetric, transitive, and reflexive, and each chain of coreferent NPs therefore defines an equivalence class, the coreference resolution problem can be viewed as a clustering problem (Cardie and Wagstaff 1999).
The algorithm by Cardie and Wagstaff (1999) is based on the intuition that all NPs referring to a specific entity within the discourse should be similar or related in some way, and that their conceptual distance should be small; given a description of each NP (i.e., a set of features) and a method for measuring the conceptual distance between two NPs, coreferent NPs are grouped together by a clustering algorithm.
The features used to describe each NP are the words contained in the NP, the head noun, the index number of the NP (within the document), information on the NP type (pronoun, definite or indefinite, proper name), whether or not the NP is in an appositive construction, number (singular or plural), gender
(masculine, feminine, either, or neuter), animacy, and the semantic class based on the WordNet sense of the head noun (Cardie and Wagstaff 1999).
The distance measure compares the feature vectors of two NPs and makes a local coreference decision based on constraints and preferences representing linguistic knowledge about coreference, while the clustering algorithm coordinates these local decisions across the whole discourse by using contextdependent constraints and preferences to partition the NPs into equivalence classes (Cardie and Wagstaff 1999).
Unsupervised approaches to coreference resolution for English typically use a limited set of features to describe each NP, including the head noun, the entity type (e.g., person, location, organization), gender, and number. In their generative, nonparametric Bayesian approach to unsupervised coreference resolution, Haghighi and Klein (2007) also incorporate salience as a dynamic model of attention states, while Ng (2008) and Poon and Domingos (2008) both hypothesize that the use of salience for resolution of nonpronominal NPs will harm performance as such NPs are less sensitive to salience.
In addition to the above mentioned features, Poon and Domingos (2008) also add recognition of syntactic relations such as appositive and predicative
NPs in their approach preforming joint inference using Markov Logic networks.
While unsupervised approaches to coreference resolution are attractive as they do not require annotated training data, their performance is as of yet not comparable to those of fully supervised approaches (Ng 2008).
In this section, we have discussed knowledge-based and data-driven methods for anaphora and coreference resolution. Some of the important issues are:
• The language: some constraints and preferences (or features) are language-insensitive, while others are language-specific. E.g., in systems for English, word order can be used for pronoun resolution, but this is a less predictable factor for languages with freer word order.
• The genre and domain: some types of text are more difficult to process than others, e.g., because the proportions of “easy” and “hard” cases vary between genres. Further, constraints and preferences (or features) for resolution can be genre- and/or domain-insensitive, or specifically tailored for a particular genre and/or domain.
• The knowledge sources used for processing NPs: from simple heuristics such as string similarity to complex features describing e.g, semantic relatedness between NPs.
Coreference in Swedish News Text
3. Coreference Annotation
The purpose of the coreference annotation project described here is to create data for development and evaluation of Swedish coreference resolution, but also to better understand the phenomenon from a linguistic point of view.
The importance of linguistic study can be motivated by the fact that until among others Fraurud (1992) and Vieira and Poesio (2000) presented corpus studies on definite NPs that showed that most definite NPs are not coreferent with previously mentioned NPs, it was often assumed that coreference was the most frequent phenomenon. This resulted in resolution algorithms where definite NPs were assumed to have an explicit antecedent. Similarly, early algorithms for pronoun resolution favored intra-sentential anaphora over inter-sentential anaphora; however, as reported by e.g., Fraurud (1992), Ariel
(1990), and McEnery, Tanaka and Botley (1997), distribution patterns for pronouns show that inter-sentential anaphora are more frequent.
In this chapter the work of annotating the data used in this thesis is described: a collection of Swedish financial news texts collected from the Inter-
A corpus consisting of 365 documents have been annotated with eight types of Named Entities (NEs). A subset of 66 documents (about 22,000 tokens) has been annotated with a selected set of relations between NPs, including coreference, building on the BREDT reference annotation scheme
Related work on anaphora and coreference annotation
There are a number of different corpus annotation schemes aimed at capturing different types of anaphoric and referential phenomenon (including coreference), e.g., the Lancaster/IBM UCREL scheme that was used for annotating the Lancaster/IBM anaphoric treebank (McEnery et al. 1997). The UCREL scheme outlines how pronouns, noun phrases, and clauses can be co-indexed within the framework of cohesion as described in (Halliday and Hasan 1976).
An annotation scheme that tackles the anaphora problem as a whole is the
MATE-GNOME scheme (Poesio 2004). This scheme is used in the Pie-inthe-sky initiative (Pustejovsky, Meyers, Palmer and Poesio 2005), with the purpose of forming a single unified representation of semantic annotation by
Sources are listed in appendix A.
merging PropBank, NomBank, TimeBank, Penn Discourse TreeBank, and using the MATE-GNOME scheme for anaphoric annotation.
A similar project is SYN-RA (SYNtax-based Reference Annotation), where morphological, syntactic, semantic, and anaphoric information is combined.
The reference annotation is based on the MATE scheme, but only relations where high inter-annotator agreement has been reported in other projects are annotated (Hinrichs, Kübler and Naumann 2005).
While the objective of these projects is to capture a wide range of anaphoric relations, there are also annotation schemes that focus specifically on the coreference relation between NPs. In this section, three different annotation initiatives, MUC (Hirschman and Chinchor 1997), ACE (LDC 2008), and
BREDT (Borthen 2004b), are described.
The MUC-6 and MUC-7 annotation schemes for English
One of the most widely used data sets to date for machine learning experiments on coreference resolution are the Message Understanding Conference
(MUC-6 and MUC-7) coreference data sets. The purpose of the MUC initiatives, funded by NIST, was to create data for Information Extraction development and evaluation, and to that end, data was annotated with different categories of names for the MUC Named Entity Recognition (NER) task
(Chinchor 1997), and coreference relations between NPs in the MUC Coreference Resolution task (Hirschman and Chinchor 1997).
The MUC-6 and MUC-7 coreference annotated data, which is available
have been used in many different experiments, e.g., (Soon et al. 2001), (Ng and Cardie 2002b), (Ng and
Cardie 2002a), (Yang et al. 2003), and (Hoste 2005). The MUC initiative also resulted in the widely used MUC-score for evaluation of equivalence classes
(Vilain et al. 1995).
The MUC coreference annotation scheme focuses on NPs that, as stated by Hirschman, Robinson, Burger and Vilain (1997), refer to the same entity.
This is called the IDENTITY, or IDENT, relation. The elements in the text available for annotation are called markables, and this includes nouns, NPs
(including NEs, dates, currency expressions, and percentages), and personal and demonstrative pronouns (Hirschman and Chinchor 1997).
notator is encouraged to first mark up the possible markables in the data, and then in a second step partition the markables into equivalence classes based on the IDENT relation (Hirschman et al. 1997, Hirschman and Chinchor 1997).
The SGML mark-up consists of a COREF tag with four attributes: a unique identifier (ID) for each NP, a reference pointer (REF) that establishes coreference links to a previously mentioned NP, a relation type attribute (where the
LDC, URL: http://projects.ldc.upenn.edu
value of TYPE always is “IDENT” as the only annotated relation in the MUC data is IDENTITY), and an attribute MIN identifying the minimal (head) el-
(3.1) <s> <COREF ID=“0”>Ocean Drilling & Exploration Co.</COREF> will sell <COREF ID=“3” MIN=“business”><COREF ID=“2”
TYPE=“IDENT” REF=“0”>its</ COREF>contract-drilling business</COREF>, and took a $50.9 million loss from discontinued operations in <COREF ID=“12” MIN=“quarter”>the third quarter</COREF> because of the planned sale. </s>
In both the MUC-6 and the MUC-7 data sets, thirty documents in English annotated with coreference are used as training documents. The MUC-6 and
MUC-7 training data set contain 1644 and 1905 anaphoric NPs, respectively.
The test set for MUC-6 contains thirty documents and 1627 anaphoric noun phrases. Twenty texts, containing 1311 anaphoric NPs, serve as the MUC-7 test set. All documents in both MUC-6 and MUC-7 are news wire articles.
For the annotation process, two tools were available: SRA’s Discourse Tagging Tool and MITRE’s Alembic Workbench. For visualization, the SGML tags were converted into HTML, and each document were displayed in tabular form, where each column corresponds to a paragraph (ordered from left to right), and each row a coreference chain (Hirschman et al. 1997).
The MUC data is widely used even though there is disagreement as to how well the MUC coreference task definition captures this linguistic phenomenon. The criticism against the MUC coreference task definition and annotation scheme is based on the reported inter-annotator agreement, as well as linguistic concerns and the overall performance of systems trained on the
MUC data (see e.g, (Hirschman et al. 1997), (Van Deemter and Kibble 1999),
(Van Deemter and Kibble 2000), (Kibble and Van Deemter 2000), (Mitkov
2002), and (Borthen 2004b)). The goal of the MUC-7 annotation project was to achieve good inter-annotator agreement, defined as 95% (Hirschman and
Chinchor 1997), but in review Hirschman et al. (1997) found that the interannotator agreement for the MUC-6 and MUC-7 data sets was 83% and 84%, respectively.
Van Deemter and Kibble (2000) claim that the coreference relation as defined in the MUC coreference task definition is too extended: they argue against the inclusion of non-referring NPs (e.g., quantified NPs and predicative NPs) and bound anaphors, and intensionality (e.g., change over time).
They also identify what they term “the markable problem”, that is, the problem of defining which entities are to be available for coreference annotation as
MUC annotators are encouraged to mark up all the candidate markables first, and the relations second (Hirschman et al. 1997). This means that e.g., coordinated NPs are to be marked both individually and as a unit. Van Deemter and Kibble suggest to redefine what a referring NP is, and to narrow down the
coreference task to cover only the identity relation proper, even though this means less input to e.g., an IE system.
Several annotation schemes for other languages than English are based on the MUC coreference annotation scheme; e.g., (Hartrumpf 2001) and (Hoste
2005) base their annotation work (for German and Dutch, respectively) on the
MUC guidelines, with some modifications in accordance with (Van Deemter and Kibble 2000).
The ACE annotation scheme for English, Chinese, and
The Automatic Content Extraction (ACE) program began in 1999, and it is related to the MUC project both in terms of motivations behind the project and the issues addressed. The objective of the ACE program is to develop automatic content extraction technology of entities, relations, and events to support automatic processing of human language in text form (Doddington et al. 2004). The ACE data is used in a number of experiments on coreference resolution, see e.g., (Luo, Ittycheriah, Jing, Kambhatla and Roukos 2004),
(Chen and Hacioglu 2006), (Ng 2007).
In the ACE initiative, the overall task is to extract entities (cf. discourse referents), the relations between entities, and events these entities are involved in.
The surface realisations of the entities are called mentions (cf. markables in the MUC initiative) (Doddington et al. 2004). The entity detection and tracking (EDT) task can be described as a combination of the MUC NER task and the MUC coreference resolution task, where all mentions of an entity (candidate mentions are names, descriptions, and pronouns) are to be found and partitioned into equivalence classes based on reference to the same entity.
The ACE initiative has so far resulted in data for entity detection and tracking, relation detection and characterization, and event extraction in English,
Chinese, and Arabic. Within the project, data-specific evaluation metrics have also been developed (Doddington et al. 2004). The current focus is on crossdocument and cross-language (English-Arabic) entity detection and tracking
(Strassel, Przybocki, Peterson, Song and Maeda 2008). The annotation guidelines, corpora and other linguistic resources in support of the ACE program are
In the ACE EDT task the annotation format is based on XML, and the annotation target objects are a selected set of seven types of entities with the attributes type (person, organization, geo-political entity, etc.), sub-type
(e.g., individual persons, or groups of persons), and entity class, which can be either referential (with the sub-types: negatively quantified, specific referential, generic referential, under-specified referential) or attributive (with subclasses appositive or predicative, etc.). All name mentions, nominal mentions,
LDC, URL: http://projects.ldc.upenn.edu/ace
and pronominal mentions of each entity are to be annotated (Doddington et al. 2004, LDC 2008).
For the annotation process, the ACE tool is used. This tool is based on the
and relies on color-coded underlining as the ACE annotations can be embedded or overlapping. The tool guides the annotation process, by requiring the user to decide on one task before moving on to the next.
Within the ACE initiative there is much emphasis on annotation quality, and there are several control mechanisms in order to ensure high quality annotations. Doddington et al. (2004) report an inter-annotator agreement for the
EDT task on the ACE 2003 data set of 88% for English, 87% for Chinese, and
74% for Arabic.
The BREDT annotation scheme for Norwegian
The objective of the BREDT project is to create linguistically plausible data for statistical and machine learning experiments. To this end, Borthen (2004a) proposes an annotation scheme that distinguishes between coreference, defined as identity of reference, and other types of reference and predicative
NPs, in order to improve on the results of machine learning. Borthen (2004b) argues along the lines of (Van Deemter and Kibble 2000), that initiatives such as the MUC-6 and MUC-7 coreference annotation guidelines that aim for efficiency by avoiding loss of information (by e.g., marking all predicative NPs as coreferential with their subjects in positive sentences), lead to a representation of the reference phenomenon that is not linguistically plausible and that this will not generate an optimal result if used as training data for machine learning.
The BREDT annotation scheme differentiates between coreference, which is defined as identity of reference between discourse referents, and other related types of reference, i.e., metonymy, intensional reference (to handle hypotheticals), reciprocal reference, bound variables, identity of sense, and reference to sets
(subset, superset, and excluding reference). The scheme also covers less common types of reference, e.g., inalienable possessions (restricted to body parts) and family relations. The inclusion of these two relations may have to do with the genre of the BREDT corpus these guidelines were developed for: fiction. In order to produce negative training examples for machine learning, expletives (pleonastic ‘it’) and predicative NPs are also marked
The elements in the text available that can be marked are noun phrases
(proper names, lexical NPs, and pronouns), and certain determiners (possessives and genitives). Empty traces, ellipses, or relative pronouns are not marked (Borthen 2004a).
AGTK, URL: http://sourceforge.net/projects/agtk/
The format described in the BREDT annotation guidelines are different from the MUC and ACE initiatives (SGML and XML): here, the tokens in the corpus are verticalized and indexed, and the annotation is in the form of a list. The phrase (or stretch of discourse) the anaphor is pointing to is defined in the list as the index number of the antecedent combined with the code for
numbers 3 and 4, which is annotated as [ant=“3-4r”] where r signifies coreference:
(3.2) Annotation example cited from the BREDT annotation guidelines
(Borthen 2004b:p. 15): Nested anaphoric NPs.
1. Jeg (’I’)
2. har (’have’)
3. en (’a’)
4. hund. (’dog’)
5. Jeg (’I’)
6. elsker (’love’)
7. hunden (’dog-def’) [ant=“3-4r”]
8. min (’mine’). [ant=“5r”]
The list format allows an anaphor to refer to more than one antecedent (i.e., the list can take more than one antecedent referred to via different types of relation):
(3.3) Annotation example cited from the BREDT annotation guidelines
(Borthen 2004b:p.14): Disjoint antecedents, and multiple relations.
5. Hun (‘she’) [ant=“1r”]
7. Han (‘he’) [ant=“3r”]
15. De (‘they’) [ant=“5,7r”]
25. Hun (‘she’) [ant=“5r;15sub”]
The guidelines caution against adding more than two antecedents for one anaphor; the latest activation act of each reference type is sufficient. In ex-
, and the first occurrence (index 1) Kari as the latest activation (index 5) is sufficient for resolution. The antecedents are ordered by how informative they
are (with coreference as the primary relation), and not by recency (see index
The annotation format also allows for a different solution to handling of
line 7) as well as the whole coordinated NP (Kari and Ola on lines 5–7) are linked to an antecedent, or in the case of the coordinated NP, antecedents. The link for the whole coordinated NP is marked on the conjunction (og, line 6).
(3.4) Annotation example cited from the BREDT annotation guidelines
(Borthen 2004b:p.16): Coordinated NPs.
5. Kari [ant=“1r”]
6. og (’and’) [ant=“1,3r”]
7. Ola [ant=“3r”]
8. sov (’slept’).
Within the BREDT project, a corpus of Norwegian fiction, henceforth the
BREDT corpus, consisting of 41,969 tokens has been annotated (in XML for-
and a graphical interface for displaying and editing anaphoric relations in text has been developed
The BREDT corpus has been used in a number of experiments on data-driven coreference resolution in Norwegian; see e.g., (Nøklestad and
Johansson 2006), (Johansson and Nøklestad 2008), (Nøklestad 2009).
The MUC, ACE, and BREDT annotation initiatives are all interesting, both in terms of how the objectives are formulated, how the annotation task is defined, and how the annotation process is structured. Comparisons between these three initiatives formed the basis for the annotation process described in the following section.
In the MUC annotation guidelines, the annotators are instructed to base the annotation on their knowledge of the world, and disregard e.g., theories of how NPs are resolved even though this means that some relations will be impossible to identify with current NLP technology (Hirschman and
Chinchor 1997). In the ACE project, however, the annotators are instructed that all annotated relations should be based on textual or contextual evidence
Thanks to Christer Johansson and Anders Nøklestad for making the BREDT corpus available.
BREDT, URL: http://bredt.uib.no/demonstrators.html
found within the scope of the document, and not on annotators’ knowledge of the world (Doddington et al. 2004).
The BREDT guidelines address “the markable problem” as defined by Van Deemter and Kibble (2000) in that the annotator does not have to handle coordinated NPs unless they refer to a previously mentioned item. However, the annotation process on verticalised data without phrase chunks or markables seems likely to be error-prone. Despite the arguments in (Van Deemter and Kibble 2000) against a separation of markable and coreference annotation, inter-annotator agreement did improve in the MUC-7 annoation project when the two-step model of markable annotation followed by relation annotation was adopted (Hirschman et al. 1997).
The possibility for multiple antecedents and different relation types in the
BREDT guidelines is also an interesting solution, as it allows for the annotation of complex cohesion patterns. But at the same time, this feature of
BREDT makes the annotation process even more complicated, especially if the annotation is performed by a single annotator.
The fine-grained set of relations in the BREDT guidelines allows for the annotator to mark e.g., predicative NPs and set relations. The inclusion of relation types other than coreference facilitates investigations into e.g., the distribution patterns of these relations in text. Among others Van Deemter and
Kibble (2000) argue that the inclusion of predicate NPs and bound variables are harmful to the coreference resolution task; by isolating such relations we can investigare how frequent these phenomena are. A more fine-grained set of relation types may also lead to better annotation quality, as suggested by
Coreference Annotation of Swedish News Text
This section describes our work of annotating a corpus of Swedish financial news texts that will function as training and evaluation data in experiments on
The goal of the coreference annotation process is to mark a select set of relations between NPs, with coreference as the primary relation, in as a consistent manner as possible. The BREDT guidelines for Norwegian fiction
(Borthen 2004a) are adapted to Swedish news text with the exception that only one relation type per anaphor is allowed.
The manual annotation process consists of three steps: annotation of Named
Entities, annotation of markables, and annotation of relations between markables.
First, Named Entities (NEs) are identified and classified. The results is a collection of texts marked with SGML-tags specifying the start and end of the
See appendix A for a list of the sources.
The results is a corpus consisting of 365 texts, henceforth referred to as SEK-
Second, all markables, defined as NEs and NPs, are marked in a subset of SEK-365 consisting of 66 documents (henceforth SEK-66). This is done semi-automatically: first an NP chunker based on finite-state techniques automatically identifies simple NPs, followed by manual annotation of complex
NPs. The Penn Treebank bracketing guidelines for Treebank II style (Bies,
Ferguson, Katz and MacIntyre 1995) was used as a reference during this process.
The third stage involves the manual annotation of relations between markables, based on the BREDT annotation guidelines. The annotation of mark-
Annotation of Named Entities
Correct identification and classification of NEs are important for coreference resolution, because e.g., company names, brand names and product names are often related and thus similar as to surface structure. NE classification can help a system to correctly resolve a link between the markables mobiltillverkaren
Nokia (‘the mobile phone manufacturer Nokia’) and konkurrenten Nokia (‘the competitor Nokia’). NE classification can also rule out any linking between mobiltillverkaren Nokia and en ny Nokia (‘a new Nokia’), referring to a new mobile phone model, manufactured by the company Nokia. A correct multiclass classification can also help in recognizing e.g., a redescription in the form of the definite description företaget (‘the company’), which is coreferent
(3.5) När Nokia på fredagen presenterade en ny 3G-telefon ... [...] Företaget lanserar ett imponerande antal nya mobiler.
(‘When Nokia on Friday presented a new 3G mobile phone... [...] The
To our knowledge, there exist two Swedish corpora annotated with NEs: the
Stockholm-Umeå Corpus (SUC) of about a million words, manually annotated with 9 classes of NEs (Wennstedt 1995), and the KTH news corpus which consists of about 108.000 Swedish news articles. In this corpus, 100 documents (about 18.000 tokens) have been manually annotated with four types of NEs: person names, locations, organizations, and time and date expressions (Hassel 2001, Dalianis and Åström 2001). None of these corpora are suitable for our task; SUC because it contains many different genres of text and coreference resolution is to some extent a domain-dependent task,
SEK = Svensk Ekonomikorpus.
All translations are approximate.
and also because SUC consists of samples rather than full documents. The set of name types in KTH News Corpus is limited, and the corpus is not available due to copyright issues.
The annotation process
In this project, a text collection of 365 texts (SEK-365) have been annotated with NEs, that is, words and phrases in the texts that function as proper names in a wide sense have been identified and categorized as different types of NEs.
The annotation of NEs was performed in two stages; firstly, following the MUC-7 Named Entity Task Definition (Chinchor 1997), names denoting persons, organizations, and locations were annotated automatically using the Learn-Filter-Apply-Forget method, as described in (Volk and
Clematide 2001), adapted for Swedish.
After manual correction and inspection of the result of the automatic annotation, names of a domain-specific NE category, namely financial indices, were marked with a specific tag. Finally, products, services, and trademarks,
NE categories that are frequent in this domain, were annotated based on the findings of an exploratory study on the corpus (Nilsson and Malmgren 2006).
A tag for miscellaneous names was used when no other category was applica-
During the annotation process, all judgments are based on the annotators’ interpretation of the textual and contextual clues, and world knowledge. We decided to annotate from a human reader’s point of view, similarly to the
MUC-7 guidelines (Hirschman and Chinchor 1997), rather than to try to separate world knowledge from textual and contextual clues as suggested in e.g., the ACE project (LDC 2008).
Below follows a brief decription of the NE types in this corpus.
The organization name category includes names denoting a wide range of organizations with the common denominator that they are organizations of people with a common goal, e.g., financial gain (companies), academic research and education (universities and institutes), handling of public sector funds
(governmental organizations) etc., and an established organizational structure.
(3.6) Försäkringsbolaget <ORG>Skandia</ORG> rapporterar ...
(‘The insurance company <ORG>Skandia</ORG> reports ...’)
pre-modifiers are not included within the NE tags.
Organization names found in compounds have been marked as NEs of the type ORG when the compound itself is used to name some kind of estab-
Thanks to Aisha Malmgren for the annotation of brand names.
lished organization, e.g., Wallenbergsfären (‘the Wallenberg group’) and GMgruppen
(‘the GM group’).
Organization names that include other types of names, e.g., the location name Stockholm in Stockholms universitet (‘Stockholm University’) are marked only as organizations, and the embedded location name is not
Person names denote human beings, and can consist of either a first name, e.g., Pehr, a last name Gyllenhammar, or a full name Pehr G Gyllenhammar.
Tight appositives such as job titles are often attached to person names, as in
(3.7) ... förre Volvochefen <PERS>Pehr G Gyllenhammar</PERS> ...
(‘... former Volvo executive <PERS>Pehr G Gyllenhammar</PERS>
Like organization names, person names can occur in different kinds of compounds (e.g., Wallenbergsfären, above) where they are not annotated as person
as NEs at all:
(3.8) Det Wallenbergdominerade riskkapitalbolaget EQT ...
(‘The Wallenberg-dominated venture capital company EQT ...’)
Location names denote geographical entities: e.g., countries (Sverige ‘Sweden’), regions (Europa, ‘Europe’), cities and towns (Stockholm), and streets
(Strandvägen). Some names belonging to the location category, e.g., names of
(3.9) (a) Huvudkontoret kommer att ligga i Sverige.
(‘The head quarters will be located in Sweden.’)
(b) Sverige har tryckt på för att de fattiga ländernas makt ska öka.
(‘Sweden has argued for an increase in power for the poor countries.’).
In our annotation, this dual function is not distinguished; all occurrences of names of geographical areas or geopolitical entities are annotated as location names even though, for coreference resolution purposes, a distinction between location names that can refer to geopolitical entities (e.g., countries, regions,
In the ACE initiative, according to the multi-class and multi-sub-class ACE Entitiy Detection and Tracking Task Definition (LDC 2008), such embedded NEs are to be marked.
and cities) and names that cannot (e.g., mountains and rivers), may be benefi-
Trademarks are symbols or words used to identify a company or a product. A
similar to a company name (e.g., the (abbreviated) company name Skandia
tinction between trademarks and company names is generally clear from the context; e.g., trademarks are frequently preceded by the appositive varumärket
(Nilsson and Malmgren 2006).
(3.10) Under de närmaste åren kommer <ORG>Skandias</ORG> försäljning att vara betydligt sämre än den har varit historiskt. Och därför är även varumärket <TM>Skandia</TM> mindre värt idag än vad det har varit tidigare.
(‘During the next couple of years, <ORG>Skandia</ORG> sales will be considerably lower than historically. And because of that, the trademark <TM>Skandia</TM> is also valued lower today than previously.)
Names denoting artefacts produced by a company are annotated as product names . Similarly to trademarks, product names are often similar to company names that occur in the same context, e.g.:
(3.11) <ORG>Mazda</ORG> har till exempel ökat ned 33,5 procent i år tack vare framgången med <PROD>Mazda 6</PROD>.
(‘<ORG>Mazda</ORG> has had an increase of 33.5 percent this year due to the success of <PROD>Mazda 6</PROD>.’)
Compound noun phrases that denote a related group of products, and consist of a product name and a noun, are marked as product name, e.g.,: Sagan om
Ringen-filmerna (lit. ‘The Lord of the Ring-movies’).
This category includes names of e.g., financial services, or newspapers, television stations, and news wire services when these names refer to the service of providing information. Akin to trademarks and products, there may be an organization (i.e., a company) with the same or a similar name; in such cases,
from the same document:
Within e.g., the ACE EDT annotation task there is a distinction between location names that denote geographical entities, and entities that can also function as geopolitical entities (LDC
(3.12)(a) Hon är en ivrig nyhetskonsument som börjar med DN, Svenska
Dagbladet och Dagens Industri på morgonen, Ekot och
<SERV>TT</SERV> hela dagen, Rapport och Aktuellt på kvällen.
(‘She is an eager consumer of news, who begins with [the newspapers] DN, Svenska Dagbladet, and Dagens Industri in the morning, [the radio news broadcast] Ekot and [the news wire service] <SERV>TT</SERV> through out the day, and [the TV news broadcasts] Rapport and Aktuellt in the evening.’)
(b) Som chef för <ORG>TT</ORG> var det naturligt att inte rösta, säger hon.
(‘As an executive of <ORG>TT</ORG>, not voting was natural, she says.’)
Because of the domain, names of stock market indices such as Dow Jones industriindex (‘Dow Jones Index’), OMX-index (‘OMX Index’), and CAC 40index (‘CAC 40 Index’) are annotated as a separate name category.
Any name occurring in the corpus which cannot be categorized as one of the above-mentioned types is annotated as miscellaneous.
Out of the 365 texts annotated with NEs as described above, 66 documents have been annotated with coreference and other relations between markables, based on the BREDT annotation guidelines. In this section, the annotation of markables (defined as NPs and NEs), and the annotation of relations between markables is described.
The annotation process
Motivated by a study on annotation quality by Hirschman et al. (1997), who show that coreference annotation quality is improved by adopting the MUC two-step annotation process where the first step consists of marking up all NPs
(or markables), and the second of defining the relations between markables, we decided to follow this two-step process.
Thus, the first stage of analysis consists of marking up all NPs in the texts, and the second is to link all referring NPs with explicit antecedents following the annotation guidelines described in (Borthen 2004a). The linking is performed by adding each anaphoric or cataphoric NP to the most recent antecedent. Because the annotation was performed by a single annotator (the author), we do not claim to present a general analysis, but rather one that represents one person’s interpretation of the text collection.
We use a SGML-type annotation format, akin to that used in the MUC project. This allows for a simple visualization technique, similar to that described in (Hirschman et al. 1997) where the SGML tags are converted to
HTML and the NPs are color-coded and displayed in tabular form.
The basis of judgment for the annotations also follows the MUC guidelines; the annotation of relations between NPs are based on the annotator’s interpretation of the textual and contextual clues, and world knowledge.
Annotation of markables
Following the BREDT guidelines, NPs consisting of a proper name, or with either a lexical noun, a nominalized adjecive, or a pronoun as head are considered as candidate markables. Some determiners, namely possessives and genitives, are also eligible for annotation, while empty traces, ellipsis, and the relative pronoun som (‘who’, ‘which’) are not included.
The annotation of markables was performed semi-automatically.
First, TreeTagger (Schmid 1994), a part-of-speech tagger trained on the
was used to annotate the documents with part-of-speech tags. Second, an NP chunker based on finite-state techniques identified base NPs. Finally, nested NPs were manually annotated.
During the manual annotation of complex, nested NPs, the Penn Treebank bracketing guidelines (Bies et al. 1995) was used as a reference. The decision whether to include or exclude e.g., a post-poned preposition phrase was based on whether the phrase could be interpreted as a restrictive modifier or not.
Coordinated NPs are not marked as a unit. Coordinated NPs referred to by e.g., a plural pronoun are handled during the reference relation annotation; this is done by marking the pronoun with a list consisting of the index numbers of the NPs coordinated by the conjunction (se below).
As a results, each identified markable was marked with a SGML-type tag
REF with the attributes identification (id), type of relation (type), and minimal string (min), defined as the head of the phrase if nominal or pronominal, or if a NE, the entire name. After chunking, a Perl script was used to add a unique index number to each markable within a document, and to identify the minimal string for each markable which was added as a value to the min
Relations between markables
For each markable, the possible values of the attribute type are “anc” (anchor, used for both initial-mentions and single-mentions), or a selected set of the relations defined in the BREDT annotation guidelines (Borthen 2004b):
• “coref” (coreference)
• “sub” (subset)
• “super” (superset)
Thanks to David Hagstrand for making this tagger available.
• “excl” (excluding reference)
• “bound” (bound variables)
• “sense” (identity-of-sense)
• “reci” (reciprocal reference)
• “each” (distributive reference)
• “met” (metonymy).
These relations require that an additional attribute, ref (short for reference) is added to the SGML tag. This attribute takes as value the index number of the
tagsledningen with index 7 is coreferentially linked to the markable ledningen with index 2.
(3.13) Example of NE and coreference annotation:
<REF type="anc" id="1" min="kritik">
Allvarlig JJ kritik
</REF> riktas mot
<REF type="anc" id="2" min="ledningen""> ledningen for
<REF type="anc" id="3" min="Skandia Liv">
<REF type="coref" id="7" ref="2" min="foretagsledningen">
Predicative NPs are linked to the correlate with the type attribute value “pred”, and the index number of the correlate as the ref attribute value.
In addition to the relation types mentioned above, occurrences of expletive det
(‘it’) are marked by the type value “expl”.
Finally, pronouns and demonstratives without explicit NP antecedents, i.e., anaphors that refer to entities outside the discourse or to implicitly introduced entities or propositions, are annotated with a special tag, NOTE, describing e.g., if the markable refers to an event introduced by a VP in the previous sentence, or clause.
In order to work around the problem of deciding e.g., whether coordinated
NPs should be marked as a unit, coordinated NPs that function as antecedents for plural definite descriptions or plural pronouns are not marked as a unit;
instead, the value of the ref attribute of the anaphor is a list, which allows for one or more links to antecedent NPs. The same approach is used to handle split antecedents of plural definite descriptions and plural pronouns.
Because verticalized text with POS tags and SGML tags for NEs and markables is difficult to read, a simple visualization tool was develped to facilitate the annotation process and improve on the annotation quality. The tool consists of a Perl script that converts the documents into HTML so that the annotation can be reviewed in a browser. Each document is presented in table format, where each cell contains a sentence, and each row a paragraph. The script highlights each markable and indents nested structures, and adds color codes to each coreference chain. A similar visualization tool was used during the annotation project in the MUC project (Hirschman et al. 1997).
Relation types used in annotation
There are nine relation types used in the annotation of this data (described below). Some of the relation types described in the BREDT annotation guidelines do not apply to this genre and domain; e.g., references to family members and reference to body parts (termed “inalienable possessions” in (Borthen
2004b)) are infrequent in this data.
The coreference relation is defined as identity of reference between two markables, that is, they both refer to the same referent in the real (or conceptual) world. During the annotation, the objective has been to ensure that the annotated coreference relations are transitive and symmetrical, i.e., that the anaphor can be substituted for the antecedent without changing the meaning of the utterance.
The coreference relation can be anaphoric or cataphoric. Coreference can occur between different types of NPs: between full form NEs such as Finansinspektionen and acronyms such as FI, and definite descriptions such as myndigheten
(3.14) Finansinspektionen, som är den myndighet som granskar försäkringsbolagen, har i en rapport nämnt flera internaffärer som inte har gynnat försäkringsspararna. I en del fall har
FI tvingat ägaren att betala tillbaka pengarna.
Myndigheten avser nu att granska samtliga försäkringsbolag.
(‘Finansinspektionen, the government agancy that supervises the insurance companies, has in a report mentioned several internal affairs that have not been beneficial for the insurance holders. In some cases,
FI has forced the owner to return the money. The agency will now inspect all insurance companies.’)
In the BREDT annotation guidelines, occurrences of coreference between hypothetical or intensional referents are not included in the coreference class; according to the guidelines, the links between Affären (‘The deal’), affären,
pose is to allow for such cases to be singled out (Borthen 2004b).
(3.15) Det franska flygbolaget Air France kommer att lägga ett bud på konkurrenten KLM. Budet är på drygt 7 miljarder kronor. KLM bekräftade senare att en överenskommelse om ett samgående ingåtts.
(‘The french airline Air France will make a bid on the competitor
KLM. The bid is of approximately 7 billion SEK. KLM later confirmed that an agreement on a merger has been reached.)
Contrary to the BREDT guidelines, we do not distinguish between real and hypothetical or intensional reference, in part because there are few cases of intensional reference in this corpus, and in part because we argue that the reference relation between the NPs is the same, whether the discourse referent is real, hypothetical, or intensional. That is, is is not the relation between the markables ett bud på konkurrenten KLM and budet, but the context (e.g., Air
France kommer att , ‘Air France will’) that tells us that the referent introduced by the antecedent ett bud på KLM (‘a bid for KLM’) is intensional.
A subset relation (or partitive reference) is introduced when some expression refers to a subset of a set of entities previously introduced in the discourse.
The subset relation holds between (sets of) discourse referents in a specific discourse, and it is not lexically encoded, i.e., it is not a hyponymic relation.
(3.16) EU-kommissionen har ställt en rad kompletterande frågor om den nordiska elmarknaden. Inte förrän i måndags, när den sista frågan besvarades, kunde ansökan betraktas som komplett.
(‘The European Commission has posed a number of follow-up questions about the Nordic market for electricity. The application was not considered complete until Monday, when the last question was answered.’)
Superset is an anaphoric relation that holds between a plural NP and some previously or subsequently introduced entity/entities that belongs to a subset of what the plural NP refers to.
Superset relations are common in direct speech, where e.g., a first person plural pronoun makes a cataphoric reference to the organization the speaker is
In order to handle cases of pronouns in quoted speech consistently, all first person pronouns in quoted speech are annotated as relational to the speaker.
(3.17) – Våra kundundersökningar visar att videokonferenser inte är intressanta i dagsläget, säger
(‘Our customer surveys show that video conferences are not in demand at present, says Sara Kullgren.’)
Excluding reference is an anaphoric relation where the entity the antecedent refers to is excluded from the set of entities to which the anaphor refers.
(3.18) Frankrike, till skillnad från resten av euroländerna, har inte infört...
(‘France, contrary to the rest of the Euro countries, har not adopted
According to the MUC definition of coreference, bound anaphors referring to quantified NPs, e.g., the reflexive pronoun deras (‘their’) that is dependent on the quantified NP många anställda (‘many employees’) in exam-
that no equivalence relation holds between the referents of the two entities
(Hirschman and Chinchor 1997). This over-extension of the coreference relation has been criticized by e.g., Mitkov (2002) and Kibble and Van Deemter
Following the BREDT annotation guidelines, bound anaphors with quantified antecedents are not included in the coreference class, but annotated as bound variables :
(3.19) Många anställda menar att just på deras avtalsområde finns det särskilda skäl att ställa högre krav på löneökningar.
(‘Many employees argue that in their occupational branch there are particular reasons for demanding additional salary increases.)
The coreference relation is described in terms of identity, but it is important to distinguish between “identity-of-sense” and “identity-of-reference”. In an identity-of-reference relation the anaphor denotes the same entity as its antecedent, whereas in an identity-of-sense relation the anaphor does not denote the same entity as its antecedent but one of a similar description.
Identity-of-sense anaphors have a reference that is independent of the reference of the antecedent, and while the anaphor inherits all or parts of its sense from the antecedent a new discourse referent is introduced:
(3.20) ... länder med löner som är en bråkdel av de som betalas i västvärlden.
(‘... countries with salaries that are a fraction of those payed in the western world.’)
Reciprocal reference is triggered by the reciprocal pronoun varandra (’each other’), and while it is closely related to coreference it has a more specialized meaning. Instances such as this could be integrated into the coreference class, but as it might be worthwhile to study the phenomenon in isolation there is a specific tag for reciprocal reference (Borthen 2004b).
(3.21) [...] att SAS går samman med Finnair. Båda är små i ett internationellt sammanhang och naggar på varandras marknader.
(‘ [...] that SAS merges with Finnair. Both are small in an international context, and [they are] chipping away at each other’s markets.’)
Distributive reference is triggered by the distributive pronouns vardera, var och en
(3.22) [...] arbetarna får tillbaka sina anställningar och ett skadestånd till var och en på 100.000 kronor.
(‘[...] the workers will get their jobs back and damages to a sum of
100.000 are awarded to each.’)
The metonymy relation is closely related to coreference. Through metonymy, a set of associations are transfered which may or may not be important to
Norrköping is used to to refer to the office of a governmental organization located in Norrköping.
(3.23) Vi har gjort en anmälan till länsarbetsnämnden i Norrköping [...].
Samtidigt har bara Region Mitt, där Norrköping och verkets huvudkontor är de största enheterna, blivit ...
(‘We have made a petition to the county labor committee in
Norrköping [...]. At the same time only Region Mitt, where
Norrköping and the agency’s head quarters are the largest units, has become [...].’)
Only cases of metonymy that influence the interpretation of the reference links
below, where a company name is used to refer to the value of the shares of that company, fall outside the scope of this annotation task.
(3.24) Den Nasdaq-noterade nätverkskoncernen Cisco Systems, [...], rasade med 3,9 procent.
(‘The Nasdaq-listed network concern Cisco Systems, [...], plummeted with 3.9 percent.’)
Following the BREDT guidelines, two types of non-referential expressions that resemble referential expressions, predicative NPs and expletive det (‘it’), are also marked in order to exclude them from the referential expressions in the text.
Predicative NPs add to the description of the discourse referent in question, but they are typically not definite enough to single out a specific referent within the discourse:
(3.25) Vasakronan är ett av Sveriges största fastighetsbolag med [...].
(‘Vasakronan is one of the largest real estate companies in Sweden with [...].’)
Occurrences of expletive det (‘it’) do not refer to a discourse referent, while the third person inanimate pronoun det may do so. The distinction between
mention of det has an expletive function, whereas the second det can be interpreted as a propositional reference to the fact [the money is invested in the subsidiary] established in the previous sentence.
(3.26) Det är i dotterbolaget som Skandias drygt en miljon försökringssparares pengar finns.
Det har lett till [...].
(‘It is in the subsidiary the money belonging to the just over one million insurance savers of Skandia is (has been invested). It has lead to [...].’)
In addition to the relation types listed above, some markables are annotated with a specific tag, NOTE, with information on how that markable was interpreted. Included in this class are e.g., pronominal markables that refer to entities outside the discourse, or to implicitly introduced entities or propositions.
In this chapter, the annotation of NEs, markables, and relations between markables har been described. This is a difficult task, as shown in studies on interannotator agreement in e.g., the MUC project (Hirschman et al. 1997).
The BREDT annotation guidelines, with a fine-grained set of relation types, allow for singling out related phenomena (e.g., predicative NPs and superset/subset relations) (Borthen 2004b); our conclusion is that this helps ensure annotation quality. To further facilitate the annotation process and improve on the annotation quality, we developed a simple visualization tool similar to that described in (Hirschman et al. 1997).
Because this annotation was done by a single annotator, we do not claim to present a a general analysis, but rather one that represents one person’s interpretation of the text collection. We will discuss some difficult cases of coreference in the next chapter, where we also present som observations based on the annotated data.
4. Coreference in Swedish news text
In this chapter, some observations on coreference and related phenomena in
Swedish financial news text, based on the annotation work described in the previous chapter, are described.
The coreference annotated corpus (SEK-66) consists of a combination of news wire and news paper texts (cf. the MUC and ACE corpora) within the domain financial news. The total number of words in these texts is 21,935.
The texts range in length from seven sentences to 59 sentences, and the total number of sentences is 1,323. The average text consists of 20 sentences, and
There are 730 coreference chains in total (consisting of two or more NPs); the chains range in length from two to 51 NPs, with the mean length of 3.6
NPs and the median length 2. That is, the majority of chains are very short, but there are also (longer documents with) very long coreference chains: 45% of all chains consists of two NPs, and only about 5% of more than ten NPs.
In these studies, the distribution of NE types in the corpus annotated with
NEs, referred to as SEK-365, is described. We also investigate the distributional patterns of the relation types annotated in the SEK-66 corpus, the distribution of coreferential NP types, and the distribution of intra- and intersentential coreference per NP type.
Finally, some difficult cases of coreference are discussed: coordinated and split antecedents, quoted speech pronouns, and the pronoun det (‘it’), which can have an expletive or an anaphoric function.
Distribution of NE types
The distribution of NE types in the SEK-365 corpus tells us which types of the entities the texts are about, and by studying the NE types in the corpus we can learn something about the coreferece resolution subtask of identification
Because of the domain (financial news) of this corpus, names classified as
organization names are single unit names such as Skandia or the acronym GM, but there are also multiple unit names such as General Motors or Den Norske
Of course, not all entities are named, but we assume that entities important enough to be named are central to the coreference resolution task.
Figure 4.1: Distribution of NE categories in the SEK-365 corpus; tokens (number of occurrences of each name type) and types (the number of different NEs of each name type). The name types are: organization (ORG), person (PERS), location (LOC), financial index (FIND), service (SERV), product (PROD), miscellaneous (MISC), and trademark (TM).
. Name variants are common; e.g., the same organization can be denoted as both Skandia AB and Skandia within the same discourse. Tight appositives, e.g., in försäkringsbolaget Skandia, are common in the first introduction of a name in a discourse.
The name type person (PERS) is the second most frequent in the data (see
to person names, e.g., LO-ekonomen Sandro Scocco. Akin to organization names, this category also consists of name variants, i.e., full names, first names
(Sando), and last names (Scocco).
The third largest category consist of location names (LOC); this category is fairly static as there are a only small number of types. As mentioned in sec-
areas or geopolitical entities even though this may be beneficial for corefer-
While the majority of all names belong to the three standard NE types, listed above, there are some additional NE types that occur in this domain, e.g., names of stock market indices and other types of financial indices. Names denoting products or services provided by a company also occur, as well as trademarks, defined as symbols or words used to identify a company or a
Table 4.1: Distribution of first-mention (single-mention or initial-mention/antecedent) and subsequent-mention (anaphoric) NPs over coreferential, relational (excluding coreference, e.g., superset, subset, and identity-of-sense), and single-mention (nonrelational) classes of markables in SEK-66. Relational classes include all non-
First-mention Subseq-mention All markables
Single-mention NPs 3670
In the second version of the Stockholm-Umeå Corpus (SUC 2.0; 1,000,000 tokens), where the annotation of NEs was performed manually (Wennstedt 1995), names such as England are annotated as location if they refer to a geographical area, and as organization if they refer to a geopolitical entity. In SUC 2.0, there are 249 different names that occur in the corpus annotated as both location and organization. Of these 249 different names, 2747 occurrences that are annotated as location names have been found, and 814 annotated as organization names.
The types of NEs differs with the domain: In the ACE Entity Detection and Tracking task, the entity types are: Person, Organization, Location, Facility, Weapon, Vehicle and Geo-Political
Entity. Neither weapons nor vehicles occur in our corpus.
Distribution of relation types
only 33% (2,186) are subsequent-mentions linked to another markable via one of the relation types according to the annotation. Most of these links can be described as coreferential: of the 2,186 relational markables, 86% (1,887)
The majority of the relations in the text collection are anaphoric; only 60 occurrences of cataphora have been found. Most cataphoric relations occur in quoted speech, where e.g., a deictic pronoun (vi, ‘we’) is used as a cataphoric reference to the organization the speaker is speaking on behalf of (see exam-
In order to categorize the participant NPs of the coreference chains, and of other types of relations between NPs, we distinguish between the first NP in a coreference chain (or relation) and the subsequent NPs. Following (Fraurud
1992), we distinguish between NPs as:
• Initial-mentions: NPs followed by one or more coreferential NP (i.e., the initial NP in a coreference chain), or one or more NP of some other relation type (e.g., superset or subset).
• Subsequent-mentions: (anaphoric) NPs preceded by one or more coreferential NP (i.e., NPs in a coreference chain), or one or more
NP of some other relation type.
• Single-mentions: NPs neither preceded nor followed by any coreferent NPs, or any NPs of some other relation type.
• First-mentions: initial-mentions and single-mentions combined.
Because our analysis presented below also includes other relation types than coreference, we use the same terminology regardless of relation type.
or subset) are often linked to markables that already belong to a coreference chain: there are only 20 relational initial-mention markables on 299 relational
– this is both because coreference is the most frequent relation, and because if a markable is in both a coreference relation with another markable, and (as an initial-mention) in another type of relation, it is listed here as coreferential because coreference is defined as the primary relation type.
data is presented. Coreference is the most frequent relation type in the data: about 86% of all relations belong to this type. All other relation types are infrequent.
Here, we are discussing direct anaphora, i.e., anaphoric definite NPs with explicit NP antecedents.
Note that such relations consists of pairs of NPs, an anaphor and an antecedent, while coreference chains can consist of any number of NPs.
Table 4.2: Distribution of relation types in SEK-66.
Reciprocal reference (e.g., varandra
, ‘each other’)
Distributive reference (e.g., vardera , ‘each’)
The second most frequent relation type is that between a predicative NP and its subject, which accounts for 3.9% of the total number of relations.
The third and fourth most frequent relation types are the subset and superset relations, both at about 3.5%. These two relations are closely related, and resolution of both relations involves challenges such as handling split and coordinated antecedents and plural pronouns, and quoted speech pronouns.
There were 30 occurrences (1.4% of the total) of the fifth most frequent relation type, excluding reference, but for all other relation types, less than 10 occurrences (0.3% or less of the total) were found.
Among the NPs annotated as related to other NPs, most (1,887 of 2,168;
85.8% of all relational NPs) are coreferent, but as most NPs (77% of all NPs) are not annotated as related to other NPs coreference can be described as a minority class. We make the assumption that the number of coreference relations in this data set is sufficient for training and evaluating a classifier for Swedish coreference resolution, but we will have reason to return to this question in subsequent chapters.
The fact that the relation between subject and predicate NP is relatively frequent lends support to the claim that there may be a negative effect on the results if this relation is included in the coreference resolution task (see e.g., Van Deemter and Kibble 1999, Van Deemter and Kibble 2000, Borthen
2004a). Thus, we will not include predicate NPs in the coreference class (cf.
the MUC Coreference Task Definition (Hirschman and Chinchor 1997)).
All other relation types besides coreference are very infrequent in this corpus. If we were to use this data to train a classifier over all these classes of
relations we would have few positive training examples for most of them, and thus we cannot expect good classification results. In order to build a classifier for such infrequent phenomena, a larger corpus is needed.
In the following sections, we will focus on the subset of coreferent NPs and leave the other relation types aside.
Table 4.3: Distribution of initial-mention (antecedent) and subsequent-mention
(anaphoric) NPs over NP types in SEK-66.
Initial-mention Subseq-mention All markables
Distribution of coreferential NP types
In this thesis, we are concerned with coreference between NPs, and we dis-
into initial-mention and subsequent-mention NPs over the three NP types. Of the 730 coreference chains in the corpus, two thirds start with an initial lexical
NP, and 44% of all subsequent-mention NPs belong to this NP type. A third of all chains start with a NE, but no chains start with a pronoun. The latter NP type makes up 24% of all subsequent-mentions.
This NP type is referred to as definite descriptions in the second part of the thesis, when focus is on resolution of subsequent-mention definite lexical NPs.
There are few instances of pronouns with a totalt of 454 occurrences, and as the pronoun word class is quite heterogeneous, this makes pronoun resolution even more difficult. This data set may be too small for pronoun resolution; especially resolution of the less frequent pronoun types might suffer from data
the pronoun resolution results.
Table 4.4: Distribution of simple and complex single-mention, initial-mention (antecedent) and and subsequent-mention (anaphoric) definite descriptions in the
Swedish financial news text (SEK-66).
1150 132 533
Anaphoricity in definite descriptions
In coreference resolution systems, definite descriptions are typically treated as subsequent-mention candidates, matched against preposed NEs, indefinite lexical NPs, other definite descriptions, and pronouns. But as shown in sec-
30% of all definite descriptions are part of a coreference chain as as either the initial-mention or a subsequent-mention.That is, most definite descriptions are antecedentless , as they do not have an overt NP antecedent in the preceding context.
scriptions are listed in the second and third column; these categories combined cover almost 80% of all definite descriptions. That is, most of the definite descriptions in our data should not be resolved to any antecedent (and, most of
Figure 4.2: NP complexity: Complex definite descriptions classified as a) NPs with
PP modifiers, b) NPs with genitive or possessive modifiers, c) NPs consisting of two content words or more, d) other complex NPs.
the definite descriptions in the training data given to the classifier will constitute negative training examples).
Thus, resolution of definite descriptions is difficult not only because definite descriptions can be used as descriptions of Named Entities, and because they can enter into relations with other lexical NPs based on synonymy, hyperonymy, and hyponymy that can be hard to interpret, but also because the majority of definite descriptions are non-coreferent.
In a corpus study by Fraurud (1992) on professional, non-fiction Swedish prose, more than 50% of all simple NPs (consisting of a single definite noun) and 75% of all complex definite descriptions (defined as definite NPs with any kind of modifier, e.g., an adjectival modifier, or a postposed preposition phrase) is classified as first-mention (single-mention or initial-mention NPs), and thus do not have an overt, contextual antecedent.
There are similar findings for English, where e.g., Poesio and Vieira (1998), and Vieira and Poesio (2000) observe that restrictive postmodification is the most frequent feature of antecedentless definite descriptions.
with an enclitic definite article), and complex NPs (all other definite NPs with e.g., at least one restrictive adjectival modifier, a genitive or possessive modifier, or a postposed prepostional phrase).
While the proportions of simple and complex NPs are about the same for single-mention and initial-mention definite descriptions (47.4% resp. 46% simple, and 52.6% resp. 54% complex NPs), the majority of subsequentmention definite descriptions are simple NPs (64.1%).
In the afore-mentioned study, (Fraurud 1992) found that 85% of all definite NPs with a genitive or possessive modifier were either initial or singlemention. She attributes this to the fact that a genitive or possessive construction explicitly relates the referent to another referent, and thus serves as an introduction.
modifiers, b) NPs with genitive or possessive modifiers, c) NPs consisting of two content words or more, d) other complex NPs. In this categorization, PP modification is a strong feature of single-mention definite descriptions. We can also observe that initial-mention long NPs are infrequent, compared to single-mention and subsequent-mention long NPs. Complex NPs with genitive or possessive modifiers occur in all three categories, and the proportion of subsequent-mentions with genitive or possessive modifiers is larger than expected in view of Fraurud’s findings. Although we cannot conclude that complex NPs are single-mention or initial-mention and that simple NPs are subsequent-mention, it seems likely that knowledge on NP complexity, and especially on subcategories of complex NPs (with PP modifiers, or with genitive or possessive modifiers) will be beneficial to coreference resolution.
There are also other factors that determine the anaphoricity of a definite description, e.g., if the referent can be identified from encyclopedic knowledge.
In a study on a subset of Penn Treebank, Vieira and Poesio (2000) found that expressions that can be described as referring to encyclopedic knowledge, such as the sun and the moon, often occur as single- or initial-mentions; other frequent single-mentions are e.g., time expressions.
These findings are mirrored in the analysis of our corpus, where among the most frequent single-mention definite descriptions are different expressions referring to time (e.g., året, ‘the year’, kvartalet, ‘the quarter’, månaden,
‘the month’, veckan, ‘the week’). Most single-mention definite descriptions can be defined as expressions referring to encyclopedic knowledge within this specific domain. Frequent single-mentions are e.g., important agents within the domain like facken (‘the unions’), regeringen (‘the government’), euroländerna (‘the Euro countries), LO-ekonomerna (‘the LO economists’), institutions such as börsen (‘the stock market’), penningmarknaden (‘the money
market’), and important concepts such as industrin (‘the industry’), ekonomin
(‘the economy’), arbetsmarknaden (‘the labour market’), marknaden (‘the market’), arbetslösheten (‘the unemployment rate’), tillväxten (‘the growth’).
These findings suggest that domain-specific lexical resources are important to coreference resolution.
Resolution of definite descriptions is a challenging task, not only because definite descriptions can enter into synonymy or hyperonymy relations with other lexical NPs but also because most definite descriptions are single-mentions.Synonymy or hyperonymy relations are difficult to identify and thus may effect the recall for this NP type; such relations are also a challenge in terms of precision because e.g., synonymy is a clue to, but does not equal coreference between two NPs. During the experiments described in the following chapters we do not assume that definite descriptions have a coreferent antecedent during definite description resolution.
Some difficult cases of coreference
In this section, some challenging cases of coreference are described. Each case is interesting and worthy of a thesis in itself. This section serves as an illustration of the diversity of the coreference phenomenon, and as a justification for the limitations of the scope of this thesis.
One of the problems encountered during coreference annotation is how to handle coordinated NPs, e.g., whether they should be marked as a unit or not. One solution, suggested by Van Deemter and Kibble (2000), is to only mark coordinated constructions in cases where there is an anaphoric expression (thus, the markable annotation is dependent on the coreference annotation).
(4.1) I Världsbanken och IMF skiljer sig röstandelen åt med hänsyn till hur mycket pengar som ett land pytsar in i institutionerna.
(In the World Bank and IMF, the number of votes differ according to how much money a country contributes to the institutions.)
Coordinated NPs constitute a problem during coreference resolution, not only because this is a difficult (sub)task but also because adding all syntactically conjoined NPs as candidate antecedents further adds both to the imbalance between coreferent and non-coreferent instances and to the complexity of resolution of plural NPs.
Our solution to the annotation problem is to connect the referring expression to the antecedent by marking the referring expression with a list consisting of the index numbers of the NPs coordinated by the conjunction (see
Världsbanken and IMF.
In the experiments on coreference resolution described in this thesis, we do not attempt to handle coordinated NPs, and they are not recognized as units. The annotated coreference links between anaphors and multiple, coordinated antecedents are not included as examples of coreference (i.e., positive instances) in the training or evaluation data used in the experiments described in the following chapters.
If plural anaphors with coordinated antecedents are difficult to handle, plural anaphors with split antecedents, that is, referring expressions with antecedents consisting of a set of syntactically disjoint NPs, are even more challenging:
(4.2) Hon tror att det på sikt kan leda till att SAS går samman med Finnair.
Det vore det mest naturliga för SAS del. Båda bolagen är små i ett internationellt sammanhang och naggar på varandras marknader.
De kan vinna mycket på ett samgående, säger hon.
(‘She believes that this will eventually lead to SAS merging with
Finnair. That would be the most natural solution for SAS. Both companies are small in an international context, and [they] are eating away at each other’s markets. They have much to gain from a merger, she says.’)
Correct resolution of the link between the plural definite description båda bolagen (‘both companies’) and the antecedent, consisting of the syntactically
knows that the referring expression limits the number of members in the antecedent set to two, and that it can identify both SAS and Finnair as the names of two companies that are in focus at that point in the discourse.
In the experiments on coreference resolution described in this thesis, we do not attempt to handle anaphors with split antecedents. That is, the annotated coreference links between such anaphors and their disjoint antecedents are not included in the training or evaluation data used in the experiments described in the following chapters.
While this is a way to limit the scope of the coreference resolution task, it also means that the set of plural pronouns and plural definite descriptions given to the classifier for resolution will include NPs that are coreferent with two or more coordinated or disjoint NPs. That is, some plural pronouns and plural definite descriptions in the training and test data will ha a plural antecedent, but some will appear to be antecedentless because their antecedents are coordinated or disjoint. This is likely to make resolution of plural NPs more difficult for the classifier, and we will return to this question in the following chapters.
Quoted speech pronouns
In news text, quoted speech is frequent, and consequently so are quoted speech pronouns and other referring expressions within quotes that might be interpreted as coreferent with expressions outside the quote. While this is straightforward for human interpretation, automatic resolution requires that the system recognizes when a person is speaking on behalf of an organization, as in
(4.3) – Jag tror att man kan lösa det på olika sätt. Vi har ett väldigt nära samarbete med Lufthansa och det är fullt tillräckligt för oss, säger SAS informationsdirektör Hans Ollongren, som också framhåller att SAS idag äger flera mindre bolag i Norden.
(‘– I believe that this can be solved in different ways. We are co-operating very closely with Lufthansa, and that is quite sufficient for us, says SAS information manager Hans Ollongren, who also stresses that SAS today owns several smaller companies within the
Here, a coreference link might be added between the plural pronouns vi, oss
on behalf of the company, or the management.
In the annotation of this corpus, singular first person pronouns in quoted speech is annotated as coreferent with the speaker, while plural first person pronouns are annotated as relational to the speaker through the superset rela-
While superset and subset relations are too infrequent to be included in the
within and outside quotes are added to the training and evaluation data used in the experiments in this thesis; this inclusion will be further discussed in the following chapters.
Table 4.5: Number of occurrences, and percentage of det (‘it’) annotated as expletive, coreferent, or as referential to entities outside the discourse or to implicitly introduced entities (i.e., without an explicit NP antecedent note).
Annotation type N
% N det
Expletive det 68 76 183 83 251 81
No explicit NP antecedent 20 22 24 11 44 14
Anaphoric and non-anaphoric
In Swedish, annotation and resolution of the third person inanimate pronoun det (‘it’) is further complicated by the fact that det can also occur as a definite article in an NP, or as an expletive.
Most of these occurrences, 251 (81%), are annotated as expletive det. In fact, anaphoric det in a coreference relation is the least common use – there are only
16 occurrences in the entire corpus of coreferential det. This is not a uniquely
Swedish phenomenon; similar distribution patterns for expletives and third person pronouns has been reported for Norwegian (Nøklestad 2009, Holen
2007), German (Klenner and Ailloud 2008), English (McEnery et al. 1997), and Dutch (Hoste 2005).
Although good resolution results for this pronoun type are unlikely (especially after splitting the corpus into disjoint training and evaluation data sets), we will not remove instances of this pronoun from the data. We motivate this by the fact that other pronoun types also are infrequent (even though they are not ambiguous). However, we will have reason to return to this question in the following chapters.
In this chapter, we have presented some observations on coreference and related phenomena in our annotated data. These findings can be summarized as follows:
• Distribution of NE types: most of the NEs in this corpus belong to the organization name category, followed by other standard NE categories such as person names and location names. Domain-specific
NE types are e.g., financial indices and products.
• Distribution of relations between NPs: among the NPs that point to some textual antecedent, coreference is the most frequent relation type. However, compared to the class of NPs that are not related to any other NP in the discourse, coreference is the minority class.
• Distribution of coreferent NP types: about half of the NPs in coreference relations are lexical NPs, about a third are NEs, and the rest are pronouns.
• Factors for determining anaphoricity: among definite descriptions, only a minority are anaphoric. Some factors that may help in recognizing antecedentless definite descriptions are e.g., NP complexity, and world- and domain-knowledge.
• Difficult cases of coreference: plural anaphors with coordinated and split antecedents, anaphors in quoted speech, and third person inanimate pronoun det are some of the challenges.
Coreference Resolution in Swedish
5. Data Preparation
In this chapter, the preprocessing of the data and the construction of pairs of anaphors and candidate antecedents (i.e., positive and negative instances for the classifier) are described. The different methods we use for selecting anaphor-antecedent candidates and the motivations behind them are discussed.
We also describe the knowledge sources used for feature construction, as well as how the feature set was constructed, and why we believe these features to be informative for Swedish coreference resolution.
Referring expressions handled by the system
In this thesis, the topic is resolution of coreference between referring expressions in Swedish. The expressions identified as potential anaphors in these experiments are Named Entities, definite descriptions, and pronouns.
Named Entities (NEs) are words and noun phrases which function as proper names in a wide sense. Depending on the domain, different classes of NEs typically occur; in the type of texts used here, names of trademarks and products occur in addition to the most frequent classes: names of organizations, persons, and locations. The different classes of NEs covered here are described
While the different types of NEs share some properties, e.g., that most NEs are capitalized, uniquely referring names, there are also differences: NEs can be categorized as e.g., animate or inanimate, and if animate, as either mascu-
Definite descriptions are definite NPs with a common noun head. In Swedish, definite NPs must have at least one of the following markers for definiteness: a pre-modifying definite attribute (e.g., a possessive pronoun as in min bok, ‘my book’, mitt hus, ‘my house’), a definite marker in terms of an enclitic suffix on the head word (e.g., boken, ‘book’+def +uter, huset ‘house’+def +neuter),
Most, if not all, animate beings in this domain are also human. Therefore, we use the term animate meaning animate-human.
or a combination of these two (e.g., den boken, ‘that book’+def +uter). Definiteness can also be marked by a pre-modifying adjectival attribute in definite form (e.g., samma bok, ‘the same book’). Instances of definite NPs without a nominal head, e.g., a determiner followed by a nominalized adjective as in de
(‘the indicted’) are also categorized as definite descriptions.
In come cases, definite descriptions must agree in gender and number with its antecedent. E.g., coreference between a pronoun and a definite description requires agreement in gender and number, while two definite descriptions of
Plural definite descriptions can have a plural antecedent, or coordinated or split antecedents (see sec-
or inanimate, or in some (rare) cases as either masculine or feminine.
Swedish pronouns are a relatively closed but heterogeneous word class that can be divided into definite pronouns, interrogative pronouns, quantitative pronouns, and relational pronouns. Pronouns can occur as the head of an NP, or as an attribute in an NP. They tell us how a referent can be identified within a specific context, but have little intrinsic descriptive content. In this thesis we are concerned with definite pronouns that can be classified as:
1. Personal pronouns, e.g., jag (‘I’), hon (‘she’), de (‘they’),
2. Demonstrative pronouns, e.g., denna (‘this+uter (one)’), detta
3. Reflexive pronouns, e.g., sig (’himself’/’herself’/’itself’)
4. Relative pronouns, e.g., som (‘who’); resolution of relative pronouns is not included in these experiments as instances of such pronouns
Definite pronouns can function either as deixis or anaphora. Deictic pronouns (i.e., the first- and second person) refer to the speaker/listener (or via the superset relation, to some group the speaker/listener belongs to), and do not necessarily rely on an overt antecedent for identification. Anaphoric pronouns (the third-person pronouns, and demonstrative pronouns) typically refer to an (overt and proximate) antecedent. In terms of accessibility, both types signal that the referent is maximally accessible, either within the situation or the textual context. However, deictic and anaphoric pronouns tend to behave differently: deictic pronouns often occur as cataphors in quoted speech.
Anaphoric pronouns can also be classified as either free anaphors that are not subject to syntactic constraints where reference is concerned, or syntactically bound anaphors where the correlate is subject to syntactic constraints
(e.g., reflexive pronouns prototypically corefer with the subject in a finite
There are two grammatical genders in Swedish, neuter and uter.
clause). In terms of scope, free anaphors can be situated further away from their antecedents than syntactically bound anaphors.
Anaphoric pronouns can refer to either animate (e.g, 3 pers sg: han (‘he’), hon (‘she’)), inanimate (e.g, 3 pers sg: den, det, (‘it’)), or to both animate and inanimate referents (e.g, 3 pers pl: de (‘they’)) referred to as ‘mixed’ anaphoric pronouns below. In a corpus study of Swedish pronouns in fiction
(short stories), reports on court proceedings, and non-fiction (articles on technology), Fraurud (1992) found that while most pronouns with an inanimate referent were located in the same sentence as the antecedent or in the preceding sentence, animate referents were referred to by anaphoric pronouns at greater distances. Fraurud concludes that animacy is an important factor influencing the scope of a referent.
In a second study on Swedish plural pronouns, Fraurud (1992) found that for third person plural pronouns, which are ‘mixed’ as to animacy, there were no significant differences in scope between animate and inanimate referents, and that the antecedents of most plural pronouns (97%) were found within the same or the immediately preceding sentence.
Pronouns can also be categorized by their grammatical function. The referent denoted by the grammatical subject in a clause is typically more prominent than the referent denoted by the object; consequently, pronouns in non-subject form are typically situated within a shorter distance to their antecedents than pronouns in subject form.
Plural pronouns can refer to both one plural antecedent, or to (any number of) coordinated or split antecedents. We do not add the links between plural pronouns and coordinated or split antecedents in the annotated data to the training and test data described below; in order to identify cases of coreference between one anaphor and multiple antecedents and to solve that particular resolution problem, additional discourse models on a level above NPs are needed.
Thus, occurrences of plural pronouns constitute a potential source of errors, but because we have no way of knowing when a plural pronoun has a group of antecedents and when it has a single, plural antecedent (other than resolution), such plural pronouns with coordinated or split antecedents are not removed from the training and test data.
A solution to this problem is to remove all plural pronouns from the training and test data, but as we are interested in finding out how our feature set works for resolution of plural pronouns we do not remove these pronouns. Additional motivation for not removing such cases is that definite descriptions and NEs
Selection and interpretation of definite referring expressions
The three types of definite referring expressions we are concerned with here
(NEs, definite descriptions, and pronouns), can be described in terms of accessibility , that is, how accessible, or active, the (mental representation of a) referent of an NP is at a given point in the discourse. Ariel (1990) argues that the degree of activation can be seen in the writer’s choice of different NP types, and that this choice between NEs, definite descriptions, and pronouns is a type of grammatical coding that guides the processing. That is, the selection and interpretation of referring expressions is based not only on the content of the expression, but also on the degree of accessibility indicated by the speaker.
Ariel (1990) distinguishes between Low Accessibility Markers (e.g., Named
Entities and definite descriptions) and High Accessibility Markers (e.g., pronouns). Low Accessibility Markers are typically uniquely referring and highly linguistically informative, and they signal the introduction or reintroduction of a referent, and possibly also a termination of the current topical referent, while High Accessibility Markers such as pronouns signal continued activation of the current topical referent.
We use the Accessibility Marking Scale (Ariel 1990:p. 73), from Low Accessibility Markers with much linguistic content (e.g., full names with modifiers) to High Accessibility Markers with little linguistic content (e.g., pronouns), as a basis for categorizing the NPs in our data:
Accessibility Marking Scale: Full name + modifier < Full name < Long definite descriptions < Short definite description < Last name < First name < Distal demonstrative < Proximate demonstrative < Stressed pronoun < Unstressed pronoun < Cliticized pronoun < Extremely High Accessibility Markers (e.g., reflexives).
From a processing point of view, High Accessibility Markers such as pronouns can often only be accessed within the same or the previous sentence, whereas
Low Accessibility Markers such as NEs and definite descriptions can be accessed within or across paragraphs, but only rarely within shorter distances, e.g., in the same sentence. That is, this categorization is based on how informative an NP is, how uniquely referring it is, and its degree of prominence
This categorization is the basis for the selection of candidate antecedents in
The annotation process, including the annotation of NEs, NPs, and reference relations between NPs according to the BREDT annotation scheme, is de-
In these experiments, we assume that we have been given as part of our input the markable boundaries and information on the NEs (boundaries and class label). That is, we assume gold NP chunking and Named Entity Recognition, the motivation being that preprocessing errors related to recognition of NPs and NEs are harmful to the resolution results. A number of studies have shown that such preprocessing errors make up a large part of the resolution errors, see e.g., (Baldwin 1997), (Mitkov et al. 2002) and (Morton 2005).
Others who make similar assumptions are e.g., Denis and Baldridge (2008),
Poon and Domingos (2008), Haghighi and Klein (2007), and Haghighi and
Klein (2009). We do not assume that we have been given the head word and
On this data, the following preprocessing steps were performed in order to enrich the data with information that might be useful for coreference resolution:
• Part-of-speech tagging (including morpho-syntactic information)
a stochastic tagger based on a Hidden
Markov model. The statistics used are extracted from the automatically POS-tagged and manually corrected Stockholm-Umeå
Corpus, SUC, a balanced corpus of approximately 1,000,000 words
(Ejerhed, Källgren, Wennstedt and Åström 1992). The tag set is a modified version of the SUC tag set. A tagging accuracy of 96.3% is reported; for unknown words, accuracy is 92.0% (Carlberger and
version with a pre-trained memory-based model for Swedish (Nivre,
Hall, Nilsson, Chanev, Eryiˇgit, Kübler, Marinov and Marsi 2007).
• Based on the output of the Granska Tagger and the MaltParser analysis, a Perl script is used to categorize NPs as simple (defined as a single noun or pronoun, or a noun with a determiner), multi-word (an
NP with more than one content word), or complex NPs (nested NPs with e.g., a genitive modifier or a PP), and the head word of each NP is marked. This script also determines the NP type, based on the partof-speech of the head word and the definiteness of the NP. For simple
NPs, information on definiteness is provided by the Granska analysis.
For complex NPs, definiteness is determined by a set of rules, e.g., ett hus (‘a house’) is indefinite, detta hus (‘this house’) is definite.
The NP types are: definite description, indefinite lexical NPs, cardinal numbers
, ordinal numbers, definite pronoun, indefinite pronoun,
NE , or unknown.
In the ACE data set, used by Haghighi and Klein (2007), the NP types are names, descriptions, and pronouns.
Granska, URL: http://www.csc.kth.se/tcs/humanlang/tools.html
MaltParser, URL: http://maltparser.org
MaltParser 0.4, URL: http://w3.msi.vxu.se/ nivre/research/MaltParser.html
• Each instance of the NE type ‘organization’ is extended with the synsets for företag, organisation (‘company’, ‘organization’), and instances of the NE type ‘person’ is extended with the synsets for människa, person (‘human being’, ‘person’) in Swedish WordNet
The data is arbitrarily split on the document level into two disjoint sets of training and test data. As we do not have enough data for a disjoint development data set, the training data is used for development and tuning of the
the final evaluation of these classifiers is performed on the test data (described
The split on the document level is motivated by the fact that coreference is a discourse phenomenon, and we want to evaluate the results over complete documents. The split has consequences both as to the distribution of (different types of) NEs, definite descriptions and pronouns in the training and test data sets, and the distribution of coreference relations between NPs (some documents are longer, and thus contains more coreference links). Because of the different difficulty levels of the resolution tasks, distributional differences between the training and test data might effect the outcome of this task. We
Both the training and test data sets are preprocessed in the same way, the same instance creation and instance selection methods are applied, and the same feature set is constructed for both data sets.
Construction of negative and positive instances
Because coreference is a relation that holds between all NPs within a document that refer to the same discourse entity, each NP within a document might be coreferent with any preceding NP. Consequently, training instances for the classifier are constructed by pairing each NP with the previously occurring NPs in each document, thus forming candidate anaphor–antecedent pairs. An instance consisting of coreferent anaphor–antecedent pair is a positive instance, and an instance consisting of a non-coreferent pair is a negative instance.
For every NP there is a large number of possible antecedents, and since coreference is a relatively rare relation (as not all definite NPs are coreferent with a preceding expression), the ratio of positive instances is typically very
Svenskt OrdNät, URL: http://www.lingfil.uu.se/ling/swn.html
there are 6 positive instances and 22 negative
(5.1) [Den nya vd:n för [Svenskt Näringsliv]
, [Ebba Lindsö]
, tillträdde på [onsdagen]
5 ska nu bli tydligare i [[sin]
, säger [hon]
(‘[The new CEO of [SN]
, [Ebba Lindsö]
, entered on
. [The organization]
5 will become more resolute in [[its]
From this point of view, reducing the number of instances by some method of instance selection is an important issue in the approach to coreference resolution described in this thesis, and we will discuss different instance selection strategies in detail in the following section.
The complete set of negative and positive training instances includes instances where the candidate anaphor is e.g., an indefinite lexical NP. Because we define the set of referring expressions covered by the system as NEs, def-
where the candidate anaphor is classified as one of the three types of referring expressions is selected for further processing. This selection is automatic, and performed by extracting the instances where a) the head word of the candidate anaphor matches one of the pronouns in the select pronoun set, b) the candidate anaphor is an NE, c) the candidate anaphor is classified as a definite deception.
At this point, a basic syntactic constraint is used to remove unlikely anaphor-antecedent candidates for all NP types, following e.g., (Strube et al. 2002): NPs nested within other NPs are not allowed to function as either anaphor or antecedent to the immediately dominant NP, or to other NPs
pronoun sin (‘its’) is not allowed to match against the dominant NP sin framtoning (‘its image’, ‘its appearance’).
The subset of negative and positive training instances where the candidate anaphor is classified as a referring expression is further divided into three different data sets based on the NP type of the anaphor. By thus dividing the data, we can train different classifiers for the three NP types. The motivation for this is partly that the different NP types require different types of information for resolution, and partly because the NP types are not equally distributed in the
The four coreference chains (including single-mention NPs that constitute chains of length
= 1) are: [Den nya vd:n för Svenskt Näringsliv
, Ebba Lindsö
], [Svenskt Näringsliv
, [sin framtoning]
Table 5.1: Training instances of anaphors and candidate antecedents, constructed
stances are instances where the anaphor is nested within the candidate antecedent, e.g., sin in sin framtoning.
Anaphor Candidate antecedents
2 Svenskt Näringsliv 1 Nya vd:n för Svenskt Näringsliv
3 Ebba Lindsö 2 Svenskt Näringsliv
3 Ebba Lindsö
Nya vd:n för Svenskt Näringsliv
Nya vd:n för Svenskt Näringsliv onsdagen
6 sin framtoning
6 sin framtoning
6 sin framtoning
6 sin framtoning
6 sin framtoning
Nya vd:n för Svenskt Näringsliv
Nya vd:n för Svenskt Näringsliv
6 sin framtoning
3 Ebba Lindsö
2 Svenskt Näringsliv
1 Nya vd:n för Svenskt Näringsliv
6 sin framtoning
3 Ebba Lindsö
2 Svenskt Näringsliv
1 Nya vd:n för Svenskt Näringsliv
Table 5.2: Candidate anaphor-antecedent pairs in the complete training data set partitioned by the NP type of the anaphor, and in the complete training data set (i.e., all three NP types combined).
NP type of anaphor Number of candidate pairs % of total
All NP types
data and this might influence the outcome if one single classifier is used. Since the NP types are unequally distributed in the data, the data sets are of varying size: about two thirds of all candidate anaphors in the training data set are definite descriptions, while pronouns and NEs make up about one sixth each
Instance selection strategies
Coreference between NPs is relatively rare phenomenon, as most definite NPs
therefore the majority of candidate anaphor-antecedent pairs are not coreferent, and consequently function as negative examples for the classifier (cmp.
In the complete NE training data set, 3.76% of all pairs are positive examples, and in the pronoun data set, 4.67% are positive. As definite descriptions make up the most frequent NP type, this data set is both the largest (63.95%
In order to restrict the search space for each anaphor, thereby creating a more evenly distributed data set, different instance selection strategies can be used. By selecting a subset of training instances, we can improve on the processing time of the classifier, and studies by e.g., Uryupina (2004) and
Hendrickx et al. (2007) show that linguistically motivated instance selection also can improve on the classification results.
Additional motivation for instance selection is that by instance selection we can further define the task: if we know what kind of data we are trying to classify, and what kind of relations we are trying to find, we can do a better job of selecting suitable features for this task. If we narrow down the task of coreference resolution to e.g., identification of coreferent NEs, and we define the task as identification of coreference and non-coreference relations between pairs of NEs and either other NEs or lexical NPs, rather than coreference and
non-coreference between pairs of any type of NPs, a better informed feature selection can be performed.
Instance selection methods can be based on e.g., restricting the search space to the closest preceding NP (Soon et al. 2001), or the closest, easily resolved preceding NP (Ng and Cardie 2002b), or linear restrictions in combination with filters (for filtering of negative or positive instances, or both) based on conditions on gender or number agreement, or semantic type
The instance selection strategies in our experiments are based on corpus studies on the connection between the referential form of an NP and its cognitive status, its accessibility as an antecedent (Ariel 1990). This theory of a direct connection between referential form and cognitive status, in combination with different constraints on definite description repetitions and redescriptions, and with semantic and syntactic constraints on pronouns, form the basic framework for our instance selection method.
Accessibility theory, put fourth by (Ariel 1990), is focused on the form of the anaphor and of the antecedent, and the relationship between them, which makes it an interesting model for selecting for each anaphor the most likely candidate antecedents among all preceding NPs. The idea is to use Accessibility theory for selecting likely anaphor-antecedent pairs for the classifier – clusterer approach to coreference resolution used here. This instance selection method includes both negative and positive instance filtering, as the basis for the selection is how likely it is that the anaphor and the candidate antecedent are coreferent based on their respective degree of accessibility.
In the sections below, different instance selection methods for NEs, definite descriptions, and pronouns based on Accessibility theory are described.
We contrast these methods against a basic, linear-k instance selection technique based on corpus studies that show that the different NP types display different behavior in terms of the linear distance between the anaphor and the antecedent (McEnery et al. 1997, Vieira and Poesio 2000, Fraurud 1992); pronouns typically corefer with an antecedent in the immediate context (i.e., the current or the preceding sentence), whereas NEs and definite descriptions can display long-distance coreference relations. In this model, k is set to the number of sentences included in the search space for each type of anaphor, with different values for NEs, definite descriptions, and pronouns respectively.
In a sense this technique is also based on cognitive status as it allows for large search scopes for (low accessibility) subsequent-mention NEs and definite descriptions, and short scopes for (low accessibility) pronouns; the difference is that there are no restrictions on the types of candidate antecedents that are allowed for each type of anaphor.
The basic discourse units used in all the instance selection strategies described below are the sentence, defined as a unit recognized by the tokenizer, and the paragraph, defined as a unit beginning and ending with a new line.
Both these units are seen as structural clues provided by the writer that can be used for coreference resolution.
Table 5.3: Positive (i.e., coreferent) and negative candidate anaphor-antecedent pairs for each NP type before and after application of instance selection on the training data set: number of negative instances, and percentage of negatives compared to all negatives in the complete training data set (column 2, 3), and number of positive instances, and percentage of positives compared to all positives in the complete training data set
(column 4, 5). Total number of instances, and percentage of instances compare to all instances in the complete data set (column 6, 7).
NE-filter 6,820 21.1
Instance selection strategies for Named Entities
The task of identifying coreferent NEs is primarily a task of recognizing different name variants: NEs with or without modifiers, substrings of an NE (e.g., a last name or a first name), and acronyms. There are also cases of NEs being subsequent-mentions of definite descriptions, or pronouns (i.e., cataphora), but such cases are rare.
Subsequent-mention low accessibility markers such as full proper names and names with modifiers can be described as instances of repetition (or reintroduction) rather than anaphora in that the subsequent-mention NE does not
(The coreference resolution task is of course still to recognize which NEs are coreferent and which are not.)
When discussing resolution of coreferent subsequent-mention NEs, we use the term anaphor even though they may not be anaphoric in the strict sense.
Table 5.4: Coverage of the total number of anaphors belonging to each NP type in the complete training data set for each NP type, and in the versions of the training data after instance selection; by ‘cover’ we mean to include the anaphor and least one coreferent antecedent in the data set .
As typically uniquely referring expressions within the discourse, NEs can display long-distance coreference relations (see e.g., (Vieira and
more than 90% of all subsequent-mention NEs in the complete training data set, we use 20
(covering 99% of all subsequent-mention NEs) and 10 sentences (covering
used by e.g., Hendrickx et al. (2007).
We will contrast the outcomes of this basic, linear-k instance selection strategy (with k set to 20 and 10 sentences, respectively) to that of an instance selection method (called “NE-filter”, below) based on the characteristics of NEs as a class of NPs, and the cognitive status of the preceding NPs described by
Ariel (1990) as accessibility.
We use the paragraph as a delimiter, as it constitutes a more flexible, and linguistically and cognitively plausible, distance measure than a fixed sentence limit.
By cover, we mean to include both the subsequent-mention NE and at least one coreferent antecedent in the data set.
The Accessibility theory-based instance selection method for NEs is used to select the most likely antecedents for each anaphor, based on the following strategy:
• Subsequent-mention NEs that are similar (defined as string similarity after removal of e.g., tight appositives) to previous NPs are treated as repetitions and allowed these preceding NPs as training instances regardless of distance.
• As NEs are categorised as low accessibility markers based on the corpus studies by Ariel (1990), they are paired with other low accessibility markers in the immediate context, defined as the current and the two preceding paragraphs:
• In the current paragraph, NEs are paired with all preceding other NEs (in order to identify acronyms), and definite descriptions. Preceding indefinite lexical NPs, or high accessibility markers such as pronouns are not allowed as candidate antecedents,
• In the immediately preceding paragraph, other NEs, complex NPs and multi-word NPs are allowed as candidate antecedents,
• In the second preceding paragraph, other NEs and complex
NPs are allowed as candidate antecedents.
The selection strategies described above result in the following versions of the
NE training data set constructed for the task of identification of subsequentmention coreferent NEs:
1. NE-20: In the least restriced, basic instance selection strategy, all NEs are matched against all NPs in the 20 previous sentences, covering
99% of all subsequent-mention coreferent NEs in the training data
of all negatives, and the set of positives to 86% in the complete NE
2. NE-10: All NEs are matched against all NPs in the 10 previous sentences, covering 95% of all subsequent-mention coreferent NEs in the training data. The set of negatives is reduced to 66% of all negatives, and the set of positives to 62% in the complete NE training data set.
3. NE-filter: All NEs are matched against all other previous NPs according to a) repeated or partially repeated form b) the accessibility of the antecedent, covering 97% of all subsequent-mention coreferent NEs in the training data. The set of negatives is reduced to 21% of all negatives, and the set of positives to 69% in the complete NE training data set.
Instance selection strategies for definite descriptions
Most of the NPs in the data used in these experiments are classified as lexical
NPs, and consequently this NP type type makes up the largest data set after
We use three instance selection strategies for definite descriptions. In the first two, based on linear distance, definite descriptions are matched against all NPs within two restricted search scopes (15 previous sentences and 10 previous sentences), which both cover more than 90% of all anaphoric definite
The outcomes of this basic, linear-k selection strategy are compared to that of a strategy based on linguistic and cognitive studies on anaphora (called
Resolution of definite descriptions can be described as the dual task of recognition of repetitions (i.e, same head anaphora (Vieira and Poesio 2000)) and redescriptions (i.e., different head anaphora, e.g., synonyms, hyperonyms, and definite descriptions with NE antecedents). Our instance selection strategy allows for repetitions regardless of distance in the document (akin to the
NE-filter strategy, above) based on findings in corpus studies by e.g., Vieira and Poesio (2000) on anaphora in English and by Fraurud (1992) on definite
NPs in Swedish.
Candidate redescriptions are selected based on Accessibility theory (Ariel
1990), where definite descriptions are categorised as low accessibility markers, with ranked subcategories based on NP complexity; that is, complex NPs and multi-word NPs are less accessible than non-complex NPs. These subcategories are allowed different search scopes. When determining the scope, both the form of the anaphor and of the antecedent are taken into account. Candidate anaphor-antecedent pairs are filtered according to the following strategy:
• Within the current paragraph, separate strategies are used for complex and non-complex anaphors:
• For complex NPs, all preceding NEs, other complex NPs, and multi-word NPs are allowed as candidate antecedents
(i.e., complex NPs are not allowed candidate antecedents of lower accessibility).
• For non-complex NPs, all types of preceding NPs are allowed as candidate antecedents.
• In the preceding paragraph, low accessibility NPs (defined as NEs, complex NPs, and multi-word NPs) are allowed as candidate antecedents.
• In the second preceding paragraph, only low accessibility NPs (defined as NEs and complex NPs) are allowed as candidate antecedents.
• Regardless of distance, repetitions are allowed; that is, all preceding
NPs that are similar to the anaphor (i.e., have the same head) within the document are allowed as candidate antecedents.
Using the strategies outlined above, the following versions of the complete definite description training data set are formed:
1. DefDesc-15: All definite descriptions are matched against all NPs in the 15 previous sentences, covering 96% of all anaphoric definite
atives are reduced to 81% of all negatives, and the positives to 88% of
2. DefDesc-10: All definite descriptions are matched against all NPs in the 10 previous sentences, covering 91% of all anaphoric definite descriptions in the training data. The negatives are reduced to 66% of all negatives , and the positives to 77% of all positives in the complete
DefDesc training data set.
3. DefDesc-filter: All definite descriptions are matched against a) all previous similar NPs (allowing full and partial repetitions regardless of distance), b) NPs within the current and the previous two paragraphs according to rules based on Accessibility theory, thus covering 94% of all anaphoric definite descriptions in the training data.
The negatives are reduced to 22% of all negatives, and the positives to 73% of all positives in the complete DefDesc training data set.
Instance selection strategies for pronouns
Agreement in person, number, and gender are important to pronoun resolution, and psycholinguistic studies show that such information guide the processing; studies also show that links to more accessible referents takes less time to process: e.g., a pronoun with one referent rather than several, with proximal referents rather than distant, and with topical concepts rather than less topical (Sanders and Spooren 2007).
In many approaches to knowledge-based pronoun resolution for English, agreement between the anaphor and the candidate antecedent has been used as a hard constraint in order to filter out unlikely candidates; The same constraint has been applied for instance selection in data-driven approaches (see
In these experiments we will use agreement as a feature during classification rather than as a hard constraint, because we want to allow for coreference links between e.g., utrikesministern (‘the secretary of state’+uter), statsrådet
(‘cabinet member’+neuter), and hon (‘she’), or between regeringen (‘the cabinet’+sg) and de (‘they’+pl). Additional motivation for not using agreement as a hard constraint is that it requires preprocessing with a very high level of accuracy, or it will potentially lead to a loss in recall.
Corpus studies show that the majority of all anaphoric pronouns can be resolved to an antecedent within a short distance from the anaphor, typically a few sentences (see e.g., (McEnery et al. 1997) for English, (Fraurud 1992) for
Swedish). Based on this knowledge, a common strategy for instance selection of pronouns, in both knowledge-based and data-driven approaches to pronoun resolution, is to restrict the search scope to the immediate context, e.g., three
(Mitkov 2002, Klenner 2007, Klenner and Ailloud 2009), or five preceding sentences (Uryupina 2004), a so-called linear-k strategy.
The basic strategy used here (resulting in training data sets called “Pronoun-
5” and “Pronoun-3”, below) is to restrict the search scope to five and three sentences respectively, thus covering at least one antecedent for 100%, respective
99% of all anaphoric pronouns in the complete training data.
The outcomes of this linear-k strategy are contrasted to that of an attempt to improve on the classification results by applying different selection strategies for different types of pronouns, based on corpus studies on anaphora by
Fraurud (1992), Ariel (1990), and McEnery et al. (1997).
The pronoun-type specific instance selection strategy (called “Pronounfilter”, below) is based on the characteristics of each pronoun, for example whether the pronoun is categorised as syntactically bound or free, or anaphoric or deictic. The filter also takes into account whether the pronoun can be classified as animate, inanimate, or ‘mixed’ (i.e., can denote either an animate or inanimate referent). Finally, the selection method makes use of the grammatical function (subject, object, or genitive) of the pronoun, and in the cases where grammatical function cannot be determined by the form of the pronoun
(i.e., the third person inanimate pronouns den, ‘it’+uter and det, ‘it’+neuter), on the syntactic analysis by MaltParser.
Based on these characteristics, we apply an instance selection method where different search scopes (i.e., windows within which all NPs are regarded as candidate antecedents) are allowed for different types of pronouns. The scope is measured in the number of sentences, or the number of paragraphs between the anaphor and the antecedent. The search space of each pronoun is restricted as follows:
• All grammatical forms of deictic pronouns, e.g., vi (‘we’), oss (‘us’), vår
(‘our’) are allowed a scope consisting of the same or the immediately preceding paragraph.
• Free anaphors are grouped according to whether they typically refer to animate or inanimate referents, or if they can refer to both animate and inanimate referents:
• Animate 3rd person singular pronouns in subject form, e.g., hon ‘she’: are allowed a search scope of the same or the immediately preceding paragraph.
Animate 3rd person singular pronouns in object form, e.g., henne (‘her’): within the same paragraph.
Animate 3rd person singular pronouns in genitive form, e.g, hennes
, (‘her’): within the same paragraph.
• Inanimate 3rd person singular pronoun den (‘it’+uter) is restricted to the same and the two preceding sentences if analyzed as a subject by MaltParser, or else restricted to the current and the immediately preceding sentence.
Inanimate 3rd person singular pronoun det (‘it’+neuter) is restricted to the same and the immediately preceding sentence if analyzed as a subject, or else restricted to the current sentence (the shorter scope, relative to den, is due to the fact that det also can function as an expletive).
Inanimate 3rd person singular genitive/possessive dess
(‘its’) is restricted to the same sentence.
• Mixed 3rd person plural pronoun in subject form de (‘they’) are allowed a scope of the same and the two preceding sentences.
Mixed 3rd person plural pronoun in object form dem
(‘them’): the same and the preceding sentence.
Mixed 3rd person plural genitive deras (‘their’): the same and the preceding sentence.
• Syntactically bound anaphors, e.g., 3rd person reflexive sig
(‘him/her/it self, themselves’) are not allowed to function as antecedents, and as anaphors their scope is restricted to the current sentence.
• Demonstratives, e.g., denna (‘this’+uter), detta (‘this’+neuter), dessa
(‘these’), which according to Accessibility theory are intermediate accessibility markers (Ariel 1990), are allowed a scope of the current and the immediately preceding paragraph if analyzed as a subject, or else a scope of the current or the immediately preceding sentence.
Three versions of the complete pronoun training data sets are created by applying the instance selection techniques described above:
1. Pronoun-5: All pronouns are matched against all NPs in the five preceding sentences, covering 100% of all anaphoric pronouns in the
to 30% of all negatives, and the positives are reduced to 36% of all
2. Pronoun-3: All pronouns are matched against all NPs in the three preceding sentences, covering 99% of all anaphoric pronouns in the training data. The negatives are reduced to 20% of all negatives, and the positives to 28% of all positives in the complete pronoun training data set,
3. Pronoun-filter: All pronouns are matched against all preceding NPs according to the constraints described above, covering 97% of all
anaphoric pronouns in the training data, and reducing the negatives to 11% of all negatives, and the positives to 23% of all positives in the complete pronoun training data set.
Construction of the feature set
For each instance, consisting of a candidate anaphor-antecedent pair, we need a set of features for classification. Features can describe some characteristic of
(e.g., natural gender or grammatical function), or compare the anaphor to the antecedent (e.g., if the baseforms of the head words of the two markables are identical, or if the two markables occur in the same sentence). Below, the former are called descriptive features, and the latter comparative features.
A number of information sources are used to derive the features described below:
• Morpho-syntactic information is provided by the Granska Tagger
• The dependency analysis by MaltParser is used to construct the syntactic features, the local context features, and the NP complexity fea-
• String matching is determined by a set of regular expressions, and by estimating the minimum edit distance (Jurafsky and Martin 2000).
• Animacy information for definite descriptions is derived from the definitions in the Swedish lexicon Lexin, available from
• Acronyms are recognized by a simple rule-based acronym finder; e.g., H&M is matched to Hennes & Mauritz.
• Semantic relatedness is estimated using two resources:
• The general synonymy lexicon SynLex, consisting of 25,000 word pairs graded according to semantic
freely available from CSC KTH Royal
Institute of Technology (Kann and Rosell 2006),
• Word-Space models of semantically related words in texts within this domain (Nilsson and Hjelm 2009).
The features in the complete feature set are described below. The most informative features for identification of coreferent NEs, resolution of definite
Throughout this section, the term antecedent is used instead of the correct, but more cumbersome term candidate antecedent.
Granska, URL: http://www.csc.kth.se/tcs/humanlang/tools.html
MaltParser, Växjö University. URL: http://w3.msi.vxu.se/ñivre/research/MaltParser.html
Språkbanken, Gothenburg University. URL: http://spraakbanken.gu.se/
SynLex, KTH CSC. URL: http://lexin2.nada.kth.se/synlex.html
The positional features describe the location of the anaphor and the location of the antecedent within the document, the current paragraph, and the current sentence. The underlying motivation for these features is to approximate prominence within the document, within each paragraph, and within each sentence. Two features describe whether or not each NP is located in two specific types of text segments: the title of the text, or in qutations. The following positional features are included:
Descriptive positional features
• (1, 2) The position of the anaphor/antecedent within the document, expressed as the index number of the current paragraph counted from the beginning of the document. (Value: 1 .. the number of paragraphs in document.)
• (3, 4) The position of the anaphor/antecedent within the current paragraph expressed as the index number of the NP counted from the beginning of the paragraph. (Value: 1 .. the number of NPs in paragraph.)
• (5, 6) The antecedent/anaphor is the left-most NP in the current sentence, approximating prominence (that is, assuming that the most important information is presented first). (Value: y/n.)
• (7, 8) The anaphor/antecedent is located in the title of the text; the motivation for this feature is that the most important concepts in the text are likely to be mentioned in the headline (cf., “textual designators” in (Bergler and Knoll 1996)). (Value: y/n.)
• (9, 10) The anaphor/antecedent occurs within a quotation, or directly adjecent/within the same sentence as a quote, e.g., both pronouns det
(‘it’) and han (‘he’) in – Det är korrekt, säger han. (‘– That is accurate, he says.’). (Value: y/n.)
The comparative postitional features describe the location of the antecedent in relation to that of the anaphor; the discourse segments we use are sentences, paragraphs, and quotations. There are also two features describing distance in terms of the number of intervening NPs, and the number of intervening sentences.
Comparative positional features
• A set of three features compares the location of the anaphor and the candidate antecedent: (11) are the two NPs located in the same sentence (value: y/n.), (12) in adjecent sentences (value: y/n.), and/or
(13) in the same paragraph (value: y/n.).
• (14) Both the anaphor and the antecedent are located within the same quotation. (Value: y/n.)
• The distance between the anaphor and the antecedent estimated as
• (15) the number of intervening NPs. (Value: 1 .. the number of NPs in document-1.)
• (16) the number of intervening sentences. (Value: 0 .. the number of sentences in document-1.)
The morpho-syntactic features describe the head word of both the antecedent and the anaphor in terms of e.g., grammatical gender, number, and definiteness. There is also a feature for pronouns, specifying the pronoun type, and for some pronoun types, the form of the pronoun. The following descriptive features are included:
Descriptive morpho-syntactic and lexical features
• (17, 18) The part-of-speech of the head word of the anaphor/antecedent according to the analysis of the Granska POS-tagger.
• (19, 20) The grammatical gender of the head word of the anaphor/antecedent according to the Granska analysis. (Value: uter, neuter, uter/neuter, unknown.)
• (21, 22) The number of the head word of the anaphor/antecedent according to the Granska analysis. (Value; sg, pl, sg/pl, unknown.)
• (23, 24) The definiteness of the head word of the anaphor/antecedent according to the Granska analysis. (Value: def, indef, def/indef.)
• (25, 26) The case of the anaphor/antecedent. (Value: sub, obj, sub/obj, nom, gen, unknown.)
• (27, 28) The type of pronoun of the anaphor/antecedent. Pronouns are categorized as either subject (e.g., hon, ‘she’), object (e.g., henne , ‘her’), genitive (e.g., hennes, ‘her’), possessive (e.g., min,
‘my’), demonstrative (e.g., denna, this), distributive (e.g., vardera,
‘each’), and reciprocal (e.g., varandra, ‘each other’). (Value: subjPN, objPN, genPN, possPN, demPN, eachPN, reciPN, and na for not applicable.)
• (29, 30) The NP type of the anaphor/antecedent, according to the part-of-speech of the head word and the definiteness of the NP. This feature distinguishes between definite descriptions (feature value: defDesc), indefinite lexical NPs (value: indDesc), cardinal numbers
(value: CN), ordinal numbers (value: ON), personal pronoun (value: defPron), indefinite pronouns (value: indPron), NEs (value: NE), or unknown.
The comparative features describe the anaphor and the antecedent in terms of the compatibility of their respective morpho-syntactic features:
Comparative morpho-syntactic and lexical features
• (31) Gender agreement between the anaphor and the antecedent, both in terms of grammatical and natural gender. (Value: y/n/na. The value
‘n’ is reserved for genuine disagreement, e.g., neuter – uter; when the gender of one, or both of the NPs is unknown, ‘na’ is used. If an NP does not have a grammatical gender, e.g., in the case of NEs, ‘na’ is used.)
• (32) Number agreement between the anaphor and the antecedent.
(Value: y/n/na. The value ‘n’ is reserved for disagreement, e.g., singular – plural; when the number of one of the NPs is unknown, ‘na’ is used.)
• (33) Definiteness agreement between the anaphor and the antecedent.
(Value: y/n/na. The value ‘n’ is reserved for disagreement, e.g., indefinite – definite; when the definiteness of one of the NPs is unknown,
’na’ is used.)
• (34) Morphological agreement in terms of grammatical gender and number between the anaphor and the antecedent; a complete match of these features is required for a positive value. (Value: y/n.)
The syntactic features describe the syntactic function of both the candidate antecedent and the anaphor, expressed as dependency relations. The features also describe the internal syntax of each NP in terms of complexity; the inclusion of these features are based on findings that complex NPs (when occuring as a candidate anaphor) are more likely to be antecedentless (see e.g.,
(Fraurud 1992) and (Vieira and Poesio 2000)), and that low accessibility markers such as complex NPs are more likely to be antecedents (Ariel 1990).
Descriptive syntactic features
• (35, 36) The dependency relation of the head word of the anaphor/antecedent according to the MaltParser analysis, e.g., subject (SUB), object (OBJ), predicative complement (PRD), apposition (APP), second conjunct (CC), complement of preposition
(PR), or attribute/adnominal modifier (ATT). (Value: the dependency relation of head word according to the MaltParser analysis.)
• (37, 38) Complexity of the anaphor/antecedent, categorized as single word NPs, e.g., huset (‘the house’), multiple (content) word NPs, e.g., det vita huset (‘the white house’), or nested NPs, e.g., huset på berget
(‘the house on the hill’) or hennes hus (‘her house’). (Value: ‘single’,
• (39, 40) The antecedent/anaphor is a complex NP with a nested preposition phrase, e.g, huset på berget (‘the house on the hill’).
• (41, 42) The antecedent/anaphor is a complex NP with a genitive/possessive modifier, hennes hus (‘her house’). (Value: y/n.)
There are two comparative syntactic features; the first checks for syntactic parallellism between the anaphor and the antecedent (i.e., whether the respective syntactic function of (the head word of) the anaphor and the antecedent are the same), and the second marks whether the anaphor is dependent on the same word as the antecedent, according to the MaltParser dependency analysis. This feature can capture those instances where the anaphor and the antecedent are conceptually connected, e.g., through a verb (cf., “trigger families” in (McCarthy and Lehnert 1995)).
Comparative syntactic features
• (43) Syntactic parallellism (in terms of the dependency relations of the head words) between the anaphor and the antecedent head words, i.e., both NPs are the grammatical subject. (Value: y/n.)
• (44) The anaphor is dependent on the same word as the antecedent, according to the MaltParser analysis. (Value: y/n.)
Local context features
The local context features consist of e.g., an appositive NP of the antecedent/anaphor, and of the word of which the antecedent/anaphor is dependent, according to output of the MaltParser analysis.
Descriptive context features
• (45, 46) The maximal string of the anaphor/antecedent, if the NP is a pronoun. This feature is intended for pronouns only, as this is a heterogenous but closed word-class. (Value: string.)
• (47, 48) The appositive NP of the anaphor/antecedent (if any), e.g., flygbolaget in flygbolaget SAS (‘the airline SAS’). (Value: string, or
‘na’ if there is no appositive NP.)
• (49, 50) The word the anaphor/antecedent is dependent on according to the MaltParser analysis. The motivation for this feature is partly that the verb can tell us something about the NP, e.g., säger (‘says’) implicates animacy while såldes (‘was sold’) implicates inanimacy, but also that propositional pronouns can be recognized through the verb. Based on a corpus study, Fraurud (1992) lists phrases such as e.g., det innebär (‘it means’), det medför (‘it entails’), det leder till
(‘it leads to’). (Value: string, or ‘na’ if not applicable.)
• (51, 52) The part-of-speech of the word the anaphor/antecedent is dependent on according to the MaltParser analysis. (Value: string)
String similarity features
The string similarity features capture different types of similarity between
NPs, e.g., full or partial matching between the antecedent and the anaphor, or the minimum edit distance (i.e., the number of character editing operations needed in order to transform one string of characters into another). The motivation for these features is the strong tendency of both NEs and definite descriptions for (full or partial) repetitions (same head anaphora, see e.g., (Vieira and Poesio 2000, Garera and Yarowsky 2006)).
The minimal string is defined as the head word in the case of definite descriptions and pronouns, and as the complete name of NEs excluding e.g., tight appositives. The maximal string is defined as the complete markable, including e.g., modifiers and tight appositives. The baseform of the minimal string is obtained from the Granska analysis output.
Comparative similarity features
• (53) The anaphor is a repetition of the maximal string of the antecedent, i.e., the anaphor and the antecedent are identical, e.g., flygbolaget (‘the airline’) – flygbolaget are identical, but det franska fullservicebolaget – fullservicebolaget (‘the French full service airline’) are not. (Value: y/n.)
• (54) The anaphor is a repetition of the minimal string of the antecedent, i.e., the anaphor is identical to the antecedent head word, e.g., flygbolaget – det franska flygbolaget. (Value: y/n.)
• (55) The baseform of the anaphor is identical to the baseform of the minimal string of the antecedent, e.g., det franska fullservicebolaget
(baseform: fullservicebolag) – ett fullservicebolag (‘a full service airline’) . (Value: y/n.)
• A set of six features allows for substring matching between the anaphor and the antecedent, as Swedish is a compounding language:
• (56, 57) The minimal string of the anaphor/antecedent is a substring of the the antecedent/anaphor maximal string, e.g., det konkurshotade bolaget (min: bolaget) – flygbolaget.
• (58, 59) The maximal string of the anaphor/antecedent is a substring of the antecedent/anaphor maximal string, e.g., bolaget – flygbolaget. (Value: y/n.)
• (60, 61) The baseform of the anaphor/antecedent is a substring of the antecedent/anaphor maximal sting, e.g., bolaget
(baseform: bolag) – ett fullservicebolag. (Value: y/n.)
• A set of three features estimates how similar the maximal strings, minimal strings and baseform strings of the anaphor and the antecedent are, based on the number of editing operations (insertion, deletion or substitution of a character) needed in order to transform
one of the strings into the other. This measure of string similarity
(or distance) is called the minimum edit distance (Jurafsky and
• (62) The minimum edit distance between the maximal strings of the anaphor and the antecedent. (Value: n, where n is the number of insertions, deletions, and substitutions of characters needed in order to achieve identical strings; and where the value 0 indicates that the strings are identical.)
• (63) The minimum edit distance between the minimal strings of the anaphor and the antecedent. (Value: n, see above.)
• (64) The minimum edit distance between the baseforms of the anaphor and the antecedent. (Value: n, see above.)
• (65, 66) An NP identical to the anaphor/antecedent is nested within the antecedent/anaphor, e.g., Sverige (‘Sweden’) – Sveriges huvudstad (‘The capital of Sweden’). Matching is performed between both surface forms and baseforms. (Value: y/n.)
• A set of six features allows for string matching between the appositive NP of the antecedent (if any) and the anaphor or vice versa, e.g., flygbolaget – flygbolaget SAS; this feature can help identify coreference between a definite description and an NE:
• (67, 68) The appositive NP of the anaphor/antecedent is equal to the antecedent/anaphor maximal string. (Value: y/n/na.)
• (69, 70) The appositive NP of the anaphor/antecedent is equal to the antecedent/anaphor minimal string. (Value: y/n/na.)
• (71, 72) The appositive NP of the anaphor/antecedent is equal to the antecedent/anaphor baseform. (Value: y/n/na, where ‘na’ stands for not applicable.)
• (73) Either the anaphor or the antecedent are an acronym of the other
NP, e.g., HD and Högsta domstolen (‘The Supreme Court’). (Value: y/n.)
The acronym feature (73) is determined by a simple rule-based approach, where any sequence of less than six upper-case characters is regarded as a potential acronym. That sequence is matched against any NP consisting of at least one string of more than six characters within the document.
where the string FI can be recognized as an acronym of the full form NP Finansinspektionen based on the syntax of the clause. In such cases, the acronym is only allowed to match variants of that full form NP within the document.
(5.2) (a) Finansinspektionen (FI) har utrett ...
(‘The Financial Supervisory Authority (FI) has investigated ...’)
(b) Finansinspektionen, FI, har utrett ...
(‘The Financial Supervisory Authority, FI, has investigated ...’)
In order to recognize that e.g., AD is an acronym for the compound Arbetsdomstolen (‘the Swedish Labour Court’) linguistic knowledge in combination with domain-specific, lexical knowledge is used. In this case, matching is allowed because the first character in the acronym (A) matches the first character of the compound, and the second character in the acronym (D) matches domstolen
. This decision is based on a rule that a string with an ‘s’ (a common
Swedish interfix) can be decompounded (here: into Arbet#s#domstolen) if the suffix of the decompounded string (here: domstolen, ‘the court’) is listed as a typical “organization suffix” in the domain-dependent knowledge-base of the acronym finder. Once such a match has been found in the document, the acronym is only allowed to match with variations of that string (e.g., with or without a genitive suffix).
A combination of the rules described above is used to recognize that e.g., ARN is an acronym for Allmänna reklamationsnämnden – A matches
, R matches reklamation#s#nämnden, and N matches nämnden (‘the supervisor committee’) which is listed in the knowledge-base of the acronym finder.
As acronym matching is only allowed within documents, the accuracy of this feature is satisfactory, but there are limitations to this simple approach: e.g., cases where there is no match between the first character in the acronym and the first character in the NP, e.g., FOI – Totalförsvarets forskningsinstitut
(‘The Swedish Defense Research Agency’).
The lexico-semantic features are used to capture relations between the anaphor and the antecedent, such as synonymy and hyperonymy. They also describe the antecedent and the anaphor in terms of e.g., animacy and natural gender.
Descriptive lexico-semantic features
• (74, 75) The natural gender of the anaphor/antecedent for pronouns and NEs; in case of NEs denoting humans this feature was determined by consulting lists of common male and female given names in Swedish and English. (Value: masc, fem, unknown.)
• (76, 77) Animacy (or rather: humanness) of the anaphor/antecedent.
For pronouns and NEs according to their type (e.g., animate pronouns, or NEs of the type PERSON), and for definite descriptions according to information on animacy extracted from definitions in the Swedish lexicon Lexin. (Value: anim, inanim, unknown.)
• (78, 79) The NE type of the anaphor/antecedent, according to the NE annotation. (Value: PERS, ORG, LOC, TRADEMARK, SERVICE,
• A set of six features are used for temporal expressions, describing if an anaphor/antecedent is denoting e.g., (80, 81) a month, (82, 83) a day of the week, or (84, 85) some other common temporal period.
The motivation for this feature is firstly that temporal expressions can corefer (e.g., ‘april’ and ‘last month’), and secondly that e.g., (Vieira and Poesio 2000) reports that one common type of antecedentless definite NPs are temporal expressions. (Value: y/n.)
The comparative lexico-semantic features capture agreement in animacy, NE type, and NP type. A set of two features are used to determine synonymy between the baseforms of the anaphor and the antecedent: The SynLex synonymy lexicon (Kann and Rosell 2006) is used to determine whether the minimal strings are synonyms, and if so, to what degree on a scale of 3.0 to 5.0.
In the SynLex project, word pairs were automatically constructed by lexicon look-up, and automatically refined using information on co-occurrence extracted from corpora. The refined set of word pairs were then volontarily graded by users of an on-line dictionary, Lexin. The grading, on a scale from
0 (for non-related words) to 5 (for synonyms), was used to construct a list of
25,000 word pairs graded as semantically related (graded 3.0 to 3.9) and as synonymic (from grade 4.0 to 5.0) (Kann and Rosell 2006).
As SynLex is a general lexicon, and semantic relatedness between words often is domain- and context-dependent, we additionally use lexical knowledge modeled on domain-specific data. Word-space, or distributional, semantics can provide knowledge on lexico-semantic similarity in any language and domain, given appropriate text material. We use such models to derive a set of features meant to capture semantic similarity, e.g., synonymy and hyponymy, between two lexical NPs, and between lexical NPs and NEs (Nilsson and
The construction of models for distributional semantics involves counting co-occurrences for the words in the model, and thus, such models can be classified into two groups depending on the type of co-occurrence used: firstorder co-occurrence, where two words occur within the same text segment
(e.g., the same document, or the same paragraph or sentence), and secondorder co-occurrence, where two words co-occur with a third word (Manning and Schütze 1999). In the models used here, we use two scopes for counting co-occurrences: 1) words occurring within the same document, and 2) words occurring in a fixed-size sliding window (Nilsson and Hjelm 2009). Following Sahlgren (2006), we refer to the relations in the model constructed from
I thank my former colleague Hans Hjelm for coming up with the idea to combine our thesis subjects, and for creating the word-space models and the feature combinations.
first order co-occurrence as syntagmatic, and to the relations in the model constructed from second order co-occurrence as paradigmatic.
Three word-space models of similarity are trained on a collection of about 1.5 million tokens from the same genre and domain (financial news in Swedish, collected from the Internet) as the coreference annotation data, including the annotated data. (As we do not use the coreference annotation when training the word-spaces this is not a case of training on the test data.)
Three models of similarity are created for each of the two types of co-occurrence: syntagmatic (document co-occurrence) and paradigmatic
(co-occurrence within a window of three words). These models are constructed using the standard cosine similarity measure both on the vectors from the co-occurrence matrices (termed “plain” below), and on dimensionality-reduced vectors (using singular value decomposition, “SVD”) that are supposed to capture “latent” relations among words not directly accessible through the co-occurrence data. We also use the statistical measure mutual information (“MI”) on the co-occurrence matrices (Nilsson and
From these three models of syntagmatic relations (termed “s-plain”, “s-
SVD”, and “s-MI”), the score for each candidate anaphor-antecedent pair found within the model is added as a feature. The same is done for the three models of paradigmatic relations (termed “p-plain”, “p-SVD”, and “p-MI” below). From each of these models, we also extract two binary features for each candidate anaphor-antecedent pair. The first feature is positive only if the antecedent is the the highest ranking coreference candidate of the anaphor, and the second is positive if the antecedent is among the top-10 candidates of the anaphor. Finally, we add a binary feature which is positive if the anaphor and the antecedent occur in the intersection of the syntagmatic and paradigmatic
Comparative lexico-semantic features
• (86) Agreement regarding the animacy of the anaphor and of the antecedent. (Value: y/n.)
• (87) The anaphor and the antecedent belong to the same NE type, e.g., ‘person’, ‘organization’, or ‘location’. Following
(Fliedner 2006), this feature also applies to lexical NPs by matching the head word against the semantic extension of the NE (ORG is extended with the synsets in Swedish WordNet for företag, organisation (‘company’, ‘organization’), and instances of PERSON is extended with the synsets for människa, person (‘human being’,
• (88) The anaphor and the antecedent belong to the same category of
NPs, i.e., pronouns, NEs, or lexical NPs. (Value: y/n.)
Svenskt OrdNät, URL: http://www.lingfil.uu.se/ling/swn.html
• Two features describe whether the anaphor and the antecedent are semantically related, according to the SynLex lexicon:
• (89) There exists a relation of synonymy between the anaphor head word and the antecedent head word, according to the SynLex synonymy lexicon. (Value: y/n.)
• (90) The degree of synonymy between the anaphor and the antecedent head words, expressed as the SynLex synonymy score. (Value: n, where n is a score between 3.0 and 5.0.)
• A set of three features describes the syntagmatic relations between the anaphor and the antecedent, as captured by three different types of syntagmatic Word-Space models (based on document co-occurrence):
• (91) Model constructed using cosine on plain vectors (“s-
• (92) Model constructed using cosine on dimensionality-
(Value: 0.0 .. 1.0)
• (93) Model constructed using Mutual Information on plain
• A set of three features describes the paradigmatic relations between the anaphor and the antecedent, as captured by three different types of paradigmatic Word-Space models (based on window co-occurrence), as a “similarity” score:
• (94) Model constructed using cosine on plain vectors (“p-
• (95) Model constructed using cosine on dimensionality-
(Value: 0.0 .. 1.0)
• (96) Model constructed using MI on plain vectors (“p-plain-
MI”); for an example, see the model for undersökning (‘in-
• A set of three features marks whether the anaphor – antecedent pair is the highest ranking pair in each of the three syntagmatic models:
(97) s-plain-cosine, (98) s-SVD-cosine, and (99) s-plain-MI; row 1
• A set of three features marks whether the anaphor – antecedent pair is the highest ranking pair in each of the three paradigmatic Word-
Space models: (100) p-plain-cosine, (101) p-SVD-cosine, and (102)
• A set of three features describes if the anaphor is one of the top ten most similar words of the antecedent in each of the three syntagmatic
Word-Space models: (103) s-plain-cosine, (104) s-SVD-cosine, and
• A set of three features describes if the anaphor is one of the top ten most similar words of the antecedent in each of the three paradigmatic Word-Space models: (106) p-plain-cosine, (107)
• A set of three features describes whether the anaphor can be found in the intersection between the sets consisting of the top ten most similar words of the antecedent in the paradigmatic model and in the syntagmatic model:
• (109) The anaphor can be found in the intersection of the cosine paradimatic (“p-plain”) and the cosine syntagmatic model (“s-plain”) of the antecedent (columns 1 and 2 in ta-
• (110) The anaphor can be found in the intersection of the
MI paradimatic (“p-MI”) and the MI syntagmatic model
• (111) The anaphor can be found in the intersection of the
SVD paradigmatic (“p-SVD”) and the SVD syntagmatic model (“s-SVD”) of the antecedent (columns 5 and 6 in
In this chapter, data preparation in terms of preprocessing, instance creation and instance selection, and the construction of the complete feature set was described. In the following chapters, the training and test data sets will be used in experiments on hybrid methods for coreference resolution: the training
6. Development of NP specific feature sets
In this chapter, experiments on linguistically motivated feature selection for classification or candidate anaphor-antecedent pairs are described. We introduce the machine-learning algorithm, Memory-Based Learning, and the learner, TIMBL, used for training classifiers for resolution of candidate anaphor-antecedent pairs. We also describe the measures used to evaluate the performance of the classifiers.
and the informativeness of the features for classification of each NP type is discussed. The outcome of these experiments is used as a baseline in the experiments with linguistically motivated feature selection for each of the three coreference resolution tasks: Identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution. The NP type specific feature set
The general setup of our experiments is the following. For each of the three tasks we use three versions of the same data set, each constructed using different instance selection methods (e.g., for NEs we use three versions of the com-
Each data set is split in two parts: a training set used for development and
described in this chapter, and a held-out test set.
In this chapter we describe the development phase of different classifiers for each of the three tasks. During the development phase, the experiments are performed using n-fold cross-validation on the training data set, with n set to five. This means that each training data set is split on the document level into five subsets. Iteratively, each subset is used as a held-out test set while the remaining four subsets of the training data is used for training. The split is made arbitrarily at the document level.
In the five-fold cross-validation experiments described in this chapter, our focus is the selection of a suitable set of features for each task. The classifiers
We do not have enough data to use separate data sets for development and training.
are compared to a baseline classifier using the same, basic feature set for each of the three tasks.
Memory-based learning with TIMBL for coreference resolution
For classification of the candidate anaphor-antecedent pairs, we use the Tilburg memory-based learner, TIMBL (Daelemans and Van den
Bosch 2005). The general idea of memory-based learning (MBL) is learning as a cognitive task: when people learn, they do not extract a set of rules from the most typical and frequent experiences but rather store all experiences in memory, and solve new problems by reusing solutions from similar experiences. MBL is referred to as a lazy learning method as all data is kept for processing, where eager learning methods would abstract from the frequent and typical.
Lazy learning methods such as MBL is suitable for a task like coreference resolution, where there are many exceptions to the most frequent cases and where ignoring such exceptional and infrequent examples can be harmful;
MBL has been applied to coreference resolution (see e.g., Hoste 2005, Hendrickx et al. 2008) and to pronoun resolution (see e.g., Nøklestad 2009).
An MBL system consists of two components: a memory-based learning component and a similarity-based classification component. During learning, the learning component adds new examples (below referred to as training instances
) to the memory without abstraction, selection or restructuring. The basis for the classification is the description of each new instance to be classified in comparison to the training instances in memory; in this case the in-
description of each such candidate anaphor-antecedent pair consists of a set of features which should ideally be as informative, relevant, and noise-less as possible. For MBL, the features are put together to form a feature vector, where each feature has a fixed slot. In order to provide training examples for the classifier, a feature vector describing each anaphor-antecedent pair is labeled with a class label; in this case coreferent or non-coreferent. Given a set of feature vectors, the machine learning algorithm induces a classifier that – given a sufficient number of good training examples – can decide whether two
NPs are coreferent or not (Daelemans and Van den Bosch 2005).
This decision is based on the similarity between the instance to be classified and all training instances in memory. Similarity is computed by using a metric that measures the distance between the unknown instance and all the training instances. Each new, unknown instance is classified as the most frequent class among the most similar examples, called the nearest neighbors. The number of nearest neighbors is expressed by k (Daelemans and Van den Bosch 2005).
In all experiments reported here, the TIMBLE standard search algorithm
IB1 is used in combination with the Overlap distance metric and the feature weighting measure Gain Ratio (see (Daelemans and Van den Bosch 2005) for a description of the TIMBL software package). The value for k has to be determined experimentally for each data set (Hoste 2005). For our cross-validation experiments on the training data, we used 1, 2, 3, 4, 5,
6, 7, 8, 9, and 10 as possible values for the pronoun data sets, and added
12, 14, 16, 18, 20, 25, and 30 as possible values for the definite description and NE data sets. The value resulting in the best outcome in the five-fold cross-validation experiments on the training data is used for the classifiers
The Overlap distance metric is used in combination with Gain Ratio feature weighting in all experiments reported below. The feature weighting measure Gain Ratio is a variant of Information
Gain feature weighting aimed at normalizing information gain over features with high number of values (Daelemans and Van den Bosch 2005). This measure may be suitable for this task as we have e.g., local context features with a very high number of values.
Hoste (2005) has shown that by parameter optimization, including both classifier settings and feature selection, classification results can be significantly improved, and that the difference between results before and after optimization of one learner can be bigger than the difference between two learners. Therefore, we believe that the results reported here can be improved both by tuning the classifiers and by applying other types of machine learning algorithms to this task. However, as our main objective is to investigate how linguistically and cognitively motivated constraints and features can affect the outcome, we leave such improvements for future experiments.
Evaluation measures for classification
The classifiers are evaluated by estimating the precision, recall and F-score per classified instance (i.e., anaphor-antecedent pair). As a baseline for each of the three tasks (and three versions of the data sets) we use a classifier trained on a basic set of features, similar to that used in high-performance systems such
We also evaluate the results of the classifiers per anaphor, rather than per anaphor-antecedent pair. This evaluation is performed by comparing the class of correctly resolved anaphors with the classes consisting of precision errors and recall errors.
Values for k in the baseline experiments: NE-20: 7, NE-10: 7, NE-filter: 5, DefDesc-15: 5,
DefDesc-10: 5, DefDesc-filter: 8, Pronoun-5: 2, Pronoun-3: 3, Pronoun-filter: 2. Values for k in the experiments on the selected feature sets: NE-20: 10, NE-10: 8, NE-filter: 10, DefDesc-15:
7, DefDesc-10: 7, DefDesc-filter: 10, Pronoun-5: 2, Pronoun-3: 2, Pronoun-filter: 3.
Evaluation of the minority class
The overall performance of a classifier can be evaluated in terms of accuracy, where all errors are considered equally, and accuracy is the the proportion of correct classifications by the classifier. Because we have defined the task of finding coreferent NPs as a task of classifying pairs of NPs that are either coreferent or non-coreferent, the baseline accuracy can be estimated by classifying all pairs as the majority class.
However, accuracy may not be appropriate for this task because the data sets are imbalanced, that is, in each of the data sets there is a majority class consisting of non-coreferent pairs of NPs and a minority class of coreferent pairs of NPs. The degree of imbalance differs between the data sets depending on the instance selection methods used, but in all data sets coreference is by far the minority class. For example, in the NE-20 data set (consisting of
set (with 6,820 non-coreferent and 875 coreferent instances) the baseline accuracy is 89% without finding even one coreference link. Therefore, it is more reasonable to evaluate classifier performance for the minority class labeled as coreference.
Evaluation per anaphor-antecedent pair
Since our goal is to build classifiers which can resolve coreferential links between NPs, we evaluate the results of our experiments in terms of precision, recall , and F-score of the coreference class. These metrics measure the ability of each classifier to correctly classify instances of this minority class: a high recall indicates that the classifier finds most of the coreference links between the NPs in the data set, while a high precision means that the classifier makes few errors by incorrectly classifying non-coreferent NP pairs as coreferent.
Because we work on a two-class problem of classification of coreferential and non-coreferential pairs of NPs, there are four kinds of outcomes of the classification: true and false positives and true and false negatives. These basic counts, used for calculating the precision, recall, and F-score, are described
stances correctly classified as coreferent, and false positive (FP) is the number of instances incorrectly classified as coreferent. The false negative (FN) is the number of coreferent instances classified as non-coreferent, and the true negative (TN) is the number of non-coreferent instances correctly classified.
When evaluating the five-fold cross-validation results of the classifier the micro-averaged results are presented. Micro-averaged results are computed on all classifications over the five data sets considered as one concatenated output. The use of micro-averaging (as opposed to macro-averaging, where the scores are calculated separately for each fold, and the mean of the resulting values is presented) is motivated by the fact that the data is partitioned
Table 6.1: The basic counts used in the evaluation measures precision, recall, and Fscore, as well as for calculating the number of resolved anaphors, precision errors, and recall errors per anaphor.
True class: True class:
True positive False positive
Total no. of classifications
TP + FN = P
FP + TN = N into five folds on the document level resulting in unequal partitions. That is, the documents are not of the same size and thus the five folds are not of the same size. Further, the distribution of coreference relations is uneven between documents, and therefore coreference relations are unevenly distributed between the five folds. The micro-averaged performance measures used during evaluation per anaphor-antecedent pair are:
• Precision: calculated as the number of correctly classified coreference relations (TP) divided by the total number of relations classified as coreferent (TP+FP),
• Recall: calculated as the number of correctly classified coreference relations (TP) divided by the total number of coreferent relations (P) in the data,
− score =
2 · precision · recall precision + recall
• F-score is a combination of precision and recall, calculated as the
monization penalizes large differences between precision and recall
(Daelemans and Van den Bosch 2005).
consisting of NPs A, B, C, D, and E is described. The links (ten in total) between the NPs represent the anaphor-antecedent pairs (instances) in the test data set that are included in the evaluation of the classifier. The NPs A, D, and E (colored blue) are labeled as coreferent, meaning that the anaphorantecedent pairs E-D, E-A, and D-A are positives, and all other anaphorantecedent pairs (E-C, E-B, D-C, D-B, C-B, C-A, and B-A) are negatives.
The classifier labels each pair as either coreferent or non-coreferent, and during the evaluation the predicted class of each pair is compared to the true
Figure 6.1: Example of instances constructed from a discourse consisting of NPs A,
B, C, D, and E, where NPs A, D, and E are coreferent. These links (ten in total) represent anaphor-antecedent pairs (instances) that are included in the evaluation of the classifier. That is, resolution of anaphor E is counted four times: in combination with candidate antecedent D, C, B, and A, respectively.
class, and added to the total count of true and false positives, and true and false negatives. Because both the number of anaphors and the number of candidate antecedents for each anaphor vary between the versions of the data sets used for five-fold cross-validation for each task in this chapter, the classifier results evaluated per anaphor-antecedent pair in each data set cannot be used to compare the different classifiers.
Following (Daelemans and Van den Bosch 2005), all test for statistical significance applied to the results are performed using the one-tailed, pairedsample t-test, and henceforth, when a result is described as significant it means that it is significant at the 5% level, i.e., p<0.05.
Evaluation per anaphor
The evaluation measures described above are calculated from the instances of the classified data (i.e., each pair of anaphor and candidate antecedent in the data set), and does not tell us how well the classifiers performed per anaphor; because coreference is a relation between all NPs referring to a specific discourse referent, an anaphor can have multiple coreferent antecedents.
In theory, an anaphor needs only one correctly classified link to one of its antecedents in order to be resolved. A classifier that, given the instances de-
ognize the link between NPs E and A, and that incorrectly classifies the NP pair E-C as coreferent achieves a count of two true positives and one false positive, and one false negative and seven true negatives. However, the false negative E-A is not harmful, because the classifier correctly labeled E-D and D-A as coreferent. That is, false negatives are harmful to recall when none of the links between an anaphor and its antecedents is found, but might not effect the outcome if at least one correct link has been established between the anaphor and one of its antecedents; low recall estimated per anaphor-antecedent pair in the data does not necessarily mean that the classifier cannot resolve an anaphor, only that it cannot resolve all the possible coreference links to all the antecedents of that anaphor.
Further, depending on the method of instance selection used, the set of antecedents (and anaphors) vary between the data sets handed to the classifier due to differences in instance selection, and one data set may include a large number of “easy” instances (e.g., cases where string similarity can be used for resolution) while another includes more “difficult” instances (e.g., pronouns or cases where semantic relatedness is a clue for resolution).
Therefore, we also evaluate the classification results per anaphor. This evaluation is done by calculating the number of TP, FP, and FN for each anaphor, and the results are presented as:
• Resolved: correctly classified anaphors (i.e., with at least one classification evaluated as a TP);
• Precision errors: anaphors with incorrectly classified antecedents
(i.e., with at least one classification evaluated as a FP);
• Recall errors: anaphors with unresolved antecedent links (i.e., with at least one classification evaluated as a FN).
The overlap between these three classes is calculated, e.g., anaphors with both correctly classified antecedents and unresolved antecedents belong in both the resolved class and the recall error class. The result of this evaluation
where the green circle contains all resolved anaphors, the red circle contains all anaphors with precision errors, and the yellow circle contains all anaphors with recall errors. The areas labeled A, B, and C contains all anaphors belonging to only one class: resolved, precision errors, or recall errors. The overlap areas are labeled AB (containing correctly resolved anaphors with at least one false positive), BC (containing anaphors with both false positives and false negatives), AC (containing correctly resolved anaphors with at least one false negative), and ABC (containing anaphors belonging to all three classes).
The Venn diagrams are created using the Venn Diagram Plotter from the Pacific Northwest
National Laboratory (PNNL), available at URL: http://omics.pnl.gov.
Figure 6.2: Example 3-circle area-proportional Venn diagram: resolved anaphors (area
A), precision errors (area B), and recall errors (area C).
Ideally, we want as many resolved anaphors as possible, while at the same time keeping the precision errors to a minimum. A large overlap between the classes A (resolved) and C (recall errors), that is, many anaphors with both correctly recognized and missing links to their antecedents, is probably not as damaging (since in theory, an anaphor only needs one antecedent in order to be resolved) as a large overlap between the classes A (resolved) and B (precision errors), or B (precision errors) and C (recall errors).
A possible baseline for this evaluation measure, inspired by the accuracy baseline, is to assign all anaphors the majority class of non-coreferent, that is, to place all anaphors in the recall error class. That is, the goal of the classifier is to assign predictions such that there are as many resolved anaphors as possible, while at the same time keeping the class of precision errors as small as possible.
Table 6.2: The basic feature set used for all NP types (Named Entities, definite descriptions, and pronouns). A set of four features (two for anaphors and two for antecedents) describe the NP in terms of morpho-syntactic properties, whereas 11 features compare the anaphor to the antecedent, or describe the anaphor–antecedent pair in terms of
Comparative positional features
16 Distance in sentences
Descriptive morpho-syntactic and lexical features
Definiteness of the anaphor/antecedent head word
Pronoun type of the anaphor/antecedent
Comparative morpho-syntactic and lexical features
Comparative similarity features
Comparative lexico-semantic features
Same NE type
Same NP type
Synonyms in SynLex
SynLex synonymy score
Baseline experiments on anaphor – antecedent classification
In this section, the initial experiments on classification of candidate anaphorantecedent pairs are described. The outcome of the classifications with a basic feature set discussed in this section is used as a baseline for the experiments with the selected feature sets described in the following sections.
A basic feature set for coreference resolution
As a starting point for our experiments on coreference resolution, we select
basic set is comparable to the feature sets for English used by e.g., Soon et
al. (2001), and as a baseline by e.g., Ng and Cardie (2002b). There are differences between our basic feature set and the one used in the studies mentioned above, most notably that we use the lexicon SynLex (Kann and Rosell 2006), which describes semantic relatedness in terms of synonymy, whereas they use
WordNet (Fellbaum 1998) for estimating semantic relatedness in terms of synonymy and hyperonymy. There are also differences in how the features have been constructed, but the feature set used by Soon et al. and our basic feature set are similar in that these features are relatively easy to construct.
The features in the basic feature set describe the head words of the anaphor and the antecedent of each candidate pair in terms of definiteness and pronoun type (if a pronoun), and compare the candidate pair in terms of agreement in grammatical and natural gender, number, and animacy. The distance between the anaphor and the antecedent is measured in the number of intervening sentences. There are two features describing similarity between the maximal and the minimal strings of the anaphor and the antecedent. There are also features that mark whether the anaphor and the antecedent belong to the same
NE category or the same NP category, and whether one is an acronym of the other. Finally, semantic information is added by two features that describe if the baseforms of the two words are synonyms according to SynLex, and the
This feature set contains some information that may be important for resolution of each NP type: morphological agreement and distance for pronouns, similarity and synonymy information for definite descriptions, and similarity and recognition of acronyms for NEs.
are presented as the micro-averaged precision, recall and F-score of five-fold cross-validation experiments for each data set, using the same basic feature set. We also evaluate the outcome of the classifiers per anaphor.
Table 6.3: NE classification results using the basic feature set. Results from 5-fold cross-validation experiments on the NE training data sets (NE-20, NE-10, and NEfilter) for all subsequent-mention NEs (‘Total’), and categorized per antecedent type:
NEs, lexical NPs, and pronouns. Micro-averaged precision (‘P’), recall (‘R’), and Fscore (‘F’) reported for each category.
R F P
R F P
NE classification results with the basic feature set
tal’), and categorized according to the antecedent type (i.e., each antecedent is categorized as either an NE, a definite or indefinite lexical NP, or a pronoun); these results show that the classifiers perform well when the task is to recognize variants of NEs, with better precision than recall, but that none of the classifiers are able to resolve NEs to definite description or pronoun
classified data, and does not tell us how well the classifiers performed per anaphor. Because coreference is a relation between all NPs referring to a specific discourse referent, an anaphor can have multiple antecedents. Due to the different instance selection methods used for the three versions of the NE training data set, both the number of anaphors and the number of candidate an-
anaphor. Here, we can see that most of the recall errors are overlapping with the class of resolved anaphors, and that very few precision errors are made by all three classifiers.
When inspecting the output from the classifiers we find that both precision and recall are mostly due to the strong tendency towards repetitions of NEs: most correctly classified coreference pairs are examples of repetitions of the same name.
There are some false positives due to non-coreferent, similar NEs, e.g., all three classifiers establish a link between varumärket Skandia (‘the trademark
Skandia’) and the company name Skandia. None of the three classifiers are able to recognize acronyms such as H&M – Hennes & Mauritz, despite the feature marking possible acronyms. Some cases of NEs with tight appositives, e.g., programvaruföretaget IFS (‘the software company IFS’) – IFS are correctly classified, but there are also cases that are not recognized, e.g., KLM:s
(‘KLM’+gen) – konkurrenten KLM (‘the competitor KLM’). The features for repetition must also include handling of e.g., genitive forms.
The classifiers are not able to recognize any lexical NPs or pronouns as antecedents, even though there are a few false positives. The features in the basic feature set are not sufficiently informative for handling difficult cases such as redescriptions.
The instance selection method based on Accessibility theory (see
difficult cases (i.e, long-distance redescriptions and pronouns) in this data
The NE-filter data set does not include pronoun antecedents.
set compared to the NE-20 and NE-10 data sets. This is the reason why the classifier trained on the NE-filter data set achieves the best total result
(F-score 71.58%, compared to NE-20 F-score 59.21% and NE-10 F-score
Table 6.4: Definite description classification results using the basic feature set. Results from 5-fold cross-validation experiments on the definite description training data sets
(DefDesc-15, DefDesc-10, and DefDesc-filter) for all anaphors (‘Total’), and categorized per antecedent type: NEs, lexical NPs, and pronouns. Micro-averaged precision
(‘P’), recall (‘R’), and F-score (‘F’) reported for each category.
– – –
Table 6.5: Definite description classification results using the basic feature set. Results from 5-fold cross-validation experiments, categorized as repetitions (‘Repet’) and redescriptions (‘Redesc’). Micro-averaged precision (‘P’), recall (‘R’), and Fscore (‘F’) reported for each category.
R F P
R F P
Definite description classification results with the basic feature set
Resolution of definite descriptions is similar to NE identification in that repetitions of definite descriptions with the same head are frequent. Thus, sim-
the classification results for definite descriptions are presented per antecedent type; the three classifiers all fail to recognize NE antecedents, whereas lexical
NPs as antecedents are more successfully dealt with. Pronominal antecedents are also problematic, especially recall is very low – because pronouns are not allowed as candidate antecedents in the DefDesc-filter data set, recall results
are better for the classifier trained on this data set where such difficult cases have been removed.
are presented per anaphor as resolved links, precision errors and recall errors.
Contrary to the results for NEs, there are few correctly resolved anaphors and many precision errors. The largest class for all three classifiers is the recall error class.
definite descriptions are categorized depending on whether the anaphor can be described as a (full or partial) repetition or a redescription of the antecedent, the basic feature set cannot handle redescriptions for this NP type. The only features for handling redescriptions in the basic data set are the two SynLex features for semantic relatedness, but they do not contribute to the results.
Therefore, the correctly classified instances are example of repetitions of the head word such as andra kvartalet (‘second quarter’) – andra kvartalet, or kvartalet (‘the quarter’) – det tredje kvartalet (‘the third quarter’).
However, string similarity does not equal coreference. The classifier with the largest scope (DefDesc-15) misclassifies a number of instances where the head word is repeated, e.g., första kvartalet (‘first quarter’) – andra kvartalet (‘second quarter’) as coreferent. Both instances are filtered out in the
DefDesc-10 data sets as the distance between the NPs are greater than ten sentences, but as the head words (kvartalet) are identical both instances occur in the DefDesc-filter data. The classifier trained on this data set makes a correct classification in both cases (as non-coreferent).
required for successful resolution of both repetitions and redescriptions; the recall scores for repetitions are around 50% for all three data sets, while none of the redescriptions are resolved correctly. Examples of false negatives (by all classifiers) are e.g., bolaget (‘the company’) – det franska flygbolaget Air France
(‘the French airline Air France’, flygbolaget = ‘aviationcompany’+def). By adding selected features from the complete feature set handling e.g., partially repeated form, such cases might be resolved.
than identification of coreferent NEs; for all three classifiers, the largest class consists of recall errors, and the overlap with the class of anaphors with correctly resolved links is small. All three classifiers also display a large number of precision errors, even though the basic feature set includes only two, “conservative” features for string similarity (either a match between the maximal strings of the anaphor and the antecedent, or between the anaphor and the head word of the antecedent).
Table 6.6: Pronoun classification results using the basic feature set. Results from
5-fold cross-validation experiments on the pronoun training data sets (Pronoun-5,
Pronoun-3, and Pronoun-filter) for all anaphors (‘Total’), and categorized depending on the antecedent type: NEs, lexical NPs, and pronouns. Micro-averaged precision
(‘P’), recall (‘R’), and F-score (‘F’) reported for each category.
R F P
P R F
Pronoun classification results with the basic feature set
The results for the classifiers trained on the three pronoun data sets are pre-
tecedent type. From these results, we find that none of the classifiers are able to recognize coreference between pronouns and lexical NPs based on this feature set.
sifiers for pronouns are presented per anaphor. For all three classifiers, the largest class consists of recall errors, and the second largest of precision errors.
Table 6.7: Coreference classification results for pronoun using the basic feature set: with anaphors categorized as deictic and anaphoric pronouns. Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) are reported for each category.
R F P
P R F
The basic feature set is not sufficient for successful pronoun resolution, and the performance of the classifiers will likely improve if the basic feature set
Table 6.8: Coreference classification results for pronoun with the basic feature set: with anaphors categorized as free and bound anaphoric pronouns. Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) are reported for each category.
Table 6.9: Coreference classification results for pronouns with the basic feature set.
The results are categorized according to whether the referent is animate (‘Anim.’), inanimate (‘Inan.’), or either animate or inanimate (‘Mixed’). Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) is reported for each category.
R F P
P R F
is extended with features selected from the complete feature set. But it is also important to discuss the quality of the features. From the results of the pronoun classifiers trained using the basic feature set, it seems that the quality of the features for e.g., gender and number agreement (for inanimate pronouns), and animacy agreement (for both animate and inanimate pronouns) is poor: no lexical NPs are recognized as antecedents, and the precision scores overall are low.
Since pronouns are a heterogeneous group, both in terms or relative frequency in the data and other properties, it is interesting to look at the results from different angles. By dividing the set of anaphors into deictic and
nouns is a more challenging task than resolution of anaphoric pronouns. For all three basic classifiers trained on the three different data sets, precision is better for anaphoric pronouns than for deictic pronouns. This indicates that we need features specifically describing characteristics of deictic pronouns.
We also need to improve on recall for both anaphoric and deictic pronouns.
these results show that resolution of bound anaphors is problematic; the basic feature set cannot describe such cases.
By dividing the personal pronouns into categories based on whether they can be used to refer to animate, inanimate, and both animate and inanimate
nouns is an easier task for the classifier than resolution of inanimate or mixed
and det (‘it’) are infrequent in this data collection, and most occurrences of det have an expletive function. Therefore, data sparseness may be one reason for these results, but we also conclude that the features in the basic feature set are not suitable for recognizing relations between inanimate pronouns and their antecedents.
Informative features for each NP type
By looking at the results for each of the three tasks from different viewpoints, e.g., by comparing resolution of animate pronouns to inanimate pronouns, we can tell more about the feature set: which features are useful, and what additional information we need for improving resolution of that particular NP type. We can also use information from the TIMBL classifiers on how infor-
found them to be for each data set, as expressed by the gain ratio value (i.e., the weight) of each feature presented in the TIMBL output. The rank presented in
as an indication of which kinds of features are important for each of the three tasks.
The most informative features for all three tasks, and all data sets, are the comparative similarity features describing pairs of anaphors and antecedents where the anaphor is a repeated form or a partially repeated form of the antecedent. For NEs, the feature recognizing acronyms is ranked in third place
(even though none of the NE classifiers are able to recognize any coreferent acronyms); this feature is not applicable to the other two tasks.
For the remaining feature set, there are differences in feature informativity both between the three tasks, and also between the three versions of the training data set for each NP type.
For NE classification, in addition to the string similarity features, the most informative features are the lexico-semantic features comparing animacy, NE type, and NP type, while the morpho-syntactic features (both comparative and descriptive) and the distance features are not very informative. The SynLex features for recognition of semantically related words are not applicable to
Table 6.10: The basic feature set used for all NP types (NEs, definite descriptions, and pronouns), with the features ranked (1 – 15), based on the Gain Ratio values for each feature presented in the TIMBL output. Averaged ranking results from five-fold crossvalidation experiments for each NP type and data set. Features that are not applicable to a data set are colored gray.
15 10 fi 5
20 10 fi
Comparative positional features
6 Distance (sentences) 12 12 8
Descriptive morpho-syntactic features
23 Def. head (ana)
24 Def. head (ante)
27 Pronoun type (ana)
28 Pronoun type (ante)
Comparative morpho-syntactic features
31 Gender agreement
32 Number agreement
9 9 9
10 10 10
54 Partial repetition
Comparative similarity features
Comparative lexico-semantic features
86 Animacy agreement
87 Same NE type
88 Same NP type
89 SynLex synonyms
90 SynLex syn. score
12 12 8
10 10 12
The ranking of features for the classifiers trained on the versions of the NE data set where the linear-k instance selection methods were used (NE-20 and
NE-10) is identical, while there are some differences in the ranking for the classifier trained on the data set where instance selection was based on Accessibility theory (NE-filter). First, pronouns are not allowed as antecedents in the NE-filter data set, making the antecedent pronoun type feature not applicable, and second, the feature measuring distance between the anaphor and the antecedent in the number of intervening sentences is ranked higher for the classifier trained on this data set than for classifier trained on the linear-k data sets.
For resolution of definite descriptions, the most informative features (in addition to the comparative similarity features) are the features comparing animacy and NP type, the feature describing whether the head word of the anaphor is definite or not, and the features comparing number and gender agreement. We also find that the SynLex features for semantic relatedness are not very informative for any of the classifiers.
There are differences in feature ranking between the classifiers trained and tested on the versions of the DefDesc data set where the linear-k instance selection methods were used (DefDesc-15 and DefDesc-10), and the classifier trained on the data set constructed with an instance selection method based on
Accessibility theory (DefDesc-filter).
The feature for NE type agreement, which includes semantic extensions for matching lexical NPs to the most frequent NE types person and organization, is ranked higher by the classifier trained on DefDesc-filter than by the other two.
The opposite can be observed for the features for semantic relatedness. In the linear-k data sets all antecedents (including all kinds of redescriptions that require features for semantic relatedness) are allowed within a span of 15 and
10 previous sentences, respectively. In the DefDesc-filter data set, there are probably fewer cases of redescriptions because of the restrictions on both noncomplex NPs and complex NPs and NEs as antecedents based on Accessibility
The most surprising finding from the feature ranking of the three pronoun classifiers is that distance is the least informative feature. This might be due to the restricted scopes of all the versions of the Pronoun training data set. We also find that gender and number agreement is not among the highest ranking features for any of the classifiers, even though this feature might be important for finding e.g., lexical NP antecedents.
There are minor differences in feature ranking between the classifiers trained and tested on the pronoun data sets where the linear-k instance selection methods were used (Pronoun-5 and Pronoun-3), and the classifier trained on the data set with an instance selection method based on linguistic knowledge (Pronoun-filter).
In this section, we have discussed the results from a baseline experiment on anaphor-antecedent pair classification, using a basic feature set consisting of
15 features. In the following sections we will attempt to improve on the results of these baseline classifiers by extending the feature sets for each task:
• The evaluation of the results of identification of NE variants shows that we need to improve on recall by adding features for recognition of definite description and pronoun antecedents while trying to maintain the precision figures;
• For the definite description resolution task, features that facilitates recognition of redescriptions may improve on recall. We also need to improve on recognition of repetitions (especially partially repeated forms), as recall figures are around 50% for this category;
• Judging by the results of the pronoun classifiers using the basic feature set, pronoun resolution is the most difficult task, and the basic feature set is not sufficient for this task. All aspects of antecedent identification need improvement.
different classification tasks require different feature sets. Below, we will describe the selection of specific feature sets for each NP type and resolution task, based on linguistic motivations, and the results of the classifiers trained on these extended feature sets will be compared to the baseline results of the classifiers trained using the basic feature set.
Identification of coreferent Named Entities
fication with the basic feature set resulted in high precision for all three NE data sets, but that we need to improve the recall by adding more features that can help the classifier to recognize lexical NP and pronominal antecedents.
We could also see that the set of features for comparing string similarity and animacy are important for NE classification.
Feature selection for identification of coreferent NEs
The feature selection for NE classification is based on the characteristics of coreferent subsequent-mention NEs, and of their typical candidate antecedents. Following (Ariel 1990), we define these candidates as low accessibility markers such as NEs and lexical NPs. The complete feature set
NE classification is listed in in Appendix B.
From the complete feature set, all descriptive and comparative positional features are selected, as these features are meant to capture prominence in
terms of which NPs are important at a particular point in the discourse, or within a particular discourse unit (paragraph or sentence). We also want to compare the location of the anaphor and of the candidate antecedent as NEs often display long-distance coreference relations.
From the set of morpho-syntactic features, the descriptive features for pronoun type of the candidate antecedent and the set of features for recognizing temporal expressions are selected. No comparative morpho-syntactic features are added as there are no constraints on grammatical gender and number agreement between subsequent-mention NEs and their antecedents.
All descriptive and comparative syntactic features are added, in order to capture the cases where an NE is coreferent with an NP in the immediate context, e.g., an appositive NP or an acronym. We also use the local context features that describe the appositive of the NP (if any) and the dominant constituent of the NP according to the MaltParser dependency analysis.
All the comparative similarity features are used, as string similarity is very informative for NE classification. String similarity is estimated both as string overlap, and as the minimal edit distance between the two strings. This set of features also includes features that compare any appositive NP of the anaphor
(or antecedent) to the maximal string, minimal string, or baseform of the antecedent (or anaphor), capturing the similarity between e.g., the appositive flygbolaget of the NE flygbolaget American Airlines (‘the airline American
Airlines’) and the head word of the definite description det största flygbolaget i världen
(lit. ‘the largest airline in the world’), or världens största flygbolag
(lit. ‘the world’s largest airline’).
Semantic compatibility is important for NE classification: we use all descriptive lexico-semantic features (e.g., natural gender, animacy, and NE type), and also the animacy and NE type agreement features. The SynLex features are not applicable to the NE identification task as SynLex does not include names.
From the set of features derived from the Word-Space models described
and the binary features from the syntagmatic and paradigmatic models constructed using SVD. The decision to use this subset of Word-Space derived features was based on comparisons of different subsets of Word-Space features (identical to those described in (Nilsson and Hjelm 2009)) on five-fold cross-validation on the training data; the subset of Word-Space derived fea-
These features were found to be the most informative during cross-validation experiments on resolution of redescriptions, described in (Nilsson and Hjelm 2009).
Table 6.11: NE classification results using the selected feature set, compared to the basic and the complete feature set. Results from 5-fold cross-validation experiments on the NE training data sets (NE-20, NE-10, and NE-filter) presented as micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’).
Basic feature set NE-20
Complete feature set NE-20
Selected feature set NE-20
Basic feature set NE-10
Complete feature set NE-10
Selected feature set NE-10
Basic feature set NE-filter
Complete feature set NE-filter
Selected feature set NE-filter
Precision Recall F-score
NE classification results
the selected feature set for NEs perform significantly better than the baselines (i.e., the classifiers using the basic feature set) in terms of recall, resulting in a significant improvement in F-score for all classifiers (significant at the 1% level (p≤0.01); Ne-20: p≤0.00975, Ne-10: p≤0.00949, NE-filter: p≤0.00570).
The results for the data sets constructed using the distance-based, linear-k instance selection methods, NE-20 and NE-10, are similar with F-scores of
73.55% for NE-20 and 75.11% for NE-10. Compared to the baseline scores from the experiment with the basic feature set, both classifiers show an improvement in recall combined with a drop in precision. The classifier trained on the NE-filter data set, constructed using the Accessibility theory-based instance selection method, achieves an F-score of 91.09%. For this data set, the decrease in precision (compared to the classifier trained on the same data set using the basic feature set) is smaller, from 97.38% to 91.20%.
The evaluation per anaphor presented in figures 6.4, 6.6, and 6.8 on page 144 can be compared to the evaluation per anaphor for the classifiers trained on the basic feature set presented in figures 6.3, 6.5, and 6.7 on the
same page. For all three classifiers using the selected feature set, the largest class consists of resolved anaphors. The class of recall errors is reduced while
the class of precision errors shows a minor increase in comparison with the classifiers using the basic feature set.
Table 6.12: NE classification results using the selected NE feature set. Results from
5-fold cross-validation experiments on the NE training data sets (NE-20, NE-10, and
NE-filter) for all subsequent-mention NEs (‘Total’), and categorized per antecedent type: NEs, lexical NPs, and pronouns. Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) reported for each category.
The selected feature set, aimed at resolving coreference between low accessibility markers such as NEs and lexical NPs, gives the best result compared to the basic feature set and the complete feature set for the classifier trained on the data set where the instance selection method removes high accessibility markers (e.g., pronouns) as antecedents. In the data sets where such cases occur, i.e., the NE-20 and NE-10 data sets, the classifiers using the complete feature set show slightly better results than the classifiers using the selected feature set, but as these kinds of relationships (between NEs and previously mentioned pronouns) are rare, the difference is marginal.
The classifiers using the selected feature set are better at finding lexical
NP antecedents and pronominal antecedents than the classifiers using the ba-
the identification of name variants. These classifiers correctly identify (most) acronyms such as H&M and Hennes & Mauritz, and cases with e.g., genitive suffixes such as KLM:s and konkurrenten KLM.
Even though the classifiers do not recognize many lexical NP antecedents and pronominal antecedents (recall is low for all three classifiers), the outcomes, both per antecedent type and overall, are improved compared to the baseline experiment. The lexical NPs recognized as coreferent are mostly different descriptions of companies (e.g., bolaget, företaget) since such cases are quite frequent in this domain. The correctly recognized pronominal antecedents are all animate third person pronouns han and hon.
Precision errors are e.g., the trademark varumärket Skandia and the company name Skandia, which also was a problem for the basic classifiers. Overall, metonymic use of NEs is problematic: e.g., the string Norrköping referring
to both the town and to the headquarter of a governmental agency located in that town.
The added features for matching substrings not only contribute to an increase in recall, but also in some instances to false positives such as an incorrect link between the company names Ericsson and Sony Ericsson.
Recall errors are e.g., pronominal antecedents incorrectly classified as noncoreferent because of missing information on natural gender (as lists of common Swedish and English first names were used to add information on natural gender to person names, names not present on those list do not have such information). Some of the recall errors are due to coreference links that require some degree of world knowledge and/or reasoning for resolution, e.g, in ex-
ministern and Leif Pagrotsky requires knowledge about the referent.
(6.2) De senaste veckorna har uppmuntrande meddelanden strömmat in till
Leif Pagrotsky. Han var en av dem som fick skulden för att det blev nej i EMU-valet, och både näringsliv och fackförbund har krävt hans avgång. Göran Persson tycks ha letat efter lämpligt sätt att göra sig av med den kontroversielle ministern.
(Approx. ‘The past few weeks, encouraging messages for Leif
Pagrotsky have been pouring in. He was one of those blamed for the outcome of the EMU-election, and both the industry and labor unions have demanded his resignation. It seems Göran Persson has been looking for a convenient way to get rid of the controversial secretary.’)
fier using the selected feature set trained on NE-20 data sets are presented per anaphor; here, the increase in the number of resolved anaphors (and the reduction in recall errors) for the classifier using the selected feature set is accompanied by an increase in precision errors. The outcome is similar for the classifier using the selected feature set trained on NE-10 data sets, presented
The best results per anaphor is achieved by the classifier using the se-
comparable number of resolved anaphors to the NE-20 classifier (trained and tested on a data set which covers a larger number of anaphor-antecedent pairs
of all classifiers.
Figure 6.3: NE-20: the basic feature set. Combined results from 5-fold crossvalidation on the NE-20 training data.
Figure 6.4: NE-20: the selected feature set. Combined results from 5-fold crossvalidation on the NE-20 training data set.
Figure 6.5: NE-10: the basic feature set. Combined results from 5-fold crossvalidation on the NE-10 training data set.
Figure 6.6: NE-10: the selected feature set. Combined results from 5-fold crossvalidation on the NE-10 training data set.
Figure 6.7: NE-filter: the basic feature set. Combined results from 5-fold crossvalidation on the NE-filter training data set.
Figure 6.8: NE-filter: the selected feature set. Combined results from 5-fold crossvalidation on the NE-filter training data set.
Resolution of Definite Descriptions
inite descriptions is a more difficult task than identification of coreferent NE variants. The baseline classifiers for definite descriptions achieve an F-score of between 35% and 40% for all versions of the DefDesc training data set; the problem for these classifiers is recall, which is below 30% for all data sets, while the precision scores are above 60%.
During feature selection for resolution of definite descriptions, our main goal is to increase recall, but we also need to improve on precision.
Feature selection for resolution of definite descriptions
The feature selection for resolution of definite descriptions is based on the characteristics of anaphors of this NP type (that is, subsequent-mention, definite lexical NPs) and its typical antecedents. Following (Ariel 1990), we define definite descriptions as low accessibility markers, and differentiate between candidate antecedents described as low accessibility markers (i.e., NEs and complex lexical NPs) and high accessibility markers (e.g., pronouns). The features selected from complete feature set for the definite description resolution task are listed in appendix B.
When evaluating the results of the classifiers trained on the three versions of the DefDesc training data set with the basic feature set (described in sec-
mative, but that they did not cover all cases of partial repetitions. Therefore, all comparative similarity features in the complete feature set are selected. By adding features that compare the two candidate NPs both in terms of maximal strings, the head word, the baseform of the head word, and (for candidate antecedent NEs) tight appositives, we hope to capture relations between both two lexical NPs, and between a definite description and a previously mentioned NE.
The selected feature set for definite descriptions is similar to the NE feature set described in the previous section. For example, all descriptive and comparative positional features are used. Like NEs, definite descriptions does not have to agree in grammatical gender and definiteness, but they tend to agree as to animacy and number. The features describing natural gender (of the anaphor and the antecedent, and agreement between the two) might also be of interest for this resolution task.
The set of features describing temporal expressions is also added to the selected set. These features are based on a study by Vieira and Poesio (2000) which shows that such expressions frequently occur as antecedentless definite descriptions. These features might also help in finding coreference relations between temporal expressions such as perioden april–maj (‘the period April
– May’) and andra kvartalet (‘the second quarter’).
All descriptive and comparative syntactic features are used, as we want to capture the cases where a definite description is coreferent with an NE in the immediate context, e.g., in an appositive relationship. The local context features, e.g., the feature consisting of the dominant constituent of the NP according to the Malt Parser analysis, are also selected. This feature might tell us something about e.g., the animacy of the NP as e.g., the verb säger (‘says’) is used to refer to people, whereas the verb rapporterar (‘reports’) is used in connection with e.g., companies and news agencies.
Also used are all descriptive lexico-semantic features (e.g., natural gender, animacy, and NE type of the antecedent candidate), and the animacy and NE type agreement features.
For cases where semantic relatedness (e.g., synonym and near synonymy) may be a clue to resolution, we use the features describing the anaphor and the candidate antecedent in terms of semantic relatedness as expressed in the
We also add a subset of the features derived from the word-space models
as in the selected feature set for NEs (the similarity scores from all the syntagmatic models, and the binary features from the syntagmatic and paradigmatic models constructed using SVD), as we found these features to be the most advantageous of the word-space features for this task based on comparisons of different subsets (as described in (Nilsson and Hjelm 2009)).
Table 6.13: Definite description classification results using the selected feature set, compared to the basic and the complete feature set. Results from 5-fold crossvalidation experiments on the definite description training data sets (DefDesc-15,
DefDesc-10, and DefDesc-filter) presented as micro-averaged precision (‘P’), recall
(‘R’), and F-score (‘F’).
Precision Recall F-score
Basic feature set DefDesc-15
Complete feature set DefDesc-15
Selected feature set DefDesc-15
Basic feature set DefDesc-10
Complete feature set DefDesc-10
Selected feature set DefDesc-10
Basic feature set DefDesc-filter
Complete feature set DefDesc-filter
Selected feature set DefDesc-filter
Definite description classification results
complete feature set. Compared to the baseline, both precision and recall are improved for all three classifiers using the selected feature set. While the improvement in precision is modest for all three classifiers, there is a significant increase in recall.
The results for the selected feature set are presented classified per the an-
recall is at 37% for the data sets DefDesc-10 and DefDesc-15, and 51% for the data set with instances selected according to Accessibility theory (DefDescfilter) where difficult cases of e.g., long-distance redescriptions and pronomi-
repetition and redescription.
The evaluation per anaphor presented in figures 6.10, 6.12, and 6.14 on page 149 can be compared to the evaluation per anaphor for the classifiers trained on the basic feature set presented in figures 6.9, 6.9, and 6.13 on the
same page. For all three classifiers using the selected feature set, the largest class consists of unresolved anaphors (recall errors). Both the class of resolved anaphors and the class of precision errors are increased in comparison with the classifiers using the basic feature set.
Table 6.14: Definite description classification results using the selected feature set.
Results from 5-fold cross-validation experiments on the definite description training data sets (DefDesc-15, DefDesc-10, and DefDesc-filter) categorized depending on the antecedent type: NEs, lexical NPs, and pronouns. Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) reported for each category.
R F P
P R F
Our goal, to improve on both recall and precision, has been achieved for all classifiers even though there is still room for improvement. The recall errors are of several different types, e.g., cases not covered by the features for rec-
Table 6.15: Definite description classification results using the selected feature set.
Results from 5-fold cross-validation experiments categorized as repetitions (‘Repet’) and redescriptions (‘Redesc’). Micro-averaged precision (‘P’), recall (‘R’), and Fscore (‘F’) reported for each category.
R F P
R F P
ognizing string similarity, missing information on semantic relatedness, and cases where real-world knowledge and/or reasoning is needed.
the results are categorized as resolution of repetitions and redescriptions.
While both recall and precision scores for repetitions are above 70% for all three classifiers, the results for redescriptions are low: precision scores are around 50% and recall scores around 6%.
Even though these figures constitute an improvement (as the classifiers using the basic feature set did not recognize any redescriptions), it is clear that in order to successfully solve the task we must add additional features for capturing the different kinds of semantic relatedness and real world knowledge needed to resolve the cases here categorized as redescriptions. For example, the classifiers cannot recognize that that denna justering (‘this adjustment’) is coreferent with en nedrevidering (‘a downward revision’), and they also cannot recognize cases where real world knowledge is needed, e.g., in order to resolve sin mindre nederländska konkurrent (‘their smaller Dutch competitor’) to (the Dutch airline) KLM.
The string similarity features added to the selected feature set contribute to the improved recognition of repetitions, but there are still problems with string matching, e.g., the classifiers do not recognize that the compound
(‘The Opec decision’) is coreferent with the nested NP Opecs beslut (‘The decision of Opec’).
each classifiers is evaluated per anaphor. Results for the classifiers using the
These figures show that overall the improvements per anaphor by the classifiers using the selected feature set are modest in comparison with the results of the basic classifiers. Further, the difficulty of this task is manifest in fig-
fiers using the selected feature set (compared to the basic feature set), the
Figure 6.9: DefDesc-15: the basic feature set. Combined results from 5-fold crossvalidation.
Figure 6.10: DefDesc-15: the selected feature set. Combined results from 5-fold cross-validation.
Figure 6.11: DefDesc-10: the basic feature set. Combined results from 5-fold crossvalidation.
Figure 6.12: DefDesc-10: the selected feature set. Combined results from 5-fold cross-validation.
Figure 6.13: DefDesc-filter: the basic feature set. Combined results from 5-fold cross-validation.
Figure 6.14: DefDesc-filter: the selected feature set. Combined results from 5-fold cross-validation.
recall errors class is still the largest for all three classifiers, and the majority of all subsequent-mention definite description are not successfully matched against any antecedent.
In the baseline experiment, the pronoun classifiers were the least accurate with
F-scores ranging from 20% to 35%. While recall is the biggest problem, precision scores are also low: around 45% for all three pronoun data sets. Thus, we need to select features for pronoun resolution from the complete feature set that will lead to an increase in both recall and precision.
Feature selection for pronoun resolution
The pronoun resolution task is difficult, not only because (as high accessibility markers) pronouns can have both low accessibility markers such as NEs and definite descriptions, and other high accessibility markers as antecedents, but also because the pronoun class itself is heterogeneous. There are many different types of pronouns with different characteristics, and some of these pronoun types are more frequent than others. Even within a pronoun type, say anaphoric-animate pronouns, there are subsets (e.g., different genders, or grammatical forms) of different frequency. When selecting a specific feature set for pronoun resolution, we must try to describe as many different pronoun types (and their antecedents) as possible.
In the selected feature set for pronoun, all available information about both anaphor and candidate antecedent in terms of the position within the document, the morpho-syntactic and lexico-semantic information, the syntactic information, and the context features describing both NPs, and comparing the two NPs.
most informative feature for pronouns in the basic feature set; in the selected pronoun feature set repetition of the anaphor (or antecedent) baseform are added to the two similarity features used in the basic feature set. The added features may capture relations between e.g., henne (‘her’) and hon (‘she’), both with the baseform hon.
We also add the features that check for an NP with an identical surface form or baseform to the anaphor (or antecedent) nested within the antecedent
(or anaphor), in order to rule out a link between hennes yrke (‘her profession’) and hon (‘she’), or henne (‘her’).
None of the features describing semantic relatedness, as defined in the Syn-
Lex lexicon or in the Word-Space models described above, are used for this
NP type as they do not apply to pronoun resolution.
A complete list of the features selected for pronoun resolution is presented in appendix B.
Table 6.16: Pronoun classification results for pronouns using the selected feature set, compared to the basic and the complete feature set. Results from 5-fold crossvalidation experiments on the pronoun training data sets (Pronoun-5, Pronoun-3, and
P-filter) presented as micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’).
Basic feature set Pronoun-5
Complete feature set Pronoun-5
Selected feature set Pronoun-5
Basic feature set Pronoun-3
Complete feature set Pronoun-3
Selected feature set Pronoun-3
Basic feature set Pronoun-filter
Complete feature set Pronoun-filter
Selected feature set Pronoun-filter
Precision Recall F-score
Table 6.17: Pronoun classification results using the selected feature set. Results from
5-fold cross-validation experiments on the pronoun training data sets (Pronoun-5,
Pronoun-3, and Pronoun-filter) are presented per anaphor (‘Total’), and per antecedent type: NEs, lexical NPs, and pronouns. Micro-averaged precision (‘P’), recall
(‘R’), and F-score (‘F’) reported for each category.
Pronoun classification results
The pronoun classifiers trained on the selected feature set perform better than both the classifiers using the basic feature set and the classifiers using the complete feature set; the increase in recall is larger, but there is also an improve-
Overall, there is a significant increase in F-score for all three classifiers when extending the basic feature set with the selected features as outlined above.
results for lexical NP antecedents are low in comparison to the other two antecedent NP types, they are improved compared to the results of the classifiers using the basic feature set as none of the basic classifiers were able to resolve
Table 6.18: Pronoun classification results with the selected feature set: anaphors categorized as deictic and anaphoric pronouns. Micro-averaged precision (‘P’), recall
(‘R’), and F-score (‘F’) are reported for each category.
R F P
P R F
Compared to the baseline experiment, there is a bigger improvement in the
more successfully. While precision is increased for both anaphoric and deictic pronouns using the selected feature set, the recall for deictic pronouns decrease compared to the basic classifiers.
Table 6.19: Pronoun classification results with the selected feature set: anaphors categorized as free and bound pronouns. Micro-averaged precision (‘P’), recall (‘R’), and
F-score (‘F’) are reported for each category.
pronouns; compared to the baseline experiments, the results for both bound and free anaphors are improved with the selected feature set.
The results categorized according to the animacy of the referent (i.e., the pronouns are categorized as animate, inanimate, and ‘mixed’) are presented
Table 6.20: Pronoun classification results with the selected feature set: the results are categorized according to whether the referent is animate (‘Anim.’), inanimate (‘Inan.’), or either animate or inanimate (‘Mixed’). Micro-averaged precision (‘P’), recall (‘R’), and F-score (‘F’) are reported for each category.
R F P
P R F
results are modestly improved, but these classifiers also perform better when resolving animate pronouns than inanimate and ‘mixed’ pronouns.
Pronoun resolution, which was the most difficult task for the classifiers using the basic feature set, prove difficult also for the classifiers using the selected feature set. For example, the problem of distinguishing between anaphoric and expletive det (‘it’) is not solved by extending the feature set; in the respective output of the classifiers trained on the Pronoun-5 and Pronoun-3 data sets, there are two instances of correct classification of a coreference link between the pronoun det and an antecedent (one antecedent is a lexical NP, bolaget
(‘the company’), and the other a company name, Ericsson), and in the output of the classifier trained on the Pronoun-filter data set, there is only one such correctly classified instance, det and bolaget.
Other sources of errors are the class of pronouns that can refer to both animate and inanimate referents, and plural pronouns that may refer to a single
(plural) antecedent but also to coordinated or split antecedents. As mentioned
moved from the training and test data because there is no method of knowing when a plural pronoun has coordinated or split antecedents and when it has a single, plural antecedent (other than resolution).
Evaluation per anaphor shows that for all three classifiers trained using the
errors than correctly resolved anaphors. By adding features from the complete feature set, this result can be reversed: all the extended classifiers (in
resolved anaphors, and a decrease in precision errors.
Figure 6.15: Pronoun-5: the basic feature set. Combined results from 5-fold crossvalidation.
Figure 6.16: Pronoun-5: the selected feature set. Combined results from 5-fold cross-validation.
Figure 6.17: Pronoun-3: the basic feature set. Combined results from 5-fold crossvalidation.
Figure 6.18: Pronoun-3: the selected feature set. Combined results from 5-fold cross-validation.
Figure 6.19: Pronoun-filter: the basic feature set. Combined results from 5-fold cross-validation.
Figure 6.20: Pronoun-filter: the selected feature set. Combined results from 5-fold cross-validation.
Different classifiers for different pronoun types
As shown in the previous section, the results of the three pronoun classifiers vary for different types of pronouns. Therefore, we will experiment with classifiers for five different pronoun types:
1. Anaphoric-animate pronouns: 3rd person singular animate,
2. Anaphoric-inanimate pronouns: 3rd person singular inanimate, and singular demonstrative pronouns,
3. Anaphoric-mixed pronouns: 3rd person plural, and plural demonstrative pronouns,
4. Deictic pronouns: 1st and 2nd person personal pronouns,
5. Bound pronouns: e.g., reflexives, reciprocals.
This division follows the classification used during instance selection, and as can be seen in the discussion of informative features for resolution of each pronoun type, below, there seems to be some merit for this distinction in that there are differences between the pronoun types in terms of feature informativeness.
Additional motivation for this experiment is that in a system for Norwegian pronoun resolution, Nøklestad (2009) compares the results of one single classifier for all pronoun types to the results of three classifiers (for han/hun
(‘he’/‘she’), den (‘it’ masc/fem), and de (‘they’)); while there is no difference for resolution of han/hun and de, there is a significant increase in accuracy for resolution of den. Nøklestad also finds that the features have different degrees of importance for the various pronoun types.
In this section, two main research questions are of interest: First, can we improve on the results of each classifier through a more refined feature selection based on each specific task and the specific properties of each pronoun type? Second, is it beneficial to train separate classifiers for these five classes of pronouns, rather than to train one single classifier for all pronouns? We will try to answer these questions by comparing the combined results of the pronoun-type specific classifiers to the results by the single pronoun classifiers described above.
The setup of the experiments is as follows: Firstly, the classifiers for each pronoun type (listed as 1–5, above) are trained with the same pronoun-specific feature set (referred to as the basic pronoun feature set, below) as in the experiment with the single classifiers, above, and secondly, with selected features for each pronoun type.
The pronoun-type specific feature selection is linguistically motivated, based on the properties of each pronoun type and on the definition of each specific task, e.g., syntactic features are included for bound pronouns, and animacy similarity features and string similarity features are included for the anaphoric-animate pronouns. The features are selected from the complete
the feature ranking of the five-fold cross-validation experiments with the feature set used for the single pronoun classifiers for each pronoun type (1 –
5) is considered. The results of the classifiers using the basic pronoun feature set are used as a baseline.
Table 6.21: Anaphoric-animate pronouns classification results using the selected anaphoric-animate feature set, compared to the baseline results of classifiers using the complete pronoun feature set.
Basic pronoun feature set P-5-anim
Selected animate feature set P-5-anim
Basic pronoun feature set P-3-anim
Selected animate feature set P-3-anim
Basic pronoun feature set P-fi-anim
Selected animate feature set P-fi-anim
Precision Recall F-score
The task of classifying pairs of anaphoric animate pronouns and candidate antecedents involves candidate antecedents of any NP type: NEs, lexical NPs, or other pronouns.
The initial experiments with the basic pronoun feature set show that the most informative features for resolution of anaphoric-animate pronouns
that describe the natural gender of the anaphor and the antecedent, and the animacy of the antecedent. Features that capture repetitions (of the same form or the base form) are also important, as prominent referents referred to by high accessibility markers such as pronouns are usually referred to by other high accessibility markers. Other features that capture aspects of accessibility, e.g., the pronoun type of the antecedent and NP complexity features are also important, as well as the part-of-speech and definiteness of the antecedent. In keeping with the definition of anaphoric-animate pronouns as free anaphors
(cmp. to syntactically bound anaphors, below), syntactic features are not very informative for resolution of anaphoric-animate pronouns.
Based on these observations, a selected set of features is used to train classifiers on each of the three data sets (P-5-anim, P-3-anim, and P-fi-anim). The
compared to the baseline results of the classifiers using the basic pronoun feature set described in the previous section. Both classifiers trained on the
Pronoun-3 and Pronoun-filter data sets show a slight increase with the selected
feature set for anaphoric-animate pronouns, whereas the Pronoun-5 classifier shows a decrease in both precision and recall. As linguistically motivated feature selection is closely connected to the understanding of the task, this is to be expected: the more control one has over the set of instances (in terms of which types of anaphors and antecedents are allowed), the better the results of linguistically motivated feature selection. However, the difference between the classifiers using the basic pronoun feature set and the selected anaphoricanimate feature set is not statistically significant.
Table 6.22: Coreference classification results for anaphoric-inanimate pronouns: results from 5-fold cross-validation experiments with the classifiers using the selected anaphoric-inanimate feature set, compared to the baseline results of classifiers using the complete pronoun feature set.
Complete pronoun feature set P-5-inanim
Selected inanimate feature set P-5-inanim
Complete pronoun feature set P-3-inanim
Selected inanimate feature set P-3-inanim
Complete pronoun feature set P-fi-inanim
Selected inanimate feature set P-fi-inanim
Precision Recall F-score
Classification of anaphoric inanimate pronouns is a difficult task, especially since in Swedish, the inanimate 3rd person singular pronoun det (‘it’) also can have an expletive function. The results for inanimate pronouns by the single
This resolution task relies heavily on morphological agreement (in gender and number) between the anaphor and the candidate antecedent, and the initial experiments with the basic pronoun feature set shows that informative features for anaphoric-inanimate pronouns are e.g., those that capture morphological aspects of the anaphor and antecedent, and morphological similarities between the anaphor and the antecedent. Syntactic features that describe how the two
NPs are related to each other in terms of distance or dependency relations are also highly ranked, as are features describing the NP complexity of the antecedent.
The results of the classifiers trained using the basic pronoun feature set compared to the results of the classifiers trained on the selected feature set are
imate pronouns do not result in better performance of any of the three classifiers; the classifiers trained with the basic pronoun feature set perform better
(on the P-5-inanim data set) or as well (on the P-3-inanim and P-fi-inanim data sets) as the classifiers using the inanimate pronoun-specific feature set.
Based on these results, we conclude that better features – both features of higher quality, and new features not included in our complete feature set – are needed in order to improve on the results of resolution of inanimate pronouns.
Table 6.23: Anaphoric-mixed pronoun classification results using the selected anaphoric ‘mixed’ feature set, compared to the baseline results of classifiers using the complete pronoun feature set.
Complete pronoun feature set P-5-mixed
Selected mixed feature set P-5-mixed
Complete pronoun feature set P-3-mixed
Selected mixed feature set P-3-mixed
Complete pronoun feature set P-fi-mixed
Selected mixed feature set P-fi-mixed
Precision Recall F-score
Anaphoric pronouns that can refer to either animate or inanimate entities, e.g., the ‘mixed’ third person plural de (‘they’), are similar to both animate and inanimate pronouns in terms of what features are useful for resolution.
These similarities can be seen when comparing feature informativeness of the basic pronoun feature set used to train the classifiers for animate, inanimate, and mixed pronouns, but there are also differences between the three pronoun types that lend support to the decision to select specific feature sets for these three categories.
From the output of the initial experiments, we can conclude that features that capture repetitions (between the full form or the base form) are highly informative for resolution of anaphoric-mixed pronouns, akin to animate pronouns. Like inanimate pronouns, resolution of ‘mixed’ pronouns also depends on syntactic features, describing distance and dependency.
Morphological features are also highly ranked, e.g., the feature describing the case of the anaphor.
The results of the classifiers for ‘mixed’ pronouns are presented in
P-3-mixed data sets, while there is a small increase of both precision and recall for the classifier trained and tested on the P-fi-mixed data set.
For this task, there seems to be is a positive effect of training a specific classifier, compared to the single pronoun classifiers (results presented in ta-
single pronoun classifiers, there is an increase in recall for all three classifiers.
For the P-fi-mixed classifier (trained on the the most restricted version of the data set, both in terms of positive and negative instances), there is also an increase in precision.
Table 6.24: Coreference classification results for deictic pronouns: results from 5-fold cross-validation experiments with the classifiers using the selected deictic feature set, compared to the baseline results of classifiers using the complete pronoun feature set.
Complete pronoun feature set P-5-deixis
Selected deictic feature set P-5-deixis
Complete pronoun feature set P-3-deixis
Selected deictic feature set P-3-deixis
Complete pronoun feature set P-fi-deixis
Selected deictic feature set P-fi-deixis
Precision Recall F-score
Coreference resolution involving deictic pronouns is problematic, because they typically refer to the speaker, the listener within the discourse, or in some cases the reader. Deictic pronouns are also often used in a superset relation, rather than a coreference relation.
The initial experiments with the basic pronoun feature set show that there are similarities between the most informative features of anaphoric animate pronouns and those of deictic pronouns (also animate); e.g., features that describe the NPs in terms of repetition (baseform or full form), and the natural gender of the anaphor and antecedent are important for resolution of both pronoun types.
But there are also differences that lend support to the decision to distinguish between anaphoric and deictic pronouns, e.g., local context features are more informative for resolution of deictic pronouns than for anaphoric-animate pronouns. Because deictic pronouns frequently occur in quoted speech, it is possible that e.g., verb semantics can help resolution.
Linguistically motivated feature selection for deictic pronouns from the available feature set does not lead to better results for the classifiers trained on the P-3-deixis and P-fi-deixis data sets, while there is an increase in precision for the P-5-deixis classifier, leading to an increase in F-score. However, it seems that features of higher quality in combination with new features are needed for this resolution task.
Table 6.25: Coreference classification results for syntactically bound pronouns: results from 5-fold cross-validation experiments with the classifiers using the selected ‘bound’ feature set, compared to the baseline results of classifiers using the complete pronoun feature set.
Complete pronoun feature set P-5-bound
Selected bound feature set P-5-bound
Complete pronoun feature set P-3-bound
Selected bound feature set P-3-bound
Complete pronoun feature set P-fi-bound
Selected bound feature set P-fi-bound
Precision Recall F-score
The results of the single pronoun classifiers for syntactically bound pronouns,
than resolution of free anaphors. Syntactically bound pronouns refer to an antecedent within the sentence (or clause), and thus this task is likely to benefit from the pronoun-type specific instance selection method where all candidate antecedents across sentence boundaries are removed.
The initial experiments with the basic pronoun feature set show that (in keeping with linguistic intuition) the most important features for resolution of syntactically bound pronouns are syntactic features that either describe the anaphor or the antecedent in terms of position, or that compare the distance between them or syntactic dependency. Some of the features here classified as syntactic also in some sense capture the prominence of the antecedent, e.g., the feature that describes whether the antecedent is the first NP in the sentence or not. Features describing natural and morphological gender are also highly informative for resolution of bound pronouns.
The results of the classifiers for bound pronouns using both the basic pro-
effect of feature selection is small for the classifiers trained on the more restricted data sets (P-3-bound and P-fi-bound), while there is an increase in precision for the classifier trained on the P-5-bound data set. Overall, it seems that in order to improve on resolution better features are needed. It is also possible that resolution of this pronoun type would benefit from a rule-based approach.
Table 6.26: Coreference classification results for the combined result of the five different pronoun classifiers, compared to the results using the single pronoun classifier.
(Micro-averaged results from five-fold cross-validation experiments).
Single classifier Pronoun-5
Comb. pronoun-type spec. classifiers Pronoun-5
Single classifier Pronoun-3
Comb. pronoun-type spec. classifiers Pronoun-3
Single classifier Pronoun-filter
Comb. pronoun-type spec. classifiers Pronoun-filter
Precision Recall F-score
Combining the results
The combined classifier results for each of the three data sets are presented
The combined resolution results on the data sets for which the linear-k instance selection method was used, Pronoun-5 and Pronoun-3, are not affected, while the result of the Pronoun-filter classifier is improved, with an increase in both precision and recall. We attribute this to the relationship between our task description, the instance selection method used, and the linguistic motivations between the feature selection – this is the data set over which we have the most control. (Because the distribution of pronoun types is uneven, i.e., there are fewer instances of e.g., inanimate, mixed, and syntactically bound pronouns than animate pronouns, any improvements in resolution of a small class, e.g., inanimate pronouns, are lost in the combined results where the results of resolution of the majority class is the most influential.)
In this chapter, our focus has been linguistically motivated feature selection for each of the three coreference resolution tasks. The results of cross-validation experiments on the training data, with classifiers using task-specific feature sets, have been compared to the results of classifiers using an identical, baseline set of features.
In the next chapter, we will test these classifiers on held-out test data sets, and compare the results. We will also investigate the effect of applying instance selection on the test data; before linking the coreferent NPs in each document in the test data into coreference chains, we will add the instances
removed during preprocessing (labeled as non-coreferent). In so doing, we can compare the effect of not only the different feature sets and the different instance selection methods used to restrict the training data, but also the effect of applying the same instance selection methods as a linguistic filter to the test data.
7. Coreference Resolution
By combining machine learning with instance selection, we create a hybrid approach to coreference resolution where the first step (after preprocessing
selection method that labels each anaphor-antecedent pair as a likely or an unlikely candidate for coreference.
Each anaphor-antecedent pair in the subset of unlikely candidates is labeled as non-coreferent and removed from further processing, while the subset of likely candidates is handed to the TIMBL classifier which labels each pair as either coreferent or non-coreferent. After classification, the two subsets are merged for evaluation on the coreference chain level.
In this chapter, the held-out test data is used for evaluating the classifiers, the selected feature sets, and the instance selection methods used on the training data as described in the previous chapters. The preprocessing and preparation of the test data set is the same as that for the training data, as described in
In addition to the evaluation of the classifiers, we also investigate the effect of linguistically motivated instance selection on the test data. The motivation for instance selection on the test data is to improve coreference resolution by removing unlikely candidate anaphor-antecedent pairs; that is, the instance selection method is applied as a linguistic filter for solving some of the (noncoreferent) cases on the basis on linguistic knowledge, before classification.
The TIMBL classifiers for NEs, definite descriptions, and pronouns developed during the cross-validation experiments described in the previous chapter are trained on the training data for each NP type, and evaluated on the corresponding set of held-out test data.
For each NP type, we use different methods for instance selection on the training data, resulting in three different classifiers (e.g., NE-20, NE-10, and
NE-filter). We also use the same methods for instance selection on the test data, as the first step in our hybrid approach. The different methods for lin-
Evaluation of the classifiers is performed on both per anaphor-antecedent pair in each version of the test data sets, and per anaphor. The classifier results (i.e., the result by the memory-based classifier on the subset of instances
labeled as likely candidates during instance selection) are presented in sec-
As coreference is a relation between all NPs that refer to the same referent within a discourse, we also merge the set of anaphor-antecedent pairs removed during test data instance selection (labeled as non-coreferent) with the output of the classifiers (i.e., instances labeled as either coreferent or noncoreferent). The anaphor-antecedent pairs classified as coreferent by the classifier are linked together in order to form coreference chains, and the resulting chains are evaluated against the annotated gold standard.
The method for linking the coreferent pairs into chains is described in sec-
Test data preparation
The methods used to select and classify the different types of referring expressions and to select likely anaphor-antecedent pairs from the set of all possible
are also used on the test data sets. Below follows a brief recapitulation of the instance selection methods, and a description of the resulting test data sets.
Instance selection methods used on the test data
After extraction of the referring expressions, the complete test data set is split into three parts depending on the type of expression: NEs, definite descriptions, and pronouns. That is, a subset for evaluation of e.g., pronoun resolution is formed, consisting of all candidate anaphor-antecedent pairs in the complete test data set where the anaphor is a pronoun.
Each of these three subsets (for NEs, definite descriptions, and pronouns) is further processed by instance selection in three different ways, resulting in three versions of the same test data (sub)set. That is, for each of the NP types we use three different methods of determining if an anaphor-antecedent pair is a likely candidate for coreference or not.
For NEs and definite descriptions, instance selection is based on Accessibility theory, put fourth by Ariel (1990). This theory is focused on the form of the anaphor and of the antecedent, described as their accessibility, and the relationship between them (in terms of distance). We use this theory as a basis for selecting the candidate anaphor-antecedent pairs that are most likely to be coreferent based on their respective degree of accessibility, in combination with different constraints on description repetitions and redescriptions.
For pronouns, described by Ariel (1990) as high accessibility markers that can refer to both low accessibility markers such as NEs and definite descrip-
tions and to other high accessibility markers, we add semantic and syntactic constraints that form the basic framework for our instance selection method.
These methods are contrasted against a basic, linear-k instance selection technique, where k is the limit for the search scope of each anaphor measured in number of preceding sentences. This method is motivated by corpus studies that show that the different NP types display different behavior in terms of the linear distance between the anaphor and the antecedent (see e.g., Vieira and Poesio 2000, Fraurud 1992). For NEs we use two such limits, k=20 and k=10 , both of which cover at least one antecedent for more than 95% of all
definite descriptions, k is set to 15 and 10 sentences, covering 95% and 90% of all anaphors-antecedent pairs, respectively. For pronouns, k is set to 5 and
3 sentences, covering the antecedent for more than 99% of all anaphors in the training data.
Resulting data sets
For evaluation of our model for identification of coreferent NEs, we use these versions of the complete NE test data set:
1. NE-20-test: All NEs are matched against all NPs in the 20 preceding sentences.
2. NE-10-test: All NEs are matched against all NPs in the 10 preceding sentences.
3. NE-filter-test: All NEs are matched against all preceding NPs according to constraints based on Accessibility theory, as described in
For evaluation of our approach to resolution of definite descriptions, we use these versions of the complete definite description test data set:
1. DefDesc-15-test: All definite descriptions are matched against all
NPs in the 15 preceding sentences.
2. DefDesc-10-test: All definite descriptions are matched against all
NPs in the 10 preceding sentences.
3. DefDesc-filter-test: All definite descriptions are matched against all preceding NPs according to constraints based on Accessibility theory,
For evaluation of our approach to pronoun resolution, we use these versions of the complete pronoun test data set:
1. Pronoun-5-test: All pronouns are matched against all NPs in the 5 preceding sentences.
2. Pronoun-3-test: All pronouns are matched against all NPs in the 3 preceding sentences.
3. Pronoun-filter-test: All pronouns are matched against all preceding
Constructing coreference chains
In this thesis so far, we have focused on identifying coreference relations between pairs of NPs, referred to as anaphor-antecedent pairs. But because coreference is a relation between all NPs that refer to the same referent within a discourse, we also want to evaluate the output of the classifiers on the coreference chain level. Pair-wise classification forms the basis for our approach to coreference resolution, but in order to form coreference chains we need to link all pairs that are coreferent into one chain.
In this study, we use aggressive-merge clustering (Ng 2005) in order to create coreference links from the pair-wise predictions by TIMBL.
Other commonly used methods for clustering coreferent NPs for coreference resolution are e.g., closest-first clustering where the closest preceding coreferent NP (if any) of each anaphor is selected. This method is based on the idea that recency is important for coreference resolution, and that each anaphor needs only one antecedent for resolution (see e.g., Soon et al. 2001,
Strube et al. 2002, Ng 2005). A related approach is best-first clustering where the closest NP with the highest coreference likelihood value (estimated by the classifier, or some other method) from the set of preceding coreferent NPs (if any). Like the closest-first clustering method it is assumed that each anaphor only needs only one antecedent for resolution. The more sophisticated selection of an antecedent allows for potentially higher precision than closest-first clustering (see e.g., Aone and Bennet 1995, Ng and Cardie 2002b, Ng 2005).
Unlike the other two methods, which require that only one antecedent be chosen, aggressive-merge clustering may yield higher recall as each anaphor is merged with all of its preceding coreferent NPs. That is, a cluster is created for each anaphor, consisting of that NP and all antecedents classified as coreferent. If an NP in that cluster is coreferent with an NP in another cluster, those two clusters are merged (hence the name aggressive-merge). This allows for
(potentially) higher recall than the other two methods (Ng 2005).
The motivation for choosing aggressive-merge clustering is that it is a simple, recall-friendly method that allows for multiple antecedents. Additional motivation is that in a comparison between these three methods by Ng (2005), aggressive-merge was found to be the most successful method on three differ-
When evaluating the same trials with the B-cubed score aggressive-merge was the most successful method on one out of three test sets. In the other two trials evaluated with the B-cubed score, the closest-first method was the most successful, perhaps because the B-cubed score
(contrary to the MUC score) rewards successful recognition of non-coreference relationships, and further penalizes incorrect merging of two large clusters more heavily than the merging of
Evaluation of coreference chains
Coreference chains, or equivalence classes of coreferent NPs within a document, are evaluated as to how they conform with the correct (that is, manually annotated) coreference chains in the test data set.
All results on the coreference chain level are reported using the standard
the motivation for choosing the MUC score is that it is widely used, and even though other measures have been proposed, there is (as of yet) no established best practice for
For each test data set, results are evaluated with MUC-score precision, recall, and F-score (Vilain et al. 1995), where the response (i.e., the coreference chains found by the system) is compared to the key (i.e., the chains in the manually annotated test data). As all results for each particular task are compared to the same key, comparison can be made between both classifiers and instance selection methods.
The results for the different NP types are calculated against the subset key for each NP type.
Table 7.1: Baseline coreference chains results for the test data sets for NEs, definite descriptions, and pronouns. MUC score precision (‘P’), recall (‘R’), and F-score (‘F’) reported for each data set. B 1: string similarity, B 2: recency and morphological agreement.
B 1 98.82
F P R F P R F
R F P
R F P
B 1 63.64
R F P
R F P
B 1 14.72
B 2 10.65
We use an implementation of the MUC score as defined in (Vilain et al. 1995) from the open source NLP toolkit LingPipe, by Alias-i. URL: http://alias-i.com/lingpipe/
The 5th International Workshop on Semantic Evaluations (SemEval-2010) task Coreference
Resolution in Multiple Languages aims to evaluate coreference resolution systems by employing both B-cubed and CEAF in comparison with the MUC score in order to investigate the advantages and drawbacks of the different metrics (Recasens et al. 2009).
Table 7.2: Baseline coreference chain results for all NP types combined, reported as
MUC score precision (‘P’), recall (‘R’), and F-score (‘F’). The best test data instance selection method (NE-filter-test, DefDesc-filter-test, Pronoun-filter-test) is combined with the (best) baseline classifier for each NP type (NE B1, DefDesc B1, Pronoun
Instance selection method + baseline
Precision Recall F-score
For evaluation of the coreference resolution task on the coreference chain level, we calculate two baselines based on simple heuristics.
Because repetitions of coreferent NPs are frequent, the first baseline (B 1) is based on string matching. Each NP in the test data set is classified as coreferent with any identical preceding NP, and all NPs classified as coreferent are linked together using the aggressive-merge clustering method.
A second baseline (B 2) is calculated for pronouns. This baseline is based on recency and morphological agreement: each pronoun is classified as coreferent with the most recent NP of matching grammatical or natural gender and number (if any), and all NPs classified as coreferent are linked using the aggressive-merge clustering method.
The coreference chains, the result of merging the anaphor-antecedent pairs labeled as non-coreferent during instance selection and the baseline classification of the remaining anaphor-antecedent pairs in each data set (e.g., NE-20test or NE-10-test), are evaluated against the key for that NP type, allowing us to compare the results on the different data sets and evaluate the effect of the instance selection methods.
As we can see, recall is generally low as these simple heuristics can only identify a small fraction of the coreference relations. Baseline 1, based on string similarity, gives higher precision than recall for all data sets even though the results vary. On all the NE test data sets, B 1 gives 99% precision. The same baseline for definite descriptions result in at best 65% precision, and for pronouns, 19% precision. The best baseline results, compared to the complete key, are obtained by applying the Accessibility theory-based instance selection methods the data sets (i.e., NE-filter-test, DefDesc-filter-test, and Pronounfilter-test).
This baseline is estimated by combining the classifier output with the highest
MUC F-score for NEs, definite descriptions, and pronouns respectively (see
(that is, all versions of the NE test data set covers the same number of similar strings); we selected the smallest data set, NE-filter-test. The combined output is linked using aggressive-merge, and the resulting coreference chains are evaluated against the complete key, resulting in a baseline of MUC precision
58%, MUC recall 40%, and MUC F-score 47% for coreference resolution on our test data.
In the previous chapter, selected feature sets for the three NP types were developed during five-fold cross-validation experiments on the training data. In this section, the NP type specific classifiers trained on the training data using these selected feature sets are evaluated on the test data.
First, the classifier results per test data set are reported, and the effect of
stance selection methods and the classifiers are evaluated against the manually annotated gold standard. Here, we focus on the effect of instance selection on the test data, that is, our hybrid method of combining linguistically motivated selection of likely anaphor-antecedent pairs with a data-driven model for classification. Our hypothesis is that if we use linguistic knowledge to restrict the test data set to the most likely anaphor-antecedent pairs, we can improve on precision.
Table 7.3: NE classification results on data sets NE-20-test, NE-10-test, and NE-filtertest with classifiers trained on the data sets NE-20, NE-10, and NE-filter. Precision
(‘P’), recall (‘R’), and F-score (‘F’) reported for each classifier and data set.
NE-20-test NE-10-test NE-filter-test
P R F P R F P R F
Anaphor-antecedent classification results
The memory-based classifiers for NEs, definite descriptions, and pronouns are trained on the training data using the NP-type specific selected feature sets
Figure 7.1: Classifiers NE-20, NE-10, and NE-filter, each tested on the three test data sets NE-20-test, NE-10-test, and NE-filter-test. Results presented per anaphor: correctly resolved coreference links (TP; green), precision errors (FP; red), and recall errors (FN; yellow).
When reviewing the results below, we are interested in the effect of instance selection on the training data. Our main concern is whether restriction of the training data will lead to a loss in recall by removing positives instances, and possibly also a loss in precision by removing negative instances. We will also discuss the selected feature set for each NP type, and frequent error types.
on different versions of the NE training data set of different size but using
to classify the instances of each of the three test data sets. The classification results per test data set are listed in each column; the results are similar for all classifiers on each of the test data sets, with F-scores around 79% on the NE-
20 and NE-10 test data sets, and 90% on the NE-filter test set. (Comparisons between the results for the different test sets will not be made here, but we will
return to the question of test data instance selection in the next section where resolution on the coreference chain level is discussed.)
As mentioned above, the versions of the NE training data set are of varying size: about 80% of the negative instances, and 30% of the positives in the complete NE training data set are removed when constructing the NE-filter training data set, while about 10% of the negatives and 15% of the positives in the complete set are removed when constructing the NE-20 training data set
the training data sets).
The results for the NE-filter classifier on all versions of the NE test data set
positive instances in the training data can be substantially reduced by using the Accessibility theory-based instance selection method with only a minor decrease in classifier performance. For this task, the selection of training data instances has a small negative effect on recall, while the effect on precision is positive.
sifiers and data sets, the largest class consists of resolved anaphors, and the smallest class of precision errors. All classifiers achieve the largest class of resolved anaphors, and the smallest class of precision errors on the NE-filtertest data set.
The most frequent recall error type consists of unrecognized coreference links between NEs and definite descriptions, e.g., between the organization name Metall and facket (‘the union’), indicating that the features used for recognizing semantic relatedness must be improved.
There are also precision errors that may be due to the semantic relatedness features, e.g., incorrect links between NEs and definite descriptions, e.g., Ericsson – deras närstående bolag (‘their related companies’).
The most frequent type of precision errors in the NE-20 and NE-10 test data sets are incorrect links between NEs and pronouns. Such cases are removed from the NE-filter-test data, which may be the reason for the higher precision on this data set by all classifiers. Precision errors are also caused by instances of non-coreferent similar strings, e.g., ägaren Skandia (‘the owner Skandia) – deras ägare (‘their owner’).
DefDesc-15, DefDesc-10 and DefDesc-filter training data sets are presented.
All classifiers use the selected feature set for definite descriptions (described
The performance of the classifiers, trained on varying amounts of data, is similar on all three test data sets, with F-scores around 50% on the DefDesc-15 and DefDesc-10 test sets, and around 60% on the DefDesc-filter test set. This similarity between the classifiers is also shown in the evaluation per anaphor,
Table 7.4: Definite description classification results on data sets DefDesc-15-test,
DefDesc-10-test, and DefDesc-filter-test with classifiers trained on the data sets
DefDesc-15, DefDesc-10, and DefDesc-filter. Precision (‘P’), recall (‘R’), and F-score
(‘F’) reported for each classifier and data set.
P R F
P R F
P R F
restricted version of the training data set, indicate that applying the Accessibility theory-based instance selection method used for definite descriptions does not result in a loss in recall, and only a small decrease in precision.
Akin to the NE-filter training data set, discussed above, about 80% of the negative instances, and 30% of the positives in the complete definite description training data set are removed from the DefDesc-filter training data set compared to the DefDesc-15 (about 20% of both negatives and positives removed) and DefDesc-10 training (about 35% of the negatives and 25% of the positives removed) data sets. Most of the NPs in our training data are definite descriptions; this class makes up about 60% of the anaphor-antecedent pairs
jor benefit of our Accessibility theory-based instance selection method is an increase in processing speed, without harm to either precision or recall.
Most of the correctly recognized links are full or partial repetitions of the same description. Repetition is also the most frequent cause of precision errors: instances of non-coreferent repetition of a definite description occur, often as terms or concepts that are frequent within this domain (and in some cases may described as domain-specific encyclopedic knowledge, see sec-
ders’), bolaget (‘the company’).
Most of the recall errors are unrecognized repetitions (in spite of the features for string similarity), and redescriptions, e.g., den typen av förändringar
(‘that type of changes’) – reformer (‘reforms’).
Pronoun-5, Pronoun-3 and Pronoun-filter data sets with the selected feature
Figure 7.2: Classifiers DefDesc-15, DefDesc-10, and DefDesc-filter, each tested on the three test data sets DefDesc-15-test, DefDesc-10-test, and DefDesc-filter-test. Results presented per anaphor: correctly resolved coreference links (TP; green), precision errors (FP; red), and recall errors (FN; yellow).
are obtained by the Pronoun-3 classifier on the Pronoun-filter-test data set: precision 48.63%, recall 34.23%, and F-score 40.18%; for all classifiers and test data sets, F-score is between 20% and 40%.
These results clearly show that for this resolution task, the application of linguistically motivated instance selection on the training data as described
training data set performs significantly worse on all three data sets, compared to the other two classifiers trained on data where a linear-k instance selection method was applied (k=3 and 5 sentences, respectively); e.g., the precision score of the Pronoun-filter classifier is 19% on the Pronoun-5 test data (with
pared to 37% for the Pronoun-3 classifier, and 38% for the Pronoun-5 classifier.
classification does not benefit from reducing the number of instances in the
Table 7.5: Pronoun classification results on data sets Pronoun-5-test, Pronoun-3-test, and Pronoun-filter-test with classifiers trained on the data sets Pronoun-5, Pronoun-3, and Pronoun-filter. Precision (‘P’), recall (‘R’), and F-score (‘F’) reported for each classifier and data set.
Pronoun-5-test Pronoun-3-test Pronoun-filter-test
P R F P R F P R F
training data based on the properties of the different pronoun types: the application of our linguistically motivated instance selection method on the training data results in a classifier (P-filter) with the largest number of precision errors on each of the three test data set. The lowest number of precision errors is obtained by the Pronoun-5 classifier, while the largest number of resolved anaphors on all test data sets is obtained by the Pronoun-3 classifier.
Compared to the cross-validation experiments described in the previous chapter, the results for all three classifiers are disappointing. It is of course to be expected that the results for the cross-validation experiments after development (including feature selection) on that particular data set will be better than the evaluation results on the held-out test data set, but for this particular task of pronoun resolution there is a considerable difference between the cross-validation results and the test results.
During evaluation we noticed that the pronoun type distribution is very different between the training and test data set. Firstly, there are proportionally more pronouns (and fewer definite descriptions) in the training data set than in the test data set, and secondly, there are more occurrences of “difficult” pronouns (e.g., anaphoric–inanimate and anaphoric–mixed) in the test data set than in the training data set.
and because it was performed on the document level the proportion of “easy” vs. “hard” instances was not monitored (then, of course, the split would no longer be arbitrary) – if we were to keep track of the difficulty levels of the data, we would have to look at the proportions of definite description repetitions and redescriptions, of NEs with NE and non-NE antecedents, and of the different pronoun types.
In this case, there are more “hard” instances in the test data set than in the training data set, which explains the difference between the results of the
Figure 7.3: Classifiers Pronoun-5, Pronoun-3, and Pronoun-filter, each tested on the three test data sets Pronoun-5-test, Pronoun-3-test, and Pronoun-filter-test. Results presented per anaphor: correctly resolved coreference links (TP; green), precision errors (FP; red), and recall errors (FN; yellow).
five-fold cross-validation experiments on the training data, and the classifier tests on the test data; to some degree this may also explain the disappointing pronoun resolution results overall.
With larger data sets, the risk of this problem occurring would likely be reduced but as we are dealing with natural language we must handle the difficult cases as well as the easier ones. It is a valuable lesson that when it comes to pronoun resolution, the pronoun type distribution of the test data has a great influence the evaluation results. In fact, if we use default TIMBL parameters and the complete feature set (so as not to benefit from the feature selection on the training data in the cross-validation experiments) to train a classifier on the Pronoun-3 training data and test on the Pronoun-3 test data, we get the results 33% precision, 33% recall, and 33% F-score. If we then use the same settings to train a classifier on the Pronoun-3 test data, and evaluate on the
Pronoun-3 training data, this classifier obtains 53% precision, 39% recall, and
Table 7.6: Classification of different types of pronouns – anaphoric-animate, anaphoric-inanimate, anaphoric-mixed, deictic, and bound pronouns – evaluated on test data sets Pronoun-5-test, Pronoun-3-test, and Pronoun-filter-test. The precision
(‘P’), recall (‘R’), and F-score (‘F’) reported for each pronoun type classifier and data set.
R F P
R F P
P-5-inan-test P-3-inan-test P-fi-inan-test
F P R F P R F
R F P
R F P
P R F
P R F
P R F
Syntactically bound pronouns
P-5-bound-test P-3-bound-test P-fi-bound-test
P R F P R F P R F
Different classifiers for different pronoun types
experiments on the three versions of the pronoun training data set (Pronoun-5,
Pronoun-3, and Pronoun-filter), with each data set partitioned into five different subsets depending on the pronoun type of the anaphor: anaphoric-animate, anaphoric-inanimate, anaphoric-either, deictic and bound pronouns. In this section, the classifiers trained on these training data sets are tested on the corresponding subsets of the pronoun test data sets.
The results of the single pronoun classifiers, discussed in the previous section, show that the feature set selected for pronoun resolution, and the approaches to pronoun resolution used here are not sufficient. Our decision to include all types of definite pronouns except relative pronouns as potential
results; if we look at each pronoun type in isolation we can identify some of the problem areas for future work. Results for all pronoun-type specific clas-
For animate pronouns, the highest precision on all three data sets is achieved by the Pronoun-5-animate classifier, while the best recall is obtained by the Pronoun-3-animate classifier. The Pronoun-filter-animate classifier is unable to recognize as many instances as the other two: the recall figures for this classifier is below 40% on all three test data sets.
All three classifiers for inanimate pronouns are unable to resolve coreference links between such pronouns and their antecedents; the results presented
feature set is insufficient for classification. Precision scores are between 27% and 36%, and recall scores are between 9% and 17% for all classifiers for inanimate pronouns.
Resolution of anaphoric pronouns that can refer to either animate or inanimate referents (here referred to as anaphoric-mixed pronouns) is – similarly to classification of inanimate pronouns – a difficult task, as already shown in
the results of the three classifiers trained on the training data sets Pronoun-5mixed, Pronoun-3-mixed, and Pronoun-filter-mixed, and tested on the corresponding test sets are presented. While the recall scores are a little better than for classification of inanimate pronouns, precision scores for all classifiers are between 18% and 34%, with the highest precision on the most restricted data set. From these results, we conclude that the present feature set is insufficient for classification of anaphoric-mixed pronouns. Because these data sets consists of plural pronouns that may refer to a single (plural) antecedent but also to coordinated or split antecedents, a different approach to resolution is be needed.
Deictic pronouns belong to the pronoun types that benefited the most from the feature selection for the pronoun specific deixis classifiers in the five-fold
the classification results for deictic pronouns are less encouraging; The best Fscore, 23%, is achieved by the classifier trained on the most restricted data set
(Pronoun-3-deixis) on the test data set where antecedents are allowed within a scope of the current or the preceding paragraph (Pronoun-filter-deixis).
The classifiers for syntactically bound pronouns (Pronoun-5-bound,
Pronoun-3-bound, and Pronoun-filter-bound) tested on the bound pronoun
classifiers for animate pronouns (above). The classifier trained on the
Pronoun-filter-bound training data and tested on the Pronoun-filter-bound test data yields the best results with 59% F-score. In these two data sets, only candidate antecedents located (to the left of the pronoun) within the same sentence are allowed; this result indicates that linguistically motivated instance selection of both training and test data can be effective for this pronoun type.
Table 7.7: Coreference chain results for the NE test data sets. MUC score precision
(‘P’), recall (‘R’), and F-score (‘F’) are reported for each classifier and data set.
The best F-score results are for each test data set (columns) are highlighted; the best
F-score results for each classifier (rows) are in italics.
R F P
R F P
B 1 98.82
Coreference resolution results
Before evaluating the results on the coreference chain level, the instances removed from each data set by the application of the different instance selection methods are labeled as non-coreferent and merged with the classifier output.
The merged output is transformed into coreference chains by forming a cluster of coreferent antecedents for each anaphor, and merging the over-
resulting coreference chains are evaluated using the MUC score for preci-
results for both the NP specific tasks, and the method for combining the baseline results for the complete coreference resolution task are described in sec-
In this section, we will discuss the effect of instance selection on the test data. Our hypothesis is that if we restrict the test data set to the most likely
anaphor-antecedent pairs (based on their respective accessibility), we can improve on precision. This is motivated by the idea that the choice made by the speaker/writer to use a specific form of referential expression (e.g., a simple lexical NP rather than a complex lexical NP, or a pronoun rather than a NE) in a specific context is a clue to processing (see e.g., Ariel 1990).
NE-10-train, and NE-fi-train data sets, yield the best results on the NE-filtertest data, mainly because of better precision. Our hybrid approach of restricting the set of instances in the test data in accordance with Accessibility theory before classification seems to be advantageous for identification of coreferent
On the NE-filter test data set, where Accessibility theory-based instance selection was used, better results are obtained by the NE-10 and NE-20 classifiers than by the NE-filter classifier (which is trained on Accessibility theoryfiltered data), due to better recall by the former. However, the classifier trained on the NE-filter-train data achieves the best results on the test data sets where linear distance is used for instance selection (i.e., NE-20-test and NE-10-test), through better precision. Thus, if we are aiming for precision, linguistically motivated selection of anaphor-antecedent pairs is an option.
Compared to the baseline, there is a loss in precision, but the gain in recall is large enough to warrant an substantial increase in F-score for all classifiers evaluated on all test data sets.
Table 7.8: Coreference chain results for the DefDesc test data sets using three extended classifiers trained on the training data sets DefDesc-20, DefDesc-10, and DefDescfilter. MUC precision (‘P’), recall (‘R’), and F-score (‘F’) is reported for each classifier and data set. The best F-score results are for each test data set (columns) are highlighted; the best F-score results for each classifier (rows) are in italics.
R F P
R F P
When evaluating the outcome of resolution of definite descriptions on the coreference chain level, we find that all classifiers achieve their best results in combination with instance selection on the test data based on Accessibil-
ity theory (i.e., on the DefDesc-filter-test data set), and the classifier trained on the data set filtered according to Accessibility theory (i.e., DefDesc-filter)
Compared to the baseline, the DefDesc-filter classifier obtains better recall figures overall, with modest gains in precision. The other two classifiers,
DefDesc-15 and DefDesc-10, also give improvements in recall paired with lower precision compared to the baseline.
The MUC F-scores for the merged output of the definite description classifiers (of about 44% to 49%) is markedly lower than the MUC F-scores for the output of the NE classifiers, which again shows that resolution of definite descriptions is a more difficult task than identification of coreferent NEs.
These results show a positive effect of the Accessibility theory-based instance selection method on both training and test data: first, the classifier trained on the data set filtered according to Accessibility theory (DefDescfilter) achieves the best precision and the best recall on all data set, and second, the best results for all classifiers are obtained in combination with the Accessibility theory-based instance selection of the test data (i.e., on the DefDescfilter-test data set). Because the gain in F-score is relatively small, the main benefit of this approach is the reduction in size of both the training data; the
DefDesc-filter classifier is trained on 23% of the complete training data set, while the DefDesc-10 classifier is trained on 65%, and DefDesc-15 on 88%.
Table 7.9: Coreference chain results for the pronoun test data sets using three extended single classifiers (P-5, P-3, and P-fi), and the extended combined classifiers (P-5-c,
P-3-c, and P-fi-c). MUC precision (‘P’), recall (‘R’), and F-score (‘F’) is reported for each classifier and data set. The best F-score results are for each test data set
(columns) are highlighted; the best F-score results for each classifier (rows) are in italics.
P-5-test P-3-test P-filter-test
The results of the output of the single pronoun classifiers on the three versions of the pronoun test data set merged into coreference chains are presented in
two baselines, these results are not satisfactory.
The Pronoun-5 classifier yields the best precision (44% and 47%) in two out of three versions of the test data, while the best recall (34% and 28%) in two out of three tests is obtained by the Pronoun-filter classifier (even though this classifier is trained on fewer instances than the other two classifiers).
The Pronoun-3 classifier, which is trained on data that allows antecedent candidates within the tree preceding sentences, gives the best F-score overall, with the best result on the Pronoun-5-test data set (35% F-score). This classifier also gives the best precision overall, 51%, on the Pronoun-filter data set.
It is difficult to interpret the results as to the influence of the instance selection methods as linguistic filters for removing unlikely anaphor-antecedent pairs from the test data before classification. Overall, there is an improvement in precision for all classifiers when evaluated on the Pronoun-filter test data set, but this gain is paired with a loss in recall. The same tendency can be detected on the other two test data sets: in comparing the results by the same classifiers on Pronoun-5 test and Pronoun-3 test we can note an increase in precision and a decrease in recall when the search scope of the anaphors is further restricted.
The positive effect of linguistically motivated instance selection as seen for
NEs and definite descriptions cannot be seen here; our conclusion is that our approach to pronoun resolution must be improved both as to feature selection and instance selection. As indicated by the results of the classifiers for different pronoun types, we can also improve on the overall results by narrowing the pronoun resolution task to a subset of all pronoun types, e.g., animate pronouns.
Table 7.10: Combined coreference linking results. The choice of classifiers and instance selection methods for the test data sets are based on 1) MUC F-score, and 2) evaluation per anaphor. Results presented as MUC precision, recall, and F-score.
Precision Recall F-score
1) Best MUC F-score
2) Best result per anaphor (Venn diagrams)
Combining the results
In order to compare the results of our system for the complete task of coreference resolution, we use two methods for combining the results for identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution: the first method is to merge the output from the combination of classifier and test data instance selection method with the best MUC F-score
for each task, and the second is to merge the output from the combination of classifier and test data instance selection method with the best result per anaphor for each task (defined as the classifier with the largest class of resolved anaphors including anaphors with both resolved links and recall errors, and the smallest class of precision errors, as described in the three-circle Venn diagrams).
The following two combinations of classifiers and test data instance selection methods are compared:
1. The classifiers and the test data instance selection methods with the best MUC F-score for each task: for NEs we select the NE-10 classifier tested on the NE-filter-test data set (MUC F-score 81%), for definite descriptions we select the DefDesc-filter classifier on the
DefDesc-filter-test data set (MUC F-score 49%), and for pronouns the Pronoun-3 classifier on the Pronoun-5-test data set (MUC F-score
2. The classifiers and the test data instance selection methods with
NE-filter-test data set (MUC F-score 81%), for definite descriptions we select the DefDesc-10 classifier on the DefDesc-filter-test data set (MUC F-score 48%), and for pronouns the Pronoun-3 classifier on the Pronoun-filter-test data set (MUC F-score 32%).
give higher precision and higher recall. The results of the two combinations are similar, but we can see that the MUC F-score combination method gives higher recall, while the best Venn combination method yields better precision.
Our hybrid approach to coreference resolution combines a supervised machine learning approach with linguistically motivated selection of anaphorantecedent candidates. In the first step of this approach, instance selection is used to select likely candidate anaphor-antecedent pairs from the set of all possible pairs; the second step is the application of a memory-based classifier for determining coreference between the selected pairs. After classification the sets of anaphor-antecedent candidates labeled as non-coreferent (i.e., the unlikely anaphor-antecedent pairs removed during instance selection and the pairs labeled as non-coreferent by the classifier) and the set of candidates labeled as coreferent by the classifier are merged. The pairs labeled as coreferent are clustered using the aggressive-merge method, and the resulting coreference chains are evaluated against the manually annotated gold standard.
We also use the same instance selection methods to reduce the amount of examples stored in the memory of the classifier (i.e., the training data), as
The output of this hybrid approach is compared to that of a baseline method based on simple heuristics. During evaluation we focus on:
1. the effect of reducing the amount of classifier training data by removing unlikely anaphor-antecedent candidates, and
2. the effect of restricting the test data by applying instance selection as a linguistic filter which labels unlikely anaphor-antecedent candidates as non-coreferent, before handing the remaining candidates to the classifier for recognition of coreferent anaphor-antecedent pairs.
Regarding the effect of reducing the amount of classifier training data, we find that for NEs and definite descriptions the training data can be substantially reduced by applying Accessibility theory-based instance selection with only a minor decrease in performance. The main benefit of training data instance selection for these two tasks is an increase in processing speed as the data is reduced to about 22% of the complete training data set for both NP types (see
For pronouns, the results are less encouraging with respect to reduction of the training data; we find that linguistically motivated instances selection on the training data is not beneficial to pronoun resolution, and that the best results in two out of three cases are obtained by the classifier trained on the largest amount of training data (Pronoun-5). The difference between the results for this classifier and for the classifier trained on all candidates in the three preceding sentences (Pronoun-3), however, is small. When evaluating the pronoun type specific classifiers, we find that the classifiers trained on the most restricted data sets for anaphoric-animate pronouns and deictic pronouns yield the best results while the results for the other pronoun types are
(possibly with the exception of the classifier for anaphoric-animate pronouns) suffer from data sparseness, and that training data instance selection in some cases aggrevates this problem.
When evaluating the results on the coreference chain level, we find that there is a positive effect of restricting the test data for NEs and definite descriptions by applying instance selection as a linguistic filter which labels unlikely anaphor-antecedent candidates as non-coreferent.
For pronouns, removing unlikely anaphor-antecedent candidates through lingusitically motivated test instance selection has a positive effect on precision, coupled with a negative effect on recall; this is probably due to the fact that this method filters out more anaphor-antecedent candidates than the other
We summarize the outcome of our experiments on hybrid methods for coreference resolution as follows:
• For NEs, all three classifiers obtain their best results on the test data set where Accessibility theory-based instance selection has been applied (NE-filter-test). The best results overall is achieved by the NE-
10 classifier in combination with Accessibility theory-based test instance selection (MUC F-score 81.37%). In two out of three cases
(on the linear-k data sets NE-20-test and NE-10-test), the classifier trained on data where Accessibility theory-based instance selection
• For definite descriptions, all three classifiers obtain their best results on the test data set where Accessibility theory-based instance selection has been applied. For each test data set, regardless of instance selection method (i.e., DefDesc-15-test, DefDesc-10-test, and
DefDesc-filter-test), the classifier trained on data where Accessibility theory-based instance selection has been applied (DefDesc-filter) achieves the best results due to better precision. The best result overall is MUC F-score 49.31% (63.31% precision, 40.38% recall), given by the DefDesc-filter classifier on the DefDesc-filter-test. (See ta-
• For pronouns, the classifier trained on the most restricted data set (Pronoun-3) in combination with the least restricted instance selection method (Pronoun-5-test) yield the best result, MUC
F-score 35.37% (precision 42.8%, recall 30.13%). For all classifiers, there is a positive effect of lingusitically motivated test instance selection (Pronoun-filter-test) as pertains to precision, but as this method is the most restrictive there is also a loss in recall. (See
In combining the results for identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution, we can evaluate the result for the complete coreference resolution task. Our combined result, based on the best
This is an improvment compared to the combined baseline score of MUC Fscore 47.20%, but as we have noted in the discussions of the three tasks, our approach is most successful for identification of coreferent NEs and resolution of definite descriptions, whereas further work is needed in order to solve the pronoun resolution task.
The aim of this thesis is to explore how data-driven methods in combination with linguistic knowledge can be used for coreference resolution in Swedish.
We have presented a hybrid approach which combines machine learning with linguistically motivated selection of anaphor-antecedent candidates, covering names, definite descriptions, and pronouns. This is, to our knowledge, the first system where coreference resolution in Swedish has been addressed in a uniform way, using specialized classifiers for each task in order to maximize performance.
The data-driven method adopted is Memory-Based Learning (MBL), a supervised method based on the idea that learning means storing experiences in memory, and that new problems are solved by reusing solutions from similar experiences (Daelemans and Van den Bosch 2005). The general approach to coreference resolution used in this thesis is to pair each NP within a document with all preceding NPs. For each such pair, linguistic knowledge on both NPs and on the relation between the two NPs is used to predict whether this pair is coreferent or not. In a second step, coreference chains are built from the NPs predicted by the classifier as coreferent.
One of the main contributions of this thesis a comprehensive feature set for coreference resolution in Swedish. This feature set was compiled based primarily on the following criteria: relevance according to previous research on knowledge-based and data-driven anaphora and coreference resolution, findings in corpus studies on these phenomena, and relevance for the language and the domain. From this set, we used linguistically motivated feature selection for each of the three coreference resolution subtasks: identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution.
While there are language-specific features, the majority of features can readily be adapted to other languages. The only domain-specific knowledge added to the system is provided by a set of word-space derived features. A key advantage to using word-space, or distributional, semantics is that this model can provide knowledge on lexico-semantic similarity in any language and domain, given appropriate text material (Nilsson and Hjelm 2009).
The linguistic knowledge for selecting likely anaphor-antecedent pairs is based on Accessibility theory, which is focused on the relation between the referential form of an NP and its cognitive status, its accessibility as an antecedent (Ariel 1990). This theory of a direct relation between referential form
and cognitive status form the basic framework for our instance selection methods.
As stated by Daelemans, Van den Bosch and Zavrel (1999), forgetting exceptions is harmful in language learning: natural language is noisy and complex, and contains both generalizable regularities and exceptions. Our instance selection methods are not based on the removal of irregularities and exceptions, but aim for solving part of the coreference resolution problem, namely, how to recognize and discard unlikely candidate antecedents from the set of candidates for each anaphor using linguistic knowledge, before the problem of recognizing coreference between (the most likely) anaphor-antecedent pairs is handled by the classifier. The reduction of the set of NPs is further motivated by the imbalance between coreferent and non-coreferent NPs: only a small part of the possible links between NPs is coreferential, and the majority of
NPs are not coreferentially related to any other NP within the document.
We have demonstrated two ways of using this method of linguistically motivated selection of anaphor-antecedent pairs. First, the amount of training examples stored in memory is reduced. We find that for coreference resolution of definite descriptions and names, the amount of training data can be reduced, with only a minor loss in performance. For pronoun resolution there is a negative effect of reducing the training data. The reduction of training data is substantial; we use less than 1/4 of the complete training data set.
Second, selection can be used to improve on coreference resolution results.
This is the first step in our hybrid approach to coreference resolution, where the second step is the application of an MBL classifier for determining coreference between the selected pairs. Results indicate that this hybrid approach is advantageous for coreference resolution of definite descriptions and names.
For names, there is an increase in precision, while for definite descriptions there is an increase in recall. For pronoun resolution, there is a negative effect on recall along with a positive effect on precision.
Because Accessibility theory is claimed to be universal (although the set of referring expressions varies between languages) (Ariel 1990, Ariel 2006), our instance selection method based on this theory can likely be applied to other languages.
In this thesis, we have addressed three tasks of coreference resolution, namely identification of coreferent NEs, resolution of definite descriptions, and pronoun resolution. We end this thesis by discussing ideas for further improving our system.
First, in order to make our system fully automatic, we need to add automatic recognition of markables (i.e., NPs), and NE recognition to the system.
Because information on NP complexity is used both for instance selection and
feature construction, we need parsing rather than simple NP chunking. PP attachment disambiguation may further improve on resolution results (see e.g,
Second, the instance selection methods can be improved. In these experiments, we use the same methods for reduction of the training data, and for removing unlikely candidates in the unseen data before classification; it may be that these two tasks require different instance selection methods. It would also be interesting to investigate whether these methods, based on Accessibility theory which is claimed to be universal, have a similar effect for other languages.
Third, the data-driven part of our system can be improved:
• We use specialized classifiers for each task with specific feature sets, where the feature selection is linguistically motivated. It would be interesting to contrast these results to those obtained by automatic feature selection. It is likely that results can be improved further by exploring different combinations of learner parameter settings, feature sets, and methods for instance selection (cf. Hoste 2005).
• Even though one of the themes in this thesis has been reduction of training data, we acknowledge that a way of improving on the results of the classifiers would be to add more – carefully selected – instances to the training data; in particular, positive examples of redescriptions
(i.e., coreferent lexical NPs with different head words) and all kinds of pronouns.
• The features used for determining coreference between NPs must be improved, not only by careful feature selection, but also by adding new features that better capture some of the relations between coreferent NPs. In particular, better features (both of higher quality, and new features) and possibly also new resolution strategies for e.g., inanimate and plural pronouns are needed for pronoun resolution.
• The aggressive-merge method for linking the predictions of the classifier info coreference chains is one of the most basic methods; by adopting a more complex strategy, either data-driven or knowledgebased, results would likely improve.
Finally, recognition of coreference between NEs and definite descriptions, and resolution of redescriptions are difficult tasks, as they require (possibly domain-dependent) knowledge on semantic relatedness. We have experimented with two knowledge sources for lexico-semantic similarity: the general synonymy lexicon SynLex, and word-space models trained on data from the same domain as our training and test data. While the results of using features derived from word-space models are encouraging, it may be advantageous to add other sources of semantic knowledge, including domain-specific or encyclopedic knowledge.
1. 2003-09-24 Hård kritik mot Skandia Livs ledning. Dan Olsson, DN
2. 2003-09-24 Teknikras på Wall Street. TT-AFP
3. 2003-09-25 Konjunkturen sämre än i tidigare SCB-rapport. Johan Schück,
4. 2003-09-25 IT ledde börsnedgången. TT
5. 2003-09-25 Världens ekonomier hotas av obalans. Johan Schück, DN
6. 2003-09-25 Pagrotsky får ryggdunkar av svenska folket. Juan Flores, DN
7. 2003-09-25 Liten ökning av handelsnettot. TT
8. 2003-09-25 Opecbeslut drog ned Tokyo. TT-Reuters
9. 2003-09-25 Minskad försäljning för H&M. Pia Gripenberg, DN Ekonomi
10. 2003-09-25 Nokias nya mobil - design till varje pris. TT
11. 2003-09-25 Japan godkänner Scaniafordon. TT
12. 2003-09-25 Mycket blev sagt, litet beslutat. Marianne Björklund, DN
13. 2003-09-25 Starkare krona kan ge färre jobb. Dan Olsson, DN Ekonomi
14. 2003-09-25 Electrolux lägger ut mer produktion i låglöneländer. Anders
Olsson, DN Ekonomi
15. 2003-09-25 Bush överger "starka dollarn politik". Lennart Pehrson, DN
16. 2003-09-29 Sony Ericssons 3G-mobil först efter nyår. TT
17. 2003-09-29 Teknik drog upp börserna i New York. TT-Reuters
18. 2003-09-29 Air France köper KLM för sju miljarder. TT-Reuters
19. 2003-09-29 Ericsson spås ljusare framtid. Kalle Nilsson, DN Ekonomi
20. 2003-09-30 Svenskars sparande minskar. TT
21. 2003-09-30 Svenskar optimister trots trög återhämtning. TT
22. 2003-09-30 Både upp och ner på Tokyobörser. TT
23. 2003-09-30 Vasakronans vd skeptisk till fastigheter på börsen. Anders Olsson, DN Ekonomi
24. 2003-09-30 Låga löner vanliga i industrin. Bosse Andersson, DN Ekonomi
25. 2003-09-30 Advokater fällda i bedrägerihärva. Lasse Wierup, DN Ekonomi
26. 2003-09-30 Försiktig uppgång i Stockholm. TT
27. 2003-09-30 Näringslivet kritisk mot småföretagsplan. TT
28. 2003-09-30 Danska staten går in i oljebranschen. TT
29. 2003-09-30 Ny kassa dyr för liten ICA-butik. Sophie Petzell, DN Ekonomi
30. 2003-09-30 Ericsson avvaktar med affärer. Kalle Nilsson, DN Ekonomi
31. 2003-09-30 Lindsö vill ha huvudkontoren kvar i Sverige. Bosse Andersson,
32. 2003-09-30 Sjuk och utbränd chef kostar en miljon. Bosse Andersson, DN
33. 2003-09-30 Dyster statistik sänkte Wall Street. TT-Reuters
34. 2003-09-30 Bristyrke i vinter: medlare. Cecilia Jacobsson, DN Ekonomi
35. 2003-09-30 3,5 procent rimlig löneökningstakt anser facken. Cecilia Jacobsson, DN Ekonomi
36. 2003-09-30 Analytiker: "SAS går samman med Finnair". Olof Sandberg,
37. 2003-09-30 Livbolag ska kunna bötfällas. Dan Olsson, DN Ekonomi
38. 2003-09-30 Många PPM-fonder stiger i värde. Maria Crofts, DN Ekonomi
39. 2003-09-30 Medlarbas manar till återhållsamhet. Bosse Andersson, DN
40. 2003-09-30 Fastighetsbolagen vill vara kvar på börsen. Anders Olsson, DN
41. 2003-10-01 Även Folksam sänker pensioner. Dan Olsson, DN Ekonomi
42. 2003-10-01 Konsumtionen viktig för svensk tillväxt. TT
43. 2003-10-01 Tokyo steg efter tankan-rapport. TT-Reuters
44. 2003-10-01 Bilförsäljningen väntas öka. TT
45. 2003-10-01 Tillväxten bromsas upp. TT
46. 2003-10-01 Bures bokslutskommuniké vilseledde marknaden. Pia Gripenberg, DN Ekonomi
47. 2003-10-01 Euromaint drar ned med trehundra. TT
48. 2003-10-01 Bilförsäljningen väntas öka. Kalle Nilsson, DN Ekonomi
49. 2003-10-01 Telia satsar på snabb mobilteknik. Anders Lignell/TT
50. 2003-10-01 EU försenar Sydkrafts Graningebud. Kalle Nilsson, DN
51. 2003-10-01 Kundkontakter ett "sjukt" jobb. Leni Weilenmann, DN
52. 2003-10-01 Nordbankenmäklare överklagar vårddomen. Ingrid Carlberg,
53. 2003-10-01 Folksam återtar pensionspengar. Dan Olsson, DN Ekonomi
54. 2003-10-02 Kraftig uppgång på Asienbörser. TT-Reuters
55. 2003-10-07 Konkursdrabbad efter inbrott. Juan Flores, DN Ekonomi
56. 2003-10-07 Nya varsel hotar Migrationsverket. TT
57. 2003-10-07 Skandiachefers lägenhetsaffärer blir rättsfråga. DN Ekonomi
58. 2003-10-07 Färre optimistiska om landets ekonomi. TT
59. 2003-10-07 Hongkong gick emot asiatisk optimism. TT-Reuters
60. 2003-10-07 Ericsson upp på fallande börs. TT
61. 2003-10-07 Shorts på jobbet fall för Arbetsdomstolen. Cecilia Jacobsson,
62. 2003-10-07 Tre nya fall av insiderbrott. TT
63. 2003-10-07 Mejerichefer åtalas för dödsolycka. TT
64. 2003-10-07 Tiden läker alla sår även Skandias. TT
65. 2003-10-07 Dystrare läge på arbetsmarknaden. TT
66. 2003-10-07 Ericsson-vd tror på säkrare marknad. TT
B. The complete feature set
The complete feature set, and the feature set for the respective NP types
Named Entities, definite descriptions, and pronouns. Descriptive features describe both anaphor and antecedent, whereas comparative features describe some aspect of comparison between the two NPs. The features are described
Table B.1: The complete feature set, and the feature set for the respective NP types
Named Entities, definite descriptions, and pronouns. Descriptive features describe both anaphor and antecedent (anaphor/antecedent), whereas comparative features describe some aspect of comparison between the two NPs. The features are described
Feature description Named Entites DefDescs Pronouns
Descriptive positional features
1, 2 NP no. in document
3, 4 NP no. in paragraph
5, 6 Left-most NP in sentence
7, 8 NP in headline
9, 10 NP in quoted speech
Comparative positional features
Distance in NPs
Distance in sentences
Table continued on next page.
continued from previous page.
Feature description Named Entites DefDescs Pronouns
Descriptive morpho-syntactic and lexical features
Head word part-of-speech
19, 20 Grammatical gender
21, 22 Number
23, 24 Definiteness
25, 26 Case
Comparative morpho-syntactic and lexical features
Descriptive syntactic features
35, 36 Dependency relation of head word
37, 38 Complexity of NP
39, 40 NP with nested PP
41, 42 NP with gentive modifier
Comparative syntactic features
Dependency between NPs
Descriptive context features
45, 46 The maximal NP
47, 48 The apposition, if any
49, 50 Dominant constituent of NP
51, 52 POS of dominant constituent
Comparative similarity features
Substring min anaphor
Substring min antecedent
Table continued on next page.
continued from previous page.
Comparative similarity features ( continued)
Substring max anaphor
Substring max antecedent
Substring baseform anaphor
Substring baseform antecedent
Nested NP of ana eq. to ante
Nested NP of ante eq. to ana
App. NP of ante eq. to ana
App. NP of ana eq. to ante
App. NP of ante eq. to ana min
App. NP of ana eq. to ante min
App. NP of ante eq. to ana max
App. NP of ana eq. to ante max
Descriptive lexico-semantic features
74, 75 Natural gender
76, 77 Animacy
78, 79 NE type
80, 81 Temporal expr: month
82, 83 Temporal expr: day
84, 85 Temporal expr: other
Named Entites DefDescs Pronouns
Comparative lexico-semantic features
Same NE type
Same NP type
Synonyms in SynLex
SynLex synonymy score
Table continued on next page.
continued from previous page.
Feature description Named Entites DefDescs Pronouns
Comparative lexico-semantic features (continued)
91 WS syntagmatic ‘s-plain’
92 WS syntagmatic ‘s-SVD’
93 WS syntagmatic ‘s-MI’
94 WS paradigmatic ‘p-plain’
95 WS paradigmatic ‘p-SVD’
96 WS paradigmatic ‘p-MI’
97 WS best pair ‘s-plain’
98 WS best pair ‘s-MI’
99 WS best pair ‘s-SVD’
100 WS best pair ‘p-plain’
101 WS best pair ‘p-MI’
102 WS best pair ‘p-SVD’
103 WS top-10 ‘s-plain’
104 WS top-10 ‘s-MI’
105 WS top-10 ‘s-SVD’
106 WS top-10 ‘p-plain’
107 WS top-10 ‘p-MI’
108 WS top-10 ‘p-SVD’
108 WS intersection top-10 ‘plain’
109 WS intersection top-10 ‘MI’
110 WS intersection top-10 ‘SVD’
110 Total no. features
Androutsopoulos, I. and Aretoulaki, M. (2003).
Natural language interaction, in
R. Mitkov (ed.),
The Oxford Handbook of Computational Linguistics
, Oxford University Press, chapter 35, pp. 629–649.
Aone, C. and Bennet, S. W. (1995). Evaluating automated and manual aquisition of anaphora resolution strategies, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics
, ACL, Cambridge, MA, pp. 122–129.
Ariel, M. (1990).
Accessing Noun-Phrase Antecedents
, Taylor & Francis.
Ariel, M. (2006). Accessibility theory, in K. Brown (ed.), Encyclopedia of Language & Linguistics
, 2 edn, Elsevier, Oxford, pp. 15–18.
Azzam, S., Humphreys, K. and Gaizauskas, R. (1999). Coreference Resolution in a
Multilingual Information Extraction System,
Proceedings of the ACL Workshop on Coreference and its Applications
, ACL, Maryland.
Baldwin, B. (1997). CogNIAC: High precision coreference with limited knowledge and linguistic resources,
Proceedings of the ACL’97/EACL’97 Workshop on
Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts
, Madrid, Spain, pp. 38–45.
Baldwin, B. and Morton, T. S. (1998). Dynamic Coreference-based Summarization,
Proceedings of the Third Conference on Empirical Methods in Natural
, Granada, Spain.
Barbu, C., Evans, R. and Mitkov, R. (2002). A corpus based analysis of morphological disagreement in anaphora resolution,
Proceeding of LREC 2002, The
Third International Conference on Language Resources and Evaluation
Las Palmas, Spain.
Bergler, S. and Knoll, S. (1996). Coreference Patterns in the Wall Street Journal, in
C. Percy, C. Meyer and I. Lancashire (eds),
Synchronic corpus linguistics.
Papers from the Sixteenth International Conference on English Language
Research on Computerized Corpora, Toronto 1995
, Amsterdam: Rodopi.
Bies, A., Ferguson, M., Katz, K. and MacIntyre, R. (1995).
Bracketing Guidelines for Treebank II Style, 1995
, CIS Technical Report MS-CIS-95-06 edn, Penn
Treebank Project, University of Pennsylvania.
Borthen, K. (2004a). Annotation scheme for BREDT. Version 1.0,
University of Bergen.
Borthen, K. (2004b). Predicative NPs and the annotation of reference chains,
Proceedings of Coling 2004 , Geneva, Switzerland, pp. 1175–1178.
Bosch, P. (1983).
Agreement and Anaphora. A Study of the Role of Pronouns in
Syntax and Discourse
, Cognitive Science Series, Academic Press.
Brennan, S. E., Friedman, M. W. and Pollard, C. J. (1987). A centering approach to pronouns,
Proceedings of the 25th Annual Meeting of the ACL
, Association of Computational Linguistics, Stanford, CA, pp. 155–162.
Cardie, C. and Wagstaff, K. (1999). Noun Phrase Coreference as Clustering,
Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora , ACL, pp. 82–89.
Carlberger, J. and Kann, V. (1999). Implementing an efficient part-of-speech tagger,
Software Practice and Experience
Charniak, E. (1972).
Toward a Model of Children’s Story Comprehension
, PhD thesis, Massachusetts Institute of Technology.
Chen, Y. and Hacioglu, K. (2006).
Exploration of coreference resolution:
The ACE entity detection and recognition task,
Text, Speech and Dialogue
, Vol. 4188/2006 of
Lecture Notes in Computer Science
Chinchor, N. (1997). MUC-7 Named Entity Task Definition (version 3.5),
Proceedings of the Seventh Message Understanding Conference (MUC-7) . Available from http://www.itl.nist.gov/ (Last checked Oct. 14, 2005.).
Cristea, D., Ide, N. and Romary, L. (1998). Veins Theory: A model of global discourse cohesion and coherence,
Proceedings of the 36th Annual Meeting of the
Association for Computational Linguistics and of the 17th International
Conference on Computational Linguistics (COLING-ACL’98)
Canada, pp. 281–285.
Cristea, D., Ide, N., Marcu, D. and Tablan, V. (2000). An empirical investigation of the relation between discurse structure and co-reference, COLING 2000, 18th
International Conference on Computational Linguistics, Proceedings of the Conference
, Saarbrücken, Germany, pp. 208–214.
Daelemans, W. and Van den Bosch, A. (2005).
Memory-Based Language Processing
, Studies in Natural Language Processing, Cambridge University Press.
Daelemans, W., Van den Bosch, A. and Zavrel, J. (1999). Forgetting exceptions is harmful in language learning,
Machine Learning. Special issue on Natural
Dalianis, H. and Åström, E. (2001). SweNam - A Swedish Named Entity recognizer. Its construction, training, and evaluation., Technical Report TRITA-NA-
Denis, P. and Baldridge, J. (2008). Specialized models and ranking for coreference resolution,
Proceedings of the 2008 Conference on Empirical Methods in
Natural Language Processing
, ACL, Honolulu, Hawaii, pp. 660–669.
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S. and
Weischedel, R. (2004). The Automatic Content Extraction (ACE) Program:
Tasks, Data, and Evaluation,
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Ejerhed, E., Källgren, G., Wennstedt, O. and Åström, M. (1992). The Linguistic Annotation System of the Stockholm-Umeå Corpus Project,
Technical Report 33
Department of General Linguistics, University of Umeå.
Fellbaum, C. D. (ed.) (1998).
WordNet: An Electronic Lexical Database
Fliedner, G. (2006). Towards natural interactive question answering, Proceedings of the 5th International Conference on Language Resources and Evaluation
, Genoa, Italy.
Fraurud, K. (1992).
Processing Noun Phrases in Natural Language Discourse
PhD thesis, Stockholm University.
Garera, N. and Yarowsky, D. (2006). Resolving and generating definite anaphora by modeling hypernymy using unlabeled corpora,
Proceedings of the Conference on Natural Language Learning, CoNLL .
Grishman, R. (2003). Information Extraction, in
R. Mitkov (ed.),
The Oxford Handbook of Computational Linguistics
, Oxford University Press, chapter 30, pp. 545–559.
Grosz, B., Joshi, A. and Weinstein, S. (1995). Centering: A Framework for Modeling the Local Coherence of Discourse,
Computational Linguistics 2
Haghighi, A. and Klein, D. (2007). Unsupervised coreference resolution in a nonparametric bayesian model, Proceedings of the 45th Annual Meeting of the
Association of Computational Linguistics
, ACL, Prague, Czech Republic, pp. 848–855.
Haghighi, A. and Klein, D. (2009). Simple coreference resolution with rich syntactic and semantic features,
Proceedings of the 2009 Conference on Empirical
Methods in Natural Language Processing , ACL, Singapore.
Halliday, M. A. K. and Hasan, R. (1976).
Cohesion in English
Hartrumpf, S. (2001). Coreference Resolution with Syntactico-Semantic Rules and
Corpus Statistics, Proceedings of the Fifth Computational Natural Language Learning Workshop (CoNLL-2001)
, Toulouse, France, pp. 137–144.
Hassel, M. (2001). Internet as Corpus. Automatic Construction of a Swedish News
Hendrickx, I., Hoste, V. and Daelemans, W. (2007). Evaluating hybrid versus datadriven coreference resolution,
Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Colloquium,
DAARC 2007. Lagos, Portugal, March 2007. Selected papers.
, Springer, pp. 137–150.
Hendrickx, I., Hoste, V. and Daelemans, W. (2008). Semantic and Syntactic Features for Dutch Anaphora Resolution, Proceedings of the 9th International
Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2008, Haifa, Israel, February 17-23, 2008
, Lecture Notes in
Computer Science, Springer.
Hinrichs, E. W., Kübler, S. and Naumann, K. (2005). A Unified Representation for
Morphological, Syntactic, Semantic, and Referential Annotations,
Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky. The
43rd Annual Meeting of the Association for Computational Linguistics
, pp. 13–20.
Hirschman, L. and Chinchor, N. (1997). MUC-7 Coreference Task Definition (version 3.0),
Proceedings of the Seventh Message Understanding Conference (MUC-7)
. Available from http://www.itl.nist.gov/ (Last checked Oct. 14,
Hirschman, L., Robinson, P., Burger, J. and Vilain, M. (1997). Automating coreference: The Role of Annotated Training Data,
AAAI Spring Symposium on
Applying Machine Learning to Discourse Processing
Hobbs, J. R. (1978). Resolving Pronoun References,
: 311–338. Reprinted in Readings in Natural Language Processing, B. Grosz, K. Sparck-Jones, and B.
Webber, editors, pp. 339-352, Morgan Kaufmann Publishers, Los Altos, California.
Holen, G. I. (2007). Automatic anaphora resolution for Norwegian,
Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor
Resolution Colloquium, DAARC 2007. Lagos, Portugal, March 2007. Selected papers.
, Springer, pp. 151–167.
Hoste, V. (2005).
Optimization Issues in Machine Learning of Coreference Resolution
, PhD thesis, Universiteit Antwerpen.
Iida, R., Inui, K., Takamura, H. and Matsumoto, Y. (2003). Incorporating Contextual Cues in Trainable Models for Coreference Resolution, Proceedings of the
EACL 2003 Workshop on The Computational Treatment of Anaphora
, Budapest, Hungary.
Johansson, C. and Nøklestad, A. (2008). Improving an Anaphora Resolution System for Norwegian,
Proceedings of the Second Workshop on Anaphora Resolution (WAR II)
Johansson, C., Nøklestad, A. and Reigem, Ø. (2006). Developing a re-usable webdemonstrator for automatic anaphora resolution with support for manual editing of coreference chains,
Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006) , ELRA, pp. 1161–1166.
Jurafsky, D. and Martin, J. H. (2000).
Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
, Prentice Hall.
Kann, V. and Rosell, M. (2006). Free construction of a Swedish dictionary of synonyms, in
S. Werner (ed.),
Proceedings of the 15th NODALIDA conference,
, Joensuu, Finland.
Kehler, A., Appelt, D., Taylor, L. and Simma, A. (2004). The (non)utility of predicateargument frequencies for pronoun interpretation,
Proceedings of the Annual
Meeting of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2004)
, Boston, MA, USA.
Kennedy, C. and Boguraev, B. (1996). Anaphora for everyone: pronominal anaphora resoluation without a parser,
Proceedings of the 16th conference on Computational Linguistics (COLING’96) , Copenhagen, Denmark.
Kibble, R. and Van Deemter, K. (2000). Coreference Annotation: Whither?,
Proceedings of the 2nd International Conference on Language Resources and
Evaluation, LREC 2000
, ELRA, Athens, Greece.
Klenner, M. (2007). Enforcing Consistency on Coreference Sets, In Recent Advances in Natural Language Processing (RANLP)
, pp. 323–328.
Klenner, M. and Ailloud, E. (2008). Enhancing Coreference Clustering,
Proceedings from the Second Bergen Workshop on Anaphora Resolution (WAR II),
, pp. 31–40.
Klenner, M. and Ailloud, E. (2009). Optimization in coreference resolution is not needed: A nearly-optimal zero-one ILP algorithm with intensional constraints,
Proceedings of the 110th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009)
Lappin, S. and Leass, H. (1994). An algorithm for pronominal anaphora resolution,
ACE (Automatic Content Extraction) English Annotation Guidelines for Entities
. Version 6.6.
Luo, X. (2005). On Coreference Resolution Performance Metrics, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
, ACL, Vancouver, British Columbia,
Canada, pp. 25–32.
Luo, X., Ittycheriah, A., Jing, H., Kambhatla, N. and Roukos, S. (2004). A mentionsynchronous coreference resolution algorithm based on the Bell tree, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04)
Mann, W. C. and Thompson, S. A. (1988). Rhetorical Structure Theory: Toward a functional theory of text organization,
Manning, C. and Schütze, H. (1999).
Foundations of Statistical Natural Language
Processing , The MIT Press, Cambridge, MA.
McCarthy, J. F. and Lehnert, W. G. (1995). Using decision trees for coreference resolution, in C. Mellish (ed.), Proceedings of the Fourteenth International
Conference on Artificial Intelligence
, pp. 1050–1055.
McEnery, A., Tanaka, I. and Botley, S. (1997). Corpus Annotation and Reference
R. Mitkov and B. Boguraev (eds),
Proceedings of the Association for Computational Linguistics Workshop on Anaphora Resolution for Unrestricted Texts, Madrid , pp. 67–74.
Mitchell, T. M. (1997).
Mitkov, R. (1997). Factors in anaphora resolution: they are not the only things that matter. A case study based on two different approaches, Proceedings of the
ACL’97/EACL’97 workshop on Operational factors in practical, robust anaphora resolution
, Madrid, Spain, pp. 14–21.
Mitkov, R. (2002).
Mitkov, R. (2003). Anaphora Resolution, in R. Mitkov (ed.), The Oxford Handbook of Computational Linguistics
, Oxford University Press, pp. 266–283.
Mitkov, R., Evans, R. and Orasan, C. (2002).
A new, fully automatic version of Mitkov’s knowledge-poor pronoun resolution method,
Proceedings of the
Third International Conference on Computational Linguistics and Intelligent Text Processing
, Vol. 2276 of
Lecture Notes In Computer Science
Springer-Verlag, pp. 168–186.
Mitkov, R., Evans, R., Orasan, C., Ha, L. A. and Pekar, V. (2007). Anaphora resolution: To what extent does it help NLP?, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2007. Lagos, Portugal, March 2007. Selected papers.
Springer, pp. 179–190.
Moore, J. D. and Wiemer-Hastings, P. (2003). Discourse in computational linguistics and artificial intelligence, in
A. C. Graesser, M. A. Gernsbacher and S. R.
Handbook of Discourse Processes
, Lawrence Erlbaum Associates, chapter 12, pp. 439–486.
Morton, T. (2005).
Using Semantic Relations to Improve Information Retrieval
PhD thesis, University of Pennsylvania.
Morton, T. S. (2000). Coreference for NLP Applications, Proceedings of the 38th
Annual Meeting of the Association for Computational Linguistics
Ng, V. (2005). Machine Learning for Coreference Resolution: From Local Classification to Global Ranking, Proceedings of the 43rd Annual Meeting of the
Association for Computational Linguistics (ACL), Ann Arbor, MI
, pp. 157–
Ng, V. (2007).
Shallow semantics for coreference resolution,
Proceedings of the Twentieth International Joint Conference on Artificial Intelligence
Ng, V. (2008). Unsupervised models for coreference resolution, Proceedings of the
2008 Conference on Empirical Methods in Natural Language Processing
, pp. 640–649.
Ng, V. and Cardie, C. (2002a).
Combining Sample Selection and Error-Driven
Pruning for Machine Learning of Coreference Rules, Proceedings of the
2002 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, ACL, pp. 55–62.
Ng, V. and Cardie, C. (2002b). Improving Machine Learning Approaches to Coreference Resolution,
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)
, ACL, Philadelphia, PA, USA, pp. 104–111.
Nilsson, K. and Hjelm, H. (2009). Using semantic features derived from word-space models for Swedish coreference resolution, in
K. Jokinen and E. Bick (eds),
Proceedings of the 17th Nordic Conference of Computational Linguistics,
Nilsson, K. and Malmgren, A. (2006). Towards automatic recognition of product names: an exploratory study of brand names in economic texts, in
Proceedings of the 15th Nodalida conference, Joensuu 2005
, Joensuu, Finland.
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryiˇgit, G., Kübler, S., Marinov, S. and
Marsi, E. (2007). Maltparser: A language-independent system for data-driven dependency parsing,
Natural Language Engineering 13
Nøklestad, A. (2009).
A Machine Learning Approach to Anaphora Resolution
Including Named Entity Recognition, PP Attachment Disambiguation, and Animacy Detection , PhD thesis, University of Oslo.
Nøklestad, A. and Johansson, C. (2006). Detecting reference chains in Norwegian, in
S. Werner (ed.), Proceedings of the 15th NODALIDA conference, Joensuu
, [email protected] : University of Joensuu electronic publications in linguistics and language technology 1, Joensuu. SBN 952-458-771-8, ISSN 1796-1114.
Poesio, M. (2004). The MATE/GNOME Scheme for Anaphoric Annotation, Revisited,
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004 , Boston, MA, USA, pp. 154–162.
Poesio, M. and Vieira, R. (1998). A corpus-based investigation of definite description use,
Poon, H. and Domingos, P. (2008). Joint unsupervised coreference resolution with markov logic,
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
, ACL, Honolulu, Hawaii.
Pustejovsky, J., Meyers, A., Palmer, M. and Poesio, M. (2005). Merging PropBank,
NomBank, TimeBank, Penn Discourse Treebank and Coreference,
Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the
Sky. The 43rd Annual Meeting of the Association for Computational Linguistics
, pp. 13–20.
Recasens, M., Martí, T., Taulé, M., Màrquez, L. and Sapena, E. (2009). SemEval-2010
Task 1: Coreference Resoution in Multiple Languages,
Proceesings oof the
NAACL HLT WOrkshop on Semantic Evaluations: Recent achievements and Future Directions
, ACL, Boulder, CO, USA, pp. 70–75.
Sahlgren, M. (2006).
The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in
High-Dimensional Vector Spaces
, PhD thesis, Stockholm University, Stockholm, Sweden.
Sanders, T. and Spooren, W. (2007). Discourse and Text Structure, in D. Geeraerts and H. Cuyckens (eds),
The Oxford Handbook of Cognitive Linguistics
, Oxford University Press, chapter 35, pp. 916–944.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees,
Proceedings of International Conference on New Methods in Language Processing
Singer, M. (2007). Inference processing in discourse comprehension, in M. G. Gaskell
The Oxford Handbook of Psycholinguistics
, Oxford University Press, chapter 20, pp. 343–360.
Soon, W. M., Ng, H. T. and Lim, D. C. Y. (2001).
A Machine Learning Approach to Coreference Resolution of Noun Phrases,
Strassel, S., Przybocki, M., Peterson, K., Song, Z. and Maeda, K. (2008). Linguistic
Resource and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction,
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08)
Strube, M. (2006). Corpus-based and machine learning approaches to anaphora resolution. A critical assessment, in
M. Schwarz-Friesel, M. Consten and M. Knees
Anaphors in Text
, Germany: de Gruyter.
Strube, M. and Hahn, U. (1999). Functional centering: Grounding referential coherence in information structure,
Computational Linguistics 25
Strube, M., Rapp, S. and Müller, C. (2002). The influence of minimum edit distance on reference resolution,
Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP) , ACL, Philadelphia,
PA, USA, pp. 312–319.
Tetreault, J. R. (1999). Analysis of syntax-based pronoun resolution methods, Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99)
, Maryland, USA, pp. 602–605.
Uryupina, O. (2004). Linguistically Motivated Sample Selection for Coreference Resolution,
Proceedings of the Fifth Discourse Anaphora and Anaphora Resolution Colloquium, DAARC 2004
Van Deemter, K. and Kibble, R. (1999). What is coreference, and what should coreference annotation be?, in
A. Bagga, B. Baldwin and S. Shelton (eds),
Proceedings of the ACL Workshop on Coreference and Its Applications , ACL,
Van Deemter, K. and Kibble, R. (2000). On Coreferring: Coreference in MUC and related annotation schemes,
Computational Linguistics 26
Viberg, Å., Lindmark, K., Lindvall, A. and Mellenius, I. (2002).
Proceedings of Euralex 2002, Copenhagen University
, pp. 407–412.
Vieira, R. and Poesio, M. (2000). An empirically based system for processing definite descriptions, Computational Linguistics
Vilain, M., Burger, J., Aberdeen, J., Connolly, D. and Hirschman, L. (1995). A Model-
Theoretic Coreference Scoring Scheme,
Proceedings of the Sixth Message
Understanding Conference (MUC-6)
, Morgan Kaufmann, Columbia, Maryland.
Volk, M. and Clematide, S. (2001). Learn-Filter-Apply-Forget. Mixed Approaches to
Named Entity Recognition, Proc. of 6th International Workshop on Applications of Natural Language for Information Systems
, Madrid, Spain.
Vossen, P. (1998). Introduction to EuroWordNet, Computers and the Humanities
Watson, R., Preiss, J. and Briscoe, T. (2003). The contribution of domain-independent robust pronminal resolution to open-domain question answering,
Symposium on Reference Resolution and its Applications to Question Answering and
Summarization , pp. 75–82.
Wennstedt, O. (1995). Annotering av namn i SUC-korpusen, in K. G. Ottósson, R. V.
Fjeld and A. Torp (eds),
The Nordic Languages and Modern Linguistics 9.
Proceedings of the Ninth International Conference of Nordic and General
, University of Oslo, Novis forlag, pp. 315–324.
Winograd, T. (1972).
Understanding Natural Language
, Academic Press.
Witte, R. and Tang, T. (2007). Task-Dependent Visualization of Coreference Resolution Results,
International Conference on Recent Advances in Natural
Language Processing (RANLP 2007) , Borovets, Bulgaria.
Yang, X., Zhou, G., Su, J. and Tan, C. (2003). Coreference resolution using competition learning approach,
Proceedings of ACL 2003, Sapporo, Japan, 7-12
, pp. 176–183.
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project