Coreference Resolution via Hypergraph Partitioning Dissertation zur Erlangung der Doktorwürde der Neuphilologischen Fakultät der Ruprecht-Karls-Universität Heidelberg vorgelegt von Jie Cai aus China Referent: Prof. Dr. Michael Strube Korreferent: Prof. Dr. Anette Frank Einreichung: 08.11 2012 Disputation: 04.04 2013 Abstract Coreference resolution is one of the most fundamental Natural Language Processing tasks, aiming to identify the coreference relation in texts. The task is to group mentions (i.e. phrases of interest) into sets, so that all mentions in one set refer to the same entity (i.e. a real world object). Mentions are conventionally proper names, common nouns and pronouns. Lately, the coreference task has been extended to deal with verb phrases too. However, we only work with noun phrase mentions in this thesis. By linking mentions together in a document, not only entities are recovered but also different fragments of the context are connected. This therefore leads to a better text understanding. Coreference resolution is essentially important to many applications, such as text summarization and information extraction. In this thesis, we propose a novel coreference model based on hypergraph partitioning. Our system is named COPA, standing for Coreference Partitioner. Given a raw document, COPA represents it as a hypergraph, upon which the hypergraph partitioning algorithms are applied to derive coreference sets directly. The Coreference Representation. The coreference relation is a high-dimensional relation, because it depends on multiple types of basic relations (e.g. string similarities and semantic relatedness). Most of the previous work on the coreference resolution task combines the basic relations between mentions into single ones and derives the coreference sets afterward. Since it is relatively expensive to learn the combination of the basic relations, we propose a novel hypergraph representation model for coreference resolution. In our model, the mentions are taken as vertices in the hypergraph and the relational features derived from the basic relations as hyperedges. The hypergraph allows for multiple edges between vertices, so that it suits the high-dimension property of the coreference relation. Moreover, in a hypergraph one hyperedge can connect more than two vertices. As a result the hypergraph directly represents the relations between sets of mentions as required for the coreference resolution task. Since the basic relations are incorporated in an overlapping manner, COPA only needs a few training documents to achieve competitive performance. The weakly iv supervised nature makes COPA a good candidate when applying to different domains or languages, or when only limited training data is available. The Coreference Inference. The inference of the coreference resolution task deals with sets of mentions. It needs to capture the relations between multiple mentions in order to derive the final coreference sets. Therefore, we consider coreference resolution as a set problem. Most of the previous coreference models address the set problem by dividing the resolution into two steps — a classification step and a clustering step (e.g. Soon et al. (2001)). The classification step makes decisions for each pair of mentions on whether they are coreferent or not. Upon the pairwise decisions, the clustering step further groups mentions into the final sets. The two-step division makes the classification performance not necessarily positively correlated with the end evaluation numbers. It is difficult to track the error propagation and hard to optimize with respect to the final coreference sets. Moreover, since the coreference decisions are made between pairs of mentions independently, global context information is missing in those models. In this thesis, we propose a global coreference model via hypergraph partitioning. We design two algorithms based on the spectral clustering technique — a hierarchical R2 partitioner and a flat k-way flatK partitioner. We also propose extensions to the clustering algorithms of COPA, aiming to include constraints to enforce the cluster-level consistency. The constrained COPA is the first attempt towards a better learning scheme for our system. It solves the cluster-level inconsistency problem and at the same time contributes to research in the constrained graph clustering field. The Coreference Evaluation. Since COPA is an end-to-end coreference system, the important implementation issues encountered when applying clustering algorithms to practical uses are also addressed in this thesis. For instance, the existing evaluation metrics become problematic when the automatically identified mentions do not align with the ones in the ground truth. In this thesis, we propose variants of the coreference evaluation metrics to tackle this problem. COPA outperforms several baseline systems in fair settings, using the same features and the same mentions and only comparing the effectiveness of the models themselves. It also performs competitively compared to the state-of-the-art systems across different evaluation metrics, different data sets and different domains. Zusammenfassung Koreferenzresolution ist eine der grundlegendsten Aufgaben der Computerlinguistik. Es wird dabei das Ziel verfolgt, die Koreferenzrelation in Texten zu identifizieren. Die Aufgabe besteht darin, Erwähnungen (d.h. zu untersuchende Phrasen) so in Mengen zu gliedern, dass alle Erwähnungen in einer Menge auf die gleiche Entität (d.h. ein Objekt in der Welt) referieren. Herkömmlicherweise werden Eigennamen, Gattungsnamen und Pronomen zu den Erwähnungen gezählt, wobei in den letzten Jahren auch vermehrt Verbphrasen einbezogen worden sind. In dieser Dissertation werden ausschliesslich nominale und pronominale Erwähnungen berücksichtigt. Indem Erwähnungen in einem Dokument miteinander verknüpft werden, werden nicht nur Entitäten identifiziert, sondern auch verschiedene Kontextfragmente miteinander verbunden. Dies führt zu einem besseren automatischen Textverstehen. Koreferenzresolution ist für viele Anwendungen wie beispielsweise Textzusammenfassung und Informationsextraktion essentiell. In dieser Dissertation schlagen wir ein neues Koreferenzmodell basierend auf Partitionierung von Hypergraphen vor. Unser System heisst COPA, was für KoreferenzPartitionierer (engl. Coreference Partitioner) steht. Gegeben ein Textdokument wird dieses in COPA als Hypergraph repräsentiert. Anschliessend werden Partitionierungsalgorithmen auf diesen Hypergraphen angewendet, um direkt die Koreferenzmengen zu erhalten. Die Repräsentation von Koreferenz. Die Koreferenzrelation ist hochdimensional, da sie von vielen Typen von Basisrelationen (z.B. Zeichenkettenähnlichkeiten und semantischer Verwandtschaft) abhängt. Viele frühere Koreferenzresolutionsarbeiten kombinieren verschiedene Basisrelationen zwischen zwei Erwähnungen zu einer einzelnen Relation und treffen die Koreferenzentscheidungen basierend auf diesen kondensierten Relationen. Da es relativ aufwändig ist, die Kombination von Basisrelationen zu lernen, schlagen wir ein neues Repräsentationsmodell basierend auf Hypergraphen für Koreferenzresolution vor. In unserem Modell werden Erwähnungen als Knoten in einem Hypergraphen betrachtet und die Basisrelationen werden als Hyperkanten integriert. Der Hypergraph erlaubt viele Kanten zwischen Knoten, was der hochdimensionalen Eigenschaft der Koreferenzrelation vi entspricht. Hinzu kommt, dass in einem Hypergraphen eine Hyperkante mehr als zwei Knoten miteinander verbinden kann. Folglich repräsentiert der Hypergraph direkt die Relationen zwischen Mengen von Erwähnungen, wie es die Koreferenzresolutionsaufgabe erfordert. Da die Basisrelationen überlappend integriert sind, benötigt COPA nur wenige Dokumente zum Trainieren, um konkurrenzfähige Ergebnisse zu erzielen. Da COPA ein schwach überwachtes Koreferenzsystem ist, eignet es sich auch dann, wenn verschiedene Domänen und Sprachen interessieren oder wenn wenige Trainingsdaten verfügbar sind. Inferenz für Koreferenz. Die Inferenz für die Koreferenzresolutionsaufgabe erfolgt über Mengen von Erwähnungen. Es müssen dabei die Relationen zwischen mehreren Erwähnungen berücksichtigt werden, um die endgültigen Koreferenzmengen abzuleiten. Wir betrachten daher Koreferenzresolution als ein Mengenproblem. Die meisten bisher vorgeschlagenen Koreferenzmodelle unterteilen das Mengenproblem in zwei Schritte – einen Klassifikationsschritt und einen Clusteringschritt (z.B. Soon et al. (2001)). Im Klassifikationsschritt wird für jedes Paar von Erwähnungen entschieden, ob die entsprechenden Erwähnungen koreferent sind oder nicht. Basierend auf diesen paarweisen Entscheidungen werden die Erwähnungen im Clusteringschritt in die endgültigen Mengen gruppiert. Die Gliederung in zwei Teilschritte führt dazu, dass die Klassifikationsergebnisse nicht notwendigerweise mit den Endresultaten für Koreferenzmengen positiv korreliert sind. Es ist daher schwierig, die Fehlerfortpflanzung zu verstehen und die Inferenz hinsichtlich der endgültigen Koreferenzmengen zu optimieren. Hinzu kommt, dass globale Kontextinformation in diesen Modellen fehlt, da die Koreferenzentscheidungen zwischen Paaren von Erwähnungen unabhängig getroffen werden. In dieser Dissertation schlagen wir ein globales Koreferenzmodell basierend auf Partitionierung von Hypergraphen vor. Wir schlagen zwei Algorithmen vor, die auf der spektralen Clusteringtechnik basieren – ein hierarchischer R2 Partitionierer und ein partitionierender k-way flatk Partitionierer. Wir präsentieren auch Erweiterungen für die Clusteringalgorithmen von COPA, die Nebenbedingungen (engl. constraints) einschliessen, um Konsistenz auf der Clusterebene zu erzwingen. Der constrained COPA ist ein erster Versuch in Richtung eines besseren Lernschemas für unser System. Es löst spezielle Koreferenzprobleme und trägt gleichzeitig zum Forschungsfeld von Graphclustering mit Nebenbedingungen bei. vii Die Evaluation von Koreferenz. Da COPA ein Koreferenzsystem mit realen Vorverarbeitungskomponenten ist, befasst sich die vorliegende Dissertation auch mit wichtigen Implementierungsschwierigkeiten, die bei Clusteringalgorithmen auftreten, wenn sie in Anwendungen benutzt werden. So sind beispielsweise Evaluationsmetriken problematisch, da die vom System identifizierten Erwähnungen nicht mit den Erwähnungen im Goldstandard übereinstimmen. Wir schlagen daher in dieser Dissertation neue Varianten der Koreferenzevaluierungsmetriken vor, um mit diesem Problem umgehen zu können. COPA schlägt verschiedene Baseline-Systeme in einem fairen Evaluierungsszenarium mit gleichen Features, sodass ausschliesslich die Effektivität der Modelle verglichen wird. COPA erzielt zudem auch konkurrenzfähige Ergebnisse im Vergleich zu Systemen, welche dem Stand der Forschung entsprechen. Hierbei wird sowohl hinsichtlich verschiedener Evaluationsmetriken als auch in Bezug auf verschiedene Textsammlungen und Domänen verglichen. viii Acknowledgments I always imagined what it takes for me to get all the way through until this point writing my acknowledgment. I learned that it takes four years of learning and working; it takes countless times of producing results worse than the baseline systems; most importantly it takes all the guidance and supports I received during these years. I am very lucky to have Prof. Dr. Michael Strube as my Ph.D supervisor. The only easy part of this four-year study would have been the communication with Michael, who is always around for giving advices and for listening. I appreciate a lot the freedom he offered and the patience he had even at the time I was very much puzzled and took forever to find my way out. Michael has been a very helpful supervisor and always manages to make complicated things simple for me. I would also like to thank Prof. Dr. Anette Frank, who took me as her student when I first came to Germany and who now is my Korreferent. I thank Anette for all the warm encouragements and the helpful advices. The thesis is made much better thanks to the precious efforts my proof-readers have spent. Angela Fahrni literally contributes three pages of German abstract to my thesis; Sebastian Martschat reads so many versions of it and helps to improve both the contents and the writing a lot; Yufang Hou has been always carefully reading and provides me with many constructive suggestions. Besides my Ph.D sisters and brothers, Dr. Camille Guinaudeau gives me a lot of helpful review comments; my English proof-reader Dr. Jiawei Mao corrects thoroughly through the entire thesis; my friend Shiyang Lu sends me the detailed reviews from faraway the other half of the earth. Since I know by heart how busy all my kind proof-readers are, I appreciate so much the time each of them squeezed out for my thesis. My family and friends have been constantly encouraging, supporting and inspiring me throughout the four years. My parents Houming Cai and Jieyun Liu are there for me as they always are, and they make things so easy by being such cheerful parents. I thank my husband Yangyang Zhao for all the phone calls which accompanied me walking down the hill in the dark forest, for all the good memories we x had in Europe and for being very supportive even when I got very cranky during the Ph.D study. I here thank Angela again, who has been at my side (literally again) for talking through the confusions, for discussing about researches and for the casually gossipings. I also would like to thank my dear friends Lingling Kong and Zhen Zeng for simply everything. There are so much more I should mention and so many more people who helped. HITS gGmbH supports my Ph.D program; Heidelberg city gives me a German home; things would have been much harder for me at the beginning if it were not for the help of my former colleague Viola Ganter; I still remembered my first amazing trip thanks to my favorite postdoc Dr. Vivi Nastase. At the end, I want to thank myself too. Cai, I am happy to see that you learned a lot along the way. I appreciate the hard work you did for these years and especially that you managed to keep jogging at the same time. I am glad that you had a good time. Erklärung Hiermit erkläre ich, die vorliegende Arbeit selbständig verfasst und keine anderen als die ausdrücklich genannten Quellen und Hilfsmittel verwendet zu haben. (Jie Cai) xii Contents 1 Introduction 1.1 Anaphora and Coreference . . . . . . . . . . . 1.2 The Coreference Resolution Task . . . . . . . . 1.2.1 Representing the Coreference Relation 1.2.2 Inferring the Coreference Relation . . . 1.2.3 Evaluating Coreference Resolution . . 1.2.4 Cheap Learning? . . . . . . . . . . . . 1.3 Contributions of this Thesis . . . . . . . . . . . 1.3.1 Representing the Coreference Relation 1.3.2 Inferring the Coreference Relation . . . 1.3.3 Evaluating Coreference Resolution . . 1.3.4 Cheap Learning! . . . . . . . . . . . . 1.3.5 Other Contributions . . . . . . . . . . 1.4 The Thesis Structure . . . . . . . . . . . . . . 1.5 Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work On Coreference Models 2.1 Early Theories and Formalisms . . . . . . . . . . 2.1.1 Centering . . . . . . . . . . . . . . . . . 2.1.2 Binding Theory . . . . . . . . . . . . . . 2.2 Rule-based Deterministic Coreference Models . . 2.2.1 Hobbs’ Algorithm . . . . . . . . . . . . 2.2.2 Lappin and Leass’ Algorithm . . . . . . 2.2.3 Haghighi and Klein’s Simple System . . 2.2.4 Stanford’s Multi-Pass Sieve System . . . 2.3 Unsupervised Coreference Models . . . . . . . . 2.3.1 Cardie and Wagstaff’s Clustering Method 2.3.2 Haghighi and Klein’s Bayesian Model . . 2.3.3 Ng’s EM Clustering Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 7 7 7 8 9 9 10 10 10 11 12 . . . . . . . . . . . . 13 13 14 15 16 16 17 18 18 19 19 20 20 xiv CONTENTS 2.4 2.5 2.6 2.3.4 Poon and Domingos’ Markov Logic Model 2.3.5 Kobdani et al.’s Bootstrapping Model . . . Weakly Supervised Coreference Models . . . . . . 2.4.1 Multi-view Co-training Models . . . . . . 2.4.2 Single-view Bootstrapping Methods . . . . Supervised Coreference Models . . . . . . . . . . 2.5.1 Two-step Methods . . . . . . . . . . . . . 2.5.2 Preference Models . . . . . . . . . . . . . 2.5.3 One-step Methods . . . . . . . . . . . . . 2.5.3.1 Clustering Methods . . . . . . . 2.5.3.2 Probabilistic Models . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . 3 Data Sets for Coreference Resolution 3.1 MUC . . . . . . . . . . . . . . . 3.2 ACE . . . . . . . . . . . . . . . . 3.3 OntoNotes . . . . . . . . . . . . . 3.4 I2B2 . . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 COPA: Coreference Partitioner 4.1 Introduction to COPA . . . . . . . . . . . . . . . . . . . . . 4.2 The Mathematical Background . . . . . . . . . . . . . . . . 4.2.1 The Hypergraph Representation . . . . . . . . . . . 4.2.2 Hypergraph Partitioning . . . . . . . . . . . . . . . 4.2.2.1 Spectral Clustering . . . . . . . . . . . . 4.2.2.2 Spectral Clustering for Hypergraphs . . . 4.3 COPA: Coreference Resolution via Hypergraph Partitioning 4.3.1 Preprocessing Pipeline . . . . . . . . . . . . . . . . 4.3.2 Constructing Hypergraphs for Documents . . . . . . 4.3.3 Hypergraph Resolver . . . . . . . . . . . . . . . . . 4.3.3.1 Recursive 2-way Partitioner . . . . . . . . 4.3.3.2 Flat k-way Partitioner . . . . . . . . . . . 4.3.4 k Model: Predicting the Number of Entities . . . . . 4.3.5 Complexity of COPA . . . . . . . . . . . . . . . . . 4.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . 4.4.1 The Post-processing For Pronoun Anaphors . . . . . 4.4.2 Partitioning Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 21 21 22 22 23 23 25 26 26 27 29 . . . . . 31 31 32 33 34 35 . . . . . . . . . . . . . . . . . 37 38 40 40 42 43 44 46 47 47 48 49 50 51 55 55 55 56 CONTENTS 4.5 4.6 5 xv Hypergraphs to Standard Graphs 4.5.1 The Star Expansion . . . 4.5.2 The Clique Expansion . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . COPA Features 5.1 The Feature Categorization in the Hypergraph 5.2 Negative Features . . . . . . . . . . . . . . . 5.3 Positive Features . . . . . . . . . . . . . . . 5.4 Weak Features . . . . . . . . . . . . . . . . . 5.5 The Distance Feature . . . . . . . . . . . . . 5.6 The Learned Hyperedge Weights . . . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Evaluation Metrics for End-to-end Coreference Resolution 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 6.1.1 MUC . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 B3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2.1 Existing B3 variants . . . . . . . . . . . . . 6.1.2.2 Our proposed variant — B3sys . . . . . . . . 6.1.2.3 B3sys Example Output . . . . . . . . . . . . 6.1.3 CEAF . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3.1 Problems of CEAForig . . . . . . . . . . . . 6.1.3.2 Existing CEAF variants . . . . . . . . . . . 6.1.3.3 Our proposed variant — CEAFsys . . . . . . 6.1.4 BLANC . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Experiments with the Proposed Evaluation Metrics . . . . . . 6.2.1 Data and Mention Taggers . . . . . . . . . . . . . . . 6.2.2 The Artificial Setting . . . . . . . . . . . . . . . . . . 6.2.3 The Realistic Setting . . . . . . . . . . . . . . . . . . 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Evaluating COPA 7.1 COPA vs. Baselines . . . . . 7.1.1 Data . . . . . . . . . 7.1.2 The Mention Tagger 7.1.3 Evaluation Metrics . 7.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 58 58 . . . . . . . 61 61 62 64 66 67 67 69 . . . . . . . . . . . . . . . . 71 72 72 73 74 77 79 81 82 84 84 87 87 87 88 89 91 . . . . . 93 93 94 94 95 95 xvi CONTENTS 7.2 7.3 7.4 7.5 7.6 7.7 7.1.4.1 COPA vs. SOON . . . 7.1.4.2 COPA vs. B&R . . . . 7.1.4.3 Running Time . . . . . 7.1.5 Discussion . . . . . . . . . . . . COPA vs. State-of-the-art Systems . . . . 7.2.1 Data . . . . . . . . . . . . . . . . 7.2.2 The Mention Tagger . . . . . . . 7.2.3 Evaluation Metrics . . . . . . . . 7.2.4 Results . . . . . . . . . . . . . . 7.2.5 Discussions . . . . . . . . . . . . COPA in the Medical Domain . . . . . . 7.3.1 Data . . . . . . . . . . . . . . . . 7.3.2 The Mention Tagger . . . . . . . 7.3.3 Evaluation Metrics . . . . . . . . 7.3.4 Results . . . . . . . . . . . . . . 7.3.5 Discussions . . . . . . . . . . . . Error Analysis . . . . . . . . . . . . . . . 7.4.1 COPA Errors for News Articles . 7.4.2 COPA Errors for Clinical Reports Experiments on the Training Data Size . . Experiments on the k Model . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 The Constrained COPA 8.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Enforcing Transitivity in Coreference Resolution . . . . 8.1.2 Literature on Constrained Clustering . . . . . . . . . . . 8.2 Inconsistency Analysis on Output Coreference Sets . . . . . . . 8.3 Our Proposal — the Constrained COPA . . . . . . . . . . . . . 8.3.1 Constrained Data Clustering — COP-KMeans . . . . . 8.3.2 Our Variant of COP-KMeans . . . . . . . . . . . . . . . 8.3.3 Constrained Hypergraph Spectral Clustering . . . . . . 8.3.4 Constrained COPA Partitioners . . . . . . . . . . . . . . 8.4 Cannot-Link Constraints for Coreference Resolution . . . . . . 8.5 Experiments on the Constrained COPA . . . . . . . . . . . . . . 8.5.1 Experiments with Artificial Clean Constraints . . . . . . 8.5.2 Experiments with Automatically Generated Constraints . 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 96 98 98 99 100 100 100 100 102 102 103 103 104 104 108 108 108 109 110 111 114 . . . . . . . . . . . . . . 117 118 118 120 121 124 124 126 127 128 129 131 132 135 137 CONTENTS xvii 9 Conclusions 139 9.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 List of Figures 143 List of Tables 145 List of Algorithms 149 Bibliography 151 xviii CONTENTS Chapter 1 Introduction ”Hi Cai, you must be very brave to work on Coreference Resolution.” – Prof. Mirella Lapata1 – This thesis addresses the challenge of within-document coreference resolution, a task of grouping the referring expressions (i.e. phrases) of entities (i.e. real world semantic objects) into coreference sets so that all expressions in one set refer to the same entity. The coreference relation is dependent on multiple basic relations such as the shallow syntactic relation and semantic relatedness. It can be derived from one of the basic relations or from a combination of multiple ones, depending on different contexts. Therefore we consider the coreference relation as a complex relation and a high-dimensional relation, as opposed to the basic lowdimensional relations. Since the coreference resolution task is not only to detect the pairwise coreference relation but also to group the referring expressions into sets, we consider the task as a set problem. By analyzing the linguistic phenomena of the coreference relation and understanding the task requirements, we raise four important questions which are addressed throughout the thesis — (1) representing the coreference relation, (2) inferring the coreference relation, (3) evaluating coreference resolution, (4) learning cheaply. Our proposed coreference model is motivated by the first two questions. Both its representation model and its inference method address the requirements (1) and (2) correspondingly. Our model represents documents as hypergraphs, which allow for multiple edges between vertices and multiple vertices within one edge. The vertices are the referring expressions from the documents, and the multiple edges between them enable us to break down the complex coreference relation into multiple basic ones. Moreover, the hyperedges containing 2 1. Introduction multiple vertices straightforwardly represent the sets of expressions. Upon the hypergraph representation, we apply graph partitioning techniques to partition the hypergraphs into subhypergraphs, each of which corresponds to a coreference set. Our system is named COPA, standing for Coreference Partitioner. COPA differs significantly from the previous local models, since it is able to take the global context (of a document) into consideration and to generate the coreference sets simultaneously in one step. We work on an end-to-end system setting, which takes raw texts as input and extracts coreference sets in a fully automatic way. Since the presence of noise is unavoidable in such a realistic setup, not only the modeling itself but also the practical issues are addressed in this thesis. For instance, our proposed evaluation metrics aim to conquer the problems of the widely used metrics when evaluating the noisy output from end-to-end coreference systems. In this chapter, we start with introducing the coreference phenomena from a linguistic point of view in Section 1.1. Section 1.2 then describes the coreference resolution task and the four questions consequently emerging. In Section 1.3, we convey the intuitions behind our proposal of COPA and the main contributions of the thesis. The general structure of the thesis is given at the end in Section 1.4. 1.1 Anaphora and Coreference In linguistic expressions, in order to preserve the coherence in texts while keeping the diverse phrasal expressions at the same time, the referring expressions are used frequently. In the following Example (1), the pronouns [him], [he] and [his] are all referring expressions, which are called anaphors or anaphoric expressions. An anaphor is used to refer to an antecedent which is a preceding phrase (e.g. [Yemen’s President]), and they are talking about the same object in the world. A world object is called an entity, for instance the Y EMEN ’ S P RESIDENT in Example (1)2 . The process of identifying the correct antecedent for an anaphor is anaphora resolution. Example (1): [Yemen’s President]1 has repeatedly said an internal explosion rocked the ”USS Cole”, but tomorrow the U.S. official expects [him]1 to announce that [he]1 has changed [his]1 mind, and tomorrow, the search for bodies will resume . Besides the pronominal anaphors, as shown in Example (1), definite and demonstrative phrases are often used as the anaphoric expressions too (e.g. [the meeting] and [the regulators] in Example (2)). Proper names can either mention a new entity or refer to a previous one, such as both mentions of [Lincoln]. 2 The entities are in capitalized fonts throughout this thesis. 1.2 The Coreference Resolution Task 3 Example (2): In [a highly unusual meeting in Sen. DeConcini ’s office in April 1987]1 , the five senators asked [federal regulators]2 to ease up on [Lincoln]3 . According to notes taken by one of the participants at [the meeting]1 , [the regulators]2 said [Lincoln]3 was gambling dangerously with depositors ’ federally insured money and was ” a ticking time bomb .” An anaphor and its antecedent are said to be coreferent with each other. In other words, both of them are linguistic expressions that refer to a specific entity. It is common that there are multiple linguistic expressions for an entity in a document, which together form a coreference chain or a coreference set (e.g. all the phrases marked with the same subscripts in Example (1) form one coreference set). The process of identifying the coreference sets within or across documents is coreference resolution. As Example (2) illustrates, a document tends to have multiple coreference sets, and coreference resolution is to identify all of them commonly. Coreference resolution is closely related to anaphora resolution, and it can be viewed as a post-processing upon the antecedent-anaphor output from anaphora resolution. Considering Example (1), resolving [him], [he] and [his] to [Yemen’s President] respectively during anaphora resolution will help to generate the entire coreference set. However, in this thesis, we argue that global (set-level) information is missed from such post-processing interpretation. In the same Example (1), when the first two pronouns are resolved to the entity Y EMEN ’ S P RESIDENT, it is more likely for the third one to refer to this salient entity too rather than to the entity THE U.S. OFFICIAL. As a result, a set-based one-step coreference resolution model is preferable due to its global property. 1.2 The Coreference Resolution Task In this section, the crucial requirements for modeling the coreference resolution task are discussed within an end-to-end system framework. Our proposed coreference model is motivated by the requirements and addresses all of them throughout the thesis. The coreference resolution task is to group the referring expressions into sets so that all expressions in one set refer to the same entity. An end-to-end coreference system takes raw documents as input and generates the identified coreference sets as output, via a pipeline of automatic processors. Figure 1.1 shows an example text displayed in MMAX, which is a multilayer visualization tool to help illustrate the coreference examples (Müller & Strube, 2006). The phrases that need to be resolved for coreference resolution are conventionally called mentions in the task, such as [Gore], [I], [he], [his opponent] and [the vice president]. In this thesis, the mentions marked with square brackets (i.e. []) are true mentions, which are taken from the ground truth annotation, and the ones in curly brackets (i.e. {}) are system mentions, which are derived automatically. The running entity in this example is G ORE, 4 1. Introduction whose corresponding coreference set is {[Gore], [I], [he], [his opponent], [the vice president], · · · }. Figure 1.1: Example (3): Coreference Resolution in MMAX The pre-processing components may vary between different systems, but the most important ones are sentence splitting, POS tagging, mention detection and syntactic parsing. The pre-processors provide a coreference system with the mentions to be resolved and contextual information for assisting the resolution procedure. When external resources are available, more components for knowledge extraction may be incorporated into the system accordingly. The following subsections will introduce the most important aspects for designing a coreference system. 1.2.1 Representing the Coreference Relation The Coreference relation is a high-dimensional relation. By interpreting the coreference relation as a high-dimensional relation, we refer to the fact that the coreference relation is dependent on different types of basic relations, such as shallow syntactic dependency and semantic relatedness. These basic relations are considered to be low-dimensional, which together form the (more) complex coreference relation. 1.2 The Coreference Resolution Task 5 We use the same Example (3) (Figure 1.2) in this subsection to convey the high-dimension property of the coreference relation. It can be seen that within the exemplar text, there are several diverse basic (low-dimensional) relations which comprise the coreference relation. In Example (3), with the entity G ORE, the coreference relation between the first [Gore] and the second [Gore] can be easily detected just based on their high string similarity. However, in order to resolve the coreference relation between [Gore] and [the vice president], external knowledge resources are necessary for providing relevant information about vice president G ORE. If it has been mentioned in the preceeding text that G ORE is a vice president (e.g. in a text fragment ”the Vice President Gore”), the relation can be also retrieved from the very text by extracting the relevant attributes for the entity G ORE before the resolution. Figure 1.2: Example (3): Coreference Relation is High-Dimensional (part 1) For the same Example (3), Figure 1.3 illustrates a more complex coreference relation between the mentions [his opponent] and [Gore], whose resolution requires a reasoning scheme upon the two entities G ORE and B USH. In order to identify the relation between [his opponent] and [Gore] correctly, it is necessary to resolve [his] to [Bush] at first and afterward to extract the fact that G ORE is the opponent of B USH in the debate. In this case, the coreference relation is much more complex than the ones between mentions which share the same strings. 6 1. Introduction Figure 1.3: Example (3): Coreference Relation is High-Dimensional (part 2) The coreference relation of pronouns is often based on local phenomena. Considering the pronoun [He] marked in Figure 1.3, which is in a parallel sentential structure with [Gore], i.e. ”[Gore] said” and ”[He] added”. It is reasonably confident for such a structural relation to indicate the coreference relation for pronouns. However, structural information is a much weaker indicator for most of the non-pronominal anaphors. To sum up, the coreference relation can be inferred from multiple low-dimensional relations (e.g. string match and parallel structure). Depending on the types of the participating mentions and the local contexts, different basic relations can be dominating or be interacting with each other during coreference resolution. Q1: How to represent the multiple low-dimensional relations and to allow their interactions? is the first question to consider in terms of the representation model for a coreference resolution system. 1.2 The Coreference Resolution Task 7 1.2.2 Inferring the Coreference Relation The coreference resolution task is a set problem. The coreference resolution task is to group mentions into disjoint coreference sets, so that each set corresponds to an entity. The resolution decision for one mention depends on the resolutions of all the others in the same text, which together provide the global context for the mention in focus. As explained for Example (3) (Figure 1.3), the resolution of the mention [his opponent] should benefit from the resolution of the embedded mention [his]. Therefore, inferring the coreference sets simultaneously is essential to making use of the complete context. In order to achieve the overall optimized coreference sets, the inference procedure needs to consider not only the relations between mentions within the same sets, but also the relations between mentions from different sets. Since the optimization is conducted at the output end, it is important to preserve all relations from a document until the final generation of the coreference sets. Hence it is preferred to have the coreference sets identified directly from the original relations. Q2: How to derive coreference sets directly and simultaneously? is the second crucial question we need to consider. It regards the choice of the inference algorithm. 1.2.3 Evaluating Coreference Resolution Evaluating the system output sets against the true coreference sets is no trivial matter. There have been several evaluation metrics designed for the coreference resolution task, either evaluating on mention pairs or on sets directly. However, they become problematic in a realistic system setup, where the system mentions do not align with the true mentions any more. Q3: How to evaluate end-to-end coreference resolution systems? is the third concern of ours in this thesis. 1.2.4 Cheap Learning? There are several data sets proposed for evaluating coreference resolution systems, most of which are collections of news articles, such as the examples illustrated in this section. Since 8 1. Introduction the coreference relation is a general linguistic phenomenon, coreference resolution is applicable to different domains (e.g. the medical domain) and to different languages. This urges the requirements of a large amount of annotated data sets for the purpose of the model training. Annotating the corpora manually is considered to be expensive, therefore the question Q4: Can we use less training data? becomes important when extending the coreference system to open domain texts or when applying the system to multilingual tasks. 1.3 Contributions of this Thesis Most recent approaches to coreference resolution divide the task into two steps: (1) a classification step which determines whether a pair of mentions is coreferent or which outputs a confidence value for this pair, and (2) a clustering step which groups mentions into entities based on the output of step 1. In this thesis, we propose a global one-step model — COPA — to approach the coreference resolution task. COPA is a novel coreference model which avoids the division into two steps and instead performs a global decision in one step. It represents a document as a hypergraph, where the vertices denote the mentions and the edges denote the (low-dimensional) relational features between mentions. Coreference resolution is performed globally in one step by partitioning the hypergraph into sub-hypergraphs so that all mentions in one sub-hypergraph refer to the same entity. The left part of Figure 1.4 illustrates the appearance of the hypergraph built by COPA and the right part shows the COPA output after the partitioning procedure. This example is described in more detail in Chapter 4. Figure 1.4: COPA Example: Processing Illustration 1.3 Contributions of this Thesis 9 With COPA, we are able to address the four questions raised in Section 1.2, which are explicated in Section 1.3.1 to Section 1.3.4. 1.3.1 Representing the Coreference Relation Previous two-step models attempt to predict a single confidence value between a pair of mentions by learning the combination of features from the training data (Soon et al., 2001; Luo et al., 2004; Rahman & Ng, 2009; Bengtson & Roth, 2008). Since these models base their clustering step on the collapsed relations, some global information which could have guided step 2 is already lost. In the other hand, global information cannot be accessed in step 1 when making the pairwise decisions. The hypergraph representation of COPA (e.g. Figure 1.4 (a)) enables the multiple relational features to directly come in (as hyperedges) without the necessity of collapsing them into single ones (as standard edges) as standard graph models would have to. Comparing with the standard graph, the hypergraph has additional representation power. A hyperedge connects two or more than two vertices (e.g. the hyperedge connecting [Obama], [US president Barack Obama] and [Barack Obama]), and between vertices there can be multiple hyperedges involved (for the sake of a clear illustration, Figure 1.4 does not include overlapping hyperedges). The set property and the overlapping manner of hyperedges make the hypergraph a good candidate for representing the coreference relation. In brief, the hypergraph allows for representing multiple low-dimensional relations and capturing set-level information, so that the representation model of COPA is intuitively representing coreference phenomena. Moreover, since the hypergraph is a generalization of the standard graph, the algorithms based on standard graphs are still applicable to hypergraphs with necessary adaptations. It is easy to include more relations as hyperedges into the hypergraph model and various graphbased inference algorithms are supported on top of the COPA model. 1.3.2 Inferring the Coreference Relation For most of the two-step methods, the classification steps vary in the choices of the classifiers and the numbers of features used. The clustering step exhibits much more variations: Local variants utilize greedy search strategies (Soon et al., 2001; Ng & Cardie, 2002) while global variants optimize globally but still upon the pairwise output from step 1 (Luo et al., 2004; Daumé III & Marcu, 2005; Nicolae & Nicolae, 2006; Denis & Baldridge, 2009). As already mentioned, since these methods base their global clustering step on a local pairwise model, some global information which could have guided step 2 is already lost. There have also been attempts on establishing global one-step models, most of which are probabilistic ones (Culotta et al., 2007; Sapena et al., 2010; McCallum & Wellner, 2005; 10 1. Introduction Poon & Domingos, 2008). The global models allow one to make use of set-level information and more context during the inference procedure. Upon the hypergraph representation, COPA applies graph partitioning techniques to derive coreference sets directly and simultaneously. The graph partitioning algorithms of COPA generate the optimized coreference sets, so that the mentions within the same sets are connected to each other as closely as possible, while the mentions from different sets as loosely as possible. It is the first graph-partitioning-based coreference model that takes all mentions from a document into one unified graph and achieves competitive performances across different data sets in a realistic setting. Partitioning algorithms enable us to make a global coreference decision by using whichever contextual information encoded in the graph, rather than to work in a sequential and local manner. Unlike the probabilistic models, COPA is based on a graph partitioning technique that is preferable for its simple inference procedure. We differ from Nicolae & Nicolae’s graph partitioning model (Nicolae & Nicolae, 2006), as we do not make pairwise coreference predictions and we manage to handle all types of mentions in one unified model. 1.3.3 Evaluating Coreference Resolution In this thesis, we address an important issue in the coreference resolution task — evaluation metrics. Since most widely used metrics are designed to handle true mentions only, they become problematic when evaluating end-to-end coreference systems. We propose variants of different evaluation metrics for dealing with this issue. 1.3.4 Cheap Learning! The hypergraph-based coreference model of COPA derives the coreference relation by analyzing the graph structure at the inference phase, and the relational features used for the graph construction are simply represented in an overlapping manner. Since no feature combination function needs to be learned beforehand, COPA only requires a small amount of training data to learn the weights for low-dimensional relations (i.e. hyperedge weights), which makes COPA a weakly supervised system. 1.3.5 Other Contributions Coreference resolution is a set problem and thus the coreference relation is a transitive relation. Due to the transitive closure which is implicitly done during the partitioning process of COPA, inconsistent coreference sets may be derived. Different optimization strategies have been employed in the literature in order to enforce the coreference transitivity. In this thesis, 1.4 The Thesis Structure 11 we address this problem within the graph partitioning framework by proposing constrained clustering algorithms. We propose a novel method to combine constrained data clustering algorithms with the spectral graph clustering technique via the spectral embedding, hereby contributing to the constrained graph clustering field. At the same time, the constrained COPA contributes to the coreference problems which can only be solved by considering cluster-level consistencies. We experiment with both artificial clean constraints and automatically generated ones. Although the clean setting produces promising improvements, our results on the automatically generated constraints are mostly negative for now. Further efforts on designing more high-recall constraints are needed. Extensive experiments show that COPA outperforms strong baseline systems in strict fair comparisons, and it performs competitively with a small feature set and a small amount of training data across different domains. 1.4 The Thesis Structure The thesis is organized into two parts, (1) Chapter 1 to Chapter 7 form the backbone of our contributions to the coreference resolution task; (2) Chapter 8 introduces the important extensions we made upon the basic version of COPA model, both in the algorithms and in solving special types of coreference problems. • Chapter 1 helps the readers to develop an idea about the work presented in this thesis — the motivation and the significant contributions. • Chapter 2 introduces the important related work for coreference resolution, which provides a big picture to the task modeling. • Chapter 3 describes the corpora used throughout the thesis. The annotation schemes adopted by each of the data sets are illustrated and the important differences between them are pointed out. The chapter aims at assisting the reader to get familiar with the coreference phenomena and the involved issues related to annotations, both of which are important for understanding the coreference resolution task addressed in this thesis. • Chapter 4 introduces our proposed coreference system — COPA. The chapter is selfcontained, with the representation model, the partitioning algorithms and the system components described in detail. For the techniques involved in the basic version of COPA, readers can read Chapter 4 and Chapter 7 (for experiments) alone, with the features in Chapter 5 to be briefly looked up if necessary. • Chapter 5 presents the features used in COPA. 12 1. Introduction • Chapter 6 discusses the problems of the previous evaluation metrics and then introduces our variants of the metrics for evaluating end-to-end coreference systems. Experiments verifying our variants are included at the end of the chapter. For readers who have been working in the field and are concerned about evaluating end-to-end coreference systems due to ones’ own experiences, Chapter 6 can be read as a stand-alone chapter. • Chapter 7 evaluates COPA with thorough experimental comparisons, against strong baseline systems and state-of-the-art systems in different domains. • Chapter 8 describes the constrained version of COPA. We aim to guide the system towards more consistent partitionings by imposing negative (i.e. Cannot-Link) constraints on the partitioning algorithm. Experimental results for constrained COPA are provided within the chapter. For readers interested in graph clustering algorithms, Chapter 8 focuses on including constraints into graph clustering algorithms without changing the objective functions, and the chapter applies the proposed methods to an application of coreference resolution. Readers may also want to check on all the implementation issues addressed in Chapter 4, which give important hints to use clustering techniques for real applications. • Chapter 9 concludes the entire thesis and suggests future improvement directions for graph-based coreference models. 1.5 Published Work The proposal of COPA is published in (Cai & Strube, 2010a), where the hypergraph representation of texts and the coreference inference via partitioning are described. (Cai et al., 2011b) describes the positive-negative-weak feature engineering and illustrates the application of COPA on a large corpora to compete with the state-of-the-art systems. COPA’s participation on clinical tasks is introduced in (Cai et al., 2011a). The proposed evaluation metrics for end-to-end coreference resolution are published in (Cai & Strube, 2010b). Chapter 2 Related Work On Coreference Models Understanding and automatically resolving the coreference phenomena in texts has been of interest to computational linguists for decades, starting from the early work on linguistic theories to the latest research on exploring machine learning techniques. The inclusion of the early theories (Section 2.1) in this chapter is to illustrate the linguistic insights they provide, which still inspire good features for modern methods. However, the main stream of research is moving towards the machine-learning-based task modeling (Section 2.3 to 2.5). In this chapter, the most important research lines in the field are introduced. The existing coreference models are categorized according to their learning schemes — rule-based systems (Section 2.2), unsupervised models (Section 2.3), weakly supervised methods (Section 2.4) and finally the supervised ones (Section 2.5). Our proposed system is a supervised coreference model. However, we show in Chapter 7 that our system only needs a little training data to achieve competitive performance, which makes it a weakly supervised one (when using limited training data). Unlike the weakly supervised methods in Section 2.4 which make use of unlabeled data together with labeled ones, our model is only trained on (manually) annotated data in a conventional supervised manner without making bootstrapping procedures necessary. 2.1 Early Theories and Formalisms In this section, two important theories related to coreference resolution are introduced. Centering theory (Section 2.1.1) studies the referring relation between utterances (e.g. sentences) and entities in order to model the discourse coherence. This theory can be used directly to estimate the possible entity assignments for referring expressions, and therefore to predict the coreference relation. Centering theory is summarized with its important claims in this section, and the details are provided in the corresponding references. Binding theory (Section 2.1.2) models the preference of antecedents for anaphoric expres- 14 2. Related Work On Coreference Models sions on dependency trees. It can be easily adopted as relational features (or constraints) for machine-learning-based coreference models. 2.1.1 Centering Centering theory (Grosz & Sidner, 1986; Grosz et al., 1995; Strube & Hahn, 1999) is a theory of the local component of attentional state. Joshi & Kuhn (1979), Joshi & Weinstein (1981) show that there is a connection between changes in immediate focus and the complexity of the inference required for understanding the utterances in the corresponding discourse. From a coreference modeling point of view, the less complex the required inference is, the more possible it is to be a correct usage of referring expressions in the utterances. Centers (e.g. referring expressions) of an utterance refer to the entities which help to link the utterance to others within a discourse segment. Each utterance U in a discourse segment DS has a set of forward-looking centers, Cf (U, DS ) and (except for the segment initial utterance) has a single backward-looking center, Cb (U, DS ). The simplified notations are Cf (U ) and Cb (U ). When a center c is the semantic interpretation of an utterance U , it is defined as a relation — U directly realizes c. A ”realizes” relation is a generalization of the ”directly realizes”. Since the realization relation combines syntactic, semantic, discourse, and intentional factors, the centers of an utterance are determined by the properties of the utterance in focus, the corresponding discourse segment and the discourse. The center elements of Cf (Un ) are derived from the expressions that constitute Un , and they are partially ordered according to their prominences in Un . The top ranked element of Cf (Un ) that is also realized in Un+1 is taken as Cb (Un+1 ). Three types of transition relations between pairs of utterances are defined. 1. Center continuation: Cb (Un+1 ) = Cb (Un ), and the entity is the top ranked element of Cf (Un+1 ). 2. Center retaining: Cb (Un+1 ) = Cb (Un ), but this entity is not the top ranked element in Cf (Un+1 ). 3. Center shifting: Cb (Un+1 ) 6= Cb (Un ). Different centering transitions between utterances indicate different degrees in coherence for the corresponding segment. The most fundamental claim of centering theory is that the inference load on the hearer decreases as the discourse coherence increases. Several other major claims are provided, which can be used as constraints for coreference modeling. 1. A unique Cb : each Un has only one backward-looking center. 2.1 Early Theories and Formalisms 2. Ranking of Cf : factors. 15 the elements of Cf are partially ordered according to a number of 3. Centering constraints realization possibilities: if any element of Cf (Un ) is realized by a pronoun in Un+1 , then Cb (Un+1 ) must be realized by the pronoun too. 4. Preferences among sequences of center transitions: sequences of continuation are preferred over sequences of retaining; sequences of retaining are to be preferred over sequences of shifting. 5. Primacy of partial information: a semantic theory supporting the construction of partial interpretations is necessary. 6. Locality of Cb (Un ): Cb (Un ) cannot be corresponding to Cf (Un−2 ) or other prior sets of forward-looking centers. 7. Centering is controlled by a combination of discourse factors: centers are determined on the basis of the combination of syntactic, semantic and pragmatic processes. Centering theory connects the focuses of attention, the choices of referring expressions, and the coherence of utterances within a discourse segment. It has been used in extended or re-formulated forms for anaphora resolution tasks (Brennan et al., 1987; Hahn & Strube, 1997; Strube, 1998; Walker, 1998). 2.1.2 Binding Theory The binding theory is formulated in Chomsky’s Lectures of Government Binding (Chomsky, 1981; Chomsky, 1995), which discusses anaphora within the generative paradigm. It considers the anaphoric relation for reflexive pronouns, reciprocals, personal pronouns and referential expressions (lexical noun phrases), by imposing syntactic constraints on their NP interpretations. Reflexives and reciprocals need local antecedents; pronouns may have an antecedent, but must be free locally; referential expressions must be free. The three principles in binding theory are described as: Principle A: An anaphor (reflexive or reciprocal) must be bound in its governing category. Example: [John]i saw [himself]i . ([John] binds [himself], and they are coreferential.) Principle B: A pronoun (except reflexive and reciprocal) must be free in its governing category. 16 2. Related Work On Coreference Models Example: [John]i saw [him]j . ([John] binds [him] which violates the principle, so that they are not coreferential.) Principle C: An referential expressions must be free everywhere. Example: [John]i saw [Katja]j . ([John] binds [Katja] which violates the principle, so that they are not coreferential.) The binding theory is helpful in ruling out the antecedents for pronominal anaphors that violate the proposed constraints, as well as in assigning possible antecedents to bound anaphors. For instance, our feature (6) corresponds to Principle C and feature (17) to Principle A (see Chapter 5). 2.2 Rule-based Deterministic Coreference Models The coreference resolution systems from earlier years (e.g. Hobbs (1978) and Lappin & Leass (1994)) rely on manually configured rules, most of which are derived from the linguistic interpretations of the coreference phenomena. There are a couple of lately emerged coreference resolution systems (Section 2.2.3 and 2.2.4) which are also completely built upon heuristic rules and perform in a deterministic manner. These systems aim to explore how syntactic and semantic information helps the task by neglecting the effect of the learning schemes. The successfully explored heuristic rules should inspire (strong) features for machine-learning-based algorithms (see Section 2.3, 2.4 and 2.5), and the deterministic systems may serve as good baselines for the complex coreference models. 2.2.1 Hobbs’ Algorithm Hobbs (1978) proposes one of the first algorithmic approaches to pronoun resolution, determining the antecedents for pronominal anaphors by searching on syntactic parse trees and incorporating semantic analysis. Hobbs’ first algorithm performs on surface parse trees, which are assumed to be correctly available for each sentence to be resolved. A surface parse tree exhibits the grammatical structure of a sentence. This simple method traverses the tree in a particular order looking for a noun phrase of the correct gender and number as the expected antecedent of a pronoun. Selectional constraints can be further applied to the algorithm to restrict the candidate antecedents. Hobbs’ second algorithm is working on texts, where the syntactically derivable coreference and non-coreference relations have already been detected. The texts should be in logical representations, exhibiting functional semantic relationships. In this semantic algorithm, there are four principal semantic operations on logical notations of texts. These are (1) detecting 2.2 Rule-based Deterministic Coreference Models 17 inter-sentence connections, (2) interpreting general words or predicates in context, (3) merging redundant statements and (4) extracting the yet unidentified entities. The four options together are able to accomplish the pronoun resolution most of the times. Where they fail, the naive algorithm is used to determine the final antecedent. Hobbs’ approach remains one of the most influential work in the field and serves frequently as a common benchmark for evaluating later proposals (Mitkov, 2002). 2.2.2 Lappin and Leass’ Algorithm Lappin & Leass (1994) propose an algorithm, RAP (Resolution of Anaphora Procedure), which is applied to the syntactic representations generated by McCord’s Slot Grammer parser (McCord, 1989). The system uses multiple salience measures, which capture a variety of syntactic properties. Moreover, the system uses a model of attentional state too. From a list of candidate antecedents of a pronominal anaphor, RAP determines the preferred one by relying on several components. 1. An intra-sentential syntactic filter. 2. A morphological filter rules out the candidate NPs for a pronoun according to their syntactic grounds or agreements on person, number or gender. 3. Pleonastic pronouns are identified by a separated filter. 4. An NP is assigned several salience values, which favor (i) subject over non-subject NPs, (ii) direct objects over other complements, (iii) arguments of a verb over adjuncts and objects of prepositional phrase adjuncts of the verb, and (iv) head nouns over complements of head nouns. 5. For an equivalent class of NPs, an overall salience value is calculated. 6. At the end, a decision maker selects the preferred antecedents for each anaphoric pronouns Lappin & Leass test RAP on five computer manuals containing approximately 82,000 tokens. The success rate of the system is optimized on the training set in a heuristic way. In the blind test, RAP scores higher than a Slot Grammer version of Hobbs’ algorithm (Hobbs, 1978). 18 2. Related Work On Coreference Models 2.2.3 Haghighi and Klein’s Simple System Haghighi & Klein (2009) present a deterministic coreference system, which is driven by syntactic and semantic compatibility lists extracted from an unlabeled corpus. They try to break from the standard view of focusing on coreference modeling. Instead, they are devoted to exploring linguistic features in a simple deterministic manner. Haghighi and Klein’s system works in a three-step process. For each anaphor, a best antecedent is chosen or is set to be NULL, following the three steps: 1. Syntactic Constraints: a self-contained syntactic module generates syntactic structures using an augmented parser and extracts syntactic paths from the anaphor to its candidate antecedents. When applicable, syntactic constraints either enforce or disallow coreference relations on paths. 2. Semantic Constraints: a self-contained semantic module evaluates semantic compatibilities between head words and between names, so that this module further filters the remaining antecedents from 1. 3. Selection: Select the final antecedent with the minimal tree distance to the anaphor. For agreement constraints, Haghighi & Klein implement person, number and entity type agreements. Role appositives and predicate nominatives are extracted from syntactic trees to assist non-pronominal resolution. A set of compatible word pairs which match the predicatenominative patterns are extracted from two external data sets, so that rich semantic knowledge can be accessed. The simple system manages to outperform the state-of-the-art unsupervised coreference resolution systems and is broadly comparable to the state-of-the-art supervised systems. The authors suggest to use the system as a simple-to-reproduce and high-performance baseline for future work in the field. 2.2.4 Stanford’s Multi-Pass Sieve System When participating in the CoNLL-2011 shared task (Pradhan et al., 2011) which is one of the most influential shared task on the coreference resolution task, the Stanford’s system (Lee et al., 2011) won in all provided settings. The proposed Multi-pass Sieve system is built in an architecture which implements multiple sieves in a cascaded manner. In a top down manner, the sieves output the highest precision predictions to the lowest ones. Since at each sieve all information available (including the predictions from previous sieves) can be used, clusterlevel features (e.g. cluster head match) have a means to come into the model. The sieves proposed are described briefly below. 2.3 Unsupervised Coreference Models 19 1. Pass 1: Extract string match. 2. Pass 2: Precise constructions (e.g. appositive; predicate nominative; role appositive). 3. Pass 3: Strict Head Match (e.g. cluster head match; compatible modifiers). 4. Pass 4 & 5 & 6: Variants of head match. 5. Pass 7: Pronoun resolution. Despite of its simplicity, Stanford’s multi-sieve system achieves more competitive performance than most of the complex models. With careful engineering, it is easier to add more sieves and features without harming the performance which on the other hand can frequently happen to more sophisticated models. 2.3 Unsupervised Coreference Models Generally speaking, unsupervised models are studied to ease the requirements for expensive human annotations. However, the unsupervised coreference models have not yet surpassed the supervised ones. In this section, an unsupervised clustering method, three unsupervised probabilistic models and one bootstrapping method for coreference resolution are described. 2.3.1 Cardie and Wagstaff’s Clustering Method Cardie & Wagstaff (1999) represent mentions to be resolved as vertices in a graph. Edge weights are calculated from a distance metric which measures the compatibility degree between vertices. The proposed distance metric is dist(N Pi , N Pj ) = X wf × incompatibilityf (N Pi , N Pj ) f where f corresponds to a specific pairwise feature. To generate the coreference sets, an agglomerative clustering algorithm is applied afterward, which merges compatible partial clusters according to the judgments from the distance metric. The algorithm performs in a greedy manner and does not allow clusters with incompatible mentions. Therefore it may become problematic when dealing with noisy data sets. 20 2. Related Work On Coreference Models 2.3.2 Haghighi and Klein’s Bayesian Model Haghighi & Klein (2007) propose a fully generative model for coreference resolution. A nonparametric Bayesian model is adopted in order to avoid the pre-assumption about the number of entities. For non-pronominal mentions, the model makes decisions based on their dependencies on mention heads. For pronouns, the model incorporates the parameters for the entity type, gender and number. Entity salience is added into the model too. Haghighi & Klein report higher numbers than Cardie & Wagstaff (1999) on the MUC-6 data, and show that including more unannotated data can improve the performance due to the unsupervised learning nature of their model. However, Haghighi & Klein’s Bayesian model is difficult to extend, since it requires the change of the model structure to include more features. 2.3.3 Ng’s EM Clustering Method Ng (2008) recasts the unsupervised coreference resolution problem as EM clustering. The adopted joint probability is P (D, C) = P (C)P (D|C) where D represents an observed document and C is a clustering on it. The document is further represented by mention pairs and 7 features are applied to each pair of mentions. Therefore P (D|C) is given by Q P (D|C) = mij ∈P airs(D) P (m1ij , · · · , m7ij |Cij ) The parameters (i.e. the probabilities of the features given the clusterings) are estimated using an EM algorithm and at the end a converged clustering C is induced for each document. In order to cope with the number of possible clusterings which are exponential to the number of mentions in a document, complex schemes are proposed to choose only the best clusterings at each iteration. Ng achieves better performance compared with the enhanced version of Haghighi & Klein’s system but his system is still not comparable to supervised coreference models. 2.3.4 Poon and Domingos’ Markov Logic Model In order to perform a joint inference across mentions as opposed to focus on pairwise relations, Poon & Domingos (2008) make use of the expressive power of Markov Logic to represent relations between mentions in first-order logic. Poon & Domingos propose an unsupervised system based on Markov Logic Networks to infer the coreference sets. Several relational features are adopted, where m stands for a mention, c for a cluster and e for an entity. 2.4 Weakly Supervised Coreference Models 21 1. Head Match for Non-pronouns: ¬IsPrn(m) ∧ InCluster (m, +c) ∧ Head (m, +t) 2. Mention Types Agreement: InCluster (m, c) ⇒ (Type(m, e) ↔ Type(c, e)) 3. Pronoun-Cluster Types Agreement: IsPrn(m) ∧ InCluster (m, c) ∧ Head (m, +t) ∧ Type(c, +e) 4. Apposition Constraint: Appo(x, y) ⇒ (InCluster (x, c) ↔ InCluster (y, c)) 5. Predicate Nominative Constraint: PredNom(x, y) ⇒ (InCluster (x, c) ↔ InCluster (y, c)) Poon & Domingos report competitive performance of their system, benefiting from leveraging relations between mentions from the cluster-level perspective. Markov Logic provides an easy way for incorporating cluster-level features, which is non-trivial for pair-wise models. However, their big gain by adding appositive and predicate nominative constraints cannot be reproduced for other data sets where these relations are not taken as being coreferent. 2.3.5 Kobdani et al.’s Bootstrapping Model Kobdani et al. (2011) collect word associations from large unlabeled data sets, and propose an unsupervised system to learn the association scores between mentions. For the testing phase, the word association scores are used in the same way as the coreference probabilities. Built upon the predictions of the unsupervised system, a self-training scheme is adopted to learn the coreference relation in a conventional supervised manner. Since no manually labeled data is used, the self-training system can be viewed as unsupervised too, and it outperforms several strong unsupervised systems. 2.4 Weakly Supervised Coreference Models Weakly (semi-) supervised learning algorithms work with little labeled data and attempt to make use of the unlabeled data while processing. They are expected to perform better than the unsupervised methods due to the available (although limited) guidance from the training labels. In this section, several weakly supervised coreference models are described. 22 2. Related Work On Coreference Models 2.4.1 Multi-view Co-training Models Co-training (Blum & Mitchell, 1998) is a multi-view method to bootstrap by gradually extending the training (labeled) set with the automatically labeled data. Co-training algorithms utilize multiple learners each of which captures a separate view of the data (i.e. using disjoint subsets of features to represent the data). Müller et al. (2002) apply a co-training method to coreference resolution by using two classifiers and therefore two views of the data. They propose a feature selection strategy to create the two subsets of features, representing the two views with the two best features and selecting the remaining one by one. Besides the greedy feature selection method by Müller et al., Ng & Cardie (2003) also experiment with random selection and the selection according to the feature types. The two classifiers are trained with their own feature sets, and predict labels for the unlabeled data. At each iteration of training, each classifier chooses its most confident predictions and add the auto-labeled data into the training set of the other classifier. However, the results reported by Müller et al. are mostly negative and Ng & Cardie do not generate improvements with co-training algorithms either. The main difficulties lie in the generation of the independent feature sets (views), the choice of the number of iterations and the training data growth speed (Pierce & Cardie, 2001). Raghavan et al. (2012) propose semantic and temporal features as views for their cotraining classifiers, and these views appear to work on clinical data sets. 2.4.2 Single-view Bootstrapping Methods Ng & Cardie (2003) compare multi-view weakly supervised methods with single-view ones with the application to coreference resolution. They propose two single-view algorithms, a self-training algorithm and an EM algorithm. Both of their single-view methods are based on the bootstrapping scheme. The self-training algorithm involves a committee of classifiers, each of which is trained on a random sampled subset of the labeled data. The classifiers predict for all the unlabeled data and the predictions agreed by all of the classifiers are added to the labeled data. The single-view weakly supervised EM assumes a parametric model of data generation. The unlabeled data are considered to be missing labels and the algorithm optimizes the posterior probability of the parameters given both the labeled and the unlabeled data. More details can be found in Nigam et al. (2000). Ng & Cardie (2003) conclude that the single-view methods easily outperform the multiview co-training algorithm for the coreference resolution task. 2.5 Supervised Coreference Models 2.5 23 Supervised Coreference Models Due to the existence of well-annotated corpora (see Chapter 3 for details), more attention has been paid recently to supervised coreference resolution modeling. Although coreference resolution is a set problem (i.e. grouping mentions into sets), the first machine-learning-based approach applies pairwise classification models which break down the problem into two-step processing (Section 2.5.1). The success of the two-step method is mainly due to its expressive simplicity and straightforward learning strategy. However, more global models are coming into the field (Section 2.5.3) aiming to conquer the performance bottle-neck from the missing of pairwise-beyond information (e.g. relations between more than two mentions). Both local and global models are introduced in this section, so that readers can grasp an idea of the motivations and the importance of working on global models, specifically on the relative simpler graph-partitioning-based inference. 2.5.1 Two-step Methods The Mention-pair model was firstly proposed by Aone & Bennett (1995) and McCarthy & Lehnert (1995). However, Soon et al.’s system (Soon et al., 2001) is the first successful attempt applying machine learning technique to the mention-pair model for coreference resolution, which has become the most widely used baseline system in the field. Soon et al. divide the task into a two-step processing, a classification step and a clustering step. In step 1, the classifiers perform on pairs of mentions to decide whether they are coreferent or not. Based on the classification decisions, the clustering component merges mention pairs into sets so that all mentions in one set are coreferent to each other. A decision tree classifier (e.g. C5 Quinlan (1993)) is adopted along with 12 features for step 1, and the closest-first search strategy for step 2 (i.e. choosing the closest positive antecedent for the focusing anaphor). A simple example illustrating the two-step processing is given below. • Mention list: a1 , b1 , a2 , b2 , a3 • Step 1: Classification step: For b1 : a1 ←| b1 For a2 : a1 ←a2 ; b1 ←|a2 For b2 : a1 ←|b2 ; b1 ←b2 ; a2 ←|b2 For a3 : a1 ←a3 ; b1 ←|a3 ; a2 ←a3 ; b2 ←|a3 • Step 2: Clustering step: 24 2. Related Work On Coreference Models Set1 : {a1 , a2 , a3 } Set2 : {b1 , b2 } The sign ←| denotes that the mention pair is decided not to be coreferent and the sign ← applies to the ones which are predicted to be coreferent with each other. In the literature, one line of improvements after Soon et al. is made along two directions, either by proposing more powerful pairwise classifiers (in step 1) or by clustering the pairwise decisions with better algorithms (in step 2). For a more detailed overview, readers are referred to Ng (2010). Work on the Classification Step. Step 1 can be improved by exploring more powerful classifiers. Besides the decision tree classifier (e.g. Soon et al. (2001), Ng & Cardie (2002)), the Maximum Entropy classifier (e.g. Luo et al. (2004)) and the averaged perceptron learning algorithm (e.g. Bengtson & Roth (2008)) have also been applied to the classification step. There have been researchers working on enriching the feature set for step 1. Ng & Cardie (2002) extend Soon et al.’s feature set to a size of 52, including more sophisticated linguistic knowledge. Bengtson & Roth (2008) stress on the importance of feature selection and propose to serve as the enhanced baseline system for complex coreference models. Ponzetto & Strube (2006) firstly exploit semantic features (by the means of semantic role labeling) and world knowledge (from Wikipedia) for coreference resolution, and Rahman & Ng (2011) proceed to analyze in details the behavior of combining world knowledge with different models. Since world knowledge (especially when obtained from the web data) is noisy, it is still of interest how to make use of it in a robust way. More recent attempts can be found in Kobdani et al. (2011) and Bansal & Klein (2012). Work on the Clustering Step. By always choosing the closest positive antecedents (as in Soon et al. (2001)), the pairwise decisions from the classification step are linked into sets. Since the closest-first strategy is too sensitive to error propagation, a best-first method is proposed by Ng & Cardie (2002) instead to link the most confident positive antecedents. Luo et al. (2004) perform a greedy search on a bell tree representation (Figure 2.1). In each step a decision is made to connect a focusing anaphor (e.g. 3*) with a previously constructed partial entity (e.g. [12]). Although this method moves towards entity-level modeling, the greedy (and sequential) nature of the algorithm excludes important information contained in all the other paths except for the chosen one. 2.5 Supervised Coreference Models 25 Figure 2.1: Luo’s Bell Tree Method (Luo et al., 2004) Optimization algorithms have been applied to the clustering step, in order to achieve better performance given the output from the classification step. For instance, both Klenner (2007) and Finkel & Manning (2008) impose transitivity constraints on integer linear programming (ILP) to enforce transitive closure which cannot be taken care of by greedy algorithms. 2.5.2 Preference Models Selecting the correct antecedent for an anaphor among all candidate antecedents can also be approached by preference modeling, which predicts the winning candidates based on comparisons between all candidates. Preference models allow one to consider not only the coreference relations between antecedents and anaphors, but also the competition relation between antecedents. Twin Candidate Model. A twin candidate model is proposed by Yang et al. (2005) to model the competition between pairs of antecedents. Each anaphor ana together with two candidate antecedents ante1 and ante2 form one tuple instance {ana, ante1 , ante2 }, which has three possible labels — 10 suggesting the preference of ante1 , 01 suggesting ante2 and 00 indicating ana being non-anaphoric. The best antecedents are ranked top in a round-robin manner. Yang et al. propose features describing relations between a pair of antecedents, which are not accessible for non-preference models. • inter SentDist: Distance between ante1 and ante2 in sentences 26 2. Related Work On Coreference Models • inter StrSim: 0,1,2 if StrSim(ante1 , ana) is equal to, larger or less than StrSim(ante2 , ana) (where StrSim(·, ·) measures the string similarity between two mentions) • inter SemSim: 0,1,2 if SemSim(ante1 , ana) is equal to, larger or less than SemSim(ante2 , ana) (where SemSim(·, ·) measures the semantic agreement between two mentions in WordNet) Ranking Models. Denis & Baldridge (2007) rank all candidate antecedents for pronoun anaphors simultaneously, and the system is shown to outperform the twin candidate model significantly. To be able to exploit cluster-level information upon the mention ranking model, Rahman & Ng (2009) propose to rank clusters instead of antecedents. The preference models start exploring the global relations without assuming pairwise predictions given. But due to their sequential property, only the preceding context of each anaphor is participating in the decision making which is still similar to the two-step methods. 2.5.3 One-step Methods In this section, one-step models for the coreference resolution task are introduced. Those are the closest work to ours in terms of resolving all mentions simultaneously by considering the available full context. 2.5.3.1 Clustering Methods Two algorithms are described in this section, both of which perform the global inference by means of clustering algorithms. Nicolae and Nicolae’s graph clustering algorithm to be introduced is still built upon pairwise classification output (as edge weights). However, it is considered as a global model as they do not sequentially cluster mentions into coreference sets, but resolve them all together. Cardie and Wagstaff’s Method. It is worth noting that Cardie and Wagstaff’s method (Cardie & Wagstaff, 1999) in Section 2.3 is unsupervised since the edge weights are set manually. However their clustering mechanism can be easily adapted into a supervised version by learning the weights automatically. Recall that Cardie & Wagstaff represent mentions to be resolved as vertices in the graph, and edge weights are calculated from a distance metric which measures the compatibility degree between vertices. An agglomerative clustering algorithm is applied to generate the coreference sets afterward. 2.5 Supervised Coreference Models 27 Nicolae and Nicolae’s Best-cut. Nicolae & Nicolae (2006) describe a graph-cut-based algorithm with the same graph representation as Cardie and Wagstaff’s. The graph-cut strategy superficially resembles our approach. However, they apply the cutting algorithm only on the output from a classification step which form a weighted standard graph as shown in Figure 2.2. Figure 2.2: Nicolae and Nicolae’s Best-cut Method (Nicolae & Nicolae, 2006) They report considerable improvements over state-of-the-art systems including Luo et al. (2004). However, since they not only change the clustering strategy but also the features for the classification step, it is not clear whether the improvements are due to the graph-based clustering technique. Furthermore, they separate pronoun resolution from the core processing but adopt a standard two-step method for pronouns. The fact that their algorithm is only applied to a subset of mentions makes it less elegant than ours. 2.5.3.2 Probabilistic Models Being conceptually similar to the graph clustering algorithms, probabilistic models optimize the entity assignments by considering all relations available in the focusing contexts. Different inference frameworks have been explored in the literature to capture cluster-level information (e.g. transitivity) and different approximation algorithms are used to make globally optimized predictions. It is not very clear yet which model is distinguishably superior. 28 2. Related Work On Coreference Models McCallum and Wellner’s Conditional Model. McCallum & Wellner (2005) introduce three discriminative, conditional-probabilistic models for coreference resolution, all examples of undirected graphical models. The models condition on the mentions, and generate entity assignments for them. It is shown that the most improved version (i.e. the third model) can transform itself to an equivalent (different) graph, which is with mentions as vertices and edge weights ranging from −∞ to +∞. The inference thus becomes a graph partitioning problem, where e.g. correlation clustering (Bansal et al., 2002) can be applied to handle the negative edges. Culotta’s First-order Logic Method. Culotta et al. (2007) adopt a first-order logic representation where features over sets of mentions are implemented (i.e. cluster-level features). The proposed models can be viewed as estimating the parameters for each cluster-wise compatibility independently and then being combined together via clustering. Uniform sampling is used for generating training instances (i.e. positive/negative clusters) in one model, and on-line training schemes are proposed for the other two improved versions. They use four features in the model. The first is an enumeration over pairs of noun phrases. The second is the output of a pairwise model. The third is the cluster size. The fourth counts mention type, number and gender in each cluster. They assume true mentions as input and only report one evaluation metric numbers. It is not clear whether the improvement in results translates into system mentions. Sapena’s Relaxation Labeling Algorithm. Sapena et al. (2010) use a constraint-based approach (i.e. relaxation labeling) for coreference resolution. They generate pairwise predictions as constraints using a decision tree classifier and represent them in a graph. Afterward they optimize with respect to the constraints (both positive and negative ones) in an iterative procedure. It is shown that the proposed model outperforms an ILP algorithm with the transitivity enforced. In his thesis (Sapena, 2012), Sapena shows that his graph representation can be viewed as hypergraphs, as illustrated in Figure 2.3. The mentions are taken as vertices and the constraints generated from the decision tree are taken as edges (e.g. e1 , e2 and e3 ). The main differences between Sapena’s work and ours lie in (1) his hyperedges represent the learned combinations of features while ours are derived directly from simple (low-dimensional) relational features; (2) his resolution model is a probabilistic model while ours performs under the graph-based clustering framework. The two work differs in both the representation model and the resolution algorithm, despite of the similar namings. 2.6 Summary 29 Figure 2.3: Sapena Thesis’s Hypergraph Representation (Sapena, 2012) Markov Logic Models for Coreference Resolution. As mentioned, Poon & Domingos (2008) propose to use a learning-based unsupervised Markov Logic Model for coreference resolution, which manages to incorporate cluster-level features via formulas. Song et al. (2012) implement a supervised framework using Markov Logic, to perform the mention pair classification and the mention clustering jointly. They make use of the expressive power of Markov Logic Networks to include hard (global) constraints for the best-first scheme and for transitivity. Frank et al. (2012) adopt Markov Logic Networks to detect errors in automatic semantic annotations. The automatic system predictions for word sense disambiguation and coreference resolution are taken together into the their model, and are optimized (i.e. corrected) via the joint inference. Both Song et al.’s and Frank et al.’s proposed models can be viewed as optimization methods for step 2 in the two-step coreference framework. 2.6 Summary Two-step Coreference Models. Although coreference resolution is naturally a clustering problem, which aims to cluster mentions into coreference sets, most of the recent approaches divide the task into two steps: (1) a classification step which determines whether a pair of mentions is coreferent or which outputs a confidence value, and (2) a clustering step which groups mentions into entities based on the output of step 1. Soon et al. (2001) firstly propose the two-step strategy under the machine learning framework, i.e. pairwise classification and clustering. They use a set of twelve powerful features. Their system is based solely on information of the mention pairs (i.e. anaphor and antecedent), and does not take any information of other mentions into account. However, it turned out that it is difficult to improve upon their results by just applying a more sophisticated learning method without improving the features. 30 2. Related Work On Coreference Models A number of approaches have been focusing on improving coreference modeling within the two-step framework, either by proposing linguistic-leaned or world-knowledge-based features or by applying different optimization algorithms for the clustering phase. Most of the two-step methods are considered to be local, because they make coreference decisions on pairs of mentions and cluster the mentions into sets considering only the preceding antecedents. In order to exploit the full context, global models are preferred over the two-step methods. Global Coreference Models. As an example of graph partitioning models for coreference resolution, Nicolae & Nicolae (2006) propose a graph-cut-based approach where mentions are vertices and edge weights are learned from pairwise coreference classifiers. Unfortunately, they only manage to resolve non-pronoun mentions in this framework and have to approach pronoun resolution separately. This work is superficially similar to ours, but our graph-based model includes mentions of all types in the graph representation. In this way, we are able to access the full context of the focusing document, which makes our model fully global. Graphical models have the superiority of precise probability formulating, which consequently enables the coreference systems to learn complex dependency structures between mentions and entities. However, the learning and inference procedures can be complicated even with the approximation (e.g. Finkel & Manning (2008)), which make them less preferable than the simpler coreference systems such as ours. Lang et al. (2009) propose an unsupervised coreference resolution system based on a hypergraph partitioning algorithm, which did not appear accessible before our first proposal (Cai & Strube, 2010a). Lang et al. represent mentions as vertices and generate hyperedges directly from features. Unfortunately, no strict experimental comparison (with the same feature sets) is provided to verify the effect of their model. Furthermore, the mentions along with their heads and semantic types are all taken from the gold annotation in Lang et al.’s system. In contrast, in this thesis we present a complete hypergraph partitioning model for coreference resolution and provide thorough experiments with realistic system settings. Crucial issues regarding both the clustering algorithms and the coreference application are addressed in this thesis. For instance, we propose the feature categorization in Chapter 5 to ensure the stable construction of the hypergraphs. Extensive experiments across different domains and different evaluation metrics are able to convey the effectiveness and the robustness of our proposed system. Chapter 3 Data Sets for Coreference Resolution Two data sets have been frequently used for years to evaluate coreference resolution. The former is from the MUC conferences (see Section 3.1) and the latter is provided by the Automatic Content Evaluation (ACE) program (see Section 3.2). Stoyanov et al. (2009) point out that there are significant differences in annotating mentions and the coreference relation between these corpora, which will be illustrated in this chapter. A much larger corpus OntoNotes (see Section 3.3) was recently released. It became the standard evaluation set for the coreference resolution task soon after its first usage in the CoNLL 2011 shared task (Pradhan et al., 2011). In this thesis, we also experiment on a medical data set (see Section 3.4), which consists of clinical reports with annotated coreference relation between persons, (clinical) problems, treatments etc. We describe the coreference data sets before introducing our proposed coreference model in this thesis, aiming to assist the readers to better understand coreference phenomena and the annotation- scheme-related problems involved in the task. 3.1 MUC The MUC data sets consist of MUC-6 (MUC-VI Text Collection) (Chinchor & Sundheim, 2003) with a standard training/testing division (30/30) and MUC-7 data (North American News Text Corpora) (Chinchor, 2001) (30/20). The documents in the MUC data sets are all news articles, and are prepared (annotated) for four evaluation tasks — Named Entity Recognition, Coreference Resolution, Template Elements and Scenario Templates. The MUC corpora are annotated with general types of mentions, but only the ones that participate in the coreference relation. In other words, the entities containing single mentions (denoted as singleton entities) are not tagged, such as ”the Federal Railway Labor Act” in the following MUC Example. It is also worth noting that neither apposition nor predicate nominatives are annotated as the coreference relation. 32 3. Data Sets for Coreference Resolution MUC Example: Under the Federal Railway Labor Act, if the mediator fails to bring [the two sides]1 together and [the two sides]1 do n’t agree to binding arbitration, [a 30-day coolingoff period]2 follows . After [that]2 , [the union]3 can strike or the company can lock [the union]3 out . Since we only focus on the end-to-end coreference resolution problem, which takes raw texts as input without assuming any annotations, mentions need to be detected automatically. Our mention tagger (see Chapter 7) tends to identify too many mentions for MUC data, as there is no restriction on the types of mentions to be resolved. This is therefore resulting in too many spurious coreference sets, such as the entity containing several [yesterday] mentions. 3.2 ACE There are four corpora from the ACE program, ACE 2002 (Mitchell et al., 2002), ACE 2003 (Mitchell et al., 2003), ACE 2004 (Mitchell et al., 2004) and ACE2005. The annotations of ACE data contain six areas — Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. There are different types of document sources for ACE data sets, i.e. news wire reports, broadcast news programs and newspapers, and in three different languages, i.e. Arabic, Chinese and English. In this thesis, we use both ACE 2003 and ACE 2004. Since we do not have access to official ACE testing data (only available to ACE participants), we follow Bengtson & Roth (2008) to divide ACE 2004 English training data into training, development and testing partitions (268/76/107). We randomly split the 252 ACE 2003 training documents using the same proportions into training, development and testing (151/38/63). The coreference relation in ACE data sets is annotated only among the mentions of certain entity types. For instance, ACE 2004 adopts 7 entity types, which are Person (PER), Organization (ORG), Location (LOC), Geo-Political Entity (GPE), Facility (FAC), Vehicle (VEH) and Weapon (WEA). Singleton entities are allowed in ACE data as long as they are of the required entity types. In the following ACE Example that illustrates the ACE annotations, both mentions [Palestinian]1 and [the former Soviet Union]4 form singleton entities due to their GPE types. ACE Example: 3.3 OntoNotes 33 The problem arose after [[Palestinian]1 Mahmood Abu Talib, [whose]2 testimony the court has been hearing since Friday]2 , refused to continue answering a question by [[defense lawyer]3 Richard Keen]3 about the detailed reasons for [his]3 having lived in [the former Soviet Union]4 for a period of 18 months in the 70s . [The lawyer]3 asked the judges to force [Abu Talib]5 to answer the question aimed at demonstrating [the witness]5 ’s ”professional terrorism” precedents . There are several special relations that are taken as the coreference relation in ACE data sets, such as appositive (e.g. entity 2), predicative nominative and role appositive (e.g. [[defense lawyer]3 Richard Keen]3 ). Features designed for capturing these special relations might not work when moving to different data sets, as they usually do not form the coreference relation from the linguistic perspectives. It is relatively easier to detect ACE mentions given the fixed entity types. However, since entity extraction is also implicitly evaluated via singleton entities, it brings non-trivial implementation issues to the the coreference evaluation metrics (for more details, readers are referred to Chapter 6). 3.3 OntoNotes The OntoNotes Release 4.0 corpus (Weischedel et al., 2011) provided by the Linguistic Data Consortium (LDC) is used for CoNLL 2011 shared task on modeling unrestricted coreference in OntoNotes. It consists of 2, 999 English documents, 1, 674 of which are chosen as the training data, 202 as the development set and 207 as the testing set for the shared task. In the collection, there are news wire texts, broadcast news, broadcast conversations, magazine and web documents. The diverse text types impose more challenges on coreference systems. In addition to the coreference relation, OntoNotes data is also tagged with syntactic trees, high-coverage verb and some noun propositions, partial verb and noun word senses, and 18 named entity types. The shared task provides two types of annotation layers, the gold layers (for the training set) and the system predicted layers (for all sets). The participating systems can only have access to system predicted information during the testing phase, which explicitly stresses on the importance of the end-to-end coreference setting. In OntoNotes data, appositive structures are annotated as a separate type and they are not included in the coreference sets. The predicative nominatives are not considered being coreferent either. Event coreference is annotated, such as the [overcoming]2 and [This example]2 entity in the following OntoNotes Example (1). As shown in OntoNotes Example (2), the generic phrases (e.g. [Officials]1 ) are also tagged as mentions as long as there are other men- 34 3. Data Sets for Coreference Resolution tions being coreferent with them. GPEs are linked to the references of their governments, e.g. [China]1 and [the Chinese government ’s]1 in OntoNotes Example (3). OntoNotes Example (1): [The South Korean team of veterans]1 , by [overcoming]2 [their]1 injuries to give a display of athleticism at the international level , have emerged from the shadow of war and transformed [their]1 handicaps into glorious results. [This example]2 should provide food for thought to the disabled and sports communities in the future . OntoNotes Example (2): [Officials]1 say [they]1 have reduced the reunion schedule from four days to three and will spend some $ 800,000 to bring the families together , compared with the nearly $ 1.6 million it spent for the August event . OntoNotes Example (3): [China]1 today blacked out a CNN interview that was critical of [the Chinese government ’s]1 handling of the SARs epidemic and of [the country ’s]1 health care system. 3.4 I2B2 The I2B2/VA/Cincinnati Childrens 2011 challenge (Uzuner et al., 2012) held one NLP shared task in 2011, the first track of which was on coreference resolution. Participants were asked to mark the concept mentions (i.e. entity mentions), including pronouns, as coreferent or not. Data for this track were provided by Partners HealthCare, Beth Israel Deaconess Medical Center (MIMIC II Database), University of Pittsburgh, and the Mayo Clinic. According to different settings, the task was further divided into task 1A, 1B and 1C. We participated in all three of them. The ODIE corpus (including the Mayo and Pittsburgh data sets) is used for task 1A and task 1B. Task 1B provides manually annotated mentions (referred to as concepts in the task description) while task 1A requires an automatic mention detection. The ODIE corpus consists of 97 training documents. The I2b2/VA/Cincinnati corpus (including the Partner, Beth and 3.5 Summary 35 Pittsburgh data sets) with 492 training documents is used for task 1C, where the true mentions are provided too. The entities of interest in the I2B2 data sets are significantly different from the ones in standard coreference data sets (i.e. the previously introduced corpora in this chapter), which cover persons, problems, treatments, tests, etc. All the texts are in semi-structured formats, with content of the clinical treatments a patient receives as well as a rich set of his/her relevant information, e.g. the admission date, the date of birth, etc. I2B2 Example (1): [Attending]1 : [Gayle M Whitener , M.D.]1 I2B2 Example (2): On hospital day 2 she experienced [atrial fibrillation]1 with HR in the 140s. We decided given her age that she would not be a good candidate for cardioversion for [her afib]1 nor would she be a good candidate for coumadin. I2B2 Example (3): [VULVAR CANCER]1 . A tumor was noted on her vulva which was biopsied and revealed [squamous cell carcinoma in situ]1 . Examples from I2B2 corpora are shown above. It can be seen that due to the organized structures, some of the coreference entities are obvious to solve, e.g. [Attending]1 and [Gayle M Whitener , M.D.]1 in I2B2 Example (1). However, abbreviations (e.g. [atrial fibrillation]1 and [her afib]1 in I2B2 Example (2)) can be difficult as well as the variants for medical expressions (e.g [VULVAR CANCER]1 and [squamous cell carcinoma in situ]1 in I2B2 Example (3)). 3.5 Summary In order to convey the improvements one achieves, researchers in the corefernce resolution field always conduct comparison experiments on several standard data sets. The documents selected for the corpora are conventionally news articles. The community starts to include 36 3. Data Sets for Coreference Resolution speech transcripts and others only recently in OntoNotes data. In this chapter, the coreference data sets used by our system are introduced, including one additional medical corpus. The given examples show that the entity types and the annotation schemes vary between different data sets, so that the corpus-specific system engineering and feature designing are necessary to some degree. For instance, features capturing the knowledge on GPE entities are required for news articles, while for clinical reports, medical-domain-specific knowledge are needed in order to solve the difficult cases. Nevertheless, linguistically driven features (e.g. binding constraints) can be applied universally. Chapter 4 COPA: Coreference Partitioner In this thesis, we propose a novel coreference resolution model, that represents documents as hypergraphs, upon which partitioning algorithms are applied to derive the coreference sets directly and simultaneously. Our system is named COPA, standing for Coreference Partitioner. The Hypergraph Representation. Unlike most of the previous work that resolves the pairwise relations independently (e.g. the two-step methods in Chapter 2), representing documents as graphs enables COPA to have a global view of the relations between all mentions. More specifically, we propose the hypergraph model for the representation, motivated by the highdimension property of the coreference relation. The standard graph models have to collapse the multiple low-dimensional relations between mentions into single ones (i.e. the coreference relation) as edges, which leads to a loss of information before the inference phase. In contrast, a hypergraph is a graph in which (a) a hyperedge can connect more than two vertices, and (b) between two vertices there can exist more than two hyperedges. Therefore, our hypergraph model is able to maintain the original low-dimensional relations as overlapping hyperedges (i.e. (b)) until the final inference, and the model also easily represents sets of mentions (i.e. (a)) which suits well the set property of coreference resolution. The Partitioning Inference. Upon the hypergraph representation, COPA produces the coreference sets so that the mentions within the same sets are closely connected and different sets are far apart from each other. In order to achieve such an optimization, we propose to apply the graph partitioning technique as the inference method for coreference resolution. Graph partitioning algorithms seek for a cut upon the graph edges, so that the derived subgraphs are optimized with respect to a specific graph cut function. In COPA, we adopt the Normalized Cut (NCut) function which measures both the inner-set and the inter-set connectivities. The spectral clustering algorithm is employed to optimize the NCut value, so that the inner-set connections are as strong as possible while the inter-set ones are as weak as possible. With the 38 4. COPA: Coreference Partitioner graph partitioning algorithm applied, the optimized coreference sets are able to be derived simultaneously. The Chapter Organization. Section 4.1 illustrates how COPA works via examples. The mathematical background of both the hypergraph model and the spectral clustering algorithm is described in Section 4.2, which provides the notation used throughout the thesis. Section 4.3 describes in detail our proposed hypergraph partitioning model for coreference resolution. The important issues regarding applying the graph partitioning technique to practical uses are discussed in Section 4.4. As mentioned previously, the hypergraph is a generalization of the standard graph and is equipped with additional power of representation. However, there exist standard graphs to which the hypergraph can be transformed (see Section 4.5). Upon the standard graphs, more graph-based algorithms can be directly applied. Therefore such transformation gives the freedom in choosing the inference algorithm to hypergraphbased models. Although COPA performs directly on the hypergraphs, future extensions on the inference method may benefit from such graph transformation. 4.1 Introduction to COPA Figure 4.1 shows the modules of our proposed coreference resolution system. The COPA system includes the learning modules for collecting the hyperedge weights (i.e. the Hyperedge Learner in Section 4.3.2) and for predicting the number of entities k (i.e. the k model in Section 4.3.4). The resolution modules of the COPA system construct the hypergraph models for the testing documents (using the Hypergraph Builder in Section 4.3.2) and partition them into sub-hypergraphs (using the Hypergraph Resolver in Section 4.3.3). 4.1 Introduction to COPA 39 Learning Modules Training Set Hyperedge Learner k Model Hyperedge Weights The Predicted k Resolution Modules Testing Set Hypergraph Builder Hypergraphs Hypergraph Resolver Coreference Sets Figure 4.1: COPA Model Illustration COPA Example. To illustrate how COPA works, an example of a short document involving two entities — BARACK O BAMA and N ICOLAS S ARKOZY — is provided in Table 4.1. [US President Barack Obama] came to Toronto today. [Obama] discussed the financial crisis with [President Sarkozy]. [He] talked to [him] about the recent downturn of the European markets. [Barack Obama] will leave Toronto tomorrow. Table 4.1: COPA Example: Texts A hypergraph (Figure 4.2 a) is built for the example document based on three features. Two red (solid line) hyperedges denote the feature partial string match — {US President Barack Obama, Barack Obama, Obama} and {US President Barack Obama, President Sarkozy}. One green (dashed line) hyperedge denotes the feature pronoun match — {he, him}. Two blue (dashed-dotted line) hyperedges denote the feature subject|object match — {Obama, he} and {President Sarkozy, him}. Each of the hyperedges has an associated edge weights (the examples of which can be seen in Section 4.3.2). 40 4. COPA: Coreference Partitioner On this initial representation, spectral clustering technique is applied to find two partitions that have the strongest within-cluster connections and at the same time the weakest betweenclusters relations. The cut found in this way is called Normalized Cut (abbreviated as NCut), which avoids trivial partitions frequently output by the min-cut algorithm (see Section 4.2.2). The two resulting sub-hypergraphs (Figure 4.2 b) correspond to two resolved entities shown on both sides of the bold dashed line, i.e. the upper left sub-graph being BARACK O BAMA and the lower right N ICOLAS S ARKOZY. In real cases, multiple entities can be found within one document. Figure 4.2: COPA Example: Processing Illustration 4.2 The Mathematical Background 4.2.1 The Hypergraph Representation A hypergraph is a graph in which hyperedges can connect more than two vertices, and between two vertices there can be multiple hyperedges. The Hypergraph Notation. Let HG = (V, E) be a hypergraph with a vertex set V and a hyperedge set E. The hyperedges can connect arbitrarily multiple vertices such that E ⊆ {U |U ⊆ V, |U | > 1}. A weighted HG has a positive weight value w(e) associated with each hyperedge e. A vertex v is incident with a hyperedge e if it is connected with the edge, being denoted as v ∈ e. For a vertex v ∈ V , the degree of v is the number of hyperedges connecting to it and is thus defined as 4.2 The Mathematical Background 41 d(v) = X w(e) (4.1) e∈E|v∈e For a hyperedge e ∈ E, its degree is the number of vertices connected by it, denoted as δ(e) = |e| (4.2) In order to be analyzed mathematically, the hypergraph representation is further transformed into matrices. The incidence matrix H of a HG is a |V | × |E| matrix with entries H(v, e) = 1 if v ∈ e and 0 otherwise. Dv and De denote the diagonal matrices with the vertex and hyperedge degrees respectively, and W the diagonal matrix with the corresponding hyperedge weights. After the transformation, the matrices contain full information about the original hypergraphs. The Matrix Computation Example. We use the hypergraph in Figure 4.3 as an example to illustrate the matrix computations introduced above. The numbers in brackets are the corresponding hyperedge weights. e1 (0.4) v1 v2 e3 (0.7) e2 (0.1) v4 v3 Figure 4.3: An Example for the Hypergraph Notation The incidence matrix H of this hypergraph and the hyperedge weight matrix W are e1 e2 e3 v1 1 v2 1 H= v3 1 v4 0 0 1 1 0 0 0 1 1 e1 e2 e3 e1 0.4 , W = e2 0 e3 0 0 0.1 0 0 0 0.7 42 4. COPA: Coreference Partitioner The degrees of vertices are calculated as d(v1 ) = w(e1 ) = 0.4 d(v2 ) = w(e1 ) + w(e2 ) = 0.5 d(v3 ) = w(e1 ) + w(e2 ) + w(e3 ) = 1.2 d(v4 ) = w(e3 ) = 0.7 so that the vertex degree matrix Dv and the hyperedge degree matrix De are v1 v2 v3 v1 0.4 v2 0 Dv = v3 0 0 0.5 0 0 0 0 1.2 0 v4 0 v4 e1 0 e1 3 0 , De = e 2 0 0 e3 0 0.7 e2 e3 0 2 0 0 0 2 4.2.2 Hypergraph Partitioning Grouping data into meaningful clusters is well known as cluster analysis or data clustering, which is to discover the intrinsic structures of the focusing data sets (see Jain et al. (1999) for an overview). The data points to be clustered are usually in vector-based feature representations, the quality of which often influences the performance of the clustering algorithms directly. For tasks where the relations between data points are of greater interest, such as coreference resolution, explicit data vector representations can be avoided by resorting to graph models. Partitioning upon graphs is also referred as graph clustering. Graph clustering is the task of dividing the vertices in a graph into sets (i.e. sub-graphs), such that vertices within sets are tightly connected to each other in some pre-defined sense, while the ones from different sets are loosely related. The edges to be removed to output the sub-graphs form a cut, and the edges are said to be crossing the cut. In a weighted graph, the value of a cut is defined by the sum of the weights of these edges crossing the cut. Graph clustering algorithms are aiming at finding a partition that optimizes the chosen cut value, so that the partition provides an optimal segmentation solution on the graph. Spectral clustering is a family of clustering algorithms that has been proven to work efficiently in applications and frequently outperforms standard clustering algorithms such as k-means. In COPA, we adopt a spectral clustering algorithm that can perform directly on hypergraph models. 4.2 The Mathematical Background 4.2.2.1 43 Spectral Clustering Taking the two-way partitioning as an example, we introduce briefly the intuitions behind spectral clustering in this section. The Standard Graph Cut. Let A, B denote two disjoint sub-graphs from the original graph G = (V, E) (V , E are vertex set and edge set respectively), where A ∪ B = V and A ∩ B = ∅. The standard graph cut is defined as cut(A, B) = X w(u, v) (4.3) u∈A,v∈B Finding the minimum cut (min-cut) of a graph (i.e. minA,B (cut(A, B))) is the simplest and most direct way to solve the partitioning problem. The min-cut is well-studied (see Stoer & Wagner (1997) for algorithms and discussions) and is used in applications too (Wu & Leahy, 1993). However, it is noticed that the min-cut criteria favors cutting isolated vertices (Jain et al., 1999) , which have few edges connecting to others in the graph so that the corresponding cut value is small. Most applications focus on detecting meaningful cluster structures (i.e. the clusters consisting of multiple vertices), and are not interested in such trivial singletons output by min-cut algorithms. Normalized Cut. Shi & Malik (2000) propose a new measure of disassociation between sub-graphs, taking the inner-cluster density into consideration too. The new measure is called Normalized Cut (NCut): N Cut(A, B) = cut(A, B) cut(A.B) + assoc(A, V ) assoc(B, V ) (4.4) P Where assoc(A, V ) = u∈A,t∈V w(u, t) sums all the edges between vertices in A sub-graph and all vertices in the original graph. Therefore, by minimizing the NCut value, the resulting sub-graphs should be weakly connecting to each other while being as dense as possible at the same time. However, introducing the inner-cluster factor makes the minimization of NCut an NPhard problem. Spectral clustering techniques (Chung, 1997; Shi & Malik, 2000; Ng et al., 2002) solve the relaxed version by partitioning the rows of a matrix (see the Laplacian matrix Lsym in Section 4.2.2.2) according to the components in the top few singular vectors for the matrix. They are simple to implement and reasonably fast and have been shown to frequently outperform traditional clustering algorithms such as k-means algorithm in applications (von Luxburg, 2007). 44 4.2.2.2 4. COPA: Coreference Partitioner Spectral Clustering for Hypergraphs (Zhou et al., 2007) generalize spectral clustering to operate directly on hypergraphs (in contrast to e.g. Agarwal et al. (2005) who partition a graph that approximates the hypergraph). In COPA, we adopt their hyperspectral clustering algorithm. Following the same intuition behind the standard normalized cut as introduced in Section 4.2.2.1, hypergraph spectral clustering defines the N Cuthg of a k-way partitioning Pk as N Cuthg (Pk ) = X vol∂Vi volVi 1≤i≤k (4.5) Where Vi ∩ Vj = ∅, for all 1 ≤ i, j ≤ k and i 6= j. The volume volVi of a vertex set Vi is defined by volVi = X d(v) (4.6) v∈Vi The hyperedge boundary ∂Vi is defined as the graph cut separating Vi from other vertices in the graph, such that ∂Vi = {e ∈ E|e ∩ Vi 6= ∅, e ∩ Vic 6= ∅} (4.7) where Vic denotes the complement of Vi . The volume of the hyperedge boundary is defined by vol∂Vi = X w(e) e∈∂Vi |e ∩ Vi ||e ∩ Vic | δ(e) (4.8) When a minimized N Cut(Pk ) value is reached, the linkage between clusters is as weak as possible while it is as dense as possible within clusters. The minimization can be approached using a relaxation approach, which approximates discrete cluster memberships with continuous real numbers by solving the eigen problem of the hypergraph Laplacian. The symmetric Laplacian (Lsym ) (von Luxburg, 2007) is adopted. 1 1 Lsym = I − Dv − 2 HW De −1 H T Dv − 2 (4.9) 4.2 The Mathematical Background 45 Given a hypergraph HG, a set of matrices is generated. Dv and De denote the diagonal matrices containing the vertex and hyperedge degrees respectively. |V | × |E| matrix H represents the HG with the entries h(v, e) = 1 if v ∈ e and 0 otherwise. H T is the transpose of H. W is the diagonal matrix with the edge weights. Let (λi , vi ), i = 1, . . . , n, be the eigenvalues and the associated eigenvectors of L, where 0 ≤ λ1 ≤ · · · ≤ λn and kvi k = 1. The continuous solution to minimizing N Cut(Pk ) is then provided by a new data representation X with lower dimensions compared with the original data dimensions: X = (v1 , · · · , vk ) (4.10) where X is called the k-th order spectral embedding of the graph. It has been shown that k is generally equal to the number of clusters (Ng et al. 2001). A standard data clustering algorithm, such as the k-means method (MacQueen, 1967), can afterward be applied to cluster the graph nodes in the new space. An illustration is given in Figure 4.4 to show how spectral clustering work on graph models. 46 4. COPA: Coreference Partitioner Graph [H] [Dv] [De] [W] Graph Laplacian [L] Eigen Decomposition Spectral Embedding vertex1 vertex2 vertexn Data Clustering Algorithm subgraph1 subgraph2 subgraphk Figure 4.4: Illustration of Spectral Graph Clustering 4.3 COPA: Coreference Resolution via Hypergraph Partitioning Figure 4.5 illustrates the work flow of the COPA system. The system takes raw documents as input and outputs the expected coreference sets. The pre-processing components perform text parsing (e.g. POS tagging and syntactic parsing), mention identification, and mention-relevant information extraction (e.g. semantic class identification). With the identified mentions and the extracted features, COPA represents the input text as hypergraphs. At the end, COPA partitions the hypergraphs into coreference sets. 4.3 COPA: Coreference Resolution via Hypergraph Partitioning 47 Feature Extractors Mentions Parsing (with basic properties) - Gender - Number - Semantic Class - ... Raw Text Hypergraph Representation Hypergraph Partitioner Coreference Sets Decoder Preprocessing Figure 4.5: Illustration of COPA System Flow 4.3.1 Preprocessing Pipeline COPA is implemented on top of the BART-toolkit (Versley et al., 2008). Documents are transformed into the MMAX2-format (Müller & Strube, 2006) which allows for easy visualization and (linguistic) debugging. Each document is stored in several XML-files representing different layers of annotations. These annotations are created by a pipeline of preprocessing components. We use the Stanford MaxentTagger (Toutanova et al., 2003) for part-of-speech tagging, and the Stanford Named Entity Recognizer (Finkel et al., 2005) for annotating named entities. In order to derive syntactic information, we use the Charniak/Johnson reranking parser (Charniak & Johnson, 2005) combined with a constituent-to-dependency conversion Tool 1 . We have implemented an in-house mention tagger, which makes use of the parsing output, the part-of-speech tags, as well as the chunks from the Yamcha Chunker (Kudoh & Matsumoto, 2000). The mention tagger detects automatically the mention boundaries, along with their syntactic heads. The separated-annotation-layer scheme and the flexible feature representation (see Chapter 5) enable COPA to incorporate knowledge easily. For instance, to enrich the system with medical domain information, we query the Unified Medical Language System (UMLS) 2 and the MetaMap software (Aronson, 2001) for each mention. All the top matched concept names returned by the MetaMap API as well as their corresponding definitions in the UMLS database are collected during preprocessing. 4.3.2 Constructing Hypergraphs for Documents The Hypergraph Builder component of COPA represents documents in undirected hypergraphs with basic relational features. Hyperedges are derived from the adopted feature set. 1 2 http://nlp.cs.lth.se/software/treebank_converter http://www.nlm.nih.gov/research/umls/ 48 4. COPA: Coreference Partitioner Each hyperedge corresponds to a feature instance modeling a specific relation of that feature type between two or more mentions. This leads to initially overlapping sets of mentions (as in Figure 4.2(1a)). Hyperedges are assigned weights that are calculated from the training data using the Hyperedge Learner component, as the percentage of the initial edges being in fact coreferent. For instance, when calculating the edge weights for the HeadMatch feature, 126 binary corresponding relations are found, out of which 55 are coreferent. As a result, the edge weight for 55 = 0.4365. Since only basic statistics are collected from the annotated data, HeadMatch is 126 COPA is not sensitive to the size of the training set (see Chapter 7). The weights for some of (Soon et al., 2001)’s features learned from the ACE 2004 training data are given in Table 4.2. Edge Name Alias StrMatch Pron Appositive StrMatch Npron NonPron Pron Weight 0.777 0.702 0.568 0.657 0.403 Table 4.2: Hyperedge Weight Examples for ACE 2004 Data 4.3.3 Hypergraph Resolver Raw documents are transformed into hypergraphs with mentions as vertices and features as edges. In contrast to the common practice in graph models, we incorporate rich relational information directly without assuming a distance metric and maintain all the relations until the final generation of the coreference sets. As introduced in Section 4.2.2.1, for a given hypergraph, the hypergraph Laplacian Lsym is computed. After solving the eigenvectors of Lsym , a new representation of the original vertices are formed. As illustrated in Figure 4.6, after forming a matrix using the eigenvectors as columns, the rows of the matrix are taken as the new vector representations of the vertices. The vertices in the new spectral space can easily be partitioned, because they are well-separated by then. 4.3 COPA: Coreference Resolution via Hypergraph Partitioning vector1 vectork vector2 vetex1 ... vetex2 ... . . . vetexn 49 . . . . . . ... Figure 4.6: Illustration of the Spectral Embedding The Hypergraph Resolver (i.e. the partitioner) aims to detect the intrinsic cluster structure in the hypergraph. It partitions every hypergraph into several sub-hypergraphs, each corresponding to one set of coreferent mentions (see e.g. the output in Figure 4.2(1b) which contains two sub-hypergraphs). Section 4.3.3.1 and 4.3.3.2 describe our proposed partitioning algorithms which form the core parts of the hypergraph resolver. 4.3.3.1 Recursive 2-way Partitioner We propose the recursive variant of spectral clustering, recursive 2-way partitioning (R2 partitioner) (Cai & Strube, 2010a). This method does not need any information about the number of target sets (the number k of clusters). Instead a stopping criterion α⋆ has to be provided which is adjusted on development data. At each recursion step, the R2 partitioner bi-partitions the focusing graph and the resulting partitions will be kept only if the cut value is smaller than α⋆ . The graph Laplacian is re-computed at each recursion based on the input graph. The algorithmic details are referred to Algorithm 1. In the R2 partitioner, only one eigenvector V2 is used for the spectral embedding and consequently the new vertex representation is only in one dimension. Therefore, directly searching for a best splitting point in V2 is sufficient to partition the graph, with vertices ordered according to their corresponding V2 values. For recursion purpose, all the sub-hypergraphs that can be partitioned with a NCut value smaller than the α∗ are partitioned further. When the NCut value is bigger than the α∗, it is suggesting a strong connectivity within the hypergraph in focus so that it should not be partitioned any more. 50 4. COPA: Coreference Partitioner Algorithm 1 R2 partitioner 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: − 21 1 HW De −1 H T Dv − 2 } 1 1 Note: { NCut(S) := vol∂S( volS + volS c )} input: target hypergraph HG, predefined α⋆ Given a HG, construct its Dv , H, W and De Compute L for HG Solve the L for the second smallest eigenvector V2 for each splitting point in V2 do calculate NCut i end for Choose the splitting point with min(N Cuti ) i Generate two sub HG’s if min(NCut i ) < α∗ then i for each sub HG do Bi-partition the sub HG with R2 partitioner end for else Output the current sub HG end if output: partitioned HG 1: Note: { Lsym = I − Dv Since the mention detectors usually aim at high recall, there are a lot of system mentions which do not match with true mentions. Including system mentions into graphs results in loosely connected outliers, and COPA is expected to split them out as singleton clusters. Using Normalized Cut does not generate singleton clusters, hence a heuristic singleton detection strategy is proposed in COPA. We apply a threshold β to each node in the graph. If a node’s degree is below the threshold, the node will be removed. 4.3.3.2 Flat k-way Partitioner The R2 partitioner generates an optimized bi-partitioning at each recursion step. Due to its hierarchical nature, however, it is not guaranteed that the final output clusters are also globally optimized, and it does not have any intrinsic means to include global constraints to globally guide the clustering. In order to overcome these problems, we propose a flat variant of partitioner, flatK partitioner (see Algorithm 2). k clusters will be output simultaneously by making use of the k smallest eigenvectors of the hypergraph Laplacian Lsym (as in Figure 4.6). 4.3 COPA: Coreference Resolution via Hypergraph Partitioning 51 Algorithm 2 flatK partitioner 2: 3: 4: 5: 6: 7: 8: 9: − 21 1 HW De −1 H T Dv − 2 } P i Note: { N Cut(Pk ) = 1≤i≤k vol∂V } volVi input: target hypergraph HG, number of clusters k Given a HG, construct its Dv , H, W and De Compute Lsym for the HG Solve the Lsym for the k smallest eigenvectors v1 , ..., vk Construct the spectral embedding X = (v1 , · · · , vk ) Apply k-means to the points (xi )i=1,...,n to produce k clusters C1 , ..., Ck output: partitioned HG with clusters C1 , ..., Ck 1: Note: { Lsym = I − Dv To assist the flatK partitioner we propose a preference-based k model to predict the number of entities within documents. The details of the k model is introduced in Section 4.3.4. 4.3.4 k Model: Predicting the Number of Entities Most clustering methods for multi-cluster tasks assume the number of clusters k to be known beforehand. However, if k is not known, choosing it turns out to be a general problem for clustering algorithms, especially when partitioning noisy data. Several methods to estimate k have been proposed (for an overview see (Milligan & Cooper, 1985) and (von Luxburg, 2010)) which focus on detecting the intrinsic cluster structures from the data where clustering is viewed as an unsupervised task. The methods of analyzing the cluster structures, such as the gap statistic (Tibshirani et al., 2001) and the stability measurements (Ben-David et al., 2006), require relatively big graphs to support valid statistics. For instance, when there are less than 100 vertices in a graph to be partitioned, the analysis methods are not able to work stably. Since documents vary largely in numbers of mentions, COPA seeks methods that are not sensitive to the graph sizes when predicting the number of entities. In this thesis, we propose a supervised k model to decide on a k — the number of entities — for each hypergraph. The objective of our k model is to find the best k that optimizes the end coreference performance. The best k does not necessarily correspond to the number of true entities (the true k), when spurious system mentions are included in the hypergraphs. We address the k predicting problem with preference modeling, where two partitionings of two different k compete with each other and the better partitioning is expected to generate a better coreference performance (e.g. the F-score number). By applying the preference modeling, the differences between partitionings can be captured, which are less sensitive to noise than the methods solely analyzing the graph structures. In order to avoid confusion, the terms 52 4. COPA: Coreference Partitioner Partitioning, Partition and Cluster are clarified via the following example. • mentions – m1 , m2 , m3 , m4 , m5 • a partitioning P2 (k = 2) – {m1 , m2 }, {m3 , m4 , m5 } • a partitioning P3 (k = 3) – {m1 , m2 }, {m3 , m4 }, {m5 } • an example cluster|partition – {m1 , m2 } Our proposed k model is outlined in Algorithm 3. Given a set of possible k’s for a hypergraph, a preference model is trained to find the best k with respect to the application F-score. The details of the model are described in the following subsections. 4.3 COPA: Coreference Resolution via Hypergraph Partitioning 53 Algorithm 3 k model outline Training: Construct hypergraphs for the documents for each hypergraph do Estimate the k range, [k1 , kx ] Decide on OneCluster for ki ∈ [k2 , kx ] do Generate a partition Pk end for Find the best partition Pbest Pair the {Pkbest , Pki }, kbest < ki , as positive training instances Pair the {Pki , Pkbest }, ki < kbest , as negative training instances end for Build k model from training instances Testing: Construct hypergraphs for the documents for each hypergraph do Estimate the k range, [k1 , kx ] Decide on OneCluster for ki ∈ [k2 , kx ] do Generate a partition Pk end for Pair each {Pki , Pkj }, ki < kj , as testing instances Use the learned k model to annotate the instances Choose the best Pk using the round-robin scheme Output Pk end for Training. Before the training, a range of possible k’s for each hypergraph is estimated based on the string properties of the mentions. The lower bound is set to be 1, while the upper bound is the number of different mention strings. Determining the possible k’s can also be approached by including more linguistic knowledge, for instance, to set the lower bound as the number of different proper names, which are most likely to be different entities. Since determining if a graph should be partitioned at all (as a binary decision) is easier than deciding on the best partition (as a preference decision), the cluster with k = 1 denoted as OneCluster is decided separately by simply looking at the the second cluster with k = 2, as opposed to the other situations in which both partitionings need to be considered. A graph with 54 4. COPA: Coreference Partitioner the second cluster which generates a high NCut value (greater than 0.1 in our experiments) will prefer the OneCluster, and all the others will be passed to the preference model. We partition each hypergraph built from the training data with a set of possible k’s. The resulting partitioning with ki is denoted as Pki . The k model aims to find the arg maxki F (Pki ), where the F (Pki ) denotes the coreference F-score when the partitioning Pki is taken. Two partitionings are paired as one training instance, {Pki , Pkj } with ki < kj . An instance is labeled positive when F (Pki ) > F (Pkj ), and negative otherwise. This way, the k model casts the original problem of picking the best k into a binary classification task where the preference among each pair of k’s is learned. Testing. For testing data, all pairs of partitionings {Pki , Pkj } with ki < kj are selected as instances. The learned k model assigns each instance a label of positive or negative, with positive indicating the preference for Pki and negative for Pkj . To find the top k from the pairwise preference decisions, a round-robin strategy is adopted. We assign each partition Pki a confidence value conf (Pki ) = pos(Pki ) − neg(Pki ), where pos(Pki ) is how many times Pki is preferred, and neg(Pki ) denotes the times not preferred. The top k then is simply the one with the highest confidence value. k Model Features. There are currently only a few features used for the k model proposed in this section. For an instance {Pi , Pj }, there are features: (1) M axN Cut1 : the biggest NCut value of partitioning Pi ; (2) M axN Cut2 : the biggest NCut value of partitioning Pj ; (3) M axN CutDif f : the difference between biggest NCut values of the partitioning Pj and partitioning Pj ; (4) kDif f : the difference between the k values used for both partitioning Pi and partitioning Pj ; (5) ConN umDif f : the difference between the numbers of constraints violated in partitioning Pi and partitioning Pj , and the constraints used are simply the negative features used in COPA (see Section 5.2). For the k model learner, a decision tree classifier (J48 provided by (Witten & Frank, 2005)) is used. 4.4 Implementation Issues 55 4.3.5 Complexity of COPA In COPA, the hyperedge weights are assigned using simple descriptive statistics, so that the time the Hypergarph Resolver needs for building the hypergraph model, transforming the hypergraph to matrices and computing the graph Laplacian matrix is not substantial. For eigensolving, we use an open source library provided by the Colt project3 which implements a Householder-QL algorithm to solve the eigenvalue decomposition. When applied to the symmetric graph Laplacian, the complexity of the eigensolving is given by O(n3 ), where n is the number of the mentions in the hypergraph. For the R2 partitioner, only the top two eigenvectors are required at each recursion, the decomposition can be easily improved by Lonczos algorithm which gives O(nm) as the computational cost with m as the number of an equivalent (different) graph of the hypergraph. The equivalent graph here is depicted by the hypergraph Laplacian implicitly. To sum up, the worst computational complexity of our resolving procedure gives O(n3 ) and in hierarchical manner it is only O(nm). Spectral clustering only becomes problematic when the graph has millions of vertices. However, for documents where at most hundreds of mentions appear it is not an issue at all. 4.4 Implementation Issues 4.4.1 The Post-processing For Pronoun Anaphors In a hypergraph built by COPA, pronouns are connected to all other non-pronouns which do not violate any agreement relations, such as gender and number agreements. In an end-to-end setting, there are many singleton entities included into the hypergraphs via their connections to pronouns. As mentioned before, a spectral clustering algorithm is unable to separate singletons during partitioning, thus we may derive clusters mixed with singleton entities. In order to address this issue, we propose a post-processing strategy. For a pronoun anaphor, only its strongest connection within its assigned cluster is kept and all other links are removed. Figure 4.7 gives an example for the post-processing of pronouns, the graph is shown in a standard graph form for the sake of clarity. The dashed (red) circles indicate the cluster boundaries. 3 http://acs.lbl.gov/˜hoschek/colt/ 56 4. COPA: Coreference Partitioner 0.7 {a1} Before Post-processing {a2} 0.7 0.6 {he} {b1} 0.8 After Post-processing 0.8 0.2 {a2} {a1} {he} 0.1 {b1} {c1} {c1} Figure 4.7: Illustration of the Post-processing for Pronouns Considering the generated cluster in the left side of Figure 4.7 which contains the mentions {a1 }, {a2 }, {he}, {b1 }, {c1 }, with links between {he} and all the other mentions and one link between {a1 } and {a2 }. Assuming the strongest connection to {he} is {a1 }, the proposed post-processing removes {b1 } and {c1 } while leaving {a1 }, {a2 }, {he} in the final cluster. This post-processing is driven by the intuition that the connections between pronouns and non-pronouns are not confident enough to support transitive closures. For instance, the links between {he} and {b1 }, {c1 } are not confident enough to enforce a connection between {b1 } and {c1 }. We only maintain one link per pronoun after the partitioning procedure, e.g. the one between {he} and {a1 }, but keeping other relations being transitive so that {a2 } is also in the final cluster. 4.4.2 Partitioning Issues Graph Components. The number of zero eigenvalues corresponds to the number of components in the graph (von Luxburg, 2007). A graph component is a disconnected sub-graph, and in COPA multiple components can occur when only limited features are used, so that not all mentions from the document are connected (directly or via a path). Different components can be processed separately during partitioning process, for the sake of reducing complexity. Only for the connected graphs, the top (k) eigenvectors are taken as described for the spectral embedding. Eigenvalue Smoothing. It is worth noting that depending on the implementation details of the eigen decomposition component, the solved eigenvalues can be a double or a float type. It is necessary to smooth the eigenvalues, for instance by applying an Epsilon variable (e.g. a small number) to allow for small fluctuations on the eigenvalues. 4.5 Hypergraphs to Standard Graphs 57 The k-means Initialization. It is well known that the k-means algorithm is sensitive to the initialization of cluster centers. Since there is a lot of noise involved in our hypergraphs, the decision on the initial cluster centers becomes even more crucial. Accidentally choosing the noisy mentions as initial centers can generate unexpected clusters. In COPA, we address this issue by restricting the initial cluster centers to proper names that are more likely to lead entities. This modification manages to introduce application specific knowledge into the kmeans to guide the initialization, and can be easily improved by estimating the entity centers using more information. 4.5 Hypergraphs to Standard Graphs The hypergraph is a generalization of the standard graph. It is possible to find graphs which approximate hypergraphs and thus can be accessed using the standard graph-based algorithms. In order to preserve the power of representation of the hypergraph, in COPA we avoid the transformation step by applying the partitioning algorithm directly to the hypergraph models. However, in this section, we introduce the equivalent graphs to the hypergraph, which serve as alternatives when hypergraph-based algorithms are not available or when one wants to explore more inference models upon the hypergraph representation. The two most commonly used ones are Star Expansion and Clique Expansion (Agarwal et al., 2005). Star Expansion (in Section 4.5.1) introduces a new star vertex for each hyperedge, which connects all the vertices covered by the original hyperedge. As a result, a bi-partite graph is generated where the edge weights can be assigned by distributing the corresponding hyperedge weights evenly. Clique Expansion (in Section 4.5.2) expands each hyperedge into cliques, and the similarity between two vertices is proportional to the summed weights of their common labels. 4.5.1 The Star Expansion Star Expansion transforms the hypergraph into a bi-partite graph, where there are additional starred vertices corresponding to original hyperedges. All the vertices belonging to a hyperedge are therefore connected to the new starred vertex in the bi-partite graph. The weights of the multiple edges generated from one hyperedge e is normalized by the degree of e: w′ (u, e) = w(e)/δ(e) where the w(e) is the original hyperedge weight and u is a vertex connecting to e. (4.11) 58 4. COPA: Coreference Partitioner 4.5.2 The Clique Expansion Clique Expansion transforms each hyperedge into several pairwise edges (Zien et al., 1999), so that the vertices in a hyperedge form a clique. The new edge weights between vertex u and v is w′ (u, v) = µ X h(u, e)h(v, e)w(e) (4.12) e where the w(e) is the original hyperedge weight and µ is a fixed scalar. 4.6 Summary Our Contributions. In this chapter, we introduce our proposed coreference resolution model — COPA, standing for coreference partitioner. Our contributions are two-fold, (1) representing the coreference relation with the hypergraph model, and (2) inferring coreference sets using the hypergraph partitioning algorithms. COPA represents documents in the hypergraph model, so that the multiple low-dimensional relations between mentions are easily expressed as hyperedges without the necessity of combining them before the final decision. Upon the constructed hypergraphs, the spectral clustering technique is applied to derive coreference sets directly and simultaneously. By adopting spectral clustering algorithms, it is made sure that the mentions within a coreference set are closely related, while the ones from different sets are far apart from each other. Spectral Hypergraph Partitioning for Coreference Resolution. The proposed hypergraph partitioning model looks at the entire graph to make coreference decisions. Not only the context preceding a mention but also the one after it are evaluated to assign the mention to one of the clusters. We propose two partitioning algorithms for COPA, the R2 partitioner performs the hierarchical clustering and the flatK partitioner partitions only once. To assist the flatK partitioner, we propose a novel k model to predict the number of entities within documents. End-to-end Coreference Resolution. We address the coreference resolution problem in an end-to-end system setup, where noise is unavoidable and the mentions to be resolved may not align with the true mention set. Implementing coreference models in end-to-end systems is very important, since it has been observed that improved performance on true mentions does not necessarily translate into the improved performance on system mentions (Ng, 2008). The implementation issues of applying clustering techniques to coreference resolution are addressed in this chapter too. 4.6 Summary 59 Overall, the hypergraph representation of COPA avoids the expensive training for the feature combination, and its light weighted partitioning-based inference does not ask for complex probabilistic estimations. COPA’s partitioning-based strategy can be taken as a general preference model, where the preference of entities for one mention depends on information on all other mentions. Therefore, we believe that COPA is a coreference model preferable not only to the previous local models but also to complicated graphical methods. 60 4. COPA: Coreference Partitioner Chapter 5 COPA Features In this chapter, we introduce the feature representation scheme encoded in COPA. Our features aim to capture the linguistic phenomena of the coreference relation, as well as the data-specific statistics. COPA has been applied to various types of data sets ranging from news articles (e.g. MUC, ACE and OntoNotes data sets in Chapter 3) to clinical reports (e.g. the I2B2 corpus), the feature sets it implements therefore cover both general and domain-specific information. 5.1 The Feature Categorization in the Hypergraph Positive relational features can be incorporated into the hypergraph model of COPA as types of hyperedges (e.g. in Figure 4.2 (b) the two hyperedges marked by “– ··” are of the same type from feature subject/object match), so that a realized hyperedge is an instance of a corresponding type. All hyperedge instances that are derived from the same type have the same weight, but they may get re-weighted by the distance feature (Section 5.5). Negative relations can be treated either as filters to be applied to the graph construction phase (e.g. the negative features described in Section 5.2) or as constraints to be applied to the inference procedure (see Chapter 8). In this chapter, we only focus on the features adopted for constructing the hypergraphs, which consist of three categories: Negative Features: to prevent hyperedges between mentions; Positive Features: to generate relatively strong hyperedges between mentions; Weak Features: to add hyperedges to an existing hypergraph without introducing new mentions into the hypergraph; Negative features here act as global filtering variables, avoiding incompatible mentions to be connected in a graph. For instance, although [Mr. Clinton] and [Mrs. Clinton] match 62 5. COPA Features via substring match (positive) feature, there is no hyperedge built between them due to their incompatible gender. COPA differentiates between positive and weak features, because spectral clustering algorithms do not have intrinsic means to handle singleton clusters. Recall that the spectral clustering technique targets at optimizing the normalized cut (NCut) value, which has the inner-cluster connectives factor as the denominator. This therefore makes it impossible to output singleton clusters. In order to avoid too much noise (e.g. singleton mentions) in our hypergraph model, we construct the graphs in a conservative manner. While weak relations contribute to the graph structure, they tend to involve too many singleton mentions into the graph. So we construct hypergraphs solely out of the positive features and only add weak relations into the graph afterward without introducing new vertices at all. In the following sections we describe the features implemented in COPA. 5.2 Negative Features Negative features describe the pairwise relations between mentions that are most likely to be not coreferent. They have been conventionally used in combination with other features (Soon et al., 2001) and is implemented as weak positive features in an early version of COPA (Cai & Strube, 2010a). Now we apply negative features as global filters in the graph construction phase. When mentions are detected to be in a negative relation, it is made sure that no edges are built between them in the hypergraphs. (1) N Gender, (2) N Number Two mentions do not agree in gender or number. For instance, no edge is allowed between the mentions [Hillary Clinton] and [he] due to their incompatible gender. The mention [Mr. Sisulu] has the negative relation of incompatible number with the mention [boys]. (3) N SemanticClass Two mentions do not agree in semantic classe. For news articles (e.g. MUC, ACE and OntoNotes data sets), only the Object, Date, Person and other top categories derived from WordNet (Fellbaum, 1998) are used. For clinical reports (e.g. I2B2 corpus), this feature is replaced by feature (7) that identifies the medical types for each mention. (4) N Mod 5.2 Negative Features 63 Two mentions have the same syntactic heads, and the anaphor has a modifier that does not occur in the antecedent or contradicts the modifiers of the antecedent. For instance, a negative relation is built between the mentions [expedited proceedings] and [the investigation proceedings], as the modifiers of the two mentions convey different information. However, simply enforcing the modifiers to be the same cannot handle the situations in which the modifiers appear differently though without contradicting each other (e.g. [the case in question] and [the case against the accused]). The current version of COPA does not take care of these difficult cases. (5) N DSPrn Two first person pronouns (i.e. [I], [me], [my] etc.) in direct speech which are assigned to different speakers should not be linked together. The speaker information is given in the OntoNotes data set. (6) N ContraSubjObj Two mentions are in the subject and object positions of the same verb, and the anaphor is not a possessive pronoun. For instance, [John] talks to [him], where [John] should not be coreferent with [him]. (7) N i2b2Type Two mentions have different mention types (e.g. treatment, problem, etc. as defined in the I2B2 data set). For instance, [Ischemic bowel] has an incompatible I2B2 type with [Thoracentesis], as a clinical problem mention cannot be coreferent with a medical treatment mention. (8) N i2b2Quant Two mentions are modified by different quantities. For instance, the mention [heart rate] in the text fragment ”heart rate 116” and the mention [a heart rate] in the text fragment ”a heart rate of 128” cannot be coreferent. (9) N i2b2ConName Two mentions have the same syntactic heads, and their matched (if ever) concept names in MetaMap are different. For example, the mention [back pain] and the mention [chest pain] are in this negative relation. 64 5.3 5. COPA Features Positive Features The majority of the well-studied coreference features (e.g. Stoyanov et al. (2009)) are positive coreference indicators. In our system, the mentions that participate in positive relations are included in the hypergraphs as vertices. (10) StrMatch Npron & (11) StrMatch Pron After discarding stop words, if the strings of mentions completely match and are not pronouns, they are put into hyperedges of the StrMatch Npron type. When the matched mentions are pronouns, they are connected with a StrMatch Pron hyperedge. We differentiate the two types of string matchings, as pronouns suggest much less information than non-pronouns do. (12) Alias After discarding stop words, if mentions are aliases of each other (i.e. proper names with partial match, full names and acronyms of organizations, etc.). For instance, [Australia’s Qintex] and [Qintex Australia Ltd.] are aliases of each other. (13) HeadMatch If the syntactic heads of mentions match, such as [the U.S. rules] and [the rules]. (14) Nprn Prn If the antecedent is not a pronoun and the anaphor is a pronoun. The feature is designed with the intuition that pronouns are used to refer to existing entities. Although this feature is not highly weighted, it is crucial for integrating pronouns into the hypergraph. (15) Speaker12Prn If the speaker of a second person pronoun is talking to the speaker of a first person pronoun, the two pronouns are connected with a hyperedge. This type of hyperedges only contain first and second person pronouns. This feature is useful for the OntoNotes data set where speaker information (e.g. the speaker names and the speech boundaries) is explicitly provided. (16) DSPrn If one of the mentions is the subject of a speak verb, and other mentions are first person pronouns within the corresponding direct speech. Direct speech boundaries are detected simply by paring double quotes. 5.3 Positive Features 65 (17) ReflexivePrn If the anaphor is a reflexive pronoun, and the antecedent is the subject of the same clause. Dependency trees are utilized to conduct the necessary grammatical analysis. In sentence ”[today’s generation of Taiwanese] save our island’s last remaining forest of these giant trees, for [themselves] and later generations?”, the marked mentions are linked via this feature. (18) PossPrn If the anaphor is a possessive pronoun, and the antecedent is the subject in the same sub-clause. In sentence ”How would you feel if [your child] learned from [his] classmates to cough up phlegm all over the place?”, the marked mentions are in this relation. (19) GPEIsA If the antecedent is a Named Entity of GPE entity types (i.e. one of the ACE entity type (NIST, 2004)), and the anaphor is a definite expression of the same type. For instance, [Iraq] is linked with [the nation]. (20) OrgIsA If the antecedent is a Named Entity of Organization entity type, and the anaphor is a definite expression of the same type. For instance, [Google Inc.] is linked with [the company]. Feature (19) and (20) capture the IsA relations for specific types of Named Entities, and are designed for news article data sets. (21) Appositive Two mentions are in an appositive structure, such as the mention [Laurence Tribe, Gore’s attorney] and its embedded mention [Gore’s attorney]. Depending on the annotation schemes of the adopted data set, this relation may or may not be a coreference indicator. (22) Concept We disambiguate each Named Entity to Wikipedia entries (Fahrni et al., 2012), and if mentions linked to the same entries. For instance, [South Korea] and [ROK] are disambiguated to the same entry so that they are connected by this feature. 66 5. COPA Features (23) i2b2PisA A pseudo IsA relation. One mention appears in other mentions’ definitions extracted from the UMLS thesaurus. For instance, the mentions [Paracentesis] and [the tap] are captured by this feature, since the top ranked definition of [the tap] is ”Paracentesis”. (24) i2b2Abbr One mention is in the abbreviation format (i.e. with all letters capitalized), the other mentions match (exactly or partially) with its concept name extracted by MetaMap. For instance, the mention [EGD] is identified to be the abbreviation of the mention [esophagogastroduodenoscopy]. (25) i2b2CatMatch There is always structured information in the clinical data sets (e.g. I2B2), as shown in the text ”[Attending]: [Erm K. Neidwierst , M.D.]”. The mentions are linked when they appear in the same category slot of the report and both are persons. (26) i2b2PrnPreference This is a data specific feature, describing the preferences for certain types of pronouns. For example, first person singular pronouns in the data set mostly refer to the physician who writes the clinical report. 5.4 Weak Features Weak features are weak coreference indicators. Using them as positive features would introduce too much noise to the graph (i.e. a graph with too many singletons). We apply weak features only to mentions already integrated in the graph, so that weak information provides it with a richer structure. (27) W VerbAgree If the anaphor is a pronoun, and the antecedent appears as a subject or an object in previous sentences. The verbs of both mentions should be the same. For instance, the sentence ”Born in Homei, Changhua in 1928, [Hsu] studied the violin in Japan as a youth” is followed by the sentence ”Later, [he] studied in France ...”, so that the marked two mentions share this W VerbAgree relation. (28) W Subject If mentions are subjects. 5.5 The Distance Feature 67 (29) W Synonym If mentions are synonymous as indicated by WordNet, such as [the town] and [the village]. (30) W i2b2SubStr One mention is the substring of the other. For instance, the mention [Cisplatin] is the substring of the mention [Cisplatin chemotherapy]. 5.5 The Distance Feature Graph models cannot deal with positional information well, such as distance between mentions or the sequential ordering of mentions in a document. Therefore the hypergraph model of COPA does not have any obvious means to encode distance information. However, distance between mentions plays an important role in coreference resolution, especially for resolving pronouns. We do not encode distance as a binary feature, as this introduces too many hyperedges into the graph. Instead, we use distance to re-weigh hyperedges of degrees of 2, which are supposed to be sensitive to positional information. We experiment with two types of distance weights: (31) sentence distance as used in Soon et al. (2001)’s feature set and (32) compatible mentions distance as introduced by Bengtson & Roth (2008). 5.6 The Learned Hyperedge Weights Table 5.1 and Table 5.2 provide the example feature weights (i.e. hyperedge weights) learned from the OntoNotes training set, in order to indicate the hypergraph structures we derived. I2B2-relevant feature weights are shown in Table 5.3. In Table 5.4, the statistics for the negative features suggest how strongly the features are contributing to non-coreference decisions. OntoNotes data does not annotate appositive relations as coreference relations, so that Feature (21) gives surprisingly small weights. 68 5. COPA Features Positive Features (10) StrMatch Npron (11) StrMatch Pron (12) Alias (13) HeadMatch (14) Nprn Prn (15) Speaker12Prn (16) DSPrn (17) ReflexivePrn (18) PossPrn (19) GPEIsA (20) OrgIsA (21) Appositive (22) Concept Weights 0.766 0.620 0.733 0.614 0.176 0.552 0.9 0.567 0.75 0.308 0.111 0.001 0.494 Table 5.1: Positive Feature Weights on OntoNotes Data Weak Features (27) W VerbAgree (28) W Subject (29) W Synonym Weights 0.342 0.4425 0.429 Table 5.2: Weak Feature Weights on OntoNotes Data I2B2 Features (23) i2b2PisA (24) i2b2Abbr (25) i2b2CatMatch (26) i2b2PrnPreference (30) W i2b2SubStr Weights 0.348 0.423 0.935 0.967 0.594 Table 5.3: Feature Weights on I2B2 Data 5.7 Summary 69 Negative Features (1) N Gender (2) N Number (3) N SemanticClass (4) N Mod (5) N DSPrn (6) N ContraSubjObj (7) N i2b2Type (8) N i2b2Quant (9) N i2b2ConName Statistics -0.993 -0.996 -0.993 -0.853 -0.762 -0.997 -0.999 -0.999 -0.816 Table 5.4: Negative Feature Statistics on OntoNotes Data 5.7 Summary In COPA, features are expressed as hyperedges. Since the combination of features is implicitly done during the inference phase, the features in the graph construction phase simply are included in an overlapping manner. Therefore it is straightforward and costs little effort to include more features in COPA. We categorize the features into three types, which do not only indicate the linguistic functions of different features but also provide a systematic way for feature development in COPA. Negative relations are interpreted as global filters during the graph construction in this Chapter, and they are explored further in Chapter 8 as global constraints which are applied during the inference phase. Coreference decisions depend on preferences, where negative information in certain cases contributes as much as the conventional positive indicators. 70 5. COPA Features Chapter 6 Evaluation Metrics for End-to-end Coreference Resolution Evaluating clustering results is one of the most important issues in cluster analysis, and is referred as clustering validation (Halkidi et al., 2001). When the ground truth is provided, the evaluation methods aim to measure how similar the clustering results are to the gold annotations. For instance, the evaluation metrics for coreference resolution measure the output coreference sets (i.e. clusters) against the ground truth sets provided by domain experts. Since there may be different numbers of output clusters (e.g. the coreference sets) compared with the gold annotations, such an evaluation task is different from evaluating classification problems which directly assesses the label assignments of instances. It becomes more complicated to perform the evaluation when the numbers of the output instances (e.g. the mentions) are also different from the gold ones. In this chapter, we focus on the end-to-end system setting for the coreference resolution task, and propose evaluation algorithms to assess noisy coreference output. Early research on coreference resolution has worked on the true mention setting, where the mentions participating in coreference sets are given along with their exact boundaries. The commonly used coreference resolution evaluation metrics are designed for such systems, but evaluate the output coreference sets from different perspectives. For instance, the MUC score (Vilain et al., 1995) in Section 6.1.1 performs on the relations between mentions, the B3 algorithm (Bagga & Baldwin, 1998) in Section 6.1.2 operates on the relations between mentions and sets, and the CEAF algorithm (Luo, 2005) in Section 6.1.3 captures the relations between sets. However, it is not trivial to apply these metrics to end-to-end coreference systems, where the automatically identified system mentions may not align with the true mentions. To be consistent with the literature, in this chapter key mention is used to refer to true mention. In Section 6.1, we discuss the problems of the existing coreference metrics and propose 72 6. Evaluation Metrics for End-to-end Coreference Resolution two variants of the B3 and CEAF algorithms which can be applied to noisy coreference output dealing with system mentions. Our experiments in Section 6.2 show that our variants lead to intuitive and reliable results for end-to-end coreference systems. 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 6.1.1 MUC The MUC score (Vilain et al., 1995) counts the minimum number of links between mentions to be inserted or deleted when mapping a system response to a gold standard key set. Given an example, Key : {m1 , m2 , m3 , m4 } Response: {m1 , m2 } {m3 , m4 } Figure 6.1 illustrates the relations between mentions for both the key and the response. Since the response sets require at least one link (e.g. between m1 and m4 ) to form a set (i.e. {m1 , m2 , m3 , m4 }) which matches the provided key, the recall is given as Recall = 2/3. The precision is computed as P recision = 2/2, as all the links in the response are correct. Key Response m1 m1 m3 m2 m4 m3 m2 m4 Figure 6.1: The MUC Score Illustration Although pairwise links capture the relations in a set, they cannot represent singleton entities, i.e. entities, which are mentioned only once. Therefore, the MUC score is not suitable for the ACE data (http://www.itl.nist.gov/iad/mig/tests/ace/), which includes singleton entities in the keys. Moreover, the MUC score does not give credit for separating singleton entities from other chains. This becomes problematic in a realistic system setup, when mentions are extracted automatically. 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 73 6.1.2 B3 The B3 algorithm (Bagga & Baldwin, 1998) overcomes the shortcomings of the MUC score. Instead of looking at the links, B3 computes precision and recall for all mentions in the document, which are then combined to produce the final precision and recall numbers for the entire output. For each mention, the B3 algorithm computes a precision and recall score using equations 6.1 and 6.2: P recision(mi ) = Recall(mi ) = |Rmi ∩ Kmi | |Rmi | (6.1) |Rmi ∩ Kmi | |Kmi | (6.2) where Rmi is the response chain (i.e. the system output) which includes the mention mi , and Kmi is the key chain (manually annotated gold standard) with mi . The overall precision and recall are computed by averaging them over all mentions. Considering the same example as in the previous section, Key : {m1 , m2 , m3 , m4 } Response: {m1 , m2 } {m3 , m4 } Figure 6.2 illustrates the relations between mentions and their corresponding sets. m2: m1: Rm2 Rm1 m1 m2 m3 m4 Km1 m1 m2 m3 m4 Km2 m4: m3: m1 Km3 m2 Rm3 m3 m4 m1 Km4 m2 Rm4 m3 Figure 6.2: The B3 Algorithm Illustration m4 74 6. Evaluation Metrics for End-to-end Coreference Resolution According to Equation 6.1 and 6.2, P recision(m1 ) = 22 , Recall(m1 ) = 2 4 P recision(m2 ) = 22 , Recall(m2 ) = 2 4 P recision(m3 ) = 22 , Recall(m3 ) = 2 4 P recision(m4 ) = 22 , Recall(m4 ) = 2 4 Since B3 ’s calculations are based on mentions, singletons are taken into account. However, a problematic issue arises when system mentions have to be dealt with: B3 assumes the mentions in the key and in the response to be identical. Hence, B3 has to be extended to deal with system mentions which are not in the key and key mentions not extracted by the system, so called twinless mentions (Stoyanov et al., 2009). 6.1.2.1 Existing B3 variants A few variants of the B3 algorithm for dealing with system mentions have been introduced recently. (Stoyanov et al., 2009) suggest two variants of the B3 algorithm to deal with system mentions, B30 and B3all 1 . For example, a key and a response are provided as below: Key : {a b c} Response: {a b d} B30 discards all twinless system mentions (i.e. mention d) and penalizes recall by setting recallmi = 0 for all twinless key mentions (i.e. mention c). The B30 precision, recall and recision·Recall ) for the example are calculated as: F-score (i.e. F = 2 · PPrecision+Recall P rB03 = 12 ( 22 + 22 ) = 1.0 . RecB03 = 31 ( 32 + 32 + 0) = 0.444 FB03 = 2 × 1.0×0.444 1.0+0.444 . = 0.615 B3all retains twinless system mentions. It assigns 1/|Rmi | to a twinless system mention as its precision and similarly 1/|Kmi | to a twinless key mention as its recall. For the same example above, the B3all precision, recall and F-score are given by: . 3 P rBall = 13 ( 32 + 32 + 13 ) = 0.556 1 Our discussion of B30 and B3all is based on the analysis of the source code available at http://www.cs.utah.edu/nlp/reconcile/. 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 75 . 3 RecBall = 31 ( 23 + 32 + 13 ) = 0.556 3 FBall =2× 0.556×0.556 0.556+0.444 . = 0.556 Tables 6.1, 6.2 and 6.3 illustrate the problems with B30 and B3all . The rows labeled System give the original keys and system responses while the rows labeled B30 , B3all and B3sys show the performance generated by Stoyanov et al.’s variants and the one we introduce in this chapter, B3sys (the row labeled CEAF sys is discussed in Subsection 6.1.3). System 1 B30 B3all B3r&n B3sys CEAFsys System 2 B30 B3all B3r&n B3sys CEAFsys Set 1 key {a b c} response {a b d} P 1.0 0.556 0.556 0.667 0.5 R 0.444 0.556 0.556 0.556 0.667 F 0.615 0.556 0.556 0.606 0.572 {a b c} key response {a b d e} P 1.0 0.375 0.375 0.5 0.4 R 0.444 0.556 0.556 0.556 0.667 F 0.615 0.448 0.448 0.527 0.500 Table 6.1: Problems of B30 In Table 6.1, there are two system outputs (i.e. System 1 and System 2). Mentions d and e are the twinless system mentions erroneously resolved and c a twinless key mention. System 1 is supposed to be slightly better with respect to precision, because System 2 produces one more spurious resolution (i.e. for mention e ). However, B30 computes exactly the same numbers for both systems. Hence, there is no penalty for erroneous coreference relations in B30 , if the mentions do not appear in the key, e.g. putting mentions d or e in Set 1 does not count as precision errors. — B30 is too lenient by only evaluating the correctly extracted mentions. 76 6. Evaluation Metrics for End-to-end Coreference Resolution System 1 B3all B3r&n B3sys CEAFsys System 2 B3all B3r&n B3sys CEAFsys Set 1 {a b c} key response {a b d} P 0.556 0.556 0.667 0.5 key {a b c} response {a b d} P 0.667 0.667 0.667 0.5 Singletons R 0.556 0.556 0.556 0.667 F 0.556 0.556 0.606 0.572 {c} R 0.556 0.556 0.556 0.667 F 0.606 0.606 0.606 0.572 Table 6.2: Problems of B3all (1) System 1 B3all B3r&n B3sys CEAFsys System 2 B3all B3r&n B3sys CEAFsys Set 1 key {a b} response {a b d} P 0.556 0.556 0.556 0.667 {a b} key response {a b d} P 0.778 0.556 0.556 0.667 Singletons R 1.0 1.0 1.0 1.0 F 0.715 0.715 0.715 0.800 {i} {j} {k} R 1.0 1.0 1.0 1.0 F 0.875 0.715 0.715 0.800 Table 6.3: Problems of B3all (2) 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 77 B3all deals well with the problem illustrated in Table 6.1, the figures reported correspond to intuition. However, B3all can output different results for identical coreference resolutions when exposed to different mention taggers as shown in Tables 6.2 and 6.3. B3all manages to penalize erroneous resolutions for twinless system mentions, however, it ignores twinless key mentions when measuring precision. In Table 6.2, System 1 and System 2 generate the same output, except that the mention tagger in System 2 also extracts mention c. Intuitively, the same numbers are expected for both systems. However, B3all gives a higher precision to System 2, which results in a higher F-score. B3all retains all twinless system mentions, as can be seen in Table 6.3. System 2’s mention tagger tags more mentions (i.e. the mentions i, j and k), while both System 1 and System 2 have identical coreference resolution performance. Still, B3all outputs quite different results for precision and thus for F-score. This is due to the credit B3all takes from unresolved singleton twinless system mentions (i.e. mention i, j, k in System 2). Since the metric is expected to evaluate the end-to-end coreference system performance rather than the mention tagging quality, it is not satisfying to observe that B3all ’s numbers actually fluctuate when the system is exposed to different mention taggers. Rahman & Ng (2009) apply another variant, denoted here as B3r&n . They remove only those twinless system mentions that are singletons before applying the B3 algorithm. So, a system would not be rewarded by the the spurious mentions which are correctly identified as singletons during resolution (as has been the case with B3all ’s higher precision for System 2 as can be seen in Table 6.3). We assume that Rahman & Ng apply a strategy similar to B3all after the removing step (this is not clear in Rahman & Ng (2009)). While it avoids the problem with singleton twinless system mentions, B3r&n still suffers from the problem dealing with twinless key mentions, as illustrated in Table 6.2. 6.1.2.2 Our proposed variant — B3sys We here propose a coreference resolution evaluation metric, B3sys , which deals with system mentions more adequately (see the rows labeled B3sys in Tables 6.1, 6.2, 6.3, 6.8 and 6.9). We put all twinless key mentions into the response as singletons which enables B3sys to penalize non-resolved coreferent key mentions without penalizing non-resolved singleton key mentions, and also avoids the problem B3all and B3r&n have as shown in Table 6.2. All twinless system mentions that are deemed not coreferent (hence being singletons) are discarded. To calculate B3sys precision, all twinless system mentions that are mistakenly resolved are put into the key since they are spurious resolutions (equivalent to the assignment operations in B3all ), which should be penalized by precision. Unlike B3all , B3sys does not benefit from unresolved twinless system mentions (i.e. the twinless singleton system mentions). For recall, the algo- 78 6. Evaluation Metrics for End-to-end Coreference Resolution rithm only goes through the original key sets, similar to B3all and B3r&n . Details are given in Algorithm 4. 3 Algorithm 4 Bsys Input: key sets key, response sets response Output: precision P , recall R and F-score F 1: Discard all the singleton twinless system mentions in response; 2: Put all the twinless annotated mentions into response; 3: if calculating precision then 4: Merge all the remaining twinless system mentions with key to form keyp ; 5: Use response to form responsep 6: Through keyp and responsep ; 7: Calculate B 3 precision P . 8: end if 9: if calculating recall then 10: Discard all the remaining twinless system mentions in response to from responser ; 11: Use key to form keyr 12: Through keyr and responser ; 13: Calculate B 3 recall R 14: end if 15: Calculate F-score F For example, a coreference resolution system has the following key and response: Key : {a b c} Response: {a b d} {i j} To calculate the precision of B3sys , the key and response are altered to: Keyp : {a b c} {d} {i} {j} Responsep : {a b d} {i j} {c} So, the precision of B3sys is given by: 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 79 . 3 P rBsys = 16 ( 32 + 23 + 31 + 12 + 21 + 1) = 0.611 The modified key and response for recall are: Keyr : {a b c} Responser : {a b} {c} The resulting recall of B3sys is: . 3 RecBsys = 13 ( 32 + 23 + 31 ) = 0.556 Thus the F-score number is calculated as: 3 FBsys =2× 0.611×0.556 0.611+0.556 . = 0.582 B3sys indicates more adequately the performance of end-to-end coreference resolution systems. It is not easily tricked by different mention taggers. Further example analysis for the proposed B3sys can be found in Section 6.1.2.3. 6.1.2.3 B3sys Example Output Here, we provide additional examples for analyzing the behavior of B3sys where we systematically vary system outputs. Since we propose B3sys for dealing with end-to-end systems, we consider only examples also containing twinless mentions. The systems in Table 6.4 and 6.6 generate different twinless key mentions while keeping the twinless system mentions untouched. In Table 6.5 and 6.7, the number of twinless system mentions changes through different responses and the number of twinless key mentions is fixed. In Table 6.4, B3sys recall goes up when more key mentions are resolved into the correct set. And the precision stays the same, because there is no change in the number of the erroneous resolutions (i.e. the spurious cluster with mentions i and j). For the examples in Tables 6.5 and 6.7, B3sys gives worse precision to the outputs with more spurious resolutions, but the same recall if the systems resolve key mentions in the same way. Since the set of key mentions intersects with the set of twinless system mentions in Table 6.6, we do not have an intuitive explanation for the decrease in precision from response1 to response4 . However, both the F-score and the recall still show the right tendency. 80 6. Evaluation Metrics for End-to-end Coreference Resolution Set 2 key Set 1 {a b c d e} response1 response2 response3 response4 {a b} {a b c} {a b c d} {a b c d e} {i j} {i j} {i j} {i j} P B3sys R F 0.857 0.857 0.857 0.857 0.280 0.440 0.68 1.0 0.422 0.581 0.784 0.923 Table 6.4: Analysis of B3sys 1 key Set 1 Set 2 {a b c d e} response1 response2 response3 response4 {a b c} {a b c} {a b c} {a b c} {i j} {i j k} {i j k l} {i j k l m} P B3sys R F 0.857 0.75 0.667 0.6 0.440 0.440 0.440 0.440 0.581 0.555 0.530 0.508 Table 6.5: Analysis of B3sys 2 key Set 1 {a b c d e} response1 response2 response3 response4 {a b i j} {a b c i j} {a b c d i j} {a b c d e i j} P B3sys R F 0.643 0.6 0.571 0.551 0.280 0.440 0.68 1.0 0.390 0.508 0.621 0.711 Table 6.6: Analysis of B3sys 3 key Set 1 {a b c d e} response1 response2 response3 response4 {a b c i j} {a b c i j k} {a b c i j k l} {a b c i j k l m} P B3sys R F 0.6 0.5 0.429 0.375 0.440 0.440 0.440 0.440 0.508 0.468 0.434 0.405 Table 6.7: Analysis of B3sys 4 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 81 6.1.3 CEAF Luo (2005) criticizes the B3 algorithm for using entities more than one time, because B3 computes precision and recall of mentions by comparing entities containing that mention. Hence Luo proposes the CEAF algorithm which aligns entities in key and response. CEAF applies a similarity metric (which could be either based on mention or entity) for each pair of entities (i.e. a set of mentions) to measure the goodness of each possible alignment. The best mapping is used for calculating CEAF precision, recall and F-measure. Consider the same example as cited for previous metrics, Key : {m1 , m2 , m3 , m4 } Response: {m1 , m2 } {m3 , m4 } The best mapping of the key and response sets is illustrated in Figure 6.3. Since the response set R1 is aligned with the key set K1 , R2 is forced to align with an empty set. K1 m1 m3 m2 m4 R2 R1 m1 m3 m2 m4 Figure 6.3: The CEAF Alignment Illustration Luo proposes two entity-based similarity metrics (Equation 6.3 and 6.4) for an entity pair (Ki , Rj ) originating from key, Ki , and response, Rj . φ3 (Ki , Rj ) = |Ki ∩ Rj | φ4 (Ki , Rj ) = 2|Ki ∩ Rj | |Ki | + |Rj | (6.3) (6.4) 82 6. Evaluation Metrics for End-to-end Coreference Resolution The CEAF precision and recall are derived from the alignment which has the best total similarity (denoted as Φ(g ∗ )), shown in Equations 6.5 and 6.6. P recision = P Recall = P Φ(g ∗ ) i φ(Ri , Ri ) Φ(g ∗ ) i φ(Ki , Ki ) (6.5) (6.6) If not specified otherwise, we apply Luo’s φ3 (⋆, ⋆) in the example illustrations. We denote the original CEAF algorithm as CEAForig . Detailed calculations are illustrated via a new example below: Key : {a b c} Response: {a b d} The CEAForig φ3 (⋆, ⋆) are given by: φ3 (K1 , R1 ) = 2 (K1 : {abc}; R1 : {abd}) φ3 (K1 , K1 ) = 3 φ3 (R1 , R1 ) = 3 So the CEAForig evaluation numbers are: P rCEAForig = 2 3 RecCEAForig = = 0.667 2 3 FCEAForig = 2 × 6.1.3.1 = 0.667 0.667×0.667 0.667+0.667 = 0.667 Problems of CEAForig CEAForig was intended to deal with key mentions. Its adaptation to system mentions has not been addressed explicitly. Although CEAForig theoretically does not require the same number of mentions in key and response, it still cannot be directly applied to end-to-end systems, because the entity alignments are based on mention mappings. As can be seen from Table 6.8, CEAForig fails to produce intuitive results for system mentions. System 2 outputs one more spurious entity (containing mention i and j) compared with System 1, however, achieves the same CEAForig precision. Since twinless system mentions do 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 83 not have mappings in key, they contribute nothing to the mapping similarity. So, resolution mistakes for system mentions are not calculated, and moreover, the precision is easily skewed by the number of output entities. CEAForig reports very low precision for system mentions (see also Stoyanov et al. (2009)). System 1 CEAForig B3sys CEAFsys System 2 CEAForig B3sys CEAFsys Set 1 {a b c} key response {a b} P 0.4 1.0 0.667 Set 2 Singletons R 0.667 0.556 0.667 {c} {i} {j} F 0.500 0.715 0.667 {a b c} key response {a b} P 0.4 0.8 0.6 {i j} R 0.667 0.556 0.667 {c} F 0.500 0.656 0.632 Table 6.8: Problems of CEAForig 84 6. Evaluation Metrics for End-to-end Coreference Resolution System 1 CEAFr&n B3sys CEAFsys System 2 CEAFr&n B3sys CEAFsys Set 1 {a b c} key response {a b} P 0.286 0.714 0.571 {a b c} key response {a b} P 0.286 0.571 0.429 Set 2 Set 3 Singletons {i j} R 0.667 0.556 0.667 {k l} F 0.400 0.625 0.615 {c} {i j k l} R 0.667 0.556 0.667 {c} F 0.400 0.563 0.522 Table 6.9: Problems of CEAFr&n 6.1.3.2 Existing CEAF variants Rahman & Ng (2009) briefly introduce their CEAF variant, which is denoted as CEAFr&n here. They use φ3 (⋆, ⋆), which results in equal CEAFr&n precision and recall figures when using true mentions. Since Rahman & Ng’s experiments using system mentions produce unequal precision and recall figures, we assume that, after removing twinless singleton system mentions, they do not put any twinless mentions into the other set. In the example in Table 6.9, CEAFr&n does not penalize adequately the incorrectly resolved entities consisting of twinless system mentions. So CEAFr&n does not tell the difference between System 1 and System 2. It can be concluded from the examples that the same number of mentions in key and response is needed for computing the CEAF score. 6.1.3.3 Our proposed variant — CEAFsys We propose to adjust CEAF in the same way as we did for B3sys , resulting in CEAF sys . We put all twinless key mentions into the response as singletons. All singleton twinless system mentions are discarded. For calculating CEAF sys precision, all twinless system mentions which were mistakenly resolved are put into the key. For computing CEAF sys recall, only the 6.1 Evaluation Metrics for the End-to-end Coreference Resolution 85 original key sets are considered. In this way CEAF sys deals adequately with system mentions (see Algorithm 5 for details). Algorithm 5 CEAFsys Input: key sets key, response sets response Output: precision P , recall R and F-score F 1: Discard all the singleton twinless system mentions in response; 2: Put all the twinless annotated mentions into response; 3: if calculating precision then 4: Merge all the remaining twinless system mentions with key to form keyp ; 5: Use response to form responsep 6: Form Map g ⋆ between keyp and responsep 7: Calculate CEAF precision P using φ3 (⋆, ⋆) 8: end if 9: if calculating recall then 10: Discard all the remaining twinless system mentions in response to form responser ; 11: Use key to form keyr 12: Form Map g ⋆ between keyr and responser 13: Calculate CEAF recall R using φ3 (⋆, ⋆) 14: end if 15: Calculate F-score F Taking System 2 in Table 6.8 as an example, key and response are altered for precision: Keyp : {a b c} {i} {j} Responsep : {a b d} {i j} {c} So the φ3 (⋆, ⋆) are as below, only listing the best mappings: φ3 (K1 , R1 ) = 2 (K1 : {abc}; R1 : {abd}) φ3 (K2 , R2 ) = 1 (K2 : {i}; R2 : {ij}) φ3 (∅, R3 ) = 0 (R3 : {c}) φ3 (R1 , R1 ) = 3 φ3 (R2 , R2 ) = 2 86 6. Evaluation Metrics for End-to-end Coreference Resolution φ3 (R3 , R3 ) = 1 The precision is thus given by: P rCEAFsys = 2+1+0 3+2+1 = 0.6 The key and response for recall are: Keyr : {a b c} Responser : {a b} {c} The resulting φ3 (⋆, ⋆) are: φ3 (K1 , R1 ) = 2(K1 : {abc}; R1 : {ab}) φ3 (∅, R2 ) = 0(R2 : {c}) φ3 (K1 , K1 ) = 3 φ3 (R1 , R1 ) = 2 φ3 (R2 , R2 ) = 1 The recall and F-score are thus calculated as: RecCEAFsys = 2 3 FCEAFsys = 2 × = 0.667 0.6×0.667 0.6+0.667 = 0.632 However, one additional complication arises with regard to the similarity metrics used by CEAF. It turns out that only φ3 (⋆, ⋆) is suitable for dealing with system mentions while φ4 (⋆, ⋆) produces unintuitive results (see Table 6.10). Set 1 key {a b c} System 1 response {a b} P φ4 (⋆, ⋆) 0.4 φ3 (⋆, ⋆) 0.667 System 2 φ4 (⋆, ⋆) φ3 (⋆, ⋆) {a b c} key response {a b} {i j} P 0.489 0.6 Singletons {c} {i} {j} R 0.8 0.667 F 0.533 0.667 {c} R 0.8 0.667 F 0.607 0.632 Table 6.10: Problems of φ4 (⋆, ⋆) 6.2 Experiments with the Proposed Evaluation Metrics 87 φ4 (⋆, ⋆) computes a normalized similarity for each entity pair using the summed number of mentions in the key and the response. CEAF precision then distributes that similarity evenly over the response set. Spurious system entities, such as the one with mention i and j in Table 6.10, are not penalized. φ3 (⋆, ⋆) calculates unnormalized similarities. It compares the two systems in Table 6.10 adequately. Hence we use only φ3 (⋆, ⋆) in CEAF sys . When normalizing the similarities by the number of entities or mentions in the key (for recall) and the response (for precision), the CEAF algorithm considers all entities or mentions to be equally important. Hence CEAF tends to compute quite low precision for system mentions which does not represent the system performance adequately. Here, we do not address this issue. 6.1.4 BLANC Recently, a new coreference resolution evaluation algorithm, BLANC, has been introduced (Recasens & Vila, 2010). This measure implements the Rand index (Rand, 1971) which has been originally developed to evaluate clustering methods. The BLANC algorithm deals correctly with singleton entities and rewards correct entities according to the number of mentions. However, a basic assumption behind BLANC is, that the sum of all coreferential and non-coreferential links is constant for a given set of mentions. This implies that BLANC assumes identical mentions in key and response. It is not clear how to adapt BLANC to system mentions. We do not address this issue here. 6.2 Experiments with the Proposed Evaluation Metrics 3 and CEAF sys , we here While Section 6.1 used toy examples to motivate our metrics Bsys report results on two larger experiments using ACE2004 data. 6.2.1 Data and Mention Taggers We use the ACE2004 (Mitchell et al., 2004) English training data which we split into three sets following Bengtson & Roth (2008): Train (268 docs), Dev (76), and Test (107). We use two in-house mention taggers. The first (SM1) implements a heuristic aiming at high recall. The second (SM2) uses the J48 decision tree classifier (Witten & Frank, 2005). The number of detected mentions, head coverage, and accuracy on testing data are shown in Table 6.11. 88 6. Evaluation Metrics for End-to-end Coreference Resolution training development test mentions twin mentions mentions twin mentions mentions twin mentions head coverage accuracy SM1 SM2 31,370 16,081 13,072 14,179 8,045 – 3,371 – 8,387 4,956 4,242 4,212 79.3% 73.3% 57.3% 81.2% Table 6.11: Mention Taggers on ACE2004 Data 6.2.2 The Artificial Setting For the artificial setting we report results on the development data using the SM1 tagger. To illustrate the stability of the evaluation metrics with respect to different mention taggers, we reduce the number of twinless system mentions in intervals of 10%, while correct (non-twinless) ones are kept untouched. The coreference resolution system used is the BART (Versley et al., 2008) reimplementation of Soon et al. (2001). The results are plotted in Figures 6.4 and 6.5. 0.85 MUC BCubedsys BCubed0 BCubedall BCubedng F-score for ACE04 Development Data 0.8 0.75 0.7 0.65 0.6 0.55 1 0.8 0.6 0.4 0.2 Proportion of twinless system mentions used in the experiment Figure 6.4: Artificial Setting B3 Variants 0 6.2 Experiments with the Proposed Evaluation Metrics 89 0.8 MUC CEAFsys CEAForig CEAFng F-score for ACE04 Development Data 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 1 0.8 0.6 0.4 0.2 0 Proportion of twinless system mentions used in the experiment Figure 6.5: Artificial Setting CEAF Variants Omitting twinless system mentions from the training data while keeping the number of correct mentions constant should improve the coreference resolution performance, because a more precise coreference resolution model is obtained. As can be seen from Figures 6.4 and 3 6.5, the MUC-score, Bsys and CEAF sys follow this intuition. 6.2.3 The Realistic Setting Experiment 1 For the realistic setting we compare SM1 and SM2 as preprocessing components for the BART (Versley et al., 2008) reimplementation of Soon et al. (2001). The coreference resolution system with the SM2 tagger performs better, because a better coreference model is achieved from system mentions with higher accuracy. The MUC, B3sys and CEAFsys metrics have the same tendency when applied to systems with different mention taggers (Table 6.12, 6.13 and 6.14 and the bold numbers are higher with a p-value of 0.05, by a paired-t test). Since the MUC scorer does not evaluate singleton entities, it produces too low numbers which are not informative any more. 90 6. Evaluation Metrics for End-to-end Coreference Resolution Soon (SM1) Soon (SM2) R 51.7 49.1 MUC Pr 53.1 69.9 F 52.4 57.7 Table 6.12: Realistic Setting MUC Soon (SM1) Soon (SM2) R 65.7 64.1 3 Bsys Pr 76.8 87.3 F 70.8 73.9 R 57.0 54.7 B03 Pr 91.1 91.3 F 70.1 68.4 R 65.1 64.3 3 Ball Pr 85.8 87.1 F 74.0 73.9 R 65.1 64.3 3 Br&n Pr 78.7 84.9 F 71.2 73.2 Table 6.13: Realistic Setting B3 Variants Soon (SM1) Soon (SM2) R 66.4 67.4 CEAFsys Pr 61.2 65.2 F 63.7 66.3 R 62.0 60.0 CEAForig Pr 39.9 56.6 F 48.5 58.2 R 62.1 60.0 CEAFr&n Pr 59.8 66.2 F 60.9 62.9 Table 6.14: Realistic Setting CEAF Variants As shown in Table 6.13, B3all reports counter-intuitive results when a system is fed with 3 system mentions generated by different mention taggers. Ball cannot be used to evaluate two different end-to-end coreference resolution systems, because the mention tagger is likely to have bigger impact than the coreference resolution system. B30 fails to generate the right comparison too, because it is too lenient by ignoring all twinless mentions. The CEAForig numbers in Table 6.14 illustrate the big influence the system mentions have on precision (e.g. the very low precision number for Soon (SM1)). The big improvement for Soon (SM2) is largely due to the system mentions it uses, rather than to different coreference models. Both B3r&n and CEAFr&n show no serious problems in the experimental results. However, as discussed before, they fail to penalize the spurious entities with twinless system mentions adequately. 6.3 Summary 91 Soon (SM2) Bengtson 3 Bsys R Pr 64.1 87.3 66.1 81.9 F 73.9 73.1 R 54.7 69.5 B03 Pr 91.3 74.7 F 68.4 72.0 Table 6.15: Realistic Setting B30 vs. B3sys Experiment 2 We compare results of Bengtson & Roth’s (2008) system with our Soon (SM2) system. Bengtson & Roth’s embedded mention tagger aims at high precision, generating half of the mentions SM1 generates (explicit statistics are not available to us). Bengtson & Roth report a B 3 F-score for system mentions, which is very close to the one for true mentions. Their B 3 -variant does not impute errors of twinless mentions and is assumed to be quite similar to the B30 strategy. We integrate both the B30 and B3sys variants into their system and show results in Table 6.15 (we cannot report significance, because we do not have access to results for single documents in Bengtson & Roth’s system). It can be seen that, when different variants of evaluation metrics are applied, the performance of the systems vary wildly. 6.3 Summary In this chapter, we address problems of commonly used evaluation metrics for coreference 3 resolution and suggest two variants for B 3 and CEAF, called Bsys and CEAF sys . In contrast to 3 the variants proposed by Stoyanov et al. (2009), Bsys and CEAF sys are able to deal with end3 to-end systems which do not use any gold information. The numbers produced by Bsys and CEAFsys are able to indicate the resolution performance of a system more adequately, without being tricked easily by twisting preprocessing components. We believe that the explicit description of evaluation metrics, as given in this chapter, is a precondition for the reliable comparison of end-to-end coreference resolution systems. 92 6. Evaluation Metrics for End-to-end Coreference Resolution Chapter 7 Evaluating COPA In order to analyze the effectiveness of COPA, we present three groups of comparison experiments (1, 2, and 3) and two analytical ones (4 and 5) in this chapter. 1. Section 7.1 compares COPA against two baseline systems, both of which are pairwise models with strong features. The comparisons aim to convey the superiority of the global partitioning method proposed in COPA over local pairwise models, with all preprocessors (including the mention detector) being the same. 2. Section 7.2 shows the performance of COPA in the CoNLL 2011 shared task on coreference resolution, which is one of the most influential shared tasks in the field. Demonstrating COPA’s results in the task enables us to identify the competitiveness of our system, by comparing it with the most important state-of-the-art systems. 3. Section 7.3 tests COPA on medical data sets, to illustrate the robustness of COPA when adapted to new domains. 4. Experiments on the weakly supervised property of COPA are shown in Section 7.5. 5. Experiments on analyzing our proposed k model are in Section 7.6. Since the experimental settings differ between sections, discussions are provided separately in each section, making them self-contained. Features mentioned in this chapter are described in Chapter 5 in more details, and the data sets are introduced in Chapter 3. 7.1 COPA vs. Baselines We compare COPA with two implementations of pairwise models. The first baseline is SOON – the BART (Versley et al., 2008) reimplementation of Soon et al. (2001), with few (i.e. 12) 94 7. Evaluating COPA but strong features. Our second baseline is B&R – Bengtson & Roth (2008) 1 , which exploits a much larger feature set while keeping the machine learning approach simple. Bengtson & Roth (2008) show that their system outperforms much more sophisticated machine learning approaches such as Culotta et al. (2007), who reported the best results on true mentions before Bengtson & Roth (2008). Bengtson & Roth (2008)’s is the strongest pairwise model on the ACE data sets before the CoNLL 2011 shared task (which is discussed in Section 7.2), and its source code is accessible for modifications so that strict fair comparisons can be conducted. Therefore Bengtson & Roth (2008)’s system is the second reasonable competitor for evaluating COPA in this Section. Both of the baseline systems are chosen because they are the strongest pairwise models to compare with to illustrate the effectiveness of our proposed global method. We use the same pre-processors (including the mention detection) for all systems to exclude the possible influences from them. Differences in outputs mainly indicates the differences in the inference algorithms. 7.1.1 Data We use the MUC6 data (Chinchor & Sundheim, 2003) with the standard training/testing divisions (30/30) and the MUC7 data (Chinchor, 2001) (30/20). Since we do not have access to the official ACE testing data (only available to ACE participants), we follow Bengtson & Roth (2008) for dividing the ACE 2004 English training set (Mitchell et al., 2004) into training, development and testing partitions (268/76/107). We randomly split the 252 ACE 2003 training documents (Mitchell et al., 2003) using the same proportions into training, development and testing (151/38/63). The systems were tuned on development data and run only once on testing data. 7.1.2 The Mention Tagger We implement a classification-based mention tagger, which tags each NP chunk (e.g. the output of the Yamcha Chunker) as being an ACE mention or not, with the necessary postprocessing for embedded mentions. For the ACE 2004 testing data, we cover 75.8% of the syntactic heads of mentions with a 73.5% accuracy. Since the MUC data sets do not limit the mentions to any specific semantic classes as the ACE sets do, our mention tagger directly outputs all the embedded noun phrases. 1 http://l2r.cs.uiuc.edu/˜cogcomp/asoftware.php?skey=FLBJCOREF 7.1 COPA vs. Baselines 95 7.1.3 Evaluation Metrics In order to report realistic results, we neither assume true mentions as input nor do we evaluate only on true mentions. Instead, we use an in-house mention tagger for automatically extracting mentions, and evaluate using variants of the evaluation metrics B 3 (Bagga & Baldwin, 3 1998) and CEAF (Luo, 2005), named Bsys and CEAF sys respectively, which are adapted to the evaluation of end-to-end coreference resolution systems (see Chapter 6). For the sake of completeness we also report the MUC score. 7.1.4 Results 7.1.4.1 COPA vs. SOON In this section, we compare the SOON-baseline with COPA using the R2 partitioner (parameters α⋆ and β optimized on development data). COPA uses the same features as adopted by SOON, which are shown in Table 7.1. Moreover, the two systems use the same set of system mentions too. Negative Positive (1) N Gender, (2) N Number, (3) N SemanticClass (10) StrMatch Npron, (11) StrMatch Pron, (12) Alias, (14) Nprn Prn, (21) Appositive, (31)sentence distance Table 7.1: COPA Features for Comparing with SOON (details in Chapter 5) Table 7.2 gives the comparison results, it can be seen that even with the same features, COPA consistently outperforms SOON on all data sets using all evaluation metrics. With the exception of MUC7, ACE 2003 and ACE 2004 data evaluated with CEAF sys , all of COPA’s 3 improvements are statistically significant. When evaluated using MUC and Bsys , COPA with the R2 partitioner boosts recall in all data sets while losing in precision. This led us to believe that incorporating more features would increase precision without losing too much recall. Hence we integrated features from Bengtson & Roth (2008)’s system to conduct the second comparison in Section 7.1.4.2. 96 7. Evaluating COPA MUC 3 Bsys CEAF sys MUC6 MUC7 ACE 2003 ACE 2004 MUC6 MUC7 ACE 2003 ACE 2004 MUC6 MUC7 ACE 2003 ACE 2004 R 59.4 52.3 56.7 50.4 53.1 49.8 66.9 64.7 56.9 57.3 71.0 67.9 SOON P 67.9 67.1 75.8 67.4 78.9 80.0 87.7 85.7 53.0 54.3 68.7 65.2 F 63.4 58.8 64.9 57.7 63.5 61.4 75.9 73.8 54.9 55.7 69.8 66.5 COPA with the R2 partitioner R P F α⋆ β 62.8 66.4 64.5 0.08 0.03 55.2 66.1 60.1 0.05 0.01 60.8 75.1 67.2 0.07 0.03 54.1 67.3 60.0 0.05 0.04 56.4 76.3 64.1 0.08 0.03 53.3 76.1 62.7 0.05 0.01 71.5 83.3 77.0 0.07 0.03 67.3 83.4 74.5 0.07 0.03 62.2 57.5 59.8 0.08 0.03 58.3 54.2 56.2 0.06 0.01 71.1 68.3 69.7 0.07 0.03 68.5 65.5 67.0 0.07 0.03 Table 7.2: SOON vs. COPA R2 (SOON features, system mentions, bold indicates significant improvement in F-score over SOON according to a paired-t test with p < 0.05) In brief, Table 7.2 conveys that the global hypergraph partitioning method of COPA models the coreference resolution task more adequately than Soon et al. (2001)’s local model – even when using the very same features and the same mentions. 7.1.4.2 COPA vs. B&R Table 7.3 gives our re-produced B&R numbers on the ACE 2004 testing data using the true (and system) mention settings, in comparison to the numbers they reported in the paper. Their lenient variant of B 3 (Stoyanov et al., 2009) is used, which discards all twinless mentions2 . Table 7.3 is to show that we make sure that their reported numbers are successfully regen3 erated. Replacing their preprocessing components with ours generates 74.8 F-score of Bsys , which is comparable to the 74.0 using their own’s. 2 The mentions which are not aligned with true mentions are called twinless (Stoyanov et al., 2009) 7.1 COPA vs. Baselines 97 true mention (lenient B 3 ) B&R’s system mention (lenient B 3 ) 3 B&R’s system mention (Bsys ) 3 COPA’s system mention (Bsys ) Reported R P F 74.5 88.3 80.8 72.5 84.9 78.24 73.8 Reproduced R P F 73.0 89.6 80.4 72.1 83.2 77.3 68.3 80.8 74.0 66.3 85.8 74.8 Table 7.3: Reproduced Numbers of B&R 3 In Table 7.4 we report the Bsys performance of SOON and B&R on the ACE 2004 testing data (which was the data set B&R’s original results reported on) using true mentions and using 3 COPA’s automatically identified system mentions. For evaluation we use Bsys only, because (Bengtson & Roth, 2008)’s system does not allow one to easily integrate CEAF. B&R considerably outperforms SOON (we cannot compute statistical significance, because B&R does not provide single document performance). The difference using system mentions, however, is not as big as we expected. Bengtson & Roth (2008) reported very good results when using true mentions. For evaluating on system mentions, however, they were using the lenient B 3 . 3 When replacing this with Bsys the difference between SOON and B&R shrinks. 3 true mention (Bsys ) 3 COPA’s system mention (Bsys ) SOON R P F 67.4 90.3 77.2 64.7 85.7 73.8 B&R (Reproduced) R P F 73.0 89.6 80.4 66.3 85.8 74.8 Table 7.4: Baselines on the ACE 2004 Testing Data In this section, we compare the B&R system (using our preprocessing components and mention tagger), and COPA with the R2 partitioner using B&R features. The features are given in Table 7.5. COPA does not use the learned features from B&R, as this would have implied to embed a pairwise coreference resolution system in COPA. 98 7. Evaluating COPA Negative Positive Weak (1) N Gender, (2) N Number, (3) N SemanticClass (4) N Mod, (10) StrMatch Npron, (11) StrMatch Pron, (12) Alias, (13) HeadMatch,(14) Nprn Prn, (21) Appositive, (31)sentence distance, (32) compatible mention distance (27) W VerbAgree, (29) W Synonym Table 7.5: COPA Features for Comparing with B&R (details in Chapter 5) The comparison results are provided in Table 7.6. We report results for ACE 2003 and ACE 2004. The parameters are optimized on the ACE 2004 data. COPA with the R2 partitioner outperforms B&R on both data sets. Bengtson & Roth (2008) developed their system on ACE 2004 data and never exposed it to ACE 2003 data. We suspect that the relatively poor result of B&R on ACE 2003 data is caused by its over-fitting to ACE 2004. This shows that COPA is a highly competitive system as it outperforms Bengtson & Roth (2008)’s system which claims to have the best performance on the ACE 2004 data. 3 Bsys ACE 2003 ACE 2004 B&R R P F 56.4 97.3 71.4 66.3 85.8 74.8 COPA with theR2 partitioner R P F 70.3 86.5 77.5 68.4 84.4 75.6 Table 7.6: B&R vs. COPA R2 (B&R features, COPA’s system mentions) 7.1.4.3 Running Time On a machine with 2 AMD Opteron CPUs and 8 GB RAM, COPA finishes preprocessing, training and partitioning the ACE 2004 data set in 15 minutes, which is slightly faster than our duplicated SOON baseline and is much faster than the original B&R system. 7.1.5 Discussion Most previous attempts to solve the coreference resolution task globally have been hampered by employing a local pairwise model in the classification step (i.e. step 1 mentioned in Chapter 7.2 COPA vs. State-of-the-art Systems 99 2) while only the clustering step realizes a global approach ( E.g. Luo et al. (2004), Nicolae & Nicolae (2006), Klenner (2007), Denis & Baldridge (2009), lesser so Culotta et al. (2007)). In this section, we conduct experiments comparing our coreference resolution system, COPA, against two strong baselines (Soon et al., 2001; Bengtson & Roth, 2008). Soon et al. (2001) is the first two-step model with 12 very strong features. Bengtson & Roth (2008)’s system has been claimed to achieve the best performance on the ACE 2004 data (using true mentions, Bengtson & Roth (2008) did not report any comparison with other systems using system mentions). COPA implements a global decision in one step via hypergraph partitioning and considers all the relations in a graph, which enables it to outperform the two strong pairwise models. It has been observed that the improved performance with true mentions do not necessarily translate to an improved performance when system mentions are used (Ng, 2008). We follow Stoyanov et al. (2009) and argue that evaluating the performance of coreference resolution systems on true mentions is unrealistic. Hence we integrate an ACE mention tagger into our system, tune the system towards the real task, and evaluate only using system mentions. While Ng (2008) could not show that superior models achieved superior results on system mentions, COPA is able to outperform both baseline systems in strict comparisons and in an end-toend setup. 7.2 COPA vs. State-of-the-art Systems COPA has participated in the CoNLL shared task on modeling unrestricted coreference (Pradhan et al., 2011), and we submitted COPA’s results to the open setting of the task. We used only 30% of the training data (randomly selected) and 20 features (see Table 7.7). Negative Positive Weak (1) N Gender, (2) N Number, (3) N SemanticClass, (4) N Mod, (5) N DSPrn, (6) N ContraSubjObj (10) StrMatch Npron, (11) StrMatch Pron, (12) Alias, (13) HeadMatch, (14) Nprn Prn, (15) Speaker12Prn, (16) DSPrn, (17) ReflexivePrn, (18) PossPrn, (19) GPEIsA, (20) OrgIsA, (31) sentence distance (32) compatible mention distance (27) W VerbAgree, (28) W Subject, (29) W Synonym Table 7.7: COPA Features for the CoNLL 2011 Shared Task (details in Chapter 5) 100 7. Evaluating COPA 7.2.1 Data The CoNLL shared task aims to predict coreference on the OntoNotes data. There are 1,674 training documents, 202 development documents and 207 testing documents. As is customary for CoNLL tasks, two tracks are provided, i.e. closed and open. For the closed track, participating systems are restricted to using the distributed resources (with the predicted layers of information provided by the task), in order to allow fair algorithmic comparisons. The open track allows for unrestricted usage of additional external resources. Since several off-the-shelf pre-processing components are used, COPA participates in the open setting track (without actually using additional resources such as Wikipedia). 7.2.2 The Mention Tagger For the CoNLL shared task, we incorporate information from syntactic parse trees into our mention tagger. Both the semantic classes and the syntactic heads are generated along with the system mentions. The official evaluation on the mention taggers shows that the performance of our mention tagger falls into the average-performance group (see Table 7.8). R COPA 67.15 max open 74.31 P 67.64 67.87 F1 67.40 70.94 Table 7.8: COPA’s Mention Tagger Performance on the CoNLL testing set 7.2.3 Evaluation Metrics The unweighted average of MUC, BCUBED and CEAF(E) is used as the final score in CoNLL shared task. CEAF(E) is using the entity based similarity metric (see Chapter 6). It is considered that each of the three metrics represents a different important dimension (Denis & Baldridge, 2009), the MUC being based on links, BCUBED based on mentions and CEAF on entities. The combination of them should be adequate for evaluating the performances of a coreference resolution system. 7.2.4 Results The stopping criterion α∗ (see Section 4.2.2.2) is tuned on development data to optimize the final coreference scores. A value of 0.06 is chosen for the CoNLL testing set. 7.2 COPA vs. State-of-the-art Systems 101 COPA’s results on the development set and the testing set are displayed in Table 7.9 and Table 7.10 respectively. The Overall numbers in both tables are the average scores of MUC, BCUBED and CEAF(E). In Table 7.11, the best performances in both open and closed are given, along with the median numbers. Since COPA is not using additional resources anyway, the closed numbers can still be roughly compared with. This is mentioned in the overview paper of the task too (see the second paragraph in page 18 of (Pradhan et al., 2011)). Metric MUC BCUBED CEAF(M) CEAF(E) BLANC Overall R 52.69 64.26 54.44 45.73 69.78 P 57.94 73.39 54.44 40.92 75.26 F1 55.19 68.52 54.44 43.19 72.13 55.63 Table 7.9: COPA’s results on the CoNLL development set Metric MUC BCUBED CEAF(M) CEAF(E) BLANC Overall R 56.73 64.60 53.37 42.71 69.77 P 58.90 71.03 53.37 40.68 73.96 F1 57.80 67.66 53.37 41.67 71.62 55.71 Table 7.10: COPA’s results on the CoNLL testing set COPA max open med open max closed med closed F1 55.71 58.31 54.32 57.79 50.98 Table 7.11: Overall Results on the CoNLL testing set 102 7. Evaluating COPA The best system of CoNLL 2011 shared task is Stanford’s Multi-Pass Sieve system (Lee et al., 2011), which is based on heuristic rules. The second ranking systems are not significantly different from ours, for instance Sapena’s system, which uses an iterative probabilistic model with the constraints between mentions learned from a decision tree. Both of the systems are described in Chapter 2. Overall, COPA performs competitively when compared with the state-of-the-art systems in the field, while using a relatively small set of features and a small amount of training data. 7.2.5 Discussions The CoNLL 2011 shared task enables us to compare our coreference model COPA with the state-of-the-art systems on a much bigger data set, the OntoNotes data. We only apply 30% of the training documents to learn the hyperedge weights, and the learned COPA model comes in as the second team in the open track in which five teams participated. Since COPA does not use additional resources, it is considered to belong to the second small ball park in the closed track too (Pradhan et al., 2011) where there are 18 teams participating. Pradhan et al. (2011) concludes that most of the participating systems are still two-step models, fully trained upon the training set using the approach as described in (Soon et al., 2001). It is suggesting again that COPA’s global partitioning algorithm outperforms the pairwise models under the CoNLL setup, even with a small set of features (i.e. 22). 7.3 COPA in the Medical Domain We participated in all three tasks of the 2011 i2b2/VA Track on Challenges in Natural Language Processing for Clinical Data (descriptions can be found in Chapter 3). The features used to report the results are given in Table 7.12. 7.3 COPA in the Medical Domain Negative Positive Weak 103 (1) N Gender, (2) N Number, (3) N SemanticClass, (4) N Mod, (6) N ContraSubjObj, (7) N i2b2Type, (8) N i2b2Quant, (9) N i2b2ConName (10) StrMatch Npron, (11) StrMatch Pron, (12) Alias, (13) HeadMatch, (14) Nprn Prn, (17) ReflexivePrn, (21) Appositive, (23) i2b2PisA, (24) i2b2Abbr, (25) i2b2CatMatch, (26) i2b2PronPreference, (31)sentence distance, (32) compatible mention distance (28) W Subject,(29) W Synonym, (30) W i2b2SubStr Table 7.12: COPA Features for the 2011 i2b2/VA Shared Task (details in Chapter 5) 7.3.1 Data For task 1A and task 1B – ODIE corpus without and with concepts3 , a training set of 97 documents is released (including the Mayo and Pittsburgh data sets). A total number of 492 documents (including the Partner, Beth and Pittsburgh data sets) are used as training data for task 1C – i2b2/VA corpus with concepts. In task 1A, our in-house mention tagger is integrated into the preprocessing components. For development purposes, we randomly split the training data into two parts with the ratio of 4 to 1. From the ODIE corpus, 78 documents are kept for training, and 19 are used as development set. A split of 394/98 is used for the i2b2/VA corpus. 7.3.2 The Mention Tagger For the I2B2 shared task, the semantic classes of mentions (e.g. persons and treatments) are evaluated together with the output coreference sets in task 1A. Our mention tagger makes use of the entity definitions extracted from the Unified Medical Language System (UMLS) 4 for the semantic class identification. Our mention tagger covers 84.9% of the syntactic heads of mentions with an accuracy of 62.2% on the ODIE corpus. 3 4 Concepts in the shared task refer to the given true mentions. http://www.nlm.nih.gov/research/umls/ 104 7. Evaluating COPA 7.3.3 Evaluation Metrics For coreference resolution there exists no evaluation metric that has been approved unanimously. Hence the i2b2/VA/Cincinnati shared task adopts the approach taken by the CoNLL 2011 shared task to measure the final coreference performance, the unweighted average of the MUC, BCUBED and CEAF(E) evaluation metrics, here being denoted as Overall. However, in contrast to the CoNLL evaluation, the i2b2/VA/Cincinnati shared task evaluates additional mentions that do not participate in any coreference set, so that it results in too high performance numbers (see BCUBED numbers in Table 7.15 for an example). In addition, i2b2/VA/Cincinnati adopts the BLANC evaluation metric but does not include it in Overall. We report numbers according to the i2b2/VA/Cincinnati evaluation scripts for Task 1B and Task 1C (denoted as I2B2). For task 1A (with automatically detected mentions) we compute the evaluation metrics according to our own variants of BCUBED and CEAF (denoted as SYS), and CoNLLs variants of BCUBED and CEAF (denoted as CoNLL). Reporting our results for task 1A using the I2B2 metrics is meaningless because the final i2b2/VA/Cincinnati evaluation script also evaluates the semantic classes of mentions which we do not include into our output files. The final i2b2/VA/Cincinnati evaluation script changed during the final evaluation phase. The released script during the development phase actually does not evaluate the semantic classes. All evaluations in this section are conducted across semantic classes. 7.3.4 Results COPA on the Development Data. COPA’s results on the development sets for all three tasks are displayed in Table 7.13, Table 7.14, Table 7.15 and Table 7.16. The evaluation metrics (i.e. MUC, BCUBED, CEAF(E), overall as the unweighted average of the three, and additionally BLANC) are calculated with the scripts provided by the shared task. task 1A (SYS) MUC BCUBED CEAF Overall R 88.9 83 78.5 P 61.8 90 63.6 F1 72.9 86.4 70.2 76.5 Table 7.13: COPA’s Results on the ODIE Development Set w/o Concepts (Task 1A) Using SYS Evaluation Metrics 7.3 COPA in the Medical Domain task 1A (CoNLL) MUC BCUBED CEAF Overall 105 R 88.9 82.5 78.5 P 61.8 94.4 48.2 F1 72.9 88 59.7 73.6 Table 7.14: COPA’s Results on the ODIE Development Set w/o Concepts (Task 1A) Using CoNLL Evaluation Metrics task 1B (I2B2) MUC BCUBED CEAF (BLANC Overall R 88.6 88.5 71.5 80.5 P 79.1 93 62.2 95.8 F1 82.7 90.7 66.5 86.6) 80.0 Table 7.15: COPA’s Results on the ODIE Development Set with Concepts (Task 1B) Using I2B2 Evaluation Metrics task 1C (I2B2) MUC BCUBED CEAF (BLANC Overall R 80.8 95.6 88.8 93.3 P 84.9 96.1 86.3 97.2 F1 82.8 95.8 87.6 95.2) 88.7 Table 7.16: COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C) Using I2B2 Evaluation Metrics 106 7. Evaluating COPA COPA on the Testing Data. Our final performances on the testing data for Task 1B (i.e. overall F1 measure of 0.806) and Task 1C (i.e. overall F1 measure of 0.888) are similar to our results on the development set (see Table 7.15 and Table 7.16). Our testing results are slightly worse than the results of the top performing system for Task 1C, and are not significantly different from the top results for Task 1B (Uzuner et al., 2012). It is indicating that our system is competitive in the medical domain. However, our results on the testing data of Task 1A are much worse than on the development data, because the final evaluation script (I2B2) also evaluates the semantic classes of mentions too, which we did not include into our output files. It can be seen from Table 7.17 that, SYS metrics give similar numbers on the Task 1A testing data as on the Task 1A development data, which are the best SYS performances in the shared task. R task 1A (SYS) Exact and Partial .760 Exact .783 P .648 .707 F1 .696 .730 F1 max F1 med .696 .690 .730 .703 Table 7.17: COPA’s Results (in bold) on the ODIE Testing Set w/o Concepts (Task 1A) Using SYS Evaluation Metrics R task 1A (I2B2) Exact and Partial .617 Exact .765 P .423 .568 F1 .417 .630 F1 max F1 med .657 .624 .675 .634 Table 7.18: COPA’s Results (in bold) on the ODIE Testing Set w/o Concepts (Task 1A) Using I2B2 Evaluation Metrics task 1B (I2B2) Overall R .850 P .773 F1 .806 F1 max F1 med .827 .800 Table 7.19: COPA’s Results (in bold) on the ODIE Testing Set with Concepts (Task 1B) Using I2B2 Evaluation Metrics 7.3 COPA in the Medical Domain task 1C (I2B2) Overall 107 R .894 P .882 F1 .888 F1 max F1 med .915 .859 Table 7.20: COPA’s Results (in bold) on the i2b2/VA Testing Set with Concepts (Task 1C) Using I2B2 Evaluation Metrics Medical Domain Knowledge. As mentioned in Chapter 5, the UMLS thesaurus and the MetaMap API are used to equip COPA with medical domain knowledge. Features (7) N i2b2Type, (9) N i2b2ConName, (23) i2b2PisA and (24) i2b2Abbr are left out in Table 7.21 to illustrate the influence of domain knowledge. task 1C(I2B2) MUC BCUBED CEAF Overall w/o KnowledgeFeats R P F1 .807 .821 .814 .959 .953 .956 .859 .867 .863 .878 w KnowledgeFeats R P F1 .808 .849 .828 .956 .961 .958 .888 .863 .876 .887 Table 7.21: COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C), with and without Knowledge Features, Using I2B2 Evaluation Metrics. (bold indicates significant improvement in F1 measure over the column w/o KnowledgeFeats, according to a paired-t test with p < 0.005) By accessing domain knowledge, COPA manages to capture the coreference relation which pure linguistic features cannot capture. For example, the mention {neurolysis} is correctly resolved to {the procedure}R due to the contribution of the IsA relation. Because the version of the evaluation metrics used by the shared task is overwhelmed by unresolved singletons (in particular BCUBED), the contribution of the knowledge features appears smaller than it actually is. The same comparison is conducted with SYS metrics in Table 7.22, which shows a bigger improvement by using knowledge features. 108 7. Evaluating COPA task 1C (SYS) MUC BCUBED CEAF Overall w/o KnowledgeFeats R P F1 .807 .821 .814 .750 .849 .797 .786 .731 .757 .787 w KnowledgeFeats R P F1 .808 .849 .828 .752 .883 .813 .792 .750 .770 .804 Table 7.22: COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C), with and without Knowledge Features, Using SYS Evaluation Metrics. (bold indicates significant improvement in F1 measure over the column w/o KnowledgeFeats, according to a paired-t test with p < 0.005) 7.3.5 Discussions By participating in the I2B2 shared task, we are able to convey the domain adaptation ability of the COPA model. With the system mention setting and the SYS metrics (see Table 7.17), COPA generates the best performance. In terms of the true mention setting, COPA is ranked into the second group (Uzuner et al., 2012). From the experiences in the I2B2 shared task, we confirm that it is easy to adapt the COPA model to new domains. The feature engineering is easy due to the overlapping hyperedges and the learning phase can be cheaply done with a small portion of the training documents. 7.4 Error Analysis 7.4.1 COPA Errors for News Articles Mention Detection Errors. As described in Section 4.3.1, our mention detection is based on automatically extracted information, such as syntactic parsing trees and basic NP chunks. Since no minimum span information is provided in the OntoNotes data (in contrast to the previous standard corpus, ACE), exact mention-boundary detection is required. A lot of the spurious mentions in our system are generated due to the mismatches of the ending or starting punctuations, and the OntoNotes annotation is also not consistent in this regard. The mention detection F-score of COPA is 67.40, whereas the best system in the CoNLL shared task has the F-score of 70.94. 7.4 Error Analysis 109 Our current mention detector does not extract verb phrases. Therefore it misses all the Event mentions in the OntoNotes corpus.Besides the fact that the current COPA is not resolving any event coreferences, our mention detector performs weakly in extracting date mentions too. As a result, the system outputs several spurious coreference sets, for instance a set containing the September from the mention 15th September. Moreover, an idiomatic expression identification needs to be included too, which should help to avoid detecting some spurious mentions, such as {God} in the phrase {for God’s sake}. Resolution Errors. A big portion of the recall loss in our system is due to the lack of world knowledge. For example, COPA does not resolve the mention {the Europe station} correctly into the entity R ADIO F REE E UROPE, because the system does not know that the entity is a station. Some more difficult coreference cases in the OntoNotes data might require a reasoning mechanism. To be able to connect the mention {the victim} with the mention {the groom’s brother}, the event that the brother is killed needs to be interpreted by the system. We also observed from the experiments that the resolution of the {it} mentions are quite inaccurate. Although our mention detector discards the pleonastic pronouns, there are still a lot of them left that introduce wrong coreference sets. Since the {it} mentions do not contain enough information by themselves, more features exploring their local syntax are necessary. 7.4.2 COPA Errors for Clinical Reports The data sets adopted in the i2b2/VA shared task contain semi-structured reports describing clinical relevant information of patients. Therefore some data-specific coreference chains can be easily derived, such as in the case of ”{Patient} name: {XXX}” where the patient name is explicitly given. Pronouns in these data sets are not as ambiguous as they are in news articles. The patient is quite centered in the context of each report, who occupies most of the third person pronouns. Most singular first person pronouns refer to the doctors who write the reports. Definite noun phrases are not used frequently in the i2b2/VA data sets. Instead, variations of medical terms and expanded descriptions of entities frequently appear, which are difficult to detect without domain-dependent knowledge resources. Mention Detection Errors. The mention detection in task 1A has been a challenge for us, as the annotated mentions are not always the largest noun phrase spans (which is usually the case in coreference annotations). Annotated is rather a meaningful medical usage. For instance, phrase {appendix 8.0 x 0.5 cm} is a mention while {135 pulse rate} is not. 110 7. Evaluating COPA Resolution Errors. COPA has difficulties deciding whether the difference between the modifiers of the mention {chest pain} and the mention {back pain} is essential enough to separate them from each other. It requires knowledge that {back} and {chest} are both part of the body while being different ones. We attempt to handle this problem by including the medical concept names the mentions refer to (see feature (9)). However, including even deeper knowledge would be beneficial. 7.5 Experiments on the Training Data Size We conducted a series of runs with different amounts of the training data, shown in Figure 7.1. The curve derived from the i2b2/VA/Cincinnati corpus using the I2B2 metrics is tagged with ”i2b2 trsize”, while the curve using our SYS metrics is tagged with ”i2b2 trsize,sys”. Because of the skewed evaluation metrics adopted in the i2b2/VA/Cincinnati (see Section 7.3.3), the curve ”i2b2 trsize” shows only a small drop in performance (i.e. four percent Fmeasure) when only two training documents are used. When we apply our own version of the evaluation metrics which is not as influenced by singletons (see Chapter 6), the drop on the curve ”i2b2 trsize,sys” is more pronounced. However, even with this evaluation measure we can see that only little training data is sufficient for our system to reach its top performance. Training Size Experiment on i2b2 dev and CoNLL dev 1 i2b2_trsize i2b2_trsize, sys conll_trsize, sys overall F1 of COPA 0.9 0.8 0.7 0.6 0.5 10 100 number of training documents used Figure 7.1: COPA’s Results with Different Sizes of the Training Data 7.6 Experiments on the k Model 111 In order to check whether the task of coreference resolution is easier in the clinical domain than in the news domain, we perform the same experiment using the CoNLL-shared task development data using our own evaluation metrics (“sys”), the curve of which is tagged as “conll trsize,sys”. Here we see a slight increase when using more than 20 training documents, though even here we reach top performance with only about 100 training documents (out of more than 1,800 original ones). The overall lower numbers can be partially explained by using automatically tagged mentions and partially by the difficulty of the news domain (due to the more occurrences of pronouns and diverse entity types). However, in both domains our system needs only very little training data to achieve competitive performance. 7.6 Experiments on the k Model We proposed two partitioning algorithms in this thesis, the R2 partitioner which partitions the hypergraphs in an iterative manner and the flatK partitioner which attempts to conquer the hierarchical limitation of the R2 partitioner by deriving the clusters at one step. The flatK partitioner assumes the number of clusters to be known beforehand, and our proposed k model in Chapter 4 addresses this issue via preference modeling. The effect of singleton entities. It is no trivial matter to predict the number of entities (i.e. clusters) during the end-to-end coreference processing, when noise is involved in the graphs to be partitioned. System mentions which do not participate in any coreference set present as singleton entities in the graphs, which dramatically change the distributions of the number of entities. Figure 7.2 compares the distributions of the number of entities per 100 mentions with and without singleton entities involved. The figures on the left side plot the frequencies of different k’s without singleton entities, while the right ones include singleton entities. The upper two figures are for MUC 6 data set and the lower two are for ACE 2002 corpus. 112 7. Evaluating COPA EntNumPer100Men_withoutSin_muc6 EntNumPer100Men_withSin_muc6 0.2 0.1 "EntNumPer100Men_withoutSingletons.txt" using 1:2 "EntNumPer100Men_withSingletons.txt" using 1:2 0.08 0.15 Frequency Frequency 0.06 0.1 0.04 0.05 0.02 0 0 0 2 4 6 8 10 EntNumPer100Men 12 14 16 18 0 10 20 EntNumPer100Men_withoutSin_ace2002 30 40 50 EntNumPer100Men 60 70 80 90 EntNumPer100Men_withSin_ace2002 0.16 0.045 "EntNumPer100Men_withoutSingletons.txt" using 1:2 "EntNumPer100Men_withSingletons.txt" using 1:2 0.04 0.14 0.035 0.12 0.03 Frequency Frequency 0.1 0.08 0.025 0.02 0.06 0.015 0.04 0.01 0.02 0.005 0 0 0 5 10 15 EntNumPer100Men 20 25 30 0 20 40 60 EntNumPer100Men 80 100 Figure 7.2: The Distributions of k With and Without Singleton Entities It can be seen that when using system mentions (i.e. the settings with singleton entities), the distributions of the number of entities contain a lot of noise compared with the true mention setting without singletons. Such noisy distributions make the prediction of k difficult to be approached by regression methods. This motivates our proposed preference-based k model which does not estimate the intrinsic distribution of k, but attempts to optimize the application F-score directly. The Performance of Our Proposed k Model. With the set of features described in Section 4.3.4, Table 7.23 gives the performance for the classification step of our proposed k model. The true and false classes correspond to the decisions which prefer the first or the second partitionings. Since the upper bound of k is decided by simply counting the numbers of different mention strings, we generate an approximately 1:6 ratio for positive and negative instances. The much bigger size of negative instances explains the low F-score the false class achieves. Although the classification performance does not directly correlate with the final coreference results, it is empirically observed that improving the classification step boosts 7.6 Experiments on the k Model 113 COPA’s resolution results correspondingly. Class R false 0.271 true 0.759 P 0.428 0.611 F 0.332 0.677 Table 7.23: k Model’s Classification Performance on the CoNLL Development Data Table 7.24 illustrates the performance of our proposed partitioning algorithms on the CoNLL development data and on the ACE 2004 development data. With the current set of the k model features, the flatK partitioner does not show its superiority over the R2 partitioner. However, it is potentially useful for incorporating global set-level information, such as the number of entities and the relations between entities. The numbers with bestK suggest the upper bound performance of the flatK partitioner. The bestK setting chooses the k’s which achieve the best coreference performances. R R2 P F R MUC 3 Bsys CEAFsys 59.99 67.78 46.72 61.82 73.29 44.93 60.89 70.43 45.81 60.04 68.23 45.97 MUC 3 Bsys CEAFsys 63.3 70.9 71.8 70.9 81.0 67.4 66.9 75.6 69.6 63.5 71.0 71.8 flatK P F CoNLL 60.99 60.51 71.94 70.03 45.02 45.49 ACE04 70.8 67.0 81.0 75.7 67.5 69.6 flatK(bestK) R P F 60.51 68.6 46.86 61.97 73.28 45.42 61.23 70.86 46.13 61.8 68.8 71.9 78.8 86.2 69.3 69.3 76.5 70.6 Table 7.24: COPA R2 Vs. flatK’s ( with the alpha*=0.07, bold indicates significant improvement in F-score over the others according to a paired-t test with p < 0.05) 114 7.7 7. Evaluating COPA Summary In this chapter, our proposed model COPA is evaluated in various settings. For the model comparisons, we do not include the graph partitioning algorithm proposed by Nicolae & Nicolae (2006) as a baseline system, because our adopted baseline model Bengtson & Roth (2008) is claimed to produce better performance over the previous ones. For the state-of-the-art systems after Bengtson & Roth (2008), we compare them with the CoNLL 2011 shared task setup. COPA vs. Pairwise Models. By comparing COPA with two pairwise models in a strict manner (i.e. leaving only the models to be different), it is suggested that the performance gains of our graph-partitioning model come from the usage of full contexts and the direct optimization of coreference sets. From the comparison experiments conducted on several corpora and with different evaluation metrics, we conclude that our global model triumphs over the pairwise methods consistently. COPA vs. the State-of-the-art. The CoNLL 2011 shared task allows us to compare our system with the state-of-the-art systems on the OntoNotes corpus, which is a big collection of documents and is well-annotated. COPA participates with the R2 partitioner, and performs competitively with only a limited amount of training documents applied (coming in as the second in the open track, and also belongs to the second block in the closed track). It is shown that COPA works stable on different types of documents, such as news articles and speech transcripts, and incorporating new features is simple as the learning process is very light-weighted. COPA’s Domain Adaptation & Weakly Supervised COPA. In order to further test the robustness of COPA, we also provide the experiments on a data set of clinical reports. The flatK partitioner is used in this setting, and the performance is encouraging that COPA can be easily adapted to new domains by incorporating some domain-specific knowledge. In Section 7.5, more extensive experiments are conducted to illustrate the weakly supervised nature of the COPA model. Our hypergraph model is shown to be stable with respect to the amount of the training data. For the clinical set, we need as little as five percent of the training data to achieve a competitive performance. This makes COPA a good choice, when coreference resolution needs to be applied to new domains and new languages. Our Proposed k model. We analyze our proposed k model in Section 7.6 which is designed to assist the flatK partitioner. We show statistics on the number of entities within documents and provide experimental numbers to show the current status of the model. 7.7 Summary 115 Graph models cannot deal well with positional information, such as distance between mentions or the sequential ordering of mentions in a document. We implement distance information as weights on hyperedges which results in a decent performance. However, this is limited to pairwise relations and thus does not exploit the power of the high-degree relations available in COPA. We expect further improvements, once we manage to include positional information directly. An error analysis reveals that there are some cluster-level inconsistencies in the COPA output, such as the cluster with three mentions [Bill Clinton], [Clinton] and [Hillary Clinton] where [Bill Clinton] and [Hillary Clinton] are incompatible with each other. Enforcing the consistency would require a global strategy to respect the constraints during the partitioning phase. We also explore constrained clustering algorithms in COPA, a field which has been very active recently (Basu et al., 2009). Constrained clustering methods should allow us to make use of negative information from the cluster-level perspective (see Chapter 8 for details). 116 7. Evaluating COPA Chapter 8 The Constrained COPA The Constrained COPA. The coreference resolution task is to cluster mentions into sets so that all mentions in one set refer to the same entity. COPA represents documents as hypergraphs, with relational features as hyperedges. Upon the hypergraphs, the system resorts to graph partitioning techniques to generate the final coreference sets. The partitioning should be significantly improved using supervision in the form of pairwise constraints, e.g. pairs of mentions which are known to be in the same coreference sets (Must-Link constraints) or in different ones (Cannot-Link constraints). The constraints suggest top-down advice to improve the output partitioning. While it is straightforward to interpret Must-Link constraints as highly weighted edges, there is no trivial way to include negative relations (i.e. Cannot-Link constraints) into a graph representation. Directly adding negative edges into a graph results in a NP-hard problem for the standard graph partitioning algorithms, although it can be addressed by specific algorithms such as correlation clustering (Bansal et al., 2002). In this chapter, we include Cannot-Link constraints within the hypergraph partitioning framework of COPA without changing the already-adopted spectral clustering algorithms. The constrained COPA applies constrained data clustering algorithms to the vector representations in the spectral space, which are generated during the spectral clustering procedure. In this way, the consistent partitions are found by both respecting the constraints and optimizing the normalized cut. From the supervision point of view, this work of including constraints can be viewed as the first step towards a better learning model for COPA. However, pairwise constraints only provide limited pairwise guidance. Improvements are expected by further exploring the learning phase of COPA. Enforcing Transitivity in Coreference Resolution. In this chapter, we aim to show that including Cannot-Link constraints is helpful to the task of coreference resolution. In our hypergraph representation, the weight of a hyperedge indicates how close its incident vertices are to each other with respect to the corresponding relation. The vertices without edges in 118 8. The Constrained COPA between can still be clustered into the same coreference set due to the transitive closure which is implicitly done during the clustering process. Therefore, without any means to enforce the constraint respecting, inconsistent clusters can be derived. For example, when a mention [Bill Clinton] is connected with a mention [Clinton] in a graph, and at the same time a similarly weighted edge is connecting the mention [Clinton] and a mention [Hillary Clinton], the mention [Bill Clinton] and the mention [Hillary Clinton] therefore end up in the same cluster despite of the negative relation between them (e.g. different person names indicate different entities). There have been attempts to enforce transitivity in coreference resolution, for instance, by imposing constraints on integer linear programming (ILP) (Finkel & Manning, 2008) or by disallowing inconsistent assignments during the optimization of the graphical models (the second model in McCallum & Wellner (2005)). However, we work on including constraints into graph partitioning algorithms, in order to generate more consistent coreference sets. We experiment with both artificial clean constraints and automatically generated ones. The experiments on clean constraints show significant improvements by applying our proposed constrained partitioning algorithm. However, our experimental results with generated constraints are mostly negative, due to the low coverage of the proposed constraints. Detailed discussions on the current problem and future work are also provided. The previous efforts on including constraints in the coreference resolution task are introduced in Section 8.1.1, and the existing general purpose constrained clustering algorithms are in Section 8.1.2. We describe our proposed algorithm in Section 8.3, and empirically analyze the performance of the constrained COPA in Section 8.5. 8.1 Background 8.1.1 Enforcing Transitivity in Coreference Resolution It has been observed that the two-step coreference systems (i.e. conducting a classification step and a clustering step) tend to generate inconsistent coreference sets. Since the negative predictions from the classification step are ignored, the transitivity of the coreference relation is not enforced explicitly in the clustering step. Constrained Clustering Methods. Cardie & Wagstaff (1999) include constraints into their distance metric to modify the edge weights between mentions, and perform graph clustering algorithms upon the modified graphs afterward. Built upon Cardie & Wagstaff’s system, Wagstaff (2002) attempts to apply constrained clustering algorithms directly to the task (see her Chapter 5). To illustrate the contributions of the constraints, Wagstaff only compares 8.1 Background 119 against the system that does not use constraint information at all. For instance, the gender agreement indicator is excluded from the feature set of the baseline system. We argue that constraints can be straightforwardly incorporated into the standard feature sets, and simply excluding constraint information leads to a very low performance of the baseline system (see column 1 of Table 5.5 in Wagstaff (2002)). Constrained ILP Models. Klenner (2007) and Finkel & Manning (2008) impose transitivity constraints on the integer linear programming optimization (ILP) to cluster the pairwise classification decisions into sets. With constrained COPA, we enforce transitivity with onestep clustering algorithms. We also do not suffer from expensive computational complexity as ILP models do. Constrained Probabilistic Models. McCallum & Wellner (2005) optimize the conditional probability of the global entity assignment, by casting the proposed graphical model as an equivalent graph partitioning problem — the correlation clustering problem (Bansal et al., 2002). Correlation clustering operates on pairwise relations between data points, to derive partitions which respect the relations as much as possible. Since negative edges are allowed in such graphs, the cluster-level consistency is taken care of directly. McCallum & Wellner use fully connected graphs with all mentions as vertices. We believe that the coreference relation can be represented in much sparser graphs as the ones adopted by COPA (see Chapter 4). Moreover, only a small amount of negative relations between mentions need to be considered as constraints, rather than intensively making use of many trivial ones (i.e. the negative relations between the mentions which are not likely to be clustered into the same set at all). In this thesis, we propose to guide the graph clustering algorithm to generate more consistent partitions with the selected Cannot-Link constraints. Sapena et al. (2010) use a constraint-based approach (i.e. relaxation labeling) for coreference resolution with the learned constraints applied. It is shown that the proposed model outperforms an ILP algorithm which enforces transitivity constraints. The work is conceptually similar to the constrained COPA, except that we focus on the standard graph-clustering setup. Entity-mention Models. Entity-mention models (Luo et al., 2004; Yang et al., 2008; Culotta et al., 2007) take care of the entity-level consistency by the incremental manner of processing. Entity-level information gets accumulated as the entities grow, the within-entity consistency is therefore maintained. Despite of the improved expressiveness, entity-mention models have not yield particularly encouraging results yet (Ng, 2010), possibly due to the seriousness of the error propagation. 120 8. The Constrained COPA 8.1.2 Literature on Constrained Clustering Due to the unsupervised nature of clustering algorithms, the obtained clusters may not necessarily be consistent with the domain knowledge of interest. For instance, in the image segmentation task, while expecting to cluster portraits of persons by gender, it is still possible to generate clusters with and without glasses in the portraits. Constrained clustering allows one to specify prior (domain) information about clusters to guide the clustering process in order to avoid creating spurious partitions. Constrained Data Clustering. Most of the previous efforts of including constraints into clustering algorithms have been on the data which can be represented as vectors. Wagstaff & Cardie (2000) propose to modify the standard k-means algorithm (MacQueen, 1967) to make sure that no constraint is violated while assigning data points to clusters. Basu et al. (2002) use annotated data points to form k-means’s initial clusters and to constrain the following assignments. Instead of modifying the assignment methods of k-means, one can also learn distance metrics from pairwise constraints (Bar-Hillel et al., 2003; Klein et al., 2002; Xing et al., 2003). Basu et al. (2004) propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). Recently, this area has greatly expanded to include algorithms that leverage additional domain knowledge for the purpose of clustering (Basu et al., 2009). Constrained Graph Clustering. For tasks where relations are of greater interest than data points themselves (e.g. the coreference resolution task which focuses on identifying the coreference relation) or where data vectors are not directly available, graph clustering fits more appropriately than data clustering techniques. There is only a little work on constrained graph clustering. Kamvar et al. (2003) modify the similarities between the constrained data items and then apply classifiers in the spectral space, so that spectral clustering is transformed to spectral classification. Our proposed constrained COPA resembles the spirit of making use of the data representation in the spectral space, but we do not apply classification steps. Kulis et al. (2005) construct appropriate kernels including constraint penalties, with which kernel k-means (Dhillon et al., 2004) can be applied to iteratively find the optimization of the corresponding objective functions. There are also attempts to combine pairwise constraints with the normalized cut directly, but only with Must-Link constraints (Yu & Shi, 2004) or only for two-class problems (Coleman et al., 2008). In the constrained COPA, we combine a simple constrained data clustering algorithm (Wagstaff & Cardie, 2000) with our hypergraph spectral clustering algorithms (see Chapter 4) via the 8.2 Inconsistency Analysis on Output Coreference Sets 121 spectral embedding. With our constrained clustering algorithm, we avoid modifying the constructed graphs or changing the objective functions of the original partitioning algorithms. 8.2 Inconsistency Analysis on Output Coreference Sets Before introducing our proposal of the constrained COPA, we firstly provide examples of inconsistent coreference sets generated by the basic COPA (Chapter 4). Since we only focus on the pairwise Cannot-Link constraints in this chapter, the inconsistent sets are determined to be the ones containing at least one pair of mentions which do not corefer. By illustrating the spurious coreference set examples, we motivate the proposal of the constrained COPA. The analysis in this section is conducted on the OntoNotes development set (see Section 3.3), and COPA’s CoNLL evaluation numbers are given in Table 8.1. R2 partitioner MUC BCUBED CEAF(E) overall R 60.87 68.76 46.18 P 61.92 72.57 45.15 F1 61.39 70.61 45.66 59.22 Table 8.1: COPA R2 partitioner’s results on the OntoNotes development set using CoNLL metrics Frequency of Inconsistent Clusters. We collect the output coreference sets where there are at least one pair of mentions belonging to different entities. The inconsistencies are only measured between the mentions which are not twinless1 , so that their ground truth annotations are available and the effect of the mention detection is not taken into account. From Table 8.2, it can be seen that around 1/6 of the output clusters from the basic version of COPA contain inconsistent mentions, occurring in half of the documents. 1 The mentions which are not aligned with true mentions are called twinless (Stoyanov et al., 2009) 122 8. The Constrained COPA Overall Output Clusters 3097 (in 202 documents) Inconsistent Clusters 484 (in 102 documents) Table 8.2: Inconsistent Output Clusters from COPA R2 partitioner on the OntoNotes Development Set Although there is only a small portion of the output clusters containing inconsistencies, we believe that the problem will become more severe when more relational features are included and when the graph structure becomes richer. Since the negative relations are taken as negative features in COPA (see Chapter 5), the violated ones in the output result from the partitioning phase only. Our objective here is to guide the partitioning algorithm with cluster-level information. It is worth noting that although the Cannot-Link constraints adopted in this chapter are pairwise, the consistencies are enforced on the cluster level. Inconsistent Cluster Examples. With the inconsistent cluster examples, we aim to illustrate how they are generated via the transitivity closure automatically done during the partitioning procedure. In the examples, the subscripts of the square brackets (i.e. []) indicate the true entity assignments and the ones of the curly brackets (i.e. {}) give the system output. In Example (1), the mention {[He]} is wrongly cut away from the entity J USTICE A N TONIN S CALIA , and is grouped with the L AURANCE T RIBE entity whose name indicates female gender. This mistake is generated via the connection between the mentions {[He]} and {[Tribe]}. It shows that solely activating a negative feature between {[He]} and {[Laurance Tribe, Gore’s attorney]} does not prevent this inconsistent cluster in the output. A better partitioning should be expected for this example when the cluster-level gender agreement constraint is respected. Example (1): {[Laurance Tribe, Gore’s attorney]1 }1 , said the state court did nothing illegal. {[Justice Antonin Scalia]2 }2 also pressed {[Tribe]1 }1 . {[He]2 }1 said the state court relied on the Florida Constitution to draft its decision. In Example (2), both entities ANY ECONOMIC THEORY and AN ECONOMIC THEORY are 8.2 Inconsistency Analysis on Output Coreference Sets 123 only active locally (i.e. in their own sentences). However, they are mistakenly linked together via the definite expression {[the theory]}. Since it is most likely that the indefinite noun phrases introduce new entities, the connections between {[an economic theory]} and its preceding mentions should be forbidden. This can be easily interpreted as a Cannot-Link constraint. Example (2): For example your uncle, using {[any economic theory]1 }1 , the probability that {[it]1 }1 will be accurate is virtually 0. So whenever you discuss {[an economic theory]2 }1 with someone, the response would be: My uncle isn’t like that, so {[the theory]2 }1 is baloney. In Example (3), the mention {[him]} is clustered together with the mention {[He]}. This violates Principle B of the binding theory (see Section 2.1). When the principle is respected, the resolution of the mention {[He]} can be indicated by the observation that the entity RUS SIAN F OREIGN M INISTER I GOR I VANOV is more salient (i.e. in the subject position of the sentence) than the entity KOSTUNICA in this context. Example (3): {[Russian Foreign Minister Igor Ivanov]1 }1 congratulated {[Kostunica]2 }2 on {[his]2 }2 election victory . {[He]1 }1 also gave {[him]2 }1 a letter from Russian President Vladimir Putin. The examples introduced in this section convey that simply preventing links between noncoreferent mentions as suggested by the negative features do not ensure the within-cluster consistencies in the output. The examples also indicate that the partitioning algorithms should be improved with the guidance of linguistic knowledge. In this chapter, we focus on guidance information in the form of Cannot-Link constraints, and address the problem by proposing a constrained hypergraph partitioning algorithm. 124 8.3 8. The Constrained COPA Our Proposal — the Constrained COPA In this section, we propose to combine constrained data clustering algorithms with our hypergraph spectral clustering algorithms via the spectral embedding. The proposed method avoids changing the objective function of the adopted hypergraph clustering algorithms. It also avoids propagating the constraints on the originally constructed hypergraphs. Our proposal makes it feasible to apply different constrained data clustering algorithms within the spectral graph clustering framework. A simple constrained data clustering algorithm COP-KMeans is introduced in Section 8.3.1, and our variant of the COP-KMeans is in Section 8.3.2. Section 8.3.3 describes our proposal of combining the modified COP-KMeans with COPA via the spectral embedding, in order to tackle the constrained hypergraph clustering problem. 8.3.1 Constrained Data Clustering — COP-KMeans The standard k-means algorithm (MacQueen, 1967) iteratively assigns data points to their closest clusters, and converges when there are no more changes in the cluster assignments. The k-means algorithm solely depends on the intrinsic distributions of the given data sets. Wagstaff & Cardie (2000) provide a modified version of the k-means algorithm which makes use of the background knowledge being expressed as pairwise constraints. Their proposed variant COP-KMeans respects the pairwise constraints during the cluster assigning process. The algorithm disallows the assignments where constraints are violated, therefore resulting in consistent partitions. There are two types of pairwise constraints which are prevalently adopted and are the input to COP-KMeans. • A Must-Link constraint suggests that the given pair of data points should belong to the same cluster. • A Cannot-Link constraint suggests that the given pair of data points should not belong to the same cluster. Algorithm 6 gives the details on COP-KMeans. Line 4 and Line 5 of the algorithm locate the modifications COP-KMeans makes upon the standard k-means algorithm. Instead of assigning a data point to the closest cluster, COP-KMeans checks on the constraint violation first. Only the clusters which do not violate any given constraints are considered in the assignment. 8.3 Our Proposal — the Constrained COPA 125 Algorithm 6 COP-KMeans Algorithm (single iteration) (Wagstaff & Cardie, 2000) 1: input: data set D, must-link constraints Con = ⊆ D × D, cannot-link constraints Con 6= ⊆ D × D 2: Let C1 . . . Ck be the initial cluster centers 3: for each point di in D do 4: Assign di to the closest cluster Cj such that violateConstraints(di , Cj , Con = , Con 6= ) is false 5: If no such cluster exists, fail (return ∅) 6: end for 7: for each cluster Ci do 8: Update the center of Ci by averaging all of the points dj that are assigned to Ci 9: end for 10: return partitioned C1 . . . Ck The ViolateConstraints function in Algorithm 7 suggests that the pairwise constraints are brutally enforced in COP-KMeans. No partitioning output is generated when there is no single assignment respecting all given constraints (i.e. Line 12). Algorithm 7 ViolateConstraints Function Algorithm (Wagstaff & Cardie, 2000) 1: input: data point d, cluster C, must-link constraints Con = ⊆ D × D, cannot-link constraints Con 6= ⊆ D × D 2: for each (d, d= ) ∈ Con = do 3: if d= ∈ / Con= then 4: return true 5: end if 6: end for 7: for each (d, d6= ) ∈ Con 6= do 8: if d6= ∈ Con 6= then 9: return true 10: end if 11: end for 12: return false 126 8. The Constrained COPA 8.3.2 Our Variant of COP-KMeans Since COPA is an end-to-end system which works in a noisy environment, enforcing constraints in a hard way as COP-KMeans does can be problematic. We propose a variant of COP-KMeans to minimize the number of the violated constraints. The proposed VD-KMeans is given in Algorithm 8, with the modification in line 4 replacing the ViolateConstraints function with the ViolationDegree function (see Algorithm 9). ViolationDegree counts the number of the violated Cannot-Link constraints when assigning a data point to a cluster, and VD-KMeans simply decides on the cluster with the smallest violation degree or on the closest cluster when the violation degrees are the same. Algorithm 8 VD-KMeans Algorithm (single iteration) 1: input: data set D, cannot-link constraints Con6= ⊆ D × D 2: Let C1 . . . Ck be the initial cluster centers 3: for each point di in D do 4: Assign di to the cluster Cj with the smallest ViolationDegree(di , Cj , Con6= ) 5: For clusters are with the same violation degree, choose the closest one 6: end for 7: for each cluster Ci do 8: Update the center of Ci by averaging all of the points dj that are assigned to Ci 9: end for 10: return partitioned C1 . . . Ck Algorithm 9 ViolationDegree Function Algorithm 1: input: data point d, cluster C, cannot-link constraints Con6= ⊆ D × D 2: for each (d, d6= ) ∈ Con6= do 3: if d6= ∈ C then 4: Increase the violation degree: vdCnt + + 5: end if 6: end for 7: return vdCnt We only consider Cannot-Link constraints in constrained COPA, as Must-Link constraints can be straightforwardly incorporated as highly weighted hyperedges in our hypergraph models. 8.3 Our Proposal — the Constrained COPA 127 8.3.3 Constrained Hypergraph Spectral Clustering The hypergraph-based spectral clustering has been introduced in Section 4.2.2. In short, spectral clustering reduces the data dimensionality by using the eigenvectors of the graph Laplacians. The resulting vector representation of the data set is the spectral embedding. For the sake of the expressive convenience, we start with revisiting some of the basic notations. Hypergraph Normalized Cut. When the normalized cut (Ncut (Shi & Malik, 2000)) is adapted to hypergraphs (Zhou et al., 2007), it preserves the intuition that a good partitioning cuts as few hyperedges as possible while leaving the resulting partitions as dense as possible. The hypergraph Ncut for a k-partitioning Pk is defined by N cut(Pk ) = X vol∂Vi volVi 1≤i≤k (8.1) Pk = {Vi |V = V1 ∪ V2 ∪ · · · ∪ Vk }, where Vi ∩ Vj = ∅, for all 1 ≤ i, j ≤ k and i 6= j. The volume volVi gives the within-cluster density of the the vertex set Vi . The volume of the hyperedge boundary ∂Vi measures the hyperedges to be cut in order to derive Vi as a cluster. The objective of our partitioning algorithm is therefore to minimize Equation 8.1. The Spectral Embedding. The Ncut value can be minimized using a relaxation approach, which approximates discrete cluster memberships with continuous real numbers. The approximation can be approached by solving the eigen problem of the hypergraph Laplacian: 1 1 L = I − Dv − 2 HW De −1 H T Dv − 2 (8.2) Let (λi , vi ), i = 1, . . . , n, be the eigenvalues and the associated eigenvectors of L, where 0 ≤ λ1 ≤ · · · ≤ λn and kvi k = 1. The continuous solution to the Ncut minimization is then provided by a new low-dimensional data representation X: X = (v1 , · · · , vk ) (8.3) where X is called the k-th order spectral embedding of the graph. It has been shown that k generally equals to the number of clusters (Ng et al. 2001). A standard data clustering algorithm, such as the k-means, can be applied to cluster the graph nodes in the new space afterward. Applying Constrained Data Clustering Algorithms to the Spectral Embedding. Figure 8.1 illustrates our proposal of the constrained spectral graph clustering algorithm. The Cannot-Link constraints are extracted from the graph to be partitioned, and are imposed on the generated spectral embedding. Since the spectral embedding transforms the original graph 128 8. The Constrained COPA to a vector representation of vertices, constrained data clustering algorithms can be directly applied. Graph vertex1 vertex2 Spectral Embedding Cannot-Link Constraints vertexn Constrained Data Clustering subgraph1 subgraph2 subgraphk Figure 8.1: Illustration of Constrained Spectral Graph Clustering 8.3.4 Constrained COPA Partitioners COPA implements a hierarchical multi-class partitioner, R2 partitioner, which recursively bipartitions the hypergraph until a stopping criterion (i.e. α∗ ) is reached (see Section 4.3.3.1). We propose to apply constraints to each recursion of the R2 partitioner. The resulting ConR2 partitioner is outlined in Algorithm 10. ConR2 partitioner recursively bi-partitions when the Ncut value is smaller than α∗ or when the violated constraints are fewer compared with the input hypergraph (i.e. Line 8). The current bi-partition is not accepted when the constraint violations do not become fewer after partitioning (i.e. Line 11). VD-KMeans is used as the data clustering algorithm, taking the spectral embedding and Cannot-Link constraints as input. 8.4 Cannot-Link Constraints for Coreference Resolution 129 Algorithm 10 ConR2 partitioner 1: input: target hypergraph HG, Cannot-Link constraints CN , α⋆ 2: Counts the violated constraints VioCnt for the input HG 3: Solve for the 2-nd spectral embedding, SE 4: Generate two sub HG’s using VD-kmeans(SE , CN ) 5: Counts the violated constraints VioCnt1 , VioCnt2 for two sub HG’s 6: if min(Ncut i ) < α∗ OR both VioCnti ’s are smaller than VioCnt then i 7: for each sub HG do 8: Bi-partition the sub HG with R2 partitioner 9: end for 10: else 11: if any VioCnti is bigger than or equal to VioCnt then 12: Output the input HG 13: end if 14: else 15: Output the current sub HG 16: end if 17: output: partitioned HG The R2 partitioner optimizes the bi-partition at each recursion step. However, it is not guaranteed that the final output clusters are globally optimized due to the hierarchical nature. To overcome the problem, we experiment with the flatK partitioner (see Algorithm 2) as well 2 . However, the ConflatK partitioner is not covered in this chapter. 8.4 Cannot-Link Constraints for Coreference Resolution The Difference Between Negative Features and Cannot-Link Constraints. In this section, we describe the Cannot-Link constraints proposed for coreference resolution. The CannotLink constraints are negative relations between a pair of mentions, and are at the same time taken as negative features too. Negative features in COPA prevent hyperedges to be built during the graph construction phase, while the Cannot-Link constraints guide the partitioners in 2 With the constrained ConflatK partitioner, k clusters are output simultaneously. The VD-kmeans algorithm is again applied to the k-th spectral embedding of the input hypergraph, and directly outputs the final clusters. The model used to predict the k is introduced in Section 4.3.4. 130 8. The Constrained COPA the inference procedure. Duplicating the constraints as negative features enables us to analyze the contributions which are solely from the constrained clustering algorithm. (1) CN Gender – Two mentions do not agree in gender. – For instance, the mentions [Hillary Clinton] and [he] should not be clustered into one set due to the incompatible gender. (2) CN ContraMod – Two mentions have the same syntactic heads, and the anaphor has a modifier which does not occur in the antecedent or which contradicts the modifiers of the antecedent. – For instance, a Cannot-Link constraint is built between [1,000 coal rail cars] and [the 1,450 coal rail cars], as the two mentions contain different quantitative modifiers. (3) CN ContraGPE – Two mentions realizing different GPEs should not be in one set. – For instance, a negative relation exists between the mentions [Syria] and [Lebanon] because they are different countries. A gazetteer consisting of lists of country names and city names is looked up for computing this constraint. (4) CN ContraSubjObj – Two mentions are in the subject and object positions of a non-copular verb, and the anaphor is not a possessive pronoun. – Considering the text ”[John] talks to [him]”, where the mention [John] should not be coreferent with the pronoun [him]. The dependency tree is used to identify the verbs on which the mentions depend. This constraint is derived from Principle B of the Binding theory (Section 2.1.2). (5) CN Span – A mention spanning another one cannot be linked to it, except for RoleAppositive cases. – Considering the embedding mentions [[his] brother], the two should not be clustered together. 8.5 Experiments on the Constrained COPA 131 (6) CN ContraPerson – Two person mentions with different names cannot be linked. – For instance, the mention [Mr. Wright] should not be coreferent with the mention [Mr. Valenti] due to the different family names of the two person entities. The Cleanness of the Proposed Constraints. Table 8.3 analyzes the cleanness of the proposed constraints. The statistics corresponds to the frequencies of the constraints holding on the OntoNotes training data. The negative signs in the table indicate that the Cannot-Link constraints are negative relations between mentions. Constraints (1) CN Gender (2) CN ContraMod (3) CN ContraGPE (4) CN ContraSubjObj (5) CN Span (6) CN ContraPerson Statistics -0.993 -0.980 -0.992 -0.997 -0.996 -0.961 Table 8.3: The Cleanness of the Cannot-Link Constraints on the OntoNotes Training Set 8.5 Experiments on the Constrained COPA Experimental Settings. In this section, we experiment with the proposed constrained COPA. The numbers are reported on the OntoNotes development set, using the unweighted average of MUC, BCUBED and CEAF(E) (i.e. the final score in CoNLL 2011 shared task). The setting of COPA using the R2 partitioner is denoted as R2, upon which the setting R2+N Feats includes the Cannot-Link constraints as negative features. The baseline system PostR2 encodes the standard k-means algorithm and keeps bi-partitioning until there is no violated constraint any more. ConR2 corresponds to the constrained COPA proposed in this chapter. In Section 8.5.1, we first experiment with the clean constraints which are generated from the ground truth annotations. Such upperbound setting allows us to evaluate the proposed method while excluding the effect of the constraint generation phase. The automatically generated constraints are tested in Section 8.5.2, where the constrained COPA performs in a fully automatic manner. 132 8. The Constrained COPA 8.5.1 Experiments with Artificial Clean Constraints The Generation of Clean Constraints. The clean constraints are only generated for the mentions which can align with the true mentions. In this way noise brought by the twinless mentions is still kept, otherwise building clean constraints for all mentions will directly remove the spurious ones. There are a total of 144, 858 clean constraints generated for the OntoNotes development set. ConR2 vs. Baselines. Table 8.4 gives the performance of our proposed constrained COPA with clean constraints. The difference between PostR2 and ConR2 is that PostR2 only uses the constraints as the stopping criterion for the recursive partitioning, but ConR2 actually guides the partitioning inference with the constraints. MUC BCUBED CEAF(E) overall R R2 P F 60.85 68.68 46.19 61.93 72.59 45.13 61.39 70.58 45.66 59.21 R2+N Feats R P F 61.81 69.6 47.85 64.06 76.28 45.72 62.92 72.78 46.76 60.82 R PostR2 P F 59.6 67.62 49.47 64.67 78.58 44.8 62.03 72.69 47.02 R ConR2 P F 62.66 69.8 50.03 67.6 80.0 45.49 65.03 74.55 47.65 60.58 Table 8.4: ConR2 vs. Baselines with Clean Constraints on the OntoNotes Development Set (bold indicates significant improvement in F-score over PostR2 according to a paired-t test with p < 0.05) The improvement ConR2 achieves compared with the setting R2+N Feats demonstrates the contribution which is solely from the proposed algorithm. The precision of all metrics (except for CEAF(E)) are improved by using the constrained clustering algorithm. This is not surprising given the fact that Cannot-Link constraints are applied to prevent spurious linkages. Gains on recall are observed too. Since constraints participate in the partitioning decisions when using ConR2, the recall improvements suggest that the corrections on some mentions (which are involved in the constraints) also improve the resolutions of others. The baseline system PostR2 greedily partitions the clusters which violate constraints, without incorporating constraint information into the partitioning decisions. The PostR2 results also produce higher precision (except for the CEAF(E) metric), but suffer from a bigger loss in recall. This confirms again that the constraints need to be enforced on the cluster level during the partitioning inference. 62.41 8.5 Experiments on the Constrained COPA 133 ConR2 with Randomly Sampled Constraints. Figure 8.2 plots the performance curve of ConR2 given the increasing number of Cannot-Link constraints. The used constraints here are randomly sampled from the full set of clean constraints as introduced previously. It is worth noting that all the original clean constraints are included as negative features throughout the experiments, and only the ones used as Cannot-Link constraints differ in size. Therefore the leftmost points in all three plots correspond to the performance of the R2+N Feats model. ConR2 with Clean Constraints using MUC Score ConR2 with Clean Constraints using BCUBED Score 70 80 MUC BCUBED 68 78 64 F-Measure (%) F-Measure (%) 66 62 76 74 60 58 72 56 70 0 0.2 0.4 0.6 Percentage of Applied Constraints 0.8 1 0 0.2 0.4 0.6 Percentage of Applied Constraints 0.8 1 ConR2 with Clean Constraints using CEAF(E) Score 50 CEAF(E) F-Measure (%) 48 46 44 42 40 0 0.2 0.4 0.6 Percentage of Applied Constraints 0.8 1 Figure 8.2: ConR2 Performance with Increasing Size of Clean Constraints Figure 8.2 shows that ConR2 only outperforms R2+N Feats when more than 80% of the constraints (around 115, 880) are used. Smaller sets of constraints generate worse performance compared with the R2+N Feats system which does not use constraints at all. The possible explanation is that more constraints help to generate balanced clusters, while a few can easily skew the ConR2 partitioner. This demonstrates a drawback of the proposed algorithm, that enforcing the constraints is a higher priority than deriving a good partitioning. 134 8. The Constrained COPA We conduct another group of experiments by adding noise constraints, as shown in Figure 8.3. Noise constraints are randomly sampled, and are added upon the full set of the clean constraints. The straight lines in all plots indicate the performance of the baseline R2+N Feats. ConR2’s performance drops below the baseline soon after about 10% noise constraints are included, and keeps decreasing quickly. ConR2 with Noise Constraints using MUC Score ConR2 with Noise Constraints using BCUBED Score 70 80 ConR2 R2+N_Feats ConR2 R2+N_Feats 75 F-Measure (%) F-Measure (%) 65 60 55 70 65 50 60 0 0.2 0.4 0.6 The Ratio of Noise Constraints to Clean Constraints 0.8 1 0 0.2 0.4 0.6 The Ratio of Noise Constraints to Clean Constraints 0.8 1 ConR2 with Noise Constraints using CEAF(E) Score 50 ConR2 R2+N_Feats F-Measure (%) 48 46 44 42 40 0 0.2 0.4 0.6 The Ratio of Noise Constraints to Clean Constraints 0.8 1 Figure 8.3: ConR2 Performance with the Increasing Size of Noise Constraints In this section, we experiment with the artificial Cannot-Link constraints using the proposed constrained COPA. We analyze the influence of the size of applied constraints and the size of the involved noise constraints (i.e. incorrect constraints). Significant improvement is achieved when a big enough set of constraints is provided and when the set consists of less than 10% spurious ones. The experiments on the randomly sampled clean constraints suggest a reasonable recall range for designing the real constraints, and the experiments on the noise constraints hint on a proper precision range. In the following section, experiments with the automatically generated constraints (i.e. the real constraints) are provided. 8.5 Experiments on the Constrained COPA 135 8.5.2 Experiments with Automatically Generated Constraints ConR2 vs. R2+N Feats. Table 8.5 shows the results of ConR2 using the Cannot-Link constraints proposed in Section 8.4. Since the constraints are already included as negative features in the basic COPA, the R2 performance in Table 8.4 is the same as the baseline performance in Table 8.5 (i.e. R2+N Feats). R2+N Feats R P F MUC 60.85 61.93 61.39 BCUBED 68.68 72.59 70.58 CEAF(E) 46.19 45.13 45.66 overall 59.21 ConR2 R P F 59.58 61.77 60.66 67.57 73.22 70.28 46.6 44.47 45.51 58.82 Table 8.5: ConR2 vs. R2+N Feats with Automatically Generated Constraints on the OntoNotes Development Set From the statistics provided in Table 8.3, it can be seen that more than 90% of our automatically generated constraints are correct. This is demonstrated in the previous section to be a good proportion in order to improve upon the R2+N Feats. However, ConR2 yields worse results compared with the baseline system. It can be partially explained by the small size of the applied constraints, which is 12, 555 for the entire development set. The contributions of the proposed constraints are illustrated in Table 8.6, ordered in accordance with the cleanness of the constraints. Increases in precisions are observed for both MUC and BCUBED, but a bigger loss loss in recalls constantly occurs. The current constrained COPA unfortunately generates negative results. A detailed inspection shows that several inconsistent output clusters (see Table 8.2) are not covered by the proposed constraints. For instance, (2) CN ContraMod does not capture the negative relation between the mentions [China’s Red Cross Society] and [the international Red Cross Organization]. Since the current constraints target at high precisions, more high-recall ones should be developed. 136 8. The Constrained COPA (4) CN ContraSubjObj + (5) CN Span + (1) CN Gender + (3) CN ContraGPE + (2) CN ContraMod + (6) CN ContraPerson R MUC P F 60.22 60.22 59.93 59.85 59.69 59.58 61.73 61.78 61.76 61.7 61.74 61.77 60.97 60.99 60.83 60.76 60.7 60.66 R BCUBED P F 68.19 68.15 67.87 67.78 67.68 67.57 72.8 72.86 73.01 72.95 73.1 73.22 70.42 70.42 70.35 70.27 70.28 70.28 CEAF(E) R P 46.26 46.33 46.42 46.44 46.53 46.6 44.79 44.81 44.63 44.63 44.53 44.47 F 45.51 45.56 45.51 45.52 45.51 45.51 Table 8.6: The Contributions of the Proposed Cannot-Link Constraints Solved Example by ConR2. Although ConR2 does not generate promising results yet, we now show an example which is solved by applying the constrained clustering algorithm. Figure 8.4 shows the output clusters by the basic version of COPA, where the entity P RESIDENT S LOBODAN M ILOSEVIC is mistakenly mixed with the entity P RESIDENT P UTIN. This happens because both persons are male presidents and they are linked together via other mentions such as [the president] and [he]. Figure 8.4: Example Output Clusters Using the Basic COPA By applying the constrained COPA, it can be seen from Figure 8.5 that the two entities are correctly resolved thanks to the constraint (6) CN ContraPerson. 8.6 Summary 137 President Slobodan Milosevic : President Putin : Figure 8.5: Example Output Clusters Using the Constrained COPA 8.6 Summary Incorporating Constraints into Coreference Resolution. In this chapter, we consider a general problem for the clustering field. Due to the transitive closure which is implicitly done during the clustering phase, counter-intuitive clusters can be derived. This is also an 138 8. The Constrained COPA issue for the coreference resolution task when the coreference sets are generated by clustering models. For instance, the mention [a Norwegian Transport Ship] is clustered together with a preceding mention [The damaged ship] via another mention [the ship] which appears later in the document. However, the indefinite article ”a” strongly indicates that the mention [a Norwegian Transport Ship] is not anaphoric. Such information can be interpreted as pairwise constraints: Must-Link asks the mentions to be in one cluster and Cannot-Link forbids so. In order to generate consistent coreference sets, there has been previous work on enforcing transitivity for coreference resolution (e.g. Finkel & Manning (2008)) and on applying correlation clustering to incorporate negative edges in graphs (e.g. McCallum & Wellner (2005)). In this thesis, we focus on incorporating the pairwise constraints within the graph spectral clustering framework. Our Proposal: Constrained COPA. In this chapter, we extend the basic version of COPA in order to guide the partitioning algorithms with pairwise constraints. Since the Must-Link constraints can be straightforwardly included as strong edges in a graph model, we only deal with Cannot-Link’s for now. We propose to combine constrained data clustering algorithms with hypergraph spectral clustering algorithms via the spectral embedding. In this way, we address the constrained graph clustering problem without changing the clustering objective function or modifying the originally constructed graph structures. We conduct experiments with the constrained COPA on both the artificial clean constraints and the automatically generated ones.The experiments on clean constraints allow us to study the effect of the size of constraints and the proportion of the noise on the proposed algorithm. Although the improvement achieved by using the clean constraints is significant, our results on the automatically generated ones are unfortunately negative. The possible reason is that the current Cannot-Link constraints do not have enough coverage on the data set. Testing with constraints of a small coverage does not convey the effectiveness of the algorithm, especially when the number of the inconsistent clusters to be solved is not very big in the first place. Future Work. Since the number of the inconsistent clusters will grow bigger when the graph structures become richer, the importance of providing prior information to guide the clustering algorithms remains. Our proposed method provides a way to address the problem with relatively little effort on adapting the original clustering algorithms. The next step for us is to include more constraints in order to explore the potential of the constrained COPA. We currently exclude the negative relations such as semantic class agreement and number agreement, to avoid too much noise. However, the experiments with clean constraints suggest that at most 10% noise is allowed, which is the case for both of them. So it will be reasonable to include more high-recall constraints in the future. Chapter 9 Conclusions Natural Language Processing (NLP) tasks process texts automatically on the syntactic, semantic and pragmatic levels, targeting at the full text understanding. Coreference resolution has been one of the most fundamental NLP task for decades, which links the referring expressions of the same entities into sets. From a pragmatic point of view, a text can be considered as a collection of entities and the relations between them. Resolving the referring expressions therefore enables us to identify the entities in a document. Furthermore, the local context of the different occurrences of an entity are implicitly merged via the coreference relation built between the referring expressions. Therefore it is made easier to extract the relations between entities from their enlarged context. In the introduction of this thesis, we interpret the coreference relation as a high-dimensional relation, which can be derived from multiple basic relations (e.g. string similarity and semantic relatedness). Unlike the previous methods which collapse the basic relations before the inference step, we aim to maintain the basic relations until the final inference procedure. In order to do so, we propose a hypergraph model to represent a document as shown in Figure 9.1 (a). Figure 9.1: COPA Example: Processing Illustration 140 9. Conclusions The thesis presents our proposed coreference system COPA, an end-to-end hypergraphpartitioning-based model. Upon the hypergraph representation of documents, partitioning algorithms are proposed to derive the coreference sets as shown in Figure 9.1 (b). By making use of the graph partitioning technique, COPA is able to generate the coreference sets at one step by considering all the relations encoded in the hypergraph together. In contrast to the local coreference models, our system performs the inference procedure in a global manner; and unlike the probabilistic global methods, our partitioning algorithms do not involve sophisticated probability estimations but achieves more competitive performance. In this chapter we summarize the main contributions of our work and point out the possible future research directions. 9.1 Main Contributions In this thesis, we address four important questions concerning the coreference resolution modeling and the end-to-end coreference system designing. Representing the High-dimensional Coreference Relation. COPA represents the mentions as vertices in the hypergraph model, and connects them with weighted hyperedges which are directly derived from the basic relations (i.e. features). Since this allows for multiple hyperedges existing between mentions, the basic relations are incorporated into the hypergraphs in an overlapping manner. The hypergraph provides us with a way to make the coreference decisions only during the inference phase, in contrast to the previous work which combines the basic relations into the coreference relation during the graph construction phase (i.e. the representing phase). We propose to categorize the coreference features into three types. The negative features prevent the hyperedges to be built between mentions, indicating the non-coreferential relations. The positive features are used to construct the hypergraphs, which are mainly the strong indicators for the coreference relation. The weak features enrich the hypergraph structures by providing many weak hyperedges which do not strongly correlate with the coreference relation but are still informative. The feature categorization is important for applying graph models in end-to-end systems, making them less sensitive to the noise and making it easier to incorporate more features. Inferring the Coreference Sets Globally. The coreference resolution task is to derive the coreference sets from a collection of mentions. We argue that the coreference models should not only analyze the relations between mentions but also consider the relations between different coreference sets. The hypergraph partitioning algorithms adopted in COPA manage to 9.2 Future Work 141 optimize the output coreference sets directly instead of only making the best decisions for mention pairs. Moreover, in our model resolving one mention depends on the resolutions of all the others, which makes COPA a global method. In this thesis, we also explored a constrained version of COPA. We demonstrate the importance of enforcing the transitivity in the coreference resolution task and propose to address the problem within the constrained graph clustering framework. The idea of our method is to combine the constrained data clustering algorithms with the spectral graph clustering ones via the spectral embedding. Due to the low coverage of the automatically generated constraints, our experimental results are mostly negative so far. However, the clean (artificial) constraints show promising improvements from the proposed algorithm. We leave the work on incorporating the generated constraints in COPA as a future research direction. Evaluating the End-to-end Coreference Systems. In this thesis, we report the problems of the existing coreference evaluation metrics when they are applied to end-to-end system output. In order to evaluate the coreference task in a realistic setting, we propose two variants of the evaluation metrics B3 and CEAF . Our variants are empirically shown to evaluate the noisy coreference output in an adequate way. The appropriate evaluation metrics are essential especially when the coreference systems optimize with respect to the final coreference output. Learning Cheaply. Due to the overlapping manner of the hyperedges, COPA only needs to learn the weights for the basic relations instead of a high-dimensional combination of them. It requires only a few training documents to collect the simple statistics for the weights of the basic relations, so COPA is considered as a weakly supervised system. The experiments also confirm that COPA achieves competitive results with a small training set. This makes COPA a good candidate when moving to a different domain or a different language where not enough ground truth annotation is available. 9.2 Future Work In this section, we highlight a couple of possible future research directions which should be worth investigating. More Coreference Features. Due to the well-defined hypergraph representation and the feature categorization strategy in COPA, it requires little effort to incorporate relational features. The current version of COPA only adopts a standard set of coreference features, and it should be further improved by designing more linguistic- and world- knowledge. For instance, weak features enable us to include (a large amount of) noisy relations extracted from 142 9. Conclusions the Internet such as word associations. Besides building relations between mentions, it will be also interesting to explore the relations between mention contexts. For instance, the mentions participating in the same event as the same roles or having the same relations with the same (another) entity should have a good possibility to be coreferent with each other. In brief, more features will help to generate hypergraphs with richer structures, and therefore better partitions should be produced on such hypergraphs. Learning to partition. The learning scheme currently adopted by COPA is only to collect simple statistics about the basic relations. The constrained COPA can be viewed as a first step towards a better learning of our hypergraph-partitioning-based model. However, it should be worth efforts to find a learning algorithm which can directly optimize the hyperedge weights with respect to the partitioning criterion (i.e. the NCut value). In general, the learning procedure being consistent with the inference procedure should be able to make the most of the training data. Graph-partitioning-based Entity Model. Although the hyperedges in COPA are able to represent sets of multiple mentions, we have not yet modeled entities explicitly . Enabling properties on hyperedges may be able to capture entity-level information, and such information can be propagated to mentions and vice versa via the edge-vertex incidences. Incrementally or iteratively partitioning the hypergraphs can be another way to model entities. Entities derived from the previous runs or iterations should help with later partitionings. Application to Other Languages and Domains. COPA has been lately tested on different languages, such as Chinese. It performed stable by borrowing some of the languageindependent features from the English implementation, such as head match. As discussed in the thesis already, the proposed system performs competitively across different domains too. In the future, it will be interesting to apply COPA to other languages and domains where hardly any annotation for coreference resolution is available. In such cases, training on similar languages or relying more on the weak Internet features may all contribute. List of Figures 1.1 1.2 1.3 1.4 Example (3): Coreference Resolution in MMAX . . . . . . . . . Example (3): Coreference Relation is High-Dimensional (part 1) Example (3): Coreference Relation is High-Dimensional (part 2) COPA Example: Processing Illustration . . . . . . . . . . . . . . . . . 4 5 6 8 2.1 2.2 2.3 Luo’s Bell Tree Method (Luo et al., 2004) . . . . . . . . . . . . . . . . . . . Nicolae and Nicolae’s Best-cut Method (Nicolae & Nicolae, 2006) . . . . . . Sapena Thesis’s Hypergraph Representation (Sapena, 2012) . . . . . . . . . 25 27 29 4.1 4.2 4.3 4.4 4.5 4.6 4.7 COPA Model Illustration . . . . . . . . . . . . COPA Example: Processing Illustration . . . . An Example for the Hypergraph Notation . . . Illustration of Spectral Graph Clustering . . . . Illustration of COPA System Flow . . . . . . . Illustration of the Spectral Embedding . . . . . Illustration of the Post-processing for Pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 41 46 47 49 56 6.1 6.2 6.3 6.4 6.5 The MUC Score Illustration . . . The B3 Algorithm Illustration . . . The CEAF Alignment Illustration Artificial Setting B3 Variants . . . Artificial Setting CEAF Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 73 81 88 89 7.1 7.2 COPA’s Results with Different Sizes of the Training Data . . . . . . . . . . . 110 The Distributions of k With and Without Singleton Entities . . . . . . . . . . 112 8.1 8.2 8.3 8.4 8.5 Illustration of Constrained Spectral Graph Clustering . . . . . . . ConR2 Performance with Increasing Size of Clean Constraints . . ConR2 Performance with the Increasing Size of Noise Constraints Example Output Clusters Using the Basic COPA . . . . . . . . . Example Output Clusters Using the Constrained COPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 133 134 136 137 144 9.1 LIST OF FIGURES COPA Example: Processing Illustration . . . . . . . . . . . . . . . . . . . . 139 List of Tables 4.1 4.2 COPA Example: Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyperedge Weight Examples for ACE 2004 Data . . . . . . . . . . . . . . . 39 48 5.1 5.2 5.3 5.4 Positive Feature Weights on OntoNotes Data . Weak Feature Weights on OntoNotes Data . . Feature Weights on I2B2 Data . . . . . . . . Negative Feature Statistics on OntoNotes Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 68 68 69 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 Problems of B30 . . . . . . . . . . . Problems of B3all (1) . . . . . . . . . Problems of B3all (2) . . . . . . . . Analysis of B3sys 1 . . . . . . . . . Analysis of B3sys 2 . . . . . . . . . Analysis of B3sys 3 . . . . . . . . . Analysis of B3sys 4 . . . . . . . . . Problems of CEAForig . . . . . . . . Problems of CEAFr&n . . . . . . . Problems of φ4 (⋆, ⋆) . . . . . . . . Mention Taggers on ACE2004 Data Realistic Setting MUC . . . . . . . Realistic Setting B3 Variants . . . . Realistic Setting CEAF Variants . . Realistic Setting B30 vs. B3sys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 76 76 80 80 80 80 83 84 86 88 90 90 90 91 7.1 7.2 COPA Features for Comparing with SOON (details in Chapter 5) . . . . . . . SOON vs. COPA R2 (SOON features, system mentions, bold indicates significant improvement in F-score over SOON according to a paired-t test with p < 0.05) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reproduced Numbers of B&R . . . . . . . . . . . . . . . . . . . . . . . . . Baselines on the ACE 2004 Testing Data . . . . . . . . . . . . . . . . . . . . 95 7.3 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 97 97 146 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 7.24 8.1 LIST OF TABLES COPA Features for Comparing with B&R (details in Chapter 5) . . . . . . . . 98 B&R vs. COPA R2 (B&R features, COPA’s system mentions) . . . . . . . . . 98 COPA Features for the CoNLL 2011 Shared Task (details in Chapter 5) . . . 99 COPA’s Mention Tagger Performance on the CoNLL testing set . . . . . . . 100 COPA’s results on the CoNLL development set . . . . . . . . . . . . . . . . 101 COPA’s results on the CoNLL testing set . . . . . . . . . . . . . . . . . . . . 101 Overall Results on the CoNLL testing set . . . . . . . . . . . . . . . . . . . 101 COPA Features for the 2011 i2b2/VA Shared Task (details in Chapter 5) . . . 103 COPA’s Results on the ODIE Development Set w/o Concepts (Task 1A) Using SYS Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 COPA’s Results on the ODIE Development Set w/o Concepts (Task 1A) Using CoNLL Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 COPA’s Results on the ODIE Development Set with Concepts (Task 1B) Using I2B2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 105 COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C) Using I2B2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 105 COPA’s Results (in bold) on the ODIE Testing Set w/o Concepts (Task 1A) Using SYS Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 106 COPA’s Results (in bold) on the ODIE Testing Set w/o Concepts (Task 1A) Using I2B2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 106 COPA’s Results (in bold) on the ODIE Testing Set with Concepts (Task 1B) Using I2B2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 106 COPA’s Results (in bold) on the i2b2/VA Testing Set with Concepts (Task 1C) Using I2B2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 107 COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C), with and without Knowledge Features, Using I2B2 Evaluation Metrics. (bold indicates significant improvement in F1 measure over the column w/o KnowledgeFeats, according to a paired-t test with p < 0.005) . . . . . . . . . . . . 107 COPA’s Results on the i2b2/VA Development Set with Concepts (Task 1C), with and without Knowledge Features, Using SYS Evaluation Metrics. (bold indicates significant improvement in F1 measure over the column w/o KnowledgeFeats, according to a paired-t test with p < 0.005) . . . . . . . . . . . . 108 k Model’s Classification Performance on the CoNLL Development Data . . . 113 COPA R2 Vs. flatK’s ( with the alpha*=0.07, bold indicates significant improvement in F-score over the others according to a paired-t test with p < 0.05) 113 COPA R2 partitioner’s results on the OntoNotes development set using CoNLL metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 LIST OF TABLES 8.2 8.3 8.4 8.5 8.6 Inconsistent Output Clusters from COPA R2 partitioner on the OntoNotes Development Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Cleanness of the Cannot-Link Constraints on the OntoNotes Training Set ConR2 vs. Baselines with Clean Constraints on the OntoNotes Development Set (bold indicates significant improvement in F-score over PostR2 according to a paired-t test with p < 0.05) . . . . . . . . . . . . . . . . . . . . . . . . . ConR2 vs. R2+N Feats with Automatically Generated Constraints on the OntoNotes Development Set . . . . . . . . . . . . . . . . . . . . . . . . . . The Contributions of the Proposed Cannot-Link Constraints . . . . . . . . . . 147 122 131 132 135 136 148 LIST OF TABLES List of Algorithms 1 2 3 4 5 6 7 8 9 10 R2 partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . flatK partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . k model outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Bsys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CEAFsys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . COP-KMeans Algorithm (single iteration) (Wagstaff & Cardie, 2000) ViolateConstraints Function Algorithm (Wagstaff & Cardie, 2000) . VD-KMeans Algorithm (single iteration) . . . . . . . . . . . . . . . . ViolationDegree Function Algorithm . . . . . . . . . . . . . . . . . ConR2 partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 51 53 78 85 125 125 126 126 129 150 LIST OF ALGORITHMS Bibliography Agarwal, Sameer, Jonwoo Lim, Lihi Zelnik-Manor, Pietro Perona, David Kriegman & Serge Belongie (2005). Beyond pairwise clustering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 838–845. Aone, Chinatsu & Scott W. Bennett (1995). Evaluating automated and manual acquisition of anaphora resolution strategies. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, Mass., 26–30 June 1995, pp. 122–129. Aronson, Alan R. (2001). Effective mapping of biomedical texts to the UMLS metathesaurus: The MetaMap program. In In Proceedings of the AMIA Symposium 2001, pp. 17–21. Bagga, Amit & Breck Baldwin (1998). Algorithms for scoring coreference chains. In Proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain, 28–30 May 1998, pp. 563–566. Bansal, Mohit & Dan Klein (2012). Coreference semantics from web features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, 8–14 July 2012, pp. 389–398. Bansal, Nikhil, Avrim Blum & Shuchi Chawla (2002). Correlational clustering. In The proceeding of the 43rd annual symposium on foundations of computer science (FOCS), pp. 238–247. Bar-Hillel, Aharon, Tomer Hertz, Noam Shental & Daphna Weinshall (2003). Learning distance functions using equivalence relations. In Proceeding of 20th International Conference on Machine Learning. Basu, Sugato, Arindam Banerjee & Raymond J Mooney (2002). Semi-supervised clustering by seeding. In Proceedings of International Conference on Machine Learning, pp. 27–34. Basu, Sugato, Mikhail Bilenko & Raymond J Mooney (2004). A probabilistic framework for semi-supervised clustering. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 59–68. 152 BIBLIOGRAPHY Basu, Sugato, Ian Davidson & Kiri L. Wagstaff (Eds.) (2009). Constrained Clustering: Advances in Algorithms, Theory, and Applications. Boca Raton, Flo.: CRC Press. Ben-David, Shai, Ulrike von Luxburg & David Pal (2006). A sober look at clustering stability. In Proceedings or the 19th Annual Conference on Learning Theory, pp. 5–19. Berlin: Springer. Bengtson, Eric & Dan Roth (2008). Understanding the value of features for coreference resolution. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Waikiki, Honolulu, Hawaii, 25-27 October 2008, pp. 294–303. Blum, Avrim & Tom Mitchell (1998). Combining labeled and unlabeled data with CoTraining. In Proceedings of the 11th Annual Conference on Learning Theory, Madison, Wisc., 24–26 July, 1998, pp. 92–100. Brennan, Susan E., Marilyn W. Friedman & Carl J. Pollard (1987). A centering approach to pronouns. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics, Stanford, Cal., 6–9 July 1987, pp. 155–162. Cai, Jie, Éva Mújdricza-Maydt, Yufang Hou & Michael Strube (2011a). Weakly supervised graph-based coreference resolution for clinical data. In Proceedings of the 5th i2b2 Shared Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data, Washington,D.C. Cai, Jie, Éva Mújdricza-Maydt & Michael Strube (2011b). Unrestricted coreference resolution via global hypergraph partitioning. In Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, Oreg., 23–24 June 2011. Cai, Jie & Michael Strube (2010a). End-to-end coreference resolution via hypergraph partitioning. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010, pp. 143–151. Cai, Jie & Michael Strube (2010b). Evaluation metrics for end-to-end coreference resolution systems. In Proceedings of the SIGdial 2010 Conference: The 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Tokyo, Japan, 24–25 September 2010, pp. 28–36. Cardie, Claire & Kiri Wagstaff (1999). Noun phrase coreference as clustering. In Proceedings of the 1999 SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, Md., 21–22 June 1999, pp. 82–89. BIBLIOGRAPHY 153 Charniak, Eugene & Mark Johnson (2005). Coarse-to-fine n-best parsing and MaxEent discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Mich., 25–30 June 2005, pp. 173–180. Chinchor, Nancy (2001). Message Understanding Conference (MUC) 7. LDC2001T02, Philadelphia, Penn: Linguistic Data Consortium. Chinchor, Nancy & Beth Sundheim (2003). Message Understanding Conference (MUC) 6. LDC2003T13, Philadelphia, Penn: Linguistic Data Consortium. Chomsky, Noam (1981). Lectures on Government and Binding. Dordrecht: Foris. Chomsky, Noam (1995). The Minimalist Program. Cambridge, Mass.: MIT Press. Chung, Fan R.K. (1997). Spectral Graph Theory. Providence, R.I.: American Mathematical Society. Coleman, Tom, James Saunderson & Anthony Wirth (2008). Spectral clustering with inconsistent advice. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008, pp. 152–159. Culotta, Aron, Michael Wick & Andrew McCallum (2007). First-order probabilistic models for coreference resolution. In Proceedings of Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Rochester, N.Y., 22–27 April 2007, pp. 81–88. Daumé III, Hal & Daniel Marcu (2005). A large-scale exploration of effective global features for a joint entity detection and tracking model. In Proceedings of the Human Language Technology Conference and the 2005 Conference on Empirical Methods in Natural Language Processing, Vancouver, B.C., Canada, 6–8 October 2005, pp. 97–104. Denis, Pascal & James Baldridge (2007). A ranking approach to pronoun resolution. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 6–12 January 2007, pp. 1588–1593. Denis, Pascal & Jason Baldridge (2009). Global joint models for coreference resolution and named entity classification. Procesamiento del Lenguaje Natural, 42:87–96. Dhillon, Inderjit S, Yuqiang Guan & Brain Kulis (2004). Kernel k-means, spectral clustering and normalized cuts. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD), Vol. 2, pp. 551–556. 154 BIBLIOGRAPHY Fahrni, Angela, Vivi Nastase & Michael Strube (2012). HITS’ cross-lingual entity linking system at TAC 2011: One model for all languages. In Proceedings of the Text Analysis Conference, National Institute of Standards and Technology, Gaithersburg, Maryland, USA, 14-15 November 2011. Fellbaum, Christiane (Ed.) (1998). WordNet: An Electronic Lexical Database. Cambridge, Mass.: MIT Press. Finkel, Jenny Rose, Trond Grenager & Christopher Manning (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, Mich., 25–30 June 2005, pp. 363–370. Finkel, Jenny Rose & Christopher Manning (2008). Enforcing transitivity in coreference resolution. In Companion Volume to the Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 15–20 June 2008, pp. 45–48. Frank, Anette, Thomas Bögel, Oliver Hellwig & Nils Reiter (2012). Semantic annotation for the digital humanities – using Markov Logic Networks for annotation consistency control. Linguistic Issues in Language Technology, 7(8):1–21. Grosz, Barbara J., Aravind K. Joshi & Scott Weinstein (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225. Grosz, Barbara J. & Candace L. Sidner (1986). Attention, intentions, and the structure of discourse. Computational Linguistics, 12(3):175–204. Haghighi, Aria & Dan Klein (2007). Unsupervised coreference resolution in a nonparametric Bayesian model. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, 23–30 June 2007, pp. 848–855. Haghighi, Aria & Dan Klein (2009). Simple coreference resolution with rich syntactic and semantic features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-7 August 2009, pp. 1152–1161. Hahn, Udo & Michael Strube (1997). Centering in-the-large: Computing referential discourse segments. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and of the 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 7–12 July 1997, pp. 104–111. Halkidi, Maria, Yannis Batistakis & Michalis Vazirgiannis (2001). On clustering validation techniques. IEEE Intelligent Systems, 17:107–145. BIBLIOGRAPHY 155 Hobbs, Jerry R. (1978). Resolving pronominal references. Lingua, 44:311–338. Jain, Anil K., M.N. Murty & P.J. Flynn (1999). Data clustering: A review. ACM Computing Surveys, 31(3):264–323. Joshi, Aravind K. & Steve Kuhn (1979). Centered logic: The role of entity centered sentence representation in natural language inferencing. In Proceedings of the 6th International Joint Conference on Artificial Intelligence, Tokyo, Japan, 20–23 August 1979, pp. 435–439. Joshi, Aravind K. & Scott Weinstein (1981). Control of inference: Role of some aspects of discourse structure – centering. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, B.C., Canada, 24–28 August 1981, pp. 385–387. Kamvar, Sepandar D., Dan Klein & Christopher D. Manning (2003). Spectral learning. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003, pp. 561–566. Klein, Dan, Sepandar D. Kamvar & Christopher D. Manning (2002). From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In Proceeding of the 19th International Conference on Machine Learning. Klenner, Manfred (2007). Enforcing consistency on coreference sets. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria, 27–29 September 2007, pp. 323–328. Kobdani, Hamidreza, Hinrich Schütze, Michael Schiehlen & Hans Kamp (2011). Bootstrapping coreference resolution using word associations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oreg., USA, 19–24 June 2011. Kudoh, Taku & Yuji Matsumoto (2000). Use of Support Vector Machines for chunk identification. In Proceedings of the 4th Conference on Computational Natural Language Learning, Lisbon, Portugal, 13–14 September 2000, pp. 142–144. Kulis, Brian, Sugato Basu, Inderjit Dhillon & Raymond Mooney (2005). Semi-supervised graph clustering: A kernel approach. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005, pp. 457–464. Lang, Jun, Bing Qin, Ting Liu & Sheng Li (2009). Unsupervised coreference resolution with hpyergraph partitioning. Computer and Information Science, 2:55–63. Lappin, Shalom & Herbert J. Leass (1994). An algorithm for pronominal anaphora resolution. Computational Linguistics, 20(4):535–561. 156 BIBLIOGRAPHY Lee, Heeyoung, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu & Dan Jurafsky (2011). Stanfords multi-pass sieve coreference resolution system at the CoNLL2011 shared task. In Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, Oreg., 23–24 June 2011, pp. 28–34. Luo, Xiaoqiang (2005). On coreference resolution performance metrics. In Proceedings of the Human Language Technology Conference and the 2005 Conference on Empirical Methods in Natural Language Processing, Vancouver, B.C., Canada, 6–8 October 2005, pp. 25–32. Luo, Xiaoqiang, Abe Ittycheriah, Hongyan Jing, Nanda Kambhatla & Salim Roukos (2004). A mention-synchronous coreference resolution algorithm based on the Bell Tree. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 21–26 July 2004, pp. 136–143. MacQueen, J.B (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Symposium on Math, Statistics, and Probability, pp. 281–297. McCallum, Andrew & Ben Wellner (2005). Conditional models of identity uncertainty with application to noun coreference. In Lawrence K. Saul, Yair Weiss & Léon Bottou (Eds.), Advances in Neural Information Processing Systems 17, pp. 905–912. Cambridge, Mass.: MIT Press. McCarthy, Joseph F. & Wendy G. Lehnert (1995). Using decision trees for coreference resolution. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montréal, Canada, 20–25 August 1995, pp. 1050–1055. McCord, Michael C. (1989). Slot grammar: A system for simpler construction of practical natural language grammars. In Natural Language and Logic’89, pp. 118–145. Milligan, Glenn W. & Martha W. Cooper (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159–179. Mitchell, Alexis, Stephanie Strassel, Shudong Huang & Ramez Zakhary (2004). ACE 2004 Multilingual Training Corpus. LDC2005T09, Philadelphia, Penn.: Linguistic Data Consortium. Mitchell, Alexis, Stephanie Strassel, Mark Przybocki, JK Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstain, Lisa Ferro & Beth Sundheim (2002). ACE-2 Version 1.0. LDC2003T11, Philadelphia, Penn.: Linguistic Data Consortium. BIBLIOGRAPHY 157 Mitchell, Alexis, Stephanie Strassel, Mark Przybocki, JK Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstain, Lisa Ferro & Beth Sundheim (2003). TIDES Extraction (ACE) 2003 Multilingual Training Data. LDC2004T09, Philadelphia, Penn.: Linguistic Data Consortium. Mitkov, Ruslan (2002). Anaphora Resolution. London, U.K.: Longman. Müller, Christoph, Stefan Rapp & Michael Strube (2002). Applying Co-Training to reference resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Penn., 7–12 July 2002, pp. 352–359. Müller, Christoph & Michael Strube (2006). Multi-level annotation of linguistic data with MMAX2. In Sabine Braun, Kurt Kohn & Joybrato Mukherjee (Eds.), Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, pp. 197–214. Peter Lang: Frankfurt a.M., Germany. Ng, Andrew Y., Michael J. Jordan & Yair Weiss (2002). On spectral clustering: Analysis and an algorithm. In T.G. Dietterich, S. Becker & Z. Ghahramani (Eds.), Advances in Neural Processing Systems 14 (NIPS 2001), pp. 849–856. Cambridge, Mass.: MIT Press. Ng, Vincent (2008). Unsupervised models for coreference resolution. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Waikiki, Honolulu, Hawaii, 25-27 October 2008, pp. 640–649. Ng, Vincent (2010). Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010, pp. 1396–1411. Ng, Vincent & Claire Cardie (2002). Improving machine learning approaches to coreference resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Penn., 7–12 July 2002, pp. 104–111. Ng, Vincent & Claire Cardie (2003). Weakly supervised natural language learning without redundant views. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, 27 May –1 June 2003, pp. 173–180. Nicolae, Cristina & Gabriel Nicolae (2006). BestCut: A graph algorithm for coreference resolution. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, 22–23 July 2006, pp. 275–283. 158 BIBLIOGRAPHY Nigam, Kamal, Andrew Kachites McCallum, Sebastian Thrun & Tom Mitchell (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3):103–134. NIST (2004). The ACE evaluation plan: Evaluation of the recognition of ACE entities, ACE relations and ACE events. http://www.itl.nist.gov/iad/mig//tests/ace/2004/doc/ ace04evalplan-v7.pdf. Pierce, David & Claire Cardie (2001). Limitations of Co-Training for natural language learning from large datasets. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburgh, Penn., 3–4 June 2001, pp. 1–9. Ponzetto, Simone Paolo & Michael Strube (2006). Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York, N.Y., 4–9 June 2006, pp. 192–199. Poon, Hoifung & Pedro Domingos (2008). Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Waikiki, Honolulu, Hawaii, 25-27 October 2008, pp. 650–659. Pradhan, Sameer, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel & Nianwen Xue (2011). CoNLL-2011 Shared Task: Modeling unrestricted coreference in OntoNotes. In Proceedings of the Shared Task of 15th Conference on Computational Natural Language Learning, Portland, Oreg., 23–24 June 2011. Quinlan, J. Ross (1993). C4.5: Programs for Machine Learning. San Mateo, Cal.: Morgan Kaufman. Raghavan, Preethi, Eric Fosler-Lussier & Albert M. Lai (2012). Exploring semi-supervised coreference resolution of medical concepts using semantic and temporal features. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 731–741. Rahman, Altaf & Vincent Ng (2009). Supervised models for coreference resolution. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-7 August 2009, pp. 968–977. Rahman, Altaf & Vincent Ng (2011). Coreference resolution with world knowledge. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oreg., USA, 19–24 June 2011, pp. 814–824. BIBLIOGRAPHY 159 Rand, William R. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850. Recasens, Marta & Marta Vila (2010). On paraphrase and coreference. Computational Linguistics, 36(4):639–647. Sapena, Emili (2012). A constraint-based hypergraph partitioning approach to coreference resolution, (Ph.D. thesis). Universitat Politècnica de Catalunya. Sapena, Emili, Lluı́s Padró & Jordi Turmo (2010). A global relaxation labeling approach to coreference resolution. In Proceedings of Coling 2010: Poster Volume, Beijing, China, 23–27 August 2010, pp. 1086–1094. Shi, Jianbo & Jitendra Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905. Song, Yang, Jing Jiang, Wayne Xin Zhao, Sujan Li & Houfeng Wang (2012). Joint learning for coreference resolution with Markov Logic. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, 8–14 July 2012. Soon, Wee Meng, Hwee Tou Ng & Daniel Chung Yong Lim (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4):521– 544. Stoer, Mechthild & Frank Wagner (1997). A simple min-cut algorithm. Journal of the ACM, 44(4):585–591. Stoyanov, Veselin, Nathan Gilbert, Claire Cardie & Ellen Riloff (2009). Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing, Singapore, 2–7 August 2009, pp. 656–664. Strube, Michael (1998). Never look back: An alternative to centering. In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montréal, Québec, Canada, 10–14 August 1998, Vol. 2, pp. 1251–1257. Strube, Michael & Udo Hahn (1999). Functional centering: Grounding referential coherence in information structure. Computational Linguistics, 25(3):309–344. 160 BIBLIOGRAPHY Tibshirani, Robert, Guenther Walther & Trevor Hastie (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423. Toutanova, Kristina, Dan Klein, Christopher D. Manning & Yoram Singer (2003). Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, 27 May –1 June 2003, pp. 252– 259. Uzuner, Özlem, Andreea Bodnari, Shuying Shen, Tyler Forbush, John Pestian & Brett R South (2012). Evaluating the state of the art in coreference resolution for electronic medical records. In J Am Med Inform Assoc. Published online first: 24 February 2012 doi:10.1136/amiajnl-2011-000784. Versley, Yannick, Simone Paolo Ponzetto, Massimo Poesio, Vladimir Eidelman, Alan Jern, Jason Smith, Xiaofeng Yang & Alessandro Moschitti (2008). BART: A modular toolkit for coreference resolution. In Companion Volume to the Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 15–20 June 2008, pp. 9–12. Vilain, Marc, John Burger, John Aberdeen, Dennis Connolly & Lynette Hirschman (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Message Understanding Conference (MUC-6), pp. 45–52. San Mateo, Cal.: Morgan Kaufmann. von Luxburg, Ulrike (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416. von Luxburg, Ulrike (2010). Clustering stability: An overview. Foundations and Trends in Machine Learning, 2(3):235–274. Wagstaff, Kiri (2002). Intelligent clustering with instance-level constraints, (Ph.D. thesis). Cornell University. Ch5: constrained clustering application to coreference resolution. Wagstaff, Kiri & Claire Cardie (2000). Clustering with instance-level constraints. In Proceedings of the 17th International Conference on Machine Learning, Palo Alto, Cal., 29 June – 2 July 2000, pp. 1103–1110. Walker, Marilyn A. (1998). Centering, anaphora resolution, and discourse structure. In M.A. Walker, A.K. Joshi & E.F. Prince (Eds.), Centering Theory in Discourse, pp. 401–435. Oxford, U.K.: Oxford University Press. BIBLIOGRAPHY 161 Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, Mohammed ElBachouti, Robert Belvin & Ann Houston (2011). OntoNotes Release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium. Witten, Ian H. & Eibe Frank (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco, Cal.: Morgan Kaufmann. Wu, Zhenyu & R Leahy (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(11):1101–1113. Xing, Eric P, Andrew Y Ng, Michael I Jordan & Stuart Russell (2003). Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems(NIPS 2003). Yang, Xiaofeng, Jian Su, Jun Lang, Chew Lim Tan, Ting Liu & Sheng Li (2008). An entitymention model for coreference resolution with Inductive Logic Programming. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Columbus, Ohio, 15–20 June 2008, pp. 843–851. Yang, Xiaofeng, Jian Su & Chew Lim Tan (2005). A twin-candidate model of coreference resolution with non-anaphor identification capability. In Proceedings of the 2nd International Joint Conference on Natural Language Processing, Jeju Island, South Korea, 11–13 October 2005, pp. 719–730. Yu, Stella X & Jianbo Shi (2004). Segmentation given partial grouping constraints. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. vol. 26, No. 2. Zhou, Dengyong, Jiayuan Huang & Bernhard Schölkopf (2007). Learning with hypergraphs: Clustering, classification, and embedding. In B. Schölkopf, J. Platt & T. Hofmann (Eds.), Neural Information Processing Systems 19: Proceedings of the 2006 Conference, pp. 1601– 1608. Cambridge, Mass.: MIT Press. Zien, Jason Y., Martine Schlag & Pak K. Chan (1999). Multi-level spectral hypergraph partitioning with arbitrary vertex sizes. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 18, pp. 1389–1399.

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement