Towards Robust Linguistic Analysis Using OntoNotes.

Towards Robust Linguistic Analysis Using OntoNotes.
Towards Robust Linguistic Analysis Using OntoNotes
Sameer Pradhan1 , Alessandro Moschitti2,3 , Nianwen Xue4 , Hwee Tou Ng5
Anders Björkelund6 , Olga Uryupina2 , Yuchen Zhang4 and Zhi Zhong5
1
Boston Childrens Hospital and Harvard Medical School, Boston, MA 02115, USA
2
University of Trento, University of Trento, 38123 Povo (TN), Italy
3
QCRI, Qatar Foundation, 5825 Doha, Qatar
4
Brandeis University, Brandeis University, Waltham, MA 02453, USA
5
National University of Singapore, Singapore, 117417
6
University of Stuttgart, 70174 Stuttgart, Germany
Abstract
the Penn Discourse Treebank (Prasad et al., 2008),
and many other annotation projects, all annotate
the same underlying body of text. It was also converted to dependency structures and other syntactic formalisms such as CCG (Hockenmaier and
Steedman, 2002) and LTAG (Shen et al., 2008),
thereby creating an even bigger impact through
these additional syntactic resources. The most recent one of these efforts is the OntoNotes corpus
(Weischedel et al., 2011). However, unlike the
previous extensions of the Treebank, in addition
to using roughly a third of the same WSJ subcorpus, OntoNotes also added several other genres,
and covers two other languages — Chinese and
Arabic: portions of the Chinese Treebank (Xue et
al., 2005) and the Arabic Treebank (Maamouri and
Bies, 2004) have been used to sample the genre of
text that they represent.
One of the current hurdles in language processing is the problem of domain, or genre adaptation.
Although genre or domain are popular terms, their
definitions are still vague. In OntoNotes, “genre”
means a type of source – newswire (NW), broadcast news (BN), broadcast conversation (BC), magazine (MZ), telephone conversation (TC), web data
(WB) or pivot text (PT). Changes in the entity and
event profiles across source types, and even in the
same source over a time duration, as explicitly expressed by surface lexical forms, usually account
for a lot of the decrease in performance of models trained on one source and tested on another,
usually because these are the salient cues that are
relied upon by statistical models.
Large-scale corpora annotated with multiple
layers of linguistic information exist in various
languages, but they typically consist of a single
source or collection. The Brown corpus, which
consists of multiple genres, have been usually used
to investigate issues of genres of sensitivity, but it
is relatively small and does not include any infor-
Large-scale linguistically annotated corpora have played a crucial role in advancing the state of the art of key natural language technologies such as syntactic, semantic and discourse analyzers, and they
serve as training data as well as evaluation
benchmarks. Up till now, however, most
of the evaluation has been done on monolithic corpora such as the Penn Treebank,
the Proposition Bank. As a result, it is still
unclear how the state-of-the-art analyzers
perform in general on data from a variety of genres or domains. The completion
of the OntoNotes corpus, a large-scale,
multi-genre, multilingual corpus manually
annotated with syntactic, semantic and
discourse information, makes it possible
to perform such an evaluation. This paper
presents an analysis of the performance of
publicly available, state-of-the-art tools on
all layers and languages in the OntoNotes
v5.0 corpus. This should set the benchmark for future development of various
NLP components in syntax and semantics,
and possibly encourage research towards
an integrated system that makes use of the
various layers jointly to improve overall
performance.
1
Introduction
Roughly a million words of text from the Wall
Street Journal newswire (WSJ), circa 1989, has
had a significant impact on research in the language processing community — especially those
in the area of syntax and (shallow) semantics, the
reason for this being the seminal impact of the
Penn Treebank project which first selected this text
for annotation. Taking advantage of a solid syntactic foundation, later researchers who wanted to
annotate semantic phenomena on a relatively large
scale, also used it as the basis of their annotation. For example the Proposition Bank (Palmer et
al., 2005), BBN Name Entity and Pronoun coreference corpus (Weischedel and Brunstein, 2005),
1
A portion of the English data in the OntoNotes corpus
is a selected set of sentences that were annotated for parse
and word sense information. These sentences are present in a
document of their own, and so the documents for parse layers
for English are inflated by about 3655 documents and for the
word sense are inflated by about 8797 documents.
143
Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 143–152,
c
Sofia, Bulgaria, August 8-9 2013. 2013
Association for Computational Linguistics
Language
Parse
Proposition
Sense
Name
Coreference
Documents Words Documents Verb Prop. Noun Prop. Documents Verb Sense Noun Sense Documents Words Documents Words
English
7,9671
2.6M
6,124
300K
18K
12K
173K
120K
3,637
2.0M
Chinese
2002
1.0M
1861
148K
7K
1573
83K
1K
1,911
988K
Arabic
599
402K
599
30K
-
310
4.3K
8.7K
446
298K
2,384
(3493)
1,729
(2,280)
447
(447)
1.7M
950K
300K
Table 1: Coverage for each layer in the OntoNotes v5.0 corpus, by number of documents, words, and
some other attributes. The numbers in parenthesis are the total number of parts in the documents.
mal genres such as web data. Very seldom has it
been the case that the exact same phenomena have
been annotated on a broad cross-section of the
same language before OntoNotes. The OntoNotes
corpus thus provides an opportunity for studying
the genre effect on different syntactic, semantic
and discourse analyzers.
Parts of the OntoNotes Corpus have been used
for various shared tasks organized by the language
processing community. The word sense layer was
the subject of prediction in two SemEval-2007
tasks, and the coreference layer was the subject
of prediction in the SemEval-20102 (Recasens et
al., 2010), CoNLL-2011 and 2012 shared tasks
(Pradhan et al., 2011; Pradhan et al., 2012). The
CoNLL-2012 shared task provided predicted information to the participants, however, that did not
include a few layers such as the named entities
for Chinese and Arabic, propositions for Arabic,
and for better comparison of the English data with
the CoNLL-2011 task, a smaller OntoNotes v4.0
portion of the English parse and propositions was
used for training.
This paper is a first attempt at presenting a coherent high-level picture of the performance of
various publicly available state-of-the-art tools on
all the layers of OntoNotes in all three languages,
so as to pave the way for further explorations in
the area of syntax and semantics processing.
The possible avenues for exploratory studies
on various fronts are enormous. However, given
space considerations, in this paper, we will restrict our presentation of the performance on all
layers of annotation in the data by using a stratified cross-section of the corpus for training, development, and testing. The paper is organized
as follows: Section 2 gives an overview of the
OntoNotes corpus. Section 3 explains the parameters of the evaluation and the various underlying
assumptions. Section 4 presents the experimental
results and discussion, and Section 5 concludes the
paper.
2
multiple layers of syntactic, semantic and discourse information in text. The English language portion comprises roughly 1.7M words and
Chinese language portion comprises roughly 1M
words of newswire, magazine articles, broadcast
news, broadcast conversations, web data and conversational speech data3 . The Arabic portion is
smaller, comprising 300K words of newswire articles. This rich, integrated annotation covering
many layers aims at facilitating the development
of richer, cross-layer models and enabling better automatic semantic analysis. The corpus is
tagged with syntactic trees, propositions for most
verb and some noun instances, partial verb and
noun word senses, coreference, and named entities. Table 1 gives an overview of the number of
documents that have been annotated in the entire
OntoNotes corpus.
2.1 Layers of Annotation
This section provides a very concise overview of
the various layers of annotations in OntoNotes.
For a more detailed description, the reader is referred to (Weischedel et al., 2011) and the documentation accompanying the v5.04 release.
2.1.1 Syntax
This represents the layer of syntactic annotation
based on revised guidelines for the Penn Treebank (Marcus et al., 1993; Babko-Malaya et al.,
2006), the Chinese Treebank (Xue et al., 2005)
and the Arabic Treebank (Maamouri and Bies,
2004). There were two updates made to the parse
trees as part of the OntoNotes project: i) the introduction of NML phrases, in the English portion,
to mark nominal sub-constituents of flat NPs that
do not follow the default right-branching structure,
and ii) re-tokenization of hyphenated tokens into
multiple tokens in English and Chinese. The Arabic Treebank on the other hand was also significantly revised in an effort to increase consistency.
2.1.2 Word Sense
Coarse-grained word senses are tagged for the
most frequent polysemous verbs and nouns, in or-
OntoNotes Corpus
3
These numbers are for the portion that has all layers of
annotations. The word count for each layer is mentioned in
Table 1
4
For all the layers of data used in this study, the
OntoNotes v4.99 pre-release that was used for the CoNLL2012 shared task is identical to the v5.0 release.
The OntoNotes project has created a large-scale
corpus of accurate and integrated annotation of
2
A small portion 125K words in English was used for this
evaluation.
144
der to maximize token coverage. The word sense
granularity is tailored to achieve very high interannotator agreement as demonstrated by Palmer et
al. (2007). These senses are defined in the sense
inventory files. In the case of English and Arabic
languages, the sense-inventories (and frame files)
are defined separately for each part of speech that
is realized by the lemma in the text. For Chinese,
however the sense inventories (and frame files) are
defined per lemma – independent of the part of
speech realized in the text.
and telephone conversation genre — are very long
which prohibited efficient annotation in their entirety. These are split into smaller parts, and each
part is considered a separate document for the sake
of coreference evaluation.
3
Given the scope of the corpus and the multitude of
settings one can run evaluations, we had to restrict
this study to a relatively focused subset. There has
already been evidence of models trained on WSJ
doing poorly on non-WSJ data on parses (Gildea,
2001; McClosky et al., 2006), semantic role labeling (Carreras and Màrquez, 2005; Pradhan et al.,
2008), word sense (Escudero et al., 2000; ?), and
named entities. The phenomenon of coreference is
somewhat of an outlier. The winning system in the
CoNLL-2011 shared task was one that was completely rule-based and not directly trained on the
OntoNotes corpus. Given this overwhelming evidence, we decided not to focus on potentially complex cross-genre evaluations. Instead, we decided
on evaluating the performance on each layer of annotation using an appropriately selected, stratified
training, development and test set, so as to facilitate future studies.
2.1.3 Proposition
The propositions in OntoNotes are PropBank-style
semantic roles for English, Chinese and Arabic.
Most English verbs and few nouns were annotated using the revised guidelines for the English
PropBank (Babko-Malaya et al., 2006) as part of
the OntoNotes effort. Some enhancements were
made to the English PropBank and Treebank to
make them synchronize better with each other:
one of the outcomes of this effort was that two
types of LINKs that represent pragmatic coreference (LINK - PCR) and selectional preferences
(LINK - SLC) were added to the original PropBank
(Palmer et al., 2005). More details can be found in
the addendum to the PropBank guidelines5 in the
OntoNotes v5.0 release. A part of speech agnostic
Chinese PropBank (Xue and Palmer, 2009) guidelines were used to annotate most frequent lemmas in Chinese. Many verbs and some nouns and
adjectives were annotated using the revised Arabic PropBank guidelines (Palmer et al., 2008; Zaghouani et al., 2010).
3.1
Training, Development and Test
Partitions
In this section we will have a brief discussion
on the logic behind the partitioning of the data
into training, development and test sets. Before
we do that, it would help to know that given the
range and peculiarities of the layers of annotation and presence of various resource and technical constraints, not all the documents in the corpus are annotated with all the layers of information, and token-centric phenomena (such as word
sense and propositions of predicates) were not annotated with 100% coverage. Most of the proposition annotation in English and Arabic is for the
verb predicates, with a few nouns annotated in
English and some adjectives in Arabic. In Chinese, the selection is part of speech agnostic, and is
based on the lemmas that can be considered predicates. Some documents in the corpora are actually
snippets from larger documents, and have been annotated for a combination of parse, propositions,
word sense and names, but not coreference. If one
considers each layer independently, then an ideal
partitioning scheme would create a separate partition for each layer such that it maximizes the number of examples that can be extracted for that layer
from the corpus. The upside is that one would
get as much data there is to train and estimate the
performance of each layer across the entire corpus. The downside is that this might cover vari-
2.1.4 Named Entities
The corpus was tagged with a set of 18 welldefined proper named entity types that have been
tested extensively for inter-annotator agreement
by Weischedel and Burnstein (2005).
2.1.5 Coreference
This layer captures general anaphoric coreference that covers entities and events not limited
to noun phrases or a limited set of entity types
(Pradhan et al., 2007). It considers all pronouns
(PRP, PRP$), noun phrases (NP) and heads of verb
phrases (VP) as potential mentions. Unlike English, Chinese and Arabic have dropped subjects
and objects which were also considered during
coreference annotation6 . The mentions formed by
these dropped pronouns total roughly about 11%
for both Chinese and Arabic. Coreference is the
only document-level phenomenon in OntoNotes.
Some of the documents in the corpus — especially
the ones in the broadcast conversation, web data,
5
6
Evaluation Setting
doc/propbank/english-propbank.pdf
As we will see later these are not used during the task.
145
Word Segmentation The three languages that
we are evaluating are from quite different language families. Arabic has a complex morphology, English has limited morphology, whereas
Chinese has very little morphology. English word
segmentation amounts to rule-based tokenization,
and is close to perfect. In the case of Chinese and
Arabic, although the tokenization/segmentation is
not as good as English, the accuracies are in the
high 90s. Given this we decided to use gold,
Treebank segmentation for all languages. In the
case of Chinese, the words themselves are lemmas, whereas in English they can be predicted
with very high accuracy. For Arabic, by default
written text is unvocalised, and lemmatization is a
complex process which we considered out of the
scope of this study, so we decided to use correct,
gold standard lemmas, along with the correct vocalized version of the tokens.
ous cross sections of the documents in the corpus,
and would not provide a clean picture when looking at the collective performance for all the layers. The documents that are annotated with coreference correspond to the intersection of all annotations. These are the documents that have also
been annotated with all the other layers of information. The amount of data we can get together
in such a test set is big enough to be representative. Therefore, we decided that it would be
ideal to choose a portion of these documents as
the test collection for all layers. An additional advantage is that it is the exact same test set used
in the CoNLL-2012 shared task, and so in a way
is already a standard. On the training and development side however, one can still imagine using
all possible information for training models for a
particular layer, and that is what we decided to
do. The training and development data is generated by providing all documents with all available
layers of annotation for input, however, the test
set is generated by providing as input to the algorithm the set of documents in the corpus that have
been annotated for coreference. This algorithm
tries to reuse previously established partitions for
English, i.e., the WSJ portion. Unfortunately, in
the case of Chinese and Arabic, either the historical partitions were not in the selection used for
OntoNotes, or were partially overlapping with the
ones created using this scheme, and/or had a very
small portion of OntoNotes covered in the test set.
Therefore, we decided to create a fresh partition
for the Chinese and Arabic data. Note, however,
that the these test sets also match the ones used
in the CoNLL-2012 evaluation. The algorithm for
selecting the training, development and test partitions is described on the CoNLL-2012 shared task
webpage, along with the list of training, development, and test document IDs7 .
3.2
Traces and Function Tags Treebank traces
have hardly played a role in the mainstream parser
and semantic role labeling evaluation. Function
tags also have received similar treatment in the
parsing community, and though they are important, there is also a significant information overlap
between them and the proposition structure provided by the PropBank layer. Whereas in English,
most traces represent syntactic phenomena such
as movement and raising, in Chinese and Arabic,
they can also represent dropped subjects/objects.
These subset of traces directly affect the coreference layer, since, unlike English, traces in Chinese
and Arabic (*pro* and * respectively) are legitimate targets of mentions and are considered for
coreference annotation in OntoNotes. Recovering
traces in text is a hard problem, and the most recently reported numbers in literature for Chinese
are around a F-score of 50 (Yang and Xue, 2010;
Cai et al., 2011). For Arabic there have not been
much studies on recovering these. A study by
Gabbard (2010) shows that these can be recovered
with an F-score of 55 with automatic parses and
roughly 65 using gold parses. Considering the low
level of prediction accuracy of these tokens, and
their relative low frequency, we decided to consider predicting traces in trees out of the scope of
this study. In other words, we removed the manually identified traces and function tags from the
Treebanks across all three languages, in all the
three – training, development and test partitions.
This meant removing any and all dependent annotation in layers such as PropBank and Coreference. In the case of PropBank these are the
argument bearing traces, whereas in coreference
these are the mentions formed by these elided subjects/objects.
Assumptions
Next we had to decide on a set of assumptions
to use while designing the experiments to measure the automatic prediction accuracy for each of
the layers. Since some of these decisions affect
more than one layer of annotation, we will describe these in this section instead of in the section
where we discuss the experiment with a particular
layer of annotation.
7
http://conll.cemantix.org/2012/download/ids/
For each language there are two sub-directories — “all”
contains more general lists which include documents
that had at least one of the layers of annotation, and
“coref” contains the lists that include documents that
have coreference annotation. The former were used to
generate training, development, test sets for layers other
than coreference, and the latter was used to generate
training/development/test sets for the coreference layer
used in the CoNLL-2012 shared task.
146
Disfluencies One thing that needs to be dealt
with in conversational data is the presence of disfluencies (restarts, etc.). In the English parses of
the OntoNotes, disfluencies are marked using a
special EDITED8 phrase tag – as was the case for
the Switchboard Treebank. Computing the accuracy of identifying disfluencies is also out of the
scope of this study. Given the frequency of disfluencies and the performance with which one can
identify them automatically,9 a probable processing pipeline would filter them out before parsing.
We decided to remove them using oracle information available in the English Treebank, and the
coreference chains were remapped to trees without disfluencies. Owing to various technical constraints, we decided to retain the disfluencies in the
Chinese data.
Layer
Segmentation
Lemma
Parse
Proposition
Predicate Frame
Word Sense
Name Entities
Coreference
Speaker
Number
Gender
•
◦
◦
◦
◦
◦
◦
◦
•
◦
◦
•
—
◦
◦
◦
◦
◦
◦
•
×
×
•
•
◦10
◦
◦
◦
◦
◦
—
×
×
Table 2: Status of layers used during prediction
of other layers. A “•” indicates gold annotation,
a “◦” indicates predicted, a “×” indicates an absence of the predicted layer, and a “—” indicates
that the layer is not applicable to the language.
The predicted annotation layers input to downstream models were automatically annotated by
using NLP processors learned with n-cross fold
validation on the training data. This way, the n
chunks of training data are annotated avoiding dependencies with the data used for training the NLP
processors.
Spoken Genre Given the scope of this study, we
make another significant assumption. For the spoken genres – BC, BN and TC – we use the manual
transcriptions rather than the output of a speech
recognizer, as would be the case in real world. The
performance on various layers for these genres
would therefore be artificially inflated, and should
be taken into account while analyzing results. Not
many studies have previously reported on syntactic and semantic analysis for spoken genre. Favre
et al. (2010) report the performance on the English
subset of an earlier version of OntoNotes.
4.1 Syntax
Predicted parse trees for English were produced
using the Charniak parser11 (Charniak and Johnson, 2005). Some additional tag types used in
the OntoNotes trees were added to the parser’s
tagset, including the nominal (NML) tag, and the
rules used to determine head words were extended
correspondingly. Chinese and Arabic parses were
generated using the Berkeley parser (Petrov and
Klein, 2007). In the case of Arabic, the parsing community uses a mapping from rich Arabic
part of speech tags to Penn-style part of speech
tags. We used the mapping that is included with
the Arabic Treebank. The predicted parses for
the training portion of the data were generated using 10-fold (5-folds for Arabic) cross-validation.
For testing, we used a model trained on the entire
training portion. Table 3 shows the precision, recall and F1 -scores of the re-trained parsers on the
CoNLL-2012 test along with the part of speech accuracies (POS) using the standard evalb scorer.
The performance on the PT genre for English is
the highest among other English genres. This is
possibly because of the professional, clean translations of the underlying text, and are mostly
shorter sentences. The MZ genre and the NW both
of which contain well edited text, share similar
scores. There is a few points gap between these
and the other genres. As for Chinese, the performance on MZ is the highest followed by BN.
Surprisingly, the WB genre has a similar score and
the others are close behind except for TC. As expected, the Arabic parser performance is the low-
Discourse The corpus contains information on
the speaker for broadcast communication, conversation, telephone conversation and writer for the
web data. This information provides an important
clue for correctly linking anaphoric pronouns with
the right antecedents. This information could be
automatically deduced, but is also not within the
scope of our study. Therefore, we decided to provide gold, instead of predicted, data both during
training and testing. Table 2 lists the status of the
layers.
4
English Chinese Arabic
Experiments
In this section, we will report on the experiments
carried out using all available data in the training set for training models for a particular layer,
and using the CoNLL-2012 test set as the test set.
8
There is another phrase type – EMBED in the telephone
conversation genre which is similar to the EDITED phrase
type, and sometimes identifies insertions, but sometimes contains logical continuation of phrases by different speakers, so
we decided not to remove that from the data.
9
A study by Charniak and Johnson (2001) shows that one
can identify and remove edits from transcribed conversational
speech with an F-score of about 78, with roughly 95 precision
and 67 recall.
10
The predicted part of speech for Arabic are a mapped
down version of the richer gold version present in the Treebank
11
147
http://bllip.cs.brown.edu/download/reranking-parserAug06.tar.gz
English
N
2,211
1,357
780
2,327
1,366
1,787
1,869
Overall 11,697
Chinese BC
885
BN
929
MZ
451
NW
481
TC
968
WB
758
Overall 4,472
Arabic NW
1,003
BC
BN
MZ
NW
TC
WB
PT
All Sentences
POS
97.33
97.32
96.58
97.15
96.11
96.03
98.77
97.09
94.79
93.85
97.06
94.07
92.22
92.37
94.12
94.12
P
86.36
87.61
89.90
87.68
85.09
85.46
95.29
88.08
80.17
83.49
88.48
82.26
71.90
82.57
82.23
74.71
R
86.11
87.03
89.49
87.25
84.13
85.26
94.66
87.65
79.35
80.13
83.85
77.28
69.19
78.92
78.93
75.67
F
P
86.23
87.32
89.70
87.47
84.60
85.36
94.98
87.87
79.76
81.78
86.10
79.69
70.52
80.70
80.55
75.19
English
Performance
81.2
82.0
79.1
85.7
77.5
Overall 82.5
Nouns 83.4
Verbs
81.8
Chinese BC
BN
MZ
NW
Overall
Arabic NW
75.9
Nouns 79.2
Verbs
68.8
BC
BN
MZ
NW
WB
R
F
A
81.3
81.5
78.8
85.7
77.6
82.5
83.1
81.9
75.2
77.7
69.5
81.2
81.7
79.0
85.7
77.5
82.5
83.2
81.8
75.6
78.4
69.1
80.5
85.4
82.4
89.1
84.3
-
Table 3: Parser performance on the CoNLL-2012
test set.
Table 4: Word sense performance on the CoNLL2012 test set.
est among the three languages.
sion of OntoNotes, but the results are not directly
comparable.
4.2 Word Sense
We used the IMS12 (It Makes Sense) (Zhong and
Ng, 2010) word sense tagger. IMS was trained on
all the word sense data that is present in the training portion of the OntoNotes corpus using crossvalidated predictions on the input layers similar
to the proposition tagger. During testing, for English and Arabic, IMS must first use the automatic POS information to identify the nouns and
verbs in the test data, and then assign senses to
the automatically identified nouns and verbs. In
the case of Arabic, IMS uses gold lemmas. Since
automatic POS tagging is not perfect, IMS does
not always output a sense to all word tokens that
need to be sense tagged due to wrongly predicted
POS tags. As such, recall is not the same as precision on the English and Arabic test data. For
Chinese the measure of performance is just the
accuracy since the senses are defined per lemma
rather than per part of speech. Since we provide
gold word segmentation, IMS attempts to sense
tag all correctly segmented Chinese words, so recall and precision are the same and so is the F1 score. Table 4 shows the performance of this classifier aggregated over both the verbs and nouns
in the CoNLL-2012 test set and an overall score
split by nouns and verbs for English and Arabic. For both nouns and verbs in English, the
F1 -score is over 80%. The performance on English nouns is slightly higher than English verbs.
Comparing to the other two languages, the performance on Arabic is relatively lower, especially the
performance on Arabic verbs, whose F1 -score is
less than 70%. For English, genres PT and TC,
and for Chinese genres TC and WB, no gold standard senses were available, and so their accuracies
could not be computed. Previously, Zhong et al.
(2008) reported the word sense performance on
the Wall Street Journal portion of an earlier ver12
4.3 Proposition
The revised PropBank has introduced two new
links — LINK - SLC and LINK - PCR. Since the community is not used to the new PropBank representation which (i) relies heavily on the trace structure in the Treebank and (ii) we decided to exclude, we unfold the LINKs back to their original
representation as in the PropBank 1.0 release. We
used ASSERT15 (Pradhan et al., 2005) to predict
the propositional structure for English. We made
a small modification to ASSERT, and replaced
the TinySVM classifier with a CRF16 to speed
up training the model on all the data. The Chinese propositional structure was predicted with the
Chinese semantic role labeler described in (Xue,
2008), retrained on the OntoNotes v5.0 data. The
Arabic propositional structure was predicted using the system described in Diab et al. (2008).
(Diab et al., 2008) Table 5 shows the detailed per14
The Frame ID column indicates the F-score for English
and Arabic, and accuracy for Chinese for the same reasons as
word sense.
15
16
http://cemantix.org/assert.html
http://leon.bottou.org/projects/sgd
Frame Total
ID
Sent.
English
BC
BN
MZ
NW
TC
WB
PT
Overall
Chinese BC
BN
MZ
NW
TC
WB
Arabic
Overall
NW
Total
Prop.
93.2 1994
5806
92.7 1218
4166
90.8
740
2655
92.8 2122
6930
91.8
837
1718
90.7 1139
2751
96.6 1208
2849
92.8 9,261 26,882
87.7
885 2,323
93.3
929 4,419
92.3
451 2,620
96.6
481 2,210
82.2
968 1,622
87.8
758 1,761
90.9 4,472 14,955
85.6 1,003
2337
% Perfect Argument ID + Class
Prop.
P
R
F
52.89
54.78
50.77
46.45
49.94
42.86
67.53
51.66
31.34
35.44
31.68
27.33
32.74
35.21
32.62
24.18
80.76
80.22
79.13
79.80
79.85
80.51
89.35
81.30
53.92
64.34
65.04
69.28
48.70
62.35
61.26
52.99
69.69
69.36
67.78
66.80
72.35
69.06
84.43
70.53
68.60
66.05
65.40
55.74
59.12
68.87
64.48
45.03
74.82
74.40
73.02
72.72
75.91
74.35
86.82
75.53
60.38
65.18
65.22
61.78
53.41
65.45
62.83
48.68
Table 5: Proposition and frameset disambiguation
performance14 in the CoNLL-2012 test set.
http://www.comp.nus.edu.sg/∼nlp/sw/IMS v0.9.2.1.tar.gz
148
formance numbers17 . The CoNLL-2005 scorer18
was used to compute the scores. At first glance,
the performance on the English newswire genre is
much lower than what has been reported for WSJ
Section 23. This could be attributed to several factors: i) the newswire in OntoNotes not only contains WSJ data, but also Xinhua news, and some
other newswire evaluation data, ii) The WSJ training and test portions in OntoNotes are a subset of
the standard ones that have been used to report
performance earlier; iii) the PropBank guidelines
were significantly revised during the OntoNotes
project in order to synchronize well with the Treebank, and finally iv) it includes propositions for
be verbs missing from the original PropBank. It
looks like the newly added Pivot Text data (comprised of the New Testament) shows very good
performance. The Chinese and Arabic19 accuracy
is much worse. In addition to automatically predicting the arguments, we also trained the IMS
system to tag PropBank frameset IDs.
Language
Genre
Entity
Count
P
English
BC
BN
MZ
NW
TC
WB
1671
2180
1161
4679
362
1133
11186
667
3158
1453
1043
200
886
7407
2550
80.17
88.95
82.74
86.79
74.09
77.72
84.04
72.49
82.17
86.11
65.16
48.00
80.60
78.20
74.53
Chinese
Arabic
Overall
BC
BN
NW
MZ
TC
WB
Overall
NW
Performance
R
F
77.20
85.69
82.17
84.25
61.60
68.05
80.86
58.47
71.50
76.39
56.66
60.00
51.13
66.45
62.55
78.66
87.29
82.45
85.50
67.27
72.56
82.42
64.73
76.46
80.96
60.62
53.33
62.57
71.85
68.02
a joint estimation of named entity and parsing.
However, it was on an earlier version of the English portion of OntoNotes using a different crosssection for training and testing and therefore is not
directly comparable.
4.5 Coreference
The task is to automatically identify mentions of
entities and events in text and to link the coreferring mentions together to form entity/event chains.
The coreference decisions are made using automatically predicted information on other structural
and semantic layers including the parses, semantic roles, word senses, and named entities that
were produced in the earlier sections. Each document part from the documents that were split into
multiple parts during coreference annotation were
treated as separate document.
We used the number and gender predictions
generated by Bergsma and Lin (2006). Unfortunately neither Arabic, nor Chinese have comparable data available. Chinese, in particular, does not
have number or gender inflections for nouns, but
(Baran and Xue, 2011) look at a way to infer such
information.
We trained the Björkelund and Farkas (2012)
coreference system21 which uses a combination of
two pair-wise resolvers, the first is an incremental chain-based resolution algorithm (Björkelund
and Farkas, 2012), and the second is a best-first
resolver (Ng and Cardie, 2002). The two resolvers
are combined by stacking, i.e., the output of the
first resolver is used as features in the second one.
The system uses a large feature set tailored for
each language which, in addition to classic coreference features, includes both lexical and syntactic
information.
Recently, it was discovered that there is possibly a bug in the official scorer used for the
CoNLL 2011/2012 and the SemEval 2010 coreference tasks. This relates to the mis-implementation
of the method proposed by (Cai and Strube, 2010)
for scoring predicted mentions. This issue has also
been recently reported in Recasens et al., (2013).
As of this writing, the BCUBED metric has been
fixed, and the correctness of the CEAFm , CEAFe
and BLANC metrics is being verified. We will
be updating the CoNLL shared task webpages22
with more detailed information and also release
the patched scripts as soon as they are available.
We will also re-generate the scores for previous
shared tasks, and the coreference layer in this paper and make them available along with the models and system outputs for other layers. Table
7 shows the performance of the system on the
Table 6: Performance of the named entity recognizer on the CoNLL-2012 test set.
4.4 Named Entities
We retrained the Stanford named entity recognizer20 (Finkel et al., 2005) on the OntoNotes data.
Table 6 shows the performance details for all the
languages across all 18 name types broken down
by genre. In English, BN has the highest performance followed by the NW genre. There is a significant drop from those and the TC and WB genre.
Somewhat similar trend is observed in the Chinese data, with Arabic having the lowest scores.
Since the Pivot Text portion (PT) of OntoNotes
was not tagged with names, we could not compute the accuracy for that cross-section of the data.
Previously Finkel and Manning (2009) performed
17
The number of sentences in this table are a subset of the
ones in the table showing parser performance, since these are
the sentences for which at least one predicate has been tagged
with its arguments
18
19
http://www.lsi.upc.es/∼srlconll/srl-eval.pl
The system could not not use the morphology features in
Diab et al. (2008).
20
21
22
http://nlp.stanford.edu/software/CRF-NER.shtml
149
http://www.ims.uni-stuttgart.de/∼anders/coref.html
http://conll.cemantix.org
CoNLL-2012 test set, broken down by genre. The
same metrics that were used for the CoNLL-2012
shared task are computed, with the C O NLL column being the official C O NLL measure.
Language
Genre
MD
MUC
BCUBED
CEAF m
CEAF e
BLANC
CONLL
73.04
72.74
78.87
73.72
77.74
82.45
73.54
75.8
73.63
77.39
74.25
82.56
83.14
77.45
76.07
69.63
56.19
59.30
62.33
56.80
66.08
64.52
56.78
59.74
53.27
59.09
51.25
75.08
64.59
57.26
57.79
51.09
76.24
75.70
83.88
81.06
79.78
85.82
76.48
79.22
76.89
81.94
78.78
85.75
90.04
77.67
81.56
74.61
64.89
65.79
72.21
66.19
72.59
74.01
67.20
68.2
65.17
69.11
63.36
83.28
78.90
64.77
69.92
59.14
PREDICTED MENTIONS
English
Chinese
Arabic
BC
BN
MZ
NW
PT
TC
WB
73.43
73.49
71.86
68.54
86.95
80.81
74.43
Overall 75.38
BC
68.02
BN
68.57
MZ
55.55
NW
89.19
TC
77.72
WB
72.61
Overall 66.37
NW
60.55
63.92
63.92
64.94
60.20
79.09
76.78
66.86
67.58
59.6
61.34
48.89
80.71
73.59
65.79
58.61
47.82
61.98
65.85
71.38
65.11
68.33
71.35
61.43
65.78
59.44
67.83
58.83
73.64
71.65
62.32
66.56
61.16
85.63
82.11
85.65
80.68
93.20
90.68
88.12
Overall 86.16
BC
84.88
BN
80.97
MZ
78.85
NW
93.23
TC
92.91
WB
85.87
Overall 83.47
NW
76.43
76.09
73.56
77.73
73.52
85.72
86.83
80.61
78.7
76.34
74.89
73.06
86.54
88.31
77.61
76.85
60.81
68.70
71.52
78.82
73.08
73.25
78.94
69.86
72.67
69.89
76.88
70.15
86.70
84.51
69.24
76.30
67.29
54.82
58.93
64.03
57.54
65.52
65.41
54.76
59.20
53.12
60.90
55.63
76.30
64.30
56.71
59.01
53.42
42.68
48.14
50.68
45.10
50.83
45.44
42.05
45.87
40.77
48.10
46.04
70.89
48.52
43.67
48.19
44.30
GOLD MENTIONS
English
Chinese
Arabic
BC
BN
MZ
NW
PT
TC
WB
61.73
63.67
72.75
65.63
70.76
73.87
63.45
66.32
62.02
68.91
61.68
80.60
79.49
60.71
68.30
59.50
49.87
52.29
60.09
51.96
58.81
56.26
51.13
53.23
49.29
55.56
46.86
76.60
63.87
47.47
56.61
49.32
Table 7: Performance of the coreference system
on the CoNLL-2012 test set.
The varying results across genres mostly meet
our expectations. In English, the system does best
on TC and the PT genres. The text in the TC set
often involve long chains where the speakers refer to themselves which, given speaker information, is fairly easy to resolve. The PT section
includes many references to god (e.g. god and
the lord) which the lexicalized resolver is quite
good at picking up during training. The more difficult genres consist of texts where references to
many entities are interleaved in the discourse and
is as such harder to resolve correctly. For Chinese the numbers on the TC genre are also quite
good, and the explanation above also holds here
— many mentions refer to either of the speakers. For Chinese the NW section displays by far
the highest scores, however, and the reason for
this is not clear to us. Not surprisingly, restricting
the set of mentions only to gold mentions gives
a large boost across all genres and all languages.
This shows that mention detection (MD) and singleton detection (which is not part of the annotation) remain a big source of errors for the coreference resolver. For these experiments we used
a combination of training and development data
for training — following the CoNLL-2012 shared
task specification. Leaving out the development
set has a very negligible effect on the CoNLLscore for all the languages (English: 0.14; Chinese 0.06; Arabic: 0.40 F-score respectively). The
effect on Arabic is the most (0.40 F-score) most
likely because of its much smaller size. To gauge
the performance improvement between 2011 and
2012 shared tasks, we performed a clean comparison of over the best performing system and
an earlier version of this system (Björkelund and
Nugues, 2011) on the CoNLL 2011 test set using the CoNLL 2011 train and development set
for training. The current system has a CoNLL
score of 60.09 ( 64.92+69.84+45.51
)23 as opposed to
3
the 54.53 reported in björkelund (Björkelund and
Nugues, 2011), and the 57.79 reported for the best
performing system of CoNLL-2011. One caveat
is that these score comparison are done using the
earlier version (v4) of the CoNLL scorer. Nevertheless, it is encouraging to see that within a
short span of a year, there has been significant
improvement in system performance – partially
owing to cross-pollination of research generated
through the shared tasks.
5
Conclusion
In this paper we reported work on finding a reasonable training, development and test split for
the various layers of annotation in the OntoNotes
v5.0 corpus, which consists of multiple genres in
three typologically very different languages. We
also presented the performance of publicly available, state-of-the-art algorithms on all the different
layers of the corpus for the different languages.
The trained models as well as their output will
be made publicly available24 to serve as benchmarks for language processing community. Training so many different NLP components is very
time-consuming, thus, we hope the work reported
here has lifted the burden of having to create reasonable baselines for researchers who wish to use
this corpus to evaluate their systems. We created
just one data split in training, development and test
set, covering a collection of genres for each layer
of annotation in each language in order to keep the
workload manageable However, the results do not
discriminate the performance on individual genres: we believe such a setup is still a more realistic
gauge for the performance of the state-of-the-art
NLP components than a monolithic corpus such
as the Wall Street Journal section of the Penn Treebank. It can be used as a starting point for developing the next generation of NLP components that
are more robust and perform well on a multitude
of genres for a variety of different languages.
23
24
150
( MUC + BCUBED + CEAFe )/3
http://cemantix.org
6
Acknowledgments
Natural Language Learning (CoNLL), Ann Arbor, MI,
June.
We gratefully acknowledge the support of the
Defense Advanced Research Projects Agency
(DARPA/IPTO) under the GALE program,
DARPA/CMO Contract No. HR0011-06-C-0022
for sponsoring the creation of the OntoNotes
corpus.
This work was partially supported
by grants R01LM10090 and U54LM008748
from the National Library Of Medicine, and
R01GM090187 from the National Institutes of
General Medical Sciences. We are indebted to
Slav Petrov for helping us to retrain his syntactic
parser for Arabic. Alessandro Moschitti and
Olga Uryupina have been partially funded by
the European Community’s Seventh Framework
Programme (FP7/2007-2013) under the grant
number 288024 (L I M O SIN E).
The content
is solely the responsibility of the authors and
does not necessarily represent the official views
of the National Institutes of Health. Nianwen
Xue and Yuchen Zhang are supported in part
by the DAPRA via contract HR0011-11-C-0145
entitled “Linguistic Resources for Multilingual
Processing.”
Eugene Charniak and Mark Johnson. 2001. Edit detection
and parsing for transcribed speech. In Proceedings of the
Second Meeting of the North American Chapter of the Association for Computational Linguistics, June.
Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine
n-best parsing and maxent discriminative reranking. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI,
June.
Mona Diab, Alessandro Moschitti, and Daniele Pighin. 2008.
Semantic role labeling systems for Arabic using kernel
methods. In Proceedings of ACL-08: HLT, pages 798–
806, Columbus, Ohio, June. Association for Computational Linguistics.
Gerard Escudero, Lluis Marquez, and German Rigau. 2000.
An empirical study of the domain dependence of supervised word disambiguation systems. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 172–
180, Hong Kong, China, October. Association for Computational Linguistics.
Benoit Favre, Bernd Bohnet, and D. Hakkani-Tur. 2010.
Evaluation of semantic role labeling and dependency
parsing of automatic speech recognition output. In
Proceedings of 2010 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), page
5342–5345.
References
Olga Babko-Malaya, Ann Bies, Ann Taylor, Szuting Yi,
Martha Palmer, Mitch Marcus, Seth Kulick, and Libin
Shen. 2006. Issues in synchronizing the English treebank
and propbank. In Workshop on Frontiers in Linguistically
Annotated Corpora 2006, July.
Elizabeth Baran and Nianwen Xue. 2011. Singular or plural?
exploiting parallel corpora for Chinese number prediction.
In Proceedings of Machine Translation Summit XIII, Xiamen, China.
Shane Bergsma and Dekang Lin. 2006. Bootstrapping pathbased pronoun resolution. In Proceedings of the 21st International Conference on Computational Linguistics and
44th Annual Meeting of the Association for Computational Linguistics, pages 33–40, Sydney, Australia, July.
Anders Björkelund and Richárd Farkas. 2012. Data-driven
multilingual coreference resolution using resolver stacking. In Joint Conference on EMNLP and CoNLL - Shared
Task, pages 49–55, Jeju Island, Korea, July. Association
for Computational Linguistics.
Anders Björkelund and Pierre Nugues. 2011. Exploring lexicalized features for coreference resolution. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 45–50, Portland, Oregon, USA, June. Association for Computational
Linguistics.
Jie Cai and Michael Strube. 2010. Evaluation metrics for
end-to-end coreference resolution systems. In Proceedings of the 11th Annual Meeting of the Special Interest
Group on Discourse and Dialogue, SIGDIAL ’10, pages
28–36.
Shu Cai, David Chiang, and Yoav Goldberg.
2011.
Language-independent parsing with empty elements. In
Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies, pages 212–216, Portland, Oregon, USA, June.
Association for Computational Linguistics.
Xavier Carreras and Lluı́s Màrquez. 2005. Introduction to
the CoNLL-2005 shared task: Semantic role labeling. In
Proceedings of the Ninth Conference on Computational
Jenny Rose Finkel and Christopher D. Manning. 2009. Joint
parsing and named entity recognition. In Proceedings of
Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, pages 326–334, Boulder,
Colorado, June. Association for Computational Linguistics.
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association
for Computational Linguistics, page 363–370.
Ryan Gabbard. 2010. Null Element Restoration. Ph.D. thesis, University of Pennsylvania.
Daniel Gildea. 2001. Corpus variation and parser performance. In 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pittsburgh, PA.
Julia Hockenmaier and Mark Steedman. 2002. Acquiring compact lexicalized grammars from a cleaner treebank. In Proceedings of the Third LREC Conference, page
1974–1981.
Mohamed Maamouri and Ann Bies. 2004. Developing an
Arabic treebank: Methods, guidelines, procedures, and
tools. In Ali Farghaly and Karine Megerdoomian, editors, COLING 2004 Computational Approaches to Arabic
Script-based Languages, pages 2–9, Geneva, Switzerland,
August 28th. COLING.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated corpus
of English: The Penn treebank. Computational Linguistics, 19(2):313–330, June.
David McClosky, Eugene Charniak, and Mark Johnson.
2006. Effective self-training for parsing. In Proceedings
of the Human Language Technology Conference/North
American Chapter of the Association for Computational
Linguistics (HLT/NAACL), New York City, NY, June.
151
Vincent Ng and Claire Cardie. 2002. Improving machine
learning approaches to coreference resolution. In Proceedings of the Association for Computational Linguistics
(ACL-02), pages 104–111.
Martha Palmer, Daniel Gildea, and Paul Kingsbury. 2005.
The Proposition Bank: An annotated corpus of semantic
roles. Computational Linguistics, 31(1):71–106.
Martha Palmer, Hoa Trang Dang, and Christiane Fellbaum.
2007. Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Journal of
Natural Language Engineering, 13(2).
Martha Palmer, Olga Babko-Malaya, Ann Bies, Mona Diab,
Mohammed Maamouri, Aous Mansouri, and Wajdi Zaghouani. 2008. A pilot Arabic propbank. In Proceedings
of the International Conference on Language Resources
and Evaluation (LREC), Marrakech, Morocco, May 2830.
Slav Petrov and Dan Klein. 2007. Improved inferencing for
unlexicalized parsing. In Proc of HLT-NAACL.
Sameer Pradhan, Kadri Hacioglu, Valerie Krugler, Wayne
Ward, James Martin, and Dan Jurafsky. 2005. Support
vector learning for semantic argument classification. Machine Learning, 60(1):11–39.
Sameer Pradhan, Lance Ramshaw, Ralph Weischedel, Jessica MacBride, and Linnea Micciulla. 2007. Unrestricted coreference: Indentifying entities and events in
OntoNotes. In Proceedings of the IEEE International
Conference on Semantic Computing (ICSC), September
17-19.
Sameer Pradhan, Wayne Ward, and James H. Martin. 2008.
Towards robust semantic role labeling. Computational
Linguistics Special Issue on Semantic Role Labeling,
34(2).
Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha
Palmer, Ralph Weischedel, and Nianwen Xue. 2011.
CoNLL-2011 shared task: Modeling unrestricted coreference in OntoNotes. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning:
Shared Task, pages 1–27, Portland, Oregon, USA, June.
Association for Computational Linguistics.
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga
Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 shared
task: Modeling multilingual unrestricted coreference in
OntoNotes. In Joint Conference on EMNLP and CoNLL Shared Task, pages 1–40, Jeju Island, Korea, July. Association for Computational Linguistics.
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki,
Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008.
The Penn discourse treebank 2.0. In Proceedings of the
Sixth International Conference on Language Resources
and Evaluation (LREC’08), Marrakech, Morocco, May.
Marta Recasens, Lluı́s Màrquez, Emili Sapena, M. Antònia
Martı́, Mariona Taulé, Véronique Hoste, Massimo Poesio,
and Yannick Versley. 2010. Semeval-2010 task 1: Coreference resolution in multiple languages. In Proceedings of
the 5th International Workshop on Semantic Evaluation,
pages 1–8, Uppsala, Sweden, July.
Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The life and death of discourse entities: Identifying singleton mentions. In Proceedings of
the 2013 Conference of the North American Chapter of
the Association for Computational Linguistics: Human
Language Technologies, pages 627–633, Atlanta, Georgia, June. Association for Computational Linguistics.
Libin Shen, Lucas Champollion, and Aravind K. Joshi. 2008.
LTAG-spinal and the treebank. Language Resources and
Evaluation, 42(1):1–19, March.
Ralph Weischedel and Ada Brunstein. 2005. BBN pronoun coreference and entity type corpus LDC catalog no.:
LDC2005T33. BBN Technologies.
Ralph Weischedel, Eduard Hovy, Mitchell Marcus, Martha
Palmer, Robert Belvin, Sameer Pradhan, Lance Ramshaw,
and Nianwen Xue. 2011. OntoNotes: A large training corpus for enhanced processing. In Joseph Olive,
Caitlin Christianson, and John McCary, editors, Handbook of Natural Language Processing and Machine
Translation: DARPA Global Autonomous Language Exploitation. Springer.
Nianwen Xue and Martha Palmer. 2009. Adding semantic
roles to the Chinese Treebank. Natural Language Engineering, 15(1):143–172.
Nianwen Xue, Fei Xia, Fu dong Chiou, and Martha Palmer.
2005. The Penn Chinese TreeBank: phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207–238.
Nianwen Xue. 2008. Labeling Chinese predicates with semantic roles. Computational Linguistics, 34(2):225–255.
Yaqin Yang and Nianwen Xue. 2010. Chasing the ghost:
recovering empty categories in the Chinese treebank.
In Proceedings of the 23rd International Conference on
Computational Linguistics (COLING), Beijing, China.
Wajdi Zaghouani, Mona Diab, Aous Mansouri, Sameer Pradhan, and Martha Palmer. 2010. The revised Arabic propbank. In Proceedings of the Fourth Linguistic Annotation
Workshop, pages 222–226, Uppsala, Sweden, July.
Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A widecoverage word sense disambiguation system for free text.
In Proceedings of the ACL 2010 System Demonstrations,
pages 78–83, Uppsala, Sweden.
Zhi Zhong, Hwee Tou Ng, and Yee Seng Chan. 2008.
Word sense disambiguation using OntoNotes: An empirical study. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 1002–
1010.
152
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertisement