Proceedings of the 48th Annual Meeting of the Association for

Proceedings of the 48th Annual Meeting of the Association for
ACL 2010
48th Annual Meeting of the
Association for Computational Linguistics
Proceedings of the Conference Short Papers
11-16 July 2010
Uppsala University
Uppsala, Sweden
Production and Manufacturing by
Taberg Media Group AB
Box 94, 562 02 Taberg
The Association for Computational Linguistics
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL)
209 N. Eighth Street
Stroudsburg, PA 18360
Tel: +1-570-476-8006
Fax: +1-570-476-0860
[email protected]
Table of Contents
Paraphrase Lattice for Statistical Machine Translation
Takashi Onishi, Masao Utiyama and Eiichiro Sumita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A Joint Rule Selection Model for Hierarchical Phrase-Based Translation
Lei Cui, Dongdong Zhang, Mu Li, Ming Zhou and Tiejun Zhao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su, Yang Liu, Yajuan Lv, Haitao Mi and Qun Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao and Eiichiro Sumita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
Bing Xiang, Yonggang Deng and Bowen Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices
Graeme Blackwood, Adrià de Gispert and William Byrne . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The Same-Head Heuristic for Coreference
Micha Elsner and Eugene Charniak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Authorship Attribution Using Probabilistic Context-Free Grammars
Sindhu Raghavan, Adriana Kovashka and Raymond Mooney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
The Impact of Interpretation Problems on Tutorial Dialogue
Myroslava O. Dzikovska, Johanna D. Moore, Natalie Steinhauser and Gwendolyn Campbell . . . 43
The Prevalence of Descriptive Referring Expressions in News and Narrative
Raquel Hervas and Mark Finlayson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Preferences versus Adaptation during Referring Expression Generation
Martijn Goudbeek and Emiel Krahmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Cognitively Plausible Models of Human Language Processing
Frank Keller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60
The Manually Annotated Sub-Corpus: A Community Resource for and by the People
Nancy Ide, Collin Baker, Christiane Fellbaum and Rebecca Passonneau . . . . . . . . . . . . . . . . . . . . . . 68
Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar
Yoshihide Kato and Shigeki Matsubara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Evaluating Machine Translations Using mNCD
Marcus Dobrinkat, Tero Tapiovaara, Jaakko Väyrynen and Kimmo Kettunen . . . . . . . . . . . . . . . . . 80
Tackling Sparse Data Issue in Machine Translation Evaluation
Ondřej Bojar, Kamil Kos and David Mareček . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Exemplar-Based Models for Word Meaning in Context
Katrin Erk and Sebastian Pado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A Structured Model for Joint Learning of Argument Roles and Predicate Senses
Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Semantics-Driven Shallow Parsing for Chinese Semantic Role Labeling
Weiwei Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Collocation Extraction beyond the Independence Assumption
Gerlof Bouma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Automatic Collocation Suggestion in Academic Writing
Jian-Cheng Wu, Yu-Chia Chang, Teruko Mitamura and Jason S. Chang . . . . . . . . . . . . . . . . . . . . . 115
Event-Based Hyperspace Analogue to Language for Query Expansion
Tingxu Yan, Tamsin Maxwell, Dawei Song, Yuexian Hou and Peng Zhang . . . . . . . . . . . . . . . . . . 120
Automatically Generating Term Frequency Induced Taxonomies
Karin Murthy, Tanveer A Faruquie, L Venkata Subramaniam, Hima Prasad K and Mukesh Mohania
Complexity Assumptions in Ontology Verbalisation
Richard Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Word Alignment with Synonym Regularization
Hiroyuki Shindo, Akinori Fujino and Masaaki Nagata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules
Zhiyang Wang, Yajuan Lv, Qun Liu and Young-Sook Hwang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Fixed Length Word Suffix for Factored Statistical Machine Translation
Narges Sharif Razavian and Stephan Vogel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
Minwoo Jeong and Ivan Titov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Coreference Resolution with Reconcile
Veselin Stoyanov, Claire Cardie, Nathan Gilbert, Ellen Riloff, David Buttler and David Hysom 156
Predicate Argument Structure Analysis Using Transformation Based Learning
Hirotoshi Taira, Sanae Fujita and Masaaki Nagata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Improving Chinese Semantic Role Labeling with Rich Syntactic Features
Weiwei Sun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Balancing User Effort and Translation Error in Interactive Machine Translation via Confidence Measures
Jesús González Rubio, Daniel Ortiz Martı́nez and Francisco Casacuberta . . . . . . . . . . . . . . . . . . . . 173
Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for
Marine Carpuat, Yuval Marton and Nizar Habash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Learning Common Grammar from Multilingual Corpus
Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Tree-Based Deterministic Dependency Parsing — An Application to Nivre’s Method —
Kotaro Kitagawa and Kumiko Tanaka-Ishii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Sparsity in Dependency Grammar Induction
Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira and Ben Taskar . . . . . . . 194
Top-Down K-Best A* Parsing
Adam Pauls, Dan Klein and Chris Quirk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
Simple Semi-Supervised Training of Part-Of-Speech Taggers
Anders Søgaard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging
Ashish Vaswani, Adam Pauls and David Chiang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
SVD and Clustering for Unsupervised POS Tagging
Michael Lamar, Yariv Maron, Mark Johnson and Elie Bienenstock . . . . . . . . . . . . . . . . . . . . . . . . . 215
Intelligent Selection of Language Model Training Data
Robert C. Moore and William Lewis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Blocked Inference in Bayesian Tree Substitution Grammars
Trevor Cohn and Phil Blunsom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .225
Online Generation of Locality Sensitive Hash Signatures
Benjamin Van Durme and Ashwin Lall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
Optimizing Question Answering Accuracy by Maximizing Log-Likelihood
Matthias H. Heie, Edward W. D. Whittaker and Sadaoki Furui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Generating Entailment Rules from FrameNet
Roni Ben Aharon, Idan Szpektor and Ido Dagan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Don’t ‘Have a Clue’? Unsupervised Co-Learning of Downward-Entailing Operators.
Cristian Danescu-Niculescu-Mizil and Lillian Lee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Vocabulary Choice as an Indicator of Perspective
Beata Beigman Klebanov, Eyal Beigman and Daniel Diermeier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Bin Wei and Christopher Pal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews
Niklas Jakob and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Hierarchical Sequential Learning for Extracting Opinions and Their Attributes
Yejin Choi and Claire Cardie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast
Decoding Algorithm
Dong Yang, Paul Dixon and Sadaoki Furui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Arabic Named Entity Recognition: Using Features Extracted from Noisy Data
Yassine Benajiba, Imed Zitouni, Mona Diab and Paolo Rosso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Extracting Sequences from the Web
Anthony Fader, Stephen Soderland and Oren Etzioni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
An Entity-Level Approach to Information Extraction
Aria Haghighi and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document
Semantic Network
Decong Li, Sujian Li, Wenjie Li, Wei Wang and Weiguang Qu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Domain Adaptation of Maximum Entropy Language Models
Tanel Alumäe and Mikko Kurimo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Decision Detection Using Hierarchical Graphical Models
Trung H. Bui and Stanley Peters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Using Speech to Reply to SMS Messages While Driving: An In-Car Simulator User Study
Yun-Cheng Ju and Tim Paek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Classification of Feedback Expressions in Multimodal Data
Costanza Navarretta and Patrizia Paggio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Optimizing Informativeness and Readability for Sentiment Summarization
Hitoshi Nishikawa, Takaaki Hasegawa, Yoshihiro Matsuo and Genichiro Kikui . . . . . . . . . . . . . . 325
Last but Definitely Not Least: On the Role of the Last Sentence in Automatic Polarity-Classification
Israela Becker and Vered Aharonson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Automatically Generating Annotator Rationales to Improve Sentiment Classification
Ainur Yessenalina, Yejin Choi and Claire Cardie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Simultaneous Tokenization and Part-Of-Speech Tagging for Arabic without a Morphological Analyzer
Seth Kulick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Hierarchical A* Parsing with Bridge Outside Scores
Adam Pauls and Dan Klein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Using Parse Features for Preposition Selection and Error Detection
Joel Tetreault, Jennifer Foster and Martin Chodorow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Distributional Similarity vs. PU Learning for Entity Set Expansion
Xiao-Li Li, Lei Zhang, Bing Liu and See-Kiong Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Active Learning-Based Elicitation for Semi-Supervised Word Alignment
Vamshi Ambati, Stephan Vogel and Jaime Carbonell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
An Active Learning Approach to Finding Related Terms
David Vickrey, Oscar Kipersztok and Daphne Koller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Learning Better Data Representation Using Inference-Driven Metric Learning
Paramveer S. Dhillon, Partha Pratim Talukdar and Koby Crammer . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Wrapping up a Summary: From Representation to Generation
Josef Steinberger, Marco Turchi, Mijail Kabadjov, Ralf Steinberger and Nello Cristianini . . . . . 382
Paraphrase Lattice for Statistical Machine Translation
Takashi Onishi and Masao Utiyama and Eiichiro Sumita
Language Translation Group, MASTAR Project
National Institute of Information and Communications Technology
3-5 Hikaridai, Keihanna Science City, Kyoto, 619-0289, JAPAN
we propose a novel method that can handle input variations using paraphrases and lattice decoding. In the proposed method, we regard a given
source sentence as one of many variations (1-best).
Given an input sentence, we build a paraphrase lattice which represents paraphrases of the input sentence. Then, we give the paraphrase lattice as an
input to the Moses decoder (Koehn et al., 2007).
Moses selects the best path for decoding. By using
paraphrases of source sentences, we can translate
expressions which are not found in a training corpus on the condition that paraphrases of them are
found in the training corpus. Moreover, by using
lattice decoding, we can employ the source-side
language model as a decoding feature. Since this
feature is affected by the source-side context, the
decoder can choose a proper paraphrase and translate correctly.
This paper is organized as follows: Related
works on lattice decoding and paraphrasing are
presented in Section 2. The proposed method is
described in Section 3. Experimental results for
IWSLT and Europarl dataset are presented in Section 4. Finally, the paper is concluded with a summary and a few directions for future work in Section 5.
Lattice decoding in statistical machine
translation (SMT) is useful in speech
translation and in the translation of German because it can handle input ambiguities such as speech recognition ambiguities and German word segmentation ambiguities. We show that lattice decoding is
also useful for handling input variations.
Given an input sentence, we build a lattice
which represents paraphrases of the input
sentence. We call this a paraphrase lattice.
Then, we give the paraphrase lattice as an
input to the lattice decoder. The decoder
selects the best path for decoding. Using these paraphrase lattices as inputs, we
obtained significant gains in BLEU scores
for IWSLT and Europarl datasets.
Lattice decoding in SMT is useful in speech translation and in the translation of German (Bertoldi
et al., 2007; Dyer, 2009). In speech translation,
by using lattices that represent not only 1-best result but also other possibilities of speech recognition, we can take into account the ambiguities of
speech recognition. Thus, the translation quality
for lattice inputs is better than the quality for 1best inputs.
In this paper, we show that lattice decoding is
also useful for handling input variations. “Input
variations” refers to the differences of input texts
with the same meaning. For example, “Is there
a beauty salon?” and “Is there a beauty parlor?” have the same meaning with variations in
“beauty salon” and “beauty parlor”. Since these
variations are frequently found in natural language
texts, a mismatch of the expressions in source sentences and the expressions in training corpus leads
to a decrease in translation quality. Therefore,
2 Related Work
Lattice decoding has been used to handle ambiguities of preprocessing. Bertoldi et al. (2007) employed a confusion network, which is a kind of lattice and represents speech recognition hypotheses
in speech translation. Dyer (2009) also employed
a segmentation lattice, which represents ambiguities of compound word segmentation in German,
Hungarian and Turkish translation. However, to
the best of our knowledge, there is no work which
employed a lattice representing paraphrases of an
input sentence.
On the other hand, paraphrasing has been used
to enrich the SMT model. Callison-Burch et
Proceedings of the ACL 2010 Conference Short Papers, pages 1–5,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Input sentence
Parallel Corpus
(for paraphrase)
phrase table and keep only appropriate phrase
pairs using the sigtest-filter (Johnson et al.,
3. Calculate the paraphrase probability.
Paraphrase Lattice
Calculate the paraphrase probability p(e2 |e1 )
if e2 is hypothesized to be a paraphrase of e1 .
Parallel Corpus
(for training)
SMT model
Lattice Decoding
p(e2 |e1 ) =
Output sentence
P (c|e1 )P (e2 |c)
where P (·|·) is phrase translation probability.
Figure 1: Overview of the proposed method.
4. Acquire a paraphrase pair.
Acquire (e1 , e2 ) as a paraphrase pair if
p(e2 |e1 ) > p(e1 |e1 ). The purpose of this
threshold is to keep highly-accurate paraphrase pairs. In experiments, more than 80%
of paraphrase pairs were eliminated by this
al. (2006) and Marton et al. (2009) augmented
the translation phrase table with paraphrases to
translate unknown phrases. Bond et al. (2008)
and Nakov (2008) augmented the training data by
paraphrasing. However, there is no work which
augments input sentences by paraphrasing and
represents them in lattices.
3.2 Building paraphrase lattice
An input sentence is paraphrased using the paraphrase list and transformed into a paraphrase lattice. The paraphrase lattice is a lattice which represents paraphrases of the input sentence. An example of a paraphrase lattice is shown in Figure 2.
In this example, an input sentence is “is there a
beauty salon ?”. This paraphrase lattice contains
two paraphrase pairs “beauty salon” = “beauty
parlor” and “beauty salon” = “salon”, and represents following three sentences.
Paraphrase Lattice for SMT
Overview of the proposed method is shown in Figure 1. In advance, we automatically acquire a
paraphrase list from a parallel corpus. In order to
acquire paraphrases of unknown phrases, this parallel corpus is different from the parallel corpus
for training.
Given an input sentence, we build a lattice
which represents paraphrases of the input sentence
using the paraphrase list. We call this lattice a
paraphrase lattice. Then, we give the paraphrase
lattice to the lattice decoder.
• is there a beauty salon ?
• is there a beauty parlor ?
• is there a salon ?
3.1 Acquiring the paraphrase list
We acquire a paraphrase list using Bannard and
Callison-Burch (2005)’s method. Their idea is, if
two different phrases e1 , e2 in one language are
aligned to the same phrase c in another language,
they are hypothesized to be paraphrases of each
other. Our paraphrase list is acquired in the same
The procedure is as follows:
In the paraphrase lattice, each node consists of
a token, the distance to the next node and features
for lattice decoding. We use following four features for lattice decoding.
• Paraphrase probability (p)
A paraphrase probability p(e2 |e1 ) calculated
when acquiring the paraphrase.
hp = p(e2 |e1 )
1. Build a phrase table.
• Language model score (l)
Build a phrase table from parallel corpus using standard SMT techniques.
A ratio between the language model probability of the paraphrased sentence (para) and
that of the original sentence (orig).
2. Filter the phrase table by the sigtest-filter.
The phrase table built in 1 has many inappropriate phrase pairs. Therefore, we filter the
hl =
0 -- ("is"
, 1, 1, 1, 1)
1 -- ("there" , 1, 1, 1, 1)
2 -- ("a"
, 1, 1, 1, 1)
Distance to the next node
Features for lattice decoding
3 -- ("beauty" , 1, 1, 1, 2) ("beauty" , 0.250, 1.172, 1, 1) ("salon" , 0.133, 0.537, 0.367, 3)
4 -- ("parlor" , 1, 1, 1, 2)
Paraphrase probability (p)
5 -- ("salon" , 1, 1, 1, 1)
6 -- ("?"
Language model score (l)
, 1, 1, 1, 1)
Paraphrase length (d)
Figure 2: An example of a paraphrase lattice, which contains three features of (p, l, d).
• Normalized language model score (L)
SMT system which allows lattice decoding. In
lattice decoding, Moses selects the best path and
the best translation according to features added in
each node and other SMT features. These weights
are optimized using Minimum Error Rate Training
(MERT) (Och, 2003).
A language model score where the language
model probability is normalized by the sentence length. The sentence length is calculated as the number of tokens.
hL =
LM (para)
LM (orig) ,
4 Experiments
where LM (sent) = lm(sent) length(sent)
• Paraphrase length (d)
In order to evaluate the proposed method, we
conducted English-to-Japanese and English-toChinese translation experiments using IWSLT
2007 (Fordyce, 2007) dataset. This dataset contains EJ and EC parallel corpus for the travel
domain and consists of 40k sentences for training and about 500 sentences sets (dev1, dev2
and dev3) for development and testing. We used
the dev1 set for parameter tuning, the dev2 set
for choosing the setting of the proposed method,
which is described below, and the dev3 set for testing.
The English-English paraphrase list was acquired from the EC corpus for EJ translation and
53K pairs were acquired. Similarly, 47K pairs
were acquired from the EJ corpus for EC translation.
The difference between the original sentence
length and the paraphrased sentence length.
hd = exp(length(para) − length(orig))
The values of these features are calculated only
if the node is the first node of the paraphrase, for
example the second “beauty” and “salon” in line
3 of Figure 2. In other nodes, for example “parlor” in line 4 and original nodes, we use 1 as the
values of features.
The features related to the language model, such
as (l) and (L), are affected by the context of source
sentences even if the same paraphrase pair is applied. As these features can penalize paraphrases
which are not appropriate to the context, appropriate paraphrases are chosen and appropriate translations are output in lattice decoding. The features
related to the sentence length, such as (L) and (d),
are added to penalize the language model score
in case the paraphrased sentence length is shorter
than the original sentence length and the language
model score is unreasonably low.
In experiments, we use four combinations of
these features, (p), (p, l), (p, L) and (p, l, d).
4.1 Baseline
As baselines, we used Moses and Callison-Burch
et al. (2006)’s method (hereafter CCB). In Moses,
we used default settings without paraphrases. In
CCB, we paraphrased the phrase table using the
automatically acquired paraphrase list. Then,
we augmented the phrase table with paraphrased
phrases which were not found in the original
phrase table. Moreover, we used an additional feature whose value was the paraphrase probability
(p) if the entry was generated by paraphrasing and
Lattice decoding
We use Moses (Koehn et al., 2007) as a decoder
for lattice decoding. Moses is an open source
Moses (w/o Paraphrases)
39.24 (+0.26)
26.14 (+1.03)
Proposed Method
40.34 (+1.36)
27.06 (+1.95)
Table 1: Experimental results for IWSLT (%BLEU).
1 if otherwise. Weights of the feature and other
features in SMT were optimized using MERT.
an absolute improvement of 1.95 BLEU points
over Moses and 0.92 BLEU points over CCB. As
the relation of three systems is Moses < CCB <
Proposed Method, paraphrasing is useful for SMT
and using paraphrase lattices and lattice decoding is especially more useful than augmenting the
phrase table. In Proposed Method, the criterion for
building paraphrase lattices and the combination
of features for lattice decoding were (p) and (p, L)
in EJ translation and (L) and (p, l) in EC translation. Since features related to the source-side language model were chosen in each direction, using
the source-side language model is useful for decoding paraphrase lattices.
We also tried a combination of Proposed
Method and CCB, which is a method of decoding
paraphrase lattices with an augmented phrase table. However, the result showed no significant improvements. This is because the proposed method
includes the effect of augmenting the phrase table.
Moreover, we conducted German-English
translation using the Europarl corpus (Koehn,
2005). We used the WMT08 dataset1 , which
consists of 1M sentences for training and 2K sentences for development and testing. We acquired
5.3M pairs of German-German paraphrases from
a 1M German-Spanish parallel corpus. We conducted experiments with various sizes of training
corpus, using 10K, 20K, 40K, 80K, 160K and 1M.
Figure 3 shows the proposed method consistently
get higher score than Moses and CCB.
4.2 Proposed method
In the proposed method, we conducted experiments with various settings for paraphrasing and
lattice decoding. Then, we chose the best setting
according to the result of the dev2 set.
4.2.1 Limitation of paraphrasing
As the paraphrase list was automatically acquired, there were many erroneous paraphrase
pairs. Building paraphrase lattices with all erroneous paraphrase pairs and decoding these paraphrase lattices caused high computational complexity. Therefore, we limited the number of paraphrasing per phrase and per sentence. The number
of paraphrasing per phrase was limited to three and
the number of paraphrasing per sentence was limited to twice the size of the sentence length.
As a criterion for limiting the number of paraphrasing, we use three features (p), (l) and (L),
which are same as the features described in Subsection 3.2. When building paraphrase lattices, we
apply paraphrases in descending order of the value
of the criterion.
4.2.2 Finding optimal settings
As previously mentioned, we have three choices
for the criterion for building paraphrase lattices
and four combinations of features for lattice decoding. Thus, there are 3 × 4 = 12 combinations
of these settings. We conducted parameter tuning
with the dev1 set for each setting and used as best
the setting which got the highest BLEU score for
the dev2 set.
5 Conclusion
This paper has proposed a novel method for transforming a source sentence into a paraphrase lattice
and applying lattice decoding. Since our method
can employ source-side language models as a decoding feature, the decoder can choose proper
paraphrases and translate properly. The experimental results showed significant gains for the
IWSLT and Europarl dataset. In IWSLT dataset,
we obtained 1.36 BLEU points over Moses in EJ
translation and 1.95 BLEU points over Moses in
4.3 Results
The experimental results are shown in Table 1. We
used the case-insensitive BLEU metric for evaluation. In EJ translation, the proposed method
obtained the highest score of 40.34%, which
achieved an absolute improvement of 1.36 BLEU
points over Moses and 1.10 BLEU points over
CCB. In EC translation, the proposed method also
obtained the highest score of 27.06% and achieved
BLEU score (%)
Corpus size (K)
Figure 3: Effect of training corpus size.
EC translation. In Europarl dataset, the proposed
method consistently get higher score than baselines.
In future work, we plan to apply this method
with paraphrases derived from a massive corpus
such as the Web corpus and apply this method to a
hierarchical phrase based SMT.
Cameron S. Fordyce. 2007. Overview of the IWSLT
2007 Evaluation Campaign. In Proceedings of the
International Workshop on Spoken Language Translation (IWSLT), pages 1–12.
J Howard Johnson, Joel Martin, George Foster, and
Roland Kuhn. 2007. Improving Translation Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 967–975.
Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with Bilingual Parallel Corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses:
Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(ACL), pages 177–180.
Nicola Bertoldi, Richard Zens, and Marcello Federico.
2007. Speech translation by confusion network decoding. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), pages 1297–1300.
Philipp Koehn. 2005. Europarl: A Parallel Corpus for
Statistical Machine Translation. In Proceedings of
the 10th Machine Translation Summit (MT Summit),
pages 79–86.
Francis Bond, Eric Nichols, Darren Scott Appling, and
Michael Paul. 2008. Improving Statistical Machine
Translation by Paraphrasing the Training Data. In
Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pages 150–157.
Yuval Marton, Chris Callison-Burch, and Philip
Improved Statistical Machine
Translation Using Monolingually-Derived Paraphrases. In Proceedings of the Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 381–390.
Chris Callison-Burch, Philipp Koehn, and Miles Osborne. 2006. Improved Statistical Machine Translation Using Paraphrases. In Proceedings of the
Human Language Technology conference - North
American chapter of the Association for Computational Linguistics (HLT-NAACL), pages 17–24.
Preslav Nakov. 2008. Improved Statistical Machine
Translation Using Monolingual Paraphrases. In
Proceedings of the European Conference on Artificial Intelligence (ECAI), pages 338–342.
Chris Dyer. 2009. Using a maximum entropy model
to build segmentation lattices for MT. In Proceedings of the Human Language Technology conference - North American chapter of the Association
for Computational Linguistics (HLT-NAACL), pages
Franz Josef Och. 2003. Minimum Error Rate Training
in Statistical Machine Translation. In Proceedings
of the 41st Annual Meeting of the Association for
Computational Linguistics (ACL), pages 160–167.
A Joint Rule Selection Model for Hierarchical Phrase-based Translation∗
Lei Cui† , Dongdong Zhang‡ , Mu Li‡ , Ming Zhou‡ , and Tiejun Zhao†
School of Computer Science and Technology
Harbin Institute of Technology, Harbin, China
Microsoft Research Asia, Beijing, China
proper rule selection for hypothesis generation, including both source-side rule selection and targetside rule selection where the source-side rule determines what part of source words to be translated
and the target-side rule provides one of the candidate translations of the source-side rule. Improper
rule selections may result in poor translations.
There is some related work about the hierarchical rule selection. In the original work (Chiang,
2005), the target-side rule selection is analogous to
the model in traditional phrase-based SMT system
such as Pharaoh (Koehn et al., 2003). Extending
this work, (He et al., 2008; Liu et al., 2008) integrate rich context information of non-terminals
to predict the target-side rule selection. Different
from the above work where the probability distribution of source-side rule selection is uniform,
(Setiawan et al., 2009) proposes to select sourceside rules based on the captured function words
which often play an important role in word reordering. There is also some work considering to
involve more rich contexts to guide the source-side
rule selection. (Marton and Resnik, 2008; Xiong
et al., 2009) explore the source syntactic information to reward exact matching structure rules or
punish crossing structure rules.
All the previous work mainly focused on either
source-side rule selection task or target-side rule
selection task rather than both of them together.
The separation of these two tasks, however, weakens the high interrelation between them. In this paper, we propose to integrate both source-side and
target-side rule selection in a unified model. The
intuition is that the joint selection of source-side
and target-side rules is more reliable as it conducts
the search in a larger space than the single selection task does. It is expected that these two kinds
of selection can help and affect each other, which
may potentially lead to better hierarchical rule selections with a relative global optimum instead of
a local optimum that might be reached in the pre-
In hierarchical phrase-based SMT systems, statistical models are integrated to
guide the hierarchical rule selection for
better translation performance. Previous
work mainly focused on the selection of
either the source side of a hierarchical rule
or the target side of a hierarchical rule
rather than considering both of them simultaneously. This paper presents a joint
model to predict the selection of hierarchical rules. The proposed model is estimated based on four sub-models where the
rich context knowledge from both source
and target sides is leveraged. Our method
can be easily incorporated into the practical SMT systems with the log-linear
model framework. The experimental results show that our method can yield significant improvements in performance.
Hierarchical phrase-based model has strong expression capabilities of translation knowledge. It
can not only maintain the strength of phrase translation in traditional phrase-based models (Koehn
et al., 2003; Xiong et al., 2006), but also characterize the complicated long distance reordering
similar to syntactic based statistical machine translation (SMT) models (Yamada and Knight, 2001;
Quirk et al., 2005; Galley et al., 2006; Liu et al.,
2006; Marcu et al., 2006; Mi et al., 2008; Shen et
al., 2008).
In hierarchical phrase-based SMT systems, due
to the flexibility of rule matching, a huge number
of hierarchical rules could be automatically learnt
from bilingual training corpus (Chiang, 2005).
SMT decoders are forced to face the challenge of
This work was finished while the first author visited Microsoft Research Asia as an intern.
Proceedings of the ACL 2010 Conference Short Papers, pages 6–11,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
vious methods. Our proposed joint probability
model is factored into four sub-models that can
be further classified into source-side and targetside rule selection models or context-based and
context-free selection models. The context-based
models explore rich context features from both
source and target sides, including function words,
part-of-speech (POS) tags, syntactic structure information and so on. Our model can be easily incorporated as an independent feature into the practical hierarchical phrase-based systems with the
log-linear model framework. The experimental results indicate our method can improve the system
performance significantly.
the monolingual source side of the training corpus.
CFSM is used to capture how likely the sourceside rule is linguistically motivated or has the corresponding target-side counterpart.
For CBSM, it can be naturally viewed as a classification problem where each distinct source-side
rule is a single class. However, considering the
huge number of classes may cause serious data
sparseness problem and thereby degrade the classification accuracy, we approximate CBSM by a
binary classification problem which can be solved
by the maximum entropy (ME) approach (Berger
et al., 1996) as follows:
Ps (α|C) ≈ Ps (υ|α, C)
exp[ i λi hi (υ, α, C)]
υ 0 exp[ i λi hi (υ , α, C)]
Hierarchical Rule Selection Model
Following (Chiang, 2005), hα, γi is used to represent a synchronous context free grammar (SCFG)
rule extracted from the training corpus, where α
and γ are the source-side and target-side rule respectively. Let C be the context of hα, γi. Formally, our joint probability model of hierarchical
rule selection is described as follows:
P (α, γ|C) = P (α|C)P (γ|α, C)
where υ ∈ {0, 1} is the indicator whether the
source-side rule is applied during decoding, υ = 1
when the source-side rule is applied, otherwise
υ = 0; hi is a feature function, λi is the weight
of hi . CBSM estimates the probability of the
source-side rule being selected according to the
rich context information coming from the surface
strings and sub-phrases that will be reduced to
non-terminals during decoding.
Analogously, we decompose the target-side rule
selection model by the interpolation approach as
We decompose the joint probability model into
two sub-models based on the Bayes formulation,
where the first sub-model is source-side rule selection model and the second one is the target-side
rule selection model.
For the source-side rule selection model, we further compute it by the interpolation of two submodels:
θPs (α) + (1 − θ)Ps (α|C)
ϕPt (γ) + (1 − ϕ)Pt (γ|α, C)
where Pt (γ) is the context-free target model
(CFTM) and Pt (γ|α, C) is the context-based target model (CBTM), ϕ is the interpolation weight
that can be optimized over the development data.
In the similar way, we compute CFTM by the
MLE approach and estimate CBTM by the ME
approach. CFTM computes how likely the targetside rule is linguistically motivated, while CBTM
predicts how likely the target-side rule is applied
according to the clues from the rich context information.
where Ps (α) is the context-free source model
(CFSM) and Ps (α|C) is the context-based source
model (CBSM), θ is the interpolation weight that
can be optimized over the development data.
CFSM is the probability of source-side rule selection that can be estimated based on maximum
likelihood estimation (MLE) method:
γ Count(hα, γi)
Ps (α) =
where the numerator is the total count of bilingual rule pairs with the same source-side rule that
are extracted based on the extraction algorithm in
(Chiang, 2005), and the denominator is the total
amount of source-side rule patterns contained in
Model Training of CBSM and CBTM
The acquisition of training instances
CBSM and CBTM are trained by ME approach for
the binary classification, where a training instance
consists of a label and the context related to SCFG
rules. The context is divided into source context
Figure 1: Example of training instances in CBSM and CBTM.
as a positive instance, while the elements in {hυ =
0, C(rs ), C(rtj )i|j 6= i ∧ 1 ≤ j ≤ n} are viewed
as negative instances since they fail to be applied
to the translation from s to t. For example in Figure 1(c), Rule (1) and Rule (2) are two different
SCFG rules extracted from Figure 1(a) and Figure
1(b) respectively, where their source-side rules are
the same. As Rule (1) cannot be applied to Figure 1(b) for the translation and Rule (2) cannot
be applied to Figure 1(a) for the translation either,
hυ = 1, C(rsa ), C(rta )i and hυ = 1, C(rsb ), C(rtb )i
are constructed as positive instances while hυ =
0, C(rsa ), C(rtb )i and hυ = 0, C(rsb ), C(rta )i are
viewed as negative instances. It is noticed that
this instance construction method may lead to a
large quantity of negative instances and choke the
training procedure. In practice, to limit the size
of the training set, the negative instances constructed based on low-frequency target-side rules
are pruned.
and target context. CBSM is trained only based
on the source context while CBTM is trained over
both the source and the target context. All the
training instances are automatically constructed
from the bilingual training corpus, which have labels of either positive (i.e., υ = 1) or negative (i.e.,
υ = 0). This section explains how the training instances are constructed for the training of CBSM
and CBTM.
Let s and t be the source sentence and target
sentence, W be the word alignment between them,
rs be a source-side rule that pattern-matches a
sub-phrase of s, rt be the target-side rule patternmatching a sub-phrase of t and being aligned to rs
based on W , and C(r) be the context features related to the rule r which will be explained in the
following section.
For the training of CBSM, if the SCFG rule
hrs , rt i can be extracted based on the rule extraction algorithm in (Chiang, 2005), hυ = 1, C(rs )i
is constructed as a positive instance, otherwise
hυ = 0, C(rs )i is constructed as a negative instance. For example in Figure 1(a), the context of
source-side rule ”X1 hezuo” that pattern-matches
the phrase ”youhao hezuo” produces a positive
instance, while the context of ”X1 youhao” that
pattern-matches the source phrase ”de youhao” or
”shuangfang de youhao” will produce a negative
instance as there are no corresponding plausible
target-side rules that can be extracted legally1 .
For the training of CBTM, given rs , suppose
there is a SCFG rule set {hrs , rtk i|1 ≤ k ≤ n}
extracted from multiple distinct sentence pairs in
the bilingual training corpus, among which we assume hrs , rti i is extracted from the sentence pair
hs, ti. Then, we construct hυ = 1, C(rs ), C(rti )i
Context-based features for ME training
ME approach has the merit of easily combining
different features to predict the probability of each
class. We incorporate into the ME based model
the following informative context-based features
to train CBSM and CBTM. These features are
carefully designed to reduce the data sparseness
problem and some of them are inspired by previous work (He et al., 2008; Gimpel and Smith,
2008; Marton and Resnik, 2008; Chiang et al.,
2009; Setiawan et al., 2009; Shen et al., 2009;
Xiong et al., 2009):
1. Function word features, which indicate
whether the hierarchical source-side/targetside rule strings and sub-phrases covered by
non-terminals contain function words that are
often important clues of predicting syntactic
Because the aligned target words are not contiguous and
”cooperation” is aligned to the word outside the source-side
2. POS features, which are POS tags of the
boundary source words covered by nonterminals.
We compare our method with the baseline and
some typical approaches listed in Table 1 where
XP+ denotes the approach in (Marton and Resnik,
2008) and TOFW (topological ordering of function words) stands for the method in (Setiawan et
al., 2009). As (Xiong et al., 2009)’s work is based
on phrasal SMT system with bracketing transduction grammar rules (Wu, 1997) and (Shen et al.,
2009)’s work is based on the string-to-dependency
SMT model, we do not implement these two related work due to their different models from ours.
We also do not compare with (He et al., 2008)’s
work due to its less practicability of integrating
numerous sub-models.
3. Syntactic features, which are the constituent
constraints of hierarchical source-side rules
exactly matching or crossing syntactic subtrees.
4. Rule format features, which are nonterminal positions and orders in sourceside/target-side rules. This feature interacts
between source and target components since
it shows whether the translation ordering is
Our method
5. Length features, which are the length
of sub-phrases covered by source nonterminals.
NIST 2008
Table 1: Comparison results, our method is significantly better than the baseline, as well as the other
two approaches (p < 0.01)
Experiment setting
We implement a hierarchical phrase-based system
similar to the Hiero (Chiang, 2005) and evaluate
our method on the Chinese-to-English translation
task. Our bilingual training data comes from FBIS
corpus, which consists of around 160K sentence
pairs where the source data is parsed by the Berkeley parser (Petrov and Klein, 2007). The ME training toolkit, developed by (Zhang, 2006), is used to
train our CBSM and CBTM. The training size of
constructed positive instances for both CBSM and
CBTM is 4.68M, while the training size of constructed negative instances is 3.74M and 3.03M respectively. Following (Setiawan et al., 2009), we
identify function words as the 128 most frequent
words in the corpus. The interpolation weights are
set to θ = 0.75 and ϕ = 0.70. The 5-gram language model is trained over the English portion
of FBIS corpus plus Xinhua portion of the Gigaword corpus. The development data is from NIST
2005 evaluation data and the test data is from
NIST 2006 and NIST 2008 evaluation data. The
evaluation metric is the case-insensitive BLEU4
(Papineni et al., 2002). Statistical significance in
BLEU score differences is tested by paired bootstrap re-sampling (Koehn, 2004).
NIST 2006
As shown in Table 1, all the methods outperform the baseline because they have extra models to guide the hierarchical rule selection in some
ways which might lead to better translation. Apparently, our method also performs better than the
other two approaches, indicating that our method
is more effective in the hierarchical rule selection
as both source-side and target-side rules are selected together.
Effect of sub-models
Due to the space limitation, we analyze the effect of sub-models upon the system performance,
rather than that of ME features, part of which have
been investigated in previous related work.
Baseline+all sub-models
Comparison with related work
NIST 2006
NIST 2008
Table 2: Sub-model effect upon the performance,
*: significantly better than baseline (p < 0.01)
Our baseline is the implemented Hiero-like SMT
system where only the standard features are employed and the performance is state-of-the-art.
As shown in Table 2, when sub-models are inte9
grated as independent features, the performance is
improved compared to the baseline, which shows
that each of the sub-models can improve the hierarchical rule selection. It is noticeable that the performance of the source-side rule selection model
is comparable with that of the target-side rule selection model. Although CFSM and CFTM perform only slightly better than the others among
the individual sub-models, the best performance is
achieved when all the sub-models are integrated.
Zhongjun He, Qun Liu, and Shouxun Lin. 2008. Improving Statistical Machine Translation using Lexicalized Rule Selection. In Proc. Coling, pages 321328.
Philipp Koehn. 2004. Statistical Significance Tests for
Machine Translation Evaluation. In Proc. EMNLP.
Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.
Statistical Phrase-Based Translation. In Proc. HLTNAACL, pages 127-133.
Qun Liu, Zhongjun He, Yang Liu, and Shouxun Lin.
2008. Maximum Entropy based Rule Selection
Model for Syntax-based Statistical Machine Translation. In Proc. EMNLP, pages 89-97.
Hierarchical rule selection is an important and
complicated task for hierarchical phrase-based
SMT system. We propose a joint probability
model for the hierarchical rule selection and the
experimental results prove the effectiveness of our
In the future work, we will explore more useful
features and test our method over the large scale
training corpus. A challenge might exist when
running the ME training toolkit over a big size
of training instances from the large scale training
Yang Liu, Yun Huang, Qun Liu, and Shouxun Lin.
2007. Forest-to-String Statistical Translation Rules.
In Proc. ACL, pages 704-711.
Yang Liu, Qun Liu, and Shouxun Lin. 2006. Tree-toString Alignment Template for Statistical Machine
Translation. In Proc. ACL-Coling, pages 609-616.
Daniel Marcu, Wei Wang, Abdessamad Echihabi, and
Kevin Knight. 2006. SPMT: Statistical Machine Translation with Syntactified Target Language
Phrases. In Proc. EMNLP, pages 44-52.
Yuval Marton and Philip Resnik. 2008. Soft Syntactic
Constraints for Hierarchical Phrased-Based Translation. In Proc. ACL, pages 1003-1011.
We are especially grateful to the anonymous reviewers for their insightful comments. We also
thank Hendra Setiawan, Yuval Marton, Chi-Ho Li,
Shujie Liu and Nan Duan for helpful discussions.
Haitao Mi, Liang Huang, and Qun Liu. 2008. ForestBased Translation. In Proc. ACL, pages 192-199.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a Method for Automatic
Evaluation of Machine Translation. In Proc. ACL,
pages 311-318.
Slav Petrov and Dan Klein. 2007. Improved Inference
for Unlexicalized Parsing. In Proc. HLT-NAACL,
pages 404-411.
Adam L. Berger, Vincent J. Della Pietra, and Stephen
A. Della Pietra. 1996. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1): pages 39-72.
Chris Quirk, Arul Menezes, and Colin Cherry. 2005.
Dependency Treelet Translation: Syntactically Informed Phrasal SMT. In Proc. ACL, pages 271-279.
David Chiang. 2005. A Hierarchical Phrase-Based
Model for Statistical Machine Translation. In Proc.
ACL, pages 263-270.
David Chiang, Kevin Knight, and Wei Wang. 2009.
11,001 New Features for Statistical Machine Translation. In Proc. HLT-NAACL, pages 218-226.
Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A
New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model.
In Proc. ACL, pages 577-585.
Michel Galley, Jonathan Graehl, Kevin Knight, Daniel
Marcu, Steve DeNeefe, Wei Wang, and Ignacio
Thayer. 2006. Scalable Inference and Training of
Context-Rich Syntactic Translation Models. In Proc.
ACL-Coling, pages 961-968.
Libin Shen, Jinxi Xu, Bing Zhang, Spyros Matsoukas,
and Ralph Weischedel. 2009. Effective Use of Linguistic and Contextual Information for Statistical
Machine Translation. In Proc. EMNLP, pages 7280.
Kevin Gimpel and Noah A. Smith. 2008. Rich SourceSide Context for Statistical Machine Translation. In
Proc. the Third Workshop on Statistical Machine
Translation, pages 9-17.
Hendra Setiawan, Min Yen Kan, Haizhou Li, and Philip
Resnik. 2009. Topological Ordering of Function
Words in Hierarchical Phrase-based Translation. In
Proc. ACL, pages 324-332.
Dekai Wu. 1997. Stochastic Inversion Transduction
Grammars and Bilingual Parsing of Parallel Corpora. Computational Linguistics, 23(3): pages 377403.
Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum Entropy Based Phrase Reordering Model for
Statistical Machine Translation. In Proc. ACLColing, pages 521-528.
Deyi Xiong, Min Zhang, Aiti Aw, and Haizhou Li.
A Syntax-Driven Bracketing Model for
Phrase-Based Translation. In Proc. ACL, pages
Kenji Yamada and Kevin Knight. 2001. A Syntaxbased Statistical Translation Model. In Proc. ACL,
pages 523-530.
Le Zhang.
Maximum entropy modeling toolkit for python and c++.
available at
Learning Lexicalized Reordering Models from Reordering Graphs
Jinsong Su, Yang Liu, Yajuan Lü, Haitao Mi, Qun Liu
Key Laboratory of Intelligent Information Processing
Institute of Computing Technology
Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190, China
Lexicalized reordering models play a crucial
role in phrase-based translation systems. They
are usually learned from the word-aligned
bilingual corpus by examining the reordering
relations of adjacent phrases. Instead of just
checking whether there is one phrase adjacent
to a given phrase, we argue that it is important
to take the number of adjacent phrases into
account for better estimations of reordering
models. We propose to use a structure named
reordering graph, which represents all phrase
segmentations of a sentence pair, to learn lexicalized reordering models efficiently. Experimental results on the NIST Chinese-English
test sets show that our approach significantly
outperforms the baseline method.
Figure 1: Occurrence of a swap with different numbers
of adjacent bilingual phrases: only one phrase in (a) and
three phrases in (b). Black squares denote word alignments and gray rectangles denote bilingual phrases. [s,t]
indicates the target-side span of bilingual phrase bp and
[u,v] represents the source-side span of bilingual phrase
Phrase-based translation systems (Koehn et al.,
2003; Och and Ney, 2004) prove to be the stateof-the-art as they have delivered translation performance in recent machine translation evaluations.
While excelling at memorizing local translation and
reordering, phrase-based systems have difficulties in
modeling permutations among phrases. As a result,
it is important to develop effective reordering models to capture such non-local reordering.
The early phrase-based paradigm (Koehn et al.,
2003) applies a simple distance-based distortion
penalty to model the phrase movements. More recently, many researchers have presented lexicalized
reordering models that take advantage of lexical
information to predict reordering (Tillmann, 2004;
Xiong et al., 2006; Zens and Ney, 2006; Koehn et
al., 2007; Galley and Manning, 2008). These models are learned from a word-aligned corpus to predict three orientations of a phrase pair with respect
to the previous bilingual phrase: monotone (M ),
swap (S), and discontinuous (D). Take the bilingual
phrase bp in Figure 1(a) for example. The wordbased reordering model (Koehn et al., 2007) analyzes the word alignments at positions (s − 1, u − 1)
and (s − 1, v + 1). The orientation of bp is set
to D because the position (s − 1, v + 1) contains
no word alignment. The phrase-based reordering
model (Tillmann, 2004) determines the presence
of the adjacent bilingual phrase located in position
(s − 1, v + 1) and then treats the orientation of bp as
S. Given no constraint on maximum phrase length,
the hierarchical phrase reordering model (Galley and
Manning, 2008) also analyzes the adjacent bilingual
phrases for bp and identifies its orientation as S.
However, given a bilingual phrase, the abovementioned models just consider the presence of an
adjacent bilingual phrase rather than the number of
adjacent bilingual phrases. See the examples in Fig-
Proceedings of the ACL 2010 Conference Short Papers, pages 12–16,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Figure 2: (a) A parallel Chinese-English sentence pair and (b) its corresponding reordering graph. In (b), we denote
each bilingual phrase with a rectangle, where the upper and bottom numbers in the brackets represent the source
and target spans of this bilingual phrase respectively. M = monotone (solid lines), S = swap (dotted line), and D =
discontinuous (segmented lines). The bilingual phrases marked in the gray constitute a reordering example.
ure 1 for illustration. In Figure 1(a), bp is in a swap
order with only one bilingual phrase. In Figure 1(b),
bp swaps with three bilingual phrases. Lexicalized
reordering models do not distinguish different numbers of adjacent phrase pairs, and just give bp the
same count in the swap orientation.
In this paper, we propose a novel method to better
estimate the reordering probabilities with the consideration of varying numbers of adjacent bilingual
phrases. Our method uses reordering graphs to represent all phrase segmentations of parallel sentence
pairs, and then gets the fractional counts of bilingual phrases for orientations from reordering graphs
in an inside-outside fashion. Experimental results
indicate that our method achieves significant improvements over the traditional lexicalized reordering model (Koehn et al., 2007).
This paper is organized as follows: in Section 2,
we first give a brief introduction to the traditional
lexicalized reordering model. Then we introduce
our method to estimate the reordering probabilities
from reordering graphs. The experimental results
are reported in Section 3. Finally, we end with a
conclusion and future work in Section 4.
Estimation of Reordering Probabilities
Based on Reordering Graph
In this section, we first describe the traditional lexicalized reordering model, and then illustrate how to
construct reordering graphs to estimate the reorder-
ing probabilities.
2.1 Lexicalized Reordering Model
Given a phrase pair bp = (ei , f ai ), where ai defines that the source phrase f ai is aligned to the
target phrase ei , the traditional lexicalized reordering model computes the reordering count of bp in
the orientation o based on the word alignments of
boundary words. Specifically, the model collects
bilingual phrases and distinguishes their orientations
with respect to the previous bilingual phrase into
three categories:
o= S
ai − ai−1 = 1
ai − ai−1 = −1
|ai − ai−1 | 6= 1
Using the relative-frequency approach, the reordering probability regarding bp is
Count(o, bp)
o0 Count(o , bp)
p(o|bp) = P
2.2 Reordering Graph
For a parallel sentence pair, its reordering graph indicates all possible translation derivations consisting
of the extracted bilingual phrases. To construct a
reordering graph, we first extract bilingual phrases
using the way of (Och, 2003). Then, the adjacent
bilingual phrases are linked according to the targetside order. Some bilingual phrases, which have
no adjacent bilingual phrases because of maximum
length limitation, are linked to the nearest bilingual
phrases in the target-side order.
Shown in Figure 2(b), the reordering graph for
the parallel sentence pair (Figure 2(a)) can be represented as an undirected graph, where each rectangle corresponds to a phrase pair, each link is the
orientation relationship between adjacent bilingual
phrases, and two distinguished rectangles bs and be
indicate the beginning and ending of the parallel sentence pair, respectively. With the reordering graph,
we can obtain all reordering examples containing
the given bilingual phrase. For example, the bilingual phrase hzhengshi huitan, formal meetingsi (see
Figure 2(a)), corresponding to the rectangle labeled
with the source span [6,7] and the target span [4,5],
is in a monotone order with one previous phrase
and in a discontinuous order with two subsequent
phrases (see Figure 2(b)).
src span
[0, 0]
[1, 1]
[1, 7]
[4, 4]
[4, 5]
[4, 6]
[4, 7]
[2, 7]
[5, 5]
[6, 6]
[6, 7]
[7, 7]
[2, 2]
[2, 3]
[3, 3]
[8, 8]
trg span
[0, 0]
[1, 1]
[1, 7]
[2, 2]
[2, 3]
[2, 4]
[2, 5]
[2, 7]
[3, 3]
[4, 4]
[4, 5]
[5, 5]
[6, 6]
[6, 7]
[7, 7]
[8, 8]
(Charniak and Johnson, 2005; Huang, 2008), the
fractional count of (o, bp0 , bp) is
2.3 Estimation of Reordering Probabilities
Table 1: The α and β values of the bilingual phrases
shown in Figure 2.
Count(o, bp0 , bp) =
We estimate the reordering probabilities from reordering graphs. Given a parallel sentence pair,
there are many translation derivations corresponding to different paths in its reordering graph. Assuming all derivations have a uniform probability,
the fractional counts of bilingual phrases for orientations can be calculated by utilizing an algorithm in
the inside-outside fashion.
Given a phrase pair bp in the reordering graph,
we denote the number of paths from bs to bp with
α(bp). ItPcan be computed in an iterative way
α(bp) = bp0 α(bp0 ), where bp0 is one of the previous bilingual phrases of bp and α(bs )=1. In a similar way, the number of paths from
P be to bp,
as β(bp), is simply β(bp) =
β(bp ), where
bp00 is one of the subsequent bilingual phrases of bp
and β(be )=1. Here, we show the α and β values of
all bilingual phrases of Figure 2 in Table 1. Especially, for the reordering example consisting of the
bilingual phrases bp1 =hjiang juxing, will holdi and
bp2 =hzhengshi huitan, formal meetingsi, marked in
the gray color in Figure 2, the α and β values can be
calculated: α(bp1 ) = 1, β(bp2 ) = 1+1 = 2, β(bs ) =
8+1 = 9.
Inspired by the parsing literature on pruning
α(bp0 ) · β(bp)
β(bs )
where the numerator indicates the number of paths
containing the reordering example (o, bp0 , bp) and
the denominator is the total number of paths in the
reordering graph. Continuing with the reordering
example described above, we obtain its fractional
count using the formula (3): Count(M, bp1 , bp2 ) =
(1 × 2)/9 = 2/9.
Then, the fractional count of bp in the orientation
o is calculated as described below:
Count(o, bp) =
Count(o, bp0 , bp) (4)
For example, we compute the fractional count of
bp2 in the monotone orientation by the formula (4):
Count(M, bp2 ) = 2/9.
As described in the lexicalized reordering model
(Section 2.1), we apply the formula (2) to calculate
the final reordering probabilities.
We conduct experiments to investigate the effectiveness of our method on the msd-fe reordering model and the msd-bidirectional-fe reordering
model. These two models are widely applied in
phrase-based system (Koehn et al., 2007). The msdfe reordering model has three features, which represent the probabilities of bilingual phrases in three
orientations: monotone, swap, or discontinuous. If a
msd-bidirectional-fe model is used, then the number
of features doubles: one for each direction.
3.1 Experiment Setup
Two different sizes of training corpora are used in
our experiments: one is a small-scale corpus that
comes from FBIS corpus consisting of 239K bilingual sentence pairs, the other is a large-scale corpus
that includes 1.55M bilingual sentence pairs from
LDC. The 2002 NIST MT evaluation test data is
used as the development set and the 2003, 2004,
2005 NIST MT test data are the test sets. We
choose the MOSES1 (Koehn et al., 2007) as the experimental decoder. GIZA++ (Och and Ney, 2003)
and the heuristics “grow-diag-final-and” are used to
generate a word-aligned corpus, where we extract
bilingual phrases with maximum length 7. We use
SRILM Toolkits (Stolcke, 2002) to train a 4-gram
language model on the Xinhua portion of Gigaword
In exception to the reordering probabilities, we
use the same features in the comparative experiments. During decoding, we set ttable-limit = 20,
stack = 100, and perform minimum-error-rate training (Och, 2003) to tune various feature weights. The
translation quality is evaluated by case-insensitive
BLEU-4 metric (Papineni et al., 2002). Finally, we
conduct paired bootstrap sampling (Koehn, 2004) to
test the significance in BLEU scores differences.
3.2 Experimental Results
Table 2 shows the results of experiments with the
small training corpus. For the msd-fe model, the
BLEU scores by our method are 30.51 32.78 and
29.50, achieving absolute improvements of 0.89,
0.66 and 0.62 on the three test sets, respectively. For
the msd-bidirectional-fe model, our method obtains
BLEU scores of 30.49 32.73 and 29.24, with absolute improvements of 1.11, 0.73 and 0.60 over the
baseline method.
The phrase-based lexical reordering model (Tillmann,
2004) is also closely related to our model. However, due to
the limit of time and space, we only use Moses-style reordering
model (Koehn et al., 2007) as our baseline.
Table 2: Experimental results with the small-scale corpus. m-f: msd-fe reordering model. m-b-f: msdbidirectional-fe reordering model. RG: probabilities estimation based on Reordering Graph. * or **: significantly
better than baseline (p < 0 .05 or p < 0 .01 ).
Table 3: Experimental results with the large-scale corpus.
Table 3 shows the results of experiments with
the large training corpus. In the experiments of
the msd-fe model, in exception to the MT-05 test
set, our method is superior to the baseline method.
The BLEU scores by our method are 32.44, 33.24
and 31.64, which obtain 0.86, 0.85 and 0.15 gains
on three test set, respectively. For the msdbidirectional-fe model, the BLEU scores produced
by our approach are 33.29, 34.49 and 32.79 on the
three test sets, with 0.86, 1.42 and 1.1 points higher
than the baseline method, respectively.
Conclusion and Future Work
In this paper, we propose a method to improve the
reordering model by considering the effect of the
number of adjacent bilingual phrases on the reordering probabilities estimation. Experimental results on
NIST Chinese-to-English tasks demonstrate the effectiveness of our method.
Our method is also general to other lexicalized
reordering models. We plan to apply our method
to the complex lexicalized reordering models, for
example, the hierarchical reordering model (Galley
and Manning, 2008) and the MEBTG reordering
model (Xiong et al., 2006). In addition, how to further improve the reordering model by distinguishing
the derivations with different probabilities will become another study emphasis in further research.
The authors were supported by National Natural Science Foundation of China, Contracts 60873167 and
60903138. We thank the anonymous reviewers for
their insightful comments. We are also grateful to
Hongmei Zhao and Shu Cai for their helpful feedback.
Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and maxent discriminative reranking. In Proc. of ACL 2005, pages 173–180.
Michel Galley and Christopher D. Manning. 2008. A
simple and effective hierarchical phrase reordering
model. In Proc. of EMNLP 2008, pages 848–856.
Liang Huang. 2008. Forest reranking: Discriminative
parsing with non-local features. In Proc. of ACL 2008,
pages 586–594.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Proc.
of HLT-NAACL 2003, pages 127–133.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source
toolkit for statistical machine translation. In Proc. of
ACL 2007, Demonstration Session, pages 177–180.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proc. of EMNLP
2004, pages 388–395.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Franz Joseph Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics, pages 417–449.
Franz Josef Och. 2003. Minimum error rate training in
statistical machine translation. In Proc. of ACL 2003,
pages 160–167.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL 2002,
pages 311–318.
Andreas Stolcke. 2002. Srilm - an extensible language
modeling toolkit. In Proc. of ICSLP 2002, pages 901–
Christoph Tillmann. 2004. A unigram orientation model
for statistical machine translation. In Proc. of HLTACL 2004, Short Papers, pages 101–104.
Deyi Xiong, Qun Liu, and Shouxun Lin. 2006. Maximum entropy based phrase reordering model for statistical machine translation. In Proc. of ACL 2006, pages
Richard Zens and Hermann Ney. 2006. Discriminvative
reordering models for statistical machine translation.
In Proc. of Workshop on Statistical Machine Translation 2006, pages 521–528.
Filtering Syntactic Constraints for Statistical Machine Translation
Hailong Cao and Eiichiro Sumita
Language Translation Group, MASTAR Project
National Institute of Information and Communications Technology
3-5 Hikari-dai, Seika-cho, Soraku-gun, Kyoto, Japan, 619-0289
{hlcao, eiichiro.sumita }
translation gets an extra credit if it respects the
parse tree but may incur a cost if it violates a
constituent boundary.
In this paper, we address this challenge from a
less explored direction. Rather than use all constraints offered by the parse trees, we propose
using them selectively. Based on parallel training
data, a classifier is built automatically to decide
whether a node in the parse trees should be used
as a reordering constraint or not. As a result, we
obtain a 0.8 BLEU point improvement over a full
constraint-based system.
Source language parse trees offer very useful
but imperfect reordering constraints for statistical machine translation. A lot of effort has
been made for soft applications of syntactic
constraints. We alternatively propose the selective use of syntactic constraints. A classifier
is built automatically to decide whether a node
in the parse trees should be used as a reordering constraint or not. Using this information
yields a 0.8 BLEU point improvement over a
full constraint-based system.
Reordering Constraints from Source
Parse Trees
In this section we briefly review a constraintbased system named IST-ITG (Imposing Source
Tree on Inversion Transduction Grammar, Yamamoto et al., 2008) upon which this work
When using ITG constraints during decoding,
the source-side parse tree structure is not considered. The reordering process can be more tightly
constrained if constraints from the source parse
tree are integrated with the ITG constraints. ISTITG constraints directly apply source sentence
tree structure to generate the target with the
following constraint: the target sentence is obtained by rotating any node of the source sentence tree structure.
After parsing the source sentence, a bracketed
sentence is obtained by removing the node
syntactic labels; this bracketed sentence can then
be directly expressed as a tree structure. For
example1, the parse tree “(S1 (S (NP (DT This))
(VP (AUX is) (NP (DT a) (NN pen)))))” is
obtained from the source sentence “This is a
pen”, which consists of four words. By removing
In statistical machine translation (SMT), the
search problem is NP-hard if arbitrary reordering
is allowed (Knight, 1999). Therefore, we need to
restrict the possible reordering in an appropriate
way for both efficiency and translation quality.
The most widely used reordering constraints are
IBM constraints (Berger et al., 1996), ITG constraints (Wu, 1995) and syntactic constraints
(Yamada et al., 2000; Galley et al., 2004; Liu et
al., 2006; Marcu et al., 2006; Zollmann and
Venugopal 2006; and numerous others). Syntactic constraints can be imposed from the source
side or target side. This work will focus on syntactic constraints from source parse trees.
Linguistic parse trees can provide very useful
reordering constraints for SMT. However, they
are far from perfect because of both parsing errors and the crossing of the constituents and formal phrases extracted from parallel training data.
The key challenge is how to take advantage of
the prior knowledge in the linguistic parse trees
without affecting the strengths of formal phrases.
Recent efforts attack this problem by using the
constraints softly (Cherry, 2008; Marton and
Resnik, 2008). In their methods, a candidate
We use English examples for the sake of readability.
Proceedings of the ACL 2010 Conference Short Papers, pages 17–21,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
the node syntactic labels, the bracketed sentence
“((This) ((is) ((a) (pen))))” is obtained. Such a
bracketed sentence can be used to produce
For example, for the source-side bracketed
tree “((f1 f2) (f3 f4)) ”, eight target sequences [e1,
e2, e3, e4], [e2, e1, e3, e4], [e1, e2, e4, e3], [e2,
e1, e4, e3], [e3, e4, e1, e2], [e3, e4, e2, e1], [e4,
e3, e1, e2], and [e4, e3, e2, e1] are possible. For
the source-side bracketed tree “(((f1f2) f3) f4),”
eight sequences [e1, e2, e3, e4], [e2, e1, e3, e4],
[e3, e1, e2, e4], [e3, e2, e1, e4], [e4, e1, e2, e3],
[e4, e2, e1, e3], [e4, e3, e1, e2], and [e4, e3, e2,
e1] are possible. When the source sentence tree
structure is a binary tree, the number of word
orderings is reduced to 2N-1 where N is the length
of the source sentence.
The parsing results sometimes do not produce
binary trees. In this case, some subtrees have
more than two child nodes. For a non-binary subtree, any reordering of child nodes is allowed.
For example, if a subtree has three child nodes,
six reorderings of the nodes are possible.
Otherwise the node is an interior node.
For example, in Figure 1, both node N1 and
node N3 are frontier nodes. Node N2 is an interior node because the source words f2, f3 and f4
are translated into e2, e3 and e4, which are not
contiguous in the target side.
Clearly, only frontier nodes should be used as
reordering constraints while interior nodes are
not suitable for this. However, little work has
been done on how to explicitly distinguish these
two kinds of nodes in the source parse trees. In
this section, we will explore building a classifier
which can label the nodes in the parse trees as
frontier nodes or interior nodes.
Learning to Classify Parse Tree
In IST-ITG and many other methods which use
syntactic constraints, all of the nodes in the parse
trees are utilized. Though many nodes in the
parse trees are useful, we would argue that some
nodes are not trustworthy. For example, if we
constrain the translation of “f1 f2 f3 f4” with
node N2 illustrated in Figure 1, then word “e1”
will never be put in the middle the other three
words. If we want to obtain the translation “e2 e1
e4 e3”, node N3 can offer a good constraint
while node N2 should be filtered out. In real corpora, cases such as node N2 are frequent enough
to be noticeable (see Fox (2002) or section 4.1 in
this paper).
Therefore, we use the definitions in Galley et
al. (2004) to classify the nodes in parse trees into
two types: frontier nodes and interior nodes.
Though the definitions were originally made for
target language parse trees, they can be straightforwardly applied to the source side. A node
which satisfies both of the following two conditions is referred as a frontier node:
All the words covered by the node remain
contiguous after translation.
f3 f4
e4 e3
Figure 1: An example parse tree and alignments
Ideally, we would have a human-annotated corpus in which each sentence is parsed and each
node in the parse trees is labeled as a frontier
node or an interior node. But such a target language specific corpus is hard to come by, and
never in the quantity we would like.
Instead, we generate such a corpus automatically. We begin with a parallel corpus which will
be used to train our SMT model. In our case, it is
the FBIS Chinese-English corpus.
Firstly, the Chinese sentences are segmented,
POS tagged and parsed by the tools described in
Kruengkrai et al. (2009) and Cao et al. (2007),
both of which are trained on the Penn Chinese
Treebank 6.0.
Secondly, we use GIZA++ to align the sentences in both the Chinese-English and EnglishChinese directions. We combine the alignments
using the “grow-diag-final-and” procedure provided with MOSES (Koehn, 2007). Because
there are many errors in the alignment, we remove the links if the alignment count is less than
three for the source or the target word. Additionally, we also remove notoriously bad links in
All the words covered by the node can be
translated separately. That is to say, these
words do not share a translation with any
word outside the coverage of the node.
{de, le} × {the, a, an} following Fossum and
Knight (2008).
Thirdly, given the parse trees and the alignment information, we label each node as a frontier node or an interior node according to the
definition introduced in this section. Using the
labeled nodes as training data, we can build a
classifier. In theory, a broad class of machine
learning tools can be used; however, due to the
scale of the task (see section 4), we utilize the
Pegasos 2 which is a very fast SVM solver
(Shalev-Shwartz et al, 2007).
MOSES decoder. Our decoder can operate on the
same principles as the MOSES decoder. Minimum error rate training (MERT) with respect to
BLEU score is used to tune the decoder’s parameters, and it is performed using the standard
technique of Och (2003). A lexical reordering
model was used in our experiments.
The translation model was created from the
FBIS corpus. We used a 5-gram language model
trained with modified Knesser-Ney smoothing.
The language model was trained on the target
side of FBIS corpus and the Xinhua news in GIGAWORD corpus. The development and test
sets are from NIST MT08 evaluation campaign.
Table 1 shows the statistics of the corpora used
in our experiments.
For each node in the parse trees, we use the following feature templates:
• A context-free grammar rule which rewrites
the current node (In this and all the following
grammar based features, a mark is used to
indicate which non terminal is the current
• A context-free grammar rule which rewrites
the current node’s father
• The combination of the above two rules
• A lexicalized context-free grammar rule
which rewrites the current node
• A lexicalized context-free grammar rule
which rewrites the current node’s father
• Syntactic label, head word, and head POS
tag of the current node
• Syntactic label, head word, and head POS
tag of the current node’s left child
• Syntactic label, head word, and head POS
tag of the current node’s right child
• Syntactic label, head word, and head POS
tag of the current node’s left brother
• Syntactic label, head word, and head POS
tag of the current node’s right brother
• Syntactic label, head word, and head POS
tag of the current node’s father
• The leftmost word covered by the current
node and the word before it
• The rightmost word covered by the current
node and the word after it
Training set
Development set
Test set
Experiments on Nodes Classification
We extracted about 3.9 million example nodes
from the training data, i.e. the FBIS corpus.
There were 2.37 million frontier nodes and 1.59
million interior nodes in these examples, give
rise to about 4.4 million features. To test the performance of our classifier, we simply use the last
ten thousand examples as a test set, and the rest
being used as Pegasos training data. All the parameters in Pegasos were set as default values. In
this way, the accuracy of the classifier was
Then we retrained our classifier by using all of
the examples. The nodes in the automatically
parsed NIST MT08 test set were labeled by the
classifier. As a result, 17,240 nodes were labeled
as frontier nodes and 5,736 nodes were labeled
as interior nodes.
Experiments on Chinese-English SMT
In order to confirm that it is advantageous to distinguish between frontier nodes and interior
nodes, we performed four translation experiments.
The first one was a typical beam search decoding without any syntactic constraints.
All the other three experiments were based on
the IST-ITG method which makes use of syntac-
Our SMT system is based on a fairly typical
phrase-based model (Finch and Sumita, 2008).
For the training of our SMT model, we use a
modified training toolkit adapted from the
Table 1: Corpora statistics
tic constraints. The difference between these
three experiments lies in what constraints are
used. In detail, the second one used all nodes
recognized by the parser; the third one only used
frontier nodes labeled by the classifier; the fourth
one only used interior nodes labeled by the classifier.
With the exception of the above differences,
all the other settings were the same in the four
experiments. Table 2 summarizes the SMT performance.
Syntactic Constraints
all nodes
frontier nodes
interior nodes
We would like to thank Taro Watanabe and
Andrew Finch for insightful discussions. We also
would like to thank the anonymous reviewers for
their constructive comments.
A.L. Berger, P.F. Brown, S.A.D. Pietra, V.J.D. Pietra,
J.R. Gillett, A.S. Kehler, and R.L. Mercer. 1996.
Language translation apparatus and method of using context-based translation models. United States
patent, patent number 5510981, April.
Hailong Cao, Yujie Zhang and Hitoshi Isahara. Empirical study on parsing Chinese based on Collins'
model. 2007. In PACLING.
Colin Cherry. 2008. Cohesive phrase-Based decoding
for statistical machine translation. In ACL- HLT.
Table 2: Comparison of different constraints by
SMT quality
Andrew Finch and Eiichiro Sumita. 2008. Dynamic
model interpolation for statistical machine translation. In SMT Workshop.
Clearly, we obtain the best performance if we
constrain the search with only frontier nodes.
Using just frontier yields a 0.8 BLEU point improvement over the baseline constraint-based
system which uses all the constraints.
On the other hand, constraints from interior
nodes result in the worst performance. This comparison shows it is necessary to explicitly distinguish nodes in the source parse trees when they
are used as reordering constraints.
The improvement over the system without
constraints is only modest. It may be too coarse
to use pare trees as hard constraints. We believe
a greater improvement can be expected if we apply our idea to finer-grained approaches that use
constraints softly (Marton and Resnik (2008) and
Cherry (2008)).
Victoria Fossum and Kevin Knight. 2008. Using bilingual Chinese-English word alignments to resolve PP attachment ambiguity in English. In
AMTA Student Workshop.
Heidi J. Fox. 2002. Phrasal cohesion and statistical
machine translation. In EMNLP.
Michel Galley, Mark Hopkins, Kevin Knight, and
Daniel Marcu. 2004. What's in a translation rule?
Kevin Knight. 1999. Decoding complexity in word
replacement translation models. Computational
Linguistics, 25(4):607–615.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Alexandra Constantin, Evan Herbst. 2007. Moses:
Open Source Toolkit for Statistical Machine Translation. In ACL demo and poster sessions.
Conclusion and Future Work
We propose a selectively approach to syntactic
constraints during decoding. A classifier is built
automatically to decide whether a node in the
parse trees should be used as a reordering constraint or not. Preliminary results show that it is
not only advantageous but necessary to explicitly
distinguish between frontier nodes and interior
The idea of selecting syntactic constraints is
compatible with the idea of using constraints
softly; we plan to combine the two ideas and obtain further improvements in future work.
Canasai Kruengkrai, Kiyotaka Uchimoto, Jun'ichi
Kazama, Yiou Wang, Kentaro Torisawa and Hitoshi Isahara. 2009. An error-driven word-character
hybrid model for joint Chinese word segmentation
and POS tagging. In ACL-IJCNLP.
Yang Liu, Qun Liu, Shouxun Lin. 2006. Tree-tostring alignment template for statistical machine
translation. In ACL-COLING.
Daniel Marcu, Wei Wang, Abdessamad Echihabi, and
Kevin Knight. 2006. SPMT: Statistical machine
translation with syntactified target language
phrases. In EMNLP.
Yuval Marton and Philip Resnik. 2008. Soft syntactic
constraints for hierarchical phrased-based translation. In ACL-HLT.
Franz Och. 2003. Minimum error rate training in statistical machine translation. In ACL.
Shai Shalev-Shwartz, Yoram Singer and Nathan Srebro. 2007. Pegasos: Primal estimated sub-gradient
solver for SVM. In ICML.
Dekai Wu. 1995. Stochastic inversion transduction
grammars with application to segmentation, bracketing, and alignment of parallel corpora. In IJCAI.
Kenji Yamada and Kevin Knight. 2000. A syntaxbased statistical translation model. In ACL.
Hirofumi Yamamoto, Hideo Okuma and Eiichiro
Sumita. 2008. Imposing constraints from the
source tree on ITG constraints for SMT. In Workshop on syntax and structure in statistical translation.
Andreas Zollmann and Ashish Venugopal. 2006. Syntax augmented machine translation via chart parsing. In SMT Workshop, HLT-NAACL.
Diversify and Combine: Improving Word Alignment for Machine
Translation on Low-Resource Languages
Bing Xiang, Yonggang Deng, and Bowen Zhou
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
Most of the research on alignment combination
in the past has focused on how to combine the
alignments from two different directions, sourceto-target and target-to-source. Usually people start
from the intersection of two sets of alignments,
and gradually add links in the union based on
certain heuristics, as in (Koehn et al., 2003), to
achieve a better balance compared to using either
intersection (high precision) or union (high recall).
In (Ayan and Dorr, 2006) a maximum entropy approach was proposed to combine multiple alignments based on a set of linguistic and alignment
features. A different approach was presented in
(Deng and Zhou, 2009), which again concentrated
on the combination of two sets of alignments, but
with a different criterion. It tries to maximize the
number of phrases that can be extracted in the
combined alignments. A greedy search method
was utilized and it achieved higher translation performance than the baseline.
More recently, an alignment selection approach
was proposed in (Huang, 2009), which computes confidence scores for each link and prunes
the links from multiple sets of alignments using
a hand-picked threshold. The alignments used
in that work were generated from different aligners (HMM, block model, and maximum entropy
model). In this work, we use soft voting with
weighted confidence scores, where the weights
can be tuned with a specific objective function.
There is no need for a pre-determined threshold
as used in (Huang, 2009). Also, we utilize various knowledge sources to enrich the alignments
instead of using different aligners. Our strategy is
to diversify and then combine in order to catch any
complementary information captured in the word
alignments for low-resource languages.
The rest of the paper is organized as follows.
We present a novel method to improve
word alignment quality and eventually the
translation performance by producing and
combining complementary word alignments for low-resource languages. Instead
of focusing on the improvement of a single
set of word alignments, we generate multiple sets of diversified alignments based
on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an
English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial
words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better translation performance.
1 Introduction
Word alignment usually serves as the starting
point and foundation for a statistical machine
translation (SMT) system. It has received a significant amount of research over the years, notably in
(Brown et al., 1993; Ittycheriah and Roukos, 2005;
Fraser and Marcu, 2007; Hermjakob, 2009). They
all focused on the improvement of word alignment
models. In this work, we leverage existing aligners and generate multiple sets of word alignments
based on complementary information, then combine them to get the final alignment for phrase
training. The resource required for this approach
is little, compared to what is needed to build a reasonable discriminative alignment model, for example. This makes the approach especially appealing for SMT on low-resource languages.
Proceedings of the ACL 2010 Conference Short Papers, pages 22–26,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
We present three different sets of alignments in
Section 2 for an English-to-Pashto MT task. In
Section 3, we propose the alignment combination
algorithm. The experimental results are reported
in Section 4. We conclude the paper in Section 5.
We take an English-to-Pashto MT task as an example and create three sets of additional alignments
on top of the baseline alignment.
E’: they
your employees and you
Figure 1: Alignment before/after VP-based reordering.
Syntactic Reordering
Pashto is a subject-object-verb (SOV) language,
which puts verbs after objects. People have proposed different syntactic rules to pre-reorder SOV
languages, either based on a constituent parse tree
(Drábek and Yarowsky, 2004; Wang et al., 2007)
or dependency parse tree (Xu et al., 2009). In
this work, we apply syntactic reordering for verb
phrases (VP) based on the English constituent
parse. The VP-based reordering rule we apply in
the work is:
1980), a widely applied algorithm to remove the
common morphological and inflexional endings
from words in English. For Pashto, we utilize
a morphological decompostion algorithm that has
been shown to be effective for Arabic speech
recognition (Xiang et al., 2006). We start from a
fixed set of affixes with 8 prefixes and 21 suffixes.
The prefixes and suffixes are stripped off from
the Pashto words under the two constraints:(1)
Longest matched affixes first; (2) Remaining stem
must be at least two characters long.
• V P (V B∗, ∗) → V P (∗, V B∗)
where V B∗ represents V B, V BD, V BG, V BN ,
V BP and V BZ.
In Figure 1, we show the reference alignment
between an English sentence and the corresponding Pashto translation, where E is the original English sentence, P is the Pashto sentence (in romanized text), and E ′ is the English sentence after
reordering. As we can see, after the VP-based reordering, the alignment between the two sentences
becomes monotone, which makes it easier for the
aligner to get the alignment correct. During the
reordering of English sentences, we store the index changes for the English words. After getting
the alignment trained on the reordered English and
original Pashto sentence pairs, we map the English
words back to the original order, along with the
learned alignment links. In this way, the alignment is ready to be combined with the baseline
alignment and any other alternatives.
2 Diversified Word Alignments
2.3 Partial Word
For low-resource languages, we usually suffer
from the data sparsity issue. Recently, a simple
method was presented in (Chiang et al., 2009),
which keeps partial English and Urdu words in the
training data for alignment training. This is similar
to the stemming method, but is more heuristicsbased, and does not rely on a set of available affixes. With the same motivation, we keep the first
4 characters of each English and Pashto word to
generate one more alternative for the word alignment.
3 Confidence-Based Alignment
Now we describe the algorithm to combine multiple sets of word alignments based on weighted
confidence scores. Suppose aijk is an alignment
link in the i-th set of alignments between the j-th
source word and k-th target word in sentence pair
(S,T ). Similar to (Huang, 2009), we define the
confidence of aijk as
Pashto is one of the morphologically rich languages. In addition to the linguistic knowledge applied in the syntactic reordering described above,
we also utilize morphological analysis by applying
stemming on both the English and Pashto sides.
For English, we use Porter stemming (Porter,
c(aijk |S, T ) =
qs2t (aijk |S, T )qt2s (aijk |T, S),
where the source-to-target link posterior probability
pi (tk |sj )
qs2t (aijk |S, T ) = PK
k ′ =1 pi (tk ′ |sj )
apply grow-diagonal-final (gdf). The decoding
weights are optimized with minimum error rate
training (MERT) (Och, 2003) to maximize BLEU
scores (Papineni et al., 2002). There are 2028 sentences in the tuning set and 1019 sentences in the
test set, both with one reference. We use another
150 sentence pairs as a heldout hand-aligned set
to measure the word alignment quality. The three
sets of alignments described in Section 2 are generated on the same training data separately with
GIZA++ and enhanced by gdf as for the baseline
alignment. The English parse tree used for the
syntactic reordering was produced by a maximum
entropy based parser (Ratnaparkhi, 1997).
and the target-to-source link posterior probability
qt2s (aijk |T, S) is defined similarly. pi (tk |sj ) is
the lexical translation probability between source
word sj and target word tk in the i-th set of alignments.
Our alignment combination algorithm is as follows.
1. Each candidate link ajk gets soft votes from
N sets of alignments via weighted confidence
v(ajk |S, T ) =
wi ∗ c(aijk |S, T ),
4.2 Improvement in Word Alignment
In Table 1 we show the precision, recall and Fscore of each set of word alignments for the 150sentence set. Using partial word provides the highest F-score among all individual alignments. The
F-score is 5% higher than for the baseline alignment. The VP-based reordering itself does not improve the F-score, which could be due to the parse
errors on the conversational training data. We experiment with three options (c0 , c1 , c2 ) when combining the baseline and reordering-based alignments. In c0 , the weights wi and confidence scores
c(aijk |S, T ) in Eq. (3) are all set to 1. In c1 ,
we set confidence scores to 1, while tuning the
weights with hill climbing to maximize the Fscore on a hand-aligned tuning set. In c2 , we compute the confidence scores as in Eq. (1) and tune
the weights as in c1 . The numbers in Table 1 show
the effectiveness of having both weights and confidence scores during the combination.
Similarly, we combine the baseline with each
of the other sets of alignments using c2 . They
all result in significantly higher F-scores. We
also generate alignments on VP-reordered partial
words (X in Table 1) and compared B + X and
B + V + P . The better results with B + V + P
show the benefit of keeping the alignments as diversified as possible before the combination. Finally, we compare the proposed alignment combination c2 with the heuristics-based method (gdf),
where the latter starts from the intersection of all 4
sets of alignments and then applies grow-diagonalfinal (Koehn et al., 2003) based on the links in
the union. The proposed combination approach on
B + V + S + P results in close to 7% higher Fscores than the baseline and also 2% higher than
where the weight wi for each set of alignment
can be optimized under various criteria. In
this work, we tune it on a hand-aligned development set to maximize the alignment Fscore.
2. All candidates are sorted by soft votes in descending order and evaluated sequentially. A
candidate link ajk is included if one of the
following is true:
• Neither sj nor tk is aligned so far;
• sj is not aligned and its left or right
neighboring word is aligned to tk so far;
• tk is not aligned and its left or right
neighboring word is aligned to sj so far.
3. Repeat scanning all candidate links until no
more links can be added.
In this way, those alignment links with higher
confidence scores have higher priority to be included in the combined alignment.
4 Experiments
Our training data contains around 70K EnglishPashto sentence pairs released under the DARPA
TRANSTAC project, with about 900K words on
the English side. The baseline is a phrase-based
MT system similar to (Koehn et al., 2003). We
use GIZA++ (Och and Ney, 2000) to generate
the baseline alignment for each direction and then
gdf. We also notice that its higher F-score is
mainly due to the higher precision, which should
result from the consideration of confidence scores.
Table 1: Alignment precision, recall and F-score
(B: baseline; V: VP-based reordering; S: stemming; P: partial word; X: VP-reordered partial
Table 2: Improvement in BLEU scores (B: baseline; V: VP-based reordering; S: stemming; P: partial word; X: VP-reordered partial word).
both higher F-score and higher BLEU score. The
combination approach itself is not limited to any
specific alignment. It provides a general framework that can take advantage of as many alignments as possible, which could differ in preprocessing, alignment modeling, or any other aspect.
Improvement in MT Performance
In Table 2, we show the corresponding BLEU
scores on the test set for the systems built on each
set of word alignment in Table 1. Similar to the
observation from Table 1, c2 outperforms c0 and
c1 , and B + V + S + P with c2 outperforms
B + V + S + P with gdf. We also ran one experiment in which we concatenated all 4 sets of
alignments into one big set (shown as cat). Overall, the BLEU score with confidence-based combination was increased by 1 point compared to the
baseline, 0.6 compared to gdf, and 0.7 compared
to cat. All results are statistically significant with
p < 0.05 using the sign-test described in (Collins
et al., 2005).
This work was supported by the DARPA
TRANSTAC program. We would like to thank
Upendra Chaudhari, Sameer Maskey and Xiaoqiang Luo for providing useful resources and the
anonymous reviewers for their constructive comments.
Necip Fazil Ayan and Bonnie J. Dorr. 2006. A maximum entropy approach to combining word alignments. In Proc. HLT/NAACL, June.
5 Conclusions
Peter Brown, Vincent Della Pietra, Stephen Della
Pietra, and Robert Mercer. 1993. The mathematics
of statistical machine translation: parameter estimation. Computational Linguistics, 19(2):263–311.
In this work, we have presented a word alignment
combination method that improves both the alignment quality and the translation performance. We
generated multiple sets of diversified alignments
based on linguistics, morphology, and heuristics, and demonstrated the effectiveness of combination on the English-to-Pashto translation task.
We showed that the combined alignment significantly outperforms the baseline alignment with
David Chiang, Kevin Knight, Samad Echihabi, et al.
2009. Isi/language weaver nist 2009 systems. In
Presentation at NIST MT 2009 Workshop, August.
Michael Collins, Philipp Koehn, and Ivona Kučerová.
2005. Clause restructuring for statistical machine
translation. In Proc. of ACL, pages 531–540.
Yonggang Deng and Bowen Zhou. 2009. Optimizing
word alignment combination for phrase table training. In Proc. ACL, pages 229–232, August.
Elliott Franco Drábek and David Yarowsky. 2004. Improving bitext word alignments via syntax-based reordering of english. In Proc. ACL.
Alexander Fraser and Daniel Marcu. 2007. Getting the
structure right for word alignment: Leaf. In Proc. of
EMNLP, pages 51–60, June.
Ulf Hermjakob. 2009. Improved word alignment with
statistics and linguistic heuristics. In Proc. EMNLP,
pages 229–237, August.
Fei Huang. 2009. Confidence measure for word alignment. In Proc. ACL, pages 932–940, August.
Abraham Ittycheriah and Salim Roukos. 2005. A maximum entropy word aligner for arabic-english machine translation. In Proc. of HLT/EMNLP, pages
89–96, October.
Philipp Koehn, Franz Och, and Daniel Marcu. 2003.
Statistical phrase-based translation.
In Proc.
Franz Josef Och and Hermann Ney. 2000. Improved
statistical alignment models. In Proc. of ACL, pages
440–447, Hong Kong, China, October.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proc. of ACL,
pages 160–167.
Kishore Papineni, Salim Roukos, Todd Ward, and Weijing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, pages
Martin Porter. 1980. An algorithm for suffix stripping.
In Program, volume 14, pages 130–137.
Adwait Ratnaparkhi. 1997. A linear observed time statistical parser based on maximum entropy models.
In Proc. of EMNLP, pages 1–10.
Chao Wang, Michael Collins, and Philipp Koehn.
2007. Chinese syntactic reordering for statistical
machine translation. In Proc. EMNLP, pages 737–
Bing Xiang, Kham Nguyen, Long Nguyen, Richard
Schwartz, and John Makhoul. 2006. Morphological
decomposition for arabic broadcast news transcription. In Proc. ICASSP.
Peng Xu, Jaeho Kang, Michael Ringgaard, and Franz
Och. 2009. Using a dependency parser to improve
smt for subject-object-verb languages. In Proc.
NAACL/HLT, pages 245–253, June.
Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding
of Statistical Machine Translation Lattices
Graeme Blackwood, Adrià de Gispert, William Byrne
Machine Intelligence Laboratory
Cambridge University Engineering Department
Trumpington Street, CB2 1PZ, U.K.
once. It is the efficient computation of these path
posterior n-gram probabilities that is the primary
focus of this paper. We will show how general
purpose WFST algorithms can be employed to efficiently compute p(u|E) for all u ∈ N .
Tromble et al. (2008) use Equation (1) as an
approximation to the general form of statistical
machine translation MBR decoder (Kumar and
Byrne, 2004):
This paper presents an efficient implementation of linearised lattice minimum
Bayes-risk decoding using weighted finite
state transducers. We introduce transducers to efficiently count lattice paths containing n-grams and use these to gather
the required statistics. We show that these
procedures can be implemented exactly
through simple transformations of word
sequences to sequences of n-grams. This
yields a novel implementation of lattice
minimum Bayes-risk decoding which is
fast and exact even for very large lattices.
Ê = argmin
E ′ ∈E
L(E, E ′ )P (E|F )
The approximation replaces the sum over all paths
in the lattice by a sum over lattice n-grams. Even
though a lattice may have many n-grams, it is
possible to extract and enumerate them exactly
whereas this is often impossible for individual
paths. Therefore, while the Tromble et al. (2008)
linearisation of the gain function in the decision
rule is an approximation, Equation (1) can be computed exactly even over very large lattices. The
challenge is to do so efficiently.
If the quantity p(u|E) had the form of a conditional expected count
1 Introduction
This paper focuses on an exact implementation
of the linearised form of lattice minimum Bayesrisk (LMBR) decoding using general purpose
weighted finite state transducer (WFST) operations1 . The LMBR decision rule in Tromble et al.
(2008) has the form
Ê = argmax θ0 |E | +
θu #u (E )p(u|E)
E ′ ∈E
c(u|E) =
where E is a lattice of translation hypotheses, N
is the set of all n-grams in the lattice (typically,
n = 1 . . . 4), and the parameters θ are constants
estimated on held-out data. The quantity p(u|E)
we refer to as the path posterior probability of the
n-gram u. This particular posterior is defined as
P (E|F ),
p(u|E) = p(Eu |E) =
#u (E)P (E|F ),
it could be computed efficiently using counting
transducers (Allauzen et al., 2003). The statistic c(u|E) counts the number of times an n-gram
occurs on each path, accumulating the weighted
count over all paths. By contrast, what is needed
by the approximation in Equation (1) is to identify all paths containing an n-gram and accumulate
their probabilities. The accumulation of probabilities at the path level, rather than the n-gram level,
makes the exact computation of p(u|E) hard.
Tromble et al. (2008) approach this problem by
building a separate word sequence acceptor for
each n-gram in N and intersecting this acceptor
where Eu = {E ∈ E : #u (E) > 0} is the subset of lattice paths containing the n-gram u at least
We omit an introduction to WFSTs for space reasons.
See Mohri et al. (2008) for details of the general purpose
WFST operations used in this paper.
Proceedings of the ACL 2010 Conference Short Papers, pages 27–32,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
quences to n-gram sequences of order n. Φn has a
similar form to the WFST implementation of an ngram language model (Allauzen et al., 2003). Φn
includes for each n-gram u = w1n arcs of the form:
with the lattice to discard all paths that do not contain the n-gram; they then sum the probabilities of
all paths in the filtered lattice. We refer to this as
the sequential method, since p(u|E) is calculated
separately for each u in sequence.
Allauzen et al. (2010) introduce a transducer
for simultaneous calculation of p(u|E) for all unigrams u ∈ N1 in a lattice. This transducer is
effective for finding path posterior probabilities of
unigrams because there are relatively few unique
unigrams in the lattice. As we will show, however,
it is less efficient for higher-order n-grams.
Allauzen et al. (2010) use exact statistics for
the unigram path posterior probabilities in Equation (1), but use the conditional expected counts
of Equation (4) for higher-order n-grams. Their
hybrid MBR decoder has the form
Ê = argmax θ0 |E ′ |
E ′ ∈E
θu #u (E ′ )p(u|E)
u∈N :k<|u|≤4
The n-gram lattice of order n is called En and is
found by composing E ◦ Φn , projecting on the output, removing ǫ-arcs, determinizing, and minimising. The construction of En is fast even for large
lattices and is memory efficient. En itself may
have more states than E due to the association of
distinct n-gram histories with states. However, the
counting transducer for unigrams is simpler than
the corresponding counting transducer for higherorder n-grams. As a result, counting unigrams in
En is easier than counting n-grams in E.
3 Efficient Path Counting
Associated with each En we have a transducer Ψn
which can be used to calculate the path posterior
probabilities p(u|E) for all u ∈ Nn . In Figures
1 and 2 we give two possible forms2 of Ψn that
can be used to compute path posterior probabilities
over n-grams u1,2 ∈ Nn for some n. No modification to the ρ-arc matching mechanism is required
even in counting higher-order n-grams since all ngrams are represented as individual symbols after
application of the mapping transducer Φn .
Transducer ΨL
n is used by Allauzen et al. (2010)
to compute the exact unigram contribution to the
conditional expected gain in Equation (5). For example, in counting paths that contain u1 , ΨL
n retains the first occurrence of u1 and maps every
other symbol to ǫ. This ensures that in any path
containing a given u, only the first u is counted,
avoiding multiple counting of paths.
We introduce an alternative path counting transducer ΨR
n that effectively deletes all symbols except the last occurrence of u on any path by ensuring that any paths in composition which count
earlier instances of u do not end in a final state.
Multiple counting is avoided by counting only the
last occurrence of each symbol u on a path.
We note that initial ǫ:ǫ arcs in ΨL
n effectively
create |Nn | copies of En in composition while
searching for the first occurrence of each u. Com-
u∈N :1≤|u|≤k
wn :u
θu #u (E )c(u|E) , (5)
where k determines the range of n-gram orders
at which the path posterior probabilities p(u|E)
of Equation (2) and conditional expected counts
c(u|E) of Equation (4) are used to compute the
expected gain. For k < 4, Equation (5) is thus
an approximation to the approximation. In many
cases it will be perfectly fine, depending on how
closely p(u|E) and c(u|E) agree for higher-order
n-grams. Experimentally, Allauzen et al. (2010)
find this approximation works well at k = 1 for
MBR decoding of statistical machine translation
lattices. However, there may be scenarios in which
p(u|E) and c(u|E) differ so that Equation (5) is no
longer useful in place of the original Tromble et
al. (2008) approximation.
In the following sections, we present an efficient
method for simultaneous calculation of p(u|E) for
n-grams of a fixed order. While other fast MBR
approximations are possible (Kumar et al., 2009),
we show how the exact path posterior probabilities
can be calculated and applied in the implementation of Equation (1) for efficient MBR decoding
over lattices.
2 N-gram Mapping Transducer
The special composition symbol σ matches any arc; ρ
matches any arc other than those with an explicit transition.
See the OpenFst documentation:
We make use of a trick to count higher-order ngrams. We build transducer Φn to map word se28
u1 :u1
u2 :u2
More than one final state may gather probabilities
for the same u; to compute p(u|E) these probabilities are added. The forward algorithm requires
that En ◦ΨR
n be topologically sorted; although sorting can be slow, it is still quicker than log semiring
ǫ-removal and determinization.
The statistics gathered by the forward algorithm could also be gathered under the expectation
semiring (Eisner, 2002) with suitably defined features. We take the view that the full complexity of
that approach is not needed here, since only one
symbol is introduced per path and per exit state.
Unlike En ◦ ΨR
n , the composition En ◦ Ψn does
not segregate paths by u such that there is a direct association between final states and symbols.
The forward algorithm does not readily yield the
per-symbol probabilities, although an arc weight
vector indexed by symbols could be used to correctly aggregate the required statistics (Riley et al.,
2009). For large Nn this would be memory intensive. The association between final states and
symbols could also be found by label pushing, but
we find this slow for large En ◦ Ψn .
Figure 1: Path counting transducer ΨL
n matching
first (left-most) occurrence of each u ∈ Nn .
σ:ǫ u :u
1 1
u2 :u2
u1 :ǫ
u2 :ǫ
Figure 2: Path counting transducer ΨR
n matching
last (right-most) occurrence of each u ∈ Nn .
posing with ΨR
n creates a single copy of En while
searching for the last occurrence of u; we find this
to be much more efficient for large Nn .
Path posterior probabilities are calculated over
each En by composing with Ψn in the log semiring, projecting on the output, removing ǫ-arcs, determinizing, minimising, and pushing weights to
the initial state (Allauzen et al., 2010). Using eiR
ther ΨL
n or Ψn , the resulting counts acceptor is Xn .
It has a compact form with one arc from the start
state for each ui ∈ Nn :
ui /- log p(ui |E )
4 Efficient Decoder Implementation
In contrast to Equation (5), we use the exact values
of p(u|E) for all u ∈ Nn at orders n = 1 . . . 4 to
Ê = argmin θ0 |E | +
gn (E, E ) ,
E ′ ∈E
where gn (E, E ′ ) = u∈Nn θu #u (E ′ )p(u|E) using the exact path posterior probabilities at each
order. We make acceptors Ωn such that E ◦ Ωn
assigns order n partial gain gn (E, E ′ ) to all paths
E ∈ E. Ωn is derived from Φn directly by assigning arc weight θu ×p(u|E) to arcs with output label
u and then projecting on the input labels. For each
n-gram u = w1n in Nn arcs of Ωn have the form:
Efficient Path Posterior Calculation
Although Xn has a convenient and elegant form,
it can be difficult to build for large Nn because
the composition En ◦ Ψn results in millions of
states and arcs. The log semiring ǫ-removal and
determinization required to sum the probabilities
of paths labelled with each u can be slow.
However, if we use the proposed ΨR
n , then each
path in En ◦ ΨR
output lan
bel u and all paths leading to a given final state
share the same u. A modified forward algorithm
can be used to calculate p(u|E) without the costly
ǫ-removal and determinization. The modification
simply requires keeping track of which symbol
u is encountered along each path to a final state.
wn /θu × p(u|E )
To apply θ0 we make a copy of E, called E0 ,
with fixed weight θ0 on all arcs. The decoder is
formed as the composition E0 ◦ Ω1 ◦ Ω2 ◦ Ω3 ◦ Ω4
and Ê is extracted as the maximum cost string.
5 Lattice Generation for LMBR
Lattice MBR decoding performance and efficiency is evaluated in the context of the NIST
Table 1: BLEU scores for Arabic→English maximum likelihood translation (ML), MBR decoding using
the hybrid decision rule of Equation (5) at 0 ≤ k ≤ 3, and regular linearised lattice MBR (LMBR).
Table 2: Time in seconds required for path posterior n-gram probability calculation and LMBR decoding
using sequential method and left-most (ΨL
n ) or right-most (Ψn ) counting transducer implementations.
Arabic→English machine translation task3 . The
development set mt0205tune is formed from the
odd numbered sentences of the NIST MT02–
MT05 testsets; the even numbered sentences form
the validation set mt0205test. Performance on
NIST MT08 newswire (mt08nw) and newsgroup
(mt08ng) data is also reported.
First-pass translation is performed using HiFST
(Iglesias et al., 2009), a hierarchical phrase-based
decoder. Word alignments are generated using
MTTK (Deng and Byrne, 2008) over 150M words
of parallel text for the constrained NIST MT08
Arabic→English track. In decoding, a Shallow1 grammar with a single level of rule nesting is
used and no pruning is performed in generating
first-pass lattices (Iglesias et al., 2009).
The first-pass language model is a modified
Kneser-Ney (Kneser and Ney, 1995) 4-gram estimated over the English parallel text and an 881M
word subset of the GigaWord Third Edition (Graff
et al., 2007). Prior to LMBR, the lattices are
rescored with large stupid-backoff 5-gram language models (Brants et al., 2007) estimated over
more than 6 billion words of English text.
The n-gram factors θ0 , . . . , θ4 are set according
to Tromble et al. (2008) using unigram precision
p = 0.85 and average recall ratio r = 0.74. Our
translation decoder and MBR procedures are implemented using OpenFst (Allauzen et al., 2007).
6 LMBR Speed and Performance
Lattice MBR decoding performance is shown in
Table 1. Compared to the maximum likelihood
translation hypotheses (row ML), LMBR gives
gains of +0.8 to +1.0 BLEU for newswire data and
+0.5 BLEU for newsgroup data (row LMBR).
The other rows of Table 1 show the performance
of LMBR decoding using the hybrid decision rule
of Equation (5) for 0 ≤ k ≤ 3. When the conditional expected counts c(u|E) are used at all orders
(i.e. k = 0), the hybrid decoder BLEU scores are
considerably lower than even the ML scores. This
poor performance is because there are many unigrams u for which c(u|E) is much greater than
p(u|E). The consensus translation maximising the
conditional expected gain is then dominated by
unigram matches, significantly degrading LMBR
decoding performance. Table 1 shows that for
these lattices the hybrid decision rule is an accurate approximation to Equation (1) only when
k ≥ 2 and the exact contribution to the gain function is computed using the path posterior probabilities at orders n = 1 and n = 2.
We now analyse the efficiency of lattice MBR
decoding using the exact path posterior probabilities of Equation (2) at all orders. We note that
the sequential method and both simultaneous implementations using path counting transducers ΨL
and ΨR
numerical accuracy); they differ only in speed and
memory usage.
simultaneous ΨR
total time (seconds)
Posteriors Efficiency Computation times for
the steps in LMBR are given in Table 2. In calculating path posterior n-gram probabilities p(u|E),
we find that the use of ΨL
n is more than twice
as slow as the sequential method. This is due to
the difficulty of counting higher-order n-grams in
large lattices. ΨL
n is effective for counting unigrams, however, since there are far fewer of them.
Using ΨR
n is almost twice as fast as the sequential
method. This speed difference is due to the simple forward algorithm. We also observe that for
higher-order n, the composition En ◦ ΨR
n requires
less memory and produces a smaller machine than
En ◦ Ψ L
n . It is easier to count paths by the final
occurrence of a symbol than by the first.
lattice n-grams
Figure 3: Total time in seconds versus |N |.
criteria should be implemented exactly where possible, so that it is clear exactly what the system is
doing. For machine translation lattices, conflating the values of p(u|E) and c(u|E) for higherorder n-grams might not be a serious problem, but
in other scenarios – especially where symbol sequences are repeated multiple times on the same
path – it may be a poor approximation.
We note that since much of the time in calculation is spent dealing with ǫ-arcs that are ultimately
removed, an optimised composition algorithm that
skips over such redundant structure may lead to
further improvements in time efficiency.
Decoding Efficiency Decoding times are significantly faster using Ωn than the sequential method;
average decoding time is around 0.1 seconds per
sentence. The total time required for lattice MBR
is dominated by the calculation of the path posterior n-gram probabilities, and this is a function of the number of n-grams in the lattice |N |.
For each sentence in mt0205tune, Figure 3 plots
the total LMBR time for the sequential method
(marked ‘o’) and for probabilities computed using
n (marked ‘+’). This compares the two techniques on a sentence-by-sentence basis. As |N |
grows, the simultaneous path counting transducer
is found to be much more efficient.
This work was supported in part under the
GALE program of the Defense Advanced Research Projects Agency, Contract No. HR001106-C-0022.
7 Conclusion
Cyril Allauzen, Mehryar Mohri, and Brian Roark.
2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st
Meeting of the Association for Computational Linguistics, pages 557–564.
We have described an efficient and exact implementation of the linear approximation to LMBR
using general WFST operations. A simple transducer was used to map words to sequences of ngrams in order to simplify the extraction of higherorder statistics. We presented a counting transducer ΨR
n that extracts the statistics required for
all n-grams of order n in a single composition and
allows path posterior probabilities to be computed
efficiently using a modified forward procedure.
We take the view that even approximate search
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: a
general and efficient weighted finite-state transducer
library. In Proceedings of the 9th International Conference on Implementation and Application of Automata, pages 11–23. Springer.
Cyril Allauzen, Shankar Kumar, Wolfgang Macherey,
Mehryar Mohri, and Michael Riley. 2010. Expected
Michael Riley, Cyril Allauzen, and Martin Jansche.
2009. OpenFst: An Open-Source, Weighted FiniteState Transducer Library and its Applications to
Speech and Language. In Proceedings of Human
Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Tutorial Abstracts, pages 9–10, Boulder, Colorado, May. Association for Computational Linguistics.
sequence similarity maximization. In Human Language Technologies 2010: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles,
California, June.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J.
Och, and Jeffrey Dean. 2007. Large language
models in machine translation. In Proceedings of
the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, pages 858–867.
Roy Tromble, Shankar Kumar, Franz Och, and Wolfgang Macherey. 2008. Lattice Minimum BayesRisk decoding for statistical machine translation.
In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages
620–629, Honolulu, Hawaii, October. Association
for Computational Linguistics.
Yonggang Deng and William Byrne. 2008. HMM
word and phrase alignment for statistical machine
translation. IEEE Transactions on Audio, Speech,
and Language Processing, 16(3):494–507.
Jason Eisner. 2002. Parameter estimation for probabilistic finite-state transducers. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1–8, Philadelphia, July.
David Graff, Junbo Kong, Ke Chen, and Kazuaki
Maeda. 2007. English Gigaword Third Edition.
Gonzalo Iglesias, Adrià de Gispert, Eduardo R. Banga,
and William Byrne. 2009. Hierarchical phrasebased translation with weighted finite state transducers. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pages 433–441, Boulder, Colorado, June. Association for Computational Linguistics.
R. Kneser and H. Ney. 1995. Improved backing-off for
m-gram language modeling. In Acoustics, Speech,
and Signal Processing, pages 181–184.
Shankar Kumar and William Byrne. 2004. Minimum
Bayes-risk decoding for statistical machine translation. In Proceedings of Human Language Technologies: The 2004 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pages 169–176.
Shankar Kumar, Wolfgang Macherey, Chris Dyer, and
Franz Och. 2009. Efficient minimum error rate
training and minimum bayes-risk decoding for translation hypergraphs and lattices. In Proceedings of
the Joint Conference of the 47th Annual Meeting of
the Association for Computational Linguistics and
the 4th International Joint Conference on Natural
Language Processing of the AFNLP, pages 163–
171, Suntec, Singapore, August. Association for
Computational Linguistics.
M. Mohri, F.C.N. Pereira, and M. Riley. 2008. Speech
recognition with weighted finite-state transducers.
Handbook on Speech Processing and Speech Communication.
The Same-head Heuristic for Coreference
Micha Elsner and Eugene Charniak
Brown Laboratory for Linguistic Information Processing (BLLIP)
Brown University
Providence, RI 02912
resolve (variant MUC score .82 on MUC-6) and
while those with partial matches are quite a bit
harder (.53), by far the worst performance is on
those without any match at all (.27). This effect
is magnified by most popular metrics for coreference, which reward finding links within large
clusters more than they punish proposing spurious links, making it hard to improve performance by linking conservatively. Systems that
use gold mention boundaries (the locations of NPs
marked by annotators)1 have even less need to
worry about same-head relationships, since most
NPs which disobey the conventional assumption
are not marked as mentions.
In this paper, we count how often same-head
pairs fail to corefer in the MUC-6 corpus, showing that gold mention detection hides most such
pairs, but more realistic detection finds large numbers. We also present an unsupervised generative model which learns to make certain samehead pairs non-coreferent. The model is based
on the idea that pronoun referents are likely to
be salient noun phrases in the discourse, so we
can learn about NP antecedents using pronominal antecedents as a starting point. Pronoun
anaphora, in turn, is learnable from raw data
(Cherry and Bergsma, 2005; Charniak and Elsner,
2009). Since our model links fewer NPs than the
baseline, it improves precision but decreases recall. This tradeoff is favorable for CEAF, but not
for b3 .
We investigate coreference relationships
between NPs with the same head noun.
It is relatively common in unsupervised
work to assume that such pairs are
coreferent– but this is not always true, especially if realistic mention detection is
used. We describe the distribution of noncoreferent same-head pairs in news text,
and present an unsupervised generative
model which learns not to link some samehead NPs using syntactic features, improving precision.
Full NP coreference, the task of discovering which
non-pronominal NPs in a discourse refer to the
same entity, is widely known to be challenging.
In practice, however, most work focuses on the
subtask of linking NPs with different head words.
Decisions involving NPs with the same head word
have not attracted nearly as much attention, and
many systems, especially unsupervised ones, operate under the assumption that all same-head
pairs corefer. This is by no means always the case–
there are several systematic exceptions to the rule.
In this paper, we show that these exceptions are
fairly common, and describe an unsupervised system which learns to distinguish them from coreferent same-head pairs.
There are several reasons why relatively little
attention has been paid to same-head pairs. Primarily, this is because they are a comparatively
easy subtask in a notoriously difficult area; Stoyanov et al. (2009) shows that, among NPs headed
by common nouns, those which have an exact
match earlier in the document are the easiest to
Related work
Unsupervised systems specify the assumption of
same-head coreference in several ways: by as1
Gold mention detection means something slightly different in the ACE corpus, where the system input contains every
NP annotated with an entity type.
Proceedings of the ACL 2010 Conference Short Papers, pages 33–37,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
sumption (Haghighi and Klein, 2009), using
a head-prediction clause (Poon and Domingos,
2008), and using a sparse Dirichlet prior on word
emissions (Haghighi and Klein, 2007). (These
three systems, perhaps not coincidentally, use gold
mentions.) An exception is Ng (2008), who points
out that head identity is not an entirely reliable cue
and instead uses exact string match (minus determiners) for common NPs and an alias detection
system for proper NPs. This work uses mentions
extracted with an NP chunker. No specific results
are reported for same-head NPs. However, while
using exact string match raises precision, many
non-matching phrases are still coreferent, so this
approach cannot be considered a full solution to
the problem.
Supervised systems do better on the task, but
not perfectly. Recent work (Stoyanov et al., 2009)
attempts to determine the contributions of various
categories of NP to coreference scores, and shows
(as stated above) that common NPs which partially
match an earlier mention are not well resolved by
the state-of-the-art RECONCILE system, which
uses pairwise classification. They also show that
using gold mention boundaries makes the coreference task substantially easier, and argue that this
experimental setting is “rather unrealistic”.
tite matching between gold and proposed clusters,
then gives the percentage of entities whose gold
label and proposed label match. b3 gives more
weight to errors involving larger clusters (since
these lower scores for several mentions at once);
for mention CEAF, all mentions are weighted
We annotate the data with the self-trained Charniak parser (McClosky et al., 2006), then extract
mentions using three different methods. The gold
mentions method takes only mentions marked by
annotators. The nps method takes all base noun
phrases detected by the parser. Finally, the nouns
method takes all nouns, even those that do not
head NPs; this method maximizes recall, since it
does not exclude prenominals in phrases like “a
Bush spokesman”. (High-precision models of the
internal structure of flat Penn Treebank-style NPs
were investigated by Vadas and Curran (2007).)
For each experimental setting, we show the number of mentions detected, and how many of them
are linked to some antecedent by the system.
The data is shown in Table 1. b3 shows a large
drop in precision when all same-head pairs are
linked; in fact, in the nps and nouns settings, only
about half the same-headed NPs are actually coreferent (864 real links, 1592 pairs for nps). This
demonstrates that non-coreferent same-head pairs
not only occur, but are actually rather common in
the dataset. The drop in precision is much less
obvious in the gold mentions setting, however;
most unlinked same-head pairs are not annotated
as mentions in the gold data, which is one reason
why systems run in this experimental setting can
afford to ignore them.
Improperly linking same-head pairs causes a
loss in precision, but scores are dominated by recall3 . Thus, reporting b3 helps to mask the impact
of these pairs when examining the final f-score.
We roughly characterize what sort of sameheaded NPs are non-coreferent by handexamining 100 randomly selected pairs. 39
pairs denoted different entities (“recent employees” vs “employees who have worked for longer”)
disambiguated by modifiers or sometimes by
discourse position. The next largest group (24)
consists of time and measure phrases like “ten
miles”. 12 pairs refer to parts or quantities
Descriptive study: MUC-6
We begin by examining how often non-same-head
pairs appear in the MUC-6 coreference dataset.
To do so, we compare two artificial coreference
systems: the link-all strategy links all, and only,
full (non-pronominal) NP pairs with the same head
which occur within 10 sentences of one another.
The oracle strategy links NP pairs with the same
head which occur within 10 sentences, but only if
they are actually coreferent (according to the gold
annotation)2 The link-all system, in other words,
does what most existing unsupervised systems do
on the same-head subset of NPs, while the oracle
system performs perfectly.
We compare our results to the gold standard using two metrics. b3 (Bagga and Baldwin, 1998)
is a standard metric which calculates a precision
and recall for each mention. The mention CEAF
(Luo, 2005) constructs a maximum-weight bipar2
The choice of 10 sentences as the window size captures
most, but not all, of the available recall. Using nouns mention
detection, it misses 117 possible same-head links, or about
10%. However, precision drops further as the window size
This bias is exaggerated for systems which only link
same-head pairs, but continues to apply to real systems; for
instance (Haghighi and Klein, 2009) has a b3 precision of 84
and recall of 67.
Link all
Link all
Link all
Linked b3 pr rec
Gold mentions
100 32.3
80.6 31.7
93.7 22.1
100 30.6
67.2 29.5
87.2 24.7
100 41.5
56.6 40.9
83.0 32.8
mention CEAF
Table 1: Oracle, system and baseline scores on MUC-6 test data. Gold mentions leave little room
for improvement between baseline and oracle; detecting more mentions widens the gap between
them. With realistic mention detection, precision and CEAF scores improve over baselines, while recall
and f-scores drop.
relationship to a potential generator gj . These features, which we denote f (ni , gj , D), may depend
on their relative position in the document D, and
on any features of gj , since we have already generated its tree. However, we cannot extract features
from the subtree under ni , since we have yet to
generate it!
As usual for IBM models, we learn using EM,
and we need to start our alignment function off
with a good initial set of parameters. Since antecedents of NPs and pronouns (both salient NPs)
often occur in similar syntactic environments, we
use an alignment function for pronoun coreference as a starting point. This alignment can be
learned from raw data, making our approach unsupervised.
We take the pronoun model of Charniak and Elsner (2009)4 as our starting point. We re-express
it in the IBM framework, using a log-linear model
for our alignment. Then our alignment (parameterized by feature weights w) is:
(“members of...”), and 12 contained a generic
(“In a corporate campaign, a union tries...”). 9
contained an annotator error. The remaining 4
were mistakes involving proper noun phrases
headed by Inc. and other abbreviations; this case
is easy to handle, but apparently not the primary
cause of errors.
Our system is a version of the popular IBM model
2 for machine translation. To define our generative
model, we assume that the parse trees for the entire document D are given, except for the subtrees
with root nonterminal NP, denoted ni , which our
system will generate. These subtrees are related
by a hidden set of alignments, ai , which link each
NP to another NP (which we call a generator) appearing somewhere before it in the document, or
to a null antecedent. The set of potential generators G (which plays the same role as the sourcelanguage text in MT) is taken to be all the NPs
occurring within 10 sentences of the target, plus a
special null antecedent which plays the same role
as the null word in machine translation– it serves
as a dummy generator for NPs which are unrelated
to any real NP in G.
The generative process fills in all the NP nodes
in order, from left to right. This process ensures
that, when generating node ni , we have already
filled in all the NPs in the set G (since these all
precede ni ). When deciding on a generator for
NP ni , we can extract features characterizing its
p(ai = j|G, D) ∝ exp(f (ni , gj , D) • w)
The weights w are learned by gradient descent
on the log-likelihood. To use this model within
EM, we alternate an E-step where we calculate
the expected alignments E[ai = j], then an Mstep where we run gradient descent. (We have also
had some success with stepwise EM as in (Liang
and Klein, 2009), but this requires some tuning to
work properly.)
Downloaded from
As features, we take the same features as Charniak and Elsner (2009): sentence and word-count
distance between ni and gj , sentence position of
each, syntactic role of each, and head type of gj
(proper, common or pronoun). We add binary features for the nonterminal directly over gj (NP, VP,
PP, any S type, or other), the type of phrases modifying gj (proper nouns, phrasals (except QP and
PP), QP, PP-of, PP-other, other modifiers, or nothing), and the type of determiner of gj (possessive,
definite, indefinite, deictic, other, or nothing). We
designed this feature set to distinguish prominent
NPs in the discourse, and also to be able to detect
abstract or partitive phrases by examining modifiers and determiners.
To produce full NPs and learn same-head coreference, we focus on learning a good alignment
using the pronoun model as a starting point. For
translation, we use a trivial model, p(ni |gai ) = 1
if the two have the same head, and 0 otherwise,
except for the null antecedent, which draws heads
from a multinomial distribution over words.
While we could learn an alignment and then
treat all generators as antecedents, so that only
NPs aligned to the null antecedent were not labeled coreferent, in practice this model would
align nearly all the same-head pairs. This is
true because many words are “bursty”; the probability of a second occurrence given the first is
higher than the a priori probability of occurrence
(Church, 2000). Therefore, our model is actually a
mixture of two IBM models, pC and pN , where pC
produces NPs with antecedents and pN produces
pairs that share a head, but are not coreferent. To
break the symmetry, we allow pC to use any parameters w, while pN uses a uniform alignment,
w ≡ ~0. We interpolate between these two models
with a constant λ, the single manually set parameter of our system, which we fixed at .9.
The full model, therefore, is:
ator (the largest term in either of the sums) is from
pT and is not the null antecedent are marked as
coreferent to the generator. Other NPs are marked
not coreferent.
Our results on the MUC-6 formal test set are
shown in Table 1. In all experimental settings,
the model improves precision over the baseline
while decreasing recall– that is, it misses some legitimate coreferent pairs while correctly excluding many of the spurious ones. Because of the
precision-recall tradeoff at which the systems operate, this results in reduced b3 and link F. However, for the nps and nouns settings, where the
parser is responsible for finding mentions, the
tradeoff is positive for the CEAF metrics. For instance, in the nps setting, it improves over baseline
by 57%.
As expected, the model does poorly in the gold
mentions setting, doing worse than baseline on
both metrics. Although it is possible to get very
high precision in this setting, the model is far too
conservative, linking less than half of the available
mentions to anything, when in fact about 60% of
them are coreferent. As we explain above, this experimental setting makes it mostly unnecessary to
worry about non-coreferent same-head pairs because the MUC-6 annotators don’t often mark
While same-head pairs are easier to resolve than
same-other pairs, they are still non-trivial and deserve further attention in coreference research. To
effectively measure their effect on performance,
researchers should report multiple metrics, since
under b3 the link-all heuristic is extremely difficult to beat. It is also important to report results
using a realistic mention detector as well as gold
p(ni |G, D) =λpT (ni |G, D)
+ (1 − λ)pN (ni |G, D)
1 X
pT (ni |G, D) =
exp(f (ni , gj , D) • w)
We thank Jean Carletta for the S WITCHBOARD
annotations, and Dan Jurafsky and eight anonymous reviewers for their comments and suggestions. This work was funded by a Google graduate
× I{head(ni ) = head(j)}
X 1
pT (ni |G, D) =
I{head(ni ) = head(gj )}
NPs for which the maximum-likelihood gener36
Amit Bagga and Breck Baldwin. 1998. Algorithms for
scoring coreference chains. In LREC Workshop on
Linguistics Coreference, pages 563–566.
Eugene Charniak and Micha Elsner. 2009. EM works
for pronoun anaphora resolution. In Proceedings of
EACL, Athens, Greece.
Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution.
In Proceedings of CoNLL, pages 88–95, Ann Arbor,
Kenneth W. Church. 2000. Empirical estimates of
adaptation: the chance of two Noriegas is closer to
p/2 than p2 . In Proceedings of ACL, pages 180–186.
Aria Haghighi and Dan Klein. 2007. Unsupervised
coreference resolution in a nonparametric Bayesian
model. In Proceedings of ACL, pages 848–855.
Aria Haghighi and Dan Klein. 2009. Simple coreference resolution with rich syntactic and semantic features. In Proceedings of EMNLP, pages 1152–1161.
Percy Liang and Dan Klein. 2009. Online EM for unsupervised models. In HLT-NAACL.
Xiaoqiang Luo. 2005. On coreference resolution performance metrics. In Proceedings of HLT-EMNLP,
pages 25–32, Morristown, NJ, USA. Association for
Computational Linguistics.
David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In
Proceedings of HLT-NAACL, pages 152–159.
Vincent Ng. 2008. Unsupervised models for coreference resolution. In Proceedings of EMNLP, pages
640–649, Honolulu, Hawaii. Association for Computational Linguistics.
Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic.
In Proceedings of EMNLP, pages 650–659, Honolulu, Hawaii, October. Association for Computational Linguistics.
Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and
Ellen Riloff. 2009. Conundrums in noun phrase
coreference resolution: Making sense of the stateof-the-art. In Proceedings of ACL-IJCNLP, pages
656–664, Suntec, Singapore, August. Association
for Computational Linguistics.
David Vadas and James Curran. 2007. Adding noun
phrase structure to the penn treebank. In Proceedings of ACL, pages 240–247, Prague, Czech Republic, June. Association for Computational Linguistics.
Authorship Attribution Using Probabilistic Context-Free Grammars
Sindhu Raghavan Adriana Kovashka Raymond Mooney
Department of Computer Science
The University of Texas at Austin
1 University Station C0500
Austin, TX 78712-0233, USA
(2008) use a combination of word-level statistics
and part-of-speech counts or n-grams. Baayen et
al. (1996) demonstrate that the use of syntactic
features from parse trees can improve the accuracy of authorship attribution. While there have
been several approaches proposed for authorship
attribution, it is not clear if the performance of one
is better than the other. Further, it is difficult to
compare the performance of these algorithms because they were primarily evaluated on different
datasets. For more information on the current state
of the art for authorship attribution, we refer the
reader to a detailed survey by Stamatatos (2009).
We further investigate the use of syntactic information by building complete models of each author’s syntax to distinguish between authors. Our
approach involves building a probabilistic contextfree grammar (PCFG) for each author and using
this grammar as a language model for classification. Experiments on a variety of corpora including poetry and newspaper articles on a number of
topics demonstrate that our PCFG approach performs fairly well, but it only outperforms a bigram language model on a couple of datasets (e.g.
poetry). However, combining our approach with
other methods results in an ensemble that performs
the best on most datasets.
In this paper, we present a novel approach
for authorship attribution, the task of identifying the author of a document, using
probabilistic context-free grammars. Our
approach involves building a probabilistic
context-free grammar for each author and
using this grammar as a language model
for classification. We evaluate the performance of our method on a wide range of
datasets to demonstrate its efficacy.
Natural language processing allows us to build
language models, and these models can be used
to distinguish between languages. In the context of written text, such as newspaper articles or
short stories, the author’s style could be considered a distinct “language.” Authorship attribution,
also referred to as authorship identification or prediction, studies strategies for discriminating between the styles of different authors. These strategies have numerous applications, including settling disputes regarding the authorship of old and
historically important documents (Mosteller and
Wallace, 1984), automatic plagiarism detection,
determination of document authenticity in court
(Juola and Sofko, 2004), cyber crime investigation (Zheng et al., 2009), and forensics (Luyckx
and Daelemans, 2008).
The general approach to authorship attribution
is to extract a number of style markers from the
text and use these style markers as features to train
a classifier (Burrows, 1987; Binongo and Smith,
1999; Diederich et al., 2000; Holmes and Forsyth,
1995; Joachims, 1998; Mosteller and Wallace,
1984). These style markers could include the
frequencies of certain characters, function words,
phrases or sentences. Peng et al. (2003) build a
character-level n-gram model for each author. Stamatatos et al. (1999) and Luyckx and Daelemans
Authorship Attribution using PCFG
We now describe our approach to authorship attribution. Given a training set of documents from
different authors, we build a PCFG for each author
based on the documents they have written. Given
a test document, we parse it using each author’s
grammar and assign it to the author whose PCFG
produced the highest likelihood for the document.
In order to build a PCFG, a standard statistical
parser takes a corpus of parse trees of sentences
as training input. Since we do not have access to
authors’ documents annotated with parse trees,
we use a statistical parser trained on a generic
Proceedings of the ACL 2010 Conference Short Papers, pages 38–42,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
// In addition, we collected poems from the Project Gutenberg website (
Main_Page). We attempted to collect sets of
documents on a shared topic written by multiple
authors. This was done to ensure that the datasets
truly tested authorship attribution as opposed to
topic identification. However, since it is very difficult to find authors that write literary works on
the same topic, the Poetry dataset exhibits higher
topic variability than our news datasets. We had
5 different datasets in total – Football, Business,
Travel, Cricket, and Poetry. The number of authors in our datasets ranged from 3 to 6.
For each dataset, we split the documents into
training and test sets. Previous studies (Stamatatos
et al., 1999) have observed that having unequal
number of words per author in the training set
leads to poor performance for the authors with
fewer words. Therefore, we ensured that, in the
training set, the total number of words per author
was roughly the same. We would like to note that
we could have also selected the training set such
that the total number of sentences per author was
roughly the same. However, since we would like
to compare the performance of the PCFG-based
approach with a bag-of-words baseline, we decided to normalize the training set based on the
number of words, rather than sentences. For testing, we used 15 documents per author for datasets
with news articles and 5 or 10 documents per author for the Poetry dataset. More details about the
datasets can be found in Table 1.
corpus like the Wall Street Journal (WSJ) or
Brown corpus from the Penn Treebank (http:
to automatically annotate (i.e. treebank) the
training documents for each author. In our
experiments, we used the Stanford Parser (Klein
and Manning, 2003b; Klein and Manning,
2003a) and the OpenNLP sentence segmenter
Our approach is summarized below:
Input – A training set of documents labeled
with author names and a test set of documents with
unknown authors.
1. Train a statistical parser on a generic corpus
like the WSJ or Brown corpus.
2. Treebank each training document using the
parser trained in Step 1.
3. Train a PCFG Gi for each author Ai using the
treebanked documents for that author.
4. For each test document, compute its likelihood for each grammar Gi by multiplying the
probability of the top PCFG parse for each
5. For each test document, find the author Ai
whose grammar Gi results in the highest likelihood score.
Output – A label (author name) for each document in the test set.
Experimental Comparison
This section describes experiments evaluating our
approach on several real-world datasets.
# authors
# words/auth
# docs/auth
# sent/auth
Table 1: Statistics for the training datasets used in
our experiments. The numbers in columns 3, 4 and
5 are averages.
We collected a variety of documents with known
authors including news articles on a wide range of
topics and literary works like poetry. We downloaded all texts from the Internet and manually removed extraneous information as well as titles, author names, and chapter headings. We collected
several news articles from the New York Times
online journal (http://global.nytimes.
com/) on topics related to business, travel, and
football. We also collected news articles on
cricket from the ESPN cricinfo website (http:
We evaluated our approach to authorship prediction on the five datasets described above. For news
articles, we used the first 10 sections of the WSJ
corpus, which consists of annotated news articles
on finance, to build the initial statistical parser in
writing style at the syntactic level, it may not accurately capture lexical information. Since both syntactic and lexical information is presumably useful
in capturing the author’s overall writing style, we
also developed an ensemble using a PCFG model,
the bag-of-words MaxEnt classifier, and an ngram language model. We linearly combined the
confidence scores assigned by each model to each
author, and used the combined score for the final
classification. We refer to this model as “PCFGE”, where E stands for ensemble. We also developed another ensemble based on MaxEnt and
n-gram language models to demonstrate the contribution of the PCFG model to the overall performance of PCFG-E. For each dataset, we report
accuracy, the fraction of the test documents whose
authors were correctly identified.
Step 1. For Poetry, we used 7 sections of the
Brown corpus which consists of annotated documents from different areas of literature.
In the basic approach, we trained a PCFG model
for each author based solely on the documents
written by that author. However, since the number of documents per author is relatively low, this
leads to very sparse training data. Therefore, we
also augmented the training data by adding one,
two or three sections of the WSJ or Brown corpus
to each training set, and up-sampling (replicating)
the data from the original author. We refer to this
model as “PCFG-I”, where I stands for interpolation since this effectively exploits linear interpolation with the base corpus to smooth parameters.
Based on our preliminary experiments, we replicated the original data three or four times.
We compared the performance of our approach
to bag-of-words classification and n-gram language models. When using bag-of-words, one
generally removes commonly occurring “stop
words.” However, for the task of authorship prediction, we hypothesized that the frequency of
specific stop words could provide useful information about the author’s writing style. Preliminary experiments verified that eliminating stop
words degraded performance; therefore, we did
not remove them. We used the Maximum Entropy
(MaxEnt) and Naive Bayes classifiers in the MALLET software package (McCallum, 2002) as initial baselines. We surmised that a discriminative
classifier like MaxEnt might perform better than
a generative classifier like Naive Bayes. However, when sufficient training data is not available,
generative models are known to perform better
than discriminative models (Ng and Jordan, 2001).
Hence, we chose to compare our method to both
Naive Bayes and MaxEnt.
Results and Discussion
Table 2 shows the accuracy of authorship prediction on different datasets. For the n-gram models, we only report the results for the bigram
model with smoothing (Bigram-I) as it was the
best performing model for most datasets (except
for Cricket and Poetry). For the Cricket dataset,
the trigram-I model was the best performing ngram model with an accuracy of 98.34%. Generally, a higher order n-gram model (n = 3 or higher)
performs poorly as it requires a fair amount of
smoothing due to the exponential increase in all
possible n-gram combinations. Hence, the superior performance of the trigram-I model on the
Cricket dataset was a surprising result. For the
Poetry dataset, the unigram-I model performed
best among the smoothed n-gram models at 81.8%
accuracy. This is unsurprising because as mentioned above, topic information is strongest in
the Poetry dataset, and it is captured well in the
unigram model. For bag-of-words methods, we
find that the generatively trained Naive Bayes
model (unigram language model) performs better than or equal to the discriminatively trained
MaxEnt model on most datasets (except for Business). This result is not suprising since our
datasets are limited in size, and generative models
tend to perform better than discriminative methods when there is very little training data available.
Amongst the different baseline models (MaxEnt,
Naive Bayes, Bigram-I), we find Bigram-I to be
the best performing model (except for Cricket and
Poetry). For both Cricket and Poetry, Naive Bayes
We also compared the performance of the
PCFG approach against n-gram language models.
Specifically, we tried unigram, bigram and trigram
models. We used the same background corpus
mixing method used for the PCFG-I model to effectively smooth the n-gram models. Since a generative model like Naive Bayes that uses n-gram
frequencies is equivalent to an n-gram language
model, we also used the Naive Bayes classifier in
MALLET to implement the n-gram models. Note
that a Naive-Bayes bag-of-words model is equivalent to a unigram language model.
While the PCFG model captures the author’s
Naive Bayes
Table 2: Accuracy in % for authorship prediction on different datasets. Bigram-I refers to the bigram
language model with smoothing. PCFG-E refers to the ensemble based on MaxEnt, Bigram-I, and
PCFG-I. MaxEnt+Bigram-I refers to the ensemble based on MaxEnt and Bigram-I.
is better than the best constituent model. Furthermore, for the Football, Cricket and Poetry datasets
this improvement is quite substantial. We now
find that the performance of some variant of PCFG
is always better than or equal to that of the best
baseline. While the basic PCFG model outperforms the baseline for the Football dataset, PCFGE outperforms the best baseline for the Poetry
and Business datasets. For the Cricket and Travel
datasets, the performance of the PCFG-E model
equals that of the best baseline. In order to assess the statistical significance of any performance
difference between the best PCFG model and the
best baseline, we performed the McNemar’s test,
a non-parametric test for binomial variables (Rosner, 2005). We found that the difference in the
performance of the two methods was not statistically significant at .05 significance level for any of
the datasets, probably due to the small number of
test samples.
The performance of PCFG and PCFG-I is particularly impressive on the Football and Poetry
datasets. For the Football dataset, the basic PCFG
model is the best performing PCFG model and it
performs much better than other methods. It is surprising that smoothing using PCFG-I actually results in a drop in performance on this dataset. We
hypothesize that the authors in the Football dataset
may have very different syntactic writing styles
that are effectively captured by the basic PCFG
model. Smoothing the data apparently weakens
this signal, hence causing a drop in performance.
For Poetry, PCFG-I achieves much higher accuracy than the baselines. This is impressive given
the much looser syntactic structure of poetry compared to news articles, and it indicates the value of
syntactic information for distinguishing between
literary authors.
Finally, we consider the specific contribution of
the PCFG-I model towards the performance of
is the best performing baseline model. While discussing the performance of the PCFG model and
its variants, we consider the best performing baseline model.
We observe that the basic PCFG model and the
PCFG-I model do not usually outperform the best
baseline method (except for Football and Poetry,
as discussed below). For Football, the basic PCFG
model outperforms the best baseline, while for
Poetry, the PCFG-I model outperforms the best
baseline. Further, the performance of the basic
PCFG model is inferior to that of PCFG-I for most
datasets, likely due to the insufficient training data
used in the basic model. Ideally one would use
more training documents, but in many domains
it is impossible to obtain a large corpus of documents written by a single author. For example,
as Luyckx and Daelemans (2008) argue, in forensics one would like to identify the authorship of
documents based on a limited number of documents written by the author. Hence, we investigated smoothing techniques to improve the performance of the basic PCFG model. We found that
the interpolation approach resulted in a substantial improvement in the performance of the PCFG
model for all but the Football dataset (discussed
below). However, for some datasets, even this
improvement was not sufficient to outperform the
best baseline.
The results for PCFG and PCFG-I demonstrate that syntactic information alone is generally a bit less accurate than using n-grams. In order to utilize both syntactic and lexical information, we developed PCFG-E as described above.
We combined the best n-gram model (Bigram-I)
and PCFG model (PCFG-I) with MaxEnt to build
PCFG-E. For the Travel dataset, we find that the
performance of the PCFG-E model is equal to that
of the best constituent model (Bigram-I). For the
remaining datasets, the performance of PCFG-E
the PCFG-E ensemble. Based on comparing the
results for PCFG-E and MaxEnt+Bigram-I, we
find that there is a drop in performance for most
datasets when removing PCFG-I from the ensemble. This drop is quite substantial for the Football
and Poetry datasets. This indicates that PCFG-I
is contributing substantially to the performance of
PCFG-E. Thus, it further illustrates the importance of broader syntactic information for the task
of authorship attribution.
Conference on Machine Learning (ECML), pages
137–142, Berlin, Heidelberg. Springer-Verlag.
Patrick Juola and John Sofko. 2004. Proving and
Improving Authorship Attribution Technologies. In
Proceedings of Canadian Symposium for Text Analysis (CaSTA).
Dan Klein and Christopher D. Manning. 2003a. Accurate unlexicalized parsing. In Proceedings of the
41st Annual Meeting on Association for Computational Linguistics (ACL), pages 423–430, Morristown, NJ, USA. Association for Computational Linguistics.
Future Work and Conclusions
Dan Klein and Christopher D. Manning. 2003b. Fast
Exact Inference with a Factored Model for Natural
Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS), pages 3–10.
MIT Press.
In this paper, we have presented our ongoing work
on authorship attribution, describing a novel approach that uses probabilistic context-free grammars. We have demonstrated that both syntactic and lexical information are useful in effectively capturing authors’ overall writing style. To
this end, we have developed an ensemble approach that performs better than the baseline models on several datasets. An interesting extension
of our current approach is to consider discriminative training of PCFGs for each author. Finally,
we would like to compare the performance of our
method to other state-of-the-art approaches to authorship prediction.
Kim Luyckx and Walter Daelemans. 2008. Authorship Attribution and Verification with Many Authors
and Limited Data. In Proceedings of the 22nd International Conference on Computational Linguistics
(COLING), pages 513–520, August.
Andrew Kachites McCallum.
MALLET: A Machine Learning for Language Toolkit.
Frederick Mosteller and David L. Wallace. 1984. Applied Bayesian and Classical Inference: The Case of
the Federalist Papers. Springer-Verlag.
Andrew Y. Ng and Michael I. Jordan. 2001. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems 14
(NIPS), pages 841–848.
Experiments were run on the Mastodon Cluster,
provided by NSF Grant EIA-0303609.
Fuchun Peng, Dale Schuurmans, Viado Keselj, and
Shaojun Wang. 2003. Language Independent
Authorship Attribution using Character Level Language Models. In Proceedings of the 10th Conference of the European Chapter of the Association for
Computational Linguistics (EACL).
H. Baayen, H. van Halteren, and F. Tweedie. 1996.
Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and
Linguistic Computing, 11(3):121–132, September.
Bernard Rosner. 2005. Fundamentals of Biostatistics.
Duxbury Press.
Binongo and Smith. 1999. A Study of Oscar Wilde’s
Writings. Journal of Applied Statistics, 26:781.
E. Stamatatos, N. Fakotakis, and G. Kokkinakis. 1999.
Automatic Authorship Attribution. In Proceedings
of the 9th Conference of the European Chapter of the
Association for Computational Linguistics (EACL),
pages 158–164, Morristown, NJ, USA. Association
for Computational Linguistics.
J Burrows. 1987. Word-patterns and Story-shapes:
The Statistical Analysis of Narrative Style.
Joachim Diederich, Jörg Kindermann, Edda Leopold,
and Gerhard Paass. 2000. Authorship Attribution with Support Vector Machines. Applied Intelligence, 19:2003.
E. Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. Journal of the American Society for Information Science and Technology,
D. I. Holmes and R. S. Forsyth. 1995. The Federalist Revisited: New Directions in Authorship Attribution. Literary and Linguistic Computing, 10:111–
Rong Zheng, Yi Qin, Zan Huang, and Hsinchun
Chen. 2009. Authorship Analysis in Cybercrime
Investigation. Lecture Notes in Computer Science,
Thorsten Joachims. 1998. Text categorization with
Support Vector Machines: Learning with many relevant features. In Proceedings of the 10th European
The impact of interpretation problems on tutorial dialogue
Myroslava O. Dzikovska and Johanna D. Moore
School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
Natalie Steinhauser and Gwendolyn Campbell
Naval Air Warfare Center Training Systems Division, Orlando, FL, USA
et al., 2004). However, limiting the range of possible input limits the contentful talk that the students are expected to produce, and therefore may
limit the overall effectiveness of the system.
Most of the existing tutoring systems that accept
unrestricted language input use classifiers based
on statistical text similarity measures to match
student answers to open-ended questions with
pre-authored anticipated answers (Graesser et al.,
1999; Jordan et al., 2004; McCarthy et al., 2008).
While such systems are robust to unexpected terminology, they provide only a very coarse-grained
assessment of student answers. Recent research
aims to develop methods that produce detailed
analyses of student input, including correct, incorrect and missing parts (Nielsen et al., 2008;
Dzikovska et al., 2008), because the more detailed
assessments can help tailor tutoring to the needs of
individual students.
While the detailed assessments of answers to
open-ended questions are intended to improve potential learning, they also increase the probability of misunderstandings, which negatively impact
tutoring and therefore negatively impact student
learning (Jordan et al., 2009). Thus, appropriate error recovery strategies are crucially important for tutorial dialogue applications. We describe
an evaluation of an implemented tutorial dialogue
system which aims to accept unrestricted student
input and limit misunderstandings by rejecting low
confidence interpretations and employing a range
of error recovery strategies depending on the cause
of interpretation failure.
By comparing two different system policies, we
demonstrate that with less restricted language input the rate of non-understanding errors impacts
both learning gain and user satisfaction, and that
problems arising from incorrect use of terminology have a particularly negative impact. A more
detailed analysis of the results indicates that, even
though we based our policy on an approach ef-
Supporting natural language input may
improve learning in intelligent tutoring
systems. However, interpretation errors
are unavoidable and require an effective
recovery policy. We describe an evaluation
of an error recovery policy in the B EE TLE II tutorial dialogue system and discuss how different types of interpretation
problems affect learning gain and user satisfaction. In particular, the problems arising from student use of non-standard terminology appear to have negative consequences. We argue that existing strategies
for dealing with terminology problems are
insufficient and that improving such strategies is important in future ITS research.
There is a mounting body of evidence that student
self-explanation and contentful talk in humanhuman tutorial dialogue are correlated with increased learning gain (Chi et al., 1994; Purandare
and Litman, 2008; Litman et al., 2009). Thus,
computer tutors that understand student explanations have the potential to improve student learning (Graesser et al., 1999; Jordan et al., 2006;
Aleven et al., 2001; Dzikovska et al., 2008). However, understanding and correctly assessing the
student’s contributions is a difficult problem due
to the wide range of variation observed in student
input, and especially due to students’ sometimes
vague and incorrect use of domain terminology.
Many tutorial dialogue systems limit the range
of student input by asking short-answer questions.
This provides a measure of robustness, and previous evaluations of ASR in spoken tutorial dialogue
systems indicate that neither word error rate nor
concept error rate in such systems affect learning
gain (Litman and Forbes-Riley, 2005; Pon-Barry
Proceedings of the ACL 2010 Conference Short Papers, pages 43–48,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
word sense assigned to an ambiguous word, or an
incorrectly resolved referential expression.
Our approach to selecting an error recovery policy is to prefer non-understandings to misunderstandings. There is a known trade-off in spoken dialogue systems between allowing misunderstandings, i.e., cases in which a system accepts and
acts on an incorrect interpretation of an utterance,
and non-understandings, i.e., cases in which a system rejects an utterance as uninterpretable (Bohus and Rudnicky, 2005). Since misunderstandings on the part of a computer tutor are known
to negatively impact student learning, and since
in human-human tutorial dialogue the majority of
student responses using unexpected terminology
are classified as incorrect (Jordan et al., 2009),
it would be a reasonable approach for a tutorial
dialogue system to deal with potential interpretation problems by treating low-confidence interpretations as non-understandings and focusing on an
effective non-understanding recovery policy.1
We implemented two different policies for comparison. Our baseline policy does not attempt any
remediation or error recovery. All student utterances are passed through the standard interpretation pipeline, so that the results can be analyzed
later. However, the system does not attempt to address the student content. Instead, regardless of
the answer analysis, the system always uses a neutral acceptance and bottom out strategy, giving the
student the correct answer every time, e.g., “OK.
One way to phrase the correct answer is: the open
switch creates a gap in the circuit”. Thus, the students are never given any indication of whether
they have been understood or not.
The full policy acts differently depending on the
analysis of the student answer. For correct answers, it acknowledges the answer as correct and
optionally restates it (see (Dzikovska et al., 2008)
for details). For incorrect answers, it restates the
correct portion of the answer (if any) and provides
a hint to guide the student towards the completely
correct answer. If the student’s utterance cannot be
interpreted, the system responds with a help message indicating the cause of the problem together
with a hint. In both cases, after 3 unsuccessful attempts to address the problem the system uses the
bottom out strategy and gives away the answer.
fective in task-oriented dialogue (Hockey et al.,
2003), many of our strategies were not successful in improving learning gain. At the same time,
students appear to be aware that the system does
not fully understand them even if it accepts their
input without indicating that it is having interpretation problems, and this is reflected in decreased
user satisfaction. We argue that this indicates that
we need better strategies for dealing with terminology problems, and that accepting non-standard
terminology without explicitly addressing the difference in acceptable phrasing may not be sufficient for effective tutoring.
In Section 2 we describe our tutoring system,
and the two tutoring policies implemented for the
experiment. In Section 3 we present experimental results and an analysis of correlations between
different types of interpretation problems, learning
gain and user satisfaction. Finally, in Section 4 we
discuss the implications of our results for error recovery policies in tutorial dialogue systems.
Tutorial Dialogue System and Error
Recovery Policies
This work is based on evaluation of B EETLE II
(Dzikovska et al., 2010), a tutorial dialogue system which provides tutoring in basic electricity
and electronics. Students read pre-authored materials, experiment with a circuit simulator, and then
are asked to explain their observations. B EETLE II
uses a deep parser together with a domain-specific
diagnoser to process student input, and a deep generator to produce tutorial feedback automatically
depending on the current tutorial policy. It also
implements an error recovery policy to deal with
interpretation problems.
Students currently communicate with the system via a typed chat interface. While typing
removes the uncertainty and errors involved in
speech recognition, expected student answers are
considerably more complex and varied than in
a typical spoken dialogue system. Therefore, a
significant number of interpretation errors arise,
primarily during the semantic interpretation process. These errors can lead to non-understandings,
when the system cannot produce a syntactic parse
(or a reasonable fragmentary parse), or when it
does not know how to interpret an out-of-domain
word; and misunderstandings, where a system arrives at an incorrect interpretation, due to either
an incorrect attachment in the parse, an incorrect
While there is no confidence score from a speech recognizer, our system uses a combination of a parse quality score
assigned by the parser and a set of consistency checks to determine whether an interpretation is sufficiently reliable.
they were frustrated when the system said that it
did not understand them. However, some students
in BASE also mentioned that they sometimes were
not sure if the system’s answer was correcting a
problem with their answer, or simply phrasing it
in a different way.
We used mean frequency of non-interpretable
utterances (out of all student utterances in
each session) to evaluate the effectiveness of
the two different policies. On average, 14%
of utterances in both conditions resulted in
non-understandings.2 The frequency of nonunderstandings was negatively correlated with
learning gain in FULL: r = −0.47, p < 0.005,
but not significantly correlated with learning gain
in BASE: r = −0.09, p = 0.59. However, in both
conditions the frequency of non-understandings
was negatively correlated with user satisfaction:
FULL r = −0.36, p = 0.03, BASE r = −0.4, p =
0.01. Thus, even though in BASE the system
did not indicate non-understanding, students were
negatively affected. That is, they were not satisfied with the policy that did not directly address
the interpretation problems. We discuss possible
reasons for this below.
We investigated the effect of different types of
interpretation errors using two criteria. First, we
checked whether the mean frequency of errors was
reduced between BASE and FULL for each individual strategy. The reduced frequency means that
the recovery strategy for this particular error type
is effective in reducing the error frequency. Second, we looked for the cases where the frequency
of a given error type is negatively correlated with
either learning gain or user satisfaction. This is
provides evidence that such errors are negatively
impacting the learning process, and therefore improving recovery strategies for those error types is
likely to improve overall system effectiveness,
The results, shown in Table 1, indicate that the
majority of interpretation problems are not significantly correlated with learning gain. However, several types of problems appear to be
particularly significant, and are all related to
improper use of domain terminology. These
were irrelevant answer, no appr terms, selectional restriction failure and program error.
An irrelevant answer error occurs when the student makes a statement that uses domain termi-
The content of the bottom out is the same as in
the baseline, except that the full system indicates
clearly that the answer was incorrect or was not
understood, e.g., “Not quite. Here is the answer:
the open switch creates a gap in the circuit”.
The help messages are based on the TargetedHelp approach successfully used in spoken dialogue (Hockey et al., 2003), together with the error
classification we developed for tutorial dialogue
(Dzikovska et al., 2009). There are 9 different error types, each associated with a different targeted
help message. The goal of the help messages is to
give the student as much information as possible
as to why the system failed to understand them but
without giving away the answer.
In comparing the two policies, we would expect
that the students in both conditions would learn
something, but that the learning gain and user satisfaction would be affected by the difference in
policies. We hypothesized that students who receive feedback on their errors in the full condition
would learn more compared to those in the baseline condition.
We collected data from 76 subjects interacting
with the system. The subjects were randomly assigned to either the baseline (BASE) or the full
(FULL) policy condition. Each subject took a pretest, then worked through a lesson with the system,
and then took a post-test and filled in a user satisfaction survey. Each session lasted approximately
4 hours, with 232 student language turns in FULL
(SD = 25.6) and 156 in BASE (SD = 2.02). Additional time was taken by reading and interacting with the simulation environment. The students
had little prior knowledge of the domain. The survey consisted of 63 questions on the 5-point Likert scale covering the lesson content, the graphical
user interface, and tutor’s understanding and feedback. For purposes of this study, we are using an
averaged tutor score.
The average learning gain was 0.57 (SD =
0.23) in FULL, and 0.63 (SD = 0.26) in BASE.
There was no significant difference in learning
gain between conditions. Students liked BASE better: the average tutor evaluation score for FULL
was 2.56 out of 5 (SD = 0.65), compared to 3.32
(SD = 0.65) in BASE. These results are significantly different (t-test, p < 0.05). In informal
comments after the session many students said that
We do not know the percentage of misunderstandings or
concept error rate as yet. We are currently annotating the data
with the goal to evaluate interpretation correctness.
error type
irrelevant answer
no appr terms
selectional restr failure
program error
unknown word
disambiguation failure
no parse
partial interpretation
reference failure
mean freq.
(std. dev)
0.008 (0.01)
0.005 (0.01)
0.032 (0.02)
0.002 (0.003)
0.023 (0.01)
0.013 (0.01)
0.019 (0.01)
0.004 (0.004)
0.012 (0.02)
0.134 (0.05)
satisfaction r
mean freq
(std. dev)
0.012 (0.01)
0.003 (0.01)
0.040 (0.03)
0.003 (0.003)
0.024 (0.02)
0.007 (0.01)
0.004 (0.005)
0.017 (0.01)
0.139 (0.04)
satisfaction r
Table 1: Correlations between frequency of different error types and student learning gain and satisfaction. ** - correlation is significant with p < 0.05, * - with p <= 0.1.
nology but does not appear to answer the system’s
question directly. For example, the expected answer to “In circuit 1, which components are in a
closed path?” is “the bulb”. Some students misread the question and say “Circuit 1 is closed.” If
that happens, in FULL the system says “Sorry, this
isn’t the form of answer that I expected. I am looking for a component”, pointing out to the student
the kind of information it is looking for. The BASE
system for this error, and for all other errors discussed below, gives away the correct answer without indicating that there was a problem with interpreting the student’s utterance, e.g., “OK, the
correct answer is the bulb.”
The no appr terms error happens when the student is using terminology inappropriate for the lesson in general. Students are expected to learn to
explain everything in terms of connections and terminal states. For example, the expected answer to
“What is voltage?” is “the difference in states between two terminals”. If instead the student says
“Voltage is electricity”, FULL responds with “I am
sorry, I am having trouble understanding. I see no
domain concepts in your answer. Here’s a hint:
your answer should mention a terminal.” The motivation behind this strategy is that in general, it is
very difficult to reason about vaguely used domain
terminology. We had hoped that by telling the student that the content of their utterance is outside
the domain as understood by the system, and hinting at the correct terms to use, the system would
guide students towards a better answer.
Selectional restr failure errors are typically due
to incorrect terminology, when the students
phrased answers in a way that contradicted the sys-
tem’s domain knowledge. For example, the system can reason about damaged bulbs and batteries, and open and closed paths. So if the student says “The path is damaged”, the FULL system would respond with “I am sorry, I am having
trouble understanding. Paths cannot be damaged.
Only bulbs and batteries can be damaged.”
Program error were caused by faults in the underlying network software, but usually occurred
when the student was using extremely long and
complicated utterances.
Out of the four important error types described
above, only the strategy for irrelevant answer was
effective: the frequency of irrelevant answer errors is significantly higher in BASE (t-test, p <
0.05), and it is negatively correlated with learning
gain in BASE. The frequencies of other error types
did not significantly differ between conditions.
However, one other finding is particularly interesting: the frequency of no appr terms errors
is negatively correlated with user satisfaction in
BASE . This indicates that simply accepting the student’s answer when they are using incorrect terminology and exposing them to the correct answer is
not the best strategy, possibly because the students
are noticing the unexplained lack of alignment between their utterance and the system’s answer.
Discussion and Future Work
As discussed in Section 1, previous studies of
short-answer tutorial dialogue systems produced a
counter-intuitive result: measures of interpretation
accuracy were not correlated with learning gain.
With less restricted language, misunderstandings
negatively affected learning. Our study provides
further evidence that interpretation quality significantly affects learning gain in tutorial dialogue.
Moreover, while it has long been known that user
satisfaction is negatively correlated with interpretation error rates in spoken dialogue, this is the
first attempt to evaluate the impact of different
types of interpretation errors on task success and
usability of a tutoring system.
conceivably be accepted by a system using semantic similarity as a metric (e.g., using LSA with preauthored answers). However, our results also indicate that simply accepting the incorrect terminology may not be the best strategy. Users appear to
be sensitive when the system’s language does not
align with their terminology, as reflected in the decreased satisfaction ratings associated with higher
rates of incorrect terminology problems in BASE.
Moreover, prior analysis of human-human data
indicates that tutors use different restate strategies depending on the “quality” of the student answers, even if they are accepting them as correct
(Dzikovska et al., 2008). Together, these point at
an important unaddressed issue: existing systems
are often built on the assumption that only incorrect and missing parts of the student answer should
be remediated, and a wide range of terminology
should be accepted (Graesser et al., 1999; Jordan
et al., 2006). While it is obviously important for
the system to accept a range of different phrasings,
our analysis indicates that this may not be sufficient by itself, and students could potentially benefit from addressing the terminology issues with a
specifically devised strategy.
Finally, it could also be possible that some
differences between strategy effectiveness were
caused by incorrect error type classification. Manual examination of several dialogues suggests that
most of the errors are assigned to the appropriate type, though in some cases incorrect syntactic parses resulted in unexpected interpretation errors, causing the system to give a confusing help
message. These misclassifications appear to be
evenly split between different error types, though
a more formal evaluation is planned in the future. However from our initial examination, we
believe that the differences in strategy effectiveness that we observed are due to the actual differences in the help messages. Therefore, designing
better prompts would be the key factor in improving learning and user satisfaction.
Our results demonstrate that different types of
errors may matter to a different degree. In our
system, all of the error types negatively correlated
with learning gain stem from the same underlying
problem: the use of incorrect or vague terminology by the student. With the exception of the irrelevant answer strategy, the targeted help strategies we implemented were not effective in reducing error frequency or improving learning gain.
Additional research is needed to understand why.
One possibility is that irrelevant answer was easier to remediate compared to other error types. It
usually happened in situations where there was a
clear expectation of the answer type (e.g., a list of
component names, a yes/no answer). Therefore,
it was easier to design an effective prompt. Help
messages for other error types were more frequent
when the expected answer was a complex sentence, and multiple possible ways of phrasing the
correct answer were acceptable. Therefore, it was
more difficult to formulate a prompt that would
clearly describe the problem in all contexts.
One way to improve the help messages may be
to have the system indicate more clearly when user
terminology is a problem. Our system apologized
each time there was a non-understanding, leading
students to believe that they may be answering correctly but the answer is not being understood. A
different approach would be to say something like
“I am sorry, you are not using the correct terminology in your answer. Here’s a hint: your answer
should mention a terminal”. Together with an appropriate mechanism to detect paraphrases of correct answers (as opposed to vague answers whose
correctness is difficult to determine), this approach
could be more beneficial in helping students learn.
We are considering implementing and evaluating
this as part of our future work.
This work has been supported in part by US Office
of Naval Research grants N000140810043 and
N0001410WX20278. We thank Katherine Harrison, Leanne Taylor, Charles Callaway, and Elaine
Farrow for help with setting up the system and
running the evaluation. We would like to thank
anonymous reviewers for their detailed feedback.
Some of the errors, in particular instances of
no appr terms and selectional restr failure, also
stemmed from unrecognized paraphrases with
non-standard terminology. Those answers could
Pamela Jordan, Diane Litman, Michael Lipschultz, and
Joanna Drummond. 2009. Evidence of misunderstandings in tutorial dialogue and their impact on
learning. In Proceedings of the 14th International
Conference on Artificial Intelligence in Education
(AIED), Brighton, UK, July.
V. Aleven, O. Popescu, and K. R. Koedinger. 2001.
Towards tutorial dialog to support self-explanation:
Adding natural language understanding to a cognitive tutor. In Proceedings of the 10th International
Conference on Artificial Intelligence in Education
(AIED ’01)”.
Diane Litman and Kate Forbes-Riley. 2005. Speech
recognition performance and learning in spoken dialogue tutoring. In Proceedings of EUROSPEECH2005, page 1427.
Dan Bohus and Alexander Rudnicky. 2005. Sorry,
I didn’t catch that! - An investigation of nonunderstanding errors and recovery strategies. In
Proceedings of SIGdial-2005, Lisbon, Portugal.
Michelene T. H. Chi, Nicholas de Leeuw, Mei-Hung
Chiu, and Christian LaVancher. 1994. Eliciting
self-explanations improves understanding. Cognitive Science, 18(3):439–477.
Diane Litman, Johanna Moore, Myroslava Dzikovska,
and Elaine Farrow. 2009. Generalizing tutorial dialogue results. In Proceedings of 14th International
Conference on Artificial Intelligence in Education
(AIED), Brighton, UK, July.
Myroslava O. Dzikovska, Gwendolyn E. Campbell,
Charles B. Callaway, Natalie B. Steinhauser, Elaine
Farrow, Johanna D. Moore, Leslie A. Butler, and
Colin Matheson. 2008. Diagnosing natural language answers to support adaptive tutoring. In
Proceedings 21st International FLAIRS Conference,
Coconut Grove, Florida, May.
Philip M. McCarthy, Vasile Rus, Scott Crossley,
Arthur C. Graesser, and Danielle S. McNamara.
2008. Assessing forward-, reverse-, and averageentailment indeces on natural language input from
the intelligent tutoring system, iSTART. In Proceedings of the 21st International FLAIRS conference,
pages 165–170.
Myroslava O. Dzikovska, Charles B. Callaway, Elaine
Farrow, Johanna D. Moore, Natalie B. Steinhauser,
and Gwendolyn C. Campbell. 2009. Dealing with
interpretation errors in tutorial dialogue. In Proceedings of SIGDIAL-09, London, UK, Sep.
Rodney D. Nielsen, Wayne Ward, and James H. Martin. 2008. Learning to assess low-level conceptual
understanding. In Proceedings 21st International
FLAIRS Conference, Coconut Grove, Florida, May.
Heather Pon-Barry, Brady Clark, Elizabeth Owen
Bratt, Karl Schultz, and Stanley Peters. 2004. Evaluating the effectiveness of SCoT: A spoken conversational tutor. In J. Mostow and P. Tedesco, editors,
Proceedings of the ITS 2004 Workshop on Dialogbased Intelligent Tutoring Systems, pages 23–32.
Myroslava O. Dzikovska, Johanna D. Moore, Natalie
Steinhauser, Gwendolyn Campbell, Elaine Farrow,
and Charles B. Callaway. 2010. Beetle II: a system for tutoring and computational linguistics experimentation. In Proceedings of ACL-2010 demo
Amruta Purandare and Diane Litman. 2008. Contentlearning correlations in spoken tutoring dialogs at
word, turn and discourse levels. In Proceedings 21st
International FLAIRS Conference, Coconut Grove,
Florida, May.
A. C. Graesser, P. Wiemer-Hastings, P. WiemerHastings, and R. Kreuz. 1999. Autotutor: A simulation of a human tutor. Cognitive Systems Research,
Beth Ann Hockey, Oliver Lemon, Ellen Campana,
Laura Hiatt, Gregory Aist, James Hieronymus,
Alexander Gruenstein, and John Dowding. 2003.
Targeted help for spoken dialogue systems: intelligent feedback improves naive users’ performance.
In Proceedings of the tenth conference on European
chapter of the Association for Computational Linguistics, pages 147–154, Morristown, NJ, USA.
Pamela W. Jordan, Maxim Makatchev, and Kurt VanLehn. 2004. Combining competing language understanding approaches in an intelligent tutoring system. In James C. Lester, Rosa Maria Vicari, and
Fábio Paraguaçu, editors, Intelligent Tutoring Systems, volume 3220 of Lecture Notes in Computer
Science, pages 346–357. Springer.
Pamela Jordan, Maxim Makatchev, Umarani Pappuswamy, Kurt VanLehn, and Patricia Albacete.
2006. A natural language tutorial dialogue system
for physics. In Proceedings of the 19th International
FLAIRS conference.
The Prevalence of Descriptive Referring Expressions
in News and Narrative
Raquel Hervás
Departamento de Ingenieria
del Software e Inteligencı́a Artificial
Universidad Complutense de Madrid
Madrid, 28040 Spain
[email protected]
Mark Alan Finlayson
Computer Science and
Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA, 02139 USA
[email protected]
tify their intended referent. Referring expressions, however, may be more than distinctive. It
is widely acknowledged that they can be used to
achieve multiple goals, above and beyond distinction. Here we focus on descriptive referring expressions, that is, referring expressions that are not
only distinctive, but provide additional information not required for identifying their intended referent. Consider the following text, in which some
of the referring expressions have been underlined:
Generating referring expressions is a key
step in Natural Language Generation. Researchers have focused almost exclusively
on generating distinctive referring expressions, that is, referring expressions that
uniquely identify their intended referent.
While undoubtedly one of their most important functions, referring expressions
can be more than distinctive. In particular,
descriptive referring expressions – those
that provide additional information not required for distinction – are critical to fluent, efficient, well-written text. We present
a corpus analysis in which approximately
one-fifth of 7,207 referring expressions in
24,422 words of news and narrative are descriptive. These data show that if we are
ever to fully master natural language generation, especially for the genres of news
and narrative, researchers will need to devote more attention to understanding how
to generate descriptive, and not just distinctive, referring expressions.
Once upon a time there was a man, who had
three daughters. They lived in a house and
their dresses were made of fabric.
While a bit strange, the text is perfectly wellformed. All the referring expressions are distinctive, in that we can properly identify the referents
of each expression. But the real text, the opening
lines to the folktale The Beauty and the Beast, is
actually much more lyrical:
Once upon a time there was a rich merchant,
who had three daughters. They lived in a
very fine house and their gowns were made
of the richest fabric sewn with jewels.
A Distinctive Focus
All the boldfaced portions – namely, the choice
of head nouns, the addition of adjectives, the use
of appositive phrases – serve to perform a descriptive function, and, importantly, are all unnecessary for distinction! In all of these cases, the author is using the referring expressions as a vehicle for communicating information about the referents. This descriptive information is sometimes
new, sometimes necessary for understanding the
text, and sometimes just for added flavor. But
when the expression is descriptive, as opposed to
distinctive, this additional information is not required for identifying the referent of the expression, and it is these sorts of referring expressions
that we will be concerned with here.
Generating referring expressions is a key step in
Natural Language Generation (NLG). From early
treatments in seminal papers by Appelt (1985)
and Reiter and Dale (1992) to the recent set
of Referring Expression Generation (REG) Challenges (Gatt et al., 2009) through different corpora
available for the community (Eugenio et al., 1998;
van Deemter et al., 2006; Viethen and Dale, 2008),
generating referring expressions has become one
of the most studied areas of NLG.
Researchers studying this area have, almost
without exception, focused exclusively on how
to generate distinctive referring expressions, that
is, referring expressions that unambiguously iden49
Proceedings of the ACL 2010 Conference Short Papers, pages 49–54,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
museum labels, Cheng et al. (2001) noted that descriptive information is often integrated into referring expressions using modifiers to the head noun.
To study this, and to allow our results to be more
closely compared with Cheng’s, we had our annotators split referring expressions into their constituents, portions called either nuclei or modifiers.
The nuclei were the portions of the referring expression that performed the ‘core’ referring function; the modifiers were those portions that could
be varied, syntactically speaking, independently of
the nuclei. Annotators then assigned a distinctive
or descriptive function to each constituent, rather
than the referring expression as a whole.
Normally, the nuclei corresponded to the head
of the noun phrase. In (1), the nucleus is the token
king, which we have here surrounded with square
brackets. The modifiers, surrounded by parentheses, are The and old.
Although these sorts of referring expression
have been mostly ignored by researchers in this
area1 , we show in this corpus study that descriptive expressions are in fact quite prevalent: nearly
one-fifth of referring expressions in news and narrative are descriptive. In particular, our data,
the trained judgments of native English speakers,
show that 18% of all distinctive referring expressions in news and 17% of those in narrative folktales are descriptive. With this as motivation, we
argue that descriptive referring expressions must
be studied more carefully, especially as the field
progresses from referring in a physical, immediate context (like that in the REG Challenges) to
generating more literary forms of text.
Corpus Annotation
This is a corpus study; our procedure was therefore to define our annotation guidelines (Section 2.1), select texts to annotate (2.2), create an
annotation tool for our annotators (2.3), and, finally, train annotators, have them annotate referring expressions’ constituents and function, and
then adjudicate the double-annotated texts into a
gold standard (2.4).
(1) (The) (old) [king] was wise.
Phrasal modifiers were marked as single modifiers, for example, in (2).
(The) [roof] (of the house) collapsed.
It is significant that we had our annotators mark
and tag the nuclei of referring expressions. Cheng
and colleagues only mentioned the possibility that
additional information could be introduced in the
modifiers. However, O’Donnell et al. (1998) observed that often the choice of head noun can also
influence the function of a referring expression.
Consider (3), in which the word villain is used to
refer to the King.
2.1 Definitions
We wrote an annotation guide explaining the difference between distinctive and descriptive referring expressions. We used the guide when training annotators, and it was available to them while
annotating. With limited space here we can only
give an outline of what is contained in the guide;
for full details see (Finlayson and Hervás, 2010a).
Referring Expressions We defined referring
expressions as referential noun phrases and their
coreferential expressions, e.g., “John kissed Mary.
She blushed.”. This included referring expressions
to generics (e.g., “Lions are fierce”), dates, times,
and numbers, as well as events if they were referred to using a noun phrase. We included in each
referring expression all the determiners, quantifiers, adjectives, appositives, and prepositional
phrases that syntactically attached to that expression. When referring expressions were nested, all
the nested referring expressions were also marked
Nuclei vs. Modifiers In the only previous corpus study of descriptive referring expressions, on
(3) The King assumed the throne today.
I don’t trust (that) [villain] one bit.
The speaker could have merely used him to refer to the King–the choice of that particular head
noun villain gives us additional information about
the disposition of the speaker. Thus villain is descriptive.
Function: Distinctive vs. Descriptive As already noted, instead of tagging the whole referring expression, annotators tagged each constituent (nuclei and modifiers) as distinctive or descriptive.
The two main tests for determining descriptiveness were (a) if presence of the constituent was
unnecessary for identifying the referent, or (b) if
With the exception of a small amount of work, discussed
in Section 4.
program that, among other things, includes the
ability to annotate referring expressions and coreferential relationships. We added the ability to annotate nuclei, modifiers, and their functions by
writing a workbench “plugin” in Java that could
be installed in the application.
The Story Workbench is not yet available to the
public at large, being in a limited distribution beta
testing phase. The developers plan to release it as
free software within the next year. At that time,
we also plan to release our plugin as free, downloadable software.
the constituent was expressed using unusual or ostentatious word choice. If either was true, the constituent was considered descriptive; otherwise, it
was tagged as distinctive. In cases where the constituent was completely irrelevant to identifying
the referent, it was tagged as descriptive. For example, in the folktale The Princess and the Pea,
from which (1) was extracted, there is only one
king in the entire story. Thus, in that story, the
king is sufficient for identification, and therefore
the modifier old is descriptive. This points out the
importance of context in determining distinctiveness or descriptiveness; if there had been a roomful of kings, the tags on those modifiers would
have been reversed.
There is some question as to whether copular
predicates, such as the plumber in (4), are actually
referring expressions.
2.4 Annotation & Adjudication
The main task of the study was the annotation of
the constituents of each referring expression, as
well as the function (distinctive or descriptive) of
each constituent. The system generated a first pass
of constituent analysis, but did not mark functions.
We hired two native English annotators, neither of
whom had any linguistics background, who corrected these automatically-generated constituent
analyses, and tagged each constituent as descriptive or distinctive. Every text was annotated by
both annotators. Adjudication of the differences
was conducted by discussion between the two annotators; the second author moderated these discussions and settled irreconcilable disagreements.
We followed a “train-as-you-go” paradigm, where
there was no distinct training period, but rather
adjudication proceeded in step with annotation,
and annotators received feedback during those sessions.
We calculated two measures of inter-annotator
agreement: a kappa statistic and an f-measure,
shown in Table 1. All of our f-measures indicated
that annotators agreed almost perfectly on the location of referring expressions and their breakdown into constituents. These agreement calculations were performed on the annotators’ original
corrected texts.
All the kappa statistics were calculated for two
tags (nuclei vs. modifier for the constituents, and
distinctive vs. descriptive for the functions) over
both each token assigned to a nucleus or modifier
and each referring expression pair. Our kappas indicate moderate to good agreement, especially for
the folktales. These results are expected because
of the inherent subjectivity of language. During
the adjudication sessions it became clear that different people do not consider the same information
(4) John is the plumber
Our annotators marked and tagged these constructions as normal referring expressions, but they
added an additional flag to identify them as copular predicates. We then excluded these constructions from our final analysis. Note that copular
predicates were treated differently from appositives: in appositives the predicate was included in
the referring expression, and in most cases (again,
depending on context) was marked descriptive
(e.g., John, the plumber, slept.).
2.2 Text Selection
Our corpus comprised 62 texts, all originally written in English, from two different genres, news
and folktales. We began with 30 folktales of different sizes, totaling 12,050 words. These texts
were used in a previous work on the influence of
dialogues on anaphora resolution algorithms (Aggarwal et al., 2009); they were assembled with an
eye toward including different styles, different authors, and different time periods. Following this,
we matched, approximately, the number of words
in the folktales by selecting 32 texts from Wall
Street Journal section of the Penn Treebank (Marcus et al., 1993). These texts were selected at random from the first 200 texts in the corpus.
2.3 The Story Workbench
We used the Story Workbench application (Finlayson, 2008) to actually perform the annotation.
The Story Workbench is a semantic annotation
as obvious or descriptive for the same concepts,
and even the contexts deduced by each annotators
from the texts were sometimes substantially different.
Ref. Exp. (F1 )
Constituents (F1 )
Nuc./Mod. (κ)
Const. Func. (κ)
Ref. Exp. Func. (κ)
Max. Nuc/Ref
Dist. Nuc.
Desc. Nuc.
Avg. Mod/Ref
Max. Mod/Ref
Dist. Mod.
Desc. Mod.
Table 3: Breakdown of Constituent Tags
Table 1: Inter-annotator agreement measures
are three. First is the general study of aggregation
in the process of referring expression generation.
Second and third are corpus studies by Cheng et al.
(2001) and Jordan (2000a) that bear on the prevalence of descriptive referring expressions.
The NLG subtask of aggregation can be used
to imbue referring expressions with a descriptive
function (Reiter and Dale, 2000, §5.3). There is a
specific kind of aggregation called embedding that
moves information from one clause to another inside the structure of a separate noun phrase. This
type of aggregation can be used to transform two
sentences such as “The princess lived in a castle.
She was pretty” into “The pretty princess lived in
a castle”. The adjective pretty, previously a copular predicate, becomes a descriptive modifier of
the reference to the princess, making the second
text more natural and fluent. This kind of aggregation is widely used by humans for making
the discourse more compact and efficient. In order to create NLG systems with this ability, we
must take into account the caveat, noted by Cheng
(1998), that any non-distinctive information in a
referring expression must not lead to confusion
about the distinctive function of the referring expression. This is by no means a trivial problem
– this sort of aggregation interferes with referring and coherence planning at both a local and
global level (Cheng and Mellish, 2000; Cheng et
al., 2001). It is clear, from the current state of the
art of NLG, that we have not yet obtained a deep
enough understanding of aggregation to enable us
to handle these interactions. More research on the
topic is needed.
Two previous corpus studies have looked at
the use of descriptive referring expressions. The
first showed explicitly that people craft descriptive referring expressions to accomplish different
Table 2 lists the primary results of the study. We
considered a referring expression descriptive if
any of its constituents were descriptive. Thus,
18% of the referring expressions in the corpus
added additional information beyond what was required to unambiguously identify their referent.
The results were similar in both genres.
Ref. Exp.
Dist. Ref. Exp.
Desc. Ref. Exp.
% Dist. Ref.
% Desc. Ref.
Table 2: Primary results.
Table 3 contains the percentages of descriptive
and distinctive tags broken down by constituent.
Like Cheng’s results, our analysis shows that descriptive referring expressions make up a significant fraction of all referring expressions. Although Cheng did not examine nuclei, our results
show that the use of descriptive nuclei is small but
not negligible.
Relation to the Field
Researchers working on generating referring expressions typically acknowledge that referring expressions can perform functions other than distinction. Despite this widespread acknowledgment,
researchers have, for the most part, explicitly ignored these functions. Exceptions to this trend
judicated into a gold-standard a corpus of 24,422
words. We marked all referring expressions,
coreferential relations, and referring expression
constituents, and tagged each constituent as having a descriptive or distinctive function. We wrote
an annotation guide and created software that allows the annotation of this information in free text.
The corpus and the guide are available on-line in a
permanent digital archive (Finlayson and Hervás,
2010a; Finlayson and Hervás, 2010b). The software will also be released in the same archive
when the Story Workbench annotation application
is released to the public. This corpus will be useful
for the automatic generation and analysis of both
descriptive and distinctive referring expressions.
Any kind of system intended to generate text as
humans do must take into account that identification is not the only function of referring expressions. Many analysis applications would benefit
from the automatic recognition of descriptive referring expressions.
Second, we demonstrated that descriptive referring expressions comprise a substantial fraction
(18%) of the referring expressions in news and
narrative. Along with museum descriptions, studied by Cheng, it seems that news and narrative are
genres where authors naturally use a large number of descriptive referring expressions. Given that
so little work has been done on descriptive referring expressions, this indicates that the field would
be well served by focusing more attention on this
goals. Jordan and colleagues (Jordan, 2000b; Jordan, 2000a) examined the use of referring expressions using the COCONUT corpus (Eugenio et
al., 1998). They tested how domain and discourse
goals can influence the content of non-pronominal
referring expressions in a dialogue context, checking whether or not a subject’s goals led them to include non-referring information in a referring expression. Their results are intriguing because they
point toward heretofore unexamined constraints,
utilities and expectations (possibly genre- or styledependent) that may underlie the use of descriptive
information to perform different functions, and are
not yet captured by aggregation modules in particular or NLG systems in general.
In the other corpus study, which partially inspired this work, Cheng and colleagues analyzed
a set of museum descriptions, the GNOME corpus (Poesio, 2004), for the pragmatic functions of
referring expressions. They had three functions
in their study, in contrast to our two. Their first
function (marked by their uniq tag) was equivalent to our distinctive function. The other two
were specializations of our descriptive tag, where
they differentiated between additional information
that helped to understand the text (int), or additional information not necessary for understanding (attr). Despite their annotators seeming to
have trouble distinguishing between the latter two
tags, they did achieve good overall inter-annotator
agreement. They identified 1,863 modifiers to
referring expressions in their corpus, of which
47.3% fulfilled a descriptive (attr or int) function. This is supportive of our main assertion,
namely, that descriptive referring expressions, not
only crucial for efficient and fluent text, are actually a significant phenomenon. It is interesting, though, that Cheng’s fraction of descriptive
referring expression was so much higher than ours
(47.3% versus our 18%). We attribute this substantial difference to genre, in that Cheng studied museum labels, in which the writer is spaceconstrained, having to pack a lot of information
into a small label. The issue bears further study,
and perhaps will lead to insights into differences
in writing style that may be attributed to author or
This work was supported in part by the Air
Force Office of Scientific Research under grant
number A9550-05-1-0321, as well as by the
Office of Naval Research under award number
N00014091059. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily
reflect the views of the Office of Naval Research.
This research is also partially funded the Spanish Ministry of Education and Science (TIN200914659-C03-01) and Universidad Complutense de
Madrid (GR58/08). We also thank Whitman
Richards, Ozlem Uzuner, Peter Szolovits, Patrick
Winston, Pablo Gervás, and Mark Seifter for their
helpful comments and discussion, and thank our
annotators Saam Batmanghelidj and Geneva Trotter.
We make two contributions in this paper.
First, we assembled, double-annotated, and ad53
European Workshop on Natural Language Generation, pages 174–182, Morristown, NJ, USA. Association for Computational Linguistics.
Alaukik Aggarwal, Pablo Gervás, and Raquel Hervás.
2009. Measuring the influence of errors induced by
the presence of dialogues in reference clustering of
narrative text. In Proceedings of ICON-2009: 7th
International Conference on Natural Language Processing, India. Macmillan Publishers.
Pamela W. Jordan. 2000a. Can nominal expressions
achieve multiple goals?: an empirical study. In ACL
’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 142–
149, Morristown, NJ, USA. Association for Computational Linguistics.
Douglas E. Appelt. 1985. Planning English referring
expressions. Artificial Intelligence, 26:1–33.
Pamela W. Jordan. 2000b. Influences on attribute selection in redescriptions: A corpus study. In Proceedings of CogSci2000, pages 250–255.
Hua Cheng and Chris Mellish. 2000. Capturing the interaction between aggregation and text planning in
two generation systems. In INLG ’00: First international conference on Natural Language Generation
2000, pages 186–193, Morristown, NJ, USA. Association for Computational Linguistics.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. 1993. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330.
Hua Cheng, Massimo Poesio, Renate Henschel, and
Chris Mellish. 2001. Corpus-based np modifier
generation. In NAACL ’01: Second meeting of
the North American Chapter of the Association for
Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA. Association for Computational Linguistics.
Michael O’Donnell, Hua Cheng, and Janet Hitzeman. 1998. Integrating referring and informing in
NP planning. In Proceedings of COLING-ACL’98
Workshop on the Computational Treatment of Nominals, pages 46–56.
Massimo Poesio. 2004. Discourse annotation and
semantic annotation in the GNOME corpus. In
DiscAnnotation ’04: Proceedings of the 2004 ACL
Workshop on Discourse Annotation, pages 72–79,
Morristown, NJ, USA. Association for Computational Linguistics.
Hua Cheng. 1998. Embedding new information into
referring expressions. In ACL-36: Proceedings of
the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 1478–
1480, Morristown, NJ, USA. Association for Computational Linguistics.
Ehud Reiter and Robert Dale. 1992. A fast algorithm
for the generation of referring expressions. In Proceedings of the 14th conference on Computational
linguistics, Nantes, France.
Barbara Di Eugenio, Johanna D. Moore, Pamela W.
Jordan, and Richmond H. Thomason. 1998. An
empirical investigation of proposals in collaborative dialogues. In Proceedings of the 17th international conference on Computational linguistics,
pages 325–329, Morristown, NJ, USA. Association
for Computational Linguistics.
Ehud Reiter and Robert Dale. 2000. Building Natural
Language Generation Systems. Cambridge University Press.
Kees van Deemter, Ielka van der Sluis, and Albert Gatt.
2006. Building a semantically transparent corpus
for the generation of referring expressions. In Proceedings of the 4th International Conference on Natural Language Generation (Special Session on Data
Sharing and Evaluation), INLG-06.
Mark A. Finlayson and Raquel Hervás. 2010a. Annotation guide for the UCM/MIT indications, referring
expressions, and coreference corpus (UMIREC corpus). Technical Report MIT-CSAIL-TR-2010-025,
MIT Computer Science and Artificial Intelligence
Jette Viethen and Robert Dale. 2008. The use of spatial relations in referring expressions. In Proceedings of the 5th International Conference on Natural
Language Generation.
Mark A. Finlayson and Raquel Hervás.
and coreference corpus (UMIREC
Work product, MIT Computer Science and Artificial Intelligence Laboratory.
Mark A. Finlayson. 2008. Collecting semantics in
the wild: The Story Workbench. In Proceedings of
the AAAI Fall Symposium on Naturally-Inspired Artificial Intelligence, pages 46–53, Menlo Park, CA,
USA. AAAI Press.
Albert Gatt, Anja Belz, and Eric Kow. 2009. The
TUNA-REG challenge 2009: overview and evaluation results. In ENLG ’09: Proceedings of the 12th
Preferences versus Adaptation during Referring Expression Generation
Martijn Goudbeek
University of Tilburg
Tilburg, The Netherlands
[email protected]
Emiel Krahmer
University of Tilburg
Tilburg, The Netherlands
[email protected]
rithms rely on similar distinctions. The Graphbased algorithm (Krahmer et al., 2003), for example, searches for the cheapest description for
a target, and distinguishes cheap attributes (such
as color) from more expensive ones (orientation).
Realization of referring expressions has received
less attention, yet recent studies on the ordering of
modifiers (Shaw and Hatzivassiloglou, 1999; Malouf, 2000; Mitchell, 2009) also work from the assumption that some orderings (large red) are preferred over others (red large).
We argue that such preferences are less stable
when referring expressions are generated in interactive settings, as would be required for applications such as spoken dialogue systems or interactive virtual characters. In these cases, we hypothesize that, besides domain preferences, also the referring expressions that were produced earlier in
the interaction are important. It has been shown
that if one dialogue participant refers to a couch as
a sofa, the next speaker is more likely to use the
word sofa as well (Branigan et al., in press). This
kind of micro-planning or “lexical entrainment”
(Brennan and Clark, 1996) can be seen as a specific form of “alignment” (Pickering and Garrod,
2004) between speaker and addressee. Pickering
and Garrod argue that alignment may take place
on all levels of interaction, and indeed it has been
shown that participants also align their intonation
patterns and syntactic structures. However, as far
as we know, experimental evidence for alignment
on the level of content planning has never been
given, and neither have alignment effects in modifier orderings during realization been shown. With
a few notable exceptions, such as Buschmeier et
al. (2009) who study alignment in micro-planning,
and Janarthanam and Lemon (2009) who study
alignment in expertise levels, alignment has received little attention in NLG so far.
This paper is organized as follows. Experiment I studies the trade-off between adaptation
Current Referring Expression Generation
algorithms rely on domain dependent preferences for both content selection and linguistic realization. We present two experiments showing that human speakers may
opt for dispreferred properties and dispreferred modifier orderings when these were
salient in a preceding interaction (without
speakers being consciously aware of this).
We discuss the impact of these findings for
current generation algorithms.
The generation of referring expressions is a core
ingredient of most Natural Language Generation
(NLG) systems (Reiter and Dale, 2000; Mellish et
al., 2006). These systems usually approach Referring Expression Generation (REG) as a two-step
procedure, where first it is decided which properties to include (content selection), after which
the selected properties are turned into a natural
language referring expression (linguistic realization). The basic problem in both stages is one of
choice; there are many ways in which one could
refer to a target object and there are multiple ways
in which these could be realized in natural language. Typically, these choice problems are tackled by giving preference to some solutions over
others. For example, the Incremental Algorithm
(Dale and Reiter, 1995), one of the most widely
used REG algorithms, assumes that certain attributes are preferred over others, partly based on
evidence provided by Pechmann (1989); a chair
would first be described in terms of its color, and
only if this does not result in a unique characterization, other, less preferred attributes such as
orientation are tried. The Incremental Algorithm
is arguably unique in assuming a complete preference order of attributes, but other REG algo55
Proceedings of the ACL 2010 Conference Short Papers, pages 55–59,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
and preferences during content selection while Experiment II looks at this trade-off for modifier
orderings during realization. Both studies use a
novel interactive reference production paradigm,
applied to two domains – the Furniture and People
domains of the TUNA data-set (Gatt et al., 2007;
Koolen et al., 2009) – to see whether adaptation
may be domain dependent. Finally, we contrast
our findings with the performance of state-of-theart REG algorithms, discussing how they could be
adapted so as to account for the new data, effectively adding plasticity to the generation process.
Experiment I
Experiment I studies what speakers do when referring to a target that can be distinguished in a
preferred (the blue fan) or a dispreferred way (the
left-facing fan), when in the prior context either
the first or the second variant was made salient.
Participants 26 students (2 male, mean age = 20
years, 11 months), all native speakers of Dutch
without hearing or speech problems, participated
for course credits.
Materials Target pictures were taken from the
TUNA corpus (Gatt et al., 2007) that has been
extensively used for REG evaluation. This corpus consists of two domains: one containing pictures of people (famous mathematicians), the other
containing furniture items in different colors depicted from different orientations. From previous
studies (Gatt et al., 2007; Koolen et al., 2009) it
is known that participants show a preference for
certain attributes: color in the Furniture domain
and glasses in the People domain, and disprefer
other attributes (orientation of a furniture piece
and wearing a tie, respectively).
Procedure Trials consisted of four turns in an interactive reference understanding and production
experiment: a prime, two fillers and the experimental description (see Figure 1). First, participants listened to a pre-recorded female voice referring to one of three objects and had to indicate which one was being referenced. In this subtask, references either used a preferred or a dispreferred attribute; both were distinguishing. Second, participants themselves described a filler picture, after which, third, they had to indicate which
filler picture was being described. The two filler
turns always concerned stimuli from the alterna-
Figure 1: The 4 tasks per trial. A furniture trial is
shown; people trials have an identical structure.
tive domain and were intended to prevent a too
direct connection between the prime and the target. Fourth, participants described the target object, which could always be distinguished from its
distractors in a preferred (The blue fan) or a dispreferred (The left facing fan) way. Note that at56
Figure 2: Proportions of preferred and dispreferred attributes in the Furniture domain.
Figure 3: Proportions of preferred and dispreferred attributes in the People domain.
tributes are primed, not values; a participant may
have heard front facing in the prime turn, while
the target has a different value for this attribute (cf.
Fig. 1).
For the two domains, there were 20 preferred
and 20 dispreferred trials, giving rise to 2 x (20 +
20) = 80 critical trials. These were presented in
counter-balanced blocks, and within blocks each
participant received a different random order. In
addition, there were 80 filler trials (each following
the same structure as outlined in Figure 1). During
debriefing, none of the participants indicated they
had been aware of the experiment’s purpose.
attribute, and hence will not use the dispreferred
attribute. This is not what we observe: our participants used the dispreferred attribute at a rate
significantly larger than zero when they had been
exposed to it three turns earlier (tf urniture [25] =
6.64, p < 0.01; tpeople [25] = 4.78 p < 0.01). Additionally, they used the dispreferred attribute significantly more when they had previously heard
the dispreferred attribute rather than the preferred
attribute. This difference is especially marked
and significant in the Furniture domain (tf urniture
[25] = 2.63, p < 0.01, tpeople [25] = 0.98, p <
0.34), where participants opt for the dispreferred
attribute in 54% of the trials, more frequently than
they do for the preferred attribute (Fig. 2).
We use the proportion of attribute alignment as
our dependent measure. Alignment occurs when
a participant uses the same attribute in the target
as occurred in the prime. This includes overspecified descriptions (Engelhardt et al., 2006; Arnold,
2008), where both the preferred and dispreferred
attributes were mentioned by participants. Overspecification occurred in 13% of the critical trials
(and these were evenly distributed over the experimental conditions).
The use of the preferred and dispreferred attribute as a function of prime and domain is shown
in Figure 2 and Figure 3. In both domains, the
preferred attribute is used much more frequently
than the dispreferred attribute with the preferred
primes, which serves as a manipulation check. As
a test of our hypothesis that adaptation processes
play an important role in attribute selection for
referring expressions, we need to look at participants’ expressions with the dispreferred primes
(with the preferred primes, effects of adaptation
and of preferences cannot be teased apart). Current REG algorithms such as the Incremental Algorithm and the Graph-based algorithm predict
that participants will always opt for the preferred
Experiment II
Experiment II uses the same paradigm used for
Experiment I to study whether speaker’s preferences for modifier orderings can be changed by
exposing them to dispreferred orderings.
Participants 28 Students (ten males, mean age =
23 years and two months) participated for course
credits. All were native speakers of Dutch, without
hearing and speech problems. None participated
in Experiment I.
Materials The materials were identical to those
used in Experiment I, except for their arrangement
in the critical trials. In these trials, the participants
could only identify the target picture using two attributes. In the Furniture domain these were color
and size, in the People domain these were having a
beard and wearing glasses. In the prime turn (Task
I, Fig. 1), these attributes were realized in a preferred way (“size first”: e.g., the big red sofa, or
“glasses first”: the bespectacled and bearded man)
or in a dispreferred way (“color first”: the red big
sofa or “beard first” the bespectacled and bearded
Figure 4: Proportions of preferred and dispreferred modifier orderings in the Furniture domain.
Figure 5: Proportions of preferred and dispreferred modifier orderings in the People domain.
man). Google counts for the original Dutch modifier orderings reveal that the ratio of preferred to
dispreferred is in the order of 40:1 in the Furniture
domain and 3:1 in the People domain.
Procedure As above.
preferred attribute or produce a dispreferred modifier ordering when they had previously been exposed to these attributes or orderings, without being aware of this. These findings fit in well with
the adaptation and alignment models proposed by
psycholinguists, but ours, as far as we know, is
the first experimental evidence of alignment in attribute selection and in modifier ordering. Interestingly, we found that effect sizes differ for the
different domains, indicating that the trade-off between preferences and adaptions is a gradual one,
also influenced by the a priori differences in preference (it is more difficult to make people say
something truly dispreferred than something more
marginally dispreferred).
To account for these findings, GRE algorithms
that function in an interactive setting should be
made sensitive to the production of dialogue partners. For the Incremental Algorithm (Dale and Reiter, 1995), this could be achieved by augmenting
the list of preferred attributes with a list of “previously mentioned” attributes. The relative weighting of these two lists will be corpus dependent,
and can be estimated in a data-driven way. Alternatively, in the Graph-based algorithm (Krahmer
et al., 2003), costs of properties could be based
on two components: a relatively fixed domain
component (preferred is cheaper) and a flexible
interactive component (recently used is cheaper).
Which approach would work best is an open, empirical question, but either way this would constitute an important step towards interactive REG.
We use the proportion of modifier ordering alignments as our dependent measure, where alignment
occurs when the participant’s ordering coincides
with the primed ordering. Figure 4 and 5 show the
use of the preferred and dispreferred modifier ordering per prime and domain. It can be seen that
in the preferred prime conditions, participants produce the expected orderings, more or less in accordance with the Google counts.
State-of-the-art realizers would always opt for
the most frequent ordering of a given pair of modifiers and hence would never predict the dispreferred orderings to occur. Still, the use of the dispreferred modifier ordering occurred significantly
more often than one would expect given this prediction, tf urniture [27] = 6.56, p < 0.01 and tpeople
[27] = 9.55, p < 0.01. To test our hypotheses concerning adaptation, we looked at the dispreferred
realizations when speakers were exposed to dispreferred primes (compared to preferred primes).
In both domains this resulted in an increase of the
anount of dispreferred realizations, which was significant in the People domain (tpeople [27] = 1.99,
p < 0.05, tf urniture [25] = 2.63, p < 0.01).
Current state-of-the-art REG algorithms often rest
upon the assumption that some attributes and some
realizations are preferred over others. The two experiments described in this paper show that this
assumption is incorrect, when references are produced in an interactive setting. In both experiments, speakers were more likely to select a dis-
The research reported in this paper forms part
of the VICI project “Bridging the gap between
psycholinguistics and Computational linguistics:
the case of referring expressions”, funded by the
Netherlands Organization for Scientific Research
(NWO grant 277-70-007).
Chris Mellish, Donia Scott, Lynn Cahill, Daniel Paiva,
Roger Evans, and Mike Reape. 2006. A reference architecture for natural language generation
systems. Natural Language Engineering, 12:1–34.
Jennifer Arnold.
Reference production: Production-internal and addressee-oriented
processes. Language and Cognitive Processes,
Margaret Mitchell. 2009. Class-based ordering of
prenominal modifiers. In ENLG ’09: Proceedings of
the 12th European Workshop on Natural Language
Generation, pages 50–57, Morristown, NJ, USA.
Association for Computational Linguistics.
Holly P. Branigan, Martin J. Pickering, Jamie Pearson,
and Janet F. McLean. in press. Linguistic alignment
between people and computers. Journal of Pragmatics, 23:1–2.
Thomas Pechmann. 1989. Incremental speech production and referential overspecification. Linguistics,
Susan E. Brennan and Herbert H. Clark. 1996. Conceptual pacts and lexical choice in conversation.
Journal of Experimental Psychology: Learning,
Memory, and Cognition, 22:1482–1493.
Martin Pickering and Simon Garrod. 2004. Towards
a mechanistic psychology of dialogue. Behavioural
and Brain Sciences, 27:169–226.
Hendrik Buschmeier, Kirsten Bergmann, and Stefan
Kopp. 2009. An alignment-capable microplanner for Natural Language Generation. In Proceedings of the 12th European Workshop on Natural
Language Generation (ENLG 2009), pages 82–89,
Athens, Greece, March. Association for Computational Linguistics.
Ehud Reiter and Robert Dale. 2000. Building Natural
Language Generation Systems. Cambridge University Press.
James Shaw and Vasileios Hatzivassiloglou. 1999. Ordering among premodifiers. In Proceedings of the
37th annual meeting of the Association for Computational Linguistics on Computational Linguistics,
pages 135–143.
Robert Dale and Ehud Reiter. 1995. Computational
interpretations of the gricean maxims in the generation of referring expressions. Cognitive Science,
Paul E. Engelhardt, Karl G. Bailey, and Fernanda Ferreira. 2006. Do speakers and listeners observe the
gricean maxim of quantity? Journal of Memory and
Language, 54(4):554–573.
Albert Gatt, Ielka van der Sluis, and Kees van Deemter.
2007. Evaluating algorithms for the generation of
referring expressions using a balanced corpus. In
Proceedings of the 11th European Workshop on Natural Language Generation.
Srinivasan Janarthanam and Oliver Lemon. 2009.
Learning lexical alignment policies for generating
referring expressions for spoken dialogue systems.
In Proceedings of the 12th European Workshop on
Natural Language Generation (ENLG 2009), pages
74–81, Athens, Greece, March. Association for
Computational Linguistics.
Ruud Koolen, Albert Gatt, Martijn Goudbeek, and
Emiel Krahmer. 2009. Need I say more? on factors
causing referential overspecification. In Proceedings of the PRE-CogSci 2009 Workshop on the Production of Referring Expressions: Bridging the Gap
Between Computational and Empirical Approaches
to Reference.
Emiel Krahmer, Sebastiaan van Erk, and André Verleg.
2003. Graph-based generation of referring expressions. Computational Linguistics, 29(1):53–72.
Robert Malouf. 2000. The order of prenominal adjectives in natural language generation. In Proceedings
of the 38th Annual Meeting of the Association for
Computational Linguistics, pages 85–92.
Cognitively Plausible Models of Human Language Processing
Frank Keller
School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, UK
[email protected]
linguistics (CL). To test their theories, psycholinguists construct computational models of human language processing, but these models often fall short of the engineering standards that
are generally accepted in the CL community
(e.g., broad coverage, robustness, efficiency): typical psycholinguistic models only deal with isolated phenomena and fail to scale to realistic data
sets. A particular issue is evaluation, which is typically anecdotal, performed on a small set of handcrafted examples (see Sections 3).
In this paper, we propose a challenge that requires the combination of research efforts in computational linguistics and psycholinguistics: the
development of cognitively plausible models of
human language processing. This task can be decomposed into a modeling challenge (building
models that instantiate known properties of human language processing) and a data and evaluation challenge (accounting for experimental findings and evaluating against standardized data sets),
which we will discuss in turn.
We pose the development of cognitively
plausible models of human language processing as a challenge for computational
linguistics. Existing models can only deal
with isolated phenomena (e.g., garden
paths) on small, specifically selected data
sets. The challenge is to build models that
integrate multiple aspects of human language processing at the syntactic, semantic, and discourse level. Like human language processing, these models should be
incremental, predictive, broad coverage,
and robust to noise. This challenge can
only be met if standardized data sets and
evaluation measures are developed.
In many respects, human language processing is
the ultimate goldstandard for computational linguistics. Humans understand and generate language with amazing speed and accuracy, they are
able to deal with ambiguity and noise effortlessly
and can adapt to new speakers, domains, and registers. Most surprisingly, they achieve this competency on the basis of limited training data (Hart
and Risley, 1995), using learning algorithms that
are largely unsupervised.
Given the impressive performance of humans
as language processors, it seems natural to turn
to psycholinguistics, the discipline that studies human language processing, as a source of information about the design of efficient language processing systems. Indeed, psycholinguists have uncovered an impressive array of relevant facts (reviewed in Section 2), but computational linguists
are often not aware of this literature, and results
about human language processing rarely inform
the design, implementation, or evaluation of artificial language processing systems.
At the same time, research in psycholinguistics is often oblivious of work in computational
Modeling Challenge
Key Properties
The first part of the challenge is to develop a model
that instantiates key properties of human language
processing, as established by psycholinguistic experimentation (see Table 1 for an overview and
representative references).1 A striking property of
the human language processor is its efficiency and
robustness. For the vast majority of sentences, it
will effortlessly and rapidly deliver the correct
analysis, even in the face of noise and ungrammaticalities. There is considerable experimental evi1 Here an in the following, we will focus on sentence
processing, which is often regarded as a central aspect of
human language processing. A more comprehensive answer
to our modeling challenge should also include phonological
and morphological processing, semantic inference, discourse
processing, and other non-syntactic aspects of language processing. Furthermore, established results regarding the interface between language processing and non-linguistic cognition (e.g., the sensorimotor system) should ultimately be accounted for in a fully comprehensive model.
Proceedings of the ACL 2010 Conference Short Papers, pages 60–67,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Efficiency and robustness
Broad coverage
Incrementality and connectedness
Memory cost
Ferreira et al. (2001); Sanford and Sturt (2002)
Crocker and Brants (2000)
Tanenhaus et al. (1995); Sturt and Lombardo (2005)
Kamide et al. (2003); Staub and Clifton (2006)
Gibson (1998); Vasishth and Lewis (2006)
Surp Pred
Table 1: Key properties of human language processing and their instantiation in various models of sentence processing (see
Section 2 for details)
model properties).2
The earliest approaches were ranking-based
models (Rank), which make psycholinguistic predictions based on the ranking of the syntactic
analyses produced by a probabilistic parser. Jurafsky (1996) assumes that processing difficulty
is triggered if the correct analysis falls below a
certain probability threshold (i.e., is pruned by
the parser). Similarly, Crocker and Brants (2000)
assume that processing difficulty ensures if the
highest-ranked analysis changes from one word to
the next. Both approaches have been shown to successfully model garden path effects. Being based
on probabilistic parsing techniques, ranking-based
models generally achieve a broad coverage, but
their efficiency and robustness has not been evaluated. Also, they are not designed to capture syntactic prediction or memory effects (other than search
with a narrow beam in Brants and Crocker 2000).
The ranking-based approach has been generalized by surprisal models (Surp), which predict processing difficulty based on the change in
the probability distribution over possible analyses from one word to the next (Hale, 2001; Levy,
2008; Demberg and Keller, 2008a; Ferrara Boston
et al., 2008; Roark et al., 2009). These models
have been successful in accounting for a range of
experimental data, and they achieve broad coverage. They also instantiate a limited form of prediction, viz., they build up expectations about the next
word in the input. On the other hand, the efficiency
and robustness of these models has largely not
been evaluated, and memory costs are not modeled (again except for restrictions in beam size).
The prediction model (Pred) explicitly predicts
syntactic structure for upcoming words (Demberg
and Keller, 2008b, 2009), thus accounting for experimental results on predictive language processing. It also implements a strict form of incre-
dence that shallow processing strategies are used
to achieve this. The processor also achieves broad
coverage: it can deal with a wide variety of syntactic constructions, and is not restricted by the domain, register, or modality of the input.
Human language processing is also word-byword incremental. There is strong evidence that
a new word is integrated as soon as it is available into the representation of the sentence thus
far. Readers and listeners experience differential
processing difficulty during this integration process, depending on the properties of the new word
and its relationship to the preceding context. There
is evidence that the processor instantiates a strict
form of incrementality by building only fully connected trees. Furthermore, the processor is able
to make predictions about upcoming material on
the basis of sentence prefixes. For instance, listeners can predict an upcoming post-verbal element
based on the semantics of the preceding verb. Or
they can make syntactic predictions, e.g., if they
encounter the word either, they predict an upcoming or and the type of complement that follows it.
Another key property of human language processing is the fact that it operates with limited
memory, and that structures in memory are subject
to decay and interference. In particular, the processor is known to incur a distance-based memory
cost: combining the head of a phrase with its syntactic dependents is more difficult the more dependents have to be integrated and the further away
they are. This integration process is also subject
to interference from similar items that have to be
held in memory at the same time.
Current Models
The challenge is to develop a computational model
that captures the key properties of human language
processing outlined in the previous section. A
number of relevant models have been developed,
mostly based on probabilistic parsing techniques,
but none of them instantiates all the key properties discussed above (Table 1 gives an overview of
2 We
will not distinguish between model and linking theory, i.e., the set of assumptions that links model quantities
to behavioral data (e.g., more probably structures are easier
to process). It is conceivable, for instance, that a stack-based
model is combined with a linking theory based on surprisal.
Word senses
Selectional restrictions
Thematic roles
Discourse reference
Roland and Jurafsky (2002)
Garnsey et al. (1997); Pickering and
Traxler (1998)
McRae et al. (1998); Pickering et al.
Altmann and Steedman (1988); Grodner and Gibson (2005)
Stewart et al. (2000); Kehler et al.
the development of language processing models
that combine syntactic processing with semantic
and discourse processing. So far, this challenge is
largely unmet: there are some examples of models
that integrate semantic processes such as thematic
role assignment into a parsing model (Narayanan
and Jurafsky, 2002; Padó et al., 2009). However,
other semantic factors are not accounted for by
these models, and incorporating non-lexical aspects of semantics into models of sentence processing is a challenge for ongoing research. Recently, Dubey (2010) has proposed an approach
that combines a probabilistic parser with a model
of co-reference and discourse inference based on
probabilistic logic. An alternative approach has
been taken by Pynte et al. (2008) and Mitchell
et al. (2010), who combine a vector-space model
of semantics (Landauer and Dumais, 1997) with a
syntactic parser and show that this results in predictions of processing difficulty that can be validated against an eye-tracking corpus.
Table 2: Semantic factors in human language processing
mentality by building fully connected trees. Memory costs are modeled directly as a distance-based
penalty that is incurred when a prediction has to be
verified later in the sentence. However, the current
implementation of the prediction model is neither
robust and efficient nor offers broad coverage.
Recently, a stack-based model (Stack) has been
proposed that imposes explicit, cognitively motivated memory constraints on the parser, in effect limiting the stack size available to the parser
(Schuler et al., 2010). This delivers robustness, efficiency, and broad coverage, but does not model
syntactic prediction. Unlike the other models discussed here, no psycholinguistic evaluation has
been conducted on the stack-based model, so its
cognitive plausibility is preliminary.
Acquisition and Crosslinguistics
All models of human language processing discussed so far rely on supervised training data. This
raises another aspect of the modeling challenge:
the human language processor is the product of
an acquisition process that is largely unsupervised
and has access to only limited training data: children aged 12–36 months are exposed to between
10 and 35 million words of input (Hart and Risley, 1995). The challenge therefore is to develop
a model of language acquisition that works with
such small training sets, while also giving rise to
a language processor that meets the key criteria
in Table 1. The CL community is in a good position to rise to this challenge, given the significant
progress in unsupervised parsing in recent years
(starting from Klein and Manning 2002). However, none of the existing unsupervised models has
been evaluated against psycholinguistic data sets,
and they are not designed to meet even basic psycholinguistic criteria such as incrementality.
A related modeling challenge is the development of processing models for languages other
than English. There is a growing body of experimental research investigating human language
processing in other languages, but virtually all existing psycholinguistic models only work for English (the only exceptions we are aware of are
Dubey et al.’s (2008) and Ferrara Boston et al.’s
Beyond Parsing
There is strong evidence that human language processing is driven by an interaction of syntactic, semantic, and discourse processes (see Table 2 for
an overview and references). Considerable experimental work has focused on the semantic properties of the verb of the sentence, and verb sense,
selectional restrictions, and thematic roles have all
been shown to interact with syntactic ambiguity
resolution. Another large body of research has elucidated the interaction of discourse processing and
syntactic processing. The most-well known effect
is probably that of referential context: syntactic
ambiguities can be resolved if a discourse context is provided that makes one of the syntactic
alternatives more plausible. For instance, in a context that provides two possible antecedents for a
noun phrase, the processor will prefer attaching a
PP or a relative clause such that it disambiguates
between the two antecedents; garden paths are reduced or disappear. Other results point to the importance of discourse coherence for sentence processing, an example being implicit causality.
The challenge facing researchers in computational and psycholinguistics therefore includes
(2008) parsing models for German). Again, the
CL community has made significant progress in
crosslinguistic parsing, especially using dependency grammar (Hajič, 2009), and psycholinguistic modeling could benefit from this in order to
meet the challenge of developing crosslinguistically valid models of human language processing.
ena can it account for) and its accuracy (how well
does it fit the behavioral data) can be assessed.
Experimental test sets should be complemented
by test sets based on corpus data. In order to assess the efficiency, robustness, and broad coverage of a model, a corpus of unrestricted, naturally
occurring text is required. The use of contextualized language data makes it possible to assess not
only syntactic models, but also models that capture
discourse effects. These corpora need to be annotated with behavioral measures, e.g., eye-tracking
or reading time data. Some relevant corpora have
already been constructed, see the overview in Table 3, and various authors have used them for
model evaluation (Demberg and Keller, 2008a;
Pynte et al., 2008; Frank, 2009; Ferrara Boston
et al., 2008; Patil et al., 2009; Roark et al., 2009;
Mitchell et al., 2010).
However, the usefulness of the psycholinguistic corpora in Table 3 is restricted by the absence
of gold-standard linguistic annotation (though the
French part of the Dundee corpus, which is syntactically annotated). This makes it difficult to test
the accuracy of the linguistic structures computed
by a model, and restricts evaluation to behavioral
predictions. The challenge is therefore to collect
a standardized test set of naturally occurring text
or speech enriched not only with behavioral variables, but also with syntactic and semantic annotation. Such a data set could for example be constructed by eye-tracking section 23 of the Penn
Treebank (which is also part of Propbank, and thus
has both syntactic and thematic role annotation).
In computational linguistics, the development
of new data sets is often stimulated by competitions in which systems are compared on a standardized task, using a data set specifically designed for the competition. Examples include the
CoNLL shared task, SemEval, or TREC in computational syntax, semantics, and discourse, respectively. A similar competition could be developed for computational psycholinguistics – maybe
along the lines of the model comparison challenges that held at the International Conference
on Cognitive Modeling. These challenges provide
standardized task descriptions and data sets; participants can enter their cognitive models, which
were then compared using a pre-defined evaluation metric.3
Data and Evaluation Challenge
Test Sets
The second key challenge that needs to be addressed in order to develop cognitively plausible
models of human language processing concerns
test data and model evaluation. Here, the state of
the art in psycholinguistic modeling lags significantly behind standards in the CL community.
Most of the models discussed in Section 2 have not
been evaluated rigorously. The authors typically
describe their performance on a small set of handpicked examples; no attempts are made to test on
a range of items from the experimental literature
and determine model fit directly against behavioral
measures (e.g., reading times). This makes it very
hard to obtain a realistic estimate of how well the
models achieve their aim of capturing human language processing behavior.
We therefore suggest the development of standard test sets for psycholinguistic modeling, similar to what is commonplace for tasks in computational linguistics: parsers are evaluated against the
Penn Treebank, word sense disambiguation systems against the SemEval data sets, co-reference
systems against the Tipster or ACE corpora, etc.
Two types of test data are required for psycholinguistic modeling. The first type of test data consists of a collection of representative experimental
results. This collection should contain the actual
experimental materials (sentences or discourse
fragments) used in the experiments, together with
the behavioral measurements obtained (reading
times, eye-movement records, rating judgments,
etc.). The experiments included in this test set
would be chosen to cover a wide range of experimental phenomena, e.g., garden paths, syntactic complexity, memory effects, semantic and discourse factors. Such a test set will enable the standardized evaluation of psycholinguistic models by
comparing the model predictions (rankings, surprisal values, memory costs, etc.) against behavioral measures on a large set of items. This way
both the coverage of a model (how many phenom-
3 The
ICCM 2009 challenge was the Dynamic Stock and
Flows Task, for more information see http://www.hss.
Dundee Corpus
Potsdam Corpus
MIT Corpus
English, French
Self-paced reading
Kennedy and Pynte (2005)
Kliegl et al. (2006)
Bachrach (2008)
Table 3: Test corpora that have been used for psycholinguistic modeling of sentence processing; note that the Potsdam Corpus
consists of isolated sentences, rather than of continuous text
Behavioral and Neural Data
Further issues arise from the fact that we often want to compare model fit for multiple experiments (ideally without reparametrizing the models), and that various mutually dependent measures are used for evaluation, e.g., processing effort at the sentence, word, and character level. An
important open challenge is there to develop evaluation measures and associated statistical procedures that can deal with these problems.
As outlined in the previous section, a number of
authors have evaluated psycholinguistic models
against eye-tracking or reading time corpora. Part
of the data and evaluation challenge is to extend
this evaluation to neural data as provided by eventrelated potential (ERP) or brain imaging studies
(e.g., using functional magnetic resonance imaging, fMRI). Neural data sets are considerably more
complex than behavioral ones, and modeling them
is an important new task that the community is
only beginning to address. Some recent work has
evaluated models of word semantics against ERP
(Murphy et al., 2009) or fMRI data (Mitchell et al.,
2008).4 This is a very promising direction, and the
challenge is to extend this approach to the sentence
and discourse level (see Bachrach 2008). Again,
it will again be necessary to develop standardized
test sets of both experimental data and corpus data.
In this paper, we discussed the modeling and
data/evaluation challenges involved in developing
cognitively plausible models of human language
processing. Developing computational models is
of scientific importance in so far as models are implemented theories: models of language processing allow us to test scientific hypothesis about the
cognitive processes that underpin language processing. This type of precise, formalized hypothesis testing is only possible if standardized data
sets and uniform evaluation procedures are available, as outlined in the present paper. Ultimately,
this approach enables qualitative and quantitative
comparisons between theories, and thus enhances
our understanding of a key aspect of human cognition, language processing.
There is also an applied side to the proposed
challenge. Once computational models of human
language processing are available, they can be
used to predict the difficulty that humans experience when processing text or speech. This is useful for a number applications: for instance, natural language generation would benefit from being able to assess whether machine-generated text
or speech is easy to process. For text simplification (e.g., for children or impaired readers), such a
model is even more essential. It could also be used
to assess the readability of text, which is of interest
in educational applications (e.g., essay scoring). In
machine translation, evaluating the fluency of system output is crucial, and a model that predicts
processing difficulty could be used for this, or to
guide the choice between alternative translations,
and maybe even to inform human post-editing.
Evaluation Measures
We also anticipate that the availability of new test
data sets will facilitate the development of new
evaluation measures that specifically test the validity of psycholinguistic models. Established CL
evaluation measures such as Parseval are of limited use, as they can only test the linguistic, but not
the behavioral or neural predictions of a model.
So far, many authors have relied on qualitative evaluation: if a model predicts a difference
in (for instance) reading time between two types
of sentences where such a difference was also
found experimentally, then that counts as a successful test. In most cases, no quantitative evaluation is performed, as this would require modeling the reading times for individual item and individual participants. Suitable procedures for performing such tests do not currently exist; linear
mixed effects models (Baayen et al., 2008) provide a way of dealing with item and participant
variation, but crucially do not enable direct comparisons between models in terms of goodness of
4 These data sets were released as part of the NAACL2010 Workshop on Computational Neurolinguistics.
2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. Journal of Eye Movement Research 2(1):1–12.
Altmann, Gerry T. M. and Mark J. Steedman.
1988. Interaction with context during human
sentence processing. Cognition 30(3):191–238.
Ferreira, Fernanda, Kiel Christianson, and Andrew Hollingworth. 2001. Misinterpretations of
garden-path sentences: Implications for models
of sentence processing and reanalysis. Journal
of Psycholinguistic Research 30(1):3–20.
Baayen, R. H., D. J. Davidson, and D. M. Bates.
2008. Mixed-effects modeling with crossed random effects for subjects and items. Journal of
Memory and Language to appear.
Bachrach, Asaf. 2008. Imaging Neural Correlates
of Syntactic Complexity in a Naturalistic Context. Ph.D. thesis, Massachusetts Institute of
Technology, Cambridge, MA.
Frank, Stefan L. 2009. Surprisal-based comparison between a symbolic and a connectionist
model of sentence processing. In Niels Taatgen and Hedderik van Rijn, editors, Proceedings of the 31st Annual Conference of the Cognitive Science Society. Cognitive Science Society, Amsterdam, pages 1139–1144.
Brants, Thorsten and Matthew W. Crocker. 2000.
Probabilistic parsing and psychological plausibility. In Proceedings of the 18th International Conference on Computational Linguistics. Saarbrücken/Luxembourg/Nancy, pages
Garnsey, Susan M., Neal J. Pearlmutter, Elisabeth M. Myers, and Melanie A. Lotocky. 1997.
The contributions of verb bias and plausibility
to the comprehension of temporarily ambiguous
sentences. Journal of Memory and Language
Crocker, Matthew W. and Thorsten Brants. 2000.
Wide-coverage probabilistic sentence processing.
Journal of Psycholinguistic Research
Gibson, Edward. 1998. Linguistic complexity:
locality of syntactic dependencies. Cognition
Demberg, Vera and Frank Keller. 2008a. Data
from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition 101(2):193–210.
Grodner, Dan and Edward Gibson. 2005. Consequences of the serial nature of linguistic input.
Cognitive Science 29:261–291.
Demberg, Vera and Frank Keller. 2008b. A psycholinguistically motivated version of TAG. In
Proceedings of the 9th International Workshop
on Tree Adjoining Grammars and Related Formalisms. Tübingen, pages 25–32.
Hajič, Jan, editor. 2009. Proceedings of the 13th
Conference on Computational Natural Language Learning: Shared Task. Association for
Computational Linguistics, Boulder, CO.
Demberg, Vera and Frank Keller. 2009. A computational model of prediction in human parsing: Unifying locality and surprisal effects. In
Niels Taatgen and Hedderik van Rijn, editors,
Proceedings of the 31st Annual Conference of
the Cognitive Science Society. Cognitive Science Society, Amsterdam, pages 1888–1893.
Hale, John. 2001. A probabilistic Earley parser as
a psycholinguistic model. In Proceedings of the
2nd Conference of the North American Chapter
of the Association for Computational Linguistics. Association for Computational Linguistics,
Pittsburgh, PA, volume 2, pages 159–166.
Dubey, Amit. 2010. The influence of discourse on
syntax: A psycholinguistic model of sentence
processing. In Proceedings of the 48th Annual
Meeting of the Association for Computational
Linguistics. Uppsala.
Hart, Betty and Todd R. Risley. 1995. Meaningful Differences in the Everyday Experience of
Young American Children. Paul H. Brookes,
Baltimore, MD.
Jurafsky, Daniel. 1996. A probabilistic model of
lexical and syntactic access and disambiguation. Cognitive Science 20(2):137–194.
Dubey, Amit, Frank Keller, and Patrick Sturt.
2008. A probabilistic corpus-based model of
syntactic parallelism. Cognition 109(3):326–
Kamide, Yuki, Gerry T. M. Altmann, and Sarah L.
Haywood. 2003. The time-course of prediction
in incremental sentence processing: Evidence
Ferrara Boston, Marisa, John Hale, Reinhold
Kliegl, Umesh Patil, and Shravan Vasishth.
from anticipatory eye movements. Journal of
Memory and Language 49:133–156.
Language Processing. Singapore, pages 619–
Kehler, Andrew, Laura Kertz, Hannah Rohde, and
Jeffrey L. Elman. 2008. Coherence and coreference revisited. Journal of Semantics 25(1):1–
Narayanan, Srini and Daniel Jurafsky. 2002. A
Bayesian model predicts human parse preference and reading time in sentence processing. In
Thomas G. Dietterich, Sue Becker, and Zoubin
Ghahramani, editors, Advances in Neural Information Processing Systems 14. MIT Press,
Cambridge, MA, pages 59–65.
Kennedy, Alan and Joel Pynte. 2005. Parafovealon-foveal effects in normal reading. Vision Research 45:153–168.
Padó, Ulrike, Matthew W. Crocker, and Frank
Keller. 2009. A probabilistic model of semantic
plausibility in sentence processing. Cognitive
Science 33(5):794–838.
Klein, Dan and Christopher Manning. 2002. A
generative constituent-context model for improved grammar induction. In Proceedings of
the 40th Annual Meeting of the Association for
Computational Linguistics. Philadelphia, pages
Patil, Umesh, Shravan Vasishth, and Reinhold
Kliegl. 2009. Compound effect of probabilistic disambiguation and memory retrievals on
sentence processing: Evidence from an eyetracking corpus. In A. Howes, D. Peebles,
and R. Cooper, editors, Proceedings of 9th International Conference on Cognitive Modeling.
Kliegl, Reinhold, Antje Nuthmann, and Ralf Engbert. 2006. Tracking the mind during reading:
The influence of past, present, and future words
on fixation durations. Journal of Experimental
Psychology: General 135(1):12–35.
Landauer, Thomas K. and Susan T. Dumais. 1997.
A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction
and representation of knowledge. Psychological Review 104(2):211–240.
Pickering, Martin J. and Martin J. Traxler. 1998.
Plausibility and recovery from garden paths: An
eye-tracking study. Journal of Experimental
Psychology: Learning Memory and Cognition
Levy, Roger. 2008. Expectation-based syntactic
comprehension. Cognition 106(3):1126–1177.
Pickering, Martin J., Matthew J. Traxler, and
Matthew W. Crocker. 2000. Ambiguity resolution in sentence processing: Evidence against
frequency-based accounts. Journal of Memory
and Language 43(3):447–475.
McRae, Ken, Michael J. Spivey-Knowlton, and
Michael K. Tanenhaus. 1998. Modeling the influence of thematic fit (and other constraints)
in on-line sentence comprehension. Journal of
Memory and Language 38(3):283–312.
Pynte, Joel, Boris New, and Alan Kennedy. 2008.
On-line contextual influences during reading
normal text: A multiple-regression analysis. Vision Research 48(21):2172–2183.
Mitchell, Jeff, Mirella Lapata, Vera Demberg, and
Frank Keller. 2010. Syntactic and semantic factors in processing difficulty: An integrated measure. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala.
Roark, Brian, Asaf Bachrach, Carlos Cardenas,
and Christophe Pallier. 2009. Deriving lexical and syntactic expectation-based measures
for psycholinguistic modeling via incremental
top-down parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Singapore, pages 324–333.
Mitchell, Tom M., Svetlana V. Shinkareva, Andrew Carlson, Kai-Min Chang, Vicente L.
Malave, Robert A. Mason, and Marcel Adam
Just3. 2008. Predicting human brain activity associated with the meanings of nouns. Science
Roland, Douglas and Daniel Jurafsky. 2002. Verb
sense and verb subcategorization probabilities.
In Paola Merlo and Suzanne Stevenson, editors,
The Lexical Basis of Sentence Processing: Formal, Computational, and Experimental Issues,
John Bejamins, Amsterdam, pages 325–346.
Murphy, Brian, Marco Baroni, and Massimo Poesio. 2009. EEG responds to conceptual stimuli
and corpus semantics. In Proceedings of the
Conference on Empirical Methods in Natural
Sanford, Anthony J. and Patrick Sturt. 2002.
Depth of processing in language comprehension: Not noticing the evidence. Trends in Cognitive Sciences 6:382–386.
Schuler, William, Samir AbdelRahman, Tim
Miller, and Lane Schwartz. 2010. Broadcoverage parsing using human-like memory constraints. Computational Linguistics
Staub, Adrian and Charles Clifton. 2006. Syntactic prediction in language comprehension: Evidence from either . . . or. Journal of Experimental Psychology: Learning, Memory, and Cognition 32:425–436.
Stewart, Andrew J., Martin J. Pickering, and Anthony J. Sanford. 2000. The time course of the
influence of implicit causality information: Focusing versus integration accounts. Journal of
Memory and Language 42(3):423–443.
Sturt, Patrick and Vincenzo Lombardo. 2005.
Processing coordinated structures: Incrementality and connectedness. Cognitive Science
Tanenhaus, Michael K., Michael J. SpiveyKnowlton, Kathleen M. Eberhard, and Julie C.
Sedivy. 1995. Integration of visual and linguistic information in spoken language comprehension. Science 268:1632–1634.
Vasishth, Shravan and Richard L. Lewis. 2006.
Argument-head distance and processing complexity: Explaining both locality and antilocality effects. Language 82(4):767–794.
The Manually Annotated Sub-Corpus:
A Community Resource For and By the People
Nancy Ide
Department of Computer Science
Vassar College
Poughkeepsie, NY, USA
[email protected]
Collin Baker
International Computer Science Institute
Berkeley, California USA
[email protected]
Christiane Fellbaum
Princeton University
Princeton, New Jersey USA
[email protected]
Rebecca Passonneau
Columbia University
New York, New York USA
[email protected]
teen million word Open American National Corpus annotations are largely unvalidated. The most
well-known multiply-annotated and validated corpus of English is the one million word Wall Street
Journal corpus known as the Penn Treebank (Marcus et al., 1993), which over the years has been
fully or partially annotated for several phenomena
over and above the original part-of-speech tagging
and phrase structure annotation. The usability of
these annotations is limited, however, by the fact
that many of them were produced by independent
projects using their own tools and formats, making it difficult to combine them in order to study
their inter-relations. More recently, the OntoNotes
project (Pradhan et al., 2007) released a one million word English corpus of newswire, broadcast
news, and broadcast conversation that is annotated
for Penn Treebank syntax, PropBank predicate argument structures, coreference, and named entities. OntoNotes comes closest to providing a corpus with multiple layers of annotation that can be
analyzed as a unit via its representation of the annotations in a “normal form”. However, like the
Wall Street Journal corpus, OntoNotes is limited
in the range of genres it includes. It is also limited
to only those annotations that may be produced by
members of the OntoNotes project. In addition,
use of the data and annotations with software other
than the OntoNotes database API is not necessarily straightforward.
The Manually Annotated Sub-Corpus
(MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the
American National Corpus. The MASC
infrastructure enables the incorporation of
contributed annotations into a single, usable format that can then be analyzed as
it is or ported to any of a variety of other
formats. MASC includes data from a
much wider variety of genres than existing multiply-annotated corpora of English,
and the project is committed to a fully
open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC
is the first large-scale, open, communitybased effort to create much needed language resources for NLP. This paper describes the MASC project, its corpus and
annotations, and serves as a call for contributions of data and annotations from the
language processing community.
The need for corpora annotated for multiple phenomena across a variety of linguistic layers is
keenly recognized in the computational linguistics
community. Several multiply-annotated corpora
exist, especially for Western European languages
and for spoken data, but, interestingly, broadbased English language corpora with robust annotation for diverse linguistic phenomena are relatively rare. The most widely-used corpus of English, the British National Corpus, contains only
part-of-speech annotation; and although it contains a wider range of annotation types, the fif-
The sparseness of reliable multiply-annotated
corpora can be attributed to several factors. The
greatest obstacle is the high cost of manual production and validation of linguistic annotations.
Furthermore, the production and annotation of
corpora, even when they involve significant scientific research, often do not, per se, lead to publishable research results. It is therefore understand68
Proceedings of the ACL 2010 Conference Short Papers, pages 68–73,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Gov’t documents
Debate transcript
Court transcript
Travel guides
able that many research groups are unwilling to
get involved in such a massive undertaking for relatively little reward.
(MASC) (Ide et al., 2008) project has been
established to address many of these obstacles
to the creation of large-scale, robust, multiplyannotated corpora. The project is providing
appropriate data and annotations to serve as the
base for a community-wide annotation effort,
together with an infrastructure that enables the
representation of internally-produced and contributed annotations in a single, usable format
that can then be analyzed as it is or ported to any
of a variety of other formats, thus enabling its
immediate use with many common annotation
platforms as well as off-the-shelf concordance
and analysis software. The MASC project’s aim is
to offset some of the high costs of producing high
quality linguistic annotations via a distribution of
effort, and to solve some of the usability problems
for annotations produced at different sites by
harmonizing their representation formats.
The MASC project provides a resource that is
significantly different from OntoNotes and similar corpora. It provides data from a much wider
variety of genres than existing multiply-annotated
corpora of English, and all of the data in the corpus are drawn from current American English so
as to be most useful for NLP applications. Perhaps most importantly, the MASC project is committed to a fully open model of distribution, without restriction, for all data and annotations. It is
also committed to incorporating diverse annotations contributed by the community, regardless of
format, into the corpus. As such, MASC is the
first large-scale, open, community-based effort to
create a much-needed language resource for NLP.
This paper describes the MASC project, its corpus
and annotations, and serves as a call for contributions of data and annotations from the language
processing community.
Total words
Table 1: MASC Composition (first 220K)
or otherwise free of usage and redistribution restrictions.
Where licensing permits, data for inclusion in
MASC is drawn from sources that have already
been heavily annotated by others. So far, the
first 80K increment of MASC data includes a
40K subset consisting of OANC data that has
been previously annotated for PropBank predicate argument structures, Pittsburgh Opinion annotation (opinions, evaluations, sentiments, etc.),
TimeML time and events2 , and several other linguistic phenomena. It also includes a handful of
small texts from the so-called Language Understanding (LU) Corpus3 that has been annotated by
multiple groups for a wide variety of phenomena,
including events and committed belief. All of the
first 80K increment is annotated for Penn Treebank syntax. The second 120K increment includes
5.5K words of Wall Street Journal texts that have
been annotated by several projects, including Penn
Treebank, PropBank, Penn Discourse Treebank,
TimeML, and the Pittsburgh Opinion project. The
composition of the 220K portion of the corpus annotated so far is shown in Table 1. The remaining 280K of the corpus fills out the genres that are
under-represented in the first portion and includes
a few additional genres such as blogs and tweets.
MASC: The Corpus
MASC Annotations
Annotations for a variety of linguistic phenomena,
either manually produced or corrected from output
of automatic annotation systems, are being added
MASC is a balanced subset of 500K words of
written texts and transcribed speech drawn primarily from the Open American National Corpus
(OANC)1 . The OANC is a 15 million word (and
growing) corpus of American English produced
since 1990, all of which is in the public domain
No. texts
The TimeML annotations of the data are not yet completed.
MASC contains about 2K words of the 10K LU corpus,
eliminating non-English and translated LU texts as well as
texts that are not free of usage and redistribution restrictions.
Annotation type
Noun chunks
Verb chunks
Named entities
FrameNet frames
Penn Treebank
Committed belief
No. texts
No. words
graph-analytic algorithms such as common subtree detection.
The layering of annotations over MASC texts
dictates the use of a stand-off annotation representation format, in which each annotation is contained in a separate document linked to the primary data. Each text in the corpus is provided in
UTF-8 character encoding in a separate file, which
includes no annotation or markup of any kind.
Each file is associated with a set of GrAF standoff
files, one for each annotation type, containing the
annotations for that text. In addition to the annotation types listed in Table 2, a document containing annotation for logical structure (titles, headings, sections, etc. down to the level of paragraph)
is included. Each text is also associated with
(1) a header document that provides appropriate
metadata together with machine-processable information about associated annotations and interrelations among the annotation layers; and (2) a
segmentation of the primary data into minimal regions, which enables the definition of different tokenizations over the text. Contributed annotations
are also included in their original format, where
Table 2: Current MASC Annotations (* projected)
to MASC data in increments of roughly 100K
words. To date, validated or manually produced
annotations for 222K words have been made available.
The MASC project is itself producing annotations for portions of the corpus for WordNet senses
and FrameNet frames and frame elements. To derive maximal benefit from the semantic information provided by these resources, the entire corpus is also annotated and manually validated for
shallow parses (noun and verb chunks) and named
entities (person, location, organization, date and
time). Several additional types of annotation have
either been contracted by the MASC project or
contributed from other sources. The 220K words
of MASC I and II include seventeen different types
of linguistic annotation4 , shown in Table 2.
All MASC annotations, whether contributed or
produced in-house, are transduced to the Graph
Annotation Framework (GrAF) (Ide and Suderman, 2007) defined by ISO TC37 SC4’s Linguistic
Annotation Framework (LAF) (Ide and Romary,
2004). GrAF is an XML serialization of the LAF
abstract model of annotations, which consists of
a directed graph decorated with feature structures
providing the annotation content. GrAF’s primary
role is to serve as a “pivot” format for transducing
among annotations represented in different formats. However, because the underlying data structure is a graph, the GrAF representation itself can
serve as the basis for analysis via application of
WordNet Sense Annotations
A focus of the MASC project is to provide corpus
evidence to support an effort to harmonize sense
distinctions in WordNet and FrameNet (Baker and
Fellbaum, 2009), (Fellbaum and Baker, to appear).
The WordNet and FrameNet teams have selected
for this purpose 100 common polysemous words
whose senses they will study in detail, and the
MASC team is annotating occurrences of these
words in the MASC. As a first step, fifty occurrences of each word are annotated using the
WordNet 3.0 inventory and analyzed for problems in sense assignment, after which the WordNet team may make modifications to the inventory if needed. The revised inventory (which will
be released as part of WordNet 3.1) is then used to
annotate 1000 occurrences. Because of its small
size, MASC typically contains less than 1000 occurrences of a given word; the remaining occurrences are therefore drawn from the 15 million
words of the OANC. Furthermore, the FrameNet
team is also annotating one hundred of the 1000
sentences for each word with FrameNet frames
and frame elements, providing direct comparisons
of WordNet and FrameNet sense assignments in
This includes WordNet sense annotations, which are not
listed in Table 2 because they are not applied to full texts; see
Section 3.1 for a description of the WordNet sense annotations in MASC.
attested sentences.5
For convenience, the annotated sentences are
provided as a stand-alone corpus, with the WordNet and FrameNet annotations represented in
standoff files. Each sentence in this corpus is
linked to its occurrence in the original text, so that
the context and other annotations associated with
the sentence may be retrieved.
MASC Availability and Distribution
Like the OANC, MASC is distributed without
license or other restrictions from the American
National Corpus website7 . It is also available
from the Linguistic Data Consortium (LDC)8 for
a nominal processing fee.
In addition to enabling download of the entire
MASC, we provide a web application that allows
users to select some or all parts of the corpus and
choose among the available annotations via a web
interface (Ide et al., 2010). Once generated, the
corpus and annotation bundle is made available to
the user for download. Thus, the MASC user need
never deal directly with or see the underlying representation of the stand-off annotations, but gains
all the advantages that representation offers. The
following output formats are currently available:
Automatically-produced annotations for sentence,
token, part of speech, shallow parses (noun and
verb chunks), and named entities (person, location, organization, date and time) are handvalidated by a team of students. Each annotation
set is first corrected by one student, after which it
is checked (and corrected where necessary) by a
second student, and finally checked by both automatic extraction of the annotated data and a third
pass over the annotations by a graduate student
or senior researcher. We have performed interannotator agreement studies for shallow parses in
order to establish the number of passes required to
achieve near-100% accuracy.
Annotations produced by other projects and
the FrameNet and Penn Treebank annotations
produced specifically for MASC are semiautomatically and/or manually produced by those
projects and subjected to their internal quality controls. No additional validation is performed by the
ANC project.
The WordNet sense annotations are being used
as a base for an extensive inter-annotator agreement study, which is described in detail in (Passonneau et al., 2009), (Passonneau et al., 2010).
All inter-annotator agreement data and statistics
are published along with the sense tags. The release also includes documentation on the words
annotated in each round, the sense labels for each
word, the sentences for each word, and the annotator or annotators for each sense assignment to
each word in context. For the multiply annotated
data in rounds 2-4, we include raw tables for each
word in the form expected by Ron Artstein’s calculate perl script6 , so that the agreement
numbers can be regenerated.
1. in-line XML (XCES9 ), suitable for use with
the BNCs XAIRA search and access interface and other XML-aware software;
2. token / part of speech, a common input format for general-purpose concordance software such as MonoConc10 , as well as the
Natural Language Toolkit (NLTK) (Bird et
al., 2009);
3. CONLL IOB format, used in the Conference on Natural Language Learning shared
The ANC project provides an API for GrAF annotations that can be used to access and manipulate GrAF annotations directly from Java programs and render GrAF annotations in a format
suitable for input to the open source GraphViz12
graph visualization application.13 Beyond this, the
ANC project does not provide specific tools for
use of the corpus, but rather provides the data in
formats suitable for use with a variety of available
applications, as described in section 4, together
with means to import GrAF annotations into major annotation software platforms. In particular,
the ANC project provides plugins for the General
XML Corpus Encoding Standard,
Note that several MASC texts have been fully annotated
for FrameNet frames and frame elements, in addition to the
WordNet-tagged sentences.
users may contribute evaluations and error reports
for the various annotations on the ANC/MASC
wiki17 .
Contributions of unvalidated annotations for
MASC and OANC data are also welcomed and are
distributed separately. Contributions of unencumbered texts in any genre, including stories, papers,
student essays, poetry, blogs, and email, are also
solicited via the ANC web site and the ANC FaceBook page18 , and may be uploaded at the contribution page cited above.
Architecture for Text Engineering (GATE) (Cunningham et al., 2002) to input and/or output annotations in GrAF format; a “CAS Consumer”
to enable using GrAF annotations in the Unstructured Information Management Architecture
(UIMA) (Ferrucci and Lally, 2004); and a corpus
reader for importing MASC data and annotations
into NLTK14 .
Because the GrAF format is isomorphic to input to many graph-analytic tools, existing graphanalytic software can also be exploited to search
and manipulate MASC annotations. Trivial merging of GrAF-based annotations involves simply
combining the graphs for each annotation, after
which graph minimization algorithms15 can be applied to collapse nodes with edges to common
subgraphs to identify commonly annotated components. Graph-traversal and graph-coloring algorithms can also be applied in order to identify and generate statistics that could reveal interactions among linguistic phenomena that may
have previously been difficult to observe. Other
graph-analytic algorithms — including common
sub-graph analysis, shortest paths, minimum spanning trees, connectedness, identification of articulation vertices, topological sort, graph partitioning, etc. — may also prove to be useful for mining
information from a graph of annotations at multiple linguistic levels.
MASC is already the most richly annotated corpus
of English available for widespread use. Because
the MASC is an open resource that the community can continually enhance with additional annotations and modifications, the project serves as a
model for community-wide resource development
in the future. Past experience with corpora such
as the Wall Street Journal shows that the community is eager to annotate available language data,
and we anticipate even greater interest in MASC,
which includes language data covering a range of
genres that no existing resource provides. Therefore, we expect that as MASC evolves, more and
more annotations will be contributed, thus creating a massive, inter-linked linguistic infrastructure
for the study and processing of current American
English in its many genres and varieties. In addition, by virtue of its WordNet and FrameNet annotations, MASC will be linked to parallel WordNets
and FrameNets in languages other than English,
thus creating a global resource for multi-lingual
technologies, including machine translation.
Community Contributions
The ANC project solicits contributions of annotations of any kind, applied to any part or all of
the MASC data. Annotations may be contributed
in any format, either inline or standoff. All contributed annotations are ported to GrAF standoff
format so that they may be used with other MASC
annotations and rendered in the various formats
the ANC tools generate. To accomplish this, the
ANC project has developed a suite of internal tools
and methods for automatically transducing other
annotation formats to GrAF and for rapid adaptation of previously unseen formats.
[email protected] or uploaded via the
ANC website16 . The validity of annotations
and supplemental documentation (if appropriate)
are the responsibility of the contributor. MASC
The MASC project is supported by National
Science Foundation grant CRI-0708952. The
WordNet-FrameNet alignment work is supported
by NSF grant IIS 0705155.
Collin F. Baker and Christiane Fellbaum. 2009. WordNet and FrameNet as complementary resources for
annotation. In Proceedings of the Third Linguistic
Available in September, 2010.
Efficient algorithms for graph merging exist; see,
e.g., (Habib et al., 2000).
Rebecca Passonneau, Ansaf Salleb-Aouissi, Vikas
Bhardwaj, and Nancy Ide. 2010. Word sense annotation of polysemous words by multiple annotators. In Proceedings of the Seventh International
Conference on Language Resources and Evaluation
(LREC), Valletta, Malta.
Annotation Workshop, pages 125–129, Suntec, Singapore, August. Association for Computational Linguistics.
Steven Bird, Ewan Klein, and Edward Loper.
2009. Natural Language Processing with Python.
O’Reilly Media, 1st edition.
Sameer S. Pradhan, Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. 2007. OntoNotes: A unified relational
semantic representation. In ICSC ’07: Proceedings of the International Conference on Semantic
Computing, pages 517–526, Washington, DC, USA.
IEEE Computer Society.
Hamish Cunningham, Diana Maynard, Kalina
Bontcheva, and Valentin Tablan. 2002. GATE: A
framework and graphical development environment
for robust nlp tools and applications. In Proceedings
of ACL’02.
Christiane Fellbaum and Collin Baker. to appear.
Aligning verbs in WordNet and FrameNet. Linguistics.
David Ferrucci and Adam Lally. 2004. UIMA: An
architectural approach to unstructured information
processing in the corporate research environment.
Natural Language Engineering, 10(3-4):327–348.
Michel Habib, Christophe Paul, and Laurent Viennot.
2000. Partition refinement techniques: an interesting algorithmic tool kit. International Journal of
Foundations of Computer Science, 175.
Nancy Ide and Laurent Romary. 2004. International
standard for a linguistic annotation framework. Natural Language Engineering, 10(3-4):211–225.
Nancy Ide and Keith Suderman. 2007. GrAF: A graphbased format for linguistic annotations. In Proceedings of the Linguistic Annotation Workshop, pages
1–8, Prague, Czech Republic, June. Association for
Computational Linguistics.
Nancy Ide, Collin Baker, Christiane Fellbaum, Charles
Fillmore, and Rebecca Passonneau. 2008. MASC:
The Manually Annotated Sub-Corpus of American
English. In Proceedings of the Sixth International
Conference on Language Resources and Evaluation
(LREC), Marrakech, Morocco.
Nancy Ide, Keith Suderman, and Brian Simms. 2010.
ANC2Go: A web application for customized corpus creation. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), Valletta, Malta, May. European Language Resources Association.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19(2):313–330.
Rebecca J. Passonneau, Ansaf Salleb-Aouissi, and
Nancy Ide. 2009. Making sense of word sense
variation. In SEW ’09: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions, pages 2–9, Morristown, NJ, USA. Association for Computational Linguistics.
Correcting Errors in a Treebank Based on
Synchronous Tree Substitution Grammar
Yoshihide Kato1 and Shigeki Matsubara2
1Information Technology Center, Nagoya University
2Graduate School of Information Science, Nagoya University
Furo-cho, Chikusa-ku, Nagoya, 464-8601 Japan
[email protected]
This paper proposes a method of correcting annotation errors in a treebank. By using a synchronous grammar, the method
transforms parse trees containing annotation errors into the ones whose errors are
corrected. The synchronous grammar is
automatically induced from the treebank.
We report an experimental result of applying our method to the Penn Treebank. The
result demonstrates that our method corrects syntactic annotation errors with high
mation. By using an STSG, our method transforms parse trees containing errors into the ones
whose errors are corrected. The grammar is automatically induced from the treebank. To select
STSG rules which are useful for error correction,
we define a score function based on the occurrence
frequencies of the rules. An experimental result
shows that the selected rules archive high precision.
This paper is organized as follows: Section 2
gives an overview of previous work. Section 3 explains our method of correcting errors in a treebank. Section 4 reports an experimental result using the Penn Treebank.
2 Previous Work
Annotated corpora play an important role in the
fields such as theoretical linguistic researches or
the development of NLP systems. However, they
often contain annotation errors which are caused
by a manual or semi-manual mark-up process.
These errors are problematic for corpus-based researches.
To solve this problem, several error detection
and correction methods have been proposed so far
(Eskin, 2000; Nakagawa and Matsumoto, 2002;
Dickinson and Meurers, 2003a; Dickinson and
Meurers, 2003b; Ule and Simov, 2004; Murata
et al., 2005; Dickinson and Meurers, 2005; Boyd
et al., 2008). These methods detect corpus positions which are marked up incorrectly, and find
the correct labels (e.g. pos-tags) for those positions. However, the methods cannot correct errors
in structural annotation. This means that they are
insufficient to correct annotation errors in a treebank.
This paper proposes a method of correcting errors in structural annotation. Our method is based
on a synchronous grammar formalism, called synchronous tree substitution grammar (STSG) (Eisner, 2003), which defines a tree-to-tree transfor-
This section summarizes previous methods for
correcting errors in corpus annotation and discusses their problem.
Some research addresses the detection of errors in pos-annotation (Nakagawa and Matsumoto,
2002; Dickinson and Meurers, 2003a), syntactic
annotation (Dickinson and Meurers, 2003b; Ule
and Simov, 2004; Dickinson and Meurers, 2005),
and dependency annotation (Boyd et al., 2008).
These methods only detect corpus positions where
errors occur. It is unclear how we can correct the
Several methods can correct annotation errors
(Eskin, 2000; Murata et al., 2005). These methods are to correct tag-annotation errors, that is,
they simply suggest a candidate tag for each position where an error is detected. The methods
cannot correct syntactic annotation errors, because
syntactic annotation is structural. There is no approach to correct structural annotation errors.
To clarify the problem, let us consider an example. Figure 1 depicts two parse trees annotated according to the Penn Treebank annotation 1 . The
0 and *T* are null elements.
Proceedings of the ACL 2010 Conference Short Papers, pages 74–79,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
(a) incorrect parse tree
That ,
say -NONE0
good IN
say -NONE-
• a one-to-one alignment between nodes in the
elementary trees
For a tree pair ⟨t, t′ ⟩, the tree t and t′ are
called source and target, respectively. The nonterminal leaves of elementary trees are called frontier nodes. There exists a one-to-one alignment
between the frontier nodes in t and t′ . The rule
means that the structure which matches the source
elementary tree is transformed into the structure
which is represented by the target elementary tree.
Figure 2 shows an example of an STSG rule. The
subscripts indicate the alignment. This rule can
correct the errors in the parse tree (a) depicted in
Figure 1.
An STSG derives tree pairs. Any derivation
process starts with the pair of nodes labeled with
special symbols called start symbols. A derivation
proceeds in the following steps:
good IN
Figure 1: An example of a treebank error
parse tree (a) contains errors and the parse tree
(b) is the corrected version. In the parse tree (a),
the positions of the two subtrees (, ,) are erroneous. To correct the errors, we need to move the
subtrees to the positions which are directly dominated by the node PRN. This example demonstrates that we need a framework of transforming
tree structures to correct structural annotation errors.
Figure 2: An example of an STSG rule
That ,
(b) correct parse tree
1. Choose a pair of frontier nodes ⟨η, η ′ ⟩ for
which there exists an alignment.
2. Choose a rule ⟨t, t′ ⟩ s.t. label(η) = root(t)
and label(η ′ ) = root(t′ ) where label(η) is
the label of η and root(t) is the root label of
Correcting Errors by Using
Synchronous Grammar
3. Substitute t and t′ into η and η ′ , respectively.
To solve the problem described in Section 2, this
section proposes a method of correcting structural
annotation errors by using a synchronous tree substitution grammar (STSG) (Eisner, 2003). An
STSG defines a tree-to-tree transformation. Our
method induces an STSG which transforms parse
trees containing errors into the ones whose errors
are corrected.
Figure 3 shows a derivation process in an STSG.
In the rest of the paper, we focus on the rules
in which the source elementary tree is not identical to its target, since such identical rules cannot
contribute to error correction.
3.2 Inducing an STSG for Error Correction
This section describes a method of inducing an
STSG for error correction. The basic idea of
our method is similar to the method presented by
Dickinson and Meurers (2003b). Their method detects errors by seeking word sequences satisfying
the following conditions:
3.1 Synchronous Tree Substitution Grammar
First of all, we describe the STSG formalism. An
STSG defines a set of tree pairs. An STSG can be
treated as a tree transducer which takes a tree as
input and produces a tree as output. Each grammar
rule consists of the following elements:
• The word sequence occurs more than once in
the corpus.
• a pair of trees called elementary trees
That ,
That ,
say -NONE-7 S8
That ,
say -NONE-
Figure 3: A derivation process of tree pairs in an
his abilities
Figure 5: Another example of a parse tree containing a word sequence “, they say ,”
• Different syntactic labels are assigned to the
occurrences of the word sequence.
where yield(τ ) is the word sequence dominated
by τ .
Let us consider an example. If the parse trees
depicted in Figure 1 exist in the treebank T , the
pair of partial parse trees depicted in Figure 4 is
an element of P ara(T ). We also obtain this pair
in the case where there exists not the parse tree
(b) depicted in Figure 1 but the parse tree depicted
in Figure 5, which contains the word sequence “,
they say ,”.
Unlike their method, our method seeks word sequences whose occurrences have different partial
parse trees. We call a collection of these word
sequences with partial parse trees pseudo parallel corpus. Moreover, our method extracts STSG
rules which transform the one partial tree into the
3.2.1 Constructing a Pseudo Parallel Corpus
3.2.2 Inducing a Grammar from a Pseudo
Parallel Corpus
Our method firstly constructs a pseudo parallel
corpus which represents a correspondence between parse trees containing errors and the ones
whose errors are corrected. The procedure is as
follows: Let T be the set of the parse trees occurring in the corpus. We write Sub(σ) for the
set which consists of the partial parse trees included in the parse tree σ. A pseudo parallel corpus P ara(T ) is constructed as follows:
That ,
P ara(T ) = {⟨τ, τ ′ ⟩ | τ, τ ′ ∈
0 -NONE-9
DT ,
Figure 4: An example of a partial parse tree pair
in a pseudo parallel corpus
DT ,
say -NONE-7 S8
0 -NONE-9
,1 NP2
Our method induces an STSG from the pseudo
parallel corpus according to the method proposed
by Cohn and Lapata (2009). Cohn and Lapata’s
method can induce an STSG which represents a
correspondence in a parallel corpus. Their method
firstly determine an alignment of nodes between
pairs of trees in the parallel corpus and extracts
STSG rules according to the alignments.
For partial parse trees τ and τ ′ , we define a node
alignment C(τ, τ ′ ) as follows:
C(τ, τ ′ ) = {⟨η, η ′ ⟩ | η ∈ N ode(τ )
∧ τ ̸= τ
∧ yield(τ ) = yield(τ ′ )
∧ η ′ ∈ N ode(τ ′ )
∧ root(τ ) = root(τ ′ )}
∧ η is not the root of τ
∧ η ′ is not the root of τ ′
∧ label(η) = label(η )
∧ yield(η) = yield(η ′ )}
where N ode(τ ) is the set of the nodes in τ , and
yield(η) is the word sequence dominated by η.
Figure 4 shows an example of a node alignment.
The subscripts indicate the alignment.
An STSG rule is extracted by deleting nodes in
a partial parse tree pair ⟨τ, τ ′ ⟩ ∈ P ara(T ). The
procedure is as follows:
Figure 6: Examples of error correction rules induced from the Penn Treebank
• For each ⟨η, η ′ ⟩ ∈ C(τ, τ ′ ), delete the descendants of η and η ′ .
measured the precision of the rules. The precision
is defined as follows:
For example, the rule shown in Figure 2 is extracted from the pair shown in Figure 4.
# of the positions where an error is corrected
# of the positions to which some rule is applied
3.3 Rule Selection
We manually checked whether each rule application corrected an error, because the corrected
treebank does not exist2 . Furthermore, we only
evaluated the first 100 rules which are ordered by
the score function described in Section 3.3, since
it is time-consuming and expensive to evaluate all
of the rules. These 100 rules were applied at 331
positions. The precision of the rules is 71.9%. For
each rule, we measured the precision of it. 70 rules
achieved 100% precision. These results demonstrate that our method can correct syntactic annotation errors with high precision. Moreover, 30
rules of the 70 rules transformed bracketed structures. This fact shows that the treebank contains
structural errors which cannot be dealt with by the
previous methods.
Figure 6 depicts examples of error correction
rules which achieved 100% precision. Rule (1),
(2) and (3) are rules which transform bracketed
structures. Rule (4) simply replaces a node label. Rule (1) corrects an erroneous position of a
comma (see Figure 7 (a)). Rule (2) deletes a useless node NP in a subject position (see Figure 7
(b)). Rule (3) inserts a node NP (see Figure 7 (c)).
Rule (4) replaces a node label NP with the correct label PP (see Figure 7 (d)). These examples
demonstrate that our method can correct syntactic
annotation errors.
Figure 8 depicts an example where our method
detected an annotation error but could not correct
it. To correct the error, we need to attach the node
Some rules extracted by the procedure in Section
3.2 are not useful for error correction, since the
pseudo parallel corpus contains tree pairs whose
source tree is correct or whose target tree is incorrect. The rules which are extracted from such pairs
can be harmful. To select rules which are useful for error correction, we define a score function
which is based on the occurrence frequencies of
elementary trees in the treebank. The score function is defined as follows:
Score(⟨t, t′ ⟩) =
f (t′ )
f (t) + f (t′ )
where f (·) is the occurrence frequency in the treebank. The score function ranges from 0 to 1. We
assume that the occurrence frequency of an elementary tree matching incorrect parse trees is very
low. According to this assumption, the score function Score(⟨t, t′ ⟩) is high when the source elementary tree t matches incorrect parse trees and
the target elementary tree t′ matches correct parse
trees. Therefore, STSG rules with high scores are
regarded to be useful for error correction.
4 An Experiment
To evaluate the effectiveness of our method, we
conducted an experiment using the Penn Treebank
(Marcus et al., 1993).
We used 49208 sentences in Wall Street Journal
sections. We induced STSG rules by applying our
method to the corpus. We obtained 8776 rules. We
This also means that we cannot measure the recall of the
all you need is one good one
is one good one
all you need
the respondents
the respondents
only two or three other major banks
only two or three other major banks
the U.S.
the U.S.
Figure 7: Examples of correcting syntactic annotation errors
when ...
when ...
The average of interbank offered rates based on quotations at
five major banks
Figure 8: An example where our method detected
an annotation error but could not correct it
SBAR under the node NP. We found that 22 of the
rule applications were of this type.
Figure 9 depicts a false positive example
where our method mistakenly transformed a correct syntactic structure. The score of the rule
is very high, since the source elementary tree
(TOP (NP NP VP .)) is less frequent. This
example shows that our method has a risk of
changing correct annotations of less frequent syntactic structures.
The average of interbank offered rates based on quotations at
five major banks
Figure 9: A false positive example where a correct
syntactic structure was mistakenly transformed
In future work, we will explore a method of increasing the recall of error correction by constructing a wide-coverage STSG.
This paper proposes a method of correcting errors in a treebank by using a synchronous tree
substitution grammar. Our method constructs a
pseudo parallel corpus from the treebank and extracts STSG rules from the parallel corpus. The
experimental result demonstrates that we can obtain error correction rules with high precision.
This research is partially supported by the Grantin-Aid for Scientific Research (B) (No. 22300051)
of JSPS and by the Kayamori Foundation of Informational Science Advancement.
Adriane Boyd, Markus Dickinson, and Detmar Meurers. 2008. On detecting errors in dependency treebanks. Research on Language and Computation,
Trevor Cohn and Mirella Lapata. 2009. Sentence compression as tree transduction. Journal of Artificial
Intelligence Research, 34(1):637–674.
Markus Dickinson and Detmar Meurers. 2003a. Detecting errors in part-of-speech annotation. In Proceedings of the 10th Conference of the European
Chapter of the Association for Computational Linguistics, pages 107–114.
Markus Dickinson and Detmar Meurers. 2003b. Detecting inconsistencies in treebanks. In Proceedings
of the Second Workshop on Treebanks and Linguistic
Markus Dickinson and W. Detmar Meurers. 2005.
Prune diseased branches to get healthy trees! how
to find erroneous local trees in a treebank and why
it matters. In Proceedings of the 4th Workshop on
Treebanks and Linguistic Theories.
Jason Eisner. 2003. Learning non-isomorphic tree
mappings for machine translation. In Proceedings of
the 41st Annual Meeting of the Association for Computational Linguistics, Companion Volume, pages
Eleazar Eskin. 2000. Detecting errors within a corpus
using anomaly detection. In Proceedings of the 1st
North American chapter of the Association for Computational Linguistics Conference, pages 148–153.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of English: the Penn Treebank. Computational Linguistics, 19(2):310–330.
Masaki Murata, Masao Utiyama, Kiyotaka Uchimoto,
Hitoshi Isahara, and Qing Ma. 2005. Correction of
errors in a verb modality corpus for machine translation with a machine-learning method. ACM Transactions on Asian Language Information Processing,
Tetsuji Nakagawa and Yuji Matsumoto. 2002. Detecting errors in corpora using support vector machines.
In Proceedings of the 19th Internatinal Conference
on Computatinal Linguistics, pages 709–715.
Tylman Ule and Kiril Simov. 2004. Unexpected productions may well be errors. In Proceedings of 4th
International Conference on Language Resources
and Evaluation, pages 1795–1798.
Evaluating Machine Translations using mNCD
Marcus Dobrinkat and Tero Tapiovaara and Jaakko Väyrynen
Adaptive Informatics Research Centre
Aalto University School of Science and Technology
P.O. Box 15400, FI-00076 Aalto, Finland
Kimmo Kettunen
Kymenlaakso University of Applied Sciences
P.O. Box 9, FI-48401 Kotka, Finland
[email protected]
method. BADGER scores were directly compared
against the scores of METEOR and word error
rate (WER). The correlation between BADGER
and METEOR were low and correlations between
BADGER and WER high. Kettunen (2009) uses
the NCD directly as an MT evaluation measure.
He showed with a small corpus of three language
pairs that NCD and METEOR 0.6 correlated for
translations of 10–12 MT systems. NCD was not
compared to human assessments of translations,
but correlations of NCD and METEOR scores
were very high for all the three language pairs.
Väyrynen et al. (2010) have extended the work
by including NCD in the ACL WMT08 evaluation
framework and showing that NCD is correlated
to human judgments. The NCD measure did not
match the performance of the state-of-the-art MT
evaluation measures in English, but it presented a
viable alternative to de facto standard BLEU (Papineni et al., 2001), which is simple and effective
but has been shown to have a number of drawbacks
(Callison-Burch et al., 2006).
Some recent advances in automatic MT evaluation have included non-binary matching between
compared items (Banerjee and Lavie, 2005; Agarwal and Lavie, 2008; Chan and Ng, 2009), which
is implicitly present in the string-based NCD measure. Our motivation is to investigate whether including additional language dependent resources
would improve the NCD measure. We experiment
with relaxed word matching using stemming and
a lexical database to allow lexical changes. These
additional modules attempt to make the reference
sentences more similar to the evaluated translations on the string level. We report an experiment
showing that document-level NCD and aggregated
NCD scores for individual sentences produce very
similar correlations to human judgments.
This paper introduces mNCD, a method
for automatic evaluation of machine translations. The measure is based on normalized compression distance (NCD), a
general information theoretic measure of
string similarity, and flexible word matching provided by stemming and synonyms.
The mNCD measure outperforms NCD in
system-level correlation to human judgments in English.
Automatic evaluation of machine translation (MT)
systems requires automated procedures to ensure consistency and efficient handling of large
amounts of data. In statistical MT systems, automatic evaluation of translations is essential for
parameter optimization and system development.
Human evaluation is too labor intensive, time consuming and expensive for daily evaluations. However, manual evaluation is important in the comparison of different MT systems and for the validation and development of automatic MT evaluation
measures, which try to model human assessments
of translations as closely as possible. Furthermore,
the ideal evaluation method would be language independent, fast to compute and simple.
Recently, normalized compression distance
(NCD) has been applied to the evaluation of
machine translations. NCD is a general information theoretic measure of string similarity, whereas most MT evaluation measures, e.g.,
BLEU and METEOR, are specifically constructed
for the task. Parker (2008) introduced BADGER, an MT evaluation measure that uses NCD
and a language independent word normalization
Proceedings of the ACL 2010 Conference Short Papers, pages 80–85,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Variation in language leads to several acceptable translations for each source sentence, which
is why multiple reference translations are preferred in evaluation. Unfortunately, it is typical
to have only one reference translation. Paraphrasing techniques can produce additional translation
variants (Russo-Lassner et al., 2005; Kauchak and
Barzilay, 2006). These can be seen as new reference translations, similar to pseudo references (Ma
et al., 2007).
The proposed method, mNCD, works analogously to M-BLEU and M-TER, which use the
flexible word matching modules from METEOR
to find relaxed word-to-word alignments (Agarwal and Lavie, 2008). The modules are able to
align words even if they do not share the same
surface form, but instead have a common stem or
are synonyms of each other. A similarized translation reference is generated by replacing words in
the reference with their aligned counterparts from
the translation hypothesis. The NCD score is computed between the translations and the similarized
references to get the mNCD score.
Table 1 shows some hand-picked German–
English candidate translations along with a) the
reference translations including the 1-NCD score
to easily compare with METEOR and b) the similarized references including the mNCD score. For
comparison, the corresponding METEOR scores
without implicit relaxed matching are shown.
Figure 1: An example showing the compressed
sizes of two strings separately and concatenated.
Normalized Compression Distance
Normalized compression distance (NCD) is a similarity measure based on the idea that a string x is
similar to another string y when both share substrings. The description of y can reference shared
substrings in the known x without repetition, indicating shared information. Figure 1 shows an
example in which the compression of the concatenation of x and y results in a shorter output than
individual compressions of x and y.
The normalized compression distance, as defined by Cilibrasi and Vitanyi (2005), is given in
Equation 1, with C(x) as length of the compression of x and C(x, y) as the length of the compression of the concatenation of x and y.
C(x, y) − min {C(x), C(y)}
max {C(x), C(y)}
NCD computes the distance as a score closer to
one for very different strings and closer to zero for
more similar strings.
NCD is an approximation of the uncomputable
normalized information distance (NID), a general
measure for the similarity of two objects. NID
is based on the notion of Kolmogorov complexity K(x), a theoretical measure for the information content of a string x, defined as the shortest
universal Turing machine that prints x and stops
(Solomonoff, 1964). NCD approximates NID by
the use of a compressor C(x) that is an upper
bound of the Kolmogorov complexity K(x).
N CD(x, y) =
The proposed mNCD and the basic NCD measure
were evaluated by computing correlation to human judgments of translations. A high correlation
value between an MT evaluation measure and human judgments indicates that the measure is able
to evaluate translations in a more similar way to
Relaxed alignments with the METEOR modules exact, stem and synonym were created
for English for the computation of the mNCD
score. The synonym module was not available
with other target languages.
Normalized compression distance was not conceived with MT evaluation in mind, but rather it
is a general measure of string similarity. Implicit
non-binary matching with NCD is indicated by
preliminary experiments which show that NCD is
less sensitive to random changes on the character
level than, for instance, BLEU, which only counts
the exact matches between word n-grams. Thus
comparison of sentences at the character level
could account better for morphological changes.
Evaluation Data
The 2008 ACL Workshop on Statistical Machine
Translation (Callison-Burch et al., 2008) shared
task data includes translations from a total of 30
MT systems between English and five European
languages, as well as automatic and human trans81
Candidate C/ Reference R/ Similarized Reference S
There is no effective means to stop a Tratsch, which was already included in the world.
There is no good way to halt gossip that has already begun to spread.
There is no effective means to stop gossip that has already begun to spread.
Crisis, not only in America
A Crisis Not Only in the U.S.
A Crisis not only in the America
Influence on the whole economy should not have this crisis.
Nevertheless, the crisis should not have influenced the entire economy.
Nevertheless, the crisis should not have Influence the entire economy.
Or the lost tight meeting will be discovered at the hands of a gentlemen?
Perhaps you see the pen you thought you lost lying on your colleague’s desk.
Perhaps you meeting the pen you thought you lost lying on your colleague’s desk.
Table 1: Example German–English translations showing the effect of relaxed matching in the 1-mNCD
score (for rows S) compared with METEOR using the exact module only, since the modules stem
and synonym are already used in the similarized reference. Replaced words are emphasized.
where for each system i, di is the difference between the rank derived from annotators’ input and
the rank obtained from the measure. From the annotators’ input, the n systems were ranked based
on the number of times each system’s output was
selected as the best translation divided by the number of times each system was part of a judgment.
We computed system-level correlations for
tasks with English, French, Spanish and German
as the target language1 .
lation evaluations for the translations. There are
several tasks, defined by the language pair and the
domain of translated text.
The human judgments include three different
categories. The R ANK category has human quality
rankings of five translations for one sentence from
different MT systems. The C ONST category contains rankings for short phrases (constituents), and
the Y ES /N O category contains binary answers if a
short phrase is an acceptable translation or not.
For the translation tasks into English, the relaxed alignment using a stem module and the
synonym module affected 7.5 % of all words,
whereas only 5.1 % of the words were changed in
the tasks from English into the other languages.
The data was preprocessed in two different
ways. For NCD we kept the data as is, which we
called real casing (rc). Since the used METEOR
align module lowercases all text, we restored the
case information in mNCD by copying the correct
case from the reference translation to the similarized reference, based on METEOR’s alignment.
The other way was to lowercase all data (lc).
We compare mNCD against NCD and relate their
performance to other MT evaluation measures.
Block size effect on NCD scores
Väyrynen et al. (2010) computed NCD between a
set of candidate translations and references at the
same time regardless of the sentence alignments,
analogously to document comparison. We experimented with segmentation of the candidate translations into smaller blocks, which were individually evaluated with NCD and aggregated into a
single value with arithmetic mean. The resulting
system-level correlations between NCD and human judgments are shown in Figure 2 as a function
of the block size. The correlations are very similar with all block sizes, except for Spanish, where
smaller block size produces higher correlation. An
experiment with geometric mean produced similar
results. The reported results with mNCD use maximum block size, similar to Väyrynen et al. (2010).
System-level correlation
We follow the same evaluation methodology as in
Callison-Burch et al. (2008), which allows us to
measure how well MT evaluation measures correlate with human judgments on the system level.
Spearman’s rank correlation coefficient ρ was
calculated between each MT evaluation measure
and human judgment category using the simplified
6 i di
n(n2 − 1)
The English-Spanish news task was left out as most measures had negative correlation with human judgments.
into en
into de
into fr
into es
system level correlation with human judgements
Target Lang Corr
2000 5000
Figure 2: The block size has very little effect on
the correlation between NCD and human judgments. The right side corresponds to document
comparison and the left side to aggregated NCD
scores for sentences.
Table 2 shows the average system level correlation
of different NCD and mNCD variants for translations into English. The two compressors that
worked best in our experiments were PPMZ and
bz2. PPMZ is slower to compute but performs
slightly better compared to bz2, except for the
lowercased C ONST category.
Table 2 shows that real casing improves R ANK
correlation slightly throughout NCD and mNCD
variants, whereas it reduces correlation in the categories C ONST, Y ES /N O as well as the mean.
The best mNCD (PPMZ rc) improves the best
NCD (PPMZ rc) method by 15% in the R ANK
category. In the C ONST category the best mNCD
(bz2 lc) improves the best NCD (bz2 lc) by 3.7%.
For the total average, the best mNCD (PPMZ rc)
improves the the best NCD (bz2 lc) by 7.2%.
Table 3 shows the correlation results for the
R ANK category by target language. As shown already in Table 2, mNCD clearly outperforms NCD
for English. Correlations for other languages show
mixed results and on average, mNCD gives lower
correlations than NCD.
mNCD against NCD
Table 3: mNCD versus NCD system correlation
R ANK results with different parameters (the same
as in Table 2) for each target language. Higher
values are emphasized. Target languages D E, F R
and E S use only the stem module.
block size in lines
mNCD versus other methods
Table 4 presents the results for the selected mNCD
(PPMZ rc) and NCD (bz2 rc) variants along with
the correlations for other MT evaluation methods
from the WMT’08 data, based on the results in
Callison-Burch et al. (2008). The results are averages over language pairs into English, sorted
by R ANK, which we consider the most significant category. Although mNCD correlation with
human evaluations improved over NCD, the ranking among other measures was not affected. Language and task specific results not shown here, reveal very low mNCD and NCD correlations in the
Spanish-English news task, which significantly
Table 2: Mean system level correlations over
all translation tasks into English for variants of
mNCD and NCD. Higher values are emphasized.
Parameters are the compressor PPMZ or bz2 and
the preprocessing choice lowercasing (lc) or real
casing (rc).
mNCD (PPMZ rc)
NCD (bz2 rc)
mNCD (PPMZ rc)
Table 5: Average system-level correlations for the
R ANK category from English for NCD, mNCD
and other MT evaluation measures.
similarized references. We believe there is potential for improvement in other languages as well if
synonym lexicons are available.
We have also extended the basic NCD measure
to scale between a document comparison measure and aggregated sentence-level measure. The
rather surprising result is that NCD produces quite
similar scores with all block sizes. The different
result with Spanish may be caused by differences
in the data or problems in the calculations.
After using the same evaluation methodology as
in Callison-Burch et al. (2008), we have doubts
whether it presents the most effective method exploiting all the given human evaluations in the best
way. The system-level correlation measure only
awards the winner of the ranking of five different systems. If a system always scored second,
it would never be awarded and therefore be overly
penalized. In addition, the human knowledge that
gave the lower rankings is not exploited.
In future work with mNCD as an MT evaluation measure, we are planning to evaluate synonym dictionaries for other languages than English. The synonym module for English does
not distinguish between different senses of words.
Therefore, synonym lexicons found with statistical methods might provide a viable alternative
for manually constructed lexicons (Kauchak and
Barzilay, 2006).
Table 4: Average system-level correlations over
translation tasks into English for NCD, mNCD
and other MT evaluations measures
degrades the averages. Considering the mean of
the categories instead, mNCD’s correlation of .74
is third best together with ’posbleu’.
Table 5 shows the results from English. The table is shorter since many of the better MT measures use language specific linguistic resources
that are not easily available for languages other
than English. mNCD performs competitively only
for French, otherwise it falls behind NCD and
other methods as already shown earlier.
Target Lang Corr
We have introduced a new MT evaluation measure, mNCD, which is based on normalized compression distance and METEOR’s relaxed alignment modules. The mNCD measure outperforms
NCD in English with all tested parameter combinations, whereas results with other target languages are unclear. The improved correlations
with mNCD did not change the position in the
R ANK category of the MT evaluation measures in
the 2008 ACL WMT shared task.
The improvement in English was expected on
the grounds of the synonym module, and indicated
also by the larger number of affected words in the
Grazia Russo-Lassner, Jimmy Lin, and Philip Resnik.
2005. A paraphrase-based approach to machine
translation evaluation. Technical Report LAMPTR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park.
Abhaya Agarwal and Alon Lavie. 2008. METEOR,
M-BLEU and M-TER: evaluation metrics for highcorrelation with human rankings of machine translation output. In StatMT ’08: Proceedings of the
Third Workshop on Statistical Machine Translation,
pages 115–118, Morristown, NJ, USA. Association
for Computational Linguistics.
Ray Solomonoff. 1964. Formal theory of inductive
inference. Part I. Information and Control,, 7(1):1–
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June. Association for Computational
Jaakko J. Väyrynen, Tero Tapiovaara, Kimmo Kettunen, and Marcus Dobrinkat. 2010. Normalized
compression distance as an automatic MT evaluation metric. In Proceedings of MT 25 years on. To
Chris Callison-Burch, Miles Osborne, and Philipp
Koehn. 2006. Re-evaluating the role of BLEU
in machine translation research. In Proceedings of
EACL-2006, pages 249–256.
Chris Callison-Burch, Cameron Fordyce, Philipp
Koehn, Christoph Monz, and Josh Schroeder. 2008.
Further meta-evalutation of machine translation.
ACL Workshop on Statistical Machine Translation.
Yee Seng Chan and Hwee Tou Ng. 2009. MaxSim:
performance and effects of translation fluency. Machine Translation, 23(2-3):157–168.
Rudi Cilibrasi and Paul Vitanyi. 2005. Clustering
by compression. IEEE Transactions on Information
Theory, 51:1523–1545.
David Kauchak and Regina Barzilay. 2006. Paraphrasing for automatic evaluation. In Proceedings
of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics,
pages 455–462, Morristown, NJ, USA. Association
for Computational Linguistics.
Kimmo Kettunen. 2009. Packing it all up in search for
a language independent MT quality measure tool. In
In Proceedings of LTC-09, 4th Language and Technology Conference, pages 280–284, Poznan.
Yanjun Ma, Nicolas Stroppa, and Andy Way. 2007.
Bootstrapping word alignment via word packing. In
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 304–
311, Prague, Czech Republic, June. Association for
Computational Linguistics.
K. Papineni, S. Roukos, T. Ward, and W. Zhu.
2001. BLEU: a method for automatic evaluation
of machine translation. Technical Report RC22176
(W0109-022), IBM Research Division, Thomas J.
Watson Research Center.
Steven Parker. 2008. BADGER: A new machine translation metric. In Metrics for Machine Translation
Challenge 2008, Waikiki, Hawai’i, October. AMTA.
Tackling Sparse Data Issue in Machine Translation Evaluation ∗
Ondřej Bojar, Kamil Kos, and David Mareček
Charles University in Prague, Institute of Formal and Applied Linguistics
{bojar,marecek}, [email protected]
pctrans eurotranxp
b b
We illustrate and explain problems of
n-grams-based machine translation (MT)
metrics (e.g. BLEU) when applied to
morphologically rich languages such as
Czech. A novel metric SemPOS based
on the deep-syntactic representation of the
sentence tackles the issue and retains the
performance for translation to English as
0.14 BLEU
Figure 1: BLEU and human ranks of systems participating in the English-to-Czech WMT09 shared
Section 3 introduces and evaluates some new
variations of SemPOS (Kos and Bojar, 2009), a
metric based on the deep syntactic representation
of the sentence performing very well for Czech as
the target language. Aside from including dependency and n-gram relations in the scoring, we also
apply and evaluate SemPOS for English.
Automatic metrics of machine translation (MT)
quality are vital for research progress at a fast
pace. Many automatic metrics of MT quality have
been proposed and evaluated in terms of correlation with human judgments while various techniques of manual judging are being examined as
well, see e.g. MetricsMATR08 (Przybocki et al.,
2008)1 , WMT08 and WMT09 (Callison-Burch et
al., 2008; Callison-Burch et al., 2009)2 .
The contribution of this paper is twofold. Section 2 illustrates and explains severe problems of a
widely used BLEU metric (Papineni et al., 2002)
when applied to Czech as a representative of languages with rich morphology. We see this as an
instance of the sparse data problem well known
for MT itself: too much detail in the formal representation leading to low coverage of e.g. a translation dictionary. In MT evaluation, too much detail
leads to the lack of comparable parts of the hypothesis and the reference.
Problems of BLEU
BLEU (Papineni et al., 2002) is an established
language-independent MT metric. Its correlation
to human judgments was originally deemed high
(for English) but better correlating metrics (esp.
for other languages) were found later, usually employing language-specific tools, see e.g. Przybocki et al. (2008) or Callison-Burch et al. (2009).
The unbeaten advantage of BLEU is its simplicity.
Figure 1 illustrates a very low correlation to human judgments when translating to Czech. We
plot the official BLEU score against the rank established as the percentage of sentences where a
system ranked no worse than all its competitors
(Callison-Burch et al., 2009). The systems developed at Charles University (cu-) are described in
Bojar et al. (2009), uedin is a vanilla configuration
of Moses (Koehn et al., 2007) and the remaining
ones are commercial MT systems.
In a manual analysis, we identified the reasons
for the low correlation: BLEU is overly sensitive
to sequences and forms in the hypothesis matching
This work has been supported by the grants EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003
of the Czech Republic), FP7-ICT-2009-4-247762 (Faust),
GA201/09/H057, GAUK 1163/2010, and MSM 0021620838.
We are grateful to the anonymous reviewers for further research suggestions.
2 and wmt09
Proceedings of the ACL 2010 Conference Short Papers, pages 86–91,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
firmed Flags
Total n-grams
firmed by the reference. This amounts to 34% of
running unigrams, giving enough space to differ in
human judgments and still remain unscored.
Figure 3 documents the issue across languages:
the lower the BLEU score itself (i.e. fewer confirmed n-grams), the lower the correlation to human judgments regardless of the target language
(WMT09 shared task, 2025 sentences per language).
Table 1: n-grams confirmed by the reference and
containing error flags.
Figure 4 illustrates the overestimation of scores
caused by too much attention to sequences of tokens. A phrase-based system like Moses (cubojar) can sometimes produce a long sequence of
tokens exactly as required by the reference, leading to a high BLEU score. The framed words
in the illustration are not confirmed by the reference, but the actual error in these words is very
severe for comprehension: nouns were used twice
instead of finite verbs, and a misleading translation of a preposition was chosen. The output by
pctrans preserves the meaning much better despite
not scoring in either of the finite verbs and producing far shorter confirmed sequences.
the reference translation. This focus goes directly
against the properties of Czech: relatively free
word order allows many permutations of words
and rich morphology renders many valid word
forms not confirmed by the reference.3 These
problems are to some extent mitigated if several
reference translations are available, but this is often not the case.
Figure 2 illustrates the problem of “sparse data”
in the reference. Due to the lexical and morphological variance of Czech, only a single word in
each hypothesis matches a word in the reference.
In the case of pctrans, the match is even a false
positive, “do” (to) is a preposition that should be
used for the “minus” phrase and not for the “end
of the day” phrase. In terms of BLEU, both hypotheses are equally poor but 90% of their tokens
were not evaluated.
Table 1 estimates the overall magnitude of this
issue: For 1-grams to 4-grams in 1640 instances
(different MT outputs and different annotators) of
200 sentences with manually flagged errors4 , we
count how often the n-gram is confirmed by the
reference and how often it contains an error flag.
The suspicious cases are n-grams confirmed by
the reference but still containing a flag (false positives) and n-grams not confirmed despite containing no error flag (false negatives).
Fortunately, there are relatively few false positives in n-gram based metrics: 6.3% of unigrams
and far fewer higher n-grams.
The issue of false negatives is more serious and
confirms the problem of sparse data if only one
reference is available. 30 to 40% of n-grams do
not contain any error and yet they are not con-
Extensions of SemPOS
SemPOS (Kos and Bojar, 2009) is inspired by metrics based on overlapping of linguistic features in
the reference and in the translation (Giménez and
Márquez, 2007). It operates on so-called “tectogrammatical” (deep syntactic) representation of
the sentence (Sgall et al., 1986; Hajič et al., 2006),
formally a dependency tree that includes only autosemantic (content-bearing) words.5 SemPOS as
defined in Kos and Bojar (2009) disregards the
syntactic structure and uses the semantic part of
speech of the words (noun, verb, etc.). There are
19 fine-grained parts of speech. For each semantic
part of speech t, the overlapping O(t) is set to zero
if the part of speech does not occur in the reference
or the candidate set and otherwise it is computed
as given in Equation 1 below.
We use TectoMT (Žabokrtský and Bojar, 2008),, for the linguistic pre-processing. While both our implementation of
SemPOS as well as TectoMT are in principle freely available, a stable public version has yet to be released. Our plans
include experiments with approximating the deep syntactic
analysis with a simple tagger, which would also decrease the
installation burden and computation costs, at the expense of
Condon et al. (2009) identify similar issues when evaluating translation to Arabic and employ rule-based normalization of MT output to improve the correlation. It is beyond
the scope of this paper to describe the rather different nature
of morphological richness in Czech, Arabic and also other
languages, e.g. German or Finnish.
The dataset with manually flagged errors is available at
Prague Stock Market falls to minus by the end of the trading day
pražská burza se ke konci obchodovánı́ propadla do minusu
praha stock market klesne k minus na konci obchodnı́ho dne
praha trh cenných papı́rů padá minus do konce obchodnı́ho dne
Figure 2: Sparse data in BLEU evaluation: Large chunks of hypotheses are not compared at all. Only a
single unigram in each hypothesis is confirmed in the reference.
en-cs hu-en
BLEU score
Figure 3: BLEU correlates with its correlation to human judgments. BLEU scores around 0.1 predict
little about translation quality.
similarly as semantic roles do. There are 67
functor types in total.
min(cnt(w, t, ri ), cnt(w, t, ci ))
Using Functor instead of SemPOS increases the
i∈I w∈ri ∩ci
number of word classes that independently require
O(t) = X X
max(cnt(w, t, ri ), cnt(w, t, ci )) a high overlap. For a contrast we also completely
i∈I w∈ri ∪ci
remove the classification and use only one global
class (Void).
The semantic part of speech is denoted t; ci
Deep Syntactic Relations in SemPOS. In
and ri are the candidate and reference translations
SemPOS, an autosemantic word of a class is conof sentence i, and cnt(w, t, rc) is the number of
firmed if its lemma matches the reference. We utiwords w with type t in rc (the reference or the canlize the dependency relations at the tectogrammatdidate). The matching is performed on the level of
ical layer to validate valence by refining the overlemmas, i.e. no morphological information is prelap and requiring also the lemma of 1) the parent
served in ws. See Figure 5 for an example; the
(denoted “par”), or 2) all the children regardless of
sentence is the same as in Figure 4.
their order (denoted “sons”) to match.
The final SemPOS score is obtained by macroCombining BLEU and SemPOS. One of the
averaging over all parts of speech:
major drawbacks of SemPOS is that it completely
ignores word order. This is too coarse even for
SemPOS =
languages with relatively free word order like
|T |
Czech. Another issue is that it operates on lemmas
and it completely disregards correct word forms.
where T is the set of all possible semantic parts
Thus, a weighted linear combination of SemPOS
of speech types. (The degenerate case of blank
and BLEU (computed on the surface representacandidate and reference has SemPOS zero.)
tion of the sentence) should compensate for this.
3.1 Variations of SemPOS
For the purposes of the combination, we compute
BLEU only on unigrams up to fourgrams (denoted
This section describes our modifications of SemBLEU1 , . . . , BLEU4 ) but including the brevity
POS. All methods are evaluated in Section 3.2.
penalty as usual. Here we try only a few weight
Different Classification of Autosemantic
settings in the linear combination but given a heldWords. SemPOS uses semantic parts of speech
out dataset, one could optimize the weights for the
to classify autosemantic words. The tectogrambest performance.
matical layer offers also a feature called Functor
describing the relation of a word to its governor
Congress yields: US government can pump 700 billion dollars into banks
kongres ustoupil : vláda usa může do bank napumpovat 700 miliard dolarů
kongres výnosy : vláda usa může čerpadlo 700 miliard dolarů v bankách
kongres vynášı́ : us vláda může čerpat 700 miliardu dolarů do bank
Figure 4: Too much focus on sequences in BLEU: pctrans’ output is better but does not score well.
BLEU gave credit to cu-bojar for 1, 3, 5 and 8 fourgrams, trigrams, bigrams and unigrams, resp., but
only for 0, 0, 1 and 8 n-grams produced by pctrans. Confirmed sequences of tokens are underlined and
important errors (not considered by BLEU) are framed.
kongres/n ustoupit/v :/n vláda/n usa/n banka/n napumpovat/v 700/n miliarda/n dolar/n
kongres/n výnos/n :/n vláda/n usa/n moci/v čerpadlo/n 700/n miliarda/n dolar/n banka/n
kongres/n vynášet/v :/n us/n vláda/n čerpat/v 700/n miliarda/n dolar/n banka/n
Figure 5: SemPOS evaluates the overlap of lemmas of autosemantic words given their semantic part of
speech (n, v, . . . ). Underlined words are confirmed by the reference.
SemPOS for English. The tectogrammatical
layer is being adapted for English (Cinková et al.,
2004; Hajič et al., 2009) and we are able to use the
available tools to obtain all SemPOS features for
English sentences as well.
To English: MetricsMATR08 (cn+ar:
WMT08 News Articles (de: 199, fr: 251),
WMT08 Europarl (es: 190, fr: 183), WMT09
(cz: 320, de: 749, es: 484, fr: 786, hu: 287)
To Czech: WMT08 News Articles (en: 267),
WMT08 Commentary (en: 243), WMT09
(en: 1425)
Evaluation of SemPOS and Friends
We measured the metric performance on data used
in MetricsMATR08, WMT09 and WMT08. For
the evaluation of metric correlation with human
judgments at the system level, we used the Pearson
correlation coefficient ρ applied to ranks. In case
of a tie, the systems were assigned the average position. For example if three systems achieved the
same highest score (thus occupying the positions
1, 2 and 3 when sorted by score), each of them
would obtain the average rank of 2 = 1+2+3
When correlating ranks (instead of exact scores)
and with this handling of ties, the Pearson coefficient is equivalent to Spearman’s rank correlation
The MetricsMATR08 human judgments include
preferences for pairs of MT systems saying which
one of the two systems is better, while the WMT08
and WMT09 data contain system scores (for up to
5 systems) on the scale 1 to 5 for a given sentence.
We assigned a human ranking to the systems based
on the percent of time that their translations were
judged to be better than or equal to the translations
of any other system in the manual evaluation. We
converted automatic metric scores to ranks.
Metrics’ performance for translation to English
and Czech was measured on the following testsets (the number of human judgments for a given
source language in brackets):
The MetricsMATR08 testset contained 4 reference translations for each sentence whereas the remaining testsets only one reference.
Correlation coefficients for English are shown
in Table 2. The best metric is Voidpar closely followed by Voidsons . The explanation is that Void
compared to SemPOS or Functor does not lose
points by an erroneous assignment of the POS or
the functor, and that Voidpar profits from checking the dependency relations between autosemantic words. The combination of BLEU and SemPOS6 outperforms both individual metrics, but in
case of SemPOS only by a minimal difference.
Additionally, we confirm that 4-grams alone have
little discriminative power both when used as a
metric of their own (BLEU4 ) as well as in a linear combination with SemPOS.
The best metric for Czech (see Table 3) is a linear combination of SemPOS and 4-gram BLEU
closely followed by other SemPOS and BLEUn
combinations. We assume this is because BLEU4
can capture correctly translated fixed phrases,
which is positively reflected in human judgments.
Including BLEU1 in the combination favors translations with word forms as expected by the refer6
For each n ∈ {1, 2, 3, 4}, we show only the best weight
setting for SemPOS and BLEUn .
Table 2: Average, best and worst system-level correlation coefficients for translation to English from
various source languages evaluated on 10 different
Table 3: System-level correlation coefficients for
English-to-Czech translation evaluated on 3 different testsets.
the sparse data issue. SemPOS was evaluated on
translation to Czech and to English, scoring better
than or comparable to many established metrics.
ence, thus allowing to spot bad word forms. In
all cases, the linear combination puts more weight
on SemPOS. Given the negligible difference between SemPOS alone and the linear combinations,
we see that word forms are not the major issue for
humans interpreting the translation—most likely
because the systems so far often make more important errors. This is also confirmed by the observation that using BLEU alone is rather unreliable
for Czech and BLEU-1 (which judges unigrams
only) is even worse. Surprisingly BLEU-2 performed better than any other n-grams for reasons
that have yet to be examined. The error metrics
PER and TER showed the lowest correlation with
human judgments for translation to Czech.
Ondřej Bojar, David Mareček, Václav Novák, Martin Popel, Jan Ptáček, Jan Rouš, and Zdeněk
Žabokrtský. 2009. English-Czech MT in 2008. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Chris Callison-Burch, Cameron Fordyce, Philipp
Koehn, Christof Monz, and Josh Schroeder. 2008.
Further meta-evaluation of machine translation. In
Proceedings of the Third Workshop on Statistical Machine Translation, pages 70–106, Columbus,
Ohio, June. Association for Computational Linguistics.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Josh Schroeder. 2009. Findings of the 2009
workshop on statistical machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece. Association for
Computational Linguistics.
This paper documented problems of singlereference BLEU when applied to morphologically
rich languages such as Czech. BLEU suffers from
a sparse data problem, unable to judge the quality
of tokens not confirmed by the reference. This is
confirmed for other languages as well: the lower
the BLEU score the lower the correlation to human judgments.
We introduced a refinement of SemPOS, an
automatic metric of MT quality based on deepsyntactic representation of the sentence tackling
Silvie Cinková, Jan Hajič, Marie Mikulová, Lucie Mladová, Anja Nedolužko, Petr Pajas, Jarmila
Panevová, Jiřı́ Semecký, Jana Šindlerová, Josef
Toman, Zdeňka Urešová, and Zdeněk Žabokrtský.
2004. Annotation of English on the tectogrammatical level.
Technical Report TR-2006-35,
ÚFAL/CKL, Prague, Czech Republic, December.
Zdeněk Žabokrtský and Ondřej Bojar. 2008. TectoMT,
Developer’s Guide. Technical Report TR-2008-39,
Institute of Formal and Applied Linguistics, Faculty
of Mathematics and Physics, Charles University in
Prague, December.
Sherri Condon, Gregory A. Sanders, Dan Parvaz, Alan
Rubenstein, Christy Doran, John Aberdeen, and
Beatrice Oshika. 2009. Normalization for Automated Metrics: English and Arabic Speech Translation. In MT Summit XII.
Jesús Giménez and Lluı́s Márquez. 2007. Linguistic Features for Automatic Evaluation of Heterogenous MT Systems. In Proceedings of the Second
Workshop on Statistical Machine Translation, pages
256–264, Prague, June. Association for Computational Linguistics.
Jan Hajič, Silvie Cinková, Kristýna Čermáková, Lucie Mladová, Anja Nedolužko, Petr Pajas, Jiřı́ Semecký, Jana Šindlerová, Josef Toman, Kristýna
Tomšů, Matěj Korvas, Magdaléna Rysová, Kateřina
Veselovská, and Zdeněk Žabokrtský. 2009. Prague
English Dependency Treebank 1.0. Institute of Formal and Applied Linguistics, Charles University in
Prague, ISBN 978-80-904175-0-2, January.
Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr
Sgall, Petr Pajas, Jan Štěpánek, Jiřı́ Havelka,
Marie Mikulová, Zdeněk Žabokrtský, and Magda
Ševčı́ková Razı́mová. 2006. Prague Dependency
Treebank 2.0. LDC2006T01, ISBN: 1-58563-3704.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Constantin, and Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation.
In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo
and Poster Sessions, pages 177–180, Prague, Czech
Republic, June. Association for Computational Linguistics.
Kamil Kos and Ondřej Bojar. 2009. Evaluation of Machine Translation Metrics for Czech as the Target
Language. Prague Bulletin of Mathematical Linguistics, 92.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of Machine Translation. In ACL 2002,
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–
318, Philadelphia, Pennsylvania.
M. Przybocki, K. Peterson, and S. Bronsart. 2008. Official results of the NIST 2008 ”Metrics for MAchine TRanslation” Challenge (MetricsMATR08).
Petr Sgall, Eva Hajičová, and Jarmila Panevová. 1986.
The Meaning of the Sentence and Its Semantic
and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht,
Exemplar-Based Models for Word Meaning In Context
Katrin Erk
Sebastian Padó
Department of Linguistics
Institut für maschinelle Sprachverarbeitung
University of Texas at Austin
Stuttgart University
[email protected] [email protected]
top 20 features for coach, we get match and team
(for the “trainer” sense) as well as driver and car
(for the “bus” sense). This problem has typically
been approached by modifying the type vector for
a target to better match a given context (Mitchell
and Lapata, 2008; Erk and Padó, 2008; Thater et
al., 2009).
In the terms of research on human concept representation, which often employs feature vector
representations, the use of type vectors can be understood as a prototype-based approach, which uses
a single vector per category. From this angle, computing prototypes throws away much interesting
distributional information. A rival class of models is that of exemplar models, which memorize
each seen instance of a category and perform categorization by comparing a new stimulus to each
remembered exemplar vector.
We can address the polysemy issue through an
exemplar model by simply removing all exemplars that are “not relevant” for the present context, or conversely activating only the relevant
ones. For the coach example, in the context of
a text about motorways, presumably an instance
like “The coach drove a steady 45 mph” would be
activated, while “The team lost all games since the
new coach arrived” would not.
In this paper, we present an exemplar-based distributional model for modeling word meaning in
context, applying the model to the task of deciding paraphrase applicability. With a very simple
vector representation and just using activation, we
outperform the state-of-the-art prototype models.
We perform an in-depth error analysis to identify
stable parameters for this class of models.
This paper describes ongoing work on distributional models for word meaning in
context. We abandon the usual one-vectorper-word paradigm in favor of an exemplar
model that activates only relevant occurrences. On a paraphrasing task, we find
that a simple exemplar model outperforms
more complex state-of-the-art models.
Distributional models are a popular framework
for representing word meaning. They describe
a lemma through a high-dimensional vector that
records co-occurrence with context features over a
large corpus. Distributional models have been used
in many NLP analysis tasks (Salton et al., 1975;
McCarthy and Carroll, 2003; Salton et al., 1975), as
well as for cognitive modeling (Baroni and Lenci,
2009; Landauer and Dumais, 1997; McDonald and
Ramscar, 2001). Among their attractive properties
are their simplicity and versatility, as well as the
fact that they can be acquired from corpora in an
unsupervised manner.
Distributional models are also attractive as a
model of word meaning in context, since they do
not have to rely on fixed sets of dictionary sense
with their well-known problems (Kilgarriff, 1997;
McCarthy and Navigli, 2009). Also, they can
be used directly for testing paraphrase applicability (Szpektor et al., 2008), a task that has recently
become prominent in the context of textual entailment (Bar-Haim et al., 2007). However, polysemy
is a fundamental problem for distributional models.
Typically, distributional models compute a single
“type” vector for a target word, which contains cooccurrence counts for all the occurrences of the
target in a large corpus. If the target is polysemous, this vector mixes contextual features for all
the senses of the target. For example, among the
Related Work
Among distributional models of word, there are
some approaches that address polysemy, either
by inducing a fixed clustering of contexts into
senses (Schütze, 1998) or by dynamically modi92
Proceedings of the ACL 2010 Conference Short Papers, pages 92–97,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Sentential context
After a fire extinguisher is used, it must
always be returned for recharging and
its use recorded.
fying a word’s type vector according to each given
sentence context (Landauer and Dumais, 1997;
Mitchell and Lapata, 2008; Erk and Padó, 2008;
Thater et al., 2009). Polysemy-aware approaches
also differ in their notion of context. Some use a
bag-of-words representation of words in the current sentence (Schütze, 1998; Landauer and Dumais, 1997), some make use of syntactic context (Mitchell and Lapata, 2008; Erk and Padó,
2008; Thater et al., 2009). The approach that we
present in the current paper computes a representation dynamically for each sentence context, using
a simple bag-of-words representation of context.
In cognitive science, prototype models predict
degree of category membership through similarity to a single prototype, while exemplar theory
represents a concept as a collection of all previously seen exemplars (Murphy, 2002). Griffiths et
al. (2007) found that the benefit of exemplars over
prototypes grows with the number of available exemplars. The problem of representing meaning in
context, which we consider in this paper, is closely
related to the problem of concept combination in
cognitive science, i.e., the derivation of representations for complex concepts (such as “metal spoon”)
given the representations of base concepts (“metal”
and “spoon”). While most approaches to concept
combination are based on prototype models, Voorspoels et al. (2009) show superior results for an
exemplar model based on exemplar activation.
In NLP, exemplar-based (memory-based) models have been applied to many problems (Daelemans et al., 1999). In the current paper, we use an
exemplar model for computing distributional representations for word meaning in context, using the
context to activate relevant exemplars. Comparing
representations of context, bag-of-words (BOW)
representations are more informative and noisier,
while syntax-based representations deliver sparser
and less noisy information. Following the hypothesis that richer, topical information is more suitable
for exemplar activation, we use BOW representations of sentential context in the current paper.
We return to the young woman who is
reading the Wrigley’s wrapping paper.
bring back (3),
take back (2),
send back (1),
give back (1)
come back (3),
revert (1), revisit
(1), go (1)
Table 1: The Lexical Substitution (LexSub) dataset.
letters for sets of exemplars.
We model polysemy by activating relevant exemplars of a lemma E in a given sentence context
s. (Note that we use E to refer to both a lemma
and its exemplar set, and that s can be viewed as
just another exemplar vector.) In general, we define
activation of a set E by exemplar s as
act(E, s) = {e ∈ E | sim(e, s) > θ(E, s)}
where E is an exemplar set, s is the “point of comparison”, sim is some similarity measure such as
Cosine or Jaccard, and θ(E, s) is a threshold. Exemplars belong to the activated set if their similarity
to s exceeds θ(E, s).1 We explore two variants of
activation. In kNN activation, the k most similar exemplars to s are activated by setting θ to the
similarity of the k-th most similar exemplar. In
q-percentage activation, we activate the top q%
of E by setting θ to the (100-q)-th percentile of the
sim(e, s) distribution. Note that, while in the kNN
activation scheme the number of activated exemplars is the same for every lemma, this is not the
case for percentage activation: There, a more frequent lemma (i.e., a lemma with more exemplars)
will have more exemplars activated.
Exemplar activation for paraphrasing. A paraphrases is typically only applicable to a particular
sense of a target word. Table 1 illustrates this on
two examples from the Lexical Substitution (LexSub) dataset (McCarthy and Navigli, 2009), both
featuring the target return. The right column lists
appropriate paraphrases of return in each context
(given by human annotators). 2 We apply the exemplar activation model to the task of predicting
paraphrase felicity: Given a target lemma T in a
particular sentential context s, and given a list of
Exemplar Activation Models
We now present an exemplar-based model for
meaning in context. It assumes that each target
lemma is represented by a set of exemplars, where
an exemplar is a sentence in which the target occurs,
represented as a vector. We use lowercase letters
for individual exemplars (vectors), and uppercase
In principle, activation could be treated not just as binary
inclusion/exclusion, but also as a graded weighting scheme.
However, weighting schemes introduce a large number of
parameters, which we wanted to avoid.
Each annotator was allowed to give up to three paraphrases per target in context. As a consequence, the number
of gold paraphrases per target sentence varies.
no act.
random BL
potential paraphrases of T , the task is to predict
which of the paraphrases are applicable in s.
Previous approaches (Mitchell and Lapata, 2008;
Erk and Padó, 2008; Erk and Padó, 2009; Thater
et al., 2009) have performed this task by modifying the type vector for T to the context s and then
comparing the resulting vector T 0 to the type vector of a paraphrase candidate P . In our exemplar
setting, we select a contextually adequate subset
of contexts in which T has been observed, using
T 0 = act(T, s) as a generalized representation of
meaning of target T in the context of s.
Previous approaches used all of P as a representation for a paraphrase candidate P . However,
P includes also irrelevant exemplars, while for a
paraphrase to be judged as good, it is sufficient that
one plausible reading exists. Therefore, we use
P 0 = act(P, s) to represent the paraphrase.
kNN perc. kNN perc.
36.1 35.5 36.5 38.6
36.2 35.2 36.2 37.9
36.1 35.3 35.8 37.8
36.0 35.3 35.8 37.7
35.9 35.1 35.9 37.5
36.0 35.0 36.1 37.5
35.9 34.8 36.1 37.5
36.0 34.7 36.0 37.4
35.9 34.5 35.9 37.3
Table 2: Activation of T or P individually on the
full LexSub dataset (GAP evaluation)
sion (GAP), which interpolates the precision values
of top-n prediction lists for increasing n. Let G =
hq1 , . . . , qm i be the list of gold paraphrases with
gold weights hy1 , . . . , ym i. Let P = hp1 , . . . , pn i
be the list of model predictions as ranked by the
model, and let hx1 , . . . , xn i be the gold weights
associated with them (assume xi = 0 if pi 6∈ G),
where G ⊆ P . Let I(xi ) = 1P
if pi ∈ G, and zero
otherwise. We write xi = 1i ik=1 xk for the average gold weight of the first i model predictions,
and analogously yi . Then
Experimental Evaluation
Data. We evaluate our model on predicting paraphrases from the Lexical Substitution (LexSub)
dataset (McCarthy and Navigli, 2009). This dataset
consists of 2000 instances of 200 target words in
sentential contexts, with paraphrases for each target word instance generated by up to 6 participants.
Paraphrases are ranked by the number of annotators that chose them (cf. Table 1). Following Erk
and Padó (2008), we take the list of paraphrase candidates for a target as given (computed by pooling
all paraphrases that LexSub annotators proposed
for the target) and use the models to rank them for
any given sentence context.
As exemplars, we create bag-of-words cooccurrence vectors from the BNC. These vectors
represent instances of a target word by the other
words in the same sentence, lemmatized and POStagged, minus stop words. E.g., if the lemma
gnurge occurs twice in the BNC, once in the sentence “The dog will gnurge the other dog”, and
once in “The old windows gnurged”, the exemplar
set for gnurge contains the vectors [dog-n: 2, othera:1] and [old-a: 1, window-n: 1]. For exemplar
similarity, we use the standard Cosine similarity,
and for the similarity of two exemplar sets, the
Cosine of their centroids.
I(xi )xi
j=1 I(yj )yj
GAP (P, G) = Pm
Since the model may rank multiple paraphrases the
same, we average over 10 random permutations of
equally ranked paraphrases. We report mean GAP
over all items in the dataset.
Results and Discussion. We first computed two
models that activate either the paraphrase or the
target, but not both. Model 1, actT, activates only
the target, using the complete P as paraphrase, and
ranking paraphrases by sim(P, act(T, s)). Model
2, actP, activates only the paraphrase, using s as
the target word, ranking by sim(act(P, s), s).
The results for these models are shown in Table 2, with both kNN and percentage activation:
kNN activation with a parameter of 10 means that
the 10 closest neighbors were activated, while percentage with a parameter of 10 means that the closest 10% of the exemplars were used. Note first
that we computed a random baseline (last row)
with a GAP of 28.5. The second-to-last row (“no
activation”) shows two more informed baselines.
Evaluation. The model’s prediction for an item
is a list of paraphrases ranked by their predicted
goodness of fit. To evaluate them against a
weighted list of gold paraphrases, we follow Thater
et al. (2009) in using Generalized Average Preci94
P activation (%) ⇒
T activation (kNN) ⇓
The actT “no act” result (34.6) corresponds to a
prototype-based model that ranks paraphrase candidates by the distance between their type vectors
and the target’s type vector. Virtually all exemplar models outperform this prototype model. Note
also that both actT and actP show the best results
for small values of the activation parameter. This
indicates paraphrases can be judged on the basis
of a rather small number of exemplars. Nevertheless, actT and actP differ with regard to the details
of their optimal activation. For actT, a small absolute number of activated exemplars (here, 20)
works best , while actP yields the best results for
a small percentage of paraphrase exemplars. This
can be explained by the different functions played
by actT and actP (cf. Section 3): Activation of the
paraphrase must allow a guess about whether there
is reasonable interpretation of P in the context s.
This appears to require a reasonably-sized sample
from P . In contrast, target activation merely has to
counteract the sparsity of s, and activation of too
many exemplars from T leads to oversmoothing.
We obtained significances by computing 95%
and 99% confidence intervals with bootstrap resampling. As a rule of thumb, we find that 0.4%
difference in GAP corresponds to a significant difference at the 95% level, and 0.7% difference in
GAP to significance at the 99% level. The four
activation methods (i.e., columns in Table 2) are
significantly different from each other, with the exception of the pair actT/kNN and actP/kNN (n.s.),
so that we get the following order:
Table 3: Joint activation of P and T on the full
LexSub dataset (GAP evaluation)
we fix the actP activation level, we find comparatively large performance differences between the
T activation settings k=5 and k=10 (highly significant for 10% actP, and significant for 20% and
30% actP). On the other hand, when we fix the
actT activation level, changes in actP activation
generally have an insignificant impact.
Somewhat disappointingly, we are not able to
surpass the best result for actP alone. This indicates
that – at least in the current vector space – the
sparsity of s is less of a problem than the “dilution”
of s that we face when we representing the target
word by exemplars of T close to s. Note, however,
that the numerically worse performance of the best
actTP model is still not significantly different from
the best actP model.
Influence of POS and frequency. An analysis
of the results by target part-of-speech showed that
the globally optimal parameters also yield the best
results for individual POS, even though there are
substantial differences among POS. For actT, the
best results emerge for all POS with kNN activation
with k between 10 and 30. For k=20, we obtain a
GAP of 35.3 (verbs), 38.2 (nouns), and 35.1 (adjectives). For actP, the best parameter for all POS was
activation of 10%, with GAPs of 36.9 (verbs), 41.4
(nouns), and 37.5 (adjectives). Interestingly, the
results for actTP (verbs: 38.4, nouns: 40.6, adjectives: 36.9) are better than actP for verbs, but worse
for nouns and adjectives, which indicates that the
sparsity problem might be more prominent than for
the other POS. In all three models, we found a clear
effect of target and paraphrase frequency, with deteriorating performance for the highest-frequency
targets as well as for the lemmas with the highest
average paraphrase frequency.
actP/perc > actP/kNN ≈ actT/kNN > actT/perc
where > means “significantly outperforms”. In particular, the best method (actT/kNN) outperforms
all other methods at p<0.01. Here, the best parameter setting (10% activation) is also significantly
better than the next-one one (20% activation). With
the exception of actT/perc, all activation methods
significantly outperform the best baseline (actP, no
Based on these observations, we computed a
third model, actTP, that activates both T (by kNN)
and P (by percentage), ranking paraphrases by
sim(act(P, s), act(T, s)). Table 3 shows the results. We find the overall best model at a similar
location in parameter space as for actT and actP
(cf. Table 2), namely by setting the activation parameters to small values. The sensitivity of the
parameters changes considerably, though. When
Comparison to other models. Many of the
other models are syntax-based and are therefore
only applicable to a subset of the LexSub data.
We have re-evaluated our exemplar models on the
subsets we used in Erk and Padó (2008, EP08, 367
EP08 dataset
EP09 dataset
EP08 dataset
EP09 dataset
sult from activating a low absolute number of exemplars. Paraphrase representations are best activated
with a percentage-based threshold. Overall, we
found that paraphrase activation had a much larger
impact on performance than target activation, and
that drawing on target exemplars other than s to
represent the target meaning in context improved
over using s itself only for verbs (Tab. 3). This suggests the possibility of considering T ’s activated
paraphrase candidates as the representation of T in
the context s, rather than some vector of T itself,
in the spirit of Kintsch (2001).
While it is encouraging that the best parameter
settings involved the activation of only few exemplars, computation with exemplar models still requires the management of large numbers of vectors.
The computational overhead can be reduced by using data structures that cut down on the number
of vector comparisons, or by decreasing vector dimensionality (Gorman and Curran, 2006). We will
experiment with those methods to determine the
tradeoff of runtime and accuracy for this task.
Another area of future work is to move beyond
bag-of-words context: It is known from WSD
that syntactic and bag-of-words contexts provide
complementary information (Florian et al., 2002;
Szpektor et al., 2008), and we hope that they can be
integrated in a more sophisticated exemplar model.
Finally, we will to explore task-based evaluations. Relation extraction and textual entailment
in particular are tasks where similar models have
been used before (Szpektor et al., 2008).
Acknowledgements. This work was supported
in part by National Science Foundation grant IIS0845925, and by a Morris Memorial Grant from
the New York Community Trust.
Table 4: Comparison to other models on two subsets of LexSub (GAP evaluation)
datapoints) and Erk and Padó (2009, EP09, 100 datapoints). The second set was also used by Thater et
al. (2009, TDP09). The results in Table 4 compare
these models against our best previous exemplar
models and show that our models outperform these
models across the board. 3 Due to the small sizes
of these datasets, statistical significance is more
difficult to attain. On EP09, the differences among
our models are not significant, but the difference
between them and the original EP09 model is.4 On
EP08, all differences are significant except for actP
vs. actTP.
We note that both the EP08 and the EP09
datasets appear to be simpler to model than the
complete Lexical Substitution dataset, at least by
our exemplar-based models. This underscores an
old insight: namely, that direct syntactic neighbors,
such as arguments and modifiers, provide strong
clues as to word sense.
Conclusions and Outlook
This paper reports on work in progress on an exemplar activation model as an alternative to onevector-per-word approaches to word meaning in
context. Exemplar activation is very effective in
handling polysemy, even with a very simple (and
sparse) bag-of-words vector representation. On
both the EP08 and EP09 datasets, our models surpass more complex prototype-based approaches
(Tab. 4). It is also noteworthy that the exemplar
activation models work best when few exemplars
are used, which bodes well for their efficiency.
We found that the best target representations re-
R. Bar-Haim, I. Dagan, I. Greental, and E. Shnarch.
2007. Semantic inference at the lexical-syntactic
level. In Proceedings of AAAI, pages 871–876, Vancouver, BC.
M. Baroni and A. Lenci. 2009. One distributional
memory, many semantic spaces. In Proceedings of
the EACL Workshop on Geometrical Models of Natural Language Semantics, Athens, Greece.
Since our models had the advantage of being tuned on
the dataset, we also report the range of results across the
parameters we tested. On the EP08 dataset, we obtained 33.1–
36.5 for actT; 33.3–38.0 for actP; 37.7-39.9 for actTP. On the
EP09 dataset, the numbers were 35.8–39.1 for actT; 38.1–39.9
for actP; 37.2–39.8 for actTP.
We did not have access to the TDP09 predictions to do
significance testing.
W. Daelemans, A. van den Bosch, and J. Zavrel. 1999.
Forgetting exceptions is harmful in language learning. Machine Learning, 34(1/3):11–43. Special Issue on Natural Language Learning.
K. Erk and S. Padó. 2008. A structured vector space
I. Szpektor, I. Dagan, R. Bar-Haim, and J. Goldberger.
2008. Contextual preferences. In Proceedings of
ACL, pages 683–691, Columbus, OH.
model for word meaning in context. In Proceedings
of EMNLP, pages 897–906, Honolulu, HI.
K. Erk and S. Padó. 2009. Paraphrase assessment in
structured vector space: Exploring parameters and
datasets. In Proceedings of the EACL Workshop on
Geometrical Models of Natural Language Semantics, Athens, Greece.
S. Thater, G. Dinu, and M. Pinkal. 2009. Ranking
paraphrases in context. In Proceedings of the ACL
Workshop on Applied Textual Inference, pages 44–
47, Singapore.
W. Voorspoels, W. Vanpaemel, and G. Storms. 2009.
The role of extensional information in conceptual
combination. In Proceedings of CogSci.
R. Florian, S. Cucerzan, C. Schafer, and D. Yarowsky.
2002. Combining classifiers for word sense disambiguation. Journal of Natural Language Engineering, 8(4):327–341.
J. Gorman and J. R. Curran. 2006. Scaling distributional similarity to large corpora. In Proceedings of
ACL, pages 361–368, Sydney.
T. Griffiths, K. Canini, A. Sanborn, and D. J. Navarro.
2007. Unifying rational models of categorization
via the hierarchical Dirichlet process. In Proceedings of CogSci, pages 323–328, Nashville, TN.
A. Kilgarriff. 1997. I don’t believe in word senses.
Computers and the Humanities, 31(2):91–113.
W. Kintsch. 2001. Predication. Cognitive Science,
T. Landauer and S. Dumais. 1997. A solution to Platos
problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211–240.
D. McCarthy and J. Carroll. 2003. Disambiguating
nouns, verbs, and adjectives using automatically acquired selectional preferences. Computational Linguistics, 29(4):639–654.
D. McCarthy and R. Navigli. 2009. The English lexical substitution task. Language Resources and Evaluation, 43(2):139–159. Special Issue on Computational Semantic Analysis of Language: SemEval2007 and Beyond.
S. McDonald and M. Ramscar. 2001. Testing the distributional hypothesis: The influence of context on
judgements of semantic similarity. In Proceedings
of CogSci, pages 611–616.
J. Mitchell and M. Lapata. 2008. Vector-based models
of semantic composition. In Proceedings of ACL,
pages 236–244, Columbus, OH.
G. L. Murphy. 2002. The Big Book of Concepts. MIT
G Salton, A Wang, and C Yang. 1975. A vectorspace model for information retrieval. Journal of the
American Society for Information Science, 18:613–
H. Schütze. 1998. Automatic word sense discrimination. Computational Linguistics, 24(1):97–124.
A Structured Model for Joint Learning of
Argument Roles and Predicate Senses
Masayuki Asahara Yuji Matsumoto
Graduate School of Information Science
Nara Institute of Science and Technology
8916-5 Takayama, Ikoma,
Nara, 630-0192, Japan
{masayu-a, matsu}
Yotaro Watanabe
Graduate School of Information Sciences
Tohoku University
6-6-05, Aramaki Aza Aoba, Aoba-ku,
Sendai 980-8579, Japan
[email protected]
of core arguments in predicate-argument structure
analysis. They used argument sequences tied with
a predicate sense (e.g. AGENT-buy.01/ActivePATIENT) as a feature for the re-ranker of the
system where predicate sense and argument role
candidates are generated by their pipelined architecture. They reported that incorporating this type
of features provides substantial gain of the system
The other factor is inter-dependencies between
a predicate sense and argument roles, which relate to selectional preference, and motivated us
to jointly identify a predicate sense and its argument roles. This type of dependencies has been
explored by Riedel and Meza-Ruiz (2008; 2009b;
2009a), all of which use Markov Logic Networks
(MLN). The work uses the global formulae that
have atoms in terms of both a predicate sense and
each of its argument roles, and the system identifies predicate senses and argument roles simultaneously.
Ideally, we want to capture both types of dependencies simultaneously. The former approaches
can not explicitly include features that capture
inter-dependencies between a predicate sense and
its argument roles. Though these are implicitly incorporated by re-ranking where the most plausible assignment is selected from a small subset of
predicate and argument candidates, which are generated independently. On the other hand, it is difficult to deal with core argument features in MLN.
Because the number of core arguments varies with
the role assignments, this type of features cannot
be expressed by a single formula.
Thompson et al. (2010) proposed a generative model that captures both predicate senses
and its argument roles. However, the first-order
markov assumption of the model eliminates ability to capture non-local dependencies among arguments. Also, generative models are in general
inferior to discriminatively trained linear or log-
In predicate-argument structure analysis,
it is important to capture non-local dependencies among arguments and interdependencies between the sense of a predicate and the semantic roles of its arguments. However, no existing approach explicitly handles both non-local dependencies and semantic dependencies between
predicates and arguments. In this paper we propose a structured model that
overcomes the limitation of existing approaches; the model captures both types of
dependencies simultaneously by introducing four types of factors including a global
factor type capturing non-local dependencies among arguments and a pairwise factor type capturing local dependencies between a predicate and an argument. In
experiments the proposed model achieved
competitive results compared to the stateof-the-art systems without applying any
feature selection procedure.
1 Introduction
Predicate-argument structure analysis is a process
of assigning who does what to whom, where,
when, etc. for each predicate. Arguments of a
predicate are assigned particular semantic roles,
such as Agent, Theme, Patient, etc. Lately,
predicate-argument structure analysis has been regarded as a task of assigning semantic roles of
arguments as well as word senses of a predicate
(Surdeanu et al., 2008; Hajič et al., 2009).
Several researchers have paid much attention to
predicate-argument structure analysis, and the following two important factors have been shown.
Toutanova et al. (2008), Johansson and Nugues
(2008), and Björkelund et al. (2009) presented
importance of capturing non-local dependencies
Proceedings of the ACL 2010 Conference Short Papers, pages 98–102,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
2.1 Factors of the Model
We define four types of factors for the model.
Predicate Factor FP scores a sense of p, and
does not depend on any arguments. The score
function is defined by FP (x, p, A) = w·ΦP (x, p).
Argument Factor FA scores a label assignment
of a particular argument a ∈ A. The score is determined independently from a predicate sense, and
is given by FA (x, p, a) = w · ΦA (x, a).
Figure 1: Undirected graphical model representation of the structured model
FP A captures inter-dependencies between
a predicate sense and one of its argument
The score function is defined as
FP A (x, p, a) = w · ΦP A (x, p, a). The difference from FA is that FP A influences both
the predicate sense and the argument role. By
introducing this factor, the role label can be
influenced by the predicate sense, and vise versa.
linear models.
In this paper we propose a structured model
that overcomes limitations of the previous approaches. For the model, we introduce several
types of features including those that capture both
non-local dependencies of core arguments, and
inter-dependencies between a predicate sense and
its argument roles. By doing this, both tasks are
mutually influenced, and the model determines
the most plausible set of assignments of a predicate sense and its argument roles simultaneously.
We present an exact inference algorithm for the
model, and a large-margin learning algorithm that
can handle both local and global features.
Global Factor FG is introduced to capture plausibility of the whole predicate-argument structure.
Like the other factors, the score function is defined as FG (x, p, A) = w · ΦG (x, p, A). A possible feature that can be considered by this factor is the mutual dependencies among core arguments. For instance, if a predicate-argument structure has an agent (A0) followed by the predicate
and a patient (A1), we encode the structure as a
string A0-PRED-A1 and use it as a feature. This
type of features provide plausibility of predicateargument structures. Even if the highest scoring
predicate-argument structure with the other factors
misses some core arguments, the global feature
demands the model to fill the missing arguments.
The numbers of factors for each factor type are:
FP and FG are 1, FA and FP A are |A|. By integrating the all factors, the score function becomes
∑ A) = w · ΦP (x, p) + w · ΦG (x, p, A) + w ·
a∈A {ΦA (x, a) + ΦP A (x, p, a)}.
Figure 1 shows the graphical representation of our
proposed model. The node p corresponds to a
predicate, and the nodes a1 , ..., aN to arguments
of the predicate. Each node is assigned a particular predicate sense or an argument role label. The
black squares are factors which provide scores of
label assignments. In the model, the nodes for arguments depend on the predicate sense, and by influencing labels of a predicate sense and its argument roles, the most plausible label assignment of
the nodes is determined considering all factors.
In this work, we use linear models. Let x be
words in a sentence, p be a sense of a predicate in
x, and A = {an }N
1 be a set of possible role label
assignments for x. A predicate-argument structure
is represented by a pair of p and A. We define
the score function for
∑ predicate-argument structures as s(p, A) =
Fk ∈F Fk (x, p, A). F is a
set of all the factors, Fk (x, p, A) corresponds to a
particular factor in Figure 1, and gives a score to a
predicate or argument label assignments. Since we
use linear models, Fk (x, p, A) = w · Φk (x, p, A).
2.2 Inference
The crucial point of the model is how to deal
with the global factor FG , because enumerating
possible assignments is too costly. A number of
methods have been proposed for the use of global
features for linear models such as (Daumé III
and Marcu, 2005; Kazama and Torisawa, 2007).
In this work, we use the approach proposed in
(Kazama and Torisawa, 2007). Although the approach is proposed for sequence labeling tasks, it
can be easily extended to our structured model.
That is, for each possible predicate sense p of the
predicate, we provide N-best argument role assignments using three local factors FP , FA and
FP A , and then add scores of the global factor FG ,
finally select the argmax from them. In this case,
the argmax is selected from |Pl |N candidates.
lL+G is the loss function for the case of using
both local and global features, corresponding to
the constraint (A), and lL is the loss function for
the case of using only local features, corresponding to the constraints (B) provided that (A) is satisfied.
2.3 Learning the Model
The fact that an argument candidate is not assigned any role (namely it is assigned the label “NONE”) is unlikely to contribute predicate sense disambiguation. However, it remains possible that “NONE” arguments is biased toward a particular predicate sense by FP A
(i.e. w · ΦP A (x, sensei , ak = “NONE00 ) > w ·
ΦP A (x, sensej , ak = “NONE00 ).
In order to avoid this bias, we define a special sense label, senseany , that is used to calculate the score for a predicate and a roll-less
argument, regardless of the predicate’s sense.
We use the feature vector ΦP A (x, senseany , ak )
if ak = “NONE00 and ΦP A (x, sensei , ak ) otherwise.
2.4 The Role-less Argument Bias Problem
For learning of the model, we borrow a fundamental idea of Kazama and Torisawa’s perceptron
learning algorithm. However, we use a more sophisticated online-learning algorithm based on the
Passive-Aggressive Algorithm (PA) (Crammer et
al., 2006).
For the sake of simplicity, we introduce some
notations. We denote a predicate-argument structure y = hp, Ai, a local∑feature vector as
ΦL (x, y) = ΦP (x, p) +
a∈A {ΦA (x, a) +
ΦP A (x, p, a)},a feature vector coupling both
local and global features as ΦL+G (x, y) =
ΦL (x, y) + ΦG (x, p, A), the argmax using ΦL+G
as ŷL+G , the argmax using ΦL as ŷL . Also, we
use a loss function ρ(y, y0 ), which is a cost function associated with y and y0 .
The margin perceptron learning proposed by
Kazama and Torisawa can be seen as an optimization with the following two constrains.
(A) w·ΦL+G (x, y)−w·ΦL+G (x, ŷ
) ≥ ρ(y, ŷ
3 Experiment
3.1 Experimental Settings
We use the CoNLL-2009 Shared Task dataset
(Hajič et al., 2009) for experiments. It is a
dataset for multi-lingual syntactic and semantic
dependency parsing 1 . In the SRL-only challenge
of the task, participants are required to identify
predicate-argument structures of only the specified
predicates. Therefore the problems to be solved
are predicate sense disambiguation and argument
role labeling. We use Semantic Labeled F1 for
For generating N-bests, we used the beamsearch algorithm, and the number of N-bests was
set to N = 64. For learning of the joint model, the
loss function ρ(yt , y0 ) of the Passive-Aggressive
Algorithm was set to the number of incorrect assignments of a predicate sense and its argument
roles. Also, the number of iterations of the model
used for testing was selected based on the performance on the development data.
Table 1 shows the features used for the structured model. The global features used for FG are
based on those used in (Toutanova et al., 2008;
Johansson and Nugues, 2008), and the features
The dataset consists of seven languages: Catalan, Chinese, Czech, English, German, Japanese and Spanish.
(B) w · ΦL (x, y) − w · ΦL (x, ŷL ) ≥ ρ(y, ŷL )
(A) is the constraint that ensures a sufficient
margin ρ(y, ŷL+G ) between y and ŷL+G . (B)
is the constraint that ensures a sufficient margin
ρ(y, ŷL ) between y and ŷL . The necessity of
this constraint is that if we apply only (A), the algorithm does not guarantee a sufficient margin in
terms of local features, and it leads to poor quality
in the N-best assignments. The Kazama and Torisawa’s perceptron algorithm uses constant values
for the cost function ρ(y, ŷL+G ) and ρ(y, ŷL ).
The proposed model is trained using the following optimization problem.
||w0 − w||2 + Cξ
wnew = arg min
w0 ∈<n 2
s.t. lL+G ≤ ξ, ξ ≥ 0 if ŷL+G 6= y
s.t. lL ≤ ξ, ξ ≥ 0
if ŷL+G = y 6= ŷL
lL+G = w · ΦL+G (x, ŷL+G )
− w · ΦL+G (x, y) + ρ(y, ŷL+G )
lL = w · ΦL (x, ŷL ) − w · ΦL (x, y) + ρ(y, ŷL )
Plemma of the predicate and predicate’s head, and ppos of the predicate
Dependency label between the predicate and predicate’s head
The concatenation of the dependency labels of the predicate’s dependents
Plemma and ppos of the predicate, the predicate’s head, the argument candidate, and the argument’s head
Plemma and ppos of the leftmost/rightmost dependent and leftmost/rightmost sibling
The dependency label of predicate, argument candidate and argument candidate’s dependent
The position of the argument candidate with respect to the predicate position in the dep. tree (e.g. CHILD)
The position of the head of the dependency relation with respect to the predicate position in the sentence
The left-to-right chain of the deplabels of the predicate’s dependents
Plemma, ppos and dependency label paths between the predicate and the argument candidates
The number of dependency edges between the predicate and the argument candidate
Plemma and plemma&ppos of the argument candidate
Dependency label path between the predicate and the argument candidates
The sequence of the predicate and the argument labels in the predicate-argument structure (e.g. A0-PRED-A1)
Whether the semantic roles defined in frames exist in the structure, (e.g. CONTAINS:A1)
The conjunction of the predicate sense and the frame information (e.g. wear.01&CONTAINS:A1)
Table 1: Features for the Structured Model
Table 2: Results on the CoNLL-2009 Shared Task dataset (Semantic Labeled F1).
Next, we compare our system with top 3 systems in the CoNLL-2009 Shared Task. By incorporating both FP A and FG , our joint model
achieved competitive results compared to the top 2
systems (Björkelund and Zhao), and achieved the
better results than the Meza-Ruiz’s system 2 . The
systems by Björkelund and Zhao applied feature
selection algorithms in order to select the best set
of feature templates for each language, requiring
about 1 to 2 months to obtain the best feature set.
On the other hand, our system achieved the competitive results with the top two systems, despite
the fact that we used the same feature templates
for all languages without applying any feature engineering procedure.
Table 3 shows the performances of predicate
sense disambiguation and argument role labeling
separately. In terms of sense disambiguation results, incorporating FP A and FG worked well. Although incorporating either of FP A and FG provided improvements of +0.13 and +0.18 on average, adding both factors provided improvements
of +0.50. We compared the predicate sense dis-
Table 3: Predicate sense disambiguation and argument role labeling results (average).
used for FP A are inspired by formulae used in
the MLN-based SRL systems, such as (Meza-Ruiz
and Riedel, 2009b). We used the same feature
templates for all languages.
3.2 Results
Table 2 shows the results of the experiments, and
also shows the results of the top 3 systems in the
CoNLL-2009 Shared Task participants of the SRLonly system.
By incorporating FP A , we achieved performance improvement for all languages. This results
suggest that it is effective to capture local interdependencies between a predicate sense and one
of its argument roles. Comparing the results with
FP +FA and FP +FA +FG , incorporating FG also
contributed performance improvements for all languages, especially the substantial F1 improvement
of +1.88 is obtained in German.
The result of Meza-Ruiz for Czech is substantially worse
than the other systems because of inappropriate preprocessing for predicate sense disambiguation. Excepting Czech, the
average F1 value of the Meza-Ruiz is 77.75, where as our
system is 79.89.
ambiguation results of FP + FA and ALL with the
McNemar test, and the difference was statistically
significant (p < 0.01). This result suggests that
combination of these factors is effective for sense
As for argument role labeling results, incorporating FP A and FG contributed positively for all
languages. Especially, we obtained a substantial gain (+4.18) in German. By incorporating
FP A , the system achieved the F1 improvements
of +0.54 on average. This result shows that capturing inter-dependencies between a predicate and
its arguments contributes to argument role labeling. By incorporating FG , the system achieved the
substantial improvement of F1 (+1.91).
Since both tasks improved by using all factors,
we can say that the proposed joint model succeeded in joint learning of predicate senses and
its argument roles.
Shalev-Shwartz, and Yoram Singer. 2006. Online
passive-aggressive algorithms. JMLR, 7:551–585.
Hal Daumé III and Daniel Marcu. 2005. Learning
as search optimization: Approximate large margin
methods for structured prediction. In ICML-2005.
Koen Deschacht and Marie-Francine Moens. 2009.
Semi-supervised semantic role labeling using the latent words language model. In EMNLP-2009.
Hagen Fürstenau and Mirella Lapata. 2009. Graph
alignment for semi-supervised semantic role labeling. In EMNLP-2009.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martı́, Lluı́s
Màrquez, Adam Meyers, Joakim Nivre, Sebastian
Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu,
Nianwen Xue, and Yi Zhang. 2009. The CoNLL2009 shared task: Syntactic and semantic dependencies in multiple languages. In CoNLL-2009, Boulder, Colorado, USA.
Richard Johansson and Pierre Nugues. 2008.
Dependency-based syntactic-semantic analysis
with propbank and nombank. In CoNLL-2008.
In this paper, we proposed a structured model that
captures both non-local dependencies between arguments, and inter-dependencies between a predicate sense and its argument roles. We designed
a linear model-based structured model, and defined four types of factors: predicate factor, argument factor, predicate-argument pairwise factor and global factor for the model. In the experiments, the proposed model achieved competitive results compared to the state-of-the-art systems without any feature engineering.
A further research direction we are investigating is exploitation of unlabeled texts. Semisupervised semantic role labeling methods have
been explored by (Collobert and Weston, 2008;
Deschacht and Moens, 2009; Fürstenau and Lapata, 2009), and they have achieved successful
outcomes. However, we believe that there is still
room for further improvement.
Jun’Ichi Kazama and Kentaro Torisawa. 2007. A new
perceptron algorithm for sequence labeling with
non-local features. In EMNLP-CoNLL 2007.
Ivan Meza-Ruiz and Sebastian Riedel. 2009a. Jointly
identifying predicates, arguments and senses using
markov logic. In HLT/NAACL-2009.
Ivan Meza-Ruiz and Sebastian Riedel. 2009b. Multilingual semantic role labelling with markov logic.
In CoNLL-2009.
Sebastian Riedel and Ivan Meza-Ruiz. 2008. Collective semantic role labelling with markov logic. In
Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez, and Joakim Nivre. 2008. The
CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In CoNLL-2008.
Synthia A. Thompson, Roger Levy, and Christopher D.
Manning. 2010. A generative model for semantic
role labeling. In Proceedings of the 48th Annual
Meeting of the Association of Computational Linguistics (to appear).
Kristina Toutanova, Aria Haghighi, and Christopher D.
Manning. 2008. A global joint model for semantic
role labeling. Computational Linguistics, 34(2).
Anders Björkelund, Love Hafdell, and Pierre Nugues.
2009. Multilingual semantic role labeling. In
Ronan Collobert and Jason Weston. 2008. A unified
architecture for natural language processing: Deep
neural networks with multitask learning. In ICML
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai
Semantics-Driven Shallow Parsing for Chinese Semantic Role Labeling
Weiwei Sun
Department of Computational Linguistics, Saarland University
German Research Center for Artificial Intelligence (DFKI)
D-66123, Saarbrücken, Germany
[email protected]
of a given sentence. Because of the semantic information it contains, we call it semantics-driven
shallow parsing. The key idea is to make basic
chunks as large as possible but not overlap with arguments. Additionally, we introduce several new
“path” features to express more structural information, which is important for SRL.
We present encouraging SRL results on Chinese
PropBank (CPB) data. With semantics-driven
shallow parsing, our SRL system achieves 76.10
F-measure, with gold segmentation and POS tagging. The performance further achieves 76.46
with the help of new “path” features. These results obtain significant improvements over the best
reported SRL performance (74.12) in the literature
(Sun et al., 2009).
One deficiency of current shallow parsing based Semantic Role Labeling (SRL)
methods is that syntactic chunks are too
small to effectively group words. To partially resolve this problem, we propose
semantics-driven shallow parsing, which
takes into account both syntactic structures and predicate-argument structures.
We also introduce several new “path” features to improve shallow parsing based
SRL method. Experiments indicate that
our new method obtains a significant improvement over the best reported Chinese
SRL result.
In the last few years, there has been an increasing interest in Semantic Role Labeling (SRL) on
several languages, which consists of recognizing
arguments involved by predicates of a given sentence and labeling their semantic types. Both
full parsing based and shallow parsing based SRL
methods have been discussed for English and Chinese. In Chinese SRL, shallow parsing based
methods that cast SRL as the classification of
syntactic chunks into semantic labels has gained
promising results. The performance reported in
(Sun et al., 2009) outperforms the best published
performance of full parsing based SRL systems.
Previously proposed shallow parsing takes into
account only syntactic information and basic
chunks are usually too small to group words into
argument candidates. This causes one main deficiency of Chinese SRL. To partially resolve this
problem, we propose a new shallow parsing. The
new chunk definition takes into account both syntactic structure and predicate-argument structures
Related Work
CPB is a project to add predicate-argument relations to the syntactic trees of the Chinese TreeBank (CTB). Similar to English PropBank, the arguments of a predicate are labeled with a contiguous sequence of integers, in the form of AN (N is
a natural number); the adjuncts are annotated as
such with the label AM followed by a secondary
tag that represents the semantic classification of
the adjunct. The assignment of argument labels
is illustrated in Figure 1, where the predicate is the
verb “提供/provide” For example, the noun phrase
“保险公司/the insurance company” is labeled as
A0, meaning that it is the proto-Agent of “提供”.
Sun et al. (2009) explore the Chinese SRL problem on the basis of shallow syntactic information
at the level of phrase chunks. They present a semantic chunking method to resolve SRL on basis
of shallow parsing. Their method casts SRL as
the classification of syntactic chunks with IOB2
representation for semantic roles (i.e. semantic
Proceedings of the ACL 2010 Conference Short Papers, pages 103–108,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
insurance company
Sanxia Project provide insurance service
NP ]
The insurance company has provided insurance services for the Sanxia Project.
Figure 1: An example from Chinese PropBank.
Joint learning of syntactic and semantic structures is another hot topic in dependency parsing
research. Some models have been well evaluated in CoNLL 2008 and 2009 shared tasks (Surdeanu et al., 2008; Hajič et al., 2009). The
CoNLL 2008/2009 shared tasks propose a unified
dependency-based formalism to model both syntactic dependencies and semantic roles for multiple languages. Several joint parsing models are
presented in the shared tasks. Our focus is different from the shared tasks. In this paper, we hope
to find better syntactic representation for semantic
role labeling.
chunks). Two labeling strategies are presented: 1)
directly tagging semantic chunks in one-stage, and
2) identifying argument boundaries as a chunking
task and labeling their semantic types as a classification task. On the basis of syntactic chunks,
they define semantic chunks which do not overlap
nor embed using IOB2 representation. Syntactic
chunks outside a chunk receive the tag O (Outside). For syntactic chunks forming a chunk of
type A*, the first chunk receives the B-A* tag (Begin), and the remaining ones receive the tag I-A*
(Inside). Then a SRL system can work directly
by using sequence tagging technique. Shallow
chunk definition presented in (Chen et al., 2006)
is used in their experiments. The definition of syntactic and semantic chunks is illustrated Figure 1.
For example, “保险公司/the insurance company”,
consisting of two nouns, is a noun phrase; in the
syntactic chunking stage, its two components “保
险” and “公司” should be labeled as B-NP and
I-NP. Because this phrase is the Agent of the predicate “提 供/provide”, it takes a semantic chunk
label B-A0. In the semantic chunking stage, this
phrase should be labeled as B-A0.
Semantics-Driven Shallow Parsing
There are two main jobs of semantic chunking: 1)
grouping words as argument candidate and 2) classifying semantic types of possible arguments. Previously proposed shallow parsing only considers
syntactic information and basic chunks are usually too small to effectively group words. This
causes one main deficiency of semantic chunking.
E.g. the argument “为三峡工程/for the Sanxia
Project” consists of three chunks, each of which
only consists of one word. To rightly recognize
this A2, Semantic chunker should rightly predict
three chunk labels. Small chunks also make the
important “path” feature sparse, since there are
more chunks between a target chunk and the predicate in focus. In this section, we introduce a new
chunk definition to improve shallow parsing based
SRL, which takes both syntactic and predicateargument structures into account. The key idea
is to make syntactic chunks as large as possible
for semantic chunking. The formal definition is as
Their experiments on CPB indicate that according to current state-of-the-art of Chinese parsing,
SRL systems on basis of full parsing do not perform better than systems based on shallow parsing.
They report the best SRL performance with gold
segmentation and POS tagging as inputs. This is
very different from English SRL. In English SRL,
previous work shows that full parsing, both constituency parsing and dependency parsing, is necessary.
Ding and Chang (2009) discuss semantic
chunking methods without any parsing information. Different from (Sun et al., 2009), their
method formulates SRL as the classification of
words with semantic chunks. Comparison of experimental results in their work shows that parsing
is necessary for Chinese SRL, and the semantic
chunking methods on the basis of shallow parsing
outperform the ones without any parsing.
Chunk Bracketing
Given a sentence s = w1 , ..., wn , let c[i : j]
denote a constituent that is made up of words
between wi and wj (including wi and wj ); let
pv = {c[i : j]|c[i : j] is an argument of v}
tax payment
Function Word
leave the country
* (AM-ADV*)
* (V*)
Figure 2: An example for definition of semantics-driven chunks with IOB2 representation.
denote one predicate-argument structure where v
is the predicate in focus. Given a syntactic tree
Ts = {c[i : j]|c[i : j] is a constituent of s}, and
its all argument structures Ps = {pv | v is a verbal
predicate in s}, there is one and only one chunk
set C = {c[i : j]} s.t.
So we can still formulate our new shallow parsing
as an “IOB” sequence labeling problem.
Chunk Type
We introduce two types of chunks. The first is
simply the phrase type, such as NP, PP, of current chunk. The column CHUNK 1 illustrates
this kind of chunk type definition. The second is
more complicated. Inspired by (Klein and Manning, 2003), we split one phrase type into several
subsymbols, which contain category information
of current constituent’s parent. For example, an
NP immediately dominated by a S, will be substituted by NPˆS. This strategy severely increases
the number of chunk types and make it hard to
train chunking models. To shrink this number, we
linguistically use a cluster of CTB phrasal types,
which was introduced in (Sun and Sui, 2009). The
column CHUNK 2 illustrates this definition. E.g.,
NPˆS implicitly represents Subject while NPˆVP
represents Object.
1. ∀c[i : j] ∈ C, c[i : j] ∈ Ts ;
2. ∀c[i : j] ∈ C, ∀c[iv : j v ] ∈ ∪Ps , j < iv or
i > j v or iv ≤ i ≤ j ≤ j v ;
3. ∀c[i : j] ∈ C, the parent of c[i : j] does not
satisfy the condition 2.
4. ∀C 0 satisfies above conditions, C 0 ⊂ C.
The first condition guarantees that every chunk
is a constituent. The second condition means that
chunks do not overlap with arguments, and further
guarantees that semantic chunking can recover all
arguments with the last condition. The third condition makes new chunks as big as possible. The last
one makes sure that C contains all sub-components
of all arguments. Figure 2 is an example to illustrate our new chunk definition. For example, “中
国/Chinese 税务/tax 部分/department” is a constituent of current sentence, and is also an argument of “规定/stipulate”. If we take it as a chunk,
it does not conflict with any other arguments, so
it is a reasonable syntactic chunk. For the phrase
“欠缴/owing 税款/tax payment”, though it does
not overlap with the first, third and fourth propositions, it is bigger than the argument “税款” (conflicting with condition 2) while labeling the predicate “欠缴”, so it has to be separated into two
chunks. Note that the third condition also guarantees the constituents in C does not overlap with
each other since each one is as large as possible.
New Path Features
The Path feature is defined as a chain of base
phrases between the token and the predicate. At
both ends, the chain is terminated with the POS
tags of the predicate and the headword of the token. For example, the path feature of “保险公
司” in Figure 1 is “公司-ADVP-PP-NP-NP-VV”.
Among all features, the “path” feature contains
more structural information, which is very important for SRL. To better capture structural information, we introduce several new “path” features.
They include:
• NP|PP|VP path: only syntactic chunks
that takes tag NP, PP or VP are kept.
(Chen et al., 2006)
Overall (C1)
Bracketing (C1)
Overall (C2)
Bracketing (C2)
When labeling the predicate “出境/leave the
country” in Figure 2, this feature of “中
国 税 务 部 门/Chinese tax departments” is
• V|的 path: a sequential container of POS tags
of verbal words and “的”; This feature of “中
国税务部门” is NP+VV+VV+的+VV+VP.
Table 1: Shallow parsing performance.
• O2POS path: if a word occupies a chunk
label O, use its POS in the path feature. This feature of “中 国 税 务 部 门” is
Syntactic Chunking Performance
Table 1 shows the performance of shallow syntactic parsing. Line Chen et al., 2006 is the chunking performance evaluated on syntactic chunk definition proposed in (Chen et al., 2006). The second and third blocks present the chunking performance with new semantics-driven shallow parsing. The second block shows the overall performance when the first kind of chunks type is used,
while the last block shows the performance when
the more complex chunk type definition is used.
For the semantic-driven parsing experiments, we
add the path from current word to the first verb before or after as two new features. Line Bracketing
evaluates the word grouping ability of these two
kinds of chunks. In other words, detailed phrase
types are not considered. Because the two new
chunk definitions use the same chunk boundaries,
the fourth and sixth lines are comparable. There
is a clear decrease between the traditional shallow
parsing (Chen et al., 2006) and ours. We think one
main reason is that syntactic chunks in our new
definition are larger than the traditional ones. An
interesting phenomenon is that though the second
kind of chunk type definition increases the complexity of the parsing job, it achieves better bracketing performance.
Experiments and Analysis
Experimental Setting
Experiments in previous work are mainly based
on CPB 1.0 and CTB 5.0. We use CoNLL-2005
shared task software to process CPB and CTB. To
facilitate comparison with previous work, we use
the same data setting with (Xue, 2008). Nearly
all previous research on Chinese SRL evaluation use this setting, also including (Ding and
Chang, 2008, 2009; Sun et al., 2009; Sun, 2010).
The data is divided into three parts: files from
chtb 081 to chtb 899 are used as training set; files
from chtb 041 to chtb 080 as development set;
files from chtb 001 to chtb 040, and chtb 900 to
chtb 931 as test set. Both syntactic chunkers and
semantic chunkers are trained and evaluated by using the same data set. By using CPB and CTB, we
can extract gold standard semantics-driven shallow chunks according to our definition. We use
this kind of gold chunks automatically generated
from training data to train syntactic chunkers.
For both syntactic and semantic chunking, we
used conditional random field model. Crfsgd1 , is
used for experiments. Crfsgd provides a feature
template that defines a set of strong word and POS
features to do syntactic chunking. We use this
feature template to resolve shallow parsing. For
semantic chunking, we implement a similar onestage shallow parsing based SRL system described
in (Sun et al., 2009). There are two differences between our system and Sun et al.’s system. First,
our system uses Start/End method to represent semantic chunks (Kudo and Matsumoto, 2001). Second, word formation features are not used.
SRL Performance
Table 2 summarizes the SRL performance. Line
Sun et al., 2009 is the SRL performance reported
in (Sun et al., 2009). To the author’s knowledge,
this is the best published SRL result in the literature. Line SRL (Chen et al., 2006) is the SRL
performance of our system. These two systems
are both evaluated by using syntactic chunking defined in (Chen et al., 2006). From the first block
we can see that our semantic chunking system
reaches the state-of-the-art. The second and third
blocks in Table 2 present the performance with
sociation for Computational Linguistics, Sydney, Australia.
new shallow parsing. Line SRL (C1) and SRL (C2)
show the overall performances with the first and
second chunk definition. The lines following are
the SRL performance when new “path” features
are added. We can see that new “path” features
are useful for semantic chunking.
(Sun et al., 2009)
SRL [(Chen et al., 2006)]
SRL [C1]
+ NP|PP|VP path
+ V|的 path
+ O2POS path
+ All new path
SRL [C2]
+ All new path
Weiwei Ding and Baobao Chang. 2008. Improving Chinese semantic role classification with hierarchical feature selection strategy. In Proceedings of the EMNLP 2008, pages 324–
333. Association for Computational Linguistics, Honolulu, Hawaii.
Weiwei Ding and Baobao Chang. 2009. Fast semantic role labeling for Chinese based on semantic chunking. In ICCPOL ’09: Proceedings of the 22nd International Conference on
Computer Processing of Oriental Languages.
Language Technology for the Knowledgebased Economy, pages 79–90. Springer-Verlag,
Berlin, Heidelberg.
Table 2: SRL performance on the test data. Items
in the first column SRL [(Chen et al., 2006)], SRL
[C1] and SRL [C2] respetively denote the SRL
systems based on shallow parsing defined in (Chen
et al., 2006) and Section 3.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia
Martı́, Lluı́s Màrquez, Adam Meyers, Joakim
Nivre, Sebastian Padó, Jan Štěpánek, Pavel
Straňák, Mihai Surdeanu, Nianwen Xue, and
Yi Zhang. 2009. The CoNLL-2009 shared task:
Syntactic and semantic dependencies in multiple languages. In Proceedings of the 13th Conference on Computational Natural Language
Learning (CoNLL-2009), June 4-5. Boulder,
Colorado, USA.
In this paper we propose a new syntactic shallow parsing for Chinese SRL. The new chunk
definition contains both syntactic structure and
predicate-argument structure information. To improve SRL, we also introduce several new “path”
features. Experimental results show that our new
chunk definition is more suitable for Chinese SRL.
It is still an open question what kinds of syntactic
information is most important for Chinese SRL.
We suggest that our attempt at semantics-driven
shallow parsing is a possible way to better exploit
this problem.
Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of
the 41st Annual Meeting of the Association for
Computational Linguistics, pages 423–430. Association for Computational Linguistics, Sapporo, Japan.
Taku Kudo and Yuji Matsumoto. 2001. Chunking
with support vector machines. In NAACL ’01:
Second meeting of the North American Chapter
of the Association for Computational Linguistics on Language technologies 2001, pages 1–
8. Association for Computational Linguistics,
Morristown, NJ, USA.
The author is funded both by German Academic
Exchange Service (DAAD) and German Research
Center for Artificial Intelligence (DFKI).
The author would like to thank the anonymous
reviewers for their helpful comments.
Weiwei Sun. 2010. Improving Chinese semantic
role labeling with rich features. In Proceedings
of the ACL 2010.
Weiwei Sun and Zhifang Sui. 2009. Chinese function tag labeling. In Proceedings of the 23rd
Pacific Asia Conference on Language, Information and Computation. Hong Kong.
Wenliang Chen, Yujie Zhang, and Hitoshi Isahara.
2006. An empirical study of Chinese chunking.
In Proceedings of the COLING/ACL 2006 Main
Conference Poster Sessions, pages 97–104. As-
Weiwei Sun, Zhifang Sui, Meng Wang, and Xin
Wang. 2009. Chinese semantic role labeling
with shallow parsing. In Proceedings of the
2009 Conference on Empirical Methods in Natural Language Processing, pages 1475–1483.
Association for Computational Linguistics, Singapore.
Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluı́s Màrquez, and Joakim Nivre. 2008.
The conll 2008 shared task on joint parsing of
syntactic and semantic dependencies. In CoNLL
2008: Proceedings of the Twelfth Conference
on Computational Natural Language Learning,
pages 159–177. Coling 2008 Organizing Committee, Manchester, England.
Nianwen Xue. 2008. Labeling Chinese predicates with semantic roles. Comput. Linguist.,
Collocation Extraction beyond the Independence Assumption
Gerlof Bouma
Universität Potsdam, Department Linguistik
Campus Golm, Haus 24/35
Karl-Liebknecht-Straße 24–25
14476 Potsdam, Germany
[email protected]
In this paper we start to explore two-part
collocation extraction association measures
that do not estimate expected probabilities on the basis of the independence assumption. We propose two new measures
based upon the well-known measures of
mutual information and pointwise mutual
information. Expected probabilities are derived from automatically trained Aggregate
Markov Models. On three collocation gold
standards, we find the new association measures vary in their effectiveness.
Collocation extraction typically proceeds by scoring collocation candidates with an association measure, where high scores are taken to indicate likely
collocationhood. Two well-known such measures
are pointwise mutual information (PMI) and mutual information (MI). In terms of observing a combination of words w1 , w2 , these are:
p(w1 , w2 )
p(w1 ) p(w2 )
p(x, y) i(x, y).
i (w1 , w2 ) = log
I (w1 , w2 ) =
x∈{w1 ,¬w1 }
y∈{w2 ,¬w2 }
PMI (1) is the logged ratio of the observed bigramme probability and the expected bigramme
probability under independence of the two words
in the combination. MI (2) is the expected outcome
of PMI, and measures how much information of the
distribution of one word is contained in the distribution of the other. PMI was introduced into the collocation extraction field by Church and Hanks (1990).
Dunning (1993) proposed the use of the likelihoodratio test statistic, which is equivalent to MI up to
a constant factor.
Two aspects of (P)MI are worth highlighting.
First, the observed occurrence probability pobs is
compared to the expected occurrence probability
pexp . Secondly, the independence assumption underlies the estimation of pexp .
The first aspect is motivated by the observation that interesting combinations are often those
that are unexpectedly frequent. For instance, the
bigramme of the is uninteresting from a collocation extraction perspective, although it probably is
amongst the most frequent bigrammes for any English corpus. However, we can expect to frequently
observe the combination by mere chance, simply
because its parts are so frequent. Looking at pobs
and pexp together allows us to recognize these cases
(Manning and Schütze (1999) and Evert (2007) for
more discussion).
The second aspect, the independence assumption in the estimation of pexp , is more problematic, however, even in the context of collocation
extraction. As Evert (2007, p42) notes, the assumption of “independence is extremely unrealistic,” because it ignores “a variety of syntactic, semantic
and lexical restrictions.” Consider an estimate for
pexp (the the). Under independence, this estimate
will be high, as the itself is very frequent. However,
with our knowledge of English syntax, we would
say pexp (the the) is low. The independence assumption leads to overestimated expectation and the the
will need to be very frequent for it to show up as a
likely collocation. A less contrived example of how
the independence assumption might mislead collocation extraction is when bigramme distribution is
influenced by compositional, non-collocational, semantic dependencies. Investigating adjective-noun
combinations in a corpus, we might find that beige
cloth gets a high PMI, whereas beige thought does
not. This does not make the former a collocation or
multiword unit. Rather, what we would measure is
the tendency to use colours with visible things and
not with abstract objects. Syntactic and semantic
Proceedings of the ACL 2010 Conference Short Papers, pages 109–114,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
associations between words are real dependencies,
but they need not be collocational in nature. Because of the independence assumption, PMI and
MI measure these syntactic and semantic associations just as much as they measure collocational
association. In this paper, we therefore experimentally investigate the use of a more informed pexp in
the context of collocation extraction.
Aggregate Markov Models
To replace pexp under independence, one might
consider models with explicit linguistic information, such as a POS-tag bigramme model.
This would for instance give us a more realistic
pexp (the the). However, lexical semantic information is harder to incorporate. We might not know
exactly what factors are needed to estimate pexp
and even if we do, we might lack the resources
to train the resulting models. The only thing we
know about estimating pexp is that we need more
information than a unigramme model but less than
a bigramme model (as this would make pobs /pexp
uninformative). Therefore, we propose to use Aggregate Markov Models (Saul and Pereira, 1997;
Hofmann and Puzicha, 1998; Rooth et al., 1999;
Blitzer et al., 2005)1 for the task of estimating pexp .
In an AMM, bigramme probability is not directly
modeled, but mediated by a hidden class variable c:
pamm (w2 |w1 ) =
p(c|w1 )p(w2 |c).
The number of classes in an AMM determines the
amount of dependency that can be captured. In the
case of just one class, AMM is equivalent to a unigramme model. AMMs become equivalent to the
full bigramme model when the number of classes
equals the size of the smallest of the vocabularies of the parts of the combination. Between these
two extremes, AMMs can capture syntactic, lexical,
semantic and even pragmatic dependencies.
AMMs can be trained with EM, using no more
information than one would need for ML bigramme
probability estimates. Specifications of the E- and
M-steps can be found in any of the four papers cited
above – here we follow Saul and Pereira (1997). At
each iteration, the model components are updated
These authors use very similar models, but with differing
terminology and with different goals. The term AMM is used
in the first and fourth paper. In the second paper, the models
are referred to as Separable Mixture Models. Their use in
collocation extraction is to our knowledge novel.
according to:
n(w1 , w)p(c|w1 , w)
p(c|w1 ) ← P w
w,c0 n(w1 , w)p(c |w1 , w)
n(w, w2 )p(c|w, w2 )
p(w2 |c) ← P w
w,w0 n(w, w )p(c|w, w )
where n(w1 , w2 ) are bigramme counts and the posterior probability of a hidden category c is estimated by:
p(c|w1 )p(w2 |c)
c0 p(c |w1 )p(w2 |c )
p(c|w1 , w2 ) = P
Successive updates converge to a local maximum
of the AMM’s log-likelihood.
The definition of the counterparts to (P)MI without the independence assumption, the AMM-ratio
and AMM-divergence, is now straightforward:
p(w1 , w2 )
, (7)
p(w1 ) pamm (w2 |w1 )
p(x, y) ramm (x, y). (8)
ramm (w1 , w2 ) = log
damm (w1 , w2 ) =
x∈{w1 ,¬w1 }
y∈{w2 ,¬w2 }
The free parameter in these association measures is
the number of hidden classes in the AMM, that is,
the amount of dependency between the bigramme
parts used to estimate pexp . Note that AMM-ratio
and AMM-divergence with one hidden class are
equivalent to PMI and MI, respectively. It can be
expected that in different corpora and for different types of collocation, different settings of this
parameter are suitable.
Data and procedure
We apply AMM-ratio and AMM-divergence to
three collocation gold standards. The effectiveness
of association measures in collocation extraction is
measured by ranking collocation candidates after
the scores defined by the measures, and calculating average precision of these lists against the gold
standard annotation. We consider the newly proposed AMM-based measures for a varying number
of hidden categories. The new measures are compared against two baselines: ranking by frequency
(pobs ) and random ordering. Because AMM-ratio
and -divergence with one hidden class boil down
to PMI and MI (and thus log-likelihood ratio), the
evaluation contains an implicit comparison with
these canonical measures, too. However, the results will not be state-of-the-art: for the datasets
investigated below, there are more effective extraction methods based on supervised machine learning
(Pecina, 2008).
The first gold standard used is the German
adjective-noun dataset (Evert, 2008). It contains
1212 A-N pairs taken from a German newspaper
corpus. We consider three subtasks, depending on
how strict we define true positives. We used the
bigramme frequency data included in the resource.
We assigned all types with a token count ≤5 to one
type, resulting in AMM training data of 10k As,
20k Ns and 446k A-N pair types.
The second gold standard consists of 5102 German PP-verb combinations, also sampled from
newspaper texts (Krenn, 2008). The data contains annotation for support verb constructions
(FVGs) and figurative expressions. This resource
also comes with its own frequency data. After frequency thresholding, AMMs are trained on 46k
PPs, 7.6k Vs, and 890k PP-V pair types.
Third and last is the English verb-particle construction (VPC) gold standard (Baldwin, 2008),
consisting of 3078 verb-particle pairs and annotation for transitive and intransitive idiomatic VPCs.
We extract frequency data from the BNC, following the methods described in Baldwin (2005). This
results in two slightly different datasets for the two
types of VPC. For the intransitive VPCs, we train
AMMs on 4.5k Vs, 35 particles, and 43k pair types.
For the transitive VPCs, we have 5k Vs, 35 particles and 54k pair types.
All our EM runs start with randomly initialized
model vectors. In Section 3.3 we discuss the impact
of model variation due to this random factor.
German A-N collocations The top slice in Table 1 shows results for the three subtasks of the
A-N dataset. We see that using AMM-based pexp
initially improves average precision, for each task
and for both the ratio and the divergence measure.
At their maxima, the informed measures outperform both baselines as well as PMI and MI/loglikelihood ratio (# classes=1). The AMM-ratio performs best for 16-class AMMs, the optimum for
AMM-divergence varies slightly.
It is likely that the drop in performance for the
larger AMM-based measures is due to the AMMs
learning the collocations themselves. That is, the
AMMs become rich enough to not only capture
the broadly applicative distributional influences of
syntax and semantics, but also provide accurate
pexp s for individual, distributionally deviant combinations – like collocations. An accurate pexp results
in a low association score.
One way of inspecting what kind of dependencies the AMMs pick up is to cluster the data with
them. Following Blitzer et al. (2005), we take the
200 most frequent adjectives and assign them to
the category that maximizes p(c|w1 ); likewise for
nouns and p(w2 |c). Four selected clusters (out of
16) are given in Table 2.2 The esoteric class 1 contains ordinal numbers and nouns that one typically
uses those with, including references to temporal
concepts. Class 2 and 3 appear more semantically
motivated, roughly containing human and collective denoting nouns, respectively. Class 4 shows
a group of adjectives denoting colours and/or political affiliations and a less coherent set of nouns,
although the noun cluster can be understood if we
consider individual adjectives that are associated
with this class. Our informal impression from looking at clusters is that this is a common situation: as
a whole, a cluster cannot be easily characterized,
although for subsets or individual pairs, one can
get an intuition for why they are in the same class.
Unfortunately, we also see that some actual collocations are clustered in class 4, such as gelbe Karte
‘warning’ (lit.: ‘yellow card’) and dickes Auto ‘big
(lit.: fat) car’.
German PP-Verb collocations The second slice
in Table 1 shows that, for both subtypes of PP-V
collocation, better pexp -estimates lead to decreased
average precision. The most effective AMM-ratio
and -distance measures are those equivalent to
(P)MI. Apparently, the better pexp s are unfortunate
for the extraction of the type of collocations in this
The poor performance of PMI on these data –
clearly below frequency – has been noticed before
by Krenn and Evert (2001). A possible explanation
for the lack of improvement in the AMMs lies in
the relatively high performing frequency baselines.
The frequency baseline for FVGs is five times the
An anonymous reviewer rightly warns against sketching
an overly positive picture of the knowledge captured in the
AMMs by only presenting a few clusters. However, the clustering performed here is only secondary to our main goal
of improving collocation extraction. The model inspection
should thus not be taken as an evaluation of the quality of the
models as clustering models.
# classes
category 1
category 1–2
category 1–3
Table 1: Average precision for AMM-based association measures and baselines on three datasets.
1 dritt ‘third’, erst ‘first’, fünft ‘fifth’, halb ‘half’, kommend
‘next’, laufend ‘current’, letzt ‘last’, nah ‘near’, paar ‘pair’,
vergangen ‘last’, viert ‘fourth’, wenig ‘few’, zweit ‘second’
Jahr ‘year’, Klasse ‘class’, Linie ‘line’, Mal ‘time’, Monat
‘month’, Platz ‘place’, Rang ‘grade’, Runde ‘round’, Saison
‘season’, Satz ‘sentence’, Schritt ‘step’, Sitzung ‘session’, Sonntag ‘Sunday’, Spiel ‘game’, Stunde ‘hour’, Tag ‘day’, Woche
‘week’, Wochenende ‘weekend’
Besucher ‘visitor’, Bürger ‘citizens’, Deutsche ‘German’, Frau
‘woman’, Gast ‘guest’, Jugendliche ‘youth’, Kind ‘child’, Leute
‘people’, Mädchen ‘girl’, Mann ‘man’, Mensch ‘human’, Mitglied ‘member’
Betrieb ‘company’, Familie ‘family’, Firma ‘firm’, Gebiet
‘area’, Gesellschaft ‘society’, Land ‘country’, Mannschaft
‘team’, Markt ‘market’, Organisation ‘organisation’, Staat
‘state’, Stadtteil ‘city district’, System ‘system’, Team ‘team’,
Unternehmen ‘enterprise’, Verein ‘club’, Welt ‘world’
Auge ‘eye’, Auto ‘car’, Haar ‘hair’, Hand ‘hand’, Karte ‘card’,
Stimme ‘voice/vote’
2 aktiv ‘active’, alt ‘old’, ausländisch ‘foreign’, betroffen
‘concerned’, jung ‘young’, lebend ‘alive’, meist ‘most’,
unbekannt ‘unknown’, viel ‘many’
3 deutsch ‘German’, europäisch ‘European’, ganz ‘whole’,
gesamt ‘whole’, international ‘international’, national ‘national’, örtlich ‘local’, ostdeutsch ‘East-German’, privat
‘private’, rein ‘pure’, sogenannt ‘so-called’, sonstig ‘other’,
westlich ‘western’
4 blau ‘blue’, dick ‘fat’, gelb ‘yellow’, grün ‘green’, linke
‘left’, recht ‘right’, rot ‘red’, schwarz ‘black’, white ‘weiß’
Table 2: Selected adjective-noun clusters from a 16-class AMM.
random baseline, and MI does not outperform it by
much. Since the AMMs provide a better fit for the
more frequent pairs in the training data, they might
end up providing too good pexp -estimates for the
true collocations from the beginning.
Further investigation is needed to find out
whether this situation can be ameliorated and, if
not, whether we can systematically identify for
what kind of collocation extraction tasks using better pexp s is simply not a good idea.
English Verb-Particle constructions The last
gold standard is the English VPC dataset, shown
in the bottom slice of Table 1. We have only used
class-sizes up to 32, as there are only 35 particle
types. We can clearly see the effect of the largest
AMMs approaching the full bigramme model as
average precision here approaches the random baseline. The VPC extraction task shows a difference
between the two AMM-based measures: AMMratio does not improve at all, remaining below the
frequency baseline. AMM-divergence, however,
shows a slight decrease in precision first, but ends
up performing above the frequency baseline for the
8-class AMMs in both subtasks.
Table 3 shows four clusters of verbs and particles. The large first cluster contains verbs that
involve motion/displacement of the subject or object and associated particles, for instance walk
about or push away. Interestingly, the description
of the gold standard gives exactly such cases as
negatives, since they constitute compositional verbparticle constructions (Baldwin, 2008). Classes 2
and 3 show syntactic dependencies, which helps
1 break, bring, come, cut, drive, fall, get, go, lay, look, move, pass, push,
put, run, sit, throw, turn, voice, walk
2 accord, add, apply, give, happen, lead, listen, offer, pay, present, refer,
relate, return, rise, say, sell, send, speak, write
3 know, talk, tell, think
4 accompany, achieve, affect, cause, create, follow, hit, increase, issue,
mean, produce, replace, require, sign, support
across, ahead, along, around, away, back, backward, down, forward, into, over, through, together
astray, to
Table 3: Selected verb-particle clusters from an 8-class AMM on transitive data.
collocation extraction by decreasing the impact of
verb-preposition associations that are due to PPselecting verbs. Class 4 shows a third type of distributional generalization: the verbs in this class are
all frequently used in the passive.
Variation due to local optima
We start each EM run with a random initialization of the model parameters. Since EM finds local
rather than global optima, each run may lead to
different AMMs, which in turn will affect AMMbased collocation extraction. To gain insight into
this variation, we have trained 40 16-class AMMs
on the A-N dataset. Table 4 gives five point summaries of the average precision of the resulting
40 ‘association measures’. Performance varies considerably, spanning 2–3 percentage points in each
case. The models consistently outperform (P)MI in
Table 1, though.
Several techniques might help to address this
variation. One might try to find a good fixed way of
initializing EM or to use EM variants that reduce
the impact of the initial state (Smith and Eisner,
2004, a.o.), so that a run with the same data and
the same number of classes will always learn (almost) the same model. On the assumption that an
average over several runs will vary less than individual runs, we have also constructed a combined
pexp by averaging over 40 pexp s. The last column
Variation in avg precision
cat 1
cat 1–2
cat 1–3
min q1 med q3
Table 4: Variation on A-N data over 40 EM runs
and result of combining pexp s.
in Table 4 shows this combined estimator leads to
good extraction results.
In this paper, we have started to explore collocation
extraction beyond the assumption of independence.
We have introduced two new association measures
that do away with this assumption in the estimation of expected probabilities. The success of using
these association measures varies. It remains to be
investigated whether they can be improved more.
A possible obstacle in the adoption of AMMs in
collocation extraction is that we have not provided
any heuristic for setting the number of classes for
the AMMs. We hope to be able to look into this
question in future research. Luckily, for the AN and
VPC data, the best models are not that large (in the
order of 8–32 classes), which means that model fitting is fast enough to experiment with different settings. In general, considering these smaller models
might suffice for tasks that have a fairly restricted
definition of collocation candidate, like the tasks
in our evaluation do. Because AMM fitting is unsupervised, selecting a class size is in this respect
no different from selecting a suitable association
measure from the canon of existing measures.
Future research into association measures that
are not based on the independence assumption will
also include considering different EM variants and
other automatically learnable models besides the
AMMs used in this paper. Finally, the idea of using an informed estimate of expected probability
in an association measure need not be confined
to (P)MI, as there are many other measures that
employ expected probabilities.
This research was carried out in the context of
the SFB 632 Information Structure, subproject D4:
Methoden zur interaktiven linguistischen Korpusanalyse von Informationsstruktur.
Lawrence Saul and Fernando Pereira. 1997. Aggregate and mixed-order markov models for statistical
language processing. In Proceedings of the Second
Conference on Empirical Methods in Natural Language Processing, pages 81–89.
Timothy Baldwin. 2005. The deep lexical acquisition
of english verb-particle constructions. Computer
Speech and Language, Special Issue on Multiword
Expressions, 19(4):398–414.
Timothy Baldwin. 2008. A resource for evaluating the
deep lexical acquisition of English verb-particle constructions. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 1–2, Marrakech.
John Blitzer, Amir Globerson, and Fernando Pereira.
2005. Distributed latent variable models of lexical
co-occurrences. In Tenth International Workshop on
Artificial Intelligence and Statistics.
Kenneth W. Church and Patrick Hanks. 1990. Word
association norms, mutual information, and lexicography. Computational Linguistics, 16(1):22–29.
Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74.
Stefan Evert. 2007. Corpora and collocations. Extended Manuscript of Chapter 58 of A. Lüdeling and
M. Kytö, 2008, Corpus Linguistics. An International
Handbook, Mouton de Gruyter, Berlin.
Stefan Evert. 2008. A lexicographic evaluation of German adjective-noun collocations. In Proceedings of
the LREC 2008 Workshop Towards a Shared Task
for Multiword Expressions (MWE 2008), pages 3–6,
Thomas Hofmann and Jan Puzicha. 1998. Statistical models for co-occurrence data. Technical report,
MIT. AI Memo 1625, CBCL Memo 159.
Brigitte Krenn and Stefan Evert. 2001. Can we do
better than frequency? a case study on extracting PPverb collocations. In Proceedings of the ACL Workshop on Collocations, Toulouse.
Brigitte Krenn. 2008. Description of evaluation resource – German PP-verb data. In Proceedings of
the LREC 2008 Workshop Towards a Shared Task
for Multiword Expressions (MWE 2008), pages 7–
10, Marrakech.
Chris Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing.
MIT Press, Cambridge, MA.
Pavel Pecina. 2008. A machine learning approach to
multiword expression extraction. In Proceedings of
the LREC 2008 Workshop Towards a Shared Task
for Multiword Expressions (MWE 2008), pages 54–
57, Marrakech.
Mats Rooth, Stefan Riester, Detlef Prescher, Glenn Carrol, and Franz Beil. 1999. Inducing a semantically
annotated lexicon via em-based clustering. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park,
Noah A. Smith and Jason Eisner. 2004. Annealing techniques for unsupervised statistical language
learning. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.
Automatic Collocation Suggestion in Academic Writing
Jian-Cheng Wu1
Yu-Chia Chang1,*
Teruko Mitamura2
National Tsing Hua University
Hsinchu, Taiwan
{wujc86, richtrf, jason.jschang}
In recent years, collocation has been
widely acknowledged as an essential
characteristic to distinguish native speakers from non-native speakers. Research
on academic writing has also shown that
collocations are not only common but
serve a particularly important discourse
function within the academic community.
In our study, we propose a machine
learning approach to implementing an
online collocation writing assistant. We
use a data-driven classifier to provide
collocation suggestions to improve word
choices, based on the result of classification. The system generates and ranks
suggestions to assist learners’ collocation
usages in their academic writing with satisfactory results. *
The notion of collocation has been widely discussed in the field of language teaching for decades. It has been shown that collocation, a successive common usage of words in a chain, is
important in helping language learners achieve
native-like fluency. In the field of English for
Academic Purpose, more and more researchers
are also recognizing this important feature in
academic writing. It is often argued that collocation can influence the effectiveness of a piece of
writing and the lack of such knowledge might
cause cumulative loss of precision (Howarth,
Many researchers have discussed the function
of collocations in the highly conventionalized
and specialized writing used within academia.
Research also identified noticeable increases in
the quantity and quality of collocational usage by
Corresponding author: Yu-chia Chang (Email address:
[email protected])
Jason S. Chang1
Carnegie Mellon University
Pittsburgh, United States
[email protected]
native speakers (Howarth, 1998). Granger (1998)
reported that learners underuse native-like collocations and overuse atypical word combinations.
This disparity in collocation usage between native and non-native speakers is clear and should
receive more attention from the language technology community.
To tackle such word usage problems, traditional language technology often employs a database of the learners' common errors that are
manually tagged by teachers or specialists (e.g.
Shei and Pain, 2000; Liu, 2002). Such system
then identifies errors via string or pattern matching and offer only pre-stored suggestions. Compiling the database is time-consuming and not
easily maintainable, and the usefulness is limited
by the manual collection of pre-stored suggestions. Therefore, it is beneficial if a system can
mainly use untagged data from a corpus containing correct language usages rather than the errortagged data from a learner corpus. A large corpus
of correct language usages is more readily available and useful than a small labeled corpus of
incorrect language usages.
For this suggestion task, the large corpus not
only provides us with a rich set of common collocations but also provides the context within
which these collocations appear. Intuitively, we
can take account of such context of collocation to
generate more suitable suggestions. Contextual
information in this sense often entails more linguistic clues to provide suggestions within sentences or paragraph. However, the contextual
information is messy and complex and thus has
long been overlooked or ignored. To date, most
fashionable suggestion methods still rely upon
the linguistic components within collocations as
well as the linguistic relationship between misused words and their correct counterparts (Chang
et al., 2008; Liu, 2009).
In contrast to other research, we employ contextual information to automate suggestions for
verb-noun lexical collocation. Verb-noun collocations are recognized as presenting the most
Proceedings of the ACL 2010 Conference Short Papers, pages 115–119,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
challenge to students (Howarth, 1996; Liu,
2002). More specifically, in this preliminary
study we start by focusing on the word choice of
verbs in collocations which are considered as the
most difficult ones for learners to master (Liu,
2002; Chang, 2008). The experiment confirms
that our collocation writing assistant proves the
feasibility of using machine learning methods to
automatically prompt learners with collocation
suggestions in academic writing.
Collocation Checking and Suggestion
This study aims to develop a web service, Collocation Inspector (shown in Figure 1) that accepts
sentences as input and generates the related candidates for learners.
In this paper, we focus on automatically providing academic collocation suggestions when
users are writing up their abstracts. After an abstract is submitted, the system extracts linguistic
features from the user’s text for machine learning
model. By using a corpus of published academic
texts, we hope to match contextual linguistic
clues from users’ text to help elicit the most relevant suggestions. We now formally state the
problem that we are addressing:
Problem Statement: Given a sentence S written by a learner and a reference corpus RC, our
goal is to output a set of most probable suggestion candidates c1, c2, ... , cm. For this, we train a
classifier MC to map the context (represented as
feature set f1, f2, ..., fn) of each sentence in RC to
the collocations. At run-time, we predict these
collocations for S as suggestions.
Academic Collocation Checker Training Procedures
Sentence Parsing and Collocation Extraction:
We start by collecting a large number of abstracts from the Web to develop a reference corpus for collocation suggestion. And we continue
to identify collocations in each sentence for the
subsequent processing.
Collocation extraction is an essential step in
preprocessing data. We only expect to extract the
collocation which comprises components having
a syntactic relationship with one another. However, this extraction task can be complicated.
Take the following scholarly sentence from the
reference corpus as an example (example (1)):
(1) We introduce a novel method
for learning to find documents
on the web.
Figure 1. The interface for the collocation suggestion
nsubj (introduce-2, We-1)
det (method-5, a-3)
amod (method-5, novel-4)
dobj (introduce-2, method-5)
prepc_for (introduce-2, learning-7)
aux (find-9, to-8)
Figure 2. Dependency parsing of Example (1)
Traditionally, through part-of-speech tagging,
we can obtain a tagged sentence as follows (example (2)). We can observe that the desired collocation “introduce method”, conforming to
“VERB+NOUN” relationship, exists within the
sentence. However, the distance between these
two words is often flexible, not necessarily rigid.
Heuristically writing patterns to extract such verb
and noun might not be effective. The patterns
between them can be tremendously varied. In
addition, some verbs and nouns are adjacent, but
they might be intervened by clause and thus have
no syntactic relation with one another (e.g. “propose model” in example (3)).
(2) We/PRP introduce/VB a/DT
novel/JJ method/NN for/IN
learning/VBG to/TO find/VB
documents/NNS on/IN the/DT
web/NN ./.
(3) We proposed that the webbased model would be more effective than corpus-based one.
A natural language parser can facilitate the extraction of the target type of collocations. Such
parser is a program that works out the grammatical structure of sentences, for instance, by identifying which group of words go together or which
word is the subject or object of a verb. In our
study, we take advantage of a dependency parser,
Stanford Parser, which extracts typed dependencies for certain grammatical relations (shown in
Figure 2). Within the parsed sentence of example
(1), we can notice that the extracted dependency
“dobj (introduce-2, method-4)” meets the criterion.
Table 1. Example sentences and class tags (collocations)
Example Sentence
Class tag
Using a Classifier for the Suggestion task: A
classifier is a function generally to take a set of
attributes as an input and to provide a tagged
class as an output. The basic way to build a classifier is to derive a regression formula from a set
of tagged examples. And this trained classifier
can thus make predication and assign a tag to any
input data.
The suggestion task in this study will be seen
as a classification problem. We treat the collocation extracted from each sentence as the class tag
(see examples in Table 1). Hopefully, the system
can learn the rules between tagged classes (i.e.
collocations) and example sentences (i.e. scholarly sentences) and can predict which collocation
is the most appropriate one given attributes extracted from the sentences.
Another advantage of using a classifier to
automate suggestion is to provide alternatives
with regard to the similar attributes shared by
sentences. In Table 1, we can observe that these
collocations exhibit a similar discourse function
and can thus become interchangeable in these
sentences. Therefore, based on the outputs along
with the probability from the classifier, we can
provide more than one adequate suggestions.
In this paper, we will describe a method of
identifying the syntactic role of antecedescribe
dents, which consists of two phases
Feature Selection for Machine Learning: In
the final stage of training, we build a statistical
machine-learning model. For our task, we can
use a supervised method to automatically learn
the relationship between collocations and example sentences.
We choose Maximum Entropy (ME) as our training algorithm to build a collocation suggestion
classifier. One advantage of an ME classifier is
that in addition to assigning a classification it can
provide the probability of each assignment. The
ME framework estimates probabilities based on
the principle of making as few assumptions as
possible. Such constraints are derived from the
training data, expressing relationships between
features and outcomes.
Moreover, an effective feature selection can
increase the precision of machine learning. In our
study, we employ the contextual features which
We introduce a novel method for learning
to find documents on the web.
We presented a method of improving Japanese dependency parsing by using largepresent
scale statistical information.
In this paper, we suggest a method that
automatically constructs an NE tagged corsuggest
pus from the web to be used for learning of
NER systems.
consist of two elements, the head and the ngram
of context words:
Head: Each collocation comprises two parts,
collocate and head. For example, in a given verbnoun collocation, the verb is the collocate as well
as the target for which we provide suggestions;
the noun serves as the head of collocation and
convey the essential meaning of the collocation.
We use the head as a feature to condition the
classifier to generate candidates relevant to a
given head.
Ngram: We use the context words around the
target collocation by considering the corresponding unigrams and bigrams words within the sentence. Moreover, to ensure the relevance, those
context words, before and after the punctuation
marks enclosing the collocation in question, will
be excluded. We use the parsed sentence from
previous step (example (2)) to show the extracted
context features1 (example (4)):
(4) CN=method UniV_L=we
UniV_R=a UniV_R=novel UniN_L=a
UniN_L=novel UniN_R=for
UniN_R=learn BiV_R=a_novel
BiN_L=a_novel BiN_R=for_learn
BiV_I=we_a BiN_I=novel_for
CN refers to the head within collocation. Uni and Bi indicate the unigram and bigram context words of window size
two respectively. V and N differentiate the contexts related
to verb or noun. The ending alphabets L, R, I show the position of the words in context, L = left, R = right, and I = in
Automatic Collocation Suggestion at
After the ME classifier is automatically trained,
the model is used to find out the best collocation
suggestion. Figure 3 shows the algorithm of producing suggestions for a given sentence. The
input is a learner’s sentence in an abstract, along
with an ME model trained from the reference
In Step (1) of the algorithm, we parse the sentence for data preprocessing. Based on the parser
output, we extract the collocation from a given
sentence as well as generate features sets in Step
(2) and (3). After that in Step (4), with the
trained machine-learning model, we obtain a set
of likely collocates with probability as predicted
by the ME model. In Step (5), SuggestionFilter
singles out the valid collocation and returns the
best collocation suggestion as output in Step (6).
For example, if a learner inputs the sentence like
Example (5), the features and output candidates
are shown in Table 2.
(5) There are many investigations about wireless network
communication, especially it is
important to add Internet
transfer calculation speeds.
From an online research database, CiteSeer, we
have collected a corpus of 20,306 unique abstracts, which contained 95,650 sentences. To
train a Maximum Entropy classifier, 46,255 collocations are extracted and 790 verbal collocates
are identified as tagged classes for collocation
suggestions. We tested the classifier on scholarly
sentences in place of authentic student writings
which were not available at the time of this pilot
study. We extracted 364 collocations among 600
randomly selected sentences as the held out test
data not overlapping with the training set. To
automate the evaluation, we blank out the verb
collocates within these sentences and treat these
verbs directly as the only correct suggestions in
question, although two or more suggestions may
be interchangeable or at least appropriate. In this
sense, our evaluation is an underestimate of the
performance of the proposed method.
While evaluating the quality of the suggestions
provided by our system, we used the mean reciprocal rank (MRR) of the first relevant suggestions returned so as to assess whether the suggestion list contains an answer and how far up the
answer is in the list as a quality metric of the sys-
Procedure CollocationSuggestion(sent, MEmodel)
(1) parsedSen = Parsing(sent)
(2) extractedColl = CollocationExtraction(parsedSent)
(3) features = AssignFeature(ParsedSent)
(4) probCollection = MEprob(features, MEmodel)
(5) candidate = SuggestionFilter(probCollection)
(6) Return candidate
Figure 3. Collocation Suggestion at Run-time
Table 2. An example from learner’s sentence
add speed
Table 3. MRR for different feature sets
Feature Sets Included In Classifier
Features of HEAD
Features of CONTEXT
Features of HEAD+CONTEXT
tem output. Table 3 shows that the best MRR of
our prototype system is 0.518. The results indicate that on average users could easily find answers (exactly reproduction of the blanked out
collocates) in the first two to three ranking of
suggestions. It is very likely that we get a much
higher MMR value if we would go through the
lists and evaluate each suggestion by hand.
Moreover, in Table 3, we can further notice that
contextual features are quite informative in comparison with the baseline feature set containing
merely the feature of HEAD. Also the integrated
feature set of HEAD and CONTEXT together
achieves a more satisfactory suggestion result.
Many avenues exist for future research that are
important for improving the proposed method.
For example, we need to carry out the experiment on authentic learners’ texts. We will conduct a user study to investigate whether our system would improve a learner’s writing in a real
setting. Additionally, adding classifier features
based on the translation of misused words in
learners’ text could be beneficial (Chang et al.,
2008). The translation can help to resolve prevalent collocation misuses influenced by a learner's
native language. Yet another direction of this
research is to investigate if our methodology is
applicable to other types of collocations, such as
AN and PN in addition to VN dealt with in this
In summary, we have presented an unsupervised method for suggesting collocations based
on a corpus of abstracts collected from the Web.
The method involves selecting features from the
reference corpus of the scholarly texts. Then a
classifier is automatically trained to determine
the most probable collocation candidates with
regard to the given context. The preliminary results show that it is beneficial to use classifiers
for identifying and ranking collocation suggestions based on the context features.
Y. Chang, J. Chang, H. Chen, and H. Liou. 2008. An
automatic collocation writing assistant for Taiwanese EFL learners: A case of corpus-based NLP
technology. Computer Assisted Language Learning, 21(3), pages 283-299.
S. Granger. 1998. Prefabricated patterns in advanced
EFL writing: collocations and formulae. In Cowie,
A. (ed.) Phraseology: theory, analysis and applications. Oxford University Press, Oxford, pages 145160.
P. Howarth. 1996. Phraseology in English Academic
Writing. Tübingen: Max Niemeyer Verlag.
P. Howarth. 1998. The phraseology of learner’s academic writing. In Cowie, A. (ed.) Phraseology:
theory, analysis and applications. Oxford University Press, Oxford, pages 161-186.
D. Hawking and N. Craswell. 2002. Overview of the
TREC-2001 Web track. In Proceedings of the 10th
Text Retrieval Conference (TREC 2001), pages 2531.
L. E. Liu. 2002. A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan
learners’ English. Unpublished master’s thesis,
Tamkang University, Taipei, January.
A. L. Liu, D. Wible, and N. L. Tsao. 2009. Automated
suggestions for miscollocations. In Proceedings of
the Fourth Workshop on Innovative Use of NLP for
Building Educational Applications, pages 47-50.
C. C. Shei and H. Pain. 2000. An ESL writer’s collocational aid. Computer Assisted Language Learning, 13, pages 167-182.
Event-based Hyperspace Analogue to Language for Query Expansion
Tingxu Yan
Tianjin University
Tianjin, China
[email protected]
Dawei Song
Robert Gordon University
Aberdeen, United Kingdom
[email protected]
Tamsin Maxwell
University of Edinburgh
Edinburgh, United Kingdom
[email protected]
Yuexian Hou
Tianjin University
Tianjin, China
[email protected]
dependence language model for IR (Gao et al.,
2004), which incorporates linguistic relations between non-adjacent words while limiting the generation of meaningless phrases, and the Markov
Random Field (MRF) model, which captures short
and long range term dependencies (Metzler and
Croft, 2005; Metzler and Croft, 2007), consistently outperform a unigram language modelling approach but are closely approximated by
a bigram language model that uses no linguistic knowledge. Improving retrieval performance
through application of semantic and syntactic information beyond proximity and co-occurrence
features is a difficult task but remains a tantalising
Our approach is like that of Gao et al. (2004)
in that it considers semantic-syntactically determined relationships between words at the sentence
level, but allows words to have more than one
role, such as predicate and argument for different events, while link grammar (Sleator and Temperley, 1991) dictates that a word can only satisfy one connector in a disjunctive set. Compared
to the MRF model, our approach is unsupervised
where MRFs require the training of parameters using relevance judgments that are often unavailable
in practical conditions.
Other work incorporating syntactic and linguistic information into IR includes early research by
(Smeaton, O’Donnell and Kelledy, 1995), who
employed tree structured analytics (TSAs) resembling dependency trees, the use of syntax to detect paraphrases for question answering (QA) (Lin
and Pantel, 2001), and semantic role labelling in
QA (Shen and Lapata, 2007).
Independent from IR, Pado and Lapata (2007)
proposed a general framework for the construction of a semantic space endowed with syntactic
Bag-of-words approaches to information
retrieval (IR) are effective but assume independence between words. The Hyperspace Analogue to Language (HAL)
is a cognitively motivated and validated
semantic space model that captures statistical dependencies between words by
considering their co-occurrences in a surrounding window of text. HAL has been
successfully applied to query expansion in
IR, but has several limitations, including
high processing cost and use of distributional statistics that do not exploit syntax. In this paper, we pursue two methods
for incorporating syntactic-semantic information from textual ‘events’ into HAL.
We build the HAL space directly from
events to investigate whether processing
costs can be reduced through more careful
definition of word co-occurrence, and improve the quality of the pseudo-relevance
feedback by applying event information
as a constraint during HAL construction.
Both methods significantly improve performance results in comparison with original HAL, and interpolation of HAL and
relevance model expansion outperforms
either method alone.
Peng Zhang
Robert Gordon University
Aberdeen, United Kingdom.
[email protected]
Despite its intuitive appeal, the incorporation of
linguistic and semantic word dependencies in IR
has not been shown to significantly improve over
a bigram language modeling approach (Song and
Croft, 1999) that encodes word dependencies assumed from mere syntactic adjacency. Both the
Proceedings of the ACL 2010 Conference Short Papers, pages 120–125,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
compatibility with human information processing.
Recently, they have also been applied in IR, such
as LSA for latent semantic indexing, and HAL for
query expansion. For the purpose of this paper, we
focus on HAL, which encodes word co-occurrence
information explicitly and thus can be applied to
query expansion in a straightforward way.
HAL is premised on context surrounding a word
providing important information about its meaning (Harris, 1968). To be specific, an L-size
sliding window moves across a large text corpus
word-by-word. Any two words in the same window are treated as co-occurring with each other
with a weight that is inversely proportional to their
separation distance in the text. By accumulating
co-occurrence information over a corpus, a wordby-word matrix is constructed, a simple illustration of which is given in Table 1. A single word is
represented by a row vector and a column vector
that capture the information before and after the
word, respectively. In some applications, direction sensitivity is ignored to obtain a single vector
representation of a word by adding corresponding
row and column vectors (Bai et al., 2005).
information. This was represented by an undirected graph, where nodes stood for words, dependency edges stood for syntactical relations, and
sequences of dependency edges formed paths that
were weighted for each target word. Our work is
in line with Pado and Lapata (2007) in constructing a semantic space with syntactic information,
but builds our space from events, states and attributions as defined linguistically by Bach (1986).
We call these simply events, and extract them automatically from predicate-argument structures and
a dependency parse. We will use this space to perform query expansion in IR, a task that aims to find
additional words related to original query terms,
such that an expanded query including these words
better expresses the information need. To our
knowledge, the notion of events has not been applied to query expansion before.
This paper will outline the original HAL algorithm which serves as our baseline, and the
event extraction process. We then propose two
methods to arm HAL with event information: direct construction of HAL from events (eHAL-1),
and treating events as constraints on HAL construction from the corpus (eHAL-2). Evaluation
will compare results using original HAL, eHAL1 and eHAL-2 with a widely used unigram language model (LM) for IR and a state of the art
query expansion method, namely the Relevance
Model (RM) (Lavrenko and Croft, 2001). We also
explore whether a complementary effect can be
achieved by combining HAL-based dependency
modelling with the unigram-based RM.
Table 1: A HAL space for the text “w1 w2 w3 w4
w5 w6 ” using a 5-word sliding window (L = 5).
HAL Construction
Semantic space models aim to capture the meanings of words using co-occurrence information
in a text corpus. Two examples are the Hyperspace Analogue to Language (HAL) (Lund and
Burgess, 1996), in which a word is represented
by a vector of other words co-occurring with it
in a sliding window, and Latent Semantic Analysis (LSA) (Deerwester, Dumais, Furnas, Landauer and Harshman, 1990; Landauer, Foltz and
Laham, 1998), in which a word is expressed as
a vector of documents (or any other syntactical units such as sentences) containing the word.
In these semantic spaces, vector-based representations facilitate measurement of similarities between words. Semantic space models have been
validated through various studies and demonstrate
HAL has been successfully applied to query expansion and can be incorporated into this task directly (Bai et al., 2005) or indirectly, as with the
Information Flow method based on HAL (Bruza
and Song, 2002). However, to date it has used
only statistical information from co-occurrence
patterns. We extend HAL to incorporate syntacticsemantic information.
3 Event Extraction
Prior to event extraction, predicates, arguments,
part of speech (POS) information and syntactic dependencies are annotated using the bestperforming joint syntactic-semantic parser from
the CoNNL 2008 Shared Task (Johansson and
be built in a similar manner to the original HAL.
We ignore the parameter of window length (L)
and treat every event as a single window of length
equal to the number of words in the event. Every
pair of words in an event is considered to be cooccurrent with each other. The weight assigned to
the association between each pair is simply set to
one. With this scheme, all the events are traversed
and the event-based HAL is constructed.
The advantage of this method is that it substantially reduces the processing time during HAL
construction because only events are involved and
there is no need to calculate weights per occurrence. Additional processing time is incurred in
semantic role labelling (SRL) during event identification. However, the naive approach to extraction might be simulated with a combination of less
costly chunking and dependency parsing, given
that the word ordering information available with
SRL is not utilised.
eHAL-1 combines syntactical and statistical information, but has a potential drawback in that
only events are used during construction so some
information existing in the co-occurrence patterns
of the original text may be lost. This motivates the
second method.
Nugues, 2008), trained on PropBank and NomBank data. The event extraction algorithm then
instantiates the template REL [modREL] Arg0
[modArg0] ...ArgN [modArgN], where REL is the
predicate relation (or root verb if no predicates
are identified), and Arg0...ArgN are its arguments.
Modifiers (mod) are identified by tracing from
predicate and argument heads along the dependency tree. All predicates are associated with at
least one event unless both Arg0 and Arg1 are not
identified, or the only argument is not a noun.
The algorithm checks for modifiers based on
POS tag1 , tracing up and down the dependency
tree, skipping over prepositions, coordinating conjunctions and words indicating apportionment,
such as ‘sample (of)’. However, to constrain output the search is limited to a depth of one (with
the exception of skipping). For example, given
the phrase ‘apples from the store nearby’ and an
argument head apples, the first dependent, store,
will be extracted but not nearby, which is the dependent of store. This can be detrimental when
encountering compound nouns but does focus on
core information. For verbs, modal dependents are
not included in output.
Available paths up and down the dependency
tree are followed until all branches are exhausted,
given the rules outlined above. Tracing can result in multiple extracted events for one predicate
and predicates may also appear as arguments in
a different event, or be part of argument phrases.
For this reason, events are constrained to cover
only detail appearing above subsequent predicates
in the tree, which simplifies the event structure.
For example, the sentence “Baghdad already has
the facilities to continue producing massive quantities of its own biological and chemical weapons”
results in the event output: (1) has Baghdad already facilities continue producing; (2) continue
quantities producing massive; (3) producing quantities massive weapons biological; (4) quantities
weapons biological massive.
4.2 eHAL-2: Event-Based Filtering
This method attempts to include more statistical
information in eHAL construction. The key idea
is to decide whether a text segment in a corpus
should be used for the HAL construction, based
on how much event information it covers. Given a
corpus of text and the events extracted from it, the
eHAL-2 method runs as follows:
1. Select the events of length M or more and
discard the others for efficiency;
2. Set an “inclusion criterion”, which decides if
a text segment, defined as a word sequence
within an L-size sliding window, contains an
event. For example, if 80% of the words in an
event are contained in a text segment, it could
be considered to “include” the event;
4 HAL With Events
4.1 eHAL-1: Construction From Events
3. Move across the whole corpus word-by-word
with an L-size sliding window. For each window, complete Steps 4-7;
Since events are extracted from documents, they
form a reduced text corpus from which HAL can
To be specific, the modifiers include negation, as well as
adverbs or particles for verbal heads, adjectives and nominal
modifiers for nominal heads, and verbal or nominal dependents of modifiers, provided modifiers are not also identified
as arguments elsewhere in the event.
4. For the current L-size text segment, check
whether it includes an event according to the
“inclusion criterion” (Step 2);
based LM smoothed using Dirichlet prior with µ
set to 1000 as appropriate for TREC style title
queries (Lavrenko, 2004). The top 50 returned
documents form the basis for all pseudo-relevance
feedback, with other parameters tuned separately
for the RM and HAL methods.
For each dataset, the number of feedback terms
for each method is selected optimally among 20,
40, 60, 804 and the interpolation and smoothing
coefficient is set to be optimal in [0,1] with interval 0.1. For RM, we choose the first relevance
model in Lavrenko and Croft (2001) with the document model smoothing parameter optimally set
at 0.8. The number of feedback terms is fixed at
60 (for AP89 and WSJ9092) and 80 (for AP8889),
and interpolation between the query and relevance
models is set at 0.7 (for WSJ9092) and 0.9 (for
AP89 and AP8889). The HAL-based query expansion methods add the top 80 expansion terms
to the query with interpolation coefficient 0.9 for
WSJ9092 and 1 (that is, no interpolation) for AP89
and AP8889. The other HAL-based parameters
are set as follows: shortest event length M = 5,
for eHAL-2 the “inclusion criterion” is 75% of
words in an event, and for HAL and eHAL-2, window size L = 8. Top expansion terms are selected
according to the formula:
5. If an event is included in the current text
segment, check the following segments for
a consecutive sequence of segments that also
include this event. If the current segment includes more than one event, find the longest
sequence of related text segments. An illustration is given in Figure 1 in which dark
nodes stand for the words in a specific event
and an 80% inclusion criterion is used.
Segment K
Segment K+1
Segment K+2
Segment K+3
Figure 1: Consecutive segments for an event
6. Extract the full span of consecutive segments
just identified and go to the next available text
segment. Repeat Step 3;
7. When the scanning is done, construct HAL
using the original HAL method over all extracted sequences.
With the guidance of event information, the procedure above keeps only those segments of text
that include at least one event and discards the rest.
It makes use of more statistical co-occurrence information than eHAL-1 by applying weights that
are proportional to word separation distance. It
also alleviates the identified drawback of eHAL-1
by using the full text surrounding events. A tradeoff is that not all the events are included by the
selected text segments, and thus some syntactical
information may be lost. In addition, the parametric complexity and computational complexity are
also higher than eHAL-1.
HAL(tj | ⊕ q)
PHAL (tj | ⊕ t) = P
HAL(ti | ⊕ q)
where HAL(tj |⊕q) is the weight of tj in the combined HAL vector ⊕q (Bruza and Song, 2002)
of original query terms. Mean Average Precision
(MAP) is the performance indicator, and t-test (at
the level of 0.05) is performed to measure the statistical significance of results.
Table 2 lists the experimental results5 . It can
be observed that all the three HAL-based query
expansion methods improve performance over the
LM and both eHALs achieve better performance
than original HAL, indicating that the incorporation of event information is beneficial. In addition,
eHAL-2 leads to better performance than eHAL1, suggesting that use of linguistic information as
a constraint on statistical processing, rather than
the focus of extraction, is a more effective strategy. The results are still short of those achieved
We empirically test whether our event-based
HALs perform better than the original HAL, and
standard LM and RM, using three TREC2 collections: AP89 with Topics 1-50 (title field),
AP8889 with Topics 101-150 (title field) and
WSJ9092 with Topics 201-250 (description field).
All the collections are stemmed, and stop words
are removed, prior to retrieval using the Lemur
Toolkit Version 4.113 . Initial retrieval is identical for all models evaluated: KL-divergence
For RM, feedback terms were also tested on larger numbers up to 1000 but only comparable result was observed.
In Table 2, brackets show percent improvement of
eHALs / RM over HAL / eHAL-2 respectively and * and #
indicate the corresponding statistical significance.
TREC stands for the Text REtrieval Conference series
run by NIST. Please refer to for details.
Available at
Table 2: Performance (MAP) comparison of query
expansion using different HALs
Table 3: Performance (MAP) comparison of query
expansion using the combination of RM and term
with RM, but the gap is significantly reduced by
incorporating event information here, suggesting
this is a promising line of work. In addition, as
shown in (Bai et al., 2005), the Information Flow
method built upon the original HAL largely outperformed RM. We expect that eHAL would provide an even better basis for Information Flow, but
this possibility is yet to be explored.
As is known, RM is a pure unigram model while
HAL methods are dependency-based. They capture different information, hence it is natural to
consider if their strengths might complement each
other in a combined model. For this purpose, we
design the following two schemes:
the occurrence frequencies of individual words
into account, which is not well-captured by the
events. In contrast, the performance of Scheme 2
is more promising. The three methods outperform
the original RM in most cases, but the improvement is not significant and it is also observed that
there is little difference shown between RM with
HAL and eHALs. The phenomenon implies more
effective methods may be invented to complement
the unigram models with the syntactical and statistical dependency information.
1. Apply RM to the feedback documents (original RM), the events extracted from these
documents (eRM-1), and the text segments
around each event (eRM-2), where the three
sources are the same as used to produce HAL,
eHAL-1 and eHAL-2 respectively;
6 Conclusions
The application of original HAL to query expansion attempted to incorporate statistical word association information, but did not take into account the syntactical dependencies and had a
high processing cost. By utilising syntacticsemantic knowledge from event modelling of
pseudo-relevance feedback documents prior to
computing the HAL space, we showed that processing costs might be reduced through more careful selection of word co-occurrences and that performance may be enhanced by effectively improving the quality of pseudo-relevance feedback documents. Both methods improved over original
HAL query expansion. In addition, interpolation
of HAL and RM expansion improved results over
those achieved by either method alone.
2. Interpolate the expanded query model by
RM with the ones generated by each HAL,
represented by HAL+RM, eHAL-1+RM and
eHAL-2+RM. The interpolation coefficient is
again selected to achieve the optimal MAP.
The MAP comparison between the original RM
and these new models are demonstrated in Table 36 . From the first three lines (Scheme 1), we
can observe that in most cases the performance
generally deteriorates when RM is directly run
over the events and the text segments. The event
information is more effective to express the information about the term dependencies while the unigram RM ignores this information and only takes
This research is funded in part by the UK’s Engineering and Physical Sciences Research Council,
grant number: EP/F014708/2.
For rows in Table 3, brackets show percent difference
from original RM.
SIGIR conference on Research and development in
information retrieval, pp. 472–479, New York, NY,
Bach E. The Algebra of Events. 1986. Linguistics and
Philosophy, 9(1): pp. 5–16.
Metzler D. and Bruce W. B. Latent Concept Expansion using Markov Random Fields 2007. In: SIGIR
’07: Proceedings of the 30th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 311–318, ACM,
New York, NY, USA.
Bai J. and Song D. and Bruza P. and Nie J.-Y. and Cao
G. Query Expansion using Term Relationships in
Language Models for Information Retrieval 2005.
In: Proceedings of the 14th International ACM Conference on Information and Knowledge Management, pp. 688–695.
Pado S. and Lapata M. Dependency-Based Construction of Semantic Space Models. 2007. Computational Linguistics, 33: pp. 161–199.
Bruza P. and Song D. Inferring Query Models by Computing Information Flow. 2002. In: Proceedings of
the 11th International ACM Conference on Information and Knowledge Management, pp. 206–269.
Shen D. and Lapata M. Using Semantic Roles to Improve Question Answering. 2007. In: Proceedings
of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning, pp. 12–21.
Deerwester S., Dumais S., Furnas G., Landauer T. and
Harshman R. Indexing by latent semantic analysis.
1990. Journal of the American Sociaty for Information Science, 41(6): pp. 391–407.
Sleator D. D. and Temperley D. Parsing English with
a Link Grammar 1991. Technical Report CMU-CS91-196, Department of Computer Science, Carnegie
Mellon University.
Gao J. and Nie J. and Wu G. and Cao G. Dependence
Language Model for Information Retrieval. 2004.
In: Proceedings of the 27th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177.
Harris Z. 1968. Mathematical Structures of Language.. Wiley, New York.
Smeaton A. F., O’Donnell R. and Kelledy F. Indexing
Structures Derived from Syntax in TREC-3: System
Description. 1995. In: The Third Text REtrieval
Conference (TREC-3), pp. 55–67.
Johansson R. and Nugues P.
Syntactic-semantic Analysis with PropBank and
NomBank. 2008. In: CoNLL ’08: Proceedings of
the Twelfth Conference on Computational Natural
Language Learning, pp. 183–187.
Song F. and Croft W. B. A General Language Model
for Information Retrieval. 1999. In: CIKM ’99:
Proceedings of the Eighth International Conference on Information and Knowledge Management,
pp. 316–321, New York, NY, USA, ACM.
Landauer T., Foltz P. and Laham D. Introduction to Latent Semantic Analysis. 1998. Discourse Processes,
25: pp. 259–284.
Lavrenko V. 2004. A Generative Theory of Relevance,
PhD thesis, University of Massachusetts, Amherst.
Lavrenko V. and Croft W. B. Relevance Based Language Models. 2001. In: SIGIR ’01: Proceedings
of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127, New York, NY, USA,
2001. ACM.
Lin D. and Pantel P. DIRT - Discovery of Inference
Rules from Text. 2001. In: KDD ’01: Proceedings
of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
323–328, New York, NY, USA.
Lund K. and Burgess C. Producing High-dimensional
Semantic Spaces from Lexical Co-occurrence.
1996. Behavior Research Methods, Instruments &
Computers, 28: pp. 203–208. Prentice-Hall, Englewood Cliffs, NJ.
Metzler D. and Bruce W. B. A Markov Random Field
Model for Term Dependencies 2005. In: SIGIR ’05:
Proceedings of the 28th annual international ACM
Automatically Generating Term-frequency-induced Taxonomies
Karin Murthy
Tanveer A Faruquie
L Venkata Subramaniam
K Hima Prasad
Mukesh Mohania
IBM Research - India
and Subramaniam, 2006), even if those relationships do not explicitly appear in the text. Though
these methods tackle inconsistency by addressing
taxonomy deduction globally, the relationships extracted are often difficult to interpret by humans.
We show that for certain domains, the frequency
with which terms appear in a corpus on their own
and in conjunction with other terms induces a natural taxonomy. We formally define the concept
of a term-frequency-based taxonomy and show
its applicability for an example application. We
present an unsupervised method to generate such
a taxonomy from scratch and outline how domainspecific constraints can easily be integrated into
the generation process. An advantage of the new
method is that it can also be used to extend an existing taxonomy.
We evaluated our method on a large corpus of
real-life addresses. For addresses from emerging
geographies no standard postal address scheme
exists and our objective was to produce a postal
taxonomy that is useful in standardizing addresses
(Kothari et al., 2010). Specifically, the experiments were designed to investigate the effectiveness of our approach on noisy terms with lots of
variations. The results show that our method is
able to induce a taxonomy without using any kind
of lexical-semantic patterns.
We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based
taxonomy is useful for application domains where the frequency with which
terms occur on their own and in combination with other terms imposes a natural
term hierarchy. We highlight an application for our approach and demonstrate its
effectiveness and robustness in extracting
knowledge from real-world data.
Taxonomy deduction is an important task to understand and manage information. However, building
taxonomies manually for specific domains or data
sources is time consuming and expensive. Techniques to automatically deduce a taxonomy in an
unsupervised manner are thus indispensable. Automatic deduction of taxonomies consist of two
tasks: extracting relevant terms to represent concepts of the taxonomy and discovering relationships between concepts. For unstructured text, the
extraction of relevant terms relies on information
extraction methods (Etzioni et al., 2005).
The relationship extraction task can be classified into two categories. Approaches in the first
category use lexical-syntactic formulation to define patterns, either manually (Kozareva et al.,
2008) or automatically (Girju et al., 2006), and
apply those patterns to mine instances of the patterns. Though producing accurate results, these
approaches usually have low coverage for many
domains and suffer from the problem of inconsistency between terms when connecting the instances as chains to form a taxonomy. The second
category of approaches uses clustering to discover
terms and the relationships between them (Roy
Related Work
One approach for taxonomy deduction is to use
explicit expressions (Iwaska et al., 2000) or lexical and semantic patterns such as is a (Snow et al.,
2004), similar usage (Kozareva et al., 2008), synonyms and antonyms (Lin et al., 2003), purpose
(Cimiano and Wenderoth, 2007), and employed by
(Bunescu and Mooney, 2007) to extract and organize terms. The quality of extraction is often controlled using statistical measures (Pantel and Pennacchiotti, 2006) and external resources such as
wordnet (Girju et al., 2006). However, there are
Proceedings of the ACL 2010 Conference Short Papers, pages 126–131,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
domains (such as the one introduced in Section
3.2) where the text does not allow the derivation
of linguistic relations.
Supervised methods for taxonomy induction
provide training instances with global semantic information about concepts (Fleischman and
Hovy, 2002) and use bootstrapping to induce new
seeds to extract further patterns (Cimiano et al.,
2005). Semi-supervised approaches start with
known terms belonging to a category, construct
context vectors of classified terms, and associate
categories to previously unclassified terms depending on the similarity of their context (Tanev
and Magnini, 2006). However, providing training data and hand-crafted patterns can be tedious.
Moreover in some domains (such as the one presented in Section 3.2) it is not possible to construct
a context vector or determine the replacement fit.
Unsupervised methods use clustering of wordcontext vectors (Lin, 1998), co-occurrence (Yang
and Callan, 2008), and conjunction features (Caraballo, 1999) to discover implicit relationships.
However, these approaches do not perform well
for small corpora. Also, it is difficult to label the
obtained clusters which poses challenges for evaluation. To avoid these problems, incremental clustering approaches have been proposed (Yang and
Callan, 2009). Recently, lexical entailment has
been used where the term is assigned to a category if its occurrence in the corpus can be replaced
by the lexicalization of the category (Giuliano and
Gliozzo, 2008). In our method, terms are incrementally added to the taxonomy based on their
support and context.
Association rule mining (Agrawal and Srikant,
1994) discovers interesting relations between
terms, based on the frequency with which terms
appear together. However, the amount of patterns
generated is often huge and constructing a taxonomy from all the patterns can be challenging.
In our approach, we employ similar concepts but
make taxonomy construction part of the relationship discovery process.
Figure 1: Part of an address taxonomy
Let C be a corpus of records r. Each record is
represented as a set of terms t. Let T = {t | t ∈
r ∧ r ∈ C} be the set of all terms of C. Let f (t)
denote the frequency of term t, that is the number
of records in C that contain t. Let F (t, T + , T − )
denote the frequency of term t given a set of mustalso-appear terms T + and a set of cannot-alsoappear terms T − . F (t, T + , T − ) = | {r ∈ C |
t ∈ r ∧ ∀ t0 ∈ T + : t0 ∈ r ∧ ∀ t0 ∈ T − : t0 ∈
/ r} |.
A term-frequency-induced taxonomy (TFIT), is
an ordered tree over terms in T . For a node n in
the tree, n.t is the term at n, A(n) the ancestors of
n, and P (n) the predecessors of n.
A TFIT has a root node with the special term ⊥
and the conditional frequency ∞. The following
condition is true for any other node n:
∀t ∈ T, F (n.t, A(n), P (n)) ≥ F (t, A(n), P (n)).
That is, each node’s term has the highest conditional frequency in the context of the node’s ancestors and predecessors. Only terms with a conditional frequency above zero are added to a TFIT.
We show in Section 4 how a TFIT taxonomy
can be automatically induced from a given corpus.
But before that, we show that TFITs are useful in
practice and reflect a natural ordering of terms for
application domains where the concept hierarchy
is expressed through the frequency in which terms
Term-frequency-induced Taxonomies
Example Domain: Address Data
An address taxonomy is a key enabler for address
standardization. Figure 1 shows part of such an address taxonomy where the root contains the most
generic term and leaf-level nodes contain the most
specific terms. For emerging economies building
a standardized address taxonomy is a huge chal-
For some application domains, a taxonomy is induced by the frequency in which terms appear in a
corpus on their own and in combination with other
terms. We first introduce the problem formally and
then motivate it with an example application.
Part of address
house number
building name
building name
city (taluk)
city (taluk)
ZIP code
proper noun
proper noun
proper noun
proper noun
proper noun
proper noun
proper noun
proper noun
6 digit string
states), the conditional-frequency constraint introduced in Section 3.1 is enforced for each node in a
TFIT. ’Houston’s state ’Texas’ (which is more frequent) is picked before ’Houston’. After ’Texas’ is
picked it appears in the ”cannot-also-appear”’ list
for all further siblings on the first level, thus giving
’Houston’ has a conditional frequency of zero.
We show in Section 5 that an address taxonomy
can be inferred by generating a TFIT taxonomy.
Automatically Generating TFITs
We describe a basic algorithm to generate a TFIT
and then show extensions to adapt to different application domains.
Table 1: Example of a tokenized address
lenge. First, new areas and with it new addresses
constantly emerge. Second, there are very limited
conventions for specifying an address (Faruquie et
al., 2010). However, while many developing countries do not have a postal taxonomy, there is often
no lack of address data to learn a taxonomy from.
Column 2 of Table 1 shows an example of an
Indian address. Although Indian addresses tend to
follow the general principal that more specific information is mentioned earlier, there is no fixed order for different elements of an address. For example, the ZIP code of an address may be mentioned
before or after the state information and, although
ZIP code information is more specific than city information, it is generally mentioned later in the
address. Also, while ZIP codes often exist, their
use by people is very limited. Instead, people tend
to mention copious amounts of landmark information (see for example rows 4-6 in Table 1).
Taking all this into account, there is often not
enough structure available to automatically infer a
taxonomy purely based on the structural or semantic aspects of an address. However, for address
data, the general-to-specific concept hierarchy is
reflected in the frequency with which terms appear
on their own and together with other terms.
It mostly holds that f (s) > f (d) > f (c) >
f (z) where s is a state name, d is a district name,
c is a city name, and z is a ZIP code. However, sometimes the name of a large city may be
more frequent than the name of a small state. For
example, in a given corpus, the term ’Houston’
(a populous US city) may appear more frequent
than the term ’Vermont’ (a small US state). To
avoid that ’Houston’ is picked as a node at the first
level of the taxonomy (which should only contain
Base Algorithm
Algorithm 1 Algorithm for generating a TFIT.
// For initialization T + , T − are empty
// For initialization l,w are zero
genTFIT(T + , T − , C, l, w)
// select most frequent term
tnext = tj with F (tj , T + , T − ) is maximal amongst all
tj ∈ C;
fnext = F (tnext , T + , T − );
if fnext ≥ support then
//Output node (tj , l, w)
// Generate child node
genTFIT(T + ∪ {tnext }, T − , C, l + 1, w)
// Generate sibling node
genTFIT(T + , T − ∪ {tnext }, C, l, w + 1)
end if
To generate a TFIT taxonomy as defined in Section 3.1 we recursively pick the most frequent term
given previously chosen terms. The basic algorithm genT F IT is sketched out in Algorithm 1.
When genT F IT is called the first time, T + and
T − are empty and both level l and width w are
zero. With each call of genT F IT a new node
n in the taxonomy is created with (t, l, w) where
t is the most frequent term given T + and T −
and l and w capture the position in the taxonomy.
genT F IT is recursively called to generate a child
of n and a sibling for n.
The only input parameter required by our algorithm is support. Instead of adding all terms
with a conditional frequency above zero, we only
add terms with a conditional frequency equal to or
higher than support. The support parameter controls the precision of the resulting TFIT and also
the runtime of the algorithm. Increasing support
increases the precision but also lowers the recall.
Integrating Constraints
’Houston’ may become a node at the first level and
appear to be a state. Generally, such cases only appear at the far right of the taxonomy.
Structural as well as semantic constraints can easily be integrated into the TFIT generation.
We distinguish between taxonomy-level and
node-level structural constraints. For example,
limiting the depth of the taxonomy by introducing a maxLevel constraint and checking before
each recursive call if maxLevel is reached, is
a taxonomy-level constraint. A node-level constraint applies to each node and affects the way
the frequency of terms is determined.
For our example application, we introduce the
following node-level constraint: at each node we
only count terms that appear at specific positions
in records with respect to the current level of the
node. Specifically, we slide (or incrementally increase) a window over the address records starting from the end. For example, when picking the
term ’Washington’ as a state name, occurrences of
’Washington’ as city or street name are ignored.
Using a window instead of an exact position accounts for positional variability. Also, to accommodate varying amounts of landmark information
we length-normalize the position of terms. That is,
we divide all positions in an address by the average
length of an address (which is 10 for our 40 Million addresses). Accordingly, we adjust the size of
the window and use increments of 0.1 for sliding
(or increasing) the window.
In addition to syntactical constraints, semantic
constraints can be integrated by classifying terms
for use when picking the next frequent term. In our
example application, markers tend to appear much
more often than any proper noun. For example,
the term ’Road’ appears in almost all addresses,
and might be picked up as the most frequent term
very early in the process. Thus, it is beneficial to
ignore marker terms during taxonomy generation
and adding them as a post-processing step.
We present an evaluation of our approach for address data from an emerging economy. We implemented our algorithm in Java and store the records
in a DB2 database. We rely on the DB2 optimizer
to efficiently retrieve the next frequent term.
The results are based on 40 Million Indian addresses. Each address record was given to us as
a single string and was first tokenized into a sequence of terms as shown in Table 1. In a second
step, we addressed spelling variations. There is no
fixed way of transliterating Indian alphabets to English and most Indian proper nouns have various
spellings in English. We used tools to detect synonyms with the same context to generate a list of
rules to map terms to a standard form (Lin, 1998).
For example, in Table 1 ’Maharashtra’ can also be
spelled ’Maharastra’. We also used a list of keywords to classify some terms as markers such as
’Road’ and ’Nagar’ shown in Table 1.
Our evaluation consists of two parts. First, we
show results for constructing a TFIT from scratch.
To evaluate the precision and recall we also retrieved post office addresses from India Post1 ,
cleaned them, and organized them in a tree.
Second, we use our approach to enrich the existing hierarchy created from post office addresses
with additional area terms. To validate the result,
we also retrieved data about which area names appear within a ZIP code.2 We also verified whether
Google Maps shows an area on its map.3
Taxonomy Generation
We generated a taxonomy O using all 40 million
addresses. We compare the terms assigned to
category levels district and taluk4 in O with the
tree P constructed from post office addresses.
Each district and taluk has at least one post office.
Thus P covers all districts and taluks and allows
us to test coverage and precision. We compute the
precision and recall for each category level CL as
Handling Noise
The approach we propose naturally handles noise
by ignoring it, unless the noise level exceeds the
support threshold. Misspelled terms are generally
infrequent and will as such not become part of
the taxonomy. The same applies to incorrect addresses. Incomplete addresses partially contribute
to the taxonomy and only cause a problem if the
same information is missing too often. For example, if more than support addresses with the
city ’Houston’ are missing the state ’Texas’, then
Administrative division in some South-Asian countries.
Recall %
Precision %
Chira Bazar
Dhobi Talao
Kalbadevi Road
Marine Drive
Marine Lines
Princess Street
Thakurdwar Road
Zaveri Bazar
Charni Road
Khadilkar Road
Khetwadi Road
Opera House
Prathna Samaj
Table 2: Precision and recall for categorizing
terms belonging to the state Maharashtra
RecallCL =
# correct paths f rom root to CL in O
# paths f rom root to CL in P
P recisionCL =
# correct paths f rom root to CL in O
# paths f rom root to CL in O
Table 2 shows precision and recall for district
and taluk for the large state Maharashtra. Recall
is good for district. For taluk it is lower because a
major part of the data belongs to urban areas where
taluk information is missing. The precision seems
to be low but it has to be noted that in almost 75%
of the addresses either district or taluk information is missing or noisy. Given that, we were able
to recover a significant portion of the knowledge
We also examined a branch for a smaller state
(Kerala). Again, both districts and taluks appear
at the next level of the taxonomy. For a support
of 200 there are 19 entries in O of which all but
two appear in P as district or taluk. One entry is a
taluk that actually belongs to Maharashtra and one
entry is a name variation of a taluk in P . There
were not enough addresses to get a good coverage
of all districts and taluks.
Table 3: Areas found for ZIP code 400002 (top)
and 400004 (bottom)
tokenization process. 16 correct terms out of 18
terms results in a precision of 89%.
We also ran experiments to measure the coverage of area detection for Mumbai without using ZIP codes. Initializing our algorithm with
M aharshtra and M umbai yielded over 100 areas with a support of 300 and more. However,
again the precision is low because quite a few of
those areas are actually taluk names.
Using a large number of addresses is necessary
to achieve good recall and precision.
Taxonomy Augmentation
In this paper, we presented a novel approach to
generate a taxonomy for data where terms exhibit an inherent frequency-based hierarchy. We
showed that term frequency can be used to generate a meaningful taxonomy from address records.
The presented approach can also be used to extend
an existing taxonomy which is a big advantage
for emerging countries where geographical areas
evolve continuously.
While we have evaluated our approach on address data, it is applicable to all data sources where
the inherent hierarchical structure is encoded in
the frequency with which terms appear on their
own and together with other terms. Preliminary
experiments on real-time analyst’s stock market
tips 5 produced a taxonomy of (TV station, Analyst, Affiliation) with decent precision and recall.
We used P and ran our algorithm for each branch
in P to include area information. We focus our
evaluation on the city Mumbai. The recall is low
because many addresses do not mention a ZIP
code or use an incorrect ZIP code. However,
the precision is good implying that our approach
works even in the presence of large amounts of
Table 3 shows the results for ZIP code 400002
and 400004 for a support of 100. We get similar results for other ZIP codes. For each detected
area we compared whether the area is also listed
on, part of a post office name
(PO), or shown on google maps. All but four
areas found are confirmed by at least one of the
three external sources. Out of the unconfirmed
terms F anaswadi and M arineDrive seem to
be genuine area names but we could not confirm
DhakurdwarRoad. The term th is due to our
See Live Market voices at: home.jsp
Language for Knowledge and Knowledge for Language, pages 335–345.
Rakesh Agrawal and Ramakrishnan Srikant. 1994.
Fast algorithms for mining association rules in large
databases. In Proceedings of the 20th International
Conference on Very Large Data Bases, pages 487–
Govind Kothari, Tanveer A Faruquie, L V Subramaniam, K H Prasad, and Mukesh Mohania. 2010.
Transfer of supervision for improved address standardization. In Proceedings of the 20th International Conference on Pattern Recognition.
Razvan C. Bunescu and Raymond J. Mooney. 2007.
Learning to extract relations from the web using
minimal supervision. In Proceedings of the 45th Annual Meeting of the Association of Computational
Linguistics, pages 576–583.
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy.
2008. Semantic class learning from the web with
hyponym pattern linkage graphs. In Proceedings of
the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1048–1056.
Sharon A. Caraballo. 1999. Automatic construction
of a hypernym-labeled noun hierarchy from text. In
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pages 120–126.
Dekang Lin, Shaojun Zhao, Lijuan Qin, and Ming
Zhou. 2003. Identifying synonyms among distributionally similar words. In Proceedings of the 18th
International Joint Conference on Artificial Intelligence, pages 1492–1493.
Philipp Cimiano and Johanna Wenderoth. 2007. Automatic acquisition of ranked qualia structures from
the web. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pages 888–895.
Dekang Lin. 1998. Automatic retrieval and clustering
of similar words. In Proceedings of the 17th International Conference on Computational Linguistics,
pages 768–774.
Philipp Cimiano, Günter Ladwig, and Steffen Staab.
2005. Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In Proceedings of the 14th International Conference on
World Wide Web, pages 332–341.
Patrick Pantel and Marco Pennacchiotti.
Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 113–120.
Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland,
Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web:
an experimental study.
Artificial Intelligence,
Shourya Roy and L Venkata Subramaniam. 2006. Automatic generation of domain models for call centers from noisy transcriptions. In Proceedings of
the 21st International Conference on Computational
Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 737–
Tanveer A. Faruquie, K. Hima Prasad, L. Venkata
Subramaniam, Mukesh K. Mohania, Girish Venkatachaliah, Shrinivas Kulkarni, and Pramit Basu.
2010. Data cleansing as a transient service. In
Proceedings of the 26th International Conference on
Data Engineering, pages 1025–1036.
Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2004.
Learning syntactic patterns for automatic hypernym
discovery. In Advances in Neural Information Processing Systems, pages 1297–1304.
Michael Fleischman and Eduard Hovy. 2002. Fine
grained classification of named entities. In Proceedings of the 19th International Conference on Computational Linguistics, pages 1–7.
Hristo Tanev and Bernardo Magnini. 2006. Weakly
supervised approaches for ontology population. In
Proceedings of the 11th Conference of the European
Chapter of the Association for Computational Linguistics, pages 3–7.
Roxana Girju, Adriana Badulescu, and Dan Moldovan.
2006. Automatic discovery of part-whole relations.
Computational Linguistics, 32(1):83–135.
Hui Yang and Jamie Callan. 2008. Learning the distance metric in a personal ontology. In Proceeding of the 2nd International Workshop on Ontologies and Information Systems for the Semantic Web,
pages 17–24.
Claudio Giuliano and Alfio Gliozzo. 2008. Instancebased ontology population exploiting named-entity
substitution. In Proceedings of the 22nd International Conference on Computational Linguistics,
pages 265–272.
Hui Yang and Jamie Callan. 2009. A metric-based
framework for automatic taxonomy induction. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing
of the AFNLP, pages 271–279.
Lucja M. Iwaska, Naveen Mata, and Kellyn Kruger.
2000. Fully automatic acquisition of taxonomic
knowledge from large corpora of texts. In Lucja M.
Iwaska and Stuart C. Shapiro, editors, Natural Language Processing and Knowledge Representation:
Complexity assumptions in ontology verbalisation
Richard Power
Department of Computing
Open University, UK
[email protected]
(Fuchs and Schwitter, 1995). The idea is to establish a mapping from a formal language to a
natural subset of English, so that any sentence
conforming to the Controlled Natural Language
(CNL) can be assigned a single interpretation in
the formal language — and conversely, any wellformed statement in the formal language can be
realised in the CNL. With the advent of OWL,
some of these CNLs were rapidly adapted to the
new opportunity: part of Attempto Controlled English (ACE) was mapped to OWL (Kaljurand and
Fuchs, 2007), and Processable English (PENG)
evolved to Sydney OWL Syntax (SOS) (Cregan et
al., 2007). In addition, new CNLs were developed
specifically for editing OWL ontologies, such as
Rabbit (Hart et al., 2008) and Controlled Language for Ontology Editing (CLOnE) (Funk et al.,
We describe the strategy currently pursued for verbalising OWL ontologies by
sentences in Controlled Natural Language
(i.e., combining generic rules for realising
logical patterns with ontology-specific lexicons for realising atomic terms for individuals, classes, and properties) and argue
that its success depends on assumptions
about the complexity of terms and axioms
in the ontology. We then show, through
analysis of a corpus of ontologies, that although these assumptions could in principle be violated, they are overwhelmingly
respected in practice by ontology developers.
In detail, these CNLs display some variations:
thus an inclusion relationship between the classes
Admiral and Sailor would be expressed by the
pattern ‘Admirals are a type of sailor’ in CLOnE,
‘Every admiral is a kind of sailor’ in Rabbit, and
‘Every admiral is a sailor’ in ACE and SOS. However, at the level of general strategy, all the CNLs
rely on the same set of assumptions concerning the
mapping from natural to formal language; for convenience we will refer to these assumptions as the
consensus model. In brief, the consensus model
assumes that when an ontology is verbalised in
natural language, axioms are expressed by sentences, and atomic terms are expressed by entries from the lexicon. Such a model may fail in
two ways: (1) an ontology might contain axioms
that cannot be described transparently by a sentence (for instance, because they contain complex
Boolean expressions that lead to structural ambiguity); (2) it might contain atomic terms for which
no suitable lexical entry can be found. In the remainder of this paper we first describe the consensus model in more detail, then show that although
Since OWL (Web Ontology Language) was
adopted as a standard in 2004, researchers have
sought ways of mediating between the (decidedly
cumbersome) raw code and the human users who
aspire to view or edit it. Among the solutions
that have been proposed are more readable coding
formats such as Manchester OWL Syntax (Horridge et al., 2006), and graphical interfaces such
as Protégé (Knublauch et al., 2004); more speculatively, several research groups have explored ways
of mapping between OWL and controlled English,
with the aim of presenting ontologies (both for
viewing and editing) in natural language (Schwitter and Tilbrook, 2004; Sun and Mellish, 2006;
Kaljurand and Fuchs, 2007; Hart et al., 2008). In
this paper we uncover and test some assumptions
on which this latter approach is based.
Historically, ontology verbalisation evolved
from a more general tradition (predating OWL
and the Semantic Web) that aimed to support
knowledge formation by automatic interpretation
of texts authored in Controlled Natural Languages
Proceedings of the ACL 2010 Conference Short Papers, pages 132–136,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
C uD
[a, b] ∈ P
IntersectionOf(C D)
SomeValuesFrom(P C)
SubClassOf(C D)
ClassAssertion(C a)
PropertyAssertion(P a b)
we could replace atomic class A by a constructed
class, thus obtaining perhaps (A1 u A2 ) u B, and
so on ad infinitum. Moreover, since most axiom
patterns contain classes as constituents, they too
can become indefinitely complex.
This sketch of knowledge representation in
OWL illustrates the central distinction between logical functors (e.g., IntersectionOf,
SubClassOf), which belong to the W3C standard
(Motik et al., 2010), and atomic terms for individuals, classes and properties (e.g., Nelson,
Admiral, VictorOf). Perhaps the fundamental design decision of the Semantic Web is that all domain terms remain unstandardised, leaving ontology developers free to conceptualise the domain
in any way they see fit. In the consensus verbalisation model, this distinction is reflected by dividing linguistic resources into a generic grammar for
realising logical patterns, and an ontology-specific
lexicon for realising atomic terms.
Consider for instance C v D, the axiom pattern for class inclusion. This purely logical pattern
can often be mapped (following ACE and SOS) to
the sentence pattern ‘Every [C] is a [D]’, where C
and D will be realised by count nouns from the
lexicon if they are atomic, or further grammatical
rules if they are complex. The more specific pattern C v ∃P.D can be expressed better by a sentence pattern based on a verb frame (‘Every [C]
[P]s a [D]’). All these mappings depend entirely
on the OWL logical functors, and will work with
any lexicalisation of atomic terms that respects the
syntactic constraints of the grammar, to yield verbalisations such as the following (for axioms 1-3
Table 1: Common OWL expressions
in principle it is vulnerable to both the problems
just mentioned, in practice these problems almost
never arise.
Consensus model
Atomic terms in OWL (or any other language implementing description logic) are principally of
three kinds, denoting either individuals, classes
or properties1 . Individuals denote entities in the
domain, such as Horatio Nelson or the Battle of
Trafalgar; classes denote sets of entities, such as
people or battles; and properties denote relations
between individuals, such as the relation victor of
between a person and a battle.
From these basic terms, a wide range of complex expressions may be constructed for classes,
properties and axioms, of which some common
examples are shown in table 1. The upper part of
the table presents two class constructors (C and
D denote any classes; P denotes any property);
by combining them we could build the following
expression denoting the class of persons that command fleets2 :
P erson u ∃ CommanderOf.F leet
The lower half of the table presents three axiom
patterns for making statements about classes and
individuals (a, b denote individuals); examples of
their usage are as follows:
1. Every admiral commands a fleet.
2. Nelson is an admiral.
1. Admiral v ∃ CommanderOf.F leet
3. Nelson is the victor of Trafalgar.
2. N elson ∈ Admiral
3. [N elson, T raf algar] ∈ VictorOf
The CNLs we have cited are more sophisticated
than this, allowing a wider range of linguistic patterns (e.g., adjectives for classes), but the basic
assumptions are the same. The model provides
satisfactory verbalisations for the simple examples
considered so far, but what happens when the axioms and atomic terms become more complex?
Note that since class expressions contain classes
as constituents, they can become indefinitely complex. For instance, given the intersection A u B
If data properties are used, there will also be terms for
data types and literals (e.g., numbers and strings), but for simplicity these are not considered here.
In description logic notation, the constructor C u D
forms the intersection of two classes and corresponds to
Boolean conjunction, while the existential restriction ∃P.C
forms the class of individuals having the relation P to
one or more members of class C. Thus P erson u ∃
CommanderOf.F leet denotes the set of individuals x such
that x is a person and x commands one or more fleets.
Complex terms and axioms
The distribution of content among axioms depends
to some extent on stylistic decisions by ontology developers, in particular with regard to ax133
iom size. This freedom is possible because description logics (including OWL) allow equivalent formulations using a large number of short
axioms at one extreme, and a small number of
long ones at the other. For many logical patterns,
rules can be stated for amalgamating or splitting
axioms while leaving overall content unchanged
(thus ensuring that exactly the same inferences are
drawn by a reasoning engine); such rules are often
used in reasoning algorithms. For instance, any set
of SubClassOf axioms can be amalgamated into
a single ‘metaconstraint’ (Horrocks, 1997) of the
form > v M , where > is the class containing
all individuals in the domain, and M is a class
to which any individual respecting the axiom set
must belong3 . Applying this transformation even
to only two axioms (verbalised by 1 and 2 below)
will yield an outcome (verbalised by 3) that strains
human comprehension:
Figure 1: Identifier content
can be verbalised transparently within the assumptions of the consensus model.
Empirical studies of usage
We have shown that OWL syntax will permit
atomic terms that cannot be lexicalised, and axioms that cannot be expressed clearly in a sentence. However, it remains possible that in practice, ontology developers use OWL in a constrained manner that favours verbalisation by the
consensus model. This could happen either because the relevant constraints are psychologically
intuitive to developers, or because they are somehow built into the editing tools that they use
(e.g., Protégé). To investigate this possibility,
we have carried out an exploratory study using a
corpus of 48 ontologies mostly downloaded from
the University of Manchester TONES repository
(TONES, 2010). The corpus covers ontologies of
varying expressivity and subject-matter, including
some well-known tutorial examples (pets, pizzas)
and topics of general interest (photography, travel,
heraldry, wine), as well as some highly technical
scientific material (mosquito anatomy, worm ontogeny, periodic table). Overall, our sample contains around 45,000 axioms and 25,000 atomic
Our first analysis concerns identifier length,
which we measure simply by counting the number of words in the identifying phrase. The program recovers the phrase by the following steps:
(1) read an identifier (or label if one is provided4 );
(2) strip off the namespace prefix; (3) segment the
resulting string into words. For the third step we
1. Every admiral is a sailor.
2. Every admiral commands a fleet.
3. Everything is (a) either a non-admiral or a sailor, and
(b) either a non-admiral or something that commands a
An example of axiom-splitting rules is found in
a computational complexity proof for the description logic EL+ (Baader et al., 2005), which requires class inclusion axioms to be rewritten to a
maximally simple ‘normal form’ permitting only
four patterns: A1 v A2 , A1 u A2 v A3 , A1 v
∃P.A2 , and ∃P.A1 v A2 , where P and all AN
are atomic terms. However, this simplification of
axiom structure can be achieved only by introducing new atomic terms. For example, to simplify
an axiom of the form A1 v ∃P.(A2 u A3 ), the
rewriting rules must introduce a new term A23 ≡
A2 u A3 , through which the axiom may be rewritten as A1 v ∃P.A23 (along with some further axioms expressing the definition of A23 ); depending
on the expressions that they replace, the content of
such terms may become indefinitely complex.
A trade-off therefore results. We can often find
rules for refactoring an overcomplex axiom by a
number of simpler ones, but only at the cost of introducing atomic terms for which no satisfactory
lexical realisation may exist. In principle, therefore, there is no guarantee that OWL ontologies
For an axiom set C1 v D1 , C2 v D2 . . ., M will be
(¬C1 t D1 ) u (¬C2 t D2 ) . . ., where the class constructors ¬C (complement of C) and C t D (union of C and D)
correspond to Boolean negation and disjunction.
Some ontology developers use ‘non-semantic’ identifiers
such as #000123, in which case the meaning of the identifier
is indicated in an annotation assertion linking the identifier to
a label.
CA u CA v ⊥
CA v ∃PA .CA
[I, I] ∈ PA
[I, L] ∈ DA
I ∈ CA
CA ≡ CA u ∃PA .CA
The preference for simple patterns was confirmed by an analysis of argument structure for the OWL functors (e.g., SubClassOf,
IntersectionOf) that take classes as arguments.
Overall, 85% of arguments were atomic terms
rather than complex class expressions. Interestingly, there was also a clear effect of argument position, with the first argument of a functor being
atomic rather than complex in as many as 99.4%
of cases7 .
Table 2: Axiom pattern frequencies
assume that word boundaries are marked either
by underline characters or by capital letters (e.g.,
battle of trafalgar, BattleOfTrafalgar), a
rule that holds (in our corpus) almost without exception. The analysis (figure 1) reveals that phrase
lengths are typically between one and four words
(this was true of over 95% of individuals, over
90% of classes, and over 98% of properties), as
in the following random selections:
Our results indicate that although in principle the
consensus model cannot guarantee transparent realisations, in practice these are almost always attainable, since ontology developers overwhelmingly favour terms and axioms with relatively simple content. In an analysis of around 50 ontologies
we have found that over 90% of axioms fit a mere
seven patterns (table 2); the following examples
show that each of these patterns can be verbalised
by a clear unambiguous sentence – provided, of
course, that no problems arise in lexicalising the
atomic terms:
Individuals: beaujolais region, beringer, blue
mountains, bondi beach
Classes: abi graph plot, amps block format, abattoir, abbey church
Properties: has activity, has address, has amino
acid, has aunt in law
1. Every admiral is a sailor
2. No sailor is a landlubber
Our second analysis concerns axiom patterns,
which we obtain by replacing all atomic terms
with a symbol meaning either individual, class,
property, datatype or literal. Thus for example the
axioms Admiral v Sailor and Dog v Animal
are both reduced to the form CA v CA , where
the symbol CA means ‘any atomic class term’. In
this way we can count the frequencies of all the
logical patterns in the corpus, abstracting from the
domain-specific identifier names. The results (table 2) show an overwhelming focus on a small
number of simple logical patterns5 . Concerning class constructors, the most common by far
were intersection (C u C) and existential restriction (∃P.C); universal restriction (∀P.C) was relatively rare, so that for example the pattern CA v
∀PA .CA occurred only 54 times (0.1%)6 .
3. Every admiral commands a fleet
4. Nelson is the victor of Trafalgar
5. Trafalgar is dated 1805
6. Nelson is an admiral
7. An admiral is defined as a person that commands a fleet
However, since identifiers containing 3-4 words
are fairly common (figure 1), we need to consider
whether these formulations will remain transparent when combined with more complex lexical entries. For instance, a travel ontology in our corpus contains an axiom (fitting pattern 4) which our
prototype verbalises as follows:
4’. West Yorkshire has as boundary the West
Yorkshire Greater Manchester Boundary Fragment
Most of these patterns have been explained already; the
others are disjoint classes (CA uCA v ⊥), equivalent classes
(CA ≡ CA u ∃PA .CA ) and data property assertion ([I, L] ∈
DA ). In the latter pattern, DA denotes a data property, which
differs from an object property (PA ) in that it ranges over
literals (L) rather than individuals (I).
If C v ∃P.D means ‘Every admiral commands a fleet’,
C v ∀P.D will mean ‘Every admiral commands only fleets’
(this will remain true if some admirals do not command anything at all).
The lexical entries here are far from ideal: ‘has
as boundary’ is clumsy, and ‘the West Yorkshire
Greater Manchester Boundary Fragment’ has as
One explanation for this result could be that developers (or development tools) treat axioms as having a topiccomment structure, where the topic is usually the first argument; we intend to investigate this possibility in a further
many as six content words (and would benefit
from hyphens). We assess the sentence as ugly but
understandable, but to draw more definite conclusions one would need to perform a different kind
of empirical study using human readers.
Matthew Horridge, Nicholas Drummond, John Goodwin, Alan Rector, Robert Stevens, and Hai Wang.
2006. The Manchester OWL syntax. In OWL:
Experiences and Directions (OWLED’06), Athens,
Georgia. CEUR.
Ian Horrocks. 1997. Optimising Tableaux Decision
Procedures for Description Logics. Ph.D. thesis,
University of Manchester.
We conclude (a) that existing ontologies can be
mostly verbalised using the consensus model, and
(b) that an editing tool based on relatively simple
linguistic patterns would not inconvenience ontology developers, but merely enforce constraints
that they almost always respect anyway. These
conclusions are based on analysis of identifier and
axiom patterns in a corpus of ontologies; they need
to be complemented by studies showing that the
resulting verbalisations are understood by ontology developers and other users.
K. Kaljurand and N. Fuchs. 2007. Verbalizing OWL
in Attempto Controlled English. In Proceedings of
OWL: Experiences and Directions, Innsbruck, Austria.
Holger Knublauch, Ray W. Fergerson, Natalya Fridman Noy, and Mark A. Musen. 2004. The Protégé
OWL Plugin: An Open Development Environment
for Semantic Web Applications. In International Semantic Web Conference, pages 229–243.
Boris Motik, Peter F. Patel-Schneider, and Bijan Parsia. 2010. OWL 2 web ontology language:
Structural specification and functional-style syntax. 21st
April 2010.
The research described in this paper was undertaken as part of the SWAT project (Semantic Web Authoring Tool), which is supported by
the UK Engineering and Physical Sciences Research Council (EPSRC) grants G033579/1 (Open
University) and G032459/1 (University of Manchester). Thanks are due to the anonymous ACL reviewers and to colleagues on the SWAT project for
their comments and suggestions.
R. Schwitter and M. Tilbrook. 2004. Controlled natural language meets the semantic web. In Proceedings of the Australasian Language Technology
Workshop, pages 55–62, Macquarie University.
X. Sun and C. Mellish. 2006. Domain Independent
Sentence Generation from RDF Representations for
the Semantic Web. In Proceedings of the Combined
Workshop on Language-Enabled Educational Technology and Development and Evaluation of Robust
Spoken Dialogue Systems (ECAI06), Riva del Garda,
TONES. 2010. The TONES ontology repository.
Last accessed: 21st April 2010.
F. Baader, I. R. Horrocks, and U. Sattler. 2005. Description logics as ontology languages for the semantic web. Lecture Notes in Artificial Intelligence,
Anne Cregan, Rolf Schwitter, and Thomas Meyer.
2007. Sydney OWL Syntax - towards a Controlled
Natural Language Syntax for OWL 1.1. In OWLED.
Norbert Fuchs and Rolf Schwitter. 1995. Specifying
logic programs in controlled natural language. In
Adam Funk, Valentin Tablan, Kalina Bontcheva,
Hamish Cunningham, Brian Davis, and Siegfried
CLOnE: Controlled Language for Ontology Editing.
In 6th International and 2nd Asian Semantic Web Conference
(ISWC2007+ASWC2007), pages 141–154, November.
Glen Hart, Martina Johnson, and Catherine Dolbear.
2008. Rabbit: Developing a control natural language for authoring ontologies. In ESWC, pages
Word Alignment with Synonym Regularization
Hiroyuki Shindo, Akinori Fujino, and Masaaki Nagata
NTT Communication Science Laboratories, NTT Corp.
2-4 Hikaridai Seika-cho Soraku-gun Kyoto 619-0237 Japan
[email protected]
information works as a constraint in word alignment models and improves word alignment quality.
A large number of monolingual lexical semantic resources such as WordNet (Miller, 1995) have
been constructed in more than fifty languages
(Sagot and Fiser, 2008). They include wordlevel relations such as synonyms, hypernyms and
hyponyms. Synonym information is particularly
helpful for word alignment because we can expect a synonym to correspond to the same word
in a different language. In this paper, we explore a
method for using synonym information effectively
to improve word alignment quality.
In general, synonym relations are defined in
terms of word sense, not in terms of word form. In
other words, synonym relations are usually context or domain dependent. For instance, ‘head’
and ‘chief’ are synonyms in contexts referring to
working environment, while ‘head’ and ‘forefront’
are synonyms in contexts referring to physical positions. It is difficult, however, to imagine a context where ‘chief’ and ‘forefront’ are synonyms.
Therefore, it is easy to imagine that simply replacing all occurrences of ‘chief’ and ‘forefront’ with
‘head’ do sometimes harm with word alignment
accuracy, and we have to model either the context
or senses of words.
We propose a novel method that incorporates
synonyms from monolingual resources in a bilingual word alignment model. We formulate a synonym pair generative model with a topic variable
and use this model as a regularization term with a
bilingual word alignment model. The topic variable in our synonym model is helpful for disambiguating the meanings of synonyms. We extend
HM-BiTAM, which is a HMM-based word alignment model with a latent topic, with a novel synonym pair generative model. We applied the proposed method to an English-French word alignment task and successfully improved the word
We present a novel framework for word
alignment that incorporates synonym
knowledge collected from monolingual
linguistic resources in a bilingual probabilistic model. Synonym information is
helpful for word alignment because we
can expect a synonym to correspond to
the same word in a different language.
We design a generative model for word
alignment that uses synonym information
as a regularization term. The experimental
results show that our proposed method
significantly improves word alignment
1 Introduction
Word alignment is an essential step in most phrase
and syntax based statistical machine translation
(SMT). It is an inference problem of word correspondences between different languages given
parallel sentence pairs. Accurate word alignment
can induce high quality phrase detection and translation probability, which leads to a significant improvement in SMT performance. Many word
alignment approaches based on generative models have been proposed and they learn from bilingual sentences in an unsupervised manner (Vogel et al., 1996; Och and Ney, 2003; Fraser and
Marcu, 2007).
One way to improve word alignment quality
is to add linguistic knowledge derived from a
monolingual corpus. This monolingual knowledge makes it easier to determine corresponding
words correctly. For instance, functional words
in one language tend to correspond to functional
words in another language (Deng and Gao, 2007),
and the syntactic dependency of words in each language can help the alignment process (Ma et al.,
2008). It has been shown that such grammatical
Proceedings of the ACL 2010 Conference Short Papers, pages 137–141,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
translation probability from {e to }
f under the kth
topic: p (f |e, z = k ). T = Ti,i′ is a state transition probability of a first order Markov process.
Fig. 1 shows a graphical model of HM-BiTAM.
The total likelihood of bilingual sentence pairs
{E, F } can be obtained by marginalizing out latent variables z, a and θ,
p (F, E; Ψ) =
p (F, E, z, a, θ; Ψ) dθ,
where Ψ = {α, β, T, B} is a parameter set. In
this model, we can infer word alignment a by maximizing the likelihood above.
Figure 1: Graphical model of HM-BiTAM
alignment quality.
3 Proposed Method
3.1 Synonym Pair Generative Model
Bilingual Word Alignment Model
We design a generative model for synonym pairs
{f, f ′ } in language F , which assumes that the
synonyms are collected from monolingual linguistic resources. We assume that each synonym pair
(f, f ′ ) is generated independently given the same
‘sense’ s. Under this assumption, the probability
of synonym pair (f, f ′ ) can be formulated as,
In this section, we review a conventional generative word alignment model, HM-BiTAM (Zhao
and Xing, 2008).
HM-BiTAM is a bilingual generative model
with topic z, alignment a and topic weight vector θ as latent variables. Topic variables such
as ‘science’ or ‘economy’ assigned to individual
sentences help to disambiguate the meanings of
words. HM-BiTAM assumes that the nth bilingual sentence pair, (En , Fn ), is generated under a
given latent topic zn ∈ {1, . . . , k, . . . , K}, where
K is the number of latent topics. Let N be the
number of sentence pairs, and In and Jn be the
lengths of En and Fn , respectively. In this framework, all of the bilingual sentence pairs {E, F } =
{(En , Fn )}N
n=1 are generated as follows.
) ∑
p f, f ′ ∝
p (f |s ) p f ′ |s p (s) .
We define a pair (e, k) as a representation of
the sense s, where e and k are a word in a different language E and a latent topic, respectively.
It has been shown that a word e in a different
language is an appropriate representation of s in
synonym modeling (Bannard and Callison-Burch,
2005). We assume that adding a latent topic k for
the sense is very useful for disambiguating word
meaning, and thus that (e, k) gives us a good approximation of s. Under this assumption, the synonym pair generative model can be defined as follows.
1. θ ∼ Dirichlet (α): sample topic-weight vector
2. For each sentence pair (En , Fn )
(a) zn ∼ M ultinomial (θ): sample the topic
(b) en,i:In |zn ∼ p (En |zn ; β ): sample English
words from a monolingual unigram model given
topic zn
For each position jn = 1, . . . , Jn
i. ajn ∼ p (ajn |ajn −1 ; T ): sample an alignment link ajn from a first order Markov process
ii. fjn ∼ p (fjn |En , ajn , zn ; B ): sample a
target word fjn given an aligned source
word and topic
} )
f, f ′ ; Ψ
∏ ∑
p(f |e, k; Ψ)p(f
|e, k; Ψ)p(e,
k; Ψ),(3)
(f,f ′ ) e,k
e is the parameter set of our model.
where Ψ
3.2 Word Alignment with Synonym
where alignment ajn = i denotes source word ei
and target word fjn are aligned. α is a parameter over the topic weight vector θ, β = {βk,e } is
the source word probability given the kth topic:
p (e |z = k ). B = {Bf,e,k } represents the word
In this section, we extend the bilingual generative model (HM-BiTAM) with our synonym pair
model. Our expectation is that synonym pairs
4 Experiments
4.1 Experimental Setting
For an empirical evaluation of the proposed
method, we used a bilingual parallel corpus of
English-French Hansards (Mihalcea and Pedersen,
2003). The corpus consists of over 1 million sentence pairs, which include 447 manually wordaligned sentences. We selected 100 sentence pairs
randomly from the manually word-aligned sentences as development data for tuning the regularization weight ζ, and used the 347 remaining
sentence pairs as evaluation data. We also randomly selected 10k, 50k, and 100k sized sentence
pairs from the corpus as additional training data.
We ran the unsupervised training of our proposed
word alignment model on the additional training
data and the 347 sentence pairs of the evaluation
data. Note that manual word alignment of the
347 sentence pairs was not used for the unsupervised training. After the unsupervised training, we
evaluated the word alignment performance of our
proposed method by comparing the manual word
alignment of the 347 sentence pairs with the prediction provided by the trained model.
We collected English and French synonym pairs
from WordNet 2.1 (Miller, 1995) and WOLF 0.1.4
(Sagot and Fiser, 2008), respectively. WOLF is a
semantic resource constructed from the Princeton
WordNet and various multilingual resources. We
selected synonym pairs where both words were included in the bilingual training set.
We compared the word alignment performance
of our model with that of GIZA++ 1.03 1 (Vogel et al., 1996; Och and Ney, 2003), and HMBiTAM (Zhao and Xing, 2008) implemented by
us. GIZA++ is an implementation of IBM-model
4 and HMM, and HM-BiTAM corresponds to ζ =
0 in eq. 7. We adopted K = 3 topics, following
the setting in (Zhao and Xing, 2006).
We trained the word alignment in two directions: English to French, and French to English.
The alignment results for both directions were refined with ‘GROW’ heuristics to yield high precision and high recall in accordance with previous
work (Och and Ney, 2003; Zhao and Xing, 2006).
We evaluated these results for precision, recall, Fmeasure and alignment error rate (AER), which
are standard metrics for word alignment accuracy
(Och and Ney, 2000).
Figure 2: Graphical model of synonym pair generative process
correspond to the same word in a different language, thus they make it easy to infer accurate
word alignment. HM-BiTAM and the synonym
model share parameters in order to incorporate
monolingual synonym information into the bilingual word alignment model. This can be achieved
e in eq. 3 as,
via reparameterizing Ψ
( )
p f e, k; Ψ
p e, k; Ψ
p (f |e, k; B ) ,
p (e |k; β ) p (k; α) .
Overall, we re-define the synonym pair model
with the HM-BiTAM parameter set Ψ,
p( f, f ′ ; Ψ)
∏ ∑
αk βk,e Bf,e,k Bf ′ ,e,k . (6)
∝ ∑
k′ k
(f,f ) k,e
Fig. 2 shows a graphical model of the synonym
pair generative process. We estimate the parameter values to maximize the likelihood of HMBiTAM with respect to bilingual sentences and
that of the synonym model with respect to synonym pairs collected from monolingual resources.
Namely, the parameter estimate, Ψ̂, is computed
Ψ̂ = arg max log p(F, E; Ψ) + ζ log p( f, f ′ ; Ψ) ,
where ζ is a regularization weight that should
be set for training. We can expect that the second
term of eq. 7 to constrain parameter set Ψ and
avoid overfitting for the bilingual word alignment
model. We resort to the variational EM approach
(Bernardo et al., 2003) to infer Ψ̂ following HMBiTAM. We omit the parameter update equation
due to lack of space.
# vocabularies
with SRH
with SRH
Precision Recall F-measure AER
GIZA++ standard
0.856 0.718 0.781 0.207
with SRH 0.874 0.720 0.789 0.198
HM-BiTAM standard
0.869 0.788 0.826 0.169
with SRH 0.884 0.790 0.834 0.160
0.941 0.808 0.870 0.123
Table 2: The number of vocabularies in the 10k,
50k and 100k data sets.
Precision Recall F-measure AER
GIZA++ standard
0.905 0.770 0.832 0.156
with SRH 0.903 0.759 0.825 0.164
HM-BiTAM standard
0.901 0.814 0.855 0.140
with SRH 0.899 0.808 0.853 0.145
0.947 0.824 0.881 0.112
were replaced with the word ‘sick’. As shown in
Table 2, the number of vocabularies in the English
and French data sets decreased as a result of employing the SRH.
We show the performance of GIZA++ and HMBiTAM with the SRH in the lines entitled “with
SRH” in Table 1. The GIZA++ and HM-BiTAM
with the SRH slightly outperformed the standard
GIZA++ and HM-BiTAM for the 10k and 100k
data sets, but underperformed with the 50k data
set. We assume that the SRH mitigated the overfitting of these models into low-frequency word
pairs in bilingual sentences, and then improved the
word alignment performance. The SRH regards
all of the different words coupled with the same
word in the synonym pairs as synonyms. For instance, the words ‘head’, ‘chief’ and ‘forefront’ in
the bilingual sentences are replaced with ‘chief’,
since (‘head’, ‘chief’) and (‘head’, ‘forefront’) are
synonyms. Obviously, (‘chief’, ‘forefront’) are
not synonyms, which is detrimented to word alignment.
The proposed method consistently outperformed GIZA++ and HM-BiTAM with the SRH
in 10k, 50k and 100k data sets in F-measure.
The synonym pair model in our proposed method
can automatically learn that (‘head’, ‘chief’) and
(‘head’, ‘forefront’) are individual synonyms with
different meanings by assigning these pairs to different topics. By sharing latent topics between
the synonym pair model and the word alignment
model, the synonym information incorporated in
the synonym pair model is used directly for training word alignment model. The experimental results show that our proposed method was effective in improving the performance of the word
alignment model by using synonym pairs including such ambiguous synonym words.
Finally, we discuss the data set size used for unsupervised training. As shown in Table 1, using
a large number of additional sentence pairs improved the performance of all the models. In all
our experimental settings, all the additional sen-
Precision Recall F-measure AER
GIZA++ standard
0.925 0.791 0.853 0.136
with SRH 0.934 0.803 0.864 0.126
HM-BiTAM standard
0.898 0.851 0.874 0.124
with SRH 0.909 0.860 0.879 0.114
0.927 0.862 0.893 0.103
Table 1: Comparison of word alignment accuracy.
The best results are indicated in bold type. The
additional data set sizes are (a) 10k, (b) 50k, (c)
4.2 Results and Discussion
Table 1 shows the word alignment accuracy of the
three methods trained with 10k, 50k, and 100k additional sentence pairs. For all settings, our proposed method outperformed other conventional
methods. This result shows that synonym information is effective for improving word alignment
quality as we expected.
As mentioned in Sections 1 and 3.1, the main
idea of our proposed method is to introduce latent topics for modeling synonym pairs, and then
to utilize the synonym pair model for the regularization of word alignment models. We expect
the latent topics to be useful for modeling polysemous words included in synonym pairs and to
enable us to incorporate synonym information effectively into word alignment models. To confirm the effect of the synonym pair model with
latent topics, we also tested GIZA++ and HMBiTAM with what we call Synonym Replacement
Heuristics (SRH), where all of the synonym pairs
in the bilingual training sentences were simply replaced with a representative word. For instance,
the words ‘sick’ and ‘ill’ in the bilingual sentences
tence pairs and the evaluation data were selected
from the Hansards data set. These experimental
results show that a larger number of sentence pairs
was more effective in improving word alignment
performance when the sentence pairs were collected from a homogeneous data source. However,
in practice, it might be difficult to collect a large
number of such homogeneous sentence pairs for
a specific target domain and language pair. One
direction for future work is to confirm the effect
of the proposed method when training the word
alignment model by using a large number of sentence pairs collected from various data sources including many topics for a specific language pair.
Y. Deng and Y. Gao. 2007. Guiding statistical word
alignment models with prior knowledge. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 1–8,
Prague, Czech Republic, June. Association for Computational Linguistics.
A. Fraser and D. Marcu. 2007. Getting the structure right for word alignment: LEAF. In Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 51–60, Prague, Czech Republic,
June. Association for Computational Linguistics.
Y. Ma, S. Ozdowska, Y. Sun, and A. Way. 2008.
Improving word alignment using syntactic dependencies. In Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2), pages 69–77, Columbus,
Ohio, June. Association for Computational Linguistics.
Conclusions and Future Work
We proposed a novel framework that incorporates synonyms from monolingual linguistic resources in a word alignment generative model.
This approach utilizes both bilingual and monolingual synonym resources effectively for word
alignment. Our proposed method uses a latent
topic for bilingual sentences and monolingual synonym pairs, which is helpful in terms of word
sense disambiguation. Our proposed method improved word alignment quality with both small
and large data sets. Future work will involve examining the proposed method for different language pairs such as English-Chinese and EnglishJapanese and evaluating the impact of our proposed method on SMT performance. We will also
apply our proposed method to a larger data sets
of multiple domains since we can expect a further improvement in word alignment accuracy if
we use more bilingual sentences and more monolingual knowledge.
R. Mihalcea and T. Pedersen. 2003. An evaluation
exercise for word alignment. In Proceedings of the
HLT-NAACL 2003 Workshop on building and using
parallel texts: data driven machine translation and
beyond-Volume 3, page 10. Association for Computational Linguistics.
G. A. Miller. 1995. WordNet: a lexical database for
English. Communications of the ACM, 38(11):41.
F. J. Och and H. Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual
Meeting on Association for Computational Linguistics, pages 440–447. Association for Computational
Linguistics Morristown, NJ, USA.
F. J. Och and H. Ney. 2003. A systematic comparison
of various statistical alignment models. Computational Linguistics, 29(1):19–51.
B. Sagot and D. Fiser. 2008. Building a free French
wordnet from multilingual resources. In Proceedings of Ontolex.
S. Vogel, H. Ney, and C. Tillmann. 1996. HMMbased word alignment in statistical translation. In
Proceedings of the 16th Conference on Computational Linguistics-Volume 2, pages 836–841. Association for Computational Linguistics Morristown,
C. Bannard and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for
Computational Linguistics, pages 597–604. Association for Computational Linguistics Morristown,
B. Zhao and E. P. Xing. 2006. BiTAM: Bilingual
topic admixture models for word alignment. In Proceedings of the COLING/ACL on Main Conference
Poster Sessions, page 976. Association for Computational Linguistics.
J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P.
Dawid, D. Heckerman, A. F. M. Smith, and M. West.
2003. The variational bayesian EM algorithm for incomplete data: with application to scoring graphical
model structures. In Bayesian Statistics 7: Proceedings of the 7th Valencia International Meeting, June
2-6, 2002, page 453. Oxford University Press, USA.
B. Zhao and E. P. Xing. 2008. HM-BiTAM: Bilingual
topic exploration, word alignment, and translation.
In Advances in Neural Information Processing Systems 20, pages 1689–1696, Cambridge, MA. MIT
Better Filtration and Augmentation for Hierarchical Phrase-Based
Translation Rules
Zhiyang Wang †
Yajuan Lü †
Qun Liu †
Young-Sook Hwang ‡
Key Lab. of Intelligent Information Processing
HILab Convergence Technology Center
Institute of Computing Technology
C&I Business
Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190, China
11, Euljiro2-ga, Jung-gu, Seoul 100-999, Korea
[email protected]
[email protected]
This paper presents a novel filtration criterion to restrict the rule extraction for
the hierarchical phrase-based translation
model, where a bilingual but relaxed wellformed dependency restriction is used to
filter out bad rules. Furthermore, a new
feature which describes the regularity that
the source/target dependency edge triggers the target/source word is also proposed. Experimental results show that, the
new criteria weeds out about 40% rules
while with translation performance improvement, and the new feature brings another improvement to the baseline system,
especially on larger corpus.
Figure 1: Solid wire reveals the dependency relation pointing from the child to the parent. Target
word e is triggered by the source word f and it’s
head word f ′ , p(e|f → f ′ ).
Based on the relaxed-well-formed dependency
structure, we also introduce a new linguistic feature to enhance translation performance. In the
traditional phrase-based SMT model, there are
always lexical translation probabilities based on
IBM model 1 (Brown et al., 1993), i.e. p(e|f ),
namely, the target word e is triggered by the source
word f . Intuitively, however, the generation of e
is not only involved with f , sometimes may also
be triggered by other context words in the source
side. Here we assume that the dependency edge
(f → f ′ ) of word f generates target word e (we
call it head word trigger in Section 4). Therefore,
two words in one language trigger one word in
another, which provides a more sophisticated and
better choice for the target word, i.e. Figure 1.
Similarly, the dependency feature works well in
Chinese-to-English translation task, especially on
large corpus.
1 Introduction
Hierarchical phrase-based (HPB) model (Chiang,
2005) is the state-of-the-art statistical machine
translation (SMT) model. By looking for phrases
that contain other phrases and replacing the subphrases with nonterminal symbols, it gets hierarchical rules. Hierarchical rules are more powerful
than conventional phrases since they have better
generalization capability and could capture long
distance reordering. However, when the training corpus becomes larger, the number of rules
will grow exponentially, which inevitably results
in slow and memory-consuming decoding.
In this paper, we address the problem of reducing the hierarchical translation rule table resorting
to the dependency information of bilingual languages. We only keep rules that both sides are
relaxed-well-formed (RWF) dependency structure
(see the definition in Section 3), and discard others
which do not satisfy this constraint. In this way,
about 40% bad rules are weeded out from the original rule table. However, the performance is even
better than the traditional HPB translation system.
2 Related Work
In the past, a significant number of techniques
have been presented to reduce the hierarchical rule
table. He et al. (2009) just used the key phrases
of source side to filter the rule table without taking
advantage of any linguistic information. Iglesias
et al. (2009) put rules into syntactic classes based
on the number of non-terminals and patterns, and
applied various filtration strategies to improve the
rule table quality. Shen et al. (2008) discarded
Proceedings of the ACL 2010 Conference Short Papers, pages 142–146,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
ure 2 shows an example of a dependency tree. In
this example, the word found is the root of the tree.
Shen et al. (2008) propose the well-formed dependency structure to filter the hierarchical rule table. A well-formed dependency structure could be
either a single-rooted dependency tree or a set of
sibling trees. Although most rules are discarded
with the constraint that the target side should be
well-formed, this filtration leads to degradation in
translation performance.
As an extension of the work of (Shen et
al., 2008), we introduce the so-called relaxedwell-formed dependency structure to filter the hierarchical rule table. Given a sentence S =
w1 w2 ...wn . Let d1 d2 ...dn represent the position of
parent word for each word. For example, d3 = 4
means that w3 depends on w4 . If wi is a root, we
define di = −1.
Definition A dependency structure wi ...wj is
a relaxed-well-formed structure, where there is
h ∈
/ [i, j], all the words wi ...wj are directly or
indirectly depended on wh or -1 (here we define
h = −1). If and only if it satisfies the following
Figure 2: An example of dependency tree. The
corresponding plain sentence is The lovely girl
found a beautiful house.
most entries of the rule table by using the constraint that rules of the target-side are well-formed
(WF) dependency structure, but this filtering led to
degradation in translation performance. They obtained improvements by adding an additional dependency language model. The basic difference
of our method from (Shen et al., 2008) is that we
keep rules that both sides should be relaxed-wellformed dependency structure, not just the target
side. Besides, our system complexity is not increased because no additional language model is
The feature of head word trigger which we apply to the log-linear model is motivated by the
trigger-based approach (Hasan and Ney, 2009).
Hasan and Ney (2009) introduced a second word
to trigger the target word without considering any
linguistic information. Furthermore, since the second word can come from any part of the sentence,
there may be a prohibitively large number of parameters involved. Besides, He et al. (2008) built
a maximum entropy model which combines rich
context information for selecting translation rules
during decoding. However, as the size of the corpus increases, the maximum entropy model will
become larger. Similarly, In (Shen et al., 2009),
context language model is proposed for better rule
selection. Taking the dependency edge as condition, our approach is very different from previous
approaches of exploring context information.
/ [i, j]
• dh ∈
• ∀k ∈ [i, j], dk ∈ [i, j] or dk = h
From the definition above, we can see that
the relaxed-well-formed structure obviously covers the well-formed one. In this structure, we
don’t constrain that all the children of the sub-root
should be complete. Let’s review the dependency
tree in Figure 2 as an example. Except for the wellformed structure, we could also extract girl found
a beautiful house. Therefore, if the modifier The
lovely changes to The cute, this rule also works.
4 Head Word Trigger
(Koehn et al., 2003) introduced the concept of
lexical weighting to check how well words of
the phrase translate to each other. Source word
f aligns with target word e, according to the
IBM model 1, the lexical translation probability
is p(e|f ). However, in the sense of dependency
relationship, we believe that the generation of the
target word e, is not only triggered by the aligned
source word f , but also associated with f ’s head
word f ′ . Therefore, the lexical translation probability becomes p(e|f → f ′ ), which of course
allows for a more fine-grained lexical choice of
3 Relaxed-well-formed Dependency
Dependency models have recently gained considerable interest in SMT (Ding and Palmer, 2005;
Quirk et al., 2005; Shen et al., 2008). Dependency tree can represent richer structural information. It reveals long-distance relation between
words and directly models the semantic structure
of a sentence without any constituent labels. Fig-
the target word. More specifically, the probability could be estimated by the maximum likelihood
(MLE) approach,
For language model, we use the SRI Language
Modeling Toolkit (Stolcke, 2002) to train a 4gram model on the first 1/3 of the Xinhua portion
of GIGAWORD corpus. And we use the NIST
2002 MT evaluation test set as our development
count(e, f → f ′ )
set, and NIST 2004, 2005 test sets as our blind
p(e|f → f ) = P
e′ count(e , f → f )
test sets. We evaluate the translation quality using case-insensitive BLEU metric (Papineni et
Given a phrase pair f , e and word alignment
al., 2002) without dropping OOV words, and the
a, and the dependent relation of the source senfeature weights are tuned by minimum error rate
tence dJ1 (J is the length of the source sentence,
training (Och, 2003).
I is the length of the target sentence). Therefore,
In order to get the dependency relation of the
given the lexical translation probability distributraining corpus, we re-implement a beam-search
tion p(e|f → f ′ ), we compute the feature score of
style monolingual dependency parser according
a phrase pair (f , e) as
to (Nivre and Scholz, 2004). Then we use the
same method suggested in (Chiang, 2005) to
extract SCFG grammar rules within dependency
p(e|f , d1 , a)
constraint on both sides except that unaligned
words are allowed at the edge of phrases. Pa= Πi=1
p(ei |fj → fdj )
|{j|(j, i) ∈ a}|
rameters of head word trigger are estimated as de∀(j,i)∈a
scribed in Section 4. As a default, the maximum
initial phrase length is set to 10 and the maximum
Now we get p(e|f , dJ1 , a), we could obtain
rule length of the source side is set to 5. Besides,
p(f |e, dI1 , a) (dI1 represents dependent relation of
we also re-implement the decoder of Hiero (Chithe target side) in the similar way. This new feature can be easily integrated into the log-linear
ang, 2007) as our baseline. In fact, we just exploit
the dependency structure during the rule extracmodel as lexical weighting does.
tion phase. Therefore, we don’t need to change
5 Experiments
the main decoding algorithm of the SMT system.
In this section, we describe the experimental setting used in this work, and verify the effect of
the relaxed-well-formed structure filtering and the
new feature, head word trigger.
5.2 Results on FBIS Corpus
A series of experiments was done on the FBIS corpus. We first parse the bilingual languages with
monolingual dependency parser respectively, and
then only retain the rules that both sides are in line
with the constraint of dependency structure. In
Table 1, the relaxed-well-formed structure filtered
out 35% of the rule table and the well-formed discarded 74%. RWF extracts additional 39% compared to WF, which can be seen as some kind
of evidence that the rules we additional get seem
common in the sense of linguistics. Compared to
(Shen et al., 2008), we just use the dependency
structure to constrain rules, not to maintain the tree
structures to guide decoding.
Table 2 shows the translation result on FBIS.
We can see that the RWF structure constraint can
improve translation quality substantially both at
development set and different test sets. On the
Test04 task, it gains +0.86% BLEU, and +0.84%
on Test05. Besides, we also used Shen et al.
(2008)’s WF structure to filter both sides. Although it discard about 74% of the rule table, the
Experimental Setup
Experiments are carried out on the NIST1
Chinese-English translation task with two different size of training corpora.
• FBIS: We use the FBIS corpus as the first
training corpus, which contains 239K sentence pairs with 6.9M Chinese words and
8.9M English words.
• GQ: This is manually selected from the
LDC2 corpora. GQ contains 1.5M sentence
pairs with 41M Chinese words and 48M English words. In fact, FBIS is the subset of
It consists of six LDC corpora:
LDC2002E18, LDC2003E07, LDC2003E14, Hansards part
of LDC2004T07, LDC2004T08, LDC2005T06.
Rule table size
feature works well on two different test sets. The
gain is +2.21% BLEU on Test04, and +1.33% on
Test05. Compared to the result of the baseline,
only using the RWF structure to filter performs the
same as the baseline on Test05, and +0.99% gains
on Test04.
Table 1: Rule table size with different constraint on FBIS. Here HPB refers to the baseline hierarchal phrase-based system, RWF means
relaxed-well-formed constraint and WF represents
the well-formed structure.
6 Conclusions
This paper proposes a simple strategy to filter the
hierarchal rule table, and introduces a new feature
to enhance the translation performance. We employ the relaxed-well-formed dependency structure to constrain both sides of the rule, and about
40% of rules are discarded with improvement of
the translation performance. In order to make full
use of the dependency information, we assume
that the target word e is triggered by dependency
edge of the corresponding source word f . And
this feature works well on large parallel training
How to estimate the probability of head word
trigger is very important. Here we only get the parameters in a generative way. In the future, we we
are plan to exploit some discriminative approach
to train parameters of this feature, such as EM algorithm (Hasan et al., 2008) or maximum entropy
(He et al., 2008).
Besides, the quality of the parser is another effect for this method. As the next step, we will
try to exploit bilingual knowledge to improve the
monolingual parser, i.e. (Huang et al., 2009).
Table 2: Results of FBIS corpus. Here Tri means
the feature of head word trigger on both sides. And
we don’t test the new feature on Test04 because of
the bad performance on development set. * or **
= significantly better than baseline (p < 0.05 or
0.01, respectively).
over-all BLEU is decreased by 0.66%-0.78% on
the test sets.
As for the feature of head word trigger, it seems
not work on the FBIS corpus. On Test05, it gets
the same score with the baseline, but lower than
RWF filtering. This may be caused by the data
sparseness problem, which results in inaccurate
parameter estimation of the new feature.
Result on GQ Corpus
This work was partly supported by National
Natural Science Foundation of China Contract
60873167. It was also funded by SK Telecom,
Korea under the contract 4360002953. We show
our special thanks to Wenbin Jiang and Shu Cai
for their valuable suggestions. We also thank
the anonymous reviewers for their insightful comments.
In this part, we increased the size of the training
corpus to check whether the feature of head word
trigger works on large corpus.
We get 152M rule entries from the GQ corpus
according to (Chiang, 2007)’s extraction method.
If we use the RWF structure to constrain both
sides, the number of rules is 87M, about 43% of
rule entries are discarded. From Table 3, the new
Peter F. Brown, Vincent J. Della Pietra, Stephen
A. Della Pietra, and Robert L. Mercer. 1993. The
mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–
Table 3: Results of GQ corpus. * or ** = significantly better than baseline (p < 0.05 or 0.01,
David Chiang. 2005. A hierarchical phrase-based
model for statistical machine translation. In ACL
’05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 263–
David Chiang. 2007. Hierarchical phrase-based translation. Comput. Linguist., 33(2):201–228.
Yuan Ding and Martha Palmer. 2005. Machine translation using probabilistic synchronous dependency
insertion grammars. In ACL ’05: Proceedings of the
43rd Annual Meeting on Association for Computational Linguistics, pages 541–548.
Saša Hasan and Hermann Ney. 2009. Comparison of
extended lexicon models in search and rescoring for
smt. In NAACL ’09: Proceedings of Human Language Technologies: The 2009 Annual Conference
of the North American Chapter of the Association
for Computational Linguistics, Companion Volume:
Short Papers, pages 17–20.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In ACL ’03: Proceedings of the 41st Annual Meeting on Association
for Computational Linguistics, pages 160–167.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for
Computational Linguistics, pages 311–318.
Chris Quirk, Arul Menezes, and Colin Cherry. 2005.
Dependency treelet translation: syntactically informed phrasal smt. In ACL ’05: Proceedings of
the 43rd Annual Meeting on Association for Computational Linguistics, pages 271–279.
Libin Shen, Jinxi Xu, and Ralph Weischedel. 2008. A
new string-to-dependency machine translation algorithm with a target dependency language model. In
Proceedings of ACL-08: HLT, pages 577–585.
Saša Hasan, Juri Ganitkevitch, Hermann Ney, and
Jesús Andrés-Ferrer. 2008. Triplet lexicon models
for statistical machine translation. In EMNLP ’08:
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 372–
Libin Shen, Jinxi Xu, Bing Zhang, Spyros Matsoukas,
and Ralph Weischedel. 2009. Effective use of linguistic and contextual information for statistical machine translation. In EMNLP ’09: Proceedings of
the 2009 Conference on Empirical Methods in Natural Language Processing, pages 72–80.
Zhongjun He, Qun Liu, and Shouxun Lin. 2008. Improving statistical machine translation using lexicalized rule selection. In COLING ’08: Proceedings
of the 22nd International Conference on Computational Linguistics, pages 321–328.
Andreas Stolcke. 2002. Srilman extensible language
modeling toolkit. In In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), pages 901–904.
Zhongjun He, Yao Meng, Yajuan Lü, Hao Yu, and Qun
Liu. 2009. Reducing smt rule table with monolingual key phrase. In ACL-IJCNLP ’09: Proceedings
of the ACL-IJCNLP 2009 Conference Short Papers,
pages 121–124.
Liang Huang, Wenbin Jiang, and Qun Liu. 2009.
Bilingually-constrained (monolingual) shift-reduce
parsing. In EMNLP ’09: Proceedings of the 2009
Conference on Empirical Methods in Natural Language Processing, pages 1222–1231.
Gonzalo Iglesias, Adrià de Gispert, Eduardo R. Banga,
and William Byrne. 2009. Rule filtering by pattern
for efficient hierarchical translation. In EACL ’09:
Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguistics, pages 380–388.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In
NAACL ’03: Proceedings of the 2003 Conference
of the North American Chapter of the Association
for Computational Linguistics on Human Language
Technology, pages 48–54.
Joakim Nivre and Mario Scholz. 2004. Deterministic dependency parsing of english text. In COLING
’04: Proceedings of the 20th international conference on Computational Linguistics, pages 64–70.
Fixed Length Word Suffix for Factored
Statistical Machine Translation
Narges Sharif Razavian
Stephan Vogel
School of Computer Science
Carnegie Mellon Universiy
Pittsburgh, USA
[email protected]
School of Computer Science
Carnegie Mellon Universiy
Pittsburgh, USA
[email protected]
Factored Statistical Machine Translation extends the Phrase Based SMT model by allowing each word to be a vector of factors.
Experiments have shown effectiveness of
many factors, including the Part of Speech
tags in improving the grammaticality of the
output. However, high quality part of
speech taggers are not available in open
domain for many languages. In this paper
we used fixed length word suffix as a new
factor in the Factored SMT, and were able
to achieve significant improvements in three
set of experiments: large NIST Arabic to
English system, medium WMT Spanish to
English system, and small TRANSTAC
English to Iraqi system.
Statistical Machine Translation(SMT) is currently the state of the art solution to the machine
translation. Phrase based SMT is also among the
top performing approaches available as of today.
This approach is a purely lexical approach, using
surface forms of the words in the parallel corpus
to generate the translations and estimate probabilities. It is possible to incorporate syntactical
information into this framework through different ways. Source side syntax based re-ordering
as preprocessing step, dependency based reordering models, cohesive decoding features are
among many available successful attempts for
the integration of syntax into the translation
model. Factored translation modeling is another
way to achieve this goal. These models allow
each word to be represented as a vector of factors
rather than a single surface form. Factors can
represent richer expression power on each word.
Any factors such as word stems, gender, part of
speech, tense, etc. can be easily used in this
Previous work in factored translation modeling
have reported consistent improvements from Part
of Speech(POS) tags, morphology, gender, and
case factors (Koehn et. a. 2007). In another work,
Birch et. al. 2007 have achieved improvement
using Combinational Categorial Grammar (CCG)
super-tag factors. Creating the factors is done as
a preprocessing step, and so far, most of the experiments have assumed existence of external
tools for the creation of these factors (i. e. Part of
speech taggers, CCG parsers, etc.). Unfortunately
high quality language processing tools, especially for the open domain, are not available for most
While linguistically identifiable representations
(i.e. POS tags, CCG supertags, etc) have been
very frequently used as factors in many applications including MT, simpler representations have
also been effective in achieving the same result
in other application areas. Grzymala-Busse and
Old 1997, DINCER 2008, were able to use
fixed length suffixes as features for training a
POS tagging. In another work Saberi and Perrot
1999 showed that reversing middle chunks of the
words while keeping the first and last part intact,
does not decrease listeners’ recognition ability.
This result is very relevant to Machine Translation, suggesting that inaccurate context which is
usually modeled with n-gram language models,
can still be as effective as accurate surface forms.
Another research (Rawlinson 1997) confirms this
finding; this time in textual domain, observing
that randomization of letters in the middle of
words has little or no effect on the ability of
skilled readers to understand the text. These results suggest that the inexpensive representational factors which do not need unavailable tools
might also be worth investigating.
These results encouraged us to introduce language independent simple factors for machine
translation. In this paper, following the work of
Grzymala-Busse et. al. we used fixed length suf-
Proceedings of the ACL 2010 Conference Short Papers, pages 147–150,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
fix as word factor, to lower the perplexity of the
language model, and have the factors roughly
function as part of speech tags, thus increasing
the grammaticality of the translation results. We
were able to obtain consistent, significant improvements over our baseline in 3 different experiments, large NIST Arabic to English system,
medium WMT Spanish to English system, and
small TRANSTAC English to Iraqi system.
The rest of this paper is as follows. Section 2
briefly reviews the Factored Translation Models.
In section 3 we will introduce our model, and
section 4 will contain the experiments and the
analysis of the results, and finally, we will conclude this paper in section 5.
Factored Translation Model
Statistical Machine Translation uses the log linear combination of a number of features, to
compute the highest probable hypothesis as the
e = argmaxe p(e|f) = argmaxe p exp Σi=1n λi hi(e,f)
In phrase based SMT, assuming the source and
target phrase segmentation as {(fi,ei)}, the most
important features include: the Language Model
feature hlm(e,f) = plm(e); the phrase translation
feature ht(e,f) defined as product of translation
probabilities, lexical probabilities and phrase penalty; and the reordering probability, hd(e,f),
usually defined as πi=1n d(starti,endi-1) over the
source phrase reordering events.
Factored Translation Model, recently introduced by (Koehn et. al. 2007), allow words to
have a vector representation. The model can then
extend the definition of each of the features from
a uni-dimensional value to an arbitrary joint and
conditional combination of features. Phrase
based SMT is in fact a special case of Factored
The factored features are defined as an extension of phrase translation features. The function
τ(fj,ej), which was defined for a phrase pair before, can now be extended as a log linear combination Σf τf(fjf,ejf). The model also allows for a
generation feature, defining the relationship between final surface form and target factors. Other
features include additional language model features over individual factors, and factored reordering features.
Figure 1 shows an example of a possible factored model.
Figure 1: An example of a Factored Translation and
Generation Model
In this particular model, words on both source
and target side are represented as a vector of four
factors: surface form, lemma, part of speech
(POS) and the morphology. The target phrase is
generated as follows: Source word lemma generates target word lemma. Source word's Part of
speech and morphology together generate the
target word's part of speech and morphology, and
from its lemma, part of speech and morphology
the surface form of the target word is finally generated. This model has been able to result in
higher translation BLEU score as well as grammatical coherency for English to German, English to Spanish, English to Czech, English to
Chinese, Chinese to English and German to English.
Fixed Length Suffix Factors for Factored Translation Modeling
Part of speech tagging, constituent and dependency parsing, combinatory categorical grammar
super tagging are used extensively in most applications when syntactic representations are
needed. However training these tools require
medium size treebanks and tagged data, which
for most languages will not be available for a
while. On the other hand, many simple words
features, such as their character n-grams, have in
fact proven to be comparably as effective in
many applications.
(Keikha et. al. 2008) did an experiment on text
classification on noisy data, and compared several word representations. They compared surface
form, stemmed words, character n-grams, and
semantic relationships, and found that for noisy
and open domain text, character-ngrams outperform other representations when used for text
classification. In another work (Dincer et al
2009) showed that using fixed length word ending outperforms whole word representation for
training a part of speech tagger for Turkish language.
Based on this result, we proposed a suffix factored model for translation, which is shown in
Figure 2.
Word 
Word Language Model
Suffix 
Suffix Language Model
Figure 2: Suffix Factored model: Source word determines factor vectors (target word, target word suffix) and each factor will be associated with its
language model.
Where plm-word is the n-gram language model
probability over the word surface sequence, with
the language model built from the surface forms.
Similarly, plm-suffix(esuffix) is the language model
probability over suffix sequences. p(eword-j &
esuffix-j|fj) and p(fj | eword-j & esuffix-j) are translation
probabilities for each phrase pair i , used in by
the decoder. This probability is estimated after
the phrase extraction step which is based on
grow-diag heuristic at this stage.
We used Moses implementation of the factored
model for training the feature weights, and SRI
toolkit for building n-gram language models. The
baseline for all systems included the moses system with lexicalized re-ordering, SRI 5-gram
language models.
Test on
Test on
Medium System on Travel Domain:
Spanish to English
This system is the WMT08 system, on a corpus
of 1.2 million sentence pairs with average sentence length 27.9 words. Like the previous experiment, we defined the 3 character suffix of the
words as the second factor, and built the language model and reordering model on the joint
event of (surface, suffix) pairs. We built 5-gram
language models for each factor. The system had
about 97K distinct vocabulary in the surface language model, which was reduced to 8K using the
suffix corpus. Having defined the baseline, the
system results are as follows.
Small System from Dialog Domain:
English to Iraqi
This system was TRANSTAC system, which
was built on about 650K sentence pairs with the
average sentence length of 5.9 words. After
choosing length 3 for suffixes, we built a new
parallel corpus, and SRI 5-gram language models
for each factor. Vocabulary size for the surface
form was 110K whereas the word suffixes had
Tune on
As you can see, this improvement is consistent
over multiple unseen datasets. Arabic cases and
numbers show up as the word suffix. Also, verb
numbers usually appear partly as word suffix and
in some cases as word prefix. Defining a language model over the word endings increases the
probability of sequences which have this case
and number agreement, favoring correct agreements over the incorrect ones.
P(e|f) ~ plm-word(eword)* plm-suffix(esuffix)
* Σi=1n p(eword-j & esuffix-j|fj)
* Σi=1n p(fj | eword-j & esuffix-j)
Experiments and Results
Table 1: BLEU score, English to Iraqi Transtac system, comparing Factored and Baseline systems.
Based on this model, the final probability of
the translation hypothesis will be the log linear
combination of phrase probabilities, reordering
model probabilities, and each of the language
models’ probabilities.
about 8K distinct words. Table 1 shows the result
(BLEU Score) of the system compared to the
Test setWMT08
Table 2: BLEU score, Spanish to English WMT system, comparing Factored and Baseline systems.
Here, we see improvement with the suffix factors compared to the baseline system. Word endings in English language are major indicators of
word’s part of speech in the sentence. In fact
most common stemming algorithm, Porter’s
Stemmer, works by removing word’s suffix.
Having a language model on these suffixes pushes the common patterns of these suffixes to the
top, making the more grammatically coherent
sentences to achieve a better probability.
Large NIST 2009 System: Arabic to
We used NIST2009 system as our baseline in
this experiment. The corpus had about 3.8 Million sentence pairs, with average sentence length
of 33.4 words. The baseline defined the lexicalized reordering model. As before we defined 3
character long word endings, and built 5-gram
SRI language models for each factor. The result
of this experiment is shown in table 3.
Test on
Test on
Table 3: BLEU score, Arabic to English NIST 2009
system, comparing Factored and Baseline systems.
This result confirms the positive effect of the
suffix factors even on large systems. As mentioned before we believe that this result is due to
the ability of the suffix to reduce the word into a
very simple but rough grammatical representation. Defining language models for this factor
forces the decoder to prefer sentences with more
probable suffix sequences, which is believed to
increase the grammaticality of the result. Future
error analysis will show us more insight of the
exact effect of this factor on the outcome.
improvements over the baseline. This result, obtained from the language independent and inexpensive factor, shows promising new
opportunities for all language pairs.
Birch, A., Osborne, M., and Koehn, P. CCG supertags
in factored statistical machine translation. Proceedings of the Second Workshop on Statistical Machine Translation, pages 9–16, Prague, Czech
Republic. Association for Computational Linguistics, 2007.
Dincer T., Karaoglan B. and Kisla T., A Suffix Based
Part-Of-Speech Tagger For Turkish, Fifth International Conference on Information Technology:
New Generations, 2008.
Grzymala-Busse J.W., Old L.J. A machine learning
experiment to determine part of speech from wordendings, Lecture Notes in Computer Science,
Communications Session 6B Learning and Discovery Systems, 1997.
Keikha M., Sharif Razavian N, Oroumchian F., and
Seyed Razi H., Document Representation and
Quality of Text: An Analysis, Chapter 12, Survey
of Text Mining II, Springer London, 2008.
Koehn Ph., Hoang H., Factored Translation Models,
Proceedings of 45th Annual Meeting of the Association for Computational Linguistics (ACL), 2007.
Rawlinson G. E., The significance of letter position in
word recognition, PhD Thesis, Psychology Department, University of Nottingham, Nottingham
UK, 1976.
Saberi K and Perrot D R, Cognitive restoration of
reversed speech, Nature (London) 1999.
In this paper we introduced a simple yet very
effective factor: fixed length word suffix, to use
in Factored Translation Models. This simple factor has been shown to be effective as a rough
replacement for part of speech. We tested our
factors in three experiments in a small, English to
Iraqi system, a medium sized system of Spanish
to English, and a large system, NIST09 Arabic to
English. We observed consistent and significant
Unsupervised Discourse Segmentation
of Documents with Inherently Parallel Structure
Minwoo Jeong and Ivan Titov
Saarland University
Saarbrücken, Germany
(e.g., (Hearst, 1994)). The most straightforward
approach would be to use a pipeline strategy,
where an existing segmentation algorithm finds
discourse boundaries of each part independently,
and then the segments are aligned. Or, conversely,
a sentence-alignment stage can be followed by a
segmentation stage. However, as we will see in our
experiments, these strategies may result in poor
segmentation and alignment quality.
To address this problem, we construct a nonparametric Bayesian model for joint segmentation and alignment of parallel parts. In comparison with the discussed pipeline approaches,
our method has two important advantages: (1) it
leverages the lexical cohesion phenomenon (Halliday and Hasan, 1976) in modeling the parallel parts of documents, and (2) ensures that the
effective number of segments can grow adaptively. Lexical cohesion is an idea that topicallycoherent segments display compact lexical distributions (Hearst, 1994; Utiyama and Isahara, 2001;
Eisenstein and Barzilay, 2008). We hypothesize
that not only isolated fragments but also each
group of linked fragments displays a compact and
consistent lexical distribution, and our generative
model leverages this inter-part cohesion assumption.
In this paper, we consider the dataset of “English as a second language” (ESL) podcast1 , where
each episode consists of two parallel parts: a story
(an example monologue or dialogue) and an explanatory lecture discussing the meaning and usage of English expressions appearing in the story.
Fig. 1 presents an example episode, consisting of
two parallel parts, and their hidden topical relations.2 From the figure we may conclude that there
is a tendency of word repetition between each pair
of aligned segments, illustrating our hypothesis of
compactness of their joint distribution. Our goal is
Documents often have inherently parallel
structure: they may consist of a text and
commentaries, or an abstract and a body,
or parts presenting alternative views on
the same problem. Revealing relations between the parts by jointly segmenting and
predicting links between the segments,
would help to visualize such documents
and construct friendlier user interfaces. To
address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We
apply our method to the “English as a second language” podcast dataset where each
episode is composed of two parallel parts:
a story and an explanatory lecture. The
predicted topical links uncover hidden relations between the stories and the lectures. In this domain, our method achieves
competitive results, rivaling those of a previously proposed supervised technique.
Many documents consist of parts exhibiting a high
degree of parallelism: e.g., abstract and body of
academic publications, summaries and detailed
news stories, etc. This is especially common with
the emergence of the Web 2.0 technologies: many
texts on the web are now accompanied with comments and discussions. Segmentation of these parallel parts into coherent fragments and discovery
of hidden relations between them would facilitate
the development of better user interfaces and improve the performance of summarization and information retrieval systems.
Discourse segmentation of the documents composed of parallel parts is a novel and challenging problem, as previous research has mostly focused on the linear segmentation of isolated texts
Episode no. 232 post on Jan. 08, 2007.
Proceedings of the ACL 2010 Conference Short Papers, pages 151–155,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
I have a day job, but I recently started a
small business on the side.
I didn't know anything about accounting
and my friend, Roland, said that he would
give me some advice.
Roland: So, the reason that you need to
do your bookkeeping is so you can
manage your cash flow.
Lecture transcript
This podcast is all about business vocabulary related to accounting.
The title of the podcast is Business Bookkeeping. ...
The story begins by Magdalena saying that she has a day job.
A day job is your regular job that you work at from nine in the morning 'til five in the afternoon, for
She also has a small business on the side. ...
Magdalena continues by saying that she didn't know anything about accounting and her friend,
Roland, said he would give her some advice.
Accounting is the job of keeping correct records of the money you spend; it's very similar to
bookkeeping. ...
Roland begins by saying that the reason that you need to do your bookkeeping is so you can
manage your cash flow.
Cash flow, flow, means having enough money to run your business - to pay your bills. ...
Figure 1: An example episode of ESL podcast. Co-occurred words are represented in italic and underline.
to divide the lecture transcript into discourse units
and to align each unit to the related segment of the
story. Predicting these structures for the ESL podcast could be the first step in development of an
e-learning system and a podcast search engine for
ESL learners.
In this section we describe our model for discourse
segmentation of documents with inherently parallel structure. We start by clarifying our assumptions about their structure.
We assume that a document x consists of K
parallel parts, that is, x = {x(k) }k=1:K , and
each part of the document consists of segments,
x(k) = {si }i=1:I . Note that the effective number of fragments I is unknown. Each segment can
either be specific to this part (drawn from a part(k)
specific language model φi ) or correspond to
the entire document (drawn from a document-level
language model φi ). For example, the first
and the second sentences of the lecture transcript
in Fig. 1 are part-specific, whereas other linked
sentences belong to the document-level segments.
The document-level language models define topical links between segments in different parts of
the document, whereas the part-specific language
models define the linear segmentation of the remaining unaligned text.
Each document-level language model corresponds to the set of aligned segments, at most one
segment per part. Similarly, each part-specific language model corresponds to a single segment of
the single corresponding part. Note that all the
documents are modeled independently, as we aim
not to discover collection-level topics (as e.g. in
(Blei et al., 2003)), but to perform joint discourse
segmentation and alignment.
Unlike (Eisenstein and Barzilay, 2008), we cannot make an assumption that the number of segments is known a-priori, as the effective number of
part-specific segments can vary significantly from
document to document, depending on their size
and structure. To tackle this problem, we use
Dirichlet processes (DP) (Ferguson, 1973) to de-
Related Work
Discourse segmentation has been an active area
of research (Hearst, 1994; Utiyama and Isahara,
2001; Galley et al., 2003; Malioutov and Barzilay,
2006). Our work extends the Bayesian segmentation model (Eisenstein and Barzilay, 2008) for isolated texts, to the problem of segmenting parallel
parts of documents.
The task of aligning each sentence of an abstract
to one or more sentences of the body has been
studied in the context of summarization (Marcu,
1999; Jing, 2002; Daumé and Marcu, 2004). Our
work is different in that we do not try to extract
the most relevant sentence but rather aim to find
coherent fragments with maximally overlapping
lexical distributions. Similarly, the query-focused
summarization (e.g., (Daumé and Marcu, 2006))
is also related but it focuses on sentence extraction
rather than on joint segmentation.
We are aware of only one previous work on joint
segmentation and alignment of multiple texts (Sun
et al., 2007) but their approach is based on similarity functions rather than on modeling lexical cohesion in the generative framework. Our application,
the analysis of the ESL podcast, was previously
studied in (Noh et al., 2010). They proposed a supervised method which is driven by pairwise classification decisions. The main drawback of their
approach is that it neglects the discourse structure
and the lexical cohesion phenomenon.
fine priors on the number of segments. We incorporate them in our model in a similar way as it
is done for the Latent Dirichlet Allocation (LDA)
by Yu et al. (2005). Unlike the standard LDA, the
topic proportions are chosen not from a Dirichlet
prior but from the marginal distribution GEM (α)
defined by the stick breaking construction (Sethuraman, 1994), where α is the concentration parameter of the underlying DP distribution. GEM (α)
defines a distribution of partitions of the unit interval into a countable number of parts.
The formal definition of our model is as follows:
• Draw the part-specific topic proportions β (k)
GEM (α(k) ) for k ∈ {1, . . . , K}.
is the current segmentation and its type. The new
pair (z 0 , t0 ) is accepted with the probability
P (z 0 , t0 , x)Q(z 0 , t0 |z, t)
min 1,
P (z, t, x)Q(z, t|z 0 , t0 )
In order to implement the MH algorithm for our
model, we need to define the set of potential moves
(i.e. admissible changes from (z, t) to (z 0 , t0 )),
and the proposal distribution Q over these moves.
If the actual number of segments is known and
only a linear discourse structure is acceptable, then
a single move, shift of the segment border (Fig.
2(a)), is sufficient (Eisenstein and Barzilay, 2008).
In our case, however, a more complex set of moves
is required.
We make two assumptions which are motivated by the problem considered in Section 5:
we assume that (1) we are given the number of
document-level segments and also that (2) the
aligned segments appear in the same order in each
part of the document. With these assumptions in
mind, we introduce two additional moves (Fig.
2(b) and (c)):
• Choose the part-specific language models φi
Dir(γ (k) ) for k ∈ {1, . . . , K} and i ∈ {1, 2, . . .}.
• For each part k and each sentence n:
– Draw type tn ∼ U nif (Doc, P art).
– If (tn = Doc); draw topic zn ∼ β (doc) ; gen(k)
erate words xn ∼ M ult(φ (k) )
– Otherwise; draw topic zn ∼ β (k) ; generate
words xn ∼ M ult(φ (k) ).
The priors γ (doc) , γ (k) , α(doc) and α(k) can be
estimated at learning time using non-informative
hyperpriors (as we do in our experiments), or set
manually to indicate preferences of segmentation
At inference time, we enforce each latent topic
zn to be assigned to a contiguous span of text,
assuming that coherent topics are not recurring
across the document (Halliday and Hasan, 1976).
It also reduces the search space and, consequently,
speeds up our sampling-based inference by reducing the time needed for Monte Carlo chains to
mix. In fact, this constraint can be integrated in the
model definition but it would significantly complicate the model description.
Figure 2: Three types of moves: (a) shift, (b) split
and (c) merge.
• Draw the document-level topic proportions β (doc) ∼
GEM (α(doc) ).
• Choose the document-level language model φi
Dir(γ (doc) ) for i ∈ {1, 2, . . .}.
• Split move: select a segment, and split it at
one of the spanned sentences; if the segment
was a document-level segment then one of
the fragments becomes the same documentlevel segment.
• Merge move: select a pair of adjacent segments where at least one of the segments is
part-specific, and merge them; if one of them
was a document-level segment then the new
segment has the same document-level topic.
All the moves are selected with the uniform probability, and the distance c for the shift move is
drawn from the proposal distribution proportional
to c−1/cmax . The moves are selected independently for each part.
Although the above two assumptions are not
crucial as a simple modification to the set of moves
would support both introduction and deletion of
document-level fragments, this modification was
not necessary for our experiments.
As exact inference is intractable, we follow Eisenstein and Barzilay (2008) and instead use a
Metropolis-Hastings (MH) algorithm. At each
iteration of the MH algorithm, a new potential
alignment-segmentation pair (z 0 , t0 ) is drawn from
a proposal distribution Q(z 0 , t0 |z, t), where (z, t)
Pipeline (I)
Pipeline (2I+1)
Our model (I)
Dataset and setup
Dataset We apply our model to the ESL podcast
dataset (Noh et al., 2010) of 200 episodes, with
an average of 17 sentences per story and 80 sentences per lecture transcript. The gold standard
alignments assign each fragment of the story to a
segment of the lecture transcript. We can induce
segmentations at different levels of granularity on
both the story and the lecture side. However, given
that the segmentation of the story was obtained by
an automatic sentence splitter, there is no reason
to attempt to reproduce this segmentation. Therefore, for quantitative evaluation purposes we follow Noh et al. (2010) and restrict our model to
alignment structures which agree with the given
segmentation of the story. For all evaluations, we
apply standard stemming algorithm and remove
common stop words.
Evaluation metrics To measure the quality of segmentation of the lecture transcript, we use two
standard metrics, Pk (Beeferman et al., 1999) and
WindowDiff (WD) (Pevzner and Hearst, 2002),
but both metrics disregard the alignment links (i.e.
the topic labels). Consequently, we also use the
macro-averaged F1 score on pairs of aligned span,
which measures both the segmentation and alignment quality.
Baseline Since there has been little previous research on this problem, we compare our results
against two straightforward unsupervised baselines. For the first baseline, we consider the
pairwise sentence alignment (SentAlign) based
on the unigram and bigram overlap. The second baseline is a pipeline approach (Pipeline),
where we first segment the lecture transcript with
BayesSeg (Eisenstein and Barzilay, 2008) and
then use the pairwise alignment to find their best
alignment to the segments of the story.
Our model We evaluate our joint model of segmentation and alignment both with and without
the split/merge moves. For the model without
these moves, we set the desired number of segments in the lecture to be equal to the actual number of segments in the story I. In this setting,
the moves can only adjust positions of the segment borders. For the model with the split/merge
moves, we start with the same number of segments
I but it can be increased or decreased during inference. For evaluation of our model, we run our
inference algorithm from five random states, and
1 − F1
Table 1: Results on the ESL podcast dataset. For
all metrics, lower values are better.
take the 100,000th iteration of each chain as a sample. Results are the average over these five runs.
Also we perform L-BFGS optimization to automatically adjust the non-informative hyperpriors
after each 1,000 iterations of sampling.
Table 1 summarizes the obtained results. ‘Uniform’ denotes the minimal baseline which uniformly draws a random set of I spans for each lecture, and then aligns them to the segments of the
story preserving the linear order. Also, we consider two variants of the pipeline approach: segmenting the lecture on I and 2I + 1 segments, respectively.3 Our joint model substantially outperforms the baselines. The difference is statistically
significant with the level p < .01 measured with
the paired t-test. The significant improvement over
the pipeline results demonstrates benefits of joint
modeling for the considered problem. Moreover,
additional benefits are obtained by using the DP
priors and the split/merge moves (the last line in
Table 1). Finally, our model significantly outperforms the previously proposed supervised model
(Noh et al., 2010): they report micro-averaged F1
score 0.698 while our best model achieves 0.778
with the same metric. This observation confirms
that lexical cohesion modeling is crucial for successful discourse analysis.
We studied the problem of joint discourse segmentation and alignment of documents with inherently
parallel structure and achieved favorable results on
the ESL podcast dataset outperforming the cascaded baselines. Accurate prediction of these hidden relations would open interesting possibilities
The use of the DP priors and the split/merge moves on
the first stage of the pipeline did not result in any improvement in accuracy.
for construction of friendlier user interfaces. One
example being an application which, given a userselected fragment of the abstract, produces a summary from the aligned segment of the document
Hyungjong Noh, Minwoo Jeong, Sungjin Lee,
Jonghoon Lee, and Gary Geunbae Lee. 2010.
Script-description pair extraction from text documents of English as second language podcast. In
Proceedings of the 2nd International Conference on
Computer Supported Education.
Lev Pevzner and Marti Hearst. 2002. A critique and
improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1):19–
The authors acknowledge the support of the
Excellence Cluster on Multimodal Computing
and Interaction (MMCI), and also thank Mikhail
Kozhevnikov and the anonymous reviewers for
their valuable comments, and Hyungjong Noh for
providing their data.
Jayaram Sethuraman. 1994. A constructive definition
of Dirichlet priors. Statistica Sinica, 4:639–650.
Bingjun Sun, Prasenjit Mitra, C. Lee Giles, John Yen,
and Hongyuan Zha. 2007. Topic segmentation
with shared topic detection and alignment of multiple documents. In Proceedings of ACM SIGIR,
pages 199–206.
Doug Beeferman, Adam Berger, and John Lafferty.
1999. Statistical models for text segmentation.
Computational Linguistics, 34(1–3):177–210.
Masao Utiyama and Hitoshi Isahara. 2001. A statistical model for domain-independent text segmentation. In Proceedings of ACL, pages 491–498.
David M. Blei, Andrew Ng, and Michael I. Jordan.
2003. Latent dirichlet allocation. JMLR, 3:993–
Kai Yu, Shipeng Yu, and Vokler Tresp. 2005. Dirichlet
enhanced latent semantic analysis. In Proceedings
Hal Daumé and Daniel Marcu. 2004. A phrase-based
hmm approach to document/abstract alignment. In
Proceedings of EMNLP, pages 137–144.
Hal Daumé and Daniel Marcu. 2006. Bayesian queryfocused summarization. In Proceedings of ACL,
pages 305–312.
Jacob Eisenstein and Regina Barzilay. 2008. Bayesian
unsupervised topic segmentation. In Proceedings of
EMNLP, pages 334–343.
Thomas S. Ferguson. 1973. A Bayesian analysis of
some non-parametric problems. Annals of Statistics,
Michel Galley, Kathleen R. McKeown, Eric FoslerLussier, and Hongyan Jing. 2003. Discourse segmentation of multi-party conversation. In Proceedings of ACL, pages 562–569.
M. A. K. Halliday and Ruqaiya Hasan. 1976. Cohesion in English. Longman.
Marti Hearst. 1994. Multi-paragraph segmentation of
expository text. In Proceedings of ACL, pages 9–16.
Hongyan Jing. 2002. Using hidden Markov modeling
to decompose human-written summaries. Computational Linguistics, 28(4):527–543.
Igor Malioutov and Regina Barzilay. 2006. Minimum
cut model for spoken lecture segmentation. In Proceedings of ACL, pages 25–32.
Daniel Marcu. 1999. The automatic construction of
large-scale corpora for summarization research. In
Proceedings of ACM SIGIR, pages 137–144.
Coreference Resolution with Reconcile
Veselin Stoyanov
Center for Language
and Speech Processing
Johns Hopkins Univ.
Baltimore, MD
Claire Cardie
Department of
Computer Science
Cornell University
Ithaca, NY
Nathan Gilbert
Ellen Riloff
School of Computing
University of Utah
Salt Lake City, UT
David Buttler
David Hysom
Lawrence Livermore
National Laboratory
Livermore, CA
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
We believe that one root cause of these disparities is the high cost of implementing an end-toend coreference resolution system. Coreference
resolution is a complex problem, and successful
systems must tackle a variety of non-trivial subproblems that are central to the coreference task —
e.g., mention/markable detection, anaphor identification — and that require substantial implementation efforts. As a result, many researchers exploit gold-standard annotations, when available, as
a substitute for component technologies to solve
these subproblems. For example, many published
research results use gold standard annotations to
identify NPs (substituting for mention/markable
detection), to distinguish anaphoric NPs from nonanaphoric NPs (substituting for anaphoricity determination), to identify named entities (substituting for named entity recognition), and to identify
the semantic types of NPs (substituting for semantic class identification). Unfortunately, the use of
gold standard annotations for key/critical component technologies leads to an unrealistic evaluation setting, and makes it impossible to directly
compare results against coreference resolvers that
solve all of these subproblems from scratch.
Comparison of coreference resolvers is further
hindered by the use of several competing (and
non-trivial) evaluation measures, and data sets that
have substantially different task definitions and
annotation formats. Additionally, coreference resolution is a pervasive problem in NLP and many
NLP applications could benefit from an effective
coreference resolver that can be easily configured
and customized.
To address these issues, we have created a platform for coreference resolution, called Reconcile,
that can serve as a software infrastructure to support the creation of, experimentation with, and
evaluation of coreference resolvers. Reconcile
was designed with the following seven desiderata
in mind:
Despite the existence of several noun phrase coreference resolution data sets as well as several formal evaluations on the task, it remains frustratingly
difficult to compare results across different coreference resolution systems. This is due to the high cost
of implementing a complete end-to-end coreference
resolution system, which often forces researchers
to substitute available gold-standard information in
lieu of implementing a module that would compute
that information. Unfortunately, this leads to inconsistent and often unrealistic evaluation scenarios.
With the aim to facilitate consistent and realistic experimental evaluations in coreference resolution, we present Reconcile, an infrastructure for the
development of learning-based noun phrase (NP)
coreference resolution systems. Reconcile is designed to facilitate the rapid creation of coreference resolution systems, easy implementation of
new feature sets and approaches to coreference resolution, and empirical evaluation of coreference resolvers across a variety of benchmark data sets and
standard scoring metrics. We describe Reconcile
and present experimental results showing that Reconcile can be used to create a coreference resolver
that achieves performance comparable to state-ofthe-art systems on six benchmark data sets.
Noun phrase coreference resolution (or simply
coreference resolution) is the problem of identifying all noun phrases (NPs) that refer to the same
entity in a text. The problem of coreference resolution is fundamental in the field of natural language processing (NLP) because of its usefulness
for other NLP tasks, as well as the theoretical interest in understanding the computational mechanisms involved in government, binding and linguistic reference.
Several formal evaluations have been conducted
for the coreference resolution task (e.g., MUC-6
(1995), ACE NIST (2004)), and the data sets created for these evaluations have become standard
benchmarks in the field (e.g., MUC and ACE data
sets). However, it is still frustratingly difficult to
compare results across different coreference resolution systems. Reported coreference resolution scores vary wildly across data sets, evaluation
metrics, and system configurations.
• implement the basic underlying software ar156
Proceedings of the ACL 2010 Conference Short Papers, pages 156–161,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
(Poesio and Kabadjov, 2004) and BART (Versley
et al., 2008) (which can be considered a successor of GuiTaR) are both modular systems that target the full coreference resolution task. As such,
both systems come close to meeting the majority
of the desiderata set forth in Section 1. BART,
in particular, can be considered an alternative to
Reconcile, although we believe that Reconcile’s
approach is more flexible than BART’s. In addition, the architecture and system components of
Reconcile (including a comprehensive set of features that draw on the expertise of state-of-the-art
supervised learning approaches, such as Bengtson
and Roth (2008)) result in performance closer to
the state-of-the-art.
Coreference resolution has received much research attention, resulting in an array of approaches, algorithms and features. Reconcile
is modeled after typical supervised learning approaches to coreference resolution (e.g. the architecture introduced by Soon et al. (2001)) because
of the popularity and relatively good performance
of these systems.
However, there have been other approaches
to coreference resolution, including unsupervised
and semi-supervised approaches (e.g. Haghighi
and Klein (2007)), structured approaches (e.g.
McCallum and Wellner (2004) and Finley and
Joachims (2005)), competition approaches (e.g.
Yang et al. (2003)) and a bell-tree search approach
(Luo et al. (2004)). Most of these approaches rely
on some notion of pairwise feature-based similarity and can be directly implemented in Reconcile.
chitecture of contemporary state-of-the-art
learning-based coreference resolution systems;
• support experimentation on most of the standard coreference resolution data sets;
• implement most popular coreference resolution scoring metrics;
• exhibit state-of-the-art coreference resolution
performance (i.e., it can be configured to create a resolver that achieves performance close
to the best reported results);
• can be easily extended with new methods and
• is relatively fast and easy to configure and
• has a set of pre-built resolvers that can be
used as black-box coreference resolution systems.
While several other coreference resolution systems are publicly available (e.g., Poesio and
Kabadjov (2004), Qiu et al. (2004) and Versley et
al. (2008)), none meets all seven of these desiderata (see Related Work). Reconcile is a modular
software platform that abstracts the basic architecture of most contemporary supervised learningbased coreference resolution systems (e.g., Soon
et al. (2001), Ng and Cardie (2002), Bengtson and
Roth (2008)) and achieves performance comparable to the state-of-the-art on several benchmark
data sets. Additionally, Reconcile can be easily reconfigured to use different algorithms, features, preprocessing elements, evaluation settings
and metrics.
In the rest of this paper, we review related work
(Section 2), describe Reconcile’s organization and
components (Section 3) and show experimental results for Reconcile on six data sets and two evaluation metrics (Section 4).
System Description
Reconcile was designed to be a research testbed
capable of implementing most current approaches
to coreference resolution. Reconcile is written in
Java, to be portable across platforms, and was designed to be easily reconfigurable with respect to
subcomponents, feature sets, parameter settings,
Reconcile’s architecture is illustrated in Figure
1. For simplicity, Figure 1 shows Reconcile’s operation during the classification phase (i.e., assuming that a trained classifier is present).
The basic architecture of the system includes
five major steps. Starting with a corpus of documents together with a manually annotated coreference resolution answer key1 , Reconcile performs
Related Work
Several coreference resolution systems are currently publicly available. JavaRap (Qiu et al.,
2004) is an implementation of the Lappin and
Leass’ (1994) Resolution of Anaphora Procedure
(RAP). JavaRap resolves only pronouns and, thus,
it is not directly comparable to Reconcile. GuiTaR
Only required during training.
Figure 1: The Reconcile classification architecture.
the following steps, in order:
1. Preprocessing. All documents are passed
through a series of (external) linguistic processors such as tokenizers, part-of-speech
taggers, syntactic parsers, etc. These components produce annotations of the text. Table 1 lists the preprocessors currently interfaced in Reconcile. Note that Reconcile includes several in-house NP detectors, that
conform to the different data sets’ definitions of what constitutes a NP (e.g., MUC
vs. ACE). All of the extractors utilize a syntactic parse of the text and the output of a
Named Entity (NE) extractor, but extract different constructs as specialized in the corresponding definition. The NP extractors successfully recognize about 95% of the NPs in
the MUC and ACE gold standards.
Dep. parser
NP Detector
UIUC (CC Group, 2009)
OpenNLP (Baldridge, J., 2005)
OpenNLP (Baldridge, J., 2005)
OpenNLP (Baldridge, J., 2005)
+ the two parsers below
Stanford (Klein and Manning, 2003)
Berkeley (Petrov and Klein, 2007)
Stanford (Klein and Manning, 2003)
OpenNLP (Baldridge, J., 2005)
Stanford (Finkel et al., 2005)
Table 1: Preprocessing components available in
pairs of NPs and it is trained to assign a score
indicating the likelihood that the NPs in the
pair are coreferent.
4. Clustering. A clustering algorithm consolidates the predictions output by the classifier
and forms the final set of coreference clusters
2. Feature generation. Using annotations produced during preprocessing, Reconcile produces feature vectors for pairs of NPs. For
example, a feature might denote whether the
two NPs agree in number, or whether they
have any words in common. Reconcile includes over 80 features, inspired by other successful coreference resolution systems such
as Soon et al. (2001) and Ng and Cardie
5. Scoring. Finally, during testing Reconcile
runs scoring algorithms that compare the
chains produced by the system to the goldstandard chains in the answer key.
Each of the five steps above can invoke different components. Reconcile’s modularity makes it
Some structured coreference resolution algorithms (e.g.,
McCallum and Wellner (2004) and Finley and Joachims
(2005)) combine the classification and clustering steps above.
Reconcile can easily accommodate this modification.
3. Classification. Reconcile learns a classifier
that operates on feature vectors representing
Available modules
various learners in the Weka toolkit
libSVM (Chang and Lin, 2001)
SVMlight (Joachims, 2002)
Most Recent First
MUC score (Vilain et al., 1995)
B 3 score (Bagga and Baldwin, 1998)
CEAF score (Luo, 2005)
(b) Tokenizer: OpenNLP
(c) POS Tagger: OpenNLP
(d) Parser: Berkeley
(e) Named Entity Recognizer: Stanford
2. Feature Set - A hand-selected subset of 60 out of the
more than 80 features available. The features were selected to include most of the features from Soon et al.
Soon et al. (2001), Ng and Cardie (2002) and Bengtson
and Roth (2008).
3. Classifier - Averaged Perceptron
4. Clustering - Single-link - Positive decision threshold
was tuned by cross validation of the training set.
Table 2: Available implementations for different
modules available in Reconcile.
The first two rows of Table 3 show the performance of Reconcile2010 . For all data sets, B 3
scores are higher than MUC scores. The MUC
score is highest for the MUC6 data set, while B 3
scores are higher for the ACE data sets as compared to the MUC data sets.
Due to the difficulties outlined in Section 1,
results for Reconcile presented here are directly
comparable only to a limited number of scores
reported in the literature. The bottom three
rows of Table 3 list these comparable scores,
which show that Reconcile2010 exhibits state-ofthe-art performance for supervised learning-based
coreference resolvers. A more detailed study of
Reconcile-based coreference resolution systems
in different evaluation scenarios can be found in
Stoyanov et al. (2009).
easy for new components to be implemented and
existing ones to be removed or replaced. Reconcile’s standard distribution comes with a comprehensive set of implemented components – those
available for steps 2–5 are shown in Table 2. Reconcile contains over 38,000 lines of original Java
code. Only about 15% of the code is concerned
with running existing components in the preprocessing step, while the rest deals with NP extraction, implementations of features, clustering algorithms and scorers. More details about Reconcile’s architecture and available components and
features can be found in Stoyanov et al. (2010).
Data Sets
Reconcile incorporates the six most commonly
used coreference resolution data sets, two from the
MUC conferences (MUC-6, 1995; MUC-7, 1997)
and four from the ACE Program (NIST, 2004).
For ACE, we incorporate only the newswire portion. When available, Reconcile employs the standard test/train split. Otherwise, we randomly split
the data into a training and test set following a
70/30 ratio. Performance is evaluated according
to the B 3 and MUC scoring metrics.
Experimental Results
Reconcile is a general architecture for coreference
resolution that can be used to easily create various
coreference resolvers. Reconcile provides broad
support for experimentation in coreference resolution, including implementation of the basic architecture of contemporary state-of-the-art coreference systems and a variety of individual modules employed in these systems. Additionally,
Reconcile handles all of the formatting and scoring peculiarities of the most widely used coreference resolution data sets (those created as part
of the MUC and ACE conferences) and, thus,
allows for easy implementation and evaluation
across these data sets. We hope that Reconcile
will support experimental research in coreference
resolution and provide a state-of-the-art coreference resolver for both researchers and application
developers. We believe that in this way Reconcile will facilitate meaningful and consistent comparisons of coreference resolution systems. The
full Reconcile release is available for download at
The Reconcile2010 Configuration
Reconcile can be easily configured with different algorithms for markable detection, anaphoricity determination, feature extraction, etc., and run
against several scoring metrics. For the purpose of
this sample evaluation, we create only one particular instantiation of Reconcile, which we will call
Reconcile2010 to differentiate it from the general
platform. Reconcile2010 is configured using the
following components:
1. Preprocessing
(a) Sentence Splitter: OpenNLP
Soon et al. (2001)
Ng and Cardie (2002)
Yang et al. (2003)
Data sets
Table 3: Scores for Reconcile on six data sets and scores for comparable coreference systems.
S. Lappin and H. Leass. 1994. An algorithm for pronominal anaphora resolution. Computational Linguistics,
This research was supported in part by the National Science Foundation under Grant # 0937060
to the Computing Research Association for the
CIFellows Project, Lawrence Livermore National
Laboratory subcontract B573245, Department of
Homeland Security Grant N0014-07-1-0152, and
Air Force Contract FA8750-09-C-0172 under the
DARPA Machine Reading Program.
The authors would like to thank the anonymous
reviewers for their useful comments.
X. Luo, A. Ittycheriah, H. Jing, N. Kambhatla, and
S. Roukos. 2004. A mention-synchronous coreference
resolution algorithm based on the bell tree. In Proceedings of the 42nd Annual Meeting of the ACL.
X. Luo. 2005. On Coreference Resolution Performance
Metrics. In Proceedings of Human Language Technology
Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP).
A. McCallum and B. Wellner. 2004. Conditional Models
of Identity Uncertainty with Application to Noun Coreference. In Advances in Neural Information Processing
(NIPS 2004).
A. Bagga and B. Baldwin. 1998. Algorithms for scoring
coreference chains. In Linguistic Coreference Workshop
at the Language Resources and Evaluation Conference.
MUC-6. 1995. Coreference Task Definition. In Proceedings
of the Sixth Message Understanding Conference (MUC6).
Baldridge, J.
The OpenNLP project.
MUC-7. 1997. Coreference Task Definition. In Proceedings of the Seventh Message Understanding Conference
E. Bengtson and D. Roth. 2008. Understanding the value of
features for coreference resolution. In Proceedings of the
2008 Conference on Empirical Methods in Natural Language Processing (EMNLP).
V. Ng and C. Cardie. 2002. Improving Machine Learning
Approaches to Coreference Resolution. In Proceedings of
the 40th Annual Meeting of the ACL.
CC Group.
Sentence Segmentation Tool. cogcomp/atool.php?tkey=SS.
NIST. 2004. The ACE Evaluation Plan. NIST.
C. Chang and C. Lin.
LIBSVM: a Library for Support Vector Machines.
Available at
J. Finkel, T. Grenager, and C. Manning. 2005. Incorporating
Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 21st International Conference on Computational Linguistics and
44th Annual Meeting of the ACL.
S. Petrov and D. Klein. 2007. Improved Inference for Unlexicalized Parsing. In Proceedings of the Joint Meeting
of the Human Language Technology Conference and the
North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2007).
M. Poesio and M. Kabadjov. 2004. A general-purpose,
off-the-shelf anaphora resolution module: implementation
and preliminary evaluation. In Proceedings of the Language Resources and Evaluation Conference.
T. Finley and T. Joachims. 2005. Supervised clustering with
support vector machines. In Proceedings of the Twentysecond International Conference on Machine Learning
(ICML 2005).
L. Qiu, M.-Y. Kan, and T.-S. Chua. 2004. A public reference
implementation of the rap anaphora resolution algorithm.
In Proceedings of the Language Resources and Evaluation
A. Haghighi and D. Klein. 2007. Unsupervised Coreference
Resolution in a Nonparametric Bayesian Model. In Proceedings of the 45th Annual Meeting of the ACL.
W. Soon, H. Ng, and D. Lim. 2001. A Machine Learning Approach to Coreference of Noun Phrases. Computational
Linguistics, 27(4):521–541.
T. Joachims. 2002. SVMLight ,
V. Stoyanov, N. Gilbert, C. Cardie, and E. Riloff. 2009. Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In Proceedings of
D. Klein and C. Manning. 2003. Fast Exact Inference with
a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing (NIPS 2003).
V. Stoyanov, C. Cardie, N. Gilbert, E. Riloff, D. Buttler, and
D. Hysom. 2010. Reconcile: A coreference resolution
research platform. Technical report, Cornell University.
Y. Versley, S. Ponzetto, M. Poesio, V. Eidelman, A. Jern,
J. Smith, X. Yang, and A. Moschitti. 2008. BART: A
modular toolkit for coreference resolution. In Proceedings of the Language Resources and Evaluation Conference.
M. Vilain, J. Burger, J. Aberdeen, D. Connolly, and
L. Hirschman. 1995. A Model-Theoretic Coreference
Scoring Theme. In Proceedings of the Sixth Message Understanding Conference (MUC-6).
X. Yang, G. Zhou, J. Su, and C. Tan. 2003. Coreference
resolution using competition learning approach. In Proceedings of the 41st Annual Meeting of the ACL.
Predicate Argument Structure Analysis using Transformation-based
Hirotoshi Taira
Sanae Fujita
Masaaki Nagata
NTT Communication Science Laboratories
2-4, Hikaridai, Seika-cho, Souraku-gun, Kyoto 619-0237, Japan
[email protected]
The construction of such large corpora is strenuous and time-consuming. Additionally, maintaining high annotation consistency in such corpora
is crucial for statistical learning; however, such
work is hard, especially for tasks containing semantic elements. For example, in Japanese corpora, distinguishing true dative (or indirect object)
arguments from time-type argument is difficult because the arguments of both types are often accompanied with the ‘ni’ case marker.
A problem with such statistical learners as SVM
is the lack of interpretability; if accuracy is low, we
cannot identify the problems in the annotations.
We are focusing on transformation-based learning (TBL). An advantage for such learning methods is that we can easily interpret the learned
model. The tasks in most previous research are
such simple tagging tasks as part-of-speech tagging, insertion and deletion of parentheses in syntactic parsing, and chunking (Brill, 1995; Brill,
1993; Ramshaw and Marcus, 1995). Here we experiment with a complex task: Japanese PASs.
TBL can be slow, so we proposed an incremental training method to speed up the training. We
experimented with a Japanese PAS corpus with a
graph-based TBL. From the experiments, we interrelated the annotation tendency on the dataset.
The rest of this paper is organized as follows.
Section 2 describes Japanese predicate structure,
our graph expression of it, and our improved
method. The results of experiments using the
NAIST Text Corpus, which is our target corpus,
are reported in Section 3, and our conclusion is
provided in Section 4.
Maintaining high annotation consistency
in large corpora is crucial for statistical
learning; however, such work is hard,
especially for tasks containing semantic
elements. This paper describes predicate argument structure analysis using transformation-based learning. An advantage of transformation-based learning is
the readability of learned rules. A disadvantage is that the rule extraction procedure is time-consuming. We present
incremental-based, transformation-based
learning for semantic processing tasks. As
an example, we deal with Japanese predicate argument analysis and show some
tendencies of annotators for constructing
a corpus with our method.
Automatic predicate argument structure analysis
(PAS) provides information of “who did what
to whom” and is an important base tool for
such various text processing tasks as machine
translation information extraction (Hirschman et
al., 1999), question answering (Narayanan and
Harabagiu, 2004; Shen and Lapata, 2007), and
summarization (Melli et al., 2005). Most recent approaches to predicate argument structure
analysis are statistical machine learning methods
such as support vector machines (SVMs)(Pradhan
et al., 2004). For predicate argument structure analysis, we have the following representative large corpora: FrameNet (Fillmore et al.,
2001), PropBank (Palmer et al., 2005), and NomBank (Meyers et al., 2004) in English, the Chinese PropBank (Xue, 2008) in Chinese, the
GDA Corpus (Hashida, 2005), Kyoto Text Corpus
Ver.4.0 (Kawahara et al., 2002), and the NAIST
Text Corpus (Iida et al., 2007) in Japanese.
2 Predicate argument structure and
graph transformation learning
First, we illustrate the structure of a Japanese sentence in Fig. 1. In Japanese, we can divide a sentence into bunsetsu phrases (BP). A BP usually
consists of one or more content words and zero,
Proceedings of the ACL 2010 Conference Short Papers, pages 162–167,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Syntactic dependency between bunsetsus
Kare no
He ’s
mise de
tabe ta
okashi wa kinou
eat PAST snack TOP yesterday shop at
kat ta
buy PAST
Kareno tabeta okashiwa kinou misede katta.
The snack he ate is one I bought at the store yesterday.
BP: Bunsetsu phrase
CW: Content Word
FW: Functional Word
PRED: Predicate
ARG: Argument
Argument Types
Nom: Nominative
Acc: Accusative
Dat: Dative
Time: Time
Loc: Location
b) `Delete Pred Node’
c) `Add Edge’
a) `Add Pred Node’
d) `Delete Edge’
e) `Change Edge Label’
Figure 2: Transform types
paring the current graphs with the gold standard
graph structure in the training data, we find the different statuses of the nodes and edges among the
graphs. We extract such transformation rule candidates as ‘add node’ and ‘change edge label’ with
constraints, including ‘the corresponding BP includes a verb’ and ‘the argument candidate and the
predicate node have a syntactic dependency.’ The
extractions are executed based on the rule templates given in advance. Each extracted rule is
evaluated for the current graphs, and error reduction is calculated. The best rule for the reduction
is selected as a new rule and inserted at the bottom
of the current rule list. The new rule is applied to
the current graphs, which are transferred to other
graph structures. This procedure is iterated until
the total errors for the gold standard graphs become zero. When the process is completed, the
rule list is the final model. In the test phase, we iteratively transform nodes and edges in the graphs
mapped from the test data, based on rules in the
model like decision lists. The last graph after all
rule adaptations is the system output of the PAS.
Figure 1: Graph expression for PAS
one, or more than one functional words. Syntactic dependency between bunsetsu phrases can
be defined. Japanese dependency parsers such as
Cabocha (Kudo and Matsumoto, 2002) can extract
BPs and their dependencies with about 90% accuracy.
Since predicates and arguments in Japanese are
mainly annotated on the head content word in
each BP, we can deal with BPs as candidates of
predicates or arguments. In our experiments, we
mapped each BP to an argument candidate node
of graphs. We also mapped each predicate to a
predicate node. Each predicate-argument relation
is identified by an edge between a predicate and an
argument, and the argument type is mapped to the
edge label. In our experiments below, we defined
five argument types: nominative (subjective), accusative (direct objective), dative (indirect objective), time, and location. We use five transformation types: a) add or b) delete a predicate node, c)
add or d) delete an edge between an predicate and
an argument node, e) change a label (= an argument type) to another label (Fig. 2). We explain
the existence of an edge between a predicate and
an argument labeled t candidate node as that the
predicate and the argument have a t type relationship.
In this procedure, the calculation of error reduction is very time-consuming, because we have to
check many constraints from the candidate rules
for all training samples. The calculation order is
O(M N ), where M is the number of articles and
N is the number of candidate rules. Additionally,
an edge rule usually has three types of constraints:
‘pred node constraint,’ ‘argument candidate node
constraint,’ and ‘relation constraint.’ The number of combinations and extracted rules are much
larger than one of the rules for the node rules.
Ramshaw et al. proposed an index-based efficient
reduction method for the calculation of error reduction (Ramshaw and Marcus, 1994). However,
in PAS tasks, we need to check the exclusiveness
of the argument types (for example, a predicate argument structure does not have two nominative ar-
Transformation-based learning was proposed
by (Brill, 1995). Below we explain our learning strategy when we directly adapt the learning
method to our graph expression of PASs. First, unstructured texts from the training data are inputted.
After pre-processing, each text is mapped to an
initial graph. In our experiments, the initial graph
has argument candidate nodes with corresponding
BPs and no predicate nodes or edges. Next, com163
guments), and we cannot directly use the method.
Jijkoun et al. only used candidate rules that happen in the current and gold standard graphs and
used SVM learning for constraint checks (Jijkoun
and de Rijke, 2007). This method is effective
for achieving high accuracy; however, it loses the
readability of the rules. This is contrary to our aim
to extract readable rules.
Table 1: Data distribution
# of Articles
# of Sentences
# of Predicates
# of Arguments
To reduce the calculations while maintaining
readability, we propose an incremental method
and describe its procedure below. In this procedure, we first have PAS graphs for only one article. After the total errors among the current and
gold standard graphs become zero in the article,
we proceed to the next article. For the next article,
we first adapt the rules learned from the previous
article. After that, we extract new rules from the
two articles until the total errors for the articles become zero. We continue these processes until the
last article. Additionally, we count the number of
rule occurrences and only use the rule candidates
that happen more than once, because most such
rules harm the accuracy. We save and use these
rules again if the occurrence increases.
Table 4: Total performances (F1-measure (%))
Our system
Our system
3.2 Results
Our incremental method takes an hour. In comparison, the original TBL cannot even extract one
rule in a day. The results of predicate and argument type predictions are shown in Table 4. Here,
‘Baseline’ is the baseline system that predicts the
BSs that contain verbs, adjectives, and da form
nouns (‘to be’ in English) as predicates and predicts argument types for BSs having syntactical
dependency with a predicted predicate BS, based
on the following rules: 1) BSs containing nominative (ga) / accusative (wo) / dative (ni) case markers are predicted to be nominative, accusative, and
dative, respectively. 2) BSs containing a topic case
marker (wa) are predicted to be nominative. 3)
When a word sense category from a Japanese ontology of the head word in BS belongs to a ‘time’
or ‘location’ category, the BS is predicted to be a
‘time’ and ‘location’ type argument. In all precision, recall, and F1-measure, our system outperformed the baseline system.
Next, we show our system’s learning curve in
Fig. 3. The number of final rules was 68. This
indicates that the first twenty rules are mainly effective rules for the performance. The curve also
shows that no overfitting happened. Next, we
show the performance for every argument type in
Table 5. ‘TBL,’ which stands for ‘transformationbased learning,’ is our system. In this table,
the performance of the dative and time types improved, even though they are difficult to distinguish. On the other hand, the performance of the
location type argument in our system is very low.
Our method learns rules as decreasing errors of
3.1 Experimental Settings
We used the articles in the NAIST Text Corpus version 1.4β (Iida et al., 2007) based on the
Mainichi Shinbun Corpus (Mainichi, 1995), which
were taken from news articles published in the
Japanese Mainichi Shinbun newspaper. We used
articles published on January 1st for training examples and on January 3rd for test examples.
Three original argument types are defined in the
NAIST Text Corpus: nominative (or subjective),
accusative (or direct object), and dative (or indirect object). For evaluation of the difficult annotation cases, we also added annotations for ‘time’
and ‘location’ types by ourselves. We show the
dataset distribution in Table 1. We extracted the
BP units and dependencies among these BPs from
the dataset using Cabocha, a Japanese dependency
parser, as pre-processing. After that, we adapted
our incremental learning to the training data. We
used two constraint templates in Tables 2 and 3
for predicate nodes and edges when extracting the
rule candidates.
Table 2: Predicate node constraint templates
Pred. Node Constraint Template
noun, verb, adjective, etc.
independent, attached word, etc.
pos1 & pos2 above two features combination
da form (copula)
word base form
Rule Example
Pred. Node Constraint
pos1=‘VERB’ & pos2=‘ANCILLARY WORD’
‘da form’
add pred node
del pred node
add pred node
add pred node
add pred node
Table 3: Edge constraint templates
Edge Constraint Template
Arg. Cand.
Pred. Node
FW (=func. ∗
dep(arg → pred)
dep(arg ← pred)
Rule Example
Edge Constraint
FW of Arg. =‘wa(TOP)’ & dep(arg → pred)
add NOM edge
add NOM edge
chg edge label
add NOM edge
dep(arg → pred)
FW of Pred. =‘na(ADNOMINAL)’ & dep(arg
← pred)
SemCat of Arg. = ‘TIME’ & dep(arg → pred)
passive form
dep(arg → pred)
FW of Arg. =‘ga(NOM) & Pred.: passive form
kform (= type
of inflected
Pred. SemCat
kform of Pred. = continuative ‘ta’ form
SemCat of Arg. = ‘HUMAN’ & Pred. SemCat
add NOM edge
all arguments, and the performance of the location
type argument is probably sacrificed for total error
reduction because the number of location type arguments is much smaller than the number of other
argument types (Table 1), and the improvement of
the performance-based learning for location type
arguments is relatively low. To confirm this, we
performed an experiment in which we gave the
rules of the baseline system to our system as initial
rules and subsequently performed our incremental learning. ‘Base + TBL’ shows the experiment.
The performance for the location type argument
improved drastically. However, the total performance of the arguments was below the original
TBL. Moreover, the ‘Base + TBL’ performance
surpassed the baseline system. This indicates that
our system learned a reasonable model.
Finally, we show some interesting extracted
rules in Fig. 4. The first rule stands for an expression where the sentence ends with the performance of something, which is often seen in
Japanese newspaper articles. The second and third
rules represent that annotators of this dataset tend
to annotate time types for which the semantic category of the argument is time, even if the argument
looks like the dat. type, and annotators tend to annotate dat. type for arguments that have an dat.
F1-measure (%)
add TIME edge
Figure 3: Learning curves: x-axis = number of
rules; y-axis: F1-measure (%)
Rule No.20
if BP contains the word `%’ ,
Add Pred. Node
`People who answered are 87%’
答え た
人 は
87% で
answer-ed people-TOP 87%-be
Rule No.15
SemCat is `Time’
`will start on the 7th
7日 に スタート する
7ka-ni staato-suru
7th DAT start will
Dat. / Time
Change Edge Label Dat. →Time
Rule No.16
if func. wd. is `DAT’ case,
Time / Dat.
Change Edge Label
Rule No.16 is applied
Figure 4: Examples of extracted rules
Table 5: Results for every arg. type (F-measure
Base + TBL
Eric Brill. 1993. Transformation-based error-driven
parsing. In Proc. of the Third International Workshop on Parsing Technologies.
Time Loc.
51.5 38.0
59.6 1.7
55.8 37.4
Eric Brill. 1995. Transformation-based error-driven
learning and natural language processing: A case
study in part-of-speech tagging. Computational Linguistics, 21(4):543–565.
type case marker.
Charles J. Fillmore, Charles Wooters, and Collin F.
Baker. 2001. Building a large lexical databank
which provides deep semantics. In Proc. of the Pacific Asian Conference on Language, Information
and Computation (PACLING).
We performed experiments for Japanese predicate
argument structure analysis using transformationbased learning and extracted rules that indicate the
tendencies annotators have. We presented an incremental procedure to speed up rule extraction.
The performance of PAS analysis improved, especially, the dative and time types, which are difficult
to distinguish. Moreover, when time expressions
are attached to the ‘ni’ case, the learned model
showed a tendency to annotate them as dative arguments in the used corpus. Our method has potential for dative predictions and interpreting the
tendencies of annotator inconsistencies.
Kouichi Hashida. 2005. Global document annotation
(GDA) manual.
Lynette Hirschman, Patricia Robinson, Lisa
Hub-4 Event’99 general guidelines. projects/muc/.
Ryu Iida, Mamoru Komachi, Kentaro Inui, and Yuji
Matsumoto. 2007. Annotating a Japanese text corpus with predicate-argument and coreference relations. In Proc. of ACL 2007 Workshop on Linguistic
Annotation, pages 132–139.
Valentin Jijkoun and Maarten de Rijke. 2007. Learning to transform linguistic graphs. In Proc. of
the Second Workshop on TextGraphs: GraphBased Algorithms for Natural Language Processing
(TextGraphs-2), pages 53–60. Association for Computational Linguistics.
We thank Kevin Duh for his valuable comments.
Daisuke Kawahara, Sadao Kurohashi, and Koichi
Construction of a Japanese
relevance-tagged corpus (in Japanese). Proc. of the
8th Annual Meeting of the Association for Natural
Language Processing, pages 495–498.
Taku Kudo and Yuji Matsumoto. 2002. Japanese
dependency analysis using cascaded chunking. In
Proc. of the 6th Conference on Natural Language
Learning 2002 (CoNLL 2002).
Mainichi. 1995. CD Mainichi Shinbun 94. Nichigai
Associates Co.
Gabor Melli, Yang Wang, Yudong Liu, Mehdi M.
Kashani, Zhongmin Shi, Baohua Gu, Anoop Sarkar,
and Fred Popowich.
Description of
SQUASH, the SFU question answering summary
handler for the DUC-2005 summarization task. In
Proc. of DUC 2005.
Adam Meyers, Ruth Reeves, Catherine Macleod,
Rachel Szekely, Veronika Zielinska, Brian Young,
and Ralph Grishman. 2004. The NomBank project:
An interim report. In Proc. of HLT-NAACL 2004
Workshop on Frontiers in Corpus Annotation.
Srini Narayanan and Sanda Harabagiu. 2004. Question answering based on semantic structures. In
Proc. of the 20th International Conference on Computational Linguistics (COLING).
M. Palmer, P. Kingsbury, and D. Gildea. 2005. The
proposition bank: An annotated corpus of semantic
roles. Computational Linguistics, 31(1):71–106.
Sameer Pradhan, Waybe Ward, Kadri Hacioglu, James
Martin, and Dan Jurafsky. 2004. Shallow semantic
parsing using support vector machines. In Proc. of
the Human Language Technology Conference/North
American Chapter of the Association of Computational Linguistics HLT/NAACL 2004.
Lance Ramshaw and Mitchell Marcus. 1994. Exploring the statistical derivation of transformational rule
sequences for part-of-speech tagging. In The Balancing Act: Proc. of the ACL Workshop on Combining Symbolic and Statistical Approaches to Language.
Lance Ramshaw and Mitchell Marcus. 1995. Text
chunking using transformation-based learning. In
Proc. of the third workshop on very large corpora,
pages 82–94.
Dan Shen and Mirella Lapata. 2007. Using semantic roles to improve question answering. In
Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning
(EMNLP/CoNLL), pages 12–21.
Nianwen Xue. 2008. Labeling Chinese predicates
with semantic roles. Computational Linguistics,
Improving Chinese Semantic Role Labeling with Rich Syntactic Features
Weiwei Sun∗
Department of Computational Linguistics, Saarland University
German Research Center for Artificial Intelligence (DFKI)
D-66123, Saarbrücken, Germany
[email protected]
features, some of which are designed to better capture structural information of sub-trees in a given
parse. With help of these new features, our system achieves 93.49 F-measure with hand-crafted
parses. Comparison with the best reported results,
92.0 (Xue, 2008), shows that these features yield a
significant improvement of the state-of-the-art.
We further analyze the effect of syntactic parsing in Chinese SRL. The main effect of parsing
in SRL is two-fold. First, grouping words into
constituents, parsing helps to find argument candidates. Second, parsers provide semantic classifiers
plenty of syntactic information, not to only recognize arguments from all candidate constituents but
also to classify their detailed semantic types. We
empirically analyze each effect in turn. We also
give some preliminary linguistic explanations for
the phenomena.
Developing features has been shown crucial to advancing the state-of-the-art in Semantic Role Labeling (SRL). To improve
Chinese SRL, we propose a set of additional features, some of which are designed to better capture structural information. Our system achieves 93.49 Fmeasure, a significant improvement over
the best reported performance 92.0. We
are further concerned with the effect
of parsing in Chinese SRL. We empirically analyze the two-fold effect, grouping
words into constituents and providing syntactic information. We also give some preliminary linguistic explanations.
Previous work on Chinese Semantic Role Labeling (SRL) mainly focused on how to implement SRL methods which are successful on English. Similar to English, parsing is a standard
pre-processing for Chinese SRL. Many features
are extracted to represent constituents in the input
parses (Sun and Jurafsky, 2004; Xue, 2008; Ding
and Chang, 2008). By using these features, semantic classifiers are trained to predict whether a
constituent fills a semantic role. Developing features that capture the right kind of information encoded in the input parses has been shown crucial
to advancing the state-of-the-art. Though there
has been some work on feature design in Chinese
SRL, information encoded in the syntactic trees is
not fully exploited and requires more research effort. In this paper, we propose a set of additional
Chinese SRL
The Chinese PropBank (CPB) is a semantic annotation for the syntactic trees of the Chinese TreeBank (CTB). The arguments of a predicate are labeled with a contiguous sequence of integers, in
the form of AN (N is a natural number); the adjuncts are annotated as such with the label AM
followed by a secondary tag that represents the semantic classification of the adjunct. The assignment of semantic roles is illustrated in Figure 1,
where the predicate is the verb “调查/investigate”.
E.g., the NP “事故原因/the cause of the accident”
is labeled as A1, meaning that it is the Patient.
In previous research, SRL methods that are successful on English are adopted to resolve Chinese
SRL (Sun and Jurafsky, 2004; Xue, 2008; Ding
and Chang, 2008, 2009; Sun et al., 2009; Sun,
2010). Xue (2008) produced complete and systematic research on full parsing based methods.
The work was partially completed while this author was
at Peking University.
Proceedings of the ACL 2010 Conference Short Papers, pages 168–172,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
A majority of features used in our system are a
combination of features described in (Xue, 2008;
Ding and Chang, 2008) as well as the word formation and coarse frame features introduced in
(Sun et al., 2009), the c-command thread features proposed in (Sun et al., 2008). We give
a brief description of features used in previous
work, but explain new features in details. For
more information, readers can refer to relevant
papers and our source codes2 that are well commented. To conveniently illustrate, we denote
a candidate constituent ck with a fixed context
wi−1 [ck wi ...wh ...wj ]wj+1 , where wh is the head
word of ck , and denote predicate in focus with
v w v w v w v w v , where w v is the
a context w−2
+1 +2
predicate in focus.
thoroughly investigate
Figure 1: An example sentence: The police are
thoroughly investigating the cause of the accident.
Their method divided SRL into three sub-tasks: 1)
pruning with a heuristic rule, 2) Argument Identification (AI) to recognize arguments, and 3) Semantic Role Classification (SRC) to predict semantic types. The main two sub-tasks, AI and
SRC, are formulated as two classification problems. Ding and Chang (2008) divided SRC into
two sub-tasks in sequence: Each argument should
first be determined whether it is a core argument or
an adjunct, and then be classified into fine-grained
categories. However, delicately designed features
are more important and our experiments suggest
that by using rich features, a better SRC solver
can be directly trained without using hierarchical
architecture. There are also some attempts at relaxing the necessity of using full syntactic parses,
and semantic chunking methods have been introduced by (Sun et al., 2009; Sun, 2010; Ding and
Chang, 2009).
Baseline Features
The following features are introduced in previous
Chinese SRL systems. We use them as baseline.
Word content of wv , wh , wi , wj and wi +wj ;
POS tag of wv , wh . subcategorization frame, verb
class of wv ; position, phrase type ck , path from ck
to wv (from (Xue, 2008; Ding and Chang, 2008))
First character, last character and word length
of wv , first character+length, last character+word
length, first character+position, last character+position, coarse frame, frame+wv , frame+left
character, frame+verb class, frame+ck (from (Sun
et al., 2009)).
Head word POS, head word of PP phrases, category of ck ’s lift and right siblings, CFG rewrite
rule that expands ck and ck ’s parent (from (Ding
and Chang, 2008)).
Our System
We implement a three-stage (i.e. pruning, AI and
SRC) SRL system. In the pruning step, our system keeps all constituents (except punctuations)
that c-command1 current predicate in focus as argument candidates. In the AI step, a lot of syntactic features are extracted to distinguish argument
and non-argument. In other words, a binary classifier is trained to classify each argument candidate
as either an argument or not. Finally, a multi-class
classifier is trained to label each argument recognized in the former stage with a specific semantic
role label. In both AI and SRC, the main job is to
select strong syntactic features.
New Word Features
We introduce some new features which can be
extracted without syntactic structure. We denote
them as word features. They include:
v , wv , w
Word content of w−1
i−1 and wj+1 ;
v , wv , w
POS tag of w−1 , w+1 , w−2
i−1 , wi , wj ,
wj+1 , wi+2 and wj−2 .
Length of ck : how many words are there in ck .
Word before “LC”: If the POS of wj is “LC”
(localizer), we use wj−1 and its POS tag as two
new features.
NT: Does ck contain a word with POS “NT”
(temporal noun)?
See (Sun et al., 2008) for detailed definition.
Combination features: wi ’s POS+wj ’s POS,
wv +Position
et al., 2009; Sun, 2010). All parsing and SRL experiments use this data setting. To resolve classification problems, we use a linear SVM classifier SVMlin 3 , along with One-Vs-All approach for
multi-class classification. To evaluate SRL with
automatic parsing, we use a state-of-the-art parser,
Bikel parser4 (Bikel, 2004). We use gold segmentation and POS as input to the Bikel parser and
use it parsing results as input to our SRL system.
The overall LP/LR/F performance of Bikel parser
is 79.98%/82.95%/81.43.
New Syntactic Features
Taking complex syntax trees as inputs, the classifiers should characterize their structural properties. We put forward a number of new features to
encode the structural information.
Category of ck ’s parent; head word and POS of
head word of parent, left sibling and right sibling
of ck .
Lexicalized Rewrite rules: Conjuction of
rewrite rule and head word of its corresponding
RHS. These features of candidate (lrw-c) and its
parent (lrw-p) are used. For example, this lrwc feature of the NP “事 故 原 因” in Figure 1 is
N P → N N + N N (原因).
Partial Path: Path from the ck or wv to the lowest common ancestor of ck and wv . One path feature, hence, is divided into left path and right path.
Clustered Path: We use the manually created
clusters (see (Sun and Sui, 2009)) of categories of
all nodes in the path (cpath) and right path.
C-commander thread between ck and wv (cct):
(proposed by (Sun et al., 2008)). For example, this
feature of the NP “警方” in Figure 1 is N P +
ADV P + ADV P + V V .
Head Trace: The sequential container of the
head down upon the phrase (from (Sun and Sui,
2009)). We design two kinds of traces (htr-p, htrw): one uses POS of the head word; the other uses
the head word word itself. E.g., the head word of
事故原因 is “原因” therefore these feature of this
NP are NP↓NN and NP↓原因.
Combination features: verb class+ck , wh +wv ,
wh +Position,
wh +wv +Position,
path+wv ,
wh +right path, w +left path, frame+wv +wh ,
and wv +cct.
Overall Performance
Table 1 summarizes precision, recall and Fmeasure of AI, SRC and the whole task (AI+SRC)
of our system respectively. The forth line is
the best published SRC performance reported in
(Ding and Chang, 2008), and the sixth line is the
best SRL performance reported in (Xue, 2008).
Other lines show the performance of our system.
These results indicate a significant improvement
over previous systems due to the new features.
(Ding and Chang, 2008)
(Xue, 2008)
Table 1: SRL performance on the test data with
gold standard parses.
Two-fold Effect of Parsing in SRL
The effect of parsing in SRL is two-fold. On the
one hand, SRL systems should group words as argument candidates, which are also constituents in
a given sentence. Full parsing provides boundary information of all constituents. As arguments
should c-command the predicate, a full parser can
further prune a majority of useless constituents. In
other words, parsing can effectively supply SRL
with argument candidates. Unfortunately, it is
very hard to rightly produce full parses for Chinese text. On the other hand, given a constituent,
SRL systems should identify whether it is an argument and further predict detailed semantic types if
Experiments and Analysis
Experimental Setting
To facilitate comparison with previous work, we
use CPB 1.0 and CTB 5.0, the same data setting with (Xue, 2008). The data is divided into
three parts: files from 081 to 899 are used as
training set; files from 041 to 080 as development set; files from 001 to 040, and 900 to 931
as test set. Nearly all previous research on constituency based SRL evaluation use this setting,
also including (Ding and Chang, 2008, 2009; Sun
The second block in Table 2 summarizes the SRC
performance with gold argument boundaries. Line
5 is the accuracy when word features are used;
Line 6 is the accuracy when additional syntactic
features are added; The last row is the accuracy
when syntactic features used are extracted from
automatic parses (Bikel+Gold). We can see that
different from AI, word features only can train
reasonable good semantic classifiers. The comparison between Line 5 and 7 suggests that with
parsing errors, automatic parsed syntactic features
cause noise to the semantic role classifiers.
Table 2: Classification perfromance on development data. In the Feat column, W means word
features; W+S means word and syntactic feautres.
it is an argument. For the two classification problems, parsing can provide complex syntactic information such as path features.
The Effect of Parsing in AI
In AI, full parsing is very important for both
grouping words and classification. Table 2 summarizes relative experimental results. Line 2 is the
AI performance when gold candidate boundaries
and word features are used; Line 3 is the performance with additional syntactic features. Line 4
shows the performance by using automatic parses
generated by Bikel parser. We can see that: 1)
word features only cannot train good classifiers to
identify arguments; 2) it is very easy to recognize
arguments with good enough syntactic parses; 3)
there is a severe performance decline when automatic parses are used. The third observation is a
similar conclusion in English SRL. However this
problem in Chinese is much more serious due to
the state-of-the-art of Chinese parsing.
Information theoretic criteria are popular criteria in variable selection (Guyon and Elisseeff, 2003). This paper uses empirical mutual
information between
the tarP each variable andp(x,y)
get, I(X, Y ) = x∈X,y∈Y p(x, y) log p(x)p(y)
, to
roughly rank the importance of features. Table 3
shows the ten most useful features in AI. We can
see that the most important features all based on
full parsing information. Nine of these top 10 useful features are our new features.
wv cct
The Effect of Parsing in SRC
Why Word Features Are Effective for
‡ frame+w
h +w
h +w
† frame+w v
h +w +position
w +cct
† w +w
‡ w +Postion
Table 4: Top 10 useful features for SRC.
Table 4 shows the ten most useful features in
SRC. We can see that two of these ten features
are word features (denoted by †). Namely, word
features play a more important role in SRC than
in AI. Though the other eight features are based
on full parsing, four of them (denoted by ‡) use
the head word which can be well approximated
by word features, according to some language specific properties. The head rules described in (Sun
and Jurafsky, 2004) are very popular in Chinese
parsing research, such as in (Duan et al., 2007;
Zhang and Clark, 2008). From these head rules,
we can see that head words of most phrases in
Chinese are located at the first or the last position.
We implement these rules on Chinese Tree Bank
and find that 84.12% 5 nodes realize their heads as
either their first or last word. Head position suggests that boundary words are good approximation
of head word features. If head words have good
approximation word features, then it is not strange
that the four features denoted by ‡ can be effectively represented by word features. Similar with
feature effect in AI, most of most useful features
in SRC are our new features.
wh +wv +Position
‡ w +w v
Table 3: Top 10 useful features for AI. ‡ means
word features.
This statistics excludes all empty categories in CTB.
Honglin Sun and Daniel Jurafsky. 2004. Shallow
semantc parsing of Chinese. In Daniel Marcu
Susan Dumais and Salim Roukos, editors, HLTNAACL 2004: Main Proceedings.
This paper proposes an additional set of features
to improve Chinese SRL. These new features yield
a significant improvement over the best published
performance. We further analyze the effect of
parsing in Chinese SRL, and linguistically explain
some phenomena. We found that (1) full syntactic
information playes an essential role only in AI and
that (2) due to the head word position distribution,
SRC is easy to resolve in Chinese SRL.
Weiwei Sun. 2010. Semantics-driven shallow
parsing for Chinese semantic role labeling. In
Proceedings of the ACL 2010.
Weiwei Sun and Zhifang Sui. 2009. Chinese function tag labeling. In Proceedings of the 23rd
Pacific Asia Conference on Language, Information and Computation. Hong Kong.
Weiwei Sun, Zhifang Sui, and Haifeng Wang.
2008. Prediction of maximal projection for semantic role labeling. In Proceedings of the
22nd International Conference on Computational Linguistics.
The author is funded both by German Academic
Exchange Service (DAAD) and German Research
Center for Artificial Intelligence (DFKI).
The author would like to thank the anonymous
reviewers for their helpful comments.
Weiwei Sun, Zhifang Sui, Meng Wang, and Xin
Wang. 2009. Chinese semantic role labeling
with shallow parsing. In Proceedings of the
2009 Conference on Empirical Methods in Natural Language Processing, pages 1475–1483.
Association for Computational Linguistics, Singapore.
Daniel M. Bikel. 2004. A distributional analysis
of a lexicalized statistical parsing model. In
Dekang Lin and Dekai Wu, editors, Proceedings of EMNLP 2004, pages 182–189. Association for Computational Linguistics, Barcelona,
Nianwen Xue. 2008. Labeling Chinese predicates with semantic roles. Comput. Linguist.,
Weiwei Ding and Baobao Chang. 2008. Improving Chinese semantic role classification with hierarchical feature selection strategy. In Proceedings of the EMNLP 2008, pages 324–
333. Association for Computational Linguistics, Honolulu, Hawaii.
Yue Zhang and Stephen Clark. 2008. A tale of two
parsers: Investigating and combining graphbased and transition-based dependency parsing.
In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 562–571. Association for Computational Linguistics, Honolulu, Hawaii.
Weiwei Ding and Baobao Chang. 2009. Fast semantic role labeling for Chinese based on semantic chunking. In ICCPOL ’09: Proceedings of the 22nd International Conference on
Computer Processing of Oriental Languages.
Language Technology for the Knowledgebased Economy, pages 79–90. Springer-Verlag,
Berlin, Heidelberg.
Xiangyu Duan, Jun Zhao, and Bo Xu. 2007.
Probabilistic models for action-based Chinese
dependency parsing. In ECML ’07: Proceedings of the 18th European conference on
Machine Learning, pages 559–566. SpringerVerlag, Berlin, Heidelberg.
Isabelle Guyon and André Elisseeff. 2003. An
introduction to variable and feature selection. Journal of Machine Learning Research,
Balancing User Effort and Translation Error in Interactive Machine
Translation Via Confidence Measures
Jesús González-Rubio
Daniel Ortiz-Martı́nez
Inst. Tec. de Informática
Dpto. de Sist Inf. y Comp.
Univ. Politéc. de Valencia
Univ. Politéc. de Valencia
46021 Valencia, Spain
46021 Valencia, Spain
[email protected] [email protected]
An implementation of the IMT famework was
performed in the TransType project (Foster et al.,
1997; Langlais et al., 2002) and further improved
within the TransType2 project (Esteban et al.,
2004; Barrachina et al., 2009).
IMT aims at reducing the effort and increasing the productivity of translators, while preserving high-quality translation. In this work, we integrate Confidence Measures (CMs) within the IMT
framework to further reduce the user effort. As
will be shown, our proposal allows to balance the
ratio between user effort and final translation error.
This work deals with the application of
confidence measures within an interactivepredictive machine translation system in
order to reduce human effort. If a small
loss in translation quality can be tolerated
for the sake of efficiency, user effort can
be saved by interactively translating only
those initial translations which the confidence measure classifies as incorrect. We
apply confidence estimation as a way to
achieve a balance between user effort savings and final translation error. Empirical results show that our proposal allows
to obtain almost perfect translations while
significantly reducing user effort.
1.1 Confidence Measures
Confidence estimation have been extensively studied for speech recognition. Only recently have researchers started to investigate CMs for MT (Gandrabur and Foster, 2003; Blatz et al., 2004; Ueffing
and Ney, 2007).
Different TransType-style MT systems use confidence information to improve translation prediction accuracy (Gandrabur and Foster, 2003; Ueffing and Ney, 2005). In this work, we propose a focus shift in which CMs are used to modify the interaction between the user and the system instead
of modify the IMT translation predictions.
To compute CMs we have to select suitable confidence features and define a binary classifier. Typically, the classification is carried out depending
on whether the confidence value exceeds a given
threshold or not.
In Statistical Machine Translation (SMT), the
translation is modelled as a decission process. For
a given source string f1J = f1 . . . fj . . . fJ , we
seek for the target string eI1 = e1 . . . ei . . . eI
which maximises posterior probability:
êI1 = argmax P r(eI1 |f1J ) .
Within the Interactive-predictive Machine
Translation (IMT) framework, a state-of-the-art
SMT system is employed in the following way:
For a given source sentence, the SMT system
fully automatically generates an initial translation.
A human translator checks this translation from
left to right, correcting the first error. The SMT
system then proposes a new extension, taking the
correct prefix ei1 = e1 . . . ei into account. These
steps are repeated until the whole input sentence
has been correctly translated. In the resulting
decision rule, we maximise over all possible
extensions eIi+1 of ei1 :
êIi+1 = argmax P r(eIi+1 |ei1 , f1J ) .
Francisco Casacuberta
Dpto. de Sist Inf. y Comp.
Univ. Politéc. de Valencia
46021 Valencia, Spain
[email protected]
2 IMT with Sentence CMs
In the conventional IMT scenario a human translator and a SMT system collaborate in order to
obtain the translation the user has in mind. Once
the user has interactively translated the source sentences, the output translations are error-free. We
propose an alternative scenario where not all the
source sentences are interactively translated by the
user. Specifically, only those source sentences
Proceedings of the ACL 2010 Conference Short Papers, pages 173–177,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
whose initial fully automatic translation are incorrect, according to some quality criterion, are interactively translated. We propose to use CMs as
the quality criterion to classify those initial translations.
Our approach implies a modification of the
user-machine interaction protocol. For a given
source sentence, the SMT system generates an initial translation. Then, if the CM classifies this
translation as correct, we output it as our final
translation. On the contrary, if the initial translation is classified as incorrect, we perform a conventional IMT procedure, validating correct prefixes and generating new suffixes, until the sentence that the user has in mind is reached.
In our scenario, we allow the final translations
to be different from the ones the user has in mind.
This implies that the output may contain errors.
If a small loss in translation can be tolerated for
the sake of efficiency, user effort can be saved by
interactively translating only those sentences that
the CMs classify as incorrect.
It is worth of notice that our proposal can be
seen as a generalisation of the conventional IMT
approach. Varying the value of the CM classification threshold, we can range from a fully automatic SMT system where all sentences are classified as correct to a conventional IMT system
where all sentences are classified as incorrect.
scores cw (ei ) are combined:
MEAN CM (cM (eI1 )) is computed as the geometric mean of the confidence scores of the
words in the sentence:
u I
cM (e1 ) = t
cw (ei ) .
RATIO CM (cR (eI1 )) is computed as the percentage of words classified as correct in the sentence. A word is classified as correct if
its confidence exceeds a word classification
threshold τw .
cR (eI1 ) =
We compute sentence CMs by combining the
scores given by a word CM based on the IBM
model 1 (Brown et al., 1993), similar to the one
described in (Blatz et al., 2004). We modified this
word CM by replacing the average by the maximal lexicon probability, because the average is
dominated by this maximum (Ueffing and Ney,
2005). We choose this word CM because it can be
calculated very fast during search, which is crucial given the time constraints of the IMT systems. Moreover, its performance is similar to that
of other word CMs as results presented in (Blatz
et al., 2003; Blatz et al., 2004) show. The word
confidence value of word ei , cw (ei ), is given by
Spanish English
Table 1: Statistics of the Spanish–English EU corpora. K and M denote thousands and millions of
elements respectively.
2.1 Selecting a CM for IMT
cw (ei ) = max p(ei |fj ) ,
Running words
Running words
Perplexity (trigrams)
Running words
Perplexity (trigrams)
|{ei / cw (ei ) > τw }|
After computing the confidence value, each sentence is classified as either correct or incorrect, depending on whether its confidence value exceeds
or not a sentence clasiffication threshold τs . If
τs = 0.0 then all the sentences will be classified
as correct whereas if τs = 1.0 all the sentences
will be classified as incorrect.
3 Experimentation
The aim of the experimentation was to study the
possibly trade-off between saved user effort and
translation error obtained when using sentence
CMs within the IMT framework.
3.1 System evaluation
where p(ei |fj ) is the IBM model 1 lexicon probability, and f0 is the empty source word.
From this word CM, we compute two sentence
CMs which differ in the way the word confidence
In this paper, we report our results as measured
by Word Stroke Ratio (WSR) (Barrachina et al.,
2009). WSR is used in the context of IMT to measure the effort required by the user to generate her
WSR IMT-CM (τw=0.4)
BLEU IMT-CM (τw=0.4)
Threshold (τs)
Threshold (τs)
Figure 1: BLEU translation scores versus WSR
for different values of the sentence classification
threshold using the MEAN CM.
Figure 2: BLEU translation scores versus WSR
for different values of the sentence classification
threshold using the RATIO CM with τw = 0.4.
translations. WSR is computed as the ratio between the number of word-strokes a user would
need to achieve the translation she has in mind and
the total number of words in the sentence. In this
context, a word-stroke is interpreted as a single action, in which the user types a complete word, and
is assumed to have constant cost.
Additionally, and because our proposal allows
differences between its output and the reference
translation, we will also present translation quality results in terms of BiLingual Evaluation Understudy (BLEU) (Papineni et al., 2002). BLEU
computes a geometric mean of the precision of ngrams multiplied by a factor to penalise short sentences.
dure, optimising the BLEU score on the development set.
The IMT system which we have implemented
relies on the use of word graphs (Ueffing et al.,
2002) to efficiently compute the suffix for a given
prefix. A word graph has to be generated for each
sentence to be interactively translated. For this
purpose, we used a multi-stack phrase-based decoder which will be distributed in the near future
together with the Thot toolkit. We discarded to
use the state-of-the-art Moses toolkit (Koehn et
al., 2007) because preliminary experiments performed with it revealed that the decoder by OrtizMartı́nez et al. (2005) performs better in terms of
WSR when used to generate word graphs for their
use in IMT (Sanchis-Trilles et al., 2008). Moreover, the performance difference in regular SMT is
negligible. The decoder was set to only consider
monotonic translation, since in real IMT scenarios considering non-monotonic translation leads to
excessive response time for the user.
Finally, the obtained word graphs were used
within the IMT procedure to produce the reference translations in the test set, measuring WSR
and BLEU.
3.2 Experimental Setup
Our experiments were carried out on the EU corpora (Barrachina et al., 2009). The EU corpora
were extracted from the Bulletin of the European
Union. The EU corpora is composed of sentences
given in three different language pairs. Here, we
will focus on the Spanish–English part of the EU
corpora. The corpus is divided into training, development and test sets. The main figures of the
corpus can be seen in Table 1.
As a first step, be built a SMT system to translate from Spanish into English. This was done
by means of the Thot toolkit (Ortiz et al., 2005),
which is a complete system for building phrasebased SMT models. This toolkit involves the estimation, from the training set, of different statistical models, which are in turn combined in a loglinear fashion by adjusting a weight for each of
them by means of the MERT (Och, 2003) proce-
3.3 Results
We carried out a series of experiments ranging the
value of the sentence classification threshold τs ,
between 0.0 (equivalent to a fully automatic SMT
system) and 1.0 (equivalent to a conventional IMT
system), for both the MEAN and RATIO CMs.
For each threshold value, we calculated the effort
of the user in terms of WSR, and the translation
quality of the final output as measured by BLEU.
DECLARACIÓN (No 17) relativa al derecho de acceso a la información
DECLARATION (No 17) on the right of access to information
DECLARATION (No 17) on the right of access to information
Conclusiones del Consejo sobre el comercio electrónico y los impuestos indirectos.
Council conclusions on electronic commerce and indirect taxation.
Council conclusions on e-commerce and indirect taxation.
participación de los paı́ses candidatos en los programas comunitarios.
participation of the applicant countries in Community programmes.
countries’ involvement in Community programmes.
Example 1: Examples of initial fully automatically generated sentences classified as correct by the CMs.
(ref) and the final translation (tra) for three of the
initial fully automatically generated translations
that were classified as correct by our CMs, and
thus, were not interactively translated by the user.
The first translation (tra-1) is identical to the corresponding reference translation (ref-1). The second
translation (tra-2) corresponds to a correct translation of the source sentence (src-2) that is different from the corresponding reference (ref-2). Finally, the third translation (tra-3) is an example of
a slightly incorrect translation.
Figure 1 shows WSR (WSR IMT-CM) and
BLEU (BLEU IMT-CM) scores obtained varying
τs for the MEAN CM. Additionally, we also show
the BLEU score (BLEU SMT) obtained by a fully
automatic SMT system as translation quality baseline, and the WSR score (WSR IMT) obtained by
a conventional IMT system as user effort baseline.
This figure shows a continuous transition between
the fully automatic SMT system and the conventional IMT system. This transition occurs when
ranging τs between 0.0 and 0.6. This is an undesired effect, since for almost a half of the possible
values for τs there is no change in the behaviour
of our proposed IMT system.
The RATIO CM confidence values depend on
a word classification threshold τw . We have carried out experimentation ranging τw between 0.0
and 1.0 and found that this value can be used to
solve the above mentioned undesired effect for
the MEAN CM. Specifically, varying the value of
τw we can stretch the interval in which the transition between the fully automatic SMT system
and the conventional IMT system is produced, allowing us to obtain smother transitions. Figure 2
shows WSR and BLEU scores for different values of the sentence classification threshold τs using τw = 0.4. We show results only for this value
of τw due to paper space limitations and because
τw = 0.4 produced the smoothest transition. According to Figure 2, using a sentence classification
threshold value of 0.6 we obtain a WSR reduction
of 20% relative and an almost perfect translation
quality of 87 BLEU points.
It is worth of notice that the final translations
are compared with only one reference, therefore,
the reported translation quality scores are clearly
pessimistic. Better results are expected using a
multi-reference corpus. Example 1 shows the
source sentence (src), the reference translation
4 Concluding Remarks
In this paper, we have presented a novel proposal
that introduces sentence CMs into an IMT system
to reduce user effort. Our proposal entails a modification of the user-machine interaction protocol
that allows to achieve a balance between the user
effort and the final translation error.
We have carried out experimentation using two
different sentence CMs. Varying the value of
the sentence classification threshold, we can range
from a fully automatic SMT system to a conventional IMT system. Empirical results show that
our proposal allows to obtain almost perfect translations while significantly reducing user effort.
Future research aims at the investigation of improved CMs to be integrated in our IMT system.
Work supported by the EC (FEDER/FSE) and
the Spanish MEC/MICINN under the MIPRCV
“Consolider Ingenio 2010” program (CSD200700018), the iTransDoc (TIN2006-15694-CO2-01)
and iTrans2 (TIN2009-14511) projects and the
FPU scholarship AP2006-00691. Also supported
by the Spanish MITyC under the
(TSI-020110-2009-439) project and by the Generalitat Valenciana under grant Prometeo/2009/014.
for statistical machine translation. In Proc. ACL,
pages 177–180.
S. Barrachina, O. Bender, F. Casacuberta, J. Civera,
E. Cubel, S. Khadivi, A. Lagarda, H. Ney, J. Tomás,
and E. Vidal. 2009. Statistical approaches to
computer-assisted translation. Computational Linguistics, 35(1):3–28.
P. Langlais, G. Lapalme, and M. Loranger. 2002.
Transtype: Development-evaluation cycles to boost
translator’s productivity.
Machine Translation,
J. Blatz, E. Fitzgerald, G. Foster, S. Gandrabur,
C. Goutte, A. Kulesza, A. Sanchis, and N. Ueffing.
2003. Confidence estimation for machine translation.
F. J. Och. 2003. Minimum error rate training in statistical machine translation. In Proc. ACL, pages 160–
D. Ortiz, I. Garcı́a-Varea, and F. Casacuberta. 2005.
Thot: a toolkit to train phrase-based statistical translation models. In Proc. MT Summit, pages 141–148.
J. Blatz, E. Fitzgerald, G. Foster, S. Gandrabur,
C. Goutte, A. Kuesza, A. Sanchis, and N. Ueffing.
2004. Confidence estimation for machine translation. In Proc. COLING, page 315.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.
BLEU: a method for automatic evaluation of MT.
In Proc. ACL, pages 311–318.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The Mathematics of Statistical
Machine Translation: Parameter Estimation. Computational Linguistics, 19(2):263–311.
G. Sanchis-Trilles, D. Ortiz-Martı́nez, J. Civera,
F. Casacuberta, E. Vidal, and H. Hoang. 2008. Improving interactive machine translation via mouse
actions. In Proc. EMNLP, pages 25–27.
J. Esteban, J. Lorenzo, A. Valderrábanos, and G. Lapalme. 2004. Transtype2: an innovative computerassisted translation system. In Proc. ACL, page 1.
G. Foster, P. Isabelle, and P. Plamondon. 1997. Targettext mediated interactive machine translation. Machine Translation, 12:12–175.
N. Ueffing and H. Ney. 2005. Application of wordlevel confidence measures in interactive statistical
machine translation. In Proc. EAMT, pages 262–
S. Gandrabur and G. Foster. 2003. Confidence estimation for text prediction. In Proc. CoNLL, pages
N. Ueffing and H. Ney. 2007. Word-level confidence
estimation for machine translation. Comput. Linguist., 33(1):9–40.
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen,
C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,
and E. Herbst. 2007. Moses: Open source toolkit
N. Ueffing, F.J. Och, and H. Ney. 2002. Generation
of word graphs in statistical machine translation. In
Proc. EMNLP, pages 156–163.
Improving Arabic-to-English Statistical Machine Translation
by Reordering Post-verbal Subjects for Alignment
Marine Carpuat Yuval Marton Nizar Habash
Columbia University
Center for Computational Learning Systems
475 Riverside Drive, New York, NY 10115
pre-verbal subject languages (SVO) such as English.
These issues are particularly problematic in
phrase-based SMT (Koehn et al., 2003). Standard
phrase-based SMT systems memorize phrasal
translation of verb and subject constructions as observed in the training bitext. They do not capture any generalizations between occurrences in
VS and SV orders, even for the same verbs. In
addition, their distance-based reordering models
are not well suited to handling complex reordering operations which can include long distance
dependencies, and may vary by context. Despite
these limitations, phrase-based SMT systems have
achieved competitive results in Arabic-to-English
benchmark evaluations.1 However, error analysis
shows that verbs are still often dropped or incorrectly translated, and subjects are split or garbled
in translation. This suggests that better syntactic
modeling should further improve SMT.
We attempt to get a better understanding of
translation patterns for Arabic verb constructions,
particularly VS constructions, by studying their
occurrence and reordering patterns in a handaligned Arabic-English parallel treebank. Our
analysis shows that VS reordering rules are not
straightforward and that SMT should therefore
benefit from direct modeling of Arabic verb subject translation. In order to detect VS constructions, we use our state-of-the-art Arabic dependency parser, which is essentially the CATIB E X
baseline in our subsequent parsing work in Marton et al. (2010), and is further described there. We
show that VS subjects and their exact boundaries
are hard to identify accurately. Given the noise
in VS detection, existing strategies for source-side
reordering (e.g., Xia and McCord (2004), Collins
et al. (2005), Wang et al. (2007)) or using de-
We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous
reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of
VS constructions are hard to detect accurately, even with a state-of-the-art Arabic
dependency parser. We therefore propose
to reorder VS constructions into SV order for SMT word alignment only. This
strategy significantly improves BLEU and
TER scores, even on a strong large-scale
baseline and despite noisy parses.
Modern Standard Arabic (MSA) is a morphosyntactically complex language, with different
phenomena from English, a fact that raises many
interesting issues for natural language processing
and Arabic-to-English statistical machine translation (SMT). While comprehensive Arabic preprocessing schemes have been widely adopted for
handling Arabic morphology in SMT (e.g., Sadat and Habash (2006), Zollmann et al. (2006),
Lee (2004)), syntactic issues have not received
as much attention by comparison (Green et
al. (2009), Crego and Habash (2008), Habash
(2007)). Arabic verbal constructions are particularly challenging since subjects can occur in
pre-verbal (SV), post-verbal (VS) or pro-dropped
(“null subject”) constructions. As a result, training
data for learning verbal construction translations
is split between the different constructions and
their patterns; and complex reordering schemas
are needed in order to translate them into primarily
Proceedings of the ACL 2010 Conference Short Papers, pages 178–183,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
(see Section 3). We then check whether the English translations of the Arabic verb and the Arabic subject occur in the same order as in Arabic
(monotone) or not (inverted). Table 1 summarizes the reordering patterns for each category. As
expected, 98% of Arabic SV are translated in a
monotone order in English. For VS constructions,
the picture is surprisingly more complex. The
monotone VS translations are mostly explained
by changes to passive voice or to non-verbal constructions (such as nominalization) in the English
In addition, Table 1 shows that verb subjects occur more frequently in VS order (70%) than in SV
order (30%). These numbers do not include prodropped (“null subject”) constructions.
Table 1: How are Arabic SV and VS translated in
the manually word-aligned Arabic-English parallel treebank? We check whether V and S are translated in a “monotone” or “inverted” order for all
VS and SV constructions. “Overlap” represents
instances where translations of the Arabic verb
and subject have some English words in common,
and are not monotone nor inverted.
gold reordering all verbs
SV monotone
2588 98.2
SV inverted
SV overlap
SV total
2638 100
VS monotone
1700 27.3
VS inverted
4033 64.7
VS overlap
VS total
6235 100
Even if the SMT system had perfect knowledge
of VS reordering, it has to accurately detect VS
constructions and their spans in order to apply
the reordering correctly. For that purpose, we
use our state-of-ther-art parsing model, which is
essentially the CATIB E X baseline model in Marton et al. (2010), and whose details we summarize next. We train a syntactic dependency parser,
MaltParser v1.3 with the Nivre “eager” algorithm
(Nivre, 2003; Nivre et al., 2006; Nivre, 2008) on
the training portion of the Penn Arabic Treebank
part 3 v3.1, hereafter PATB3 (Maamouri et al.,
2008; Maamouri et al., 2009). The training / development split is the same as in Zitouni et al.
(2006). We convert the PATB3 representation into
the succinct CATiB format, with 8 dependency
relations and 6 POS tags, which we then extend
to a set of 44 tags using regular expressions of
the basic POS and the normalized surface word
form, similarly to Marton et al. (2010), following
Habash and Roth (2009). We normalize Alif Maqsura to Ya, and Hamzated Alifs to bare Alif, as is
commonly done in Arabic SMT.
For analysis purposes, we evaluate our subject
and verb detection on the development part of
PATB3 using gold POS tags. There are various
ways to go about it. We argue that combined detection statistics of constructions of verbs and their
subjects (VATS), for which we achieve an F-score
of 74%, are more telling for the task at hand.2
pendency parses as cohesion constraints in decoding (e.g., Cherry (2008); Bach et al. (2009)) are
not effective at this stage. While these approaches
have been successful for language pairs such as
German-English for which syntactic parsers are
more developed and relevant reordering patterns
might be less ambiguous, their impact potential on
Arabic-English translation is still unclear.
In this work, we focus on VS constructions
only, and propose a new strategy in order to benefit from their noisy detection: for the word alignment stage only, we reorder phrases detected as
VS constructions into an SV order. Then, for
phrase extraction, weight optimization and decoding, we use the original (non-reordered) text. This
approach significantly improves both BLEU and
TER on top of strong medium and large-scale
phrase-based SMT baselines.
Arabic VS construction detection
VS reordering in gold Arabic-English
We use the manually word-aligned parallel
Arabic-English Treebank (LDC2009E82) to study
how Arabic VS constructions are translated into
English by humans. Given the gold Arabic syntactic parses and the manual Arabic-English word
alignments, we can determine the gold reorderings for SV and VS constructions. We extract VS
representations from the gold constituent parses
by deterministic conversion to a simplified dependency structure, CATiB (Habash and Roth, 2009)
We divert from the CATiB representation in that a nonmatrix subject of a pseudo verb (An and her sisters) is treated
as a subject of the verb that is under the same pseudo verb.
This treatment of said subjects is comparable to the PATB’s.
These scores take into account the spans of both
the subject and the specific verb it belongs to, and
potentially reorder with. We also provide statistics
of VS detection separately (F-score 63%), since
we only handle VS here. This low score can be
explained by the difficulty in detecting the postverbal subject’s end boundary, and the correct verb
the subject belongs to. The SV construction scores
are higher, presumably since the pre-verbal subject’s end is bounded by the verb it belongs to. See
Table 2.
Although not directly comparable, our VS
scores are similar to those of Green et al. (2009).
Their VS detection technique with conditional
random fields (CRF) is different from ours in bypassing full syntactic parsing, and in only detecting maximal (non-nested) subjects of verb-initial
clauses. Additionally, they use a different training / test split of the PATB data (parts 1, 2 and 3).
They report 65.9% precision and 61.3% F-score.
Note that a closer score comparison should take
into account their reported verb detection accuracy
of 98.1%.
coding are performed on the original Arabic word
order. Preliminary experiments on an earlier version of the large-scale SMT system described in
Section 6 showed that forcing reordering of all
VS constructions at training and test time does
not have a consistent impact on translation quality: for instance, on the NIST MT08-NW test set,
TER slightly improved from 44.34 to 44.03, while
BLEU score decreased from 49.21 to 49.09.
Limiting reordering to alignment allows the system to be more robust and recover from incorrect
changes introduced either by incorrect VS detection, or by incorrect reordering of a correctly detected VS. Given a parallel sentence (a, e), we
proceed as follows:
1. automatically tag VS constructions in a
2. generate new sentence a0 = reorder(a) by
reordering Arabic VS into SV
3. get word alignment wa0 on new sentence pair
(a0 , e)
4. using mapping from a to a0 , get corresponding word alignment wa = unreorder(wa0 )
for the original sentence pair (a, e)
Table 2: Precision, Recall and F-scores for constructions of Arabic verbs and their subjects, evaluated on our development part of PATB3.
VATS (verbs & their subj.)
VNS (verbs w/ null subj.)
verbal subj. exc. null subj.
verbal subj. inc. null subj.
verbs with non-null subj.
SV or VS
Experiment set-up
We use the open-source Moses toolkit (Koehn et
al., 2007) to build two phrase-based SMT systems
trained on two different data conditions:
• medium-scale the bitext consists of 12M
words on the Arabic side (LDC2007E103).
The language model is trained on the English
side of the large bitext.
• large-scale the bitext consists of several
newswire LDC corpora, and has 64M words
on the Arabic side. The language model is
trained on the English side of the bitext augmented with Gigaword data.
Reordering Arabic VS for SMT word
Except from this difference in training data, the
two systems are identical. They use a standard
phrase-based architecture. The parallel corpus is
word-aligned using the GIZA++ (Och and Ney,
2003), which sequentially learns word alignments
for the IBM1, HMM, IBM3 and IBM4 models.
The resulting alignments in both translation directions are intersected and augmented using the
grow-diag-final-and heuristic (Koehn et al., 2007).
Phrase translations of up to 10 words are extracted
in the Moses phrase-table. We apply statistical
significance tests to prune unreliable phrase-pairs
Based on these analyses, we propose a new
method to help phrase-based SMT systems deal
with Arabic-English word order differences due to
VS constructions. As in related work on syntactic
reordering by preprocessing, our method attempts
to make Arabic and English word order closer to
each other by reordering Arabic VS constructions
into SV. However, unlike in previous work, the reordered Arabic sentences are used only for word
alignment. Phrase translation extraction and de180
and score remaining phrase-table entries (Chen et
al., 2009). We use a 5-gram language model with
modified Kneser-Ney smoothing. Feature weights
are tuned to maximize BLEU on the NIST MT06
test set.
For all systems, the English data is tokenized
using simple punctuation-based rules. The Arabic
side is segmented according to the Arabic Treebank (PATB3) tokenization scheme (Maamouri et
al., 2009) using the MADA+TOKAN morphological analyzer and tokenizer (Habash and Rambow,
2005). MADA-produced Arabic lemmas are used
for word alignment.
Table 3: Evaluation on all test sets: on the total
of 4432 test sentences, improvements are statistically significant at the 99% level using bootstrap
resampling (Koehn, 2004)
medium baseline
+ VS reordering
large baseline
+ VS reordering
TER (%)
47.78 (-0.56)
42.21 (-0.24)
ages a phrase-based SMT decoder to use phrasal
translations that do not break subject boundaries.
Syntactically motivated reordering for phrasebased SMT has been more successful on language
pairs other than Arabic-English, perhaps due to
more accurate parsers and less ambiguous reordering patterns than for Arabic VS. For instance,
Collins et al. (2005) apply six manually defined
transformations to German parse trees which improve German-English translation by 0.4 BLEU
on the Europarl task. Xia and McCord (2004)
learn reordering rules for French to English translations, which arguably presents less syntactic distortion than Arabic-English. Zhang et al. (2007)
limit reordering to decoding for Chinese-English
SMT using a lattice representation. Cherry (2008)
uses dependency parses as cohesion constraints in
decoding for French-English SMT.
For Arabic-English phrase-based SMT, the impact of syntactic reordering as preprocessing is
less clear. Habash (2007) proposes to learn syntactic reordering rules targeting Arabic-English word
order differences and integrates them as deterministic preprocessing. He reports improvements
in BLEU compared to phrase-based SMT limited
to monotonic decoding, but these improvements
do not hold with distortion. Instead of applying reordering rules deterministically, Crego and
Habash (2008) use a lattice input to represent alternate word orders which improves a ngram-based
SMT system. But they do not model VS constructions explicitly.
Most previous syntax-aware word alignment
models were specifically designed for syntaxbased SMT systems. These models are often
bootstrapped from existing word alignments, and
could therefore benefit from our VS reordering approach. For instance, Fossum et al. (2008) report
improvements ranging from 0.1 to 0.5 BLEU on
Arabic translation by learning to delete alignment
We evaluate translation quality using both BLEU
(Papineni et al., 2002) and TER (Snover et al.,
2006) scores on three standard evaluation test
sets from the NIST evaluations, which yield more
than 4400 test sentences with 4 reference translations. On this large data set, our VS reordering
method remarkably yields statistically significant
improvements in BLEU and TER on the medium
and large SMT systems at the 99% confidence
level (Table 3).
Results per test set are reported in Table 4. TER
scores are improved in all 10 test configurations,
and BLEU scores are improved in 8 out of the 10
configurations. Results on the MT08 test set show
that improvements are obtained both on newswire
and on web text as measured by TER (but not
BLEU score on the web section.) It is worth noting
that consistent improvements are obtained even on
the large-scale system, and that both baselines are
full-fledged systems, which include lexicalized reordering and large 5-gram language models.
Analysis shows that our VS reordering technique improves word alignment coverage (yielding 48k and 330k additional links on the medium
and large scale systems respectively). This results
in larger phrase-tables which improve translation
BLEU r4n4 (%)
44.65 (+0.30)
51.70 (+0.25)
Related work
To the best of our knowledge, the only other approach to detecting and using Arabic verb-subject
constructions for SMT is that of Green et al.
(2009) (see Section 3), which failed to improve
Arabic-English SMT. In contrast with our reordering approach, they integrate subject span information as a log-linear model feature which encour181
Table 4: VS reordering improves BLEU and TER scores in almost all test conditions on 5 test sets, 2
metrics, and 2 MT systems
test set
medium baseline
+ VS reordering
large baseline
+ VS reordering
46.33 (+0.38)
52.63 (+0.33)
test set
medium baseline
+ VS reordering
large baseline
+ VS reordering
48.31 (-0.46)
42.95 (-0.38)
BLEU r4n4 (%)
45.03 (+0.09) 48.69 (+0.64)
52.34 (-0.11) 55.29 (+0.63)
TER (%)
46.10 (-0.35) 44.29 (-0.71)
40.40 (-0.02) 38.75 (-0.40)
45.06 (+0.20)
52.85 (+0.25)
31.96 (-0.09)
39.87 (+0.65)
47.11 (-0.63)
41.51 (-0.30)
57.30 (-0.72)
51.86 (-0.19)
links if they degrade their syntax-based translation
system. Departing from commonly-used alignment models, Hermjakob (2009) aligns Arabic and
English content words using pointwise mutual information, and in this process indirectly uses English sentences reordered into VS order to collect
cooccurrence counts. The approach outperforms
GIZA++ on a small-scale translation task, but the
impact of reordering alone is not evaluated.
0110. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and
do not necessarily reflect the views of DARPA.
Marine Carpuat, Yuval Marton, and Nizar Habash. 2010. Reordering matrix post-verbal subjects for arabic-to-english
smt. In Proceedings of the Conference Traitement Automatique des Langues Naturelles (TALN).
Nguyen Bach, Stephan Vogel, and Colin Cherry. 2009. Cohesive constraints in a beam search phrase-based decoder.
In Proceedings of the 10th Meeting of the North American
Chapter of the Association for Computational Linguistics,
Companion Volume: Short Papers, pages 1–4.
Conclusion and future work
We presented a novel method for improving overall SMT quality using a noisy syntactic parser: we
use these parses to reorder VS constructions into
SV for word alignment only. This approach increases word alignment coverage and significantly
improves BLEU and TER scores on two strong
SMT baselines.
In subsequent work, we show that matrix (mainclause) VS constructions are reordered much more
frequently than non-matrix VS, and that limiting reordering to matrix VS constructions for
word alignment further improves translation quality (Carpuat et al., 2010). In the future, we plan to
improve robustness to parsing errors by using not
just one, but multiple subject boundary hypotheses. We will also investigate the integration of VS
reordering in SMT decoding.
Boxing Chen, George Foster, and Roland Kuhn. 2009.
Phrase translation model enhanced with association based
features. In Proceedings of MT-Summit XII, Ottawa, Ontario, September.
Colin Cherry. 2008. Cohesive phrase-based decoding for
statistical machine translation. In Proceedings of the 46th
Annual Meeting of the Association for Computational Linguistics (ACL), pages 72–80, Columbus, Ohio, June.
Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005.
Clause restructuring for statistical machine translation. In
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 531–540,
Ann Arbor, MI, June.
Josep M. Crego and Nizar Habash. 2008. Using shallow syntax information to improve word alignment and reordering
for SMT. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 53–61, June.
Victoria Fossum, Kevin Knight, and Steven Abney. 2008.
Using syntax to improve word alignment precision for
syntax-based machine translation. In Proceedings of the
Third Workshop on Statistical Machine Translation, pages
The authors would like to thank Mona Diab, Owen Rambow, Ryan Roth, Kristen Parton and Joakim Nivre for helpful discussions and assistance. This material is based upon
work supported by the Defense Advanced Research Projects
Agency (DARPA) under GALE Contract No HR0011-08-C-
Spence Green, Conal Sathi, and Christopher D. Manning.
2009. NP subject detection in verb-initial Arabic clauses.
Joakim Nivre. 2003. An efficient algorithm for projective
dependency parsing. In Proceedings of the 8th International Conference on Parsing Technologies (IWPT), pages
149–160, Nancy, France.
In Proceedings of the Third Workshop on Computational
Approaches to Arabic Script-based Languages (CAASL3).
Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of the 43rd
Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 573–580, Ann Arbor, Michigan,
Joakim Nivre. 2008. Algorithms for Deterministic Incremental Dependency Parsing. Computational Linguistics,
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–52.
Nizar Habash and Ryan Roth. 2009. CATiB: The Columbia
Arabic treebank. In Proceedings of the ACL-IJCNLP 2009
Conference Short Papers, pages 221–224, Suntec, Singapore, August. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
Zhu. 2002. BLEU: a method for automatic evaluation of
machine translation. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics.
Nizar Habash. 2007. Syntactic preprocessing for statistical machine translation. In Proceedings of the Machine
Translation Summit (MT-Summit), Copenhagen.
Fatiha Sadat and Nizar Habash. 2006. Combination of arabic
preprocessing schemes for statistical machine translation.
In Proceedings of the 21st International Conference on
Computational Linguistics and the 44th annual meeting of
the Association for Computational Linguistics, pages 1–8,
Morristown, NJ, USA.
Ulf Hermjakob. 2009. Improved word alignment with statistics and linguistic heuristics. In Proceedings of the 2009
Conference on Empirical Methods in Natural Language
Processing, pages 229–237, Singapore, August.
Philipp Koehn, Franz Och, and Daniel Marcu. 2003.
Statistical phrase-based translation. In Proceedings of
HLT/NAACL-2003, Edmonton, Canada, May.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea
Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of AMTA, pages 223–231, Boston, MA.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin,
and Evan Herbst. 2007. Moses: Open source toolkit for
statistical machine translation. In Annual Meeting of the
Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June.
Chao Wang, Michael Collins, and Philipp Koehn. 2007. Chinese syntactic reordering for statistical machine translation. In Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLPCoNLL), pages 737–745.
Fei Xia and Michael McCord. 2004. Improving a statistical
mt system with automatically learned rewrite patterns. In
Proceedings of COLING 2004, pages 508–514, Geneva,
Switzerland, August.
Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004
Conference on Empirical Methods in Natural Language
Processing (EMNLP-2004), Barcelona, Spain, July.
Yuqi Zhang, Richard Zens, and Hermann Ney. 2007. Chunklevel reordering of source language sentences with automatically learned rules for statistical machine translation.
In Human Language Technology Conf. / North American
Chapter of the Assoc. for Computational Linguistics Annual Meeting, Rochester, NY, April.
Young-Suk Lee. 2004. Morphological analysis for statistical
machine translation. In Proceedings of the Human Language Technology Conference of the NAACL, pages 57–
60, Boston, MA.
Mohamed Maamouri, Ann Bies, and Seth Kulick. 2008.
Enhancing the arabic treebank: a collaborative effort toward new annotation guidelines. In Proceedings of the
Sixth International Language Resources and Evaluation
(LREC’08), Marrakech, Morocco.
Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. 2006.
Maximum Entropy Based Restoration of Arabic Diacritics. In Proceedings of COLING-ACL, the joint conference
of the International Committee on Computational Linguistics and the Association for Computational Linguistics,
pages 577–584, Sydney, Australia.
Mohamed Maamouri, Ann Bies, Seth Kulick, Fatma Gaddeche, Wigdan Mekki, Sondos Krouna, and Basma
Bouziri. 2009. The penn arabic treebank part 3 version
3.1. Linguistic Data Consortium LDC2008E22.
Andreas Zollmann, Ashish Venugopal, and Stephan Vogel.
2006. Bridging the inflection morphology gap for arabic statistical machine translation. In Proceedings of the
Human Language Technology Conference of the NAACL,
Companion Volume: Short Papers, pages 201–204, New
York City, USA.
Yuval Marton, Nizar Habash, and Owen Rambow. 2010. Improving arabic dependency parsing with lexical and inflectional morphological features. In Proceedings of the
11th Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL) workshop
on Statistical Parsing of Morphologically Rich Languages
(SPMRL), Los Angeles.
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. MaltParser: A Data-Driven Parser-Generator for Dependency
Parsing. In Proceedings of the Conference on Language
Resources and Evaluation (LREC).
Learning Common Grammar from Multilingual Corpus
Tomoharu Iwata
Daichi Mochihashi
Hiroshi Sawada
NTT Communication Science Laboratories
2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, Japan
In our scenario, we use probabilistic contextfree grammars (PCFGs) as our monolingual grammar model. We assume that a PCFG for each
language is generated from a general model that
are common across languages, and each sentence
in multilingual corpora is generated from the language dependent PCFG. The inference of the general model as well as the multilingual PCFGs can
be performed by using a variational method for
efficiency. Our approach is based on a Bayesian
multitask learning framework (Yu et al., 2005;
Daumé III, 2009). Hierarchical Bayesian modeling provides a natural way of obtaining a joint regularization for individual models by assuming that
the model parameters are drawn from a common
prior distribution (Yu et al., 2005).
We propose a corpus-based probabilistic framework to extract hidden common
syntax across languages from non-parallel
multilingual corpora in an unsupervised
fashion. For this purpose, we assume a
generative model for multilingual corpora,
where each sentence is generated from a
language dependent probabilistic contextfree grammar (PCFG), and these PCFGs
are generated from a prior grammar that
is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel
multilingual corpus of eleven languages
demonstrate the feasibility of the proposed
2 Related work
The unsupervised grammar induction task has
been extensively studied (Carroll and Charniak,
1992; Stolcke and Omohundro, 1994; Klein and
Manning, 2002; Klein and Manning, 2004; Liang
et al., 2007). Recently, models have been proposed that outperform PCFG in the grammar induction task (Klein and Manning, 2002; Klein and
Manning, 2004). We used PCFG as a first step
for capturing commonalities in syntax across languages because of its simplicity. The proposed
framework can be used for probabilistic grammar
models other than PCFG.
Grammar induction using bilingual parallel corpora has been studied mainly in machine translation research (Wu, 1997; Melamed, 2003; Eisner,
2003; Chiang, 2005; Blunsom et al., 2009; Snyder et al., 2009). These methods require sentencealigned parallel data, which can be costly to obtain
and difficult to scale to many languages. On the
other hand, our model does not require sentences
to be aligned. Moreover, since the complexity of
our model increases linearly with the number of
languages, our model is easily applicable to cor-
1 Introduction
Languages share certain common properties (Pinker, 1994). For example, the word order
in most European languages is subject-verb-object
(SVO), and some words with similar forms are
used with similar meanings in different languages.
The reasons for these common properties can be
attributed to: 1) a common ancestor language,
2) borrowing from nearby languages, and 3) the
innate abilities of humans (Chomsky, 1965).
We assume hidden commonalities in syntax
across languages, and try to extract a common
grammar from non-parallel multilingual corpora.
For this purpose, we propose a generative model
for multilingual grammars that is learned in an
unsupervised fashion. There are some computational models for capturing commonalities at the
phoneme and word level (Oakes, 2000; BouchardCôté et al., 2008), but, as far as we know, no attempt has been made to extract commonalities in
syntax level from non-parallel and non-annotated
multilingual corpora.
Proceedings of the ACL 2010 Conference Short Papers, pages 184–188,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
pora of more than two languages, as we will show
in the experiments. To our knowledge, the only
grammar induction work on non-parallel corpora
is (Cohen and Smith, 2009), but their method does
not model a common grammar, and requires prior
information such as part-of-speech tags. In contrast, our method does not require any such prior
θ lA
φ φ
φ lA
Proposed Method
θ θ
Figure 1: Graphical model.
3.1 Model
Let X = {X l }l∈L be a non-parallel and nonannotated multilingual corpus, where X l is a set
of sentences in language l, and L is a set of languages. The task is to learn multilingual PCFGs
G = {Gl }l∈L and a common grammar that generates these PCFGs. Here, Gl = (K, W l , Φl )
represents a PCFG of language l, where K is a
set of nonterminals, W l is a set of terminals, and
Φl is a set of rule probabilities. Note that a set of
nonterminals K is shared among languages, but
a set of terminals W l and rule probabilities Φl
are specific to the language. For simplicity, we
consider Chomsky normal form grammars, which
have two types of rules: emissions rewrite a nonterminal as a terminal A → w, and binary productions rewrite a nonterminal as two nonterminals A → BC, where A, B, C ∈ K and w ∈ W l .
The rule probabilities for each nonterminal
A of PCFG Gl in language l consist of: 1)
θ Al = {θlAt }t∈{0,1} , where θlA0 and θlA1 represent probabilities of choosing the emission rule
and the binary production rule, respectively, 2)
φlA = {φlABC }B,C∈K , where φlABC represents the probability of nonterminal production
A → BC, and 3) ψ lA = {ψlAw }w∈W l , where
ψlAw represents the probability of terminal emission
∑ A → w. Note that θlA0 + θlA1∑= 1, θlAt ≥ 0,
B,C φlABC = 1, φlABC ≥ 0,
w ψlAw = 1,
and ψlAw ≥ 0. In the proposed model, multinomial parameters θ lA and φlA are generated from
Dirichlet distributions that are common across languages: θlA ∼ Dir(αθA ) and φlA ∼ Dir(αφA ),
since we assume that languages share a common
syntax structure. αθA and αφA represent the parameters of a common grammar. We use the Dirichlet
prior because it is the conjugate prior for the multinomial distribution. In summary, the proposed
model assumes the following generative process
for a multilingual corpus,
(a) For each rule type t ∈ {0, 1}:
i. Draw common rule type parameters
∼ Gam(aθ , bθ )
(b) For each nonterminal pair (B, C):
i. Draw common production parameters
∼ Gam(aφ , bφ )
2. For each language l ∈ L:
(a) For each nonterminal A ∈ K:
i. Draw rule type parameters
θlA ∼ Dir(αθA )
ii. Draw binary production parameters
φlA ∼ Dir(αφA )
iii. Draw emission parameters
ψ lA ∼ Dir(αψ )
(b) For each node i in the parse tree:
i. Choose rule type
tli ∼ Mult(θ lzi )
ii. If tli = 0:
A. Emit terminal
xli ∼ Mult(ψ lzi )
iii. Otherwise:
A. Generate children nonterminals
(zlL(i) , zlR(i) ) ∼ Mult(φlzi ),
where L(i) and R(i) represent the left and right
children of node i. Figure 1 shows a graphical model representation of the proposed model,
where the shaded and unshaded nodes indicate observed and latent variables, respectively.
3.2 Inference
The inference of the proposed model can be efficiently computed using a variational Bayesian
method. We extend the variational method to
the monolingual PCFG learning of Kurihara and
Sato (2004) for multilingual corpora. The goal
is to estimate posterior p(Z, Φ, α|X), where Z
is a set of parse trees, Φ = {Φl }l∈L is a
set of language dependent parameters, Φl =
{θ lA , φlA , ψ lA }A∈K , and α = {αθA , αA
is a set of common parameters. In the variational
method, posterior p(Z, Φ, α|X) is approximated
by a tractable variational distribution q(Z, Φ, α).
1. For each nonterminal A ∈ K:
described in (Minka, 2000). The update rule is as
( ∑ θ
θ L Ψ(
aθ −1+αAt
t0 αAt0 )−Ψ(αAt )
) ,
← θ ∑( ∑ θ
θ )
b + l Ψ( t0 γlAt0 ) − Ψ(γlAt
where L is the number of languages, and Ψ(x) =
∂ log Γ(x)
is the digamma function. Similarly, the
common production parameter αABC
can be updated as follows,
We use the following variational distribution,
q(Z, Φ, α) =
q(αθA )q(αφA )
q(z ld )
q(θ lA )q(φlA )q(ψ lA ), (1)
where we assume that hyperparameters q(αθA ) and
q(αφA ) are degenerated, or q(α) = δα∗ (α), and
infer them by point estimation instead of distribution estimation. We find an approximate posterior
distribution that minimizes the Kullback-Leibler
divergence from the true posterior. The variational
distribution of the parse tree of the dth sentence in
language l is obtained as follows,
q(z ld ) ∝
αABC ←
)C(A→BC;z ld ,l,d)
∏ (
)C(A→w;z ld ,l,d)
∏ (
, (2)
where C(r; z, l, d) is the count of rule r that occurs in the dth sentence of language l with parse
tree z. The multinomial weights are calculated as
= exp Eq(θ lA ) log θlAt ,
= exp Eq(φ ) log φlABC ,
πlAw = exp Eq(ψ ) log ψlAw .
The variational Dirichlet parameters for q(θ lA ) =
Dir(γ θlA ), q(φlA ) = Dir(γ φlA ), and q(ψ lA ) =
Dir(γ ψ
lA ), are obtained as follows,
aφ − 1 + αABC
b + l JlABC
where JABC = Ψ( B 0 ,C 0 αAB
0 C 0 ) − Ψ(αABC ),
and JlABC
= Ψ( B 0 ,C 0 γlAB
0 C 0 ) − Ψ(γlABC ).
Since factored variational distributions depend
on each other, an optimal approximated posterior
can be obtained by updating parameters by (2) (10) alternatively until convergence. The updating of language dependent distributions by (2) (8) is also described in (Kurihara and Sato, 2004;
Liang et al., 2007) while the updating of common
grammar parameters by (9) and (10) is new. The
inference can be carried out efficiently using the
inside-outside algorithm based on dynamic programming (Lari and Young, 1990).
After the inference, the probability of a common grammar rule A → BC is calculated by
φ̂A→BC = θ̂1 φ̂ABC , where θ̂1 = α1θ /(α0θ + α1θ )
and φ̂ABC = αABC
/ B 0 ,C 0 αAB
0 C 0 represent
the mean values of θl0 and φlABC , respectively.
4 Experimental results
We evaluated our method by employing the EuroParl corpus (Koehn, 2005). The corpus conθ
q(z ld )C(A, t; z ld , l, d), (6)
= αAt
sists of the proceedings of the European Parliad,z ld
ment in eleven western European languages: Dan∑
γlABC = αABC +
q(z ld )C(A → BC; z ld , l, d), ish (da), German (de), Greek (el), English (en),
Spanish (es), Finnish (fi), French (fr), Italian (it),
d,z ld
Dutch (nl), Portuguese (pt), and Swedish (sv), and
it contains roughly 1,500,000 sentences in each
γlAw = α +
q(z ld )C(A → w; z ld , l, d),
language. We set the number of nonterminals at
d,z ld
|K| = 20, and omitted sentences with more than
ten words for tractability. We randomly sampled
where C(A, t; z, l, d) is the count of rule type t
100,000 sentences for each language, and anathat is selected in nonterminal A in the dth senlyzed them using our method. It should be noted
tence of language l with parse tree z.
θ that minthat our random samples are not sentence-aligned.
The common rule type parameter αAt
Figure 2 shows the most probable terminals of
imizes the KL divergence between the true posemission for each language and nonterminal with
terior and the approximate posterior can be oba high probability of selecting the emission rule.
tained by using the fixed-point iteration method
0 → 16 11
16 → 7 6
6 → 2 12
12 → 13 5
15 → 17 19
17 → 5 9
15 → 13 5
2: verb and auxiliary verb (V)
(R → S . )
(S → SBJ VP)
(VP → V NP)
(NP → DT N)
(NP → NP N)
(NP → N PR)
(NP → DT N)
Figure 3: Examples of inferred common grammar rules in eleven languages, and their probabilities. Hand-provided annotations have the following meanings, R: root, S: sentence, NP: noun
phrase, VP: verb phrase, and others appear in Figure 2.
5: noun (N)
We named nonterminals by using grammatical categories after the inference. We can see that words
in the same grammatical category clustered across
languages as well as within a language. Figure 3 shows examples of inferred common grammar rules with high probabilities. Grammar rules
that seem to be common to European languages
have been extracted.
7: subject (SBJ)
5 Discussion
9: preposition (PR)
We have proposed a Bayesian hierarchical PCFG
model for capturing commonalities at the syntax
level for non-parallel multilingual corpora. Although our results have been encouraging, a number of directions remain in which we must extend
our approach. First, we need to evaluate our model
quantitatively using corpora with a greater diversity of languages. Measurement examples include
the perplexity, and machine translation score. Second, we need to improve our model. For example, we can infer the number of nonterminals
with a nonparametric Bayesian model (Liang et
al., 2007), infer the model more robustly based
on a Markov chain Monte Carlo inference (Johnson et al., 2007), and use probabilistic grammar
models other than PCFGs. In our model, all the
multilingual grammars are generated from a general model. We can extend it hierarchically using
the coalescent (Kingman, 1982). That model may
help to infer an evolutionary tree of languages in
terms of grammatical structure without the etymological information that is generally used (Gray
and Atkinson, 2003). Finally, the proposed approach may help to indicate the presence of a universal grammar (Chomsky, 1965), or to find it.
11: punctuation (.)
13: determiner (DT)
Figure 2: Probable terminals of emission for each
language and nonterminal.
Dan Klein and Christopher D. Manning. 2004. Corpusbased induction of syntactic structure: models of dependency and constituency. In ACL ’04: Proceedings of the
42nd Annual Meeting on Association for Computational
Linguistics, page 478, Morristown, NJ, USA. Association
for Computational Linguistics.
Phil Blunsom, Trevor Cohn, and Miles Osborne. 2009.
Bayesian synchronous grammar induction. In D. Koller,
D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21,
pages 161–168.
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th
Machine Translation Summit, pages 79–86.
Alexandre Bouchard-Côté, Percy Liang, Thomas Griffiths,
and Dan Klein. 2008. A probabilistic approach to language change. In J.C. Platt, D. Koller, Y. Singer, and
S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 169–176, Cambridge, MA.
MIT Press.
Kenichi Kurihara and Taisuke Sato. 2004. An application of the variational Bayesian approach to probabilistic
context-free grammars. In International Joint Conference
on Natural Language Processing Workshop Beyond Shallow Analysis.
Glenn Carroll and Eugene Charniak. 1992. Two experiments
on learning probabilistic dependency grammars from corpora. In Working Notes of the Workshop StatisticallyBased NLP Techniques, pages 1–13. AAAI.
K. Lari and S.J. Young. 1990. The estimation of stochastic
context-free grammars using the inside-outside algorithm.
Computer Speech and Language, 4:35–56.
David Chiang. 2005. A hierarchical phrase-based model for
statistical machine translation. In ACL ’05: Proceedings
of the 43rd Annual Meeting on Association for Computational Linguistics, pages 263–270, Morristown, NJ, USA.
Association for Computational Linguistics.
Percy Liang, Slav Petrov, Michael I. Jordan, and Dan Klein.
2007. The infinite PCFG using hierarchical dirichlet processes. In EMNLP ’07: Proceedings of the Empirical
Methods on Natural Language Processing, pages 688–
Norm Chomsky. 1965. Aspects of the Theory of Syntax. MIT
I. Dan Melamed. 2003. Multitext grammars and synchronous parsers. In NAACL ’03: Proceedings of the 2003
Conference of the North American Chapter of the Association for Computational Linguistics on Human Language
Technology, pages 79–86, Morristown, NJ, USA. Association for Computational Linguistics.
Shay B. Cohen and Noah A. Smith. 2009. Shared logistic
normal distributions for soft parameter tying in unsupervised grammar induction. In NAACL ’09: Proceedings of
Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association
for Computational Linguistics, pages 74–82, Morristown,
NJ, USA. Association for Computational Linguistics.
Thomas Minka. 2000. Estimating a Dirichlet distribution.
Technical report, M.I.T.
Michael P. Oakes. 2000. Computer estimation of vocabulary in a protolanguage from word lists in four daughter
languages. Journal of Quantitative Linguistics, 7(3):233–
Hal Daumé III. 2009. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence
(UAI-09), pages 135–142, Corvallis, Oregon. AUAI Press.
Steven Pinker. 1994. The Language Instinct: How the Mind
Creates Language. HarperCollins, New York.
Jason Eisner. 2003. Learning non-isomorphic tree mappings
for machine translation. In ACL ’03: Proceedings of the
41st Annual Meeting on Association for Computational
Linguistics, pages 205–208, Morristown, NJ, USA. Association for Computational Linguistics.
Benjamin Snyder, Tahira Naseem, and Regina Barzilay.
2009. Unsupervised multilingual grammar induction. In
Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,
pages 73–81, Suntec, Singapore, August. Association for
Computational Linguistics.
Russell D. Gray and Quentin D. Atkinson. 2003. Languagetree divergence times support the Anatolian theory of
Indo-European origin.
Nature, 426(6965):435–439,
Andreas Stolcke and Stephen M. Omohundro. 1994. Inducing probabilistic grammars by Bayesian model merging. In ICGI ’94: Proceedings of the Second International
Colloquium on Grammatical Inference and Applications,
pages 106–118, London, UK. Springer-Verlag.
Mark Johnson, Thomas Griffiths, and Sharon Goldwater.
2007. Bayesian inference for PCFGs via Markov chain
Monte Carlo. In Human Language Technologies 2007:
The Conference of the North American Chapter of the
Association for Computational Linguistics; Proceedings
of the Main Conference, pages 139–146, Rochester, New
York, April. Association for Computational Linguistics.
Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Comput.
Linguist., 23(3):377–403.
J. F. C. Kingman. 1982. The coalescent. Stochastic Processes and their Applications, 13:235–248.
Kai Yu, Volker Tresp, and Anton Schwaighofer. 2005.
Learning gaussian processes from multiple tasks. In
ICML ’05: Proceedings of the 22nd International Conference on Machine Learning, pages 1012–1019, New York,
Dan Klein and Christopher D. Manning. 2002. A generative
constituent-context model for improved grammar induction. In ACL ’02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages
128–135, Morristown, NJ, USA. Association for Computational Linguistics.
Tree-Based Deterministic Dependency Parsing
— An Application to Nivre’s Method —
Kotaro Kitagawa
Kumiko Tanaka-Ishii
Graduate School of Information Science and Technology,
The University of Tokyo
[email protected] [email protected]
idea. Instead of selecting a parsing action for
two words, as in Nivre’s model, our tree-based
model first chooses the most probable head candidate from among the trees through a tournament
and then decides the parsing action between two
Global-optimization parsing methods are another common approach (Eisner, 1996; McDonald et al., 2005). Koo et al. (2008) studied
semi-supervised learning with this approach. Hybrid systems have improved parsing by integrating outputs obtained from different parsing models (Zhang and Clark, 2008).
Our proposal can be situated among globaloptimization parsing methods as follows. The proposed tree-based model is deterministic but takes a
step towards global optimization by widening the
search space to include all necessary words connected by previously judged head-dependent relations, thus achieving a higher accuracy yet largely
retaining the speed of deterministic parsing.
Nivre’s method was improved by enhancing deterministic dependency parsing
through application of a tree-based model.
The model considers all words necessary
for selection of parsing actions by including words in the form of trees. It chooses
the most probable head candidate from
among the trees and uses this candidate to
select a parsing action.
In an evaluation experiment using the
Penn Treebank (WSJ section), the proposed model achieved higher accuracy
than did previous deterministic models.
Although the proposed model’s worst-case
time complexity is O(n2 ), the experimental results demonstrated an average parsing time not much slower than O(n).
1 Introduction
Deterministic parsing methods achieve both effective time complexity and accuracy not far from
those of the most accurate methods. One such
deterministic method is Nivre’s method, an incremental parsing method whose time complexity is
linear in the number of words (Nivre, 2003). Still,
deterministic methods can be improved. As a specific example, Nivre’s model greedily decides the
parsing action only from two words and their locally relational words, which can lead to errors.
In the field of Japanese dependency parsing,
Iwatate et al. (2008) proposed a tournament model
that takes all head candidates into account in judging dependency relations. This method assumes
backward parsing because the Japanese dependency structure has a head-final constraint, so that
any word’s head is located to its right.
Here, we propose a tree-based model, applicable to any projective language, which can be considered as a kind of generalization of Iwatate’s
2 Deterministic Dependency Parsing
2.1 Dependency Parsing
A dependency parser receives an input sentence
x = w1 , w2 , . . . , wn and computes a dependency
graph G = (W, A). The set of nodes W =
{w0 , w1 , . . . , wn } corresponds to the words of a
sentence, and the node w0 is the root of G. A is
the set of arcs (wi , wj ), each of which represents a
dependency relation where wi is the head and wj
is the dependent.
In this paper, we assume that the resulting dependency graph for a sentence is well-formed and
projective (Nivre, 2008). G is well-formed if and
only if it satisfies the following three conditions of
being single-headed, acyclic, and rooted.
2.2 Nivre’s Method
An incremental dependency parsing algorithm
was first proposed by (Covington, 2001). After
Proceedings of the ACL 2010 Conference Short Papers, pages 189–193,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Table 1: Transitions for Nivre’s method and the proposed method.
Method Right-Arc
(σ|wi , wj |β, A) ⇒ (σ, wj |β, A ∪ {(wj , wi )})
(σ|wi , wj |β, A) ⇒ (σ|wi |wj , β, A ∪ {(wi , wj )})
(σ|wi , β, A)
⇒ (σ, β, A)
(σ, wj |β, A)
⇒ (σ|wj , β, A)
(σ|ti , tj |β, A) ⇒ (σ, tj |β, A ∪ {(wj , wi )})
(σ|ti , tj |β, A) ⇒ (σ|ti , β, A ∪ {(mphc(ti , tj ), wj )})
(σ, tj |β, A) ⇒ (σ|tj , β, A)
studies taking data-driven approaches, by (Kudo
and Matsumoto, 2002), (Yamada and Matsumoto,
2003), and (Nivre, 2003), the deterministic incremental parser was generalized to a state transition
system in (Nivre, 2008).
Nivre’s method applying an arc-eager algorithm
works by using a stack of words denoted as σ, for
a buffer β initially containing the sentence x. Parsing is formulated as a quadruple (S, Ts , sinit , St ),
where each component is defined as follows:
i ̸= 0 ∧ ¬∃wk (wk , wi ) ∈ A
∃wk (wk , wi ) ∈ A
i ̸= 0
Features Used for Selecting Reduce
The features used in (Nivre and Scholz, 2004) to
define a state transition are basically obtained from
the two target words wi and wj , and their related
words. These words are not sufficient to select Reduce, because this action means that wj has no dependency relation with any word in the stack.
When the classifier selects a transition, the resulting graph satisfies well-formedness and projectivity only under the preconditions listed in Table 1.
Even though the parsing seems to be formulated as
a four-class classifier problem, it is in fact formed
of two types of three-class classifiers.
Solving these problems and selecting a more
suitable dependency relation requires a parser that
considers more global dependency relations.
• S is a set of states, each of which is denoted
as (σ, β, A) ∈ S.
• Ts is a set of transitions, and each element of
Ts is a function ts : S → S.
• sinit = ([w0 ], [w1 , . . . , wn ], ϕ) is the initial
• St is a set of terminal states.
Syntactic analysis generates a sequence of optimal
transitions ts provided by an oracle o : S → Ts ,
applied to a target consisting of the stack’s top element wi and the first element wj in the buffer. The
oracle is constructed as a classifier trained on treebank data. Each transition is defined in the upper
block of Table 1 and explained as follows:
3 Tree-Based Parsing Applied to Nivre’s
3.1 Overall Procedure
Tree-based parsing uses trees as the procedural elements instead of words. This allows enhancement of previously proposed deterministic models such as (Covington, 2001; Yamada and Matsumoto, 2003). In this paper, we show the application of tree-based parsing to Nivre’s method. The
parser is formulated as a state transition system
(S, Ts , sinit , St ), similarly to Nivre’s parser, but σ
and β for a state s = (σ, β, A) ∈ S denote a stack
of trees and a buffer of trees, respectively. A tree
ti ∈ T is defined as the tree rooted by the word wi ,
and the initial state is sinit = ([t0 ], [t1 , . . . , tn ], ϕ),
which is formed from the input sentence x.
The state transitions Ts are decided through the
following two steps.
Left-Arc Make wj the head of wi and pop wi ,
where wi is located at the stack top (denoted
as σ|wi ), when the buffer head is wj (denoted
as wj |β).
Right-Arc Make wi the head of wj , and push wj .
Reduce Pop wi , located at the stack top.
Shift Push the word wj , located at the buffer head,
onto the stack top.
The method explained thus far has the following
Locality of Parsing Action Selection
1. Select the most probable head candidate
(MPHC): For the tree ti located at the stack
top, search for and select the MPHC for wj ,
which is the root word of tj located at the
buffer head. This procedure is denoted as a
The dependency relations are greedily determined,
so when the transition Right-Arc adds a dependency arc (wi , wj ), a more probable head of wj
located in the stack is disregarded as a candidate.
most probable
head candidate
mphc (ti , t j )
robot sold
separately by his company
The biped
head candidates
the telescope
robot sold
The biped
Figure 1: Example of a tournament.
by his company
Figure 2: Example of the transition Right.
function mphc(ti , tj ), and its details are explained in §3.2.
2008). Since the Japanese language has the headfinal property, the tournament model itself constitutes parsing, whereas for parsing a general projective language, the tournament model can only
be used as part of a parsing algorithm.
Figure 1 shows a tournament for the example
of “with,” where the word “watched” finally wins.
Although only the words on the left-hand side of
tree tj are searched, this does not mean that the
tree-based method considers only one side of a dependency relation. For example, when we apply
the tree-based parsing to Yamada’s method, the
search problems on both sides are solved.
To implement mphc(ti , tj ), a binary classifier
is built to judge which of two given words is more
appropriate as the head for another input word.
This classifier concerns three words, namely, the
two words l (left) and r (right) in ti , whose appropriateness as the head is compared for the dependent wj . All word pairs of l and r in ti are
compared repeatedly in a “tournament,” and the
survivor is regarded as the MPHC of wj .
The classifier is generated through learning of
training examples for all ti and wj pairs, each
of which generates examples comparing the true
head and other (inappropriate) heads in ti . Table 2 lists the features used in the classifier. Here,
lex(X) and pos(X) mean the surface form and part
of speech of X, respectively. X lef t means the
dependents of X located on the left-hand side of
X, while X right means those on the right. Also,
X head means the head of X. The feature design
concerns three additional words occurring after
wj , as well, denoted as wj+1 , wj+2 , wj+3 .
2. Select a transition: Choose a transition,
by using an oracle, from among the following three possibilities (explained in detail in
Left-Arc Make wj the head of wi and pop
ti , where ti is at the stack top (denoted
as σ|ti , with the tail being σ), when the
buffer head is tj (denoted as tj |β).
Right-Arc Make the MPHC the head of wj ,
and pop the MPHC.
Shift Push the tree tj located at the buffer
head onto the stack top.
These transitions correspond to three possibilities
for the relation between ti and tj : (1) a word of ti
is a dependent of a word of tj ; (2) a word of tj is a
dependent of a word of ti ; or (3) the two trees are
not related.
The formulations of these transitions in the
lower block of Table 1 correspond to Nivre’s transitions of the same name, except that here a transition is applied to a tree. This enhancement from
words to trees allows removal of both the Reduce
transition and certain preconditions.
3.2 Selection of Most Probable Head
By using mphc(ti , tj ), a word located far from wj
(the head of tj ) can be selected as the head candidate in ti . This selection process decreases the
number of errors resulting from greedy decision
considering only a few candidates.
Various procedures can be considered for implementing mphc(ti , tj ). One way is to apply the
tournament procedure to the words in ti . The tournament procedure was originally introduced for
parsing methods in Japanese by (Iwatate et al.,
3.3 Transition Selection
A transition is selected by a three-class classifier
after deciding the MPHC, as explained in §3.1.
Table 1 lists the three transitions and one precon-
Table 2: Features used for a tournament.
pos(l), lex(l)
pos(lhead ), pos(llef t ), pos(lright )
pos(r), lex(r)
pos(rhead ), pos(rlef t ), pos(rright )
pos(wj ), lex(wj ), pos(wjlef t )
pos(wj+1 ), lex(wj+1 ), pos(wj+2 ), lex(wj+2 )
pos(wj+3 ), lex(wj+3 )
Table 3: Features used for a state transition.
pos(wi ), lex(wi )
pos(wilef t ), pos(wiright ), lex(wilef t ), lex(wiright )
pos(MPHC), lex(MPHC)
pos(MPHChead ), pos(MPHClef t ), pos(MPHCright )
lex(MPHChead ), lex(MPHClef t ), lex(MPHCright )
pos(wj ), lex(wj ), pos(wjlef t ), lex(wjlef t )
pos(wj+1 ), lex(wj+1 ), pos(wj+2 ), lex(wj+2 ), pos(wj+3 ), lex(wj+3 )
dition. The transition Shift indicates that the target trees ti and tj have no dependency relations.
The transition Right-Arc indicates generation of
the dependent-head relation between wj and the
result of mphc(ti , tj ), i.e., the MPHC for wj . Figure 2 shows an example of this transition. The
transition Left-Arc indicates generation of the dependency relation in which wj is the head of wi .
While Right-Arc requires searching for the MPHC
in ti , this is not the case for Left-Arc1 .
The key to obtaining an accurate tree-based
parsing model is to extend the search space while
at the same time providing ways to narrow down
the space and find important information, such as
the MPHC, for proper judgment of transitions.
The three-class classifier is constructed as follows. The dependency relation between the target
trees is represented by the three words wi , MPHC,
and wj . Therefore, the features are designed to incorporate these words, their relational words, and
the three words next to wj . Table 3 lists the exact
set of features used in this work. Since this transition selection procedure presumes selection of the
MPHC, the result of mphc(ti , tj ) is also incorporated among the features.
was used in several other previous works, enabling
mutual comparison with the methods reported in
those works.
The SVMlight package2 was used to build the
support vector machine classifiers. The binary
classifier for MPHC selection and the three-class
classifier for transition selection were built using a
cubic polynomial kernel. The parsing speed was
evaluated on a Core2Duo (2.53 GHz) machine.
4.2 Parsing Accuracy
We measured the ratio of words assigned correct
heads to all words (accuracy), and the ratio of sentences with completely correct dependency graphs
to all sentences (complete match). In the evaluation, we consistently excluded punctuation marks.
Table 4 compares our results for the proposed
method with those reported in some previous
works using equivalent training and test data.
The first column lists the four previous methods
and our method, while the second through fourth
columns list the accuracy, complete match accuracy, and time complexity, respectively, for each
method. Here, we obtained the scores for the previous works from the corresponding articles listed
in the first column. Note that every method used
different features, which depend on the method.
The proposed method achieved higher accuracy
than did the previous deterministic models. Although the accuracy of our method did not reach
that of (McDonald and Pereira, 2006), the scores
were competitive even though our method is deterministic. These results show the capability of
the tree-based approach in effectively extending
the search space.
4 Evaluation
4.1 Data and Experimental Setting
In our experimental evaluation, we used Yamada’s
head rule to extract unlabeled dependencies from
the Wall Street Journal section of a Penn Treebank.
Sections 2-21 were used as the training data, and
section 23 was used as the test data. This test data
The head word of wi can only be wj without searching
within tj , because the relations between the other words in tj
and wi have already been inferred from the decisions made
within previous transitions. If tj has a child wk that could
become the head of wi under projectivity, this wk must be
located between wi and wj . The fact that wk ’s head is wj
means that there were two phases before ti and tj (i.e., wi
and wj ) became the target:
• ti and tk became the target, and Shift was selected.
• tk and tj became the target, and Left-Arc was selected.
The first phase precisely indicates that wi and wk are unrelated.
4.3 Parsing Time
Such extension of the search space also concerns
the speed of the method. Here, we compare its
computational time with that of Nivre’s method.
We re-implemented Nivre’s method to use SVMs
with cubic polynomial kernel, similarly to our
Table 4: Dependency parsing performance.
McDonald & Pereira (2006)
McDonald et al. (2005)
Yamada & Matsumoto (2003)
Goldberg & Elhadad (2010)
Nivre (2004)
Proposed method
O(n3 )
O(n3 )
O(n2 )
O(n log n)
O(n2 )
Global vs.
parsing time [sec]
parsing time [sec]
Proposed Method
Nivre’s Method
support vector machine
structured perceptron
memory based learning
support vector machine
length of input sentence
length of input sentence
Figure 3: Parsing time for sentences.
method. Figure 3 shows plots of the parsing times
for all sentences in the test data. The average parsing time for our method was 8.9 sec, whereas that
for Nivre’s method was 7.9 sec.
Although the worst-case time complexity for
Nivre’s method is O(n) and that for our method is
O(n2 ), worst-case situations (e.g., all words having heads on their left) did not appear frequently.
This can be seen from the sparse appearance of the
upper bound in the second figure.
Jason M. Eisner. 1996. Three new probabilistic models
for dependency parsing: An exploration. Proceedings of
COLING, pp. 340-345.
5 Conclusion
Taku Kudo and Yuji Matsumoto. 2002. Japanese dependency analysis using cascaded chunking Proceedings of
CoNLL, pp. 63–69.
Yoav Goldberg and Michael Elhadad. 2010. An Efficient Algorithm for Easy-First Non-Directional Dependency Parsing. Proceedings of NAACL.
Masakazu Iwatate, Masayuki Asahara, and Yuji Matsumoto.
2008. Japanese dependency parsing using a tournament
model. Proceedings of COLING, pp. 361–368.
Terry Koo, Xavier Carreras, and Michael Collins. 2008.
Simple semi-supervised dependency parsing. Proceedings of ACL, pp. 595–603.
We have proposed a tree-based model that decides
head-dependency relations between trees instead
of between words. This extends the search space
to obtain the best head for a word within a deterministic model. The tree-based idea is potentially
applicable to various previous parsing methods; in
this paper, we have applied it to enhance Nivre’s
Our tree-based model outperformed various deterministic parsing methods reported previously.
Although the worst-case time complexity of our
method is O(n2 ), the average parsing time is not
much slower than O(n).
Ryan McDonald, Koby Crammer, and Fernando Pereira.
Online large-margin training of dependency
parsers. Proceedings of ACL, pp. 91–98.
Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. Proceedings of the EACL, pp. 81–88.
Joakim Nivre. 2003. An efficient algorithm for projective
dependency parsing. Proceedings of IWPT, pp. 149–160.
Joakim Nivre. 2008. Algorithms for deterministic incremental dependency parsing. Computational Linguistics, vol.
34, num. 4, pp. 513–553.
Joakim Nivre and Mario Scholz. 2004. Deterministic dependency parsing of English text. Proceedings of COLING,
pp. 64–70.
Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical
dependency analysis with support vector machines. Proceedings of IWPT, pp. 195–206.
Xavier Carreras. 2007. Experiments with a higher-order
projective dependency parse. Proceedings of the CoNLL
Shared Task Session of EMNLP-CoNLL, pp. 957-961.
Yue Zhang and Stephen Clark. 2008. A tale of two parsers:
investigating and combining graph-based and transitionbased dependency parsing using beamsearch. Proceedings of EMNLP, pp. 562–571.
Michael A. Covington. 2001. A fundamental algorithm for
dependency parsing. Proceedings of ACM, pp. 95-102.
Sparsity in Dependency Grammar Induction
Jennifer Gillenwater and Kuzman Ganchev
João Graça
University of Pennsylvania
Philadelphia, PA, USA
Lisboa, Portugal
[email protected]
Fernando Pereira
Google Inc.
Mountain View, CA, USA
[email protected]
Ben Taskar
University of Pennsylvania
Philadelphia, PA, USA
[email protected]
for a given language. For instance, in English it
is ungrammatical for nouns to dominate verbs, adjectives to dominate adverbs, and determiners to
dominate almost any part of speech. Thus, the realized dependency types should be a sparse subset
of all possible types.
A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a
small number of unique dependency
Specifically, we investigate
sparsity-inducing penalties on the posterior distributions of parent-child POS tag
pairs in the posterior regularization (PR)
framework of Graça et al. (2007). In experiments with 12 languages, we achieve
substantial gains over the standard expectation maximization (EM) baseline, with
average improvement in attachment accuracy of 6.3%. Further, our method
outperforms models based on a standard
Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular,
we show that our approach improves on
several other state-of-the-art techniques.
Previous work in unsupervised grammar induction has tried to achieve sparsity through priors.
Liang et al. (2007), Finkel et al. (2007) and Johnson et al. (2007) proposed hierarchical Dirichlet
process priors. Cohen et al. (2008) experimented
with a discounting Dirichlet prior, which encourages a standard dependency parsing model (see
Section 2) to limit the number of dependent types
for each head type.
Our experiments show a more effective sparsity
pattern is one that limits the total number of unique
head-dependent tag pairs. This kind of sparsity
bias avoids inducing competition between dependent types for each head type. We can achieve the
desired bias with a constraint on model posteriors during learning, using the posterior regularization (PR) framework (Graça et al., 2007). Specifically, to implement PR we augment the maximum
marginal likelihood objective of the dependency
model with a term that penalizes head-dependent
tag distributions that are too permissive.
We investigate an unsupervised learning method
for dependency parsing models that imposes sparsity biases on the dependency types. We assume
a corpus annotated with POS tags, where the task
is to induce a dependency model from the tags for
corpus sentences. In this setting, the type of a dependency is defined as a pair: tag of the dependent
(also known as the child), and tag of the head (also
known as the parent). Given that POS tags are designed to convey information about grammatical
relations, it is reasonable to assume that only some
of the possible dependency types will be realized
Although not focused on sparsity, several other
studies use soft parameter sharing to couple different types of dependencies. To this end, Cohen
et al. (2008) and Cohen and Smith (2009) investigated logistic normal priors, and Headden III et
al. (2009) used a backoff scheme. We compare to
their results in Section 5.
The remainder of this paper is organized as fol194
Proceedings of the ACL 2010 Conference Short Papers, pages 194–199,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
probability of a sentence with POS tags x and dependency tree y is given by:
lows. Section 2 and 3 review the models and several previous approaches for learning them. Section 4 describes learning with PR. Section 5 describes experiments across 12 languages and Section 6 analyzes the results. For additional details
on this work see Gillenwater et al. (2010).
pθ (x, y) = proot (r(x))×
pstop (f alse | yp , yd , yvs )pchild (yc | yp , yd , yvc )×
Parsing Model
The models we use are based on the generative dependency model with valence (DMV) (Klein and
Manning, 2004). For a sentence with tags x, the
root POS r(x) is generated first. Then the model
decides whether to generate a right dependent conditioned on the POS of the root and whether other
right dependents have already been generated for
this head. Upon deciding to generate a right dependent, the POS of the dependent is selected by
conditioning on the head POS and the directionality. After stopping on the right, the root generates left dependents using the mirror reversal of
this process. Once the root has generated all its
dependents, the dependents generate their own dependents in the same manner.
pstop (true | x, lef t, xvl ) pstop (true | x, right, xvr )
where y is the dependency of yc on head yp in direction yd , and yvc , yvs , xvr , and xvl indicate valence. For the third model extension, the backoff
to a probability not dependent on parent POS can
be formally expressed as:
λpchild (yc | yp , yd , yvc ) + (1 − λ)pchild (yc | yd , yvc ) (2)
for λ ∈ [0, 1]. We fix λ = 1/3, which is a crude
approximation to the value learned by Headden III
et al. (2009).
Previous Learning Approaches
In our experiments, we compare PR learning
to standard expectation maximization (EM) and
to Bayesian learning with a sparsity-inducing
prior. The EM algorithm
P optimizes marginal likelihood L(θ) = log Y pθ (X, Y), where X =
{x1 , . . . , xn } denotes the entire unlabeled corpus
and Y = {y1 , . . . , yn } denotes a set of corresponding parses for each sentence. Neal and Hinton (1998) view EM as block coordinate ascent on
a function that lower-bounds L(θ). Starting from
an initial parameter estimate θ0 , the algorithm iterates two steps:
Model Extensions
For better comparison with previous work we
implemented three model extensions, borrowed
from Headden III et al. (2009). The first extension alters the stopping probability by conditioning it not only on whether there are any dependents in a particular direction already, but also on
how many such dependents there are. When we
talk about models with maximum stop valency Vs
= S, this means it distinguishes S different cases:
0, 1, . . . , S − 2, and ≥ S − 1 dependents in a given
direction. The basic DMV has Vs = 2.
The second model extension we implement is
analogous to the first, but applies to dependent tag
probabilities instead of stop probabilities. Again,
we expand the conditioning such that the model
considers how many other dependents were already generated in the same direction. When we
talk about a model with maximum child valency
Vc = C, this means we distinguish C different
cases. The basic DMV has Vc = 1. Since this
extension to the dependent probabilities dramatically increases model complexity, the third model
extension we implement is to add a backoff for the
dependent probabilities that does not condition on
the identity of the parent POS (see Equation 2).
More formally, under the extended DMV the
E : q t+1 = arg min KL(q(Y) k pθt (Y | X))
M : θt+1 = arg max Eqt+1 [log pθ (X, Y)]
Note that the E-step just sets q t+1 (Y) =
pθt (Y|X), since it is an unconstrained minimization of a KL-divergence. The PR method we
present modifies the E-step by adding constraints.
Besides EM, we also compare to learning with
several Bayesian priors that have been applied to
the DMV. One such prior is the Dirichlet, whose
hyperparameter we will denote by α. For α < 0.5,
this prior encourages parameter sparsity. Cohen
et al. (2008) use this method with α = 0.25 for
training the DMV and achieve improvements over
basic EM. In this paper we will refer to our own
implementation of the Dirichlet prior as the “discounting Dirichlet” (DD) method. In addition to
4.1 `1 /`∞ Regularization
the Dirichlet, other types of priors have been applied, in particular logistic normal priors (LN) and
shared logistic normal priors (SLN) (Cohen et al.,
2008; Cohen and Smith, 2009). LN and SLN aim
to tie parameters together. Essentially, this has a
similar goal to sparsity-inducing methods in that it
posits a more concise explanation for the grammar
of a language. Headden III et al. (2009) also implement a sort of parameter tying for the E-DMV
through a learning a backoff distribution on child
probabilities. We compare against results from all
these methods.
We now define precisely how to count dependency
types. For each child tag c, let i range over an enumeration of all occurrences of c in the corpus, and
let p be another tag. Let the indicator φcpi (X, Y)
have value 1 if p is the parent tag of the ith occurrence of c, and value 0 otherwise. The number of
unique dependency types is then:
max φcpi (X, Y)
Note there is an asymmetry in this count: occurrences of child type c are enumerated with i, but
all occurrences of parent type p are or-ed in φcpi .
That is, φcpi = 1 if any occurrence of p is the parent of the ith occurrence of c. We will refer to PR
training with this constraint as PR-AS. Instead of
counting pairs of a child token and a parent type,
we can alternatively count pairs of a child token
and a parent token by letting p range over all tokens rather than types. Then each potential dependency corresponds to a different indicator φcpij ,
and the penalty is symmetric with respect to parents and children. We will refer to PR training
with this constraint as PR-S. Both approaches perform very well, so we report results for both.
Equation 7 can be viewed as a mixed-norm
penalty on the features φcpi or φcpij : the sum corresponds to an `1 norm and the max to an `∞
norm. Thus, the quantity we want to minimize
fits precisely into the PR penalty framework. Formally, to optimize the PR objective, we complete
the following E-step:
Learning with Sparse Posteriors
We would like to penalize models that predict a
large number of distinct dependency types. To enforce this penalty, we use the posterior regularization (PR) framework (Graça et al., 2007). PR
is closely related to generalized expectation constraints (Mann and McCallum, 2007; Mann and
McCallum, 2008; Bellare et al., 2009), and is also
indirectly related to a Bayesian view of learning
with constraints on posteriors (Liang et al., 2009).
The PR framework uses constraints on posterior
expectations to guide parameter estimation. Here,
PR allows a natural and tractable representation of
sparsity constraints based on edge type counts that
cannot easily be encoded in model parameters. We
use a version of PR where the desired bias is a
penalty on the log likelihood (see Ganchev et al.
(2010) for more details). For a distribution pθ , we
define a penalty as the (generic) β-norm of expectations of some features φ:
arg min KL(q(Y)||pθ (Y|X)) + σ
||Epθ [φ(X, Y)]||β
max Eq [φ(X, Y)],
which can equivalently be written as:
For computational tractability, rather than penalizing the model’s posteriors directly, we use an auxiliary distribution q, and penalize the marginal loglikelihood of a model by the KL-divergence of pθ
from q, plus the penalty term with respect to q.
For a fixed set of model parameters θ the full PR
penalty term is:
KL(q(Y) k pθ (Y|X)) + σ
s. t.
ξcp ≤ Eq [φ(X, Y)]
where ξcp corresponds to the maximum expectation of φ over all instances of c and p. Note that
the projection problem can be solved efficiently in
the dual (Ganchev et al., 2010).
min KL(q(Y) k pθ (Y|X)) + σ ||Eq [φ(X, Y)]||β (6)
where σ is the strength of the regularization. PR
seeks to maximize L(θ) minus this penalty term.
The resulting objective can be optimized by a variant of the EM (Dempster et al., 1977) algorithm
used to optimize L(θ).
We evaluate on 12 languages. Following the example of Smith and Eisner (2006), we strip punctuation from the sentences and keep only sentences of length ≤ 10. For simplicity, for all models we use the “harmonic” initializer from Klein
Learning Method
PR-S (σ = 140)
LN families
SLN TieV & N
PR-AS (σ = 140)
DD (α = 1, λ learned)
Table 1: Attachment accuracy results. Column 1: Vc Vs used for the E-DMV models. Column 3: Best PR result for each model, which is chosen by applying each of
the two types of constraints (PR-S and PR-AS) and trying
σ ∈ {80, 100, 120, 140, 160, 180}. Columns 4 & 5: Constraint type and σ that produced the values in column 3.
2 and 3 are taken from Cohen et al. (2008) and Cohen and
Smith (2009), and row 5 from Headden III et al. (2009).
complexity and regularization strength. However,
we feel the comparison is not so unfair as we perform only a very limited search of the model-σ
space. Specifically, the only values of σ we search
over are {80, 100, 120, 140, 160, 180}.
First, we consider the top three entries in Table 2, which are for the basic DMV. The first entry was generated using our implementation of
PR-S. The second two entries are logistic normal and shared logistic normal parameter tying results (Cohen et al., 2008; Cohen and Smith, 2009).
The PR-S result is the clear winner, especially as
length of test sentences increases. For the bottom two entries in the table, which are for the EDMV, the last entry is best, corresponding to using a DD prior with α = 1 (non-sparsifying), but
with a special “random pools” initialization and a
learned weight λ for the child backoff probability. The result for PR-AS is well within the variance range of this last entry, and thus we conjecture that combining PR-AS with random pools initialization and learned λ would likely produce the
best-performing model of all.
Results on English
We start by comparing English performance for
EM, PR, and DD. To find α for DD we searched
over five values: {0.01, 0.1, 0.25, 1}. We found
0.25 to be the best setting for the DMV, the same
as found by Cohen et al. (2008). DD achieves accuracy 46.4% with this α. For the E-DMV we
tested four model complexities with valencies Vc Vs of 2-1, 2-2, 3-3, and 4-4. DD’s best accuracy
was 53.6% with the 4-4 model at α = 0.1. A
comparison between EM and PR is shown in Table 1. PR-S generally performs better than the PRAS for English. Comparing PR-S to EM, we also
found PR-S is always better, independent of the
particular σ, with improvements ranging from 2%
to 17%. Note that in this work we do not perform
the PR projection at test time; we found it detrimental, probably due to a need to set the (corpussize-dependent) σ differently for the test set. We
also note that development likelihood and the best
setting for σ are not well-correlated, which unfortunately makes it hard to pick these parameters
without some supervision.
Table 2: Comparison with previous published results. Rows
and Manning (2004), which we refer to as K&M.
We always train for 100 iterations and evaluate
on the test set using Viterbi parses. Before evaluating, we smooth the resulting models by adding
e−10 to each learned parameter, merely to remove
the chance of zero probabilities for unseen events.
(We did not tune this as it should make very little
difference for final parses.) We score models by
their attachment accuracy — the fraction of words
assigned the correct parent.
≤ 10
≤ 20
65.0 (±5.7)
Results on Other Languages
Here we describe experiments on 11 additional
languages. For each we set σ and model complexity (DMV versus one of the four E-DMV experimented with previously) based on the best configuration found for English. This likely will not
result in the ideal parameters for all languages, but
provides a realistic test setting: a user has available a labeled corpus in one language, and would
like to induce grammars for many other languages.
Table 3 shows the performance for all models and
training procedures. We see that the sparsifying
methods tend to improve over EM most of the
time. For the basic DMV, average improvements
are 1.6% for DD, 6.0% for PR-S, and 7.5% for
PR-AS. PR-AS beats PR-S in 8 out of 12 cases,
Comparison with Previous Work
In this section we compare to previously published
unsupervised dependency parsing results for English. It might be argued that the comparison is
unfair since we do supervised selection of model
DD 0.25
PR-S 140
PR-AS 140
EM (3,3)
DD 0.1 (4,4)
PR-S 140 (3,3)
PR-AS 140 (4,4)
DMV Model
40.3 52.8
47.5 57.8
61.1 58.8
62.4 60.2
Extended Model
44.3 48.5
48.9 57.6
57.9 60.8
57.9 59.4
Table 3: Attachment accuracy results. The parameters used are the best settings found for English. Values for hyperparameters
(α or σ) are given after the method name. For the extended model (Vc , Vs ) are indicated in parentheses. En is the English Penn
Treebank (Marcus et al., 1993) and the other 11 languages are from the CoNLL X shared task: Bulgarian [Bg] (Simov et al.,
2002), Czech [Cz] (Bohomovà et al., 2001), German [De] (Brants et al., 2002), Danish [Dk] (Kromann et al., 2003), Spanish
[Es] (Civit and Martí, 2004), Japanese [Jp] (Kawata and Bartels, 2000), Dutch [Nl] (Van der Beek et al., 2002), Portuguese
[Pt] (Afonso et al., 2002), Swedish [Se] (Nilsson et al., 2005), Slovene [Sl] (Džeroski et al., 2006), and Turkish [Tr] (Oflazer et
al., 2003).
does not occur, it shifts the model parameters to
make nouns the parent of determiners instead of
the reverse. Then it does not have to pay the cost
of assigning a parent with a new tag to cover each
noun that doesn’t come with a determiner.
0.83 0.75
In this paper we presented a new method for unsupervised learning of dependency parsers. In contrast to previous approaches that constrain model
parameters, we constrain model posteriors. Our
approach consistently outperforms the standard
EM algorithm and a discounting Dirichlet prior.
We have several ideas for further improving our
constraints, such as: taking into account the directionality of the edges, using different regularization strengths for the root probabilities than for the
child probabilities, and working directly on word
types rather than on POS tags. In the future, we
would also like to try applying similar constraints
to the more complex task of joint induction of POS
tags and dependency parses.
Figure 1: Posterior edge probabilities for an example sentence from the Spanish test corpus. At the top are the gold
dependencies, the middle are EM posteriors, and bottom are
PR posteriors. Green indicates correct dependencies and red
indicates incorrect dependencies. The numbers on the edges
are the values of the posterior probabilities.
though the average increase is only 1.5%. PR-S
is also better than DD for 10 out of 12 languages.
If we instead consider these methods for the EDMV, DD performs worse, just 1.4% better than
the E-DMV EM, while both PR-S and PR-AS continue to show substantial average improvements
over EM, 6.5% and 6.3%, respectively.
One common EM error that PR fixes in many languages is the directionality of the noun-determiner
relation. Figure 1 shows an example of a Spanish sentence where PR significantly outperforms
EM because of this. Sentences such as “Lleva
tiempo entenderlos” which has tags “main-verb
common-noun main-verb” (no determiner tag)
provide an explanation for PR’s improvement—
when PR sees that sometimes nouns can appear
without determiners but that the opposite situation
J. Gillenwater was supported by NSF-IGERT
K. Ganchev was supported by
ARO MURI SUBTLE W911NF-07-1-0216.
J. Graça was supported by FCT fellowship
SFRH/BD/27528/2006 and by FCT project CMUPT/HuMach/0039/2008. B. Taskar was partly
supported by DARPA CSSG and ONR Young
Investigator Award N000141010746.
Y. Kawata and J. Bartels. 2000. Stylebook for the
Japanese Treebank in VERBMOBIL. Technical report, Eberhard-Karls-Universitat Tubingen.
S. Afonso, E. Bick, R. Haber, and D. Santos. 2002.
Floresta Sinta(c)tica: a treebank for Portuguese. In
Proc. LREC.
D. Klein and C. Manning. 2004. Corpus-based induction of syntactic structure: Models of dependency
and constituency. In Proc. ACL.
K. Bellare, G. Druck, and A. McCallum. 2009. Alternating projections for learning with expectation
constraints. In Proc. UAI.
M.T. Kromann, L. Mikkelsen, and S.K. Lynge. 2003.
Danish Dependency Treebank. In Proc. TLT.
A. Bohomovà, J. Hajic, E. Hajicova, and B. Hladka.
2001. The prague dependency treebank: Three-level
annotation scenario. In Anne Abeillé, editor, Treebanks: Building and Using Syntactically Annotated
P. Liang, S. Petrov, M.I. Jordan, and D. Klein. 2007.
The infinite PCFG using hierarchical Dirichlet processes. In Proc. EMNLP.
P. Liang, M.I. Jordan, and D. Klein. 2009. Learning from measurements in exponential families. In
Proc. ICML.
S. Brants, S. Dipper, S. Hansen, W. Lezius, and
G. Smith. 2002. The TIGER treebank. In Proc.
Workshop on Treebanks and Linguistic Theories.
G. Mann and A. McCallum. 2007. Simple, robust,
scalable semi-supervised learning via expectation
regularization. In Proc. ICML.
M. Civit and M.A. Martí. 2004. Building cast3lb: A
Spanish Treebank. Research on Language & Computation.
G. Mann and A. McCallum. 2008. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Proc. ACL.
S.B. Cohen and N.A. Smith. 2009. The shared logistic
normal distribution for grammar induction. In Proc.
S.B. Cohen, K. Gimpel, and N.A. Smith. 2008. Logistic normal priors for unsupervised probabilistic
grammar induction. In Proc. NIPS.
M. Marcus, M. Marcinkiewicz, and B. Santorini.
1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977.
Maximum likelihood from incomplete data via the
EM algorithm. Journal of the Royal Statistical Society, 39(1):1–38.
R. Neal and G. Hinton. 1998. A new view of the EM
algorithm that justifies incremental, sparse and other
variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. MIT Press.
S. Džeroski, T. Erjavec, N. Ledinek, P. Pajas,
Z. Žabokrtsky, and A. Žele. 2006. Towards a
Slovene dependency treebank. In Proc. LREC.
J. Nilsson, J. Hall, and J. Nivre. 2005. MAMBA meets
TIGER: Reconstructing a Swedish treebank from
antiquity. NODALIDA Special Session on Treebanks.
J. Finkel, T. Grenager, and C. Manning. 2007. The
infinite tree. In Proc. ACL.
K. Oflazer, B. Say, D.Z. Hakkani-Tür, and G. Tür.
2003. Building a Turkish treebank. Treebanks:
Building and Using Parsed Corpora.
K. Ganchev, J. Graça, J. Gillenwater, and B. Taskar.
2010. Posterior regularization for structured latent
variable models. Journal of Machine Learning Research.
K. Simov, P. Osenova, M. Slavcheva, S. Kolkovska,
E. Balabanova, D. Doikoff, K. Ivanova, A. Simov,
E. Simov, and M. Kouylekov. 2002. Building a linguistically interpreted corpus of bulgarian: the bultreebank. In Proc. LREC.
J. Gillenwater, K. Ganchev, J. Graça, F. Pereira, and
B. Taskar. 2010. Posterior sparsity in unsupervised
dependency parsing. Technical report, MS-CIS-1019, University of Pennsylvania.
N. Smith and J. Eisner. 2006. Annealing structural
bias in multilingual weighted grammar induction. In
Proc. ACL.
J. Graça, K. Ganchev, and B. Taskar. 2007. Expectation maximization and posterior constraints. In
Proc. NIPS.
L. Van der Beek, G. Bouma, R. Malouf, and G. Van Noord. 2002. The Alpino dependency treebank. Language and Computers.
W.P. Headden III, M. Johnson, and D. McClosky.
2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proc.
M. Johnson, T.L. Griffiths, and S. Goldwater. 2007.
Adaptor grammars: A framework for specifying
compositional nonparametric Bayesian models. In
Proc. NIPS.
Top-Down K-Best A∗ Parsing
Adam Pauls and Dan Klein
Computer Science Division
University of California at Berkeley
Chris Quirk
Microsoft Research
Redmond, WA, 98052
[email protected]
We propose a top-down algorithm for extracting k-best lists from a parser. Our
algorithm, TKA∗ is a variant of the kbest A∗ (KA∗ ) algorithm of Pauls and
Klein (2009). In contrast to KA∗ , which
performs an inside and outside pass before performing k-best extraction bottom
up, TKA∗ performs only the inside pass
before extracting k-best lists top down.
TKA∗ maintains the same optimality and
efficiency guarantees of KA∗ , but is simpler to both specify and implement.
Because our algorithm is very similar to KA∗ ,
which is in turn an extension of the (1-best) A∗
parsing algorithm of Klein and Manning (2003),
we first introduce notation and review those two
algorithms before presenting our new algorithm.
Assume we have a PCFG2 G and an input sentence s0 . . . sn−1 of length n. The grammar G has
a set of symbols denoted by capital letters, including a distinguished goal (root) symbol G. Without loss of generality, we assume Chomsky normal form: each non-terminal rule r in G has the
form r = A → B C with weight wr . Edges
are labeled spans e = (A, i, j). Inside derivations of an edge (A, i, j) are trees with root nonterminal A, spanning si . . . sj−1 . The weight (negative log-probability) of the best (minimum) inside
derivation for an edge e is called the Viterbi inside score β(e), and the weight of the best derivation of G → s0 . . . si−1 A sj . . . sn−1 is called
the Viterbi outside score α(e). The goal of a kbest parsing algorithm is to compute the k best
(minimum weight) inside derivations of the edge
(G, 0, n).
We formulate the algorithms in this paper
in terms of prioritized weighted deduction rules
(Shieber et al., 1995; Nederhof, 2003). A prioritized weighted deduction rule has the form
Many situations call for a parser to return a kbest list of parses instead of a single best hypothesis.1 Currently, there are two efficient approaches
known in the literature. The k-best algorithm of
Jiménez and Marzal (2000) and Huang and Chiang (2005), referred to hereafter as L AZY, operates by first performing an exhaustive Viterbi inside pass and then lazily extracting k-best lists in
top-down manner. The k-best A∗ algorithm of
Pauls and Klein (2009), hereafter KA∗ , computes
Viterbi inside and outside scores before extracting
k-best lists bottom up.
Because these additional passes are only partial,
KA∗ can be significantly faster than L AZY, especially when a heuristic is used (Pauls and Klein,
2009). In this paper, we propose TKA∗ , a topdown variant of KA∗ that, like L AZY, performs
only an inside pass before extracting k-best lists
top-down, but maintains the same optimality and
efficiency guarantees as KA∗ . This algorithm can
be seen as a generalization of the lattice k-best algorithm of Soong and Huang (1991) to parsing.
Because TKA∗ eliminates the outside pass from
KA∗ , TKA∗ is simpler both in implementation and
p(w1 ,...,wn )
φ1 : w1 , . . . , φn : wn −−−−−−−−→ φ0 : g(w1 , . . . , wn )
where φ1 , . . . , φn are the antecedent items of the
deduction rule and φ0 is the conclusion item. A
deduction rule states that, given the antecedents
φ1 , . . . , φn with weights w1 , . . . , wn , the conclusion φ0 can be formed with weight g(w1 , . . . , wn )
and priority p(w1 , . . . , wn ).
While we present the algorithm specialized to parsing
with a PCFG, this algorithm generalizes to a wide range of
See Huang and Chiang (2005) for a review.
Proceedings of the ACL 2010 Conference Short Papers, pages 200–204,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
s3 s4
s0 s1
s2 sn-1
s1 s2 s3
s1 s2 s3
s4 s5
Figure 2: (a) An outside derivation item before expansion at
the edge (VP, 1, 4). (b) A possible expansion of the item in
(a) using the rule VP→ VP NN. Frontier edges are marked in
Figure 1: Representations of the different types of items
used in parsing. (a) An inside edge item I(VP, 2, 5). (b)
An outside edge item O(VP, 2, 5). (c) An inside derivation item: D(T VP , 2, 5). (d) An outside derivation item:
, 1, 2, {(N P, 2, n)}. The edges in boldface are frontier edges.
its weight plus a heuristic h(A, i, j). For consistent and admissible heuristics h(·), this deduction
rule guarantees that when an inside edge item is
removed from the agenda, its current weight is its
true Viterbi inside score.
The heuristic h controls the speed of the algorithm. It can be shown that an edge e satisfying
β(e) + h(A, i, j) > β(G, 0, n) will never be removed from the agenda, allowing some edges to
be safely pruned during parsing. The more closely
h(e) approximates the Viterbi outside cost α(e),
the more items are pruned.
These deduction rules are “executed” within
a generic agenda-driven algorithm, which constructs items in a prioritized fashion. The algorithm maintains an agenda (a priority queue of
items), as well as a chart of items already processed. The fundamental operation of the algorithm is to pop the highest priority item φ from the
agenda, put it into the chart with its current weight,
and apply deduction rules to form any items which
can be built by combining φ with items already
in the chart. When the resulting items are either
new or have a weight smaller than an item’s best
score so far, they are put on the agenda with priority given by p(·). Because all antecedents must
be constructed before a deduction rule is executed,
we sometimes refer to particular conclusion item
as “waiting” on another item before it can be built.
s0 ... s2 s5 ... sn-1
The use of inside edge items in A∗ exploits the optimal substructure property of derivations – since
a best derivation of a larger edge is always composed of best derivations of smaller edges, it is
only necessary to compute the best way of building a particular inside edge item. When finding
k-best lists, this is no longer possible, since we are
interested in suboptimal derivations.
Thus, KA∗ , the k-best extension of A∗ , must
search not in the space of inside edge items,
but rather in the space of inside derivation items
D(T A , i, j), which represent specific derivations
of the edge (A, i, j) using tree T A . However, the
number of inside derivation items is exponential
in the length of the input sentence, and even with
a very accurate heuristic, running A∗ directly in
this space is not feasible.
Fortunately, Pauls and Klein (2009) show that
with a perfect heuristic, that is, h(e) = α(e) ∀e,
A∗ search on inside derivation items will only
remove items from the agenda that participate
in the true k-best lists (up to ties). In order
to compute this perfect heuristic, KA∗ makes
use of outside edge items O(A, i, j) which represent the many possible derivations of G →
A∗ parsing (Klein and Manning, 2003) is an algorithm for computing the 1-best parse of a sentence. A∗ operates on items called inside edge
items I(A, i, j), which represent the many possible inside derivations of an edge (A, i, j). Inside edge items are constructed according to the
IN deduction rule of Table 1. This deduction rule
constructs inside edge items in a bottom-up fashion, combining items representing smaller edges
I(B, i, k) and I(C, k, j) with a grammar rule r =
A → B C to form a larger item I(A, i, j). The
weight of a newly constructed item is given by the
sum of the weights of the antecedent items and
the grammar rule r, and its priority is given by
hypergraph search problems as shown in Klein and Manning
IN∗† :
IN-D† :
OUT-L† :
OUT-R† :
OUT-D∗ :
O(A, i, j) : w1
O(A, i, j) : w1
O(A, i, j) : w1
I(B, i, l) : w1
D(T B , i, l) : w2
I(B, i, l) : w2
I(B, i, l) : w2
I(C, l, j) : w2
D(T C , l, j) : w3
I(C, l, j) : w3
I(C, l, j) : w3
Q(TAG , i, j, F) : w1
I(B, i, l) : w2
I(C, l, j) : w3
w1 +w2 +wr +h(A,i,j)
w +w3 +wr +w1
w +w3 +wr +w2
w +w2 +wr +w3
w1 +wr +w2 +w3 +β(F )
I(A, i, j) : w1 + w2 + wr
D(T A , i, j) : w2 + w3 + wr
O(B, i, l) : w1 + w3 + wr
O(C, l, j) : w1 + w2 + wr
Q(TBG , i, l, FC ) : w1 + wr
Table 1: The deduction rules used in this paper. Here, r is the rule A → B C. A superscript * indicates that the rule is used
in TKA∗ , and a superscript † indicates that the rule is used in KA∗ . In IN-D, the tree TA is rooted at (A, i, j) and has children
T B and T C . P
In OUT-D, the tree TBG is the tree TAG extended at (A, i, j) with rule r, FC is the list F with (C, l, j) prepended,
and β(F) is e∈F β(e). Whenever the left child I(B, i, l) of an application of OUT-D represents a terminal, the next edge is
removed from F and is used as the new point of expansion.
s1 . . . si A sj+1 . . . sn (see Figure 1(b)).
Outside items are built using the OUT-L and
OUT-R deduction rules shown in Table 1. OUTL and OUT-R combine, in a top-down fashion, an
outside edge over a larger span and inside edge
over a smaller span to form a new outside edge
over a smaller span. Because these rules make reference to inside edge items I(A, i, j), these items
must also be built using the IN deduction rules
from 1-best A∗ . Outside edge items must thus wait
until the necessary inside edge items have been
built. The outside pass is initialized with the item
O(G, 0, n) when the inside edge item I(G, 0, n) is
popped from the agenda.
Once we have started populating outside scores
using the outside deductions, we can initiate a
search on inside derivation items.3 These items
are built bottom-up using the IN-D deduction rule.
The crucial element of this rule is that derivation
items for a particular edge wait until the exact outside score of that edge has been computed. The algorithm terminates when k derivation items rooted
at (G, 0, n) have been popped from the agenda.
mal completion costs are Viterbi inside scores, and
we could forget the outside pass.
TKA∗ does exactly that. Inside edge items are
constructed in the same way as KA∗ , but once the
inside edge item I(G, 0, n) has been discovered,
TKA∗ begins building partial derivations from the
goal outwards. We replace the inside derivation
items of KA∗ with outside derivation items, which
represent trees rooted at the goal and expanding
downwards. These items bottom out in a list of
edges called the frontier edges. See Figure 1(d)
for a graphical representation. When a frontier
edge represents a single word in the input, i.e. is
of the form (si , i, i + 1), we say that edge is complete. An outside derivation can be expanded by
applying a rule to one of its incomplete frontier
edges; see Figure 2. In the same way that inside
derivation items wait on exact outside scores before being built, outside derivation items wait on
the inside edge items of all frontier edges before
they can be constructed.
Although building derivations top-down obviates the need for a 1-best outside pass, it raises a
new issue. When building derivations bottom-up,
the only way to expand a particular partial inside
derivation is to combine it with another partial inside derivation to build a bigger tree. In contrast,
an outside derivation item can be expanded anywhere along its frontier. Naively building derivations top-down would lead to a prohibitively large
number of expansion choices.
We solve this issue by always expanding the
left-most incomplete frontier edge of an outside
derivation item. We show the deduction rule
OUT-D which performs this deduction in Figure 1(d). We denote an outside derivation item as
Q(TAG , i, j, F), where TAG is a tree rooted at the
goal with left-most incomplete edge (A, i, j), and
F is the list of incomplete frontier edges excluding (A, i, j), ordered from left to right. Whenever
the application of this rule “completes” the left-
KA∗ efficiently explores the space of inside
derivation items because it waits for the exact
Viterbi outside cost before building each derivation item. However, these outside costs and associated deduction items are only auxiliary quantities used to guide the exploration of inside derivations: they allow KA∗ to prioritize currently constructed inside derivation items (i.e., constructed
derivations of the goal) by their optimal completion costs. Outside costs are thus only necessary
because we construct partial derivations bottomup; if we constructed partial derivations in a topdown fashion, all we would need to compute opti3
We stress that the order of computation is entirely specified by the deduction rules – we only speak about e.g. “initiating a search” as an appeal to intuition.
most edge, the next edge is removed from F and
is used as the new point of expansion. Once all
frontier edges are complete, the item represents a
correctly scored derivation of the goal, explored in
a pre-order traversal.
Although our algorithm eliminates the 1-best outside pass of KA∗ , in practice, even for k = 104 ,
the 1-best inside pass remains the overwhelming
bottleneck (Pauls and Klein, 2009), and our modifications leave that pass unchanged.
However, we argue that our implementation is
simpler to specify and implement. In terms of deduction rules, our algorithm eliminates the 2 outside deduction rules and replaces the IN-D rule
with the OUT-D rule, bringing the total number
of rules from four to two.
The ease of specification translates directly into
ease of implementation. In particular, if highquality heuristics are not available, it is often more
efficient to implement the 1-best inside pass as
an exhaustive dynamic program, as in Huang and
Chiang (2005). In this case, one would only need
to implement a single, agenda-based k-best extraction phase, instead of the 2 needed for KA∗ .
It should be clear that expanding the left-most incomplete frontier edge first eventually explores the
same set of derivations as expanding all frontier
edges simultaneously. The only worry in fixing
this canonical order is that we will somehow explore the Q items in an incorrect order, possibly
building some complete derivation Q0C before a
more optimal complete derivation QC . However,
note that all items Q along the left-most construction of QC have priority equal to or better than any
less optimal complete derivation Q0C . Therefore,
when Q0C is enqueued, it will have lower priority
than all Q; Q0C will therefore not be dequeued until all Q – and hence QC – have been built.
Furthermore, it can be shown that the top-down
expansion strategy maintains the same efficiency
and optimality guarantees as KA∗ for all item
types: for consistent heuristics h, the first k entirely complete outside derivation items are the
true k-best derivations (modulo ties), and that only
derivation items which participate in those k-best
derivations will be removed from the queue (up to
The contribution of this paper is theoretical, not
empirical. We have argued that TKA∗ is simpler
than TKA∗ , but we do not expect it to do any more
or less work than KA∗ , modulo grammar specific
optimizations. Therefore, we simply verify, like
KA∗ , that the additional work of extracting k-best
lists with TKA∗ is negligible compared to the time
spent building 1-best inside edges.
We examined the time spent building 100-best
lists for the same experimental setup as Pauls and
Klein (2009).4 On 100 sentences, our implementation of TKA∗ constructed 3.46 billion items, of
which about 2% were outside derivation items.
Our implementation of KA∗ constructed 3.41 billion edges, of which about 0.1% were outside edge
items or inside derivation items. In other words,
the cost of k-best extraction is dwarfed by the
the 1-best inside edge computation in both cases.
The reason for the slight performance advantage
of KA∗ is that our implementation of KA∗ uses
lazy optimizations discussed in Pauls and Klein
(2009), and while such optimizations could easily
be incorporated in TKA∗ , we have not yet done so
in our implementation.
Implementation Details
Building derivations bottom-up is convenient from
an indexing point of view: since larger derivations
are built from smaller ones, it is not necessary to
construct the larger derivation from scratch. Instead, one can simply construct a new tree whose
children point to the old trees, saving both memory and CPU time.
In order keep the same efficiency when building trees top-down, a slightly different data structure is necessary. We represent top-down derivations as a lazy list of expansions. The top node
TGG is an empty list, and whenever we expand an
outside derivation item Q(TAG , i, j, F) with a rule
r = A → B C and split point l, the resulting
derivation TBG is a new list item with (r, l) as the
head data, and TAG as its tail. The tree can be reconstructed later by recursively reconstructing the
parent, and adding the edges (B, i, l) and (C, l, j)
as children of (A, i, j).
This setup used 3- and 6-round state-split grammars from
Petrov et al. (2006), the former used to compute a heuristic
for the latter, tested on sentences of length up to 25.
We have presented TKA∗ , a simplification to the
KA∗ algorithm. Our algorithm collapses the 1best outside and bottom-up derivation passes of
KA∗ into a single, top-down pass without sacrificing efficiency or optimality. This reduces the
number of non base-case deduction rules, making
TKA∗ easier both to specify and implement.
This project is funded in part by the NSF under
grant 0643742 and an NSERC Postgraduate Fellowship.
Liang Huang and David Chiang. 2005. Better k-best
parsing. In Proceedings of the International Workshop on Parsing Technologies (IWPT), pages 53–64.
Vı́ctor M. Jiménez and Andrés Marzal. 2000. Computation of the n best parse trees for weighted and
stochastic context-free grammars. In Proceedings
of the Joint IAPR International Workshops on Advances in Pattern Recognition, pages 183–192, London, UK. Springer-Verlag.
Dan Klein and Christopher D. Manning. 2001. Parsing and hypergraphs. In Proceedings of the International Workshop on Parsing Technologies (IWPT),
pages 123–134.
Dan Klein and Christopher D. Manning. 2003. A*
parsing: Fast exact Viterbi parse selection. In
Proceedings of the Human Language Technology
Conference and the North American Association
for Computational Linguistics (HLT-NAACL), pages
Mark-Jan Nederhof. 2003. Weighted deductive parsing and Knuth’s algorithm. Computationl Linguistics, 29(1):135–143.
Adam Pauls and Dan Klein. 2009. K-best A* parsing.
In Proccedings of the Association for Computational
Linguistics (ACL).
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan
Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proccedings of the
Association for Computational Linguistics (ACL).
Stuart M. Shieber, Yves Schabes, and Fernando C. N.
Pereira. 1995. Principles and implementation of
deductive parsing. Journal of Logic Programming,
Frank K. Soong and Eng-Fong Huang. 1991. A treetrellis based fast search for finding the n best sentence hypotheses in continuous speech recognition.
In Proceedings of the Workshop on Speech and Natural Language.
Simple semi-supervised training of part-of-speech taggers
Anders Søgaard
Center for Language Technology
University of Copenhagen
[email protected]
Most attempts to train part-of-speech taggers on a mixture of labeled and unlabeled
data have failed. In this work stacked
learning is used to reduce tagging to a
classification task. This simplifies semisupervised training considerably. Our
prefered semi-supervised method combines tri-training (Li and Zhou, 2005) and
disagreement-based co-training. On the
Wall Street Journal, we obtain an error reduction of 4.2% with SVMTool (Gimenez
and Marquez, 2004).
POS tagger with 4–5% error reduction. Finally,
Søgaard (2009) stacks a POS tagger on an unsupervised clustering algorithm trained on large
amounts of unlabeled data with mixed results.
This work combines a new semi-supervised
learning method to POS tagging, namely tritraining (Li and Zhou, 2005), with stacking on unsupervised clustering. It is shown that this method
can be used to improve a state-of-the-art POS tagger, SVMTool (Gimenez and Marquez, 2004). Finally, we introduce a variant of tri-training called
tri-training with disagreement, which seems to
perform equally well, but which imports much less
unlabeled data and is therefore more efficient.
Semi-supervised part-of-speech (POS) tagging is
relatively rare, and the main reason seems to be
that results have mostly been negative. Merialdo (1994), in a now famous negative result, attempted to improve HMM POS tagging by expectation maximization with unlabeled data. Clark
et al. (2003) reported positive results with little
labeled training data but negative results when
the amount of labeled training data increased; the
same seems to be the case in Wang et al. (2007)
who use co-training of two diverse POS taggers.
Huang et al. (2009) present positive results for
self-training a simple bigram POS tagger, but results are considerably below state-of-the-art.
Recently researchers have explored alternative
methods. Suzuki and Isozaki (2008) introduce
a semi-supervised extension of conditional random fields that combines supervised and unsupervised probability models by so-called MDF parameter estimation, which reduces error on Wall
Street Journal (WSJ) standard splits by about 7%
relative to their supervised baseline. Spoustova
et al. (2009) use a new pool of unlabeled data
tagged by an ensemble of state-of-the-art taggers
in every training step of an averaged perceptron
Tagging as classification
This section describes our dataset and our input
tagger. We also describe how stacking is used to
reduce POS tagging to a classification task. Finally, we introduce the supervised learning algorithms used in our experiments.
We use the POS-tagged WSJ from the Penn Treebank Release 3 (Marcus et al., 1993) with the
standard split: Sect. 0–18 is used for training,
Sect. 19–21 for development, and Sect. 22–24 for
testing. Since we need to train our classifiers on
material distinct from the training material for our
input POS tagger, we save Sect. 19 for training our
classifiers. Finally, we use the (untagged) Brown
corpus as our unlabeled data. The number of tokens we use for training, developing and testing
the classifiers, and the amount of unlabeled data
available to it, are thus:
Proceedings of the ACL 2010 Conference Short Papers, pages 205–208,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
The amount of unlabeled data available to our
classifiers is thus a bit more than 25 times the
amount of labeled data.
the-shelf implementations of supervised learning
algorithms. Specifically we have experimented
with support vector machines (SVMs), decision
trees, bagging and random forests. Tri-training,
explained below, is a semi-supervised learning
method which requires large amounts of data.
Consequently, we only used very fast learning algorithms in the context of tri-training. On the development section, decisions trees performed better than bagging and random forests. The decision tree algorithm is the C4.5 algorithm first
introduced in Quinlan (1993). We used SVMs
with polynomial kernels of degree 2 to provide a
stronger stacking-only baseline.
Input tagger
In our experiments we use SVMTool (Gimenez
and Marquez, 2004) with model type 4 run incrementally in both directions. SVMTool has an accuracy of 97.15% on WSJ Sect. 22-24 with this
parameter setting. Gimenez and Marquez (2004)
report that SVMTool has an accuracy of 97.16%
with an optimized parameter setting.
Classifier input
The way classifiers are constructed in our experiments is very simple. We train SVMTool and an
unsupervised tagger, Unsupos (Biemann, 2006),
on our training sections and apply them to the development, test and unlabeled sections. The results are combined in tables that will be the input
of our classifiers. Here is an excerpt:1
Gold standard
This section first presents the tri-training algorithm originally proposed by Li and Zhou (2005)
and then considers a novel variant: tri-training
with disagreement.
Let L denote the labeled data and U the unlabeled data. Assume that three classifiers c1 , c2 , c3
(same learning algorithm) have been trained on
three bootstrap samples of L. In tri-training, an
unlabeled datapoint in U is now labeled for a classifier, say c1 , if the other two classifiers agree on
its label, i.e. c2 and c3 . Two classifiers inform
the third. If the two classifiers agree on a labeling, there is a good chance that they are right.
The algorithm stops when the classifiers no longer
change. The three classifiers are combined by majority voting. Li and Zhou (2005) show that under certain conditions the increase in classification
noise rate is compensated by the amount of newly
labeled data points.
The most important condition is that the three
classifiers are diverse. If the three classifiers are
identical, tri-training degenerates to self-training.
Diversity is obtained in Li and Zhou (2005) by
training classifiers on bootstrap samples. In their
experiments, they consider classifiers based on the
C4.5 algorithm, BP neural networks and naive
Bayes classifiers. The algorithm is sketched
in a simplified form in Figure 1; see Li and
Zhou (2005) for all the details.
Tri-training has to the best of our knowledge not
been applied to POS tagging before, but it has been
applied to other NLP classification tasks, incl. Chinese chunking (Chen et al., 2006) and question
classification (Nguyen et al., 2008).
Each row represents a word and lists the gold
standard POS tag, the predicted POS tag and the
word cluster selected by Unsupos. For example,
the first word is labeled ’DT’, which SVMTool
correctly predicts, and it belongs to cluster 17 of
about 500 word clusters. The first column is blank
in the table for the unlabeled section.
Generally, the idea is that a classifier will learn
to trust SVMTool in some cases, but that it may
also learn that if SVMTool predicts a certain tag
for some word cluster the correct label is another
tag. This way of combining taggers into a single
end classifier can be seen as a form of stacking
(Wolpert, 1992). It has the advantage that it reduces POS tagging to a classification task. This
may simplify semi-supervised learning considerably.
Learning algorithms
We assume some knowledge of supervised learning algorithms. Most of our experiments are implementations of wrapper methods that call off1
The numbers provided by Unsupos refer to clusters; ”*”
marks out-of-vocabulary words.
for i ∈ {1..3} do
Si ← bootstrap sample(L)
ci ← train classifier (Si )
end for
for i ∈ {1..3} do
for x ∈ U do
Li ← ∅
if cj (x) = ck (x)(j, k 6= i) then
Li ← Li ∪ {(x, cj (x)}
end if
end for
ci ← train classifier (L ∪ Li )
end for
until none of ci changes
apply majority vote over ci
self-training if the three classifiers are trained on
the same sample, we used our implementation of
tri-training to obtain self-training results and validated our results by a simpler implementation. We
varied poolsize to optimize self-training. Finally,
we list results for a technique called co-forests (Li
and Zhou, 2007), which is a recent alternative to
tri-training presented by the same authors, and for
tri-training with disagreement (tri-disagr). The pvalues are computed using 10,000 stratified shuffles.
Tri-training and tri-training with disagreement
gave the best results. Note that since tri-training
leads to much better results than stacking alone,
it is unlabeled data that gives us most of the improvement, not the stacking itself. The difference between tri-training and self-training is nearsignificant (p <0.0150). It seems that tri-training
with disagreement is a competitive technique in
terms of accuracy. The main advantage of tritraining with disagreement compared to ordinary
tri-training, however, is that it is very efficient.
This is reflected by the average number of tokens
in Li over the three learners in the worst round of
Figure 1: Tri-training (Li and Zhou, 2005).
Tri-training with disagreement
We introduce a possible improvement of the tritraining algorithm: If we change lines 9–10 in the
algorithm in Figure 1 with the lines:
if cj (x) = ck (x) 6= ci (x)(j, k 6= i) then
Li ← Li ∪ {(x, cj (x)}
end if
Note also that self-training gave very good results. Self-training was, again, much slower than
tri-training with disagreement since we had to
train on a large pool of unlabeled data (but only
once). Of course this is not a standard self-training
set-up, but self-training informed by unsupervised
word clusters.
two classifiers, say c1 and c2 , only label a datapoint for the third classifier, c3 , if c1 and c2 agree
on its label, but c3 disagrees. The intuition is
that we only want to strengthen a classifier in its
weak points, and we want to avoid skewing our
labeled data by easy data points. Finally, since tritraining with disagreement imports less unlabeled
data, it is much more efficient than tri-training. No
one has to the best of our knowledge applied tritraining with disagreement to real-life classification tasks before.
av. tokens in Li
Follow-up experiments
SVMTool is one of the most accurate POS taggers available. This means that the predictions
that are added to the labeled data are of very
high quality. To test if our semi-supervised learning methods were sensitive to the quality of the
input taggers we repeated the self-training and
tri-training experiments with a less competitive
POS tagger, namely the maximum entropy-based
POS tagger first described in (Ratnaparkhi, 1998)
that comes with the maximum entropy library in
(Zhang, 2004). Results are presented as the second line in Figure 2. Note that error reduction is
much lower in this case.
Our results are presented in Figure 2. The stacking
result was obtained by training a SVM on top of
the predictions of SVMTool and the word clusters
of Unsupos. SVMs performed better than decision trees, bagging and random forests on our development section, but improvements on test data
were modest. Tri-training refers to the original algorithm sketched in Figure 1 with C4.5 as learning algorithm. Since tri-training degenerates to
error red.
Figure 2: Results on Wall Street Journal Sect. 22-24 with different semi-supervised methods.
Mitchell Marcus, Mary Marcinkiewicz, and Beatrice
Santorini. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational
Linguistics, 19(2):313–330.
This paper first shows how stacking can be used to
reduce POS tagging to a classification task. This
reduction seems to enable robust semi-supervised
learning. The technique was used to improve the
accuracy of a state-of-the-art POS tagger, namely
SVMTool. Four semi-supervised learning methods were tested, incl. self-training, tri-training, coforests and tri-training with disagreement. All
methods increased the accuracy of SVMTool significantly. Error reduction on Wall Street Journal Sect. 22-24 was 4.2%, which is comparable
to related work in the literature, e.g. Suzuki and
Isozaki (2008) (7%) and Spoustova et al. (2009)
Adwait Ratnaparkhi. 1998. Maximum entropy models for natural language ambiguity resolution. Ph.D.
thesis, University of Pennsylvania.
Anders Søgaard. 2009. Ensemble-based POS tagging
of italian. In IAAI-EVALITA, Reggio Emilia, Italy.
Bernard Merialdo. 1994. Tagging English text with
a probabilistic model. Computational Linguistics,
Tri Nguyen, Le Nguyen, and Akira Shimazu. 2008.
Using semi-supervised learning for question classification. Journal of Natural Language Processing,
Ross Quinlan. 1993. Programs for machine learning.
Morgan Kaufmann.
Drahomira Spoustova, Jan Hajic, Jan Raab, and
Miroslav Spousta. 2009. Semi-supervised training
for the averaged perceptron POS tagger. In EACL,
Athens, Greece.
Chris Biemann. 2006. Unsupervised part-of-speech
tagging employing efficient graph clustering. In
COLING-ACL Student Session, Sydney, Australia.
Wenliang Chen, Yujie Zhang, and Hitoshi Isahara.
2006. Chinese chunking with tri-training learning. In Computer processing of oriental languages,
pages 466–473. Springer, Berlin, Germany.
Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised
sequential labeling and segmentation using gigaword scale unlabeled data. In ACL, pages 665–673,
Columbus, Ohio.
Stephen Clark, James Curran, and Mike Osborne.
2003. Bootstrapping POS taggers using unlabeled
data. In CONLL, Edmonton, Canada.
Wen Wang, Zhongqiang Huang, and Mary Harper.
2007. Semi-supervised learning for part-of-speech
tagging of Mandarin transcribed speech. In ICASSP,
Jesus Gimenez and Lluis Marquez. 2004. SVMTool: a
general POS tagger generator based on support vector machines. In LREC, Lisbon, Portugal.
David Wolpert. 1992. Stacked generalization. Neural
Networks, 5:241–259.
Zhongqiang Huang, Vladimir Eidelman, and Mary
Harper. 2009. Improving a simple bigram HMM
part-of-speech tagger by latent annotation and selftraining. In NAACL-HLT, Boulder, CO.
Le Zhang. 2004. Maximum entropy modeling toolkit
for Python and C++. University of Edinburgh.
Ming Li and Zhi-Hua Zhou. 2005. Tri-training: exploiting unlabeled data using three classifiers. IEEE
Transactions on Knowledge and Data Engineering,
Ming Li and Zhi-Hua Zhou. 2007. Improve computeraided diagnosis with machine learning techniques
using undiagnosed samples. IEEE Transactions on
Systems, Man and Cybernetics, 37(6):1088–1098.
Efficient Optimization of an MDL-Inspired Objective Function for
Unsupervised Part-of-Speech Tagging
Ashish Vaswani1
Adam Pauls2
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
Computer Science Division
University of California at Berkeley
Soda Hall
Berkeley, CA 94720
[email protected]
cently, Ravi and Knight (2009) alternately minimize the model using an integer linear program
and maximize likelihood using EM to achieve the
highest accuracies on the task so far. However, in
the latter approach, because there is no single objective function to optimize, it is not entirely clear
how to generalize this technique to other problems. In this paper, inspired by the MDL principle, we develop an objective function for generative models that captures both the description of
the data by the model (log-likelihood) and the description of the model (model size). By using a
simple prior that encourages sparsity, we cast our
problem as a search for the maximum a posteriori (MAP) hypothesis and present a variant of
EM to approximately search for the minimumdescription-length model. Applying our approach
to the POS tagging problem, we obtain higher accuracies than both EM and Bayesian inference as
reported by Goldwater and Griffiths (2007). On a
Italian POS tagging task, we obtain even larger
improvements. We find that our objective function
correlates well with accuracy, suggesting that this
technique might be useful for other problems.
The Minimum Description Length (MDL)
principle is a method for model selection
that trades off between the explanation of
the data by the model and the complexity
of the model itself. Inspired by the MDL
principle, we develop an objective function for generative models that captures
the description of the data by the model
(log-likelihood) and the description of the
model (model size). We also develop a efficient general search algorithm based on
the MAP-EM framework to optimize this
function. Since recent work has shown that
minimizing the model size in a Hidden
Markov Model for part-of-speech (POS)
tagging leads to higher accuracies, we test
our approach by applying it to this problem. The search algorithm involves a simple change to EM and achieves high POS
tagging accuracies on both English and
Italian data sets.
David Chiang1
The Minimum Description Length (MDL) principle is a method for model selection that provides a
generic solution to the overfitting problem (Barron
et al., 1998). A formalization of Ockham’s Razor,
it says that the parameters are to be chosen that
minimize the description length of the data given
the model plus the description length of the model
It has been successfully shown that minimizing
the model size in a Hidden Markov Model (HMM)
for part-of-speech (POS) tagging leads to higher
accuracies than simply running the ExpectationMaximization (EM) algorithm (Dempster et al.,
1977). Goldwater and Griffiths (2007) employ a
Bayesian approach to POS tagging and use sparse
Dirichlet priors to minimize model size. More re-
MAP EM with Sparse Priors
Objective function
In the unsupervised POS tagging task, we are
given a word sequence w = w1 , . . . , wN and want
to find the best tagging t = t1 , . . . , tN , where
ti ∈ T , the tag vocabulary. We adopt the problem
formulation of Merialdo (1994), in which we are
given a dictionary of possible tags for each word
We define a bigram HMM
P(w, t | θ) =
P(w, t | θ) · P(ti | ti−1 )
In maximum likelihood estimation, the goal is to
Proceedings of the ACL 2010 Conference Short Papers, pages 209–214,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
find parameter estimates
θ̂ = arg max log P(w | θ)
= arg max log
P(w, t | θ)
Function Values
The EM algorithm can be used to find a solution.
However, we would like to maximize likelihood
and minimize the size of the model simultaneously. We define the size of a model as the number
of non-zero probabilities in its parameter vector.
Let θ1 , . . . , θn be the components of θ. We would
like to find
θ̂ = arg min − log P(w | θ) + αkθk0
Substituting (8) into (10) and ignoring the constant
term log Z, we get our objective function (6) again.
We can exercise finer control over the sparsity
of the tag-bigram and channel probability distributions by using a different α for each:
arg max log P(w | θ) +
+ αt
−P(t0 |t)
In our experiments, we set αc = 0 since previous work has shown that minimizing the number
of tag n-gram parameters is more important (Ravi
and Knight, 2009; Goldwater and Griffiths, 2007).
A common method for preferring smaller modP
els is minimizing the L1 norm, i |θi |. However,
for a model which is a product of multinomial distributions, the L1 norm is a constant.
|θi | =
X X
P(w | t) +
P(t | t)
= 2|T |
Therefore, we cannot use the L1 norm as part of
the size term as the result will be the same as the
EM algorithm.
where Z = dθ exp α i e
is a normalization
constant. Then our goal is to find the maximum
= arg max log P(w | θ) + log P(θ)
X −θi
log P(θ) = α ·
e β − log Z
We can think of the approximate model size as
a kind of prior:
P(θ) =
θ̂ = arg max log P(w, θ)
a posterior parameter estimate, which we find using MAP-EM (Bishop, 2006):
where 0 < β ≤ 1 (Mohimani et al., 2007). For
smaller values of β, this closely approximates the
desired function (Figure 1). Inverting signs and ignoring constant terms, our objective function is
X −θi 
e 
θ̂ = arg max log P(w | θ) + α
Figure 1: Ideal model-size term and its approximations.
exp α
where kθk0 , called the L0 norm of θ, simply counts
the number of non-zero parameters in θ. The
hyperparameter α controls the tradeoff between
likelihood maximization and model minimization.
Note the similarity of this objective function with
MDL’s, where α would be the space (measured
in nats) needed to describe one parameter of the
Unfortunately, minimization of the L0 norm
is known to be NP-hard (Hyder and Mahata,
2009). It is not smooth, making it unamenable
to gradient-based optimization algorithms. Therefore, we use a smoothed approximation,
kθk0 ≈
• Gradient of objective function:
Parameter optimization
E[C(t, t0 )] αt −P(tβ 0 |t)
− e
∂P(t0 | t)
P(t0 | t)
To optimize (11), we use MAP EM, which is an iterative search procedure. The E step is the same as
in standard EM, which is to calculate P(t | w, θt ),
where the θt are the parameters in the current iteration t. The M step in iteration (t + 1) looks like
• Gradient of equality constraints:
1 if t = t0
0 otherwise
∂P(t00 | t0 ) 
θt+1 = arg max E P(t|w,θt ) log P(w, t | θ) +
X −P(t0 |t) !
Let C(t, w; t, w) count the number of times the
word w is tagged as t in t, and C(t, t0 ; t) the number
of times the tag bigram (t, t0 ) appears in t. We can
rewrite the M step as
−P(t0 |t)
e β
∂2 F
E[C(t, t0 )]
∂P(t0 | t)∂P(t0 | t)
P(t0 | t)2
The other second-order partial derivatives are
all zero, as are those of the equality constraints.
E[C(t, w)] log P(w | t) +
E[C(t, t )] log P(t | t) + αt e
• Hessian of objective function, which is not
required but greatly speeds up the optimization:
θt+1 = arg max
−P(t0 |t)
We perform this optimization for each instance
of (15). These optimizations could easily be performed in parallel for greater scalability.
 (13)
subject to the constraints w P(w | t) = 1 and
t0 P(t | t) = 1. Note that we can optimize each
term of both summations over t separately. For
each t, the term
E[C(t, w)] log P(w | t)
We carried out POS tagging experiments on English and Italian.
English POS tagging
To set the hyperparameters αt and β, we prepared
three held-out sets H1 , H2 , and H3 from the Penn
Treebank. Each Hi comprised about 24, 000 words
annotated with POS tags. We ran MAP-EM for
100 iterations, with uniform probability initialization, for a suite of hyperparameters and averaged
their tagging accuracies over the three held-out
sets. The results are presented in Table 2. We then
picked the hyperparameter setting with the highest
average accuracy. These were αt = 80, β = 0.05.
We then ran MAP-EM again on the test data with
these hyperparameters and achieved a tagging accuracy of 87.4% (see Table 1). This is higher than
the 85.2% that Goldwater and Griffiths (2007) obtain using Bayesian methods for inferring both
POS tags and hyperparameters. It is much higher
than the 82.4% that standard EM achieves on the
test set when run for 100 iterations.
Using αt = 80, β = 0.05, we ran multiple random restarts on the test set (see Figure 2). We find
that the objective function correlates well with accuracy, and picking the point with the highest objective function value achieves 87.1% accuracy.
is easily optimized as in EM: just let P(w | t) ∝
E[C(t, w)]. But the term
−P(t0 |t)
E[C(t, t0 )] log P(t0 | t) + αt e β
is trickier. This is a non-convex optimization problem for which we invoke a publicly available
constrained optimization tool, ALGENCAN (Andreani et al., 2007). To carry out its optimization,
ALGENCAN requires computation of the following in every iteration:
• Objective function, defined in equation (15).
This is calculated in polynomial time using
dynamic programming.
• Constraints: gt = t0 P(t0 | t) − 1 = 0 for
each tag t ∈ T . Also, we constrain P(t0 | t) to
the interval [, 1].1
We must have > 0 because of the log P(t0 | t) term
in equation (15). It seems reasonable to set N1 ; in our
experiments, we set = 10−7 .
Table 2: Average accuracies over three held-out sets for English.
accuracy (%)
αt=80,β=0.05,Test Set 24115 Words
Tagging accuracy
Standard EM
+ random restarts
(Goldwater and Griffiths, 2007)
our approach
+ random restarts
Table 1: MAP-EM with a L0 norm achieves higher
tagging accuracy on English than (2007) and much
higher than standard EM.
maximum possible
EM, 100 iterations
MAP-EM, 100 iterations
zero parameters
-53200 -53000 -52800 -52600 -52400 -52200 -52000 -51800 -51600 -51400
objective function value
bigram types
Figure 2: Tagging accuracy vs. objective function for 1152 random restarts of MAP-EM with
smoothed L0 norm.
Table 3: MAP-EM with a smoothed L0 norm
yields much smaller models than standard EM.
sity Treebank (Bos et al., 2009). This test set comprises 21, 878 words annotated with POS tags and
a dictionary for each word type. Since this is all
the available data, we could not tune the hyperparameters on a held-out data set. Using the hyperparameters tuned on English (αt = 80, β = 0.05),
we obtained 89.7% tagging accuracy (see Table 4),
which was a large improvement over 81.2% that
standard EM achieved. When we tuned the hyperparameters on the test set, the best setting (αt =
120, β = 0.05 gave an accuracy of 90.28%.
We also carried out the same experiment with standard EM (Figure 3), where picking the point with
the highest corpus probability achieves 84.5% accuracy.
We also measured the minimization effect of the
sparse prior against that of standard EM. Since our
method lower-bounds all the parameters by , we
consider a parameter θi as a zero if θi ≤ . We
also measured the number of unique tag bigram
types in the Viterbi tagging of the word sequence.
Table 3 shows that our method produces much
smaller models than EM, and produces Viterbi
taggings with many fewer tag-bigram types.
A variety of other techniques in the literature have
been applied to this unsupervised POS tagging
task. Smith and Eisner (2005) use conditional random fields with contrastive estimation to achieve
Italian POS tagging
We also carried out POS tagging experiments on
an Italian corpus from the Italian Turin Univer212
Table 4: Accuracies on test set for Italian.
racy supporting the MDL principle. Our approach
performs quite well on POS tagging for both English and Italian. We believe that, like EM, our
method can benefit from more unlabeled data, and
there is reason to hope that the success of these
experiments will carry over to other tasks as well.
EM, Test Set 24115 Words
Tagging accuracy
We would like to thank Sujith Ravi, Kevin Knight
and Steve DeNeefe for their valuable input, and
Jason Baldridge for directing us to the Italian
POS data. This research was supported in part by
DARPA contract HR0011-06-C-0022 under subcontract to BBN Technologies and DARPA contract HR0011-09-1-0028.
-147500 -147400 -147300 -147200 -147100 -147000 -146900 -146800 -146700 -146600 -146500 -146400
objective function value
Figure 3: Tagging accuracy vs. likelihood for 1152
random restarts of standard EM.
88.6% accuracy. Goldberg et al. (2008) provide
a linguistically-informed starting point for EM to
achieve 91.4% accuracy. More recently, Chiang et
al. (2010) use GIbbs sampling for Bayesian inference along with automatic run selection and
achieve 90.7%.
In this paper, our goal has been to investigate whether EM can be extended in a generic
way to use an MDL-like objective function that
simultaneously maximizes likelihood and minimizes model size. We have presented an efficient
search procedure that optimizes this function for
generative models and demonstrated that maximizing this function leads to improvement in tagging accuracy over standard EM. We infer the hyperparameters of our model using held out data
and achieve better accuracies than (Goldwater and
Griffiths, 2007). We have also shown that the objective function correlates well with tagging accu-
R. Andreani, E. G. Birgin, J. M. Martnez, and M. L.
Schuverdt. 2007. On Augmented Lagrangian methods with general lower-level constraints. SIAM
Journal on Optimization, 18:1286–1309.
A. Barron, J. Rissanen, and B. Yu. 1998. The minimum description length principle in coding and
modeling. IEEE Transactions on Information Theory, 44(6):2743–2760.
C. Bishop. 2006. Pattern Recognition and Machine
Learning. Springer.
J. Bos, C. Bosco, and A. Mazzei. 2009. Converting a
dependency treebank to a categorical grammar treebank for italian. In Eighth International Workshop
on Treebanks and Linguistic Theories (TLT8).
D. Chiang, J. Graehl, K. Knight, A. Pauls, and S. Ravi.
2010. Bayesian inference for Finite-State transducers. In Proceedings of the North American Association of Computational Linguistics.
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977.
Maximum likelihood from incomplete data via the
EM algorithm. Computational Linguistics, 39(4):1–
Y. Goldberg, M. Adler, and M. Elhadad. 2008. EM can
find pretty good HMM POS-taggers (when given a
good start). In Proceedings of the ACL.
S. Goldwater and T. L. Griffiths. 2007. A fully
Bayesian approach to unsupervised part-of-speech
tagging. In Proceedings of the ACL.
M. Hyder and K. Mahata. 2009. An approximate L0
norm minimization algorithm for compressed sensing. In Proceedings of the 2009 IEEE International
Conference on Acoustics, Speech and Signal Processing.
B. Merialdo. 1994. Tagging English text with a
probabilistic model. Computational Linguistics,
H. Mohimani, M. Babaie-Zadeh, and C. Jutten. 2007.
Fast sparse representation based on smoothed L0
norm. In Proceedings of the 7th International Conference on Independent Component Analysis and
Signal Separation (ICA2007).
S. Ravi and K. Knight. 2009. Minimized models for
unsupervised part-of-speech tagging. In Proceedings of ACL-IJCNLP.
N. Smith. and J. Eisner. 2005. Contrastive estimation: Training log-linear models on unlabeled data.
In Proceedings of the ACL.
SVD and Clustering for Unsupervised POS Tagging
Michael Lamar*
Division of Applied Mathematics
Brown University
Providence, RI, USA
[email protected]
Yariv Maron*
Gonda Brain Research Center
Bar-Ilan University
Ramat-Gan, Israel
[email protected]
Mark Johnson
Department of Computing
Faculty of Science
Macquarie University
Sydney, Australia
[email protected]
Elie Bienenstock
Division of Applied Mathematics
and Department of Neuroscience
Brown University
Providence, RI, USA
[email protected]
We revisit the algorithm of Schütze
(1995) for unsupervised part-of-speech
tagging. The algorithm uses reduced-rank
singular value decomposition followed
by clustering to extract latent features
from context distributions. As implemented here, it achieves state-of-the-art
tagging accuracy at considerably less cost
than more recent methods. It can also
produce a range of finer-grained taggings, with potential applications to various tasks.
While supervised approaches are able to solve
the part-of-speech (POS) tagging problem with
over 97% accuracy (Collins 2002; Toutanova et
al. 2003), unsupervised algorithms perform considerably less well. These models attempt to tag
text without resources such as an annotated corpus, a dictionary, etc. The use of singular value
decomposition (SVD) for this problem was introduced in Schütze (1995). Subsequently, a
number of methods for POS tagging without a
dictionary were examined, e.g., by Clark (2000),
Clark (2003), Haghighi and Klein (2006), Johnson (2007), Goldwater and Griffiths (2007), Gao
and Johnson (2008), and Graça et al. (2009).
The latter two, using Hidden Markov Models
(HMMs), exhibit the highest performances to
date for fully unsupervised POS tagging.
The revisited SVD-based approach presented
here, which we call “two-step SVD” or SVD2,
has four important characteristics. First, it
achieves state-of-the-art tagging accuracy.
Second, it requires drastically less computational
effort than the best currently available models.
Third, it demonstrates that state-of-the-art accuracy can be realized without disambiguation, i.e.,
without attempting to assign different tags to different tokens of the same type. Finally, with no
significant increase in computational cost, SVD2
can create much finer-grained labelings than typically produced by other algorithms. When combined with some minimal supervision in postprocessing, this makes the approach useful for
tagging languages that lack the resources required by fully supervised models.
Following the original work of Schütze (1995),
we begin by constructing a right context matrix,
R, and a left context matrix, L. Rij counts the
number of times in the corpus a token of word
type i is immediately followed by a token of
word type j. Similarly, Lij counts the number of
times a token of type i is preceded by a token of
type j. We truncate these matrices, including, in
the right and left contexts, only the w1 most frequent word types. The resulting L and R are of
dimension Ntypes×w1, where Ntypes is the number
of word types (spelling forms) in the corpus, and
w1 is set to 1000. (The full Ntypes× Ntypes context
matrices satisfy R = LT.)
* These authors contributed equally.
Proceedings of the ACL 2010 Conference Short Papers, pages 215–219,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
Next, both context matrices are factored using
singular value decomposition:
The diagonal matrices SL and SR (each of rank
1000) are reduced down to rank r1 = 100 by replacing the 900 smallest singular values in each
matrix with zeros, yielding SL* and SR*. We then
form a pair of latent-descriptor matrices defined
L * = UL S L*
R* = UR SR*.
Row i in matrix L* (resp. R*) is the left (resp.
right) latent descriptor for word type i. We next
include a normalization step in which each row
in each of L* and R* is scaled to unit length,
yielding matrices L** and R**. Finally, we form a
single descriptor matrix D by concatenating these
matrices into D = [L** R**]. Row i in matrix D is
the complete latent descriptor for word type i;
this latent descriptor sits on the Cartesian product
of two 100-dimensional unit spheres, hereafter
the 2-sphere.
We next categorize these descriptors into
k1 = 500 groups, using a k-means clustering algorithm. Centroid initialization is done by placing
the k initial centroids on the descriptors of the k
most frequent words in the corpus. As the descriptors sit on the 2-sphere, we measure the
proximity of a descriptor to a centroid by the dot
product between them; this is equal to the sum of
the cosines of the angles—computed on the left
and right parts—between them. We update each
cluster’s centroid as the weighted average of its
constituents, the weight being the frequency of
the word type; the centroids are then scaled, so
they sit on the 2-sphere. Typically, only a few
dozen iterations are required for full convergence
of the clustering algorithm.
We then apply a second pass of this entire
SVD-and-clustering procedure. In this second
pass, we use the k1 = 500 clusters from the first
iteration to assemble a new pair of context matrices. Now, Rij counts all the cluster-j (j=1… k1)
words to the right of word i, and Lij counts all the
cluster-j words to the left of word i. The new matrices L and R have dimension Ntypes × k1.
As in the first pass, we perform reduced-rank
SVD, this time down to rank r2 = 300, and we
again normalize the descriptors to unit length,
yielding a new pair of latent descriptor matrices
L** and R**. Finally, we concatenate L** and R**
into a single matrix of descriptors, and cluster
these descriptors into k2 groups, where k2 is the
desired number of induced tags. We use the same
weighted k-means algorithm as in the first pass,
again placing the k initial centroids on the descriptors of the k most frequent words in the corpus. The final tag of any token in the corpus is
the cluster number of its type.
Data and Evaluation
We ran the SVD2 algorithm described above on
the full Wall Street Journal part of the Penn
Treebank (1,173,766 tokens). Capitalization was
ignored, resulting in Ntypes = 43,766, with only a
minor effect on accuracy. Evaluation was done
against the POS-tag annotations of the 45-tag
PTB tagset (hereafter PTB45), and against the
Smith and Eisner (2005) coarse version of the
PTB tagset (hereafter PTB17). We selected the
three evaluation criteria of Gao and Johnson
(2008): M-to-1, 1-to-1, and VI. M-to-1 and 1-to1 are the tagging accuracies under the best manyto-one map and the greedy one-to-one map respectively; VI is a map-free informationtheoretic criterion—see Gao and Johnson (2008)
for details. Although we find M-to-1 to be the
most reliable criterion of the three, we include
the other two criteria for completeness.
In addition to the best M-to-1 map, we also
employ here, for large values of k2, a prototypebased M-to-1 map. To construct this map, we
first find, for each induced tag t, the word type
with which it co-occurs most frequently; we call
this word type the prototype of t. We then query
the annotated data for the most common gold tag
for each prototype, and we map induced tag t to
this gold tag. This prototype-based M-to-1 map
produces accuracy scores no greater—typically
lower—than the best M-to-1 map. We discuss
the value of this approach as a minimallysupervised post-processing step in Section 5.
Low-k performance. Here we present the performance of the SVD2 model when k2, the number of induced tags, is the same or roughly the
same as the number of tags in the gold standard—hence small. Table 1 compares the performance of SVD2 to other leading models. Following Gao and Johnson (2008), the number of
induced tags is 17 for PTB17 evaluation and 50
for PTB45 evaluation. Thus, with the exception
of Graça et al. (2009) who use 45 induced tags
for PTB45, the number of induced tags is the
same across each column of Table 1.
VEM (10-1,10-1)
Table 1. Tagging accuracy under the best M-to-1 map, the greedy 1-to-1 map, and
VI, for the full PTB45 tagset and the reduced PTB17 tagset. HMM-EM, HMM-VB
and HMM-GS show the best results from Gao and Johnson (2008); HMM-Sparse(32)
and VEM (10-1,10-1) show the best results from Graça et al. (2009).
The performance of SVD2 compares favorably to the HMM models. Note that SVD2 is a
deterministic algorithm. The table shows, in parentheses, the standard deviations reported in
Graça et al. (2009). For the sake of comparison
with Graça et al. (2009), we also note that, with
k2 = 45, SVD2 scores 0.659 on PTB45. The NVI
scores (Reichart and Rappoport 2009) corresponding to the VI scores for SVD2 are 0.938 for
PTB17 and 0.885 for PTB45. To examine the
sensitivity of the algorithm to its four parameters,
w1, r1, k1, and r2, we changed each of these parameters separately by a multiplicative factor of
either 0.5 or 2; in neither case did M-to-1 accuracy drop by more than 0.014.
This performance was achieved despite the
fact that the SVD2 tagger is mathematically
much simpler than the other models. Our MATLAB implementation of SVD2 takes only a few
minutes to run on a desktop computer, in contrast
to HMM training times of several hours or days
(Gao and Johnson 2008; Johnson 2007).
High-k performance. Not suffering from the
same computational limitations as other models,
SVD2 can easily accommodate high numbers of
induced tags, resulting in fine-grained labelings.
The value of this flexibility is discussed in the
next section. Figure 1 shows, as a function of k2,
the tagging accuracy of SVD2 under both the
best and the prototype-based M-to-1 maps (see
Section 3), for both the PTB45 and the PTB17
tagsets. The horizontal one-tag-per-word-type
line in each panel is the theoretical upper limit
for tagging accuracy in non-disambiguating
models (such as SVD2). This limit is the fraction
of all tokens in the corpus whose gold tag is the
most frequent for their type.
At the heart of the algorithm presented here is
the reduced-rank SVD method of Schütze
(1995), which transforms bigram counts into latent descriptors. In view of the present work,
Figure 1. Performance of the SVD2 algorithm as a function of the number of induced
tags. Top: PTB45; bottom: PTB17. Each
plot shows the tagging accuracy under the
best and the prototype-based M-to-1 maps, as
well as the upper limit for nondisambiguating taggers.
which achieves state-of-the-art performance
when evaluation is done with the criteria now in
common use, Schütze's original work should
rightly be praised as ahead of its time. The SVD2
model presented here differs from Schütze's
work in many details of implementation—not all
of which are explicitly specified in Schütze
(1995). In what follows, we discuss the features
of SVD2 that are most critical to its performance.
Failure to incorporate any one of them signifi217
cantly reduces the performance of the algorithm
(M-to-1 reduced by 0.04 to 0.08).
First, the reduced-rank left-singular vectors
(for the right and left context matrices) are
scaled, i.e., multiplied, by the singular values.
While the resulting descriptors, the rows of L*
and R*, live in a much lower-dimensional space
than the original context vectors, they are
mapped by an angle-preserving map (defined by
the matrices of right-singular vectors VL and VR)
into vectors in the original space. These mapped
vectors best approximate (in the least-squares
sense) the original context vectors; they have the
same geometric relationships as their equivalent
high-dimensional images, making them good
candidates for the role of word-type descriptors.
A second important feature of the SVD2 algorithm is the unit-length normalization of the latent descriptors, along with the computation of
cluster centroids as the weighted averages of
their constituent vectors. Thanks to this combined device, rare words are treated equally to
frequent words regarding the length of their descriptor vectors, yet contribute less to the placement of centroids.
Finally, while the usual drawback of k-meansclustering algorithms is the dependency of the
outcome on the initial—usually random—
placement of centroids, our initialization of the k
centroids as the descriptors of the k most frequent word types in the corpus makes the algorithm fully deterministic, and improves its performance substantially: M-to-1 PTB45 by 0.043,
M-to-1 PTB17 by 0.063.
As noted in the Results section, SVD2 is fairly
robust to changes in all four parameters w1, r1, k1,
and r2. The values used here were obtained by a
coarse, greedy strategy, where each parameter
was optimized independently. It is worth noting
that dispensing with the second pass altogether,
i.e., clustering directly the latent descriptor vectors obtained in the first pass into the desired
number of induced tags, results in a drop of
Many-to-1 score of only 0.021 for the PTB45
tagset and 0.009 for the PTB17 tagset.
Disambiguation. An obvious limitation of
SVD2 is that it is a non-disambiguating tagger,
assigning the same label to all tokens of a type.
However, this limitation per se is unlikely to be
the main obstacle to the improvement of low-k
performance, since, as is well known, the theoretical upper limit for the tagging accuracy of
non-disambiguating models (shown in Fig. 1) is
much higher than the current state-of-the-art for
unsupervised taggers, whether disambiguating or
To further gain insight into how successful
current models are at disambiguating when they
have the power to do so, we examined a collection of HMM-VB runs (Gao and Johnson 2008)
and asked how the accuracy scores would change
if, after training was completed, the model were
forced to assign the same label to all tokens of
the same type. To answer this question, we determined, for each word type, the modal HMM
state, i.e., the state most frequently assigned by
the HMM to tokens of that type. We then relabeled all words with their modal label. The effect of thus eliminating the disambiguation capacity of the model was to slightly increase the
tagging accuracy under the best M-to-1 map for
every HMM-VB run (the average increase was
0.026 for PTB17, and 0.015 for PTB45). We
view this as a further indication that, in the current state of the art and with regards to tagging
accuracy, limiting oneself to non-disambiguating
models may not adversely affect performance.
To the contrary, this limitation may actually
benefit an approach such as SVD2. Indeed, on
difficult learning tasks, simpler models often behave better than more powerful ones (Geman et
al. 1992). HMMs are powerful since they can, in
theory, induce both a system of tags and a system
of contextual patterns that allow them to disambiguate word types in terms of these tags. However, carrying out both of these unsupervised
learning tasks at once is problematic in view of
the very large number of parameters to be estimated compared to the size of the training data
The POS-tagging subtask of disambiguation
may then be construed as a challenge in its own
right: demonstrate effective disambiguation in an
unsupervised model. Specifically, show that tagging accuracy decreases when the model's disambiguation capacity is removed, by re-labeling
all tokens with their modal label, defined above.
We believe that the SVD2 algorithm presented
here could provide a launching pad for an approach that would successfully address the disambiguation challenge. It would do so by allowing a gradual and carefully controlled amount of
ambiguity into an initially non-disambiguating
model. This is left for future work.
Fine-grained labeling. An important feature of
the SVD2 algorithm is its ability to produce a
fine-grained labeling of the data, using a number
of clusters much larger than the number of tags
in a syntax-motivated POS-tag system. Such
fine-grained labelings can capture additional linguistic features. To achieve a fine-grained labeling, only the final clustering step in the SVD2
algorithm needs to be changed; the computational cost this entails is negligible. A high-quality
fine-grained labeling, such as achieved by the
SVD2 approach, may be of practical interest as
an input to various types of unsupervised grammar-induction algorithms (Headden et al. 2008).
This application is left for future work.
Prototype-based tagging. One potentially important practical application of a high-quality
fine-grained labeling is its use for languages
which lack any kind of annotated data. By first
applying the SVD2 algorithm, word types are
grouped together into a few hundred clusters.
Then, a prototype word is automatically extracted from each cluster. This produces, in a
completely unsupervised way, a list of only a
few hundred words that need to be hand-tagged
by an expert. The results shown in Fig. 1 indicate
that these prototype tags can then be used to tag
the entire corpus with only a minor decrease in
accuracy compared to the best M-to-1 map—the
construction of which requires a fully annotated
corpus. Fig. 1 also indicates that, with only a few
hundred prototypes, the gap left between the accuracy thus achieved and the upper bound for
non-disambiguating models is fairly small.
Alexander Clark. 2000. Inducing syntactic categories
by context distribution clustering. In The Fourth
Conference on Natural Language Learning.
Alexander Clark. 2003. Combining distributional and
morphological information for part of speech induction. In 10th Conference of the European Chapter of the Association for Computational Linguistics, pages 59–66.
Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of
the ACL-02 conference on Empirical methods in
natural language processing – Volume 10.
Jianfeng Gao and Mark Johnson. 2008. A comparison
of bayesian estimators for unsupervised Hidden
Markov Model POS taggers. In Proceedings of the
2008 Conference on Empirical Methods in Natural
Language Processing, pages 344–352.
Sharon Goldwater and Tom Griffiths. 2007. A fully
Bayesian approach to unsupervised part-of-speech
tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744–751.
João V. Graça, Kuzman Ganchev, Ben Taskar, and
Fernando Pereira. 2009. Posterior vs. Parameter
Sparsity in Latent Variable Models. In Neural Information Processing Systems Conference (NIPS).
Aria Haghighi and Dan Klein. 2006. Prototype-driven
learning for sequence models. In Proceedings of
the Human Language Technology Conference of
the NAACL, Main Conference, pages 320–327,
New York City, USA, June. Association for Computational Linguistics.
William P. Headden, David McClosky, and Eugene
Charniak. 2008. Evaluating unsupervised part-ofspeech tagging for grammar induction. In Proceedings of the International Conference on Computational Linguistics (COLING ’08).
Mark Johnson. 2007. Why doesn’t EM find good
HMM POS-taggers? In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural
Language Learning (EMNLP-CoNLL), pages 296–
Marina Meilă. 2003. Comparing clusterings by the
variation of information. In Bernhard Schölkopf
and Manfred K. Warmuth, editors, COLT 2003:
The Sixteenth Annual Conference on Learning
Theory, volume 2777 of Lecture Notes in Computer Science, pages 173–187. Springer.
Roi Reichart and Ari Rappoport. 2009. The NVI
Clustering Evaluation Measure. In Proceedings of
the Thirteenth Conference on Computational Natural Language Learning (CoNLL), pages 165–173.
Hinrich Schütze. 1995. Distributional part-of-speech
tagging. In Proceedings of the seventh conference
on European chapter of the Association for Computational Linguistics, pages 141–148.
Noah A. Smith and Jason Eisner. 2005. Contrastive
estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL’05), pages 354–362.
Kristina Toutanova, Dan Klein, Christopher D. Manning and Yoram Singer. 2003. Feature-rich part-ofspeech tagging with a cyclic dependency network.
In Proceedings of HLT-NAACL 2003, pages 252259.
Stuart Geman, Elie Bienenstock and René Doursat.
1992. Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4 (1), pages 1–58.
Intelligent Selection of Language Model Training Data
Robert C. Moore William Lewis
Microsoft Research
Redmond, WA 98052, USA
The normal practice when using multiple languages models in machine translation seems to be
to train models on as much data as feasible from
each source, and to depend on feature weight optimization to down-weight the impact of data that is
less well-matched to the translation application. In
this paper, however, we show that for a data source
that is not entirely in-domain, we can improve the
match between the language model from that data
source and the desired application output by intelligently selecting a subset of the available data as
language model training data. This not only produces a language model better matched to the domain of interest (as measured in terms of perplexity on held-out in-domain data), but it reduces the
computational resources needed to exploit a large
amount of non-domain-specific data, since the resources needed to filter a large amount of data are
much less (especially in terms of memory) than
those required to build a language model from all
the data.
We address the problem of selecting nondomain-specific language model training
data to build auxiliary language models
for use in tasks such as machine translation. Our approach is based on comparing
the cross-entropy, according to domainspecific and non-domain-specifc language
models, for each sentence of the text
source used to produce the latter language
model. We show that this produces better
language models, trained on less data, than
both random data selection and two other
previously proposed methods.
Statistical N-gram language models are widely
used in applications that produce natural-language
text as output, particularly speech recognition and
machine translation. It seems to be a universal truth that output quality can always be improved by using more language model training
data, but only if the training data is reasonably
well-matched to the desired output. This presents
a problem, because in virtually any particular application the amount of in-domain data is limited.
Thus it has become standard practice to combine in-domain data with other data, either by
combining N-gram counts from in-domain and
other data (usually weighting the counts in some
way), or building separate language models from
different data sources, interpolating the language
model probabilities either linearly or log-linearly.
Log-linear interpolation is particularly popular
in statistical machine translation (e.g., Brants et
al., 2007), because the interpolation weights can
easily be discriminatively trained to optimize an
end-to-end translation objective function (such as
B LEU) by making the log probability according to
each language model a separate feature function in
the overall translation model.
Approaches to the Problem
Our approach to the problem assumes that we have
enough in-domain data to train a reasonable indomain language model, which we then use to
help score text segments from other data sources,
and we select segments based on a score cutoff optimized on held-out in-domain data.
We are aware of two comparable previous approaches. Lin et al. (1997) and Gao et al. (2002)
both used a method similar to ours, in which the
metric used to score text segments is their perplexity according to the in-domain language model.
The candidate text segments with perplexity less
than some threshold are selected.
The second previous approach does not explicitly make use of an in-domain language model, but
is still applicable to our scenario. Klakow (2000)
estimates a unigram language model from the
entire non-domain-specific corpus to be selected
Proceedings of the ACL 2010 Conference Short Papers, pages 220–224,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
from, and scores each candidate text segment from
that corpus by the change in the log likelihood
of the in-domain data according to the unigram
model, if that segment were removed from the corpus used to estimate the unigram model. Those
segments whose removal would decrease the log
likelihood of the in-domain data more than some
threshold are selected.
P (NI |s, N ) =
P (s|NI , N )P (NI |N )
P (s|N )
Since NI is a subset of N , P (s|NI , N ) =
P (s|NI ), and by our assumption about the relationship of I and NI , P (s|NI ) = P (s|I). Hence,
Our method is a fairly simple variant of scoring
by perplexity according to an in-domain language
model. First, note that selecting segments based
on a perplexity threshold is equivalent to selecting
based on a cross-entropy threshold. Perplexity and
cross-entropy are monotonically related, since the
perplexity of a string s according to a model M is
simply bHM (s) , where HM (s) is the cross-entropy
of s according to M and b is the base with respect to which the cross-entropy is measured (e.g.,
bits or nats). However, instead of scoring text segments by perplexity or cross-entropy according to
the in-domain language model, we score them by
the difference of the cross-entropy of a text segment according to the in-domain language model
and the cross-entropy of the text segment according to a language model trained on a random sample of the data source from which the text segment
is drawn.
P (NI |s, N ) =
P (s|I)P (NI |N )
P (s|N )
If we could estimate all the probabilities in the
right-hand side of this equation, we could use it
to select text segments that have a high probability
of being in NI .
We can estimate P (s|I) and P (s|N ) by training language models on I and a sample of N , respectively. That leaves us only P (NI |N ), to estimate, but we really don’t care what P (NI |N )
is, because knowing that would still leave us wondering what threshold to set on P (NI |s, N ). We
don’t care about classification accuracy; we care
only about the quality of the resulting language
model, so we might as well just attempt to find
a threshold on P (s|I)/P (s|N ) that optimizes the
fit of the resulting language model to held-out indomain data.
Equivalently, we can work in the log domain
with the quantity log(P (s|I)) − log(P (s|N )).
This gets us very close to working with the difference in cross-entropies, because HI (s)−HN (s) is
just a length-normalized version of log(P (s|I)) −
log(P (s|N )), with the sign reversed. The reason that we need to normalize for length is that
the value of log(P (s|I)) − log(P (s|N )) tends to
correlate very strongly with text segment length.
If the candidate text segments vary greatly in
length—e.g., if we partition N into sentences—
this correlation can be a serious problem.
We estimated this effect on a 1000-sentence
sample of our experimental data described below, and found the correlation between sentence
log probability difference and sentence length to
be r = −0.92, while the cross-entropy difference was almost uncorrelated with sentence length
(r = 0.04). Hence, using sentence probability ratios or log probability differences as our scoring
function would result in selecting disproportionately very short sentences. We tested this in an
experiment not described here in detail, and found
it not to be significantly better as a selection criterion than random selection.
To state this formally, let I be an in-domain data
set and N be a non-domain-specific (or otherwise
not entirely in-domain) data set. Let HI (s) be the
per-word cross-entropy, according to a language
model trained on I, of a text segment s drawn from
N . Let HN (s) be the per-word cross-entropy of s
according to a language model trained on a random sample of N . We partition N into text segments (e.g., sentences), and score the segments according to HI (s) − HN (s), selecting all text segments whose score is less than a threshold T .
This method can be justified by reasoning simliar to that used to derive methods for training
binary text classifiers without labeled negative
examples (Denis et al., 2002; Elkin and Noto,
2008). Let us imagine that our non-domainspecific corpus N contains an in-domain subcorpus NI , drawn from the same distribution as our
in-domain corpus I. Since NI is statistically just
like our in-domain data I, it would seem to be a
good candidate for the data that we want to extract
from N . By a simple variant of Bayes rule, the
probability P (NI |s, N ) of a text segment s, drawn
randomly from N , being in NI is given by
Europarl train
Europarl test
Sentence count
Token count
Table 1: Corpus size statistics
We have empirically evaluated our proposed
method for selecting data from a non-domainspecific source to model text in a specific domain.
For the in-domain corpus, we chose the English
side of the English-French parallel text from release v5 of the Europarl corpus (Koehn, 2005).
This consists of proceedings of the European Parliament from 1999 through 2009. We used the
text from 1999 through 2008 as in-domain training data, and we used the first 2000 sentences
from January 2009 as test data. For the nondomain-specific corpus, we used the LDC English Gigaword Third Edition (LDC Catalog No.:
We used a simple tokenization scheme on all
data, splitting on white space and on boundaries
between alphanumeric and nonalphanumeric (e.g.,
punctuation) characters. With this tokenization,
the sizes of our data sets in terms of sentences and
tokens are shown in Table 1. The token counts include added end-of-sentence tokens.
To implement our data selection method we required one language model trained on the Europarl
training data and one trained on the Gigaword
data. To make these language models comparable,
and to show the feasibility of optimizing the fit to
the in-domain data without training a model on the
entire Gigaword corpus, we trained the Gigaword
language model for data selection on a random
sample of the Gigaword corpus of a similar size to
that of the Europarl training data: 1,874,051 sentences, 48,459,945 tokens.
To further increase the comparability of these
Europarl and Gigaword language models, we restricted the vocabulary of both models to the tokens appearing at least twice in the Europarl training data, treating all other tokens as instances of
<UNK>. With this vocabulary, 4-gram language
models were trained on both the Europarl training
data and the Gigaword random sample using backoff absolute discounting (Ney et al. 1994), with a
discount of 0.7 used for all N-gram lengths. The
discounted probability mass at the unigram level
was added to the probability of <UNK>. A count
cutoff of 2 occurrences was applied to the trigrams
and 4-grams in estimating these models.
We computed the cross-entropy of each sentence in the Gigaword corpus according to both
models, and scored each sentence by the difference in cross-entropy, HEp (s)−HGw (s). We then
selected subsets of the Gigaword data corresponding to 8 cutoff points in the cross-entropy difference scores, and trained 4-gram models (again using absolute discounting with a discount of 0.7) on
each of these subsets and on the full Gigaword corpus. These language models were estimated without restricting the vocabulary or applying count
cutoffs, but the only parameters computed were
those needed to determine the perplexity of the
held-out Europarl test set, which saves a substantial amount of computation in determining the optimal selection threshold.
We compared our selection method to three
other methods. As a baseline, we trained language models on random subsets of the Gigaword
corpus of approximately equal size to the data
sets produced by the cutoffs we selected for the
cross-entropy difference scores. Next, we scored
all the Gigaword sentences by the cross-entropy
according to the Europarl-trained model alone.
As we noted above, this is equivalent to the indomain perplexity scoring method used by Lin et
al. (1997) and Gao et al. (2002). Finally, we implemented Klakow’s (2000) method, scoring each
Gigaword sentence by removing it from the Gigaword corpus and computing the difference in the
log likelihood of the Europarl corpus according to
unigram models trained on the Gigaword corpus
with and without that sentence. With the latter two
methods, we chose cutoff points in the resulting
scores to produce data sets approximately equal in
size to those obtained using our selection method.
For all four selection methods, plots of test set perplexity vs. the number of training data tokens selected are displayed in Figure 1. (Note that the
training data token counts are displayed on a logarithmic scale.) The test set perplexity for the language model trained on the full Gigaword corpus
is 135. As we might expect, reducing training
data by random sampling always increases perplexity. Selecting Gigaword sentences by their
Test-set perplexity
random selection
in-domain cross-entropy scoring
Klakow's method
cross-entropy difference scoring
Billions of words of training data
Figure 1: Test set perplexity vs. training set size
Selection Method
in-domain cross-entropy scoring
Klakow’s method
cross-entropy difference scoring
Original LM PPL
Modified LM PPL
Table 2: Results adjusted for vocabulary coverage
the training sets that appear to produce the lowest
perplexity for each selection method, however, the
spread of OOV counts is much narrower, ranging
53 (0.10%) for best training set based on crossentropy difference scoring, to 20 (0.03%), for random selection.
cross-entropy according to the Europarl-trained
model is effective in reducing both test set perplexity and training corpus size, with an optimum perplexity of 124, obtained with a model built from
36% of the Gigaword corpus. Klakow’s method
is even more effective, with an optimum perplexity of 111, obtained with a model built from 21%
of the Gigaword corpus. The cross-entropy difference selection method, however, is yet more effective, with an optimum perplexity of 101, obtained
with a model built from less than 7% of the Gigaword corpus.
The comparisons implied by Figure 1, however, are only approximate, because each perplexity (even along the same curve) is computed with
respect to a different vocabulary, resulting in a different out-of-vocabulary (OOV) rate. OOV tokens
in the test data are excluded from the perplexity
computation, so the perplexity measurements are
not strictly comparable.
To control for the difference in vocabulary, we
estimated a modified 4-gram language model for
each selection method (other than random selection) using the training set that appeared to
produce the lowest perplexity for that selection
method in our initial experiments. In the modified
language models, the unigram model based on the
selected training set is smoothed by absolute discounting, and backed-off to an unsmoothed unigram model based on the full Gigaword corpus.
This produces language models that are normalized over the same vocabulary as a model trained
on the full Gigaword corpus; thus the test set has
the same OOVs for each model.
Out of the 55566 test set tokens, the number
of OOV tokens ranges from 418 (0.75%), for the
smallest training set based on in-domain crossentropy scoring, to 20 (0.03%), for training on
the full Gigaword corpus. If we consider only
Test set perplexity for each of these modifed
language models is compared to that of the original version of the model in Table 2. It can be
seen that adjusting the vocabulary in this way, so
that all models are based on the same vocabulary,
ACM Transactions on Asian Language Information Processing, 1(1):3–33.
yields only very small changes in the measured
test-set perplexity, and these differences are much
smaller than the differences between the different
selection methods, whichever way the vocabulary
of the language models is determined.
Dietrich Klakow. 2000. Selecting articles from
the language model training corpus. In ICASSP
2000, June 5–9, Istanbul, Turkey, vol. 3, 1695–
Philipp Koehn. 2005. Europarl: a parallel corpus for statistical machine translation. In MT
Summit X, September 12–16, Phuket, Thailand,
The cross-entropy difference selection method introduced here seems to produce language models that are both a better match to texts in a restricted domain, and require less data for training, than any of the other data selection methods
tested. This study is preliminary, however, in that
we have not yet shown improved end-to-end task
performance applying this approach, such as improved B LEU scores in a machine translation task.
However, we believe there is reason to be optimistic about this. When a language model trained
on non-domain-specific data is used in a statistical translation model as a separate feature function (as is often the case), lower perplexity on indomain target language test data derived from reference translations corresponds directly to assigning higher language model feature scores to those
reference translations, which should in turn lead to
translation system output that matches reference
translations better.
Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien,
Ker-Jiann Chen, and Lin-Shan Lee. 1997.
Chinese language model adaptation based on
document classification and multiple domainspecific language models. In EUROSPEECH1997, 1463–1466.
Hermann Ney, Ute Essen, and Reinhard Kneser.
1994. On structuring dependencies in stochastic language modelling. Computer Speech and
Language, 8:1–38.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz
J. Och, and Jeffrey Dean. 2007. Large language
models in machine translation. In Proceedings
of the Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning, June 28–30,
Prague, Czech Republic, 858–867.
François Denis, Remi Gilleron, and Marc Tommasi. 2002. Text classification from positive
and unlabeled examples. In The 9th International Conference on Information Processing
and Management of Uncertainty in KnowledgeBased Systems (IPMU 2002), 1927–1934.
Charles Elkin and Keith Noto. 2008. Learning classifiers from only positive and unlabeled
data. In KDD 2008, August 24–27, Las Vegas,
Nevada, USA, 213–220.
Jianfeng Gao, Joshua Goodman, Mingjing Li, and
Kai-Fu Lee. 2002. Toward a unified approach
to statistical language modeling for Chinese.
Blocked Inference in Bayesian Tree Substitution Grammars
Trevor Cohn
Department of Computer Science
University of Sheffield
[email protected]
Phil Blunsom
Computing Laboratory
University of Oxford
[email protected]
This used a Gibbs sampler for training, which repeatedly samples for every node in every training
tree a binary value indicating whether the node is
or is not a substitution point in the tree’s derivation. Aggregated over the whole corpus, these values and the underlying trees specify the weighted
grammar. Local Gibbs samplers, although conceptually simple, suffer from slow convergence
(a.k.a. poor mixing). The sampler can get easily
stuck because many locally improbable decisions
are required to escape from a locally optimal solution. This problem manifests itself both locally to
a sentence and globally over the training sample.
The net result is a sampler that is non-convergent,
overly dependent on its initialisation and cannot be
said to be sampling from the posterior.
Learning a tree substitution grammar is
very challenging due to derivational ambiguity. Our recent approach used a
Bayesian non-parametric model to induce
good derivations from treebanked input
(Cohn et al., 2009), biasing towards small
grammars composed of small generalisable productions. In this paper we present
a novel training method for the model using a blocked Metropolis-Hastings sampler in place of the previous method’s local Gibbs sampler. The blocked sampler makes considerably larger moves than
the local sampler and consequently converges in less time. A core component
of the algorithm is a grammar transformation which represents an infinite tree substitution grammar in a finite context free
grammar. This enables efficient blocked
inference for training and also improves
the parsing algorithm. Both algorithms are
shown to improve parsing accuracy.
In this paper we present a blocked MetropolisHasting sampler for learning a TSG, similar to
Johnson et al. (2007). The sampler jointly updates
all the substitution variables in a tree, making
much larger moves than the local single-variable
sampler. A critical issue when developing a
Metroplis-Hastings sampler is choosing a suitable
proposal distribution, which must have the same
support as the true distribution. For our model the
natural proposal distribution is a MAP point estimate, however this cannot be represented directly
as it is infinitely large. To solve this problem we
develop a grammar transformation which can succinctly represent an infinite TSG in an equivalent
finite Context Free Grammar (CFG). The transformed grammar can be used as a proposal distribution, from which samples can be drawn in
polynomial time. Empirically, the blocked sampler converges in fewer iterations and in less time
than the local Gibbs sampler. In addition, we also
show how the transformed grammar can be used
for parsing, which yields theoretical and empirical improvements over our previous method which
truncated the grammar.
Tree Substitution Grammar (TSG) is a compelling
grammar formalism which allows nonterminal
rewrites in the form of trees, thereby enabling
the modelling of complex linguistic phenomena
such as argument frames, lexical agreement and
idiomatic phrases. A fundamental problem with
TSGs is that they are difficult to estimate, even in
the supervised scenario where treebanked data is
available. This is because treebanks are typically
not annotated with their TSG derivations (how to
decompose a tree into elementary tree fragments);
instead the derivation needs to be inferred.
In recent work we proposed a TSG model which
infers an optimal decomposition under a nonparametric Bayesian prior (Cohn et al., 2009).
Proceedings of the ACL 2010 Conference Short Papers, pages 225–230,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
A Tree Substitution Grammar (TSG; Bod et
al. (2003)) is a 4-tuple, G = (T, N, S, R), where
T is a set of terminal symbols, N is a set of nonterminal symbols, S ∈ N is the distinguished root
nonterminal and R is a set of productions (rules).
The productions take the form of tree fragments,
called elementary trees (ETs), in which each internal node is labelled with a nonterminal and each
leaf is labelled with either a terminal or a nonterminal. The frontier nonterminal nodes in each ET
form the sites into which other ETs can be substituted. A derivation creates a tree by recursive substitution starting with the root symbol and finishing when there are no remaining frontier nonterminals. Figure 1 (left) shows an example derivation where the arrows denote substitution. A Probabilistic Tree Substitution Grammar (PTSG) assigns a probability to each rule in the grammar,
where each production is assumed to be conditionally independent given its root nonterminal. A
derivation’s probability is the product of the probabilities of the rules therein.
In this work we employ the same nonparametric TSG model as Cohn et al. (2009),
which we now summarise. The inference problem within this model is to identify the posterior
distribution of the elementary trees e given whole
trees t. The model is characterised by the use of
a Dirichlet Process (DP) prior over the grammar.
We define the distribution over elementary trees e
with root nonterminal symbol c as
+ αc
αc P0 (ei |c)
·,c + αc
is the total count of rewriting c. Henceforth we
omit the −i sub-/super-script for brevity.
A primary consideration is the definition of P0 .
Each ei can be generated in one of two ways:
by drawing from the base distribution, where the
probability of any particular tree is proportional to
αc P0 (ei |c), or by drawing from a cache of previous expansions of c, where the probability of any
particular expansion is proportional to the number
of times that expansion has been used before. In
Cohn et al. (2009) we presented base distributions
that favour small elementary trees which we expect will generalise well to unseen data. In this
work we show that if P0 is chosen such that it
decomposes with the CFG rules contained within
each elementary tree,1 then we can use a novel dynamic programming algorithm to sample derivations without ever enumerating all the elementary
trees in the grammar.
The model was trained using a local Gibbs sampler (Geman and Geman, 1984), a Markov chain
Monte Carlo (MCMC) method in which random
variables are repeatedly sampled conditioned on
the values of all other random variables in the
model. To formulate the local sampler, we associate a binary variable with each non-root internal node of each tree in the training set, indicating whether that node is a substitution point or
not (illustrated in Figure 1). The sampler then visits each node in a random schedule and resamples
that node’s substitution variable, where the probability of the two different configurations are given
by (1). Parsing was performed using a MetropolisHastings sampler to draw derivation samples for
a string, from which the best tree was recovered.
However the sampler used for parsing was biased
∼ Gc
ei ,c
Figure 1: TSG derivation and its corresponding Gibbs state
for the local sampler, where each node is marked with a binary variable denoting whether it is a substitution site.
where P0 (·|c) (the base distribution) is a distribution over the infinite space of trees rooted with c,
and αc (the concentration parameter) controls the
model’s tendency towards either reusing elementary trees or creating novel ones as each training
instance is encountered.
Rather than representing the distribution Gc explicitly, we integrate over all possible values of
Gc . The key result required for inference is that
the conditional distribution of ei , given e−i , =
e1 . . . en \ei and the root category c is:
p(ei |e−i , c, αc , P0 ) =
Gc |αc , P0 ∼ DP(αc , P0 (·|c))
Both choices of base distribution in Cohn et al. (2009)
decompose into CFG rules. In this paper we focus on the
better performing one, P0C , which combines a PCFG applied
recursively with a stopping probability, s, at each node.
where n−i
ei ,c is the number number of times
Pei has
been used to rewrite c in e−i , and n−i
e ne,c
For every ET, e, rewriting c with non-zero count:
because it used as its proposal distribution a truncated grammar which excluded all but a handful
of the unseen elementary trees. Consequently the
proposal had smaller support than the true model,
voiding the MCMC convergence proofs.
c → sign(e)
·,c +αc
For every internal node ei in e with children ei,1 , . . . , ei,n
sign(ei ) → sign(ei,1 ) . . . sign(ei,n )
For every nonterminal, c:
c → c0
n·,c +αc
Grammar Transformation
We now present a blocked sampler using the
Metropolis-Hastings (MH) algorithm to perform
sentence-level inference, based on the work of
Johnson et al. (2007) who presented a MH sampler
for a Bayesian PCFG. This approach repeats the
following steps for each sentence in the training
set: 1) run the inside algorithm (Lari and Young,
1990) to calculate marginal expansion probabilities under a MAP approximation, 2) sample an
analysis top-down and 3) accept or reject using a
Metropolis-Hastings (MH) test to correct for differences between the MAP proposal and the true
model. Though our model is similar to Johnson et al. (2007)’s, we have an added complication: the MAP grammar cannot be estimated directly. This is a consequence of the base distribution having infinite support (assigning non-zero
probability to infinitely many unseen tree fragments), which means the MAP has an infinite rule
set. For example, if our base distribution licences
the CFG production NP → NP PP then our TSG
grammar will contain the infinite set of elementary trees NP → NP PP, NP → (NP NP PP) PP,
NP → (NP (NP NP PP) PP) PP, . . . with decreasing but non-zero probability.
However, we can represent the infinite MAP using a grammar transformation inspired by Goodman (2003), which represents the MAP TSG in an
equivalent finite PCFG.2 Under the transformed
PCFG inference is efficient, allowing its use as
the proposal distribution in a blocked MH sampler. We represent the MAP using the grammar
transformation in Table 1 which separates the ne,c
and P0 terms in (1) into two separate CFGs, A and
B. Grammar A has productions for every ET with
ne,c ≥ 1 which are assigned unsmoothed probabilities: omitting the P0 term from (1).3 Grammar
B has productions for every CFG production licensed under P0 ; its productions are denoted using
For every pre-terminal CFG production, c → t:
PCF G (c → t)
For every unary CFG production, c → a:
PCF G (c → a)sa
→ a0
PCF G (c → a)(1 − sa )
For every binary CFG production, c → ab:
→ ab
PCF G (c → ab)sa sb
→ ab0
PCF G (c → ab)sa (1 − sb )
→ a0 b
PCF G (c → ab)(1 − sa )sb
→ a0 b0
PCF G (c → ab)(1 − sa )(1 − sb )
Table 1: Grammar transformation rules to map a MAP TSG
into a CFG. Production probabilities are shown to the right of
each rule. The sign(e) function creates a unique string signature for an ET e (where the signature of a frontier node is
itself) and sc is the Bernoulli probability of c being a substitution variable (and stopping the P0 recursion).
primed (’) nonterminals. The rule c → c0 bridges
from A to B, weighted by the smoothing term
excluding P0 , which is computed recursively via
child productions. The remaining rules in grammar B correspond to every CFG production in the
underlying PCFG base distribution, coupled with
the binary decision whether or not nonterminal
children should be substitution sites (frontier nonterminals). This choice affects the rule probability
by including a s or 1 − s factor, and child substitution sites also function as a bridge back from
grammar B to A. In this way there are often two
equivalent paths to reach the same chart cell using
the same elementary tree – via grammar A using
observed TSG productions and via grammar B using P0 backoff; summing these yields the desired
net probability.
Figure 2 shows an example of the transformation of an elementary tree with non-zero count,
ne,c ≥ 1, into the two types of CFG rules. Both
parts are capable of parsing the string NP, saw, NP
into a S, as illustrated in Figure 3; summing the
probability of both analyses gives the model probability from (1). Note that although the probabilities exactly match the true model for a single elementary tree, the probability of derivations composed of many elementary trees may not match
because the model’s caching behaviour has been
suppressed, i.e., the counts, n, are not incremented
during the course of a derivation.
For training we define the MH sampler as follows. First we estimate the MAP grammar over
Backoff DOP uses a similar packed representation to encode the set of smaller subtrees for a given elementary tree
(Sima’an and Buratto, 2003), which are used to smooth its
probability estimate.
The transform assumes inside inference.− For Viterbi ren
+αc P0 (e| c)
place the probability for c → sign(e) with e,c −
n·,c +αc
S → NP VP{V{saw},NP}
VP{V{saw},NP} → V{saw} NP
V{saw} → saw
S → S’
S’ → NP VP’
VP’ → V’ NP
V’ → saw
PCF G (S → NP VP)sN P (1 − sV P )
PCF G (VP → V NP)(1 − sV )sN P
PCF G (V → saw)
Figure 2: Example of the transformed grammar for the ET
(S NP (VP (V saw) NP)). Taking the product of the rule
scores above the line yields the left term in (1), and the product of the scores below the line yields the right term.
Figure 3: Example trees under the grammar transform, which
both encode the same TSG derivation from Figure 1. The left
tree encodes that the S → NP (VP (V hates) NP elementary
tree was drawn from the cache, while for the right tree this
same elementary tree was drawn from the base distribution
(the left and right terms in (1), respectively).
Parsing The grammar transform is not only useful for training, but also for parsing. To parse a
sentence we sample a number of TSG derivations
from the MAP which are then accepted or rejected
into the full model using a MH step. The samples
are obtained from the same transformed grammar
but adapting the algorithm for an unsupervised setting where parse trees are not available. For this
we use the standard inside algorithm applied to
the sentence, omitting the tree constraints, which
has time complexity cubic in the length of the sentence. We then sample a derivation from the inside chart and perform the MH acceptance test.
This setup is theoretically more appealing than our
previous approach in which we truncated the approximation grammar to exclude most of the zero
count rules (Cohn et al., 2009). We found that
both the maximum probability derivation and tree
were considerably worse than a tree constructed
to maximise the expected number of correct CFG
rules (MER), based on Goodman’s (2003) algorithm for maximising labelled recall. For this reason we the MER parsing algorithm using sampled
Monte Carlo estimates for the marginals over CFG
rules at each sentence span.
the derivations of training corpus excluding the
current tree, which we represent using the PCFG
transformation. The next step is to sample derivations for a given tree, for which we use a constrained variant of the inside algorithm (Lari and
Young, 1990). We must ensure that the TSG
derivation produces the given tree, and therefore
during inside inference we only consider spans
that are constituents in the tree and are labelled
with the correct nonterminal. Nonterminals are
said to match their primed and signed counterparts, e.g., NP0 and NP{DT,NN{car}} both match
NP. Under the tree constraints the time complexity of inside inference is linear in the length of the
sentence. A derivation is then sampled from the
inside chart using a top-down traversal (Johnson
et al., 2007), and converted back into its equivalent TSG derivation. The derivation is scored with
the true model and accepted or rejected using the
MH test; accepted samples then replace the current derivation for the tree, and rejected samples
leave the previous derivation unchanged. These
steps are then repeated for another tree in the training set, and the process is then repeated over the
full training set many times.
We tested our model on the Penn treebank using
the same data setup as Cohn et al. (2009). Specifically, we used only section 2 for training and section 22 (devel) for reporting results. Our models
were all sampled for 5k iterations with hyperparameter inference for αc and sc ∀ c ∈ N , but in
contrast to our previous approach we did not use
annealing which we did not find to help generalisation accuracy. The MH acceptance rates were
in excess of 99% across both training and parsing.
All results are averages over three runs.
For training the blocked MH sampler exhibits
faster convergence than the local Gibbs sampler, as shown in Figure 4. Irrespective of the
initialisation the blocked sampler finds higher
likelihood states in many fewer iterations (the
same trend continues until iteration 5k). To be
fair, the blocked sampler is slower per iteration
(roughly 50% worse) due to the higher overheads
of the grammar transform and performing dynamic programming (despite nominal optimisation).4 Even after accounting for the time differ4
The speed difference diminishes with corpus size: on
sections 2–22 the blocked sampler is only 19% slower per
log likelihood
ment over our earlier 84.0 (Cohn et al., 2009)
although still well below state-of-the-art parsers.
We conjecture that the performance gap is due to
the model using an overly simplistic treatment of
unknown words, and also a further mixing problems with the sampler. For the full data set the
counts are much larger in magnitude which leads
to stronger modes. The sampler has difficulty escaping such modes and therefore is slower to mix.
One way to solve the mixing problem is for the
sampler to make more global moves, e.g., with
table label resampling (Johnson and Goldwater,
2009) or split-merge (Jain and Neal, 2000). Another way is to use a variational approximation instead of MCMC sampling (Wainwright and Jordan, 2008).
Block maximal init
Block minimal init
Local minimal init
Local maximal init
Figure 4: Training likelihood vs. iteration. Each sampling
method was initialised with both minimal and maximal elementary trees.
Local minimal init
Local maximal init
Blocked minimal init
Blocked maximal init
We have demonstrated how our grammar transformation can implicitly represent an exponential
space of tree fragments efficiently, allowing us
to build a sampler with considerably better mixing properties than a local Gibbs sampler. The
same technique was also shown to improve the
parsing algorithm. These improvements are in
no way limited to our particular choice of a TSG
parsing model, many hierarchical Bayesian models have been proposed which would also permit
similar optimised samplers. In particular models which induce segmentations of complex structures stand to benefit from this work; Examples
include the word segmentation model of Goldwater et al. (2006) for which it would be trivial to
adapt our technique to develop a blocked sampler.
Hierarchical Bayesian segmentation models have
also become popular in statistical machine translation where there is a need to learn phrasal translation structures that can be decomposed at the word
level (DeNero et al., 2008; Blunsom et al., 2009;
Cohn and Blunsom, 2009). We envisage similar
representations being applied to these models to
improve their mixing properties.
A particularly interesting avenue for further research is to employ our blocked sampler for unsupervised grammar induction. While it is difficult to extend the local Gibbs sampler to the case
where the tree is not observed, the dynamic program for our blocked sampler can be easily used
for unsupervised inference by omitting the tree
matching constraints.
Table 2: Development F1 scores using the truncated parsing algorithm and the novel grammar transform algorithm for
four different training configurations.
ence the blocked sampler is more effective than the
local Gibbs sampler. Training likelihood is highly
correlated with generalisation F1 (Pearson’s correlation efficient of 0.95), and therefore improving
the sampler convergence will have immediate effects on performance.
Parsing results are shown in Table 2.5 The
blocked sampler results in better generalisation F1
scores than the local Gibbs sampler, irrespective of
the initialisation condition or parsing method used.
The use of the grammar transform in parsing also
yields better scores irrespective of the underlying
model. Together these results strongly advocate
the use of the grammar transform for inference in
infinite TSGs.
We also trained the model on the standard Penn
treebank training set (sections 2–21). We initialised the model with the final sample from a
run on the small training set, and used the blocked
sampler for 6500 iterations. Averaged over three
runs, the test F1 (section 23) was 85.3 an improveiteration than the local sampler.
Our baseline ‘Local maximal init’ slightly exceeds previously reported score of 76.89% (Cohn et al., 2009).
Mark Johnson, Thomas Griffiths, and Sharon Goldwater. 2007. Bayesian inference for PCFGs via
Markov chain Monte Carlo. In Proceedings of
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, pages 139–146,
Rochester, NY, April.
Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP (ACLIJCNLP), pages 782–790, Suntec, Singapore, August.
Karim Lari and Steve J. Young. 1990. The estimation of stochastic context-free grammars using
the inside-outside algorithm. Computer Speech and
Language, 4:35–56.
Rens Bod, Remko Scha, and Khalil Sima’an, editors.
2003. Data-oriented parsing. Center for the Study
of Language and Information - Studies in Computational Linguistics. University of Chicago Press.
Khalil Sima’an and Luciano Buratto. 2003. Backoff
parameter estimation for the dop model. In Nada
Lavrac, Dragan Gamberger, Ljupco Todorovski, and
Hendrik Blockeel, editors, ECML, volume 2837 of
Lecture Notes in Computer Science, pages 373–384.
Trevor Cohn and Phil Blunsom. 2009. A Bayesian
model of syntax-directed tree to string grammar induction. In Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 352–361, Singapore, August.
Martin J Wainwright and Michael I Jordan. 2008.
Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., Hanover,
Trevor Cohn, Sharon Goldwater, and Phil Blunsom. 2009. Inducing compact but accurate treesubstitution grammars. In Proceedings of Human
Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL),
pages 548–556, Boulder, Colorado, June.
John DeNero, Alexandre Bouchard-Côté, and Dan
Klein. 2008. Sampling alignment structure under
a Bayesian translation model. In Proceedings of
the 2008 Conference on Empirical Methods in Natural Language Processing, pages 314–323, Honolulu,
Hawaii, October.
Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 6:721–741.
Sharon Goldwater, Thomas L. Griffiths, and Mark
Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of
the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 673–680,
Sydney, Australia, July.
Joshua Goodman. 2003. Efficient parsing of DOP with
PCFG-reductions. In Bod et al. (Bod et al., 2003),
chapter 8.
Sonia Jain and Radford M. Neal. 2000. A split-merge
Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics, 13:158–182.
Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric bayesian inference: experiments on unsupervised word segmentation with
adaptor grammars. In Proceedings of Human Language Technologies: The 2009 Annual Conference
of the North American Chapter of the Association for Computational Linguistics, pages 317–325,
Boulder, Colorado, June.
Online Generation of Locality Sensitive Hash Signatures
Benjamin Van Durme
Johns Hopkins University
Baltimore, MD 21211 USA
Ashwin Lall
College of Computing
Georgia Institute of Technology
Atlanta, GA 30332 USA
approximation error, and with lower memory requirements than when using the standard offline
We envision this method being used in conjunction with dynamic clustering algorithms, for a variety of applications. For example, Petrovic et al.
(2010) made use of LSH signatures generated over
individual tweets, for the purpose of first story detection. Streaming LSH should allow for the clustering of Twitter authors, based on the tweets they
generate, with signatures continually updated over
the Twitter stream.
Motivated by the recent interest in streaming algorithms for processing large text
collections, we revisit the work of
Ravichandran et al. (2005) on using the
Locality Sensitive Hash (LSH) method of
Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature
updates being additive over a data stream,
we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard
offline technique.
Locality Sensitive Hashing
We are concerned with computing the cosine similarity of feature vectors, defined for a pair of vectors ~u and ~v as the dot product normalized by their
There has been a surge of interest in adapting results from the streaming algorithms community to
problems in processing large text collections. The
term streaming refers to a model where data is
made available sequentially, and it is assumed that
resource limitations preclude storing the entirety
of the data for offline (batch) processing. Statistics of interest are approximated via online, randomized algorithms. Examples of text applications include: collecting approximate counts (Talbot, 2009; Van Durme and Lall, 2009a), finding
top-n elements (Goyal et al., 2009), estimating
term co-occurrence (Li et al., 2008), adaptive language modeling (Levenberg and Osborne, 2009),
and building top-k ranklists based on pointwise
mutual information (Van Durme and Lall, 2009b).
Here we revisit the work of Ravichandran et al.
(2005) on building word similarity measures from
large text collections by using the Locality Sensitive Hash (LSH) method of Charikar (2002). For
the common case of feature updates being additive over a data stream (such as when tracking
lexical co-occurrence), we show that LSH signatures can be maintained online, without additional
cosine−similarity(~u, ~v ) =
~u · ~v
|~u||~v |
This similarity is the cosine of the angle between these high-dimensional vectors and attains
a value of one (i.e., cos (0)) when the vectors are
parallel and zero (i.e., cos (π/2)) when orthogonal.
Building on the seminal work of Indyk and
Motwani (1998) on locality sensitive hashing
(LSH), Charikar (2002) presented an LSH that
maps high-dimensional vectors to a much smaller
dimensional space while still preserving (cosine)
similarity between vectors in the original space.
The LSH algorithm computes a succinct signature
of the feature set of the words in a corpus by computing d independent dot products of each feature
vector ~v with a random unit vector ~r, i.e., i vi ri ,
and retaining the sign of the d resulting products.
Each entry of ~r is drawn from the distribution
N (0, 1), the normal distribution with zero mean
and unit variance. Charikar’s algorithm makes use
of the fact (proved by Goemans and Williamson
Proceedings of the ACL 2010 Conference Short Papers, pages 231–235,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
(1995) for an unrelated application) that the angle between any two vectors summarized in this
fashion is proportional to the expected Hamming
distance of their signature vectors. Hence, we can
retain length d bit-signatures in the place of high
dimensional feature vectors, while preserving the
ability to (quickly) approximate cosine similarity
in the original space.
Ravichandran et al. (2005) made use of this algorithm to reduce the computation in searching
for similar nouns by first computing signatures for
each noun and then computing similarity over the
signatures rather than the original feature space.
m : size of pool
d : number of bits (size of resultant signature)
s : a random seed
h1 , ..., hd : hash functions mapping hs, fi i to {0, . . . , m−1}
1: Initialize floating point array P [0, . . . , m − 1]
2: Initialize H, a hashtable mapping words to floating point
arrays of size d
3: for i := 0 . . . m − 1 do
P [i] := random sample from N (0, 1), using s as seed
1: for each word w in the stream do
for each feature fi associated with w do
for j := 1 . . . d do
H[w][j] := H[w][j] + P [hj (s, fi )]
Streaming Algorithm
1: for each w ∈ H do
for i := 1 . . . d do
if H[w][i] > 0 then
S[w][i] := 1
S[w][i] := 0
In this work, we focus on features that can be
maintained additively, such as raw frequencies.1
Our streaming algorithm for this problem makes
use of the simple fact that the dot product of the
feature vector with random vectors is a linear operation. This permits us to replace the vi · ri operation by vi individual additions of ri , once for
each time the feature is encountered in the stream
(where vi is the frequency of a feature and ri is the
randomly chosen Gaussian-distributed value associated with this feature). The result of the final
computation is identical to the dot products computed by the algorithm of Charikar (2002), but
the processing can now be done online. A similar technique, for stable random projections, was
independently discussed by Li et al. (2008).
Since each feature may appear multiple times
in the stream, we need a consistent way to retrieve
the random values drawn from N (0, 1) associated
with it. To avoid the expense of computing and
storing these values explicitly, as is the norm, we
propose the use of a precomputed pool of random values drawn from this distribution that we
can then hash into. Hashing into a fixed pool ensures that the same feature will consistently be associated with the same value drawn from N (0, 1).
This introduces some weak dependence in the random vectors, but we will give some analysis showing that this should have very limited impact on
the cosine similarity computation, which we further support with experimental evidence (see Table 3).
Our algorithm traverses a stream of words and
maintains some state for each possible word that
it encounters (cf. Algorithm 1). In particular, the
state maintained for each word is a vector of floating point numbers of length d. Each element of the
vector holds the (partial) dot product of the feature
vector of the word with a random unit vector. Updating the state for a feature seen in the stream for
a given word simply involves incrementing each
position in the word’s vector by the random value
associated with the feature, accessed by hash functions h1 through hd . At any point in the stream,
the vector for each word can be processed (in time
O(d)) to create a signature computed by checking
the sign of each component of its vector.
The update cost of the streaming algorithm, per
word in the stream, is O(df ), where d is the target
signature size and f is the number of features associated with each word in the stream.2 This results
in an overall cost of O(ndf ) for the streaming algorithm, where n is the length of the stream. The
memory footprint of our algorithm is O(n0 d+m),
where n0 is the number of distinct words in the
stream and m is the size of the pool of normally
distributed values. In comparison, the original
LSH algorithm computes signatures at a cost of
O(nf + n0 dF ) updates and O(n0 F + dF + n0 d)
memory, where F is the (large) number of unique
Note that Ravichandran et al. (2005) used pointwise mutual information features, which are not additive since they
require a global statistic to compute.
For the bigram features used in § 4, f = 2.
features. Our algorithm is superior in terms of
memory (because of the pooling trick), and has the
benefit of supporting similarity queries online.
Pooling Normally-distributed Values
We now discuss why it is possible to use a
fixed pool of random values instead of generating
unique ones for each feature. Let g be the c.d.f.
of the distribution N (0, 1). It is easy to see that
picking x ∈ (0, 1) uniformly results in g −1 (x) being chosen with distribution N (0, 1). Now, if we
select for our pool the values
g −1 (1/m), g −1 (2/m), . . . , g −1 (1 − 1/m),
for some sufficiently large m, then this is identical
to sampling from N (0, 1) with the caveat that the
accuracy of the sample is limited. More precisely,
the deviation from sampling from this pool is off
from the actual value by at most
{g −1 ((i + 1)/m) − g −1 (i/m)}.
By choosing m to be sufficiently large, we can
bound the error of the approximate sample from
a true sample (i.e., the loss in precision expressed
above) to be a small fraction (e.g., 1%) of the actual value. This would result in the same relative
error in the computation of the dot product (i.e.,
1%), which would almost never affect the sign of
the final value. Hence, pooling as above should
give results almost identical to the case where all
the random values were chosen independently. Finally, we make the observation that, for large m,
randomly choosing m values from N (0, 1) results
in a set of values that are distributed very similarly
to the pool described above. An interesting avenue
for future work is making this analysis more mathematically precise.
Figure 1: Predicted versus actual cosine values for 50,000
pairs, using LSH signatures generated online, with d = 32 in
Fig. 1(a) and d = 256 in Fig. 1(b).
2004). The underlying operation is a linear operator that is easily composed (i.e., via addition),
and the randomness between machines can be tied
based on a shared seed s. At any point in processing the stream(s), current results can be aggregated
by summing the d-dimensional vectors for each
word, from each machine.
Decay The algorithm can be extended to support
temporal decay in the stream, where recent observations are given higher relative weight, by multiplying the current sums by a decay value (e.g.,
0.9) on a regular interval (e.g., once an hour, once
a day, once a week, etc.).
Similar to the experiments of Ravichandran et
al. (2005), we evaluated the fidelity of signature
generation in the context of calculating distributional similarity between words across a large
text collection: in our case, articles taken from
the NYTimes portion of the Gigaword corpus
(Graff, 2003). The collection was processed as a
stream, sentence by sentence, using bigram fea-
Distributed The algorithm can be easily distributed across multiple machines in order to process different parts of a stream, or multiple different streams, in parallel, such as in the context of
the MapReduce framework (Dean and Ghemawat,
Table 1: Mean absolute error when using signatures generated online (StreamingLSH), compared to offline (LSH).
tures. This gave a stream of 773,185,086 tokens,
with 1,138,467 unique types. Given the number
of types, this led to a (sparse) feature space with
dimension on the order of 2.5 million.
After compiling signatures, fifty-thousand
hx, yi pairs of types were randomly sampled
by selecting x and y each independently, with
replacement, from those types with at least 10 tokens in the stream (where 310,327 types satisfied
this constraint). The true cosine values between
each such x and y was computed based on offline
calculation, and compared to the cosine similarity
predicted by the Hamming distance between the
signatures for x and y. Unless otherwise specified,
the random pool size was fixed at m = 10, 000.
Figure 1 visually reaffirms the trade-off in LSH
between the number of bits and the accuracy of
cosine prediction across the range of cosine values. As the underlying vectors are strictly positive, the true cosine is restricted to [0, 1]. Figure 2
shows the absolute error between truth and prediction for a similar sample, measured using signatures of a variety of bit lengths. Here we see horizontal bands arising from truly orthogonal vectors
leading to step-wise absolute error values tracked
to Hamming distance.
Table 1 compares the online and batch LSH algorithms, giving the mean absolute error between
predicted and actual cosine values, computed for
the fifty-thousand element sample, using signatures of various lengths. These results confirm that
we achieve the same level of accuracy with online
updates as compared to the standard method.
Figure 3 shows how a pool size as low as m =
100 gives reasonable variation in random values,
and that m = 10, 000 is sufficient. When using a
standard 32 bit floating point representation, this
is just 40 KBytes of memory, as compared to, e.g.,
the 2.5 GBytes required to store 256 random vectors each containing 2.5 million elements.
Table 2 is based on taking an example for each
of three part-of-speech categories, and reporting
the resultant top-5 words as according to approximated cosine similarity. Depending on the intended application, these results indicate a range
Figure 2: Absolute error between predicted and true cosine for a sample of pairs, when using signatures of length
log2 (d) ∈ {4, 5, 6, 7, 8}, drawn with added jitter to avoid
Mean Absolute Error
Pool Size
Figure 3: Error versus pool size, when using d = 256.
of potentially sufficient signature lengths.
We have shown that when updates to a feature vector are additive, it is possible to convert the offline
LSH signature generation method into a streaming algorithm. In addition to allowing for online querying of signatures, our approach leads to
space efficiencies, as it does not require the explicit representation of either the feature vectors,
nor the random matrix. Possibilities for future
work include the pairing of this method with algorithms for dynamic clustering, as well as exploring
algorithms for different distances (e.g., L2 ) and estimators (e.g., asymmetric estimators (Dong et al.,
Milan.97 , Madrid.96 , Stockholm.96 , Manila.95 , Moscow.95
ASHER0 , Champaign0 , MANS0 , NOBLE0 , come0
Prague1 , Vienna1 , suburban1 , synchronism1 , Copenhagen2
Frankfurt4 , Prague4 , Taszar5 , Brussels6 , Copenhagen6
Prague12 , Stockholm12 , Frankfurt14 , Madrid14 , Manila14
Stockholm20 , Milan22 , Madrid24 , Taipei24 , Frankfurt25
during.99 , on.98 , beneath.98 , from.98 , onto.97
Across0 , Addressing0 , Addy0 , Against0 , Allmon0
aboard0 , mishandled0 , overlooking0 , Addressing1 , Rejecting1
Rejecting2 , beneath2 , during2 , from3 , hamstringing3
during4 , beneath5 , of6 , on7 , overlooking7
during10 , on13 , beneath15 , of17 , overlooking17
deployed.84 , presented.83 , sacrificed.82 , held.82 , installed.82
Bustin0 , Diors0 , Draining0 , Kosses0 , UNA0
delivered2 , held2 , marks2 , seared2 , Ranked3
delivered5 , rendered5 , presented6 , displayed7 , exhibited7
held18 , rendered18 , presented19 , deployed20 , displayed20
presented41 , rendered42 , held47 , leased47 , reopened47
David Graff. 2003. English Gigaword. Linguistic
Data Consortium, Philadelphia.
Piotr Indyk and Rajeev Motwani. 1998. Approximate
nearest neighbors: towards removing the curse of dimensionality. In Proceedings of STOC.
Abby Levenberg and Miles Osborne. 2009. Streambased Randomised Language Models for SMT. In
Proceedings of EMNLP.
Ping Li, Kenneth W. Church, and Trevor J. Hastie.
2008. One Sketch For All: Theory and Application
of Conditional Random Sampling. In Advances in
Neural Information Processing Systems 21.
Sasa Petrovic, Miles Osborne, and Victor Lavrenko.
2010. Streaming First Story Detection with application to Twitter. In Proceedings of NAACL.
Deepak Ravichandran, Patrick Pantel, and Eduard
Hovy. 2005. Randomized Algorithms and NLP:
Using Locality Sensitive Hash Functions for High
Speed Noun Clustering. In Proceedings of ACL.
Table 2: Top-5 items based on true cosine (bold), then using
minimal Hamming distance, given in top-down order when
using signatures of length log2 (d) ∈ {4, 5, 6, 7, 8}. Ties broken lexicographically. Values given as subscripts.
David Talbot. 2009. Succinct approximate counting of
skewed data. In Proceedings of IJCAI.
Benjamin Van Durme and Ashwin Lall. 2009a. Probabilistic Counting with Randomized Storage. In Proceedings of IJCAI.
Thanks to Deepak Ravichandran, Miles Osborne,
Sasa Petrovic, Ken Church, Glen Coppersmith,
and the anonymous reviewers for their feedback.
This work began while the first author was at the
University of Rochester, funded by NSF grant IIS1016735. The second author was supported in
part by NSF grant CNS-0905169, funded under
the American Recovery and Reinvestment Act of
Benjamin Van Durme and Ashwin Lall.
Streaming Pointwise Mutual Information. In Advances in Neural Information Processing Systems
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings
of STOC.
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters.
In Proceedings of OSDI.
Wei Dong, Moses Charikar, and Kai Li. 2009. Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. In Proceedings of SIGIR.
Michel X. Goemans and David P. Williamson. 1995.
Improved approximation algorithms for maximum
cut and satisfiability problems using semidefinite
programming. JACM, 42:1115–1145.
Amit Goyal, Hal Daumé III, and Suresh Venkatasubramanian. 2009. Streaming for large scale NLP:
Language Modeling. In Proceedings of NAACL.
Optimizing Question Answering Accuracy by Maximizing Log-Likelihood
Matthias H. Heie, Edward W. D. Whittaker and Sadaoki Furui
Department of Computer Science
Tokyo Institute of Technology
Tokyo 152-8552, Japan
of q-a pairs. This framework was used in several
TREC evaluations where it placed in the top 10
of participating systems (Whittaker et al., 2006).
In Section 3 we show that answer accuracy is
strongly correlated with the log-likelihood of the
q-a pairs computed by this statistical model. In
Section 4 we propose an algorithm to cluster q-a
pairs by maximizing the log-likelihood of a disjoint set of q-a pairs. In Section 5 we evaluate the
QA accuracy by training the QA system with the
resulting clusters.
In this paper we demonstrate that there
is a strong correlation between the Question Answering (QA) accuracy and the
log-likelihood of the answer typing component of our statistical QA model. We
exploit this observation in a clustering algorithm which optimizes QA accuracy by
maximizing the log-likelihood of a set of
question-and-answer pairs. Experimental
results show that we achieve better QA accuracy using the resulting clusters than by
using manually derived clusters.
2 QA system
In our QA framework we choose to model only
the probability of an answer A given a question Q,
and assume that the answer A depends on two sets
of features: W = W (Q) and X = X(Q):
Question Answering (QA) distinguishes itself
from other information retrieval tasks in that the
system tries to return accurate answers to queries
posed in natural language. Factoid QA limits itself to questions that can usually be answered with
a few words. Typically factoid QA systems employ some form of question type analysis, so that
a question such as What is the capital of Japan?
will be answered with a geographical term. While
many QA systems use hand-crafted rules for this
task, such an approach is time-consuming and
doesn’t generalize well to other languages. Machine learning methods have been proposed, such
as question classification using support vector machines (Zhang and Lee, 2003) and language modeling (Merkel and Klakow, 2007). In these approaches, question categories are predefined and a
classifier is trained on manually labeled data. This
is an example of supervised learning. In this paper we present an unsupervised method, where we
attempt to cluster question-and-answer (q-a) pairs
without any predefined question categories, hence
no manually class-labeled questions are used.
We use a statistical QA framework, described in
Section 2, where the system is trained with clusters
P (A|Q) = P (A|W, X),
where W represents a set of |W | features describing the question-type part of Q such as who, when,
where, which, etc., and X is a set of features
which describes the “information-bearing” part of
Q, i.e. what the question is actually about and
what it refers to. For example, in the questions
Where is Mount Fuji? and How high is Mount
Fuji?, the question type features W differ, while
the information-bearing features X are identical.
Finding the best answer  involves a search over
all A for the one which maximizes the probability
of the above model, i.e.:
 = arg max P (A|W, X).
Given the correct probability distribution, this
will give us the optimal answer in a maximum
likelihood sense. Using Bayes’ rule, assuming
uniform P (A) and that W and X are independent of each other given A, in addition to ignoring
P (W, X) since it is independent of A, enables us
to rewrite Eq. (2) as
Proceedings of the ACL 2010 Conference Short Papers, pages 236–240,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
 = arg max P (A | X) · P (W | A).
{z } | {z }
A |
P (W | ceW ) is implemented as trigram language models with backoff smoothing using absolute
discounting (Huang et al., 2001).
Due to data sparsity, our set of example q-a
pairs cannot be expected to cover all the possible answers to questions that may ever be asked.
We therefore employ answer class modeling rather
than answer word modeling by expanding Eq. (4)
as follows:
f ilter
2.1 Retrieval Model
The retrieval model P (A|X) is essentially a language model which models the probability of an
answer sequence A given a set of informationbearing features X = {x1 , . . . , x|X| }. This set
is constructed by extracting single-word features
from Q that are not present in a stop-list of highfrequency words. The implementation of the retrieval model used for the experiments described
in this paper, models the proximity of A to features in X. It is not examined further here;
see (Whittaker et al., 2005) for more details.
P (W | A) =
PA |
P (W | ceW )·
P (ceA
| ka )P (ka | A),
where ka is a concrete class in the set of |KA |
answer classes KA . These classes are generated
using the Kneser-Ney clustering algorithm, com2.2 Filter Model
monly used for generating class definitions for
The question-type feature set W = {w1 , . . . , w|W | } class language models (Kneser and Ney, 1993).
is constructed by extracting n-tuples (n = 1, 2, . . .)
In this paper we restrict ourselves to singlesuch as where, in what and when were from the
word answers; see (Whittaker et al., 2005) for the
input question Q. We limit ourselves to extracting
modeling of multi-word answers. We estimate
single-word features. The 2522 most frequent
P (ceA | kA ) as
words in a collection of example questions are
f (kA , ceA )
considered in-vocabulary words; all other words
P (ceA | kA ) =
are out-of-vocabulary words, and substituted with
f (kA , cA )
Modeling the complex relationship between
W and A directly is non-trivial. We therewhere
fore introduce an intermediate variable CE =
δ(i ∈ kA )
{c1 , . . . , c|CE | }, representing a set of classes of
f (kA , ceA ) =
example q-a pairs. In order to construct these
|ceA |
classes, given a set E = {t1 , . . . , t|E| } of example q-a pairs, we define a mapping function
and δ(·) is a discrete indicator function which
f : E 7→ CE which maps each example q-a pair tj
equals 1 if its argument evaluates true and 0 if
for j = 1 . . . |E| into a particular class f (tj ) = ce .
Thus each class ce may be defined as the union of
P (ka | A) is estimated as
all component q-a features from each tj satisfy1
ing f (tj ) = ce . Hence each class ce constitutes a
P (ka | A) = P
cluster of q-a pairs. Finally, to facilitate modeling
δ(A ∈ j)
we say that W is conditionally independent of A
given ce so that,
3 The Relationship between Mean
Reciprocal Rank and Log-Likelihood
|CE |
P (W | A) =
We use Mean Reciprocal Rank (M RR) as our
metric when evaluating the QA accuracy on a set
of questions G = {g1 ...g|G| }:
P (W | ceW ) · P (ceA | A), (4)
where ceW and ceA refer to the subsets of questiontype features and example answers for the class ce ,
M RR =
i=1 1/Ri
init: c1 ∈ CE contains all training pairs |E|
while improvement > threshold do
best LLdev ← −∞
for all j = 1...|E| do
original cluster = f (tj )
Take tj out of f (tj )
for e = −1, 1...|CE |, |CE | + 1 do
Put tj in ce
Calculate LLdev
if LLdev > best LLdev then
best LLdev ← LLdev
best cluster ← e
best pair ← j
end if
Take tj out of ce
end for
Put tj back in original cluster
end for
Take tbest pair out of f (tbest pair )
Put tbest pair into cbest cluster
end while
ρ = 0.86
Figure 1: M RR vs. LL (average per q-a pair) for
100 random cluster configurations.
where Ri is the rank of the highest ranking correct
candidate answer for gi .
Given a set D = (d1 ...d|D| ) of q-a pairs disjoint
from the q-a pairs in CE , we can, using Eq. (5),
calculate the log-likelihood as
LL =
In this algorithm, c−1 indicates the set of training pairs outside the cluster configuration, thus every training pair will not necessarily be included
in the final configuration. c|C|+1 refers to a new,
empty cluster, hence this algorithm automatically
finds the optimal number of clusters as well as the
optimal configuration of them.
log P (Wd |Ad )
|CE |
PA |
P (Wd | ceW )·
P (ceA | ka )P (ka | Ad ).
5 Experiments
To examine the relationship between M RR and
LL, we randomly generate configurations CE ,
with a fixed cluster size of 4, and plot the resulting M RR and LL, computed on the same data set
D, as data points in a scatter plot, as seen in Figure 1. We find that LL and M RR are strongly
correlated, with a correlation coefficient ρ = 0.86.
This observation indicates that we should be
able to improve the answer accuracy of the QA
system by optimizing the LL of the filter model
in isolation, similar to how, in automatic speech
recognition, the LL of the language model can
be optimized in isolation to improve the speech
recognition accuracy (Huang et al., 2001).
5.1 Experimental Setup
For our data sets, we restrict ourselves to questions
that start with who, when or where. Furthermore,
we only use q-a pairs which can be answered with
a single word. As training data we use questions
and answers from the Knowledge-Master collection1 . Development/evaluation questions are the
questions from TREC QA evaluations from TREC
2002 to TREC 2006, the answers to which are to
be retrieved from the AQUAINT corpus. In total
we have 2016 q-a pairs for training and 568 questions for development/evaluation. We are able to
retrieve the correct answer for 317 of the development/evaluation questions, thus the theoretical
upper bound for our experiments is an answer accuracy of M RR = 0.558.
Accuracy is evaluated using 5-fold (rotating)
cross-validation, where in each fold the TREC
QA data is partitioned into a development set of
Clustering algorithm
Using the observation that LL is correlated with
M RR on the same data set, we expect that optimizing LL on a development set (LLdev ) will also
improve M RR on an evaluation set (M RReval ).
Hence we propose the following greedy algorithm
to maximize LLdev :
Table 1: LLeval (average per q-a pair) and
M RReval (over all held-out TREC years), and
number of clusters (median of the cross-evaluation
folds) for the various configurations.
800 1200
# iterations
(a) Development set, 4 year’s TREC.
4 years’ data and an evaluation set of one year’s
data. For each TREC question the top 50 documents from the AQUAINT corpus are retrieved
using Lucene2 . We use the QA system described
in Section 2 for QA evaluation. Our evaluation
metric is M RReval , and LLdev is our optimization criterion, as motivated in Section 3.
Our baseline system uses manual clusters.
These clusters are obtained by putting all who q-a
pairs in one cluster, all when pairs in a second and
all where pairs in a third. We compare this baseline
with using clusters resulting from the algorithm
described in Section 4. We run this algorithm until
there are no further improvements in LLdev . Two
other cluster configurations are also investigated:
all q-a pairs in one cluster (all-in-one), and each qa pair in its own cluster (one-in-each). The all-inone configuration is equivalent to not using the filter model, i.e. answer candidates are ranked solely
by the retrieval model. The one-in-each configuration was shown to perform well in the TREC 2006
QA evaluation (Whittaker et al., 2006), where it
ranked 9th among 27 participants on the factoid
QA task.
800 1200
# iterations
(b) Evaluation set, 1 year’s TREC.
Figure 2: M RR and LL (average per q-a pair)
vs. number of algorithm iterations for one crossvalidation fold.
6 Discussion
Manual inspection of the automatically derived
clusters showed that the algorithm had constructed
configurations where typically who, when and
where q-a pairs were put in separate clusters, as in
the manual configuration. However, in some cases
both who and where q-a pairs occurred in the same
cluster, so as to better answer questions like Who
won the World Cup?, where the answer could be a
country name.
As can be seen from Table 1, there are only 4
clusters in the automatic configuration, compared
to 2016 in the one-in-each configuration. Since
the computational complexity of the filter model
described in Section 2.2 is linear in the number of
clusters, a beneficial side effect of our clustering
procedure is a significant reduction in the computational requirement of the filter model.
In Figure 2 we plot LL and M RR for one of
the cross-validation folds over multiple iterations
(the while loop) of the clustering algorithm in Sec-
5.2 Results
In Table 1, we see that the manual clusters (baseline) achieves an M RReval of 0.262, while the
clusters resulting from the clustering algorithm
give an M RReval of 0.281, which is a relative
improvement of 7%. This improvement is statistically significant at the 0.01 level using the
Wilcoxon signed-rank test. The one-in-each cluster configuration achieves an M RReval of 0.263,
which is not a statistically significant improvement
over the baseline. The all-in-one cluster configuration (i.e. no filter model) has the lowest accuracy,
with an M RReval of 0.183.
M RReval
tion 4. It can clearly be seen that the optimization
of LLdev leads to improvement in M RReval , and
that LLeval is also well correlated with M RReval .
Conclusions and Future Work
In this paper we have shown that the log-likelihood
of our statistical model is strongly correlated with
answer accuracy. Using this information, we have
clustered training q-a pairs by maximizing loglikelihood on a disjoint development set of q-a
pairs. The experiments show that with these clusters we achieve better QA accuracy than using
manually clustered training q-a pairs.
In future work we will extend the types of questions that we consider, and also allow for multiword answers.
The authors wish to thank Dietrich Klakow for his
discussion at the concept stage of this work. The
anonymous reviewers are also thanked for their
constructive feedback.
[Huang et al.2001] Xuedong Huang, Alex Acero and
Hsiao-Wuen Hon. 2001. Spoken Language Processing. Prentice-Hall, Upper Saddle River, NJ,
[Kneser and Ney1993] Reinhard Kneser and Hermann
Ney. 1993. Improved Clustering Techniques for
Class-based Statistical Language Modelling. Proceedings of the European Conference on Speech
Communication and Technology (EUROSPEECH).
[Merkel and Klakow2007] Andreas Merkel and Dietrich Klakow. 2007. Language Model Based Query
Classification. Proceedings of the European Conference on Information Retrieval (ECIR).
[Whittaker et al.2005] Edward Whittaker, Sadaoki Furui and Dietrich Klakow. 2005. A Statistical Classification Approach to Question Answering using
Web Data. Proceedings of the International Conference on Cyberworlds.
[Whittaker et al.2006] Edward Whittaker, Josef Novak,
Pierre Chatain and Sadaoki Furui. 2006. TREC
2006 Question Answering Experiments at Tokyo Institute of Technology. Proceedings of The Fifteenth
Text REtrieval Conference (TREC).
[Zhang and Lee2003] Dell Zhang and Wee Sun Lee.
2003. Question Classification using Support Vector Machines. Proceedings of the Special Interest
Group on Information Retrieval (SIGIR).
Generating Entailment Rules from FrameNet
Idan Szpektor
Ido Dagan
Roni Ben Aharon
Yahoo! Research
Department of Computer Science
Department of Computer Science
Haifa, Israel
Bar-Ilan University
Bar-Ilan University
[email protected]
Ramat Gan, Israel
Ramat Gan, Israel
[email protected]
[email protected]
FrameNet is a manually constructed database
based on Frame Semantics. It models the semantic
argument structure of predicates in terms of prototypical situations called frames.
Prior work utilized FrameNet’s argument mapping capabilities but took entailment relations
from other resources, namely WordNet. We
propose a novel method for generating entailment rules from FrameNet by detecting the entailment relations implied in FrameNet. We utilize
FrameNet’s annotated sentences and relations between frames to extract both the entailment relations and their argument mappings.
Our analysis shows that the rules generated by
our algorithm have a reasonable “per-rule” accuracy of about 70%2 . We tested the generated ruleset on an entailment testbed derived from an IE
benchmark and compared it both to WordNet and
to state-of-the-art rule generation from FrameNet.
Our experiment shows that our method outperforms prior work. In addition, our rule-set’s performance is comparable to WordNet and it is complementary to WordNet when uniting the two resources. Finally, additional analysis shows that
our rule-set accuracy is 90% in practical use.
Many NLP tasks need accurate knowledge for semantic inference. To this end,
mostly WordNet is utilized. Yet WordNet is limited, especially for inference between predicates. To help filling this gap,
we present an algorithm that generates
inference rules between predicates from
FrameNet. Our experiment shows that the
novel resource is effective and complements WordNet in terms of rule coverage.
Many text understanding applications, such as
Question Answering (QA) and Information Extraction (IE), need to infer a target textual meaning from other texts. This need was proposed as a
generic semantic inference task under the Textual
Entailment (TE) paradigm (Dagan et al., 2006).
A fundamental component in semantic inference is the utilization of knowledge resources.
However, a major obstacle to improving semantic
inference performance is the lack of such knowledge (Bar-Haim et al., 2006; Giampiccolo et al.,
2007). We address one prominent type of inference knowledge known as entailment rules, focusing specifically on rules between predicates, such
as ‘cure X ⇒ X recover’.
We aim at highly accurate rule acquisition,
for which utilizing manually constructed sources
seem appropriate. The most widely used manual
resource is WordNet (Fellbaum, 1998). Yet it is incomplete for generating entailment rules between
predicates (Section 2.1). Hence, other manual resources should also be targeted.
In this work1 , we explore how FrameNet
(Baker et al., 1998) could be effectively used for
generating entailment rules between predicates.
Entailment Rules and their Acquisition
To generate entailment rules, two issues should
be addressed: a) identifying the lexical entailment
relations between predicates, e.g. ‘cure ⇒ recover’; b) mapping argument positions, e.g. ‘cure
X ⇒ X recover’. The main approach for generating highly accurate rule-sets is to use manually
constructed resources. To this end, most systems
mainly utilize WordNet (Fellbaum, 1998), being
the most prominent lexical resource with broad
coverage of predicates. Furthermore, some of its
The detailed description of our work can be found in
(Ben Aharon, 2010).
The rule-set is available at: http://www.cs.biu.˜nlp/downloads
Proceedings of the ACL 2010 Conference Short Papers, pages 241–246,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
relations capture types of entailment relations, including synonymy, hypernymy, morphologicallyderived, entailment and cause.
Yet, WordNet is limited for entailment rule generation. First, many entailment relations, notably for the WordNet entailment and cause relation types, are missing, e.g. ‘elect ⇒ vote’.
Furthermore, WordNet does not include argument
mapping between related predicates. Thus, only
substitutable WordNet relations (synonymy and
hypernymy), for which argument positions are
preserved, could be used to generate entailment
rules. The other non-substitutable relations, e.g.
cause (‘kill ⇒ die’) and morphologically-derived
(‘meet.v ⇔ meeting.n’), cannot be used.
same frame or whose frames are related by one of
FrameNet’s inter-frame relations. Each candidate
pair is considered entailing if the two LUs are either synonyms or in a direct hypernymy relation in
WordNet (providing the vast majority of LexPar’s
relations), or if their related frames are connected
via the Perspective relation in FrameNet.
Then, argument mappings between each entailing LU pair are extracted based on the core FEs
that are shared between the two LUs. The syntactic positions of the shared FEs are taken from the
valence patterns of the LUs. A LexPar rule example is presented in Figure 3 (top part).
Since most of LexPar’s entailment relations
are based on WordNet’s relations, LexPar’s rules
could be viewed as an intersection of WordNet and
FrameNet lexical relations, accompanied with argument mappings taken from FrameNet.
FrameNet (Baker et al., 1998) is a knowledgebase of frames, describing prototypical situations.
Frames can be related to each other by inter-frame
relations, e.g. Inheritance, Precedence, Usage and
For each frame, several semantic roles are specified, called frame elements (FEs), denoting the
participants in the situation described. Each FE
may be labeled as core if it is central to the frame.
For example, some core FEs of the Commerce pay
frame are Buyer and Goods, while a non-core FE
is Place. Each FE may also be labeled with a semantic type, e.g. Sentient, Event, and Time.
A frame includes a list of predicates that can
evoke the described situation, called lexical units
(LUs). LUs are mainly verbs but may also be
nouns or adjectives. For example, the frame Commerce pay lists the LUs pay.v and payment.n.
Finally, FrameNet contains annotated sentences
that represent typical LU occurrences in texts.
Each annotation refers to one LU in a specific
frame and the FEs of the frame that occur in the
sentence. An example sentence is “IBuyer have to
pay the billsM oney ”. Each sentence is accompanied by a valence pattern, which provides, among
other info, grammatical functions of the core FEs
with respect to the LU. The valence pattern of the
above sentence is [(Buyer Subj), (Money Obj)].
Rule Extraction from FrameNet
The above prior work identified lexical entailment
relations mainly from WordNet, which limits the
use of FrameNet in two ways. First, some relations that appear in FrameNet are missed because
they do not appear in WordNet. Second, unlike
FrameNet, WordNet does not include argument
mappings for its relations. Thus, prior work for
rule generation considered only substitutable relations from WordNet (synonyms and hypernyms),
not utilizing FrameNet’s capability to map arguments of non-substitutable relations.
Our goal in this paper is to generate entailment rules solely from the information within
FrameNet. We present a novel algorithm for generating entailment rules from FrameNet, called
FRED (FrameNet Entailment-rule Derivation),
which operates in three steps: a) extracting templates for each LU; b) detecting lexical entailment
relations between pairs of LUs; c) generating entailment rules by mapping the arguments between
two LUs in each entailing pair.
Template Extraction
Many LUs in FrameNet are accompanied by annotated sentences (Section 2.2). From each sentence of a given LU, we extract one template for
each annotated FE in the sentence. Each template includes the LU, one argument corresponding to the target FE and their syntactic relation
in the sentence parse-tree. We focus on extracting unary templates, as they can describe any ar-
Using FrameNet for Semantic Inference
To the best of our knowledge, the only work that
utilized FrameNet for entailment rule generation
is LexPar (Coyne and Rambow, 2009). LexPar
first identifies lexical entailment relations by going over all LU pairs which are either in the
FrameNet groups LUs in frames and describes relations between frames. However, relations between LUs are not explicitly defined. We next describe how we automatically extract several types
of lexical entailment relations between LUs using
two approaches.
In the first approach, LUs in the same frame
that are morphological derivations of each other,
e.g. ‘negotiation.n’ and ‘negotiate.v’, are marked
as paraphrases. We take morphological derivation
information from the CATVAR database (Habash
and Dorr, 2003).
The second approach is based on our observation that some LUs express the prototypical situation that their frame describes, which we denote
dominant LUs. For example, the LU ‘recover’ is
dominant for the Recovery frame. We mark LUs
as dominant if they are morphologically derived
from the frame’s name.
Our assumption is that since dominant LUs express the frame’s generic meaning, their meaning
is likely to be entailed by the other LUs in this
frame. Consequently, we generate such lexical
rules between any dominant LU and any other LU
in a given frame, e.g. ‘heal ⇒ recover’ and ‘convalescence ⇒ recover’ for the Recovery frame.
In addition, we assume that if two frames are
related by some type of entailment relation, their
dominant LUs are also related by the same relation. Accordingly, we extract entailment relations
between dominant LUs of frames that are connected via the Inheritance, Cause and Perspective
relations, where Inheritance and Cause generate
directional entailment relations (e.g. ‘choose ⇒
decide’ and ‘cure ⇒ recover’, respectively) while
Perspective generates bidirectional paraphrase relations (e.g. ‘transfer ⇔ receive’).
Finally, we generate the transitive closure of
the set of lexical relations identified by the above
methods. For example, the combination of ‘sell ⇔
buy’ and ‘buy ⇒ get’ generates ‘sell ⇒ get’.
Figure 1: Template extraction for a sentence containing the LU ‘arrest’.
gument mapping by decomposing templates with
several arguments into unary ones (Szpektor and
Dagan, 2008). Figure 1 exemplifies this process.
As a pre-parsing step, all FE phrases in a given
sentence are replaced by their related FE names,
excluding syntactic information such as prepositions or possessives (step (b) in Figure 1). Then,
the sentence is parsed using the Minipar dependency parser (Lin, 1998) (step (c)). Finally, a
path in the parse-tree is extracted between each FE
node and the node of the LU (step (d)). Each extracted path is converted into a template by replacing the FE node with an argument variable.
We simplify each extracted path by removing
nodes along the path that are not part of the syntactic relation between the LU and the FE, such
as conjunctions and other FE nodes. For example,
Identifying Lexical Entailment Relations
‘Authorities ←− enter −→ arrest’ is simplified
into ‘Authorities ←− arrest’.
Some templates originated from different annotated sentences share the same LU and syntactic
structure, but differ in their FEs. Usually, one of
these templates is incorrect, due to erroneous parse
Generating Entailment Rules
The final step in the FRED algorithm generates
lexical syntactic entailment rules from the extracted templates and lexical entailment relations.
For each identified lexical relation ‘left ⇒ right’
between two LUs, the set of FEs that are shared by
both LUs is collected. Then, for each shared FE,
we take the list of templates that connect this FE
(e.g. ‘Suspect ←− arrest’ is a correct template, in
contrast to ‘Charges ←− arrest’). We thus keep
only the most frequently annotated template out of
the identical templates, assuming it is the correct
in practice, as well as to compare its performance
to related resources. To this end, we follow the experimental setup presented in (Szpektor and Dagan, 2009), which utilized the ACE 2005 event
dataset3 as a testbed for entailment rule-sets. We
briefly describe this setup here.
The task is to extract argument mentions for
26 events, such as Sue and Attack, from the ACE
annotated corpus, using a given tested entailment
rule-set. Each event is represented by a set of
unary seed templates, one for each event argument. Some seed templates for Attack are ‘At-
Lexical Relation:
cure ⇒ recovery
P atient ←− cure
Af f liction ←− cure
P atient ←− recovery
P atient ←− recovery
f rom
Af f liction ←− recovery
(cure Patient)
(cure of Affliction)
(Patient’s recovery)
(recovery of Patient)
(recovery from Affliction)
Intra-LU Entailment Rules:
P atient ←− recovery ⇐⇒ P atient ←− recovery
Inter-LU Entailment Rules:
P atient ←− cure =⇒ P atient ←− recovery
P atient ←− cure =⇒ P atient ←− recovery
f rom
Af f liction ←− cure =⇒ Af f liction ←− recovery
Figure 2: Some entailment rules generated for the
lexical relation ‘cure.v ⇒ recovery.n’.
FRED ∪ WordNet
R (%)
P (%)
No-Rules The system matches only the seed
templates directly, without any additional rules.
WordNet Rules are generated from WordNet
3.0, using only the synonymy and hypernymy relations (see Section 2.1). Transitive chaining of relations is allowed (Moldovan and Novischi, 2002).
to each of the LUs, denoted by Tlef
t and Tright .
Finally, for each template pair, l ∈ Tlef
t and r ∈
, the rule ‘l ⇒ r’ is generated. In addition,
we generate paraphrase rules between the various
templates including the same FE and the same LU.
Figure 2 illustrates this process.
To improve rule quality, we filter out rules that
map FEs of adjunct-like semantic types, such as
Time and Location, since different templates of
such FEs may have different semantic meanings
LexPar Rules are generated from the publicly
available LexPar database. We generated unary
rules from each LexPar rule based on a manually
constructed mapping from FrameNet grammatical
functions to Minipar dependency relations. Figure 3 presents an example of this procedure.
FRED Rules are generated by our algorithm.
af ter
(e.g. ‘T ime ←− arrive’ ‘T ime ←− arrive’).
Thus, it is hard to identify those template pairs that
correctly map these FEs for entailment.
We manually evaluated a random sample of 250
rules from the resulting rule-set, out of which we
judged 69% as correct.
Tested Configurations
We evaluated several rule-set configurations:
Table 1: Macro average Recall (R), Precision (P)
and F1 results for the tested configurations.
bef ore
tacker←−attack’ and ‘attack−→Target’.
Argument mentions are found in the ACE corpus by matching either the seed templates or templates entailing them found in the tested rule-set.
We manually added for each event its relevant
WordNet synset-ids and FrameNet frame-ids, so
only rules fitting the event target meaning will be
extracted from the tested rule-sets.
FRED ∪ WordNet The union of the rule-sets of
FRED and WordNet.
Each configuration was tested on each ACE event.
We measured recall, precision and F1. Table 1
reports macro averages of the three measures over
the 26 ACE events.
As expected, using No-Rules achieves the highest precision and the lowest recall compared to all
other configurations. When adding LexPar rules,
Application-based Evaluation
Experimental Setup
We would like to evaluate the overall utility of our
resource for NLP applications, assessing the correctness of the actual rule applications performed
LexPar rule:
Lexemes: arrest −→ apprehend
Valencies: [(Authorities Subj), (Suspect Obj), (Offense (for))] =⇒ [(Authorities Subj), (Suspect Obj), (Offense (in))]
Generated unary rules:
f or
X ←− arrest =⇒ X ←− apprehend , arrest −→ Y =⇒ apprehend −→ Y , arrest −→ Z =⇒ apprehend −→ Z
Figure 3: An example for generation of unary entailment rules from a LexPar rule.
Net, since FrameNet is a much smaller resource.
Yet, its rules are mostly complementary to those
from WordNet. This added value is demonstrated by the 19% recall increase for the union of
FRED and WordNet rule-sets compared to WordNet alone. FRED provides mainly argument mappings for non-substitutable WordNet relations, e.g.
‘attack.n on X ⇒ attack.v X’, but also lexical relations that are missing from WordNet, e.g. ‘ambush.v ⇒ attack.v’.
Overall, our experiment shows that the rulebase generated by FRED seems an appropriate complementary resource to the widely used
WordNet-based rules in semantic inference and
expansion over predicates. This suggestion is especially appealing since our rule-set performs well
even when a WSD module is not applied.
only a slight increase in recall is gained. This
shows that the subset of WordNet rules captured
by LexPar (Section 2.3) might be too small for the
ACE application setting.
When using all WordNet’s substitutable relations, a substantial relative increase in recall is
achieved (32%). Yet, precision decreases dramatically (relative decrease of 44%), causing an overall decrease in F1. Most errors are due to correct
WordNet rules whose LHS is ambiguous. Since
we do not apply a WSD module, these rules are
also incorrectly applied to other senses of the LHS.
While this phenomenon is common to all rule-sets,
WordNet suffers from it the most since it contains
many infrequent word senses.
Our main result is that using FRED’s rule-set,
recall increases significantly, a relative increase
of 27% compared to No-Rules, while precision
hardly decreases. Hence, overall F1 is the highest compared to all other configurations (a relative increase of 17% compared to No-Rules). The
improvement in F1 is statistically significant compared to all other configurations, according to the
two-sided Wilcoxon signed rank test at the level of
0.01 (Wilcoxon, 1945).
FRED preforms significantly better than LexPar
in both recall, precision and F1 (a relative increase
of 25%, 28% and 41% respectively). For example,
LexPar hardly utilizes FrameNet’s argument mapping capabilities since most of its rules are based
on a sub-set of WordNet’s substitutable relations.
FRED’s precision is substantially higher than
WordNet. This mostly results from the fact
that FrameNet mainly contains common senses
of predicates while WordNet includes many rare
word senses; which, as said above, harms precision when WSD is not applied. Error analysis
showed that only 7.5% of incorrect extractions are
due to erronous rules in FRED, while the majority
of errors are due to sense mismatch or syntactic
matching errors of the seed templates ot entailing
templates in texts.
FRED’s Recall is somewhat lower than Word-
We presented FRED, a novel algorithm for generating entailment rules solely from the information
contained in FrameNet. Our experiment showed
that FRED’s rules perform substantially better
than LexPar, the only prior rule-set derived from
FrameNet. In addition, FRED’s rule-set largely
complements the rules generated from WordNet
because it contains argument mappings between
non-substitutable predicates, which are missing
from WordNet, as well as lexical relations that are
not included in WordNet.
In future work we plan to investigate combining FrameNet and WordNet rule-sets in a transitive
manner, instead of their simple union.
This work was partially supported by the Rector’s research grant of Bar-Ilan University, the
PASCAL-2 Network of Excellence of the European Community FP7-ICT-2007-1-216886 and
the Israel Science Foundation grant 1112/08.
Collin Baker, Charles Fillmore, and John Lowe. 1998.
The berkeley framenet project. In Proceedings of
COLING-ACL, Montreal, Canada.
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,
Danilo Giampiccolo, Bernardo Magnini, and Idan
Szpektor. 2006. The second pascal recognising textual entailment challenge. In Second PASCAL Challenge Workshop for Recognizing Textual Entailment.
Roni Ben Aharon. 2010. Generating entailment rules
from framenet. Master’s thesis, Bar-Ilan University.
Robert Coyne and Owen Rambow. 2009. Lexpar: A
freely available english paraphrase lexicon automatically extracted from framenet. In Proceedings of
the Third IEEE International Conference on Semantic Computing.
Ido Dagan, Oren Glickman, and Bernardo Magnini.
2006. The pascal recognising textual entailment
challenge. In Lecture Notes in Computer Science,
volume 3944, pages 177–190.
Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge,
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
and Bill Dolan. 2007. The third pascal recognizing textual entailment challenge. In Proceedings of
the ACL-PASCAL Workshop on Textual Entailment
and Paraphrasing.
Nizar Habash and Bonnie Dorr. 2003. A categorial
variation database for english. In Proceedings of
the North American Association for Computational
Linguistics (NAACL ’03), pages 96–102, Edmonton,
Canada. Association for Computational Linguistics.
Dekang Lin. 1998. Dependency-based evaluation of
minipar. In Proceedings of the Workshop on Evaluation of Parsing Systems at LREC.
Dan Moldovan and Adrian Novischi. 2002. Lexical
chains for question answering. In Proceedings of
Idan Szpektor and Ido Dagan. 2008. Learning entailment rules for unary templates. In Proceedings
of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 849–856,
Manchester, UK, August.
Idan Szpektor and Ido Dagan. 2009. Augmenting
wordnet-based inference with argument mapping.
In Proceedings of the 2009 Workshop on Applied
Textual Inference, pages 27–35, Suntec, Singapore,
Frank Wilcoxon. 1945. Individual comparisons by
ranking methods. Biometrics Bulletin, 1(6):80–83.
Don’t ‘have a clue’?
Unsupervised co-learning of downward-entailing operators
Cristian Danescu-Niculescu-Mizil and Lillian Lee
Department of Computer Science, Cornell University
[email protected], [email protected]
argument of an upward-entailing operator by a superset (a more general version); in our case, the set
‘opium use’ was replaced by the superset ‘narcotic
Downward-entailing (DE) (also known as
downward monotonic or monotone decreasing)
operators violate this default inference rule: with
DE operators, reasoning instead goes from “sets to
subsets”. An example is the word ‘bans’:
Researchers in textual entailment have
begun to consider inferences involving
downward-entailing operators, an interesting and important class of lexical items
that change the way inferences are made.
Recent work proposed a method for learning English downward-entailing operators
that requires access to a high-quality collection of negative polarity items (NPIs).
However, English is one of the very few
languages for which such a list exists. We
propose the first approach that can be applied to the many languages for which
there is no pre-existing high-precision
database of NPIs. As a case study, we
apply our method to Romanian and show
that our method yields good results. Also,
we perform a cross-linguistic analysis that
suggests interesting connections to some
findings in linguistic typology.
‘The law bans opium use’
6⇒ (⇐)
‘The law bans narcotic use’.
Although DE behavior represents an exception to
the default, DE operators are as a class rather common. They are also quite diverse in sense and
even part of speech. Some are simple negations,
such as ‘not’, but some other English DE operators are ‘without’, ‘reluctant to’, ‘to doubt’, and
‘to allow’.1 This variety makes them hard to extract automatically.
Because DE operators violate the default “sets
1 Introduction
to supersets” inference, identifying them can poCristi: “Nicio” ... is that adjective you’ve mentioned. tentially improve performance in many NLP tasks.
Anca: A negative pronominal adjective.
Perhaps the most obvious such tasks are those inCristi: You mean there are people who analyze that
volving textual entailment, such as question ankind of thing?
swering, information extraction, summarization,
Anca: The Romanian Academy.
and the evaluation of machine translation [4]. ReCristi: They’re crazy.
searchers are in fact beginning to build textual—From the movie Police, adjective
entailment systems that can handle inferences inDownward-entailing operators are an interestvolving downward-entailing operators other than
ing and varied class of lexical items that change
simple negations, although these systems almost
the default way of dealing with certain types of
all rely on small handcrafted lists of DE operators
inferences. They thus play an important role in
[1–3, 15, 16].2 Other application areas are naturalunderstanding natural language [6, 18–20, etc.].
language generation and human-computer interacWe explain what downward entailing means by
tion, since downward-entailing inferences induce
first demonstrating the “default” behavior, which
Some examples showing different constructions for anais upward entailing. The word ‘observed’ is an
lyzing these operators: ‘The defendant does not own a blue
example upward-entailing operator: the statement
car’ 6⇒ (⇐) ‘The defendant does not own a car’; ‘They are
reluctant to tango’ 6⇒ (⇐) ‘They are reluctant to dance’;
‘Police doubt Smith threatened Jones’ 6⇒ (⇐) ‘Police doubt
Smith threatened Jones or Brown’; ‘You are allowed to use
Mastercard’ 6⇒ (⇐) ‘You are allowed to use any credit card’.
The exception [2] employs the list automatically derived
by Danescu-Niculescu-Mizil, Lee, and Ducott [5], described
(i) ‘Witnesses observed opium use.’
(ii) ‘Witnesses observed narcotic use.’
but not vice versa (we write i ⇒ ( 6⇐) ii). That
is, the truth value is preserved if we replace the
Proceedings of the ACL 2010 Conference Short Papers, pages 247–252,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
greater cognitive load than inferences in the opposite direction [8].
Most NLP systems for the applications mentioned above have only been deployed for a small
subset of languages. A key factor is the lack
of relevant resources for other languages. While
one approach would be to separately develop a
method to acquire such resources for each language individually, we instead aim to ameliorate
the resource-scarcity problem in the case of DE
operators wholesale: we propose a single unsupervised method that can extract DE operators in any
language for which raw text corpora exist.
poses, pseudo-NPIs suffice. Also, our preliminary work determined that one of the most famous co-learning algorithms, hubs and authorities
or HITS [11], is poorly suited to our problem.4
Contributions To begin with, we apply our algorithm to produce the first large list of DE operators for a language other than English. In our case
study on Romanian (§4), we achieve quite high
precisions at k (for example, iteration achieves a
precision at 30 of 87%).
Auxiliary experiments explore the effects of using a large but noisy NPI list, should one be available for the language in question. Intriguingly, we
find that co-learning new pseudo-NPIs provides
better results.
Finally (§5), we engage in some cross-linguistic
analysis based on the results of applying our algorithm to English. We find that there are some
suggestive connections with findings in linguistic
Overview of our work Our approach takes the
English-centric work of Danescu-Niculescu-Mizil
et al. [5] — DLD09 for short — as a starting point,
as they present the first and, until now, only algorithm for automatically extracting DE operators
from data. However, our work departs significantly from DLD09 in the following key respect.
DLD09 critically depends on access to a highquality, carefully curated collection of negative
polarity items (NPIs) — lexical items such as
‘any’, ‘ever’, or the idiom ‘have a clue’ that tend
to occur only in negative environments (see §2
for more details). DLD09 use NPIs as signals of
the occurrence of downward-entailing operators.
However, almost every language other than English lacks a high-quality accessible NPI list.
To circumvent this problem, we introduce a
knowledge-lean co-learning approach. Our algorithm is initialized with a very small seed set
of NPIs (which we describe how to generate), and
then iterates between (a) discovering a set of DE
operators using a collection of pseudo-NPIs — a
concept we introduce — and (b) using the newlyacquired DE operators to detect new pseudo-NPIs.
Appendix available A more complete account
of our work and its implications can be found in a
version of this paper containing appendices, available at˜cristian/acl2010/.
DLD09: successes and challenges
In this section, we briefly summarize those aspects
of the DLD09 method that are important to understanding how our new co-learning method works.
DE operators and NPIs Acquiring DE operators is challenging because of the complete lack of
annotated data. DLD09’s insight was to make use
of negative polarity items (NPIs), which are words
or phrases that tend to occur only in negative contexts. The reason they did so is that Ladusaw’s hypothesis [7, 13] asserts that NPIs only occur within
the scope of DE operators. Figure 1 depicts examples involving the English NPIs ‘any’5 and ‘have
a clue’ (in the idiomatic sense) that illustrate this
relationship. Some other English NPIs are ‘ever’,
‘yet’ and ‘give a damn’.
Thus, NPIs can be treated as clues that a DE
operator might be present (although DE operators
may also occur without NPIs).
Why this isn’t obvious Although the algorithmic idea sketched above seems quite simple, it is
important to note that prior experiments in that
direction have not proved fruitful. Preliminary
work on learning (German) NPIs using a small
list of simple known DE operators did not yield
strong results [14]. Hoeksema [10] discusses why
NPIs might be hard to learn from data.3 We circumvent this problem because we are not interested in learning NPIs per se; rather, for our pur-
We explored three different edge-weighting schemes
based on co-occurrence frequencies and seed-set membership, but the results were extremely poor; HITS invariably
retrieved very frequent words.
The free-choice sense of ‘any’, as in ‘I can skim any paper in five minutes’, is a known exception.
In fact, humans can have trouble agreeing on NPI-hood;
for instance, Lichte and Soehn [14] mention doubts about
over half of Kürschner [12]’s 344 manually collected German
DE operators
not or n’t
no DE operator
any 3
X We do n’t have any apples
XI doubt they have any apples
× They have any apples
have a clue, idiomatic sense
X We do n’t have a clue
X I doubt they have a clue
× They have a clue
Figure 1: Examples consistent with Ladusaw’s hypothesis that NPIs can only occur within the scope of
DE operators. A X denotes an acceptable sentence; a × denotes an unacceptable sentence.
DLD09 algorithm Potential DE operators are
collected by extracting those words that appear in
an NPI’s context at least once.6 Then, the potential
DE operators x are ranked by
f (x) :=
database is not available, using Romanian as a
case study.
We used Rada Mihalcea’s corpus of ≈1.45 million
sentences of raw Romanian newswire articles.
Note that we cannot evaluate impact on textual
inference because, to our knowledge, no publicly
available textual-entailment system or evaluation
data for Romanian exists. We therefore examine
the system outputs directly to determine whether
the top-ranked items are actually DE operators or
not. Our evaluation metric is precision at k of a
given system’s ranked list of candidate DE operators; it is not possible to evaluate recall since no
list of Romanian DE operators exists (a problem
that is precisely the motivation for this paper).
To evaluate the results, two native Romanian
speakers labeled the system outputs as being
“DE”, “not DE” or “Hard (to decide)”. The labeling protocol, which was somewhat complex
to prevent bias, is described in the externallyavailable appendices (§7.1). The complete system
output and annotations are publicly available at:˜cristian/acl2010/.
fraction of NPI contexts that contain x
relative frequency of x in the corpus
which compares x’s probability of occurrence
conditioned on the appearance of an NPI with its
probability of occurrence overall.7
The method just outlined requires access to a
list of NPIs. DLD09’s system used a subset of
John Lawler’s carefully curated and “moderately
complete” list of English NPIs.8 The resultant
rankings of candidate English DE operators were
judged to be of high quality.
The challenge in porting to other languages:
cluelessness Can the unsupervised approach of
DLD09 be successfully applied to languages other
than English? Unfortunately, for most other languages, it does not seem that large, high-quality
NPI lists are available.
One might wonder whether one can circumvent
the NPI-acquisition problem by simply translating
a known English NPI list into the target language.
However, NPI-hood need not be preserved under
translation [17]. Thus, for most languages, we
lack the critical clues that DLD09 depends on.
Data and evaluation paradigm
Generating a seed set
Even though, as discussed above, the translation
of an NPI need not be an NPI, a preliminary review of the literature indicates that in many languages, there is some NPI that can be translated
as ‘any’ or related forms like ‘anybody’. Thus,
with a small amount of effort, one can form a minimal NPI seed set for the DLD09 method by using an appropriate target-language translation of
‘any’. For Romanian, we used ‘vreo’ and ‘vreun’,
which are the feminine and masculine translations
of English ‘any’.
Getting a clue
In this section, we develop an iterative colearning algorithm that can extract DE operators
in the many languages where a high-quality NPI
DLD09 policies: (a) “NPI context” was defined as the
part of the sentence to the left of the NPI up to the first
comma, semi-colon or beginning of sentence; (b) to encourage the discovery of new DE operators, those sentences containing one of a list of 10 well-known DE operators were discarded. For Romanian, we treated only negations (‘nu’ and
‘n-’) and questions as well-known environments.
DLD09 used an additional distilled score, but we found
that the distilled score performed worse on Romanian.
DLD09 using the Romanian seed set
We first check whether DLD09 with the twoitem seed set described in §3.2 performs well on
Romanian. In fact, the results are fairly poor:
Precision at k (in %)
Number of DE−operators
Figure 2: Left: Number of DE operators in the top k results returned by the co-learning method at each iteration.
Items labeled “Hard” are not included. Iteration 0 corresponds to DLD09 applied to {‘vreo’, ‘vreun’}. Curves for
k = 60 and 70 omitted for clarity. Right: Precisions at k for the results of the 9th iteration. The bar divisions are:
DE (blue/darkest/largest) and Hard (red/lighter, sometimes non-existent).
the right of a DE operator, up to the first comma,
semi-colon or end of sentence); these candidates x
are then ranked by
for example, the precision at 30 is below 50%.
(See blue/dark bars in figure 3 in the externallyavailable appendices for detailed results.)
This relatively unsatisfactory performance may
be a consequence of the very small size of the NPI
list employed, and may therefore indicate that it
would be fruitful to investigate automatically extending our list of clues.
fr (x) :=
fraction of DE contexts that contain x
relative frequency of x in the corpus
Then, our co-learning algorithm consists of the
iteration of the following two steps:
• (DE learning) Apply DLD09 using a set N
of pseudo-NPIs to retrieve a list of candidate
DE operators ranked by f (defined in Section
2). Let D be the top n candidates in this list.
Main idea: a co-learning approach
Our main insight is that not only can NPIs be used
as clues for finding DE operators, as shown by
DLD09, but conversely, DE operators (if known)
can potentially be used to discover new NPI-like
clues, which we refer to as pseudo-NPIs (or pNPIs
for short). By “NPI-like” we mean, “serve as possible indicators of the presence of DE operators,
regardless of whether they are actually restricted
to negative contexts, as true NPIs are”. For example, in English newswire, the words ‘allegation’ or
‘rumor’ tend to occur mainly in DE contexts, like
‘ denied ’ or ‘ dismissed ’, even though they are
clearly not true NPIs (the sentence ‘I heard a rumor’ is fine). Given this insight, we approach the
problem using an iterative co-learning paradigm
that integrates the search for new DE operators
with a search for new pNPIs.
First, we describe an algorithm that is the “reverse” of DLD09 (henceforth rDLD), in that it retrieves and ranks pNPIs assuming a given list of
DE operators. Potential pNPIs are collected by extracting those words that appear in a DE context
(defined here, to avoid the problems of parsing or
scope determination, as the part of the sentence to
• (pNPI learning) Apply rDLD using the set D
to retrieve a list of pNPIs ranked by fr ; extend N with the top nr pNPIs in this list. Increment n.
Here, N is initialized with the NPI seed set. At
each iteration, we consider the output of the algorithm to be the ranked list of DE operators retrieved in the DE-learning step. In our experiments, we initialized n to 10 and set nr to 1.
Romanian results
Our results show that there is indeed favorable
synergy between DE-operator and pNPI retrieval.
Figure 2 plots the number of correctly retrieved
DE operators in the top k outputs at each iteration.
The point at iteration 0 corresponds to a datapoint
already discussed above, namely, DLD09 applied
to the two ‘any’-translation NPIs. Clearly, we see
general substantial improvement over DLD09, although the increases level off in later iterations.
(Determining how to choose the optimal number
of iterations is a subject for future research.)
Additional experiments, described in the
externally-available appendices (§7.2), suggest
that pNPIs can even be more effective clues than
a noisy list of NPIs. (Thus, a larger seed set
does not necessarily mean better performance.)
pNPIs also have the advantage of being derivable
automatically, and might be worth investigating
from a linguistic perspective in their own right.
its own. In the other languages (including Romanian),10 no indirect pronoun can serve as a sufficient seed. So, we expect our method to be viable for all languages; while the iterative discovery of pNPIs is not necessary (although neither is
it harmful) for the subset of languages for which a
sufficient seed exists, such as English, it is essential for the languages for which, like Romanian,
‘any’-equivalents do not suffice.
Using translation Another interesting question
is whether directly translating DE operators from
English is an alternative to our method. First, we
emphasize that there exists no complete list of English DE operators (the largest available collection is the one extracted by DLD09). Second, we
do not know whether DE operators in one language translate into DE operators in another language. Even if that were the case, and we somehow had access to ideal translations of DLD09’s
list, there would still be considerable value in using our method: 14 (39%) of our top 36 highestranked Romanian DE operators for iteration 9 do
not, according to the Romanian-speaking author,
have English equivalents appearing on DLD09’s
90-item list. Some examples are: ‘abţinut’ (abstained), ‘criticat’ (criticized) and ‘reacţionat’ (reacted). Therefore, a significant fraction of the
DE operators derived by our co-learning algorithm
would have been missed by the translation alternative even under ideal conditions.
Cross-linguistic analysis
Applying our algorithm to English: connections to linguistic typology So far, we have
made no assumptions about the language on which
our algorithm is applied. A valid question is, does
the quality of the results vary with choice of application language? In particular, what happens if we
run our algorithm on English?
Note that in some sense, this is a perverse question: the motivation behind our algorithm is the
non-existence of a high-quality list of NPIs for
the language in question, and English is essentially the only case that does not fit this description. On the other hand, the fact that DLD09 applied their method for extraction of DE operators
to English necessitates some form of comparison,
for the sake of experimental completeness.
We thus ran our algorithm on the English
BLLIP newswire corpus with seed set {‘any’} .
We observe that, surprisingly, the iterative addition of pNPIs has very little effect: the precisions
at k are good at the beginning and stay about the
same across iterations (for details see figure 5 in
in the externally-available appendices). Thus, on
English, co-learning does not hurt performance,
which is good news; but unlike in Romanian, it
does not lead to improvements.
Why is English ‘any’ seemingly so “powerful”,
in contrast to Romanian, where iterating beyond
the initial ‘any’ translations leads to better results? Interestingly, findings from linguistic typology may shed some light on this issue. Haspelmath [9] compares the functions of indefinite pronouns in 40 languages. He shows that English is
one of the minority of languages (11 out of 40)9 in
which there exists an indefinite pronoun series that
occurs in all (Haspelmath’s) classes of DE contexts, and thus can constitute a sufficient seed on
We have introduced the first method for discovering downward-entailing operators that is universally applicable. Previous work on automatically
detecting DE operators assumed the existence of
a high-quality collection of NPIs, which renders it
inapplicable in most languages, where such a resource does not exist. We overcome this limitation by employing a novel co-learning approach,
and demonstrate its effectiveness on Romanian.
Also, we introduce the concept of pseudo-NPIs.
Auxiliary experiments described in the externallyavailable appendices show that pNPIs are actually
more effective seeds than a noisy “true” NPI list.
Finally, we noted some cross-linguistic differences in performance, and found an interesting
connection between these differences and Haspelmath’s [9] characterization of cross-linguistic variation in the occurrence of indefinite pronouns.
English, Ancash Quechua, Basque, Catalan, French,
Hindi/Urdu, Irish, Portuguese, Swahili, Swedish, Turkish.
Examples: Chinese, German, Italian, Polish, Serbian.
Acknowledgments We thank Tudor Marian for
serving as an annotator, Rada Mihalcea for access to the Romanian newswire corpus, and Claire
Cardie, Yejin Choi, Effi Georgala, Mark Liberman, Myle Ott, João Paula Muchado, Stephen Purpura, Mark Yatskar, Ainur Yessenalina, and the
anonymous reviewers for their helpful comments.
Supported by NSF grant IIS-0910664.
[9] Martin Haspelmath. Indefinite Pronouns.
Oxford University Press, 2001.
[10] Jack Hoeksema. Corpus study of negative
polarity items. IV-V Jornades de corpus linguistics 1996-1997, 1997. http://odur.let.rug.
[11] Jon Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of
the 9th ACM-SIAM Symposium on Discrete
Algorithms (SODA), pages 668–677, 1998.
Extended version in Journal of the ACM,
46:604–632, 1999.
[1] Roy Bar-Haim, Jonathan Berant, Ido Dagan, Iddo Greental, Shachar Mirkin, Eyal
Shnarch, and Idan Szpektor. Efficient semantic deduction and approximate matching over
compact parse forests. In Proceedings of the
Text Analysis Conference (TAC), 2008.
[12] Wilfried Kürschner. Studien zur Negation im
Deutschen. Narr, 1983.
[13] William A. Ladusaw. Polarity Sensitivity as
Inherent Scope Relations. Garland Press,
New York, 1980. Ph.D. thesis date 1979.
[2] Eric Breck. A simple system for detecting
non-entailment. In Proceedings of the Text
Analysis Conference (TAC), 2009.
[14] Timm Lichte and Jan-Philipp Soehn. The retrieval and classification of Negative Polarity Items using statistical profiles. In Sam
Featherston and Wolfgang Sternefeld, editors, Roots: Linguistics in Search of its Evidential Base, pages 249–266. Mouton de
Gruyter, 2007.
[3] Christos Christodoulopoulos. Creating a natural logic inference system with combinatory
categorial grammar. Master’s thesis, University of Edinburgh, 2008.
[4] Ido Dagan, Oren Glickman, and Bernardo
Magnini. The PASCAL Recognising Textual
Entailment challenge. In Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and
Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, pages 177–190. Springer, 2006.
[15] Bill MacCartney and Christopher D. Manning. Modeling semantic containment and
exclusion in natural language inference. In
Proceedings of COLING, pages 521–528,
[16] Rowan Nairn, Cleo Condoravdi, and Lauri
Karttunen. Computing relative polarity for
textual inference. In Proceedings of Inference in Computational Semantics (ICoS),
[5] Cristian Danescu-Niculescu-Mizil, Lillian
Lee, and Richard Ducott. Without a ‘doubt’?
Unsupervised discovery of downwardentailing operators.
In Proceedings of
NAACL HLT, 2009.
[17] Frank Richter, Janina Radó, and Manfred
Sailer. Negative polarity items: Corpus
linguistics, semantics, and psycholinguistics: Day 2: Corpus linguistics. Tutorial
esslli/08/byday/day2/day2-part1.pdf, 2008.
[6] David Dowty. The role of negative polarity and concord marking in natural language
reasoning. In Mandy Harvey and Lynn Santelmann, editors, Proceedings of SALT IV,
pages 114–144, 1994.
[7] Gilles Fauconnier. Polarity and the scale
principle. In Proceedings of the Chicago Linguistic Society (CLS), pages 188–199, 1975.
Reprinted in Javier Gutierrez-Rexach (ed.),
Semantics: Critical Concepts in Linguistics,
[18] Vı́ctor Sánchez Valencia. Studies on natural
logic and categorial grammar. PhD thesis,
University of Amsterdam, 1991.
[19] Johan van Benthem. Essays in Logical Semantics. Reidel, Dordrecht, 1986.
[20] Ton van der Wouden. Negative contexts:
Collocation, polarity and multiple negation.
Routledge, 1997.
[8] Bart Geurts and Frans van der Slik. Monotonicity and processing load. Journal of Semantics, 22(1):97–117, 2005.
Vocabulary Choice as an Indicator of Perspective
Beata Beigman Klebanov, Eyal Beigman, Daniel Diermeier
Northwestern University and Washington University in St. Louis
beata,[email protected], [email protected]
one can talk about perspectives of people on two
sides of a conflict; this is not opposition or support for any particular proposal, but ideas about
a highly related cluster of issues, such as Israeli
and Palestinian perspectives on the conflict in all
its manifestations. Zooming out even further, one
can talk about perspectives due to certain life contingencies, such as being born and raised in a particular culture, region, religion, or political tradition, such perspectives manifesting themselves in
certain patterns of discourse on a wide variety of
issues, for example, views on political issues in the
Middle East from Arab vs Western observers.
In this article, we consider perspective at all
the four levels of abstraction. We apply the same
types of models to all, in order to discover any
common properties of perspective classification.
We contrast it with text categorization and with
opinion classification by employing models routinely used for such tasks. Specifically, we consider models that use term frequencies as features
(usually found to be superior for text categorization) and models that use term absence/presence
(usually found to be superior for opinion classification). We motivate our hypothesis that presence/absence features would be as good as or
better than frequencies, and test it experimentally.
Secondly, we investigate the question of feature
redundancy often observed in text categorization.
We establish the following characteristics of the task of perspective classification: (a) using term frequencies in a
document does not improve classification
achieved with absence/presence features;
(b) for datasets allowing the relevant comparisons, a small number of top features is
found to be as effective as the full feature
set and indispensable for the best achieved
performance, testifying to the existence
of perspective-specific keywords. We relate our findings to research on word frequency distributions and to discourse analytic studies of perspective.
We address the task of perspective classification.
Apart from the spatial sense not considered here,
perspective can refer to an agent’s role (doctor vs
patient in a dialogue), or understood as “a particular way of thinking about something, especially one that is influenced by one’s beliefs or
experiences,” stressing the manifestation of one’s
broader perspective in some specific issue, or “the
state of one’s ideas, the facts known to one, etc.,
in having a meaningful interrelationship,” stressing the meaningful connectedness of one’s stances
and pronouncements on possibly different issues.1
Accordingly, one can talk about, say, opinion
on a particular proposed legislation on abortion
within pro-choice or pro-life perspectives; in this
case, perspective essentially boils down to opinion in a particular debate. Holding the issue constant but relaxing the requirement of a debate on a
specific document, we can consider writings from
pro- and con- perspective, in, for example, the
death penalty controversy over a course of a period
of time. Relaxing the issue specificity somewhat,
Vocabulary Selection
A line of inquiry going back at least to Zipf strives
to characterize word frequency distributions in
texts and corpora; see Baayen (2001) for a survey. One of the findings in this literature is that
a multinomial (called “urn model” by Baayen)
is not a good model for word frequency distributions. Among the many proposed remedies
(Baayen, 2001; Jansche, 2003; Baroni and Evert,
2007; Bhat and Sproat, 2009), we would like to
draw attention to the following insight articulated
Google English Dictionary,
Proceedings of the ACL 2010 Conference Short Papers, pages 253–257,
Uppsala, Sweden, 11-16 July 2010. 2010
Association for Computational Linguistics
by President Bush in 2003. We use data from
278 legislators, with 669 speeches in all. We
take only one speech per speaker per year; since
many serve multiple years, each speaker is represented with 1 to 5 speeches. We perform 10-fold
cross-validation splitting by speakers, so that all
speeches by the same speaker are assigned to the
same fold and testing is always inter-speaker.
When deriving the label for perspective, it is important to differentiate between a particular legislation and a pro-choice / pro-life perspective.
A pro-choice person might still support the bill:
“I am pro-choice, but believe late-term abortions
are wrong. Abortion is a very personal decision
and a woman’s right to choose whether to terminate a pregnancy subject to the restrictions of
Roe v. Wade must be protected. In my judgment,
however, the use of this particular procedure cannot be justified.” (Rep. Shays, R-CT, 2003). To
avoid inconsistency between vote and perspective,
we use data from pro-choice and pro-life nongovernmental organizations, NARAL and NRLC,
that track legislators’ votes on abortion-related
bills, showing the percentage of times a legislator
supported the side the organization deems consistent with its perspective. We removed 22 legislators with a mixed record, that is, those who gave
20-60% support to one of the positions.2
Death Penalty (DP) blogs: We use University
of Maryland Death Penalty Corpus (Greene and
Resnik, 2009) of 1085 texts from a number of proand anti-death penalty websites. We report 4-fold
cross-validation (DP-4) using the folds in Greene
and Resnik (2009), where training and testing data
come from different websites for each of the sides,
as well as 10-fold cross-validation performance on
the entire corpus, irrespective of the site.3
Bitter Lemons (BL): We use the GUEST part
of the BitterLemons corpus (Lin et al., 2006), containing 296 articles published in 2001-2005 on by more than 200 different Israeli and Palestinian writers on issues related to the conflict.
Bitter Lemons International (BL-I): We collected 150 documents each by a different per-
most clearly in Jansche (2003). Estimation is improved if texts are construed as being generated by
two processes, one choosing which words would
appear at all in the text, and then, for words that
have been chosen to appear, how many times they
would in fact appear. Jansche (2003) describes a
two-stage generation process: (1) Toss a z-biased
coin; if it comes up heads, generate 0; if it comes
up tails, (2) generate according to F (θ), where
F (θ) is a negative binomial distribution and z is a
parameter controlling the extent of zero-inflation.
The postulation of two separate processes is
effective for predicting word frequencies, but is
there any meaning to the two processes? The first
process of deciding on the vocabulary, or word
types, for the text – what is its function? Jansche
(2003) suggests that the zero-inflation component
takes care of the multitude of vocabulary words
that are not “on topic” for the given text, including
taboo words, technical jargon, proper names. This
implies that words that are chosen to appear are
all “on topic”. Indeed, text segmentation studies
show that tracing recurrence of words in a text
permits topical segmentation (Hearst, 1997; Hoey,
1991). Yet, if a person compares abortion to infanticide – are we content with describing this word
as being merely “on topic,” that is, having a certain
probability of occurrence once the topic of abortion comes up? In fact, it is only likely to occur
if the speaker holds a pro-life perspective, while a
pro-choicer would avoid this term.
We therefore hypothesize that the choice of vocabulary is not only a matter of topic but also
of perspective, while word recurrence has mainly
to do with the topical composition of the text.
Therefore, tracing word frequencies is not going to
be effective for perspective classification beyond
noting the mere presence/absence of words, differently from the findings in text categorization,
where frequency-based features usually do better
than boolean features for sufficiently large vocabulary sizes (McCallum and Nigam, 1998).
Partial Birth Abortion (PBA) debates: We use
transcripts of the debates on Partial Birth Abortion Ban Act on the floors of the US House and
Senate in 104-108 Congresses (1995-2003). Similar legislation was proposed multiple times, passed
the legislatures, and, after having initially been vetoed by President Clinton, was signed into law
Ratings are from: We further excluded data from Rep. James Moran, D-VA, as he
changed his vote over the years. For legislators rated by neither NRLC nor NARAL, we assumed the vote aligns with the
The 10-fold setting yields almost perfect performance
likely due to site-specific features beyond perspective per se,
hence we do not use this setting in subsequent experiments.
son from either Arab or Western perspectives
on Middle Eastern affairs in 2003-2009 from The
writers and interviewees on this site are usually
former diplomats or government officials, academics, journalists, media and political analysts.4
The specific issues cover a broad spectrum, including public life, politics, wars and conflicts, education, trade relations in and between countries like
Lebanon, Jordan, Iraq, Egypt, Yemen, Morocco,
Saudi Arabia, as well as their relations with the
US and members of the European Union.
NB - COUNT and SVM - NORMF for perspective classification; Pang et al. (2002) consider most and
Yu et al. (2008) all of the above for related tasks
of movie review and political party classification.
We use SVMlight (Joachims, 1999) for SVM and
WEKA toolkit (Witten and Frank, 2005; Hall et
al., 2009) for both version of Naive Bayes. Parameter optimization for all SVM models is performed
using grid search on the training data separately
for each partition into train and test data.6
Table 2 summarizes the cross-validation results for
the four datasets discussed above. Notably, the
SVM - BOOL model is either the best or not significantly different from the best performing model,
although the competitors use more detailed textual
information, namely, the count of each word’s appearance in the text, either raw (NB - COUNT), normalized (SVM - NORMF), or combined with document frequency (SVM - TFIDF).
We are interested in perspective manifestations
using common English vocabulary. To avoid the
possibility that artifacts such as names of senators
or states drive the classification, we use as features
words that contain only lowercase letters, possibly
hyphenated. No stemming is performed, and no
stopwords are excluded.5
Table 1: Summary of corpora
9.8 K
10 K
25 K
Table 2: Classification accuracy. Scores significantly different from the best performance
(p2t <0.05 on paired t-test) are given an asterisk.
# CV folds
For generative models, we use two versions
of Naive Bayes models termed multi-variate
Bernoulli (here, NB - BOOL) and multinomial (here,
NB - COUNT ), respectively, in McCallum and
Nigam (1998) study of event models for text categorization. The first records presence/absence of a
word in a text, while the second records the number of occurrences. McCallum and Nigam (1998)
found NB - COUNT to do better than NB - BOOL for
sufficiently large vocabulary sizes for text categorization by topic. For discriminative models, we
use linear SVM, with presence-absence, normalized frequency, and tfidf feature weighting. Both
types of models are commonly used for text classification tasks. For example, Lin et al. (2006) use
We conclude that there is no evidence for the
relevance of the frequency composition of the
text for perspective classification, for all levels of
venue- and topic-control, from the tightest (PBA
debates) to the loosest (Western vs Arab authors
on Middle Eastern affairs). This result is a clear
indication that perspective classification is quite
different from text categorization by topic, where
count-based features usually perform better than
boolean features. On the other hand, we have not
Parameter c controlling the trade-off between errors
on training data and margin is optimized for all datasets,
with the grid c = {10−6 , 10−5 , . . . , 105 }. On the DP
data parameter j controlling penalties for misclassification
of positive and negative cases is optimized as well (j =
{10−2 , 10−1 , . . . , 102 }), since datasets are unbalanced (for
example, there is a fold with 27%-73% split).
Here SVM - TFIDF is doing somewhat better than SVM BOOL on one of the folds and much worse on two other folds;
paired t-test with just 4 pairs of observations does not detect
a significant difference.
We excluded Israeli, Turkish, Iranian, Pakistani writers
as not clearly representing either perspective.
We additionally removed words containing support, oppos, sustain, overrid from the PBA data, in order not to inflate the performance on perspective classification due to the
explicit reference to the upcoming vote.
observed that boolean features are reliably better
than count-based features, as reported for the sentiment classification task in the movie review domain (Pang et al., 2002).
We note the low performance on BL-I, which
could testify to a low degree of lexical consolidation in the Arab vs Western perspectives (more on
this below). It is also possible that the small size of
BL-I leads to overfitting and low accuracies. However, PBA subset with only 151 items (only 2002
and 2003 speeches) is still 96% classifiable, so size
alone does not explain low BL-I performance.
in long-lasting controversies tend to consolidate
their vocabulary and signal their perspective with
certain stigma words and banner words, that is,
specific keywords used by a discourse community to implicate adversaries and to create sympathy with own perspective, respectively (Teubert,
2001). Thus, in abortion debates, using infanticide as a synonym for abortion is a pro-life stigma.
Note that this does not mean the rest of the features are not informative for classification, only
that they are redundant with respect to a small percentage of top weight features.
When N best features are eliminated, performance goes down significantly with even smaller
N for PBA and BL datasets. Thus, top features
are not only effective, they are also crucial for accurate classification, as their discrimination capacity is not replicated by any of the other vocabulary words. This finding is consistent with Lin
and Hauptmann (2006) study of perspective vs
topic classification: While topical differences between two corpora are manifested in difference in
distributions of great many words, they observed
little perspective-based variation in distributions
of most words, apart from certain words that are
preferentially used by adherents of one or the other
perspective on the given topic.
Consolidation of perspective
We explore feature redundancy in perspective
classification.We first investigate retention of only
N best features, then elimination thereof. As a
proxy of feature quality, we use the weight assigned to the feature by the SVM - BOOL model
based on the training data. Thus, to get the performance with N best features, we take the N2
highest and lowest weight features, for the positive and negative classes, respectively, and retrain
SVM - BOOL with these features only.8
Table 3: Consolidation of perspective. Nbest
shows the smallest N and its proportion out of
all features for which the performance of SVM BOOL with only the best N features is not significantly inferior (p1t >0.1) to that of the full
feature set. No-Nbest shows the largest number N for which a model without N best features is not significantly inferior to the full model.
N={50, 100, 150, . . . , 1000}; for DP and BL-I, additionally N={1050, 1100, ..., 1500}; for PBA, additionally N={10, 20, 30, 40}.
250 2.6%
500 4.9%
100 <1%
200 2.2%
For DP and BL-I datasets, the results seem
to suggest perspectives with more diffused keyword distribution (No-NBest figures are higher).
We note, however, that feature redundancy experiments are confounded in these cases by either a
low power of the paired t-test with only 4 pairs
(DP) or by a high variance in performance among
the 10 folds (BL-I), both of which lead to numerically large discrepancy in performance that is not
deemed significant, making it easy to “match” the
full set performance with small-N best features as
well as without large-N best features. Better comparisons are needed in order to verify the hypothesis of low consolidation.
1250 5.2%
In future work, we plan to experiment with additional features. For example, Greene and Resnik
(2009) reported higher classification accuracies
for the DP-4 data using syntactic frames in which
a selected group of words appeared, rather than
mere presence/absence of the words. Another direction is exploring words as members of semantic fields – while word use might be insufficiently
consistent within a perspective, selection of a semantic domain might show better consistency.
We observe that it is generally sufficient to use
a small percentage of the available words to obtain the same classification accuracy as with the
full feature set, even in high-accuracy cases such
as PBA and BL. The effectiveness of a small
subset of features is consistent with the observation in the discourse analysis studies that rivals
We experimented with the mutual information based feature selection as well, with generally worse results.
Wolfgang Teubert. 2001. A Province of a Federal
Superstate, Ruled by an Unelected Bureaucracy –
Keywords of the Euro-Sceptic Discourse in Britain.
In Andreas Musolff, Colin Good, Petra Points, and
Ruth Wittlinger, editors, Attitudes towards Europe:
Language in the unification process, pages 45–86.
Ashgate Publishing Ltd, Hants, England.
Herald Baayen. 2001. Word frequency distributions.
Dordrecht: Kluwer.
Marco Baroni and Stefan Evert. 2007. Words
and Echoes: Assessing and Mitigating the NonRandomness Problem in Word Frequency Distribution Modeling. In Proceedings of the ACL, pages
904–911, Prague, Czech Republic.
Ian H. Witten and Eibe Frank. 2005. Data Mining:
Practical Machine Learning Tools and Techniques.
Morgan Kaufmann, 2 edition.
Suma Bhat and Richard Sproat. 2009. Knowing the
Unseen: Estimating Vocabulary Size over Unseen
Samples. In Proceedings of the ACL, pages 109–
117, Suntec, Singapore, August.
Bei Yu, Stefan Kaufmann, and Daniel Diermeier.
2008. Classifying party affiliation from political
speech. Journal of Information Technology and Politics, 5(1):33–48.
Stephan Greene and Philip Resnik. 2009. More
than Words: Syntactic Packaging and Implicit Sentiment. In Proceedings of HLT-NAACL, pages 503–
511, Boulder, CO, June.
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringe, Peter Reutemann, and Ian H. Witten.
2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1).
Marti Hearst. 1997. TextTiling: Segmenting Text into
Multi-Paragraph Subtopic Passages. Computational
Linguistics, 23(1):33–64.
Michael Hoey. 1991. Patterns of Lexis in Text. Oxford
University Press.
Martin Jansche. 2003. Parametric Models of Linguistic Count Data. In Proceedings of the ACL, pages
288–295, Sapporo, Japan, July.
Thorsten Joachims. 1999. Making large-scale SVM
learning practical. In B. Schlkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods Support Vector Learning. MIT Press.
Wei-Hao Lin and Alexander Hauptmann. 2006. Are
these documents written from different perspectives? A test of different perspectives based on statistical distribution divergence. In Proceedings of
the ACL, pages 1057–1064, Morristown, NJ, USA.
Wei-Hao Lin, Theresa Wilson, Janyce Wiebe, and
Alexander Hauptmann. 2006. Which side are you
on? Identifying perspectives at the document and
sentence levels. In Proceedings of CoNLL, pages
109–116, Morristown, NJ, USA.
Andrew McCallum and Kamal Nigam. 1998. A comparison of event models for Naive Bayes text classification. In Proceedings of AAAI-98 Workshop
on Learning for Text Categorization, pages 41–48,
Madison, WI, July.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up? Sentiment Classification using
Machine Learning Techniques. In Proceedings of
EMNLP, Philadelphia, PA, July.
Cross Lingual Adaptation: An Experiment on Sentiment Classifications
Bin Wei
University of Rochester
Rochester, NY, USA.
[email protected]
Christopher Pal
École Polytechnique de Montréal
Montréal, QC, Canada.
[email protected]
error introduced by the translator, Wan in (Wan,
2009) applied a co-training scheme. In this setting
classifiers are trained in both languages and the
two classifiers teach each other for the unlabeled
examples. The co-training approach manages to
boost the performance as it allows the text similarity in the target language to compete with the
“fake” similarity from the translated texts. However, the translated texts are still used as training
data and thus can potentially mislead the classifier.
As we are not really interested in predicting something on the language created by the translator,
but rather on the real one, it may be better to further diminish the role of the translated texts in the
learning process. Motivated by this observation,
we suggest here to view this problem as a special
case of domain adaptation, in the source domain,
we mainly observe English features, while in the
other domain mostly features from Chinese. The
problem we address is how to associate the features under a unified setting.
There has been a lot of work in domain adaption
for NLP (Dai et al., 2007)(Jiang and Zhai, 2007)
and one suitable choice for our problem is the approach based on structural correspondence learning (SCL) as in (Blitzer et al., 2006) and (Blitzer
et al., 2007b). The key idea of SCL is to identify a
low-dimensional representations that capture correspondence between features from both domains
(xs and xt in our case) by modeling their correlations with some special pivot features. The SCL
approach is a good fit for our problem as it performs knowledge transfer through identifying important features. In the cross-lingual setting, we
can restrict the translated texts by using them only
through the pivot features. We believe this form is
more robust to errors in the language produced by
the translator.
Adapting language resources and knowledge to
a n