ACTA UNIVERSITATIS UPSALIENSIS 15 Studia Linguistica Upsaliensia

ACTA UNIVERSITATIS UPSALIENSIS 15 Studia Linguistica Upsaliensia
ACTA UNIVERSITATIS UPSALIENSIS
Studia Linguistica Upsaliensia
15
Discourse in Statistical
Machine Translation
Christian Hardmeier
Dissertation presented at Uppsala University to be publicly examined in Universitetshuset,
Sal X, Uppsala, Saturday, 14 June 2014 at 10:15 for the degree of Doctor of Philosophy.
The examination will be conducted in English.
Faculty examiner: Dr. Lluís Màrquez (Qatar Computing Research Institute).
Abstract
Hardmeier, C. 2014. Discourse in Statistical Machine Translation. Studia Linguistica
Upsaliensia 15. 185 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-8963-2.
This thesis addresses the technical and linguistic aspects of discourse-level processing in
phrase-based statistical machine translation (SMT). Connected texts can have complex textlevel linguistic dependencies across sentences that must be preserved in translation. However,
the models and algorithms of SMT are pervaded by locality assumptions. In a standard SMT
setup, no model has more complex dependencies than an n-gram model. The popular stack
decoding algorithm exploits this fact to implement efficient search with a dynamic
programming technique. This is a serious technical obstacle to discourse-level modelling in
SMT.
From a technical viewpoint, the main contribution of our work is the development of a
document-level decoder based on stochastic local search that translates a complete document
as a single unit. The decoder starts with an initial translation of the document, created
randomly or by running a stack decoder, and refines it with a sequence of elementary
operations. After each step, the current translation is scored by a set of feature models with
access to the full document context and its translation. We demonstrate the viability of this
decoding approach for different document-level models.
From a linguistic viewpoint, we focus on the problem of translating pronominal anaphora.
After investigating the properties and challenges of the pronoun translation task both
theoretically and by studying corpus data, a neural network model for cross-lingual pronoun
prediction is presented. This network jointly performs anaphora resolution and pronoun
prediction and is trained on bilingual corpus data only, with no need for manual coreference
annotations. The network is then integrated as a feature model in the document-level SMT
decoder and tested in an English–French SMT system. We show that the pronoun prediction
network model more adequately represents discourse-level dependencies for less frequent
pronouns than a simpler maximum entropy baseline with separate coreference resolution.
By creating a framework for experimenting with discourse-level features in SMT, this work
contributes to a long-term perspective that strives for more thorough modelling of complex
linguistic phenomena in translation. Our results on pronoun translation shed new light on a
challenging, but essential problem in machine translation that is as yet unsolved.
Keywords: Statistical machine translation, Discourse-level machine translation, Document
decoding, Local search, Pronominal anaphora, Pronoun translation, Neural networks
Christian Hardmeier, Uppsala University, Department of Linguistics and Philology,
Box 635, SE-75126 Uppsala, Sweden.
© Christian Hardmeier 2014
ISSN 1652-1366
ISBN 978-91-554-8963-2
urn:nbn:se:uu:diva-223798 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-223798)
Printed by Elanders Sverige AB, 2014
To Ursula
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 SMT and the Translation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Modelling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 MT Evaluation and Translation Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Relation to Published Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
16
21
22
25
2 Research on Discourse and SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Discourse Structure and Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Cohesion, Coherence and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Corpus Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 Cross-Sentence Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Lexical Cohesion by Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Encouraging Lexical Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.5 Models of Cohesion and Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Targeting Specific Discourse Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Pronominal Anaphora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Noun Phrase Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Verb Tense and Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Discourse Connectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Document-Level Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Discourse-Aware MT Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
27
28
28
30
31
32
32
33
33
36
37
37
38
39
40
Part I: Algorithms for Document-Level SMT
................................................
43
3 Discourse-Level Processing with Sentence-Level Tools . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 An Overview of Phrase-Based SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 The Stack Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Two-Pass Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Sentence-to-Sentence Information Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Document-Level Optimisation by Output Rescoring . . . . . . . . . . . . . . . . . . . .
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
47
49
50
53
54
4 Document-Level Decoding with Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 A Formal Model of Phrase-Based SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 The Local Search Decoding Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 State Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 State Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Changing Phrase Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.2 Changing Phrase Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Resegmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Special Operations for Simulated Annealing . . . . . . . . . . . . . . . . . . . . . .
4.5 Efficiency Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2 Search Algorithm Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Feature Weight Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
62
63
63
64
64
65
67
67
69
70
71
72
5 Case Studies in Document-Level SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Translating Consistently: Modelling Lexical Cohesion . . . . . . . . . . . . . . . . .
5.1.1 Translation Consistency in Different MT Systems . . . . . . . . . . . .
5.1.2 Word-Space Models for Lexical Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 A Semantic Document Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Translating for Special Target Groups: Improving Readability . . . .
5.2.1 Readability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
73
74
75
76
79
79
81
85
Part II: Pronominal Anaphora in Translation
87
...............................................
6 Challenges for Anaphora Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 Pronouns and Anaphora Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Translating Pronominal Anaphora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 A Study of Pronoun Translations in MT Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Challenges for Pronoun Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.1 Baseline SMT Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2 Anaphora Resolution Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.3 Performance of Other External Components . . . . . . . . . . . . . . . . . . . . . 99
6.4.4 Inadequate Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4.5 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.6 Model Deficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 A Word Dependency Model for Anaphoric Pronouns . . . . . . . . . . . . . . . . . . . . . . . . .
7.1 Anaphoric Links as Word Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The Word Dependency Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 Evaluating Pronoun Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
106
106
108
109
112
113
8 Cross-Lingual Pronoun Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Task Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Data Sets and External Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Baseline Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Neural Network Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5 Latent Anaphora Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1 Relaxing Markable Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.2 Adding Lexicon Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.3 More Anaphoric Link Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
116
118
120
122
127
131
131
132
133
134
9 Pronoun Prediction in SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Integrating the Anaphora Model into Docent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Weakening Prior Assumptions in the SMT Models . . . . . . . . . . . . . . . . . . . . .
9.3 SMT Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.1 Baseline Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.2 Document-Level Decoding with Anaphora Models . . . . . . . . .
9.3.3 Test Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3.4 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Manual Pronoun Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Annotation Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.2 Annotation Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.3 Anaphora Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.4 Agreement with Reference Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
135
135
136
140
140
141
143
144
145
146
149
150
153
156
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Document-Level SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Pronominal Anaphora in SMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
158
158
161
166
Bibliography
169
...................................................................................................
Acknowledgements
While working on this thesis, I received help and encouragement from many
people. First and foremost, credit is due to my advisors. I began my Ph. D.
studies at Fondazione Bruno Kessler (FBK) in Trento (Italy) in 2009 and completed them at Uppsala University after moving to Sweden in 2011. I had the
great privilege of working with excellent and very dependable advisors in
both places.
In Uppsala, Joakim Nivre and Jörg Tiedemann supported me with great
involvement and care while allowing me complete freedom to pursue my own
ideas. Having access to their combined competence and experience at every
one of our meetings was an absolutely invaluable asset. Joakim taught me
to unite visionary research goals with rigorous attention to detail, and Jörg
constantly contributed new ideas to improve my methods and references to
literature I did not know about. I benefited immensely from working with
the two of them together.
During my two years in Trento, I enjoyed the supervision of Marcello Federico. He did his best to make me feel welcome in Trento and provided me
with both equipment and opportunity to explore the skiing grounds on Monte
Bondone. Careful and systematic, he would always be ready to discuss implementation details and raw experimental results or complex proofs in the
derivation of statistical models. Much of what I know about the engineering
aspects of statistical machine translation, I learnt from him.
Among my colleagues at work, two stand out particularly. In Uppsala, Sara
Stymne discussed much of my work with me and freely contributed her advice. She acted as examiner at the mock defence preceding the submission
of my thesis, proofread the entire manuscript and helped me address weaknesses in my results and their presentation. I am greatly indebted to her for
her assistance in the final stages of preparing this thesis. In Trento, Arianna
Bisazza was an excellent colleague and a good friend to me. I often missed
her and our discussions on all kinds of linguistic and technical subjects after
leaving Italy.
Two of my colleagues at Uppsala University’s Department of Linguistics
and Philology, Marie Dubremetz and Mats Dahllöf, annotated French pronouns for me. Marie also advised me on other matters requiring the linguistic
competence of a native French speaker. Both my work and my social life in
Uppsala became more interesting and enjoyable thanks to Ali Basirat, Beáta
Megyesi, Eva Martínez, Eva Petterson, Evelina Andersson, Marco Kuhlmann,
11
Maryam Nourzaei, Matthias Zumpe, Mattias Nilsson, Miguel Ballesteros, Mojgan Seraji, Oscar Täckström, Per Starbäck, Reut Tsarfaty, Sebastian Schleussner,
Ute Bohnacker and Vera Wilhelmsen.
During my time in Trento, Nicola Bertoldi, Mauro Cettolo and Gabriele Musillo of the FBK machine translation group had their share in discussions related to the early stages of my work. Roldano Cattoni was very helpful and
patient with me when I used or abused the computing cluster at FBK.
My interest in statistical machine translation was first kindled by Martin
Volk, who supervised my M. A. thesis on machine translation for film subtitles.
He offered me much support even after I had taken up my Ph. D. studies in
Trento, not least by repeatedly welcoming me as a summertime visitor at the
University of Zurich and by contributing resources for the benefit of my research. During my stays in Zurich, I received much help with my experiments
from Rico Sennrich, Don Tuggener and Manfred Klenner.
Soon after I published my first paper on pronouns in statistical machine
translation, Bonnie Webber started taking a lively interest in my work and
shared bits and pieces of her outstanding knowledge about all things related
to discourse with me. Many times, her remarks made me gain a deeper understanding of the linguistic aspects of the phenomena I was dealing with.
My experiments used substantial computational resources. They were possible only because I had the opportunity to use two high-performance computing clusters in Oslo and Uppsala.1 I am indebted to Stephan Oepen for
permitting me to use a very generous part of his computing time allowance
on the Abel cluster and to the system administrators of Abel for letting me
overdraw my disk quota significantly while completing my thesis work.
When I arrived in Sweden in 2011, I was on my own and did not even have
a place to stay, but luckily I had a faithful friend in Stockholm. I was warmly
welcomed by Roland Engkvist, who offered me shelter in the maid’s chamber
of his flat on Kungsholmen until I could move to a more permanent place.
I am grateful to my parents, who inspired a scientific interest in me and
supported my academic career in various ways throughout my life, and to
my sister, to whom I owe much of what I know about translation studies and
who proofread large parts of this thesis. Last but not least, my life would not
be the same without Ursula, whose support and affection helped me through
all these years. Thank you for being with me!
Uppsala, April 2014
Christian Hardmeier
1 Computations
were carried out on the Abel cluster, owned by the University of Oslo and
the Norwegian metacenter for High Performance Computing (NOTUR) and operated by the
Department for Research Computing at USIT, the University of Oslo IT department, under project nn9106k, as well as on resources provided by SNIC through the Uppsala Multidisciplinary
Center for Advanced Computational Science (UPPMAX) under project p2011020.
12
1. Introduction
Machine translation (MT) is the automatic translation of texts between natural languages by a computer system. Translation is a challenging task for
humans, and it is no less challenging for computers. High-quality translation
requires a thorough understanding of the source text and its intended function as well as good knowledge of the target language. In an MT system, this
process must be completely formalised, which is a daunting task since the
process is by no means completely understood. Statistical machine translation (SMT) addresses this challenge by analysing the output of human translators with statistical methods and extracting implicit information about the
translation process from corpora of translated texts. SMT has shown good
results for many language pairs and has had its share in the recent surge in
popularity of MT among the general public.
Notwithstanding their success in practical translation scenarios, the methods used in SMT are shaped far more by technical constraints than by linguistic concerns. To ensure computational efficiency and tractability, complex linguistic interrelations are sacrificed to crude independence assumptions. The performance level of current SMT systems bears an amazing testimony to the fact that most information in natural languages is encoded very
locally. Even though the context a typical SMT system considers is extremely
impoverished, a great deal of information is usually transferred successfully
into the target language. Nevertheless, human translators know that it is
not sufficient to translate groups of words or sentences in isolation if a coherent target text is desired. In this thesis, we study some of the limitations
of current SMT systems, in particular the implications of translating texts
as sentences in isolation, as SMT systems usually do. We explore ways to
overcome this limitation and investigate how cross-sentence, discourse-level
context can be exploited in automatic translation.
1.1 Motivation and Goals
The point of departure of our research is the observation of a discrepancy
between the fields of translation studies and machine translation. While it
might seem that there should be strong connections between the two research
areas, even a superficial look at the relevant literature quickly reveals that the
two fields are preoccupied with completely different problems. In translation
studies, much work has been devoted to defining and exploring the nature of
13
translation. It has been recognised since antiquity that word-by-word translation is generally inadequate and that a higher level of understanding is necessary to render a text adequately into another language. Confronted with the
accusation of having taken liberties with the texts he translated from Greek
into Latin, the fourth century church father and bible translator Jerome retorts:
Ego enim non solum fateor, sed libera voce profiteor, me in interpretatione
Graecorum, absque Scripturis sanctis, ubi et verborum ordo mysterium est, non
verbum e verbo, sed sensum exprimere de sensu. (Jerome, 1996)
For I myself not only admit but freely proclaim that in translating from the
Greek (except in the case of the holy scriptures where even the order of the
words is a mystery) I render sense for sense and not word for word. (Jerome,
1979)
Jerome defends his attitude by referring to the example of eminent writers of
Roman antiquity like Cicero and Horace. His distinction between word-byword and sense-by-sense translation was fundamental for theoretical discussions of translation until the first half of the 20th century (Bassnett, 2011).
The 20th century saw the rise of translation studies as a scientific discipline in its own right. Translation research began to focus on more precise and
formal notions of translational equivalence such as the concept of dynamic
equivalence advocated by Nida and Taber (1969), which seeks the object of
equivalence at a pragmatic or functional level highly dependent on the message and intention of the source text and the reception of the target text. More
recent theories of translation go even further and dispute the concept of equivalence altogether (Snell-Hornby, 1995), focusing instead on the cultural and
social context and the intentionality of the production of both the original
source text and the translation. The question of equivalence at the level of
individual linguistic signs is an aspect of translator training (e. g., Baker, 2011,
Chapter 2), but it does not meet with much interest otherwise; while good dictionaries are essential also for the human translator, their creation is largely
the concern of lexicographers, not translation researchers.
The vast majority of the existing research on SMT, by contrast, is characterised by a happy disregard for the functional and pragmatic aspects of
language. Instead, it deals with far more fundamental concerns such as the
problem of generating grammatical word order in the target language. Much
of the SMT research literature is fairly technically-minded and is concerned
with finding more effective ways of applying existing statistical methods and
techniques to the MT task without spending too much thought on the effects
of using these methods on perceived translation quality.
Despite this discrepancy between how translation studies and SMT research approach the translation process, SMT has reached a point of maturity
that enables it to be used by professional users in productive environments.
14
We suggest that it now makes sense for SMT researchers to take a step beyond
what has been done traditionally and consider removing some of the restrictions that have been taken for granted in order to narrow the gap between
SMT and the world of professional translators. One obvious step to take is
the one from sentence-level translation to discourse. Most SMT research of
the last twenty years has limited the context considered when generating a
translation to that of the current sentence. While this restriction was adopted
for sound technical reasons, it is a strong obstacle to the study of higher-level
problems in SMT.
The standard models of SMT know very little about the linguistic structure
of a text. Instead, when generating a part of their output, they exhaustively
explore a context window around the current position, comparing translation
variants and output word permutations and selecting the option that seems
optimal given a set of models. To ensure tractability, the context window
that is explored in this way must be kept small. In practice, SMT considers
windows of no more than a handful of words. Once the context window
has been reduced to this size, even more efficiency can be gained by using
algorithms that specifically exploit the extreme locality of the context. This
is a core feature of all commonly used decoding algorithms in SMT.
The primary goal of our research is to find ways around the sentence-level
restriction in SMT and to explore how a larger context can be exploited to
improve the quality of automatic translation. This problem has two aspects,
both of which must be addressed to achieve an improvement in translation. If
we wish to exploit unlimited discourse context in our SMT systems, we must
develop frameworks, procedures and algorithms that are not encumbered by the
standard assumptions of sentence-level independence. This is the first major
research goal and the topic of the first part of this thesis. Our main contribution related to this goal is the development of a document-level decoding
algorithm for phrase-based SMT. We have released software implementing
this algorithm to the public in the form of our document-level phrase-based
SMT decoder Docent (Hardmeier et al., 2013a) to provide a framework for
the development of discourse-level SMT models for ourselves and other researchers.
With this essential piece of infrastructure in place, the next step is to investigate what discourse-level linguistic phenomena can be useful for SMT, and
how to model them in an SMT system. We explore a few different translation
problems that can be tackled with the tools we have developed, but the field
is vast and much must be left to future work. The second part of this thesis is
devoted to the study of one specific discourse phenomenon, the problem of
pronominal anaphora. Pronominal anaphora is an intriguing object of study
in that it is a fairly simple problem for a human language user, to the point
that it might be considered uninteresting from the perspective of a human
translator, yet it has an obvious potential to improve the quality of machine
translation that has so far resisted all modelling attempts. Our contribution
15
related to this goal is the development of a cross-lingual pronoun prediction
model to deal with pronominal anaphora in translation and its integration
into our document-level SMT framework.
In the remainder of this introductory chapter, we address some loosely
connected theoretical points concerning the relation between SMT and translation theory, the modelling assumptions underlying our experimental work
and some considerations on the use of automatic evaluation methods. The
purpose of these sections is to acquaint the reader with the foundations, assumptions and, more likely than not, prejudices that have influenced our research. This chapter also includes a section detailing the relation between
this thesis and the corpus of previously published work on which it is based.
In Chapter 2, we give an overview of the existing research literature on discourse in SMT to draw a picture of the relevant background.
The rest of the thesis is structured into two parts corresponding closely,
but not exactly, to the two research goals outlined above. In the first part,
we deal with the technical challenges of increasing the size of the context
that feature models can take into account. We describe the solutions that
have been proposed for document-level processing in SMT, introduce our
new document-level decoding method and put it to the test with case studies
on two discourse-level problems related to controlling the target language
vocabulary used by the SMT system in different ways.
In the second part, we focus entirely on the translation of pronominal anaphora, a discourse-level problem that affects most SMT systems translating
longer contiguous texts and cannot be solved correctly without some form
of inference with access to document-level context. We discuss extensively
what challenges the task of translating pronouns presents and describe an
early approach to it. Then, we introduce a neural network classifier that
models pronoun prediction as a separate task which is independent from the
MT system. Finally, we conclude the second part by combining this classifier with the document decoding framework developed in the first part of
the thesis and incorporating it as a feature function in the document-level
decoder, uniting all the major contributions of our work in one single SMT
system.
1.2 SMT and the Translation Process
The major part of this thesis and of the research it is based on follows the
genre conventions of the SMT literature by adopting an engineering-oriented
stance towards the problems we investigate. Beginning with the existing
state of the art in SMT, which we have determined to be defective in certain aspects, we examine ways to capture some of these aspects with the proviso that all solutions must be realisable in the existing framework and can
be subjected to immediate experimental scrutiny. Before we engage in this
16
pursuit, let us consider some fundamental contrasts between human translation activities and MT to shed some light on why it is difficult to deal with
discourse-level text features in automatic translation.
The discourse-related limitations of SMT are to some extent technical and
have to do with the necessity to constrain the search space of the MT system to ensure that the decoding problem remains computationally tractable.
These aspects are discussed in some detail in Chapters 3 and 4 of this thesis.
In addition to the technical constraints, however, there are conceptual limitations that make it difficult for an SMT system to acquire discourse competence.
In translation studies, the last century has brought about an important
change of viewpoint, which has been named the cultural turn (Lefevere and
Bassnett, 1995; Snell-Hornby, 2010). Until the last decades of the 20th century,
translation was seen as an act of transcoding (“Umkodierung”), whereby elements of one linguistic sign vocabulary are substituted with signs of another
linguistic sign vocabulary (Koller, 1972, 69–70). The principal constraint in
this substitution is the concept of equivalence between the source language
input and the target language output:
Translating consists in reproducing in the receptor language the closest natural
equivalent of the source-language message, first in terms of meaning and
secondly in terms of style. (Nida and Taber, 1969, 12)
In the presentation of their theory of translation, Nida and Taber (1969, 12)
emphasise that the primary aim of translation must be “reproducing the message”, not the words of the source text. Their focus is on bible translation, so
the word “message” in their writings strongly connotes the message of the
gospel, but their theory is general enough to apply to other types of translation. According to them, translators “must strive for equivalence rather than
identity” (Nida and Taber, 1969, 12). They stress the importance of dynamic
equivalence, a concept of functional rather than formal equivalence that is
“defined in terms of the degree to which the receptors of the message in the
receptor language respond to it in substantially the same manner as the receptors in the source language” (Nida and Taber, 1969, 24). Koller (1972),
primarily interested in general literary translation rather than bible translation, adopts a similar position. Instead of highlighting the message of the
source text, he focuses on understandability and defines translation as the act
of making the target text receptor understand the source text (“Übersetzen
als Akt des Verstehbarmachens”; Koller, 1972, 67).
Equivalence as a purely linguistic concept has been criticised as deeply
problematic because it fails to recognise the contextual parameters of the act
of translating; it has even been called an “illusion” by Snell-Hornby (1995,
80), who also points out that the formal concept of equivalence “proved more
suitable at the level of the individual word than at the level of the text” (SnellHornby, 1995, 80). The term is still used in a recent textbook on translation,
17
but, as the author points out, merely “for the sake of convenience” and “because most translators are used to it rather than because it has any theoretical
status” (Baker, 2011, 5).
A key feature of more recent theoretical approaches to translation is their
emphasis on the communicative aspects of translation. The cultural turn of
the 1980s has been described to have “placed equivalence within a targetoriented framework concerned first and foremost with aspects of target cultures rather than with linguistic elements of source texts” (Leal, 2012, 43; her
emphasis). Translation is seen as a “communicative process which takes place
within a social context” (Hatim and Mason, 1990, 3). Instead of seeking for
the target language text that is most closely equivalent to the source language
input, the goal of translation is to perform an appropriate communicative act
in the target community, and the target text is just a means of achieving this
goal. Hatim and Mason (1990, 3) point out that doing so requires the study of
procedures to find out “which techniques produce which effects” in the source
and target community. According to them, texts are “the result of motivated
choice” (Hatim and Mason, 1990, 4; their emphasis). In the case of translation, the motivations of the producer of the source text, as decoded by the
translator, interact with the motivations of the translator him- or herself and
determine the choices made to produce the target text.
Interestingly enough, when defending the novel way of understanding
translation they promote, Lefevere and Bassnett (1995, 4) blame the shortcomings of previous theoretical approaches oriented towards linguistic equivalence on the influence of MT research and its demands for simple concepts
that are easy to capture formally. Whether or not this explanation is true,
it is striking how firmly even modern SMT techniques are rooted in traditional assumptions of translational equivalence and indeed how apt much of
the criticism against such theories of translation is when applied to current
standard methods in SMT.
The basis of all current SMT methods is the concept of word alignment,
which was formalised by Brown et al. (1990, 1993) in the form still used today.
Word alignments are objects of elaborate statistical and computational methods, but their linguistic meaning is defined simply by appealing to intuition:
For simple sentences, it is reasonable to think of the French translation of an
English sentence as being generated from the English sentence word by word.
Thus, in the sentence pair (Jean aime Marie|John loves Mary) we feel that John
produces Jean, loves produces aime, and Mary produces Marie. We say that a
word is aligned with the word that it produces. Thus John is aligned with Jean
in the pair that we just discussed. Of course, not all pairs of sentences are as
simple as this example. In the pair (Jean n’aime personne|John loves nobody),
we can again align John with Jean and loves with aime, but now, nobody aligns
with both n’ and personne. Sometimes, words in the English sentence of the pair
align with nothing in the French sentence, and similarly, occasionally words in
18
the French member of the pair do not appear to go with any of the words in
the English sentence. (Brown et al., 1990, 80–81)
While this may indeed seem “reasonable” for simple sentences, the authors
do not even try to elucidate the status or significance of word alignments in
more complex sentences, where the correspondence between source and target words is less intuitive than in the examples cited. In practical applications,
word alignments are essentially defined by what is found by the statistical
alignment models used, and the issue of interpreting them is evaded completely. Even in articles dealing with manual word alignment and word alignment evaluation, it is not necessarily addressed (e. g., Lambert et al., 2005).
While word alignments have been used in corpus studies aiming at a deeper
understanding of the processes involved in translation (e. g., by Merkel, 1999),
such efforts have had little impact on current practice in the SMT community.
The cross-linguistic relation defined by word alignments is a sort of translational equivalence relation. It maps linguistic elements of the source language to elements of the target language that are presumed to have the same
meaning, or convey the same message. The same is true of the phrase pairs
of phrase-based SMT (Koehn et al., 2003) and the synchronous context-free
grammar rules of hierarchical SMT (Chiang, 2007), which are usually created
from simple word alignments with mostly heuristic methods. None of these
approaches exploits any procedural knowledge about linguistic techniques
and their effects in the source and target community. Instead, it is assumed
that each source text has an equivalent target text, possibly dependent on
a set of context variables generally subsumed under the concept of domain,
and that this target text can be constructed compositionally in a bottom-up
fashion.
It is instructive to consider what type of translational equivalence can be
accomplished with an SMT system. Clearly, nothing in current state-of-theart SMT explicitly encourages dynamic equivalence. To model dynamic equivalence, an MT system would have to understand the purpose or function of
the texts it translates, and there is no such knowledge in the existing models.
However, one of the strengths of modern SMT is that it is capable of capturing
correspondences that go beyond the simple word-by-word correspondences
typical of pure formal equivalence. Often, SMT output can create quite a convincing illusion of dynamic equivalence, so we may consider that we are not
doing justice to the SMT approach if we put it on the same level as simple literal translation. We know that real dynamic equivalence is beyond the scope
of SMT models. An important factor is that the choice between competing
translations suggested by the translation model in an SMT system is influenced to a large extent by the language model. The language model lacks
all knowledge of the source text, which rules out the possibility of selecting
target words as a function of the message or purpose of the input; it simply
selects output words based on what has been observed most frequently in tar19
get language texts. Thus, we could say that an SMT system strives to achieve
observational equivalence of the output with the input text.
In SMT, the notion of a domain is used to encode knowledge about the
procedural aspects of translation referred to by Hatim and Mason (1990). Domain can be seen as a variable that all the probability distributions learnt by
an SMT system are implicitly conditioned on, and it is assumed that if the
domain of the system’s training data matches the domain to which it will be
applied, then the system will output contextually appropriate translations. If
there is a mismatch between the training domain and the test domain, the
performance of the system can be improved with domain adaptation techniques.
Although there is a great deal of literature on domain adaptation, few authors care to define exactly what a domain is. Frequently, a corpus of data
from a single source, or a collection of corpora from similar sources, is referred to as a domain, so that researchers will refer to the “News” domain (referring to diverse collections of news documents from one or more sources
such as news agencies or newspapers) or the “Europarl” domain (referring
to the collection of documents from the proceedings of the European parliament published in the Europarl corpus; Koehn, 2005) without investigating
the homogeneity of these data sources in more detail.
Koehn (2010, 53) briefly discusses the domain concept. He seems to use
the word as a synonym of “text type”, characterised by (at least) the dimensions of “modality” (spoken or written language) and “topic”. Bungum and
Gambäck (2011) present an interesting study of how the term is used in SMT
research and how it relates to similar concepts in cognitive linguistics. In
general, however, the term is used in a rather vague way and can encompass
a variety of corpus-level features connected with genre conventions or the
circumstances of text use. There is a clear tendency in current SMT to treat
all aspects of a text either as very local, n-gram-style features that can easily
be handled with the standard decoding algorithm or as corpus-level “domain”
features that can conveniently be taken care of at training time.
According to Hatim and Mason (1990), human text production in general
and translation in particular is a decision-making process involving a series
of motivated choices. This is true also of SMT, where a decoding algorithm
makes decisions based on some kind of formal utility measure parametrised
by statistical models. Even the manner of text production can be quite similar.
The most popular decoding algorithm for phrase-based SMT generates its
output in natural reading order, pausing briefly every few words to deliberate
on the next words to follow. This is precisely what a human translator might
do when writing down a translation.
The difference between the human translator and the SMT system lies
in the complexity of the decision-making process. Whenever it takes a decision, the SMT decoder has access to no more than a handful of words of
context. Additionally, some general text-level word choice preferences may
20
be inscribed in the models in the form of “domain adaptation”. By contrast,
when pondering what words to choose to continue the same sentence, a competent human translator will have read and constructed a mental model of
the whole text, will have talked to the commissioner of the translation about
the target audience and the purpose of the translation, will have done additional research on the contents of the input text, will have made a text-level
plan of the whole translation, will have mentally stored information used in
making earlier decisions and will have thought about how to translate key
passages in sentences to come. The context taken into consideration by the
human translator exceeds that exploited by current SMT systems by far and
includes knowledge about the whole document and its translation as well as
background knowledge external to the document.
Given the current state of the art, we cannot hope to emulate the mental
process of translation in its whole complexity, and we are far from formally
modelling translation as a purposeful activity. With the work presented in
this thesis, we strive to make a contribution towards removing the most basic
restrictions on the size of the decision context in SMT and capturing some
elementary discourse-level phenomena in translation with formal statistical
models.
1.3 Modelling Assumptions
In developing the work described in this thesis, we have been guided by a
set of assumptions that shaped the hypotheses we considered and explored.
While there are good reasons to embrace these assumptions, we should point
out that it is not a goal of this thesis to prove their validity, let alone their
superiority over any other set of assumptions that could have been made.
Rather, the principles outlined in the following paragraphs have a sort of axiomatic status in our work. They embody our endeavour to model linguistic
phenomena in the way we consider most appropriate from a theoretical point
of view rather than in the way that is most likely to result in quick gains,
and an aversion to the principle of minimal incremental improvement, whose
merit as a development strategy is undisputed, but which makes it difficult
to explore any radical changes.
As a starting point, the models we develop are data-driven. This is a fairly
uncontroversial assumption in the SMT community, even though it is not
uncommon in production systems to include some components based on explicitly formalised linguistic knowledge. In our work, we avoid the creation
of hard rules based on linguistic introspection. Instead, our goal is to use
linguistic intuition along with corpus studies to create models whose parameters can then be estimated from data. We believe that this type of model
is more versatile and has greater flexibility to deal with corpus data that may
not always match the educated human’s idea of grammaticality.
21
Taking our reliance on raw corpus data even further, we aim to develop
models that depend as little as possible on explicit annotations. Corpus annotation is another way to encode introspective linguistic knowledge. In
many subfields of natural language processing (NLP), it is common to enrich
corpus data with explicit annotations reflecting a phenomenon of interest and
then train statistical models on this data. This approach is usually considered
to be fully data-driven, since it relies on data sets sampled from real corpora,
reflecting the distribution of texts attested in everyday linguistic production.
Nevertheless, explicit annotation always imposes a certain underlying structure on a text, and it is difficult to ensure that the selected structure optimally reflects the information needed in a translation scenario. This is why we
have a preference for models that manipulate raw text data, even though we
do depart from this principle and use a part-of-speech tagger or an anaphora
resolution system trained on annotated data in some cases.
Rather than working with explicitly annotated data or proceeding in a completely unsupervised way, we attempt to use the information contained in
parallel bitexts instead. This is the one type of high-quality annotations that
is abundant in an SMT setting. Much of the parallel text included in typical
SMT training corpora is created by expert translators with high quality standards. It contains a wealth of information and is available in very large quantities compared to other types of annotations, but the translators creating the
bitexts were ignorant of how their texts would later be used in a computational setting. As a result, the annotations we have are completely unbiased
towards our own purposes. This makes the annotations potentially noisy and
difficult to use, but it also ensures that they are representative samples of distributions encountered in real-life translations, which should contribute to
the validity of the models we derive from them.
Finally, it has been a goal in our work to give preference to integrated approaches over pipeline solutions and to enable joint inference over multiple
steps wherever possible. While pipeline approaches make it easy to decompose a task into small manageable steps, they have a tendency towards developing complex dependencies between the individual steps and propagating
errors from one step to the next. This is why we implement document translation as a part of the core SMT decoding process (Chapter 4) rather than
performing inference on word lattices or n-best lists output by a standard decoder, and it is why we model anaphora resolution and pronoun prediction
jointly in a single neural network (Chapter 8).
1.4 MT Evaluation and Translation Quality
Nobody performing experiments on MT can evade the question of evaluation.
For practical reasons, MT quality is usually measured with automatic metrics such as BLEU (Papineni et al., 2002), which match word sequences in
22
the translated text against reference translations produced by human translators and assume that greater overlap is correlated with higher translation
quality. The inadequacy of metrics of this type is widely recognised and acknowledged, but few reasonable alternatives are available, and none of them
is generally accepted.
A key problem for the development of high-quality MT is the fact that the
very concept of translation quality is not well-defined. Human evaluation of
translations, the gold standard for all translation quality measurement, is a
highly non-trivial task in itself. A human translator who renders a text in another language makes a great number of choices to select appropriate words
in the target language. To some extent, these choices are guided by the wording of the input text, but they also depend on various extra-linguistic factors
such as the proposed use and target audience of the translation, cultural background knowledge of the communities for which the source and target texts
are written, language-specific genre conventions, economic considerations,
media-specific constraints, etc. A reasonable method to evaluate a translation is to make assumptions about such context factors and to discuss the
adequacy of the decisions taken by the translator in the light of the assumptions made.
This intellectual approach to translation criticism may work well for the
education of human translators, but it is defeated in MT research not only
by its extreme cost, but also by several other factors impairing its usefulness.
Essential evaluation parameters such as target audience and intended use are
often ill-defined in MT research. The markedly non-intellectual translation
process embodied in an SMT system and the sheer difficulty of exploiting the
insights gained by such a process render translation criticism unsuitable as
a tool for MT development. As a result, it is usually substituted by sampling
methods where humans are asked, e. g., to rank a number of translations (often single sentences with very little context) by quality. By measuring interannotator and intra-annotator agreement, the reliability of such methods can
be assessed to some extent, but it is next to impossible to prove their validity
since the precise evaluation criteria are often left to the evaluators’ intuition
(explicitly so, e. g., by Callison-Burch et al., 2012, 14). However, as Artstein
and Poesio (2008, 557) point out, agreement between evaluators does not entail validity because “[t]wo observers of the same event may well share the
same prejudice while still being objectively wrong.” Moreover, even if the
evaluators have objectively sound reasons to prefer one disfluent translation
over another, their judgements are influenced by effects of salience and some
errors go unpunished more easily than others, although they do reflect fundamental problems of the generating MT system.
The development of automatic MT evaluation metrics is an object of ongoing research. For more than a decade, BLEU (Papineni et al., 2002) has been
the standard metric in MT research. BLEU considers the overlap of n-gram
sequences between the candidate translation and one or more reference trans23
lations. It consists of two components. The first represents n-gram precision
in the candidate translation, which is defined as the number of n-grams the
candidate shares with the reference divided by the total number of n-grams in
the candidate translation. This quantity is computed for 1-grams to 4-grams
and aggregated into a geometric mean. It is then multiplied with the second
component, a brevity penalty which assumes the function of a recall measure.
The brevity penalty punishes translations with a factor that decays exponentially with the length ratio between candidate and reference translation if the
candidate translation is shorter than the reference.
BLEU has been used both for assessing the quality of MT systems and as
an objective function for automatic parameter tuning (Och, 2003). Significant
research efforts have been spent on improving BLEU scores. By its nature,
BLEU favours locally fluent MT output, and advances in n-gram language
modelling methods often have large impact on BLEU. By contrast, long-range
dependencies are not captured, and discourse-level phenomena are reflected
much less reliably by the metric.
Since the introduction of BLEU, many other metrics have been proposed.
None of them has been able to replace BLEU as the standard metric, but some
of them have gained some popularity. Among the more popular alternatives,
we could mention NIST (Doddington, 2002), METEOR (Banerjee and Lavie,
2005; Denkowski and Lavie, 2011) and TER (Snover et al., 2006). While these
metrics address some of the shortcomings of BLEU, they do not add any specific support for discourse-level phenomena. Some discourse-level MT evaluation measures have recently been suggested (Giménez et al., 2010; Wong
et al., 2011; Wong and Kit, 2012; Guzmán et al., 2014; Joty et al., 2014), but
they have been developed and tested for English as a target language only,
whereas English is the source language in most of the experiments discussed
in this thesis.
MT evaluation is an interesting research problem in itself, but it is not a
focus of our work. However, in experimental work it cannot be avoided completely. Our stance on evaluation is to adopt standard evaluation measures
and, in particular, the BLEU score, while recognising their inadequacy. We
generally report BLEU scores for all experiments, but we do not necessarily
expect that they reliably reflect the quality of discourse-level features in the
translation. In Chapter 7, we introduce an automatic evaluation metric that
gauges the accuracy of pronoun translation more specifically than standard
evaluation measures do, but it suffers from many of the same shortcomings
as the existing methods. Currently, the only method that has some claim to
validity when it comes to measuring discourse-level features of translation
is a targeted human evaluation like the one we conduct for our SMT experiments in Chapter 9.
As a result of these considerations, we do not generally perform statistical hypothesis tests involving BLEU scores or similar metrics. Hypothesis
tests serve to prove that a difference between two observed measurements is
24
unlikely to be due to chance, suggesting that it reflects a substantive change
in the experimental outcome. However, since we have serious doubts about
whether the measurements we consider actually reveal the qualities we are
most interested in, this is immaterial for score differences small enough that
their significance can be called in doubt. In any case, we cannot draw reliable
conclusions from them, and labelling them as significant would confer a false
sense of importance to them. We therefore do indicate BLEU scores following
standard practice in the research community, but as we consider the validity
of the scores to be a more serious concern than their significance, we do not
attempt to prove significance formally.
1.5 Relation to Published Work
Much of the material contained in this thesis has been published previously,
primarily in the form of conference papers. The text of the published papers
was used, in updated and extended form, as the basis for various parts of the
thesis. In particular, the individual chapters are related to prior publications
as follows:
– An earlier version of the literature survey in Chapter 2 was published
as an article in the journal Discours (Hardmeier, 2012).
– The decoding procedure discussed in Section 3.4 was described in a
paper presented at the International Workshop on Spoken Language
Translation (IWSLT) in Paris, France, 2010 (Hardmeier and Federico,
2010).
– The document-level decoding algorithm proposed in Chapter 4 was published in a paper presented at the Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL) on Jeju Island, Korea, 2012 (Hardmeier et al.,
2012).
– Our software implementation of this algorithm, the Docent decoder,
was presented at the system demonstration session of the 51st Annual
Meeting of the Association for Computational Linguistics (ACL) in Sofia,
Bulgaria, 2013 (Hardmeier et al., 2013a). I implemented all the core functionality of the Docent decoder and wrote the larger part of the system
description paper, with the exception of a section on readability models
written by Sara Stymne.
– The results on document-level feature weight optimisation in Section 4.7
were published in a paper presented at the Workshop on Discourse in
Machine Translation (DiscoMT) in Sofia, Bulgaria, 2013 (Stymne et al.,
2013a). The experimental work leading to these results was carried out
by Sara Stymne, who also composed the text of the DiscoMT paper. I
25
–
–
–
–
–
–
–
participated in the discussions leading to the experiments and contributed some advice on practical issues related to the Docent decoder.
A description of the semantic document language model described in
Section 5.1.3 was included as a part of our paper at EMNLP-CoNLL 2012
(Hardmeier et al., 2012).
The work on readability in Section 5.2 was published in a paper presented at the 19th Nordic Conference of Computational Linguistics (NODALIDA) in Oslo, Norway, 2013 (Stymne et al., 2013c). The experiments in
this work were carried out by Sara Stymne, who also composed the text
of the NODALIDA paper. I participated in the discussions leading to the
experiments and contributed some advice on practical issues related to
the Docent decoder.
The corpus study on pronoun translation in Section 6.3 was a part of
our paper at IWSLT 2010 (Hardmeier and Federico, 2010).
An earlier version of the discussion of challenges in pronoun translation in Section 6.4 was included as a part of the Discours article referred
to above (Hardmeier, 2012).
The word dependency model and the pronoun evaluation metric described in Chapter 7 were included in our paper at IWSLT 2010 (Hardmeier and Federico, 2010).
The cross-lingual pronoun prediction model in Chapter 8 was published
as a paper at the Conference on Empirical Methods in Natural Language
Processing (EMNLP) in Seattle, USA, 2013 (Hardmeier et al., 2013b).
The SMT system with anaphora handling described in Chapter 9 was
used in our submission to the shared task on English–French MT at the
Ninth Workshop on Statistical Machine Translation in Baltimore, USA,
2014, and was discussed in a system description paper (Hardmeier et al.,
2014). In the system description paper, I was responsible for the experimental work as well as the description of the English–French SMT
system.
The vast majority of the results in this thesis are joint work with my advisors Joakim Nivre and Jörg Tiedemann (Uppsala University) and Marcello
Federico (Fondazione Bruno Kessler, Trento). Their contributions are not
marked separately. Except as mentioned otherwise above, the main responsibility for the complete scientific process from conception and experimentation to analysis and writing was mine.
26
2. Research on Discourse and SMT
The importance of discourse-level dependencies for translation has only recently attracted systematic attention in the SMT community. In a survey paper about discourse in SMT published only a short time ago, we pointed out
the “SMT community’s apparent lack of interest in discourse” (Hardmeier,
2012) and showed that most of the research on discourse-related problems in
SMT was conducted under different headings such as terminological consistency or domain adaptation. Since then, the number of papers explicitly interested in discourse has grown, and there was even an ACL workshop devoted
to this topic (DiscoMT 2013 in Sofia, Bulgaria). There are different strands of
research in the literature. One attempts to exploit the macroscopic structure
of the input texts to infer better translations. Some work is concerned with
different aspects of lexical cohesion, terminological consistency and word
choice. Other work deals with specific linguistic features that are governed
by discourse-level processes such as generation of anaphoric pronouns, translation of discourse connectives or verb tense selection. Yet another strand
addresses the technical challenges involved in processing document-level information and seeks to create a software infrastructure that straightforwardly
supports discourse-level translation. In this chapter, we review and discuss
the existing literature.
2.1 Discourse Structure and Document Structure
One of the earliest attempts to integrate discourse processing into SMT is also,
in a sense, one of the most ambitious. Several years before the phrase-based
(Koehn et al., 2003) and hierarchical (Chiang, 2007) approaches to SMT were
introduced, Marcu et al. (2000) suggested doing MT by rewriting discourse
structure trees. They compared the discourse structure of a small corpus of
Japanese and English parallel documents and concluded that “if one attempts
to translate Japanese into English on a sentence-by-sentence basis, it is likely
that the resulting text will be unnatural from a discourse perspective” (Marcu
et al., 2000, 12–13) because of significant structural differences at the sentence,
paragraph and text levels. They outline a discourse transfer model to rewrite
the discourse structure of an input text into a corresponding tree for the target
language. To our knowledge, this work has never been followed up after
its initial publication, and we are not aware of any actively developed SMT
system implemented along these lines.
27
In the more recent SMT literature, there is some work on exploiting textlevel structure for specific text genres. Foster et al. (2010) perform local language model (LM) adaptations in a system translating Canadian parliamentary debates using metadata features that represent various aspects of document structure. Wäschle and Riezler (2012) apply a multi-task variant of
minimum error-rate training (Och, 2003) to fine-tune their models to different text sections in patent translation. Louis and Webber (2014) improve the
translation of biographical texts in Wikipedia with a cache LM influenced by
a topic model that can account for the blockwise topic shifts typical of this
text genre.
2.2 Cohesion, Coherence and Consistency
Lexical choice is a problem that has traditionally attracted much attention
in SMT research. Initially most studied from the points of view of language
modelling and domain adaptation, the effects of text-level features on word
choice have recently moved into focus. The linguistic key concepts are cohesion and coherence, two fundamental discourse properties that establish
“connectedness” in a text (Sanders and Pander Maat, 2006, 591). Cohesion is a
surface property of the text that is realised by explicit clues such as the use of
discourse markers or word repetition. It occurs whenever “the interpretation
of some element in the discourse is dependent on that of another” (Halliday
and Hasan, 1976, 4). Coherence, by contrast, is related to the connectedness
of the “mental representation of the text rather than of the text itself”. It is
created referentially, when different parts of a text refer to the same entities,
and relationally, by means of coherence relations such as Cause–Consequence
between different discourse segments (Sanders and Pander Maat, 2006, 592).
Another term that has sometimes been used by less linguistically oriented
researchers is that of lexical or terminological consistency. The underlying assumption is that the same concepts should be consistently referred to with
the same words in a translation. To what extent this principle holds in naturally occurring texts of different genres, and to what extent and in what ways
it is or should be enforced in SMT systems, is an object of ongoing research.
2.2.1 Corpus Studies
In computational word sense disambiguation software, it is common, and usually beneficial, to impose a one sense per discourse constraint (Gale et al., 1992)
and assume that all uses of a polysemous term in the same document denote
the same sense of that term. Carpuat (2009) investigates a similar one translation per discourse hypothesis that relates to translated texts, supposing that all
instances of the same term in a document should be translated in the same
way. By examining human reference translations for two English–French
28
SMT test sets, she finds indeed that 80 % of the French words are aligned to
no more than one English translation and 98 % to at most two translations,
after lemmatising both source and target. Looking at machine translations
of the same test sets, she observes that the regularity in word choice is even
stricter in SMT as a result of the generally low lexical variability of SMT output.
These results suggest that there is not much to be gained by just enforcing consistent vocabulary choice in SMT, since the vocabulary is already
fairly consistent. In principle, it may be possible to improve SMT by using
whole-document context to select translations. However, a more recent study
by Carpuat and Simard (2012) shows that this may be more difficult than it
seems. In that study, the authors find consistency and translation quality to
be essentially uncorrelated or even negatively correlated in SMT output. In
particular, they show that machine-translated output tends to be more consistent when produced by systems trained on smaller corpora, indicating that
“consistency can signal a lack of coverage for new contexts” rather than being a sign of translation quality (Carpuat and Simard, 2012, 446). In a manual
analysis of post-edited MT output, they find that most lexical inconsistencies are symptoms of more fundamental problems such as outright semantic
translation errors or syntactic or stylistic problems, whereas the terminological inconsistencies typically found in imperfect human translations only
account for about 13–16 % of the inconsistent translations. These findings
are encouraging in the sense that, in the best case, a model improving MT
output consistency in the right way might help to fix some of the more fundamental errors as well, but the lack of positive correlation between measured
consistency and translation quality shows that it is important to enforce not
only consistent, but also correct translations, and that it may be necessary to
make use of additional information for good results.
The one translation per discourse hypothesis is tested again by Ture et al.
(2012), using a methodology based on forced decoding with a hierarchical
SMT system and examining the translations selected by human translators at
text positions where multiple options would have been available in the SMT
rule table. They find that the human translators indeed opt for consistent lexical choice in the majority of cases, but that some content words may be translated in more varied ways because of stylistic considerations. They propose a
set of cross-sentence feature functions rewarding translation rule reuse that
achieves significant improvements in Arabic–English and Chinese–English
translation tasks.
Another corpus study about lexical cohesion in MT output was published
by Voigt and Jurafsky (2012). They compare referential chains in a literary
text and a piece of news text in Chinese with their English translations generated by the on-line MT service Google Translate. In the source language,
both texts exhibit a similar number of entities, but the referential chains in the
literary text are denser, indicating stronger cohesion, and contain more pro29
nouns. They find the MT system to be relatively successful at transferring
these chains to the target language. For the news text, the characteristics
of the referential chains in the output are similar to the statistics of human
translations; for the literary text, there is a slight tendency towards underexpression of cohesive devices.
In a study investigating lexical consistency in human translations and machine translations of texts in different genres, Guillou (2013) observes that
the lexical consistency of human translations varies across word classes. For
most of her texts, the consistency of noun translations is fairly high, but not
perfect. For verbs, there is greater variation. In particular, the most common verbs belonging to the top 5 % when ordered by frequency are translated much less consistently. Guillou therefore concludes that consistency is
not invariably desirable and should be enforced only selectively. In machinetranslated texts, she finds, in accordance with Carpuat and Simard (2012),
that the measured consistency is high on average, but this does not necessarily mean that the translations are correct. Disambiguation of polysemous
words is a serious problem for an SMT system, and document-level consistency is often insufficient as a predictor of translation quality. An important difference between human translations and machine translations is that
inconsistencies in the former often just represent different wordings of the
same notions, whereas incorrect word choices made by SMT systems can
completely distort the meaning of the translation and have a serious impact
on the adequacy of the translations.
Beigman Klebanov and Flor (2013) examine the vocabulary distribution of
translated texts in terms of “associative texture”. The objective measure used
by their study is the “word association profile”, defined as the distribution of
pointwise mutual information between pairs of content word types in a text,
and the mean of this distribution, called “lexical tightness”. The authors find
that lexical tightness is systematically and significantly lower in texts that
were machine-translated into another language and back again than in the
original input texts. It is also lower in MT output than in human reference
translations, and it is lower in machine translations of lower quality than
in better machine translations, where translation quality is determined by
human evaluation.
2.2.2 Cross-Sentence Language Models
One way to promote cohesive lexical choice across sentence boundaries is to
extend the scope of the language model history by propagating information
between sentences. Tiedemann (2010a,b) suggests using an exponentially decaying cache to carry over word preferences from one sentence to the next.
He demonstrates modest improvements with this approach with a corpus of
medical texts (Tiedemann, 2010a), while the same technique fails when ap30
plied to newswire text (Tiedemann, 2010b). One significant problem is that
the cache easily gets contaminated with noise, and that it can contribute to
the propagation of bad translations to the following sentences. More recently,
improvements have been demonstrated with a more sophisticated caching
technique that initialises the cache with statistics from similar documents
found with information retrieval methods and keeps the noise level in check
with the help of a topic model created with Latent Dirichlet Allocation (LDA;
Gong et al., 2011a). A cache model presented by Louis and Webber (2014) is
similar, but extends the topic model with the capacity to detect topic shifts
to account for the semi-structured nature of the texts translated (biographic
articles from Wikipedia).
As the requirements on translational consistency vary across word classes
(Guillou, 2013), it can make sense to create a model covering only the words
that are most susceptible to benefit from cohesion modelling. This is what
we have attempted to do with a cross-sentence semantic space n-gram model
over content words (Hardmeier et al., 2012). This model is described in more
detail in Section 5.1.3.
2.2.3 Lexical Cohesion by Topic Modelling
Some researchers have proposed methods based on Latent Semantic Analysis
(LSA) and LDA to achieve lexical cohesion under a topic model. Kim and
Khudanpur (2004) use cross-lingual LSA to perform domain adaptation of
language models in one language (assumed to suffer from sparse resources)
given adaptation data in another language. Zhao and Xing (2006) present an
approach to word alignment named BiTAM based on bilingual topic models,
which they then extend to cover SMT decoding as well (Zhao and Xing, 2008).
A similar technique based on a bilingual variant of LDA is used by Tam et al.
(2007) for adapting language models and phrase tables.
Simpler and more recent approaches include the one by Gong et al. (2010),
who adapt SMT phrase tables with monolingual LDA, and Ruiz and Federico (2011), who implicitly train bilingual LSA topic models by concatenating short pieces of text in both languages before training the model, and use
these topic models for language model adaptation. Gong et al. (2011b) use nbest rescoring to make the topic distribution for each document as similar as
possible to the corresponding distribution in the source document, achieving
a marginal improvement in a Chinese–English task. Eidelman et al. (2012)
adapt features in the phrase table based on an LDA topic model. They compare adaptation at the sentence level with per-document adaptation and find
that, while both approaches work, sentence-level adaptation gives marginally
better results on their Chinese–English tasks. Hasler et al. (2014) completely
integrate LDA with phrase table training by estimating phrase translation
31
probabilities with a bilingual LDA model which directly represents parallel
documents as bags of phrase pairs.
2.2.4 Encouraging Lexical Consistency
There have been several attempts directly aimed at improving the consistency of lexical choice in the MT output. Xiao et al. (2011) present a two-pass
decoding approach to enforce consistent translation of recurring terms in
a document in Chinese–English newswire translation. After the first pass,
they disambiguate terms with multiple translations by finding the dominant
translation in an n-best list. Then they filter the phrase table of the second
decoding pass to remove inconsistent translations. Their research is followed
up by the work by Ture et al. (2012) cited above, which realises improvements
for Chinese–English and Arabic–English by designing features to guide the
second-pass translation process instead of manipulating the phrase table as
Xiao et al. (2011) do.
Alexandrescu and Kirchhoff (2009) describe a graph-based learning approach to favour similar translations for similar input sentences by considering similarity both between training and test sentences and between pairs
of test sentences, which leads to large improvements for Italian–English and
Arabic–English SMT tasks.
Ma et al. (2011) argue that the consistency of translations can be improved
by constraining SMT output to be similar to sentences retrieved from a translation memory. However, their method does not explicitly enforce or encourage cross-sentence consistency. Instead, they entirely rely on the assumption
that the examples supplied by the translation memory will be more consistent
than what the SMT system would generate on its own.
2.2.5 Models of Cohesion and Coherence
Xiong et al. (2013b) describe a model that explicitly tries to capture the notion of lexical cohesion in Chinese–English SMT. Their basic model scans the
output of their MT system for lexical cohesion devices, which are pairs of target language words satisfying certain cohesive relations. The relations considered are identity (word repetition), synonymy or approximate synonymy
and hyponymy or hypernymy; they are detected with the help of WordNet
(Fellbaum, 1998). The authors show that significant gains in MT quality can
be realised just by rewarding the occurrence of such cohesion devices. Scoring them with more sensitive metrics based on conditional probability and
mutual information increases the gain. Similar effects can be achieved by
considering bilingual cohesion triggers formed by replacing the first one of the
words in a lexical cohesion device with the source language words aligned to
it (Ben et al., 2013).
32
Instead of considering isolated word pairs, lexical cohesion can be modelled by looking at chains of words extending through the whole document.
Xiong et al. (2013a) start by identifying lexical chains in the source language
with a thesaurus-based algorithm (Galley and McKeown, 2003). Next, they
map the lexical chains into the target language with a set of maximum entropy classifiers predicting the best translation of a source word given both
its local context and the neigbouring words in the chain. Finally, they add
a feature model to their hierarchical SMT decoder to encourage it to adopt
the word choices predicted by the classifiers. This model improves translation quality substantially over the word pair models. In a variant of this
model, Xiong and Zhang (2013) use a Hidden Topic Markov Model (Gruber
et al., 2007) instead of the thesaurus-based lexical chain extractor to generate
chains of semantically related words.
2.3 Targeting Specific Discourse Phenomena
In contrast to the models described in the previous section, which are concerned with lexical cohesion and word choice in a quite general sense, there
have been recent efforts to develop models dealing with the realisation of
distinct types of cohesive relations. Often, such relations are specifically encoded with particular word classes. The problems that have been studied
include the correct translation of anaphoric pronouns, the generation of determiners in noun phrases, tense marking on verbs and the translation of
discourse connectives.
2.3.1 Pronominal Anaphora
Pronominal anaphora is the use of a pronoun to refer to an entity mentioned
earlier in the discourse. This happens very frequently in most types of connected text. This phenomenon will be the main topic of the second part of
this thesis, where our own results are discussed in great detail.
Usage and distribution of pronouns differ between languages (Russo et al.,
2011). When an anaphoric pronoun is translated into a language with gender
and number agreement, the correct form must be chosen according to the
gender and number of the translation of its antecedent. Corpus studies have
shown that this can be a problem for both statistical and rule-based MT systems, resulting in a potentially large number of mistranslated pronouns depending on language pair and text type (Hardmeier and Federico, 2010; Scherrer et al., 2011).
It was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, but there have not been many attempts to cash in on this
insight. Harabagiu and Maiorano (2000) exploit parallel data in English and
33
Romanian to improve pronominal anaphora resolution by merging the output of anaphora resolvers for the individual languages with a set of simple
rules. Mitkov and Barbu (2003) pursue a similar approach for English and
French. They create a more elaborate set of handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al.
(2012) resolve different uses of the pronoun it in English–Czech data with
handwritten rules that benefit from both monolingual and bilingual features.
Other work has used word alignments to project coreference annotations
from one language to another with a view to training anaphora resolvers in
the target language (Postolache et al., 2006; de Souza and Orăsan, 2011). Rahman and Ng (2012) instead use MT to translate their test data into a language
for which they have an anaphora resolver and then project the annotations
back to the original language.
The converse problem, exploiting anaphora information for the improvement of SMT systems, was first addressed by Le Nagard and Koehn (2010).
They approach the translation of anaphoric pronouns in phrase-based SMT
by processing documents in two passes: The English input text is run through
a coreference resolver developed by the authors ad hoc, and translation is performed with a regular SMT system to obtain French translations of the antecedent noun phrases. Then the anaphoric pronouns of the English text are
annotated with the gender and number of the French translation of their antecedent and translated again with another MT system whose phrase tables are
annotated in the same way. This does not result in any noticeable increase
in translation quality, a fact which the authors put down to the insufficient
quality of their coreference resolution system. However, in a later application of the same approach to an English–Czech system, no clearly positive
results are obtained despite the use of data manually annotated for coreference (Guillou, 2011, 2012).
Engaging in the same task, Hardmeier and Federico (2010) create a onepass system that directly incorporates the processing of coreference links
into the decoding step. This system is described in Chapter 7. Pronoun
coreference links are annotated with the BART anaphora resolution software
(Versley et al., 2008). We then add an extra feature to the decoder to model
the probability of a pronoun given its antecedent. Sentence-internal coreference links are handled completely within the SMT dynamic programming
algorithm. For links across sentence boundaries, the translation of the antecedent is extracted from the MT output after translating the sentence containing it, and it is held fixed when the referring pronoun is translated. In
that work, no improvement in BLEU score is achieved for English–German
translation, but a slight improvement is found with an evaluation metric targeted specifically to pronoun coreference. A subsequent attempt to apply the
same technique to the language pair English–French is largely unsuccessful
(Hardmeier et al., 2011). In later work, we model anaphoric relations discrim-
34
inatively with neural network classifiers (Hardmeier et al., 2013b). This work
and its application to SMT is described and discussed in Chapters 8 and 9.
For the Czech language, there is a body of research in the TectoMT framework (Žabokrtský et al., 2008), which combines deep syntactic analysis with
statistical transfer methods. Novák (2011) investigates the performance of
the TectoMT system on translating the English pronoun it into Czech. He
presents an analysis of errors made by the MT system and finds that about
half of the occurrences of the pronoun it in his corpus are non-referring expletives or refer anaphorically to constituents that are not noun phrases. In
such cases, the obvious translation of it with a Czech neuter pronoun is most
often correct. The pronoun is also consistently translated with a Czech neuter
when it does have noun phrase (NP) reference, and a substantial part of these
cases are wrong. Novák et al. (2013a) suggest using a discriminative classifier
with features derived from the tectogrammatical structure to predict the morphological features of translations of it. Even though their classifier beats an
uninformed baseline by a large margin, there is no effect on BLEU. Manual
evaluation shows that the changes with respect to the baseline correspond
to improvements somewhat more often than to degradations. In later work,
this approach is extended to reflexive pronouns (Novák et al., 2013b). For reflexives, the improvements in the manual evaluation are more consistent, but
the BLEU scores are still unaffected.
Russo et al. (2012a,b) address a somewhat different problem. They consider the generation of subject pronouns when translating from pro-drop languages into languages that require pronominal subjects to be realised explicitly, conducting a corpus study and examining the output of a rule-based and
a statistical MT system. Their work focuses on identifying where to insert
pronouns with the help of rule-based preprocessing and a statistical postprocessing step. They do not make any attempt to resolve pronominal anaphora
and resort to inserting majority class (masculine) pronouns whenever there is
an ambiguity. By doing so, they manage to improve the pronoun translation
accuracy of their rule-based translation system.
Taira et al. (2012) test the impact of inserting explicit pronouns for implicit subjects and objects in Japanese on phrase-based SMT into English. They
manually insert pronouns into Japanese source sentences in contexts where
they are not required by Japanese grammar, but would be required in a corresponding English sentence. After SMT into English, they observe only a
marginal improvement in BLEU score, but a larger gain with an ad hoc metric sensitive to this specific phenomenon.
Since automatic anaphora resolution is difficult and error-prone, it is of
great value for the development of anaphora-aware SMT systems to have
test corpora manually annotated for coreference. The standard corpora used
in anaphora resolution research are often insufficient because they are only
available in one language. Harabagiu and Maiorano (2000) mention translating some of the coreference-annotated training data of the Message Under35
standing Conferences (MUC; Grishman and Sundheim, 1996) into Romanian,
but we do not know if this translation is publicly available. Recently, coreference annotations have been added to a number of parallel corpora. These include the Prague Czech–English Dependency Treebank (PCEDT; Hajič et al.,
2006) with parallel text in English and Czech and the ParCor pronoun coreference corpus (Guillou et al., 2014) with parallel text in English and French as
well as English and German. The Copenhagen Dependency Treebank (BuchKromann et al., 2009) supposedly contains annotated parallel text for Danish
and English, Italian, Spanish and German, but it is unclear if and when these
annotations will be completed and released. An English–French data set released by Popescu-Belis et al. (2012b) contains a reduced form of pronoun
annotations, labelling the English pronouns with the word they correspond
to in the French text, but not actually marking their antecedents.
2.3.2 Noun Phrase Definiteness
Definiteness marking of noun phrases is a phenomenon governed by nontrivial language-specific discourse features that vary even among closely related languages. In some languages like Russian or Czech, noun phrases have
no overt morphological definiteness markers. When translating from these
languages into a target language like English that requires the use of definite or indefinite articles, a standard SMT system will not have the necessary
information to generate correctly distributed definite and indefinite articles.
Knight and Chander (1994) describe a statistical postediting system based
on decision tree classifiers for inserting English definite and indefinite articles into the output of rule-based MT systems. They also make an experiment with human informants to determine how much discourse information
is necessary to solve this task. When presented with isolated noun phrases
extracted from a corpus, their subjects decide correctly if the noun phrase
was definite or indefinite in the corpus in around 80 % of the cases, clearly
exceeding the simple majority class accuracy of 67 %. When given access to
discourse context, the human annotators’ accuracy reaches 95 %. These figures give an indication of the upper bounds on accuracy that can be achieved
in such a task.
Tsvetkov et al. (2013) extend SMT systems for Russian–English and Czech–
English with a classifier to predict NP definiteness trained on sentence-level
lexical and morphosyntactic features. To make sure that the required phrases
are available to the MT system, they enrich their phrase tables with synthetic
phrase pairs containing unseen determiner-noun pairs. They demonstrate
that this technique improves BLEU scores with respect to a standard baseline
and that it compares favourably to a determiner insertion procedure at postprocessing time.
36
2.3.3 Verb Tense and Aspect
Gong et al. (2012b) present a cross-sentence model to control the generation
of correct verb tenses in the MT output. This is a problem that occurs in the
translation from Chinese to English because Chinese verbs are not morphologically marked for tense, whereas generating correct English output requires
selecting the right tense form. They use n-gram-like features on the target
side to model the English sequence of tenses, with two different models to
capture the sequence of verb tenses within a sentence and across sentences,
respectively. Their cross-sentence model is just a sequence model over the
tenses of the main verbs in each sentence. Sentences are processed in order,
and information about the tense of the main verb generated is passed on to
the following sentences so that the tense of the next verb can be conditioned
on this information. By applying this model, they achieve sizeable improvements in BLEU on a Chinese–English task.
One weakness of the n-gram tense model is that it only incorporates target
language information. Gong et al. (2012a) achieve additional improvements
by replacing the n-gram model with a support vector machine classifier exploiting both source language and target language features. Furthermore,
they expand the phrase table with synthetic entries to ensure that all required
verb forms are available to the SMT system.
Meyer et al. (2013) explore a related problem in English–French translation.
Owing to differences in the aspect marking systems of English and French,
an English simple past verb can correspond to an imparfait, passé simple or
passé composé form in French. A key property for predicting this distinction
is called narrativity. Meyer et al. (2013) train a classifier to predict the narrativity of English past tense verbs. They show that a small improvement
in BLEU scores and a beneficial effect in manual evaluation can be achieved
by integrating the narrativity feature in a factored phrase-based SMT system
(Koehn and Hoang, 2007).
2.3.4 Discourse Connectives
The translation of discourse connectives has recently been studied as a main
focus of the Swiss COMTIS project on text-level SMT (Popescu-Belis et al.,
2012a), which resulted in a number of publications on this topic. In a corpus study, Cartoni et al. (2011) compare parts of the Europarl multilingual
corpus (Koehn, 2005) that were originally written in French with other parts
translated into French from English, German, Italian and Spanish. They find
that the different subcorpora use fairly similar vocabulary in general, but that
discourse connectives have significantly different distributions depending on
the original source language of the text. They also notice that it is fairly common for translators to introduce discourse connectives not explicitly found
in the source language, and less common to leave out connectives present
37
in the source. Meyer et al. (2011b) contrast findings from a corpus study
based on manual annotation with results obtained from the exploration of
parallel corpora. Detailed results of the study are not contained in the published abstract. Meyer and Webber (2013) study the translations of discourse
connectives from English into French and German and find that up to 18 %
of explicit English discourse connectives have no direct correspondence in
French or German human translations, whereas machine translations much
more often include literal translations of connectives.
Without any relation to the COMTIS project, Becher (2011a,b) studies implicitation and explicitation of discourse connectives in a descriptive corpus study of business texts translated between German and English. He approaches these phenomena from the angle of translation studies rather than
natural language engineering and proposes explanations in terms of features
of the grammatical systems of the source and target language and in terms
of properties of the translation process.
Meyer et al. (2011a) and Meyer (2011) investigate automatic disambiguation of polysemous discourse connectives. They propose a “translation spotting” annotation scheme for corpus data that marks up words that can be
translated in different ways with their correct translation, which they call
“transpot”, instead of explicitly annotating linguistic features (Popescu-Belis
et al., 2012b; Cartoni et al., 2013). Disambiguating connectives with an automatic classifier before running a phrase-based SMT systems results in small
improvements in translation quality for English–French (Meyer, 2011; Meyer
and Popescu-Belis, 2012; Meyer et al., 2012) and English–Czech (Meyer and
Poláková, 2013) according to some ad hoc evaluation criteria, even though
the BLEU scores are largely unaffected. Meyer et al. (2012) present a family
of automatic and semi-automatic evaluation scores called ACT to measure the
accuracy of discourse connective translation in order to obtain a more meaningful assessment of progress on this problem than what a general-purpose
measure like BLEU can deliver. These metrics are then further studied and
validated against human judgements for the language pairs English–French
and English–Arabic (Hajlaoui and Popescu-Belis, 2012, 2013).
2.4 Document-Level Decoding
In standard SMT systems, it is relatively difficult to exploit discourse-level
features because of the limitations of the decoding algorithm. Phrase-based
SMT decoders almost universally use a variant of the dynamic programming
beam search algorithm described by Koehn et al. (2003) for decoding. This
algorithm combines good search performance with high efficiency thanks
to a dynamic programming technique exploiting the locality of the models,
making it difficult or impossible to integrate models whose dependencies require considering a context larger than a window of five or six words. In
38
past research, this problem was addressed mostly by handling cross-sentence
dependencies in components outside the decoder, e. g., by decoding in two
passes (Le Nagard and Koehn, 2010; Xiao et al., 2011; Ture et al., 2012) or by
using a special decoder driver module to annotate the decoder’s input and
recover the required information from its output (Hardmeier and Federico,
2010; Gong et al., 2012b). More recently, we have presented a decoding algorithm (Hardmeier et al., 2012) and a decoder (Hardmeier et al., 2013a) based
on local search that permit the inclusion of cross-sentence feature functions
directly into the decoding process, opening up new ways to design discoursewide models. The integration of document-level features into the SMT decoding process is a central topic of this thesis and will be studied in Chapters 3
and 4.
2.5 Discourse-Aware MT Evaluation
A recurring issue in all discourse-related MT work is the problem of evaluation. The most popular automatic MT evaluation measure, BLEU (Papineni
et al., 2002), calculates scores by measuring the overlap of low-order n-grams
(usually up to 4-grams) between the output of the MT system and one or
more reference translations. This score is insensitive to textual patterns that
extend beyond the size of the n-grams, and it favours systems relying on
strong n-gram models over other types of MT systems (Callison-Burch et al.,
2006). It has been pointed out by various authors (Le Nagard and Koehn,
2010; Hardmeier and Federico, 2010; Guillou, 2011; Meyer et al., 2012) that
this evaluation measure may not be adequate to guide research on specific
discourse-related problems, and more targeted evaluation scores have been
devised for the translation of pronominal anaphora (Hardmeier and Federico,
2010) and discourse connectives (Meyer et al., 2012; Hajlaoui and PopescuBelis, 2012, 2013).
There has also been some effort to exploit discourse information to improve the evaluation of MT in general, independently of specific features in
the MT systems tested. Giménez et al. (2010) propose an MT evaluation metric based on Discourse Representation Theory (Kamp and Reyle, 1993), which
takes into account features like coreference relations and discourse relations
to assess the quality of MT output. Unfortunately, their metric does not have
a higher correlation with human quality judgements than standard sentencelevel MT evaluation metrics in the MetricsMATR shared task (Callison-Burch
et al., 2010). However, in more recent work, a metric using tree kernels
(Collins and Duffy, 2002) over sentence-level discourse trees conforming to
Rhetorical Structure Theory (Mann and Thompson, 1988) is shown to achieve
a correlation approaching that of BLEU, and surpassing the current state of
the art when combined with other metrics (Guzmán et al., 2014; Joty et al.,
2014).
39
Wong et al. (2011) and Wong and Kit (2012) propose extending sentencelevel evaluation metrics such as BLEU (Papineni et al., 2002), TER (Snover
et al., 2006) or METEOR (Banerjee and Lavie, 2005) with a component to measure lexical cohesion. For this purpose, they use measures of word repetition
in the text, after applying either just stemming or semantic relatedness clustering according to similarity in WordNet (Fellbaum, 1998). They claim that
there is a positive correlation between their lexical cohesion scores and human quality judgements, and that they can improve the correlation of BLEU
and TER, but not METEOR, by combining them with the cohesion scores. In
finding a positive correlation between lexical cohesion as measured by word
repetition in MT output and human quality judgements, their results seem to
be inconsistent with those of Carpuat and Simard (2012) discussed above, a
discrepancy that should be investigated further to pin down the role of lexical
cohesion in MT quality.
2.6 Conclusion
After an initial period during which SMT research, with very few exceptions
(Marcu et al., 2000), was almost entirely uninterested in discourse-level processing, discourse-level and document-level aspects of translations have recently gained quite substantial attention. In a number of corpus studies, important challenges have been identified by studying such phenomena as word
disambiguation (Carpuat, 2009; Carpuat and Simard, 2012; Ture et al., 2012),
lexical cohesion (Voigt and Jurafsky, 2012; Guillou, 2013; Beigman Klebanov
and Flor, 2013), pronominal anaphora (Hardmeier and Federico, 2010; Scherrer et al., 2011) or discourse connectives (Cartoni et al., 2011; Meyer et al.,
2011b). Other discourse problems such as tense and aspect marking on verbs
(Gong et al., 2012a,b; Meyer et al., 2013) and NP definiteness (Tsvetkov et al.,
2013) have been studied more experimentally. All of these were shown to be
highly relevant to translation quality, but in most cases it has been difficult
to obtain noticeable improvements in BLEU scores or other empirical measures of MT quality. The most sizeable gains reported in the literature are
for translation between English and Chinese (e. g., Gong et al., 2012a, with a
tense model or Xiong et al., 2013a, with a lexical cohesion model). This may
indicate that it is easier to achieve improvements if the distance between
the languages is greater because it is more difficult for a baseline system to
transfer information between dissimilar languages without the help of explicit models.
The most important reason for the limited success of existing discourse
models for SMT is certainly that the underlying processes are not sufficiently
understood for the creation of accurate models. The statistical approach to
MT, which avoids all commitment to specific linguistic theories for the benefit of corpus-based pattern matching techniques, has been tremendously
40
successful, but as we begin to feel the limitations of the simple assumptions
made in early SMT research, it becomes more and more difficult to extend
the models without theoretical guidance. We hope that the research activities now begun sooner or later lead to an improved understanding of how
different discourse processes affect translation that will, in turn, enable the
development of better models.
Another serious problem is how to evaluate SMT system in a way that
places due weight on discourse aspects. Just as progress in MT research in
general was difficult to evaluate before the appearance of generally accepted
automatic metrics such as BLEU (Papineni et al., 2002), the shortcomings of
these automatic metrics when it comes to discourse make it difficult to assess progress in text-level MT. Complaints about the insensitivity of BLEU to
discourse-level phenomena, even in cases where manual evaluation does find
an improvement in the MT output, are common in the literature (e. g., Meyer
et al., 2012; Taira et al., 2012; Novák et al., 2013a). While the final evaluation
of an MT system can generally be done manually, the lack of good automatic
evaluation metrics capturing discourse properties deprives discourse-enabled
SMT systems from the possibility of automatically optimising model parameters toward translation quality. In sentence-level SMT, this is now a standard
procedure that often results in significant improvements (Och, 2003).
In sum, there remains much to be done in the field of discourse-level SMT,
even though there is considerably more research activity now than just a few
years ago. In the remainder of this thesis, we try to make a contribution
to two principal problems. In the first part, we investigate the interaction
between discourse-level models and the decoding process in SMT and present
a framework for document-level decoding that serves as a basis for further
experimentation. In the second part, we investigate pronominal anaphora
and the difficulties it poses for SMT.
41
Part I:
Algorithms for Document-Level SMT
3. Discourse-Level Processing with SentenceLevel Tools
In this chapter, we discuss the limitations of sentence-level SMT and some
ways to overcome them while still using the same tools. First, we explain the
principles of phrase-based SMT, the framework of all our experiments, and
study the stack decoding algorithm, the most popular decoding algorithm for
phrase-based SMT. We show how the stack decoder exploits model locality
to increase decoding performance and why it is difficult to use documentlevel features in combination with this algorithm. Then, we examine three
workarounds for the limitations of this algorithm and discuss their trade-offs
and constraints, drawbacks and advantages.
3.1 An Overview of Phrase-Based SMT
There are a number of competing approaches to SMT, which differ in the way
they decompose the input sentence and transfer its individual components
into the target languages. Some of the most influential are phrase-based SMT
(Koehn et al., 2003), hierarchical SMT (Chiang, 2007) and n-gram-based SMT
(Mariño et al., 2006). All of these approaches model translation at the sentence
level, and they have similar limitations when it comes to handling discourse
phenomena. In this thesis, we concentrate on phrase-based SMT, and we
shall not consider the other approaches any further. However, we expect all
of them to present similar challenges, and we imagine that the considerations
and solutions we propose are applicable to all forms of SMT, even though
the implementation details are liable to vary. In this section, we give a brief
overview of the aspects of phrase-based SMT that are relevant to our work.
For a more detailed introduction, the reader is referred to the SMT textbook
by Koehn (2010).
In the translation model of phrase-based SMT (Fig. 3.1), the input sentence
is segmented into a sequence of non-overlapping word sequences (upper line)
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house police found a large quantity of narcotics .
Figure 3.1. Sentence translation in phrase-based SMT
45
that are called phrases, even though they have little to do with phrases in
the linguistic sense of the word. Each of the source language phrases in the
input is mapped into a corresponding target language phrase (lower line).
To account for differences in word order between the languages, the output
phrases can be generated in an order that differs from that of their corresponding input phrases, or reordered. Given a realistic translation model, this
procedure can generate an immense number of different hypotheses for an
input sentence. Each hypothesis is then assigned a score by the model, and
the goal is to find the translation that maximises the model score.
Modelling the quality of a hypothesis is difficult. It is easier to model different aspects that contribute to translation quality individually and combine
these partial models into an overall score. By doing so, we can make different independence assumptions tailored to the structure of the partial models.
The overall model score f (s,t ) of a target language output sentence t translating a given source language input sentence s is then computed as a linear
combination of partial model scores, or feature functions, h k (s,t ) :
f (s,t ) =
X
λ k h k (s,t )
(3.1)
k
Usually, the weights λ k of the partial models are optimised discriminatively
to maximise some automatic translation quality metric like BLEU (Papineni
et al., 2002) with an optimisation technique such as MERT (Och, 2003), PRO
(Hopkins and May, 2011) or MIRA (Chiang, 2012).
Unlike some other subfields of NLP such as syntactic parsing, where a
similar model decomposition is used almost exclusively with binary feature
functions indicating the presence or absence of a particular feature in the
hypothesis, in SMT it is common to view Eq. 3.1 as a log-linear model (Berger et al., 1996), following Och and Ney (2002), and to use feature functions
that represent log-transformed probability estimates. This has both historical and practical reasons. Early work on SMT (Brown et al., 1990, 1993) was
strongly influenced by the standard methods in automatic speech recognition (Jelinek, 1976) and adopted the noisy channel model (Shannon, 1948) as
its fundamental model. The noisy channel model corresponds to a log-linear
model with uniform weights. The fact that reliable discriminative weight estimation for a large number of features in SMT has long been a difficult problem is an additional reason for preferring models with few, but informative
features.
In principle, the partial models h k (s,t ) can capture arbitrary features considered relevant to translation quality. There is a small set of models that are
present in virtually any phrase-based SMT system and that are considered
essential to achieve state-of-the-art performance. Usually, all phrase-based
SMT systems will contain at least some variant of the following three models:
46
•
•
◦
•
◦ ◦
◦
◦
◦
Bakom huset hittade polisen en stor mängd narkotika .
Behind the house police
Figure 3.2. Stack decoding progress after translating 3 phrase pairs
– Phrase translation model: The phrase translation model assigns a probability score to the translation of a single SMT phrase in the source
language to a given target language equivalent. It does not consider
any context beyond the phrase boundaries.
– Language model: The language model is an n-gram model that assigns
a probability to a target language word given a history of a bounded
number of target language words to its left. It does not look at the
input at all, and it only considers a limited number of context words for
any given word.
– Distortion model: The distortion model assigns a probability to the order of the phrases in the output. In its basic form, it simply penalises
differences in phrase order between the input and the output without
looking at any further context or even at the words inside the phrases.
For decoding, it is important to notice that the use of context in these models
is extremely limited. The translation model does not consider any phrase
context at all. The basic distortion model only depends on the positions of
the input words translated by the current and the immediately preceding
phrase, and the language model depends on a bounded number of context
words. This sparse and highly structured dependency configuration has been
exploited to enable efficient decoding through dynamic programming.
3.2 The Stack Decoding Algorithm
The de facto standard algorithm for decoding phrase-based SMT models is a
dynamic programming (DP) beam search algorithm commonly called stack
decoding (Koehn et al., 2003). The stack decoding algorithm constructs a
translation step by step by starting with an empty translation and adding
words to it in target language word order while keeping track of which source
language words are already covered by a phrase pair in the current translation hypothesis. At each step, the algorithm considers possible translations
for input positions that are not yet covered and extends the state with another
phrase pair until the entire input is covered. Figure 3.2 shows an example sentence after processing three phrase pairs. The top row indicates which words
are covered. Next, the decoder will choose a new input phrase that covers one
47
or more of the uncovered input words and translate it into a phrase that will
be appended to the output after the word “police”.
In the stack decoder, incomplete hypotheses are grouped in stacks according to the number of input words they cover. Stacks are generic collections,
unrelated to the last-in first-out data structure of the same name. The stacks
are processed in order of ascending coverage count, beginning with the zerocoverage stack containing only the empty hypothesis and terminating with
the final stack containing hypotheses that cover the entire input. Hypotheses
on the individual stacks are expanded in order of descending score, and after
processing a given number of items on each stack, the remaining hypotheses
are ignored, or pruned. Because of pruning, stack decoding is a beam search
algorithm.
The efficiency of stack decoding is greatly increased by a dynamic programming technique called hypothesis recombination (Och et al., 2001) that
exploits the locality of the SMT models. Most of the complexity of a decoding algorithm is due to the fact that previously generated hypotheses must
be processed over and over again whenever the scores are updated to add a
new element. If dependencies are unrestricted, adding a new element may
have the effect that a hypothesis which previously seemed suboptimal suddenly becomes best because it matches the new element better than any of
the other hypotheses. This is why a large number of hypotheses must be
stored and reexamined at each expansion step. However, since none of the
basic models considers more than a few words of target language context,
the dependencies of a new decision are very restricted in reality. Assuming
a trigram language model, which considers a history of two words, all hypotheses that coincide in the last two words form an equivalence class from
the point of view of future decisions. For each of these classes, only the best
hypothesis need be retained; all others can be discarded without further ado
because there is no way in which they can lead to the best overall translation.
We say that they are recombined with the best hypothesis of their equivalence
class. The beneficial effect of recombination is that it allows the decoder to
explore a much larger part of the search space with the same stack size.
Consider now a situation in which one of the models has dependencies
whose range substantially exceeds the history size of the n-gram model, as
will usually be the case for the discourse phenomena that we are interested
in. If the dependencies are long, but do not cross sentence boundaries, the
stack decoding algorithm can still accommodate them. However, while the
decoder generates the output between the two elements involved in the dependency, recombination will effectively be inhibited. As a result, the search
space becomes much larger, and, assuming the stacks are pruned to the same
size, the probability of making a search error will increase greatly. If the longrange dependencies cross sentence boundaries, the only way to handle this in
the stack decoding algorithm is by suppressing the sentence boundaries and
decoding the whole document, or a sufficiently large part of it to include all
48
the relevant dependencies, as if it were a single sentence. In this case, recombination will be inhibited almost completely, and the search space explosion
described above will be exacerbated. For this reason, it has been necessary
to find other ways to handle long-range dependencies in SMT decoding.
3.3 Two-Pass Decoding
Even though it is difficult to handle long-range dependencies in an SMT stack
decoder, especially if they cross sentence boundaries, it is possible to use an
unmodified sentence-level decoder to process certain discourse-level dependencies if decoding is carried out in two passes. This is the approach adopted
by Le Nagard and Koehn (2010) and subsequently Guillou (2011, 2012) for
their experiments with pronominal anaphora. It is also used to encourage
translation consistency by Xiao et al. (2011) and Ture et al. (2012).
A model of pronominal anaphora must account for the agreement relation between anaphoric pronouns and the noun phrases they refer to (see
Chapter 6 for an extended discussion). To transfer this relation into the target
language, agreement must be ensured between the translation of the antecedent noun phrase and the translation of the anaphoric pronoun. Since
both translations are generated by the SMT system, this implies modelling
long-range target side dependencies, potentially across sentence boundaries.
Le Nagard and Koehn (2010) address this problem by translating documents
in two steps. First, they generate a translation from English into French with
a normal SMT system without any knowledge of discourse or pronouns. Anaphoric links are resolved externally with a separate anaphora resolution system. When the first-pass translation is finished, the translations of the antecedents are recovered from the output, and the system looks up their gender.
For all instances of the pronouns it and they identified as anaphoric by the
anaphora resolution system, the gender of the translation is then marked on
the input token, creating synthetic tokens such as it-masculine, and the
text is translated again with an SMT system trained on this type of data.
This decoding approach is simple and has the advantage that it does not
require any modifications to the existing software. Its drawback is mainly
that the two-step procedure enforces categorical, hard decisions that make it
difficult to create a coherent model of the problem as a whole. In particular,
in the anaphora translation approach described above, all antecedent translations get fixed after the first translation step, and the system manipulates the
anaphoric pronouns to encourage agreement. Formally, however, there is no
guarantee that the second-pass translation step will select the same translations for the antecedent, so it is perfectly possible that the system translates
the antecedent differently in the second pass and then enforces agreement
with a purely fictitious antecedent translation that does not correspond to
the final translation.
49
In the pronoun translation experiments published in the literature, this effect seems to be very small in practice. Guillou (2012), whose experiments on
English–Czech are closely similar to the work on English–French described
by Le Nagard and Koehn (2010), remarks that only 3 out of 458 antecedents
were translated differently by her second-stage system. No corresponding
figures are available for the original system by Le Nagard and Koehn (2010).
Guillou (2012) highlights that she takes extra care at training time to minimise the differences between her first- and second-stage system by making
sure both systems are trained on exactly the same corpora and word alignments. The need to do this can make the training process fragile, but if it
is carefully ensured, then the two systems can reasonably be expected to
produce very similar output. Differences are most likely to occur when an
antecedent and an anaphoric pronoun (referring to this or a different antecedent) occur close together in the text. In such cases, the influence of the
n-gram model may trigger a different translation for the antecedent when
the pronoun is translated differently in the second pass. Even if this kind of
interference is rare with a simple pronoun model, it is much more likely to
happen if more discourse-level models are incorporated into the same system
using this approach.
Another limitation of the two-pass decoding approach is the directionality
of its dependencies. Necessarily, with this method the overall model divides
the relevant variables into two sets. One set (the antecedent translations) is
fixed unconditionally in the first decoding pass. The other set (the pronoun
translations) is assigned to in the second decoding pass with the possibility
of conditioning on the values of the variables in the first set. The variables
do not get optimised jointly, so there is no way in which the values of the
variables in the second set can influence the choices made for the first set. In
the case of pronominal anaphora, this is arguably the right way to model the
phenomenon: Pronouns should agree with their nominal antecedents, but it
is at least doubtful if the choice of a particular pronoun should ever induce a
subsequent choice of a compatible antecedent noun phrase. However, this is
not true of all kinds of discourse models. If the goal is to model, e. g., text cohesion by encouraging lexical consistency, it may well be advisable to optimise
over the whole text jointly and combine information from different parts of
the text rather than selecting the translation of the first word unconditionally
and conditioning the rest of the text on this choice.
3.4 Sentence-to-Sentence Information Propagation
If the cross-sentence dependencies of a model form a directed acyclic graph,
then it can be decoded with sentence-level tools without requiring two-pass
decoding. This type of dependency configuration can reasonably be posited
for models of pronominal anaphora. It is fairly safe to assume that cross50
sentence anaphoric links always introduce a dependency of an element (a
pronoun) in a later sentence on an element (an antecedent) in an earlier sentence. The reverse situation, cross-sentence pronominal cataphora, is not
impossible, but very uncommon in almost all text genres that are candidates
for machine translation, so it can be neglected without great risk for translation quality, ensuring that all dependencies can be resolved in document
order and no cycles occur.
The key to translating with cross-sentence dependencies is to decode each
sentence individually instead of feeding the document to the decoder as a
single batch. After each sentence has been translated, the information that
is needed for translating later sentences can be extracted and fed into the
decoder when it is time to do so. In the following paragraphs, we describe
an approach to the integration of pronominal anaphora into an SMT system
from our own work (Hardmeier and Federico, 2010). Gong et al. (2012b) use
a similar procedure for decoding with a cross-sentence verb tense model.
Our system has two main components, a decoder driver, which encapsulates the sentence-based Moses decoder (Koehn et al., 2007) and propagates
information between sentences, and a word dependency model, which injects information from previous sentences into the actual search process and
handles sentence-internal coreference links. The word dependency model
will be discussed in more detail in Chapter 7.
Figure 3.3 illustrates the workings of the decoder driver. Before the decoder is run, a sentence dependency graph (top right) is constructed based
on the output of a separate coreference resolution system, BART (Versley
et al., 2008). At the cross-sentence level, we only use anaphoric links. If there
happen to be any cataphoric links, they are disregarded to guarantee that the
sentence dependency graph is acyclic. Each sentence can contain pronominal
mentions that refer to a preceding sentence (backward dependencies, marked
r) as well as antecedent mentions that are referred to later (forward dependencies, marked a). The figure shows the state after translating sentences 1 and 2.
Sentences that have no backward dependencies, such as sentences 1 and 2 in
the example, and sentences whose backward dependencies have already been
resolved, such as sentences 3 and 5, are put on a queue that feeds the decoder.
After decoding, the translations of the antecedent mentions are recovered
from the decoder output with the help of the phrase alignments produced by
the decoder and the word alignments stored in the SMT phrase table. The
decoder driver extracts the words aligned to what has been identified as the
syntactic head of the antecedent mention and makes them available to the
referring sentences by encoding them in the decoder input as described in
the following section. Whenever all backward dependencies of a sentence
are satisfied, the sentence is put on the queue.
The implementation described here makes it possible to feed a large number of decoder processes in a multi-threaded setup. The decoder input queue
is realised as a priority queue ordered by the number of forward dependen51
Queue
Sentence 1
a1
Sentence 2
a2
a1
r1
r2
Sentence 3
a1
r1
r2
Sentence 4
r1
a1
a2
r1
r2
r1
Sentence 5
Sentence 5
r1
r2
Sentence 3
Sentence 6
a1
Decoder
input
Decoders
Decoder
Decoder
Decoder
Antecedent
extraction
Ordering
Output
Figure 3.3. Decoder driver for sentence-to-sentence information propagation
52
cies of the sentences in order to resolve as many dependencies as possible
as early as possible and thus increase the throughput of the system. Since
the sentences are not processed in order, a final ordering step restores the
original document order. For a slightly less complex setup, the dependency
graph and the decoder input queue can be dispensed with, and the sentences
can simply be processed in document order to ensure that the information
from earlier sentences is available when it is needed.
The main advantage of the information propagation approach over the
two-pass decoding procedure is that a single decoding pass is sufficient. This
makes the approach slightly more efficient, but it is also attractive theoretically because it eliminates the potential discrepancies between the first- and
the second-pass translation. In terms of dependency directionality, the constraints are the same. The information propagation approach requires the
cross-sentence dependencies to form a directed acyclic graph, and translation decisions get fixed greedily as this graph is traversed with no opportunity for joint optimisation. The granularity of the dependency graph is at
the sentence level; unlike two-pass decoding, the information propagation
approach does not deal with sentence-internal dependencies. For a pronominal anaphora model, this is a problem because both intrasentential and intersentential anaphoric links are very frequent in corpus data, at least in the
newswire genre (McEnery et al., 1997). In our work (Hardmeier and Federico,
2010), sentence-internal links are handled by the word dependency model in
the decoder (see Chapter 7).
To sum up, information propagation is a fast and reliable approach for
integrating discourse-level models into SMT if the dependency structure of
all of these models mainly consists of cross-sentence links and complies with
the constraints on the dependency graph imposed by the decoding procedure.
In practice, all dependencies in all cross-sentence models must be directed
and point in the same direction. If there are many sentence-internal dependencies, however, this approach will not help, and the usual constraints and
limitations of standard stack decoding apply.
3.5 Document-Level Optimisation by Output Rescoring
One way to use models with unlimited dependencies on other sentences in
combination with sentence-level SMT tools is to let the sentence-level system
produce a variety of different output proposals and then perform a second
search pass with the long-range models over the output variants suggested
by the first-pass system only. This is the approach chosen, e. g., by Gong et al.
(2011b) for integrating topic models into their SMT system. The search space
of the second-pass rescoring step can be given either as an n-best list or in
more compact form as a lattice representation of the part of the search space
explored by the first pass decoder.
53
The advantage of this method is that it does not impose any restrictions
at all on the models of the second-pass search, or on the number, type or orientation of any of the dependencies involved. It is possible to treat sentenceinternal and cross-sentence dependencies in a uniform way. Moreover, the
dependencies need not be oriented at all; if the search algorithm used for the
second pass permits it, translations throughout the document can be optimised jointly and mutually influence each other. Since the search space of the
second-pass search is relatively small, all this can be done efficiently.
The small size of the second-pass search space, which enables efficient
search, is at the same time the main disadvantage of the rescoring approach.
The size of the search space of phrase-based SMT is roughly exponential in
the sentence length (Koehn, 2010, 161). By contrast, the number of complete
hypotheses output by the stack decoding algorithm is bounded by a constant,
the stack size. Therefore, the rescoring pass only gets to see an almost negligibly small subset of the search space. It is true that the construction of
this subset with the stack decoding algorithm gives rise to hope that it may
include some of the overall best translations, but since the first-pass decoder
has no knowledge about the models to be included in the second pass, there
is no formal guarantee that this is true even approximately under the secondpass models.
3.6 Conclusion
The stack decoding algorithm for phrase-based SMT cannot handle crosssentence dependencies, and much of its efficiency is due to the fact that even
sentence-internal dependencies are assumed to have very short ranges. Nevertheless, there are a number of possibilities to deal with discourse-level structure even in this framework. They all have differents strengths and weaknesses. Two-pass decoding and sentence-to-sentence propagation are similar.
The former is a bit simpler and can potentially handle intrasentential dependencies, but there is a risk of inconsistencies, and the interaction between the
decoder and model is difficult to analyse and understand. Also, the modelling possibilities are limited to what can be achieved by manipulating the
translation model, unless specific models are implemented in the decoder, in
which case the method loses its appealing simplicity. Both approaches require directed dependencies and do not support joint optimisation over the
entire document. The n-best reranking method, by contrast, is unaffected by
most of these limitations, but it can only access an exponentially small part
of the entire search space. As a result, it is only suitable if there is reason to
suppose that the best translation under the final model is already among the
top candidates under the model the n-best lists are created with.
All of these techniques are most useful, and have been used almost exclusively, to integrate single models capturing specific features into the de54
coding process. With a greater number of cross-sentence features, or if the
cross-sentence features have complex dependencies, they quickly become
cumbersome and difficult to maintain. In the next chapter, we describe how
discourse-level models can be fully integrated into SMT decoding. Like any
other, our new approach has both advantages and drawbacks. Compared to
the methods described in this chapter, however, it has a rather different profile, which makes it particularly interesting for large-scale experimentation
with discourse models.
55
4. Document-Level Decoding with
Local Search
In the previous chapter we studied different manners of handling documentlevel features with the standard tools of sentence-based SMT. We found that
these approaches are limited in various ways and impose restrictions on the
dependency configuration of the feature models or on the search space that
can be explored. One of the goals of our work is to provide a framework
for experimentation with discourse-level features in SMT that is as flexible
as possible. It should be possible to experiment with different dependency
configurations and restrictions to find out what setup best meets the needs
of the modelling task. As far as possible, these constraints should not be
imposed as a necessity by the decoding algorithm.
In this chapter, we present an approach to phrase-based SMT decoding
where document-level features are completely integrated into the decoder
(Hardmeier et al., 2012). We have released a software implementation of this
approach, the Docent decoder, to the public (Hardmeier et al., 2013a). In order
to escape the constraints of dynamic programming beam search, we abandon
the stack decoding algorithm. Instead, we use a local search algorithm whose
internal state consists of a complete translation of an entire document. This
ensures that both the complete input document and a complete translation
hypothesis are available whenever a score must be computed, so there are
no restrictions placed on the dependencies of the feature models. Moreover,
unlike a rescoring solution, our decoder has access to the entire search space
of phrase-based SMT at least in principle, even though the vastness of the
search space and the presence of local score maxima make search difficult.
However, we show that our approach has reasonable performance in practice,
and that it can be initialised with standard stack decoding to increase the
chances of finding a good local maximum.
4.1 A Formal Model of Phrase-Based SMT
The phrase-based SMT model implemented by our decoder is exactly equivalent to the basic model of phrase-based SMT (Koehn et al., 2003), but it is
formalised in a way that matches the properties of our decoding algorithm.
The hypothesis space of our method is the same as that of sentence-level
phrase-based SMT. In particular, we assume that the input is segmented into
a number of sentences. The decoder emits exactly one output sentence for
56
each input sentence, and there is no mechanism to move information from
one sentence into another. This assumption makes the decoder more compatible with existing SMT software and evaluation methods. Strictly speaking,
however, it does not restrict its capabilities, since the entire document could
always be presented to the decoder as a single “sentence”.
Our decoder is based on local search, so its state at any time is a representation of a complete translation of the entire document. We decompose the
state of a document into the state of its sentences, and we define the overall
state S as a sequence of sentence states:
S = S 1S 2 . . . S N ,
(4.1)
where N is the number of sentences.
Let i be the number of a sentence and m i the number of input tokens of this
sentence, p and q (with 1 ≤ p ≤ q ≤ m i ) be positions in the input sentence
and [p; q] denote the set of positions from p up to and including q . We say
that [p; q] precedes [p 0; q 0], or [p; q] ≺ [p 0; q 0], if q < p 0. Let Φi ([p; q]) be the
set of translations for the source phrase covering exactly the positions [p; q]
in the input sentence i , as given by the phrase table. We call A = h[p; q],ϕi an
anchored phrase pair with coverage C (A) = [p; q] if ϕ ∈ Φi ([p; q]) is a target
phrase translating the source words at positions [p; q]. Then a sequence of n i
anchored phrase pairs
S i = A1A2 . . . A n i
(4.2)
is a valid sentence state for sentence i if the following two conditions hold:
1. The coverage sets C (A j ) for j in 1, . . . ,n i are mutually disjoint, and
2. the anchored phrase pairs jointly cover the complete input sentence, or
ni
[
C (A j ) = [1; m i ].
(4.3)
j=1
The MT output corresponding to a state is generated by iterating over the
anchored phrase pairs in the order in which they occur in the state and reading off the target phrases ϕ of each anchored phrase pair.
Let f (S ) be a scoring function mapping a state S to a real number. As
usual in SMT, it is assumed that the scoring function can be decomposed
into a linear combination of K feature functions h k (S ) , each with a constant
weight λ k , so
K
X
f (S ) =
λ k h k (S ).
(4.4)
k=1
The decoder searches for the state Sˆ with maximal score, such that
Sˆ = arg max f (S ).
(4.5)
S
57
As a baseline, we implement a set of elementary feature functions compatible with the core features of the popular Moses SMT system (Koehn et al.,
2007). All of these work on the sentence level, so a document-level decoder
has no advantage if no discourse-level features are added. However, having
this set of baseline feature functions is essential as a starting point for further
development. In particular, our decoder has the following sentence-level feature functions:
1. Phrase translation scores including forward and backward conditional
probabilities and lexical weights (Koehn et al., 2003),
2. n-gram language model scores implemented with the KenLM toolkit
(Heafield, 2011),
3. a word penalty score,
4. a phrase penalty score,
5. a distortion model with geometric decay (Koehn et al., 2003), and
6. a feature indicating the number of times a given distortion limit is exceeded in the current state.
The baseline features are computed at the sentence level, and the document
score is just the sum over all sentence scores. In our experiments, the last
feature is used with very large negative fixed weight in order to limit the
gaps between the coverage sets of adjacent anchored phrase pairs to a maximum value. In DP search, the distortion limit is enforced directly by the
search algorithm to limit complexity. In our decoder, however, this restriction is not required, so we add it among the scoring models. In principle, its
weight could be determined automatically during feature weight optimisation (Stymne et al., 2013b).
4.2 The Local Search Decoding Algorithm
The decoding algorithm we use (Algorithm 1) is very simple. It starts with
a given initial document state. In the main loop, which extends from line 3
to line 12, it generates a successor state S 0 for the current state S by calling
the function Neighbour, which non-deterministically applies one of the operations described in Section 4.4 to S . The score of the new state is compared to
that of the previous one. If it meets a given acceptance criterion, S 0 becomes
the current state, else search continues from the previous state S . The main
loop is repeated until a maximum number of steps (step limit) is reached or
until a maximum number of moves are rejected in a row (rejection limit).
For the experiments in this chapter, we use the hill climbing acceptance
criterion, which simply accepts a new state if its score is higher than that of
the current state. It is defined as

 true if α 0 > α
0
Accept(α , α ) = 
(4.6)
 false otherwise.

58
Algorithm 1 Decoding algorithm
Input: an initial document state S ;
search parameters maxsteps and maxrejected
Output: a modified document state
1: nsteps ← 0
2: nrejected ← 0
3: while nsteps < maxsteps and
nrejected < maxrejected do
4:
S 0 ← Neighbour(S )
5:
if Accept( f (S 0 ) , f (S ) ) then
6:
S ← S0
7:
nrejected ← 0
8:
else
9:
nrejected ← nrejected + 1
10:
end if
11:
nsteps ← nsteps + 1
12: end while
13: return S
The hill climbing criterion guarantees that the score never decreases in the
course of decoding. However, it only permits state modifications that improve the score in a single step. Changes that require going through intermediate steps with lower scores, for instance to split up a phrase pair into
smaller units before modifying a part of it, are impossible.
A notable difference between our algorithm and other hill climbing algorithms previously used for SMT decoding (Germann et al., 2004; Langlais
et al., 2007; see Section 4.8) is its non-determinism. Earlier work on sentencelevel decoding employed a steepest ascent strategy which amounts to enumerating the complete neighbourhood of the current state as defined by the
state operations and selecting the next state to be the best state found in the
neighbourhood of the current one. Enumerating all neighbours of a given
state, costly as it is, has the advantage that it makes it easy to prove local optimality of a state by recognising that all possible successor states have lower
scores. It can be rather inefficient, since at every step only one modification
will be adopted; many of the modifications that are discarded will very likely
be generated anew in the next iteration.
As we extend the decoder to the document level, the size of the neighbourhood that would have to be explored in this way increases considerably. Moreover, the inefficiency of the steepest ascent approach potentially
increases as well. Very likely, a promising move in one sentence will remain
promising after a modification has been applied to another sentence, even
though this is not guaranteed to be true in the presence of document-level
models. We therefore adopt a first-choice hill climbing strategy that non59
deterministically generates successor states and accepts the first one that
meets the acceptance criterion. This frees us from the necessity of generating the full set of successors for each state. On the downside, if the full
successor set is not known, it is no longer possible to prove local optimality
of a state, so we are forced to use a different condition for halting the search.
We use a combination of two limits: The step limit is a hard limit on the resources the user is willing to expend on the search problem. The value of the
rejection limit determines how much of the neighbourhood is searched for
better successors before a state is accepted as a solution; it is related to the
probability that a state returned as a solution is in fact locally optimal.
It is also possible to combine Algorithm 1 with another acceptance criterion than that of Eq. 4.6. In particular, an acceptance criterion that sometimes accepts new states with lower scores than the current one may help the
decoder to reach better states that are only accessible through a sequence of
moves. In the Docent decoder, we also implement search by simulated annealing (Kirkpatrick et al., 1983) with the Metropolis-Hastings acceptance criterion
(Metropolis et al., 1953; Hastings, 1970). This is a stochastic criterion defined
as

 true with probability A(α 0,α;T )
0
(4.7)
Accept(α , α ) = 
 false with probability 1 − A(α 0,α;T )

with an acceptance probability satisfying



1
!
A(α 0,α;T ) = 
α0 − α

exp



T
if α 0 > α
(4.8)
otherwise.
The temperature parameter T starts at a high value and is gradually reduced
according to some cooling schedule as decoding progresses. As T approaches
0, the Metropolis-Hastings criterion in Eq. 4.7 becomes equal to the hill climbing criterion in Eq. 4.6, and indeed, the Docent decoder implements hill climbing as a special case of simulated annealing.
The asymptotic behaviour of simulated annealing search depends on the
distribution of the transition probabilities from one state to the next. The
transition probabilities are determined by the interaction of the proposal distribution embodied in the Neighbour function, which generates new states
from the current one, and the acceptance distribution represented by the
Accept function. In our system, the proposal distribution is controlled by
the set of state operations described in Section 4.4 and their weights. If the
transition probabilities satisfy a condition called detailed balance, then simulated annealing is guaranteed to converge to a global optimum asymptotically
(Aarts et al., 1997). One way to meet this condition is to use the MetropolisHastings acceptance criterion in conjunction with a proposal distribution
that guarantees that all states can be reached from all other states through
60
a sequence of operations with nonzero probabilities and is symmetric, meaning that for all pairs of states S and S 0, the probability of proposing state S 0
when in state S is equal to the probability of proposing state S when in state
S 0 (Aarts et al., 1997, Theorem 3). Our current set of state operations does
not satisfy the symmetry condition, so we cannot be sure that our simulated
annealing procedure converges to an optimal solution even asymptotically.1
Empirically, the main difficulty with using simulated annealing instead of
hill climbing for SMT decoding is that it is very easy for the decoder to wander
off quickly to states with very bad scores from which it never finds its way
back to better solutions. We have not analysed this behaviour in detail, but it
seems likely that it is related to the irregularity of the proposal distribution
mentioned in the previous paragraph and could be remedied by designing
better proposal distributions. This, however, is a problem that we must leave
to future work. Instead, we control the simulated annealing search process
with some specific state operations that help the decoder return more easily
to good states it has visited before. These operations are described at the end
of Section 4.4.
4.3 State Initialisation
Before the local search decoding algorithm can be run, an initial state must
be generated. The closer the initial state is to an optimum, the less work remains to be done for the algorithm. If the algorithm is to be self-contained,
initialisation must be relatively uninformed and can only rely on some general prior assumptions about what might be a good initial guess. On the other
hand, if optimal results are sought after, it pays off to invest some effort into
a good starting point. One way to do this is to run DP search first.
For uninformed initialisation, we implement a very simple procedure based
only on the observation that, at least when translating between the major
European languages, it is usually a good guess to keep the word order of
the output very similar to that of the input. We therefore create the initial
state by selecting, for each sentence in the document, a random sequence of
randomly segmented anchored phrase pairs covering the input sentence in
monotonic order, that is, such that for all pairs of adjacent anchored phrase
pairs A j and A j+1 , we have that C (A j ) ≺ C (A j+1 ) .
For initialisation with DP search, we first run the Moses decoder (Koehn
et al., 2007) to generate an initial state. Then we extract the best output hypothesis from the Moses search graph and interpret it as a sequence of anchored
phrase pairs. In Moses, we include a relaxed version of the models of the
document-level decoding pass, omitting all models with document-level de1 Technically, the conditions described are sufficient, but not necessary, for detailed balance.
We do not expect detailed balance to obtain in our decoder, but we must defer a more rigorous
analysis to the future.
61
pendencies. In the experiments of this thesis, we generally use a configuration as similar as possible to that of the document-level decoder with the
same set of sentence-level models and the same feature weights.
4.4 State Operations
Given a document state S , the decoder uses a neighbourhood function called
to simulate a move in the state space. The neighbourhood function non-deterministically selects a type of state operation and a location in
the document to apply it to and returns the resulting new state. In practice,
operations are selected by drawing randomly from a categorical distribution
with configurable, fixed parameters. To allow the decoder to explore the entire search space, it must be possible to alter the phrase segmentation of the
input, the translations of the individual phrases as well as their output order.
By selecting a set of operations geared towards these three aspects we can
ensure that every possible document state can be reached from every other
state in a sequence of moves.
Designing operations for state transitions in local search for phrase-based
SMT is a problem that has been addressed in the literature (Langlais et al.,
2007; Arun et al., 2010). Our decoder’s first-choice hill climbing strategy never
enumerates the full neighbourhood of a state. We therefore place less emphasis than previous work on defining a compact neighbourhood, but allow
the decoder to make quite extensive changes to a state in a single step with
a certain probability. Otherwise our operations are similar to those used by
Arun et al. (2010).
All of our state operations except those described in Section 4.4.4 make
changes to a single sentence only. Each time it is called, the Neighbour function selects a sentence in the document with a probability proportional to
the number of input tokens in each sentence to ensure a fair distribution of
the decoder’s attention over the words in the document regardless of varying
sentence lengths.
To simplify notations in the description of the individual state operations,
we write
S i −→ S i0
(4.9)
Neighbour
to signify that a state operation, when presented with a document state as in
Eq. 4.1 and acting on sentence i , returns a new document state of
S 0 = S 1 . . . S i −1 S i0 S i+1 . . . S N .
(4.10)
Similarly,
−1
S i : A j+h
−→ Ã1h
j
0
(4.11)
is equivalent to
0
i
S i −→ A1j −1 Ã1h Anj+h
62
(4.12)
with
A j+h
≡ A j . . . A j+h
j
(4.13)
and indicates that the operation returns a state in which a sequence of h
consecutive anchored phrase pairs has been replaced by another sequence of
h 0 anchored phrase pairs.
4.4.1 Changing Phrase Translations
The change-phrase-translation operation replaces the translation of one
single phrase with a random translation with the same coverage taken from
the phrase table. Formally, the operation selects an anchored phrase pair A j
by drawing uniformly from the elements of S i and then draws a new translation ϕ 0 uniformly from the set Φi (C (A j )) . The new state is given by
S i : A j −→ hC (A j ),ϕ 0i.
(4.14)
4.4.2 Changing Phrase Order
There are different useful ways to change the order of the output phrases.
Our basic phrase order operation, used in all experiments described in this
chapter, is called swap-phrases. It affects the output word order without
changing the phrase translations. It exchanges two sequences of anchored
phrase pairs of lengths l 1 and l 2 , resulting in an output state of
1 +h+l 2 −1
1 +h+l 2 −1
1 +h −1
1 −1
S i : A j+l
−→ A j+l
A j+l
A j+l
j
j
j+l 1 +h
j+l 1
(4.15)
The start location j is drawn uniformly from the eligible sentence positions;
the swap range h and the lengths l 1 and l 2 come from geometric distributions
with configurable decays.
Another reasonable option is the move-phrases operation, which moves
a sequence of anchored phrase pairs either to the left or to the right without
requiring any other phrase pairs to make the corresponding opposite movement. The resulting output states are
−1
−1 j+l −1
S i : A j+h+l
−→ A j+h+l
Aj
j
j+l
(4.16)
−1
−1 j+h −1
S i : A j+h+l
−→ A j+h+l
Aj
j
j+h
(4.17)
for a right move and
for a left move. The move direction is selected randomly, and the start location j , the jump distance h and the length l are determined in the same way
as for the swap-phrases operation. Left and right moves are equivalent, but
the effects of the parameters of the distributions of h and l are exchanged.
63
4.4.3 Resegmentation
The most complex operation is resegment, which allows the decoder to alter
the segmentation of the source phrase. It takes a number of anchored phrase
pairs that form a contiguous block both in the input and in the output and
replaces them with a new set of phrase pairs covering the same span of the
input sentence. Formally,
−1
S i : A j+h
−→ Ãh1
j
0
(4.18)
such that
j+h
[−1
k=j
0
C (A k ) =
h
[
C (Ã k ) = [p; q]
(4.19)
k=1
for some p and q , where, for k = 1, . . . ,h 0, we have that Ãk = h[p k ; q k ],ϕ k i, all
coverage sets [p k ; q k ] are mutually disjoint and each ϕ k is randomly drawn
−1 , the resegment operfrom Φi ([p k ; q k ]) . Regardless of the ordering of A j+h
j
ation always generates a sequence of anchored phrase pairs in linear order,
such that C (Ãk ) ≺ C (Ãk+1 ) for k = 1, . . . ,h 0 − 1.
As for the other operations, j is generated uniformly and h is drawn from
a geometric distribution with a decay parameter. The new segmentation is
generated by extending the sequence of anchored phrase pairs with random
elements starting at the next free position, proceeding from left to right until
the whole range [p; q] is covered.
4.4.4 Special Operations for Simulated Annealing
As discussed above (Section 4.2), combining the operations described so far
with the Metropolis-Hastings acceptance criterion instead of pure hill climbing often leads the decoder astray, making it abandon promising hypotheses
too easily and spend inordinate amounts of time on low-scoring parts of the
search space. To reduce this risk, we introduce two operations that make
simulated annealing behave more like hill climbing by frequently offering it
short cuts back to good states.
The restore-best operation quite simply keeps track of the best state encountered during the current decoding run and offers it to the decoder again
regardless of what the current state looks like. By its nature, it will always
be accepted. The more frequently this operation is invoked, the more the
search resembles hill climbing. If it is added to the proposal distribution with
a relatively low probability, simulated annealing will have the opportunity
to make excursions to lower-scoring states, but it will always be sent back
to the original hill climbing path at some point unless it manages to find a
better path in the meantime. Using this operation allows us to exploit some
of the flexibility of simulated annealing whilst preserving the reliability of
hill climbing.
64
The crossover operation bears some resemblance to the way a genetic
search algorithm generates hypotheses. Like restore-best, it keeps track
of the best state encountered. Instead of just going back to that state, it creates a new state which is a combination of the current state and the cached
best state. For each sentence in the new state, it stochastically selects the
corresponding sentence state from one of the two source states. The probability with which the better state is preferred is a parameter of the operation.
This operation makes it possible to restore the safer choices of the previously
best state for some sentences while allowing the current state arrived at by
simulated annealing to retain some of its features.
When decoding with the hill climbing acceptance criterion, the current
state is necessarily always the best state encountered so far, so these two
operations would have no effect in the form described here. An operation
similar to crossover could certainly be defined for the hill climbing case as
well by selecting the second source state in some other way.
4.5 Efficiency Considerations
When implementing feature functions for the local search decoder, we have
to exercise some care to avoid recomputing scores for the whole document
at every iteration. To achieve this, the scores are computed completely only
once, at the beginning of the decoding run. In subsequent iterations, the scoring functions are presented with the scores of the previous iteration and a list
0
of modifications produced by the state operation, a set of tuples hi,r ,s, Ã1h i,
each indicating that the document should be modified as described by
0
S i : Asr −→ Ã1h .
(4.20)
If a feature function is decomposable in some way, as all the standard features
developed under the constraints of DP search are, it can then update the state
simply by subtracting and adding score components pertaining to the modified parts of the document. Feature functions have the possibility to store
their own state information along with the document state to make sure the
required information is available. Thus, the framework makes it possible to
exploit decomposability for efficient scoring without imposing any particular
decomposition on the features as DP beam search does.
To make scoring even more efficient, scores are computed in two passes:
First, every feature function is asked to provide an upper bound on the score
that will be obtained for the new state. For any feature function that represents a log-transformed probability, 0 is a trivial upper bound, but in many
cases, it is possible to calculate much tighter upper bounds far more efficiently
than computing the exact feature value, e. g., by removing just a small number of terms related to words that are affected by a proposed state change in
a larger summation. If the upper bound fails to meet the acceptance criterion,
65
hsi
−1.2
−1.7
−2.2
−0.3
−0.8
−1.8
−1.8
I
know
exactly
what
you
are
doing
−1.3
−1.2
Original score:
Upper bound:
New score:
all
about
−2.1
−0.6
−1.0 −0.03
.
h/si
−1.2 − 1.7 − 2.2 − 0.3 − 0.8 − 1.8 − 1.8 − 1.0 − 0.03 = −10.83
−10.83 + 2.2 + 0.3 + 0.8 = −7.53
−7.53 − 2.1 − 0.6 − 1.3 − 1.2 = −12.73
Figure 4.1. Two-pass LM score computation with a trigram LM
the new state is discarded right away; if not, the full score is computed and
the acceptance criterion is tested again.
Among the basic models listed at the end of Section 4.1, this two-pass
strategy is only used for the n-gram LM, which requires fairly expensive
parameter lookups for scoring. The scores of all the other baseline models
are fully computed during the first scoring pass. The n-gram model is more
complex. Figure 4.1 illustrates how the LM implementation in the Docent
decoder proceeds to compute first an upper bound, then an updated score
as a word in the document state (exactly) is replaced by a sequence of two
other words (all about). In its state information, the n-gram model keeps
track of the LM score and LM library state for each word. The first scoring
pass then identifies the words whose LM scores are affected by the current
search step. This includes the words changed by the search operation as well
as the words whose history is modified. In our implementation, the range
of the history dependencies can be determined precisely by considering the
“valid state length” information provided by the KenLM language modelling
library (Heafield, 2011). In the first pass, the LM scores of the affected words
are subtracted from the total score. The model only looks up the new LM
scores for the affected words and updates the total score if the new search
state passes the first acceptance check. This two-pass scoring approach allows us to avoid language model lookups altogether for states that will be
rejected anyhow because of low scores from the other models, e. g., because
the distortion limit is violated.
Model score updates become more complex and slower as the number of
dependencies of a model increases. While our decoding algorithm does not
impose any formal restrictions on the number or type of dependencies that
can be handled, there will be practical limits beyond which decoding becomes
unacceptably slow or the scoring code becomes very difficult to maintain.
However, these limits are fairly independent of the types of dependencies
handled by a model, which permits the exploration of more varied model
types than those handled by DP search.
66
4.6 Experimental Results
In this section, we present the results of a series of experiments with our document decoder. The goal of our experiments is to demonstrate the behaviour
of the decoder and characterise its response to changes in the fundamental
search parameters. In all experiments presented in this chapter, we use the
hill climbing acceptance criterion and the baseline set of sentence-level feature functions listed in Section 4.1. The search operations of the document decoder are change-phrase-translation with a weight of 0.8, swap-phrases
with a weight of 0.1 and a swap distance decay of 0.5 and resegment with a
weight of 0.1 and a resegmentation length decay of 0.1.
The SMT models for our experiments were created with a subset of the
training data for the English–French shared task at the WMT 2011 workshop
(Callison-Burch et al., 2011). The phrase table was trained on Europarl, News
commentary and UN data. To reduce the training data to a manageable size,
singleton phrase pairs were removed before the phrase scoring step. Significance-based filtering (Johnson et al., 2007) was applied to the resulting phrase
table, and all phrase pairs not ranking among the top 20 per source phrase
in terms of the conditional probability of the target phrase given the source
phrase were discarded. The language model was a 5-gram model with KneserNey smoothing trained on the monolingual News corpus with IRSTLM (Federico et al., 2008). Feature weights were trained with Minimum Error-Rate
Training (MERT; Och, 2003) on the news-test2008 development set using the
DP beam search decoder and the MERT implementation of the Moses toolkit
(Koehn et al., 2007). Experimental results are reported for the newstest2009
test set, a corpus of 111 newswire documents totalling 2,525 sentences or
65,595 English input tokens.
4.6.1 Stability
An important difference between our decoder and the classical DP decoder as
well as previous work in SMT decoding with local search is that our decoder is
inherently non-deterministic. This implies that repeated runs of the decoder
with the same search parameters, input and models will not, in general, find
the same local maximum of the search space. The first empirical question we
ask is therefore how different the results are under repeated runs. The results
in this and the next section were obtained with the uninformed state initialisation described in Section 4.3, i. e., without running the DP beam search
decoder.
Figure 4.2 shows the results of 7 decoder runs with the models described
above, translating the newstest2009 test set, with a step limit of 227 ≈ 1.3 · 108
and a rejection limit of 100,000. The x -axis of both plots shows the number
of decoding steps on a logarithmic scale, so the number of steps is doubled
between two adjacent points on the same curve. In the left plot, the y -axis
67
Figure 4.2. Score stability in repeated decoder runs
indicates the model score optimised by the decoder summed over all 2,525
sentences of the document. In the right plot, the case-sensitive BLEU score
(Papineni et al., 2002) of the current decoder state against a reference translation is displayed.
As expected, the decoder achieves a considerable improvement of the initial state with diminishing returns as decoding continues. Between 28 = 256
and 214 = 16,384 steps, the score increases at a roughly logarithmic pace, then
the curve flattens out, which is partly due to the fact that decoding for some
documents stops after the maximum number of rejections has been reached.
The BLEU score curve shows a similar increase, from an initial score below
0.05 to a maximum of around 0.215. This is below the score of 0.2245 achieved
by the stack decoder with the same models. The lower score is not surprising
considering that our decoder approximates a more difficult search problem,
from which a number of strong independence assumptions have been lifted,
without, at the moment, having any stronger models at its disposal to exploit
this additional freedom for better translation.
In terms of stability, there are no dramatic differences between the decoder
runs. The small differences that exist are hardly discernible in the plots. The
model scores at the end of the decoding run range between −158767.9 and
−158716.9, a relative difference of only about 0.03 %. Final BLEU scores range
from 0.2141 to 0.2163, an interval that is not negligible, but comparable to
the variance observed when, e. g., feature weights from repeated MERT runs
are used with one and the same SMT system. Note that these results were
obtained with random state initialisation. With DP initialisation, score differences between repeated runs rarely exceed 0.02 absolute BLEU percentage
points, but the improvement achievable with the baseline feature models is
hardly any greater than this because the hypothesis found by the DP decoder
is nearly optimal already.
Overall, we conclude that the decoding results of our algorithm are reasonably stable despite the non-determinism inherent in the procedure. In the
68
Figure 4.3. Search performance at different rejection limits
remaining experiments of this chapter, the evaluation scores reported are calculated as the mean of three runs for each experiment.
4.6.2 Search Algorithm Parameters
The hill climbing algorithm we use has two parameters which govern the
trade-off between decoding time and the accuracy with which a local maximum is identified: The step limit stops the search process after a certain number of steps regardless of the search progress made or lack thereof. The rejection limit stops the search after a certain number of unsuccessful attempts
to make a step, when continued search does not seem to be promising. In
most of our experiments, we set the step limit to 227 ≈ 1.3 · 108 and the rejection limit to 105 . In practice, decoding terminates by reaching the rejection
limit for the vast majority of documents. We therefore examine the effect of
different rejection limits on the learning curves. The results are shown in
Fig. 4.3.
The results show that continued search does pay off to a certain extent.
Indeed, the curve for rejection limit 107 seems to indicate that the model
score increases steadily, albeit more slowly, even after the curve has started
to flatten out at 214 = 16,384 steps. At a certain point, however, the probability
of finding a good successor state drops rather sharply by about two orders of
magnitude, as evidenced by the fact that a rejection limit of 106 does not give
a large improvement over one of 105 , while one of 107 does, so searching the
state neighbourhoods very thoroughly gives a reward. The continued model
score improvement also results in an increase in BLEU scores, and with an
average BLEU score of 0.221 the system with rejection limit 107 is fairly close
to the score of 0.2245 obtained by DP beam search.
Obviously, more exact search comes at a cost. In this case, it comes at a considerable cost, which is an explosion of the time required to decode the test
set from 4 minutes at rejection limit 103 to 224 minutes at rejection limit 105
69
and 38 hours 45 minutes at limit 107 . The DP decoder takes 31 minutes for
the same task. We conclude that the rejection limit of 105 selected for our
experiments, while technically suboptimal, realises a good trade-off between
decoding time and accuracy.
4.7 Feature Weight Optimisation
As usual in SMT, our document-level decoder decomposes its objective function into a linear combination of partial models (Eq. 4.4). For the best possible
translation quality, the feature weights λ i should be optimised with a heldout development set. Fortunately, some of the weight optimisation methods
from sentence-based SMT can be applied at the document level, too. In particular, MERT (Och, 2003) often works reasonably well for document-level
weight tuning with only minor changes.2
MERT is an optimisation procedure which finds a set of feature weights
directly optimising an automatic translation quality measure such as BLEU. It
works with a representation of the search space as a list of translation hypotheses. In practice, therefore, the SMT search space is too large to be searched
exhaustively. Instead, it is approximated with n-best lists. Since n-best lists
only cover an exponentially small subset of the search space and are strongly
biased, the resulting feature weights are not optimal for the entire space in
general. To find good weights, MERT is run repeatedly. After each MERT run,
the tuning set is translated again with the new feature weights to produce a
new n-best list, which is then added to the list of the previous iteration before MERT is called again. This procedure is typically repeated until the list
becomes stable and no new translations get added to the list in one iteration.
Adapting MERT to the document level requires two changes. The first concerns score computation. The data points considered by the MERT optimiser
now represent complete documents instead of single sentences because no
meaningful scores are available at the sentence level. Conceptually, this is
a very simple change, but it has the effect that the number of data points
for a given amount of tuning data becomes much lower. This may lead to
reduced stability, but Stymne et al. (2013a) find that the amount of data in
typical tuning sets is often sufficient to achieve useful results.
The second problem is the generation of n-best lists with a hill climbing
decoder. Since the hill climbing algorithm never accepts downhill moves, the
n-best output of this decoder will always consist of the last n accepted states.
As a result of their construction with the state operations described above,
these states will be very similar to each other, and the overall variety of the
n-best list will be much smaller than that produced by a stack decoder. Con2 The
results on document-level feature weight optimisation presented in this section are
joint work with Sara Stymne, Jörg Tiedemann and Joakim Nivre (Stymne et al., 2013a). The
experiments were carried out by Sara Stymne.
70
sequently, the MERT optimiser will see an even smaller and more biased part
of the search space, leading to bad feature weight estimates. The solution
proposed by Stymne et al. (2013a) is to replace the n-best lists with more general n-lists obtained by sampling at regular intervals during the optimisation
process. The optimal sampling conditions still need to be investigated more
precisely.
4.8 Related Work
Even though DP beam search in the form of stack decoding (Koehn et al.,
2003) has been the dominant approach to SMT decoding in recent years, methods based on local search have been explored at various times. For wordbased SMT, greedy hill climbing techniques were advocated as a faster replacement for DP beam search (Germann et al., 2001; Germann, 2003; Germann et al., 2004), and a problem formulation specifically targeting word
reordering with an efficient word reordering algorithm has been proposed
(Eisner and Tromble, 2006).
A sentence-level local search decoder has been advanced as an alternative
to the stack decoding algorithm also for phrase-based SMT (Langlais et al.,
2007, 2008). That work anticipates many of the features found in our decoder,
including the use of local search to refine an initial hypothesis produced by
DP beam search. The possibility of using models that do not fit well into
the DP paradigm is mentioned and illustrated with the example of a reversed
n-gram language model, which the authors claim would be difficult to implement in a DP decoder. Similarly to the work by Germann et al. (2001), their
decoder is deterministic and explores the entire neighbourhood of a state in
order to identify the most promising step. Our main contribution with respect to the work by Langlais et al. (2007) is the introduction of the possibility of handling document-level models by lifting the assumption of sentence
independence. As a consequence, enumerating the entire neighbourhood becomes too expensive, which is why we resort to a “first-choice” strategy that
non-deterministically generates states and accepts the first one encountered
that meets the acceptance criterion.
More recently, Gibbs sampling has been proposed as a way to generate
samples from the posterior distribution of a phrase-based SMT decoder (Arun
et al., 2009, 2010), a process that resembles local search in its use of a set of
state-modifying operators to generate a sequence of decoder states. Where
local search seeks for the best state attainable from a given initial state, Gibbs
sampling produces a representative sample from the posterior. Like all work
on SMT decoding that we know of, the Gibbs sampler presented by Arun
et al. (2010) assumes independence of sentences and considers the complete
neighbourhood of each state before taking a sample.
71
4.9 Conclusion
In this chapter, we have presented a document-level decoder for phrase-based
SMT. The decoder (Hardmeier et al., 2012, 2013a) uses a local search approach,
keeping a translation of the entire document as its internal state and continually generating new hypotheses by applying state-modifying operations to
the current state. New states are accepted or rejected according to an acceptance criterion that deterministically or stochastically favours states with
higher scores. Compared to the standard DP beam search algorithm, stack
decoding, our approach has the advantage of admitting unrestricted dependency configurations for the feature models. On the downside, our algorithm
explores a much larger search space than the stack decoder without profiting from the benefits of DP, so given models that are compatible with the
constraints of DP, the risk of search errors is much increased. However, we
have shown that our decoder on its own can generate translations whose
BLEU score is only about one point lower than that of the translations found
by Moses with the same models. Moreover, if we initialise the decoder with
Moses output and use the hill climbing acceptance criterion, we know with
certainty that only model error, not search errors, can make the final translations worse than those found by Moses.
Compared to the approaches to document-level SMT discussed in the previous chapter, our integrated document-level decoder has a number of advantages. The most important may well be its flexibility. While the sentencebased approaches all impose their specific restrictions on the models and
make it difficult to experiment freely with discourse-level models, our decoder has no such inherent restrictions. It gives the feature models access to
the entire document and permits joint optimisation of the feature functions
over the complete document without constraining the directionality of the
dependencies. It can accommodate any number of discourse-level features
without additional complications. Its search algorithm is less efficient than
DP beam search when it operates under the same constraints, but its performance does not suffer additionally from the presence of long-range dependencies. A sentence-based decoding procedure may be sufficient for some types
of document-level models, and it may even be more efficient in some specific
cases, but a document-level decoder provides an indispensable framework for
unfettered experimentation with discourse features in SMT.
72
5. Case Studies in Document-Level SMT
In this chapter, we look at how we can apply the document-level decoding
method of the previous chapter to control properties of the target language
vocabulary of an SMT system. First, we consider the problem of lexical cohesion and terminological consistency in MT. We describe the results of a small
corpus study and present a cross-sentence semantic language model based
on a vector space representation. Then, we discuss how discourse models
can be used to bias an SMT system towards certain types of vocabulary and
show some results with document-level features to improve text readability.
5.1 Translating Consistently:
Modelling Lexical Cohesion
Text cohesion, the property of linkedness in a text, is created not only by overt
devices such as discourse markers or anaphoric links, it is also reinforced by a
more general effect of the lexicon used in the text. On the one hand, different
sentences in a cohesive text will tend to be about the same things. On the
other hand, there will be patterns of word usage favouring the recurrence of
previously used words, synonyms and other semantically related words. This
aspect of cohesion is called lexical cohesion (Halliday and Hasan, 1976).
A somewhat related phenomenon in the context of translation is terminological consistency, which means that the same word will tend to be translated in the same way when it recurs in the text. Given a cohesive input text,
this will help preserve cohesion under translation. Under the slogan of one
translation per discourse, coined after the one sense per discourse hypothesis
from computational semantics (Gale et al., 1992), this assumption was tested
by Carpuat (2009) in a corpus study with both human translations and machine translations. She finds the hypothesis confirmed in the human translations in a corpus of English–French newswire. Perhaps more surprisingly,
the hypothesis also holds in machine translations of the same texts generated by a phrase-based SMT system, an observation she puts down to the
low variability of the SMT phrase tables. Of course, this result does not say
anything about whether or not the consistent translations of the SMT system
are correct.
We conclude that consistency of lexical choice is a property that cuts both
ways. It is clearly a desired property of translated texts in some sense, but it
may also be indicative of poor translation quality due to impoverished SMT
73
models. In the following sections, we try to shed some more light on this phenomenon. As a working hypothesis, we assume that SMT word choice could
be improved by exploiting the vocabulary used in the whole text to make
phrase selection consistent in the sense of lexical cohesion. In our experiments, we test if this is effectively the case under some operational models
of lexical cohesion.
5.1.1 Translation Consistency in Different MT Systems
For our experiments, we use the English–French test set of the 2010 MetricsMATR evaluation of MT evaluation metrics (Callison-Burch et al., 2010). The
test set contains source, reference and the output of 22 different MT systems for a corpus of newswire text. To generate automatic word alignments
between the source on the one hand and the reference and all candidate translations on the other hand, we concatenate the texts with the News commentary training corpus included in the WMT 2010 SMT shared task training data.
Then we run GIZA++ (Och and Ney, 2003) in both translation directions and
symmetrise the alignments with the grow-diag-final-and heuristic (Koehn
et al., 2003).
The translations are first scored with a simple word translation model
based on lexical weights, where the probability of a text is defined as follows:
L(T ,S ) = log
Y 1 X
p(t |s)
|Ts |
s ∈S
(5.1)
t ∈T s
where S and T are the source and target language texts, s and t are single
words and Ts is the set of target words aligned to a given source word. Unaligned words are considered to be aligned to a special null word. The probabilities p(t |s) are estimated as unsmoothed relative frequencies computed
over the text that is being scored. This score has the property of rewarding a
consistent translation in which the same words are always translated in the
same way.
The results of this experiment are shown in Fig. 5.1, where the lexical consistency score described in the previous paragraph is plotted against the percentage of acceptable translations according to the human evaluation for the
WMT 2010 shared task (Callison-Burch et al., 2010). At the shared task, acceptability was determined with a two-stage procedure. In the first stage, the
evaluators were asked to postedit the MT output in groups of five consecutive
sentences to create fluent target language output, but without seeing either
the source language input or the reference translations. In the second stage,
they were asked to judge whether or not the postedited text was fully fluent
in the target language and equivalent in meaning to the input text.
Under the model of Eq. 5.1, the reference translation is a clear outlier, and
it obtains a lower score than any of the machine translations. This result
74
Figure 5.1. Lexical consistency vs. human MT evaluation for different MT systems
provides further evidence for the observation that SMT output uses fairly
consistent vocabulary (Carpuat, 2009), but makes it appear improbable that
a model of this kind can improve MT. Among the MT system outputs, there
is no clear correlation between the lexical consistency score and the percentage of acceptable translations. A closer look at the individual systems and
their system descriptions reveals that the differences in vocabulary consistency are due to other factors than just output quality; in particular, the size of
the training corpus used for translation model and language model training
has a large impact on the scores, smaller corpus size being correlated with
higher consistency and lower translation quality. The presence of this nuisance variable makes it difficult to compare lexical consistency across different
MT systems and confirms that excessive consistency may sometimes indicate poor translation quality, but it does not, of course, say anything about
the usefulness of a lexical consistency or cohesion model when the training
corpus size is kept fixed.
The conclusion that can be drawn from these experiments is that a model
focusing on translation consistency alone is unlikely to improve SMT quality. Successful modelling of text cohesion will almost certainly require some
source of semantic information.
5.1.2 Word-Space Models for Lexical Cohesion
In computational discourse modelling, word-space models generated by Latent Semantic Analysis (LSA) have been used to model the vocabulary consistency characteristic of lexical cohesion (Foltz et al., 1998; Beigman Klebanov et al., 2008; Gupta et al., 2008). By defining a lexical cohesion model
on the basis of a word-space model, the cohesion model can be semantically
anchored in a manner that is independent of the text to be scored. We hope
that this kind of model will be able to distinguish between true lexical cohe75
sion and the delusive kind of consistency induced by the lack of variability
in the SMT phrase tables.
As a preliminary experiment, we test a simple word-space cluster cohesion measure on the data set described in the previous section. We build a
300-dimensional word space model on French Wikipedia data using the LSA
implementation found in the S-Space software package (Jurgens and Stevens,
2010). For each document in the translations produced by all MT systems as
well as the human reference translator, the word vectors w i of all words are
looked up and averaged to determine the mean vector ŵ for each document.
Then, the score of an individual document is defined as the sum of squared
distances between the individual word vectors and the document mean vector:
X
D=
(5.2)
|w i − ŵ | 2
i
The score of a test set is defined as the sum of the scores of all its documents.
Note that in this experiment, unlike the previous one, a low score indicates
high cohesion.
The results of this experiment are shown in Fig. 5.2. Unlike the simple consistency measure of the preceding section, according to which the reference
translation seems to be less consistent than the MT outputs, the LSA-based
measure judges that the reference, while being less of an outlier, is actually
more cohesive than most of the machine-translated texts. Unfortunately, the
diversity of the MT systems tested makes it difficult to draw more interesting
conclusions. In particular, the only MT output with a lower score than the
reference translation, indicating greater cohesion, comes from a shared task
submission for which no system description paper was published, so it is unclear what properties of the system may have contributed to this result. The
two submissions immediately following the reference translation in the score
ranking (Federmann et al., 2010; Zeman, 2010) supply evidence that training
corpus size has an effect on this score as well: These two systems use only a
relatively small subset of the training data provided for the shared task. The
system with the highest sum of squared distances is peculiar in a different
way: Rather than using the training data provided by the shared task organisers, it is trained on a large corpus of training data extracted from translation
memories of European Union translators (Jellinghaus et al., 2010).
5.1.3 A Semantic Document Language Model
We now present a model for lexical cohesion implemented in our documentlevel decoding framework. Our model rewards the use of semantically related
words in the translation output by the decoder, where semantic distance is
measured with a word space model based on Latent Semantic Analysis (LSA).
LSA has been applied with some success to semantic language modelling
76
Figure 5.2. LSA cluster cohesion vs. human MT evaluation for different MT systems
in previous research (Coccaro and Jurafsky, 1998; Bellegarda, 2000; Wandmacher and Antoine, 2007). In SMT, it has mostly been used for domain adaptation (Kim and Khudanpur, 2004; Tam et al., 2007), or to measure sentence
similarities (Banchs and Costa-jussà, 2011).
The model we use is inspired by Bellegarda (2000). It is a Markov model,
similar to a standard n-gram model, and assigns to each content word a score
given a history of n preceding content words, where n = 30 below. Content words are defined as tokens consisting exclusively of alphabetic characters not included in a stop word list originally developed for information
retrieval (Savoy, 1999).1 Scoring relies on a 30-dimensional LSA word vector
space trained with the S-Space software (Jurgens and Stevens, 2010) on data
from the Europarl and News commentary corpora of the 2010 WMT shared
task. The score is defined based on the cosine similarity between the word
vector of the predicted word and the mean word vector of the words in the
history. Following Bellegarda (2000), we convert the similarity measure into
a probability by looking at the empirical distribution of similarities between
word vectors in the training set. The probability of a given similarity can then
be estimated as the proportion of training examples having a lower similarity
score than the target value.
The model is structurally different from a regular n-gram model in that
word vector n-grams are defined over content words occurring in the word
vector model only and can cross sentence boundaries. Stop words and tokens
containing non-alphabetic characters, which together amount to around 60 %
of the tokens, are scored by a different mechanism based on their relative
frequency (undiscounted unigram probability) in the training corpus. In sum,
1 The
stop word list was retrieved from http://members.unine.ch/jacques.savoy/clef/
(12 October 2011).
frenchST.txt
77
Table 5.1. Experimental results with a cross-sentence semantic language model
DP search only
DP + hill climbing
with semantic LM
newstest2009
BLEU NIST
newstest2010
BLEU NIST
newstest2011
BLEU NIST
0.2256
0.2260
0.2271
0.2727
0.2733
0.2753
0.2494
0.2497
0.2490
6.513
6.518
6.549
7.034
7.046
7.087
7.170
7.169
7.199
the score produced by the semantic document LM has the following form:

punigr (w )
if w is a stop word, else




(5.3)
h(w |h) = αpcos (w |h) if w is a known word, else


ϵ

if w is an unknown word,
where α is the proportion of content words in the training corpus and ϵ is a
small fixed probability. It is integrated into the English–French SMT system
described in Section 4.6 as an extra feature function for the Docent decoder.
Its weight is selected by grid search over a number of values, comparing translation performance for the newstest2009 test set.
In these experiments, we use DP beam search to initialise the state of our
local search decoder. Three results are presented (Table 5.1): The first table
row shows the baseline performance using DP beam search with standard
sentence-local features only. The scores in the second row result from running the hill climbing decoder with DP initialisation, but without adding any
models. A marginal increase in BLEU scores for all three test sets demonstrates that the hill climbing decoder manages to correct some of the search
errors made by the DP search. The last row contains the scores obtained by
adding in the semantic language model. Scores are presented for three publicly available test sets from recent WMT machine translation shared tasks, of
which one (newstest2009) was used to monitor progress during development
and select the final model.
Adding the semantic language model results in a small increase in NIST
scores (Doddington, 2002) for all three test sets as well as a small BLEU score
gain (Papineni et al., 2002) for two out of three corpora. We note that the NIST
score proves to react more sensitively to improvements due to the semantic
LM in all our experiments. This is reasonable because the model specifically
targets content words, which benefit from the information weighting done
by the NIST score. While the results we present do not constitute compelling
evidence in favour of our semantic LM in its current form, they do suggest
that this model could be improved to realise higher gains from cross-sentence
semantic information and demonstrate how the document-level decoder enables experimentation with models that would be much more difficult to integrate in DP beam search.
78
5.2 Translating for Special Target Groups:
Improving Readability
The experiments of the first half of this chapter were geared towards lexical
cohesion, a phenomenon present in all connected text and universally relevant in translation. Discourse-level modelling can also be used to create
output texts with certain specific properties that are desirable in a particular
translation task. As an example, we consider models that improve the readability of the target text by exerting an influence on the vocabulary preferred
by the SMT system and pushing it towards words and constructions that are
potentially easier to understand.2 This form of simplifying translation can be
useful for special populations such as less proficient language users or dyslectic readers or simply for non-experts who want to grasp the main content
in a domain-specific text, e. g., from the legal or medical domain, written in
a foreign language.
Readability and text simplification have been widely studied in the field
of computational linguistics, and several metrics and approaches have been
proposed in the literature. Common readability metrics make use of global
text properties such as type/token ratios, lexical consistency and the proportion of long versus short words. Our goal is to incorporate these features in
an SMT system in order to combine text simplification and MT in a single
system.
Chall (1958) identifies four main factors with strong effects on the readability of a text. Mühlenbock and Johansson Kokkinakis (2009) propose four
corresponding quantitative indicators to measure them. Vocabulary load is
the difficulty of the vocabulary; the corresponding measure is the number of
words exceeding a certain length. Sentence structure is the syntactic complexity of the text and is measured by determining the average sentence length.
Idea density represents the conceptual difficulty of the text, with lexical variation as a quantitative measure. Finally, human interest indicates the degree
of abstractness of the text, and it is measured as the proportion of proper
nouns. The proposed metrics are all fairly crude approximations of the motivating text qualities, but they have the advantage of being easy to measure
and not requiring deep syntactic or semantic analysis.
5.2.1 Readability Metrics
The starting point for the work of Mühlenbock and Johansson Kokkinakis
(2009) is a well-known readability metric for Swedish called LIX (Läsbarhetsindex; Björnsson, 1968). In the terminology of Chall (1958), the LIX metric covers the vocabulary load and the sentence structure dimensions. It is
2 The
results presented in this section are primarily the work of Sara Stymne (Stymne et al.,
2013c), who carried out the experiments and composed an earlier version of the text on which
this section is based.
79
computed as a linear combination of the average sentence length and the
proportion of tokens longer than 6 characters, as in the following equation,
where C (x ) is the count of x :
LIX =
C (tokens)
C (tokens > 6 chars)
+ 100 ·
C (sentences)
C (tokens)
(5.4)
Average sentence length (ASL) is also useful as a standalone measure for
sentence structure complexity:
ASL =
C (tokens)
C (sentences)
(5.5)
Since complicated concepts are frequently expressed with long compound
words in Swedish, Mühlenbock and Johansson Kokkinakis (2009) suggest
measuring the percentage of extralong words with 14 characters or more as
an additional indicator of high vocabulary load:
XLW =
C (tokens ≥ 14 chars)
C (tokens)
(5.6)
Idea density could be measured with the type-token ratio:
TTR =
C (tokens)
C (types)
(5.7)
In order to improve comparability across texts of different length, Hultman
and Westman (1977, 56) propose a related, but different measure of lexical
variation. They relate the vocabulary size or type count V of a text to its
token count N as
k
V = N 2−N
(5.8)
for some text-specific constant k and define their lexical variation metric
OVIX (ordvariationsindex) as the reciprocal of k . Solving for OVIX, we obtain:
log C (tokens)
log N
(5.9)
OVIX =
=
log V
log C (types)
log 2 − log N
log 2 − log C (tokens)
Another indicator of idea density is the nominal ratio NR:
NR =
C (nouns) + C (prepositions) + C (participles)
C (pronouns) + C (adverbs) + C (verbs)
(5.10)
Typical news texts can be expected to have an NR around 1. NR correlates
positively with formality and negatively with readability (Mühlenbock and
Johansson Kokkinakis, 2009).
The proportion of proper nouns PN is used as an indicator of human interest:
C (proper names)
PN =
(5.11)
C (tokens)
80
Finally, Stymne et al. (2013c) suggest that consistent translation may contribute to readability and propose using a measure of translation consistency
based on an association metric called Q-score (Deléger et al., 2006). Q-score
measures the association strength of an aligned pair of items and can be used
either at the word level or at the level of SMT phrases. It is computed as the
token frequency of the aligned pair st divided by the sum of the total number
of pair types the source s and the target t individually occur in:
Q=
C (st )
N (s) + N (t )
(5.12)
Here, C (x ) is the token frequency of x as above and N (x ) is the number of
types matching a certain pattern, and the symbol represents a wildcard
character. Intuitively, the Q-score rewards common phrase or word pairs
with consistent translations whereas it penalises less frequent pairs whose
source and target elements also participate in many other pairs.
5.2.2 Experiments
For our experiments, we have implemented a subset of the readability features discussed in the previous section as feature functions for the Docent
decoder described in Chapter 4. Some of the metrics can be evaluated at the
sentence level, whereas others are meaningful only at the document level.
The following features are implemented:
Sentence level
SL
Sentence length in words
nLW
Number of long words (> 6 characters)
nXLW Number of extralong words (≥ 14 characters)
Document level
TTR
Type-token ratio (Eq. 5.7)
OVIX Word variation index (Eq. 5.9)
Qw
Q-score, word level (Eq. 5.12)
Qp
Q-score, phrase level (Eq. 5.12)
We evaluate our models on parliamentary texts from the Europarl corpus
(Koehn, 2005). This corpus contains both complex sentences and a great deal
of domain-specific terminology. Our system is trained on 1,488,322 sentences
of English–Swedish data. For evaluation, we extract 20 documents with a
total of 690 sentences from a held-out part of Europarl. A document is defined
as a complete contiguous sequence of utterances of one speaker. We exclude
documents that are shorter than 20 sentences or longer than 79 sentences.
Moses (Koehn et al., 2007) is used for training the translation model and
SRILM (Stolcke, 2002) for training the language model. We initialise our experiments with a Moses model that uses the standard features of a sentencelevel phrase-based SMT system: a 5-gram language model, five translation
81
Table 5.2. Systems with single readability features
Feature
Weight
BLEU↑ NIST↑ LIX↓ ASL↓ OVIX↓ XLW↓
NR↓
PN↑
Reference –
Baseline
–
–
0.243
–
6.12
50.47 24.65
51.17 25.01
57.73
56.88
3.08
2.63
1.055 0.013
1.062 0.015
OVIX
low
medium
high
0.243
0.228
0.144
6.11
5.83
4.41
51.00 25.09
49.33 25.45
46.59 29.09
54.65
44.43
31.65
2.60
2.53
1.82
1.069 0.015
1.063 0.015
0.941 0.013
TTR
low
medium
high
0.243
0.225
0.150
6.12
5.75
4.48
51.04 25.11
49.86 26.19
48.30 30.54
55.25
45.31
32.95
2.60
2.44
1.77
1.070 0.015
1.080 0.014
0.975 0.012
Qw
low
medium
high
0.242
0.231
0.165
6.10
5.90
4.93
51.16 25.07
51.28 25.32
50.92 26.14
57.16
58.90
60.61
2.62
2.62
2.63
1.064 0.015
1.074 0.015
1.101 0.016
Qp
low
medium
high
0.243
0.229
0.097
6.12
5.99
3.90
51.16 24.99
49.79 24.14
41.45 21.99
56.94
54.75
39.22
2.65
2.62
2.39
1.061 0.015
1.060 0.015
1.129 0.015
nLW
low
medium
high
0.244
0.225
0.106
6.14
5.96
4.11
50.96 24.98
46.72 24.21
30.27 22.18
56.73
55.39
45.41
2.63
2.72
1.78
1.065 0.015
1.080 0.018
0.899 0.023
nXLW
low
medium
high
0.241
0.225
0.224
6.10
5.85
5.84
51.03 24.96
50.92 25.09
50.97 25.12
56.69
56.56
56.55
1.85
0.19
0.19
1.060 0.015
1.070 0.016
1.068 0.016
SL
low
0.242
6.21 51.07 24.22 57.79
2.71
medium 0.211
5.94 50.77 21.61 60.93
3.15
high
0.150
4.38 50.77 18.46 65.37
3.72
↑ higher score is better
↓ lower score is better
1.058 0.016
1.040 0.018
1.072 0.021
model features, a distance-based reordering penalty and a word counter. The
weights of these features are optimised using minimum error-rate training
(Och, 2003). We reuse the same weights in Docent. The weights of the
document-level features are not optimised automatically, not least because
we have no tuning set with reference translations optimised for readability.
Instead, we test three different settings with a low, a medium and a high
weight relative to the other components of the weight vector for each readability feature.
Owing to the lack of other resources, we perform automatic evaluation
against the standard reference translation contained in the Europarl corpus.
This translation is in no way simplified or optimised for readability. We report
figures for two standard MT evaluation scores, BLEU (Papineni et al., 2002)
and NIST (Doddington, 2002), as well as the readability metrics discussed in
the previous section.
Table 5.2 shows the results when we activate one readability feature at a
time using low, medium, and high weights for each feature. The baseline and
reference are quite similar with respect to readability with some interesting
82
Table 5.3. Systems with combinations of readability features (medium weights)
BLEU↑ NIST↑ LIX↓ ASL↓ OVIX↓ XLW↓
NR↓
PN↑
Baseline
0.243
6.12
51.17 25.01
56.88
2.63
1.062 0.015
LIX (nLW+SL)
OVIX+SL
Qp+OVIX+nLW+SL
0.214
0.229
0.225
5.96
5.94
5.93
46.09 23.02
48.86 24.34
47.77 24.08
56.27
44.53
43.77
2.90
2.63
2.65
1.061 0.018
1.046 0.015
1.045 0.016
All features
0.235
6.04
49.29 24.34
47.80
1.98
1.046 0.015
differences, e. g., in the proportion of extra long words. As expected, giving
a high weight to a readability feature usually results in a sharp decrease in
MT quality with respect to the unsimplified reference translations, but it also
greatly affects the corresponding readability features. In some cases, turning
on the readability features results in extreme scores clearly indicative of overfitting. As an example, using the nLW feature with a high weight decreases
the LIX score by more than 20 points.
Using low or medium weights, by contrast, can give reasonable MT scores
as well as some improvements on several readability metrics. Unsurprisingly,
the features corresponding directly to a metric, like the nLW feature for the
LIX metric and the OVIX feature for the OVIX and TTR metrics, affect that
metric strongly. Several features also have an effect on other readability metrics. For instance, the OVIX and TTR features improve several metrics, but
cause an increase in sentence length, which is undesired. The effect of phraselevel Q-value is very different from that of word-level Q-value. On the phrase
level, it improves most metrics, while its effect on the readability metrics is
small when used on the word level.
In Table 5.3, we show results for some combinations of features, using medium weights. As expected, the effect on the readability metrics is more balanced in these cases. For the system with all features there are improvements
on all readability metrics, except for PN, which is on a par with the baseline.
The other systems that use some global feature also have a positive effect on
most readability metrics, while the LIX system that uses only local features
has little effect on OVIX and a negative effect on extra long words. For all
these systems, the decrease in MT quality is modest. This shows that the decoder with its document-level features manages to simplify translations with
respect to different aspects corresponding to vocabulary load, idea density,
and sentence structure, while maintaining reasonable translation quality.
We also performed a small human evaluation of 100 random non-identical
sentences from the baseline and the system using all readability features.3
For each sentence we rank the output on adequacy, how well the content is
translated, and readability, how easy to read the translations are. The results
are shown in Table 5.4. The baseline produces a higher number of adequate
3 177
out of 690 sentences were identical.
83
Table 5.4. Human preference with respect to adequacy and readability
Baseline
Adequacy
Readability
51
33
Preferred system
Equal Readability (All)
33
29
16
38
translations than the system with readability features, but in many cases, adequacy is equal. For readability, there is a small advantage for the system with
readability features, which is consistent with the improvement on readability
metrics.
Overall, the output is often very similar with only a few words differing.
In some of the cases where the baseline is judged as having better adequacy,
the cause is a single changed word, which may be more common or shorter,
but has the wrong form or part of speech, so it does not fit into the context.
In other cases, some non-essential information is removed from the sentence,
which while making the translation less adequate, is actually what we want
to achieve. In some cases, however, the words removed from the translations
do contain essential information.
In Table 5.5 (p. 86), we show some sample translations in order to exemplify the types of operations our current system is able to perform. One type
of successful simplification is to remove words that are not crucial for the
meaning of the text. Many of the systems with readability constraints simplify the phrase the honourable Members, either by removing the adjective
and giving only ledamöterna ‘the members’, or even by using the pronoun
ni ‘you’. Another good simplification is the rendering of in such a way that,
which is translated quite literally in the baseline, as så att ‘so that’ by several
of the systems. There are also instances, however, where the changes lead
to a loss of information. Examples are handlingsplan ‘action plan’, which is
reduced to plan ‘plan’ by the nLW system, and 2003, which is missing in the
output of the OVIX and Qp systems. Often, different translations are chosen
for a word or phrase. Sometimes this leads to a simplification, as in the nLW
system, which uses the everyday expression bli klar ‘finish’ instead of the
more formal avsluta ‘finish’. In other cases, the translation options are of a
relatively similar degree of difficulty, such as vissa/en del/några ‘some’, all of
which are valid translations. In some cases the system with readability features prefers a translation with a different part of speech, as for uppmärksamhet ‘attention’, which is translated with the adjective uppmärksam ‘attentive’
by several systems. This leads to syntactic problems later on in the translation. In general, as can be expected of SMT, there are some problems with
fluency in all translations, but they tend to get worse in the systems with
high-weight readability features.
84
5.3 Conclusion
In sum, our experiments with the cross-sentence semantic models as well as
with the readability models suggest that it is possible to control some output
properties related to vocabulary choice with document-level features. Both
types of models are in need of improvement.
The semantic language model introduced in Section 5.1.3 is a demonstration of a model that would be difficult, if not impossible, to implement with
a sentence-level SMT decoder. It models n-gram-like sequences of content
words that can span several sentences and introduces undirected dependencies between words that could not easily be processed with the two-pass
or the sentence-to-sentence information propagation method. Of the techniques discussed in Chapter 3, only n-best rescoring could handle this kind
of model, but it would be limited to a greatly impoverished representation of
the search space.
The readability features we have explored show that our document-level
MT system is capable of enforcing specific vocabulary properties in the output texts. In their current form, they have a negative effect on adequacy, even
though they do improve the automatic readability scores. Overfitting to the
scores is an aspect that must be considered in future development. It may
also be necessary to reconsider the scores, most of which are just crude approximations of the relevant linguistic phenomena that are not necessarily
correlated with perceived readability when systematically optimised against.
Another problem of the readability experiments we performed is the lack
of relevant, simplified reference translations. This is a reflection of the fact
that joint translation and text simplification strains the limits of the equivalence perspective on translation adopted by SMT. By applying readability
models, the input text is retargeted to another audience, so the intentionality
of the translation act can no longer be ignored. Evaluating against standard Europarl translations makes it difficult to assess the quality of simplified
translations since any deviation from the reference will be penalised by the
automatic evaluation scores, even if it is correct and has the desired effect
of improving readability. Had we tried to tune the feature weights automatically, the same problem would have occured there, so the weights obtained
with MERT using a regular, unsimplified test set are almost certainly suboptimal for use with the readability models. A corpus of simplified texts might
also make it possible to adapt language models to simplified output, which in
turn might improve fluency. Nevertheless, our experiments demonstrate the
effectiveness of the decoding procedure introduced in Chapter 4.
85
Table 5.5. Examples of translation output from systems with readability features
86
Source
As the honourable Members know – some speakers have mentioned it – the European Council at Lisbon paid particular attention to promoting our efforts to implement risk capital in such a
way that the action plan will be finished in 2003.
Baseline
Som de ärade ledamöterna vet – vissa talare har nämnt det – som
Europeiska rådet i Lissabon ägnat särskild uppmärksamhet åt
att främja våra ansträngningar att genomföra riskkapital på ett
sådant sätt att handlingsplanen kommer att vara avslutat år 2003.
All
(medium)
Som ledamöterna vet – vissa talare har nämnt det – som
Europeiska rådet i Lissabon särskilt uppmärksam på att främja
våra insatser för att genomföra riskkapital så att handlingsplanen
kommer att vara avslutat 2003.
LIX
(medium)
Som ledamöterna vet – vissa talare har nämnt det – Europeiska rådet i Lissabon lagt särskild vikt vid att främja våra ansträngningar
att genomföra riskkapital så att handlingsplanen kommer att vara
avslutat år 2003.
OVIX+SL
(medium)
Som ni vet – vissa talare har nämnt det – som Europeiska rådet i
Lissabon särskilt uppmärksam på att främja våra ansträngningar
att genomföra riskkapital så att handlingsplanen kommer att
avslutas under 2003.
OVIX
(high)
Som ledamöter – en del talare har nämnt det – som Europeiska
rådet i Lissabon särskilt uppmärksam på att stödja våra insatser
för att genomföra av riskkapital, på så sätt att handlingsplanen
kommer att vara avslutat i.
Qp
(high)
Som de ärade ledamöterna vet, som några talare har nämnt det
rådet i Lissabon, ägnat särskild uppmärksamhet åt att vi för att
genomföra riskerna i det att handlingsplanen kommer att avslutas
med.
nLW
(high)
Som ni vet – vissa har sagt det – EU:s möte i Lissabon lagt särskild
vikt vid vår för att genomföra risk i så att den plan att bli klar
under 2003.
SL
(high)
Som ledamöterna vet vissa talare har nämnt – Europeiska rådet i
Lissabon särskilt uppmärksammat främja våra ansträngningar att
genomföra riskkapital så att handlingsplanen avslutas 2003.
Part II:
Pronominal Anaphora in Translation
6. Challenges for Anaphora Translation
In this chapter, we introduce the problem of translating pronominal anaphora,
which will be the main topic of the entire second part of this thesis. Pronominal anaphora is a specific discourse phenomenon that is ubiquitous in natural
language text and poses surprisingly hard problems to SMT systems. In many
languages, morphological features of anaphoric pronouns such as grammatical gender and number must agree with the corresponding features of their
antecedents. Generating the right forms of the pronouns requires target-side
dependencies because agreement depends on features in the linguistic system of the target language that do not necessarily map to properties of the
input text. However, even though it seems obvious that it must be possible
to improve SMT by considering anaphoric pronouns, both our own research
and that of others have shown that it is far more difficult to obtain gains in
translation quality than it might seem at first glance. We start by taking a
closer look at what pronominal anaphora actually is and by establishing that
it is in fact a problem for SMT. Then we discuss some of the difficulties that
arise in recent work on pronouns in SMT.
6.1 Pronouns and Anaphora Resolution
Anaphora is “a relation between two linguistic elements, in which the interpretation of one (called an anaphor) is in some way determined by the
interpretation of the other (called an antecedent)” (Huang, 2004). Pronominal
anaphora specifically refers to the case in which the anaphor is a pronoun
that should be interpreted as coreferring with something already mentioned.
In the following example, the pronoun them in the second sentence has the
same referent as and agrees morphologically with the noun phrase the Catholics in the first:
(6.1) The Catholics described the situation as “safe” and “protecting.” This
made them “relaxed and peaceful.” (newstest2009)1
Prototypically, anaphoric pronouns refer to entities introduced into the discourse in the form of noun phrases. They can also refer to events or to parts
of the discourse itself, or to phenomena not explicitly mentioned, but somehow implied by the discourse. Such cases are sometimes subsumed under the
1 Examples
marked news-test2008 and newstest2009 are taken from the 2008 and 2009 test sets
of the WMT shared tasks, respectively (Callison-Burch et al., 2009).
89
label event anaphora. If a referring pronoun precedes its antecedent instead
of following it, it is called cataphoric instead of anaphoric. Furthermore, some
uses of pronouns do not refer to a particular antecedent at all. For instance,
the expletive or pleonastic pronoun it in it is raining has a purely syntactic
function and is not anaphoric.
Anaphora resolution, the problem of identifying the antecedent of an anaphoric linguistic element, is a long-standing research problem in computational linguistics. Much research has been devoted to noun phrase coreference resolution, which is “the task of determining which NPs in a text or
dialogue refer to the same real-world entity” (Ng, 2010). For the purposes of
this thesis, the special case of pronominal anaphora resolution is most relevant. General-purpose automatic coreference resolution systems usually try
to resolve both pronominal and non-pronominal noun phrase coreference,
whereas event anaphora tends to be somewhat neglected (Pradhan et al.,
2011).
In many systems, automatic coreference resolution proceeds in two stages.
First, the system analyses the text to be annotated with the help of NLP tools
such as taggers or syntactic parsers and finds the noun phrases eligible for
inclusion in a coreference relation, called mentions or markables. Then, it
performs inference over the markables found to determine which of them
refer to the same extra-linguistic entities.
There are different ways to approach the coreference resolution task. Ng
(2010) distinguishes between mention-pair systems and entity-mention systems. The former try to decide, for each pair of mentions in the text, whether
or not they refer to the same entity. The latter construct an abstract representation of all the entities in the text and decide, for each mention, whether
or not it refers to a given entity.
Like MT, coreference resolution can be approached both with handwritten
rules (e. g., Lee et al., 2011) or with machine learning methods. Systems whose
core component is based on machine learning are often mention-pair systems
using an extension of a basic set of 12 features originally proposed by Soon
et al. (2001).
In many of the experiments contained in this thesis, we use the coreference system BART (Versley et al., 2008). BART is easily extensible and very
modular, which makes it an excellent platform for our experimental work.
Our version of BART is based on an official version released in 2010. Since
then, the development of the coreference resolution system has continued,
and it is likely that many features of our version do not correspond exactly
to more recent releases of BART. Therefore, results involving coreference resolution that we present in this thesis should not be construed as reflecting the
performance of current versions of BART.
Our version of BART has a mention-pair decoder with a set of features
based mostly on the elementary feature set of Soon et al. (2001) and later
work by Uryupina (2006). It has a mention detection pipeline that uses the
90
Morpha morphological analyser (Minnen et al., 2001), the Berkeley parser
(Petrov et al., 2006; Petrov and Klein, 2007) and the Stanford named entity
recogniser (Finkel et al., 2005). In the actual prediction component, the sentence containing the anaphoric pronoun and a limited number of sentences
immediately preceding it are searched for markables that can serve as potential antecedents for the anaphor. Among these markables, the most probable
candidate is selected with a maximum entropy ranker and returned.
6.2 Translating Pronominal Anaphora
When translating a discourse containing pronouns into another language, an
MT system must decide how to render the input pronouns adequately in the
target language. The choices that must be made for pronouns are potentially
more difficult than when translating content words. To begin with, it is not
even clear that every pronoun in the input should be translated into a corresponding pronoun in the translation. Mitkov and Barbu (2003) compare how
the French translations of three technical texts written in English use pronouns compared to the originals. In their sample, the French translations contain almost 40 % more pronouns (390 instead of 281). For 241 pronouns, there
is a 1 : 1 correspondence between the languages, but 40 English pronouns
and a staggering 159 French pronouns have either no direct correspondence
or a corresponding full NP in the other language. Generalising these figures
is problematic because the sample is small and it covers a very specific text
type and only a single language pair and translation direction; furthermore,
it is not known if the translations were created by the same or by different
translators. In any case, the study clearly demonstrates that cross-lingual
differences in pronoun use are by no means a marginal phenomenon.
For content words, MT systems usually assume that each item in the source
text should be mapped into an equivalent item in the target language, possibly
as an element of a multi-word phrase or idiom. Since suppression of content
words would very likely entail a loss of information in the translation, this is
a reasonable assumption to make for the literal translation style typical of MT,
even though a human translator might sometimes opt for a less literal rendering of the input as a result of functional or pragmatic considerations. The use
of pronouns, in contrast, is much more dependent on the linguistic structure
and conventions of the target language, and it is by no means evident that an
anaphoric pronoun should always be translated with an anaphoric pronoun
even if fairly literal translation is sought for. For instance, when translating
into languages like Italian or Spanish which do not require overt subject pronouns, English subject pronouns must be left out systematically to create a
natural-sounding target text.
This is only one part of the problem, however. Even in the typical case,
when an input pronoun is translated into a corresponding target language
91
pronoun, complications arise because many languages require agreement
between the pronoun and its antecedent. The agreement relation must be
enforced in the target language by considering the relevant features such as
gender and number of the translation of the antecedent. Source language
information found in the input is not sufficient alone to choose the correct
pronoun. This is demonstrated by the following (contrived) example:
(6.2) a. The funeral of the Queen Mother will take place on Friday. It will
be broadcast live.
b. Les funérailles de la reine-mère auront lieu vendredi. Elles seront
retransmises en direct.
Here, the English antecedent, the funeral of the Queen Mother, requires a singular form for the anaphoric pronoun it. The French translation of the antecedent, les funérailles de la reine-mère, is feminine plural, so the corresponding anaphoric pronoun, elles, must be a feminine plural form too. Additionally, the French verbs are marked for plural in both sentences although the
English verbs are singular forms. Consider, however, that the translator could
have chosen to translate the word funeral with the perfectly correct French
word enterrement ‘burial’ instead:
(6.3) L’enterrement de la reine-mère aura lieu vendredi. Il sera retransmis
en direct.
Now, the antecedent NP is rendered as a masculine singular and correspondingly requires a masculine singular anaphoric pronoun and singular verb
forms.
Importantly, there is nothing in the English source text to predict the
gender of either the antecedent or the pronoun. English words do not have
grammatical gender, but even if they did, it would not necessarily be predictive of gender in another language. Number marking will often be consistent across languages because it is more tightly knit to circumstances in
the real world. Nonetheless, examples (6.2) and (6.3) show that discrepancies are possible for this feature as well. The only reliable predictor of the
morphological features of a translated anaphoric pronoun is the translation
of the antecedent, which, as the example illustrates, is to some extent at the
discretion of the translator, or the MT system.
Anaphora is a very common phenomenon found in almost all kinds of
texts. The anaphoric link can be local to the sentence, or it can cross sentence boundaries. In the first case, pronoun agreement may be dealt with
correctly by the local dependencies of the SMT language model, but this becomes increasingly unlikely as the distance between the referring pronoun
and its antecedent increases. The second, non-local case is not handled by
standard SMT models at all. It is worth pointing out that a pronoun may
well be translated correctly even without the benefit of a specific anaphora
model because the SMT system easily learns an unconditional distribution
92
over the pronouns in the training sets. In example (6.1), the plural pronoun
them would probably be rendered as sie by a naïve English–German SMT
system, which is very likely to be a good choice. When translating into a
language with gender-marked plural pronouns, however, selecting the right
pronoun is more difficult.
6.3 A Study of Pronoun Translations in MT Output
To show that pronominal anaphora is indeed a problem for SMT, we study the
performance of one of our SMT systems on personal pronouns. The sample
examined in our case study is drawn from the German–English corpus used
as a test set for the MT shared task at the EACL 2009 Workshop on Machine
Translation (Callison-Burch et al., 2009). The test set is composed of 111 newswire documents from various sources in German and English translations. In
the selected subset of 13 documents (219 sentences) we have identified all
cases of pronominal anaphora that could be resolved in the text. One of the
documents does not contain any such cases. For each anaphoric pronoun in
the German source text, we manually check whether or not it was translated
into English in an appropriate way by the phrase-based SMT system we submitted to the WMT 2010 shared task (Hardmeier et al., 2010). The system
uses 6-gram language models, allowing it to consider a relatively large local
context in translation, but it does not contain any specific components to
process sentence-wide or cross-sentence context.
In this sample, the MT system finds a suitable translation for anaphoric pronouns in about 61 % of the cases (Table 6.1). How well it performs is strongly
dependent on the type of pronoun: While it produces adequate output for
around 90 % of the demonstrative pronouns (dieser, dieses, etc.) and about
3 out of 4 masculine or neuter singular pronouns or plural pronouns, only
a third of the feminine pronouns are translated correctly. For pronouns of
polite address and reflexive pronouns, the system largely fails.
The reasons for these discrepancies can most likely be found in the differences of the pronominal systems of the source and the target languages.
The English system of pronouns distinguishes between human (he, she) and
non-human (it) referents in the singular. A gender distinction is made only
for humans. The German nominal system has three grammatical genders,
which do not correspond directly to biological sex and apply also to inanimate objects. They are distinguished in the singular forms of the pronouns.
Moreover, some German pronouns are highly ambiguous. Thus, the pronoun sie can be the form of the feminine singular, of the plural of any gender
or, when spelt Sie with an uppercase initial letter, of the polite form of address,
which is usually translated into an English second person you. The reflexive
pronoun sich is used for all genders and both numbers in the third person; it
frequently has no direct equivalent in the English sentence. In these ambigu93
ous cases, the language model will try to disambiguate based on parts of the
context that were seen during training. If the local context is truly ambiguous, the results of the disambiguation will be essentially random. Generally,
the system will prefer the forms that were observed most frequently at training time. For instance, given the pronoun distribution in typical corpora of
newswire text and political speeches, it will tend to translate sie as a plural
pronoun even when it is a feminine singular in reality. As a result of these
factors, pronoun translation accuracy varies greatly from document to document according to the number and types of pronouns that occur.
Even though translation mistakes due to wrong pronoun choice generally
do not affect important content words, they can make the MT output hard to
understand, as in the following example from document 3 of our sample:
(6.4) a. Input: Der Strafgerichtshof in Truro erfuhr, dass er seine Stieftochter Stephanie Randle regelmässig fesselte, als sie zwischen fünf und
sieben Jahre als [recte: alt] war.
b. Reference translation: Truro Crown Court heard he regularly tied
up his step-daughter Stephanie Randle, when she was aged between
five and seven.
c. MT output: The Criminal Court in Truro was told it was his Stieftochter Stephanie Randle tied as they regularly between five and seven
years. (newstest2009)
There are several things wrong with this MT output, and bad pronoun choice
is clearly one of them, with the pronoun er referring to a male person translated as it and the pronoun sie referring to a female person translated as they.
To sum up, there is evidence that current phrase-based SMT cannot handle
pronoun choice adequately. Although our case study is limited to a single language pair and a single text genre, considering the models used in SMT, there
is no reason to suppose that the situation should be very different in other
cases. Stronger differences in pronoun systems and text with longer, more
complex sentences are likely to exacerbate the difficulties, whereas the problem will be easier to solve when the languages are close and the sentences
are simple and match the training corpus closely.
6.4 Challenges for Pronoun Translation
The results of the case study in the previous section indicate that better handling of pronominal anaphora may lead to observable improvements in translation quality. However, the attempts at explicit pronoun modelling for SMT
reported in the literature (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2011; Hardmeier et al., 2013b) suggest that the problem
is harder than it seems. Pronoun translation is a complex task, and solving
it correctly requires a number of steps, including identification of anaphoric
94
neuter sg.
fem. sg.
masc. sg.
–/ 1
–/ –
–/ –
–/ –
7/10
3/ 3
–/ –
–/ –
2/ 2
4/ 4
–/ –
1/ 2
–/ –
17/22
77 %
plural
–/ 1
–/ –
1/ 2
2/ 2
–/ –
1/ 1
8/ 8
–/ –
4/ 4
1/ 4
–/ –
–/ –
–/ –
17/22
77 %
–/ –
–/ –
1/ 4
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
1/ 4
25 %
polite address
–/ 1
–/ –
6/23
1/ 2
2/ 2
–/ 1
2/ 3
–/ –
2/ 8
–/ –
1/ 6
2/ 2
–/ 1
16/49
33 %
–/ 2
–/ –
–/ 4
–/ –
–/ –
1/ 1
–/ –
–/ –
1/ 2
1/ 2
–/ –
–/ –
–/ –
3/11
27 %
reflexive
1/ 1
–/ –
5/ 8
9/11
1/ 3
7/13
4/ 5
2/ 3
16/19
2/ 2
–/ –
–/ –
2/ 3
49/68
72 %
1/ 1
–/ –
2/ 2
–/ –
1/ 1
2/ 3
4/ 4
–/ –
3/ 3
2/ 2
–/ –
2/ 3
1/ 1
18/20
90 %
demonstrative
Aktualne.cz
Spiegel
BBC
BBC
Times
ABC.es
El Mundo
Les Echos
Le Devoir
hvg.hu
nemzet.hu
Adnkronos
Corriere
–/ 2
–/ –
–/ –
1/ 1
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
–/ –
1/ 3
33 %
pron. + prep.
1
2
3
4
5
6
7
8
9
10
11
12
13
Document
2/ 9
–/ –
15/ 43
13/ 16
11/ 16
14/ 22
18/ 20
2/ 3
28/ 38
10/ 14
1/ 6
5/ 7
3/ 5
122/199
61 %
total
22 %
–
35 %
81 %
69 %
64 %
90 %
67 %
74 %
71 %
17 %
71 %
60 %
61 %
Table 6.1. Correct translations and total number of German anaphoric pronouns in a
subset of the WMT 2009 test set.
95
pronouns, correct translation of the parts of the discourse containing the antecedents, recognition of the anaphoric link to the right antecedent, extraction
of relevant features from the antecedent, generation of the correct pronoun
and its embedding in a correct translation of its context. Each of these steps is
in itself non-trivial, and there is a substantial risk that noise introduced by errors in each part of the task accumulates and eradicates all useful information
in the chain.
Guillou (2012) discusses a number of reasons for the disappointing performance of SMT systems with anaphora handling. In particular, she identifies four main sources of error:
1. Identification of anaphoric vs. non-anaphoric pronouns,
2. Anaphora resolution,
3. Identification of the head of the antecedent noun phrases, from which
gender and number features are extracted,
4. Word and phrase alignment between source and target text.
While we largely agree with Guillou’s analysis of these problems, we believe
that the list should be extended. We have identified six principal factors that
present risks to pronoun-aware SMT systems and may help to explain the
failure of existing research to find solutions:
1.
2.
3.
4.
5.
6.
Baseline SMT performance,
Anaphora resolution performance,
Performance of other external components,
Inadequate evaluation,
Error propagation, and
Model deficiencies.
The sources of error listed by Guillou (2012) can be subsumed under these
headings. In the following sections, we examine these challenges in more
detail, beginning with risks external to the pronoun translation approaches
proper and continuing with deficiencies inherent in the methods that were
tested in the literature. From this discussion, we derive the insights that
shaped the key features of our recent work presented in the later chapters
of this thesis (Chapters 8 and 9; Hardmeier et al., 2013b).
6.4.1 Baseline SMT Performance
Models for anaphoric pronouns target a very specific linguistic phenomenon
by manipulating a small number of words in the output text. This can only be
successful if the translation as a whole is reasonably good; no pronoun translation model will achieve significant improvements if what the underlying
SMT system outputs without its help is mostly gibberish. It is well known
that some language pairs are much more difficult for SMT than others, for
96
instance because of word order differences or complex target language morphology. In other cases, out-of-vocabulary words in the input text may make
the translation unreliable. When this happens, there is not much that a pronoun model can do to improve the translation because it is too specifically
focused on a single phenomenon.
In our English–German system (Hardmeier and Federico, 2010), we experienced insufficient baseline performance as a major problem. Similarly, Guillou (2011) remarks that “[o]ne of the major difficulties that [human evaluators] encountered during the evaluation was in connection with evaluating
the translation of pronouns in sentences which exhibit poor syntactic structure.” This suggests that, at least in some cases, the translations output by
her English–Czech MT system were so poor as to render pronoun-specific
evaluation essentially meaningless.
By contrast, the output of state-of-the-art English–French SMT systems
is to a large extent intelligible if not perfect. It sometimes happens that the
SMT system garbles the syntax of a sentence, such as in the following examples, where the words of the input sentence are reordered in a manner
that completely distorts the meaning of the sentences:
(6.5) a. Input: We don’t have stewardesses, we’ve been against it from the
very beginning.
b. MT output: Nous n’avons pas, nous avons été hôtesses contre elle
dès le début. (newstest2009)
(6.6) a. Input: And this time, Hurston’s old neighbors saw her as a savior.
b. MT output: Et cette fois, l’ancienne Hurston voisins a vu son comme
un sauveur. (newstest2009)
In comparison to other language pairs, these cases are fairly rare, however,
and it is reasonable to assume that this was the case also for the anaphorasensitive English–French systems described in the literature (Le Nagard and
Koehn, 2010; Hardmeier et al., 2011). Generally, there is little researchers
interested in anaphora can do about this problem except working on an easier
language pair while waiting for the progress of general SMT research.
6.4.2 Anaphora Resolution Performance
Any MT system that attempts to model pronominal anaphora explicitly must
identify anaphoric links in the input in some way, be it by running a separate anaphora resolution component (Le Nagard and Koehn, 2010; Hardmeier
and Federico, 2010), by performing anaphora resolution jointly together with
pronoun prediction (Hardmeier et al., 2013b) or by relying on manual goldstandard annotations (Guillou, 2011). When many anaphoric links are resolved incorrectly, a model may degrade performance on average rather than
improve it. To see why, consider that an SMT system with no explicit ana97
phora handling component will not emit pronouns randomly; rather, the system is likely to have a preference for the pronouns that are most frequent in
the training corpus. If the test set is homogeneous with the training data, this
may very well be the correct choice in many cases.
As an example, the SMT system used in pronoun translation corpus study
described above (Section 6.3; Hardmeier et al., 2010) has a strong preference
for translating the ambiguous German pronoun sie as they or them rather
than she or her. In consequence, pronoun translation errors are very frequent in documents whose main character is female, whereas many other
documents are hardly affected. Clearly, this is a problem not only from a technical, but also from a gender-political point of view (Gendered Innovations,
2014). Overall, anaphora resolution is a difficult task in itself, and inadequate
performance of the coreference resolver has been advanced as an explanation
for disappointing experimental results in at least one study (Le Nagard and
Koehn, 2010).
Pronouns are notoriously difficult for anaphora resolution systems to resolve correctly when they do not refer to a noun phrase. On the one hand,
this applies to expletive pronouns such as it in it is raining, which are not
used anaphorically at all. Detecting expletives automatically is a hard problem. Le Nagard and Koehn (2010) implement a rule-based system for this task
(Paice and Husk, 1987), which performs surprisingly well for them at a precision and recall of 83 %; however, the same system has been shown to perform
considerably worse on different corpus data (Evans, 2001). One of the best
systems currently available, achieving high accuracy on a variety of test sets,
is the one by Bergsma and Yarowsky (2011).
Low recall for expletive classification means that a substantial part of the
expletive pronouns in a text will be incorrectly linked to an antecedent. As
an example, consider the following two sentences, where the version of the
BART coreference resolution system used by Hardmeier et al. (2011) incorrectly links the non-referring pronoun It in the second sentence to the word
it in the first and creates a coreference chain price – it – It:
(6.7) Napi’s basket suggested that this latter was a near impossibility, since
we found that the price was up by just a shade over 10 percent on last
year’s quite high base price, even where it was most expensive. It does
appear, though, that flour suppliers are in a stronger position than egg
producers, for they have managed to force their drastic price increases
onto the multinationals. (news-test2008)
On the other hand, pronouns may refer to an event expressed by a verb
phrase rather than to a noun phrase, as in the following example:
(6.8) He made a scandal out of it when the Prefecture ordered the dissolution
of the municipal council. (newstest2009)
98
This type of coreference is handled less consistently by current coreference
resolution systems (Pradhan et al., 2011), so pronouns with event anaphora
will often be resolved incorrectly as referring to a noun phrase. At the same
time, both expletives and event anaphora may be relatively easy for a naïve
SMT system to get right, since they are generally rendered with a small set of
common pronouns such as it in English or il, ça, cela in French. In such cases,
incorrect anaphora resolution greatly increases the risk of mistranslation.
6.4.3 Performance of Other External Components
Recognising and resolving pronominal anaphora in a document and transferring it into another language requires analysis at a relatively high level
of linguistic abstraction. Depending on the architecture of a specific system, a variety of external components may be used to perform certain steps
of this analysis. In addition to the potentially quite complex preprocessing
pipelines of their coreference resolution systems, existing systems (Le Nagard
and Koehn, 2010; Hardmeier and Federico, 2010) rely on external resources
to identify morphological features of potential antecedents and to align the
words of the source language to those of the target language. While these are
well-researched NLP tasks and good tools exist, their accuracy is not perfect,
and all errors will add to the level of noise present in the total system.
Tools for morphological analysis are language-specific and will not be
available for all languages in the same quality. Even for a language like French
that may well have one of the best collections of NLP tools after English, it
turns out to be surprisingly difficult to obtain a reliable morphological analyser that works well on all text types. A number of systems have been developed, but not all of them are publicly available and perform adequately
on the MT test corpora. Both Le Nagard and Koehn (2010) and Hardmeier
and Federico (2010) use the Lefff full-form lexicon (Sagot et al., 2006). This
is an excellent resource with wide, but obviously not perfect, coverage, and
as a pure lexicon resource it contains multiple analyses of some ambiguous
word forms. To words not listed in the dictionary, Hardmeier and Federico
(2010) apply a small number of rule-based heuristics that improve coverage
somewhat. Still, the quantity of words with no or an incorrect analysis is not
negligible, and these words may provoke translation errors.
Cross-lingual word alignment is an essential step in the SMT training process. The development of statistical alignment methods stands at the very
beginning of this research field (Brown et al., 1990, 1993). The success of
SMT relies strongly on the accuracy of these methods, but also on the tolerance of subsequent training steps to the errors they make. When training
translation models for phrase-based SMT, word-to-word alignments are used
as the basis for an elaborate heuristic phrase extraction procedure (Och and
Ney, 2004) that extracts all phrase pairs consistent with a word alignment
99
according to certain criteria. This method copes very effectively with word
alignment errors by reducing the influence of individual alignment links. Frequently, phrase pairs will be identified correctly even though some word in
them is not aligned, or incorrectly aligned. A pronoun translation model cannot have the same kind of tolerance because it must consider the alignment
links of individual words. What is more, pronouns, which it is particularly
interested in, may be particularly prone to having erroneous alignment links.
Since they are very common and are not translated strictly literally in many
cases, they have fairly high translation probabilities to all kinds of words
in the word alignment models. As a result, linking them to other common
nearby words often increases the alignment score even if the correspondence is not motivated linguistically. In the worst case, they may get aligned
to a totally unrelated pronoun in the other language, so that the pronoun
translation model enforces an incorrect translation for that pronoun.
6.4.4 Inadequate Evaluation
It is widely recognised that automatic evaluation of pronoun translation is
difficult and existing methods are unreliable (Le Nagard and Koehn, 2010;
Hardmeier and Federico, 2010; Guillou, 2011). Popular MT evaluation metrics
such as BLEU (Papineni et al., 2002) score the MT output by comparing it to
one or more reference translations. This approach is fraught with problems.
Since it is completely unspecific and assigns the same weight to any overlap
with the reference, it is not particularly sensitive to the type of improvements
targeted by a pronoun translation component, which affect only a few words
in a text.
Hardmeier and Federico (2010) address this shortcoming by using a precision/recall-based measure counting the overlap of pronoun translations in
the MT output and a reference translation (see Chapter 7 for details). Whilst
increasing the sensitivity to pronoun changes, this measure retains another
serious drawback of a reference-based pronoun evaluation in that it judges
correctness by comparing the translation of a pronoun in the MT output with
the translation found in a reference translation and assumes that they should
be the same. However, this assumption is flawed: It does not necessarily hold
if the MT system selects a different translation for the antecedent of the pronoun. If this is the case, the only meaningful way to check the correctness of
a pronoun is by finding out whether it agrees with the antecedent selected
by the system, even if the translation of the antecedent may be incorrect.
As Guillou (2011) remarks, the usefulness of an evaluation method that
checks pronouns against a reference translation also depends on the number
of inflectional forms for pronouns in the target language. If pronouns are
inflected for a large number of features in a given language, the probability
of matching a pronoun exactly with a noisy system is very low even if many
100
of its features are generated correctly, and it becomes difficult to measure
progress before perfection is achieved.
More relevant conclusions about the quality of pronoun translation could
be drawn by examining how the MT output renders the coreference chains
found in the input and checking the pronouns referring to the same entity
for consistency. The main difficulty here is that this makes the evaluation
dependent on coreference annotations for the source language, leading to
unreliable evaluation results when there are errors in the annotation. This
evaluation strategy was adopted by Guillou (2011) and worked well for her
since she had gold-standard coreference annotations for her test set. In the
absence of gold-standard annotations, reliable evaluation of pronoun translations seems difficult or impossible. Coreference-annotated parallel corpora
like the Prague Czech–English Dependency Treebank (Hajič et al., 2006) and
the recently developed ParCor corpus containing data for English–French
and English–German (Guillou et al., 2014) are essential resources for sound
evaluation of pronoun translations.
6.4.5 Error Propagation
In the definition cited above (Section 6.1), anaphora is defined as “a relation
between two linguistic elements, in which the interpretation of one (called an
anaphor) is in some way determined by the interpretation of the other (called
an antecedent)” (Huang, 2004). This definition focuses on the linguistic realisation of the anaphor and the antecedent, and it views anaphora as a pairwise relation between exactly two linguistic elements. This focus is shared
by other definitions of the terms anaphora (Bussmann, 1996) and anaphor
(Trask, 1993). In the case of nominal coreference and pronominal anaphora,
it could be argued, however, that the immediate relation holds not between
two linguistic elements, but between a linguistic element and an entity in
the real world, or the representation of an entity in the reader’s or listener’s
mind, which was presumably evoked by one or more linguistic elements in
the preceding discourse. It could also be argued that the anaphoric relation
holds between the anaphor and the set of all linguistic elements referring to
the same entities.
The formal representation of anaphoric links in a computational system
must commit to one of these views. In coreference resolution, it is common to encode anaphoric links as coreference classes, defined as the set of
all mentions in a document referring to the same entity. The extratextual,
non-linguistic nature of the entities is emphasised by the definition of NP
coreference resolution as “the task of determining which NPs in a text or dialogue refer to the same real-world entity” (Ng, 2010). This is consistent with
the last of the definitions mentioned above. In many practical implementations, however, the anaphoric link is represented primarily as a pairwise
101
relation between two noun phrases in the text, a view more compatible with
the encyclopaedic definitions referred to first. These pairwise links are then
usually converted into coreference classes for evaluation.
In an anaphora model for SMT, it is often easier to deal with pairwise
anaphoric links than with entire coreference classes, especially if one of the
sentence-based decoding procedures described in Chapter 3 is applied. To
some extent, this is also justified because the morphological agreement relation, with which anaphora models for SMT are mostly concerned, holds
between the anaphoric pronoun and the most recent, or possibly most salient, mention in the text, not between the pronoun and an abstract concept.
In the existing literature, mention-pair representations of anaphoric links are
practically universal (Le Nagard and Koehn, 2010; Hardmeier and Federico,
2010; Guillou, 2011). Conditioning the translation decision for an anaphoric
pronoun on the translation of a single antecedent NP creates a risk of error
propagation. This is particularly relevant if a coreference chain consists of
a sequence of pronouns. If the SMT system, triggered by some other factor
such as the n-gram model, mistranslates one of the pronouns in the chain,
this error can easily be propagated to all later elements of the chain. This
problem could be addressed by processing the coreference links so that links
pointing to an antecedent that is a pronoun are transitively extended until
a full NP is reached, but even in this case, the presence of a single incorrect
link may lead to false resolution and, consequently, false pronoun choice.
6.4.6 Model Deficiencies
Le Nagard and Koehn (2010) claim that “[their] method works in principle,” if
it wasn’t for the poor performance of the coreference resolution system, and
Hardmeier and Federico (2010) report minor improvements for the pronoun
it in a pronoun-specific automatic evaluation with their method. However,
later work suggests that both methods are in need of refinement before they
can deliver consistently useful results by demonstrating that performance remains unconvincing even when using gold-standard coreference annotations
(Guillou, 2011) and that the small improvements that have been realised do
not carry over to another language pair (Hardmeier et al., 2011).
An interesting observation made by both Guillou (2011) and Hardmeier
et al. (2011) is that SMT systems with explicit pronoun handling tend to generate more pronouns than required. The reason for this need not be the same
for both systems. In particular, in the English–Czech system, one difference
between the languages is that Czech, unlike English, allows subject pronouns
to be left out when the subject can be inferred from the context. The observed
overgeneration effect may result from a reduced tendency of the second-pass
system with its more focused pronoun translation distributions to drop pro102
nouns, word removal being an event not explicity accounted for in the standard phrase-based SMT model.
In the experiments by Hardmeier et al. (2011), anaphoric links are modelled
by a bigram language model predicting pronouns given gender and number
of the antecedent. The vocabulary of the predicted words is restricted to pronominal forms. Other words are treated as “out of vocabulary” by the model
and penalised harshly. This leads to a strong preference for translating every
single pronoun as a pronoun, even when this is not an adequate translation,
e. g., when the coreference system mistakenly resolved a non-referential pronoun by linking it to an antecedent.
In sum, the existing pronoun models for SMT are clearly less than perfect,
and pronoun overgeneration is a problem that has been observed repeatedly
with different models. To improve the models, the reasons for this behaviour
should be examined more closely. It may be necessary to design an explicit
model for dropping pronouns or translating them with non-pronouns. As
pointed out earlier, research on anaphora resolution has had a tendency towards focusing on the prototypical case of anaphora with a nominal antecedent, and non-referential pronouns and event anaphora pose harder challenges to current systems. The same preference for prototypical problem
instances can be observed in research on SMT pronoun models; in SMT,
however, the less frequent, non-prototypical cases may in fact be easier to
handle for a naïve system since, at least for target languages like French or
German, agreement patterns are much less complex than for nominal antecedents. Consequently, there is a substantial risk of degrading performance
by adding a pronoun model that mishandles these very categories.
6.5 Conclusion
In the previous section, we gave an overview of the main challenges that
an SMT system with an explicit pronoun model is faced with. The analysis
we presented is a result of our earlier work on pronoun translation, some
of which we present in the following chapter. The insights gained from this
work have influenced our more recent work on pronouns, which will be the
topic of the remainder of this thesis. Let us therefore recapitulate the challenges discussed above and consider the design decisions we have made to
cope with them.
The first factor we mentioned is baseline performance, which means the
performance of all components of the SMT system except the ones we are
interested in. What we can do here is select our baseline system so as to
maximise the effect of the model we want to test. For pronoun translation, it
seems important to choose a language pair with very good SMT performance
as it is almost impossible to improve on an underperforming MT system with
a pronoun model. At the same time, it is important that there should be an
103
interesting difference in pronoun systems between the source and the target
language.
For us, baseline performance was the main reason to give up language
pairs such as German–English and German–French, which we studied in
earlier work. Even though these language pairs are very interesting from
the point of view of pronoun translation, the word order differences between
German on one side and English and French on the other, as well as the
relatively complex morphology of German, make it difficult to train good
phrase-based SMT systems. Instead, we concentrate our efforts on the language pair English–French. This is a combination of two major European
languages with plenty of resources. Both languages have very simple noun
morphology, and their word order is very similar. At the same time, there is
an interesting difference between the French third person pronouns, which
follow a two-gender system that conflates biological and grammatical gender
for both animate and inanimate entities, and the English pronouns, which are
marked for animacy but do not have gender features on inanimate pronouns.
Many of the difficulties related to coreference resolution, morphological
analysis, error propagation and pronoun modelling in general are addressed
in our work on pronoun prediction described in Chapter 8. Our design decisions are guided by the modelling assumptions outlined in Section 1.3. One
of the most important consequences of our early experiments is that we try
to reduce our dependence on external tools and integrate as much of the task
as possible into our own system. To the maximum extent possible, we avoid
pipeline architectures in favour of tightly integrated components. Thus, the
neural network classifier we present in Chapter 8 combines pronoun prediction with anaphora resolution in a single network. Tight coupling permits
us to preserve the uncertainty of the individual steps; rather than resolving a
pronoun to a single antecedent, we propagate a set of antecedent candidates
with an associated probability distribution to the next step. Doing so should
also reduce the risk of error propagation a little by minimising the effect of
uncertain decisions, even if it does not solve its root cause and coreference
chains are still modelled as sequences of pairwise links.
Uniting different parts of the task in one system allows us to train the entire
system in one go for a single training criterion that matches the objective of
pronoun prediction for which the classifier will finally be used, and it ensures
that all parts of the system are trained on the same type of training data. The
alternative would often be to train components such as anaphora resolution
systems or morphological analysers on out-of-domain data because annotated training data for the target domain may not be available. It has been
shown at least for word sense disambiguation that matching the training objectives and data sets of an SMT system and its ancillary components can be
essential for success (Carpuat and Wu, 2007). We suggest that this may be a
factor for pronoun translation too.
104
Evaluation is the problem we contribute least to in this work. In the next
chapter, we briefly discuss a pronoun-specific evaluation metric that is based
on precision and recall of pronoun translation, but it is still unsatisfactory and
suffers from many of the same weaknesses as the existing, general evaluation
measures. In Chapter 9, we present a method for annotating and evaluating
pronoun translations in SMT output, which allows us to analyse the performance of our own anaphora model. Parallel coreference-annotated data for the
English–French language pair has only been developed very recently (Guillou et al., 2014) and was unavailable for most of the work contained in this
thesis. In our experiments in Chapter 9, we use this new resource as a source
of reliable anaphora annotations for our model, but the development of better
evaluation measures must be left to future work.
105
7. A Word Dependency Model for Anaphoric
Pronouns
In this chapter, we describe some of our early results on pronominal anaphora translation. We present a simple document-level word dependency
model for the Moses decoder and its application to pronominal anaphora for
the language pair English–German (Hardmeier and Federico, 2010). It represents one of the earliest attempts to integrate knowledge about pronominal
anaphora into the standard, sentence-level tools of phrase-based SMT. The
initial publication of this work was one of the very first papers that addressed
the problem of pronominal anaphora in SMT (together with Le Nagard and
Koehn, 2010). We also introduce an evaluation metric that specifically measures the accuracy of pronoun translation and is more sensitive to the effects
of our anaphora models on the MT output than standard automatic MT evaluation measures such as BLEU (Papineni et al., 2002).
To enable discourse-level information processing for our word dependency model in a sentence-level SMT framework, we apply the sentence-tosentence information propagation approach described in Section 3.4. Anaphoric links are modelled as directed dependencies between word pairs consisting of a pronoun and its closest antecedent. Links are identified with the
help of an external coreference resolution system. Our model assigns a probability to the translation of a pronoun given the translation of its antecedent.
It handles both sentence-internal and cross-sentence anaphora.
7.1 Anaphoric Links as Word Dependencies
In general, the decision what translation to emit in the target language for
a given source pronoun cannot be taken based on local information only.
In many languages, pronouns show complex patterns of agreement, and selecting the correct word form requires dependencies on potentially remote
words. German possessive pronouns, for instance, agree in gender and number with the possessor (determining the choice between sein, ihr, etc.) and
in gender, number and case with the possessed object (with a paradigmatic
choice between, e. g., sein, seine, seines, etc., if the possessor is masculine singular). While the possessed object occurs in the same noun phrase as the
pronoun and agreement can, at least in simpler cases, be enforced by an ngram language model, the possessor can occur anywhere in the text, even in
106
[The same hospital]1 had had to contend with a similar infection early this
year. [It]2 → 1 had discharged a patient admitted after a serious traffic accident.
Shortly afterward, [it]3 → 2 had to re-admit the patient because of an MRSA
infection, and [doctors]4 have been unable to perform surgery that would be
vital to full recovery because [they]5 → 4 have been unable to get rid of the
staph.
The same hospital had had to contend with a similar infection early
this year .
It|*->neut_sg had discharged a patient admitted after a serious
traffic accident .
Shortly afterward , it|*->neut_sg had to re-admit the patient because
of an MRSA infection , and doctors|1-* have been unable to perform
surgery that would be vital to full recovery because they|*-1 have
been unable to get rid of the staph .
Figure 7.1. Coreference link annotation and decoder input
a different sentence. Since a given input word can be translated with different words in the target language and the pronoun must agree with the word
that was actually chosen, correct pronoun choice depends on a translation
decision taken earlier by the MT system. Our model extends the SMT decoder with the capacity to handle dependencies between the translations of
words regardless of their distance in the input. The relevant word pairs are
identified by an external anaphora resolver, and the objective of the model is
to promote morphological agreement between anaphoric pronouns and their
antecedents.
We use the open-source coreference resolution system BART (Versley et al.,
2008) to link pronouns to their antecedents in the text. The coreference resolution system was trained on the ACE02-npaper corpus (Mitchell et al., 2003)
and uses separate models for pronouns and non-pronouns in order to increase
pronoun-resolution performance. For each resolvable pronoun, the system
finds a link to an antecedent NP. Exactly one NP per pronoun is found, and
it is the closest NP preceding the pronoun that the anaphora resolver considers as coreferent with the pronoun. Our word dependency model handles
links between pairs of individual words, not syntactic phrases, so we identify
the syntactic head of the antecedent NP with the Collins head finder (Collins,
1999) and represent the anaphoric relation as a link between the anaphoric
pronoun and the syntactic head word of its antecedent NP. The output of the
coreference resolver is illustrated in the upper part of Fig. 7.1. Markable NPs
are enclosed in square brackets and their syntactic heads are highlighted in
bold face. After identifying direct anaphoric links, the coreference resolution
system proceeds to cluster mentions into coreference chains, but we do not
use this information in our experiments.
We integrate coreference information into an SMT system based on the
phrase-based Moses decoder (Koehn et al., 2007) in the form of a new model
107
which represents dependencies between pairs of target-language words produced by the MT system. The decoder driver encodes the links found by the
coreference resolver in the input passed to the SMT decoder. Pronouns and
their antecedents are marked as illustrated in the lower half of Fig. 7.1. Each
token is annotated with a pair of elements. The first part numbers the antecedents to which there is a reference in the same sentence. The second part
contains the number of the sentence-internal antecedent to which this word
refers, or a representation of the relevant features of the word itself, if it occurred in a previous sentence. Each part can be empty, in which case it is
filled with an asterisk.
To reduce vocabulary size and data sparseness, we map the antecedent
words to a tag representing their gender and number. In the example, the
word hospital in the first sentence, which is translated by the system into
the neuter singular word Krankenhaus (not shown), gets mapped to the tag
neut_sg in the input for sentence 2. Gender and number of German words
were annotated using the RFTagger (Schmid and Laws, 2008). The representation of the pronouns, by contrast, is fully lexicalised.
7.2 The Word Dependency Model
The word dependency module is integrated as an additional feature function
in a standard SMT model (Eq. 3.1). It keeps track of pairs of source words
(s ant ,s pron ) participating as antecedent and anaphor in a coreference link. Usually, the antecedent s ant will be processed first; however, it is also possible for
the anaphor s pron to be encountered first, either because of a cataphoric link in
the source sentence or, more likely, because of word reordering during decoding. When the second element of an antecedent-anaphor pair is translated,
the word dependency module adds a score of the following form:
p(Tpron |Tant ) =
max
(t pron ,t ant ) ∈Tpron ×Tant
p(t pron |t ant ),
(7.1)
where Tpron is the set of target words aligned to the source word s pron and Tant
is the set of target words aligned to the source word s ant in the decoder output.
Word alignments between decoder input and decoder output are constructed
based on the phrase-internal word alignments computed during SMT system
training.
Coreference links across sentence boundaries are handled by the decoder
driver module of Section 3.4. It reads the decoder output and extracts the
required information about antecedents occurring in previous sentences, encoding it in the input of the sentence containing the reference as described
above. In the cross-sentence case, the antecedent is not marked in the decoder input, but once it has been translated, its translation is silently extracted from the output, and the anaphor token is decorated directly with the
108
gender/number tag corresponding to the extracted word form. Cataphoric
links across sentence boundaries are not handled by the model.
In the DP search algorithm of a standard phrase-based SMT decoder, two
search paths can be recombined if one of them is provably superior to the
other under every possible continuation of the search (see Section 3.2). Since
our model introduces dependencies that can span large parts of the sentence,
care must be taken not to recombine hypotheses that could be ranked differently after including the word dependency scores. We therefore extend
the decoder search state to include, on the one hand, the set of antecedents
already processed and, on the other hand, the set of anaphors encountered
for which no antecedent has been seen yet. In either case, the translation
chosen by the decoder is stored along with the item. Hypotheses can only be
recombined if both of these sets match.
Training our word dependency model requires estimating the conditional
probability distribution p(t pron |t ant ) in Eq. 7.1. We do so by computing relative frequencies in a training corpus and applying standard language model
smoothing methods. Training examples are extracted from a parallel corpus in a way similar to the application of the model: The source-language
part of a word-aligned parallel corpus is annotated for coreference with the
BART software, then the antecedent and anaphor words are projected into
the target language using the word alignments and the corresponding pairs
of target-language antecedent and anaphor words are used as training examples. Apart from removing the need for an anaphora resolution system for
the target language, using the source language system for both the training
and testing stage has the advantage of greater consistency, but training the
model directly on coreference pairs extracted in the target language would
be a plausible alternative.
Our model is trained on version 10 of the News commentary corpus from
the training data for the WMT 2010 shared task. The estimated probabilities
are smoothed using the Witten-Bell method (Witten and Bell, 1991). This
smoothing method does not make prior assumptions about the distribution
of n-grams in a text. It is therefore more suited for estimating the probabilities
of events not drawn directly as n-grams from a text than the improved KneserNey method (Chen and Goodman, 1998) we use for smoothing our other ngram models.
7.3 Evaluating Pronoun Translation
Assessing the quality of pronoun translation in SMT output with standard
MT evaluation methods is problematic for several reasons. All widely used
automatic evaluation metrics for MT measure the similarity between a candidate translation and one or more reference translations. The quality of a
candidate translation is assumed to correlate with its similarity to the refer109
ence translations. Regardless of how similarity is defined, this can be no more
than an approximation because any source text generally admits of a large
variety of translations into a given target language.
The most popular automatic MT evaluation metric is certainly the BLEU
score (Papineni et al., 2002). It measures the similarity between a candidate
translation and a set of reference translations by looking at n-grams, usually
of length up to 4 words, and counting how large a proportion of the n-grams
in the candidate translation are found in the references too. When computing this n-gram precision quantity, BLEU uses clipped n-gram counts for the
candidate translation. Clipping the counts means that every n-gram in the
candidate translation is counted at most as often as the same n-gram occurs
in a single reference translation. It makes sure that the MT system cannot
inflate its score artificially by generating a great number of very common
words that are likely to occur in many references. This is the formal definition of the clipped counts, with c C (N ) being the count of n-gram N in the
candidate translation and c R (N ) its count in any reference translation R :
c clip (N ) = min c C (N ), max c R (N )
(7.2)
R
Precision is calculated by summing up the clipped counts of all n-grams in
the candidate translation and dividing by the total number of n-grams in the
candidate. This quantity is multiplied by a brevity penalty that ensures that
the MT system cannot optimise precision by suppressing all words it is not
confident about. Essentially, the brevity penalty replaces a measure of recall.
It is used because it is not straightforward to define recall when there are
multiple reference translations.
For the evaluation of pronoun translation, BLEU has several important
drawbacks. One of them is its total lack of specificity. BLEU assigns the
same weight to any type of token, content word, function word, pronoun,
verb, conjunction and punctuation mark alike. We are specifically interested
in pronouns, but the BLEU score conflates pronouns with all kinds of other
words and gives us a figure that may have little to do with what we actually
want to measure.
Another limitation of BLEU is that it does not check whether an n-gram
in the candidate translation actually corresponds to the n-gram it is matched
with in the reference translation. In the case of content words, this may work
well enough. If both the candidate translation and a reference translation
contain the same highly informative and relatively rare word, the chances
that they correspond to each other are fairly good. For common function
words such as pronouns, however, the assumption breaks down. The fact
that two translations both contain the word it, or and, or a comma sign, says
little about their resemblance, unless the sentences are very short.
Finally, there is an even more serious issue with a similarity score like
BLEU that makes it unsuitable for evaluating pronoun translation. BLEU as110
sumes that any overlap of the candidate translation with the reference translation is a sign of good quality, whereas any difference indicates poor quality.
However, an anaphoric pronoun is correct only if it agrees with its antecedent.
If the candidate translation renders the antecedent with an expression that
does not match the reference, then the pronoun may have to be different and
the pronoun of the reference translation may in fact be incorrect. If, say, an
antecedent that is masculine in the reference translation is rendered with a
feminine NP in the candidate, a simple similarity score will behave inconsistently and assign a higher score to a translation referring to the feminine
antecedent with a masculine pronoun than to one having the correct feminine pronoun because the latter will be penalised for two mismatches with the
reference translation instead of one despite being more grammatical.
We now present a simple method to measure the accuracy of pronoun
translations more directly. Compared to BLEU, our method addresses the
first two of the issues mentioned above by focusing specifically on pronouns,
ignoring other word classes, and by using word alignments to keep track of
the role of pronouns in a sentence to avoid conflating unrelated items as BLEU
does. Like BLEU, however, it matches the translations of pronouns against a
reference translation and does not solve the last problem we discussed.
We use a test corpus with a single reference translation. We construct word
alignments for the candidate translation and the reference translation by concatenating them with additional parallel training data, running the GIZA++
word aligner (Och and Ney, 2003) in both directions and symmetrising the
alignments as is usually done for SMT system training. We also produce
word alignments between the source text and the candidate translation by
considering the phrase-internal word alignments stored in the phrase table.
The basic idea of our metric is to count the number of pronouns translated
correctly. Doing so would require a 1 : 1 mapping from pronouns to their
translations. However, word alignments can link a word to zero, one or more
words, so we suggest using a measure based on precision and recall instead.
For every pronoun occurring in the source text, we obtain the set of aligned
target words in the reference and the candidate translation, R and C , respectively. Inspired by the BLEU score, we define the clipped count of a particular
candidate word w as the number of times it occurs in the candidate set, limited by the number of times it occurs in the reference set:
c clip (w ) = min (c C (w ),c R (w ))
(7.3)
We then consider the match count to be the sum of the clipped counts over
all words in the candidate translation aligned to pronouns in the source text,
which allows us to define precision and recall in the usual way:
X
X
c clip (w )
c clip (w )
Precision =
w ∈C
|C |
Recall =
w ∈C
|R|
(7.4)
111
This measure can be applied either to obtain a comprehensive score for a
particular system on a test set or to compute detailed scores per pronoun
type to gain further insights into the workings of the model.
For testing the significance of recall differences, we use a paired t -test. Pairing is done at the level of the set R , the individual target words aligned to pronouns in the reference translation. This method is not applicable to precision,
as the sets C cannot be paired among different candidate translations.
7.4 Experimental Results
The baseline system for our experiments was built for the English–German
task of the ACL 2010 Workshop on Statistical Machine Translation. It is a
phrase-based SMT system based on the Moses decoder with phrase tables
trained on version 5 of the Europarl corpus and version 10 of News commentary corpus and a 6-gram language model trained on the monolingual
News corpus provided by the workshop organisers. The language model is
estimated with modified Kneser-Ney smoothing (Chen and Goodman, 1998)
using the IRSTLM language modelling toolkit (Federico et al., 2008).
The feature weights are optimised by running MERT (Och, 2003) against
the news-test2008 development set for the baseline system. In order to minimise the influence of feature weight selection on the outcome of the experiments, we do not rerun MERT after adding the word dependency model. Instead, we reuse the baseline feature weights and conduct a grid search over a
set of possible values for the weight of the word dependency model, selecting
the setup that yields best pronoun translation F-score on news-test2008. The
weight is set to 0.05 with the other 14 weights (7 distortion weights, 1 language model, 5 translation model weights and word penalty as in a baseline
Moses setup) normalised to sum to 1.
English–German is a relatively difficult language pair for SMT because of
pervasive differences in word order and very productive compounding processes in German. Our baseline system achieves a BLEU score of 0.1366 on the
newstest2009 test set. The best system submitted to WMT 2009 scores 0.148
on the same test set. Handling pronouns with a word dependency model has
no significant effect on the BLEU scores, which vary between 0.136 and 0.137
in all our experiments.
The pronoun-specific evaluation (Table 7.1) suggests that the SMT system
is very bad at translating pronouns in general. Most of the pronoun translations do not match the reference. For both test sets, adding the word dependency model results in a tiny improvement in precision and a small improvement in recall, which is however highly significant (p < .0005 in a one-tailed
t -test for both test sets).
A closer look at the performance of the system on individual pronouns
reveals that by far the largest part of the improvement stems from the pro112
Table 7.1. Pronoun translation precision and recall
news-test2008
newstest2009
Baseline
Precision
Recall
F1
0.333
0.302
0.317
0.428
0.388
0.407
Word-dependency model
Precision
Recall
F1
0.338
0.316
0.326
0.430
0.399
0.414
noun it, which is translated significantly better by the enhanced system than
by the baseline. Recall for this pronoun improves from 0.210 to 0.271 for the
news-test2008 corpus (p < .0001, two-tailed t -test) and from 0.218 to 0.251 for
the newstest2009 corpus (p < .005). The only other item with a significant
improvement at a confidence level of 95 % is, surprisingly enough, the firstperson pronoun I in the newstest2009 corpus (from 0.604 to 0.624, p < .05). In
the news-test2008 corpus, the word dependency model has no effect whatever
on the word I, so it seems likely that this improvement is accidental.
By contrast, the improvement we obtain for the pronoun it, albeit slight,
is encouraging. While most other English pronouns such as he, she, they, etc.
are fairly unambiguous when translated into German and the ambiguity the
MT system is faced with will mostly concern case marking or the difficult
question whether or not a pronoun is to be translated as a pronoun at all,
translating it requires the system to determine the grammatical gender of
the German antecedent in order to choose the right pronoun. Similar problems occur in the opposite translation direction and in other language pairs,
e. g., when translating the highly ambiguous German pronoun sie into English, or when translating between two languages that have different systems
of grammatical gender. However, when applying our pronoun translation
model to the language pair English–French, we do not observe any improvement at all either in the BLEU score or in the pronoun-specific evaluation
score (Hardmeier et al., 2011).
7.5 Conclusion
Together with the two-pass approach by Le Nagard and Koehn (2010), the
word dependency model described in this chapter was one of the first attempts to model pronominal anaphora in statistical MT (Hardmeier and Federico, 2010). A key property shared by both of these early approaches is that
they try to make maximum use of existing tools and technologies and com113
bine them for a new purpose while making as little changes to their inner
workings as possible. Our word dependency model is a straightforward extension of a standard sentence-level phrase-based SMT decoder, and most of
the document processing logic is implemented outside the decoder. For coreference resolution, we completely rely on an external tool. Even the word dependency model itself is trained with standard language modelling software.
Delegating most of the work to various external tools has the advantage of relative simplicity and can be implemented with limited effort. Unfortunately,
it turns out to be quite difficult to achieve translation quality gains in this
way.
Without going into much detail, we note that our word dependency model
and the SMT system it was tested with suffer from many of the issues discussed in Chapter 6. To begin with, the difficulty of creating a good baseline
system for translating from English into German makes it hard to achieve
strong results with a pronoun translation component. Even so, the fact that
we obtained no better results when we applied the same system to English–
French translation with a much stronger baseline (Hardmeier et al., 2011)
proves that this is not the only issue. The performance of the external coreference system and the quality of the gender and number annotations were
additional problems. While we did not conduct a formal evaluation of these
components, it was easy to see that there was a substantial level of noise in
these annotations.
The most serious shortcomings, however, can be found in the word dependency model itself. The score of this model is calculated as a simple conditional
probability that formally corresponds to a bigram language model score and
is computed with language modelling tools. The antecedent, the element the
probability is conditioned on, is represented as a gender/number tag, whereas
the anaphoric pronoun is represented as a lexical item. This setup is unsatisfactory for several reasons. The antecedent encoding contains very little
information. Hard decisions are made both when resolving the anaphoric
link and when annotating the antecedent with its gender/number tags. Both
types of annotations are subject to noise and errors, but the word dependency
model knows nothing about the confidence with which the decisions were
made. The word dependency model itself, on the other hand, is probabilistic
and trained on noisy data. Because of errors made during the preparation of
the training data, there will be a considerable number of training examples
with combinations of antecedent tags and pronouns that do not agree morphologically. As a result, a substantial part of the probability mass is spilt on
incorrect combinations that are mere artefacts of the training process.
Furthermore, in many cases source language pronouns are not aligned to
pronouns in the target language, so the model score will be calculated based
on a word that is not a pronoun at all. If the target language word has not
been seen aligned to an input pronoun during training, it will be treated as an
unknown word by the LM library and penalised strongly, promoting overgen114
eration of pronouns. This is an effect we observe in the translations output
by the system.
Finally, anaphoric links are represented as pairwise relations between an
anaphoric pronoun and its antecedent. The coreference resolution system
prefers to link the pronoun to its closest antecedent, even if the antecedent
is itself a pronoun. Pronoun-pronoun links are susceptible to errors because
there is little information in the two pronouns to guide the anaphora resolver.
As a result, a single incorrect link may introduce an error into a chain of
pronouns with the effect that all subsequent pronouns get translated incorrectly. A similar situation can occur even if all anaphoric links are resolved
correctly because of the stochastic nature of the word dependency model.
Since some probability estimates for non-agreeing tag/pronoun pairs may be
inflated as described in the preceding paragraph, errors may be stochastically
introduced in a pronoun chain and propagated onwards.
To sum up, the word dependency model presented in this chapter suffers
from a number of problems. It was one of the earliest attempts to model anaphora translation in SMT, and it has been useful because we have gained a better understanding of the difficulties hidden in the seemingly innocuous task
of pronoun translation by identifying and studying its deficiencies. These insights have been material to the development of the models presented in the
remainder of this thesis.
115
8. Cross-Lingual Pronoun Prediction
In the previous chapter, we discussed a simple word dependency model to
represent anaphoric links in phrase-based SMT and demonstrated that its effect on pronoun translation was minimal and insufficient from the point of
view of translation quality. We now leave aside the generation of translations
for a while. Instead, we focus on the automatic prediction of pronoun translations when the surrounding discourse and its translation are known and
cast pronoun translation as a classification task. Initial experiments with a
simple maximum entropy classifier quickly reveal that classification is made
difficult by the uneven distribution of personal pronouns. It is easy to achieve
moderately good overall performance just by frequently predicting the most
frequent classes, but this comes at the cost of very low recall for less frequent items such as the French feminine plural pronoun elles. A classifier
with such characteristics is unlikely to improve SMT quality because it exhibits the same bias as a baseline SMT system without any pronoun-specific
components. We propose a neural network classifier that achieves more consistent precision and recall and manages to make reasonable predictions for
all pronoun categories in many cases.
We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with
its own source language anaphora resolver, can be trained successfully with
backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART; Versley
et al., 2008) to predict anaphoric links. Instead, we rely on the additional information contained in our parallel training corpus to draw inferences about
anaphoric relations. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set.
8.1 Task Setup
The overall setup of the pronoun prediction task is shown in Fig. 8.1. We
are given an English discourse containing a pronoun along with its French
translation and word alignments between the two languages, which in our
case were computed automatically using IBM model 4 (Brown et al., 1993) as
implemented by GIZA++ (Och and Ney, 2003) and word alignment symmetrisation with the grow-diag-final-and heuristic (Koehn et al., 2003). We
116
The latest version released in March is equipped with . . . It is sold at . . .
La dernière version lancée en mars est dotée de . . . • est vendue . . .
Figure 8.1. Task setup
focus on the four English third-person subject pronouns he, she, it and they.
Note that the pronoun it, unlike the other pronouns, can also be an object
pronoun, which adds a certain amount of noise to our data sets. The output
of the classifier is a multinomial distribution over six classes:
– Four classes corresponding to the four pronouns il, elle, ils and elles.
These are the masculine and feminine singular and plural forms of the
third person subject pronoun, respectively.
– One class corresponding to the impersonal pronoun ce or c’, which occurs in some very frequent constructions such as c’est ‘it is’. The elided
form c’ is used when the following word starts with a vowel. For the
purpose of our classifier, we treat it as identical to the full form.
– A sixth class other, which indicates that none of these pronouns was
used.
In general, a pronoun may be aligned to multiple words. In this case, a training example is counted as a positive example for a class if the target word
occurs among the words aligned to the pronoun, irrespective of the presence
of other aligned tokens.
This task setup resembles the problem that an SMT system must solve to
make informed choices when translating pronouns, but it avoids dealing with
automatically generated target language text and uses human-made translations as target language context instead. This could make the task both easier
and more difficult; easier, because the context can be relied on to be correctly
translated, and more difficult, because human translators frequently create
less literal translations than an SMT system would.
The features used in our classifier come from two different sources:
– Anaphora context features describe the source language pronoun and
its immediate context consisting of three words to its left and three
words to its right. They are encoded as vectors whose dimensionality
is equal to the source vocabulary size with a single non-zero component
indicating the word referred to (one-hot vectors).
– Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable noun phrase as identified
by the Collins head finder (Collins, 1999).
117
word
elle 0 0 1 0 0
la 0 1 0 0 0
version 0 0 0 1 0
candidate
00100
training ex.
p1
=.
9
0 .5 0 0.5 0 p 2 = .
0 .05 .9 .05 0
1
Figure 8.2. Antecedent feature aggregation
The encoding of the antecedent features is illustrated in Fig. 8.2 for a training example with two antecedent candidates translated to elle and la version,
respectively. The target words are represented as one-hot vectors with the
dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors
of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2 ) and summed to
yield a single vector per training example.
The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context
words on the source side, whereas the number of antecedent word vectors to
be considered depends on the number of antecedent candidates and on the
number of target words aligned to the head word of each antecedent.
8.2 Data Sets and External Tools
We run experiments with two different test sets. The TED data set consists
of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus
(Cettolo et al., 2012). We extract 71,131 training examples from this corpus.
The examples are randomly partitioned into a training set of 56,905 examples
and a validation set and a test set of 7,113 examples each. For the maximum
entropy classifiers described in the next section, another implementation of
the extraction procedure is used, which differs in some edge cases. It yields
71,052 examples, randomly partitioned into a training set of 63,228 examples
and a test set of 7,824 examples. The official WIT3 development and test sets
are not used in our classifier experiments because we want to reserve some
held-out data for MT experiments.
The News commentary data set is version 6 of the parallel News commentary corpus released as a part of the WMT 2011 training data. It contains around 2.8 million tokens of news text and yields 31,090 data points,
which are randomly split into 28.090 training examples and validation and
test sets of 1,500 examples each. The extraction procedure for maximum entropy classifiers extracts 31,017 data points, randomly split into 27,900 training examples and 3,117 test instances.
118
Table 8.1. Distribution of classes in the training data
TED
ce
elle
elles
il
ils
other
News commentary
6,901
3,574
1,581
8,645
8,259
42,171
16.3 %
7.1 %
3.0 %
17.1 %
15.6 %
40.9 %
1,312
2,513
995
5,865
3,669
16,736
6.4 %
10.1 %
3.9 %
26.5 %
15.1 %
38.0 %
71,131
100.0 %
31,090
100.0 %
Table 8.2. Percentages of French pronouns aligned to English pronouns
TED
ce
elle
elles
il
ils
other
News commentary
she
it
they
he
she
it
they
he
1.0
0.2
–
54.6
0.2
44.0
1.1
57.9
0.1
0.4
–
40.5
15.3
3.6
0.2
9.7
0.3
70.9
1.6
0.6
8.4
2.1
45.5
41.8
1.0
0.1
–
55.1
0.0
43.8
0.6
55.9
0.4
1.4
–
41.7
6.3
11.9
0.5
18.8
1.2
61.4
1.6
1.2
10.5
2.9
40.2
43.6
100.0
100.0
100.0
100.0
100.0
100.0
100.0
100.0
The distribution of the classes in the two training sets is shown in Table 8.1.
One thing to note is the dominance of the other class, which pools together
such different phenomena as translations with other pronouns not in our
list (e. g., on or celui-ci) and translations with full noun phrases instead of
pronouns. Splitting this group into more meaningful subcategories is not
straightforward, and it is even unclear if it would benefit performance because less frequent categories may be used in more varied ways while training data becomes ever sparser.
Table 8.2 shows how the examples in the two training sets are distributed
among the different class labels. Although the two corpora belong to fairly
different text genres, the distributions are similar. The most notable exceptions concern the translations of it. In the TED data, it is most frequently
aligned to ce, indicating that the c’est ‘it is’ construction is very common in
this corpus. The feminine elle is relatively infrequent. In the News corpus,
translations with il are much more common at the expense of ce. This probably reflects a difference in modality and formality between the two corpora,
the TED corpus being less formal and representing an oral genre. By contrast, the pronoun elle referring to feminine antecedents is more frequent as
a translation of it in the News commentary corpus.
119
Table 8.3. Majority class baseline results
TED
(Accuracy: 0.622)
P
R
F
ce
elle
elles
il
ils
other
–
0.579
–
0.546
0.455
0.556
0.000
0.536
0.000
0.481
0.985
0.881
–
0.557
–
0.511
0.622
0.682
News commentary
(Accuracy: 0.555)
P
R
F
–
0.559
–
0.553
–
0.709
0.000
0.111
0.000
0.383
0.000
0.711
–
0.185
–
0.453
–
0.710
The feature setup of all our classifiers requires the detection of potential
antecedents and the extraction of features pairing anaphoric pronouns with
antecedent candidates. Some of our experiments also rely on an external
anaphora resolution component. We use the open-source anaphora resolver
BART, which we also used in the experiments of the previous chapter, to
generate this information.
In all the experiments of this chapter, we use BART’s markable detection
and feature extraction machinery. In the experiments of the next two sections,
we also use BART to predict anaphoric links for pronouns. The model used
with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (Mitchell et al., 2003). In order to obtain a probability distribution over
antecedent candidates rather than one-best predictions or coreference sets,
we have modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. This is motivated
by the observation that the correct antecedent is often assigned a relatively
high score even if the single top-scoring candidate is incorrect. By preserving
the uncertainty of the anaphora resolver’s decision for the next steps in the
pipeline, the effect of incorrect decisions should be mitigated. A drawback of
this method, however, is that the BART model used was not trained in this
condition, so the resulting probabilities may not be well calibrated.
8.3 Baseline Classifiers
The easiest way to create a reasonable baseline for our pronoun prediction
task is to predict the majority class output for each source pronoun. This
means that we always predict il for he, elle for she and other for it. For they,
both ils and other are common in both corpora, but the optimal majority
class prediction is ils for the TED corpus and other for the News comment120
Table 8.4. Maximum entropy classifier results
TED
(Accuracy: 0.685)
P
R
F
ce
elle
elles
il
ils
other
0.593
0.798
0.812
0.764
0.632
0.724
0.728
0.523
0.164
0.550
0.949
0.692
0.654
0.632
0.273
0.639
0.759
0.708
News commentary
(Accuracy: 0.576)
P
R
F
0.508
0.530
0.538
0.600
0.593
0.564
0.294
0.312
0.062
0.666
0.769
0.609
0.373
0.393
0.111
0.631
0.670
0.586
aries. Table 8.3 shows the results for these predictions. In this and all the
following tables, the label P corresponds to precision, R to recall and F to
balanced F-score, the harmonic mean of precision and recall. Since the distributions are heavily skewed, the overall accuracy of this classifier is well
over 50 % despite the number of output classes. The pronouns ce and elles,
as well as ils in the News commentary corpus, are minority choices for all
source pronouns, so they are never generated at all. In the TED corpus, there
are comparatively more personal pronouns referring to humans, so il and elle
are more frequently generated by he or she. This explains why the baseline
scores for these pronouns are higher.
As a more sophisticated baseline, we train a maximum entropy (ME) classifier with the MegaM software package1 using the features described in the
previous section and the anaphoric links found by BART. The results are
shown in Table 8.4. The F-scores are consistently over the majority class
baseline for all pronouns and both corpora. As before, the overall accuracy
is higher for the TED data than for the News commentary data. While precision is above 50 % in all categories and considerably higher in some, recall
varies widely.
The pronoun elles is particularly interesting. This is the feminine plural of
the third person subject pronoun, and it usually corresponds to the English
pronoun they, which is not marked for gender. In French, elles is a marked
choice which is only used if the antecedent is exclusively comprised of linguistic elements of feminine grammatical gender. The presence of a single
item with masculine gender in the antecedent will trigger the use of the
masculine plural pronoun ils instead. This distinction cannot be predicted
from the English source pronoun or its context; making correct predictions
requires knowledge about the antecedent of the pronoun. Moreover, elles is
an infrequent pronoun. There are only 1,909 occurrences of this pronoun
1 http://www.umiacs.umd.edu/~hal/megam/
(20 June 2013).
121
in the TED training data, and 1,077 in the News commentary training set.
Because of these special properties of the feminine plural class, we argue
that the performance of a classifier on elles is a good indicator of how well
it can represent relevant knowledge about pronominal anaphora as opposed
to overfitting to source contexts or acting on prior assumptions about class
frequencies.
In accordance with the general linguistic preference for ils, the classifier
tends to predict ils much more often than elles when encountering an English
plural pronoun. This is reflected in the fact that elles has much lower recall
than ils. Clearly, the classifier achieves a good part of its accuracy by making
majority choices without exploiting deeper knowledge about the antecedents
of pronouns.
An additional experiment with a subset of 27,900 training examples from
the TED data confirms that the difference between TED and News commentaries is not just an effect of training data size, but that TED data is genuinely
easier to predict than News commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all
classifiers are much closer to the large-data TED condition than to the News
commentary experiments, except for elles, where we obtain an F-score of
0.072 (P 0.818, R 0.038), indicating that small training data size is a serious
problem for this low-frequency class.
8.4 Neural Network Classifier
In the previous section, we saw that a simple multiclass maximum entropy
classifier, while making correct predictions for much of the data set, has a
significant bias towards making majority class decisions, relying more on
prior assumptions about the frequency distribution of the classes than on
antecedent features when handling examples of less frequent classes. In order
to create a system that can be trained to rely more explicitly on antecedent
information, we have designed a neural network classifier for our task.
Artificial neural networks are networks of classifiers, usually organised
into layers, where the outputs of the classifiers in one layer are fed as inputs
to the classifiers of the next layer. The individual classifier cells map a vector
of inputs to a single output with a non-linear function parametrised with a
set of weights similar to the weights of a maximum entropy classifier. A DP
algorithm by the name of backpropagation (Rumelhart et al., 1986) allows
computing the gradients of an error function of the network outputs with
respect to all the weights in the network in polynomial time, so the network
can be trained efficiently with a variant of the gradient descent algorithm.
The main advantage of a neural network over a single classifier is that it
is capable of learning and representing latent variables. The classifiers in the
hidden layers of the network, whose outputs do not correspond directly to
122
network outputs, but are connected to the inputs of another layer of classifiers, can learn to recognise abstract features of the input data that are then
made available to the next layer. Since the gradients of the parameters of the
hidden layers are computed with backpropagation based on an error function
involving only the predictions of the final output layer, no supervision for the
intermediate abstract representation is required.
Neural networks have recently gained some popularity in natural language
processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine
translation (Le et al., 2012), but also part-of-speech tagging, chunking, named
entity recognition and semantic role labelling (Collobert et al., 2011). In tasks
related to anaphora resolution, standard feed-forward neural networks have
been tested as a classifier in an anaphora resolution system (Stuckardt, 2007),
but the idea of using a neural network for cross-lingual pronoun prediction
is novel in our work.
In the case of our pronoun prediction network, the introduction of a hidden
layer should enable the classifier to learn abstract concepts such as gender
and number that are useful across multiple output categories, so that the
performance of sparsely represented classes can benefit from the training
examples of the more frequent classes. Additionally, as we shall see in Section 8.5, the neural network’s capacity for dealing with latent variables allows
us to represent the links between anaphoric pronouns and their antecedents
as latent variables, dispensing with the need for a separately trained anaphora
resolution system.
The overall structure of the network is shown in Fig. 8.3. As inputs, it takes
the same features that were available to the baseline ME classifier, based on
the source pronoun (P) with three words of context to its left (L1 to L3) and
three words to its right (R1 to R3) as well as the words aligned to the syntactic
head words of all possible antecedent candidates as found by BART (A). All
words are encoded as one-hot vectors whose dimensionality is equal to the
vocabulary size. If multiple words are aligned to the syntactic head of an
antecedent candidate, their word vectors are averaged with uniform weights.
The resulting vectors for each antecedent are then averaged with weights
defined by the posterior distribution of the anaphora resolver in BART (p1 to
p3 ; see also Fig. 8.2).
The network has two hidden layers. The first layer (E) maps the input word
vectors to a low-dimensional representation. In this layer, the embedding
weights for all the source language vectors (the pronoun and its 6 context
words) are tied, so if two words are the same, they are mapped to the same
lower-dimensional embedding regardless of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these
word vectors represent target language words. The entire embedding layer is
then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il,
123
L3
L2
L1
E
P
H
S
p1
p2
p3
R3
1
A
R2
R1
3
2
Figure 8.3. Neural network for pronoun prediction
ils and other. The softmax layer estimates a normalised probability distribution over the different outputs. The non-linearity of both hidden layers is the
logistic sigmoid function, f (x ) = 1/(1+e −x ) . We obtained similar results (not
detailed here) with the hyperbolic tangent transfer function, f (x ) = tanh x ,
and with rectified linear units whose transfer function is f (x ) = max(0,x ) .
In all experiments reported in this chapter, the dimensionality of the source
and target language word embeddings is 20, resulting in a total embedding
layer size of 160, and the size of the last hidden layer is equal to 50. These sizes
are very small. In experiments with larger layer sizes, we obtained similar,
but no better results.
The neural network is trained with minibatch stochastic gradient descent
with backpropagated gradients using the rmsprop algorithm (Algorithm 2).2
The algorithm repeatedly samples a small set or minibatch M of training examples from the training corpus and computes the gradients G of the objective function with respect to the network parameters. It then tries to improve
the value of the objective function by applying a small correction to the parameter vector. The magnitude of the correction depends, among other things,
on the learning rate α , and its direction is a function of the gradients of the
current iteration and the gradients seen in previous iterations.
The objective function that we optimise for is cross-entropy, the standard
error function for neural networks with softmax output layers. For a single
2 Our
training procedure is greatly inspired by a series of on-line lectures held by Geoffrey
Hinton in 2012 (https://www.coursera.org/course/neuralnets, 10 September 2013).
124
training example, it is computed as
E=−
X
t i log y i ,
(8.1)
i
where the sum is over the units of the output layer representing the output
classes, t i is the target value found in the training set and y i is the probability
assigned to this class by the neural network with the current weights. Fprop
and Bprop are functions implementing the forward and backward propagation pass through the network, respectively.
In contrast to standard gradient descent, rmsprop normalises the magnitude of the gradient components by dividing them by a root-mean-square
moving average accumulated in the vector R (lines 10 and 11). We find that
this leads to faster convergence. We also apply some other heuristics to improve the speed of convergence. In most cases, there is no principled justification for the numerical values of the parameters of these heuristics, but
they are fixed empirically to improve the observed time required to achieve
convergence when training our network or earlier versions of it.
– Momentum is used to even out gradient oscillations, so the direction of
the weight adjustment made in each iteration of the optimisation procedure, which is stored in the vector ∆, is equal to m times the direction
of the previous iteration plus the contribution of the current iteration
(line 12). The momentum parameter m is set to the constant 0.9 in all
our experiments.
– The global learning rate is multiplied with a gain factor Γi for each individual weight (line 12). Initially set to 1, the gain factor is increased
by adding 0.05 whenever the gradient of a weight has the same sign in
two subsequent minibatch iterations. When the gradient changes sign,
the gain factor is decreased by multiplying with 0.95 (lines 14–20).
– The global learning rate is adjusted according to training progress. Let
d be the number of times the training error decreased in the last 6
epochs. If d < 4, i. e., if the training error increased more than twice,
then the learning rate is decreased by 20 %. Otherwise, it is increased
by 5 % stochastically after each epoch with probability 0.3d/6. After
each adjustment, the learning rate is held constant for at least 6 epochs
(lines 22–31).
Good settings of the initial learning rate and the weight cost parameter (both
around 0.001 in most experiments), as well as other training parameters, were
found by manual experimentation. The initial learning rate is set to the
highest value that reliably leads to convergence. The weight cost parameter
is selected to minimise validation error. Generally, we train our networks for
300 epochs, which seems to be amply sufficient for the network to converge.
We compute the validation error on a held-out set of some 10 % of the training
data after each epoch and use the set of parameters that achieves the lowest
validation error for testing.
125
Algorithm 2 rmsprop neural network training algorithm
Input: training set T, learning rate α , number of epochs e , minibatch size b ,
momentum parameter m, start weights W , a validation set
Output: optimised weights
1: for all weight components i do
2:
R i ← 1; Γi ← 1; ∆ i ← 0
3: end for
4: E best ← ∞
5: for i ← 1 to e do
6:
for j ← 1 to Size(T)/b do
7:
M ← b examples from T, sampled without replacement
8:
y ← Fprop(M,W )
9:
G ← Bprop(M,W ,y )
10:
R ← 0.9R√+ 0.1G 2
11:
G 0 ← G/ R
12:
∆ ← m∆ − α ΓG 0
13:
W ←W +∆
14:
for all weight components i do
15:
if G i has the same sign as for the last minibatch then
16:
Γi ← Γi + 0.05
17:
else
18:
Γi ← 0.95Γi
19:
end if
20:
end for
21:
end for
22:
if c > 5 then
23:
d ← number of times training error decreased in the last 6 epochs
24:
if d < 4 then
25:
α ← 0.8α
26:
else
27:
α ← 1.05α with probability 0.3d/6
28:
end if
29:
c←0
30:
end if
31:
c ←c +1
32:
E val ← error on validation set
33:
if E val < E best then
34:
Wbest ← W
35:
E best ← E val
36:
end if
37: end for
38: return Wbest
All vector operations are performed elementwise.
126
Table 8.5. Neural network classifier with pronouns resolved by BART
TED
(Accuracy: 0.700)
P
R
F
ce
elle
elles
il
ils
other
0.634
0.756
0.679
0.719
0.663
0.743
0.747
0.617
0.319
0.591
0.940
0.678
0.686
0.679
0.434
0.649
0.778
0.709
News commentary
(Accuracy: 0.576)
P
R
F
0.477
0.498
0.565
0.655
0.570
0.567
0.344
0.401
0.116
0.626
0.834
0.573
0.400
0.444
0.193
0.640
0.677
0.570
Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes
has a tendency to overfit to the source features and ignore the information
coming from the antecedents. This problem can be solved effectively by removing the source features from a part of the training material, forcing the
network to learn from the information contained in the antecedents. In all
experiments in this paper, we zero out each individual source feature (input
layers P, L1 to L3 and R1 to R3) stochastically with a probability of 50 % every
time a training example is presented to the network. At test time, no information is zeroed out.
Classification results with this network are shown in Table 8.5. The accuracy increases slightly for the TED test set and remains exactly the same
for the News commentary corpus. However, a closer look on the results for
individual classes reveals that the neural network makes better predictions
for almost all classes. In terms of F-score, the only class that becomes slightly
worse is the other class for the News commentary corpus because of lower
recall, indicating that the neural network classifier is less biased towards using the uninformative other category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. For the TED data, the
increase in recall comes with some loss in precision, but the net effect on
F-score is clearly positive.
8.5 Latent Anaphora Resolution
Considering Fig. 8.1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora
resolver that can be helpful when determining the correct antecedent for a
given pronoun. Knowing the gender of the translation of a pronoun limits
the set of possible antecedents to those whose translation is morphologically
127
L3
T
1
L2
L1
P
3
2
E
H
S
U
V
R3
1
A
R2
R1
3
2
Figure 8.4. Neural network with latent anaphora resolution
compatible with the target language pronoun. Exploiting this fact and the
capacity of neural networks for learning hidden representations gives us the
possibility to treat the anaphoric links as latent variables, which allows us
to avoid the use of data manually annotated for coreference, in line with the
modelling assumptions we have chosen to adopt for this thesis (Section 1.3).
To achieve this, we extend the network with a component to predict the
probability of each antecedent candidate to be the correct antecedent (Fig. 8.4).
The extended network is identical to the previous version except for the upper
left part dealing with anaphoric link features. The only difference between
the two networks is the fact that anaphora resolution is now performed by a
part of our neural network itself instead of being done by an external module
and provided to the classifier as an input.
In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine
learning component in BART that makes the actual predictions. Since this is
the only component trained on coreference-annotated data in a typical BART
configuration, no coreference annotations are used anywhere in our system
even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors
and to make comparison easier.
For each candidate markable identified by BART’s preprocessing pipeline,
the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidate-anaphora pair. This
128
feature vector is generated by the feature extraction machinery in BART and
includes a standard set of features for coreference resolution. We use the
following feature extractors in BART, each of which can generate multiple
features:
– Anaphor mention type: Checks whether the anaphor is a proper name,
a noun phrase or a pronoun, and if so what type of pronoun.
– Gender match: Checks whether the anaphor and the antecedent agree
in gender.
– Number match: Checks whether the anaphor and the antecedent agree
in number.
– String match: Checks for a string match between the anaphor and the
antecedent.
– Alias feature: Checks for fuzzy matches between the anaphor and the
antecedent with the help of some heuristics (see Soon et al., 2001).
– Appositive position feature: Checks whether the anaphor could be an
apposition of the antecedent (see Soon et al., 2001).
– Semantic class: Encodes the semantic class of the anaphor (see Soon
et al., 2001).
– Semantic class match: Checks whether the semantic classes of the anaphor and the antecedent match.
– Binary distance features: Encode whether the anaphor and the antecedent are in the same or in adjacent sentences.
– First mention: Encodes whether the antecedent is the first mention in a
sentence.
Our baseline set of features is borrowed wholesale from a working coreference system. It is based on the elementary feature set of Soon et al. (2001)
with some additional features from work by Uryupina (2006). Many of the
features such as those indicating that the anaphor is a pronoun or that it is
not a named entity are not relevant to the pronoun prediction task. To ensure
that the features used by our network are exactly the same as those used by
BART, we do not manipulate the feature extractor list at this point. Instead,
we remove all features that assume constant values in the training set when
resolving antecedents for the set of pronouns we consider. Ultimately, we
are left with a basic set of 37 anaphoric link features that are fed as inputs
to our network. These features are exactly the same as those available to
the anaphora resolution classifier in the BART system used in the previous
section.
Each training example for our network can have an arbitrary number of
antecedent candidates, each of which is described by an antecedent word
vector (A) and by an anaphoric link vector (T). The anaphoric link features
are first mapped to a regular hidden layer with logistic sigmoid units (U).
The activations of the hidden units are then mapped to a single value, which
129
functions as an element in a softmax layer over all antecedent candidates (V).
This softmax layer assigns a probability to each antecedent candidate, which
we then use to compute a weighted average over the antecedent word vector,
replacing the probabilities p i in Fig. 8.2 and Fig. 8.3.
At training time, the network’s anaphora resolution component is trained
in exactly the same way as the rest of the network. The error signal from
the embedding layer is backpropagated both to the weight matrix defining
the antecedent word embedding and to the anaphora resolution subnetwork.
Note that the number of weights in the network is the same for all training
examples even though the number of antecedent candidates varies because
all weights related to antecedent word features and anaphoric link features
are shared between all antecedent candidates.
One slightly uncommon feature of our neural network is that it contains
an internal softmax layer (V) to generate probabilities normalised over all
possible antecedent candidates. Moreover, weights are shared between all
antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with
backpropagation, these shared dependencies must be taken into account. In
particular, the outputs y i of the antecedent resolution layer are the result of
a softmax applied to functions of some shared variables q 1 , . . . ,q n :
exp f i (q 1 , . . . ,q n )
yi = X
exp f k (q 1 , . . . ,q n )
(8.2)
k
The derivatives of any y i with respect to a q j , which can be any of the weights
in the anaphora resolution subnetwork, have dependencies on the derivatives
of the other softmax inputs with respect to q j :


∂y i
∂ f i (q 1 , . . . ,q n ) X ∂ f k (q 1 , . . . ,q n )

yk
= y i 
−
(8.3)
∂q j
∂q j
∂q j
k
This makes the implementation of backpropagation for this part of the network somewhat more complicated, but it has no significant impact on training time.
Experimental results for this network are shown in Table 8.6. Compared
with Table 8.5, we note that the overall accuracy is only very slightly lower
for TED, and for the News commentaries it is actually better. When it comes
to F-scores, the performance for elles improves, while the effect on the other
classes is a bit more mixed. Even where it gets worse, the differences are
not dramatic considering that we have eliminated the manually annotated
coreference training set, a very knowledge-rich resource, from the training
process. This demonstrates that it is possible, in our classification task, to
obtain good results without using any data manually annotated for anaphora
and to rely entirely on unsupervised latent anaphora resolution.
130
Table 8.6. Neural network classifier with latent anaphora resolution
TED
(Accuracy: 0.696)
P
R
F
ce
elle
elles
il
ils
other
0.618
0.754
0.737
0.718
0.652
0.741
0.722
0.548
0.340
0.629
0.916
0.682
0.666
0.635
0.465
0.670
0.761
0.711
News commentary
(Accuracy: 0.597)
P
R
F
0.419
0.547
0.539
0.623
0.596
0.614
0.368
0.460
0.135
0.719
0.783
0.544
0.392
0.500
0.215
0.667
0.677
0.577
8.6 Further Improvements
The results presented in the preceding section represent a clear improvement
over the ME classifiers in Table 8.4, even though the overall accuracy increases only slightly. Not only does our neural network classifier achieve
better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices
requiring knowledge about the antecedents. Nevertheless, the performance
is still not entirely satisfactory.
By subjecting the output of our classifier on a development set to a manual
error analysis, we found that a fairly large number of errors belong to two
error types: On the one hand, the preprocessing pipeline used to identify
antecedent candidates does not always include the correct antecedent in the
set presented to the neural network. Whenever this occurs, it is obvious that
the classifier cannot possibly find the correct antecedent. Out of 76 examples
of the category elles that had been mistakenly predicted as ils, we found that
43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out
of 76 for the category mentioned before) the anaphora resolution subnetwork
does assign the highest probability to an antecedent which, even if possibly
incorrect, belongs to the right gender/number group, but it still predicts an incorrect pronoun. This may indicate that the network has difficulties learning
a correct gender/number representation for all words in the vocabulary.
8.6.1 Relaxing Markable Extraction
The pipeline we use to extract potential antecedent candidates is borrowed
from the BART anaphora resolution toolkit. BART uses a syntactic parser
to identify noun phrases as markables. When extracting antecedent candid131
ates for coreference prediction, it starts by considering a window consisting
of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for
morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent
candidates. If no compatible markables are found in the initial window, the
window is successively enlarged one sentence at a time until at least one suitable markable is found.
Our error analysis shows that this procedure misses some relevant markables for at least two reasons. On the one hand, the initial three-sentence extraction window is too small. On the other hand, the morphological compatibility check incorrectly filters away some markables that should have been
considered as candidates. By contrast, the extraction procedure does extract
quite a number of first and second person noun phrases (I, we, you and their
oblique forms) in the TED talks, which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore
adjust the extraction criteria to our task by increasing the initial extraction
window to six sentences, excluding first and second person markables and
removing the morphological compatibility requirement. The compatibility
check is still used to control expansion of the extraction window, but it is no
longer applied to filter the extracted markables. This increases the accuracy
to 0.701 for TED and 0.602 for the News commentaries, while the performance for elles improves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304
(News commentaries; P 0.444, R 0.231), respectively. Note that these and all
the following results are not directly comparable to the ME baseline results
in Table 8.4, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME
setting as well.
8.6.2 Adding Lexicon Knowledge
In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features
indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender
and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more
recent experiments, we just add features indicating all possible morphological
interpretations of each word, rather than trying to disambiguate them. To do
this, we look up the morphological annotations of the French words in the
Lefff dictionary (Sagot et al., 2006) and introduce a set of new binary features
to indicate whether a reading of a word with a particular set of morphosyntactic properties occurs in that dictionary. These binary features are then
132
Table 8.7. Final classifier results
TED
(Accuracy: 0.713)
P
R
F
ce
elle
elles
il
ils
other
0.611
0.749
0.602
0.733
0.710
0.760
0.723
0.596
0.616
0.638
0.884
0.704
0.662
0.664
0.609
0.682
0.788
0.731
News commentary
(Accuracy: 0.626)
P
R
F
0.492
0.526
0.547
0.599
0.671
0.681
0.324
0.439
0.558
0.757
0.878
0.526
0.391
0.478
0.552
0.669
0.761
0.594
added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.711 (TED) and 0.604 (News commentaries),
while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News
commentaries; P 0.545, R 0.462), respectively.
8.6.3 More Anaphoric Link Features
Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifier has
greater problems identifying the correct markable in this setting. One reason
for this may be that the baseline anaphoric link feature set described above
(Section 8.5) only includes two very rough binary distance features which
indicate whether or not the anaphora and the antecedent candidate occur in
the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we therefore enable
some additional features which are implemented in BART, but disabled in the
baseline system:
– Distance in number of markables
– Distance in number of sentences
– Sentence distance, log-transformed
– Distance in number of words
– Part of speech of head word
Most of these encode the distance between the anaphora and the antecedent
candidate in more precise ways. Complete results for this final system are
presented in Table 8.7.
Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most
classes except elle in the News commentary experiment. In particular, we
133
should point out the performance of our benchmark classifier for elles, which
suffered from extremely low recall in the first classifiers and approaches the
performance of the other classes, with nearly balanced precision and recall,
in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence
that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME classifier and earlier
versions of our network have no access to. This is particularly remarkable
because no data manually annotated for coreference was used for training.
8.7 Conclusion
In this chapter, we have introduced cross-lingual pronoun prediction as an
independent natural language processing task. Even though it is not an endto-end task, pronoun prediction is interesting for several reasons, not least
because of its relation to pronoun translation in SMT. We have shown that
pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this
classifier achieves a level of performance that is similar to, if not better than,
that of a classifier using a regular anaphora resolution system trained with
manually annotated data.
134
9. Pronoun Prediction in SMT
The pronoun prediction model developed in the previous chapter maps pronoun translations to a probability score given information extracted from a
piece of bilingual context potentially covering multiple sentences. Such a
model can easily be integrated in the document-level decoding framework
presented in the first part of this thesis. This chapter concludes our experimental work by combining the different components we have developed into
one system, a document-level SMT system built around the Docent decoder
with a neural network model for pronoun prediction. We study some of the
difficulties that arise when the pronoun prediction model is used in an SMT
setting and investigate the output of the enhanced system with the help of
both automatic methods and a targeted manual evaluation experiment.
9.1 Integrating the Anaphora Model into Docent
Predicting the correct translation of an anaphoric pronoun has two parts,
identifying its antecedent in the source language and finding out what linguistic elements best represent the input pronoun in the target language,
also taking into account the translation of the antecedent. The neural network model of the previous chapter (Fig. 8.4, p. 128) incorporates anaphora
resolution and target element selection in a single neural network classifier.
It is trained with backpropagation on training examples extracted from wordaligned parallel text, and the anaphoric links are treated as latent variables.
The inputs of the neural network consist of anaphora context features and
anaphoric link features, which are extracted from the source language part
of the training and test examples, and antecedent features, which are extracted from the translation.
The SMT decoder takes source language material as input and generates a
translation. At decoding time, only the features depending on the translated
output are variable as the translation is generated and updated in the decoding process. Features derived from the input are fixed. In particular, the part
of the network that deals with anaphoric links (layers T, U and V in Fig. 8.4)
are independent of the translation and can be precomputed. Instead of implementing the anaphora resolution component of the network as a part of
the SMT decoder, we therefore run it as a preprocessing step and integrate
it into the coreference resolution toolkit BART (Versley et al., 2008), which
we also use to extract markables and anaphoric link features. In BART, the
135
neural network simply replaces the standard markable ranking component.
Before running the SMT decoder, we process the source file with this modified version of BART to extract markables and compute the probabilities of
network layer V as input for the translation step. Thus, the anaphora resolution subnetwork, which was united with the pronoun prediction classifier at
training time, is now again run separately at decoding time.
The remaining parts of the neural network are added as a feature function
to the Docent document-level SMT decoder. At the beginning of a decoding
run, the feature module identifies all relevant anaphoric pronouns in the document according to some filter criteria. In our English–French experiments,
we consider all occurrences of the English pronouns it and they. The module also identifies all the markables the anaphora resolver recognised as antecedent candidates for one of the target anaphors with a probability exceeding
some small threshold value. The purpose of the threshold is to avoid spending an inordinate amount of time on numerous low-probability candidates. It
is set to 0.01 in our experiments.
The markables passing these filters are stored in a data structure that links
anaphors to their antecedent candidates as well as antecedent markables to
potential anaphors. Then, the document is scored by extracting all necessary information from the markables and making a forward propagation pass
through the neural network. The feature score of a single anaphor is the logarithm of the probability assigned by the softmax output layer S to the pronoun
translation found in the current document state. These scores are summed
over the complete document and cached in the data structure representing
the anaphor.
Whenever the document state is modified, the decoder identifies the anaphoric pronouns and antecedent candidates affected by the modification. It
then recomputes and updates the scores of those anaphors which are affected
by the modification themselves or whose antecedent candidates are affected
by it.
9.2 Weakening Prior Assumptions in the SMT Models
Even though the standard translation and language models of phrase-based
SMT do not model pronominal anaphora explicitly, they make strong prior
assumptions about how pronouns should be translated. Table 9.1 shows the
top ten translations of the single-word phrases it and they in a phrase table
created with the WMT 2014 English–French training data. The entries are
ordered by the geometric mean of the probability of the target phrase given
the source and that of the source phrase given the target, equivalent to a
log-linear combination with equal weights.
In both singular and plural, the obvious translation equivalents il and elle
or ils and elles top the lists. Two-word phrases with the conjunction que
136
Table 9.1. Top ten translations of it and they in an English–French phrase table
it
il
elle
qu’ il
c’
qu’ elle
cela
lui
celle-ci
on
, il
p(t |s)
p(s |t )
p̄geom
0.258
0.100
0.043
0.018
0.023
0.012
0.014
0.005
0.013
0.014
0.379
0.409
0.137
0.293
0.186
0.156
0.111
0.168
0.058
0.043
0.312
0.202
0.077
0.073
0.065
0.044
0.039
0.030
0.028
0.025
they
ils
elles
qu’ ils
qu’ elles
ils ont
leur
ceux-ci
celles-ci
, ils
Ils
p(t |s)
p(s |t )
p̄geom
0.234
0.109
0.081
0.031
0.016
0.039
0.007
0.005
0.008
0.007
0.543
0.432
0.296
0.258
0.233
0.030
0.160
0.131
0.082
0.043
0.357
0.217
0.155
0.090
0.060
0.034
0.033
0.025
0.025
0.017
follow closely, reflecting the fact that the English complementiser that can
frequently be omitted in places where French requires the use of que. In
the singular, the pronoun c’ of the construction c’est, translating into it is,
and the demonstrative pronoun cela also achieve high scores. All of this
is entirely unsurprising and intuitive, and the translations contained in the
phrase table supply translational equivalents for many frequent uses of the
English pronouns. While there is little semantic difference between the various translations, the correct choice between them is governed by all manner
of linguistic constraints, ranging from the syntactic relations that control the
choice between subject forms like il and object forms like lui to the discourse
mechanisms that may trigger the use of cela instead. The translation model
has no notion of these constraints, but it assigns vastly different scores to different translation alternatives based on their frequency in the training corpus.
Thus, all other things being equal, the decoder will always prefer masculine
translations over feminine ones, and given the choice between il est and c’est
as translations of it is, which is often largely a matter of style, the former
will be preferred. These preferences may be overridden by the immediately
surrounding context, which may induce the use of a multi-word phrase with
different top-scoring translations or cause the language model to score up
another translation, but none of these dependencies works over a longer distance than a handful of words.
Sometimes the n-gram language model has amazing ways of selecting pronouns without actually knowing anything about anaphora. Consider the following example:
(9.1) a. Input: It is necessary to say that the car insurance is something important, not only because it covers the driver over possible wrecks,
but because it represents an important cost [. . .] (news-test2008)
b. Reference translation: Il faut dire que l’assurance automobile est quelque chose d’important, non seulement parce qu’elle couvre le conduc137
teur face à d’éventuels sinistres, mais aussi parce qu’elle représente
une importante dépense, [. . .]
c. Baseline MT output: Il est nécessaire de dire que l’assurance automobile est quelque chose d’important, non seulement parce qu’elle
couvre le conducteur au possible des épaves, mais parce qu’il représente un coût important, [. . .]
The MT output is generated by a baseline Moses system trained on a substantial part of the WMT 2014 parallel training data which has a 6-gram language
model trained on news text from the News commentary corpus and the News
crawl corpus provided by the shared task organisers and the French Gigaword corpus from LDC. The system does not have any models specifically
dealing with pronominal anaphora. Nevertheless, the first instance of the
pronoun it referring to car insurance is correctly rendered with the French
feminine pronoun elle, although its French antecedent, the feminine noun
phrase l’assurance automobile, is far beyond the history of the 6-gram language model. It turns out that the n-gram context of the anaphoric pronoun
is highly predictive of the identity of its antecedent. The language model
training corpus contains the following two sentences, both of which overlap
with the test sentence in the 5-gram qu’elle couvre le conducteur:
(9.2) a. Paradoxalement, cette garantie n’est pas toujours incluse dans l’assurance auto bien qu’elle couvre le conducteur, qu’il soit propriétaire
du véhicule ou non.
b. La réponse à ces questions tout autant que les garanties liées à l’« individuelle conducteur », qui n’est pas toujours incluse dans l’assurance auto bien qu’elle couvre le conducteur, permet de différencier
deux contrats.
In both cases, the antecedent of elle is the noun phrase l’assurance auto, a
shortened form of the phrase l’assurance automobile of the previous example
with the same morphosyntactic features. No corresponding sentence with
the masculine pronoun il occurs in the training set. Far from demonstrating
any capability of handling anaphoric pronouns, our example illustrates the
n-gram model’s astonishing capacity for acquiring world knowledge. What
has really been learnt is the gender of the noun phrase that a pronoun in the
given context typically refers to rather than the gender of the antecedent it
specifically refers to in this example.
If the goal of running an SMT system is to create the best translations possible given the current state of the art, then it is a useful strategy to exploit
the skewness of the pronoun frequency distribution to make good, if uninformed, guesses in as many cases as possible. However, since our goal is to
develop better pronoun models, the effect of the frequency priors is undesired
because it distorts the real performance of the pronoun models. If the guesswork of the language and translation models leads to the right conclusion,
138
it may disguise mistakes of the anaphora model and generate correct output
in spite of it. Conversely, if the language and translation models impose a
more frequent pronoun choice despite better advice of the anaphora model,
spurious errors are introduced.
While this is a problem that arises because the translation and language
models incompetently interfere with the work of the anaphora model, the pronoun prediction model also interferes with some choices that the core SMT
models are better at solving. The pronoun classifier of the previous chapter
focuses on the French pronouns il, elle, ils and elles when they occur as translations of an English third-person subject pronoun. All other translations
of the English pronouns are lumped together into a single class other. In
concrete decoding situations, however, the class other as such never occurs;
instead, the model is confronted with the problem of distributing the probability mass reserved for this class to a large variety of different candidate
translations, similar to the way the n-gram language model must distribute
the total probability mass reserved for the class of unseen words to individual,
and possibly competing, instances of unseen words. Unless the model were
extended with some kind of language modelling capacity, this distribution
would be arbitrary because, having established that a candidate translation
does not contain any of the pronouns it knows about, the anaphora model
has no useful information to score it.
Luckily, both of these difficulties can be overcome at once by decoupling
the anaphora model from the translation and language models and letting
each model do what it knows most about. The key is to remove the other category from the pronoun prediction model and to remove information about
the identity of the pronouns it models from the language and translation models. We replace each occurrence of the pronouns il and elle that is aligned to
an English it and each occurrence of ils and elles that is aligned to an English
they by a placeholder while training the language model and the translation
model. Since the language model needs information about some features of
the pronouns to fit them correctly into the surrounding context, we use four
different placeholders for capitalised versus lowercased and singular versus
plural pronouns, respectively. Thus, we replace il by a placeholder called
lcpronoun-sg and Elles by a placeholder called ucpronoun-pl if they are
aligned to it or they, respectively. The scores of the translation model and
the language model are computed over the text with these placeholders. We
use two copies of the pronoun prediction model in the decoder, one to handle
singular pronouns and one to handle plural pronouns. If a target phrase contains a placeholder, separate hypotheses with all compatible pronouns are
generated and scored by the appropriate pronoun prediction model. If no
placeholder occurs in the translation, the pronoun predictors do not add any
score.
With this model, the target language pronouns il, elle, ils and elles can
be generated in two ways. In the generation path we are primarily inter139
ested in, the translation model generates a pronoun placeholder. The translation model and the language model calculate their scores based on the placeholder. Then, a pronoun is generated from the placeholder, and the pronominal anaphora model calculates its score based on the pronoun. This happens
whenever the target pronoun is aligned to it or they in the source language. In
the second generation path, a pronoun is generated directly by the translation
model with a phrase pair in which the source pronouns it or they either do
not occur or are not aligned to the target pronoun. In this case, the translation
and language models get to see the pronoun itself instead of a placeholder,
and the pronoun prediction model is not active at all. Thus, the translation
model and the language model contain both pronoun placeholders and concrete instances of pronouns.
For the translation model, this does not pose any difficulties at training
time. Since the model is trained on word-aligned parallel text, it is easy to
check whether a given pronoun instance is aligned to one of the English
source pronouns and to insert a placeholder only if this is the case. For a language model trained on monolingual data, this will not work because there
is no English source text, so it is not trivial to find out whether or not a target language pronoun would be aligned to it or they in a hypothetical source
language text. To train a model with an approximately correct distribution of
pronouns and placeholders, we first create a 6-gram language model over the
target language side of a part of the translation model training corpus with
placeholders inserted according to the aligned source words. Next, we use
this model to insert placeholders into the actual training corpus by running
the Viterbi decoder for n-gram-based disambiguation included in the SRILM
language modelling toolkit (Stolcke et al., 2011). Finally, we train a 6-gram
language model on this artificially annotated training corpus and use it in our
SMT system.
9.3 SMT Experiments
To test our anaphora model, we run a series of experiments integrating the
model into phrase-based English–French SMT systems for the two text types
we tested our classifiers on in the previous chapter. The systems incorporate
the document-level anaphora model in the local search decoder developed in
the first part of this thesis.
9.3.1 Baseline Systems
Decoding is done in two steps. First, we run a sentence-level phrase-based
SMT system with the Moses decoder (Koehn et al., 2007). The output of this
decoder is then used to initialise the Docent local search decoder described
in Chapter 4. At the same time, we use it as a baseline.
140
The fundamental setup of our baseline system for News data is loosely
based on the system submitted by Cho et al. (2013) to the WMT 2013 shared
task. Our phrase table is trained on data taken from the News commentary, Europarl, UN, Common crawl and 109 corpora. The first three of these
corpora were included integrally into the training set after filtering out sentences of more than 80 words. The Common crawl and 109 data sets were run
through an additional filtering step with an SVM classifier, closely following
Mediani et al. (2011). The phrase table of the baseline system is the same as
that of the document-level system and is created by reinserting pronouns in
a phrase table with placeholders as described in the previous section. At each
occurrence of the placeholders lcpronoun-sg, lcpronoun-pl, ucpronounsg and ucpronoun-pl, the applicable pronouns are inserted with equal probabilities. As a result, the choice between these pronouns is entirely left to the
language model in the baseline system.
The system includes three language models, a regular 6-gram model with
modified Kneser-Ney smoothing (Chen and Goodman, 1998) trained with
KenLM (Heafield, 2011), a 4-gram bilingual language model (Niehues et al.,
2011) with Kneser-Ney smoothing trained with KenLM and a 9-gram model
over Brown clusters (Brown et al., 1992) with Witten-Bell smoothing (Witten
and Bell, 1991) trained with SRILM (Stolcke et al., 2011). In addition to the
three language models, the baseline system uses the standard set of features
for phrase-based SMT with four phrase table scores, a phrase penalty, a word
penalty, an out-of-vocabulary penalty and a geometric distortion model. No
lexical reordering models are included.
The TED system is identical to the News system, but the TED parallel
training corpus from the WIT3 distribution (Cettolo et al., 2012) is added to
the translation model training set, and the monolingual French WIT3 training data is added to the LM corpus. The feature weights of the baseline systems are optimised with MERT (Och, 2003) against the newstest2011 and the
dev2010 development set for the News and the TED system, respectively.
9.3.2 Document-Level Decoding with Anaphora Models
In the document-level decoder, the anaphora model is added to the baseline
configuration in the form of two extra feature functions. Each of the feature
functions corresponds to a separate instance of the neural network classifier. One of them handles the singular pronoun it and makes a binary choice
between il and elle, and the other handles the plural pronoun they and makes
a binary choice between ils and elles. Examples where it is aligned to ils or
elles or where they is aligned to il or elle are not handled by the anaphora
model. The anaphora feature functions are only active if the input pronoun
it or they is aligned to a pronoun placeholder on the target side. If there is no
placeholder corresponding to a specific input pronoun, the anaphora models
141
Table 9.2. Neural network configurations and intrinsic performance
Esrc
Eant
E
U
H
λ
err
acc
News
singular
plural
50
50
50
50
400
400
50
50
150
150
10 −5
10 −4
0.086
0.309
0.931
0.853
TED
singular
plural
20
50
20
50
160
400
20
50
50
150
10 −5
10 −6
0.131
0.434
0.964
0.751
Esrc : source embedding size Eant : antecedent embedding size
E, U, H: total layer sizes λ: `2 weight penalty
err: validation error acc: accuracy
do not contribute a score, and scoring is left to the translation and language
models.
The two neural networks are trained exactly as described in Chapter 8.
The network configurations and their intrinsic performance are shown in
Table 9.2. They were selected based on validation error after testing a small
number of different configurations. To create the training sets for the neural
networks, all applicable examples were extracted from the News commentary
corpus for the News system and from the TED corpus for the TED system.
From these examples, 10 % were held out as a validation set and another 10 %
as a test set. The remaining data points, around 7,000 to 8,000 per condition,
were combined with examples sampled from the 109 corpus to create training
sets of about 120,000 examples per text genre and source pronoun.
In the document-level decoder, the 6-gram LM of the baseline system is replaced with a pronoun placeholder LM as described in Section 9.2. Otherwise
the feature models are identical. In particular, the bilingual 4-gram LM of the
second pass is the same as that of the first pass and does not use placeholders. The same is true of the 9-gram cluster LM, but this makes no difference
because the pronouns corresponding to identical placeholders are assigned
to the same clusters by the Brown clustering algorithm.
An attempt to optimise the feature weights of the document-level system
including the anaphora models failed because document-level MERT against
the BLEU score showed no signs of convergence after 25 iterations. We suspect that this failure is due to problems with the sampling procedure that
generates the n-lists for MERT (see Section 4.7). Instead of tuning the feature
weights automatically, we use the same set of weights as for the baseline
system and fix the weights of the two anaphora features manually and essentially arbitrarily. The anaphora model weights are set to 0.01 because values
of 0.001 and 0.1 result in an unreasonably small or large number of changes
in the test set translation. Since we have no reliable automatic performance
142
metric, we make no attempt at optimising the weights more carefully. While
our way of setting parameters based on test set performance without using
a separate development set is methodologically objectionable, we consider it
very unlikely that this crude method that considers only three different exponentially spaced parameter values and selects the best based on a superficial
impression results in a serious unfair advantage for our anaphora model.
To make the anaphora model as effective as possible, it is important for
the decoder to be able to change pronoun translations easily. In some cases,
a pronoun may be a part of a longer phrase, and it is difficult to alter the entire
phrase in a single step without making some accidental changes that cause
the modification to be rejected. To give the decoder a chance to make changes
in multiple steps, we employ the simulated annealing search algorithm instead of hill climbing. The search is started with a temperature of 1 and
follows a slow geometric decay cooling schedule, whereby the temperature
is multiplied by 0.99999 after each accepted step. The crossover operation
(with a weight of 0.2) and the restore-best operation (with a weight of
0.1) are used to keep the search from deviating too far from the hill climbing path. The remaining state operations are change-phrase-translation
(with weight 0.4), swap-phrases (with weight 0.2 and swap distance decay
0.5) and resegment (with weight 0.1 and phrase size decay 0.1).
For the News corpus, the set of potential antecedents for each occurrence
of it or they is identified with an automatic markable extraction pipeline, and
each antecedent candidate is assigned a probability with the neural network
exactly as described in Chapter 8. For the TED corpus, we can do the same.
Thanks to the existence of the ParCor corpus (Guillou et al., 2014), however,
we also have gold-standard pronoun coreference annotations at our disposal.
We can therefore run the experiment in a “gold” condition, where we replace
the automatically extracted antecedents with the gold-standard information
from the ParCor corpus. In this condition, we mark up exactly one antecedent
candidate per anaphoric instance of it or they and assign it a probability of
1. Pronoun occurrences that are marked as non-anaphoric in ParCor are removed. The anaphora models in the “gold” condition are the same as those
in the “predicted” condition. In particular, no gold-standard information is
used for training the neural networks.
9.3.3 Test Corpora
For the TED system, the test corpus used in our experiments is the tst2010
test set as distributed in the WIT3 corpus. It is composed of 11 documents
comprising 1,664 segments in total.
In the WMT News test sets, pronouns are distributed very unevenly among
the documents. While they are abundant in some documents, others contain
very few pronouns or none at all (see Section 6.3 and Table 6.1, p. 95, for some
143
Table 9.3. BLEU scores for SMT system with anaphora model
Corpus
Anaphora resolution
Baseline
524,288 steps
8,388,608 steps
News
predicted
0.2439
0.2440
–
TED
predicted
gold
0.3086
0.3085
0.3079
0.3086
0.3080
statistics). To ensure that the phenomena we focus on are sufficiently covered
by the test set, we compile a new test set by combining suitable documents
from a number of existing test corpora. Our pronoun test corpus is extracted
from the newstest test sets released for the MT shared tasks at the 2008, 2009,
2010 and 2012 Workshops on Statistical Machine Translation (WMT). The
newstest2011 set is not included because we use it as a development set for
feature weight tuning. From these test sets, we extract all documents with at
least 5 sentences containing the pronouns it or they or an uppercase variant
of them. The resulting corpus contains 131 documents and 4,954 segments in
total. All the News results in this chapter refer to this corpus.
9.3.4 Automatic Evaluation
After the initial decoding run with Moses, we launch Docent with the full
set of features including the document-level models. For the TED system,
we run Docent for 223 = 8,388,608 steps. For the News system, decoding is
much slower because of the larger test set, so we interrupt decoding after
219 = 524,288 steps. After these periods, 360 out of 4,954 segments in the
News test set (7.3 %), 122 out of 1,664 segments in the TED experiment with
predicted anaphora resolution (7.3 %) and 105 out of 1,664 segments in the
TED experiment with gold-standard anaphora resolution (6.3 %) have been
modified by the decoder. The slightly lower number of modifications in the
“gold” condition may be due to the fact that every anaphor only has a single
antecedent candidate in this condition, thus reducing the number of crosssentence dependencies with respect to the “predicted” condition.
Table 9.3 shows the BLEU scores for these experiments. Clearly, the difference between the baselines and the document-level systems are very small.
For the News system and the TED system in the “predicted” condition, the
score difference is negligible. For the TED system in the “gold” condition,
the score drops by less than 0.1 BLEU points. This change does indicate that
the reference translation is matched slightly less closely, but it is too small to
permit any conclusions.
The automatic pronoun translation metric introduced in Section 7.3 slightly
decreases in both precision and recall for the News texts and for the TED texts
144
Table 9.4. Pronoun evaluation scores for SMT system with anaphora model
P
R
F
News
Baseline
predicted
0.317
0.313
0.343
0.338
0.330
0.325
TED
Baseline
predicted
gold
0.451
0.454
0.444
0.444
0.443
0.435
0.447
0.449
0.440
in the “gold” condition (Table 9.4). For the “predicted” condition of the TED
experiment, only recall decreases while precision improves a little, so that
the F-score actually increases, but by an entirely negligible amount. These
figures do not bode well for our experiments, but we should remember that
the automatic metric matches the translation of pronouns against the reference translations without considering the actual anaphoric relations, so its
validity is debatable. In sum, the evidence of the automatic scores is neutral
or slightly negative, but the negative effects are small and further investigation is warranted nevertheless.
9.4 Manual Pronoun Annotation
To evaluate the performance of our anaphora model in a more focused way,
we have developed a manual annotation protocol that allows us to collect
gold standard annotations of pronoun choice in machine-translated context.
Our annotation scheme generates information that can be used not only for
testing how well a given MT system translates pronouns, but also to gain
insights about the pronoun evaluation task as such by comparing this evaluation method with a similar method based on reference translations.
Because of the limited time and annotator resources that were available
for the manual evaluation, we only evaluated two SMT systems in this way,
the News system with predicted anaphoric links and the TED system with
gold-standard anaphora annotations. We decided to include one News and
one TED system to cover both text types we have systems for. Among the
two TED systems, we gave preference to the one with gold-standard annotations even though it differs from the News system in two essential variables
because we were unsure if the predicted anaphoric links were sufficiently
good and because we conjectured a priori that the anaphora model with goldstandard annotations was more likely to have a positive effect on SMT performance.
145
Machine Translation Evaluation (Annotator: Christian)
Source:
Translation:
Until the 1980s , the farm was in the hands of the Argentinians .
Jusque dans les années 80 , la ferme est entre les mains des
Argentins .
They raised beef cattle on what was essentially wetlands .
Ils ont soulevé des bovins de boucherie sur ce qui était
essentiellement des zones humides .
They did it by draining the land .
Ils l' ont fait par l' assèchement des terres .
They built this intricate series of canals , and they pushed water
off the land and out into the river .
Ils ont construit cette série complexe de canaux , et ils ont poussé
l' eau du sol et dans la rivière .
Well , they couldn 't make it work , not economically .
Eh bien , ils ne pouvaient pas le faire fonctionner , pas
économiquement .
And ecologically , it was a disaster .
Et sur le plan écologique , XXX fut un désastre .
Select the correct pronoun:
il
elle
Other
il
ils
elles
Bad translation
elle
ils
ce
cela
il/ce
Discussion required
elles
ce
cela
Multiple options possible
0/54 examples annotated.
Guidelines
Figure 9.1. Pronoun annotation interface
For each example, you are presented with up to 5 sentences of English source text and a corresponding French machine translation. In
the last sentence, an English pronoun is marked up in red, and (in most cases) the French translation contains a red placeholder for a
pronoun. You are asked to select a pronoun that fits in the context.
9.4.1
Task
Description
PleaseAnnotation
select the pronoun that should
be inserted
in the French text instead of the placeholder XXX to create the most fluent
translation possible while preserving the meaning of the English sentence as much as possible.
different,difficulty
equally grammatical
completions are available,
select thetranslations
appropriate checkboxes
click on "Multiple
options the
TheIfmain
in evaluating
pronoun
isand
finding
out what
possible". The button "il/ce" is a special shortcut for cases where these two options are possible.
Select "Other"
if the sentenceof
should
completed
with a pronounis.
not included
in the list.
correct
translation
a be
given
pronoun
Usually,
MT evaluation assumes
Select "Bad translation" if there is no way to create a grammatical and faithful translation without making major changes to the
thatsurrounding
greatertext.overlap between the MT hypothesis and a human-generated refSelect "Discussion required" if you're completely unsure what to do with a particular example.
erence
is averb
sign
of better
comes to
Minor translation
disfluencies (e.g., incorrect
agreement
or obviouslytranslation
missing words) canquality.
be ignored. ForWhen
instance, if it
the placeholder
should be replaced with the words c'est, just select "ce".
translating
pronouns,
this
assumption
is
problematic
because
a
pronoun
that
You should always try to select the pronoun that best agrees with the antecedent in the machine translation, even if the antecedent
is translated
incorrectly,
and even iftranslation
this forces you to violate
theactually
pronoun's agreement
with the
immediately
surrounding
words
matches
the
reference
may
be
less
correct
than
another
such as verbs, adjectives or participles. So if the antecedent requires a plural form, but the placeholder occurs with a singular verb,
youcontext
should select is
the different.
correct plural pronoun
and ignore the agreement
error. translation correctly, we must
if the
To evaluate
pronoun
If the French translation doesn't contain a placeholder, you should check if a pronoun corresponding to the one marked up in the
English source
should
be inserted
and indicate which
if so. be represented in the target lantherefore
find
out
howsomewhere
the pronoun
should
If the French translation doesn't contain a placeholder, but it already includes the correct pronoun (usually an object pronoun like
guage
context
ofannotate
the translation,
exceedingly
difficult
toondo
autole, la or
les), you should
the example as if therewhich
had been a is
placeholder
instead of the pronoun
(i.e., click
"Other"
in the
case of an object pronoun).
matically.
A
human
language
user,
however,
can
make
this
decision
fairly
Prefer "Bad translation" over "Discussion required" if you're unsure because the translation is dodgy. Reserve "Discussion required"
for cases where there is a problem with the guidelines. And don't spend too much thought about the distinction between these two
quickly.
The task requires no expert knowledge other than some proficiency
categories, if in doubt, pick the one that came to mind first.
in the source and target language.
The annotation work was done through a simple web interface shown in
Fig. 9.1. Each example corresponds to one instance of an English pronoun
it or they. The annotators are presented with the sentence containing the
pronoun and some preceding context. Up to five sentences of context are included, but fewer if the example is close to the beginning of a document. For
all sentences, we also show a translation generated by the MT system to be
evaluated. In most sentences, a placeholder is inserted in the MT output of
the last sentence containing the pronoun to be annotated. The placeholder
replaces any pronoun linked by word alignment to the English target pronoun. As a French pronoun, we consider any word listed with a pronoun
146
part-of-speech tag (pro or any tag starting with cl) in the Lefff vocabulary
(Sagot et al., 2006). The annotators are then asked to identify the pronoun
that should be inserted into the French text instead of the placeholder to create the most fluent translation possible whilst preserving the meaning of the
English sentence as much as possible. If no French pronoun is aligned to the
English one, no placeholder is inserted, and the annotators are asked to find
out if a pronoun corresponding to the one marked up in the English source
should be inserted somewhere in the sentence.
The options available to the annotators include six very common French
pronouns and three additional categories to mark special cases. The six pronouns are the masculine and feminine singular and plural forms of the subject
pronouns, il, elle, ils and elles, as well as the pronoun ce of the c’est ‘it is’ construction and the frequently used demonstrative pronoun cela ‘this’. Among
these six pronouns, multiple choices are possible if the annotators consider
that several equally good completions are available. The three additional categories are named other, representing any pronoun other than the six just
mentioned, bad translation, indicating that the machine translation is so
bad that it cannot be meaningfully completed with a pronoun, and discuss
(called “Discussion required” in the on-line interface) to mark that the annotator is unsure how to handle a specific example. These three categories
cannot be combined with each other or any of the pronouns. The annotation interface is designed to permit annotating almost all examples with a
single mouse click. Multiple clicks are only necessary if an example should
be annotated with more than one pronoun. To allow for one-click annotation
in the relatively frequent case where both il est and c’est are acceptable, we
provide a special button named il/ce.
Since most of MT output produced by our systems is not perfectly fluent
and the additional categories for special cases are very uninformative, we request the annotators to select a pronoun whenever reasonably possible and
ignore fluency problems as far as practicable. In particular, they are instructed to disregard any agreement violations that may arise when inserting pronouns for the placeholders. The detailed annotation guidelines are shown in
Fig. 9.2. They were shown to the annotators at the beginning of each annotation session and could always be consulted by scrolling down on the web
page with the annotation interface.
Annotation work of this type can be carried out fairly quickly at a speed
of about one example per minute. Our annotations were created by the author of this thesis, one of his advisors and two colleagues working at the
same department.1 One annotator is a native speaker of French, the others
are second-language speakers of French and native speakers of Germanic languages (Swedish or German).
1 We
are indebted to Joakim Nivre, Marie Dubremetz and Mats Dahllöf for their help with the
annotations.
147
For each example, you are presented with up to 5 sentences of English source text
and a corresponding French machine translation. In the last sentence, an English
pronoun is marked up in red, and (in most cases) the French translation contains a red
placeholder for a pronoun. You are asked to select a pronoun that fits in the context.
– Please select the pronoun that should be inserted in the French text instead
of the placeholder XXX to create the most fluent translation possible while
preserving the meaning of the English sentence as much as possible.
– If different, equally grammatical completions are available, select the appropriate checkboxes and click on “Multiple options possible”. The button “il/ce” is a
special shortcut for cases where these two options are possible.
– Select “Other” if the sentence should be completed with a pronoun not included
in the list.
– Select “Bad translation” if there is no way to create a grammatical and faithful
translation without making major changes to the surrounding text.
– Select “Discussion required” if you’re completely unsure what to do with a
particular example.
– Minor disfluencies (e. g., incorrect verb agreement or obviously missing words)
can be ignored. For instance, if the placeholder should be replaced with the
words c’est, just select “ce”.
– You should always try to select the pronoun that best agrees with the antecedent
in the machine translation, even if the antecedent is translated incorrectly, and
even if this forces you to violate the pronoun’s agreement with the immediately
surrounding words such as verbs, adjectives or participles. So if the antecedent
requires a plural form, but the placeholder occurs with a singular verb, you
should select the correct plural pronoun and ignore the agreement error.
– If the French translation doesn’t contain a placeholder, you should check if a
pronoun corresponding to the one marked up in the English source should be
inserted somewhere and indicate which if so.
– If the French translation doesn’t contain a placeholder, but it already includes
the correct pronoun (usually an object pronoun like le, la or les), you should
annotate the example as if there had been a placeholder instead of the pronoun
(i. e., click on “Other” in the case of an object pronoun).
– Prefer “Bad translation” over “Discussion required” if you’re unsure because
the translation is dodgy. Reserve “Discussion required” for cases where there is
a problem with the guidelines. And don’t spend too much thought about the
distinction between these two categories, if in doubt, pick the one that came to
mind first.
Figure 9.2. Guidelines for the pronoun annotation task
148
Table 9.5. Pronoun annotation agreement
Exact match
2
3
Annotator
1
1
2
3
4
50
40
37
30
40
50
33
28
Off-diagonal mean
35.7
33.7
4
1
Overlap
2
3
4
37
33
50
24
30
28
24
50
50
44
40
30
44
50
40
32
40
40
50
26
30
32
26
50
31.3
27.3
38.0
38.7
35.3
29.3
9.4.2 Annotation Characteristics
To test the annotation scheme, we collected annotations for a set of 50 examples, of which 24 are taken from the pronoun-enriched News test set and
26 come from the TED test set. The examples were sampled randomly from
the two test sets. In total they comprise 32 examples of it and 18 examples of
they (15 it and 9 they from News data and 17 it and 9 they from TED data).
This set was given to all annotators, so that each example was independently
annotated four times. The option to specify multiple pronouns was used very
sparingly by the annotators; only 7 out of 200 annotation records make use
of this possibility. 12 out of 50 examples (4 News, 8 TED) were labelled with
bad translation or discuss by at least one of the annotators. When we inspected these cases after the annotation was completed, we recognised that
there was very little difference in the way these two labels were used. The tag
discuss almost universally indicates some problem with the translation, and
in the following discussion, we make no difference between the two labels.
In a very small number of examples, the annotators missed a category none
to indicate that no pronoun was required. As this was very rare, we decided
not to modify the list of categories after creating the initial annotations and
use the other category for this purpose instead. In new annotation tasks,
however, we recommend adding such a category.
Table 9.5 shows the extent to which the annotators agree. The left part of
the table, labelled Exact match, contains the number of examples for which
the annotations agree exactly. In the right part, labelled Overlap, we only
require that there should be at least one option that both annotators consider
acceptable, regardless of whether one of them also admits other possibilities.
We compute inter-annotator agreement in terms of Krippendorff’s α (Krippendorff, 2004) and Scott’s π (Scott, 1955) with the software included in the
NLTK toolkit (Bird et al., 2009). Over all four annotators, we obtain α = 0.613
and π = 0.189, which suggests significant disagreement. It turns out that a
substantial part of the disagreement can be pinned down to a single annotator. If we do not consider the contributions of annotator 4, we reach much
better agreement scores of α = 0.742 and π = 0.679. Since it seems accept149
Table 9.6. Pronoun evaluation contingency table for 80 paired examples
Baseline
with anaphora models
–
–
+
O
B
11 (10)
1 (1)
1 (1)
0 (0)
News
+
O
3 (1)
19 (19)
0 (0)
0 (0)
0 (0)
0 (0)
2 (2)
1 (1)
B
–
1 (1)
0 (0)
0 (0)
1 (1)
11 (10)
2 (1)
0 (0)
0 (0)
TED
+
0 (0)
22 (22)
0 (0)
0 (0)
O
B
1 (1)
0 (0)
4 (4)
0 (0)
0 (0)
0 (0)
0 (0)
0 (0)
The figures in parentheses indicate the number of cases with identical pronouns.
–: wrong pronoun +: correct pronoun O: labelled other
B: labelled bad translation or discuss
able to work with three annotators at this level of agreement and we lacked
the time for extensive annotator training and guideline revisions, we distribute the examples of the following evaluations in roughly equal shares among
annotators 1 to 3, two second-language speakers of French and one native
speaker.
9.4.3 Anaphora Model Evaluation
In a first human evaluation round, we annotate a set of 80 example pairs randomly drawn in equal parts from the News commentary and the TED corpus.
Each pair consists of a translation created by the baseline SMT system and a
translation created by the SMT system with anaphora models, annotated following the guidelines outlined above. Depending on the annotations and the
pronoun translation generated by the SMT system, we classify each example
into one of four categories. For examples assigned one or more of the labels
il, elle, ils, elles, ce or cela by the human annotator, we determine whether
the MT system emitted a matching pronoun. Matching is performed caseinsensitively. In addition to the six pronouns literally corresponding to the
class names, we consider c’ to be an instance of the class ce and ça to be an
instance of the class cela. If the translation of an example with a pronoun label is a match according to these criteria, we classify it as a positive example
(+), otherwise as a negative example (–). Examples labelled as other by the
human annotators are assigned to class O if the translation generated by the
MT system does not correspond to any of the pronoun categories, otherwise
to class –, and examples labelled as bad translation or discuss are assigned
to class B regardless of the MT output.
Table 9.6 shows contingency tables indicating the classification of the example pairs in the baseline system and in the anaphora-enabled system. The
rows of the table correspond to the classes in the baseline output and the
columns to the classes in the output of the document-level system. There
150
Table 9.7. Contingency table for 88 paired examples with different pronouns
Baseline
–
+
O
B
–
with anaphora models
News
TED
+ O B
– + O
B
9
11
0
1
19
4
0
1
0
0
0
0
2
0
0
0
0
0
0
0
11
12
0
1
9
5
0
0
2
0
1
0
–: wrong pronoun +: correct pronoun O: labelled other
B: labelled bad translation or discuss
are three factors that can cause an example to migrate from one category to
another and end up in an off-diagonal cell of the contingency table. Firstly,
the document-level decoder with its anaphora model can alter the translation of a pronoun. Secondly, it can modify the surrounding context or the
antecedent translations so that another pronoun becomes appropriate. Such
changes may be triggered by the anaphora model or by slight differences
between the language models used in the two passes. They could also occur when the local search decoder discovers and corrects search errors made
by the baseline decoder. Finally, an example may be assigned to a different
category because of inconsistencies in the manual annotation.
In Table 9.6, the anaphora model hardly seems to have any effect. For the
purposes of this evaluation, we are primarily interested in the positive and
negative examples in the upper left corner of the matrices. Here, only 4 of
34 News examples and 2 of 35 TED examples are categorised differently after
running the second-pass decoder. Comparing the pronoun translations in the
baseline output with those in the second-pass system output reveals that the
pronoun translations are identical in the vast majority of cases. This observation does not enable us to draw any definite conclusions about the behaviour
of the system because the correctness of the pronoun translation depends,
in addition to the pronoun itself, on the context and the antecedent translations. However, it does raise suspicion that the translations of the baseline
and of the anaphora-enabled system may be equivalent in many cases. To
evaluate our anaphora model, we need to know whether its effect is positive
or negative in those cases where there actually is an effect.
As an approximation to the examples influenced by the anaphora model,
we consider the subset of examples where the final translation after documentlevel decoding has a translation of the pronoun which is different from that
of the baseline. This is the case for 151 out of 1,457 News examples and 63
out of 566 TED examples, considering only examples where the source pronoun is aligned to a pronoun in the target language in both the baseline and
151
the document-level system. Pronoun comparisons are performed in a casesensitive manner and only exact literal matches are considered equal because
anything else but a literal, case-sensitive match indicates a motivated choice
by the SMT system. From this subset, we consider a random sample of 47
News examples and 41 TED examples. The results are reported in Table 9.7.
In 60 of 88 example pairs (68.2 %), at least one of the two translations produced by the baseline and the document-level system matches the preference
of the annotators. Additionally, some items in the O class may be correct
as well. However, the number of items assigned to the classes O and B is
small, so we concentrate our discussion on the positive and negative pronoun
classes (– and +). In the News text genre, the number of examples migrating
from – to + (19 items) is distinctly larger than the number of examples moving in the opposite direction (11 items). However, the difference is not large
enough to be significant at a 90 % confidence level in Liddell’s test (Liddell,
1983). Surprisingly enough, in the TED data, where gold-standard coreference annotations are used, the anaphora model seems to cause some damage,
with 12 items going from + to – and only 9 from – to +. Needless to say, this
difference is far from being statistically significant.
Considering the small sample size and the absence of statistical significance, we cannot rule out the possibility that the effects we observe are due to
random variations. Even so, the relatively large positive effect in the News
experiment attracts attention, and so does the unexpected negative outcome
of the TED experiment. In our opinion, both results deserve further investigation. Most importantly, the manual evaluation should be continued with
larger samples than the time constraints for this thesis have permitted us to
examine. This will allow us to test if the effects persist and become significant or if they must be dismissed as random. Additionally, assuming the effect
observed in the News experiment is confirmed, a similar evaluation should
be conducted for the “predicted” condition of the TED experiment to find out
if it bears more resemblance to the “predicted” condition of the News experiment or to the “gold” condition of the TED experiment. Both hypotheses are
possible.
On the one hand, the BLEU scores in Table 9.3 suggest a greater similarity
between the two “predicted” conditions than between the two conditions of
the TED experiment. This is a highly dubious indication because the score
differences are quite small and we have strong reasons to distrust BLEU as a
measure of pronoun translation accuracy. However, if the TED experiment
should prove more successful with predicted anaphora resolution than with
gold-standard annotations, this would be a very interesting result. The mismatch between training and testing conditions could be a possible explanation for such a finding. At training time, the distribution over antecedent
candidates encoded by the network’s V layer will generally have a fairly large
entropy because of the great uncertainty of the anaphora resolution process.
It is not impossible that the unexpected use of a very sharp distribution con152
centrating all its probability mass on a single item at testing time has unintended effects on the operation of the network, even if the distribution is known
to be correct.
On the other hand, even if the positive effect in the News experiment subsists at larger sample sizes, it may be more difficult to achieve comparable
performance for the text genre encountered in the TED talks. In the intrinsic
evaluations of Chapter 8, we found that the pronouns in the TED data are
easier to predict than those in the News data. However, this may well be due
to the fact that there is less entropy in the prior distribution of the pronouns,
as evidenced by the better accuracy of the majority class baseline (Table 8.3,
p. 120). Despite their superior overall performance, it is not clear that the TED
networks actually perform better when predicting more difficult edge cases.
However, the prior distribution of the pronouns should already be matched
well by the language model of the baseline SMT system, so there may be less
room for improvement in the TED experiment.
9.4.4 Agreement with Reference Translation
With the manual annotations created to evaluate the anaphora model and the
human reference translations of the test sets, we have two very different and
mutually independent types of gold-standard information on the translation
of pronouns in our test corpora. The reference translations indicate how a
text, including the pronouns it contains, can be translated in a correct manner.
We assume that the human translators producing the translations have a good
understanding of the source text and of the target language norms and create
high-quality output even if there is no one-to-one correspondence between
source language and target language elements, and even if target language
conventions dictate pronoun usage patterns that are not strictly consistent
with source language usage. However, the correctness of the reference translation can only be guaranteed for the translation as a whole. If only some bits
and pieces of a candidate translation tally with the reference translation while
other parts diverge, we cannot be sure that the total result will be acceptable.
The manual annotations, by contrast, are specifically created for a particular candidate translation generated by an MT system. Even if that candidate
translation as a whole is inferior to the reference translation, the pronoun
translation suggested by the manual annotations is more reliable than the
one suggested by the reference translation because it is consistent with the
context of the machine translation. From a theoretical point of view, the translation of a pronoun found in the reference text cannot be a valid solution in
the MT context other than by chance. However, in practice, reference translations are routinely used to calculate automatic quality scores for all parts of
MT output, including the translations of input pronouns. It is therefore pertinent to examine to what extent the translations of pronouns in a reference
153
translation corpus are useful to evaluate candidate translations produced by
an MT system.
To compare reference translations with manual annotations, we first create a set of pseudo-annotations based on the references. We generate word
alignments between the reference translations and the input texts by concatenating them with the parallel training corpus and running the same word
alignment procedure that we use for training the SMT system. Then we construct an annotation record for every occurrence of the pronouns it and they
in the input, setting the label in accordance with the target language element
aligned to the input pronoun. Again, c’ is counted as an instance of of ce
and ça is counted as an instance of cela. Comparisons are performed caseinsensitively. No pseudo-annotation record is created if the input pronoun is
not aligned to a target language word in the reference translation.
The first thing to notice with these automatically generated pseudo-annotations is that many pronoun occurrences are not covered by them. Of
the 1,547 examples of it or they extracted from the News reference translation, 517 (33.4 %) are not aligned to a pronoun in the target language.2 In
the TED data, 245 of 735 examples (33.3 %) are not aligned to pronouns. A
superficial manual inspection of the data reveals that the word alignment is
usually correct in these cases. Most often, these are genuine examples of
translations where pronoun usage differs between the source and the target
language. This does not necessarily mean that it would be impossible to translate the input in a way that preserves the pronoun usage of the source language while respecting the target language norms, but it makes it impossible
to evaluate these examples with the information contained in the reference
translations and greatly reduces the usefulness of any evaluation scheme that
relies on reference translations for pronoun evaluation. This observation applies to the automatic pronoun evaluation metric of Section 7.3 as well as to
the pseudo-annotations considered here.
Because of the great number of examples for which simple pronoun correspondences cannot be extracted from the reference translation, the effective
sample size of the pronoun evaluation is reduced and it becomes more difficult to appraise the significance of an effect. Moreover, very likely the subset
of source pronouns that are not aligned to a target pronoun is not randomly
drawn from the total set of source pronouns, so ignoring it will bias the evaluation. Conversely, it could be argued that this subset will incorporate all
the examples for which a pronoun translation is linguistically impossible, so
it contains valuable information about how to translate those cases. This is
true, but since these will be examples where the reference translation deviates substantially from the wording of the input, the word alignment will be
2 The
total number of examples extracted varies slightly across translations because we skip
examples when a source pronoun is aligned to more than one target pronoun. This occurs most
frequently when it is is rendered as il y a ‘there is’ and it is aligned to the pronouns il and y.
154
Table 9.8. Contingency table for 88 paired examples with different pronouns, evaluated
with pseudo-annotations
Baseline
–
+
O
B
–
with anaphora models
News
TED
+ O B
– + O
B
11
8
0
–
10
0
0
–
–
–
–
–
1
0
0
–
–
–
–
–
11
13
0
–
6
1
0
–
1
0
0
–
–: wrong pronoun +: correct pronoun O: labelled other
B: labelled bad translation or discuss
unreliable, and the information will be far from trivial to exploit. Also, current SMT models are highly unlikely to generate good output for these cases,
whereas they may well occasionally produce acceptable output for examples
where a direct pronominal translation is possible even if the creator of the
reference translation decided against using it.
Table 9.8 shows the results obtained by evaluating the 88 example pairs
used in Table 9.7 with the pseudo-annotations generated from human reference translations. The category B is not used in this table because the pseudoannotations never carry the labels bad translation or discuss. It turns out
that the pseudo-annotations are less likely to classify an example as correct
(+) than the human-made annotations specific to the MT system, which must
be considered as the gold standard in this comparison. The effect applies both
to the News and to the TED system. In both cases, the document-level system
is affected more strongly than the baseline. This may be an effect of chance,
but it would have led to an overly negative evaluation of the anaphora model
in this case.
In Table 9.9, the 176 manual annotations collected for the same 88 example pairs are pitted against the corresponding pseudo-annotations. Since
the manual annotations come in pairs, each pseudo-annotation occurs twice
in this table in combination with two different manual annotations. If we
consider only the first two rows, where there is either a clear match or a
clear mismatch with the manual annotation, the pseudo-annotation matches
the manual annotation in only 43 of 90 News cases (47.8 %). For 33 examples
(36.7 %), there is no pseudo-annotation, and in 14 cases (15.6 %), the pseudoannotations flatly contradict the judgements of the human annotators. The
figures for the TED data are considerably better with 50 matches in 77 examples (64.9 %), 18 missing annotations (23.4 %) and 9 contradictions (11.7 %),
but even in the TED corpus, pseudo-annotations are either incorrect or missing for more than one third of the examples.
155
Table 9.9. Contingency table for manual annotations versus pseudo-annotations
manual
annotations
–
+
O
B
–
pseudo-annotations
News
TED
+ O ∅
–
+ O
∅
29
10
0
2
4
14
0
0
11
7
0
0
0
0
1
0
18
15
1
0
32
6
3
1
3
18
0
0
0
0
1
0
–: wrong pronoun +: correct pronoun O: labelled other
B: labelled bad translation or discuss ∅: no annotation
The results suggest that the pseudo-annotations, and very probably also
other reference-oriented measures such as our pronoun evaluation metric of
Section 7.3 and BLEU, misrepresent the correctness of anaphora translations
and will not do justice to improvements achieved by specific anaphora handling components. The severity of the problem is corpus-dependent, but it is
clearly present in both of the corpora we have examined. This finding confirms the theoretically motivated hypothesis that reference-oriented measures are insufficient to guide the development of systems modelling complex
target-side dependencies.
9.5 Conclusion
In the experiments in this chapter, we have tested the pronoun prediction
model developed in the previous chapter in practical SMT systems for two different text genres. While the model has no effect on the automatic evaluation
scores, manual evaluation of the News experiment with predicted anaphora
resolution reveals a mildly positive result in that the number of improvements
exceeds the number of regressions by a small margin. This result, however,
is modest and uncertain, and it is not borne out by the parallel experiment
on TED data with gold-standard anaphora resolution. Nevertheless, the SMT
implementation of the anaphora model and its subsequent evaluation have
unfolded a number of interesting insights.
First of all, the experiments afford a new confirmation that the decoding
algorithm developed in the first part of this thesis is viable for practical use.
After examining the output of the document-level decoder, we have no reason
to suppose that the limited success of the anaphora model is due to the fact
that the decoder fails to improve the model scores. Rather, the shortcomings
we observe can be pinned down convincingly to difficulties of the task and
inadequacies of the feature models.
156
Another important insight is the recognition that it is mistaken to assume
a direct correspondence of pronouns across languages. This fact has not yet
been internalised sufficiently by the SMT community. Early work on pronouns in SMT, including our own, naïvely assumed that pronouns were anaphoric as a rule and that anaphoric pronouns, barring rare exceptions, could
be mapped directly onto corresponding target language pronouns (Le Nagard
and Koehn, 2010; Hardmeier and Federico, 2010). This is not what we find in
corpus data. Even though we recognised this problem before developing our
pronoun prediction model (Chapter 6), it turns out that the capacity of our
model to cope with it is still insufficient and that more sophisticated modelling will be required for an adequate solution.
The results of the manual evaluation of our anaphora model, while inconclusive, are intriguing. In the News experiment with predicted anaphoric
links, we observe a small improvement over the baseline. The improvement
is not statistically significant, but it is strong enough to nurture hope that
it will survive and prove significant when larger samples are studied. By
contrast, and quite contrary to what we originally expected, we find no improvement in an experiment with TED data and gold-standard coreference
annotations. We have advanced some speculations as to why this might be
the case, but only more empirical work can show to what extent they are
true.
Finally, the comparison of manual and automatic evaluations for our anaphora model has uncovered deficiencies in the automatic evaluation procedures that were already known in theory, but whose actual impact had not
been demonstrated empirically. Based on the results presented in this chapter,
we can state with some confidence that BLEU and other reference-oriented
evaluation measures are insufficient tools for the development of models of
pronominal anaphora and similar phenomena involving complex target-side
dependencies. Currently, we cannot suggest a better automatic evaluation
score for this purpose, but the manual evaluation protocol described above
permits the collection of targeted and more reliable annotations at a relatively
low cost.
157
10. Conclusions
In this thesis, we address discourse-level aspects of translation in phrasebased SMT from different points of view. Throughout our work, we have
been confronted with both technical and linguistic challenges. The technical
challenges are related to the independence assumptions made by existing
SMT solutions, and correspondingly, our first research goal has been to develop frameworks, procedures and algorithms that are not encumbered by the
standard assumptions of sentence-level independence. As a response to this
challenge, we have developed and explored a framework for document-level
decoding and released the Docent decoder (Hardmeier et al., 2013a). The linguistic challenges, on the other hand, are related to our second research goal,
to investigate what discourse-level linguistic phenomena can be useful for SMT,
and how to model them in an SMT system. We have studied different types of
discourse-level information, but our principal effort is dedicated to the problem of pronoun translation. We investigate the behaviour of pronouns under
translation, present a neural network model to predict the French translations of English pronouns and integrate this model into a phrase-based SMT
system. In this chapter, we recapitulate the findings of our thesis, discuss
the insights gained and contributions made and highlight some issues that
should be addressed in future work.
10.1 Document-Level SMT
In the first part of the thesis, we study the technical problems that we encounter when integrating document-level features into SMT decoding. We
show how the widely used stack decoding algorithm exploits locality assumptions to speed up decoding with a dynamic programming technique called
recombination, which makes it unsuitable for use with features that have
long-range dependencies. Decoding with such features requires special techniques to overcome the independence assumptions of the decoding algorithm.
We discuss three methods that enable us to combine document-level features with sentence-level decoding algorithms, by decoding in two passes,
by propagating information between sentences during a single decoding pass
or by running a second-pass search with a different algorithm over a subset
of the search space represented by the n-best output of a stack decoder. All
of these methods have been used in the literature. They trade off modelling
158
constraints, search space, ease of implementation and efficiency against each
other in different ways.
The core contribution of the first part of this thesis is the development of
a new local search decoder for phrase-based SMT at the document level. It
embodies a new approach to decoding with document-level models which
makes trade-offs that are very different from those of the existing methods.
The assumption of sentence independence is radically removed and the modelling constraints it causes, such as the dependency directionality constraint
in the information propagation approaches, are lifted. The search space accessible to the local search decoder is equal, at least in principle, to the full
search space of phrase-based SMT.
As regards ease of implementation, the decoding algorithm is geared towards complex document-level models. Simple models with local dependencies such as an n-gram language model can be considerably more complicated to implement in the document-level local search framework than in a DP
stack decoder because the stack decoder constructs its output in an order that
is particularly well suited to n-gram-style dependencies and all the required
information is readily available. The local search decoder, by contrast, gives
the programmer complete freedom to define the dependencies of the model,
and it is about as difficult to define a model with remote dependencies across
sentence boundaries as one with local dependencies only.
The most important trade-off made by the document-level decoder is that
of efficiency. By exploiting the locality of the models with dynamic programming, the traditional stack decoder manages to explore a comparatively large
part of the search space with relatively little effort, even though it still has
to resort to pruning to ensure polynomial runtime. The local search decoder
does not have this advantage and potentially spends much more time covering an equivalent part of the search space. It is important to remember,
however, that the stack decoder’s efficiency advantage is tightly coupled to
the locality of the models. It only exists in a condition in which the local
search decoder is not designed to be used. As soon as the locality constraints
on the models are softened and long-range dependencies are admitted in the
models, the DP technique in the stack decoder becomes less effective, and
its head start begins to vanish. If the dependencies are left completely unconstrained, DP is no longer applicable and the stack decoder will not necessarily
be more efficient than the local search decoder any more.
Fusing the efficiency of stack decoding with the versatility of documentlevel local search, we show that the local search decoder can be initialised
with a search state obtained from a stack decoder. In this setup, the DP search
of the first pass solves a relaxed version of the decoding problem from which
the constraints involving long-range dependencies have been omitted. While
there is no theoretical guarantee that the state found by DP search with the
relaxed models is a good starting point for the document-level search, it is
reasonable to assume that it is generally better than a random point in the
159
search space, especially if the overall model of the document-level search pass
is relatively similar to that of the DP search pass. We test this decoding setup
with different discourse-level models, including a semantic space language
model, a collection of readability models and a pronominal anaphora model.
We have not evaluated these experiments specifically for decoding performance, but clearly the decoder is capable of improving the model score and
even of overfitting to peculiarities of the models in all cases, and we find no
indications of fundamental problems with the search method in any of the
experiments. We conclude that local search with DP initialisation is a viable
solution for experimenting with discourse-level models in phrase-based SMT.
One of the principal benefits of having a decoder that admits unlimited
document-level dependencies, and our main motivation for creating and releasing this piece of software, is that it enables researchers to experiment
freely with discourse-level models without imposing technical restrictions
on the space of imaginable models from the beginning. The availability of
a document-level decoding framework should make it possible to test ideas
that would otherwise be abandoned in an early stage because the expected
cost of implementation is considered too high in relation to the probability
of success. Once a particular method has been demonstrated to work and
is ready to be incorporated into a production system, other techniques than
local search may prove more effective depending on the nature of the model.
In the work presented in this thesis, we show that the local search method
works for phrase-based SMT decoding, but we do not explore its parameters
very thoroughly, accentuating instead the development of discourse-level feature models. Now that a number of models have been developed, there are
many aspects of the search process that merit closer attention. The acceptance criterion of the local search algorithm lends itself as a starting point.
Hill climbing reliably directs the search towards higher-scoring regions of
the search space, but theoretical considerations suggest that it may fail to
find optimal solutions because it requires a score improvement at each individual search step. However, some improvements may only be achievable if
the decoder is permitted to make one or more intermediate steps to states
with lower scores first, e. g., to split up a phrase pair into smaller pieces that
can be manipulated independently.
To enable the decoder in a principled way to explore search paths in which
the model scores do not increase monotonically, we can employ the stochastic
Metropolis-Hastings acceptance criterion and perform simulated annealing
instead of hill climbing. In initial experiments with simulated annealing not
reported in this thesis, there was evidence of significant problems. Even when
started in a relatively high-scoring state, the decoder would quickly abandon
promising regions of the search space and wander off towards very bad states
without finding its way back to any acceptable solutions within a reasonable
period of time.
160
We surmise that these search problems are connected with the set of search
operations we use, and particularly with the fact that the combination of the
proposal distribution and the acceptance criterion of our decoder does not satisfy the elementary theoretical conditions guaranteeing convergence of the
simulated annealing procedure. However, by adding operations that tie the
decoder to the hill climbing path and limit the duration of excursions to lowerscoring regions of the search space, the effectiveness of simulated annealing
search can be greatly increased despite the persistence of the theoretical difficulties. In future work, the design and selection of search operations for
both hill climbing and simulated annealing and the interaction between the
proposal distribution and the search algorithm in simulated annealing should
be investigated more thoroughly and with greater focus on theoretical convergence results.
Another problem that urgently needs more attention is feature weight tuning. Stymne et al. (2013a) present an adaptation of the MERT algorithm (Och,
2003) to document-level decoding, also described in Section 4.7 of this thesis,
and show that it achieves useful results under certain circumstances. However, when we try to apply the same method to our system with pronominal
anaphora models in Chapter 9, MERT completely fails to converge. We conjecture that this failure is due to poor sampling parameters in the generation
of n-lists, but owing to time constraints we could not study the problem more
closely.
Instead of using MERT, feature weight optimisation could be performed
with the PRO method (Hopkins and May, 2011) that estimates the weights as
parameters of a linear classifier trained to separate good states encountered
by the decoder from bad ones. Stymne et al. (2013a) do test PRO with the
sampling method they also use for MERT, but training data for PRO could
potentially also be collected by making the decoder search directly for a state
with optimal BLEU score or, preferably, some other measure of translation
quality more sensitive to discourse-level aspects of translation quality. This
option is currently being explored in ongoing work at Uppsala University.
In sum, there are still a number of issues related to document decoding that
require further study, but already now, the decoding method we present has
proved to be an enabling factor for a number of experiments with discourselevel models including the work on anaphora in the second part of the thesis,
which demonstrates its usefulness at least as a research tool.
10.2 Pronominal Anaphora in SMT
In the second part of this thesis, we turn to the issue of pronominal anaphora.
We start by examining the translations of pronouns in German–English MT
output and verify that pronoun translation is, in fact, a problem for SMT. We
find that the adequacy of pronoun translations varies greatly across different
161
types of pronouns and, as a function of the prevalence of certain pronoun
types in the individual documents, across documents. The overall accuracy
is on the order of 60 % and considerably lower for some pronoun types affected by morphological syncretism with other more frequent forms in the
source language such as feminine singular pronouns in German. Depending
on the contents of the documents translated, such pronouns may be rare, but
pervasive mistranslation of particular types of pronouns is vexatious for the
reader and may even create an appearance of disrespect, especially if there is
a noticeable gender bias in the way pronouns are translated (Gendered Innovations, 2014). We therefore conclude that pronoun translation is a problem
with some practical impact in current state-of-the-art SMT.
Having established this fact, we discuss a number of complications that
arise when modelling pronominal anaphora in an SMT system. The pronoun
translation task is complex and requires doing inference over information collected from a number of sources and resulting from a variety of components,
each of which suffers from uncertainty and is liable to add a certain amount of
noise to the system. Since each of the individual components involves highly
complex reasoning, it easily happens that the accumulated noise drowns all
useful information in the system.
After a brief description of an early approach to pronoun modelling in
SMT and its evaluation, we introduce a neural network classifier that models
cross-lingual pronoun prediction as a task in its own right, independently of
an MT system. In terms of raw accuracy, the neural network improves a bit
over a simple maximum entropy classifier. However, the improvement is not
very large, presumably because the distribution of the pronouns in the data
is heavily skewed so that it is relatively easy to attain high accuracy just by
predicting the most frequent classes more frequently; for one of the two text
genres tested, the accuracy of the maximum entropy classifier is only marginally higher than that of a trivial majority choice baseline. Still, the neural
network has considerable advantages over the baseline because it delivers acceptable precision and recall for all output classes, whereas the baseline only
performs well for the more frequent target language pronouns. In particular, it greatly improves the prediction performance for the French feminine
plural pronoun elles. We use elles as an indicator of progress, because to predict this pronoun correctly, the classifier must exploit information from the
antecedents of the pronouns and cannot rely on unconditional frequency distributions and the immediate context of the pronouns alone.
An important feature of our neural network classifier is its capability to
model the links between anaphoric pronouns and their antecedents as latent
variables, eliminating the need for an external coreference resolution system
trained on manually annotated data. Instead, we extend the network with a
small number of extra layers to model the probability of anaphoric links given
a set of features prepared with the feature extraction machinery of the existing anaphora resolver. We then train these layers jointly with the pronoun
162
prediction layers by backpropagating the error gradients all the way from
the pronoun prediction network into the anaphoric link scoring component,
using unannotated parallel text as the only supervision. The fact that this approach works just as well as using the predictions of the external coreference
resolution system reveals that parallel bitexts contain valuable information
about pronominal coreference that had never been exploited in SMT prior to
our work, and only to a small extent in coreference resolution research.
We conclude our experimental work by incorporating the pronoun prediction neural network as a feature model into the document-level local search
decoder. In doing so, we tie together all the major contributions of this thesis.
We test the resulting system on two text types, news data and TED talks. For
the TED talks, we have access to a test set with manually created annotations
of pronominal coreference, which gives us the opportunity to examine the
performance of this system both with the latent anaphora resolution of the
neural network and with the gold-standard anaphoric links in the manually
annotated data set.
In terms of automatic quality measures, the anaphora model has very little
effect on the performance of the SMT systems. The BLEU score remains all
but unchanged for all systems, and our own automatic pronoun evaluation
metric is inconclusive as well. If anything, it is surprising that the TED system
with gold-standard anaphora resolution fares worse than the corresponding
system with predicted anaphoric links, but the score differences are far too
small to draw conclusions with any degree of confidence.
Since we are well aware that the existing automatic evaluation measures
are inherently unreliable when it comes to studying pronoun translation, we
conduct a simple and rapid manual evaluation of two of our systems with
a small number of annotators, which provides us with information on the
most adequate translation of pronouns in the actual context of MT output.
The evaluation yields very interesting, if somewhat inconclusive results. We
observe an improvement in pronoun translation for the News corpus with
predicted anaphoric links, but not for the TED corpus with gold-standard
annotations. While neither of the results is statistically significant, the outcome for the News corpus is strong enough to inspire hope that significance
might be attained if a larger sample were examined. The negative result in
the TED experiment tallies with the marginally negative result of the automatic evaluation and raises the intriguing question whether the difference, if
indeed there is a difference in substance, is due to the features of the two text
genres tested or to the fact that the neural network trained for unsupervised
anaphora resolution is confused by the presence of gold-standard annotation.
At present, all of this is mere speculation because the observed effects are
very modest and chance is a factor to be reckoned with considering the small
samples we have examined. Even so, we believe the results are interesting
enough to warrant further investigation in future work. Furthermore, although the work presented in this thesis has not led to a breakthrough in
163
terms of translation quality, it has shed some light on the difficulties involved
in translating anaphoric pronouns with an SMT system.
First of all, we must recognise that pronoun translation is more difficult
than it seems, and more difficult than has been acknowledged by most SMT
researchers who have even made an effort to solve it. The complications
discussed in Chapter 6 are confirmed anew by the experimental results of
Chapter 8 and Chapter 9, and despite being aware of many of the challenges
when designing these experiments, we have not been able to avert all the
problems they cause.
The existing research on pronouns in SMT has largely concentrated on
the problems of resolving pronominal anaphora, identifying the translation
of the antecedent and injecting the information gained through anaphora
resolution into an SMT system. These are essential steps without which we
cannot hope to solve the pronoun translation problem. Aside from relying on
the effects of chance and the skewness of pronoun distributions, there is no
way around the fact that correctly generating a pronoun like the French elles
requires information about the translation of its antecedent, and obtaining
this information is difficult and has justly been the object of some research
efforts. However, what has been underestimated so far is that pronoun translation is a challenging discourse problem even if we leave aside the problem
of coreference resolution completely, and that it is qualitatively different from
translating content words.
Different languages have different conventions of pronoun use, and the
translation of pronouns is subject to arbitrary effects of linguistic conventions
to a much greater extent than the translation of content words. Consider, by
way of example, the case of company names, which is relatively frequent
in news texts. In English, as in other languages, companies are frequently
introduced with their name:
(10.1) a. A perfidious embezzler. This is how the French banking giant Société Générale, the owner of the local Komerční banka (Commerce
Bank), labels its ex-employee Jerome Kerviel.
b. Un fraudeur dissimulateur. Ainsi désigne son ancien employé le
géant français la Société générale, propriétaire de la banque tchèque
Komerční banka. (news-test2008)
In English, it is then common to refer back to the company name using the
pronoun it. In French, by contrast, it is often more idiomatic to refer to the
company name with a full noun phrase first, although it is not strictly impossible to use a pronoun directly:
(10.2) a. On his account it has lost almost five billion Euro.
b. La banque a perdu à cause de lui près de cinq milliards d’euros.
(news-test2008)
164
The following example exhibits two completely different complications.
On the one hand, it uses a highlighting idiom that is specific to the English
language and must be rendered with other means in French. On the other
hand, an English subordinate clause is mapped into a construction involving
a present participle which does not require an explicit subject pronoun.
(10.3) a. But the thing about tryptamines is they cannot be taken orally because they’re denatured by an enzyme found naturally in the human
gut called monoamine oxidase.
b. Par contre les tryptamines ne peuvent pas être consommées par
voie orale étant dénaturé[e]s par une enzyme se trouvant de façon
naturelle dans l’intestin de l’homme : la monoamine-oxydase.1
Note that both instances of the English word they are regular anaphoric pronouns with a clearly defined antecedent, yet neither of these pronouns occurs in the French reference translation. Moreover, translating a subordinate
clause with a finite verb and a pronominal subject into a participle or gerund
without overt subject is frequently possible in different language pairs, also
when English is the target language.
There is evidence suggesting that cases like these are far more common
in bilingual corpus data than one might believe. In addition to the anecdotic examples we have presented here and in other places in this thesis,
the overwhelming predominance of the other class in the training data of
the neural networks presented in Chapter 8 (Table 8.2, p. 119) and the great
number of English pronouns not aligned to French pronouns in the pseudoannotations of Section 9.4.4 indicate that it is fairly common for pronouns
not to be rendered literally in translation, even though those figures may incorporate other special cases such as incorrect word alignments as well.
Now it could be argued that the translators creating these reference translations take excessive liberties with the input text and that they should be
instructed to translate more literally at least when producing reference translations for SMT research. However, this argument is fallacious. By requesting
more literal translations, we would force the translators to translate “verbum
e verbo”, in the manner recognised to be inadequate already by the church
father Jerome in the fourth century (Jerome, 1996). A consistently more literal rendering would amount to word glossing, not translation, and it would
have a strongly negative impact at least on the idiomaticity, if not on the
fluency of the target language text. Moreover, creating artificially literal reference translations for SMT use could have a lasting negative impact on the
progress of MT research because evaluating against these references would
favour the overly literal translation style of existing models while penalising
more sophisticated systems that may be developed in the future.
1 This
example is taken from the dev2010 test set of the WIT3 corpus (Cettolo et al., 2012).
165
Rather than artificially simplifying the reference data, the only sustainable,
if challenging, way to cope with these difficulties is to analyse the relevant
phenomena and attempt to model them adequately. We expect that future
approaches to pronoun translation in SMT will require extensive corpus analysis to study how pronouns of a given source language are rendered in a
given target language and create a classification of these instances. While
it may not be possible to explain all cases satisfactorily with the means currently at our disposal, much would be gained if we could identify with some
confidence which cases are amenable to handling with our existing models to
prevent the system from introducing spurious errors in the remaining cases.
10.3 Final Remarks
The recent work on discourse in SMT, and the difficulties we and others have
experienced when trying to improve MT with discourse models, reveal some
basic weaknesses of the SMT approach. Most current approaches to SMT
are founded on word alignments in the spirit of Brown et al. (1990). These
word alignments have no clear theoretical status. They are defined in terms
of statistical models whose parameters are estimated based on cooccurrence
statistics extracted from a training corpus, and they mirror a concept of translational equivalence that we have termed observational equivalence to distinguish it from the higher-level notion of dynamic equivalence and its counterpart, formal equivalence, of which it could be considered a special case.
Observational equivalence is strongly surface-oriented, and SMT has traditionally eschewed all abstract representations of meaning, mapping tokens
of the input directly into tokens of the output. This has worked well, demonstrating that much linguistic information is indeed accessible with surfacelevel processing. However, one problem of this approach is that the SMT
system often does not know exactly what it is doing. For instance, based
on observational evidence from the training corpus, an SMT system might
translate an active sentence in the input with a passive sentence in the output, or a personal construction in the source language with an impersonal
construction in the target language.
In English–French translation, this happens not infrequently, e. g., when
the phrase it requires is translated with the impersonal il faut ‘it is necessary’,
it being aligned to il. In this example, the English it is anaphoric, but the
French il is pleonastic. This translation may be perfectly adequate and idiomatic, as the training data suggests, but the problem is that the SMT system
has no control over what it is doing. Just copying bits and pieces of texts that
it encountered at training time, it does not know that a personal pronoun
is being mapped into an impersonal one in the example above, or that the
subject and object functions are exchanged in a sentence when it goes from
active to passive.
166
It is difficult to envisage consistently correct translation of discourse phenomena such as pronominal anaphora or generating the correct distribution
of definite and indefinite noun phrases if the MT system is not allowed to
construct any abstract representation of the entities occurring in a text. In
some way, future SMT systems will have to make inferences about more abstract entities than surface words to create adequate translations. This could
be done with the help of a capacity for symbolic reasoning over some form
of abstract semantic representation (e. g., Banarescu et al., 2013), but it is not
clear that symbolic representations are, in fact, the most suitable approach.
Quite possibly, abstract information about texts could be represented in the
form of one or more hidden layers in a neural network or a similar latentvariable representation (e. g., along the lines of Kalchbrenner and Blunsom,
2013). Creating such a mechanism, and making it interface with the existing
surface-level processing facilities, is going to be a major research effort and
is unlikely to lead to improvements in BLEU score in the short term.
We began this thesis by drawing attention to a discrepancy between translation studies and SMT research, pointing out how the two fields are concerned with challenges at entirely different levels of abstraction. We showed
that the observational equivalence aimed at in SMT corresponds to a fairly
dated view of translation that misses out not only the cultural turn of the late
20th century in translation studies and the shift of viewpoint towards seeing
translation as a procedural phenomenon in a cultural and social context, but
even earlier developments of the concept of equivalence such as the functional notion of dynamic equivalence advocated by Nida and Taber (1969).
It is now time to reflect what this thesis has contributed to promote a more
up-to-date concept of translation in SMT research.
First of all, the limitations of our efforts must be clearly acknowledged.
None of our contributions will bring about a paradigm shift from a view of
SMT focusing on translational equivalence to a more process-oriented view
of translation, nor have we even attempted to do so. While it is important to
keep in mind that the underlying assumptions of current approaches to SMT
fall short of the insights prevalent in modern translation research, we believe
that it is appropriate, given the current state of the art, that SMT should rest
on a concept of equivalence and that matters related to the intentionality of
the source text and its translation and to their social or cultural context should
be regarded as external to the SMT translation process itself. If anything,
we should wish that the nature of this equivalence relation as well as the
concept of domain, which encodes many of those external factors, were put
on a firmer theoretical basis. This, however, has not been the subject of this
thesis.
What we have attempted to do is to free phrase-based SMT from the narrow-minded focus on n-gram context and sentence independence and create
a framework in which modelling at a larger scale is possible without being
impeded by technical constraints from the very beginning. We consider that
167
this is an enabling factor to promote research on the translation of linguistic
phenomena on the text level, but also on aspects of SMT emanating from a
broader view of translational equivalence. In the applications treated in this
thesis, both can be found. Pronominal anaphora, which we devoted most effort to, is an example of an elementary linguistic phenomenon that requires
discourse-level processing for correct translation even if no more than mere
formal equivalence is called for. By contrast, the readability experiments
briefly discussed in Section 5.2 represent an effort that transcends even the
limits of dynamic equivalence by conferring on the translation an intention
not found in the source text and retargeting the text to a new audience.
In sum, notwithstanding the practical contributions we have made, the
foremost importance of this thesis is theoretical rather than practical. By
highlighting the fundamental limitations of one of the prevalent approaches
to SMT, by studying their impact on practical translations and by creating a
new framework relaxing the most stringent restrictions and demonstrating
its applicability to unresolved issues in MT, we hope to stimulate SMT research with a greater propensity for creating explanatory models of complex
textual relations.
168
Bibliography
Aarts, Emile H. L., Korst, Jan H. M. and van Laarhoven, Peter J. M. (1997).
Simulated annealing. In: Emile H. L. Aarts and Jan Karel Lenstra (eds.), Local Search
in Combinatorial Optimization, Wiley-Interscience series in discrete mathematics
and optimization, Chichester: Wiley, 91–120.
Alexandrescu, Andrei and Kirchhoff, Katrin (2009). Graph-based learning for
statistical machine translation. In: Proceedings of Human Language Technologies:
The 2009 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, Boulder (Colorado, USA), 119–127.
Artstein, Ron and Poesio, Massimo (2008). Inter-coder agreement for computational
linguistics. Computational linguistics, 34 (4):555–596.
Arun, Abhishek, Dyer, Chris, Haddow, Barry, Blunsom, Phil, Lopez, Adam and
Koehn, Philipp (2009). Monte Carlo inference and maximization for phrase-based
translation. In: Proceedings of the Thirteenth Conference on Computational Natural
Language Learning (CoNLL-2009), Boulder (Colorado, USA), 102–110.
Arun, Abhishek, Haddow, Barry, Koehn, Philipp, Lopez, Adam, Dyer, Chris
and Blunsom, Phil (2010). Monte Carlo techniques for phrase-based translation.
Machine translation, 24 (2):103–121.
Baker, Mona (2011). In other words. A coursebook on translation. London: Routledge.
Second edition.
Banarescu, Laura, Bonial, Claire, Cai, Shu, Georgescu, Madalina, Griffitt,
Kira, Hermjakob, Ulf, Knight, Kevin, Koehn, Philipp, Palmer, Martha and
Schneider, Nathan (2013). Abstract meaning representation for sembanking.
In: Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with
Discourse, Sofia (Bulgaria), 178–186.
Banchs, Rafael E. and Costa-jussà, Marta R. (2011). A semantic feature for
statistical machine translation. In: Proceedings of the Fifth Workshop on Syntax,
Semantics and Structure in Statistical Translation, Portland (Oregon, USA), 126–134.
Banerjee, Satanjeev and Lavie, Alon (2005). METEOR: An automatic metric for
MT evaluation with improved correlation with human judgments. In: Proceedings
of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine
Translation and/or Summarization, Ann Arbor (Michigan, USA), 65–72.
Bassnett, Susan (2011). The translator as cross-cultural mediator. In: Kirsten
Malmkjær and Kevin Windle (eds.), The Oxford Handbook of Translation Studies,
Oxford: Oxford University Press, 94–107.
Becher, Viktor (2011a). Explicitation and implicitation in translation. A corpus-based
study of English–German and German–English translations of business texts. Ph. D.
thesis, Universität Hamburg.
169
Becher, Viktor (2011b). When and why do translators add connectives? A corpusbased study. Target: International Journal on Translation Studies, 23 (1):26–47.
Beigman Klebanov, Beata, Diermeier, Daniel and Beigman, Eyal (2008). Lexical
cohesion analysis of political speech. Political Analysis, 16 (4):447–463.
Beigman Klebanov, Beata and Flor, Michael (2013). Associative texture is lost in
translation. In: Proceedings of the Workshop on Discourse in Machine Translation,
Sofia (Bulgaria), 27–32.
Bellegarda, Jerome R. (2000). Exploiting latent semantic information in statistical
language modeling. Proceedings of the IEEE, 88 (8):1279–1296.
Ben, Guosheng, Xiong, Deyi, Teng, Zhiyang, Lü, Yajuan and Liu, Qun (2013).
Bilingual lexical cohesion trigger model for document-level machine translation.
In: Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), Sofia (Bulgaria), 382–386.
Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal and Janvin, Christian
(2003). A neural probabilistic language model. Journal of Machine Learning Research,
3:1137–1155.
Berger, Adam L., Della Pietra, Stephen A. and Della Pietra, Vincent J. (1996).
A maximum entropy approach to natural language processing. Computational
linguistics, 22 (1):39–72.
Bergsma, Shane and Yarowsky, David (2011). NADA: A robust system for nonreferential pronoun detection. In: Proceedings of the 8th Discourse Anaphora and
Anaphor Resolution Colloquium, Faro (Portugal), Lecture Notes in Computer Science,
volume 7099, 12–23.
Bird, Steven, Loper, Edward and Klein, Ewan (2009). Natural Language Processing
with Python. Beijing: O’Reilly.
Björnsson, Carl-Hugo (1968). Läsbarhet. Stockholm: Liber.
Brown, Peter F., Cocke, John, Della Pietra, Stephen A., Della Pietra, Vincent J.,
Jelinek, Frederick, Lafferty, John D., Mercer, Robert L. and Roossin, Paul S.
(1990). A statistical approach to machine translation. Computational linguistics,
16 (2):79–85.
Brown, Peter F., Della Pietra, Stephen A., Della Pietra, Vincent J. and Mercer,
Robert L. (1993). The mathematics of statistical machine translation. Computational
linguistics, 19 (2):263–311.
Brown, Peter F., deSouza, Peter V., Mercer, Robert L., Della Pietra, Vincent J.
and Lai, Jenifer C. (1992). Class-based n-gram models of natural language. Computational linguistics, 18 (4):467–479.
Buch-Kromann, Matthias, Korzen, Iørn and Høeg Müller, Henrik (2009). Uncovering the ‘lost’ structure of translations with parallel treebanks. Copenhagen
Studies in Language, 38:199–224.
Bungum, Lars and Gambäck, Björn (2011). A survey of domain adaptation in
machine translation: Towards a refinement of domain space. In: Proceedings of the
India-Norway Workshop on Web Concepts and Technologies, Trondheim (Norway).
Bussmann, Hadumond (1996). Routledge dictionary of language and linguistics. Routledge Reference, London: Routledge.
170
Callison-Burch, Chris, Koehn, Philipp, Monz, Christof, Peterson, Kay, Przybocki, Mark and Zaidan, Omar (2010). Findings of the 2010 Joint Workshop on
Statistical Machine Translation and Metrics for Machine Translation. In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR,
Uppsala (Sweden), 17–53.
Callison-Burch, Chris, Koehn, Philipp, Monz, Christof, Post, Matt, Soricut,
Radu and Specia, Lucia (2012). Findings of the 2012 Workshop on Statistical
Machine Translation. In: Proceedings of the Seventh Workshop on Statistical Machine
Translation, Montréal (Canada), 10–51.
Callison-Burch, Chris, Koehn, Philipp, Monz, Christof and Schroeder, Josh
(2009). Findings of the 2009 Workshop on Statistical Machine Translation. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens (Greece),
1–28.
Callison-Burch, Chris, Koehn, Philipp, Monz, Christof and Zaidan, Omar (2011).
Findings of the 2011 Workshop on Statistical Machine Translation. In: Proceedings
of the Sixth Workshop on Statistical Machine Translation, Edinburgh (Scotland, UK),
22–64.
Callison-Burch, Chris, Osborne, Miles and Koehn, Philipp (2006). Re-evaluating
the role of BLEU in machine translation research. In: Proceedings of the 11th
Conference of the European Chapter of the Association for Computational Linguistics,
Trento (Italy).
Carpuat, Marine (2009). One translation per discourse. In: Proceedings of the
Workshop on Semantic Evaluations: Recent Achievements and Future Directions
(SEW-2009), Boulder (Colorado, USA), 19–27.
Carpuat, Marine and Simard, Michel (2012). The trouble with SMT consistency. In:
Proceedings of the Seventh Workshop on Statistical Machine Translation, Montréal
(Canada), 442–449.
Carpuat, Marine and Wu, Dekai (2007). Improving statistical machine translation
using word sense disambiguation. In: Proceedings of the 2007 Joint Conference
on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, Prague (Czech Republic), 61–72.
Cartoni, Bruno, Zufferey, Sandrine and Meyer, Thomas (2013). Annotating the
meaning of discourse connectives by looking at their translation: The translation
spotting technique. Dialogue and Discourse, 4 (2):65–86.
Cartoni, Bruno, Zufferey, Sandrine, Meyer, Thomas and Popescu-Belis, Andrei
(2011). How comparable are parallel corpora? Measuring the distribution of general
vocabulary and connectives. In: Proceedings of the 4th Workshop on Building and
Using Comparable Corpora: Comparable Corpora and the Web, Portland (Oregon,
USA), 78–86.
Cettolo, Mauro, Girardi, Christian and Federico, Marcello (2012). WIT3 : Web
inventory of transcribed and translated talks. In: Proceedings of the 16th Conference
of the European Association for Machine Translation (EAMT), Trento (Italy), 261–268.
Chall, Jeanne S. (1958). Readability: An appraisal of research and application. Columbus (Ohio): Bureau of Educational Research.
171
Chen, Stanley F. and Goodman, Joshua (1998). An empirical study of smoothing
techniques for language modeling. Technical Report, Computer Science Group,
Harvard University, Cambridge (Mass.).
Chiang, David (2007). Hierarchical phrase-based translation. Computational linguistics, 33 (2):201–228.
Chiang, David (2012). Hope and fear for discriminative training of statistical translation models. Journal of Machine Learning Research, 13:1159–1187.
Cho, Eunah, Ha, Thanh-Le, Mediani, Mohammed, Niehues, Jan, Herrmann,
Teresa, Slawik, Isabel and Waibel, Alex (2013). The Karlsruhe Institute of
Technology translation systems for the WMT 2013. In: Proceedings of the Eighth
Workshop on Statistical Machine Translation, Sofia (Bulgaria), 104–108.
Coccaro, Noah and Jurafsky, Daniel (1998). Towards better integration of semantic
predictors in statistical language modeling. In: Proceedings of the 5th International
Conference on Spoken Language Processing, Sydney (Australia).
Collins, Michael (1999). Head-Driven Statistical Models for Natural Language Parsing.
Ph. D. thesis, University of Pennsylvania.
Collins, Michael and Duffy, Nigel (2002). Convolution kernels for natural language.
In: Thomas G. Dietterich, Suzanna Becker and Zoubin Ghahramani (eds.), Advances
in Neural Information Processing Systems 14, Cambridge (Mass.): MIT Press, 625–632.
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Michael, Kavukcuoglu, Koray and Kuksa, Pavel (2011). Natural language processing (almost) from
scratch. Journal of Machine Learning Research, 12:2461–2505.
Deléger, Louise, Merkel, Magnus and Zweigenbaum, Pierre (2006). Enriching
medical terminologies: An approach based on aligned corpora. In: Arie Hasman,
Reinhold Haux, Johan van der Lei, Etienne De Clercq and Francis H. Roger France
(eds.), Ubiquity: Technologies for Better Health in Aging Societies. Proceedings of
MIE2006, the 20th International Congress of the European Federation for Medical
Informatics, Maastricht (Netherlands), 747–752.
Denkowski, Michael and Lavie, Alon (2011). Meteor 1.3: Automatic metric for
reliable optimization and evaluation of machine translation systems. In: Proceedings
of the Sixth Workshop on Statistical Machine Translation, Edinburgh (Scotland, UK),
85–91.
Doddington, George (2002). Automatic evaluation of machine translation quality
using n-gram co-occurrence statistics. In: Proceedings of the Second International
Conference on Human Language Technology Research, San Diego (California, USA),
138–145.
Eidelman, Vladimir, Boyd-Graber, Jordan and Resnik, Philip (2012). Topic models
for dynamic translation model adaptation. In: Proceedings of the 50th Annual Meeting
of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island
(Korea), 115–119.
Eisner, Jason and Tromble, Roy W. (2006). Local search with very large-scale
neighborhoods for optimal permutations in machine translation. In: Proceedings of
the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in
Speech and Language Processing, New York City (New York, USA), 57–75.
172
Evans, Richard (2001). Applying machine learning toward an automatic classification
of it. Literary and Linguistic Computing, 16 (1):45–57.
Federico, Marcello, Bertoldi, Nicola and Cettolo, Mauro (2008). IRSTLM: An
open source toolkit for handling large scale language models. In: Interspeech 2008,
Brisbane (Australia), 1618–1621.
Federmann, Christian, Eisele, Andreas, Chen, Yu, Hunsicker, Sabine, Xu, Jia
and Uszkoreit, Hans (2010). Further experiments with shallow hybrid MT systems.
In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and
MetricsMATR, Uppsala (Sweden), 77–81.
Fellbaum, Christiane (1998). WordNet: An electronic lexical database. Cambridge
(Mass.): MIT Press.
Finkel, Jenny Rose, Grenager, Trond and Manning, Christopher (2005). Incorporating non-local information into information extraction systems by Gibbs sampling.
In: Proceedings of the 43rd Annual Meeting of the Association for Computational
Linguistics, Ann Arbor (Michigan, USA), 363–370.
Foltz, Peter W., Kintsch, Walter and Landauer, Thomas K. (1998). The measurement of textual coherence with Latent Semantic Analysis. Discourse Processes,
25 (2/3):285–307.
Foster, George, Isabelle, Pierre and Kuhn, Roland (2010). Translating structured
documents. In: Proceedings of AMTA 2010: the Ninth Conference of the Association
for Machine Translation in the Americas, Denver (Colorado, USA).
Gale, William A., Church, Kenneth W. and Yarowsky, David (1992). One sense
per discourse. In: Proceedings of Speech and Natural Language, Harriman (New
York, USA), 233–237.
Galley, Michel and McKeown, Kathleen (2003). Improving word sense disambiguation in lexical chaining. In: Proceedings of the 18th International Joint Conference
on Artificial Intelligence, San Francisco (California, USA), 1486–1488.
Gendered Innovations (2014). Machine translation: Analyzing gender. http:
//genderedinnovations.stanford.edu/case-studies/nlp.html (5 May 2014).
Germann, Ulrich (2003). Greedy decoding for statistical machine translation in
almost linear time. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics,
Edmonton (Canada).
Germann, Ulrich, Jahr, Michael, Knight, Kevin, Marcu, Daniel and Yamada,
Kenji (2001). Fast decoding and optimal decoding for machine translation. In:
Proceedings of 39th Annual Meeting of the Association for Computational Linguistics,
Toulouse (France), 228–235.
Germann, Ulrich, Jahr, Michael, Knight, Kevin, Marcu, Daniel and Yamada,
Kenji (2004). Fast and optimal decoding for machine translation. Artificial Intelligence, 154 (1–2):127–143.
Giménez, Jesús, Màrqez, Lluís, Comelles, Elisabet, Castellón, Irene and Arranz, Victoria (2010). Document-level automatic MT evaluation based on discourse representations. In: Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, Uppsala (Sweden), 333–338.
173
Gong, Zhengxian, Zhang, Min, Tan, Chew-lim and Zhou, Guodong (2012a).
Classifier-based tense model for SMT. In: Proceedings of COLING 2012: Posters,
Mumbai (India), 411–420.
Gong, Zhengxian, Zhang, Min, Tan, Chew Lim and Zhou, Guodong (2012b). Ngram-based tense models for statistical machine translation. In: Proceedings of the
2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, Jeju Island (Korea), 276–285.
Gong, Zhengxian, Zhang, Min and Zhou, Guodong (2011a). Cache-based
document-level statistical machine translation. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh (Scotland,
UK), 909–919.
Gong, Zhengxian, Zhang, Yu and Zhou, Guodong (2010). Statistical machine
translation based on LDA. In: Proceedings of the 4th International Universal Communication Symposium, Beijing (China), 286–290.
Gong, Zhengxian, Zhou, Guodong and Li, Liangyou (2011b). Improve SMT with
source-side “topic-document” distributions. In: Proceedings of the 13th Machine
Translation Summit, Xiamen (China), 496–501.
Grishman, Ralph and Sundheim, Beth (1996). Message understanding conference – 6: A brief history. In: Proceedings of the 16th International Conference on
Computational Linguistics (COLING 1996), Copenhagen (Denmark), 466–471.
Gruber, Amit, Weiss, Yair and Rosen-zvi, Michal (2007). Hidden topic Markov
models. In: Marina Meila and Xiaotong Shen (eds.), Proceedings of the Eleventh
International Conference on Artificial Intelligence and Statistics (AISTATS-07), San
Juan (Puerto Rico, USA), volume 2, 163–170.
Guillou, Liane (2011). Improving Pronoun Translation for Statistical Machine Translation (SMT). Master’s thesis, University of Edinburgh, School of Informatics.
Guillou, Liane (2012). Improving pronoun translation for statistical machine translation. In: Proceedings of the Student Research Workshop at the 13th Conference of the
European Chapter of the Association for Computational Linguistics, Avignon (France),
1–10.
Guillou, Liane (2013). Analysing lexical consistency in translation. In: Proceedings
of the Workshop on Discourse in Machine Translation, Sofia (Bulgaria), 10–18.
Guillou, Liane, Hardmeier, Christian, Smith, Aaron, Tiedemann, Jörg and
Webber, Bonnie (2014). ParCor 1.0: A parallel pronoun-coreference corpus to support statistical MT. In: Proceedings of the Tenth Language Resources and Evaluation
Conference (LREC’14), Reykjavík (Iceland).
Gupta, Kamakhyn, Sadiq, Mohamed and Sridhar V (2008). Measuring lexical
cohesion in a document. In: Seventh Mexican International Conference on Artificial
Intelligence, Tuxtla Gutiérrez (Mexico), 48–52.
Guzmán, Francisco, Joty, Shafiq, Màrqez, Lluís and Nakov, Preslav (2014). Using discourse structure improves machine translation evaluation. In: Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore
(Maryland, USA).
174
Hajič, Jan, Panevová, Jarmila, Hajičová, Eva, Panevová, Jarmila, Sgall, Petr,
Pajas, Petr, Štěpánek, Jan, Havelka, Jiří and Mikulová, Marie (2006). Prague
Dependency Treebank 2.0. Philadelphia: Linguistic Data Consortium. LDC2006T01.
Hajlaoui, Najeh and Popescu-Belis, Andrei (2012). Translating English discourse
connectives into Arabic: a corpus-based analysis and an evaluation metric. In:
AMTA-2012: Fourth workshop on computational approaches to Arabic script-based
languages, San Diego (California, USA), 1–8.
Hajlaoui, Najeh and Popescu-Belis, Andrei (2013). Assessing the accuracy of
discourse connective translations: Validation of an automatic metric. In: Alexander
Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, Berlin:
Springer, Lecture Notes in Computer Science, volume 7817, 236–247.
Halliday, M. A. K. and Hasan, Ruqaiya (1976). Cohesion in English. English Language
Series, London: Longman.
Harabagiu, Sanda M. and Maiorano, Steven J. (2000). Multilingual coreference
resolution. In: Proceedings of the Sixth Conference on Applied Natural Language
Processing, Seattle (Washington, USA), 142–149.
Hardmeier, Christian (2012). Discourse in statistical machine translation: A survey
and a case study. Discours, 11.
Hardmeier, Christian, Bisazza, Arianna and Federico, Marcello (2010). FBK at
WMT 2010: Word lattices for morphological reduction and chunk-based reordering.
In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and
MetricsMATR, Uppsala (Sweden), 88–92.
Hardmeier, Christian and Federico, Marcello (2010). Modelling pronominal anaphora in statistical machine translation. In: Proceedings of the Seventh International
Workshop on Spoken Language Translation (IWSLT), Paris (France), 283–289.
Hardmeier, Christian, Nivre, Joakim and Tiedemann, Jörg (2012). Documentwide decoding for phrase-based statistical machine translation. In: Proceedings of
the 2012 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning, Jeju Island (Korea), 1179–1190.
Hardmeier, Christian, Stymne, Sara, Tiedemann, Jörg and Nivre, Joakim (2013a).
Docent: A document-level decoder for phrase-based statistical machine translation.
In: Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics: System Demonstrations, Sofia (Bulgaria), 193–198.
Hardmeier, Christian, Stymne, Sara, Tiedemann, Jörg, Smith, Aaron and Nivre,
Joakim (2014). Anaphora models and reordering for phrase-based SMT. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore (Maryland,
USA).
Hardmeier, Christian, Tiedemann, Jörg and Nivre, Joakim (2013b). Latent anaphora resolution for cross-lingual pronoun prediction. In: Proceedings of the 2013
Conference on Empirical Methods in Natural Language Processing, Seattle (Washington, USA), 380–391.
Hardmeier, Christian, Tiedemann, Jörg, Saers, Markus, Federico, Marcello
and Prashant, Mathur (2011). The Uppsala-FBK systems at WMT 2011. In:
Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh
(Scotland, UK), 372–378.
175
Hasler, Eva, Blunsom, Phil, Koehn, Philipp and Haddow, Barry (2014). Dynamic
topic adaptation for phrase-based MT. In: Proceedings of the 14th Conference of
the European Chapter of the Association for Computational Linguistics, Gothenburg
(Sweden), 328–337.
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and
their applications. Biometrika, 57 (1):97–109.
Hatim, Basil and Mason, Ian (1990). Discourse and the Translator. Language in
Social Life Series, London: Longman.
Heafield, Kenneth (2011). KenLM: faster and smaller language model queries. In:
Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh
(Scotland, UK), 187–197.
Hopkins, Mark and May, Jonathan (2011). Tuning as ranking. In: Proceedings of the
2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh
(Scotland, UK), 1352–1362.
Huang, Yan (2004). Anaphora and the pragmatics-syntax interface. In: Laurence R.
Horn and Gregory Ward (eds.), The Handbook of Pragmatics, Malden (Mass.): Blackwell, 288–314.
Hultman, Tor G. and Westman, Margareta (1977). Gymnasistsvenska. Lund:
LiberLäromedel.
Jelinek, Frederick (1976). Continuous speech recognition by statistical methods.
Proceedings of the IEEE, 64 (4):532–557.
Jellinghaus, Michael, Poulis, Alexandros and Kolovratník, David (2010). Exodus – Exploring SMT for EU institutions. In: Proceedings of the Joint Fifth Workshop
on Statistical Machine Translation and MetricsMATR, Uppsala (Sweden), 110–114.
Jerome (1979). Letter LVII: To Pammachius on the best method of translating. In:
St. Jerome: Letters and Select Works, Grand Rapids: Eerdmans, A Select Library of
Nicene and Post-Nicene Fathers of the Christian Church, Second Series, volume VI,
112–119.
Jerome (1996). Epistola LVII: Ad Pammachium. De optimo genere interpretandi. In:
Sancti Eusebii Hieronymi Stridonensis presbyteri epistolae secundum ordinem temporum ad amussim digestae et in quatuor classes distributae, Alexandria: ChadwyckHealey, Patrologia Latina Database, volume 22.
Johnson, Howard, Martin, Joel, Foster, George and Kuhn, Roland (2007). Improving translation quality by discarding most of the phrasetable. In: Proceedings
of the 2007 Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, Prague (Czech Republic), 967–975.
Joty, Shafiq, Guzmán, Francisco, Màrqez, Lluís and Nakov, Preslav (2014).
DiscoTK: Using discourse structure for machine translation evaluation. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore (Maryland,
USA).
Jurgens, David and Stevens, Keith (2010). The S-Space package: An open source
package for word space models. In: Proceedings of the ACL 2010 System Demonstrations, Uppsala (Sweden), 30–35.
176
Kalchbrenner, Nal and Blunsom, Phil (2013). Recurrent continuous translation
models. In: Proceedings of the 2013 Conference on Empirical Methods in Natural
Language Processing, Seattle (Washington, USA), 1700–1709.
Kamp, Hans and Reyle, Uwe (1993). From Discourse to Logic: An Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation
Theory. Dordrecht: Kluwer.
Kim, Woosung and Khudanpur, Sanjeev (2004). Cross-lingual latent semantic
analysis for language modeling. In: Proceedings of the IEEE international conference
on acoustics, speech, and signal processing (ICASSP), Montréal (Canada), volume 1,
257–260.
Kirkpatrick, S., Gelatt Jr., C. D. and Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220 (4598):671–680.
Knight, Kevin and Chander, Ishwar (1994). Automated postediting of documents.
In: Proceedings of the 12th National Conference on Artificial Intelligence (AAAI),
Seattle (Washington, USA), 779–784.
Koehn, Philipp (2005). Europarl: A corpus for statistical machine translation. In:
Proceedings of MT Summit X, Phuket (Thailand), 79–86.
Koehn, Philipp (2010). Statistical Machine Translation. Cambridge: Cambridge
University Press.
Koehn, Philipp and Hoang, Hieu (2007). Factored translation models. In: Conference
on Empirical Methods in Natural Language Processing, Prague (Czech Republic),
868–876.
Koehn, Philipp, Hoang, Hieu, Birch, Alexandra et al. (2007). Moses: Open source
toolkit for Statistical Machine Translation. In: Annual Meeting of the Association
for Computational Linguistics: Demonstration session, Prague (Czech Republic),
177–180.
Koehn, Philipp, Och, Franz Josef and Marcu, Daniel (2003). Statistical phrasebased translation. In: Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton (Canada), 48–54.
Koller, Werner (1972). Grundprobleme der Übersetzungstheorie, unter besonderer
Berücksichtigung schwedisch-deutscher Übersetzungsfälle, Acta Universitatis Stockholmiensis. Stockholmer germanistische Forschungen, volume 9. Bern: Francke.
Krippendorff, Klaus (2004). Measuring the reliability of qualitative text analysis
data. Quality and Quantity, 38 (6):787–800.
Lambert, Patrik, Gispert, Adriá, Banchs, Rafael and Mariño, José B. (2005).
Guidelines for word alignment evaluation and manual alignment. Language Resources and Evaluation, 39 (4):267–285.
Langlais, Philippe, Patry, Alexandre and Gotti, Fabrizio (2007). A greedy decoder
for phrase-based statistical machine translation. In: TMI-2007: Proceedings of the
11th International Conference on Theoretical and Methodological Issues in Machine
Translation, Skövde (Sweden), 104–113.
Langlais, Philippe, Patry, Alexandre and Gotti, Fabrizio (2008). Recherche locale
pour la traduction statistique par segments. In: Actes de la 15e Conférence sur le
Traitement Automatique des Langues Naturelles, Avignon (France), 119–128.
177
Le, Hai-Son, Allauzen, Alexandre and Yvon, François (2012). Continuous space
translation models with neural networks. In: Proceedings of the 2012 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Montréal (Canada), 39–48.
Le Nagard, Ronan and Koehn, Philipp (2010). Aiding pronoun translation with
co-reference resolution. In: Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, Uppsala (Sweden), 252–261.
Leal, Alice (2012). Equivalence. In: Yves Gambier and Luc van Doorslaer (eds.),
Handbook of Translation Studies, Amsterdam: Benjamins, volume 3, 39–46.
Lee, Heeyoung, Peirsman, Yves, Chang, Angel, Chambers, Nathanael, Surdeanu,
Mihai and Jurafsky, Dan (2011). Stanford’s multi-pass sieve coreference resolution
system at the CoNLL-2011 shared task. In: Proceedings of the Fifteenth Conference
on Computational Natural Language Learning: Shared Task, Portland (Oregon, USA),
28–34.
Lefevere, André and Bassnett, Susan (1995). Introduction: Proust’s grandmother
and the thousand and one nights: The ‘cultural turn’ in translation studies. In:
Susan Bassnett and André Lefevere (eds.), Translation, History and Culture, London:
Cassell, 1–14.
Liddell, F. D. K. (1983). Simplified exact analysis of case-referent studies: Matched
pairs; dichotomous exposure. Journal of Epidemiology and Community Health,
37 (1):82–84.
Louis, Annie and Webber, Bonnie (2014). Structured and unstructured cache models
for SMT domain adaptation. In: Proceedings of the 14th Conference of the European
Chapter of the Association for Computational Linguistics, Gothenburg (Sweden),
155–163.
Ma, Yanjun, He, Yifan, Way, Andy and van Genabith, Josef (2011). Consistent translation using discriminative learning – A translation memory-inspired approach.
In: Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Portland (Oregon, USA), 1239–1248.
Mann, William and Thompson, Sandra (1988). Rhetorical Structure Theory: Toward
a functional theory of text organization. Text, 8 (3):243–281.
Marcu, Daniel, Carlson, Lynn and Watanabe, Maki (2000). The automatic translation of discourse structures. In: Proceedings of the 1st Meeting of the North American
Chapter of the Association for Computational Linguistics, Seattle (Washington, USA),
9–17.
Mariño, José, Banches, Rafael E., Crego, Josep M., de Gispert, Adrià, Lambert,
Patrik, Fonollosa, José A. R. and Costa-jussà, Marta R. (2006). N-gram-based
machine translation. Computational linguistics, 32 (4):527–549.
McEnery, Anthony, Tanaka, Izumi and Botley, Simon (1997). Corpus annotation
and reference resolution. In: Proceedings of the ACL Workshop on Operational Factors
in Practical, Robust Anaphora Resolution for Unrestricted Texts, Madrid (Spain), 67–74.
Mediani, Mohammed, Cho, Eunah, Niehues, Jan, Herrmann, Teresa and Waibel,
Alex (2011). The KIT English–French translation systems for IWSLT 2011. In:
Proceedings of the Eighth International Workshop on Spoken Language Translation
(IWSLT), San Francisco (California, USA), 73–78.
178
Merkel, Magnus (1999). Understanding and enhancing translation by parallel text
processing. Ph. D. thesis, Linköping University.
Metropolis, Nicholas, Rosenbluth, Arianna W., Rosenbluth, Marshall N.,
Teller, Augusta H. and Teller, Edward (1953). Equation of state calculations by
fast computing machines. Journal of Chemical Physics, 21 (6):1987–1092.
Meyer, Thomas (2011). Disambiguating temporal-contrastive connectives for machine translation. In: Proceedings of the ACL 2011 Student Session, Portland (Oregon,
USA), 46–51.
Meyer, Thomas, Grisot, Cristina and Popescu-Belis, Andrei (2013). Detecting
narrativity to improve English to French translation of simple past verbs. In:
Proceedings of the Workshop on Discourse in Machine Translation, Sofia (Bulgaria),
33–42.
Meyer, Thomas and Poláková, Lucie (2013). Machine translation with many manually labeled discourse connectives. In: Proceedings of the Workshop on Discourse in
Machine Translation, Sofia (Bulgaria), 43–50.
Meyer, Thomas and Popescu-Belis, Andrei (2012). Using sense-labeled discourse
connectives for statistical machine translation. In: Proceedings of the Joint Workshop
on Exploiting Synergies between Information Retrieval and Machine Translation
(ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), Avignon (France),
129–138.
Meyer, Thomas, Popescu-Belis, Andrei, Hajlaoui, Najeh and Gesmundo, Andrea
(2012). Machine translation of labeled discourse connectives. In: Proceedings of the
Tenth Biennial Conference of the Association for Machine Translation in the Americas
(AMTA), San Diego (California, USA).
Meyer, Thomas, Popescu-Belis, Andrei, Zufferey, Sandrine and Cartoni, Bruno
(2011a). Multilingual annotation and disambiguation of discourse connectives for
machine translation. In: Proceedings of the SIGDIAL 2011 Conference, Portland
(Oregon, USA), 194–203.
Meyer, Thomas, Roze, Charlotte, Cartoni, Bruno, Danlos, Laurence, Zufferey,
Sandrine and Popescu-Belis, Andrei (2011b). Disambiguating discourse connectives using parallel corpora. In: Proceedings of Corpus Linguistics, Birmingham
(England, UK).
Meyer, Thomas and Webber, Bonnie (2013). Implicitation of discourse connectives
in (machine) translation. In: Proceedings of the Workshop on Discourse in Machine
Translation, Sofia (Bulgaria), 19–26.
Minnen, Guido, Carroll, John and Pearce, Darren (2001). Applied morphological
processing of English. Natural Language Engineering, 7 (3):207–223.
Mitchell, Alexis, Strassel, Stephanie, Przybocki, Mark, Davis, J. K., Doddington, George, Grishman, Ralph, Meyers, Adam, Brunstein, Ada, Ferro, Lisa
and Sundheim, Beth (2003). ACE-2 Version 1.0. Philadelphia: Linguistic Data
Consortium. LDC2003T11.
Mitkov, Ruslan and Barbu, Catalina (2003). Using bilingual corpora to improve
pronoun resolution. Languages in Contrast, 4 (2):201–211.
179
Mühlenbock, Katarina and Johansson Kokkinakis, Sofie (2009). LIX 68 revisited:
An extended readability measure. In: Proceedings of Corpus Linguistics, Liverpool
(England, UK).
Ng, Vincent (2010). Supervised noun phrase coreference research: The first fifteen
years. In: Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, Uppsala (Sweden), 1396–1411.
Nida, Eugene A. and Taber, Charles R. (1969). The theory and practice of translation,
Helps for translators, volume 8. Leiden: Brill.
Niehues, Jan, Herrmann, Teresa, Vogel, Stephan and Waibel, Alex (2011). Wider
context by using bilingual language models in machine translation. In: Proceedings
of the Sixth Workshop on Statistical Machine Translation, Edinburgh (Scotland, UK),
198–206.
Novák, Michal (2011). Utilization of anaphora in machine translation. In: Week
of Doctoral Students 2011 Proceedings of Contributed Papers, Part I, Prague (Czech
Republic), 155–160.
Novák, Michal, Nedoluzhko, Anna and Žabokrtský, Zdeněk (2013a). Translation
of “it” in a deep syntax framework. In: Proceedings of the Workshop on Discourse in
Machine Translation, Sofia (Bulgaria), 51–59.
Novák, Michal, Žabokrtský, Zdeněk and Nedoluzhko, Anna (2013b). Two case
studies on translating pronouns in a deep syntax framework. In: Proceedings of
the Sixth International Joint Conference on Natural Language Processing, Nagoya
(Japan), 1037–1041.
Och, Franz Josef (2003). Minimum error rate training in Statistical Machine Translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational
Linguistics, Sapporo (Japan), 160–167.
Och, Franz Josef and Ney, Hermann (2002). Discriminative training and maximum entropy models for Statistical Machine Translation. In: Proceedings of the
40th Annual Meeting of the Association for Computational Linguistics, Philadelphia
(Pennsylvania, USA), 295–302.
Och, Franz Josef and Ney, Hermann (2003). A systematic comparison of various
statistical alignment models. Computational linguistics, 29 (1):19–51.
Och, Franz Josef and Ney, Hermann (2004). The alignment template approach to
statistical machine translation. Computational linguistics, 30 (4):417–449.
Och, Franz Josef, Ueffing, Nicola and Ney, Hermann (2001). An efficient A* search
algorithm for Statistical Machine Translation. In: Proceedings of the Data-Driven
Machine Translation Workshop at the 39th Annual Meeting of the Association for
Computational Linguistics (ACL), Toulouse (France), 55–62.
Paice, C. D. and Husk, G. D. (1987). Towards the automatic recognition of anaphoric features in English text: the impersonal pronoun “it”. Computer Speech and
Language, 2 (2):109–132.
Papineni, Kishore, Roukos, Salim, Ward, Todd and Zhu, Wei-Jing (2002). BLEU:
A method for automatic evaluation of machine translation. In: Proceedings of the
40th Annual Meeting of the Association for Computational Linguistics, Philadelphia
(Pennsylvania, USA), 311–318.
180
Petrov, Slav, Barrett, Leon, Thibaux, Romain and Klein, Dan (2006). Learning
accurate, compact, and interpretable tree annotation. In: Proceedings of the 21st
International Conference on Computational Linguistics and 44th Annual Meeting of
the Association for Computational Linguistics, Sydney (Australia), 433–440.
Petrov, Slav and Klein, Dan (2007). Improved inference for unlexicalized parsing.
In: Proceedings of Human Language Technologies 2007: The Conference of the North
American Chapter of the Association for Computational Linguistics, Rochester (New
York, USA), 404–411.
Popescu-Belis, Andrei, Cartoni, Bruno, Gesmundo, Andrea, Henderson, James,
Grisot, Cristina, Merlo, Paola, Meyer, Thomas, Moeschler, Jacqes and
Zufferey, Sandrine (2012a). Improving MT coherence through text-level processing of input texts: The COMTIS project. In: Tralogy, Session 6 – Translation
and Natural Language Processing / Traduction et traitement automatique des langues
(TAL), Paris (France).
Popescu-Belis, Andrei, Meyer, Thomas, Liyanapathirana, Jeevanthi, Cartoni,
Bruno and Zufferey, Sandrine (2012b). Discourse-level annotation over Europarl
for machine translation: Connectives and pronouns. In: Proceedings of the Eighth
Language Resources and Evaluation Conference (LREC’12), Istanbul (Turkey).
Postolache, Oana, Cristea, Dan and Orăsan, Constantin (2006). Transferring
coreference chains through word alignment. In: Proceedings of the Fifth Language
Resources and Evaluation Conference (LREC-2006), Genoa (Italy), 889–892.
Pradhan, Sameer, Ramshaw, Lance, Marcus, Mitchell, Palmer, Martha,
Weischedel, Ralph and Xue, Nianwen (2011). CoNLL-2011 shared task: Modeling
unrestricted coreference in OntoNotes. In: Proceedings of the Fifteenth Conference
on Computational Natural Language Learning: Shared Task, Portland (Oregon, USA),
1–27.
Rahman, Altaf and Ng, Vincent (2012). Translation-based projection for multilingual coreference resolution. In: Proceedings of the 2012 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Montréal (Canada), 720–730.
Ruiz, Nick and Federico, Marcello (2011). Topic adaptation for lecture translation
through bilingual latent semantic models. In: Proceedings of the Sixth Workshop on
Statistical Machine Translation, Edinburgh (Scotland, UK), 294–302.
Rumelhart, David E., Hinton, Geoffrey E. and Williams, Ronald J. (1986). Learning representations by back-propagating errors. Nature, 323 (6088):533–536.
Russo, Lorenza, Loáiciga, Sharid and Gulati, Asheesh (2012a). Improving machine
translation of null subjects in Italian and Spanish. In: Proceedings of the Student
Research Workshop at the 13th Conference of the European Chapter of the Association
for Computational Linguistics, Avignon (France), 81–89.
Russo, Lorenza, Loáiciga, Sharid and Gulati, Asheesh (2012b). Italian and Spanish
null subjects. A case study evaluation in an MT perspective. In: Proceedings of the
Eighth Language Resources and Evaluation Conference (LREC’12), Istanbul (Turkey),
1779–1784.
Russo, Lorenza, Scherrer, Yves, Goldman, Jean-Philippe, Loáiciga, Sharid,
Nerima, Luka and Wehrli, Éric (2011). Étude inter-langues de la distribution et
181
des ambiguïtés syntaxiques des pronoms. In: Mathieu Lafourcade and Violaine
Prince (eds.), Actes de la 18e conférence sur le Traitement Automatique des Langues
Naturelles, Montpellier (France), volume 2, 279–284.
Sagot, Benoît, Clément, Lionel, Villemonte de La Clergerie, Éric and Boullier,
Pierre (2006). The Lefff 2 syntactic lexicon for French: architecture, acquisition,
use. In: Proceedings of the Fifth Language Resources and Evaluation Conference
(LREC-2006), Genoa (Italy), 1348–1351.
Sanders, T. and Pander Maat, H. (2006). Cohesion and coherence: Linguistic
approaches. In: Encyclopedia of language and linguistics, Elsevier, volume 2, 591–
595.
Savoy, Jacqes (1999). A stemming procedure and stopword list for general French
corpora. Journal of the American Society for Information Science, 50 (10):944–952.
Scherrer, Yves, Russo, Lorenza, Goldman, Jean-Philippe, Loáiciga, Sharid,
Nerima, Luka and Wehrli, Éric (2011). La traduction automatique des pronoms.
Problèmes et perspectives. In: Mathieu Lafourcade and Violaine Prince (eds.),
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles,
Montpellier (France), volume 2.
Schmid, Helmut and Laws, Florian (2008). Estimation of conditional probabilities
with decision trees and an application to fine-grained POS tagging. In: Proceedings
of the 22nd International Conference on Computational Linguistics (COLING 2008),
Manchester (England, UK), 777–784.
Schwenk, Holger (2007). Continuous space language models. Computer Speech and
Language, 21 (3):492–518.
Scott, William A. (1955). Reliability of content analysis: The case of nominal scale
coding. Public Opinion Quarterly, 19 (3):321–325.
Shannon, Claude E. (1948). A mathematical theory of communication. Bell System
Technical Journal, 27 (3):379–423.
Snell-Hornby, Mary (1995). Linguistic transcoding or cultural transfer? A critique
of translation theory in Germany. In: Susan Bassnett and André Lefevere (eds.),
Translation, History and Culture, London: Cassell, 79–86.
Snell-Hornby, Mary (2010). The turns of translation studies. In: Handbook of
Translation Studies, Amsterdam: John Benjamins, volume 1, 366–370.
Snover, Matthew, Dorr, Bonnie, Schwartz, Richard, Micciulla, Linnea and
Makhoul, John (2006). A study of translation edit rate with targeted human
annotation. In: AMTA 2006: Proceedings of the 7th Conference of the Association for
Machine Translation in the Americas, “Visions for the Future of Machine Translation”,
Cambridge (Massachussetts, USA), 223–231.
Soon, Wee Meng, Ng, Hwee Tou and Lim, Daniel Chung Yong (2001). A machine learning approach to coreference resolution of noun phrases. Computational
linguistics, 27 (4):521–544.
de Souza, José and Orăsan, Constantin (2011). Can projected chains in parallel
corpora help coreference resolution? In: Iris Hendrickx, Sobha Lalitha Devi,
António Branco and Ruslan Mitkov (eds.), Anaphora Processing and Applications,
Berlin: Springer, Lecture Notes in Computer Science, volume 7099, 59–69.
182
Stolcke, Andreas (2002). SRILM: An extensible language modeling toolkit. In:
Proceedings of the International Conference on Spoken Language Processing, Denver
(Colorado, USA).
Stolcke, Andreas, Zheng, Jing, Wang, Wen and Abrash, Victor (2011). SRILM
at sixteen: Update and outlook. In: Proceedings of the IEEE Automatic Speech
Recognition and Understanding Workshop, Waikoloa (Hawaii, USA).
Stuckardt, Roland (2007). Applying backpropagation networks to anaphor resolution. In: António Branco (ed.), Anaphora: Analysis, Algorithms and Applications.
6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2007, Lagos
(Portugal), Lecture Notes in Artificial Intelligence, volume 4410, 107–124.
Stymne, Sara, Hardmeier, Christian, Tiedemann, Jörg and Nivre, Joakim (2013a).
Feature weight optimization for discourse-level SMT. In: Proceedings of the Workshop on Discourse in Machine Translation, Sofia (Bulgaria), 60–69.
Stymne, Sara, Hardmeier, Christian, Tiedemann, Jörg and Nivre, Joakim (2013b).
Tunable distortion limits and corpus cleaning for SMT. In: Proceedings of the Eighth
Workshop on Statistical Machine Translation, Sofia (Bulgaria), 225–231.
Stymne, Sara, Tiedemann, Jörg, Hardmeier, Christian and Nivre, Joakim (2013c).
Statistical machine translation with readability constraints. In: Stephan Oepen,
Kristin Hagen and Janne Bondi Johannesse (eds.), Proceedings of the 19th Nordic
Conference of Computational Linguistics (NODALIDA 2013), Oslo (Norway), 375–386.
Taira, Hirotoshi, Sudoh, Katsuhito and Nagata, Masaaki (2012). Zero pronoun
resolution can improve the quality of J-E translation. In: Proceedings of the Sixth
Workshop on Syntax, Semantics and Structure in Statistical Translation, Jeju Island
(Korea), 111–118.
Tam, Yik-Cheung, Lane, Ian and Schultz, Tanja (2007). Bilingual LSA-based
adaptation for statistical machine translation. Machine Translation, 21 (4):187–207.
Tiedemann, Jörg (2010a). Context adaptation in statistical machine translation
using models with exponentially decaying cache. In: Proceedings of the ACL 2010
Workshop on Domain Adaptation for Natural Language Processing (DANLP), Uppsala
(Sweden), 8–15.
Tiedemann, Jörg (2010b). To cache or not to cache? Experiments with adaptive
models in statistical machine translation. In: Proceedings of the ACL 2010 Joint Fifth
Workshop on Statistical Machine Translation and MetricsMATR, Uppsala (Sweden),
189–194.
Trask, R. L. (1993). A dictionary of grammatical terms in linguistics. London: Routledge.
Tsvetkov, Yulia, Dyer, Chris, Levin, Lori and Bhatia, Archna (2013). Generating
English determiners in phrase-based translation with synthetic translation options.
In: Proceedings of the Eighth Workshop on Statistical Machine Translation, Sofia
(Bulgaria), 271–280.
Ture, Ferhan, Oard, Douglas W. and Resnik, Philip (2012). Encouraging consistent
translation choices. In: Proceedings of the 2012 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal (Canada), 417–426.
183
Uryupina, Olga (2006). Coreference resolution with and without linguistic knowledge. In: Proceedings of the Fifth Language Resources and Evaluation Conference
(LREC-2006), Genoa (Italy), 893–898.
Versley, Yannick, Ponzetto, Simone Paolo, Poesio, Massimo, Eidelman, Vladimir,
Jern, Alan, Smith, Jason, Yang, Xiaofeng and Moschitti, Alessandro (2008).
BART: A modular toolkit for coreference resolution. In: Proceedings of the ACL-08:
HLT Demo Session, Columbus (Ohio, USA), 9–12.
Veselovská, Kateřina, Ngu.y Giang Linh and Novák, Michal (2012). Using CzechEnglish parallel corpora in automatic identification of it. In: Proceedings of the 5th
Workshop on Building and Using Comparable Corpora, Istanbul (Turkey), 112–120.
Voigt, Rob and Jurafsky, Dan (2012). Towards a literary machine translation: The
role of referential cohesion. In: Proceedings of the NAACL-HLT 2012 Workshop on
Computational Linguistics for Literature, Montréal (Canada), 18–25.
Wandmacher, Tonio and Antoine, Jean-Yves (2007). Methods to integrate a language model with semantic information for a word prediction component. In:
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, Prague (Czech Republic),
506–513.
Wäschle, Katharina and Riezler, Stefan (2012). Structural and topical dimensions
in multi-task patent translation. In: Proceedings of the 13th Conference of the
European Chapter of the Association for Computational Linguistics, Avignon (France),
818–828.
Witten, Ian H. and Bell, Timothy C. (1991). The zero-frequency problem: Estimating
the probabilities of novel events in adaptive text compression. IEEE Transactions
on Information Theory, 37 (4):1085–1094.
Wong, Billy T. M. and Kit, Chunyu (2012). Extending machine translation evaluation
metrics with lexical cohesion to document level. In: Proceedings of the 2012 Joint
Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, Jeju Island (Korea), 1060–1068.
Wong, Billy T. M., Pun, Cecilia F. K., Kit, Chunyu and Webster, Jonathan J.
(2011). Lexical cohesion for evaluation of machine translation at document level.
In: Proceedings of the 7th International Conference on Natural Language Processing
and Knowledge Engineering (NLP-KE), Tokushima (Japan), 238–242.
Xiao, Tong, Zhu, Jingbo, Yao, Shujie and Zhang, Hao (2011). Document-level
consistency verification in machine translation. In: MT Summit XIII: the Thirteenth
Machine Translation Summit, Xiamen (China), 131–138.
Xiong, Deyi, Ding, Yang, Zhang, Min and Tan, Chew Lim (2013a). Lexical chain
based cohesion models for document-level statistical machine translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
Seattle (Washington, USA), 1563–1573.
Xiong, Deyi, Guosheng, Ben, Zhang, Min, Lü, Yajuan and Liu, Qun (2013b). Modeling lexical cohesion for document-level machine translation. In: Proceedings of
the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing
(China), 2183–2189.
184
Xiong, Deyi and Zhang, Min (2013). A topic-based coherence model for statistical
machine translation. In: Proceedings of the Twenty-Seventh AAAI Conference on
Artificial Intelligence, Bellevue (Washington, USA), 977–983.
Žabokrtský, Zdeněk, Ptáček, Jan and Pajas, Petr (2008). TectoMT: Highly modular
MT system with tectogrammatics used as transfer layer. In: Proceedings of the Third
Workshop on Statistical Machine Translation, Columbus (Ohio, USA), 167–170.
Zeman, Daniel (2010). Hierarchical phrase-based MT at the Charles University for
the WMT 2010 shared task. In: Proceedings of the Joint Fifth Workshop on Statistical
Machine Translation and MetricsMATR, Uppsala (Sweden), 212–215.
Zhao, Bing and Xing, Eric P. (2006). BiTAM: Bilingual topic admixture models for
word alignment. In: Proceedings of the COLING/ACL 2006 Main Conference Poster
Sessions, Sydney (Australia), 969–976.
Zhao, Bing and Xing, Eric P. (2008). HM-BiTAM: bilingual topic exploration, word
alignment, and translation. In: J. C. Platt, D. Koller, Y. Singer and S. Roweis (eds.),
Advances in Neural Information Processing Systems 20, Cambridge (Mass.): MIT
Press, 1689–1696.
185
ACTA UNIVERSITATIS UPSALIENSIS
Studia Linguistica Upsaliensia
Editors: Joakim Nivre and Åke Viberg
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Jörg Tiedemann, Recycling translations. Extraction of lexical data from parallel
corpora and their application in natural language processing. 2003.
Agnes Edling, Abstraction and authority in textbooks. The textual paths towards
specialized language. 2006.
Åsa af Geijerstam, Att skriva i naturorienterande ämnen i skolan. 2006.
Gustav Öquist, Evaluating Readability on Mobile Devices. 2006.
Jenny Wiksten Folkeryd, Writing with an Attitude. Appraisal and student texts
in the school subject of Swedish. 2006.
Ingrid Björk, Relativizing linguistic relativity. Investigating underlying assumptions about language in the neo-Whorfian literature. 2008.
Joakim Nivre, Mats Dahllöf and Beáta Megyesi, Resourceful Language Technology. Festschrift in Honor of Anna Sågvall Hein. 2008.
Anju Saxena & Åke Viberg, Multilingualism. Proceedings of the 23rd Scandinavian Conference of Linguistics. 2009.
Markus Saers, Translation as Linear Transduction. Models and Algorithms for
Efficient Learning in Statistical Machine Translation. 2011.
Ulrika Serrander, Bilingual lexical processing in single word production. Swedish
learners of Spanish and the effects of L2 immersion. 2011.
Mattias Nilsson, Computational Models of Eye Movements in Reading : A DataDriven Approach to the Eye-Mind Link. 2012.
Luying Wang, Second Language Acquisition of Mandarin Aspect Markers by
Native Swedish Adults. 2012.
Farideh Okati, The Vowel Systems of Five Iranian Balochi Dialects. 2012.
Oscar Täckström, Predicting Linguistic Structure with Incomplete and CrossLingual Supervision. 2013.
Christian Hardmeier, Discourse in Statistical Machine Translation. 2014.
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement