Findings of the 2009 Workshop on Statistical Machine Translation.

Findings of the 2009 Workshop on Statistical Machine Translation.
Findings of the 2009 Workshop on Statistical Machine Translation
Chris Callison-Burch
Johns Hopkins University
ccb cs jhu edu
Philipp Koehn
University of Edinburgh
pkoehn inf ed ac uk
Christof Monz
University of Amsterdam
christof science uva nl
Josh Schroeder
University of Edinburgh
j schroeder ed ac uk
Abstract
This paper presents the results of the
WMT09 shared tasks, which included a
translation task, a system combination
task, and an evaluation task. We conducted a large-scale manual evaluation of
87 machine translation systems and 22
system combination entries. We used the
ranking of these systems to measure how
strongly automatic metrics correlate with
human judgments of translation quality,
for more than 20 metrics. We present a
new evaluation technique whereby system
output is edited and judged for correctness.
1
• Larger training sets – In addition to annual
increases in the Europarl corpus, we released
a French-English parallel corpus verging on 1
billion words. We also provided large monolingual training sets for better language modeling of the news translation task.
• Reduced number of conditions – Previous workshops had many conditions: 10
language pairs, both in-domain and out-ofdomain translation, and three types of manual evaluation. This year we eliminated
the in-domain Europarl test set and defined
sentence-level ranking as the primary type of
manual evaluation.
Introduction
This paper1 presents the results of the shared tasks
of the 2009 EACL Workshop on Statistical Machine Translation, which builds on three previous workshops (Koehn and Monz, 2006; CallisonBurch et al., 2007; Callison-Burch et al., 2008).
There were three shared tasks this year: a translation task between English and five other European
languages, a task to combine the output of multiple
machine translation systems, and a task to predict
human judgments of translation quality using automatic evaluation metrics. The performance on
each of these shared task was determined after a
comprehensive human evaluation.
There were a number of differences between
this year’s workshop and last year’s workshop:
1
This paper was corrected subsequent to publication. Tables 10 and 11 were incorrectly calculated due to a mismatch
between the segment indices for the human rankings and the
automatic metrics. This error resulted in many metrics appearing to perform only marginally above random. We have
corrected these tables and the text that describes them in Sections 6.2 and 7. This corrected version was released on December 30, 2009.
• Editing to evaluate translation quality –
Beyond ranking the output of translation systems, we evaluated translation quality by having people edit the output of systems. Later,
we asked annotators to judge whether those
edited translations were correct when shown
the source and reference translation.
The primary objectives of this workshop are to
evaluate the state of the art in machine translation, to disseminate common test sets and public training data with published performance numbers, and to refine evaluation methodologies for
machine translation. All of the data, translations,
and human judgments produced for our workshop
are publicly available.2 We hope they form a
valuable resource for research into statistical machine translation, system combination, and automatic evaluation of translation quality.
2
http://statmt.org/WMT09/results.html
2
Overview of the shared translation and
system combination tasks
The workshop examined translation between English and five other languages: German, Spanish,
French, Czech, and Hungarian. We created a test
set for each language pair by translating newspaper articles. We additionally provided training
data and a baseline system.
2.1
Test data
The test data for this year’s task was created by
hiring people to translate news articles that were
drawn from a variety of sources during the period from the end of September to mid-October
of 2008. A total of 136 articles were selected, in
roughly equal amounts from a variety of Czech,
English, French, German, Hungarian, Italian and
Spanish news sites:3
Hungarian: hvg.hu (10), Napi (2), MNO (4),
Népszabadság (4)
Czech: iHNed.cz (3), iDNES.cz (4), Lidovky.cz (3), aktuálně.cz (2), Novinky (1)
French: dernieresnouvelles (1), Le Figaro (2),
Les Echos (4), Liberation (4), Le Devoir (9)
Spanish: ABC.es (11), El Mundo (12)
English: BBC (11), New York Times (6), Times
of London (4),
German: Süddeutsche Zeitung (3), Frankfurter
Allgemeine Zeitung (3), Spiegel (8), Welt (3)
Italian: ADN Kronos (5), Affari Italiani (2),
ASCA (1), Corriere della Sera (4), Il Sole 24
ORE (1), Il Quotidiano (1), La Republica (8)
Note that Italian translation was not one of this
year’s official translation tasks.
The translations were created by the members
of EuroMatrix consortium who hired a mix of
professional and non-professional translators. All
translators were fluent or native speakers of both
languages. Although we made efforts to proofread all translations, many sentences still contain
minor errors and disfluencies. All of the translations were done directly, and not via an intermediate language. For instance, each of the 20 Hungarian articles were translated directly into Czech,
English, French, German, Italian and Spanish.
3
For more details see the XML test files. The docid
tag gives the source and the date for each document in the
test set, and the origlang tag indicates the original source
language.
The total cost of creating the test sets consisting
of roughly 80,000 words across 3027 sentences in
seven languages was approximately 31,700 euros
(around 39,800 dollars at current exchange rates,
or slightly more than $0.08/word).
Previous evaluations additionally used test sets
drawn from the Europarl corpus. Our rationale behind discontinuing the use of Europarl as a test set
was that it overly biases towards statistical systems
that were trained on this particular domain, and
that European Parliament proceedings were less of
general interest than news stories. We focus on a
single task since the use of multiple test sets in the
past spread our resources too thin, especially in the
manual evaluation.
2.2
Training data
As in past years we provided parallel corpora to
train translation models, monolingual corpora to
train language models, and development sets to
tune parameters. Some statistics about the training materials are given in Figure 1.
109 word parallel corpus
To create the large French-English parallel corpus, we conducted a targeted web crawl of bilingual web sites. These sites came from a variety of
sources including the Canadian government, the
European Union, the United Nations, and other
international organizations. The crawl yielded on
the order of 40 million files, consisting of more
than 1TB of data. Pairs of translated documents
were identified using a set of simple heuristics to
transform French URLs into English URLs (for instance, by replacing fr with en). Documents that
matched were assumed to be translations of each
other.
All HTML and PDF documents were converted
into plain text, which yielded 2 million French
files paired with their English equivalents. Text
files were split so that they contained one sentence per line and had markers between paragraphs. They were sentence-aligned in batches of
10,000 document pairs, using a sentence aligner
that incorporates IBM Model 1 probabilities in addition to sentence lengths (Moore, 2002). The
document-aligned corpus contained 220 million
segments with 2.9 billion words on the French side
and 215 million segments with 2.5 billion words
on the English side. After sentence alignment,
there were 177 million sentence pairs with 2.5 billion French words and 2.2 billion English words.
Europarl Training Corpus
Spanish ↔ English
1,411,589
40,067,498 41,042,070
154,971
108,116
Sentences
Words
Distinct words
French ↔ English
1,428,799
44,692,992 40,067,498
129,166
107,733
German ↔ English
1,418,115
39,516,645 37,431,872
320,180
104,269
News Commentary Training Corpus
Sentences
Words
Distinct words
Spanish ↔ English
74,512
2,052,186 1,799,312
56,578
41,592
French ↔ English
64,223
1,831,149 1,560,274
46,056
38,821
German ↔ English
82,740
2,051,369 1,977,200
92,313
43,383
Czech ↔ English
79,930
1,733,865 1,891,559
105,280
41,801
109 Word Parallel Corpus
Sentences
Words
Distinct words
French ↔ English
22,520,400
811,203,407 668,412,817
2,738,882
2,861,836
Hunglish Training Corpus
Sentences
Words
Distinct words
CzEng Training Corpus
Hungarian ↔ English
1,517,584
26,114,985 31,467,693
717,198
192,901
Sentences
Words
Distinct words
Czech ↔ English
1,096,940
15,336,783 17,909,979
339,683
129,176
Europarl Language Model Data
Sentence
Words
Distinct words
English
1,658,841
44,983,136
117,577
Spanish
1,607,419
45,382,287
162,604
French
1,676,435
50,577,097
138,621
German
1,713,715
41,457,414
348,197
News Language Model Data
Sentence
Words
Distinct words
English
21,232,163
504,094,159
1,141,895
Spanish
1,626,538
48,392,418
358,664
French
6,722,485
167,204,556
660,123
German
10,193,376
185,639,915
1,668,387
Czech
5,116,211
81,743,223
929,318
Hungarian
4,209,121
86,538,513
1,313,578
Czech
Hungarian
Italian
55,389
15,387
54,464
16,167
64,906
11,046
Czech
Hungarian
Italian
9,997
4,121
9,628
4,133
11,833
3,318
News Test Set
Sentences
Words
Distinct words
English
Spanish
French
65,595
8,907
68,092
10,631
72,554
10,609
German
2525
62,699
12,277
News System Combination Development Set
Sentences
Words
Distinct words
English
Spanish
French
11,843
2,940
12,499
3,176
12,988
3,202
German
502
11,235
3,471
Figure 1: Statistics for the training and test sets used in the translation task. The number of words is
based on the provided tokenizer and the number of distinct words is the based on lowercased tokens.
The sentence-aligned corpus was cleaned to remove sentence pairs which consisted only of numbers or paragraph markers, or where the French
and English sentences were identical. The later
step helped eliminate documents that were not
actually translated, which was necessary because
we did not perform language identification. After
cleaning, the parallel corpus contained 105 million
sentence pairs with 2 billion French words and 1.8
billion English words.
In addition to cleaning the sentence-aligned parallel corpus we also de-duplicated the corpus, removing all sentence pairs that occured more than
once in the parallel corpus. Many of the documents gathered in our web crawl were duplicates
or near duplicates, and a lot of the text is repeated,
as with web site navigation. We further eliminated sentence pairs that varied from previous sentences by only numbers, which helped eliminate
template web pages such as expense reports. We
used a Bloom Filter (Talbot and Osborne, 2007) to
do de-duplication, so it may have discarded more
sentence pairs than strictly necessary. After deduplication, the parallel corpus contained 28 million sentence pairs with 0.8 billion French words
and 0.7 billion English words.
Monolingual news corpora
We have crawled the news sources that were the
basis of our test sets (and a few more additional
sources) since August 2007. This allowed us to
assemble large corpora in the target domain to be
mainly used as training data for language modeling. We collected texts from the beginning of
our data collection period to one month before the
test set period, segmented these into sentences and
randomized the order of the sentences to obviate
copyright concerns.
2.3
Baseline system
To lower the barrier of entry for newcomers to the
field, we provided Moses, an open source toolkit
for phrase-based statistical translation (Koehn et
al., 2007). The performance of this baseline system is similar to the best submissions in last year’s
shared task. Twelve participating groups used the
Moses toolkit for the development of their system.
2.4
Submitted systems
We received submissions from 22 groups from
20 institutions, as listed in Table 1, a similar
turnout to last year’s shared task. Of the 20
groups that participated with regular system submissions in last year’s shared task, 12 groups returned this year. A major hurdle for many was
a DARPA/GALE evaluation that occurred at the
same time as this shared task.
We also evaluated 7 commercial rule-based MT
systems, and Google’s online statistical machine
translation system. We note that Google did not
submit an entry itself. Its entry was created by
the WMT09 organizers using Google’s online system.4 In personal correspondence, Franz Och
clarified that the online system is different from
Google’s research system in that it runs at faster
speeds at the expense of somewhat lower translation quality. On the other hand, the training data
used by Google is unconstrained, which means
that it may have an advantage compared to the research systems evaluated in this workshop, since
they were trained using only the provided materials.
2.5
System combination
In total, we received 87 primary system submissions along with 42 secondary submissions. These
were made available to participants in the system combination shared task. Based on feedback
that we received on last year’s system combination task, we provided two additional resources to
participants:
• Development set: We reserved 25 articles
to use as a dev set for system combination (details of the set are given in Table
1). These were translated by all participating
sites, and distributed to system combination
participants along with reference translations.
• n-best translations: We requested n-best
lists from sites whose systems could produce
them. We received 25 100-best lists accompanying the primary system submissions, and
5 accompanying the secondary system submissions.
In addition to soliciting system combination entries for each of the language pairs, we treated system combination as a way of doing multi-source
translation, following Schroeder et al. (2009). For
the multi-source system combination task, we provided all 46 primary system submissions from any
language into English, along with an additional 32
secondary systems.
4
http://translate.google.com
ID
CMU - STATXFER
COLUMBIA
CU - BOJAR
CU - TECTOMT
DCU
EUROTRANXP
GENEVA
GOOGLE
JHU
JHU - TROMBLE
LIMSI
LIU
LIUM - SYSTRAN
MORPHO
NICT
NUS
PCTRANS
RBMT 1-5
RWTH
STUTTGART
SYSTRAN
TALP - UPC
UEDIN
UKA
UMD
USAAR
Participant
Carnegie Mellon University’s statistical transfer system (Hanneman et al., 2009)
Columbia University (Carpuat, 2009)
Charles University Bojar (Bojar et al., 2009)
Charles University Tectogramatical MT (Bojar et al., 2009)
Dublin City University (Du et al., 2009)
commercial MT provider from the Czech Republic
University of Geneva (Wehrli et al., 2009)
Google’s production system
Johns Hopkins University (Li et al., 2009)
Johns Hopkins University Tromble (Eisner and Tromble, 2006)
LIMSI (Allauzen et al., 2009)
Linköping University (Holmqvist et al., 2009)
University of Le Mans / Systran (Schwenk et al., 2009)
Morphologic (Novák, 2009)
National Institute of Information and Comm. Tech., Japan (Paul et al., 2009)
National University of Singapore (Nakov and Ng, 2009)
commercial MT provider from the Czech Republic
commercial systems from Learnout&Houspie, Lingenio, Lucy, PROMT, SDL
RWTH Aachen (Popovic et al., 2009)
University of Stuttgart (Fraser, 2009)
Systran (Dugast et al., 2009)
Universitat Politecnica de Catalunya, Barcelona (R. Fonollosa et al., 2009)
University of Edinburgh (Koehn and Haddow, 2009)
University of Karlsruhe (Niehues et al., 2009)
University of Maryland (Dyer et al., 2009)
University of Saarland (Federmann et al., 2009)
Table 1: Participants in the shared translation task. Not all groups participated in all language pairs.
ID
BBN - COMBO
CMU - COMBO
CMU - COMBO - HYPOSEL
DCU - COMBO
RWTH - COMBO
USAAR - COMBO
Participant
BBN system combination (Rosti et al., 2009)
Carnegie Mellon University system combination (Heafield et al., 2009)
CMU system comb. with hyp. selection (Hildebrand and Vogel, 2009)
Dublin City University system combination
RWTH Aachen system combination (Leusch et al., 2009)
University of Saarland system combination (Chen et al., 2009)
Table 2: Participants in the system combination task.
Language Pair
German-English
English-German
Spanish-English
English-Spanish
French-English
English-French
Czech-English
English-Czech
Hungarian-English
All-English
Multisource-English
Totals
Sentence Ranking
3,736
3,700
2,412
1,878
3,920
1,968
1,590
7,121
1,426
4,807
2,919
35,786
Edited Translations
1,271
823
844
278
1,145
332
565
2,166
554
0
647
8,655
Yes/No Judgments
4,361
3,854
2,599
837
4,491
1,331
1,071
9,460
1,309
0
2184
31,524
Table 3: The number of items that were judged for each task during the manual evaluation.
Table 2 lists the six participants in the system
combination task.
3
Human evaluation
As with past workshops, we placed greater emphasis on the human evaluation than on the automatic evaluation metric scores. It is our contention
that automatic measures are an imperfect substitute for human assessment of translation quality.
Therefore, we define the manual evaluation to be
primary, and use the human judgments to validate
automatic metrics.
Manual evaluation is time consuming, and it requires a large effort to conduct it on the scale of
our workshop. We distributed the workload across
a number of people, including shared-task participants, interested volunteers, and a small number
of paid annotators. More than 160 people participated in the manual evaluation, with 100 people
putting in more than an hour’s worth of effort, and
30 putting in more than four hours. A collective
total of 479 hours of labor was invested.
We asked people to evaluate the systems’ output
in two different ways:
• Ranking translated sentences relative to each
other. This was our official determinant of
translation quality.
• Editing the output of systems without displaying the source or a reference translation,
and then later judging whether edited translations were correct.
The total number of judgments collected for the
different modes of annotation is given in Table 3.
In all cases, the output of the various translation
outputs were judged on equal footing; the output
of system combinations was judged alongside that
of the individual system, and the constrained and
unconstrained systems were judged together.
3.1
Ranking translations of sentences
Ranking translations relative to each other is a reasonably intuitive task. We therefore kept the instructions simple:
Rank translations from Best to Worst relative to the other choices (ties are allowed).
In our the manual evaluation, annotators were
shown at most five translations at a time. For most
language pairs there were more than 5 systems
submissions. We did not attempt to get a complete ordering over the systems, and instead relied
on random selection and a reasonably large sample
size to make the comparisons fair.
Relative ranking is our official evaluation metric. Individual systems and system combinations
are ranked based on how frequently they were
judged to be better than or equal to any other system. The results of this are reported in Section 4.
Appendix A provides detailed tables that contain
pairwise comparisons between systems.
3.2
Editing machine translation output
We experimented with a new type of evaluation
this year where we asked judges to edit the output
of MT systems. We did not show judges the reference translation, which makes our edit-based evaluation different than the Human-targeted Trans-
lation Error Rate (HTER) measure used in the
DARPA GALE program (NIST, 2008). Rather
than asking people to make the minimum number
of changes to the MT output in order capture the
same meaning as the reference, we asked them to
edit the translation to be as fluent as possible without seeing the reference. Our hope was that this
would reflect people’s understanding of the output.
The instructions that we gave our judges were
the following:
Correct the translation displayed, making it as fluent as possible. If no corrections are needed, select “No corrections
needed.” If you cannot understand the
sentence well enough to correct it, select
“Unable to correct.”
Each translated sentence was shown in isolation
without any additional context. A screenshot is
shown in Figure 2.
Since we wanted to prevent judges from seeing the reference before editing the translations,
we split the test set between the sentences used
in the ranking task and the editing task (because
they were being conducted concurrently). Moreover, annotators edited only a single system’s output for one source sentence to ensure that their understanding of it would not be influenced by another system’s output.
3.3
Judging the acceptability of edited output
Halfway through the manual evaluation period, we
stopped collecting edited translations, and instead
asked annotators to do the following:
Indicate whether the edited translations represent fully fluent and meaningequivalent alternatives to the reference
sentence. The reference is shown with
context, the actual sentence is bold.
In addition to edited translations, unedited items
that were either marked as acceptable or as incomprehensible were also shown. Judges gave a simple yes/no indication to each item. A screenshot is
shown in Figure 3.
3.4
Inter- and Intra-annotator agreement
In order to measure intra-annotator agreement
10% of the items were repeated and evaluated
INTER - ANNOTATOR AGREEMENT
Evaluation type
P (A) P (E)
K
Sentence ranking
.549
.333 .323
Yes/no to edited output .774
.5
.549
INTRA - ANNOTATOR AGREEMENT
Evaluation type
Sentence ranking
Yes/no to edited output
P (A)
.707
.866
P (E)
.333
.5
K
.561
.732
Table 4: Inter- and intra-annotator agreement for
the two types of manual evaluation
twice by each judge. In order to measure interannotator agreement 40% of the items were randomly drawn from a common pool that was shared
across all annotators so that we would have items
that were judged by multiple annotators.
We measured pairwise agreement among annotators using the kappa coefficient (K) which is defined as
P (A) − P (E)
K=
1 − P (E)
where P (A) is the proportion of times that the annotators agree, and P (E) is the proportion of time
that they would agree by chance.
For inter-annotator agreement we calculated
P (A) for the yes/no judgments by examining all
items that were annotated by two or more annotators, and calculating the proportion of time they
assigned identical scores to the same items. For
the ranking tasks we calculated P (A) by examining all pairs of systems which had been judged by
two or more judges, and calculated the proportion
of time that they agreed that A > B, A = B, or
A < B. Intra-annotator agreement was computed
similarly, but we gathered items that were annotated on multiple occasions by a single annotator.
Table 4 gives K values for inter-annotator and
intra-annotator agreement. These give an indication of how often different judges agree, and
how often single judges are consistent for repeated
judgments, respectively. The interpretation of
Kappa varies, but according to Landis and Koch
(1977), 0 − .2 is slight, .2 − .4 is fair, .4 − .6 is
moderate, .6 − .8 is substantial and the rest almost
perfect.
Based on these interpretations the agreement for
yes/no judgments is moderate for inter-annotator
agreement and substantial for intra-annotator
WMT09 Manual Evaluation
http://www.statmt.org/wmt09/judge/do_task.php
Edit MT Output
You have judged 19 sentences for WMT09 Multisource-English News Editing, 468 sentences total taking 74.4 seconds per sentence.
Original: They are often linked to other alterations sleep as nightmares, night terrors, the nocturnal enuresis (pee in bed) or the sleepwalking, but it is not
always the case.
Edit:
They are often linked to other sleep disorders, such as nightmares, night terrors, the nocturnal enuresis (bedwetting) or sleepwalking, but this is
not always the case.
Reset Edit
Edited.
No corrections needed.
Unable to correct.
Annotator: ccb Task: WMT09 Multisource-English News Editing
Instructions:
Correct the translation displayed, making it as fluent as possble. If no corrections are needed, select "No corrections needed." If you cannot understand
the sentence well enough to correct it, select "Unable to correct."
Figure 2: This screenshot shows an annotator editing the output of a machine translation system.
WMT09 Manual Evaluation
http://www.statmt.org/wmt09/judge/do_task.php
Judge Edited MT Output
You have judged 84 sentences for WMT09 French-English News Edit Acceptance, 459 sentences total taking 64.9 seconds per sentence.
Source: Au même moment, les gouvernements belges, hollandais et luxembourgeois ont en parti nationalisé le conglomérat européen financier, Fortis.
Les analystes de Barclays Capital ont déclaré que les négociations frénétiques de ce week end, conclues avec l'accord de sauvetage" semblent ne pas avoir
réussi à faire revivre le marché".
Alors que la situation économique se détériorasse, la demande en matières premières, pétrole inclus, devrait se ralentir.
"la prospective d'équité globale, de taux d'intérêt et d'échange des marchés, est devenue incertaine" ont écrit les analystes de Deutsche Bank dans une
lettre à leurs investisseurs."
"nous pensons que les matières premières ne pourront échapper à cette contagion.
Reference: Meanwhile, the Belgian, Dutch and Luxembourg governments partially nationalized the European financial conglomerate Fortis.
Analysts at Barclays Capital said the frantic weekend negotiations that led to the bailout agreement "appear to have failed to revive market sentiment."
As the economic situation deteriorates, the demand for commodities, including oil, is expected to slow down.
"The outlook for global equity, interest rate and exchange rate markets has become increasingly uncertain," analysts at Deutsche Bank wrote in a note to
investors.
"We believe commodities will be unable to escape the contagion.
Translation
While the economic situation is deteriorating, demand for commodities, including oil, should decrease.
While the economic situation is deteriorating, the demand for raw materials, including oil, should slow down.
Alors que the economic situation deteriorated, the request in rawmaterial enclosed, oil, would have to slow down.
While the financial situation damaged itself, the first matters affected, oil included, should slow down themselves.
While the economic situation is depressed, demand for raw materials, including oil, will be slow.
Verdict
Yes No
Yes No
Yes No
Yes No
Yes No
Annotator: ccb Task: WMT09 French-English News Edit Acceptance
Instructions:
Indicate whether the edited translations represent fully fluent and meaning-equivalent alternatives to the reference sentence.
The reference is shown with context, the actual sentence is bold.
Figure 3: This screenshot shows an annotator judging the acceptability of edited translations.
0.41
0.41
0.40
0.40
0.39
0.39
0.38
0.38
0.37
0.37
0.36
0.36
0.35
0.35
0.34
Inter-annotator agreement
0.34
0.33
0.33
0.32
0.32
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
Proportion of judgments retained
0.2
0.1
0.1
0.61
0.61
0.60
0.60
0.59
0.59
0.58
0.57
Inter-annotator agreement
Proportion of judgments retained
0.58
Intra-annotator agreement
0.57
0.56
0.56
0.55
0.55
0.54
0.54
0.53
0.53
0.52
0.52
0.51
0.51
Figure 4: The effect of discarding every annotators’ initial judgments, up to the first 50 items
Intra-annotator agreement
Figure 5: The effect of removing annotators with
the lowest agreement, disregarding up to 40 annotators
agreement, but the inter-annotator agreement for
sentence level ranking is only fair.
We analyzed two possible strategies for improving inter-annotator agreement on the ranking task:
First, we tried discarding initial judgments to give
annotators a chance to learn to how to perform
the task. Second, we tried disregarding annotators who have very low agreement with others, by
throwing away judgments for the annotators with
the lowest judgments.
Figures 4 and 5 show how the K values improve for intra- and inter-annotator agreement under these two strategies, and what percentage of
the judgments are retained as more annotators are
removed, or as the initial learning period is made
longer. It seems that the strategy of removing the
worst annotators is the best in terms of improving inter-annotator K, while retaining most of the
judgments. If we remove the 33 judges with the
worst agreement, we increase the inter-annotator
K from fair to moderate, and still retain 60% of
the data.
For the results presented in the rest of the paper,
we retain all judgments.
4
Translation task results
We used the results of the manual evaluation to
analyze the translation quality of the different systems that were submitted to the workshop. In our
analysis, we aimed to address the following questions:
• Which systems produced the best translation
quality for each language pair?
• Did the system combinations produce better
translations than individual systems?
• Which of the systems that used only the provided training materials produced the best
translation quality?
Table 6 shows best individual systems. We define the best systems as those which had no other
system that was statistically significantly better
than them under the Sign Test at p ≤ 0.1.5 Multiple systems are listed for many language pairs because it was not possible to draw a statistically sig5
In one case this definition meant that the system that was
ranked the highest overall was not considered to be one of
the best systems. For German-English translation RBMT 5
was ranked highest overall, but was statistically significantly
worse than RBMT 2.
nificant difference between the systems. Commercial translation software (including Google, Systran, Morphologic, PCTrans, Eurotran XP, and
anonymized RBMT providers) did well in each of
the language pairs. Research systems that utilized
only the provided data did as well as commercial
vendors in half of the language pairs.
The table also lists the best systems among
those which used only the provided materials.
To determine this decision we excluded unconstrained systems which employed significant external resources. Specifically, we ruled out all of
the commercial systems, since Google has access
to significantly greater data sources for its statistical system, and since the commercial RBMT systems utilize knowledge sources not available to
other workshop participants. The remaining systems were research systems that employ statistical models. We were able to draw distinctions
between half of these for each of the language
pairs. There are some borderline cases, for instance LIMSI only used additional monolingual
training resources, and LIUM/Systran used additional translation dictionaries as well as additional
monolingual resources.
Table 5 summarizes the performance of the
system combination entries by listing the best
ranked combinations, and by indicating whether
they have a statistically significant difference with
the best individual systems. In general, system
combinations performed as well as the best individual systems, but not statistically significantly
better than them. Moreover, it was hard to draw
a distinction between the different system combination strategies themselves. There are a number
of possibilities as to why we failed to find significant differences:
• The number of judgments that we collected
were not sufficient to find a difference. Although we collected several thousand judgments for each language pair, most pairs of
systems were judged together fewer than 100
times.
• It is possible that the best performing individual systems were sufficiently better than
the other systems and that it is difficult to improve on them by combining them.
• Individual systems could have been weighted
incorrectly during the development stage,
Language Pair
Best system combinations
Entries
German-English
RWTH - COMBO , BBN - COMBO ,
1
5
Significantly different than
best individual systems?
BBN - COMBO > GOOGLE , SYSTRAN ,
USAAR - COMBO < RMBT 2,
no difference for others
worse than 3 best systems
each better than one of the RBMT
systems, but there was no difference
with GOOGLE , TALP - UPC
no difference
no difference
2
USAAR - COMBO > UKA ,
2
3
DCU - COMBO > SYSTRAN , LIMSI ,
no difference with others
no difference
both worse than MORPHO
3
n/a
5
CMU - COMBO , USAAR - COMBO
English-German
Spanish-English
USAAR - COMBO
CMU - COMBO , USAAR - COMBO ,
1
3
BBN - COMBO
English-Spanish
French-English
USAAR - COMBO
CMU - COMBO - HYPOSEL ,
DCU - COMBO , CMU - COMBO
English-French
USAAR - COMBO , DCU - COMBO
Czech-English
Hungarian-English
CMU - COMBO - HYPOSEL,
CMU - COMBO
Multisource-English
RWTH - COMBO
CMU - COMBO
Table 5: A comparison between the best system combinations and the best individual systems. It was
generally difficult to draw a statistically significant differences between the two groups, and between the
combinations themselves.
which could happen if the automatic evaluation metrics scores on the dev set did not
strongly correlate with human judgments.
• The lack of distinction between different
combinations could be due to the fact that
there is significant overlap in the strategies
that they employ.
Improved system combination warrants further investigation. We would suggest collecting additional judgments, and doing oracle experiments
where the contributions of individual systems are
weighted according to human judgments of their
quality.
Understandability
Our hope is that judging the acceptability of edited
output as discussed in Section 3 gives some indication of how often a system’s output was understandable. Figure 6 gives the percentage of times
that each system’s edited output was judged to
be acceptable (the percentage also factors in instances when judges were unable to improve the
output because it was incomprehensible).
The edited output of the best performing systems under this evaluation model were
deemed acceptable around 50% of the time
for French-English, English-French, EnglishSpanish, German-English, and English-German.
For Spanish-English the edited output of the best
system was acceptable around 40% of the time, for
English-Czech it was 30% and for Czech-English
and Hungarian-English it was around 20%.
This style of manual evaluation is experimental
and should not be taken to be authoritative. Some
caveats about this measure:
• Editing translations without context is difficult, so the acceptability rate is probably an
underestimate of how understandable a system actually is.
• There are several sources of variance that are
difficult to control for: some people are better
at editing, and some sentences are more difficult to edit. Therefore, variance in the understandability of systems is difficult to pin
down.
• The acceptability measure does not strongly
correlate with the more established method of
ranking translations relative to each other for
all the language pairs.6
6
The Spearman rank correlation coefficients for how the
two types of manual evaluation rank systems are .67 for de-
French–English
625–836 judgments per system
C? ≥others
System
GOOGLE •
no
.76
DCU ?
yes
.66
LIMSI •
no
.65
yes
.62
JHU ?
UEDIN ?
yes
.61
yes
.61
UKA
LIUM - SYSTRAN
no
.60
RBMT 5
no
.59
CMU - STATXFER ?
yes
.58
RBMT 1
no
.56
USAAR
no
.55
RBMT 3
no
.54
RWTH ?
yes
.52
COLUMBIA
yes
.50
RBMT 4
no
.47
GENEVA
no
.34
English–French
422–517 judgments per system
System
C? ≥others
LIUM - SYSTRAN •
no
.73
GOOGLE •
no
.68
UKA •?
yes
.66
SYSTRAN •
no
.65
RBMT 3 •
no
.65
yes
.65
DCU •?
LIMSI •
no
.64
UEDIN ?
yes
.60
RBMT 4
no
.59
RWTH
yes
.58
RBMT 5
no
.57
RBMT 1
no
.54
USAAR
no
.48
GENEVA
no
.38
Hungarian–English
865–988 judgments per system
C? ≥others
System
MORPHO •
no
.75
UMD ?
yes
.66
UEDIN
yes
.45
German–English
651–867 judgments per system
C? ≥others
System
RBMT 5
no
.66
USAAR •
no
.65
GOOGLE •
no
.65
RBMT 2 •
no
.64
RBMT 3
no
.64
RBMT 4
no
.62
STUTTGART •?
yes
.61
SYSTRAN •
no
.60
UEDIN ?
yes
.59
yes
.58
UKA ?
yes
.56
UMD ?
RBMT 1
no
.54
LIU ?
yes
.50
yes
.50
RWTH
GENEVA
no
.33
yes
.13
JHU - TROMBLE
English–German
977–1226 judgments per system
System
C? ≥others
RBMT 2 •
no
.66
RBMT 3 •
no
.64
RBMT 5 •
no
.64
USAAR
no
.58
RBMT 4
no
.58
RBMT 1
no
.57
GOOGLE
no
.54
yes
.54
UKA ?
UEDIN ?
yes
.51
yes
.49
LIU ?
yes
.48
RWTH ?
STUTTGART
yes
.43
Czech–English
1257–1263 judgments per system
C? ≥others
System
GOOGLE •
no
.75
yes
.57
UEDIN ?
CU - BOJAR ?
yes
.51
Spanish–English
613–801 judgments per system
System
C? ≥others
GOOGLE •
no
.70
TALP - UPC •?
yes
.59
UEDIN ?
yes
.56
RBMT 1 •
no
.55
RBMT 3 •
no
.55
RBMT 5 •
no
.55
RBMT 4 •
no
.53
RWTH ?
yes
.51
USAAR
no
.51
NICT
yes
.37
English–Spanish
632–746 judgments per system
System
C? ≥others
RBMT 3 •
no
.66
UEDIN •?
yes
.66
GOOGLE •
no
.65
RBMT 5 •
no
.64
RBMT 4
no
.61
NUS ?
yes
.59
TALP - UPC
yes
.58
RWTH
yes
.51
RBMT 1
no
.25
USAAR
no
.48
English–Czech
4626–4784 judgments per system
System
C? ≥others
PCTRANS •
no
.67
EUROTRANXP •
no
.67
GOOGLE
no
.66
CU - BOJAR ?
yes
.61
yes
.53
UEDIN
CU - TECTOMT
yes
.48
Systems are listed in the order of how often their translations were ranked higher than or equal to any
other system. Ties are broken by direct comparison.
C? indicates constrained condition, meaning only using the supplied training data and possibly standard
monolingual linguistic tools (but no additional corpora).
• indicates a win in the category, meaning that no other system is statistically significantly better at
p-level≤0.1 in pairwise comparison.
? indicates a constrained win, no other constrained system is statistically better.
For all pairwise comparisons between systems, please check the appendix.
Table 6: Official results for the WMT09 translation task, based on the human evaluation (ranking translations relative to each other)
Please also note that the number of corrected
translations per system are very low for some
language pairs, as low as 23 corrected sentences
per system for the language pair English–French.
Given these low numbers, the numbers presented
in Figure 6 should not be read as comparisons between systems, but rather viewed as indicating the
state of machine translation for different language
pairs.
5
Shared evaluation task overview
In addition to allowing us to analyze the translation quality of different systems, the data gathered during the manual evaluation is useful for
validating the automatic evaluation metrics. Last
year, NIST began running a similar “Metrics
for MAchine TRanslation” challenge (MetricsMATR), and presented their findings at a workshop at AMTA (Przybocki et al., 2008).
In this year’s shared task we evaluated a number
of different automatic metrics:
• Bleu (Papineni et al., 2002)—Bleu remains
the de facto standard in machine translation
evaluation. It calculates n-gram precision and
a brevity penalty, and can make use of multiple reference translations as a way of capturing some of the allowable variation in translation. We use a single reference translation
in our experiments.
• Meteor (Agarwal and Lavie, 2008)—Meteor
measures precision and recall for unigrams
and applies a fragmentation penalty. It uses
flexible word matching based on stemming
and WordNet-synonymy. meteor-ranking is
optimized for correlation with ranking judgments.
• Translation Error Rate (Snover et al.,
2006)—TER calculates the number of edits required to change a hypothesis translation into a reference translation. The possible edits in TER include insertion, deletion,
and substitution of single words, and an edit
which moves sequences of contiguous words.
Two variants of TER are also included: TERp
(Snover et al., 2009), a new version which introduces a number of different features, and
(Bleu − TER)/2, a combination of Bleu and
Translation Edit Rate.
en, .67 for fr-en, .06 for es-en, .50 for cz-en, .36 for hu-en,
.65 for en-de, .02 for en-fr, -.6 for en-es, and .94 for en-cz.
• MaxSim (Chan and Ng, 2008)—MaxSim
calculates a similarity score by comparing
items in the translation against the reference.
Unlike most metrics which do strict matching, MaxSim computes a similarity score
for non-identical items. To find a maximum weight matching that matches each system item to at most one reference item, the
items are then modeled as nodes in a bipartite graph.
• wcd6p4er (Leusch and Ney, 2008)—a measure based on cder with word-based substitution costs. Leusch and Ney (2008) also submitted two contrastive metrics: bleusp4114,
a modified version of BLEU-S (Lin and
Och, 2004), with tuned n-gram weights, and
bleusp, with constant weights. wcd6p4er
is an error measure and bleusp is a quality
score.
• RTE (Pado et al., 2009)—The RTE metric
follows a semantic approach which applies
recent work in rich textual entailment to the
problem of MT evaluation. Its predictions are
based on a regression model over a feature
set adapted from an entailment systems. The
features primarily model alignment quality
and (mis-)matches of syntactic and semantic
structures.
• ULC (Giménez and Màrquez, 2008)—ULC
is an arithmetic mean over other automatic
metrics. The set of metrics used include
Rouge, Meteor, measures of overlap between
constituent parses, dependency parses, semantic roles, and discourse representations.
The ULC metric had the strongest correlation
with human judgments in WMT08 (CallisonBurch et al., 2008).
• wpF and wpBleu (Popovic and Ney, 2009) These metrics are based on words and part of
speech sequences. wpF is an n-gram based Fmeasure which takes into account both word
n-grams and part of speech n-grams. wpBLEU is a combnination of the normal Blue
score and a part of speech-based Bleu score.
• SemPOS (Kos and Bojar, 2009) – the SemPOS metric computes overlapping words, as
defined in (Giménez and Màrquez, 2007),
with respect to their semantic part of speech.
bo
-bo
jar
ue
din
cu
om
og
cm
u
n
bb
ref
ble
-tr
om
t1
rbm
rbm
t4
s
nu
us
aa
r
rbm
t3
pc
ue
din
-te
cto
mt
jar
-bo
cu
cu
an
xp
ue
din
rot
r
mb
bb
n-c
din
ue
bo
um
d
om
u-c
cm
cm
u-c
mb
rph
-h
o
ref
mo
liu
art
a
din
ttg
stu
ue
uk
th
rw
b
t4
rbm
r
r-c
m
us
aa
aa
us
t2
rbm
rbm
og
go
rbm
rbm
t1
0.93 0.22 0.21 0.19 0.15 0.12 0.11
le
0.85 0.47 0.42 0.37 0.35 0.33 0.32 0.31 0.31 0.28 0.26 0.19 0.18 0.12
t3
Hungarian-English
t5
English-German
ref
eu
pc
tra
og
le
ns
ref
rw
th
us
aa
r-c
go
og
le
rbm
t5
tal
p-u
go
va
ne
ge
str
an
rbm
t1
u
dc
sy
dc
u-c
mb
th
rw
s
sy
liu
m-
r
t4
aa
us
rbm
din
ue
rbm
aa
us
rbm
si
lim
og
go
t5
0.91 0.32 0.32 0.26 0.23 0.21 0.19
t3
0.79 0.49 0.48 0.45 0.44 0.43 0.40 0.40 0.37 0.34 0.32 0.31 0.30 0.27 0.22 0.10 0.08
r-c
English-Czech
le
uk
a
English-French
ref
nic
t
0.69 0.52 0.38 0.33 0.32 0.28 0.27 0.27 0.21 0.19 0.10 0.08
us
aa
r
0.88 0.41 0.38 0.37 0.36 0.31 0.31 0.28 0.28 0.28 0.28 0.27 0.26 0.23
cm
u-c
rbm
t4
English-Spanish
rw
th
Spanish-English
ref
jhu
ge
ne
va
liu
t2
sy
str
an
ue
din
rbm
d
um
a
uk
rw
th
c
cm
u-c
om
bo
bb
n-c
om
bo
rbm
t5
us
aa
r-c
go
og
le
cm
u-c
mb
-h
us
aa
r
rbm
t3
stu
ttg
art
rbm
t4
rbm
t1
rw
th
0.90 0.32 0.27 0.25
ref
us
aa
r-c
go
og
le
rbm
t5
rbm
t3
bb
n-c
mb
ue
din
tal
p-u
pc
rbm
t1
ref
u-c
go
cm
us
Multsource-English
0.83 0.47 0.41 0.37 0.36 0.35 0.34 0.33 0.32 0.31 0.31 0.30 0.30 0.28 0.27 0.26 0.26 0.25 0.21 0.20 0.06 0.03
rw
th-
le
n-c
ref
ne
ge
bb
va
r
t5
t3
aa
us
rbm
jhu
rbm
din
ue
sy
r-c
o
aa
mliu
o
dc
u
str
mb
n
t1
si
rbm
t4
lim
rbm
rw
t
dc
bo
om
om
u-c
cm
German-English
cm
u-c
o
lum
co
mb
tat
bb
n-c
o
u-s
og
cm
h
ref
go
bo
-h
u-c yp
om
bo
uk
a
0.98 0.25 0.23 0.18 0.16 0.14
bia
0.85 0.52 0.50 0.47 0.41 0.41 0.40 0.39 0.38 0.35 0.34 0.34 0.34 0.33 0.33 0.31 0.30 0.29 0.28 0.28 0.28 0.21
x
Czech-English
le
French-English
Figure 6: The percent of time that each system’s edited output was judged to be an acceptable translation.
These numbers also include judgments of the system’s output when it was marked either incomprehensible or acceptable and left unedited. Note that the reference translation was edited alongside the system
outputs. Error bars show one positive and one negative standard deviation for the systems in that language pair.
.78
.76
.64
.64
.76
-.72
.56
.55
.38
.41
.42
-.43
.42
.41
.39
.39
.4
.43
-.41
.92
.91
.91
.93
.59
-.89
.93
.93
.88
.87
.87
-.83
.83
.88
.88
.89
.86
.86
-.89
.86
.98
.96
.96
.78
-.94
.87
.86
.78
.75
.82
-.84
.75
.79
.78
.78
.8
.8
-.76
1
.7
.6
.7
.8
-.7
.7
.7
.9
.9
1
-.6
1
.6
.6
.6
.6
.7
-.6
.6
.66
.83
.54
.83
-.37
.54
.26
-.03
-.14
-.31
-.01
-.31
-.14
-.09
-.26
-.31
-.49
.43
.83
.8
.79
.75
.75
-.72
.72
.66
.58
.56
.56
-.54
.54
.51
.51
.48
.47
.46
-.45
Table 7: The system-level correlation of the automatic evaluation metrics with the human judgments for translation into English.
Average
terp
ter
bleusp4114
bleusp
bleu
bleu (cased)
bleu-ter/2
wcd6p4er
nist (cased)
nist
wpF
wpbleu
en-cz (5 systems)
Because the sentence-level judgments collected
in the manual evaluation are relative judgments
rather than absolute judgments, it is not possible for us to measure correlation at the sentencelevel in the same way that previous work has done
(Kulesza and Shieber, 2004; Albrecht and Hwa,
2007a; Albrecht and Hwa, 2007b).
Rather than calculating a correlation coefficient
at the sentence-level we instead ascertained how
consistent the automatic metrics were with the human judgments. The way that we calculated consistency was the following: for every pairwise
comparison of two systems on a single sentence by
a person, we counted the automatic metric as being
consistent if the relative scores were the same (i.e.
the metric assigned a higher score to the higher
ranked system). We divided this by the total number of pairwise comparisons to get a percentage.
Because the systems generally assign real numbers as scores, we excluded pairs that the human
annotators ranked as ties.
en-es (11 systems)
Measuring sentence-level consistency
en-fr (16 systems)
5.2
Average
where di is the difference between the rank for
systemi and n is the number of systems. The possible values of ρ range between 1 (where all systems are ranked in the same order) and −1 (where
the systems are ranked in the reverse order). Thus
an automatic evaluation metric with a higher absolute value for ρ is making predictions that are more
similar to the human judgments than an automatic
evaluation metric with a lower absolute ρ.
hu-en (6 systems)
P
cz-en (5 systems)
6 d2i
ρ=1−
n(n2 − 1)
ulc
maxsim
rte (absolute)
meteor-rank
rte (pairwise)
terp
meteor-0.6
meteor-0.7
bleu-ter/2
nist
wpF
ter
nist (cased)
bleu
bleusp
bleusp4114
bleu (cased)
wpbleu
wcd6p4er
es-en (13 systems)
We measured the correlation of the automatic metrics with the human judgments of translation quality at the system-level using Spearman’s rank correlation coefficient ρ. We converted the raw scores
assigned to each system into ranks. We assigned
a human ranking to the systems based on the percent of time that their translations were judged to
be better than or equal to the translations of any
other system in the manual evaluation.
When there are no ties ρ can be calculated using
the simplified equation:
fr-en (21 systems)
Measuring system-level correlation
en-de (13 systems)
5.1
de-en (21 systems)
Moreover, it does not use the surface representation of words but their underlying forms
obtained from the TectoMT framework.
.03
-.03
-.3
-.3
-.43
-.45
-.37
.54
-.47
-.52
-.06
.07
-.89
-.78
.88
.87
.87
.87
.87
-.89
.84
.87
.9
.92
-.58
-.5
.51
.51
.36
.35
.44
-.45
.35
.23
.58
.63
-.4
-.1
.1
.1
.3
.3
.1
-.1
.1
.1
n/a
n/a
-.46
-.35
.3
.29
.27
.27
.26
-.22
.2
.17
n/a
n/a
Table 8: The system-level correlation of the automatic evaluation metrics with the human judgments for translation out of English.
SemPOS
Meteor
GTM(e=0.5)tecto
GTM(e=0.5)lemma
GTM(e=0.5)
WERtecto
TERtecto
PERtecto
F-measuretecto
F-measurelemma
F-measure
.4
.4
.4
.4
.4
.3
.3
.3
.3
.3
.3
BLEUtecto
BLEU
NISTlemma
NIST
BLEUlemma
WERlemma
WER
TERlemma
TER
PERlemma
PER
NISTtecto
.3
.3
.1
.1
.1
-.1
-.1
-.1
-.1
-.1
-.1
-.3
Table 9: The system-level correlation for automatic metrics ranking five English-Czech systems
6
6.1
Evaluation task results
System-level correlation
Table 7 shows the correlation of automatic metrics when they rank systems that are translating
into English. Note that TERp, TER and wcd6p4er
are error metrics, so a negative correlation is better for them. The strength of correlation varied for
the different language pairs. The automatic metrics were able to rank the French-English systems
reasonably well with correlation coefficients in the
range of .8 and .9. In comparison, metrics performed worse for Hungarian-English, where half
of the systems had negative correlation. The ULC
metric once again had strongest correlation with
human judgments of translation quality. This was
followed closely by MaxSim and RTE, with Meteor and TERp doing respectably well in 4th and
5th place. Notably, Bleu and its variants were the
worst performing metrics in this translation direction.
Table 8 shows correlation for metrics which operated on languages other than English. Most of
the best performing metrics that operate on English do not work for foreign languages, because
they perform some linguistic analysis or rely on
a resource like WordNet. For translation into foreign languages TERp was the best system overall.
The wpBleu and wpF metrics also did extremely
well, performing the best in the language pairs that
they were applied to. wpBleu and wpF were not
applied to Czech because the authors of the metric did not have a Czech tagger. English-German
proved to be the most problematic language pair
to automatically evaluate, with all of the metrics
having a negative correlation except wpBleu and
TER.
Table 9 gives detailed results for how well vari-
ations on a number of automatic metrics do for
the task of ranking five English-Czech systems.7
These systems were submitted by Kos and Bojar
(2009), and they investigate the effects of using
Prague Dependency Treebank annotations during
automatic evaluation. They linearizing the Czech
trees and evaluated either the lemmatized forms of
the Czech (lemma) read off the trees or the Tectogrammatical form which retained only lemmatized content words (tecto). The table also demonstrates SemPOS, Meteor, and GTM perform better
on Czech than many other metrics.
6.2
Sentence-level consistency
(This subsection was revised after publication.)
Tables 10 and 11 show the percent of times
that the metrics’ scores were consistent with human rankings of every pair of translated sentences.8 Since we eliminated sentence pairs that
were judged to be equal, the random baseline for
this task is 50%. Some metrics failed to reach the
baseline. This indicates that sentence-level evaluation of machine translation quality is very difficult.
ULC, RTE and maxsim again do the best overall
for the into-English direction. They are followed
closely by wcd6p4er and wpF which considerably
improve their performance over their system-level
correlations.
We tried a variant on measuring sentence-level
consistency. Instead of using the scores assigned
to each individual sentence, we used the systemlevel score and applied it to every sentence that
was produced by that system. These can be
thought of as a metric’s prior expectation about
how a system should preform, based on their performance on the whole data set. Tables 12 and 13
show that using the system-level scores in place of
the sentence-level scores. The “oracle” row shows
the consistency of using the system-level human
ranks that are given in Table 6. For TER, RTE
(pairwise), wpbleu, and the meteor variants, using
system-level scores as segment level scores results
in higher consistency with human judgments.
7
Summary
(One conclusion was removed from this section
subsequent to publication.)
As in previous editions of this workshop we
7
PCTRANS was excluded from the English-Czech systems
because its SGML file was malformed.
8
Not all metrics entered into the sentence-level task.
en-es (3249 pairs)
en-cz (11242 pairs)
Overall (24021 pairs)
.67
.65
.65
.58
.62
.66
.60
.58
.56
.56
.50
.50
.60
.47
.61
.60
.60
.52
.54
.61
.49
.59
.56
.56
.44
.31
n/a
n/a
.60
.58
.58
.49
.43
.61
.51
Table 11: Sentence-level consistency of the automatic metrics with human judgments for translations out of English. Italicized numbers do not
beat the random-choice baseline. (This table was
corrected after publication.)
hu-en (2193 pairs)
Overall (21200 pairs)
.61
.61
.60
.61
.61
.60
.60
.61
.61
.60
.56
.60
.61
.61
.61
.63
.62
.61
.62
.59
.59
.61
.59
.59
.59
.61
.59
.61
.61
.61
.59
.58
.59
.59
.57
.57
.59
.56
.56
.57
.57
.57
.58
.58
.59
.61
.61
.57
.57
.55
.61
.57
.55
.55
.55
.59
.57
.57
.57
.57
.67
.59
.65
.61
.44
.46
.56
.48
.46
.51
.64
.43
.55
.60
.61
.62
.60
.61
.60
.57
.58
.59
.57
.57
.58
.59
.57
.59
.60
.60
Table 12: Consistency of the automatic metrics
when their system-level ranks are treated as
sentence-level scores. The scores in red italics
indicate cases where the system-level ranks
outperform a metric’s sentence-level ranks.
Oracle
wcd6p4er
bleusp
bleusp4114
ter
terp
wpF
wpbleu
Overall (24021 pairs)
en-de (6563 pairs)
wcd6p4er
bleusp
bleusp4114
ter
terp
wpF
wpbleu
en-fr (2967 pairs)
Table 10: Sentence-level consistency of the automatic metrics with human judgments for translations into English. Italicized numbers do not beat
the random-choice baseline. (This table was corrected after publication.)
Oracle
ulc
rte (absolute)
maxsim
wcd6p4er
wpF
terp
bleusp
bleusp4114
ter
rte (pairwise)
wpbleu
meteor-0.7
meteor-0.6
meteor-rank
cz-en (2251 pairs)
.63
.62
.62
.61
.60
.60
.59
.59
.52
.51
.51
.50
.50
.50
en-cz (11242 pairs)
Overall (23149 pairs)
.63
.62
.61
.57
.60
.57
.58
.58
.50
.63
.57
.44
.44
.44
es-en (4106 pairs)
xx-en (1949 pairs)
.60
.65
.62
.55
.56
.55
.55
.55
.45
.64
.44
.48
.48
.48
en-es (3249 pairs)
hu-en (2193 pairs)
.63
.62
.60
.61
.61
.60
.59
.59
.50
.59
.49
.48
.47
.47
de-en (6382 pairs)
cz-en (2251 pairs)
.61
.61
.61
.61
.59
.60
.57
.57
.52
.51
.51
.49
.49
.49
en-de (6563 pairs)
es-en (4106 pairs)
.64
.62
.63
.62
.59
.59
.60
.60
.53
.45
.49
.52
.52
.52
fr-en (6268 pairs)
de-en (6382 pairs)
.64
.64
.63
.63
.63
.62
.62
.61
.55
.44
.53
.51
.51
.51
en-fr (2967 pairs)
fr-en (6268 pairs)
ulc
rte (absolute)
maxsim
wcd6p4er
wpF
terp
bleusp
bleusp4114
ter
rte (pairwise)
wpbleu
meteor-0.7
meteor-0.6
meteor-ranking
.62
.62
.62
.63
.61
.62
.63
.63
.59
.46
.48
.48
.51
.50
.50
.51
.63
.58
.59
.59
.58
.59
.59
.60
.60
.50
.50
.50
.50
.53
n/a
n/a
.60
.52
.52
.52
.53
.54
.55
.56
Table 13: Consistency of the automatic metrics
when their system-level ranks are treated as
sentence-level scores. The scores in red italics
indicate cases where the system-level ranks
outperform a metric’s sentence-level ranks.
carried out an extensive manual and automatic
evaluation of machine translation performance for
translating from European languages into English,
and vice versa.
The number of participants remained stable
compared to last year’s WMT workshop, with
22 groups from 20 institutions participating in
WMT09. This year’s evaluation also included 7
commercial rule-based MT systems and Google’s
online statistical machine translation system.
Compared to previous years, we have simplified the evaluation conditions by removing the indomain vs. out-of-domain distinction focusing on
news translations only. The main reason for this
was eliminating the advantage statistical systems
have with respect to test data that are from the
same domain as the training data.
Analogously to previous years, the main focus
of comparing the quality of different approaches
is on manual evaluation. Here, also, we reduced
the number of dimensions with respect to which
the different systems are compared, with sentencelevel ranking as the primary type of manual evaluation. In addition to the direct quality judgments
we also evaluated translation quality by having
people edit the output of systems and have assessors judge the correctness of the edited output.
The degree to which users were able to edit the
translations (without having access to the source
sentence or reference translation) served as a measure of the overall comprehensibility of the translation.
Although the inter-annotator agreement in the
sentence-ranking evaluation is only fair (as measured by the Kappa score), agreement can be improved by removing the first (up to 50) judgments
of each assessor, focusing on the judgments that
were made once the assessors are more familiar
with the task. Inter-annotator agreement with respect to correctness judgments of the edited translations were higher (moderate), which is probably due to the simplified evaluation criterion (binary judgments versus rankings). Inter-annotator
agreement for both conditions can be increased
further by removing the judges with the worst
agreement. Intra-annotator agreement on the other
hand was considerably higher ranging between
moderate and substantial.
In addition to the manual evaluation criteria we
applied a large number of automated metrics to
see how they correlate with the human judgments.
There is considerably variation between the different metrics and the language pairs under consideration. As in WMT08, the ULC metric had the
highest overall correlation with human judgments
when translating into English, with MaxSim and
RTE following closely behind. TERp and wpBleu
were best when translating into other languages.
All data sets generated by this workshop, including the human judgments, system translations
and automatic scores, are publicly available for
other researchers to analyze.9
Acknowledgments
This work was supported in parts by the EuroMatrix project funded by the European Commission
(6th Framework Programme), the GALE program
of the US Defense Advanced Research Projects
Agency, Contract No. HR0011-06-C-0022, and
the US National Science Foundation under grant
IIS-0713448.
We are grateful to Holger Schwenk and Preslav
Nakov for pointing out the potential bias in our
method for ranking systems when self-judgments
are excluded. We analyzed the results and found
that this did not hold. We would like to thank
Maja Popovic for sharing thoughts about how to
improve the manual evaluation. Thanks to Cam
Fordyce for helping out with the manual evaluation again this year.
Thanks to Sebastian Pado for helping us work
through the logic of segment-level scoring of automatic evaluation metric. Thanks to Tero Tapiovaara for discovering the mismatch in segment indices between the metrics and human scores, that
resulted in the corrections to Tables 10 and 11.
References
Abhaya Agarwal and Alon Lavie. 2008. Meteor, MBLEU and M-TER: Evaluation metrics for highcorrelation with human rankings of machine translation output. In Proceedings of the Third Workshop
on Statistical Machine Translation, pages 115–118,
Columbus, Ohio, June. Association for Computational Linguistics.
Joshua Albrecht and Rebecca Hwa. 2007a. A reexamination of machine learning approaches for
sentence-level MT evaluation. In Proceedings of the
45th Annual Meeting of the Association for Computational Linguistics (ACL-2007), Prague, Czech Republic.
9
http://www.statmt.org/wmt09/results.
html
Joshua Albrecht and Rebecca Hwa. 2007b. Regression for sentence-level MT evaluation with pseudo
references. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(ACL-2007), Prague, Czech Republic.
Alexandre Allauzen, Josep Crego, Aurélien Max, and
Fran cois Yvon. 2009. LIMSI’s statistical translation systems for WMT’09. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Ondřej Bojar, David Mareček, Václav Novák, Martin Popel, Jan Ptáček, Jan Rouš, and Zdeněk
Žabokrtský. 2009. English-Czech MT in 2008. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Chris Callison-Burch, Cameron Fordyce, Philipp
Koehn, Christof Monz, and Josh Schroeder. 2007.
(Meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation (WMT07), Prague, Czech Republic.
Chris Callison-Burch, Cameron Fordyce, Philipp
Koehn, Christof Monz, and Josh Schroeder. 2008.
Further meta-evaluation of machine translation. In
Proceedings of the Third Workshop on Statistical
Machine Translation (WMT08), Colmbus, Ohio.
Marine Carpuat. 2009. Toward using morphology
in French-English phrase-based SMT. In Proceedings of the Fourth Workshop on Statistical Machine
Translation, Athens, Greece, March. Association for
Computational Linguistics.
Yee Seng Chan and Hwee Tou Ng. 2008. An automatic
metric for machine translation evaluation based on
maximum similary. In In the Metrics-MATR Workshop of AMTA-2008, Honolulu, Hawaii.
Yu Chen, Michael Jellinghaus, Andreas Eisele,
Yi Zhang, Sabine Hunsicker, Silke Theison, Christian Federmann, and Hans Uszkoreit. 2009. Combining multi-engine translations with moses. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Jinhua Du, Yifan He, Sergio Penkale, and Andy Way.
2009. MATREX: The DCU MT system for WMT
2009. In Proceedings of the Fourth Workshop on
Statistical Machine Translation, Athens, Greece,
March. Association for Computational Linguistics.
Loı̈c Dugast, Jean Senellart, and Philipp Koehn.
2009. Statistical post editing and dictionary extraction: Systran/Edinburgh submissions for ACLWMT2009. In Proceedings of the Fourth Workshop
on Statistical Machine Translation, Athens, Greece,
March. Association for Computational Linguistics.
Chris Dyer, Hendra Setiawan, Yuval Marton, and
Philip Resnik. 2009. The University of Maryland statistical machine translation system for the
fourth workshop on machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Jason Eisner and Roy W. Tromble. 2006. Local
search with very large-scale neighborhoods for optimal permutations in machine translation. In Proceedings of the Human Language Technology Conference of the North American chapter of the Association for Computational Linguistics (HLT/NAACL2006), New York, New York.
Christian Federmann, Silke Theison, Andreas Eisele,
Hans Uszkoreit, Yu Chen, Michael Jellinghaus, and
Sabine Hunsicker. 2009. Translation combination using factored word substitution. In Proceedings of the Fourth Workshop on Statistical Machine
Translation, Athens, Greece, March. Association for
Computational Linguistics.
Alexander Fraser. 2009. Experiments in morphosyntactic processing for translating to and from German.
In Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Jesús Giménez and Lluı́s Màrquez. 2007. Linguistic features for automatic evaluation of heterogenous
MT systems. In Proceedings of ACL Workshop on
Machine Translation.
Jesús Giménez and Lluı́s Màrquez. 2008. A smorgasbord of features for automatic MT evaluation.
In Proceedings of the Third Workshop on Statistical
Machine Translation, pages 195–198.
Greg Hanneman, Vamshi Ambati, Jonathan H. Clark,
Alok Parlikar, and Alon Lavie.
2009.
An
improved statistical transfer system for FrenchEnglish machine translation. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Kenneth Heafield, Greg Hanneman, and Alon Lavie.
2009. Machine translation system combination
with flexible word ordering. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Almut Silja Hildebrand and Stephan Vogel. 2009.
CMU system combination for WMT’09. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Maria Holmqvist, Sara Stymne, Jody Foo, and Lars
Ahrenberg. 2009. Improving alignment for SMT by
reordering and augmenting the training corpus. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Philipp Koehn and Barry Haddow. 2009. Edinburgh’s submission to all tracks of the WMT2009
shared task with reordering and speed improvements
to Moses. In Proceedings of the Fourth Workshop
on Statistical Machine Translation, Athens, Greece,
March. Association for Computational Linguistics.
Philipp Koehn and Christof Monz. 2006. Manual and
automatic evaluation of machine translation between
European languages. In Proceedings of NAACL
2006 Workshop on Statistical Machine Translation,
New York, New York.
Philipp Koehn, Nicola Bertoldi, Ondrej Bojar, Chris
Callison-Burch, Alexandra Constantin, Brooke
Cowan, Chris Dyer, Marcello Federico, Evan
Herbst, Hieu Hoang, Christine Moran, Wade Shen,
and Richard Zens. 2007. Open source toolkit for
statistical machine translation: Factored translation
models and confusion network decoding. CLSP
Summer Workshop Final Report WS-2006, Johns
Hopkins University.
Kamil Kos and Ondřej Bojar. 2009. Evaluation of Machine Translation Metrics for Czech as the Target
Language. Prague Bulletin of Mathematical Linguistics, 92. in print.
Alex Kulesza and Stuart M. Shieber. 2004. A learning approach to improving sentence-level MT evaluation. In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in
Machine Translation, Baltimore, MD, October 4–6.
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data.
Biometrics, 33:159–174.
Gregor Leusch and Hermann Ney. 2008. BLEUSP,
PINVWER, CDER: Three improved MT evaluation
measures. In In the Metrics-MATR Workshop of
AMTA-2008, Honolulu, Hawaii.
Gregor Leusch, Evgeny Matusov, and Hermann Ney.
2009. The RWTH system combination system for
WMT 2009. In Proceedings of the Fourth Workshop
on Statistical Machine Translation, Athens, Greece,
March. Association for Computational Linguistics.
Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri
Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz,
Wren Thornton, Jonathan Weese, and Omar Zaidan.
2009. Joshua: An open source toolkit for parsingbased machine translation. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram
statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics
(ACL-2004), Barcelona, Spain.
Robert C. Moore. 2002. Fast and accurate sentence
alignment of bilingual corpora. In Proceedings of
the 5th Biennial Conference of the Association for
Machine Translation in the Americas (AMTA-2002),
Tiburon, California.
Preslav Nakov and Hwee Tou Ng. 2009. NUS
at WMT09: Domain adaptation experiments for
English-Spanish machine translation of news commentary text. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens,
Greece, March. Association for Computational Linguistics.
Jan Niehues, Teresa Herrmann, Muntsin Kolss, and
Alex Waibel. 2009. The Universität Karlsruhe
translation system for the EACL-WMT 2009. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
NIST. 2008. Evaluation plan for gale go/no-go phase
3 / phase 3.5 translation evaluations. June 18, 2008.
Attila Novák. 2009. Morphologic’s submission for
the WMT 2009 shared task. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Sebastian Pado, Michel Galley, Dan Jurafsky, and
Christopher D. Manning. 2009. Machine translation evaluation with textual entailment features. In
Proceedings of the Fourth Workshop on Statistical
Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL-2002),
Philadelphia, Pennsylvania.
Michael Paul, Andrew Finch, and Eiichiro Sumita.
2009. [email protected]: Model adaptation and
transliteration for Spanish-English SMT. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Maja Popovic and Hermann Ney. 2009. Syntaxoriented evaluation measures for machine translation output. In Proceedings of the Fourth Workshop
on Statistical Machine Translation, Athens, Greece,
March. Association for Computational Linguistics.
Maja Popovic, David Vilar, Daniel Stein, Evgeny Matusov, and Hermann Ney. 2009. The RWTH machine translation system for WMT 2009. In Proceedings of the Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Mark Przybocki,
Kay Peterson,
and Sebastien Bronsart.
2008.
Official results
of the NIST 2008 “Metrics for MAchine
TRanslation”
challenge
(MetricsMATR08).
http://nist.gov/speech/tests/metricsmatr/2008/results/.
José A. R. Fonollosa, Maxim Khalilov, Marta R. Costajussá, José B. Mariño, Carlos A. Henráquez Q.,
Adolfo Hernández H., and Rafael E. Banchs. 2009.
The TALP-UPC phrase-based translation system
for EACL-WMT 2009. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Antti-Veikko Rosti, Bing Zhang, Spyros Matsoukas,
and Richard Schwartz. 2009. Incremental hypothesis alignment with flexible matching for building confusion networks: BBN system description
for WMT09 system combination task. In Proceedings of the Fourth Workshop on Statistical Machine
Translation, Athens, Greece, March. Association for
Computational Linguistics.
Josh Schroeder, Trevor Cohn, and Philipp Koehn.
2009. Word lattices for multi-source translation.
In 12th Conference of the European Chapter of the
Association for Computational Linguistics (EACL2009), Athens, Greece.
Holger Schwenk, Sadaf Abdul Rauf, Loic Barrault, and
Jean Senellart. 2009. SMT and SPE machine translation systems for WMT’09. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In Proceedings of the 7th Biennial Conference of the
Association for Machine Translation in the Americas (AMTA-2006), Cambridge, Massachusetts.
Matthew Snover, Nitin Madnani, Bonnie Dorr, and
Richard Schwartz. 2009. Fluency, adequacy,
or HTER? exploring different human judgments
with a tunable MT metric. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
David Talbot and Miles Osborne. 2007. Smoothed
Bloom filter language models: Tera-scale lms on
the cheap. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language
Learning (EMNLP-CoNLL), Prague, Czech Republic.
Eric Wehrli, Luka Nerima, and Yves Scherrer.
2009.
Deep linguistic multilingual translation
and bilingual dictionaries. In Proceedings of the
Fourth Workshop on Statistical Machine Translation, Athens, Greece, March. Association for Computational Linguistics.
A
Pairwise system comparisons by human judges
Tables 14–24 show pairwise comparisons between systems for each language pair. The numbers in each
of the tables’ cells indicate the percentage of times that the system in that column was judged to be better
than the system in that row. Bolding indicates the winner of the two systems. The difference between
100 and the sum of the complimentary cells is the percent of time that the two systems were judged to
be equal.
Because there were so many systems and data conditions the significance of each pairwise comparison needs to be quantified. We applied the Sign Test to measure which comparisons indicate genuine
differences (rather than differences that are attributable to chance). In the following tables ? indicates statistical significance at p ≤ 0.10, † indicates statistical significance at p ≤ 0.05, and ‡ indicates statistical
significance at p ≤ 0.01, according to the Sign Test.
B
Automatic scores
SYSTRAN
UEDIN
UKA
UMD
USAAR
BBN - COMBO
CMU - COMBO
CMU - COMBO - H
RWTH - COMBO
USAAR - COMBO
> OTHERS
>= OTHERS
.46
.45
.57†
.55
.55
.47
.46
.63?
.43
.49
.44
.36
.43
.54
.57?
.52
.64
.52
.41
.46
.39
.50
.39
.55
.63?
.4
.48
.48
.40
.43
.45
.49
.64
.29?
.46
.36
.32
.52
.54?
.68†
.44
.36
.40
.48
.43
.59
.49
.62
.60
.33
.58
.55
.64†
.49
.33
.38
.30
.57
.30
.32
.50
.66
.49?
.39
.79‡
.41
.34
.3
.27
.39
.31
.34
.34
.29
.24?
.38
.36
.23‡
.28
.27?
.10‡
.27?
.33
.50
.77‡
.43
.83‡
.49
.46
.28
.36
.39
.50
.44
.55
.39
.28
.39
.33
.35
.53
.41
.33
.33
.44
.61
.75‡
.33
.83‡
.46
.58
.47
.46
.45
.34
.57
.38
.52
.39
.41
.45
.26†
.41
.47
.56
.47
.48
.60
.74‡
.27?
.93‡
.50
.51
.42
.42
.39
.3
.48
.42
.36
.50
.50
.39
.29?
.4
.28
.22†
.32
.44
.59
.49
.32
.34
.43
.26?
.27
.33
.42
.58
.41
.36
.28
.38
.23
.27
.41
.56
.69‡
.48
.90‡
.41
.56
.39
.49
.44
.30
.56
.49
.43
.42
.50
.46
.37
.34
.49
.42
.16
.47
.65
.84‡
.44
.90‡
.53
.53
.61
.48
.45
.57
.54
.28
.46
.43
.36
.44
.41
.32
.50
.65†
.32
.55
.56
.68
.46
.31
.44
.48
.63
.60
.32
.95‡
.59‡
.49
.4
.58
.37
.3
.56?
.35
.41
.56
.57?
.38
.31
.23†
.33
.41
.4
.46
.62
Table 14: Sentence-level ranking for the WMT09 German-English News Task
.84‡
.35
.91‡
.62†
.58
.32
.29
.43
.49
.74‡
.56
.33
.59†
.45
.46
.42
.38
.53
.41
.50
.51
.70
USAAR - COMBO
.75‡
.57?
.94‡
.66†
.47
.49
.4
.54
.49
.64‡
.49
.61†
.58?
.48
.54
.46
RWTH - COMBO
.74‡
.38
.92‡
.50
.42
.34
.26?
.29†
.43
.41
.42
.35
.33
.29
CMU - COMBO - HYPOSEL
.57†
.29?
.89‡
.49
.42
.31?
.43
.31?
.28†
.51?
.52
.44
.35
USAAR
UEDIN
SYSTRAN
STUTTGART
RWTH
.78‡
.42
.88‡
.63?
.61†
.23†
.45
.60?
CMU - COMBO
STUTTGART
.83‡
.39
.94‡
.57
.50
.45
.27
BBN - COMBO
RWTH
.73‡
.24†
.91‡
.63?
.46
.33
UMD
RBMT 5
.69†
.43
.84‡
.63
.42
UKA
RBMT 4
RBMT 5
RBMT 3
RBMT 4
RBMT 2
RBMT 3
LIU
RBMT 1
.63† .54
.23† .50
.77‡ .81‡
.49
.46
.37 .45
.26? .35
.37 .4
.30? .24†
.41 .49
.39 .43
.42 .37
.35 .49
.32 .47
.38 .49
.41 .34
.26† .44
.37 .37
.10‡ .39
.19† .36
.3
.39
.38 .44
.50 .54
RBMT 2
JHU - TROMBLE
RBMT 1
GOOGLE
.76‡ .08‡
.03‡
.90‡
.65† .12‡
.43 .11‡
.46 .09‡
.59† .02‡
.47 .07‡
.34 .07‡
.55 .10‡
.43 .13‡
.63 .06‡
.50? .03‡
.58? .04‡
.53 .08‡
.44 ‡
.31? .06‡
.36 .07‡
.46 ‡
.38 ‡
.55 .17‡
.51 .06
.65 .13
LIU
GENEVA
.15‡
.75‡
.29†
.32
.25†
.17‡
.12‡
.13‡
.21?
.17‡
.11‡
.10‡
.29†
.16‡
.19‡
.14‡
.10‡
.3
.06‡
.20‡
.22
.33
GENEVA
GOOGLE
JHU - TROMBLE
Tables 26 and 25 give the automatic scores for each of the systems.
.71‡
.36
.83‡
.53
.54
.29?
.31
.30
.21
.59?
.46
.44
.55
.45
.50
.11
.32
.44
.47
.29
.43
.62
RBMT 5
RWTH
STUTTGART
UEDIN
UKA
USAAR
USAAR - COMBO
> OTHERS
>= OTHERS
.34†
.41
.33?
.33†
.50
.62‡
.53?
.57†
.46
.54
.48
.57
.35
.56?
.50
.60‡
.65‡
.64‡
.58†
.55?
.55†
.55
.66
.55?
.53
.65‡
.64‡
.62‡
.46
.42
.55?
.53
.64
.33
.51
.57†
.60‡
.44
.42
.53
.47
.58
.60†
.62‡
.55†
.62‡
.48‡
.61‡
.54
.64
.37
.37
.41
.25‡
.24‡
.37
.36†
.46
.45
.35
.42
.39
.37
.48
.41
.38
.32‡
.25‡
.25‡
.35†
.32‡
.38
.28†
.32†
.28‡
.40
.33
.43
.42
.47
.37?
.31‡
.33‡
.34‡
.35†
.47
.52†
.36
.39
.39
.39
.51
.45
.43
.35†
.36†
.43
.45
.31‡
.48
.54†
.41
.44
.46
.42
.54
USAAR - COMBO
.56†
.59†
.57†
.40
.41
.47
USAAR
.44
.61‡
.50?
.37?
.37?
UKA
.55†
.55?
.44
.43
UEDIN
.51
.55†
.56†
STUTTGART
.56
.62‡
RWTH
RBMT 5
RBMT 4
RBMT 4
RBMT 3
.33‡
.34†
.35?
.33‡
.35†
.46
.47
.37
.42
.36†
.45
.38
.49
RBMT 3
RBMT 2
RBMT 2
RBMT 1
.58†
.39
.35
.31†
.48
.36†
.51
.50
.50
.47
.46
.37
.44
.54
RBMT 1
LIU
LIU
GOOGLE
.34†
GOOGLE
.43
.44
.42
.32†
.36?
.38
.32‡
.54
.53
.35
.41
.41
.45
.58†
.45
.37?
.49
.44
.25‡
.52
.66‡
.53
.46
.52
.48
.58
.41
.52
USAAR
BBN - COMBO
CMU - COMBO
USAAR - COMBO
> OTHERS
>= OTHERS
.40
.41
.35
.50
.44
.48
.47
.32†
.33
.42
.42
.55
.46
.44
.53
.47
.42
.44
.36?
.39
.31†
.42
.55
.38
.53
.41
.41
.38
.39
.32?
.39
.42
.53
.49
.36
.46
.35
.37
.37
.25‡
.42
.55
.23‡
.55†
.35
.42
.39
.45
.39
.44
.43
.36
.36
.27‡
.39
.51
.35
.61‡
.48
.4
.49
.50
.42
.43
.60?
.39
.31
.35
.44
.59
.31†
.65‡
.42
.55
.49
.49
.46
.36
.25‡
.59†
.42
.50
.48
.23
.43
.32?
.44
.48
.32
.37
.35
.43
.56
.31†
.21‡
.32?
.37
.51
.36
.62‡
.57†
.57?
.54
.51
.44
.47
.49
.64†
.14
.78‡
.52
.48
.57?
.51
.51
.45
.51
.58‡
.50
.35
.36
.49
.64
USAAR - COMBO
TALP - UPC
RWTH
RBMT 5
.38
.64‡
.46
.36
.47
CMU - COMBO
UEDIN
.41
.63‡
.44
.34
BBN - COMBO
TALP - UPC
.40
.53
.34
USAAR
RWTH
.40
.52
UEDIN
RBMT 5
RBMT 4
RBMT 4
.40
.39
.32‡
.30‡
.29†
.24‡
.16‡
.28†
.20‡
.15‡
.20‡
.26
.37
RBMT 3
RBMT 1
RBMT 3
.74‡
.56
.40
.55
.54
.64‡
.48
.61†
.69‡
.35
.19
.23
.50
.70
RBMT 1
.21‡
GOOGLE
NICT
NICT
GOOGLE
Table 15: Sentence-level ranking for the WMT09 English-German News Task
.29
.49
.70
TALP - UPC
UEDIN
USAAR
USAAR - COMBO
> OTHERS
>= OTHERS
.16‡
.11‡
.11‡
.26‡
.17‡
.17‡
.23‡
.21‡
.17
.25
.52?
.49
.54
.63‡
.36
.67‡
.54
.56
.66
.35
.51
.52
.37
.53
.49
.47
.61
.61†
.51
.46
.47†
.50
.53
.64
.34?
.35
.64‡
.34
.38
.31†
.29
.30†
.51
.30†
.38
.51
.39
.25
.76‡
.29‡
.36
.39
.46
.29?
.49
.43
.42
.58
.33
.47
.80‡
.56
.51
.47
.56†
.45?
.36?
.36
.67‡
.24‡
.39
.18†
.39
.39
.32†
.61†
.43
.52
.66
.33?
.37
.48
USAAR - COMBO
.48
.51
.83‡
.43
.49
TALP - UPC
.36
.51
.79‡
.30?
RWTH
.49
.62†
.79‡
USAAR
RWTH
.21‡
.11‡
UEDIN
RBMT 5
RBMT 5
RBMT 4
.80‡
.31†
.32
.40
.52
.41
.32
.56
.45
.45
.59
RBMT 4
RBMT 3
RBMT 3
RBMT 1
.50
.76‡
.42
.47
.42
.59?
.49
.50
.58?
.31
.50
.65
RBMT 1
.39
GOOGLE
NUS
NUS
GOOGLE
Table 16: Sentence-level ranking for the WMT09 Spanish-English News Task
.21
.43
.64‡
.32
.38
.47
.55†
.41
.36
.58?
.43
.61
Table 17: Sentence-level ranking for the WMT09 English-Spanish News Task
.21
.66‡
.54
.62†
.54
.66‡
.58‡
.50
.41
.56?
.40
.31
.50
.69
RBMT 5
RWTH
UEDIN
UKA
USAAR
BBN - COMBO
CMU - COMBO
CMU - COMBO - H
DCU - COMBO
USAAR - COMBO
> OTHERS
>= OTHERS
.41
.41
.27
.56
.41
.52
.30
.12‡
.29
.31
.44
.41
.37
.56
.59‡
.29
.39
.38
.51
.58
.26?
.42
.41
.49
.38
.41
.54
.34
.32
.31
.17‡
.17†
.22†
.23†
.28†
.23†
.53
.30
.47
.52
.46
.27
.24
.49
.31
.31
.46
.17‡
.40
.59
.33
.32
.44
.09‡
.17†
.14‡
.20‡
.21?
.33
.52
.49
.47
.34
.49
.33
.21‡
.42
.41
.61
.28
.58
.21†
.72‡
.33
.48
.29
.41
.30
.43
.71‡
.51
.46
.34
.38
.50
.39
.67†
.27?
.48
.38
.39
.61
.31
.52†
.18
.46
.41
.34
.41
.23
.25
.34
.37
.42
.40
.61
.19‡
.42
.24‡
.39
.47
.40
.55
.58‡
.64†
.61‡
.69‡
.28
.53
.61†
.54†
.77‡
.57?
.63†
.42
.66‡
.41
.39
.65‡
.47
.52
.4
.52
.30
.47
.47
.61†
.51
.46
.65†
.58
.62†
.35
.53
.52
.44
.31
.18†
.31
.58‡
.50
.73
.3
.26
.58
.47
.66
.39
.64†
.36
.71‡
.34
.31
.31
.24
.41
.47
.65†
.48
.61‡
.44
.36
.70‡
.49†
.37
.46
.47
.46
.71
USAAR - COMBO
.46
.33
.37
.74‡
.21‡
.37
.33
.35
.41
.43
.50
.39
.48
DCU - COMBO
.48
.41
.24
.57
.16‡
.33
.23?
.32
.33
.51
.55
.38
CMU - COMBO - HYPOSEL
.42
.42
.27?
.79‡
.44
.41
.38
.44
.46
.38
.40
USAAR
.25†
.38
.29
.46
.31†
.29†
.20‡
.41
.33
.22‡
UKA
LIMSI
JHU
GOOGLE
GENEVA
LIUM - SYSTRAN
.45
.52
.35
.37
.34
.35
.33
.30
.23†
.21†
.24
.19†
.48
.36
.60
.32
.28?
.31
.76‡
.27†
.37
.27
.31
.47
CMU - COMBO
RBMT 4
.45
.51
.57
.63‡
.49
.51?
.42
.44
.46
.28†
.47
.31
.50
.52?
.46
.65
.34
.45
.29
.66‡
.20‡
.27
.37
.48
BBN - COMBO
RBMT 3
.49
.27
.46
.44
.60†
.49
.46
.46
.37
.48
.32
.35
.39
.29
.53
.43
.62
.58†
.35
.34
.62?
.09‡
.44
.29
UEDIN
RBMT 1
.60‡
.56
.61‡
.57‡
.69†
.67†
.38
.63‡
.68‡
.51
.62?
.26
.30
.37
.37
.66‡
.56
.76
.46
.56‡
.33
.73‡
.35
.31
RWTH
LIUM - SYSTRAN
.47
.48
.45
.65‡
.21‡
RBMT 5
JHU
LIMSI
.63†
.71‡
.67‡
.71‡
RBMT 4
GOOGLE
.29?
.54
.17‡
.26
.16‡
.30
.23
.63?
.35
.49
.39
.21
.31
.43
.21†
.36
.21†
.13‡
.25†
.31
.5
.44 .17‡
.56? .37
.15‡
.73‡
.12‡ .13‡
.38 .22‡
.38 .19‡
.42 .33?
.42 .19‡
.55 .15‡
.51 .36
.54? .09‡
.45 .32
.31 .19‡
.54† .19‡
.52 .26†
.12‡ .23‡
.4
.28
.38 .23‡
.42 .20‡
.18 .28†
.41 .23
.66 .34
RBMT 3
DCU
GENEVA
.37
.56
.27
.76‡
.23†
.40
.4
.23†
.53
.57
.58†
.42
.38
.41
.40
.44
.21‡
.41
.24
.41
.41
.40
.58
RBMT 1
COLUMBIA
DCU
COLUMBIA
CMU - STATXFER
CMU - STATXFER
.41
.71‡
.37
.67‡
.37
.47
.36
.67†
.50
.38
.66†
.50
.66‡
.63‡
.44
.55
.28
.29
.29
.35
.58†
.1
.64†
.16‡
.29
.26?
.36
.41
.55
.38
.60‡
.54?
.37
.29
.41
.21‡
.25
.27
.19‡
.63‡
.49 .36
.67 .57
RBMT 3
RBMT 4
RBMT 5
RWTH
SYSTRAN
UEDIN
UKA
USAAR
DCU - COMBO
USAAR - COMBO
> OTHERS
>= OTHERS
‡
.25?
.11‡
.19?
.17?
.21†
.20‡
.11‡
.09‡
.27
.11‡
‡
.15
.38
.32
.51?
.37
.37
.39
.50
.39
.28
.4
.49
.18†
.17
.39
.64
.53†
.52
.57†
.61†
.47
.38
.77‡
.45
.70†
.45
.26
.52
.73
.24
.23
.38
.29
.40
.33?
.23‡
.31
.22
.17†
.29
.54
.61?
.58?
.44
.22
.51
.50
.61?
.29
.28
.45
.65
.30
.37
.29
.44
.39
.29
.33
.20?
.32
.59
.31
.26?
.49
.29
.32
.29
.28
.35
.57
.24?
.56†
.29?
.27
.38
.52
.31
.32
.41
.49
.66‡
.44
.42
.29
.36
.44
.50
.52?
.37
.44
.32
.29
.64†
.13‡
.20†
.35
.58
.37
.47
.62
.27?
.39
.45
.65
.24
.46‡
.24
.34
.17‡
.60?
.43
.22
.41
.35
.47
.26
.51
.26
.04‡
.34
.60
.46
.56‡
.32
.4
.35
.63‡
.32
.39
.48
.51
.33
.30
.61‡
.41
.06†
.42
.66
.26†
.57
.29
.36
.17†
.41
.27?
.44
.13
.16†
.32
.31
.19‡
.12‡
.08‡
.28
.48
.39
.74‡
.36
.53†
.41
.44
.53
.53
.54
.50‡
.60?
.56
.41
.76‡
.39
.51
.74
CMU - COMBO
> OTHERS
>= OTHERS
.51‡
.39‡
.29
.43
.75
.44
.32‡
.45‡
.18‡
.38
.32
.27‡
.34
.57
.52‡
.23
.45‡
.38‡
.24‡
.31
.65
UEDIN
CMU - COMBO
UEDIN
BBN - COMBO
.28‡
.38
.31‡
.28‡
.31
.51
BBN - COMBO
.54‡
CU - BOJAR
GOOGLE
GOOGLE
CU - BOJAR
Table 19: Sentence-level ranking for the WMT09 English-French News Task
.40
.73
Table 20: Sentence-level ranking for the WMT09 Czech-English News Task
USAAR - COMBO
DCU - COMBO
USAAR
.45
.52?
.34
.45
.21†
.29
.27?
.42
UKA
.27
.50?
.26†
.30
.17†
.40
.25?
UEDIN
.44
.71‡
.44
.43
.29
.46
SYSTRAN
.33
.50?
.26
.23?
.17†
RWTH
LIUM - SYSTRAN
LIMSI
.45
.45
.51
.37
.58†
.53
.63?
.33
.41
.46
.52
.30
.39
.47
.68
.44
.80‡
.42
.48
RBMT 5
RBMT 1
.15‡
.16‡
.47
.69‡
.28
RBMT 4
LIUM - SYSTRAN
.39
.73‡
RBMT 3
LIMSI
.12‡
RBMT 1
GOOGLE
.62‡
.46
.25
.24
.39
.36
.36
.41
.59?
.35
.38
.36
.66†
.32
.40
.41
.65
GOOGLE
DCU
GENEVA
GENEVA
DCU
Table 18: Sentence-level ranking for the WMT09 French-English News Task
.33
.84‡
.32
.38
.41
.60†
.44
.56?
.60
.57†
.45
.56‡
.56†
.65‡
.21
.49
.77
PCTRANS
UEDIN
> OTHERS
>= OTHERS
.26‡
.30‡
.27‡
.37?
.30
.48
.42
.36
.52‡
.46
.67
.43‡
.56‡
.39
.48‡
.58‡
.38
.43?
.38?
.50‡
.45
.66
UEDIN
.45‡
.54‡
PCTRANS
GOOGLE
.31‡
GOOGLE
EUROTRANXP
EUROTRANXP
.51‡
.35‡
.31‡
.33‡
.42‡
.38
.61
CU - TECTOMT
CU - BOJAR
CU - BOJAR
CU - TECTOMT
.30‡
.42?
.29‡
.26‡
.30‡
.53‡
.48
.67
.31
.53
.70‡
.61‡
.67‡
.59‡
.55‡
.62
.75
UEDIN
UMD
BBN - COMBO
CMU - COMBO
CMU - COMBO - HYPOSEL
> OTHERS
>= OTHERS
.28‡
.59‡
.26‡
.23‡
.25‡
.15‡
.22
.45
.48‡
.35
.34
.41
.66
CMU - COMBO - HYPOSEL
CMU - COMBO
.21‡
MORPHO
BBN - COMBO
UMD
UEDIN
MORPHO
Table 21: Sentence-level ranking for the WMT09 English-Czech News Task
.24‡
.45‡
.21‡
.27‡
.55‡
.29
.41?
.29?
.27‡
.29
.54
.34
.37
.62
.28‡
.50‡
.38
.52‡
.42
.42
.68
USAAR - COMBO ES
> OTHERS
>= OTHERS
.36
.63†
.45
.39
.70†
.52
.64?
.43
.38
.47
.44
.28
.46
.59
.85‡
.41
.32
.59
.47
.70†
.39
.42
.54
.19‡
.55
.50
.62
.14‡
.19†
.13‡
.20‡
.32
.13‡
.15‡
.10‡
.10‡
.11‡
.16
.27
.74‡
.68‡
.55
.74‡
.39
.34
.47
.36
.38
.51
.62
.69‡
.56
.60
.39
.20‡
.41
.25?
.38
.45
.58
.26‡
.50
.16‡
.12‡
.07‡
.11‡
.20†
.26
.37
.49
.32
.17‡
.55
.35
.43
.53
.37
.19‡
.38
.38
.32
.44
.71‡
.3
.32
.65‡
.32
.18‡
.22‡
.35
.25†
.27†
.31
.41
.35
.48
.33?
.29?
.27†
.50
.21‡
.31
.39
.13‡
.53
.38
.39
.35
.56
.52
.46
.60†
.58†
.65?
.41
.36
.41
.85‡
.46
.50
.75‡
.55
.66†
.52
.4
.48
.58
.69†
.57
.48
.51
.60?
.72‡
.68†
.62
.42
.78‡
.40
.74‡
.71‡
.69†
.72‡
.39
.77‡
.30
.19†
.38
.28?
.38
.40
.52
.32†
.17‡
.30
.15‡
.18‡
.29
.36
Table 23: Sentence-level ranking for the WMT09 All-English News Task
.50
.53
.39
.44
.52
.63
.38
.42
.67?
.57
.68
USAAR - COMBO ES
RWTH - COMBO XX
.43
.44
.72‡
.39
.41
.61?
.54
.67?
.56
.23†
.40
.29
.34
.46
.57
.22†
.16‡
.20†
.12‡
.34
.22‡
.11‡
.18‡
.35
.26?
.4
.38
.28?
.32
.22?
.30†
.32
.60‡
.21‡
.21‡
RWTH - COMBOXX
RWTH - COMBO DE
.49
.31
.24‡
.65†
.20‡
.23†
.50
.28
.48
.32?
.19‡
.25‡
.18‡
.32
.32
.43
‡
.48
.36
.21†
.62?
.25†
.42
.64?
.48
.61?
.63†
.36
.42
.47
.69†
.24‡
RWTH - COMBO DE
DCU - COMBO F R
.51
.26?
.29†
.66?
.43
.50
.60
.35
.49
.73‡
.52
.50
.47
.78‡
DCU - COMBO F R
CMU - COMBO XX
.11‡
.18‡
.14‡
.20‡
.32
.25?
.13‡
.23†
CMU - COMBO XX
CMU - COMBOHU
.54
.36
.23
.55
.63‡
.47
.55
.68†
.49
.70‡
.52
.44
CMU - COMBO HU
CMU - COMBO CZ
.55
.25
.50
.4
.40
.41
.44
.50
.44
.58
.49
CMU - COMBO CZ
CMU - CMB - HYP HU
.46
.28
.51
.33
.84‡
.39
.30?
.59?
.60‡
.54
.24†
.27?
.53
.32
.31
.44
.58
.52
.39
.23†
.41
.39
.54
.64?
.44
.59
.46
CMU - COMBO - HYPOSELHU
CMU - CMB - HYP DE
.42
.80‡
.44
.25
.26†
.62†
.45
.30
.58
.50
.54
.29†
.35
.35
.26?
.47
.44
.58
.38
.11‡
.29?
.33
.48
.30?
.36
.16‡
.29
CMU - COMBO - HYPOSELDE
BBN - COMBO XX
.33
.34
.56
.31?
.46
.40
.71‡
.32
.31?
.46
.41
.59
.44
.26
.50
.31
.39
.41
.52
BBN - COMBO XX
BBN - COMBO HU
.57
.36
.51
.67?
.44
.55
.53
.70?
.47
.46
.63?
.46
.67†
.43
.29
.53
.39
.58?
.49
.57
BBN - COMBO HU
BBN - COMBO ES
BBN - COMBO F R
.42
.45
.35
.38
.61
.38
.50
.39
BBN - COMBO F R
BBN - COMBO DE
.61?
.49
.33
.52
.45
.36
.54
BBN - COMBO ES
BBN - COMBO CZ
.47
.35
.47
.25†
.38
.43
BBN - COMBO DE
RBMT 5 F R
.51
.41
.43
.43
.42
BBN - COMBO CZ
RBMT 5 ES
RBMT 5 F R
RBMT 3 ES
RBMT 3 F R
RBMT 5 ES
RBMT 3 DE
RBMT 3 F R
RBMT 2 DE
.47 .52
.37 .38
.26† .36
.41
.41
.51 .54
.63† .53
.31 .48
.50 .33
.55 .44
.41 .49
.43 .47
.36 .27‡
.60‡ .57
.34? .50
.29? .64†
.70‡ .55
.36 .50
.58 .68†
.51 .37
.32 .25†
.37 .39
.26? .41
.21† .4
.43 .45
.55 .55
RBMT 3 ES
GOOGLE F R
RBMT 3 DE
GOOGLE ES
.61? .54?
.42
.42
.49 .61†
.60 .54
.52 .46
.58 .37
.41 .55
.52 .45
.74‡ .65?
.54 .58†
.40 .41
.52 .35
.75‡ .78‡
.54? .63†
.43 .68†
.75‡ .78‡
.59 .81‡
.76‡ .69‡
.50 .33
.57 .29
.43 .52
.38 .44
.37 .54
.54 .54
.67 .70
RBMT 2 DE
GOOGLE CZ
.33?
.27?
.33
.37
.34
.40
.29?
.47
.41
.39
.38
.38
.84‡
.4
.48
.63
.32
.62
.4
.44
.41
.31
.37
.41
.52
GOOGLE CZ
GOOGLE ES
GOOGLE F R
Table 22: Sentence-level ranking for the WMT09 Hungarian-English News Task
.50
.4
.32
.54
.46
.47
.44
.51
.27
.66‡
.34
.47
.33
.87‡
.47
.38
.82‡
.57
.61
.4
.55
.59
.38
.36
.61?
.49
.46
.52
.65?
.59
.70‡
.52
.44
.60‡
.86‡
.41
.56?
.80‡
.66?
.82‡
.46
.49
.56
.53
.22
.27
.57†
.46
.26?
.42
.32
.57
.58
.56
.38
.35
.75‡
.41
.53
.68†
.55
.82‡
.4
.30?
.49
.44
.28
.43
.48
.59
.44
.55
.69
.47
.62
RWTH - COMBO
> OTHERS
>= OTHERS
.37
.41
.32‡
.36
.62
.34‡
.35
.58
RWTH - COMBO
CMU - COMBO
BBN - COMBO
BBN - COMBO
CMU - COMBO
.40‡
.44‡
.42
.67
Table 24: Sentence-level ranking for the WMT09 Multisource-English News Task
0.3
0.29
0.26
0.23
0.22
0.25
0.31
0.14
0.30
0.23
0.25
0.26
0.18
0.19
0.18
0.19
0.24
0.24
0.25
0.18
0.25
–0.11
–0.12
–0.14
–0.18
–0.18
–0.15
–0.11
–0.29
–0.10
–0.15
–0.16
–0.15
–0.25
–0.22
–0.24
–0.24
–0.16
–0.16
–0.15
–0.24
–0.16
0.36
0.35
0.33
0.29
0.29
0.32
0.36
0.21
0.36
0.32
0.30
0.32
0.24
0.25
0.24
0.25
0.30
0.31
0.31
0.24
0.31
BBN - COMBO
CMU - COMBO
CU - BOJAR
GOOGLE
UEDIN
0.65
0.73
0.51
0.75
0.57
0.22
0.22
0.16
0.21
0.2
0.20
0.20
0.15
0.20
0.19
–0.19
–0.2
–0.26
–0.19
–0.23
0.27
0.27
0.22
0.26
0.25
BBN - COMBO
CMU - COMBO
CMU - COMBO - HYPOSEL
MORPHO
UEDIN
UMD
0.54
0.62
0.68
0.75
0.45
0.66
0.14
0.14
0.14
0.1
0.12
0.13
0.13
0.13
0.12
0.09
0.11
0.12
–0.29
–0.29
–0.29
–0.36
–0.32
–0.28
0.19
0.19
0.19
0.15
0.18
0.18
WPBLEU
0.31
0.3
0.28
0.24
0.23
0.27
0.31
0.14
0.31
0.27
0.26
0.27
0.18
0.2
0.19
0.19
0.25
0.25
0.26
0.19
0.26
WP F
0.73
0.66
0.71
0.58
0.50
0.66
0.67
0.34
0.76
0.62
0.65
0.60
0.56
0.54
0.47
0.59
0.52
0.61
0.61
0.55
0.57
WCD 6 P 4 ER
BBN - COMBO
CMU - COMBO
CMU - COMBO - HYPOSEL
CMU - STATXFER
COLUMBIA
DCU
DCU - COMBO
GENEVA
GOOGLE
JHU
LIMSI
LIUM - SYSTRAN
RBMT 1
RBMT 3
RBMT 4
RBMT 5
RWTH
UEDIN
UKA
USAAR
USAAR - COMBO
ULC
0.34
0.33
0.34
0.27
0.25
0.26
0.25
0.26
0.3
0.31
0.32
0.25
0.34
TERP
–0.13
–0.13
–0.13
–0.19
–0.24
–0.22
–0.22
–0.22
–0.16
–0.15
–0.15
–0.22
–0.13
TER
0.27
0.27
0.28
0.22
0.18
0.2
0.19
0.2
0.23
0.25
0.25
0.19
0.27
RTE - PAIRWISE
0.29
0.28
0.29
0.22
0.19
0.20
0.2
0.20
0.24
0.26
0.26
0.2
0.29
RTE - ABSOLUTE
0.64
0.7
0.70
0.37
0.55
0.55
0.53
0.55
0.51
0.59
0.56
0.51
0.69
NIST- CASED
BBN - COMBO
CMU - COMBO
GOOGLE
NICT
RBMT 1
RBMT 3
RBMT 4
RBMT 5
RWTH
TALP - UPC
UEDIN
USAAR
USAAR - COMBO
German-English News Task
0.31 0.51 0.55 0.6 0.41
0.29 0.49 0.54 0.58 0.4
0.3 0.49 0.54 0.57 0.4
0.18 0.38 0.43 0.44 0.30
0.28 0.48 0.54 0.57 0.39
0.1 0.34 0.43 0.41 0.29
0.27 0.46 0.51 0.54 0.38
0.21 0.43 0.50 0.53 0.37
0.24 0.48 0.52 0.55 0.38
0.25 0.48 0.52 0.55 0.38
0.23 0.45 0.5 0.52 0.36
0.24 0.47 0.51 0.54 0.37
0.26 0.45 0.50 0.53 0.36
0.30 0.50 0.55 0.59 0.41
0.27 0.48 0.52 0.56 0.38
0.26 0.47 0.52 0.55 0.38
0.27 0.47 0.52 0.55 0.38
0.28 0.47 0.52 0.56 0.38
0.28 0.47 0.52 0.56 0.38
0.24 0.47 0.51 0.54 0.38
0.24 0.47 0.51 0.55 0.38
Spanish-English News Task
0.35 0.53 0.57 0.62 0.43
0.35 0.53 0.58 0.62 0.43
0.35 0.53 0.58 0.62 0.43
0.29 0.48 0.54 0.57 0.39
0.26 0.49 0.54 0.57 0.40
0.27 0.50 0.54 0.58 0.41
0.27 0.48 0.53 0.57 0.4
0.27 0.5 0.54 0.58 0.40
0.31 0.49 0.54 0.58 0.4
0.33 0.51 0.56 0.6 0.41
0.33 0.51 0.56 0.60 0.42
0.27 0.48 0.54 0.57 0.4
0.35 0.53 0.58 0.62 0.43
French-English News Task
0.38 0.54 0.59 0.64 0.45
0.36 0.53 0.58 0.63 0.44
0.35 0.53 0.57 0.61 0.43
0.31 0.49 0.54 0.58 0.40
0.30 0.49 0.54 0.58 0.40
0.34 0.52 0.56 0.61 0.42
0.37 0.54 0.59 0.64 0.44
0.22 0.43 0.49 0.52 0.36
0.37 0.54 0.58 0.63 0.44
0.33 0.51 0.56 0.6 0.41
0.32 0.51 0.56 0.60 0.42
0.33 0.51 0.56 0.60 0.42
0.25 0.48 0.53 0.57 0.4
0.27 0.48 0.53 0.56 0.39
0.26 0.48 0.52 0.56 0.39
0.26 0.49 0.54 0.57 0.40
0.32 0.5 0.55 0.59 0.40
0.32 0.50 0.55 0.59 0.41
0.33 0.51 0.55 0.6 0.41
0.26 0.48 0.54 0.57 0.4
0.33 0.51 0.55 0.59 0.41
Czech-English News Task
0.29 0.47 0.52 0.56 0.39
0.29 0.47 0.53 0.57 0.39
0.24 0.43 0.5 0.52 0.36
0.28 0.46 0.52 0.55 0.38
0.27 0.45 0.50 0.54 0.37
Hungarian-English News Task
0.21 0.38 0.45 0.46 0.32
0.21 0.39 0.46 0.47 0.32
0.21 0.39 0.45 0.46 0.32
0.17 0.39 0.45 0.46 0.32
0.19 0.37 0.42 0.43 0.30
0.2 0.36 0.44 0.45 0.30
NIST
0.29
0.28
0.28
0.17
0.27
0.09
0.25
0.20
0.23
0.23
0.21
0.22
0.25
0.29
0.26
0.24
0.26
0.27
0.26
0.23
0.23
METEOR - RANKING
BLEUSP
–0.17
–0.19
–0.19
–0.33
–0.2
–0.38
–0.22
–0.29
–0.26
–0.25
–0.27
–0.26
–0.21
–0.18
–0.22
–0.22
–0.22
–0.20
–0.19
–0.26
–0.25
METEOR -0.7
BLEU - TER
0.22
0.21
0.21
0.09
0.20
0.06
0.18
0.13
0.16
0.16
0.14
0.15
0.18
0.22
0.18
0.17
0.19
0.2
0.19
0.15
0.16
METEOR -0.6
BLEU - CASED
0.24
0.22
0.23
0.1
0.21
0.07
0.19
0.14
0.17
0.17
0.16
0.16
0.19
0.23
0.2
0.19
0.20
0.21
0.21
0.17
0.17
MAXSIM
BLEU
0.68
0.63
0.62
0.33
0.65
0.13
0.50
0.54
0.64
0.64
0.62
0.66
0.50
0.7
0.61
0.6
0.59
0.58
0.56
0.65
0.62
BLEUSP 4114
R ANK
BBN - COMBO
CMU - COMBO
CMU - COMBO - HYPOSEL
GENEVA
GOOGLE
JHU - TROMBLE
LIU
RBMT 1
RBMT 2
RBMT 3
RBMT 4
RBMT 5
RWTH
RWTH - COMBO
STUTTGART
SYSTRAN
UEDIN
UKA
UMD
USAAR
USAAR - COMBO
7.08
6.95
6.79
4.88
6.85
4.90
6.35
5.30
6.06
5.98
5.65
5.76
6.44
7.06
6.39
6.40
6.47
6.66
6.74
5.89
5.99
6.78
6.71
6.5
4.65
6.65
4.25
6.02
5.07
5.75
5.71
5.36
5.52
6.24
6.81
6.11
6.08
6.24
6.43
6.42
5.64
6.85
0.13
0.12
0.11
0.03
0.11
0.02
0.06
0.04
0.1
0.09
0.06
0.07
0.06
0.11
0.1
0.08
0.07
0.08
0.08
0.06
0.07
0.1
0.09
0.09
0.04
0.11
0.02
0.05
0.04
0.12
0.09
0.07
0.06
0.03
0.07
0.06
0.07
0.04
0.04
0.04
0.05
0.06
0.54
0.56
0.57
0.71
0.56
0.81
0.61
0.67
0.63
0.61
0.65
0.63
0.60
0.54
0.60
0.60
0.61
0.58
0.56
0.64
0.64
0.63
0.66
0.66
0.86
0.65
1
0.72
0.76
0.70
0.68
0.72
0.70
0.74
0.63
0.69
0.71
0.70
0.69
0.69
0.71
0.70
0.31
0.29
0.29
0.22
0.29
0.19
0.27
0.26
0.29
0.29
0.27
0.28
0.27
0.30
0.29
0.28
0.27
0.28
0.28
0.28
0.28
0.45
0.47
0.47
0.58
0.48
0.61
0.49
0.55
0.51
0.51
0.52
0.52
0.49
0.46
0.49
0.5
0.49
0.48
0.48
0.52
0.51
0.36
0.35
0.35
0.25
0.35
0.22
0.33
0.29
0.31
0.32
0.30
0.31
0.33
0.36
0.33
0.33
0.34
0.34
0.34
0.31
0.32
0.31
0.29
0.3
0.17
0.28
0.12
0.26
0.22
0.24
0.25
0.23
0.24
0.26
0.31
0.27
0.26
0.27
0.28
0.27
0.24
0.25
7.64
7.65
7.68
6.91
6.07
6.24
6.20
6.26
7.12
7.28
7.25
6.31
7.58
7.35
7.46
7.50
6.74
5.93
6.08
6.03
6.10
6.95
7.02
7.04
6.14
7.25
0.16
0.21
0.23
0.1
0.11
0.13
0.10
0.12
0.11
0.13
0.16
0.11
0.20
0.13
0.2
0.22
0.1
0.12
0.14
0.11
0.11
0.08
0.11
0.1
0.09
0.13
0.51
0.51
0.5
0.60
0.62
0.60
0.60
0.6
0.56
0.54
0.55
0.62
0.51
0.61
0.60
0.59
0.71
0.69
0.65
0.67
0.65
0.68
0.64
0.64
0.67
0.6
0.33
0.34
0.34
0.3
0.3
0.31
0.3
0.31
0.31
0.32
0.32
0.3
0.34
0.42
0.42
0.42
0.46
0.49
0.48
0.48
0.48
0.45
0.44
0.43
0.48
0.42
0.4
0.40
0.41
0.36
0.34
0.36
0.35
0.36
0.37
0.38
0.39
0.34
0.4
0.35
0.36
0.36
0.3
0.28
0.29
0.28
0.29
0.32
0.33
0.34
0.28
0.35
7.88
7.72
7.40
6.89
6.85
7.29
7.84
5.32
8
7.23
7.02
7.26
5.89
6.12
5.97
6.03
7.09
7.04
7.17
6.08
7.13
7.58
7.57
7.15
6.75
6.68
6.94
7.69
5.15
7.84
6.68
6.87
7.10
5.73
5.96
5.83
5.9
6.94
6.85
7.00
5.92
6.85
0.14
0.15
0.1
0.08
0.07
0.09
0.14
0.05
0.17
0.08
0.09
0.10
0.07
0.07
0.07
0.09
0.07
0.08
0.08
0.07
0.08
0.12
0.12
0.08
0.07
0.07
0.07
0.12
0.05
0.13
0.05
0.07
0.06
0.06
0.06
0.06
0.07
0.03
0.04
0.04
0.06
0.02
0.2
0.24
0.31
0.38
0.36
0.32
0.21
0.54
0.17
0.33
0.35
0.33
0.51
0.45
0.46
0.46
0.35
0.35
0.34
0.46
0.33
0.20
0.26
0.33
0.42
0.39
0.34
0.22
0.52
0.2
0.36
0.36
0.36
0.45
0.45
0.45
0.43
0.39
0.38
0.37
0.44
0.35
0.36
0.35
0.34
0.31
0.31
0.33
0.35
0.26
0.36
0.32
0.33
0.33
0.3
0.30
0.3
0.31
0.32
0.32
0.32
0.3
0.32
0.40
0.41
0.42
0.46
0.46
0.43
0.41
0.53
0.41
0.43
0.44
0.43
0.50
0.49
0.49
0.49
0.44
0.44
0.44
0.49
0.44
0.41
0.41
0.4
0.37
0.36
0.38
0.42
0.29
0.42
0.37
0.38
0.39
0.34
0.35
0.34
0.35
0.38
0.38
0.38
0.34
0.38
0.37
0.37
0.35
0.32
0.31
0.34
0.38
0.22
0.38
0.32
0.33
0.35
0.26
0.28
0.27
0.28
0.32
0.33
0.34
0.26
0.33
6.74
6.72
5.84
6.82
6.2
6.45
6.46
5.54
6.61
6
0.24
0.34
0.26
0.32
0.22
0.3
0.34
0.28
0.33
0.25
0.52
0.53
0.61
0.53
0.56
0.60
0.60
0.69
0.62
0.63
0.29
0.29
0.26
0.29
0.27
0.47
0.47
0.52
0.47
0.49
0.34
0.35
0.31
0.35
0.33
0.29
0.29
0.24
0.28
0.27
5.46
5.52
5.51
4.75
4.95
5.41
5.2
5.24
5.16
4.55
4.74
5.12
0.16
0.28
0.25
0.34
0.12
0.21
0.18
0.22
0.25
0.49
0.12
0.13
0.71
0.71
0.71
0.79
0.75
0.68
0.83
0.82
0.82
0.83
0.87
0.85
0.23
0.23
0.23
0.23
0.21
0.22
0.55
0.55
0.55
0.6
0.58
0.55
0.27
0.28
0.27
0.26
0.27
0.27
0.2
0.2
0.2
0.17
0.19
0.18
Table 25: Automatic evaluation metric scores for translations into English
0.24
0.28
0.15
0.25
0.25
0.26
0.18
0.22
0.18
0.20
0.22
0.23
0.24
0.24
0.19
0.27
0.22
0.27
0.14
0.24
0.24
0.24
0.17
0.20
0.17
0.19
0.21
0.22
0.23
0.23
0.18
0.25
CU - BOJAR
CU - TECTOMT
EUROTRANXP
GOOGLE
PCTRANS
UEDIN
0.61
0.48
0.67
0.66
0.67
0.53
0.14
0.07
0.1
0.14
0.09
0.14
0.13
0.07
0.09
0.13
0.09
0.13
MORPHO
UEDIN
0.79
0.32
0.08
0.1
0.08
0.09
WPBLEU
0.65
0.74
0.38
0.68
0.64
0.73
0.54
0.65
0.59
0.57
0.58
0.65
0.60
0.66
0.48
0.77
WP F
DCU
DCU - COMBO
GENEVA
GOOGLE
LIMSI
LIUM - SYSTRAN
RBMT 1
RBMT 3
RBMT 4
RBMT 5
RWTH
SYSTRAN
UEDIN
UKA
USAAR
USAAR - COMBO
WCD 6 P 4 ER
0.27
0.23
0.14
0.17
0.2
0.21
0.21
0.23
0.24
0.19
0.26
TERP
0.28
0.25
0.15
0.18
0.21
0.22
0.22
0.25
0.25
0.20
0.28
TER
0.65
0.59
0.25
0.66
0.61
0.64
0.51
0.58
0.66
0.48
0.61
English-German News Task
–0.29
0.20
0.22
5.36
–0.29
0.2
0.21
5.35
–0.32
0.17
0.19
4.69
–0.30
0.19
0.21
5.08
–0.29
0.2
0.21
4.8
–0.33
0.17
0.18
4.66
–0.3
0.19
0.20
5.03
–0.28
0.2
0.21
5.51
–0.31
0.18
0.20
5.06
–0.27
0.21
0.23
5.53
–0.27
0.21
0.22
5.6
–0.33
0.18
0.19
4.83
–0.27
0.21
0.23
5.6
English-Spanish News Task
–0.15
0.33
0.34
7.27
–0.17
0.30
0.31
6.96
–0.27
0.20
0.22
5.32
–0.18
0.28
0.3
5.79
–0.20
0.26
0.28
6.47
–0.2
0.27
0.29
6.53
–0.18
0.27
0.29
6.83
–0.17
0.3
0.31
6.96
–0.17
0.30
0.31
6.94
–0.21
0.26
0.27
6.36
–0.14
0.33
0.34
7.36
English-French News Task
–0.19
0.29
0.30
6.69
–0.15
0.33
0.34
7.29
–0.27
0.20
0.22
5.59
–0.17
0.30
0.31
6.90
–0.17
0.3
0.31
6.94
–0.17
0.31
0.32
7.02
–0.23
0.24
0.26
6.12
–0.20
0.27
0.28
6.48
–0.24
0.24
0.25
6.02
–0.21
0.26
0.27
6.31
–0.19
0.27
0.28
6.67
–0.19
0.28
0.29
6.7
–0.18
0.29
0.30
6.75
–0.18
0.29
0.30
6.82
–0.23
0.24
0.26
6.16
–0.15
0.32
0.33
7.24
English-Czech News Task
–0.28
0.21
0.23
5.18
–0.35
0.14
0.16
4.17
–0.33
0.16
0.18
4.38
–0.30
0.20
0.22
4.96
–0.34
0.17
0.18
4.34
–0.29
0.21
0.22
5.04
English-Hungarian News Task
–0.37
0.15
0.16
4.04
–0.33
0.17
0.18
4.48
NIST- CASED
GOOGLE
NUS
RBMT 1
RBMT 3
RBMT 4
RBMT 5
RWTH
TALP - UPC
UEDIN
USAAR
USAAR - COMBO
NIST
0.14
0.13
0.11
0.13
0.12
0.10
0.12
0.13
0.12
0.15
0.15
0.11
0.15
BLEUSP 4114
BLEU - CASED
0.15
0.14
0.11
0.13
0.12
0.11
0.13
0.14
0.12
0.15
0.15
0.12
0.16
BLEUSP
BLEU
0.54
0.49
0.57
0.66
0.64
0.58
0.64
0.48
0.43
0.51
0.54
0.58
0.52
BLEU - TER
R ANK
GOOGLE
LIU
RBMT 1
RBMT 2
RBMT 3
RBMT 4
RBMT 5
RWTH
STUTTGART
UEDIN
UKA
USAAR
USAAR - COMBO
5.25
5.18
4.59
4.99
4.71
4.57
4.94
5.41
4.82
5.42
5.48
4.71
5.39
0.62
0.65
0.67
0.62
0.62
0.7
0.64
0.62
0.67
0.63
0.62
0.69
0.62
0.74
0.78
0.81
0.75
0.76
0.84
0.79
0.78
0.79
0.77
0.75
0.8
0.75
0.54
0.54
0.57
0.55
0.54
0.57
0.55
0.53
0.55
0.53
0.52
0.57
0.52
0.3
0.3
0.28
0.30
0.31
0.27
0.3
0.3
0.29
0.31
0.31
0.28
0.31
0.23
0.23
0.21
0.23
0.25
0.2
0.23
0.23
0.21
0.24
0.24
0.21
0.24
7.07
6.67
5.17
5.63
6.28
6.34
6.63
6.69
6.73
6.16
6.97
0.36
0.48
0.55
0.49
0.52
0.52
0.50
0.47
0.48
0.54
0.39
0.42
0.59
0.66
0.59
0.64
0.64
0.65
0.58
0.59
0.66
0.48
0.42
0.44
0.51
0.45
0.47
0.46
0.46
0.44
0.44
0.47
0.42
0.37
0.34
0.24
0.33
0.31
0.32
0.32
0.34
0.34
0.30
0.36
0.31
0.28
0.16
0.27
0.25
0.26
0.26
0.28
0.29
0.24
0.31
6.39
7.12
5.39
6.71
6.77
6.83
5.96
6.29
5.86
6.15
6.51
6.47
6.57
6.65
5.98
6.93
0.63
0.58
0.68
0.62
0.60
0.61
0.65
0.63
0.66
0.63
0.62
0.63
0.62
0.61
0.66
0.59
0.72
0.67
0.82
0.7
0.71
0.71
0.76
0.72
0.77
0.74
0.75
0.74
0.71
0.71
0.76
0.69
0.47
0.44
0.53
0.46
0.46
0.45
0.5
0.48
0.50
0.49
0.48
0.47
0.47
0.46
0.5
0.44
0.38
0.42
0.32
0.40
0.4
0.40
0.35
0.38
0.35
0.36
0.38
0.39
0.39
0.39
0.34
0.41
0.34
0.38
0.25
0.36
0.35
0.36
0.29
0.33
0.3
0.31
0.32
0.34
0.35
0.35
0.29
0.37
4.96
4.03
4.26
4.84
4.19
4.9
0.63
0.71
0.7
0.66
0.71
0.64
0.82
0.96
0.93
0.82
0.90
0.84
0.01
0.01
0.01
0.01
0.01
0.01
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
n/a
3.92
4.32
0.83
0.78
1
1
0.6
0.56
n/a
n/a
n/a
n/a
Table 26: Automatic evaluation metric scores for translations out of English
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Related manuals

Download PDF

advertisement