0%

【NLP】 2015-2020 Translationese 翻译腔相关论文整理

目录

1. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation, ACL 2020 [PDF] 摘要
2. On The Evaluation of Machine Translation Systems Trained With Back-Translation, ACL 2020 [PDF] 摘要
3. Tagged Back-translation Revisited: Why Does It Really Work?, ACL 2020 [PDF] 摘要
4. Translationese as a Language in “Multilingual” NMT, ACL 2020 [PDF] 摘要
5. How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech, ACL 2020 [PDF] 摘要
6. BLEU Might Be Guilty but References Are Not Innocent, EMNLP 2020 [PDF] 摘要
7. Translationese in Machine Translation Evaluation, EMNLP 2020 [PDF] 摘要
8. Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation, ACL 2019 [PDF] 摘要
9. What Kind of Language Is Hard to Language-Model?, ACL 2019 [PDF] 摘要
10. APE at Scale and Its Implications on MT Evaluation Biases, ACL 2019 [PDF] 摘要
11. The Effect of Translationese in Machine Translation Test Sets, ACL 2019 [PDF] 摘要
12. Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation, ACL 2019 [PDF] 摘要
13. Analysing Coreference in Transformer Outputs, EMNLP 2019 [PDF] 摘要
14. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation, EMNLP 2018 [PDF] 摘要
15. Translationese: Between Human and Machine Translation, COLING 2016 [PDF] 摘要
16. Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification, NAACL 2016 [PDF] 摘要
17. Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation, NAACL 2016 [PDF] 摘要
18. Unsupervised Identification of Translationese, TACL 2015 [PDF] 摘要

摘要

1. On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation [PDF] 返回目录
  ACL 2020.
  Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, Steffen Eger
Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish “translationese”, i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

2. On The Evaluation of Machine Translation Systems Trained With Back-Translation [PDF] 返回目录
  ACL 2020.
  Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, Michael Auli
Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples where the source itself is a translation, or translationese. This is believed to be due to translationese inputs better matching the back-translated training data. In this work, we show that this conjecture is not empirically supported and that back-translation improves translation quality of both naturally occurring text as well as translationese according to professional human translators. We provide empirical evidence to support the view that back-translation is preferred by humans because it produces more fluent outputs. BLEU cannot capture human preferences because references are translationese when source sentences are natural text. We recommend complementing BLEU with a language model score to measure fluency.

3. Tagged Back-translation Revisited: Why Does It Really Work? [PDF] 返回目录
  ACL 2020.
  Benjamin Marie, Raphael Rubino, Atsushi Fujita
In this paper, we show that neural machine translation (NMT) systems trained on large back-translated data overfit some of the characteristics of machine-translated texts. Such NMT systems better translate human-produced translations, i.e., translationese, but may largely worsen the translation quality of original texts. Our analysis reveals that adding a simple tag to back-translations prevents this quality degradation and improves on average the overall translation quality by helping the NMT system to distinguish back-translated data from original parallel data during training. We also show that, in contrast to high-resource configurations, NMT systems trained in low-resource settings are much less vulnerable to overfit back-translations. We conclude that the back-translations in the training data should always be tagged especially when the origin of the text to be translated is unknown.

4. Translationese as a Language in “Multilingual” NMT [PDF] 返回目录
  ACL 2020.
  Parker Riley, Isaac Caswell, Markus Freitag, David Grangier
Machine translation has an undesirable propensity to produce “translationese” artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train a sentence-level classifier to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency. Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these outputs using metrics measuring the degree of translationese, and present an analysis of the volatility of heuristic-based train-data tagging.

5. How Human is Machine Translationese? Comparing Human and Machine Translations of Text and Speech [PDF] 返回目录
  ACL 2020. the 17th International Conference on Spoken Language Translation
  Yuri Bizzoni, Tom S Juzek, Cristina España-Bonet, Koel Dutta Chowdhury, Josef van Genabith, Elke Teich
Translationese is a phenomenon present in human translations, simultaneous interpreting, and even machine translations. Some translationese features tend to appear in simultaneous interpreting with higher frequency than in human text translation, but the reasons for this are unclear. This study analyzes translationese patterns in translation, interpreting, and machine translation outputs in order to explore possible reasons. In our analysis we – (i) detail two non-invasive ways of detecting translationese and (ii) compare translationese across human and machine translations from text and speech. We find that machine translation shows traces of translationese, but does not reproduce the patterns found in human translation, offering support to the hypothesis that such patterns are due to the model (human vs machine) rather than to the data (written vs spoken).

6. BLEU Might Be Guilty but References Are Not Innocent [PDF] 返回目录
  EMNLP 2020. Long Paper
  Markus Freitag, David Grangier, Isaac Caswell
The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods.To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

7. Translationese in Machine Translation Evaluation [PDF] 返回目录
  EMNLP 2020. Long Paper
  Yvette Graham, Barry Haddow, Philipp Koehn
The term translationese has been used to describe features of translated text, and in this paper, we provide detailed analysis of potential adverse effects of translationese on machine translation evaluation.Our analysis shows differences in conclusions drawn from evaluations that include translationese in test data compared to experiments that tested only with text originally composed in that language. For this reason we recommend that reverse-created test data be omitted from future machine translation test sets. In addition, we provide a re-evaluation of a past machine translation evaluation claiming human-parity of MT. One important issue not previously considered is statistical power of significance tests applied to comparison of human and machine translation. Since the very aim of past evaluations was investigation of ties between human and MT systems, power analysis is of particular importance, to avoid, for example, claims of human parity simply corresponding to Type II error resulting from the application of a low powered test. We provide detailed analysis of tests used in such evaluations to provide an indication of a suitable minimum sample size for future studies.

8. Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation [PDF] 返回目录
  ACL 2019.
  Nima Pourdamghani, Nada Aldarrab, Marjan Ghazvininejad, Kevin Knight, Jonathan May
Given a rough, word-by-word gloss of a source language sentence, target language natives can uncover the latent, fully-fluent rendering of the translation. In this work we explore this intuition by breaking translation into a two step process: generating a rough gloss by means of a dictionary and then ‘translating’ the resulting pseudo-translation, or ‘Translationese’ into a fully fluent translation. We build our Translationese decoder once from a mish-mash of parallel data that has the target language in common and then can build dictionaries on demand using unsupervised techniques, resulting in rapidly generated unsupervised neural MT systems for many source languages. We apply this process to 14 test languages, obtaining better or comparable translation results on high-resource languages than previously published unsupervised MT studies, and obtaining good quality results for low-resource languages that have never been used in an unsupervised MT scenario.

9. What Kind of Language Is Hard to Language-Model? [PDF] 返回目录
  ACL 2019.
  Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner
How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

10. APE at Scale and Its Implications on MT Evaluation Biases [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 1: Research Papers)
  Markus Freitag, Isaac Caswell, Scott Roy
In this work, we train an Automatic Post-Editing (APE) model and use it to reveal biases in standard MT evaluation procedures. The goal of our APE model is to correct typical errors introduced by the translation process, and convert the “translationese” output into natural text. Our APE model is trained entirely on monolingual data that has been round-trip translated through English, to mimic errors that are similar to the ones introduced by NMT. We apply our model to the output of existing NMT systems, and demonstrate that, while the human-judged quality improves in all cases, BLEU scores drop with forward-translated test sets. We verify these results for the WMT18 English to German, WMT15 English to French, and WMT16 English to Romanian tasks. Furthermore, we selectively apply our APE model on the output of the top submissions of the most recent WMT evaluation campaigns. We see quality improvements on all tasks of up to 2.5 BLEU points.

11. The Effect of Translationese in Machine Translation Test Sets [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 1: Research Papers)
  Mike Zhang, Antonio Toral
The effect of translationese has been studied in the field of machine translation (MT), mostly with respect to training data. We study in depth the effect of translationese on test data, using the test sets from the last three editions of WMT’s news shared task, containing 17 translation directions. We show evidence that (i) the use of translationese in test sets results in inflated human evaluation scores for MT systems; (ii) in some cases system rankings do change and (iii) the impact translationese has on a translation direction is inversely correlated to the translation quality attainable by state-of-the-art MT systems for that direction.

12. Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation [PDF] 返回目录
  ACL 2019. the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
  Marcin Junczys-Dowmunt
This paper describes the Microsoft Translator submissions to the WMT19 news translation shared task for English-German. Our main focus is document-level neural machine translation with deep transformer models. We start with strong sentence-level baselines, trained on large-scale data created via data-filtering and noisy back-translation and find that back-translation seems to mainly help with translationese input. We explore fine-tuning techniques, deeper models and different ensembling strategies to counter these effects. Using document boundaries present in the authentic and synthetic parallel data, we create sequences of up to 1000 subword segments and train transformer translation models. We experiment with data augmentation techniques for the smaller authentic data with document-boundaries and for larger authentic data without boundaries. We further explore multi-task training for the incorporation of document-level source language monolingual data via the BERT-objective on the encoder and two-pass decoding for combinations of sentence-level and document-level systems. Based on preliminary human evaluation results, evaluators strongly prefer the document-level systems over our comparable sentence-level system. The document-level systems also seem to score higher than the human references in source-based direct assessment.

13. Analysing Coreference in Transformer Outputs [PDF] 返回目录
  EMNLP 2019. the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)
  Ekaterina Lapshinova-Koltunski, Cristina España-Bonet, Josef van Genabith
We analyse coreference phenomena in three neural machine translation systems trained with different data settings with or without access to explicit intra- and cross-sentential anaphoric information. We compare system performance on two different genres: news and TED talks. To do this, we manually annotate (the possibly incorrect) coreference chains in the MT outputs and evaluate the coreference chain translations. We define an error typology that aims to go further than pronoun translation adequacy and includes types such as incorrect word selection or missing words. The features of coreference chains in automatic translations are also compared to those of the source texts and human translations. The analysis shows stronger potential translationese effects in machine translated outputs than in human translations.

14. Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation [PDF] 返回目录
  EMNLP 2018. the Third Conference on Machine Translation: Research Papers
  Antonio Toral, Sheila Castilho, Ke Hu, Andy Way
We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.

15. Translationese: Between Human and Machine Translation [PDF] 返回目录
  COLING 2016. Tutorial Abstracts
  Shuly Wintner
Translated texts, in any language, have unique characteristics that set them apart from texts originally written in the same language. Translation Studies is a research field that focuses on investigating these characteristics. Until recently, research in machine translation (MT) has been entirely divorced from translation studies. The main goal of this tutorial is to introduce some of the findings of translation studies to researchers interested mainly in machine translation, and to demonstrate that awareness to these findings can result in better, more accurate MT systems.

16. Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification [PDF] 返回目录
  NAACL 2016.
  Raphael Rubino, Ekaterina Lapshinova-Koltunski, Josef van Genabith


17. Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation [PDF] 返回目录
  NAACL 2016.
  He He, Jordan Boyd-Graber, Hal Daumé III


18. Unsupervised Identification of Translationese [PDF] 返回目录
  TACL 2015.
  Ella Rabinovich, Shuly Wintner
Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domain-related features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.

注:论文列表使用AC论文搜索器整理!