CLMay 6, 2019

A Large Parallel Corpus of Full-Text Scientific Articles

arXiv:1905.01852v11093 citations
Originality Synthesis-oriented
AI Analysis

This provides a valuable resource for machine translation and NLP researchers working with scientific texts, though it is incremental as it applies existing methods to a new dataset.

The authors tackled the lack of large parallel corpora for scientific articles by creating a trilingual (English, Portuguese, Spanish) corpus from the Scielo database, achieving 98.8% correct sentence alignment and outperforming related works in machine translation.

The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes