CLMay 6, 2019

A Large Parallel Corpus of Full-Text Scientific Articles

Felipe Soares, Viviane Pereira Moreira, Karin Becker

arXiv:1905.01852v131.11093 citations

Originality Synthesis-oriented

AI Analysis

This provides a valuable resource for machine translation and NLP researchers working with scientific texts, though it is incremental as it applies existing methods to a new dataset.

The authors tackled the lack of large parallel corpora for scientific articles by creating a trilingual (English, Portuguese, Spanish) corpus from the Scielo database, achieving 98.8% correct sentence alignment and outperforming related works in machine translation.

The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

View on arXiv PDF

Similar