CLApr 15, 2021

Bilingual Terminology Extraction from Comparable E-Commerce Corpora

arXiv:2104.07398v2
AI Analysis

This addresses the costly and data-scarce issue in e-commerce machine translation by leveraging abundant comparable data, though it is incremental as it builds on cross-lingual pre-training methods.

The paper tackles the problem of extracting bilingual terminologies for e-commerce machine translation by proposing a novel framework that uses comparable corpora instead of scarce parallel data, achieving significantly better performance than strong baselines across various language pairs.

Bilingual terminologies are important machine translation resources in the field of e-commerce, which are usually either manually translated or automatically extracted from parallel data. The human translation is costly and e-commerce parallel corpora is very scarce. However, the comparable data in different languages in the same commodity field is abundant. In this paper, we propose a novel framework of extracting e-commercial bilingual terminologies from comparable data. Benefiting from the cross-lingual pre-training in e-commerce, our framework can make full use of the deep semantic relationship between source-side terminology and target-side sentence to extract corresponding target terminology. Experimental results on various language pairs show that our approaches achieve significantly better performance than various strong baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes