CLDec 24, 2024

Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST)

Jiarui Liu, Iman Ouzzani, Wenkai Li, Lechen Zhang, Tianyue Ou, Houda Bouamor, Zhijing Jin, Mona Diab

arXiv:2412.18367v61.91 citationsh-index: 24Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses gaps in AI terminology resources to foster global inclusivity and collaboration for non-English speakers in AI research, though it is incremental as it builds on existing translation methods.

The authors tackled the challenge of domain-specific AI terminology translation by creating GIST, a large-scale multilingual dataset with 5K terms translated into five languages, which demonstrated superior translation accuracy through crowdsourced evaluation and improved BLEU and COMET scores in translation workflows.

The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. We introduce GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms are translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset's quality is benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST is integrated into translation workflows using post-translation refinement methods that require no retraining, where LLM prompting consistently improves BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research. Our data is at https://huggingface.co/datasets/Jerry999/multilingual-terminology

View on arXiv PDF Code

Similar