CL AINov 8, 2024

Using Language Models to Disambiguate Lexical Choices in Translation

Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr

arXiv:2411.05781v113.824 citationsh-index: 3Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of cross-lingual concept variation in translation for linguists and NLP practitioners, but it is incremental as it builds on existing LLM and dataset methods.

The paper tackled the problem of lexical selection in translation by creating the DTAiLS dataset with 1,377 sentence pairs across nine languages and evaluated models, with GPT-4 achieving 67-85% accuracy, and showed that providing lexical rules to weaker models can improve accuracy to match or exceed GPT-4.

In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.

View on arXiv PDF Code

Similar