CLMay 18, 2024

LexGen: Domain-aware Multilingual Lexicon Generation

Ayush Maheshwari, Atul Kumar Singh, Karthika NJ, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan

arXiv:2405.11200v32.71 citationsh-index: 22Has CodeACL

Originality Incremental advance

AI Analysis

This work addresses the problem of generating domain-specific dictionaries for specialized fields like medicine and engineering in low-resource languages, which is incremental as it builds on prior bilingual lexical induction methods by adding domain-awareness.

The paper tackles domain-specific lexicon generation for low-resource Indian languages by proposing a model with domain-specific and domain-generic layers using learnable routing, achieving results demonstrated through zero-shot and few-shot experiments on a new dataset of over 75K translation pairs across 6 languages and 8 domains.

Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low/mid-resource languages. In this paper, we propose a new model to generate dictionary words for $6$ Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains.We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a post-hoc human evaluation on unseen languages. The source code and dataset is present at https://github.com/Atulkmrsingh/lexgen.

View on arXiv PDF Code

Similar