CLFeb 19, 2022

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain, Franck Dernoncourt, Thien Huu Nguyen

arXiv:2202.09694v131.0584 citations

Originality Synthesis-oriented

AI Analysis

This addresses a data scarcity problem for NLP researchers working on acronym extraction beyond English and specific domains, though it is incremental as it builds on existing AE research.

The authors tackled the lack of annotated datasets for acronym extraction in non-English languages and non-scientific domains by creating a large-scale dataset with 27,200 sentences across 6 languages and 2 domains, revealing unique challenges in multilingual and multi-domain settings.

Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). As such, challenges of AE in other languages and domains is mainly unexplored. Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area. To address this limitation, we propose a new dataset for multilingual multi-domain AE. Specifically, 27,200 sentences in 6 typologically different languages and 2 domains, i.e., Legal and Scientific, is manually annotated for AE. Our extensive experiments on the proposed dataset show that AE in different languages and different learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.

View on arXiv PDF

Similar