CLFeb 3, 2023

GLADIS: A General and Large Acronym Disambiguation Benchmark

Lihu Chen, Gaël Varoquaux, Fabian M. Suchanek

arXiv:2302.01860v228.2269 citationsh-index: 65Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for better acronym disambiguation tools across domains like biomedical reports and scientific papers, though it is incremental as it builds on existing benchmarks by scaling them up.

The authors tackled the problem of limited and domain-specific acronym disambiguation benchmarks by constructing GLADIS, a larger and more general benchmark with 1.5M acronyms, 6.4M long forms, and 160M sentences, and demonstrated its value through pre-training AcroBERT.

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

View on arXiv PDF Code

Similar