CLFeb 3, 2023

GLADIS: A General and Large Acronym Disambiguation Benchmark

arXiv:2302.01860v2269 citationsh-index: 65
AI Analysis

This addresses the need for better acronym disambiguation tools across domains like biomedical reports and scientific papers, though it is incremental as it builds on existing benchmarks by scaling them up.

The authors tackled the problem of limited and domain-specific acronym disambiguation benchmarks by constructing GLADIS, a larger and more general benchmark with 1.5M acronyms, 6.4M long forms, and 160M sentences, and demonstrated its value through pre-training AcroBERT.

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes