CLApr 26, 2022

PLOD: An Abbreviation Detection Dataset for Scientific Documents

Leonardo Zilio, Hadeel Saadany, Prashant Sharma, Diptesh Kanojia, Constantin Orăsan

arXiv:2204.12061v231.0585 citationsh-index: 30Has Code

Originality Synthesis-oriented

AI Analysis

This provides a resource for improving NLP tasks like machine translation and information retrieval, but it is incremental as it focuses on dataset creation rather than novel methods.

The paper tackles the lack of publicly available datasets for training deep neural networks in abbreviation detection by introducing PLOD, a large-scale dataset with 160k+ annotated segments, achieving F1-scores of 0.92 for abbreviations and 0.89 for long forms in baseline models.

The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language Processing tasks, such as machine translation and information retrieval. However, in terms of publicly available datasets, there is not enough data for training deep-neural-networks-based models to the point of generalising well over data. This paper presents PLOD, a large-scale dataset for abbreviation detection and extraction that contains 160k+ segments automatically annotated with abbreviations and their long forms. We performed manual validation over a set of instances and a complete automatic validation for this dataset. We then used it to generate several baseline models for detecting abbreviations and long forms. The best models achieved an F1-score of 0.92 for abbreviations and 0.89 for detecting their corresponding long forms. We release this dataset along with our code and all the models publicly in https://github.com/surrey-nlp/PLOD-AbbreviationDetection

View on arXiv PDF Code

Similar