Improving astroBERT using Semantic Textual Similarity
This work addresses the need for better natural language processing tools in astronomy research, but it appears incremental as it builds on existing language models with domain-specific adaptations.
The authors tackled the problem of enhancing the NASA Astrophysics Data System (ADS) by releasing astroBERT, a language model tailored to astronomy papers, which improves over existing models on astrophysics-specific tasks, though no concrete numbers are provided.
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: - announce the first public release of the astroBERT language model; - show how astroBERT improves over existing public language models on astrophysics specific tasks; - and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.