CL IMDec 1, 2021

Building astroBERT, a language model for Astronomy & Astrophysics

Felix Grezes, Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J. Kurtz, Golnaz Shapurian, Edwin Henneken, Carolyn S. Grant, Donna M. Thompson, Roman Chyla, Stephen McDonald, Timothy W. Hostetler, Matthew R. Templeton

arXiv:2112.00590v12.028 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the need for improved discoverability in astronomy research databases, though it appears incremental as it builds on existing BERT models.

The paper tackles the problem of semantic search in astronomy literature by developing astroBERT, a language model trained on NASA Astrophysics Data System publications, which aims to distinguish ambiguous terms like 'Planck' without user clarification. Preliminary results are presented, but no concrete numbers are provided.

The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.

View on arXiv PDF

Similar