ClimateBert: A Pretrained Language Model for Climate-Related Text
This addresses the limitation of modern NLP for processing climate-related texts, which is important for researchers and practitioners in climate science and policy, but it is incremental as it adapts an existing method to a new domain.
The authors tackled the problem of large pretrained language models performing poorly on climate-related text due to niche language, and proposed CLIMATEBERT, a model further pretrained on over 2 million climate-related paragraphs, which improved masked language model performance by 48% and reduced error rates by 3.57% to 35.71% on downstream tasks.
Over the recent years, large pretrained language models (LM) have revolutionized the field of natural language processing (NLP). However, while pretraining on general language has been shown to work very well for common language, it has been observed that niche language poses problems. In particular, climate-related texts include specific language that common LMs can not represent accurately. We argue that this shortcoming of today's LMs limits the applicability of modern NLP to the broad field of text processing of climate-related texts. As a remedy, we propose CLIMATEBERT, a transformer-based language model that is further pretrained on over 2 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies. We find that CLIMATEBERT leads to a 48% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks like text classification, sentiment analysis, and fact-checking.