CLJun 14, 2024

Bag of Lies: Robustness in Continuous Pre-training BERT

arXiv:2406.09967v11 citations
Originality Incremental advance
AI Analysis

This addresses the problem of misinformation resilience in language models for researchers and practitioners, though it is incremental as it builds on existing pre-training methods.

The study investigated the robustness of continuous pre-training in BERT by manipulating input data with misinformation and nonsensical word order, finding that these adversarial methods did not degrade and sometimes improved performance on the Check-COVID fact-checking benchmark.

This study aims to acquire more insights into the continuous pre-training phase of BERT regarding entity knowledge, using the COVID-19 pandemic as a case study. Since the pandemic emerged after the last update of BERT's pre-training data, the model has little to no entity knowledge about COVID-19. Using continuous pre-training, we control what entity knowledge is available to the model. We compare the baseline BERT model with the further pre-trained variants on the fact-checking benchmark Check-COVID. To test the robustness of continuous pre-training, we experiment with several adversarial methods to manipulate the input data, such as training on misinformation and shuffling the word order until the input becomes nonsensical. Surprisingly, our findings reveal that these methods do not degrade, and sometimes even improve, the model's downstream performance. This suggests that continuous pre-training of BERT is robust against misinformation. Furthermore, we are releasing a new dataset, consisting of original texts from academic publications in the LitCovid repository and their AI-generated false counterparts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes