CLApr 16, 2022

SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words

arXiv:2204.07779v10.62 citationsh-index: 14

Originality Incremental advance

AI Analysis

This work addresses the problem of text simplification for NLP applications, offering an incremental improvement by adapting BERT for this specific domain.

The authors tackled the lack of pre-trained models for text simplification by proposing SimpleBERT, a continued pre-training method that masks only simple words to learn their generation, resulting in state-of-the-art performance on multiple datasets for lexical and sentence simplification tasks.

Pre-trained models are widely used in the tasks of natural language processing nowadays. However, in the specific field of text simplification, the research on improving pre-trained models is still blank. In this work, we propose a continued pre-training method for text simplification. Specifically, we propose a new masked language modeling (MLM) mechanism, which does not randomly mask words but only masks simple words. The new mechanism can make the model learn to generate simple words. We use a small-scale simple text dataset for continued pre-training and employ two methods to identify simple words from the texts. We choose BERT, a representative pre-trained model, and continue pre-training it using our proposed method. Finally, we obtain SimpleBERT, which surpasses BERT in both lexical simplification and sentence simplification tasks and has achieved state-of-the-art results on multiple datasets. What's more, SimpleBERT can replace BERT in existing simplification models without modification.

View on arXiv PDF

Similar