MiniRBT: A Two-stage Distilled Small Chinese Pre-trained Model
This addresses the problem of large, slow models for Chinese NLP practitioners, offering an efficient alternative, though it is incremental as it builds on existing distillation techniques.
The paper tackles the lack of small Chinese pre-trained models by introducing MiniRBT, which achieves 94% performance relative to RoBERTa with a 6.8x speedup on tasks like machine reading comprehension and text classification.
In natural language processing, pre-trained language models have become essential infrastructures. However, these models often suffer from issues such as large size, long inference time, and challenging deployment. Moreover, most mainstream pre-trained models focus on English, and there are insufficient studies on small Chinese pre-trained models. In this paper, we introduce MiniRBT, a small Chinese pre-trained model that aims to advance research in Chinese natural language processing. MiniRBT employs a narrow and deep student model and incorporates whole word masking and two-stage distillation during pre-training to make it well-suited for most downstream tasks. Our experiments on machine reading comprehension and text classification tasks reveal that MiniRBT achieves 94% performance relative to RoBERTa, while providing a 6.8x speedup, demonstrating its effectiveness and efficiency.