CLNov 21, 2022

TCBERT: A Technical Report for Chinese Topic Classification BERT

arXiv:2211.11304v11 citationsh-index: 18Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses topic classification for Chinese NLP applications, but it is incremental as it builds on existing BERT variants and methods.

The authors tackled Chinese topic classification by adapting BERT through supervised continued pre-training with prompt-based and contrastive learning, resulting in TCBERT models that are open-sourced and trained on 2.1M Chinese data.

Bidirectional Encoder Representations from Transformers or BERT~\cite{devlin-etal-2019-bert} has been one of the base models for various NLP tasks due to its remarkable performance. Variants customized for different languages and tasks are proposed to further improve the performance. In this work, we investigate supervised continued pre-training~\cite{gururangan-etal-2020-dont} on BERT for Chinese topic classification task. Specifically, we incorporate prompt-based learning and contrastive learning into the pre-training. To adapt to the task of Chinese topic classification, we collect around 2.1M Chinese data spanning various topics. The pre-trained Chinese Topic Classification BERTs (TCBERTs) with different parameter sizes are open-sourced at \url{https://huggingface.co/IDEA-CCNL}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes