A Classification System Approach in Predicting Chinese Censorship
This work addresses the problem of automated censorship prediction for researchers and policymakers in China, but it is incremental as it applies existing methods to a specific dataset.
The paper tackled predicting censorship of Weibo posts in China by developing classification models, finding that a fine-tuned BERT model outperformed other methods, achieving higher macro-F1 and ROC-AUC scores.
This paper is dedicated to using a classifier to predict whether a Weibo post would be censored under the Chinese internet. Through randomized sampling from \citeauthor{Fu2021} and Chinese tokenizing strategies, we constructed a cleaned Chinese phrase dataset with binary censorship markings. Utilizing various probability-based information retrieval methods on the data, we were able to derive 4 logistic regression models for classification. Furthermore, we experimented with pre-trained transformers to perform similar classification tasks. After evaluating both the macro-F1 and ROC-AUC metrics, we concluded that the Fined-Tuned BERT model exceeds other strategies in performance.