Creation of the Chinese Adaptive Policy Communication Corpus
This dataset supports downstream tasks and multilingual NLP research in policy communication, addressing a gap for researchers in computational social science and NLP.
The authors introduced CAPC-CG, the first open dataset of Chinese policy directives annotated with a five-color taxonomy for clear and ambiguous language, spanning from 1949 to 2023 with 3.3 million units, and achieved a Fleiss's kappa of 0.86 for inter-annotator agreement.
We introduce CAPC-CG, the Chinese Adaptive Policy Communication (Central Government) Corpus, the first open dataset of Chinese policy directives annotated with a five-color taxonomy of clear and ambiguous language categories, building on Ang's theory of adaptive policy communication. Spanning 1949-2023, this corpus includes national laws, administrative regulations, and ministerial rules issued by China's top authorities. Each document is segmented into paragraphs, producing a total of 3.3 million units. Alongside the corpus, we release comprehensive metadata, a two-round labeling framework, and a gold-standard annotation set developed by expert and trained coders. Inter-annotator agreement achieves a Fleiss's kappa of K = 0.86 on directive labels, indicating high reliability for supervised modeling. We provide baseline classification results with several large language models (LLMs), together with our annotation codebook, and describe patterns from the dataset. This release aims to support downstream tasks and multilingual NLP research in policy communication.