CL LGDec 19, 2019

CJRC: A Reliable Human-Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension

Xingyi Duan, Baoxin Wang, Ziyue Wang, Wentao Ma, Yiming Cui, Dayong Wu, Shijin Wang, Ting Liu, Tianxiang Huo, Zhen Hu, Heng Wang, Zhiyuan Liu

arXiv:1912.09156v179 citations

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of element extraction in legal documents for researchers, but it is incremental as it applies existing reading comprehension methods to a new domain-specific dataset.

The authors introduced CJRC, a Chinese judicial reading comprehension dataset with 10K documents and 50K expert-annotated questions, to facilitate element extraction in legal documents, and baseline models based on BERT and BiDAF showed significant room for improvement compared to human performance.

We present a Chinese judicial reading comprehension (CJRC) dataset which contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts. The CJRC dataset can help researchers extract elements by reading comprehension technology. Element extraction is an important task in the legal field. However, it is difficult to predefine the element types completely due to the diversity of document types and causes of action. By contrast, machine reading comprehension technology can quickly extract elements by answering various questions from the long document. We build two strong baseline models based on BERT and BiDAF. The experimental results show that there is enough space for improvement compared to human annotators.

View on arXiv PDF

Similar