The Cambridge Law Corpus: A Dataset for Legal AI Research
This dataset addresses the need for large-scale legal data for AI research, though it is incremental as it focuses on data collection and benchmarking rather than novel methods.
The authors introduced the Cambridge Law Corpus (CLC), a dataset of over 250,000 UK court cases spanning from the 16th to 21st centuries, and provided benchmarks by training and evaluating case outcome extraction using GPT-3, GPT-4, and RoBERTa models on 638 expert-annotated cases.
We introduce the Cambridge Law Corpus (CLC), a dataset for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.