CLLGApr 9, 2023

RISC: Generating Realistic Synthetic Bilingual Insurance Contract

arXiv:2304.04212v12 citationsh-index: 12Has Code
Originality Synthesis-oriented
AI Analysis

It addresses the lack of complex legal document datasets for NLP, enabling research in insurance-specific applications, though it is incremental as it builds on existing regulatory forms.

The paper introduces RISC, a tool for generating realistic synthetic bilingual insurance contracts, and RISCBAC, a dataset of 10,000 unannotated French and English contracts based on Quebec regulations, to support NLP research in tasks like summarization and translation.

This paper presents RISC, an open-source Python package data generator (https://github.com/GRAAL-Research/risc). RISC generates look-alike automobile insurance contracts based on the Quebec regulatory insurance form in French and English. Insurance contracts are 90 to 100 pages long and use complex legal and insurance-specific vocabulary for a layperson. Hence, they are a much more complex class of documents than those in traditional NLP corpora. Therefore, we introduce RISCBAC, a Realistic Insurance Synthetic Bilingual Automobile Contract dataset based on the mandatory Quebec car insurance contract. The dataset comprises 10,000 French and English unannotated insurance contracts. RISCBAC enables NLP research for unsupervised automatic summarisation, question answering, text simplification, machine translation and more. Moreover, it can be further automatically annotated as a dataset for supervised tasks such as NER

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes