CLMar 8

Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

arXiv:2603.07825v1

Predicted impact top 10% in CL · last 90 daysOriginality Highly original

AI Analysis

This work addresses the critical need for legally accurate and trustworthy automated advisory services for Quebec insurance consumers facing an 'advice gap' due to digitization and legislative changes.

This paper introduces AEPC-QA, a new benchmark of 807 multiple-choice questions from Quebec insurance regulations, to evaluate 51 Large Language Models (LLMs) in both closed-book and Retrieval-Augmented Generation (RAG) settings. The study found that inference-time reasoning significantly improves performance, RAG can boost weak models by over 35 percentage points but also cause context distraction, and generalist models outperform smaller, specialized French models.

The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.

View on arXiv PDF

Similar