CLOct 11, 2024

Measuring the Groundedness of Legal Question-Answering Systems

Dietrich Trautmann, Natalia Ostapuk, Quentin Grail, Adrian Alan Pol, Guglielmo Bonifazi, Shang Gao, Martin Gajek

arXiv:2410.08764v112.622 citationsh-index: 10NLLP

Originality Incremental advance

AI Analysis

It addresses the need for reliable and trustworthy AI systems in high-stakes legal domains, though it is incremental in improving existing detection methods.

This work tackled the problem of assessing the groundedness of AI-generated responses in legal question-answering by evaluating methods like similarity-based metrics and natural language inference models, with the best method achieving a macro-F1 score of 0.8.

In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

View on arXiv PDF

Similar