Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation
For researchers and practitioners in automated summarisation and political science, this work provides a more principled method to evaluate faithfulness of debate summaries, addressing a known bottleneck in current evaluation metrics.
The paper addresses the challenge of evaluating whether LLM-generated summaries of parliamentary debates faithfully preserve argumentative content. It proposes a formal framework grounded in computational argumentation and demonstrates its application to debates from the European Parliament, showing improved evaluation of faithfulness compared to existing metrics.
Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.