CL AI LGJul 22, 2025

Towards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models

Armin Berger, Lars Hillebrand, David Leonhard, Tobias Deußer, Thiago Bell Felix de Oliveira, Tim Dilmaghani, Mohamed Khaled, Bernd Kliem, Rüdiger Loitz, Christian Bauckhage, Rafet Sifa

arXiv:2507.16642v19.62 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for automated compliance verification in financial auditing, offering incremental improvements by comparing existing LLMs on new data.

The paper tackled the problem of verifying regulatory compliance in financial auditing by evaluating large language models (LLMs) on custom datasets, finding that the open-source Llama-2 70B model excels at detecting non-compliance while proprietary models like GPT-4 perform best in diverse scenarios, especially non-English contexts.

The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI's GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.

View on arXiv PDF

Similar