CL AIJun 26, 2025

Large Language Models Acing Chartered Accountancy

Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta

arXiv:2506.21031v14.91 citationsh-index: 8SN Computer Science

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better evaluation of LLMs in the domain-specific financial context of Indian Chartered Accountancy, though it is incremental as it builds on existing benchmarking practices.

This paper tackled the problem of evaluating how well large language models (LLMs) capture and apply domain-specific financial knowledge by introducing CA-Ben, a benchmark based on Indian Chartered Accountancy exams, and found that Claude 3.5 Sonnet and GPT-4o outperformed other models, especially in conceptual and legal reasoning, but faced challenges in numerical computations and legal interpretations.

Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.

View on arXiv PDF

Similar