Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties
This work addresses a gap in assessing LLMs for tax law applications, which is incremental as it builds on existing legal domain research with a new benchmark.
The study tackled the problem of evaluating large language models' (LLMs) capabilities in taxation by introducing PLAT, a benchmark for predicting the legitimacy of additional tax penalties, and found that baseline LLM performance was limited but improved with retrieval, self-reasoning, and multi-agent discussion techniques.
How capable are large language models (LLMs) in the domain of taxation? Although numerous studies have explored the legal domain in general, research dedicated to taxation remain scarce. Moreover, the datasets used in these studies are either simplified, failing to reflect the real-world complexities, or unavailable as open source. To address this gap, we introduce PLAT, a new benchmark designed to assess the ability of LLMs to predict the legitimacy of additional tax penalties. PLAT is constructed to evaluate LLMs' understanding of tax law, particularly in cases where resolving the issue requires more than just applying related statutes. Our experiments with six LLMs reveal that their baseline capabilities are limited, especially when dealing with conflicting issues that demand a comprehensive understanding. However, we found that enabling retrieval, self-reasoning, and discussion among multiple agents with specific role assignments, this limitation can be mitigated.