CLAIFeb 6, 2025

LLMs to Support a Domain Specific Knowledge Assistant

arXiv:2502.04095v11 citations
Originality Incremental advance
AI Analysis

This addresses the problem of supporting companies with IFRS sustainability reporting through a domain-specific knowledge assistant, though it is incremental in applying existing LLM techniques to a new domain.

This work tackled the lack of publicly available question-answer datasets for sustainability reporting under IFRS by creating a synthetic dataset of 1,063 QA pairs using LLMs and developing two QA architectures. The RAG pipeline achieved 85.32% accuracy on single-industry questions and 72.15% on cross-industry questions, while the LLM-based pipeline achieved 93.45% and 80.30% accuracy, respectively, outperforming baselines by up to 27.36 percentage points.

This work presents a custom approach to developing a domain specific knowledge assistant for sustainability reporting using the International Financial Reporting Standards (IFRS). In this domain, there is no publicly available question-answer dataset, which has impeded the development of a high-quality chatbot to support companies with IFRS reporting. The two key contributions of this project therefore are: (1) A high-quality synthetic question-answer (QA) dataset based on IFRS sustainability standards, created using a novel generation and evaluation pipeline leveraging Large Language Models (LLMs). This comprises 1,063 diverse QA pairs that address a wide spectrum of potential user queries in sustainability reporting. Various LLM-based techniques are employed to create the dataset, including chain-of-thought reasoning and few-shot prompting. A custom evaluation framework is developed to assess question and answer quality across multiple dimensions, including faithfulness, relevance, and domain specificity. The dataset averages a score range of 8.16 out of 10 on these metrics. (2) Two architectures for question-answering in the sustainability reporting domain - a RAG pipeline and a fully LLM-based pipeline. The architectures are developed by experimenting, fine-tuning, and training on the QA dataset. The final pipelines feature an LLM fine-tuned on domain specific data and an industry classification component to improve the handling of complex queries. The RAG architecture achieves an accuracy of 85.32% on single-industry and 72.15% on cross-industry multiple-choice questions, outperforming the baseline approach by 4.67 and 19.21 percentage points, respectively. The LLM-based pipeline achieves an accuracy of 93.45% on single-industry and 80.30% on cross-industry multiple-choice questions, an improvement of 12.80 and 27.36 percentage points over the baseline, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes