Constructing Synthetic Instruction Datasets for Improving Reasoning in Domain-Specific LLMs: A Case Study in the Japanese Financial Domain
This addresses the problem of achieving domain expertise and reasoning in LLMs for domain-specific applications, such as finance, and is incremental as it builds on existing synthetic data methods.
The study tackled the challenge of adapting LLMs to specific domains by proposing a method to construct synthetic instruction data, applied to the Japanese financial domain to create a 9.5 billion token dataset with Chain-of-Thought reasoning, resulting in performance improvements over baseline models on financial benchmarks.
In adapting LLMs to specific domains, achieving both domain expertise and reasoning ability remains an urgent challenge. This study proposes a general method for constructing high-quality synthetic instruction data for any domain, starting from domain-specific vocabulary. As a demonstration, we applied this method to the financial domain and constructed a large-scale instruction dataset totaling approximately 9.5 billion tokens with Chain-of-Thought reasoning traces. Evaluation results confirmed performance improvements over baseline models on financial benchmarks, demonstrating the effectiveness of our approach. We also report findings on the impact of reasoning trace length on performance and its limitations. Lastly, we open-source our models and datasets on https://huggingface.co/nri-ai .