CL AI LGJul 31, 2023

An Effective Data Creation Pipeline to Generate High-quality Financial Instruction Data for Large Language Model

Ziao Wang, Jianning Wang, Junda Wu, Xiaofeng Zhang

arXiv:2308.01415v10.51 citationsh-index: 25

Originality Synthesis-oriented

AI Analysis

This work addresses the need for domain-specific datasets to enhance AI models in the financial sector, though it is incremental as it builds on existing methods for data generation.

The paper tackled the problem of generating high-quality financial instruction data for fine-tuning large language models by developing a data creation pipeline using ChatGPT and human expert feedback, resulting in a dataset of 103k multi-turn chats that improved model performance in financial tasks.

At the beginning era of large language model, it is quite critical to generate a high-quality financial dataset to fine-tune a large language model for financial related tasks. Thus, this paper presents a carefully designed data creation pipeline for this purpose. Particularly, we initiate a dialogue between an AI investor and financial expert using ChatGPT and incorporate the feedback of human financial experts, leading to the refinement of the dataset. This pipeline yielded a robust instruction tuning dataset comprised of 103k multi-turn chats. Extensive experiments have been conducted on this dataset to evaluate the model's performance by adopting an external GPT-4 as the judge. The promising experimental results verify that our approach led to significant advancements in generating accurate, relevant, and financial-style responses from AI models, and thus providing a powerful tool for applications within the financial sector.

View on arXiv PDF

Similar