CL AINov 2, 2023

Multi-dimensional data refining strategy for effective fine-tuning LLMs

Thanh Nguyen Ngoc, Quang Nhat Tran, Arthur Tang, Bao Nguyen, Thuy Nguyen, Thanh Pham

arXiv:2311.01049v10.5h-index: 3

Originality Synthesis-oriented

AI Analysis

This work addresses data scarcity and linguistic diversity issues for fine-tuning LLMs in low-resource languages like Vietnamese, offering practical solutions but is incremental in nature.

The paper tackled the challenge of acquiring suitable data for fine-tuning large language models in Vietnamese by developing a multi-dimensional data refining strategy, resulting in a fine-tuned model that demonstrated good performance in generating Vietnamese news articles from prompts.

Data is a cornerstone for fine-tuning large language models, yet acquiring suitable data remains challenging. Challenges encompassed data scarcity, linguistic diversity, and domain-specific content. This paper presents lessons learned while crawling and refining data tailored for fine-tuning Vietnamese language models. Crafting such a dataset, while accounting for linguistic intricacies and striking a balance between inclusivity and accuracy, demands meticulous planning. Our paper presents a multidimensional strategy including leveraging existing datasets in the English language and developing customized data-crawling scripts with the assistance of generative AI tools. A fine-tuned LLM model for the Vietnamese language, which was produced using resultant datasets, demonstrated good performance while generating Vietnamese news articles from prompts. The study offers practical solutions and guidance for future fine-tuning models in languages like Vietnamese.

View on arXiv PDF

Similar