Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning
This addresses scalability challenges for LLM developers by offering a sustainable method to enhance performance without increasing model size or dataset volume, though it is incremental.
The study tackled the challenge of computational resource demands in Large Language Model (LLM) training by proposing a curriculum learning strategy that orders data from simple to complex tasks, using criteria like prompt length and attention scores. Experiments with Mistral-7B and Gemma-7B models showed slight performance improvements over random shuffling, with attention-based sorting generally yielding better results.
The rapid advancement of Large Language Models (LLMs) has improved text understanding and generation but poses challenges in computational resources. This study proposes a curriculum learning-inspired, data-centric training strategy that begins with simpler tasks and progresses to more complex ones, using criteria such as prompt length, attention scores, and loss values to structure the training data. Experiments with Mistral-7B (Jiang et al., 2023) and Gemma-7B (Team et al., 2024) models demonstrate that curriculum learning slightly improves performance compared to traditional random data shuffling. Notably, we observed that sorting data based on our proposed attention criteria generally led to better performance. This approach offers a sustainable method to enhance LLM performance without increasing model size or dataset volume, addressing scalability challenges in LLM training.