CLAILGSep 10, 2024

A Practice of Post-Training on Llama-3 70B with Optimal Selection of Additional Language Mixture Ratio

arXiv:2409.06624v32 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently adapting LLMs to new languages and domains, though it is incremental as it builds on existing CPT methods with systematic hyperparameter tuning.

The authors tackled the problem of optimizing hyperparameters for continual pre-training of large language models to enhance Chinese ability, finding that careful selection of additional language mixture ratio and learning rate improved performance on Chinese benchmarks and specific domains like math and coding, with successful deployment of a 70B model in a chat system.

Large Language Models (LLM) often need to be Continual Pre-Trained (CPT) to obtain unfamiliar language skills or adapt to new domains. The huge training cost of CPT often asks for cautious choice of key hyper-parameters such as the mixture ratio of extra language or domain corpus. However, there is no systematic study that bridges the gap between the optimal mixture ratio and the actual model performance, and the gap between experimental scaling law and the actual deployment in the full model size. In this paper, we perform CPT on Llama-3 8B and 70B to enhance its Chinese ability. We study the optimal correlation between the Additional Language Mixture Ratio (ALMR) and the Learning Rate (LR) on the 8B size which directly indicates the optimal experimental setup. By thorough choice of hyper-parameter, and subsequent fine-tuning, the model capability is improved not only on the Chinese-related benchmark but also in some specific domains including math, coding, and emotional intelligence. We deploy the final 70B version of LLM on a real-life chat system which obtains satisfying performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes