CLAug 19, 2024

Refining Packing and Shuffling Strategies for Enhanced Performance in Generative Language Models

Yanbing Chen, Ruilin Wang, Zihao Yang, Lavender Yao Jiang, Eric Karl Oermann

arXiv:2408.09621v11.0h-index: 6

Originality Synthesis-oriented

AI Analysis

This work addresses efficiency and performance trade-offs in data preprocessing for language model training, but it is incremental as it refines existing packing strategies without introducing new paradigms.

The study tackled the problem of contextual incoherence in token packing and shuffling for training language models by investigating optimal atom sizes, finding that matching atom size to maximum sequence length optimizes performance for both concatenation and padding methods, with padding yielding lower final perplexity but at the cost of more training steps and lower compute efficiency.

Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.

View on arXiv PDF

Similar