Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia
This addresses the problem of improving numeric operation accuracy in LLMs for applications requiring mathematical reasoning, though it is incremental as it focuses on tokenization effects.
The study investigated how different numeral systems (base 10 vs. base 100 or 1000) affect the scaling behavior of large language models in performing numeric operations like addition and multiplication, finding that base 10 is consistently more data-efficient across training data scales and model sizes, with similar fine-tuning performances across systems.
Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into $1$-digit, and 2) Tokenize into $1\sim 3$ digit. The difference is roughly equivalent to using different numeral systems (namely base $10$ or base $10^{3}$). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base $10$ system is consistently more data-efficient than a base $10^{2}$ or $10^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base $10$ system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.