CL AI LGJan 30, 2023

Specializing Smaller Language Models towards Multi-Step Reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

arXiv:2301.12726v128.7356 citationsh-index: 56Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of making reasoning capabilities accessible in resource-constrained settings by specializing smaller models, though it is incremental as it builds on existing distillation and specialization techniques.

The paper tackles the problem of enabling smaller language models (≤11B parameters) to perform complex multi-step reasoning, typically an emergent ability of large models (≥100B parameters), by distilling knowledge from GPT-3.5 to T5 variants, resulting in improved performance on specialized tasks like math reasoning.

The surprising ability of Large Language Models (LLMs) to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between language models' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.

View on arXiv PDF Code

Similar