CLMay 23, 2023

PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning

Xuekai Zhu, Biqing Qi, Kaiyan Zhang, Xinwei Long, Zhouhan Lin, Bowen Zhou

arXiv:2305.13888v214.041 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of deploying large language models in practical settings by improving reasoning distillation for smaller models, though it is incremental as it builds on existing distillation methods.

The paper tackles the problem of faulty reasoning in synthetic chain-of-thought data during distillation from large to small language models, proposing Program-aided Distillation (PaD) which uses reasoning programs for error checking and self-refinement, resulting in smaller models outperforming some LLMs like LLaMA-1 13B and achieving strong improvements over baselines with fewer parameters and data.

While large language models (LLMs) excel in various natural language processing tasks, their huge size and the inaccessibility of parameters present challenges for practical deployment. Previous studies try to distill task-specific ability from LLMs to smaller models, using data synthesis and chain-of-thought (CoT) fine-tuning. However, synthetic CoT data often contains faulty reasoning, which deteriorates the quality of distillation, especially in reasoning capabilities. In this work, we propose Program-aided Distillation (PaD), which introduces reasoning programs to suppress the errors in distilled data, and thus achieves better distillation quality for reasoning tasks. In PaD, we utilize the reasoning program to substitute the CoT, allowing automated error checking of synthetic data. Further, through error injecting and further training, the small distilling model could iteratively self-refine the reasoning. Moreover, we conduct a step-wise beam search by step-by-step verifying to acquire more exact reasoning chains. We evaluate PaD on arithmetic reasoning, symbolic reasoning, and general ability. Experimental results demonstrate that smaller models using PaD can not only outperform certain LLMs~(e.g., LLaMA-1 13B) but also achieve strong improvement over baselines with a significantly smaller scale of parameters and data. The source code is publicly available at https://github.com/Xuekai-Zhu/pad.

View on arXiv PDF Code

Similar