CL LGFeb 22, 2024

Divide-or-Conquer? Which Part Should You Distill Your LLM?

Zhuofeng Wu, He Bai, Aonan Zhang, Jiatao Gu, VG Vinod Vydiswaran, Navdeep Jaitly, Yizhe Zhang

Apple

arXiv:2402.15000v316.826 citationsh-index: 50Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the problem of high inference costs and limited generalization in LLM distillation for reasoning tasks, offering an incremental improvement by optimizing which parts to distill.

The paper tackles the challenge of efficiently distilling reasoning capabilities from large language models (LLMs) by breaking tasks into decomposition and solving phases, finding that distilling the decomposition phase achieves good generalization and cost savings while distilling the solving phase leads to performance loss.

Recent methods have demonstrated that Large Language Models (LLMs) can solve reasoning tasks better when they are encouraged to solve subtasks of the main task first. In this paper we devise a similar strategy that breaks down reasoning tasks into a problem decomposition phase and a problem solving phase and show that the strategy is able to outperform a single stage solution. Further, we hypothesize that the decomposition should be easier to distill into a smaller model compared to the problem solving because the latter requires large amounts of domain knowledge while the former only requires learning general problem solving strategies. We propose methods to distill these two capabilities and evaluate their impact on reasoning outcomes and inference cost. We find that we can distill the problem decomposition phase and at the same time achieve good generalization across tasks, datasets, and models. However, it is harder to distill the problem solving capability without losing performance and the resulting distilled model struggles with generalization. These results indicate that by using smaller, distilled problem decomposition models in combination with problem solving LLMs we can achieve reasoning with cost-efficient inference and local adaptation.

View on arXiv PDF Code

Similar