CLLGOct 16, 2023

CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference

arXiv:2310.10845v223 citationsh-index: 12
AI Analysis

This addresses the problem of deploying large language models in compute-constrained environments by making CoT reasoning more efficient.

The paper tackles the computational inefficiency of Chain-of-Thought reasoning in transformers by proposing CoTFormer, an architecture that mimics CoT at the token level to achieve accuracies close to larger models while reducing computation cost by 30-50% through adaptive token-wise depth allocation.

Scaling language models to larger and deeper sizes has led to significant boosts in performance. Even though the size of these models limits their application in compute-constrained environments, the race to continually develop ever larger and deeper foundational models is underway. At the same time -- regardless of the model size -- task-specific techniques continue to play a pivotal role in achieving optimal downstream performance. One of these techniques, called Chain-of-Thought (CoT), is particularly interesting since, as we point out in this work, it resembles employing a deeper transformer through re-applying the model multiple times. However, a key subtlety in computing the attention of past tokens differentiates CoT from simply applying the model several times. Based on this insight, we propose CoTFormer, a novel architecture which closely mimics CoT at the token level, allowing us to obtain significantly improved accuracies close to much larger models. While applying CoT introduces additional computation costs, we compensate for it by leveraging CoTFormer's special compatibility with token-wise variable depth. Through a compute adaptive model -- which automatically allocates the compute to tokens that need it most -- we show that it is possible to reduce the computation cost significantly without any reduction in accuracy, and with further compute cost reductions possible while maintaining a competitive accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes