TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
This addresses efficiency bottlenecks in LLM deployment for users needing faster, more accurate models, though it builds incrementally on existing pruning techniques.
The paper tackles the problem of inefficient LLM inference by introducing TALE, a task-aware layer elimination algorithm that prunes transformer layers to optimize task-specific performance. The method consistently improves accuracy while reducing computational cost across 9 tasks and 5 models, with no retraining required.
In this paper we introduce Tale, Task-Aware Layer Elimination, an inference-time algorithm that prunes entire transformer layers in an LLM by directly optimizing task-specific validation performance. We evaluate TALE on 9 tasks and 5 models, including LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral 7B, and Lucie 7B, under both zero-shot and few-shot settings. Unlike prior approaches, TALE requires no retraining and consistently improves accuracy while reducing computational cost across all benchmarks. Furthermore, applying TALE during finetuning leads to additional performance gains. Finally, TALE provides flexible user control over trade-offs between accuracy and efficiency. Mutual information analysis shows that certain layers act as bottlenecks, degrading task-relevant representations. Tale's selective layer removal remedies this problem, producing smaller, faster, and more accurate models that are also faster to fine-tune while offering new insights into transformer interpretability.