CLAIITFeb 27

Task-Centric Acceleration of Small-Language Models

Dor Tsur, Sharon Adar, Ran Levy
arXiv:2602.24174v1
Originality Synthesis-oriented
AI Analysis

This work addresses efficiency challenges for SLMs in task-specific, high-volume applications, representing an incremental improvement with domain-specific impact.

The paper tackles the problem of accelerating small language models (SLMs) for task-specific applications in high-volume, low-latency settings by proposing TASC, a framework with two methods: TASC-ft for fine-tuning with enriched vocabulary and TASC-spec for inference-time speculative decoding. The results show consistent improvements in inference efficiency while maintaining task performance across multiple low output-variability generation tasks.

Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task's output corpus, mixing task and context n-gram information.TASC-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes