SpeLLM: Character-Level Multi-Head Decoding
This addresses the problem of high computational costs and limited vocabulary support in LLMs for researchers and practitioners, though it is an incremental improvement on existing methods.
The paper tackles the bottleneck of scaling LLM vocabulary due to the linear scaling of the output projection layer by proposing SpeLLM, a method that decouples input and output vocabularies using character-level multi-head decoding, resulting in competitive performance on downstream tasks with a 5.1% average runtime reduction across models.
Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention's quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the $k$ linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.