CL AIMay 1

Component-Aware Self-Speculative Decoding in Hybrid Language Models

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

arXiv:2605.0110610.4

AI Analysis

For researchers and practitioners deploying hybrid LLMs, this work identifies which architectural patterns enable efficient self-speculative decoding, but the findings are incremental as they primarily analyze existing models rather than proposing a new method.

The paper introduces component-aware self-speculative decoding for hybrid language models, exploiting architectural heterogeneity to use SSM/linear-attention subgraphs as zero-cost internal drafts. It finds that parallel hybrids (e.g., Falcon-H1) achieve acceptance rates of 0.68 at draft length k=2, while sequential hybrids (e.g., Qwen3.5) achieve only 0.038, an 18x gap, and that perplexity degradation predicts speculative viability.

Speculative decoding accelerates autoregressive inference by drafting candidate tokens with a fast model and verifying them in parallel with the target. Self-speculative methods avoid the need for an external drafter but have been studied exclusively in homogeneous Transformer architectures. We introduce component-aware self-speculative decoding, the first method to exploit the internal architectural heterogeneity of hybrid language models, isolating the SSM/linear-attention subgraph as a zero-cost internal draft. We evaluate this on two architecturally distinct hybrid families: Falcon-H1 (parallel: Mamba-2 + attention per layer) and Qwen3.5 (sequential: interleaved linear and attention layers), with a pure Transformer control (Qwen2.5). Parallel hybrids achieve acceptance rates of alpha = 0.68 at draft length k=2 under greedy decoding, while sequential hybrids yield only alpha = 0.038 -- an 18x gap attributable to how each architecture integrates its components. The property is scale-invariant: Falcon-H1 at 3B reproduces the rates observed at 0.5B. We further show that perplexity degradation from a companion ablation study predicts speculative viability without running speculative decoding: a 3.15x ratio (Falcon) maps to alpha = 0.37 at k=4, while 81.96x (Qwen) maps to alpha = 0.019. For sequential hybrids, generic LayerSkip achieves 12x higher acceptance rates than the component-aware strategy. The composition pattern of hybrid models -- not merely the presence of alternative components -- determines whether component-level self-speculation is viable.

View on arXiv PDF

Similar