DND: Boosting Large Language Models with Dynamic Nested Depth
This addresses the problem of inefficient token processing in LLMs for AI researchers and practitioners, offering an incremental improvement to existing models.
The paper tackles the problem of improving off-the-shelf large language models (LLMs) by introducing Dynamic Nested Depth (DND), a method that selects critical tokens for reprocessing to enhance performance. The result shows performance boosts of 1.88% for Qwen3-1.7B and 0.87% for Qwen3-30B-A3B on diverse benchmarks with minimal parameter and computing increases.
We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.