Accelerating Large Language Model Inference via Early-Exiting Algorithms
This work addresses the practical deployment challenges of large language models for users needing reduced computational costs, though it is incremental as it builds on existing early-exiting methods.
The paper tackles the computational inefficiency of large language models in batched inference by co-designing adaptive algorithms and model architectures, achieving a new Pareto frontier between efficiency and performance through methods like parallel decoding and deep parameter sharing.
Large language models have achieved remarkable capabilities, but their practical deployment is hindered by significant computational costs. While adaptive computation methods like early-exiting promise to reduce these costs, they introduce a fundamental conflict: the per-token dynamism intended to save computation often creates system-level bottlenecks that can paradoxically reduce throughput in batched inference. This dissertation resolves this conflict by co-designing adaptive algorithms and model architectures to strike an optimal balance between dynamism and efficiency. To this end, our work first addresses critical sources of overhead in conventional early-exiting by proposing an efficient parallel decoding mechanism. We then show that deep parameter sharing provides an architectural foundation that not only yields compact, parameter-efficient models but also inherently mitigates the critical synchronization issues affecting dynamic inference. Finally, this work presents a unified framework where lightweight routers are pretrained to dynamically assign an optimal recursion depth for each token. This approach establishes a new Pareto frontier between efficiency and performance by effectively optimizing for both adaptive computation and parameter efficiency within a single model.