GRCVLGAug 18, 2025

MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

arXiv:2508.12691v12 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses inference efficiency for video generation models, offering a flexible optimization that balances speed and quality, though it is incremental as it builds on existing caching methods.

The paper tackles the high computational cost and latency in video diffusion transformer models by proposing MixCache, a training-free caching framework that dynamically selects caching granularities, achieving up to 1.97x speedup while maintaining or improving generation quality.

Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$\times$ speedup on Wan 14B, 1.97$\times$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes