CVAIJan 25

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

arXiv:2601.17868v11 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks in video understanding for AI applications, though it appears incremental as it builds on existing diffusion and attention mechanisms.

The paper tackles the problem of causal masking biases in autoregressive video LLMs that hinder global spatiotemporal modeling, proposing VidLaDA with bidirectional diffusion and MARS-Cache to accelerate inference, achieving over 12x speedup while outperforming diffusion baselines and rivaling state-of-the-art models.

Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes