Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
This addresses a critical problem for researchers and practitioners in video AI by providing a memory-efficient and robust framework to enhance temporal reasoning in VideoLLMs, though it is incremental as it builds on existing parallel inference methods.
The paper tackles the bottleneck in Video Large Language Models (VideoLLMs) where increasing input frames for temporal detail leads to high computational costs and performance issues, by introducing Video Parallel Scaling (VPS), an inference-time method that aggregates outputs from parallel streams processing disjoint frame subsets, resulting in consistent and significant performance improvements across models from 2B to 32B on benchmarks like Video-MME and EventHallusion.
Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.