Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models
This reveals a fundamental limitation in modern multimodal architectures, impacting their robustness for real-world applications like autonomous systems.
The authors tackled the problem of multimodal models' inability to integrate visual information over time by introducing CP-Bench, a benchmark for continuous perception, and found that state-of-the-art models like Qwen-3-VL and GPT-5 fail dramatically in this simple counting task.
Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.