CVApr 28

Report of the 5th PVUW Challenge: Towards More Diverse Modalities in Pixel-Level Understanding

Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He, Song Bai, Philip Torr, Leilei Cao, Jinrong Zhang, Deshui Miao, Xusheng He, Dengxian Gong

arXiv:2604.2603192.9

Predicted impact top 12% in CV · last 90 daysOriginality Synthesis-oriented

AI Analysis

For researchers in video understanding, this challenge provides benchmarks and datasets for multimodal pixel-level tasks under unconstrained conditions.

The 2026 PVUW Challenge introduced three tracks for pixel-level video understanding, including a new audio-driven segmentation track, and analyzed top-performing multimodal solutions to advance robust video scene comprehension.

This report summarizes the objectives, datasets, and top-performing methodologies of the 2026 Pixel-level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state-of-the-art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS-Text track for localizing targets via motion-focused linguistic expressions; and the newly inaugurated MeViS-Audio track, which pioneers acoustic-driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting-edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene comprehension.

View on arXiv PDF

Similar