SDApr 20

APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

arXiv:2604.1866590.8h-index: 6
Predicted impact top 6% in SD · last 90 daysOriginality Synthesis-oriented
AI Analysis

For researchers working on video segmentation with audio queries, this work provides a practical pipeline that handles noisy ASR outputs and absent targets, but the approach is incremental, combining existing components (ASR, Omni, Sa2VA, SAM3) in a staged manner.

The authors propose a pipeline for Audio-aware Referring Video Object Segmentation (Ref-VOS) that adds speech transcription and visual existence verification stages to a standard Sa2VA-based system, achieving 1st place in the MeViS-Audio track. The pipeline includes an agentic refinement layer that improves segmentation quality by evaluating query reliability and temporal consistency.

This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a segmentation-oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS\_Audio task into audio-to-text conversion, visual existence verification, coarse video segmentation, and agent-guided refinement. Such a staged design is substantially more appropriate for audio-conditioned Ref-VOS than directly sending noisy ASR outputs into a segmentation model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes