Re-Prompting SAM 3 via Object Retrieval: 3rd of the 5th PVUW MOSE Track
This is an incremental improvement for video object segmentation in computer vision, addressing specific challenges in the MOSEv2 benchmark.
The paper tackles complex semi-supervised video object segmentation by developing an automatic re-prompting framework based on SAM 3 to improve robustness against target disappearance, reappearance, and distractors, achieving a J&F score of 51.17% and ranking 3rd in the MOSEv2 track.
This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi-supervised video object segmentation. Built on SAM~3, we develop an automatic re-prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same-category distractors. Our method first applies the SAM~3 detector to later frames to identify same-category object candidates, and then performs DINOv3-based object-level matching with a transformation-aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first-frame mask, enabling multi-anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.