CV LGApr 16

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

Inseok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

arXiv:2604.1463033.5h-index: 11

Predicted impact top 83% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in video object segmentation, this work provides a novel method that effectively integrates appearance and motion cues, outperforming existing approaches.

The paper introduces cross-modality token modulation for unsupervised video object segmentation, achieving state-of-the-art performance across all public benchmarks.

Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross-modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state-of-the-art performance across all public benchmarks, outperforming existing methods.

View on arXiv PDF

Similar