CVOct 19, 2024

Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding

Yi Liu, Chengxin Li, Shoukun Xu, Jungong Han

arXiv:2410.14944v16.543 citationsh-index: 8Has CodeInt J Comput Vis

Originality Incremental advance

AI Analysis

This work addresses multi-modal scene understanding for applications like autonomous driving, but it is incremental as it builds on existing fusion methods with a novel routing approach.

The paper tackles the challenge of multi-modal fusion for scene understanding by proposing a Part-Whole Relational Fusion (PWRF) framework, which uses Capsule Networks to route part-level modalities to a whole-level modality, achieving superior performance on tasks like multi-modal segmentation and salient object detection across several datasets.

Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, \emph{e.g.}, simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.

View on arXiv PDF Code

Similar