CVAILGJul 15, 2025

MMOne: Representing Multiple Modalities in One Scene

arXiv:2507.11129v22 citationsh-index: 1Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of representing multiple modalities in scenes for applications in AI and robotics, though it appears incremental as it builds on existing multimodal representation methods.

The paper tackles the problem of modality conflicts in multimodal scene representation by proposing MMOne, a framework that uses a modality modeling module and multimodal decomposition to disentangle shared and modality-specific components, resulting in enhanced representation capability for each modality as demonstrated in experiments.

Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene representation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code is available at https://github.com/Neal2020GitHub/MMOne.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes