CVJan 1

OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning

arXiv:2601.00352v1h-index: 11
Originality Incremental advance
AI Analysis

This addresses domain generalization for embodied agents using multimodal sensors, but it is incremental as it builds on existing visual-tactile learning methods.

The paper tackles the problem of modality discrepancies and domain gaps in visual-tactile learning by proposing the OmniVaT framework for single domain generalization, achieving superior cross-domain generalization performance as demonstrated in experiments.

Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes