Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs
This work addresses a critical safety issue for autonomous vehicles by enabling proactive tactile perception from visual data, representing a novel method for a known bottleneck in multi-modal fusion.
The paper tackles the problem of autonomous vehicles lacking tactile perception for road-induced excitations by proposing a novel framework, Synesthesia of Vehicles (SoV), which predicts tactile excitations from visual inputs using a cross-modal spatiotemporal alignment method and a visual-tactile synesthetic generative model based on latent diffusion, resulting in outperforming existing models in temporal, frequency, and classification performance.
Autonomous vehicles (AVs) rely on multi-modal fusion for safety, but current visual and optical sensors fail to detect road-induced excitations which are critical for vehicles' dynamic control. Inspired by human synesthesia, we propose the Synesthesia of Vehicles (SoV), a novel framework to predict tactile excitations from visual inputs for autonomous vehicles. We develop a cross-modal spatiotemporal alignment method to address temporal and spatial disparities. Furthermore, a visual-tactile synesthetic (VTSyn) generative model using latent diffusion is proposed for unsupervised high-quality tactile data synthesis. A real-vehicle perception system collected a multi-modal dataset across diverse road and lighting conditions. Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance, enhancing AV safety through proactive tactile perception.