CVSep 23, 2025

TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing

arXiv:2509.18743v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses robustness in perception for autonomous systems, but it is incremental as it builds on existing autoencoder frameworks with multimodal fusion.

The paper tackles the problem of robust point cloud processing for autonomous driving and robotics by proposing TriFusion-AE, a multimodal autoencoder that fuses textual, depth, and LiDAR data, achieving significantly more robust reconstruction under strong adversarial attacks and heavy noise compared to CNN-based autoencoders.

LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes