IT ITMar 26

AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities

Chenyiming Wen, Binpu Shi, Min Li, Ming-Min Zhao, Min-Jian Zhao, Jiangzhou Wang

arXiv:2512.1133139.7h-index: 17

AI Analysis

This addresses a critical issue for reliable high-speed data transmission in vehicular networks, offering a robust solution to missing-modality challenges, though it is incremental in improving existing multimodal fusion methods.

The paper tackles the problem of beam prediction in vehicular networks degrading due to missing sensor data, proposing AMBER, an adaptive multimodal mask transformer that maintains high accuracy and robustness in such scenarios, as demonstrated on the DeepSense6G dataset.

With the widespread adoption of millimeter-wave (mmWave) massive multi-input-multi-output (MIMO) in vehicular networks, accurate beam prediction and alignment have become critical for high-speed data transmission and reliable access. While traditional beam prediction approaches primarily rely on in-band beam training, recent advances have started to explore multimodal sensing to extract environmental semantics for enhanced prediction. However, the performance of existing multimodal fusion methods degrades significantly in real-world settings because they are vulnerable to missing data caused by sensor blockage, poor lighting, or GPS dropouts. To address this challenge, we propose AMBER ({A}daptive multimodal {M}ask transformer for {BE}am p{R}ediction), a novel end-to-end framework that processes temporal sequences of image, LiDAR, radar, and GPS data, while adaptively handling arbitrary missing-modality cases. AMBER introduces learnable modality tokens and a missing-modality-aware mask to prevent cross-modal noise propagation, along with a learnable fusion token and multihead attention to achieve robust modality-specific information distillation and feature-level fusion. Furthermore, a class-former-aided modality alignment (CMA) module and temporal-aware positional embedding are incorporated to preserve temporal coherence and ensure semantic alignment across modalities, facilitating the learning of modality-invariant and temporally consistent representations for beam prediction. Extensive experiments on the real-world DeepSense6G dataset demonstrate that AMBER significantly outperforms existing multimodal learning baselines. In particular, it maintains high beam prediction accuracy and robustness even under severe missing-modality scenarios, validating its effectiveness and practical applicability.

View on arXiv PDF

Similar