Building 6G Radio Foundation Models with Transformer Architectures
This work addresses the problem of model adaptability in dynamic 6G network environments for wireless communication researchers, though it is incremental as it applies existing transformer architectures to a new domain.
The authors tackled the need for adaptable models in wireless communications by proposing a Vision Transformer (ViT) as a radio foundation model, pretrained with Masked Spectrogram Modeling (MSM), which achieved competitive performance on downstream tasks like spectrogram segmentation and CSI-based human activity sensing, outperforming a four-times larger model in segmentation with less training time.
Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.