Vision Transformer Based User Equipment Positioning
This work addresses positioning accuracy for wireless communication systems, representing an incremental improvement over existing deep learning methods.
The paper tackles the problem of user equipment positioning using deep learning by proposing a Vision Transformer architecture that focuses on Angle Delay Profile from Channel State Information, achieving an RMSE of 0.55m indoors and 13.59m outdoors on the DeepMIMO dataset, and outperforming state-of-the-art schemes by approximately 38%.
Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the `DeepMIMO' and `ViWi' ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi's outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by $\sim$ 38\%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.