CVMar 8, 2025

Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

Md Azim Khan, Aryya Gangopadhyay, Jianwu Wang, Robert F. Erbacher

arXiv:2503.06003v1h-index: 262025 International Conference on Advanced Machine Learning and Data Science (AMLDS)

Originality Incremental advance

AI Analysis

This work addresses efficiency and robustness issues in VLMs for real-world applications like unmanned ground vehicles, though it appears incremental as it builds on existing methods like LoRA and frequency-domain techniques.

The researchers tackled computational challenges in vision-language models (VLMs) for real-time situational awareness by integrating frequency-domain transformations with low-rank adaptation (LoRA), achieving evaluation metrics comparable to state-of-the-art VLMs like CLIP ViT-L/14 and SigLIP on caption generation and VQA tasks with noisy data.

Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).

View on arXiv PDF

Similar