Efficient Adaptation For Remote Sensing Visual Grounding
This work addresses the underexplored problem of visual grounding in remote sensing for researchers and practitioners, offering a practical and cost-efficient alternative to full model training, though it is incremental as it applies existing PEFT methods to a new domain.
The paper tackled adapting pre-trained vision-language models for remote sensing visual grounding tasks using Parameter Efficient Fine Tuning techniques, achieving performance comparable to or surpassing state-of-the-art models while significantly reducing computational costs.
Adapting pre-trained models has become an effective strategy in artificial intelligence, offering a scalable and efficient alternative to training models from scratch. In the context of remote sensing (RS), where visual grounding(VG) remains underexplored, this approach enables the deployment of powerful vision-language models to achieve robust cross-modal understanding while significantly reducing computational overhead. To address this, we applied Parameter Efficient Fine Tuning (PEFT) techniques to adapt these models for RS-specific VG tasks. Specifically, we evaluated LoRA placement across different modules in Grounding DINO and used BitFit and adapters to fine-tune the OFA foundation model pre-trained on general-purpose VG datasets. This approach achieved performance comparable to or surpassing current State Of The Art (SOTA) models while significantly reducing computational costs. This study highlights the potential of PEFT techniques to advance efficient and precise multi-modal analysis in RS, offering a practical and cost-effective alternative to full model training.