CV AIJan 7

Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

Maxim Clouser, Kia Khezeli, John Kalantari

arXiv:2601.04381v11.5

Originality Incremental advance

AI Analysis

This work addresses the challenge of enabling foundation models to support non-visible modalities like IR and SAR for safety-critical applications, offering an incremental improvement through few-shot adaptation.

The study tackled the problem of adapting a flow-matching foundation model, pre-trained on RGB images, for cross-spectral translation to infrared (IR) and synthetic aperture radar (SAR) using only 100 paired images per domain, and found that the synthetic data improved object detection, with LPIPS serving as a proxy for performance and leading to gains such as enhanced pedestrian detection on KAIST IR and infrastructure detection on M4-SAR.

Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

View on arXiv PDF

Similar