CVMay 30

T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining

arXiv:2606.0067348.9h-index: 14
AI Analysis

Enables vision-language models to understand thermal images for applications in low-light and adverse weather conditions, addressing a previously unsolved domain gap.

T-CLIP adapts CLIP for thermal imaging by introducing a physics-aware captioning pipeline (IR-Cap) and a decoupled dual-LoRA framework, achieving consistent improvements in cross-modal retrieval across three thermal benchmarks.

Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes