CVJul 10, 2024

CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging

Raza Imam, Mohammed Talha Alam, Umaima Rahman, Mohsen Guizani, Fakhri Karray

arXiv:2407.07315v27.66 citationsh-index: 8

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of adapting general vision-language models to the astronomical domain, which is incremental as it fine-tunes an existing model on new data.

The paper tackles the challenge of applying large vision-language models to astronomical imaging, where datasets are small, by introducing CosmoCLIP, a framework fine-tuned on CLIP using SpaceNet images and BLIP-based captions, achieving superior generalization and significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

Existing vision-text contrastive learning models enhance representation transferability and support zero-shot prediction by matching paired image and caption embeddings while pushing unrelated pairs apart. However, astronomical image-label datasets are significantly smaller compared to general image and label datasets available from the internet. We introduce CosmoCLIP, an astronomical image-text contrastive learning framework precisely fine-tuned on the pre-trained CLIP model using SpaceNet and BLIP-based captions. SpaceNet, attained via FLARE, constitutes ~13k optimally distributed images, while BLIP acts as a rich knowledge extractor. The rich semantics derived from this SpaceNet and BLIP descriptions, when learned contrastively, enable CosmoCLIP to achieve superior generalization across various in-domain and out-of-domain tasks. Our results demonstrate that CosmoCLIP is a straightforward yet powerful framework, significantly outperforming CLIP in zero-shot classification and image-text retrieval tasks.

View on arXiv PDF

Similar