CVSep 28, 2024

Contrastive ground-level image and remote sensing pre-training improves representation learning for natural world imagery

Andy V. Huynh, Lauren E. Gillespie, Jael Lopez-Saucedo, Claire Tang, Rohan Sikand, Moisés Expósito-Alonso

arXiv:2409.19439v113.513 citationsh-index: 19

Originality Incremental advance

AI Analysis

This work addresses species recognition for ecological monitoring, but it is incremental as it builds on existing multimodal contrastive learning methods.

The paper tackles the problem of improving fine-grained species recognition by leveraging multiple views of image data through contrastive learning, resulting in enhanced downstream classification performance even when one view is absent, with a dataset of over 3 million ground-level and aerial image pairs for 6,000 plant taxa.

Multimodal image-text contrastive learning has shown that joint representations can be learned across modalities. Here, we show how leveraging multiple views of image data with contrastive learning can improve downstream fine-grained classification performance for species recognition, even when one view is absent. We propose ContRastive Image-remote Sensing Pre-training (CRISP)$\unicode{x2014}$a new pre-training task for ground-level and aerial image representation learning of the natural world$\unicode{x2014}$and introduce Nature Multi-View (NMV), a dataset of natural world imagery including $>3$ million ground-level and aerial image pairs for over 6,000 plant taxa across the ecologically diverse state of California. The NMV dataset and accompanying material are available at hf.co/datasets/andyvhuynh/NatureMultiView.

View on arXiv PDF

Similar