PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
This work addresses the challenge of building image classifiers with limited labeled data for domains like medical imaging or remote sensing, though it is incremental as it builds on existing in-context learning paradigms.
The paper tackles the problem of few-shot image classification in data-scarce domains by focusing on the role of image embeddings in in-context learning frameworks, showing that pretraining strategies significantly affect out-of-domain performance and enabling PictSure to outperform existing models on such benchmarks.
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.