What Makes Good Examples for Visual In-Context Learning?
This addresses a key bottleneck for improving in-context learning in computer vision, offering practical methods for researchers and practitioners, though it is incremental as it builds on existing concepts from NLP.
The paper tackles the problem of selecting effective in-context examples for visual in-context learning in large vision models, finding that performance is highly sensitive to example choice, and proposes unsupervised and supervised prompt retrieval methods that achieve non-trivial improvements over random selection.
Large-scale models trained on broad data have recently become the mainstream architecture in computer vision due to their strong generalization performance. In this paper, the main focus is on an emergent ability in large vision models, known as in-context learning, which allows inference on unseen tasks by conditioning on in-context examples (a.k.a.~prompt) without updating the model parameters. This concept has been well-known in natural language processing but has only been studied very recently for large vision models. We for the first time provide a comprehensive investigation on the impact of in-context examples in computer vision, and find that the performance is highly sensitive to the choice of in-context examples. To overcome the problem, we propose a prompt retrieval framework to automate the selection of in-context examples. Specifically, we present (1) an unsupervised prompt retrieval method based on nearest example search using an off-the-shelf model, and (2) a supervised prompt retrieval method, which trains a neural network to choose examples that directly maximize in-context learning performance. The results demonstrate that our methods can bring non-trivial improvements to visual in-context learning in comparison to the commonly-used random selection.