E-InMeMo: Enhanced Prompting for Visual In-Context Learning
This work addresses a key bottleneck in adapting in-context learning to computer vision, offering a lightweight solution for researchers and practitioners, though it is incremental as it builds on existing prompting methods.
The paper tackles the problem of poor prompting quality in visual in-context learning by proposing E-InMeMo, which uses learnable perturbations to optimize prompts, resulting in improvements such as a 7.99 mIoU increase for foreground segmentation and a 17.04 mIoU increase for single object detection over baselines.
Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct Me More (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL. Code is publicly available at: https://github.com/Jackieam/E-InMeMo