CVJul 1, 2025

Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Simon Reiß, Zdravko Marinov, Alexander Jaus, Constantin Seibold, M. Saquib Sarfraz, Erik Rodner, Rainer Stiefelhagen

arXiv:2507.00868v26.23 citationsh-index: 16

Originality Incremental advance

AI Analysis

This work addresses the challenge of flexible, adaptive vision pipelines for medical tasks, but it is incremental as it builds on existing in-context learning methods with a focus on compositional sequences.

The paper tackles the problem of enabling a single model to handle multiple and new compositional medical tasks without retraining, by exploring visual in-context learning with a focus on training for task sequences rather than individual tasks, and introduces a synthetic compositional task generation engine that bootstraps from segmentation datasets to train models for complex tasks.

In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks. We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine. This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks. Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks. Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.

View on arXiv PDF

Similar