CVFeb 3

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

arXiv:2602.03210v11.52 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of task heterogeneity in visual reasoning for researchers and practitioners, offering a unified approach that is incremental in adapting existing models.

The paper tackles the challenge of replicating In-Context Learning in computer vision by proposing VIRAL, a framework that uses visual analogy in a pre-trained Diffusion Transformer to handle diverse tasks, achieving superior performance over existing methods.

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

View on arXiv PDF

Similar