GRCVMMMar 20, 2025

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

arXiv:2503.16406v18 citationsh-index: 4CVPR
Originality Incremental advance
AI Analysis

This addresses a specific challenge in text-to-image generation for applications requiring precise interaction depiction, but it is incremental as it builds on existing diffusion models.

The paper tackled the problem of text-to-image diffusion models struggling to accurately depict human-object interactions due to limited differentiation of interaction words, and proposed VerbDiff, which improved interaction understanding and generated high-quality images with accurate interactions, as demonstrated on the HICO-DET dataset.

Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes