CVApr 12

AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical Control

arXiv:2604.1045481.0h-index: 5
Predicted impact top 27% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work provides a benchmark and dataset for a previously unaddressed problem—affective image manipulation—enabling evaluation and improvement of models in this domain.

The authors introduce AIM-Bench, the first benchmark for affective image manipulation, and find that current models exhibit a positivity bias due to imbalanced training data. They propose a data engine to create a balanced dataset, AIM-40k, which improves overall performance by 9.15%.

Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes