CVAug 1, 2025

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Princeton
arXiv:2508.00728v19 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the challenge of controlling object counts in generated images, which is important for users of text-to-image systems, though it appears incremental as it builds on existing counting and generation methods.

The paper tackled the problem of object counting for text-to-image generation by proposing YOLO-Count, a differentiable model that achieves state-of-the-art counting accuracy and enables precise quantity control.

We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the 'cardinality' map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes