CVDec 25, 2025

Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

Antara Titikhsha, Om Kulkarni, Dharun Muthaiah

arXiv:2512.22272v1h-index: 1

Originality Incremental advance

AI Analysis

This work addresses the semantic gap between human perception and generative models for text-to-image synthesis, offering incremental improvements in geometric control.

The paper tackled the problem of text-to-image diffusion models failing to follow geometric constraints by using a lightweight teacher model to inject human perception of shape into the generation process, resulting in an 80% improvement in semantic alignment compared to unguided baselines.

Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-Σ. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.

View on arXiv PDF

Similar