CVSep 23, 2025

Prompt-Guided Dual Latent Steering for Inversion Problems

Yichen Wu, Xu Liu, Chenxuan Zhao, Xinyu Wu

arXiv:2509.18619v115.516 citationsh-index: 4DICTA

Originality Incremental advance

AI Analysis

This addresses the problem of semantic drift in image inversion for researchers and practitioners in computer vision, offering an incremental improvement over existing methods.

The paper tackled the challenge of inverting corrupted images into diffusion model latent spaces, which often leads to semantic drift, by introducing Prompt-Guided Dual Latent Steering (PDLS), a training-free framework that decomposes inversion into structural and semantic paths using optimal control, resulting in more faithful reconstructions and better semantic alignment than single-latent baselines on datasets like FFHQ-1K and ImageNet-1K.

Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.

View on arXiv PDF

Similar