CVAIMar 12

Coarse-Guided Visual Generation via Weighted h-Transform Sampling

arXiv:2603.12057v124.01 citationsh-index: 2
Predicted impact top 40% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the need for efficient and generalizable visual generation without paired data, though it is incremental as it builds on existing training-free diffusion methods.

The paper tackles the problem of coarse-guided visual generation, where fine visual samples are synthesized from low-fidelity references, by proposing a training-free method using h-transform sampling with a noise-level-aware schedule to balance guidance and quality, achieving effective results across diverse image and video tasks.

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes