CVDec 8, 2023Code
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion ModelKaiyu Song, Hanjiang Lai
Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where an imperceptible perturbation is added to the image that can fool the DNNs. Diffusion-based adversarial purification focuses on using the diffusion model to generate a clean image against such adversarial attacks. Unfortunately, the generative process of the diffusion model is also inevitably affected by adversarial perturbation since the diffusion model is also a deep network where its input has adversarial perturbation. In this work, we propose MimicDiffusion, a new diffusion-based adversarial purification technique, that directly approximates the generative process of the diffusion model with the clean image as input. Concretely, we analyze the differences between the guided terms using the clean image and the adversarial sample. After that, we first implement MimicDiffusion based on Manhattan distance. Then, we propose two guidance to purify the adversarial perturbation and approximate the clean diffusion model. Extensive experiments on three image datasets including CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that MimicDiffusion significantly performs better than the state-of-the-art baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%, and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\% higher, respectively. The code is available in the supplementary material.
CVNov 24, 2024Code
Enhancing Few-Shot Out-of-Distribution Detection with Gradient Aligned Context OptimizationBaoshun Tong, Kaiyu Song, Hanjiang Lai
Few-shot out-of-distribution (OOD) detection aims to detect OOD images from unseen classes with only a few labeled in-distribution (ID) images. To detect OOD images and classify ID samples, prior methods have been proposed by regarding the background regions of ID samples as the OOD knowledge and performing OOD regularization and ID classification optimization. However, the gradient conflict still exists between ID classification optimization and OOD regularization caused by biased recognition. To address this issue, we present Gradient Aligned Context Optimization (GaCoOp) to mitigate this gradient conflict. Specifically, we decompose the optimization gradient to identify the scenario when the conflict occurs. Then we alleviate the conflict in inner ID samples and optimize the prompts via leveraging gradient projection. Extensive experiments over the large-scale ImageNet OOD detection benchmark demonstrate that our GaCoOp can effectively mitigate the conflict and achieve great performance. Code will be available at https://github.com/BaoshunWq/ood-GaCoOp.
CVNov 24, 2024Code
Test-time Alignment-Enhanced Adapter for Vision-Language ModelsBaoshun Tong, Kaiyu Song, Hanjiang Lai
Test-time adaptation with pre-trained vision-language models (VLMs) has attracted increasing attention for tackling the issue of distribution shift during the test phase. While prior methods have shown effectiveness in addressing distribution shift by adjusting classification logits, they are not optimal due to keeping text features unchanged. To address this issue, we introduce a new approach called Test-time Alignment-Enhanced Adapter (TAEA), which trains an adapter with test samples to adjust text features during the test phase. We can enhance the text-to-image alignment prediction by utilizing an adapter to adapt text features. Furthermore, we also propose to adopt the negative cache from TDA as enhancement module, which further improves the performance of TAEA. Our approach outperforms the state-of-the-art TTA method of pre-trained VLMs by an average of 0.75% on the out-of-distribution benchmark and 2.5% on the cross-domain benchmark, with an acceptable training time. Code will be available at https://github.com/BaoshunWq/clip-TAEA.
CVOct 24, 2025Code
Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes GenerationKaiyu Song, Hanjiang Lai, Yaqing Zhang et al.
In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to "see" all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of $1024^3$. The code will be released at: https://github.com/psky1111/Tencent-TSSR.
CVDec 9, 2025
An Iteration-Free Fixed-Point Estimator for Diffusion InversionYifei Chen, Kaiyu Song, Yan Pan et al.
Diffusion inversion aims to recover the initial noise corresponding to a given image such that this noise can reconstruct the original image through the denoising diffusion process. The key component of diffusion inversion is to minimize errors at each inversion step, thereby mitigating cumulative inaccuracies. Recently, fixed-point iteration has emerged as a widely adopted approach to minimize reconstruction errors at each inversion step. However, it suffers from high computational costs due to its iterative nature and the complexity of hyperparameter selection. To address these issues, we propose an iteration-free fixed-point estimator for diffusion inversion. First, we derive an explicit expression of the fixed point from an ideal inversion step. Unfortunately, it inherently contains an unknown data prediction error. Building upon this, we introduce the error approximation, which uses the calculable error from the previous inversion step to approximate the unknown error at the current inversion step. This yields a calculable, approximate expression for the fixed point, which is an unbiased estimator characterized by low variance, as shown by our theoretical analysis. We evaluate reconstruction performance on two text-image datasets, NOCAPS and MS-COCO. Compared to DDIM inversion and other inversion methods based on the fixed-point iteration, our method achieves consistent and superior performance in reconstruction tasks without additional iterations or training.
CVNov 12, 2024
Leveraging Previous Steps: A Training-free Fast Solver for Flow DiffusionKaiyu Song, Hanjiang Lai
Flow diffusion models (FDMs) have recently shown potential in generation tasks due to the high generation quality. However, the current ordinary differential equation (ODE) solver for FDMs, e.g., the Euler solver, still suffers from slow generation since ODE solvers need many number function evaluations (NFE) to keep high-quality generation. In this paper, we propose a novel training-free flow-solver to reduce NFE while maintaining high-quality generation. The key insight for the flow-solver is to leverage the previous steps to reduce the NFE, where a cache is created to reuse these results from the previous steps. Specifically, the Taylor expansion is first used to approximate the ODE. To calculate the high-order derivatives of Taylor expansion, the flow-solver proposes to use the previous steps and a polynomial interpolation to approximate it, where the number of orders we could approximate equals the number of previous steps we cached. We also prove that the flow-solver has a more minor approximation error and faster generation speed. Experimental results on the CIFAR-10, CelebA-HQ, LSUN-Bedroom, LSUN-Church, ImageNet, and real text-to-image generation prove the efficiency of the flow-solver. Specifically, the flow-solver improves the FID-30K from 13.79 to 6.75, from 46.64 to 19.49 with $\text{NFE}=10$ on CIFAR-10 and LSUN-Church, respectively.
CVJun 26, 2025
Rethinking Oversaturation in Classifier-Free Guidance via Low FrequencyKaiyu Song, Hanjiang Lai
Classifier-free guidance (CFG) succeeds in condition diffusion models that use a guidance scale to balance the influence of conditional and unconditional terms. A high guidance scale is used to enhance the performance of the conditional term. However, the high guidance scale often results in oversaturation and unrealistic artifacts. In this paper, we introduce a new perspective based on low-frequency signals, identifying the accumulation of redundant information in these signals as the key factor behind oversaturation and unrealistic artifacts. Building on this insight, we propose low-frequency improved classifier-free guidance (LF-CFG) to mitigate these issues. Specifically, we introduce an adaptive threshold-based measurement to pinpoint the locations of redundant information. We determine a reasonable threshold by analyzing the change rate of low-frequency information between prior and current steps. We then apply a down-weight strategy to reduce the impact of redundant information in the low-frequency signals. Experimental results demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic artifacts across various diffusion models, including Stable Diffusion-XL, Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.
CVNov 12, 2024
Flow Matching Posterior Sampling: A Training-free Conditional Generation for Flow MatchingKaiyu Song, Hanjiang Lai, Yan Pan et al.
Training-free conditional generation based on flow matching aims to leverage pre-trained unconditional flow matching models to perform conditional generation without retraining. Recently, a successful training-free conditional generation approach incorporates conditions via posterior sampling, which relies on the availability of a score function in the unconditional diffusion model. However, flow matching models do not possess an explicit score function, rendering such a strategy inapplicable. Approximate posterior sampling for flow matching has been explored, but it is limited to linear inverse problems. In this paper, we propose Flow Matching-based Posterior Sampling (FMPS) to expand its application scope. We introduce a correction term by steering the velocity field. This correction term can be reformulated to incorporate a surrogate score function, thereby bridging the gap between flow matching models and score-based posterior sampling. Hence, FMPS enables the posterior sampling to be adjusted within the flow matching framework. Further, we propose two practical implementations of the correction mechanism: one aimed at improving generation quality, and the other focused on computational efficiency. Experimental results on diverse conditional generation tasks demonstrate that our method achieves superior generation quality compared to existing state-of-the-art approaches, validating the effectiveness and generality of FMPS.
CVApr 28, 2024
Improving Training-free Conditional Diffusion Model via Fisher InformationKaiyu Song, Hanjiang Lai
Training-free conditional diffusion models have received great attention in conditional image generation tasks. However, they require a computationally expensive conditional score estimator to let the intermediate results of each step in the reverse process toward the condition, which causes slow conditional generation. In this paper, we propose a novel Fisher information-based conditional diffusion (FICD) model to generate high-quality samples according to the condition. In particular, we further explore the conditional term from the perspective of Fisher information, where we show Fisher information can act as a weight to measure the informativeness of the condition in each generation step. According to this new perspective, we can control and gain more information along the conditional direction in the generation space. Thus, we propose the upper bound of the Fisher information to reformulate the conditional term, which increases the information gain and decreases the time cost. Experimental results also demonstrate that the proposed FICD can offer up to 2x speed-ups under the same sampling steps as most baselines. Meanwhile, FICD can improve the generation quality in various tasks compared to the baselines with a low computation cost.
LGDec 8, 2023
Two Simple Principles for Diffusion-Based Test-Time AdaptationKaiyu Song, Hanjiang Lai, Yan Pan et al.
Recently, diffusion-based test-time adaptations (TTA) have shown great advances, which leverage a diffusion model to map the images in the unknown test domain to the training domain. The unseen and diverse test domains make diffusion-based TTA an ill-posed problem. In this paper, we unravel two simple principles of the design tricks for diffusion-based methods. Intuitively, \textit{Principle 1} says semantic similarity preserving. We should preserve the semantic similarity between the original and generated test images. \textit{Principle 2} suggests minimal modifications. This principle enables the diffusion to map the test images to the training domain with minimal modifications of the test images. In particular, following the two principles, we propose our simple yet effective principle-guided diffusion-based test-time adaptation method (PDDA). Concretely, following Principle 1, we propose a semantic keeper, the method to preserve feature similarity, where the semantic keeper could filter the corruption introduced from the test domain, thus better preserving the semantics. Following Principle 2, we propose a modification keeper, where we introduce a regularization constraint into the generative process to minimize modifications to the test image. Meanwhile, there is a hidden conflict between the two principles. We further introduce the gradient-based view to unify the direction generated from two principles. Extensive experiments on CIFAR-10C, CIFAR-100C, ImageNet-W, and ImageNet-C with WideResNet-28-10, ResNet-50, Swin-T, and ConvNext-T demonstrate that PDDA significantly performs better than the complex state-of-the-art baselines. Specifically, PDDA achieves 2.4\% average accuracy improvements in ImageNet-C without any training process.