73.5CVMay 7
Making Reconstruction FID Predictive of Diffusion Generation FIDTongda Xu, Mingwei He, Shady Abu-Hussein et al.
It is well known that the reconstruction FID (rFID) of a VAE is poorly correlated with the generation FID (gFID) of a latent diffusion model. We propose interpolated FID (iFID), a simple variant of rFID that exhibits a strong correlation with gFID. Specifically, for each dataset element, we retrieve its nearest neighbor in latent space, interpolate between their latent representations, decode the interpolated latent, and compute the FID between the decoded samples and the original dataset. We provide an intuitive explanation for why iFID correlates well with gFID, and why reconstruction metrics can be negatively correlated with gFID, by connecting iFID to recent results on diffusion generalization and hallucination. Theoretically, we show that iFID evaluates decoded interpolations aligned with the ridge set around which diffusion samples concentrate, thereby measuring a quantity closely related to diffusion sample quality. Empirically, iFID is the first metric shown to strongly correlate with diffusion gFID across diverse VAEs, achieving Pearson and Spearman correlations of approximately $0.85$. The project page is available at https://tongdaxu.github.io/pages/ifid.html.
CVNov 22, 2025
Versatile Recompression-Aware Perceptual Image Super-ResolutionMingwei He, Tongda Xu, Xingtong Ge et al.
Perceptual image super-resolution (SR) methods restore degraded images and produce sharp outputs. In practice, those outputs are usually recompressed for storage and transmission. Ignoring recompression is suboptimal as the downstream codec might add additional artifacts to restored images. However, jointly optimizing SR and recompression is challenging, as the codecs are not differentiable and vary in configuration. In this paper, we present Versatile Recompression-Aware Perceptual Super-Resolution (VRPSR), which makes existing perceptual SR aware of versatile compression. First, we formulate compression as conditional text-to-image generation and utilize a pre-trained diffusion model to build a generalizable codec simulator. Next, we propose a set of training techniques tailored for perceptual SR, including optimizing the simulator using perceptual targets and adopting slightly compressed images as the training target. Empirically, our VRPSR saves more than 10\% bitrate based on Real-ESRGAN and S3Diff under H.264/H.265/H.266 compression. Besides, our VRPSR facilitates joint optimization of the SR and post-processing model after recompression.
CVJan 14, 2021
DAIL: Dataset-Aware and Invariant Learning for Face RecognitionGaoang Wang, Lin Chen, Tianqiang Liu et al.
To achieve good performance in face recognition, a large scale training dataset is usually required. A simple yet effective way to improve recognition performance is to use a dataset as large as possible by combining multiple datasets in the training. However, it is problematic and troublesome to naively combine different datasets due to two major issues. First, the same person can possibly appear in different datasets, leading to an identity overlapping issue between different datasets. Naively treating the same person as different classes in different datasets during training will affect back-propagation and generate non-representative embeddings. On the other hand, manually cleaning labels may take formidable human efforts, especially when there are millions of images and thousands of identities. Second, different datasets are collected in different situations and thus will lead to different domain distributions. Naively combining datasets will make it difficult to learn domain invariant embeddings across different datasets. In this paper, we propose DAIL: Dataset-Aware and Invariant Learning to resolve the above-mentioned issues. To solve the first issue of identity overlapping, we propose a dataset-aware loss for multi-dataset training by reducing the penalty when the same person appears in multiple datasets. This can be readily achieved with a modified softmax loss with a dataset-aware term. To solve the second issue, domain adaptation with gradient reversal layers is employed for dataset invariant learning. The proposed approach not only achieves state-of-the-art results on several commonly used face recognition validation sets, including LFW, CFP-FP, and AgeDB-30, but also shows great benefit for practical use.