Feng Ding

CV
h-index15
25papers
438citations
Novelty53%
AI Score60

25 Papers

89.8IVMay 27
ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression

Haisheng Fu, Runyu Yang, Feng Ding et al.

State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.

IVFeb 25, 2023
JND-Based Perceptual Optimization For Learned Image Compression

Feng Ding, Jian Jin, Lili Meng et al.

Recently, learned image compression schemes have achieved remarkable improvements in image fidelity (e.g., PSNR and MS-SSIM) compared to conventional hybrid image coding ones due to their high-efficiency non-linear transform, end-to-end optimization frameworks, etc. However, few of them take the Just Noticeable Difference (JND) characteristic of the Human Visual System (HVS) into account and optimize learned image compression towards perceptual quality. To address this issue, a JND-based perceptual quality loss is proposed. Considering that the amounts of distortion in the compressed image at different training epochs under different Quantization Parameters (QPs) are different, we develop a distortion-aware adjustor. After combining them together, we can better assign the distortion in the compressed image with the guidance of JND to preserve the high perceptual quality. All these designs enable the proposed method to be flexibly applied to various learned image compression schemes with high scalability and plug-and-play advantages. Experimental results on the Kodak dataset demonstrate that the proposed method has led to better perceptual quality than the baseline model under the same bit rate.

MMJan 22, 2024Code
Detecting Multimedia Generated by Large AI Models: A Survey

Li Lin, Neeraj Gupta, Yue Zhang et al.

The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large language models, has marked a new era where AI-generated multimedia is increasingly integrated into various aspects of daily life. Although beneficial in numerous fields, this content presents significant risks, including potential misuse, societal disruptions, and ethical concerns. Consequently, detecting multimedia generated by LAIMs has become crucial, with a marked rise in related research. Despite this, there remains a notable gap in systematic surveys that focus specifically on detecting LAIM-generated multimedia. Addressing this, we provide the first survey to comprehensively cover existing research on detecting multimedia (such as text, images, videos, audio, and multimodal content) created by LAIMs. Specifically, we introduce a novel taxonomy for detection methods, categorized by media modality, and aligned with two perspectives: pure detection (aiming to enhance detection performance) and beyond detection (adding attributes like generalizability, robustness, and interpretability to detectors). Additionally, we have presented a brief overview of generation mechanisms, public datasets, online detection tools, and evaluation metrics to provide a valuable resource for researchers and practitioners in this field. Most importantly, we offer a focused analysis from a social media perspective to highlight their broader societal impact. Furthermore, we identify current challenges in detection and propose directions for future research that address unexplored, ongoing, and emerging issues in detecting multimedia generated by LAIMs. Our aim for this survey is to fill an academic gap and contribute to global AI security efforts, helping to ensure the integrity of information in the digital realm. The project link is https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.

CVFeb 27, 2024Code
Preserving Fairness Generalization in Deepfake Detection

Li Lin, Xinan He, Yan Ju et al.

Although effective deepfake detection models have been developed in recent years, recent studies have revealed that these models can result in unfair performance disparities among demographic groups, such as race and gender. This can lead to particular groups facing unfair targeting or exclusion from detection, potentially allowing misclassified deepfakes to manipulate public opinion and undermine trust in the model. The existing method for addressing this problem is providing a fair loss function. It shows good fairness performance for intra-domain evaluation but does not maintain fairness for cross-domain testing. This highlights the significance of fairness generalization in the fight against deepfakes. In this work, we propose the first method to address the fairness generalization problem in deepfake detection by simultaneously considering features, loss, and optimization aspects. Our method employs disentanglement learning to extract demographic and domain-agnostic forgery features, fusing them to encourage fair learning across a flattened loss landscape. Extensive experiments on prominent deepfake datasets demonstrate our method's effectiveness, surpassing state-of-the-art approaches in preserving fairness during cross-domain deepfake detection. The code is available at https://github.com/Purdue-M2/Fairness-Generalization

CVJan 20Code
DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

Feng Ding, Wenhui Yi, Xinan He et al.

Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at https://github.com/ywh1093/DiffFace-Edit.

IVJun 18, 2023
GAN-based Image Compression with Improved RDO Process

Fanxin Xia, Jian Jin, Lili Meng et al.

GAN-based image compression schemes have shown remarkable progress lately due to their high perceptual quality at low bit rates. However, there are two main issues, including 1) the reconstructed image perceptual degeneration in color, texture, and structure as well as 2) the inaccurate entropy model. In this paper, we present a novel GAN-based image compression approach with improved rate-distortion optimization (RDO) process. To achieve this, we utilize the DISTS and MS-SSIM metrics to measure perceptual degeneration in color, texture, and structure. Besides, we absorb the discretized gaussian-laplacian-logistic mixture model (GLLMM) for entropy modeling to improve the accuracy in estimating the probability distributions of the latent representation. During the evaluation process, instead of evaluating the perceptual quality of the reconstructed image via IQA metrics, we directly conduct the Mean Opinion Score (MOS) experiment among different codecs, which fully reflects the actual perceptual results of humans. Experimental results demonstrate that the proposed method outperforms the existing GAN-based methods and the state-of-the-art hybrid codec (i.e., VVC).

MTRL-SCIDec 18, 2025
Artificial Intelligence-Enabled Holistic Design of Catalysts Tailored for Semiconducting Carbon Nanotube Growth

Liu Qian, Yue Li, Ying Xie et al.

Catalyst design is crucial for materials synthesis, especially for complex reaction networks. Strategies like collaborative catalytic systems and multifunctional catalysts are effective but face challenges at the nanoscale. Carbon nanotube synthesis contains complicated nanoscale catalytic reactions, thus achieving high-density, high-quality semiconducting CNTs demands innovative catalyst design. In this work, we present a holistic framework integrating machine learning into traditional catalyst design for semiconducting CNT synthesis. It combines knowledge-based insights with data-driven techniques. Three key components, including open-access electronic structure databases for precise physicochemical descriptors, pre-trained natural language processing-based embedding model for higher-level abstractions, and physical - driven predictive models based on experiment data, are utilized. Through this framework, a new method for selective semiconducting CNT synthesis via catalyst - mediated electron injection, tuned by light during growth, is proposed. 54 candidate catalysts are screened, and three with high potential are identified. High-throughput experiments validate the predictions, with semiconducting selectivity exceeding 91% and the FeTiO3 catalyst reaching 98.6%. This approach not only addresses semiconducting CNT synthesis but also offers a generalizable methodology for global catalyst design and nanomaterials synthesis, advancing materials science in precise control.

CVNov 22, 2024Code
FairAdapter: Detecting AI-generated Images with Improved Fairness

Feng Ding, Jun Zhang, Xinan He et al.

The high-quality, realistic images generated by generative models pose significant challenges for exposing them.So far, data-driven deep neural networks have been justified as the most efficient forensics tools for the challenges. However, they may be over-fitted to certain semantics, resulting in considerable inconsistency in detection performance across different contents of generated samples. It could be regarded as an issue of detection fairness. In this paper, we propose a novel framework named Fairadapter to tackle the issue. In comparison with existing state-of-the-art methods, our model achieves improved fairness performance. Our project: https://github.com/AppleDogDog/FairnessDetection

46.2NIApr 23
An Overlay Multicast Routing Method Based on Network Situational Awareness and Hierarchical Multi-Agent Reinforcement Learning

Miao Ye, Yanye Chen, Yong Wang et al.

Compared with IP multicast, Overlay Multicast (OM) offers better compatibility and flexible deployment in heterogeneous, cross-domain networks. However, traditional OM struggles to adapt to dynamic traffic due to unawareness of physical resource states, and existing reinforcement learning methods fail to decouple OM's tightly coupled multi-objective nature, leading to high complexity, slow convergence, and instability. To address this, we propose MA-DHRL-OM, a multi-agent deep hierarchical reinforcement learning approach. Using SDN's global view, it builds a traffic-aware model for OM path planning. The method decomposes OM tree construction into two stages via hierarchical agents, reducing action space and improving convergence stability. Multi-agent collaboration balances multi-objective optimization while enhancing scalability and adaptability. Experiments show MA-DHRL-OM outperforms existing methods in delay, bandwidth utilization, and packet loss, with more stable convergence and flexible routing.

CVJan 29
MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He, Kaiqing Lin, Yue Zhou et al.

With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations' (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.

CVNov 13, 2025
Fairness-Aware Deepfake Detection: Leveraging Dual-Mechanism Optimization

Feng Ding, Wenhui Yi, Yunpeng Zhou et al.

Fairness is a core element in the trustworthy deployment of deepfake detection models, especially in the field of digital identity security. Biases in detection models toward different demographic groups, such as gender and race, may lead to systemic misjudgments, exacerbating the digital divide and social inequities. However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. Our proposed method innovatively integrates structural fairness decoupling and global distribution alignment: decoupling channels sensitive to demographic groups at the model architectural level, and subsequently reducing the distance between the overall sample distribution and the distributions corresponding to each demographic group at the feature level. Experimental results demonstrate that, compared with other methods, our framework improves both inter-group and intra-group fairness while maintaining overall detection accuracy across domains.

CVSep 18, 2025Code
LSTC-MDA: A Unified Framework for Long-Short Term Temporal Convolution and Mixed Data Augmentation in Skeleton-Based Action Recognition

Feng Ding, Haisheng Fu, Soroush Oraki et al.

Skeleton-based action recognition faces two longstanding challenges: the scarcity of labeled training samples and difficulty modeling short- and long-range temporal dependencies. To address these issues, we propose a unified framework, LSTC-MDA, which simultaneously improves temporal modeling and data diversity. We introduce a novel Long-Short Term Temporal Convolution (LSTC) module with parallel short- and long-term branches, these two feature branches are then aligned and fused adaptively using learned similarity weights to preserve critical long-range cues lost by conventional stride-2 temporal convolutions. We also extend Joint Mixing Data Augmentation (JMDA) with an Additive Mixup at the input level, diversifying training samples and restricting mixup operations to the same camera view to avoid distribution shifts. Ablation studies confirm each component contributes. LSTC-MDA achieves state-of-the-art results: 94.1% and 97.5% on NTU 60 (X-Sub and X-View), 90.4% and 92.0% on NTU 120 (X-Sub and X-Set),97.2% on NW-UCLA. Code: https://github.com/xiaobaoxia/LSTC-MDA.

AIMay 12, 2025Code
HALO: Half Life-Based Outdated Fact Filtering in Temporal Knowledge Graphs

Feng Ding, Tingting Wang, Yupeng Gao et al.

Outdated facts in temporal knowledge graphs (TKGs) result from exceeding the expiration date of facts, which negatively impact reasoning performance on TKGs. However, existing reasoning methods primarily focus on positive importance of historical facts, neglecting adverse effects of outdated facts. Besides, training on these outdated facts yields extra computational cost. To address these challenges, we propose an outdated fact filtering framework named HALO, which quantifies the temporal validity of historical facts by exploring the half-life theory to filter outdated facts in TKGs. HALO consists of three modules: the temporal fact attention module, the dynamic relation-aware encoder module, and the outdated fact filtering module. Firstly, the temporal fact attention module captures the evolution of historical facts over time to identify relevant facts. Secondly, the dynamic relation-aware encoder module is designed for efficiently predicting the half life of each fact. Finally, we construct a time decay function based on the half-life theory to quantify the temporal validity of facts and filter outdated facts. Experimental results show that HALO outperforms the state-of-the-art TKG reasoning methods on three public datasets, demonstrating its effectiveness in detecting and filtering outdated facts (Codes are available at https://github.com/yushuowiki/K-Half/tree/main ).

CVJun 14, 2024Code
Decoupling Forgery Semantics for Generalizable Deepfake Detection

Wei Ye, Xinan He, Feng Ding

In this paper, we propose a novel method for detecting DeepFakes, enhancing the generalization of detection through semantic decoupling. There are now multiple DeepFake forgery technologies that not only possess unique forgery semantics but may also share common forgery semantics. The unique forgery semantics and irrelevant content semantics may promote over-fitting and hamper generalization for DeepFake detectors. For our proposed method, after decoupling, the common forgery semantics could be extracted from DeepFakes, and subsequently be employed for developing the generalizability of DeepFake detectors. Also, to pursue additional generalizability, we designed an adaptive high-pass module and a two-stage training strategy to improve the independence of decoupled semantics. Evaluation on FF++, Celeb-DF, DFD, and DFDC datasets showcases our method's excellent detection and generalization performance. Code is available at: https://github.com/leaffeall/DFS-GDD.

CVNov 23, 2021Code
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Hao Luo, Pichao Wang, Yi Xu et al.

Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/TransReID-SSL.

86.2IVMay 10
ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

Feng Ding, Haisheng Fu, Jie Liang et al.

We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.

CVFeb 2
Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

Yue Zhou, Xinan He, Kaiqing Lin et al.

While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30\%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.

CVDec 17, 2023
Synthesizing Black-box Anti-forensics DeepFakes with High Visual Quality

Bing Fan, Shu Hu, Feng Ding

DeepFake, an AI technology for creating facial forgeries, has garnered global attention. Amid such circumstances, forensics researchers focus on developing defensive algorithms to counter these threats. In contrast, there are techniques developed for enhancing the aggressiveness of DeepFake, e.g., through anti-forensics attacks, to disrupt forensic detectors. However, such attacks often sacrifice image visual quality for improved undetectability. To address this issue, we propose a method to generate novel adversarial sharpening masks for launching black-box anti-forensics attacks. Unlike many existing arts, with such perturbations injected, DeepFakes could achieve high anti-forensics performance while exhibiting pleasant sharpening visual effects. After experimental evaluations, we prove that the proposed method could successfully disrupt the state-of-the-art DeepFake detectors. Besides, compared with the images processed by existing DeepFake anti-forensics methods, the visual qualities of anti-forensics DeepFakes rendered by the proposed method are significantly refined.

CVMar 8, 2025
VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models

Xinan He, Yue Zhou, Bing Fan et al.

Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between Visual and Language modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.

CVJun 1, 2025
Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

Yue Zhou, Xinan He, KaiQing Lin et al.

Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator's output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.

CVJul 1, 2025
Customizable ROI-Based Deep Image Compression

Jian Jin, Fanxin Xia, Feng Ding et al.

Region of Interest (ROI)-based image compression optimizes bit allocation by prioritizing ROI for higher-quality reconstruction. However, as the users (including human clients and downstream machine tasks) become more diverse, ROI-based image compression needs to be customizable to support various preferences. For example, different users may define distinct ROI or require different quality trade-offs between ROI and non-ROI. Existing ROI-based image compression schemes predefine the ROI, making it unchangeable, and lack effective mechanisms to balance reconstruction quality between ROI and non-ROI. This work proposes a paradigm for customizable ROI-based deep image compression. First, we develop a Text-controlled Mask Acquisition (TMA) module, which allows users to easily customize their ROI for compression by just inputting the corresponding semantic \emph{text}. It makes the encoder controlled by text. Second, we design a Customizable Value Assign (CVA) mechanism, which masks the non-ROI with a changeable extent decided by users instead of a constant one to manage the reconstruction quality trade-off between ROI and non-ROI. Finally, we present a Latent Mask Attention (LMA) module, where the latent spatial prior of the mask and the latent Rate-Distortion Optimization (RDO) prior of the image are extracted and fused in the latent space, and further used to optimize the latent representation of the source image. Experimental results demonstrate that our proposed customizable ROI-based deep image compression paradigm effectively addresses the needs of customization for ROI definition and mask acquisition as well as the reconstruction quality trade-off management between the ROI and non-ROI.

CVSep 16, 2025
Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection

Yue Zhou, Xinan He, Kaiqing Lin et al.

While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on `in-the-wild' benchmarks. Instead of crafting another specialized `knife' for this problem, we bring a `gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively `outguns' bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20\%. Our analysis pinpoints the source of the VFM's `firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., `AI-generated'), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM's pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world `gunfight' of AI-generated image detection, the raw `firepower' of an updated VFM is far more effective than the `craftsmanship' of a static detector. 2) True generalization evaluation requires test data to be independent of the model's entire training history, including pre-training.

13.0CVApr 2
SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition

Soroush Oraki, Feng Ding, Jie Liang

Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.

CVMar 30, 2025
TraceMark-LDM: Authenticatable Watermarking for Latent Diffusion Models via Binary-Guided Rearrangement

Wenhao Luo, Zhangyi Shen, Ye Yao et al.

Image generation algorithms are increasingly integral to diverse aspects of human society, driven by their practical applications. However, insufficient oversight in artificial Intelligence generated content (AIGC) can facilitate the spread of malicious content and increase the risk of copyright infringement. Among the diverse range of image generation models, the Latent Diffusion Model (LDM) is currently the most widely used, dominating the majority of the Text-to-Image model market. Currently, most attribution methods for LDMs rely on directly embedding watermarks into the generated images or their intermediate noise, a practice that compromises both the quality and the robustness of the generated content. To address these limitations, we introduce TraceMark-LDM, an novel algorithm that integrates watermarking to attribute generated images while guaranteeing non-destructive performance. Unlike current methods, TraceMark-LDM leverages watermarks as guidance to rearrange random variables sampled from a Gaussian distribution. To mitigate potential deviations caused by inversion errors, the small absolute elements are grouped and rearranged. Additionally, we fine-tune the LDM encoder to enhance the robustness of the watermark. Experimental results show that images synthesized using TraceMark-LDM exhibit superior quality and attribution accuracy compared to state-of-the-art (SOTA) techniques. Notably, TraceMark-LDM demonstrates exceptional robustness against various common attack methods, consistently outperforming SOTA methods.

CVNov 26, 2024
Crack Detection in Infrastructure Using Transfer Learning, Spatial Attention, and Genetic Algorithm Optimization

Feng Ding

Crack detection plays a pivotal role in the maintenance and safety of infrastructure, including roads, bridges, and buildings, as timely identification of structural damage can prevent accidents and reduce costly repairs. Traditionally, manual inspection has been the norm, but it is labor-intensive, subjective, and hazardous. This paper introduces an advanced approach for crack detection in infrastructure using deep learning, leveraging transfer learning, spatial attention mechanisms, and genetic algorithm(GA) optimization. To address the challenge of the inaccessability of large amount of data, we employ ResNet50 as a pre-trained model, utilizing its strong feature extraction capabilities while reducing the need for extensive training datasets. We enhance the model with a spatial attention layer as well as a customized neural network which architecture was fine-tuned using GA. A comprehensive case study demonstrates the effectiveness of the proposed Attention-ResNet50-GA model, achieving a precision of 0.9967 and an F1 score of 0.9983, outperforming conventional methods. The results highlight the model's ability to accurately detect cracks in various conditions, making it highly suitable for real-world applications where large annotated datasets are scarce.