CVMay 27Code
REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation DetectionJun Zhou, Bingwen Hu, Yaxiong Wang et al.
Multimodal manipulation detection aims to simultaneously identify forged image--text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image--text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at https://anonymous.4open.science/r/REVEAL-Reference-A006.
CVAug 13, 2023
CLE Diffusion: Controllable Light Enhancement Diffusion ModelYuyang Yin, Dejia Xu, Chuangchuang Tan et al.
Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability. Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: https://yuyangyin.github.io/CLEDiffusion/
AIJan 23Code
LongCat-Flash-Thinking-2601 Technical ReportMeituan LongCat Team, Anchun Gui, Bei Li et al.
We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model's strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.
LGSep 29, 2022
Meta Knowledge Condensation for Federated LearningPing Liu, Xin Yu, Joey Tianyi Zhou
Existing federated learning paradigms usually extensively exchange distributed models at a central solver to achieve a more powerful model. However, this would incur severe communication burden between a server and multiple clients especially when data distributions are heterogeneous. As a result, current federated learning methods often require a large number of communication rounds in training. Unlike existing paradigms, we introduce an alternative perspective to significantly decrease the communication cost in federate learning. In this work, we first introduce a meta knowledge representation method that extracts meta knowledge from distributed clients. The extracted meta knowledge encodes essential information that can be used to improve the current model. As the training progresses, the contributions of training samples to a federated model also vary. Thus, we introduce a dynamic weight assignment mechanism that enables samples to contribute adaptively to the current model update. Then, informative meta knowledge from all active clients is sent to the server for model update. Training a model on the combined meta knowledge without exposing original data among different clients can significantly mitigate the heterogeneity issues. Moreover, to further ameliorate data heterogeneity, we also exchange meta knowledge among clients as conditional initialization for local meta knowledge extraction. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method. Remarkably, our method outperforms the state-of-the-art by a large margin (from $74.07\%$ to $92.95\%$) on MNIST with a restricted communication budget (i.e. 10 rounds).
IRMay 22
A Unified Structured Query Understanding Framework for Industrial Semantic SearchPing Liu, Qianqi Shen, Jianqiang Shen et al.
Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn's Job Search system. Furthermore, we demonstrate the framework's horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.
CVApr 24, 2023
UTSGAN: Unseen Transition Suss GAN for Transition-Aware Image-to-image TranslationYaxin Shi, Xiaowei Zhou, Ping Liu et al.
In the field of Image-to-Image (I2I) translation, ensuring consistency between input images and their translated results is a key requirement for producing high-quality and desirable outputs. Previous I2I methods have relied on result consistency, which enforces consistency between the translated results and the ground truth output, to achieve this goal. However, result consistency is limited in its ability to handle complex and unseen attribute changes in translation tasks. To address this issue, we introduce a transition-aware approach to I2I translation, where the data translation mapping is explicitly parameterized with a transition variable, allowing for the modelling of unobserved translations triggered by unseen transitions. Furthermore, we propose the use of transition consistency, defined on the transition variable, to enable regularization of consistency on unobserved translations, which is omitted in previous works. Based on these insights, we present Unseen Transition Suss GAN (UTSGAN), a generative framework that constructs a manifold for the transition with a stochastic transition encoder and coherently regularizes and generalizes result consistency and transition consistency on both training and unobserved translations with tailor-designed constraints. Extensive experiments on four different I2I tasks performed on five different datasets demonstrate the efficacy of our proposed UTSGAN in performing consistent translations.
SPApr 8, 2022
Dynamic super-resolution in particle tracking problemsPing Liu, Habib Ammari
Particle tracking in biological imaging is concerned with reconstructing the trajectories, locations, or velocities of the targeting particles. The standard approach of particle tracking consists of two steps: first reconstructing statically the source locations in each time step, and second applying tracking techniques to obtain the trajectories and velocities. In contrast, the dynamic reconstruction seeks to simultaneously recover the source locations and velocities from all frames, which enjoys certain advantages. In this paper, we provide a rigorous mathematical analysis for the resolution limit of reconstructing source number, locations, and velocities by general dynamical reconstruction in particle tracking problems, by which we demonstrate the possibility of achieving super-resolution for the dynamic reconstruction. We show that when the location-velocity pairs of the particles are separated beyond certain distances (the resolution limits), the number of particles and the location-velocity pair can be stably recovered. The resolution limits are related to the cut-off frequency of the imaging system, signal-to-noise ratio, and the sparsity of the source. By these estimates, we also derive a stability result for a sparsity-promoting dynamic reconstruction. In addition, we further show that the reconstruction of velocities has a better resolution limit which improves constantly as the particles moving. This result is derived by an observation that the inherent cut-off frequency for the velocity recovery can be viewed as the total observation time multiplies the cut-off frequency of the imaging system, which may lead to a better resolution limit as compared to the one for each diffraction-limited frame. It is anticipated that this observation can inspire new reconstruction algorithms that improve the resolution of particle tracking in practice.
CVApr 25, 2023
Text-guided Eyeglasses Manipulation with Spatial ConstraintsJiacheng Wang, Ping Liu, Jingen Liu et al.
Virtual try-on of eyeglasses involves placing eyeglasses of different shapes and styles onto a face image without physically trying them on. While existing methods have shown impressive results, the variety of eyeglasses styles is limited and the interactions are not always intuitive or efficient. To address these limitations, we propose a Text-guided Eyeglasses Manipulation method that allows for control of the eyeglasses shape and style based on a binary mask and text, respectively. Specifically, we introduce a mask encoder to extract mask conditions and a modulation module that enables simultaneous injection of text and mask conditions. This design allows for fine-grained control of the eyeglasses' appearance based on both textual descriptions and spatial constraints. Our approach includes a disentangled mapper and a decoupling strategy that preserves irrelevant areas, resulting in better local editing. We employ a two-stage training scheme to handle the different convergence speeds of the various modality conditions, successfully controlling both the shape and style of eyeglasses. Extensive comparison experiments and ablation analyses demonstrate the effectiveness of our approach in achieving diverse eyeglasses styles while preserving irrelevant areas.
CVMay 25
Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion ModelsYiwei Xie, Ping Liu, Zheng Zhang
Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.
CVApr 18, 2023
Dual Stage Stylization Modulation for Domain Generalized Semantic SegmentationGabriel Tjio, Ping Liu, Chee-Keong Kwoh et al.
Obtaining sufficient labeled data for training deep models is often challenging in real-life applications. To address this issue, we propose a novel solution for single-source domain generalized semantic segmentation. Recent approaches have explored data diversity enhancement using hallucination techniques. However, excessive hallucination can degrade performance, particularly for imbalanced datasets. As shown in our experiments, minority classes are more susceptible to performance reduction due to hallucination compared to majority classes. To tackle this challenge, we introduce a dual-stage Feature Transform (dFT) layer within the Adversarial Semantic Hallucination+ (ASH+) framework. The ASH+ framework performs a dual-stage manipulation of hallucination strength. By leveraging semantic information for each pixel, our approach adaptively adjusts the pixel-wise hallucination strength, thus providing fine-grained control over hallucination. We validate the effectiveness of our proposed method through comprehensive experiments on publicly available semantic segmentation benchmark datasets (Cityscapes and SYNTHIA). Quantitative and qualitative comparisons demonstrate that our approach is competitive with state-of-the-art methods for the Cityscapes dataset and surpasses existing solutions for the SYNTHIA dataset. Code for our framework will be made readily available to the research community.
CVJul 3, 2023
Generating Reliable Pixel-Level Labels for Source Free Domain AdaptationGabriel Tjio, Ping Liu, Yawei Luo et al.
This work addresses the challenging domain adaptation setting in which knowledge from the labelled source domain dataset is available only from the pretrained black-box segmentation model. The pretrained model's predictions for the target domain images are noisy because of the distributional differences between the source domain data and the target domain data. Since the model's predictions serve as pseudo labels during self-training, the noise in the predictions impose an upper bound on model performance. Therefore, we propose a simple yet novel image translation workflow, ReGEN, to address this problem. ReGEN comprises an image-to-image translation network and a segmentation network. Our workflow generates target-like images using the noisy predictions from the original target domain images. These target-like images are semantically consistent with the noisy model predictions and therefore can be used to train the segmentation network. In addition to being semantically consistent with the predictions from the original target domain images, the generated target-like images are also stylistically similar to the target domain images. This allows us to leverage the stylistic differences between the target-like images and the target domain image as an additional source of supervision while training the segmentation model. We evaluate our model with two benchmark domain adaptation settings and demonstrate that our approach performs favourably relative to recent state-of-the-art work. The source code will be made available.
CVAug 13, 2024
Breaking Class Barriers: Efficient Dataset Distillation via Inter-Class Feature CompensatorXin Zhang, Jiawei Du, Ping Liu et al.
Dataset distillation has emerged as a technique aiming to condense informative features from large, natural datasets into a compact and synthetic form. While recent advancements have refined this technique, its performance is bottlenecked by the prevailing class-specific synthesis paradigm. Under this paradigm, synthetic data is optimized exclusively for a pre-assigned one-hot label, creating an implicit class barrier in feature condensation. This leads to inefficient utilization of the distillation budget and oversight of inter-class feature distributions, which ultimately limits the effectiveness and efficiency, as demonstrated in our analysis. To overcome these constraints, this paper presents the Inter-class Feature Compensator (INFER), an innovative distillation approach that transcends the class-specific data-label framework widely utilized in current dataset distillation methods. Specifically, INFER leverages a Universal Feature Compensator (UFC) to enhance feature integration across classes, enabling the generation of multiple additional synthetic instances from a single UFC input. This significantly improves the efficiency of the distillation budget. Moreover, INFER enriches inter-class interactions during the distillation, thereby enhancing the effectiveness and generalizability of the distilled data. By allowing for the linear interpolation of labels similar to those in the original dataset, INFER meticulously optimizes the synthetic data and dramatically reduces the size of soft labels in the synthetic dataset to almost zero, establishing a new benchmark for efficiency and effectiveness in dataset distillation. In practice, INFER demonstrates state-of-the-art performance across benchmark datasets. For instance, in the ipc = 50 setting on ImageNet-1k with the same compression level, it outperforms SRe2L by 34.5% using ResNet18.
CVNov 27, 2023
VehicleGAN: Pair-flexible Pose Guided Image Synthesis for Vehicle Re-identificationBaolu Li, Ping Liu, Lan Fu et al.
Vehicle Re-identification (Re-ID) has been broadly studied in the last decade; however, the different camera view angle leading to confused discrimination in the feature subspace for the vehicles of various poses, is still challenging for the Vehicle Re-ID models in the real world. To promote the Vehicle Re-ID models, this paper proposes to synthesize a large number of vehicle images in the target pose, whose idea is to project the vehicles of diverse poses into the unified target pose so as to enhance feature discrimination. Considering that the paired data of the same vehicles in different traffic surveillance cameras might be not available in the real world, we propose the first Pair-flexible Pose Guided Image Synthesis method for Vehicle Re-ID, named as VehicleGAN in this paper, which works for both supervised and unsupervised settings without the knowledge of geometric 3D models. Because of the feature distribution difference between real and synthetic data, simply training a traditional metric learning based Re-ID model with data-level fusion (i.e., data augmentation) is not satisfactory, therefore we propose a new Joint Metric Learning (JML) via effective feature-level fusion from both real and synthetic data. Intensive experimental results on the public VeRi-776 and VehicleID datasets prove the accuracy and effectiveness of our proposed VehicleGAN and JML.
CVMar 12, 2024Code
Frequency-Aware Deepfake Detection: Improving Generalizability through Frequency Space LearningChuangchuang Tan, Yao Zhao, Shikui Wei et al.
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images despite limited training data. Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries. However, the rapid advancements in synthesis technology have led to specific artifacts for each generation model. Consequently, these detectors have exhibited a lack of proficiency in learning the frequency domain and tend to overfit to the artifacts present in the training data, leading to suboptimal performance on unseen sources. To address this issue, we introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors. Our method forces the detector to continuously focus on high-frequency information, exploiting high-frequency representation of features across spatial and channel dimensions. Additionally, we incorporate a straightforward frequency domain learning module to learn source-agnostic features. It involves convolutional layers applied to both the phase spectrum and amplitude spectrum between the Fast Fourier Transform (FFT) and Inverse Fast Fourier Transform (iFFT). Extensive experimentation involving 17 GANs demonstrates the effectiveness of our proposed method, showcasing state-of-the-art performance (+9.8\%) while requiring fewer parameters. The code is available at {\cred \url{https://github.com/chuangchuangtan/FreqNet-DeepfakeDetection}}.
CVMar 11, 2024Code
Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake DetectionChuangchuang Tan, Ping Liu, RenShuai Tao et al.
Recently, the proliferation of increasingly realistic synthetic images generated by various generative adversarial networks has increased the risk of misuse. Consequently, there is a pressing need to develop a generalizable detector for accurately recognizing fake images. The conventional methods rely on generating diverse training sources or large pretrained models. In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations. Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources. In our framework, handcrafted filters and the randomly-initialized convolutional layer can be used as the training-free artifact representations extractor with excellent results. With the data-independent operator of a popular classifier, such as Resnet50, one could already reach a new state-of-the-art without bells and whistles. We evaluate the effectiveness of the DIO on 33 generation models, even DALLE and Midjourney. Our detector achieves a remarkable improvement of $13.3\%$, establishing a new state-of-the-art performance. The DIO and its extension can serve as strong baselines for future methods. The code is available at \url{https://github.com/chuangchuangtan/Data-Independent-Operator}.
LGFeb 7, 2024Code
Two Trades is not Baffled: Condensing Graph via Crafting Rational Gradient MatchingTianle Zhang, Yuchen Zhang, Kun Wang et al. · pku
Training on large-scale graphs has achieved remarkable results in graph representation learning, but its cost and storage have raised growing concerns. As one of the most promising directions, graph condensation methods address these issues by employing gradient matching, aiming to condense the full graph into a more concise yet information-rich synthetic set. Though encouraging, these strategies primarily emphasize matching directions of the gradients, which leads to deviations in the training trajectories. Such deviations are further magnified by the differences between the condensation and evaluation phases, culminating in accumulated errors, which detrimentally affect the performance of the condensed graphs. In light of this, we propose a novel graph condensation method named \textbf{C}raf\textbf{T}ing \textbf{R}ationa\textbf{L} trajectory (\textbf{CTRL}), which offers an optimized starting point closer to the original dataset's feature distribution and a more refined strategy for gradient matching. Theoretically, CTRL can effectively neutralize the impact of accumulated errors on the performance of condensed graphs. We provide extensive experiments on various graph datasets and downstream tasks to support the effectiveness of CTRL. Code is released at https://github.com/NUS-HPC-AI-Lab/CTRL.
CVMar 23Code
PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion ModelsYiwei Xie, Zheng Zhang, Ping Liu
Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.
IRMay 15
Policy-Grounded Dynamic Facet Suggestions for Job SearchDan Xu, Baofen Zheng, Qianqi Shen et al.
Job seekers often initiate search with short, underspecified queries. At LinkedIn, over 80% of job-related queries contain three or fewer keywords, making accurate user intent inference and relevant job retrieval particularly challenging. We present dynamic facet suggestion (DFS), an interactive query refinement mechanism that facilitates intent disambiguation by surfacing personalized semantic attributes conditioned on the joint user-query context in real time. We propose a policy-grounded, retrieval-augmented ranking framework for facet suggestion, comprising offline taxonomy curation, embedding-based retrieval of top-K candidates, and distilled small language model (SLM) based candidate scoring. The system is optimized for real-time serving via pointwise single-token scoring with batching and prefix caching. Offline evaluation demonstrates high precision for generated suggestions, and online A/B tests show significant improvements in suggestion engagement and job search outcomes.
LGMar 27
Machine Unlearning under Retain-Forget EntanglementJingpu Cheng, Ping Liu, Qianxiao Li et al.
Forgetting a subset in machine unlearning is rarely an isolated task. Often, retained samples that are closely related to the forget set can be unintentionally affected, particularly when they share correlated features from pretraining or exhibit strong semantic similarities. To address this challenge, we propose a novel two-phase optimization framework specifically designed to handle such retai-forget entanglements. In the first phase, an augmented Lagrangian method increases the loss on the forget set while preserving accuracy on less-related retained samples. The second phase applies a gradient projection step, regularized by the Wasserstein-2 distance, to mitigate performance degradation on semantically related retained samples without compromising the unlearning objective. We validate our approach through comprehensive experiments on multiple unlearning tasks, standard benchmark datasets, and diverse neural architectures, demonstrating that it achieves effective and reliable unlearning while outperforming existing baselines in both accuracy retention and removal fidelity.
CVMay 31, 2025Code
SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion ModelsYule Zhu, Ping Liu, Zhedong Zheng et al.
Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: https://github.com/Zeus1037/SEED.
CRApr 12
SEED: A Large-Scale Benchmark for Provenance Tracing in Sequential Deepfake Facial EditsMengieong Hoi, Zhedong Zheng, Ping Liu et al.
Deepfake content on social networks is increasingly produced through multiple \emph{sequential} edits to biometric data such as facial imagery. Consequently, the final appearance of an image often reflects a latent chain of operations rather than a single manipulation. Recovering these editing histories is essential for visual provenance analysis, misinformation auditing, and forensic or platform moderation workflows that must trace the origin and evolution of AI-generated media. However, existing datasets predominantly focus on single-step editing and overlook the cumulative artifacts introduced by realistic multi-step pipelines. To address this gap, we introduce Sequential Editing in Diffusion (\textbf{SEED}), a large-scale benchmark for sequential provenance tracing in facial imagery. SEED contains over 90K images constructed via one to four sequential attribute edits using diffusion-based editing pipelines, with fine-grained annotations including edit order, textual instructions, manipulation masks, and generation models. These metadata enable step-wise evidence analysis and support forgery detection, sequence prediction. To benchmark the challenges posed by SEED, we evaluate representative analysis strategies and observe that spatial-only approaches struggle under subtle and distributed diffusion artifacts, especially when such artifacts accumulate across multiple edits. Motivated by this observation, we further establish \textbf{FAITH}, a frequency-aware Transformer baseline that aggregates spatial and frequency-domain cues to identify and order latent editing events. Results show that high-frequency signals, particularly wavelet components, provide effective cues even under image degradation. Overall, SEED facilitates systematic study of sequential provenance tracing and evidence aggregation for trustworthy analysis of AI-generated visual content.
LGApr 11
Closed-Form Concept Erasure via Double ProjectionsChi Zhang, Jingpu Cheng, Zhixian Wang et al.
While modern generative models such as diffusion-based architectures have enabled impressive creative capabilities, they also raise important safety and ethical risks. These concerns have led to growing interest in concept erasure, the process of removing unwanted concepts from model representations. Existing approaches often achieve strong erasure performance but rely on iterative optimization and may inadvertently distort unrelated concepts. In this work, we present a simple yet principled alternative: a linear transformation framework that achieves concept erasure analytically, without any training. Our method adapts a pretrained model through two sequential, closed-form steps: first, computing a proxy projection of the target concept, and second, applying a constrained transformation within the left null space of known concept directions. This design yields a deterministic and geometrically interpretable procedure for safe, efficient, and theory-grounded concept removal. Across a wide range of experiments, including object and style erasure on multiple Stable Diffusion variants and the flow-matching model (FLUX), our approach matches or surpasses the performance of state-of-the-art methods while preserving non-target concepts more faithfully. Requiring only a few seconds to apply, it offers a lightweight and drop-in tool for controlled model editing, advancing the goal of safer and more responsible generative models.
AISep 25, 2024
Non-stationary BERT: Exploring Augmented IMU Data For Robust Human Activity RecognitionNing Sun, Yufei Wang, Yuwei Zhang et al.
Human Activity Recognition (HAR) has gained great attention from researchers due to the popularity of mobile devices and the need to observe users' daily activity data for better human-computer interaction. In this work, we collect a human activity recognition dataset called OPPOHAR consisting of phone IMU data. To facilitate the employment of HAR system in mobile phone and to achieve user-specific activity recognition, we propose a novel light-weight network called Non-stationary BERT with a two-stage training method. We also propose a simple yet effective data augmentation method to explore the deeper relationship between the accelerator and gyroscope data from the IMU. The network achieves the state-of-the-art performance testing on various activity recognition datasets and the data augmentation method demonstrates its wide applicability.
SESep 16, 2025Code
SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code ProblemsXifeng Yao, Dongyu Lang, Wu Zhang et al.
Significant advancements have been made in the capabilities of code large language models, leading to their rapid adoption and application across a wide range of domains. However, their further advancements are often constrained by the scarcity of real-world coding problems. To bridge this gap, we propose a novel framework for synthesizing code problems that emulate authentic real-world scenarios. This framework systematically integrates domain knowledge, domain skills, and coding skills, all of which are meticulously extracted from real-world programming-related datasets, including Stack Overflow and Kaggle. The extracted elements serve as the foundational building blocks for constructing code problems. To align the generated problems with practical applications, application scenarios are also mined from the aforementioned datasets. These scenarios are then utilized to construct a scenario-centric graph that interconnects domain knowledge, domain skills, and coding skills. Based on this structured representation, a sampling strategy on the graph is designed, which effectively controls the generation of a code problem with complexity and diversity, reflects real-world challenges. Experimental results demonstrate that the proposed method consistently achieves superior performance over state-of-the-art open-source large language models of varying sizes and functionalities, including both coders and general-purpose models, across a diverse set of real-world benchmarks.
IVJul 1, 2025Code
Automated anatomy-based post-processing reduces false positives and improved interpretability of deep learning intracranial aneurysm detectionJisoo Kim, Chu-Hsuan Lin, Alberto Ceballos-Arroyo et al.
Introduction: Deep learning (DL) models can help detect intracranial aneurysms on CTA, but high false positive (FP) rates remain a barrier to clinical translation, despite improvement in model architectures and strategies like detection threshold tuning. We employed an automated, anatomy-based, heuristic-learning hybrid artery-vein segmentation post-processing method to further reduce FPs. Methods: Two DL models, CPM-Net and a deformable 3D convolutional neural network-transformer hybrid (3D-CNN-TR), were trained with 1,186 open-source CTAs (1,373 annotated aneurysms), and evaluated with 143 held-out private CTAs (218 annotated aneurysms). Brain, artery, vein, and cavernous venous sinus (CVS) segmentation masks were applied to remove possible FPs in the DL outputs that overlapped with: (1) brain mask; (2) vein mask; (3) vein more than artery masks; (4) brain plus vein mask; (5) brain plus vein more than artery masks. Results: CPM-Net yielded 139 true-positives (TP); 79 false-negative (FN); 126 FP. 3D-CNN-TR yielded 179 TP; 39 FN; 182 FP. FPs were commonly extracranial (CPM-Net 27.3%; 3D-CNN-TR 42.3%), venous (CPM-Net 56.3%; 3D-CNN-TR 29.1%), arterial (CPM-Net 11.9%; 3D-CNN-TR 53.3%), and non-vascular (CPM-Net 25.4%; 3D-CNN-TR 9.3%) structures. Method 5 performed best, reducing CPM-Net FP by 70.6% (89/126) and 3D-CNN-TR FP by 51.6% (94/182), without reducing TP, lowering the FP/case rate from 0.88 to 0.26 for CPM-NET, and from 1.27 to 0.62 for the 3D-CNN-TR. Conclusion: Anatomy-based, interpretable post-processing can improve DL-based aneurysm detection model performance. More broadly, automated, domain-informed, hybrid heuristic-learning processing holds promise for improving the performance and clinical acceptance of aneurysm detection models.
CVJun 11, 2024Code
Evolving from Single-modal to Multi-modal Facial Deepfake Detection: Progress and ChallengesPing Liu, Qiqi Tao, Joey Tianyi Zhou
As synthetic media, including video, audio, and text, become increasingly indistinguishable from real content, the risks of misinformation, identity fraud, and social manipulation escalate. This survey traces the evolution of deepfake detection from early single-modal methods to sophisticated multi-modal approaches that integrate audio-visual and text-visual cues. We present a structured taxonomy of detection techniques and analyze the transition from GAN-based to diffusion model-driven deepfakes, which introduce new challenges due to their heightened realism and robustness against detection. Unlike prior surveys that primarily focus on single-modal detection or earlier deepfake techniques, this work provides the most comprehensive study to date, encompassing the latest advancements in multi-modal deepfake detection, generalization challenges, proactive defense mechanisms, and emerging datasets specifically designed to support new interpretability and reasoning tasks. We further explore the role of Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) in strengthening detection robustness against increasingly sophisticated deepfake attacks. By systematically categorizing existing methods and identifying emerging research directions, this survey serves as a foundation for future advancements in combating AI-generated facial forgeries. A curated list of all related papers can be found at \href{https://github.com/qiqitao77/Comprehensive-Advances-in-Deepfake-Detection-Spanning-Diverse-Modalities}{https://github.com/qiqitao77/Awesome-Comprehensive-Deepfake-Detection}.
CVDec 16, 2024Code
CLIP-SR: Collaborative Linguistic and Image Processing for Super-ResolutionBingwen Hu, Heng Liu, Zhedong Zheng et al.
Convolutional Neural Networks (CNNs) have significantly advanced Image Super-Resolution (SR), yet most CNN-based methods rely solely on pixel-based transformations, often leading to artifacts and blurring, particularly under severe downsampling rates (\eg, 8$\times$ or 16$\times$). The recently developed text-guided SR approaches leverage textual descriptions to enhance their detail restoration capabilities but frequently struggle with effectively performing alignment, resulting in semantic inconsistencies. To address these challenges, we propose a multi-modal semantic enhancement framework that integrates textual semantics with visual features, effectively mitigating semantic mismatches and detail losses in highly degraded low-resolution (LR) images. Our method enables realistic, high-quality SR to be performed at large upscaling factors, with a maximum scaling ratio of 16$\times$. The framework integrates both text and image inputs using the prompt predictor, the Text-Image Fusion Block (TIFBlock), and the Iterative Refinement Module, leveraging Contrastive Language-Image Pretraining (CLIP) features to guide a progressive enhancement process with fine-grained alignment. This synergy produces high-resolution outputs with sharp textures and strong semantic coherence, even at substantial scaling factors. Extensive comparative experiments and ablation studies validate the effectiveness of our approach. Furthermore, by leveraging textual semantics, our method offers a degree of super-resolution editability, allowing for controlled enhancements while preserving semantic consistency. The code is available at https://github.com/hengliusky/CLIP-SR.
CVDec 16, 2023Code
Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake DetectionChuangchuang Tan, Huan Liu, Yao Zhao et al.
Recently, the proliferation of highly realistic synthetic images, facilitated through a variety of GANs and Diffusions, has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms, an investigative inquiry into the generator architectures has remained conspicuously absent in recent years. This paper contributes to this lacuna by rethinking the architectures of CNN-based generators, thereby establishing a generalized representation of synthetic artifacts. Our findings illuminate that the up-sampling operator can, beyond frequency-based artifacts, produce generalized forgery artifacts. In particular, the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation, we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. A comprehensive analysis is conducted on an open-world dataset, comprising samples generated by \tft{28 distinct generative models}. This analysis culminates in the establishment of a novel state-of-the-art performance, showcasing a remarkable \tft{11.6\%} improvement over existing methods. The code is available at https://github.com/chuangchuangtan/NPR-DeepfakeDetection.
CVJun 8, 2021Code
Adversarial Semantic Hallucination for Domain Generalized Semantic SegmentationGabriel Tjio, Ping Liu, Joey Tianyi Zhou et al.
Convolutional neural networks typically perform poorly when the test (target domain) and training (source domain) data have significantly different distributions. While this problem can be mitigated by using the target domain data to align the source and target domain feature representations, the target domain data may be unavailable due to privacy concerns. Consequently, there is a need for methods that generalize well despite restricted access to target domain data during training. In this work, we propose an adversarial semantic hallucination approach (ASH), which combines a class-conditioned hallucination module and a semantic segmentation module. Since the segmentation performance varies across different classes, we design a semantic-conditioned style hallucination module to generate affine transformation parameters from semantic information in the segmentation probability maps of the source domain image. Unlike previous adaptation approaches, which treat all classes equally, ASH considers the class-wise differences. The segmentation module and the hallucination module compete adversarially, with the hallucination module generating increasingly "difficult" stylized images to challenge the segmentation module. In response, the segmentation module improves as it is trained with generated samples at an appropriate class-wise difficulty level. Our results on the Cityscapes and Mapillary benchmark datasets show that our method is competitive with state of the art work. Code is made available at https://github.com/gabriel-tjio/ASH.
CVDec 29, 2019Code
Very Long Natural Scenery Image Prediction by OutpaintingZongxin Yang, Jian Dong, Ping Liu et al.
Comparing to image inpainting, image outpainting receives less attention due to two challenges in it. The first challenge is how to keep the spatial and content consistency between generated images and original input. The second challenge is how to maintain high quality in generated results, especially for multi-step generations in which generated regions are spatially far away from the initial input. To solve the two problems, we devise some innovative modules, named Skip Horizontal Connection and Recurrent Content Transfer, and integrate them into our designed encoder-decoder structure. By this design, our network can generate highly realistic outpainting prediction effectively and efficiently. Other than that, our method can generate new images with very long sizes while keeping the same style and semantic content as the given input. To test the effectiveness of the proposed architecture, we collect a new scenery dataset with diverse, complicated natural scenes. The experimental results on this dataset have demonstrated the efficacy of our proposed network. The code and dataset are available from https://github.com/z-x-yang/NS-Outpainting.
CVNov 1, 2018Code
Filter Pruning via Geometric Median for Deep Convolutional Neural Networks AccelerationYang He, Ping Liu, Ziwei Wang et al.
Previous works utilized ''smaller-norm-less-important'' criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm deviation of the filters should be large; (2) the minimum norm of the filters should be small. To solve this problem, we propose a novel filter pruning method, namely Filter Pruning via Geometric Median (FPGM), to compress the model regardless of those two requirements. Unlike previous methods, FPGM compresses CNN models by pruning filters with redundancy, rather than those with ''relatively less'' importance. When applied to two image classification benchmarks, our method validates its usefulness and strengths. Notably, on CIFAR-10, FPGM reduces more than 52% FLOPs on ResNet-110 with even 2.69% relative accuracy improvement. Moreover, on ILSVRC-2012, FPGM reduces more than 42% FLOPs on ResNet-101 without top-5 accuracy drop, which has advanced the state-of-the-art. Code is publicly available on GitHub: https://github.com/he-y/filter-pruning-geometric-median
CVMay 8
Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision ModelsBincheng Peng, Guang Li, Ping Liu et al.
Dataset distillation compresses a large training set into a small synthetic set that preserves downstream training utility. While most existing methods target training networks from scratch, modern visual transfer learning often uses frozen pre-trained encoders followed by lightweight linear probing. Existing distillation methods for this setting either unroll iterative linear-probe updates with trajectory-based gradient matching, or rely on closed-form formulations originally designed for from-scratch training with neural-tangent-kernel (NTK) approximations. Neither route exploits the fact that frozen-feature linear probing admits a closed-form solution determined directly by the pre-trained features themselves, with no infinite-width approximation and no inner-loop trajectory. We propose Closed-Form Linear-Probe Dataset Distillation (CLP-DD), a bilevel formulation that computes the linear probe induced by the synthetic set with a sample-space kernel ridge solver. The synthetic images are then updated by evaluating this induced classifier on real features through a temperature-scaled softmax cross-entropy, where the classifier columns act as learned class anchors in feature space. We further show that the choice of outer objective is decisive: pairing the closed-form inner solver with a standard MSE outer loss substantially underperforms trajectory-based methods, while the discriminative outer loss closes most of the gap. On ImageNet-100 with four pre-trained backbones, CLP-DD substantially improves over LGM without DSA and approaches LGM with DSA at a fraction of the computational cost. On ImageNet-1K, CLP-DD matches or surpasses LGM with DSA on three of four backbones while running roughly $14\times$ faster and using less than one-eighth of the GPU memory.
LGFeb 17, 2024
LiGNN: Graph Neural Networks at LinkedInFedor Borisyuk, Shihai He, Yunbo Ouyang et al.
In this paper, we present LiGNN, a deployed large-scale Graph Neural Networks (GNNs) Framework. We share our insight on developing and deployment of GNNs at large scale at LinkedIn. We present a set of algorithmic improvements to the quality of GNN representation learning including temporal graph architectures with long term losses, effective cold start solutions via graph densification, ID embeddings and multi-hop neighbor sampling. We explain how we built and sped up by 7x our large-scale training on LinkedIn graphs with adaptive sampling of neighbors, grouping and slicing of training data batches, specialized shared-memory queue and local gradient optimization. We summarize our deployment lessons and learnings gathered from A/B test experiments. The techniques presented in this work have contributed to an approximate relative improvements of 1% of Job application hearing back rate, 2% Ads CTR lift, 0.5% of Feed engaged daily active users, 0.2% session lift and 0.1% weekly active user lift from people recommendation. We believe that this work can provide practical solutions and insights for engineers who are interested in applying Graph neural networks at large scale.
CVFeb 10, 2025
Grounding Creativity in Physics: A Brief Survey of Physical Priors in AIGCSiwei Meng, Yawei Luo, Ping Liu
Recent advancements in AI-generated content have significantly improved the realism of 3D and 4D generation. However, most existing methods prioritize appearance consistency while neglecting underlying physical principles, leading to artifacts such as unrealistic deformations, unstable dynamics, and implausible objects interactions. Incorporating physics priors into generative models has become a crucial research direction to enhance structural integrity and motion realism. This survey provides a review of physics-aware generative methods, systematically analyzing how physical constraints are integrated into 3D and 4D generation. First, we examine recent works in incorporating physical priors into static and dynamic 3D generation, categorizing methods based on representation types, including vision-based, NeRF-based, and Gaussian Splatting-based approaches. Second, we explore emerging techniques in 4D generation, focusing on methods that model temporal dynamics with physical simulations. Finally, we conduct a comparative analysis of major methods, highlighting their strengths, limitations, and suitability for different materials and motion dynamics. By presenting an in-depth analysis of physics-grounded AIGC, this survey aims to bridge the gap between generative models and physical realism, providing insights that inspire future research in physically consistent content generation.
CVFeb 8, 2025
The Evolution of Dataset Distillation: Toward Scalable and Generalizable SolutionsPing Liu, Jiawei Du
Dataset distillation, which condenses large-scale datasets into compact synthetic representations, has emerged as a critical solution for training modern deep learning models efficiently. While prior surveys focus on developments before 2023, this work comprehensively reviews recent advances, emphasizing scalability to large-scale datasets such as ImageNet-1K and ImageNet-21K. We categorize progress into a few key methodologies: trajectory matching, gradient matching, distribution matching, scalable generative approaches, and decoupling optimization mechanisms. As a comprehensive examination of recent dataset distillation advances, this survey highlights breakthrough innovations: the SRe2L framework for efficient and effective condensation, soft label strategies that significantly enhance model accuracy, and lossless distillation techniques that maximize compression while maintaining performance. Beyond these methodological advancements, we address critical challenges, including robustness against adversarial and backdoor attacks, effective handling of non-IID data distributions. Additionally, we explore emerging applications in video and audio processing, multi-modal learning, medical imaging, and scientific computing, highlighting its domain versatility. By offering extensive performance comparisons and actionable research directions, this survey equips researchers and practitioners with practical insights to advance efficient and generalizable dataset distillation, paving the way for future innovations.
LGFeb 20, 2024
LinkSAGE: Optimizing Job Matching Using Graph Neural NetworksPing Liu, Haichao Wei, Xiaochen Hou et al.
We present LinkSAGE, an innovative framework that integrates Graph Neural Networks (GNNs) into large-scale personalized job matching systems, designed to address the complex dynamics of LinkedIns extensive professional network. Our approach capitalizes on a novel job marketplace graph, the largest and most intricate of its kind in industry, with billions of nodes and edges. This graph is not merely extensive but also richly detailed, encompassing member and job nodes along with key attributes, thus creating an expansive and interwoven network. A key innovation in LinkSAGE is its training and serving methodology, which effectively combines inductive graph learning on a heterogeneous, evolving graph with an encoder-decoder GNN model. This methodology decouples the training of the GNN model from that of existing Deep Neural Nets (DNN) models, eliminating the need for frequent GNN retraining while maintaining up-to-date graph signals in near realtime, allowing for the effective integration of GNN insights through transfer learning. The subsequent nearline inference system serves the GNN encoder within a real-world setting, significantly reducing online latency and obviating the need for costly real-time GNN infrastructure. Validated across multiple online A/B tests in diverse product scenarios, LinkSAGE demonstrates marked improvements in member engagement, relevance matching, and member retention, confirming its generalizability and practical impact.
CVMar 20, 2024
DD-RobustBench: An Adversarial Robustness Benchmark for Dataset DistillationYifan Wu, Jiawei Du, Ping Liu et al.
Dataset distillation is an advanced technique aimed at compressing datasets into significantly smaller counterparts, while preserving formidable training performance. Significant efforts have been devoted to promote evaluation accuracy under limited compression ratio while overlooked the robustness of distilled dataset. In this work, we introduce a comprehensive benchmark that, to the best of our knowledge, is the most extensive to date for evaluating the adversarial robustness of distilled datasets in a unified way. Our benchmark significantly expands upon prior efforts by incorporating a wider range of dataset distillation methods, including the latest advancements such as TESLA and SRe2L, a diverse array of adversarial attack methods, and evaluations across a broader and more extensive collection of datasets such as ImageNet-1K. Moreover, we assessed the robustness of these distilled datasets against representative adversarial attack algorithms like PGD and AutoAttack, while exploring their resilience from a frequency perspective. We also discovered that incorporating distilled data into the training batches of the original dataset can yield to improvement of robustness.
CVJan 4, 2024
Unified Diffusion-Based Rigid and Non-Rigid Editing with Text and Image GuidanceJiacheng Wang, Ping Liu, Wei Xu
Existing text-to-image editing methods tend to excel either in rigid or non-rigid editing but encounter challenges when combining both, resulting in misaligned outputs with the provided text prompts. In addition, integrating reference images for control remains challenging. To address these issues, we present a versatile image editing framework capable of executing both rigid and non-rigid edits, guided by either textual prompts or reference images. We leverage a dual-path injection scheme to handle diverse editing scenarios and introduce an integrated self-attention mechanism for fusion of appearance and structural information. To mitigate potential visual artifacts, we further employ latent fusion techniques to adjust intermediate latents. Compared to previous work, our approach represents a significant advance in achieving precise and versatile image editing. Comprehensive experiments validate the efficacy of our method, showcasing competitive or superior results in text-based editing and appearance transfer tasks, encompassing both rigid and non-rigid settings.
LGJun 13, 2025
Fed-HeLLo: Efficient Federated Foundation Model Fine-Tuning with Heterogeneous LoRA AllocationZikai Zhang, Ping Liu, Jiahao Xu et al.
Federated Learning has recently been utilized to collaboratively fine-tune foundation models across multiple clients. Notably, federated low-rank adaptation LoRA-based fine-tuning methods have recently gained attention, which allows clients to fine-tune FMs with a small portion of trainable parameters locally. However, most existing methods do not account for the heterogeneous resources of clients or lack an effective local training strategy to maximize global fine-tuning performance under limited resources. In this work, we propose Fed-HeLLo, a novel federated LoRA-based fine-tuning framework that enables clients to collaboratively fine-tune an FM with different local trainable LoRA layers. To ensure its effectiveness, we develop several heterogeneous LoRA allocation (HLA) strategies that adaptively allocate local trainable LoRA layers based on clients' resource capabilities and the layer importance. Specifically, based on the dynamic layer importance, we design a Fisher Information Matrix score-based HLA that leverages dynamic gradient norm information. To better stabilize the training process, we consider the intrinsic importance of LoRA layers and design a Geometrically-Defined HLA strategy. It shapes the collective distribution of trainable LoRA layers into specific geometric patterns, such as Triangle, Inverted Triangle, Bottleneck, and Uniform. Moreover, we extend GD-HLA into a randomized version, named Randomized Geometrically-Defined HLA, for enhanced model accuracy with randomness. By co-designing the proposed HLA strategies, we incorporate both the dynamic and intrinsic layer importance into the design of our HLA strategy. We evaluate our approach on five datasets under diverse federated LoRA fine-tuning settings, covering three levels of data distribution from IID to extreme Non-IID. Results show that Fed-HeLLo with HLA strategies is both effective and efficient.
LGOct 14, 2024
Fed-pilot: Optimizing LoRA Allocation for Efficient Federated Fine-Tuning with Heterogeneous ClientsZikai Zhang, Rui Hu, Ping Liu et al.
Federated Learning enables the fine-tuning of foundation models (FMs) across distributed clients for specific tasks; however, its scalability is limited by the heterogeneity of client memory capacities. In this work, we propose Fed-pilot, a memory-efficient federated fine-tuning framework. It enables memory-constrained clients to participate in Low-Rank Adaptation (LoRA)-based fine-tuning by training only a subset of LoRA modules locally. Fed-pilot identifies the optimal selection of trainable LoRA modules as a knapsack optimization problem, maximizing model performance under memory constraints for each client. To mitigate inconsistencies arising from heterogeneous module allocations and Non-IID data, Fed-pilot employs a novel aggregation rule that dynamically compensates for under-updated layers. Extensive experiments on five diverse datasets across various heterogeneous data settings demonstrate Fed-pilot's effectiveness and efficiency compared to state-of-the-art methods. To the best of our knowledge, this is the first study on federated fine-tuning of FMs that integrates memory-constrained optimization. The code will be publicly available.
LGJul 13, 2025
A Scalable and Efficient Signal Integration System for Job MatchingPing Liu, Rajat Arora, Xiao Shi et al.
LinkedIn, one of the world's largest platforms for professional networking and job seeking, encounters various modeling challenges in building recommendation systems for its job matching product, including cold-start, filter bubbles, and biases affecting candidate-job matching. To address these, we developed the STAR (Signal Integration for Talent And Recruiters) system, leveraging the combined strengths of Large Language Models (LLMs) and Graph Neural Networks (GNNs). LLMs excel at understanding textual data, such as member profiles and job postings, while GNNs capture intricate relationships and mitigate cold-start issues through network effects. STAR integrates diverse signals by uniting LLM and GNN capabilities with industrial-scale paradigms including adaptive sampling and version management. It provides an end-to-end solution for developing and deploying embeddings in large-scale recommender systems. Our key contributions include a robust methodology for building embeddings in industrial applications, a scalable GNN-LLM integration for high-performing recommendations, and practical insights for real-world model deployment.
CVNov 28, 2024
RIGI: Rectifying Image-to-3D Generation Inconsistency via Uncertainty-aware LearningJiacheng Wang, Zhedong Zheng, Wei Xu et al.
Given a single image of a target object, image-to-3D generation aims to reconstruct its texture and geometric shape. Recent methods often utilize intermediate media, such as multi-view images or videos, to bridge the gap between input image and the 3D target, thereby guiding the generation of both shape and texture. However, inconsistencies in the generated multi-view snapshots frequently introduce noise and artifacts along object boundaries, undermining the 3D reconstruction process. To address this challenge, we leverage 3D Gaussian Splatting (3DGS) for 3D reconstruction, and explicitly integrate uncertainty-aware learning into the reconstruction process. By capturing the stochasticity between two Gaussian models, we estimate an uncertainty map, which is subsequently used for uncertainty-aware regularization to rectify the impact of inconsistencies. Specifically, we optimize both Gaussian models simultaneously, calculating the uncertainty map by evaluating the discrepancies between rendered images from identical viewpoints. Based on the uncertainty map, we apply adaptive pixel-wise loss weighting to regularize the models, reducing reconstruction intensity in high-uncertainty regions. This approach dynamically detects and mitigates conflicts in multi-view labels, leading to smoother results and effectively reducing artifacts. Extensive experiments show the effectiveness of our method in improving 3D generation quality by reducing inconsistencies and artifacts.
CVMay 22, 2025
PhyMAGIC: Physical Motion-Aware Generative Inference with Confidence-guided LLMSiwei Meng, Yawei Luo, Ping Liu
Recent advances in 3D content generation have amplified demand for dynamic models that are both visually realistic and physically consistent. However, state-of-the-art video diffusion models frequently produce implausible results such as momentum violations and object interpenetrations. Existing physics-aware approaches often rely on task-specific fine-tuning or supervised data, which limits their scalability and applicability. To address the challenge, we present PhyMAGIC, a training-free framework that generates physically consistent motion from a single image. PhyMAGIC integrates a pre-trained image-to-video diffusion model, confidence-guided reasoning via LLMs, and a differentiable physics simulator to produce 3D assets ready for downstream physical simulation without fine-tuning or manual supervision. By iteratively refining motion prompts using LLM-derived confidence scores and leveraging simulation feedback, PhyMAGIC steers generation toward physically consistent dynamics. Comprehensive experiments demonstrate that PhyMAGIC outperforms state-of-the-art video generators and physics-aware baselines, enhancing physical property inference and motion-text alignment while maintaining visual fidelity.
CVMay 22, 2025
Erased or Dormant? Rethinking Concept Erasure Through ReversibilityPing Liu, Chi Zhang
To what extent does concept erasure eliminate generative capacity in diffusion models? While prior evaluations have primarily focused on measuring concept suppression under specific textual prompts, we explore a complementary and fundamental question: do current concept erasure techniques genuinely remove the ability to generate targeted concepts, or do they merely achieve superficial, prompt-specific suppression? We systematically evaluate the robustness and reversibility of two representative concept erasure methods, Unified Concept Editing and Erased Stable Diffusion, by probing their ability to eliminate targeted generative behaviors in text-to-image models. These methods attempt to suppress undesired semantic concepts by modifying internal model parameters, either through targeted attention edits or model-level fine-tuning strategies. To rigorously assess whether these techniques truly erase generative capacity, we propose an instance-level evaluation strategy that employs lightweight fine-tuning to explicitly test the reactivation potential of erased concepts. Through quantitative metrics and qualitative analyses, we show that erased concepts often reemerge with substantial visual fidelity after minimal adaptation, indicating that current methods suppress latent generative representations without fully eliminating them. Our findings reveal critical limitations in existing concept erasure approaches and highlight the need for deeper, representation-level interventions and more rigorous evaluation standards to ensure genuine, irreversible removal of concepts from generative models.
CVNov 25, 2024
MICAS: Multi-grained In-Context Adaptive Sampling for 3D Point Cloud ProcessingFeifei Shao, Ping Liu, Zhao Wang et al.
Point cloud processing (PCP) encompasses tasks like reconstruction, denoising, registration, and segmentation, each often requiring specialized models to address unique task characteristics. While in-context learning (ICL) has shown promise across tasks by using a single model with task-specific demonstration prompts, its application to PCP reveals significant limitations. We identify inter-task and intra-task sensitivity issues in current ICL methods for PCP, which we attribute to inflexible sampling strategies lacking context adaptation at the point and prompt levels. To address these challenges, we propose MICAS, an advanced ICL framework featuring a multi-grained adaptive sampling mechanism tailored for PCP. MICAS introduces two core components: task-adaptive point sampling, which leverages inter-task cues for point-level sampling, and query-specific prompt sampling, which selects optimal prompts per query to mitigate intra-task sensitivity. To our knowledge, this is the first approach to introduce adaptive sampling tailored to the unique requirements of point clouds within an ICL framework. Extensive experiments show that MICAS not only efficiently handles various PCP tasks but also significantly outperforms existing methods. Notably, it achieves a remarkable $4.1\%$ improvement in the part segmentation task and delivers consistent gains across various PCP applications.
IRFeb 21, 2024
Learning to Retrieve for Job MatchingJianqiang Shen, Yuchin Juan, Shaobo Zhang et al.
Web-scale search systems typically tackle the scalability challenge with a two-step paradigm: retrieval and ranking. The retrieval step, also known as candidate selection, often involves extracting standardized entities, creating an inverted index, and performing term matching for retrieval. Such traditional methods require manual and time-consuming development of query models. In this paper, we discuss applying learning-to-retrieve technology to enhance LinkedIns job search and recommendation systems. In the realm of promoted jobs, the key objective is to improve the quality of applicants, thereby delivering value to recruiter customers. To achieve this, we leverage confirmed hire data to construct a graph that evaluates a seeker's qualification for a job, and utilize learned links for retrieval. Our learned model is easy to explain, debug, and adjust. On the other hand, the focus for organic jobs is to optimize seeker engagement. We accomplished this by training embeddings for personalized retrieval, fortified by a set of rules derived from the categorization of member feedback. In addition to a solution based on a conventional inverted index, we developed an on-GPU solution capable of supporting both KNN and term matching efficiently.
IRAug 19, 2025
Powering Job Search at Scale: LLM-Enhanced Query Understanding in Job Matching SystemsPing Liu, Jianqiang Shen, Qianqi Shen et al.
Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving taxonomies and language patterns. In this paper, we introduce a unified query understanding framework powered by a Large Language Model (LLM), designed to address these limitations. Our approach jointly models the user query and contextual signals such as profile attributes to generate structured interpretations that drive more accurate and personalized recommendations. The framework improves relevance quality in online A/B testing while significantly reducing system complexity and operational overhead. The results demonstrate that our solution provides a scalable and adaptable foundation for query understanding in dynamic web applications.
CVMay 26, 2025
Erasing Concepts, Steering Generations: A Comprehensive Survey of Concept SuppressionYiwei Xie, Ping Liu, Zheng Zhang
Text-to-Image (T2I) models have demonstrated impressive capabilities in generating high-quality and diverse visual content from natural language prompts. However, uncontrolled reproduction of sensitive, copyrighted, or harmful imagery poses serious ethical, legal, and safety challenges. To address these concerns, the concept erasure paradigm has emerged as a promising direction, enabling the selective removal of specific semantic concepts from generative models while preserving their overall utility. This survey provides a comprehensive overview and in-depth synthesis of concept erasure techniques in T2I diffusion models. We systematically categorize existing approaches along three key dimensions: intervention level, which identifies specific model components targeted for concept removal; optimization structure, referring to the algorithmic strategies employed to achieve suppression; and semantic scope, concerning the complexity and nature of the concepts addressed. This multi-dimensional taxonomy enables clear, structured comparisons across diverse methodologies, highlighting fundamental trade-offs between erasure specificity, generalization, and computational complexity. We further discuss current evaluation benchmarks, standardized metrics, and practical datasets, emphasizing gaps that limit comprehensive assessment, particularly regarding robustness and practical effectiveness. Finally, we outline major challenges and promising future directions, including disentanglement of concept representations, adaptive and incremental erasure strategies, adversarial robustness, and new generative architectures. This survey aims to guide researchers toward safer, more ethically aligned generative models, providing foundational knowledge and actionable recommendations to advance responsible development in generative AI.
CVMar 18, 2025
Advances in 4D Generation: A SurveyQiaowei Miao, Kehan Li, Jinsheng Quan et al.
Generative artificial intelligence has recently progressed from static image and video synthesis to 3D content generation, culminating in the emergence of 4D generation-the task of synthesizing temporally coherent dynamic 3D assets guided by user input. As a burgeoning research frontier, 4D generation enables richer interactive and immersive experiences, with applications ranging from digital humans to autonomous driving. Despite rapid progress, the field lacks a unified understanding of 4D representations, generative frameworks, basic paradigms, and the core technical challenges it faces. This survey provides a systematic and in-depth review of the 4D generation landscape. To comprehensively characterize 4D generation, we first categorize fundamental 4D representations and outline associated techniques for 4D generation. We then present an in-depth analysis of representative generative pipelines based on conditions and representation methods. Subsequently, we discuss how motion and geometry priors are integrated into 4D outputs to ensure spatio-temporal consistency under various control schemes. From an application perspective, this paper summarizes 4D generation tasks in areas such as dynamic object/scene generation, digital human synthesis, editable 4D content, and embodied AI. Furthermore, we summarize and multi-dimensionally compare four basic paradigms for 4D generation: End-to-End, Generated-Data-Based, Implicit-Distillation-Based, and Explicit-Supervision-Based. Concluding our analysis, we highlight five key challenges-consistency, controllability, diversity, efficiency, and fidelity-and contextualize these with current approaches.By distilling recent advances and outlining open problems, this work offers a comprehensive and forward-looking perspective to guide future research in 4D generation.
CVMay 24, 2023
Counterfactual Co-occurring Learning for Bias Mitigation in Weakly-supervised Object LocalizationFeifei Shao, Yawei Luo, Lei Chen et al.
Contemporary weakly-supervised object localization (WSOL) methods have primarily focused on addressing the challenge of localizing the most discriminative region while largely overlooking the relatively less explored issue of biased activation -- incorrectly spotlighting co-occurring background with the foreground feature. In this paper, we conduct a thorough causal analysis to investigate the origins of biased activation. Based on our analysis, we attribute this phenomenon to the presence of co-occurring background confounders. Building upon this profound insight, we introduce a pioneering paradigm known as Counterfactual Co-occurring Learning (CCL), meticulously engendering counterfactual representations by adeptly disentangling the foreground from the co-occurring background elements. Furthermore, we propose an innovative network architecture known as Counterfactual-CAM. This architecture seamlessly incorporates a perturbation mechanism for counterfactual representations into the vanilla CAM-based model. By training the WSOL model with these perturbed representations, we guide the model to prioritize the consistent foreground content while concurrently reducing the influence of distracting co-occurring backgrounds. To the best of our knowledge, this study represents the initial exploration of this research direction. Our extensive experiments conducted across multiple benchmarks validate the effectiveness of the proposed Counterfactual-CAM in mitigating biased activation.