Fan Li

CV
h-index45
87papers
2,429citations
Novelty50%
AI Score60

87 Papers

CVJun 4
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

Dehong Kong, Lina Lei, Lingtao Zheng et al.

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

CVJul 3, 2024Code
DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.

CVMar 27Code
CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Chonghuinan Wang, Zihan Chen, Yuxiang Wei et al.

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

CVMay 18Code
Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

Yixing Yong, Jian Wang, Ming Lei et al.

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

CVNov 16, 2022
Person Text-Image Matching via Text-Feature Interpretability Embedding and External Attack Node Implantation

Fan Li, Hang Zhou, Huafeng Li et al.

Person text-image matching, also known as text based person search, aims to retrieve images of specific pedestrians using text descriptions. Although person text-image matching has made great research progress, existing methods still face two challenges. First, the lack of interpretability of text features makes it challenging to effectively align them with their corresponding image features. Second, the same pedestrian image often corresponds to multiple different text descriptions, and a single text description can correspond to multiple different images of the same identity. The diversity of text descriptions and images makes it difficult for a network to extract robust features that match the two modalities. To address these problems, we propose a person text-image matching method by embedding text-feature interpretability and an external attack node. Specifically, we improve the interpretability of text features by providing them with consistent semantic information with image features to achieve the alignment of text and describe image region features.To address the challenges posed by the diversity of text and the corresponding person images, we treat the variation caused by diversity to features as caused by perturbation information and propose a novel adversarial attack and defense method to solve it. In the model design, graph convolution is used as the basic framework for feature representation and the adversarial attacks caused by text and image diversity on feature extraction is simulated by implanting an additional attack node in the graph convolution layer to improve the robustness of the model against text and image diversity. Extensive experiments demonstrate the effectiveness and superiority of text-pedestrian image matching over existing methods. The source code of the method is published at

CLFeb 4
ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu, Tian Wu et al.

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

AISep 22, 2024
OStr-DARTS: Differentiable Neural Architecture Search based on Operation Strength

Le Yang, Ziwei Zheng, Yizeng Han et al.

Differentiable architecture search (DARTS) has emerged as a promising technique for effective neural architecture search, and it mainly contains two steps to find the high-performance architecture: First, the DARTS supernet that consists of mixed operations will be optimized via gradient descent. Second, the final architecture will be built by the selected operations that contribute the most to the supernet. Although DARTS improves the efficiency of NAS, it suffers from the well-known degeneration issue which can lead to deteriorating architectures. Existing works mainly attribute the degeneration issue to the failure of its supernet optimization, while little attention has been paid to the selection method. In this paper, we cease to apply the widely-used magnitude-based selection method and propose a novel criterion based on operation strength that estimates the importance of an operation by its effect on the final loss. We show that the degeneration issue can be effectively addressed by using the proposed criterion without any modification of supernet optimization, indicating that the magnitude-based selection method can be a critical reason for the instability of DARTS. The experiments on NAS-Bench-201 and DARTS search spaces show the effectiveness of our method.

CVDec 19, 2025
Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

Zhaolin Cai, Huiyu Duan, Zitong Xu et al.

Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.

CRApr 11
Impact of Intelligent Technologies on IoV Security: Integrating Edge Computing and AI

Awais Bilal, Kashif Sharif, Liehuang Zhu et al.

The rapid development and integration of intelligent technologies in the Internet of Vehicles (IoV) have revolutionized transportation systems by enhancing connectivity, automation, and safety. However, the complexity and connectivity of IoV networks also introduce security challenges, including data privacy concerns, cyber threats, and system vulnerabilities. This paper surveys the role of Edge Computing (EC), Machine Learning (ML), and Deep Learning (DL) in strengthening IoV security frameworks. It examines the synergy between these technologies, highlighting their individual capabilities and their collective impact on enhancing threat detection, response times, and adaptive security. Through real world case studies and practical deployments, we demonstrate how EC, ML, and DL are currently improving security and operational efficiency in IoV systems. The paper also identifies key research gaps and future directions for further advancements in IoV security, including the need for scalable, privacy preserving solutions and robust defense mechanisms against emerging cyber threats. By integrating EC, ML, and DL, this work lays the groundwork for developing adaptive, efficient, and resilient IoV security infrastructures capable of addressing evolving challenges in the transportation ecosystem.

CVJul 5, 2024
Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang et al.

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.

CRNov 22, 2023
A Survey of Blockchain, Artificial Intelligence, and Edge Computing for Web 3.0

Jianjun Zhu, Fan Li, Jinyuan Chen

Web 3.0, as the third generation of the World Wide Web, aims to solve contemporary problems of trust, centralization, and data ownership. Driven by the latest advances in cutting-edge technologies, Web 3.0 is moving towards a more open, decentralized, intelligent, and interconnected network. However, increasingly widespread data breaches have raised awareness of online privacy and security of personal data. Additionally, since Web 3.0 is a sophisticated and complex convergence, the technical details behind it are not as clear as the characteristics it presents. In this survey, we conduct an in-depth exploration of Web 3.0 from the perspectives of blockchain, artificial intelligence, and edge computing. Specifically, we begin with summarizing the evolution of the Internet and providing an overview of these three key technological factors. Afterward, we provide a thorough analysis of each technology separately, including its relevance to Web 3.0, key technology components, and practical applications. We also propose decentralized storage and computing solutions by exploring the integration of technologies. Finally, we highlight the key challenges alongside potential research directions. Through the combination and mutual complementation of multiple technologies, Web 3.0 is expected to return more control and ownership of data and digital assets back to users.

CVSep 26, 2023
InvKA: Gait Recognition via Invertible Koopman Autoencoder

Fan Li, Dong Liang, Jing Lian et al.

Most current gait recognition methods suffer from poor interpretability and high computational cost. To improve interpretability, we investigate gait features in the embedding space based on Koopman operator theory. The transition matrix in this space captures complex kinematic features of gait cycles, namely the Koopman operator. The diagonal elements of the operator matrix can represent the overall motion trend, providing a physically meaningful descriptor. To reduce the computational cost of our algorithm, we use a reversible autoencoder to reduce the model size and eliminate convolutional layers to compress its depth, resulting in fewer floating-point operations. Experimental results on multiple datasets show that our method reduces computational cost to 1% compared to state-of-the-art methods while achieving competitive recognition accuracy 98% on non-occlusion datasets.

IVNov 15, 2025Code
MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

Fan Li, Arun Iyengar, Lanyu Xu

In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.

LGJun 25, 2022
Envelope imbalanced ensemble model with deep sample learning and local-global structure consistency

Fan Li, Xiaoheng Zhang, Yongming Li et al.

The class imbalance problem is important and challenging. Ensemble approaches are widely used to tackle this problem because of their effectiveness. However, existing ensemble methods are always applied into original samples, while not considering the structure information among original samples. The limitation will prevent the imbalanced learning from being better. Besides, research shows that the structure information among samples includes local and global structure information. Based on the analysis above, an imbalanced ensemble algorithm with the deep sample pre-envelope network (DSEN) and local-global structure consistency mechanism (LGSCM) is proposed here to solve the problem.This algorithm can guarantee high-quality deep envelope samples for considering the local manifold and global structures information, which is helpful for imbalance learning. First, the deep sample envelope pre-network (DSEN) is designed to mine structure information among samples.Then, the local manifold structure metric (LMSM) and global structure distribution metric (GSDM) are designed to construct LGSCM to enhance distribution consistency of interlayer samples. Next, the DSEN and LGSCM are put together to form the final deep sample envelope network (DSEN-LG). After that, base classifiers are applied on the layers of deep samples respectively.Finally, the predictive results from base classifiers are fused through bagging ensemble learning mechanism. To demonstrate the effectiveness of the proposed method, forty-four public datasets and more than ten representative relevant algorithms are chosen for verification. The experimental results show that the algorithm is significantly better than other imbalanced ensemble algorithms.

CVDec 19, 2025
HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection

Zhaolin Cai, Fan Li, Ziwei Zheng et al.

Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.

CVFeb 6
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li, Kai Liu, Leyang Chen et al.

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

CVMar 30
ColorFLUX: A Structure-Color Decoupling Framework for Old Photo Colorization

Bingchen Li, Zhixin Wang, Fan Li et al.

Old photos preserve invaluable historical memories, making their restoration and colorization highly desirable. While existing restoration models can address some degradation issues like denoising and scratch removal, they often struggle with accurate colorization. This limitation arises from the unique degradation inherent in old photos, such as faded brightness and altered color hues, which are different from modern photo distributions, creating a substantial domain gap during colorization. In this paper, we propose a novel old photo colorization framework based on the generative diffusion model FLUX. Our approach introduces a structure-color decoupling strategy that separates structure preservation from color restoration, enabling accurate colorization of old photos while maintaining structural consistency. We further enhance the model with a progressive Direct Preference Optimization (Pro-DPO) strategy, which allows the model to learn subtle color preferences through coarse-to-fine transitions in color augmentation. Additionally, we address the limitations of text-based prompts by introducing visual semantic prompts, which extract fine-grained semantic information directly from old photos, helping to eliminate the color bias inherent in old photos. Experimental results on both synthetic and real datasets demonstrate that our approach outperforms existing state-of-the-art colorization methods, including closed-source commercial models, producing high-quality and vivid colorization.

CVApr 30Code
YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Chenyang Wu, Lina Lei, Fan Li et al.

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.

LGFeb 17, 2023
A Probabilistic Generative Model for Tracking Multi-Knowledge Concept Mastery Probability

Hengyu Liu, Tiancheng Zhang, Fan Li et al.

Knowledge tracing aims to track students' knowledge status over time to predict students' future performance accurately. Markov chain-based knowledge tracking (MCKT) models can track knowledge concept mastery probability over time. However, as the number of tracked knowledge concepts increases, the time complexity of MCKT predicting student performance increases exponentially (also called explaining away problem. In addition, the existing MCKT models only consider the relationship between students' knowledge status and problems when modeling students' responses but ignore the relationship between knowledge concepts in the same problem. To address these challenges, we propose an inTerpretable pRobAbilistiC gEnerative moDel (TRACED), which can track students' numerous knowledge concepts mastery probabilities over time. To solve \emph{explain away problem}, we design Long and Short-Term Memory (LSTM)-based networks to approximate the posterior distribution, predict students' future performance, and propose a heuristic algorithm to train LSTMs and probabilistic graphical model jointly. To better model students' exercise responses, we proposed a logarithmic linear model with three interactive strategies, which models students' exercise responses by considering the relationship among students' knowledge status, knowledge concept, and problems. We conduct experiments with four real-world datasets in three knowledge-driven tasks. The experimental results show that TRACED outperforms existing knowledge tracing methods in predicting students' future performance and can learn the relationship among students, knowledge concepts, and problems from students' exercise sequences. We also conduct several case studies. The case studies show that TRACED exhibits excellent interpretability and thus has the potential for personalized automatic feedback in the real-world educational environment.

LGNov 30, 2022
Overlapping oriented imbalanced ensemble learning method based on projective clustering and stagewise hybrid sampling

Fan Li, Bo Wang, Pin Wang et al.

The challenge of imbalanced learning lies not only in class imbalance problem, but also in the class overlapping problem which is complex. However, most of the existing algorithms mainly focus on the former. The limitation prevents the existing methods from breaking through. To address this limitation, this paper proposes an ensemble learning algorithm based on dual clustering and stage-wise hybrid sampling (DCSHS). The DCSHS has three parts. Firstly, we design a projection clustering combination framework (PCC) guided by Davies-Bouldin clustering effectiveness index (DBI), which is used to obtain high-quality clusters and combine them to obtain a set of cross-complete subsets (CCS) with balanced class and low overlapping. Secondly, according to the characteristics of subset classes, a stage-wise hybrid sampling algorithm is designed to realize the de-overlapping and balancing of subsets. Finally, a projective clustering transfer mapping mechanism (CTM) is constructed for all processed subsets by means of transfer learning, thereby reducing class overlapping and explore structure information of samples. The major advantage of our algorithm is that it can exploit the intersectionality of the CCS to realize the soft elimination of overlapping majority samples, and learn as much information of overlapping samples as possible, thereby enhancing the class overlapping while class balancing. In the experimental section, more than 30 public datasets and over ten representative algorithms are chosen for verification. The experimental results show that the DCSHS is significantly best in terms of various evaluation criteria.

LGOct 25, 2022
A new Stack Autoencoder: Neighbouring Sample Envelope Embedded Stack Autoencoder Ensemble Model

Chuanyan Zhou, Jie Ma, Fan Li et al.

Stack autoencoder (SAE), as a representative deep network, has unique and excellent performance in feature learning, and has received extensive attention from researchers. However, existing deep SAEs focus on original samples without considering the hierarchical structural information between samples. To address this limitation, this paper proposes a new SAE model-neighbouring envelope embedded stack autoencoder ensemble (NE_ESAE). Firstly, the neighbouring sample envelope learning mechanism (NSELM) is proposed for preprocessing of input of SAE. NSELM constructs sample pairs by combining neighbouring samples. Besides, the NSELM constructs a multilayer sample spaces by multilayer iterative mean clustering, which considers the similar samples and generates layers of envelope samples with hierarchical structural information. Second, an embedded stack autoencoder (ESAE) is proposed and trained in each layer of sample space to consider the original samples during training and in the network structure, thereby better finding the relationship between original feature samples and deep feature samples. Third, feature reduction and base classifiers are conducted on the layers of envelope samples respectively, and output classification results of every layer of samples. Finally, the classification results of the layers of envelope sample space are fused through the ensemble mechanism. In the experimental section, the proposed algorithm is validated with over ten representative public datasets. The results show that our method significantly has better performance than existing traditional feature learning methods and the representative deep autoencoders.

CVApr 8, 2024Code
MC$^2$: Multi-concept Guidance for Customized Multi-concept Generation

Jiaxiu Jiang, Yabo Zhang, Kailai Feng et al.

Customized text-to-image generation, which synthesizes images based on user-specified concepts, has made significant progress in handling individual concepts. However, when extended to multiple concepts, existing methods often struggle with properly integrating different models and avoiding the unintended blending of characteristics from distinct concepts. In this paper, we propose MC$^2$, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. MC$^2$ enables the integration of multiple single-concept models with heterogeneous architectures. By adaptively refining attention weights between visual and textual tokens, our method ensures that image regions accurately correspond to their associated concepts while minimizing interference between concepts. Extensive experiments demonstrate that MC$^2$ outperforms training-based methods in terms of prompt-reference alignment. Furthermore, MC$^2$ can be seamlessly applied to text-to-image generation, providing robust compositional capabilities. To facilitate the evaluation of multi-concept customization, we also introduce a new benchmark, MC++. The code will be publicly available at https://github.com/JIANGJiaXiu/MC-2.

AIApr 2
AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows

Chuhan Qiao, Jinglai Zheng, Jie Huang et al.

Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.

CVApr 14
Unlocking the Potential of Grounding DINO in Videos: Parameter-Efficient Adaptation for Limited-Data Spatial-Temporal Localization

Zanyi Wang, Fan Li, Dengyang Jiang et al.

Spatio-temporal video grounding (STVG) aims to localize queried objects within dynamic video segments. Prevailing fully-trained approaches are notoriously data-hungry. However, gathering large-scale STVG data is exceptionally challenging: dense frame-level bounding boxes and complex temporal language alignments are prohibitively expensive to annotate, especially for specialized video domains. Consequently, conventional models suffer from severe overfitting on these inherently limited datasets, while zero-shot foundational models lack the task-specific temporal awareness needed for precise localization. To resolve this small-data challenge, we introduce ST-GD, a data-efficient framework that adapts pre-trained 2D visual-language models (e.g., Grounding DINO) to video tasks. To avoid destroying pre-trained priors on small datasets, ST-GD keeps the base model frozen and strategically injects lightweight adapters (~10M trainable parameters) to instill spatio-temporal awareness, alongside a novel temporal decoder for boundary prediction. This design naturally counters data scarcity. Consequently, ST-GD excels in data-scarce scenarios, achieving highly competitive performance on the limited-scale HC-STVG v1/v2 benchmarks, while maintaining robust generalization on the VidSTG dataset. This validates ST-GD as a powerful paradigm for complex video understanding under strict small-data constraints.

CVJan 3, 2025Code
ACE: Anti-Editing Concept Erasure in Text-to-Image Models

Zihao Wang, Yuxiang Wei, Fan Li et al.

Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available at https://github.com/120L020904/ACE.

CVMay 12
Fast Image Super-Resolution via Consistency Rectified Flow

Jiaqi Xu, Wenbo Li, Haoze Sun et al.

Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.

CVMar 2
InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

Yecong Wan, Fan Li, Chunwei Wang et al.

Emerging unified editing models have demonstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes. The key insight of InterCoG is to first perform object position reasoning solely within text that includes spatial relation details to explicitly deduce the location and identity of the edited target. It then conducts visual grounding via highlighting the editing targets with generated bounding boxes and masks in pixel space, and finally rewrites the editing description to specify the intended outcomes. To further facilitate this paradigm, we propose two auxiliary training modules: multimodal grounding reconstruction supervision and multimodal grounding reasoning alignment to enforce spatial localization accuracy and reasoning interpretability, respectively. We also construct GroundEdit-45K, a dataset comprising 45K grounding-oriented editing samples with detailed reasoning annotations, and GroundEdit-Bench for grounding-aware editing evaluation. Extensive experiments substantiate the superiority of our approach in highly precise edits under spatially intricate and multi-entity scenes.

CRMay 16
Universal Graph Backdoor Defense: A Feature-based Homophily Perspective

Mengting Pan, Fan Li, Chen Chen et al.

Graph neural networks (GNNs) have achieved remarkable success in relational learning. However, their vulnerability to graph backdoor attacks (GBAs) poses a significant barrier to broader adoption in high-stakes applications. Despite recent advances in graph backdoor defense (GBD), existing methods primarily focus on subgraph-based GBAs, relying on the assumption that poisoned target nodes are explicitly connected to subgraph triggers. Our empirical results reveal that such structure-centric approaches fail to defend against emerging feature-based GBAs that preserve graph topology. Therefore, in this paper, we study a novel problem of universal graph backdoor defense. First, we investigate the shared effects of both attack types from a feature-based homophily perspective, which characterizes local feature consistency between nodes and their neighborhoods. Thorough theoretical and empirical analyses demonstrate that, regardless of trigger mechanisms, backdoors induced by GBAs exhibit lower feature-based homophily than clean nodes, indicating a discrepancy in local feature similarity. Motivated by this insight, we propose to leverage node-level local feature consistency, modeled by a neighbor-aware reconstruction loss, to distinguish backdoors from clean nodes. Then, a robust training strategy is developed to eliminate trigger effects while reducing noise induced by detection uncertainty. Extensive experiments demonstrate that our framework significantly degrades the attack success rate and maintains competitive clean accuracy under both subgraph-based and feature-based attacks.

CVSep 29, 2025Code
UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

Ailing Zhang, Lina Lei, Dehong Kong et al.

Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.

LGJan 3, 2025Code
Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Ziwei Zheng, Junyao Zhao, Le Yang et al.

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.

CVNov 6, 2025
DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification

Yujie Yang, Shuang Li, Jun Ye et al.

Video-based Visible-Infrared person re-identification (VVI-ReID) aims to retrieve the same pedestrian across visible and infrared modalities from video sequences. Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. Specifically, we introduce a Semantic-Aware Silhouette and Gait Learning (SASGL) model, which generates and enhances silhouette representations with general-purpose semantic priors from DINOv2 and jointly optimizes them with the ReID objective to achieve semantically enriched and task-adaptive gait feature learning. Furthermore, we develop a Progressive Bidirectional Multi-Granularity Enhancement (PBMGE) module, which progressively refines feature representations by enabling bidirectional interactions between gait and appearance streams across multiple spatial granularities, fully leveraging their complementarity to enhance global representations with rich local details and produce highly discriminative features. Extensive experiments on HITSZ-VCM and BUPT datasets demonstrate the superiority of our approach, significantly outperforming existing state-of-the-art methods.

CLApr 15, 2025Code
Dependency Structure Augmented Contextual Scoping Framework for Multimodal Aspect-Based Sentiment Analysis

Hao Liu, Lijun He, Jiaxi Liang et al.

Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+2.3\% F1 and +3.5\% precision on Twitter2015). The source code is available at https://github.com/LHaoooo/DASCO .

IRMar 3, 2025Code
Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Kun Zhang, Jingyu Li, Zhe Li et al.

The burgeoning volume of multi-modal data necessitates advanced retrieval paradigms beyond unimodal and cross-modal approaches. Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology, enabling users to query images or videos by integrating a reference visual input with textual modifications, thereby achieving unprecedented flexibility and precision. This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications. CMR is categorized into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data construction, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and linear integration in zero-shot CMR, and semi-supervised CMR that leverages generated pseudo-triplets while addressing data noise/uncertainty. Additionally, we extensively survey the diverse application landscape of CMR, highlighting its transformative potential in e-commerce, social media, search engines, public security, etc. Seven high impact application scenarios are explored in detail with benchmark data sets and performance analysis. Finally, we further provide new potential research directions with the hope of inspiring exploration in other yet-to-be-explored fields. A curated list of works is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval

LGFeb 25
C$^{2}$TC: A Training-Free Framework for Efficient Tabular Data Condensation

Sijia Xu, Fan Li, Xiaoyang Wang et al.

Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C$^{2}$TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C$^{2}$TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.

AIFeb 3
Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

Wei Dai, Haoyu Wang, Honghao Chang et al.

Vision Language Models (VLMs) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and impair generalization of VLMs. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment. Zero-shot evaluations across benchmark datasets demonstrate that our approach outperforms existing baseline methods. Extensive experiments and ablation studies confirm our model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments. Our code and models will be publicly available.

CVNov 11, 2025
Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving

Jian Wang, Lijun He, Yixing Yong et al.

Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.

LGMay 11
Anchor-guided Hypergraph Condensation with Dual-level Discrimination

Fan Li, Xiaoyang Wang, Chen Chen et al.

The increasing prevalence of large-scale hypergraphs poses significant computational challenges for hypergraph neural network (HNN) training. To address this, hypergraph condensation (HGC) distills large real hypergraphs into compact yet informative synthetic ones, beyond graph condensation (GC) methods limited to pairwise relations. However, existing HGC methods rely on decoupled training architectures, where structure generators are pre-trained on the original hypergraph but not jointly optimized with condensed features during refinement, resulting in misaligned structures that degrade downstream utility. Moreover, trajectory-based optimization incurs substantial computational overhead in refinement, limiting condensation efficiency. To tackle these issues, we propose \textbf{A}nchor-guided \textbf{H}yper\textbf{G}raph \textbf{C}ondensation with \textbf{D}ual-level \textbf{D}iscrimination (\textbf{AHGCDD}), which consists of three key components: (1) a node initialization module based on Heat Kernel PageRank (HKPR) to encode structural knowledge into feature semantics; (2) an anchor-guided hyperedge synthesis strategy for joint optimization of condensed features and structure; (3) a theoretically grounded dual-level discrimination objective for utility-preserving condensation without redundant HNN training. Extensive experiments demonstrate the superior effectiveness and efficiency of AHGCDD.

CVNov 7, 2025
Learning Fourier shapes to probe the geometric world of deep neural networks

Jian Wang, Yixing Yong, Haixia Bi et al.

While both shape and texture are fundamental to visual recognition, research on deep neural networks (DNNs) has predominantly focused on the latter, leaving their geometric understanding poorly probed. Here, we show: first, that optimized shapes can act as potent semantic carriers, generating high-confidence classifications from inputs defined purely by their geometry; second, that they are high-fidelity interpretability tools that precisely isolate a model's salient regions; and third, that they constitute a new, generalizable adversarial paradigm capable of deceiving downstream visual tasks. This is achieved through an end-to-end differentiable framework that unifies a powerful Fourier series to parameterize arbitrary shapes, a winding number-based mapping to translate them into the pixel grid required by DNNs, and signal energy constraints that enhance optimization efficiency while ensuring physically plausible shapes. Our work provides a versatile framework for probing the geometric world of DNNs and opens new frontiers for challenging and understanding machine perception.

LGAug 17, 2025Code
DHG-Bench: A Comprehensive Benchmark for Deep Hypergraph Learning

Fan Li, Xiaoyang Wang, Wenjie Zhang et al.

Deep graph models have achieved great success in network representation learning. However, their focus on pairwise relationships restricts their ability to learn pervasive higher-order interactions in real-world systems, which can be naturally modeled as hypergraphs. To tackle this issue, Hypergraph Neural Networks (HNNs) have garnered substantial attention in recent years. Despite the proposal of numerous HNNs, the absence of consistent experimental protocols and multi-dimensional empirical analysis impedes deeper understanding and further development of HNN research. While several toolkits for deep hypergraph learning (DHGL) have been introduced to facilitate algorithm evaluation, they provide only limited quantitative evaluation results and insufficient coverage of advanced algorithms, datasets, and benchmark tasks. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for HNNs. Specifically, DHG-Bench systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. We comprehensively evaluate 17 state-of-the-art HNN algorithms on 22 diverse datasets spanning node-, edge-, and graph-level tasks, under unified experimental settings. Extensive experiments reveal both the strengths and limitations of existing algorithms, offering valuable insights and directions for future research. Furthermore, to facilitate reproducible research, we have developed an easy-to-use library for training and evaluating different HNN methods. The DHG-Bench library is available at: https://github.com/Coco-Hut/DHG-Bench.

CVJul 14, 2025Code
RefSTAR: Blind Facial Image Restoration with Reference Selection, Transfer, and Reconstruction

Zhicun Yin, Junjie Chen, Ming Liu et al.

Blind facial image restoration is highly challenging due to unknown complex degradations and the sensitivity of humans to faces. Although existing methods introduce auxiliary information from generative priors or high-quality reference images, they still struggle with identity preservation problems, mainly due to improper feature introduction on detailed textures. In this paper, we focus on effectively incorporating appropriate features from high-quality reference images, presenting a novel blind facial image restoration method that considers reference selection, transfer, and reconstruction (RefSTAR). In terms of selection, we construct a reference selection (RefSel) module. For training the RefSel module, we construct a RefSel-HQ dataset through a mask generation pipeline, which contains annotating masks for 10,000 ground truth-reference pairs. As for the transfer, due to the trivial solution in vanilla cross-attention operations, a feature fusion paradigm is designed to force the features from the reference to be integrated. Finally, we propose a reference image reconstruction mechanism that further ensures the presence of reference image features in the output image. The cycle consistency loss is also redesigned in conjunction with the mask. Extensive experiments on various backbone models demonstrate superior performance, showing better identity preservation ability and reference feature transfer quality. Source code, dataset, and pre-trained models are available at https://github.com/yinzhicun/RefSTAR.

CVNov 10, 2020Code
STCNet: Spatio-Temporal Cross Network for Industrial Smoke Detection

Yichao Cao, Qingfei Tang, Xiaobo Lu et al.

Industrial smoke emissions present a serious threat to natural ecosystems and human health. Prior works have shown that using computer vision techniques to identify smoke is a low cost and convenient method. However, industrial smoke detection is a challenging task because industrial emission particles are often decay rapidly outside the stacks or facilities and steam is very similar to smoke. To overcome these problems, a novel Spatio-Temporal Cross Network (STCNet) is proposed to recognize industrial smoke emissions. The proposed STCNet involves a spatial pathway to extract texture features and a temporal pathway to capture smoke motion information. We assume that spatial and temporal pathway could guide each other. For example, the spatial path can easily recognize the obvious interference such as trees and buildings, and the temporal path can highlight the obscure traces of smoke movement. If the two pathways could guide each other, it will be helpful for the smoke detection performance. In addition, we design an efficient and concise spatio-temporal dual pyramid architecture to ensure better fusion of multi-scale spatiotemporal information. Finally, extensive experiments on public dataset show that our STCNet achieves clear improvements on the challenging RISE industrial smoke detection dataset against the best competitors by 6.2%. The code will be available at: https://github.com/Caoyichao/STCNet.

CVMay 8
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

Yecong Wan, Fan Li, Mingwen Shao et al.

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive yet compact 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency structural cues, the routing process is encouraged to assign more Gaussian primitives to fine structures, complex geometry, and textured regions, while suppressing redundant primitives in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives.

LGMar 12
CFD-HAR: User-controllable Privacy through Conditional Feature Disentanglement

Alex Gn, Fan Li, S Kuniyilh et al.

Modern wearable and mobile devices are equipped with inertial measurement units (IMUs). Human Activity Recognition (HAR) applications running on such devices use machine-learning-based, data-driven techniques that leverage such sensor data. However, sensor-data-driven HAR deployments face two critical challenges: protecting sensitive user information embedded in sensor data in accordance with users' privacy preferences and maintaining high recognition performance with limited labeled samples. This paper proposes a technique for user-controllable privacy through feature disentanglement-based representation learning at the granular level for dynamic privacy filtering. We also compare the efficacy of our technique against few-shot HAR using autoencoder-based representation learning. We analyze their architectural designs, learning objectives, privacy guarantees, data efficiency, and suitability for edge Internet of Things (IoT) deployment. Our study shows that CFD-based HAR provides explicit, tunable privacy protection controls by separating activity and sensitive attributes in the latent space, whereas autoencoder-based few-shot HAR offers superior label efficiency and lightweight adaptability but lacks inherent privacy safeguards. We further examine the security implications of both approaches in continual IoT settings, highlighting differences in susceptibility to representation leakage and embedding-level attacks. The analysis reveals that neither paradigm alone fully satisfies the emerging requirements of next-generation IoT HAR systems. We conclude by outlining research directions toward unified frameworks that jointly optimize privacy preservation, few-shot adaptability, and robustness for trustworthy IoT intelligence.

CVApr 21
KD-Judge: A Knowledge-Driven Automated Judge Framework for Functional Fitness Movements on Edge Devices

Shaibal Saha, Fan Li, Yunge Li et al.

Functional fitness movements are widely used in training, competition, and health-oriented exercise programs, yet consistently enforcing repetition (rep) standards remains challenging due to subjective human judgment, time constraints, and evolving rules. Existing AI-based approaches mainly rely on learned scoring or reference-based comparisons and lack explicit rule-based, limiting transparency and deterministic rep-level validation. To address these limitations, we propose KD-Judge, a novel knowledge-driven automated judging framework for functional fitness movements. It converts unstructured rulebook standards into executable, machine-readable representations using an LLM-based retrieval-augmented generation and chain-of-thought rule-structuring pipeline. The structured rules are then incorporated by a deterministic rule-based judging system with pose-guided kinematic reasoning to assess rep validity and temporal boundaries. To improve efficiency on edge devices, including a high-performance desktop and the resource-constrained Jetson AGX Xavier, we introduce a dual strategy caching mechanism that can be selectively applied to reduce redundant and unnecessary computation. Experiments demonstrate reliable rule-structuring performance and accurate rep-level assessment, with judgment evaluation conducted on the CFRep dataset, achieving faster-than-real-time execution (real-time factor (RTF) < 1). When the proposed caching strategy is enabled, the system achieves up to 3.36x and 15.91x speedups on resource-constrained edge device compared to the non-caching baseline for pre-recorded and live-streaming scenarios, respectively. These results show that KD-Judge enables transparent, efficient, and scalable rule-grounded rep-level analysis that can complement human judging in practice.

CVApr 21
HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Fan Li, Chonghuinan Wang, Lina Lei et al.

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

CVApr 29
MemOVCD: Training-Free Open-Vocabulary Change Detection via Cross-Temporal Memory Reasoning and Global-Local Adaptive Rectification

Zuzheng Kuang, Honghao Chang, Boqiang Liang et al.

Open-vocabulary change detection aims to identify semantic changes in bi-temporal remote sensing images without predefined categories. Recent methods combine foundation models such as SAM, DINO and CLIP, but typically process each timestamp independently or interact only at the final comparison stage. Such paradigms suffer from insufficient temporal coupling during semantic reasoning, which limits their ability to distinguish genuine semantic changes from non-semantic appearance discrepancies. In addition, patch-dominant inference on high-resolution images often weakens global semantic continuity and produces fragmented change regions. To address these issues, we propose MemOVCD, a training-free open-vocabulary change detection framework based on cross-temporal memory reasoning and global-local adaptive rectification. Specifically, we reformulate bi-temporal change detection as a two-frame tracking problem and introduce weighted bidirectional propagation to aggregate semantic evidence from both temporal directions. To stabilize memory propagation across large temporal gaps, we construct histogram-aligned transition frames to smooth abrupt appearance changes. Moreover, a global-local adaptive rectification strategy adaptively fuses local and global-view predictions, improving spatial consistency while preserving fine-grained details. Experiments on five benchmarks demonstrate that MemOVCD achieves favorable performance on two change detection tasks, validating its effectiveness and generalization under diverse open-vocabulary settings.

CVOct 14, 2024
MagicEraser: Erasing Any Objects via Semantics-Aware Control

Fan Li, Zixiao Zhang, Yi Huang et al.

The traditional image inpainting task aims to restore corrupted regions by referencing surrounding background and foreground. However, the object erasure task, which is in increasing demand, aims to erase objects and generate harmonious background. Previous GAN-based inpainting methods struggle with intricate texture generation. Emerging diffusion model-based algorithms, such as Stable Diffusion Inpainting, exhibit the capability to generate novel content, but they often produce incongruent results at the locations of the erased objects and require high-quality text prompt inputs. To address these challenges, we introduce MagicEraser, a diffusion model-based framework tailored for the object erasure task. It consists of two phases: content initialization and controllable generation. In the latter phase, we develop two plug-and-play modules called prompt tuning and semantics-aware attention refocus. Additionally, we propose a data construction strategy that generates training data specially suitable for this task. MagicEraser achieves fine and effective control of content generation while mitigating undesired artifacts. Experimental results highlight a valuable advancement of our approach in the object erasure task.

LGFeb 15, 2024
Multi-Fidelity Methods for Optimization: A Survey

Ke Li, Fan Li

Real-world black-box optimization often involves time-consuming or costly experiments and simulations. Multi-fidelity optimization (MFO) stands out as a cost-effective strategy that balances high-fidelity accuracy with computational efficiency through a hierarchical fidelity approach. This survey presents a systematic exploration of MFO, underpinned by a novel text mining framework based on a pre-trained language model. We delve deep into the foundational principles and methodologies of MFO, focusing on three core components -- multi-fidelity surrogate models, fidelity management strategies, and optimization techniques. Additionally, this survey highlights the diverse applications of MFO across several key domains, including machine learning, engineering design optimization, and scientific discovery, showcasing the adaptability and effectiveness of MFO in tackling complex computational challenges. Furthermore, we also envision several emerging challenges and prospects in the MFO landscape, spanning scalability, the composition of lower fidelities, and the integration of human-in-the-loop approaches at the algorithmic level. We also address critical issues related to benchmarking and the advancement of open science within the MFO community. Overall, this survey aims to catalyze further research and foster collaborations in MFO, setting the stage for future innovations and breakthroughs in the field.

CVMar 26, 2024
MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

He Zhang, Shenghao Ren, Haolei Yuan et al.

Foot contact is an important cue for human motion capture, understanding, and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However, these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.

LGApr 18, 2024
Hypergraph Self-supervised Learning with Sampling-efficient Signals

Fan Li, Xiaoyang Wang, Dawei Cheng et al.

Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.