CVApr 12
AIM-Bench: Benchmarking and Improving Affective Image Manipulation via Fine-Grained Hierarchical ControlShi Chen, Xuecheng Wu, Heli Sun et al.
Affective Image Manipulation (AIM) aims to evoke specific emotions through targeted editing. Current image editing benchmarks primarily focus on object-level modifications in general scenarios, lacking the fine-grained granularity to capture affective dimensions. To bridge this gap, we introduce the first benchmark designed for AIM termed AIM-Bench. This benchmark is built upon a dual-path affective modeling scheme that integrates the Mikels emotion taxonomy with the Valence-Arousal-Dominance framework, enabling high-level semantic and fine-grained continuous manipulation. Through a hierarchical human-in-the-loop workflow, we finally curate 800 high-quality samples covering 8 emotional categories and 5 editing types. To effectively assess performance, we also design a composite evaluation suite combining rule-based and model-based metrics to holistically assess instruction consistency, aesthetics, and emotional expressiveness. Extensive evaluations reveal that current editing models face significant challenges, most notably a prevalent positivity bias, which stemming from inherent imbalances in training data distribution. To tackle this, we propose a scalable data engine utilizing an inverse repainting strategy to construct AIM-40k, a balanced instruction-tuning dataset comprising 40k samples. Concretely, we enhance raw affective images via generative redrawing to establish high-fidelity ground truths, and synthesize input images with divergent emotions and paired precise instructions. Fine-tuning a baseline model on AIM-40k yields a 9.15% relative improvement in overall performance, demonstrating the effectiveness of our AIM-40k. Our data and related code will be made open soon.
CVJan 2
RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for RehabilitationJunxiao Xue, Pavel Smirnov, Ziao Li et al.
We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients'motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients'actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient's true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients' motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.
CVSep 29, 2025Code
Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial AnalysisXuecheng Wu, Junxiao Xue, Xinyi Yin et al.
Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems, yet this field continues to suffer from limited data availability. In recent years, the self-supervised learning (SSL) technique of Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts. While scaling has proven essential for breakthroughs in general multi-modal learning domains, its specific impact on AVFA remains largely unexplored. Another core challenge in this field is capturing both intra- and inter-modal correlations through scalable audio-visual representations. To tackle these issues, we propose AVF-MAE++, a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA while enhancing cross-modal correlation modeling. Our framework introduces a novel dual masking strategy across audio and visual modalities and strengthens modality encoders with a more holistic design to better support scalable pre-training. Additionally, we present the Iterative Audio-Visual Correlation Learning Module, which improves correlation learning within the SSL paradigm, bridging the limitations of previous methods. To support smooth adaptation and reduce overfitting risks, we further introduce a progressive semantic injection strategy, organizing the model training into three structured stages. Extensive experiments conducted on 17 datasets, covering three major AVFA tasks, demonstrate that AVF-MAE++ achieves consistent state-of-the-art performance across multiple benchmarks. Comprehensive ablation studies further highlight the importance of each proposed component and provide deeper insights into the design choices driving these improvements. Our code and models have been publicly released at Github.
CVApr 20
AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual ReasoningJunxiao Xue, Quan Deng, Tingqi Hu et al.
Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics. To address this interface mismatch, we propose AeroRAG, a scene-graph-guided multimodal retrieval-augmented generation framework for visual question answering. The framework first converts an input image into structured visual knowledge, including object categories, quantities, spatial locations, and semantic relations, and then retrieves query-relevant semantic chunks to construct compact prompts for a text-based large language model. Rather than relying on direct reasoning over dense visual tokens, our method introduces a more explicit intermediate interface between perception and language reasoning. Experiments on the AUG aerial dataset and the general-domain VG-150 benchmark show consistent improvements over six strong MLLM baselines, with the largest gains observed in dense aerial scenes and relation-sensitive reasoning. We further evaluate the framework on VQAv2 to verify that the proposed interface remains compatible with standard visual reasoning settings. These results suggest that structured retrieval is a practical design direction for deployment-oriented and grounded visual reasoning systems.
CVJan 1
Disentangling Hardness from Noise: An Uncertainty-Driven Model-Agnostic Framework for Long-Tailed Remote Sensing ClassificationChi Ding, Junxiao Xue, Xinyi Yin et al.
Long-Tailed distributions are pervasive in remote sensing due to the inherently imbalanced occurrence of grounded objects. However, a critical challenge remains largely overlooked, i.e., disentangling hard tail data samples from noisy ambiguous ones. Conventional methods often indiscriminately emphasize all low-confidence samples, leading to overfitting on noisy data. To bridge this gap, building upon Evidential Deep Learning, we propose a model-agnostic uncertainty-aware framework termed DUAL, which dynamically disentangles prediction uncertainty into Epistemic Uncertainty (EU) and Aleatoric Uncertainty (AU). Specifically, we introduce EU as an indicator of sample scarcity to guide a reweighting strategy for hard-to-learn tail samples, while leveraging AU to quantify data ambiguity, employing an adaptive label smoothing mechanism to suppress the impact of noise. Extensive experiments on multiple datasets across various backbones demonstrate the effectiveness and generalization of our framework, surpassing strong baselines such as TGN and SADE. Ablation studies provide further insights into the crucial choices of our design.
CVMay 4
MooD: An Efficient VA-Driven Affective Image Editing Framework via Fine-Grained Semantic ControlXinyi Yin, Yiduo Wang, Tingqi Hu et al.
Affective image editing (AIE) aims to edit visual content to evoke target emotions. However, existing methods often overlook inference efficiency and predominantly depend on discrete emotion representations, which to some extent limits their practical applicability and makes it challenging to capture complex and subtle human emotions. To tackle these gaps, we propose MooD, the first framework that directly leverages continuous Valence-Arousal (VA) values for fine-grained and efficient AIE. Specifically, we first introduce a VA-Aware retrieval strategy to bridge vague affective values and concrete visual semantics. Building upon this, MooD integrates visual transfer and semantic guidance to achieve controllable AIE. Furthermore, we construct AffectSet, a VA-annotated dataset to support model optimization and evaluation. Extensive qualitative and quantitative experimental results demonstrate that our MooD achieves superior performance in both affective controllability and visual fidelity while maintaining high efficiency. A series of ablation studies further reveal the crucial factors of our design. Our code and data will be made publicly open soon.
CVMar 31
FED-Bench: A Cross-Granular Benchmark for Disentangled Evaluation of Facial Expression EditingFengjian Xue, Xuecheng Wu, Heli Sun et al.
Facial expression image editing requires fine-grained control to strictly preserve human identity and background while precisely manipulating expression. However, existing editing benchmarks primarily focus on general scenarios, lacking high-quality facial images and corresponding editing instructions. Furthermore, current evaluation metrics exhibit systemic biases in this task, often favoring lazy editing or overfit editing. To bridge these gaps, we propose FED-Bench, a comprehensive benchmark featuring rigorous testing and an accurate evaluation suite. First, we carefully construct a benchmark of 747 triplets through a cascaded and scalable pipeline, each comprising an original image, an editing instruction, and a ground-truth image for precise evaluation. Second, we introduce FED-Score, a cross-granularity evaluation protocol that disentangles assessment into three dimensions: Alignment for verifying instruction following, Fidelity for testing image quality and identity preservation, and Relative Expression Gain for quantifying the magnitude of expression changes, effectively mitigating the aforementioned evaluation biases. Third, we benchmark 18 image editing models, revealing that current approaches struggle to simultaneously achieve high fidelity and accurate expression manipulation, with fine-grained instruction following identified as the primary bottleneck. Finally, leveraging the scalable characteristic of introduced benchmark engine, we provide a 20k+ in-the-wild facial training set and demonstrate its effectiveness by fine-tuning a baseline model that achieves significant performance gains. Our benchmark and related code will be made publicly open soon.
LGDec 14, 2025
DARTs: A Dual-Path Robust Framework for Anomaly Detection in High-Dimensional Multivariate Time SeriesXuechun Liu, Heli Sun, Xuecheng Wu et al.
Multivariate time series anomaly detection (MTSAD) aims to accurately identify and localize complex abnormal patterns in the large-scale industrial control systems. While existing approaches excel in recognizing the distinct patterns under the low-dimensional scenarios, they often fail to robustly capture long-range spatiotemporal dependencies when learning representations from the high-dimensional noisy time series. To address these limitations, we propose DARTs, a robust long short-term dual-path framework with window-aware spatiotemporal soft fusion mechanism, which can be primarily decomposed into three complementary components. Specifically, in the short-term path, we introduce a Multi-View Sparse Graph Learner and a Diffusion Multi-Relation Graph Unit that collaborate to adaptively capture hierarchical discriminative short-term spatiotemporal patterns in the high-noise time series. While in the long-term path, we design a Multi-Scale Spatiotemporal Graph Constructor to model salient long-term dynamics within the high-dimensional representation space. Finally, a window-aware spatiotemporal soft-fusion mechanism is introduced to filter the residual noise while seamlessly integrating anomalous patterns. Extensive qualitative and quantitative experimental results across mainstream datasets demonstrate the superiority and robustness of our proposed DARTs. A series of ablation studies are also conducted to explore the crucial design factors of our proposed components. Our code and model will be made publicly open soon.
CVNov 22, 2025
V2X-RECT: An Efficient V2X Trajectory Prediction Framework via Redundant Interaction Filtering and Tracking Error CorrectionXiangyan Kong, Xuecheng Wu, Xiongwei Zhao et al.
V2X prediction can alleviate perception incompleteness caused by limited line of sight through fusing trajectory data from infrastructure and vehicles, which is crucial to traffic safety and efficiency. However, in dense traffic scenarios, frequent identity switching of targets hinders cross-view association and fusion. Meanwhile, multi-source information tends to generate redundant interactions during the encoding stage, and traditional vehicle-centric encoding leads to large amounts of repetitive historical trajectory feature encoding, degrading real-time inference performance. To address these challenges, we propose V2X-RECT, a trajectory prediction framework designed for high-density environments. It enhances data association consistency, reduces redundant interactions, and reuses historical information to enable more efficient and accurate prediction. Specifically, we design a multi-source identity matching and correction module that leverages multi-view spatiotemporal relationships to achieve stable and consistent target association, mitigating the adverse effects of mismatches on trajectory encoding and cross-view feature fusion. Then we introduce traffic signal-guided interaction module, encoding trend of traffic light changes as features and exploiting their role in constraining spatiotemporal passage rights to accurately filter key interacting vehicles, while capturing the dynamic impact of signal changes on interaction patterns. Furthermore, a local spatiotemporal coordinate encoding enables reusable features of historical trajectories and map, supporting parallel decoding and significantly improving inference efficiency. Extensive experimental results across V2X-Seq and V2X-Traj datasets demonstrate that our V2X-RECT achieves significant improvements compared to SOTA methods, while also enhancing robustness and inference efficiency across diverse traffic densities.