Shunli Wang

CV
h-index27
19papers
524citations
Novelty48%
AI Score45

19 Papers

CVJul 26, 2023Code
AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception

Dingkang Yang, Shuai Huang, Zhi Xu et al.

Driver distraction has become a significant cause of severe traffic accidents over the past decade. Despite the growing development of vision-driven driver monitoring systems, the lack of comprehensive perception datasets restricts road safety and traffic security. In this paper, we present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle in naturalistic scenarios. AIDE facilitates holistic driver monitoring through three distinctive characteristics, including multi-view settings of driver and scene, multi-modal annotations of face, body, posture, and gesture, and four pragmatic task designs for driving understanding. To thoroughly explore AIDE, we provide experimental benchmarks on three kinds of baseline frameworks via extensive methods. Moreover, two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations. We also systematically investigate the importance and rationality of the key components in AIDE and benchmarks. The project link is https://github.com/ydk122024/AIDE.

CVJul 16, 2022Code
CA-SpaceNet: Counterfactual Analysis for 6D Pose Estimation in Space

Shunli Wang, Shuaibing Wang, Bo Jiao et al.

Reliable and stable 6D pose estimation of uncooperative space objects plays an essential role in on-orbit servicing and debris removal missions. Considering that the pose estimator is sensitive to background interference, this paper proposes a counterfactual analysis framework named CASpaceNet to complete robust 6D pose estimation of the spaceborne targets under complicated background. Specifically, conventional methods are adopted to extract the features of the whole image in the factual case. In the counterfactual case, a non-existent image without the target but only the background is imagined. Side effect caused by background interference is reduced by counterfactual analysis, which leads to unbiased prediction in final results. In addition, we also carry out lowbit-width quantization for CA-SpaceNet and deploy part of the framework to a Processing-In-Memory (PIM) accelerator on FPGA. Qualitative and quantitative results demonstrate the effectiveness and efficiency of our proposed method. To our best knowledge, this paper applies causal inference and network quantization to the 6D pose estimation of space-borne targets for the first time. The code is available at https://github.com/Shunli-Wang/CA-SpaceNet.

CVMar 21, 2023
Context De-confounded Emotion Recognition

Dingkang Yang, Zhaoyu Chen, Yuzheng Wang et al.

Context-Aware Emotion Recognition (CAER) is a crucial and challenging task that aims to perceive the emotional states of the target person with contextual information. Recent approaches invariably focus on designing sophisticated architectures or mechanisms to extract seemingly meaningful representations from subjects and contexts. However, a long-overlooked issue is that a context bias in existing datasets leads to a significantly unbalanced distribution of emotional states among different context scenarios. Concretely, the harmful bias is a confounder that misleads existing models to learn spurious correlations based on conventional likelihood estimation, significantly limiting the models' performance. To tackle the issue, this paper provides a causality-based perspective to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a tailored causal graph. Then, we propose a Contextual Causal Intervention Module (CCIM) based on the backdoor adjustment to de-confound the confounder and exploit the true causal effect for model training. CCIM is plug-in and model-agnostic, which improves diverse state-of-the-art approaches by considerable margins. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our CCIM and the significance of causal insight.

CVApr 20, 2022
A Survey of Video-based Action Quality Assessment

Shunli Wang, Dingkang Yang, Peng Zhai et al.

Human action recognition and analysis have great demand and important application significance in video surveillance, video retrieval, and human-computer interaction. The task of human action quality evaluation requires the intelligent system to automatically and objectively evaluate the action completed by the human. The action quality assessment model can reduce the human and material resources spent in action evaluation and reduce subjectivity. In this paper, we provide a comprehensive survey of existing papers on video-based action quality assessment. Different from human action recognition, the application scenario of action quality assessment is relatively narrow. Most of the existing work focuses on sports and medical care. We first introduce the definition and challenges of human action quality assessment. Then we present the existing datasets and evaluation metrics. In addition, we summarized the methods of sports and medical care according to the model categories and publishing institutions according to the characteristics of the two fields. At the end, combined with recent work, the promising development direction in action quality assessment is discussed.

IVFeb 20, 2023
Towards Simultaneous Segmentation of Liver Tumors and Intrahepatic Vessels via Cross-attention Mechanism

Haopeng Kuang, Dingkang Yang, Shunli Wang et al.

Accurate visualization of liver tumors and their surrounding blood vessels is essential for noninvasive diagnosis and prognosis prediction of tumors. In medical image segmentation, there is still a lack of in-depth research on the simultaneous segmentation of liver tumors and peritumoral blood vessels. To this end, we collect the first liver tumor, and vessel segmentation benchmark datasets containing 52 portal vein phase computed tomography images with liver, liver tumor, and vessel annotations. In this case, we propose a 3D U-shaped Cross-Attention Network (UCA-Net) that utilizes a tailored cross-attention mechanism instead of the traditional skip connection to effectively model the encoder and decoder feature. Specifically, the UCA-Net uses a channel-wise cross-attention module to reduce the semantic gap between encoder and decoder and a slice-wise cross-attention module to enhance the contextual semantic learning ability among distinct slices. Experimental results show that the proposed UCA-Net can accurately segment 3D medical images and achieve state-of-the-art performance on the liver tumor and intrahepatic vessel segmentation task.

CVAug 17, 2024
MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Xiao Zhao, Xukun Zhang, Dingkang Yang et al.

Accurate and robust multimodal multi-task perception is crucial for modern autonomous driving systems. However, current multimodal perception research follows independent paradigms designed for specific perception tasks, leading to a lack of complementary learning among tasks and decreased performance in multi-task learning (MTL) due to joint training. In this paper, we propose MaskBEV, a masked attention-based MTL paradigm that unifies 3D object detection and bird's eye view (BEV) map segmentation. MaskBEV introduces a task-agnostic Transformer decoder to process these diverse tasks, enabling MTL to be completed in a unified decoder without requiring additional design of specific task heads. To fully exploit the complementary information between BEV map segmentation and 3D object detection tasks in BEV space, we propose spatial modulation and scene-level context aggregation strategies. These strategies consider the inherent dependencies between BEV segmentation and 3D detection, naturally boosting MTL performance. Extensive experiments on nuScenes dataset show that compared with previous state-of-the-art MTL methods, MaskBEV achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation, while also demonstrating slightly leading inference speed.

85.2CVApr 17
VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Hui Han, Shunli Wang, Yandan Zhao et al.

In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection. The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs. To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL). Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities. Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction. In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM. In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.

CVSep 21, 2023
CPR-Coach: Recognizing Composite Error Actions based on Single-class Training

Shunli Wang, Qing Yu, Shuaibing Wang et al.

The fine-grained medical action analysis task has received considerable attention from pattern recognition communities recently, but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently, the assessment of CPR skills mainly depends on dummies and trainers, leading to high training costs and low efficiency. For the first time, this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically, we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this paper thoroughly investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable Single-class Training & Multi-class Testing problem, we propose a humancognition-inspired framework named ImagineNet to improve the model's multierror recognition performance under restricted supervision. Extensive experiments verify the effectiveness of the framework. We hope this work could advance research toward fine-grained medical action analysis and skill assessment. The CPR-Coach dataset and the code of ImagineNet are publicly available on Github.

CVAug 4, 2024
Faster Diffusion Action Segmentation

Shuaibing Wang, Shunli Wang, Mingcheng Li et al.

Temporal Action Segmentation (TAS) is an essential task in video analysis, aiming to segment and classify continuous frames into distinct action segments. However, the ambiguous boundaries between actions pose a significant challenge for high-precision segmentation. Recent advances in diffusion models have demonstrated substantial success in TAS tasks due to their stable training process and high-quality generation capabilities. However, the heavy sampling steps required by diffusion models pose a substantial computational burden, limiting their practicality in real-time applications. Additionally, most related works utilize Transformer-based encoder architectures. Although these architectures excel at capturing long-range dependencies, they incur high computational costs and face feature-smoothing issues when processing long video sequences. To address these challenges, we propose EffiDiffAct, an efficient and high-performance TAS algorithm. Specifically, we develop a lightweight temporal feature encoder that reduces computational overhead and mitigates the rank collapse phenomenon associated with traditional self-attention mechanisms. Furthermore, we introduce an adaptive skip strategy that allows for dynamic adjustment of timestep lengths based on computed similarity metrics during inference, thereby further enhancing computational efficiency. Comprehensive experiments on the 50Salads, Breakfast, and GTEA datasets demonstrated the effectiveness of the proposed algorithm.

CVFeb 27, 2024Code
HandGCAT: Occlusion-Robust 3D Hand Mesh Reconstruction from Monocular Images

Shuaibing Wang, Shunli Wang, Dingkang Yang et al.

We propose a robust and accurate method for reconstructing 3D hand mesh from monocular images. This is a very challenging problem, as hands are often severely occluded by objects. Previous works often have disregarded 2D hand pose information, which contains hand prior knowledge that is strongly correlated with occluded regions. Thus, in this work, we propose a novel 3D hand mesh reconstruction network HandGCAT, that can fully exploit hand prior as compensation information to enhance occluded region features. Specifically, we designed the Knowledge-Guided Graph Convolution (KGC) module and the Cross-Attention Transformer (CAT) module. KGC extracts hand prior information from 2D hand pose by graph convolution. CAT fuses hand prior into occluded regions by considering their high correlation. Extensive experiments on popular datasets with challenging hand-object occlusions, such as HO3D v2, HO3D v3, and DexYCB demonstrate that our HandGCAT reaches state-of-the-art performance. The code is available at https://github.com/heartStrive/HandGCAT.

CVAug 27, 2024
Geometric Artifact Correction for Symmetric Multi-Linear Trajectory CT: Theory, Method, and Generalization

Zhisheng Wang, Yanxu Sun, Shangyu Li et al.

For extending CT field-of-view to perform non-destructive testing, the Symmetric Multi-Linear trajectory Computed Tomography (SMLCT) has been developed as a successful example of non-standard CT scanning modes. However, inevitable geometric errors can cause severe artifacts in the reconstructed images. The existing calibration method for SMLCT is both crude and inefficient. It involves reconstructing hundreds of images by exhaustively substituting each potential error, and then manually identifying the images with the fewest geometric artifacts to estimate the final geometric errors for calibration. In this paper, we comprehensively and efficiently address the challenging geometric artifacts in SMLCT, , and the corresponding works mainly involve theory, method, and generalization. In particular, after identifying sensitive parameters and conducting some theory analysis of geometric artifacts, we summarize several key properties between sensitive geometric parameters and artifact characteristics. Then, we further construct mathematical relationships that relate sensitive geometric errors to the pixel offsets of reconstruction images with artifact characteristics. To accurately extract pixel bias, we innovatively adapt the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) algorithm, commonly used in sound processing, for our image registration task for each paired symmetric LCT. This adaptation leads to the design of a highly efficient rigid translation registration method. Simulation and physical experiments have validated the excellent performance of this work. Additionally, our results demonstrate significant generalization to common rotated CT and a variant of SMLCT.

CVJun 17, 2024Code
CoMT: Chain-of-Medical-Thought Reduces Hallucination in Medical Report Generation

Yue Jiang, Jiawei Chen, Dingkang Yang et al.

Automatic medical report generation (MRG), which possesses significant research value as it can aid radiologists in clinical diagnosis and report composition, has garnered increasing attention. Despite recent progress, generating accurate reports remains arduous due to the requirement for precise clinical comprehension and disease diagnosis inference. Furthermore, owing to the limited accessibility of medical data and the imbalanced distribution of diseases, the underrepresentation of rare diseases in training data makes large-scale medical visual language models (LVLMs) prone to hallucinations, such as omissions or fabrications, severely undermining diagnostic performance and further intensifying the challenges for MRG in practice. In this study, to effectively mitigate hallucinations in medical report generation, we propose a chain-of-medical-thought approach (CoMT), which intends to imitate the cognitive process of human doctors by decomposing diagnostic procedures. The radiological features with different importance are structured into fine-grained medical thought chains to enhance the inferential ability during diagnosis, thereby alleviating hallucination problems and enhancing the diagnostic accuracy of MRG. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/CoMT.

CVMar 9, 2024
Robust Emotion Recognition in Context Debiasing

Dingkang Yang, Kun Yang, Mingcheng Li et al.

Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements, the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation, causing severe performance bottlenecks and confounding valuable context priors. In this paper, we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically, we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph, CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference, we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes, resulting in bias mitigation and robust prediction. As a model-agnostic framework, CLEF can be readily integrated into existing methods, bringing consistent performance gains.

CLNov 5, 2024
Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

Mingcheng Li, Dingkang Yang, Yang Liu et al.

Multimodal Sentiment Analysis (MSA) is an important research area that aims to understand and recognize human sentiment through multiple modalities. The complementary information provided by multimodal fusion promotes better sentiment analysis compared to utilizing only a single modality. Nevertheless, in real-world applications, many unavoidable factors may lead to situations of uncertain modality missing, thus hindering the effectiveness of multimodal modeling and degrading the model's performance. To this end, we propose a Hierarchical Representation Learning Framework (HRLF) for the MSA task under uncertain missing modalities. Specifically, we propose a fine-grained representation factorization module that sufficiently extracts valuable sentiment information by factorizing modality into sentiment-relevant and modality-specific representations through crossmodal translation and sentiment semantic reconstruction. Moreover, a hierarchical mutual information maximization mechanism is introduced to incrementally maximize the mutual information between multi-scale representations to align and reconstruct the high-level semantics in the representations. Ultimately, we propose a hierarchical adversarial learning mechanism that further aligns and adapts the latent distribution of sentiment-relevant representations to produce robust joint multimodal representations. Comprehensive experiments on three datasets demonstrate that HRLF significantly improves MSA performance under uncertain modality missing cases.

LGMay 1, 2025
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics

Cong Xu, Wenbin Liang, Mo Yu et al.

The rapid scaling of models has led to prohibitively high training and fine-tuning costs. A major factor accounting for memory consumption is the widespread use of stateful optimizers (e.g., Adam), which maintain auxiliary information of even 2x the model size in order to achieve optimal convergence. We therefore present SOLO in this work to spawn a novel type of optimizer that requires an extremely light memory footprint. While previous efforts have achieved certain success in 8-bit or 4-bit cases, SOLO enables Adam-style optimizers to maintain quantized states with precision as low as 3 bits, or even 2 bits. This immense progress is due to the identification and resolution of two key challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum hyperparameter for the latter. SOLO can thus be seamlessly applied to Adam-style optimizers, leading to substantial memory savings with minimal accuracy loss.

CVJun 14, 2024
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

Jiawei Chen, Dingkang Yang, Tong Wu et al.

Large Vision Language Models (LVLMs) are increasingly integral to healthcare applications, including medical visual question answering and imaging report generation. While these models inherit the robust capabilities of foundational Large Language Models (LLMs), they also inherit susceptibility to hallucinations-a significant concern in high-stakes medical contexts where the margin for error is minimal. However, currently, there are no dedicated methods or benchmarks for hallucination detection and evaluation in the medical field. To bridge this gap, we introduce Med-HallMark, the first benchmark specifically designed for hallucination detection and evaluation within the medical multimodal domain. This benchmark provides multi-tasking hallucination support, multifaceted hallucination data, and hierarchical hallucination categorization. Furthermore, we propose the MediHall Score, a new medical evaluative metric designed to assess LVLMs' hallucinations through a hierarchical scoring system that considers the severity and type of hallucination, thereby enabling a granular assessment of potential clinical impacts. We also present MediHallDetector, a novel Medical LVLM engineered for precise hallucination detection, which employs multitask training for hallucination detection. Through extensive experimental evaluations, we establish baselines for popular LVLMs using our benchmark. The findings indicate that MediHall Score provides a more nuanced understanding of hallucination impacts compared to traditional metrics and demonstrate the enhanced performance of MediHallDetector. We hope this work can significantly improve the reliability of LVLMs in medical applications. All resources of this work will be released soon.

CVMay 31, 2023
Analytical reconstructions of full-scan multiple source-translation computed tomography under large field of views

Zhisheng Wang, Yue Liu, Shunli Wang et al.

This paper is to investigate the high-quality analytical reconstructions of multiple source-translation computed tomography (mSTCT) under an extended field of view (FOV). Under the larger FOVs, the previously proposed backprojection filtration (BPF) algorithms for mSTCT, including D-BPF and S-BPF (their differences are different derivate directions along the detector and source, respectively), make some errors and artifacts in the reconstructed images due to a backprojection weighting factor and the half-scan mode, which deviates from the intention of mSTCT imaging. In this paper, to achieve reconstruction with as little error as possible under the extremely extended FOV, we combine the full-scan mSTCT (F-mSTCT) geometry with the previous BPF algorithms to study the performance and derive a suitable redundancy-weighted function for F-mSTCT. The experimental results indicate FS-BPF can get high-quality, stable images under the extremely extended FOV of imaging a large object, though it requires more projections than FD-BPF. Finally, for different practical requirements in extending FOV imaging, we give suggestions on algorithm selection.

CVMay 30, 2023
BPF Algorithms for Multiple Source-Translation Computed Tomography Reconstruction

Zhisheng Wang, Haijun Yu, Yixing Huang et al.

Micro-computed tomography (micro-CT) is a widely used state-of-the-art instrument employed to study the morphological structures of objects in various fields. However, its small field-of-view (FOV) cannot meet the pressing demand for imaging relatively large objects at high spatial resolutions. Recently, we devised a novel scanning mode called multiple source translation CT (mSTCT) that effectively enlarges the FOV of the micro-CT and correspondingly developed a virtual projection-based filtered backprojection (V-FBP) algorithm for reconstruction. Although V-FBP skillfully solves the truncation problem in mSTCT, it requires densely sampled projections to arrive at high-resolution reconstruction, which reduces imaging efficiency. In this paper, we developed two backprojection-filtration (BPF)-based algorithms for mSTCT, i.e., S-BPF (derivatives along source) and D-BPF (derivatives along detector). D-BPF can achieve high-resolution reconstruction with fewer projections than V-FBP and S-BPF. Through simulated and real experiments conducted in this paper, we demonstrate that D-BPF can reduce source sampling by 75% compared with V-FBP at the same spatial resolution, which makes mSTCT more feasible in practice. Meanwhile, S-BPF can yield more stable results than D-BPF, which is similar to V-FBP.

CVJan 11, 2022
TSA-Net: Tube Self-Attention Network for Action Quality Assessment

Shunli Wang, Dingkang Yang, Peng Zhai et al.

In recent years, assessing action quality from videos has attracted growing attention in computer vision community and human computer interaction. Most existing approaches usually tackle this problem by directly migrating the model from action recognition tasks, which ignores the intrinsic differences within the feature map such as foreground and background information. To address this issue, we propose a Tube Self-Attention Network (TSA-Net) for action quality assessment (AQA). Specifically, we introduce a single object tracker into AQA and propose the Tube Self-Attention Module (TSA), which can efficiently generate rich spatio-temporal contextual information by adopting sparse feature interactions. The TSA module is embedded in existing video networks to form TSA-Net. Overall, our TSA-Net is with the following merits: 1) High computational efficiency, 2) High flexibility, and 3) The state-of-the art performance. Extensive experiments are conducted on popular action quality assessment datasets including AQA-7 and MTL-AQA. Besides, a dataset named Fall Recognition in Figure Skating (FR-FS) is proposed to explore the basic action assessment in the figure skating scene.