Zhifeng Wang

CV
h-index21
36papers
418citations
Novelty45%
AI Score55

36 Papers

CVJul 27, 2023Code
HTNet for micro-expression recognition

Zhifeng Wang, Kaihao Zhang, Wenhan Luo et al.

Facial expression is related to facial muscle contractions and different muscle movements correspond to different emotional states. For micro-expression recognition, the muscle movements are usually subtle, which has a negative impact on the performance of current facial emotion recognition algorithms. Most existing methods use self-attention mechanisms to capture relationships between tokens in a sequence, but they do not take into account the inherent spatial relationships between facial landmarks. This can result in sub-optimal performance on micro-expression recognition tasks.Therefore, learning to recognize facial muscle movements is a key challenge in the area of micro-expression recognition. In this paper, we propose a Hierarchical Transformer Network (HTNet) to identify critical areas of facial muscle movement. HTNet includes two major components: a transformer layer that leverages the local temporal features and an aggregation layer that extracts local and global semantical facial features. Specifically, HTNet divides the face into four different facial areas: left lip area, left eye area, right eye area and right lip area. The transformer layer is used to focus on representing local minor muscle movement with local self-attention in each area. The aggregation layer is used to learn the interactions between eye areas and lip areas. The experiments on four publicly available micro-expression datasets show that the proposed approach outperforms previous methods by a large margin. The codes and models are available at: \url{https://github.com/wangzhifengharrison/HTNet}

CVJul 8, 2022
Abs-CAM: A Gradient Optimization Interpretable Approach for Explanation of Convolutional Neural Networks

Chunyan Zeng, Kang Yan, Zhifeng Wang et al.

The black-box nature of Deep Neural Networks (DNNs) severely hinders its performance improvement and application in specific scenes. In recent years, class activation mapping-based method has been widely used to interpret the internal decisions of models in computer vision tasks. However, when this method uses backpropagation to obtain gradients, it will cause noise in the saliency map, and even locate features that are irrelevant to decisions. In this paper, we propose an Absolute value Class Activation Mapping-based (Abs-CAM) method, which optimizes the gradients derived from the backpropagation and turns all of them into positive gradients to enhance the visual features of output neurons' activation, and improve the localization ability of the saliency map. The framework of Abs-CAM is divided into two phases: generating initial saliency map and generating final saliency map. The first phase improves the localization ability of the saliency map by optimizing the gradient, and the second phase linearly combines the initial saliency map with the original image to enhance the semantic information of the saliency map. We conduct qualitative and quantitative evaluation of the proposed method, including Deletion, Insertion, and Pointing Game. The experimental results show that the Abs-CAM can obviously eliminate the noise in the saliency map, and can better locate the features related to decisions, and is superior to the previous methods in recognition and localization tasks.

SDAug 25, 2022
Spatio-Temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings

Chunyan Zeng, Shixiong Feng, Zhifeng Wang et al.

The existing source cell-phone recognition method lacks the long-term feature characterization of the source device, resulting in inaccurate representation of the source cell-phone related features which leads to insufficient recognition accuracy. In this paper, we propose a source cell-phone recognition method based on spatio-temporal representation learning, which includes two main parts: extraction of sequential Gaussian mean matrix features and construction of a recognition model based on spatio-temporal representation learning. In the feature extraction part, based on the analysis of time-series representation of recording source signals, we extract sequential Gaussian mean matrix with long-term and short-term representation ability by using the sensitivity of Gaussian mixture model to data distribution. In the model construction part, we design a structured spatio-temporal representation learning network C3D-BiLSTM to fully characterize the spatio-temporal information, combine 3D convolutional network and bidirectional long short-term memory network for short-term spectral information and long-time fluctuation information representation learning, and achieve accurate recognition of cell-phones by fusing spatio-temporal feature information of recording source signals. The method achieves an average accuracy of 99.03% for the closed-set recognition of 45 cell-phones under the CCNU\_Mobile dataset, and 98.18% in small sample size experiments, with recognition performance better than the existing state-of-the-art methods. The experimental results show that the method exhibits excellent recognition performance in multi-class cell-phones recognition.

CVMar 20, 2023
Learning Behavior Recognition in Smart Classroom with Multiple Students Based on YOLOv5

Zhifeng Wang, Jialong Yao, Chunyan Zeng et al.

Deep learning-based computer vision technology has grown stronger in recent years, and cross-fertilization using computer vision technology has been a popular direction in recent years. The use of computer vision technology to identify students' learning behavior in the classroom can reduce the workload of traditional teachers in supervising students in the classroom, and ensure greater accuracy and comprehensiveness. However, existing student learning behavior detection systems are unable to track and detect multiple targets precisely, and the accuracy of learning behavior recognition is not high enough to meet the existing needs for the accurate recognition of student behavior in the classroom. To solve this problem, we propose a YOLOv5s network structure based on you only look once (YOLO) algorithm to recognize and analyze students' classroom behavior in this paper. Firstly, the input images taken in the smart classroom are pre-processed. Then, the pre-processed image is fed into the designed YOLOv5 networks to extract deep features through convolutional layers, and the Squeeze-and-Excitation (SE) attention detection mechanism is applied to reduce the weight of background information in the recognition process. Finally, the extracted features are classified by the Feature Pyramid Networks (FPN) and Path Aggregation Network (PAN) structures. Multiple groups of experiments were performed to compare with traditional learning behavior recognition methods to validate the effectiveness of the proposed method. When compared with YOLOv4, the proposed method is able to improve the mAP performance by 11%.

CRMar 8, 2023
Graph Neural Networks Enhanced Smart Contract Vulnerability Detection of Educational Blockchain

Zhifeng Wang, Wanxuan Wu, Chunyan Zeng et al.

With the development of blockchain technology, more and more attention has been paid to the intersection of blockchain and education, and various educational evaluation systems and E-learning systems are developed based on blockchain technology. Among them, Ethereum smart contract is favored by developers for its ``event-triggered" mechanism for building education intelligent trading systems and intelligent learning platforms. However, due to the immutability of blockchain, published smart contracts cannot be modified, so problematic contracts cannot be fixed by modifying the code in the educational blockchain. In recent years, security incidents due to smart contract vulnerabilities have caused huge property losses, so the detection of smart contract vulnerabilities in educational blockchain has become a great challenge. To solve this problem, this paper proposes a graph neural network (GNN) based vulnerability detection for smart contracts in educational blockchains. Firstly, the bytecodes are decompiled to get the opcode. Secondly, the basic blocks are divided, and the edges between the basic blocks according to the opcode execution logic are added. Then, the control flow graphs (CFG) are built. Finally, we designed a GNN-based model for vulnerability detection. The experimental results show that the proposed method is effective for the vulnerability detection of smart contracts. Compared with the traditional approaches, it can get good results with fewer layers of the GCN model, which shows that the contract bytecode and GCN model are efficient in vulnerability detection.

28.5CVMay 24
AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

Zhifeng Wang, Jason Jingshi Li, Kaihao Zhang et al.

Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG -- a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf{79.49\% accuracy} and \textbf{79.49\% F1-score}, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.

CVNov 11, 2022
JSRNN: Joint Sampling and Reconstruction Neural Networks for High Quality Image Compressed Sensing

Chunyan Zeng, Jiaxiang Ye, Zhifeng Wang et al.

Most Deep Learning (DL) based Compressed Sensing (DCS) algorithms adopt a single neural network for signal reconstruction, and fail to jointly consider the influences of the sampling operation for reconstruction. In this paper, we propose unified framework, which jointly considers the sampling and reconstruction process for image compressive sensing based on well-designed cascade neural networks. Two sub-networks, which are the sampling sub-network and the reconstruction sub-network, are included in the proposed framework. In the sampling sub-network, an adaptive full connected layer instead of the traditional random matrix is used to mimic the sampling operator. In the reconstruction sub-network, a cascade network combining stacked denoising autoencoder (SDA) and convolutional neural network (CNN) is designed to reconstruct signals. The SDA is used to solve the signal mapping problem and the signals are initially reconstructed. Furthermore, CNN is used to fully recover the structure and texture features of the image to obtain better reconstruction performance. Extensive experiments show that this framework outperforms many other state-of-the-art methods, especially at low sampling rates.

IVSep 28, 2022
Image Compressed Sensing with Multi-scale Dilated Convolutional Neural Network

Zhifeng Wang, Zhenghui Wang, Chunyan Zeng et al.

Deep Learning (DL) based Compressed Sensing (CS) has been applied for better performance of image reconstruction than traditional CS methods. However, most existing DL methods utilize the block-by-block measurement and each measurement block is restored separately, which introduces harmful blocking effects for reconstruction. Furthermore, the neuronal receptive fields of those methods are designed to be the same size in each layer, which can only collect single-scale spatial information and has a negative impact on the reconstruction process. This paper proposes a novel framework named Multi-scale Dilated Convolution Neural Network (MsDCNN) for CS measurement and reconstruction. During the measurement period, we directly obtain all measurements from a trained measurement network, which employs fully convolutional structures and is jointly trained with the reconstruction network from the input image. It needn't be cut into blocks, which effectively avoids the block effect. During the reconstruction period, we propose the Multi-scale Feature Extraction (MFE) architecture to imitate the human visual system to capture multi-scale features from the same feature map, which enhances the image feature extraction ability of the framework and improves the performance of image reconstruction. In the MFE, there are multiple parallel convolution channels to obtain multi-scale feature information. Then the multi-scale features information is fused and the original image is reconstructed with high quality. Our experimental results show that the proposed method performs favorably against the state-of-the-art methods in terms of PSNR and SSIM.

AIJul 7, 2022
UIILD: A Unified Interpretable Intelligent Learning Diagnosis Framework for Intelligent Tutoring Systems

Zhifeng Wang, Wenxing Yan, Chunyan Zeng et al.

Intelligent learning diagnosis is a critical engine of intelligent tutoring systems, which aims to estimate learners' current knowledge mastery status and predict their future learning performance. The significant challenge with traditional learning diagnosis methods is the inability to balance diagnostic accuracy and interpretability. Although the existing psychometric-based learning diagnosis methods provide some domain interpretation through cognitive parameters, they have insufficient modeling capability with a shallow structure for large-scale learning data. While the deep learning-based learning diagnosis methods have improved the accuracy of learning performance prediction, their inherent black-box properties lead to a lack of interpretability, making their results untrustworthy for educational applications. To settle the above problem, the proposed unified interpretable intelligent learning diagnosis (UIILD) framework, which benefits from the powerful representation learning ability of deep learning and the interpretability of psychometrics, achieves a better performance of learning prediction and provides interpretability from three aspects: cognitive parameters, learner-resource response network, and weights of self-attention mechanism. Within the proposed framework, this paper presents a two-channel learning diagnosis mechanism LDM-ID as well as a three-channel learning diagnosis mechanism LDM-HMI. Experiments on two real-world datasets and a simulation dataset show that our method has higher accuracy in predicting learners' performances compared with the state-of-the-art models, and can provide valuable educational interpretability for applications such as precise learning resource recommendation and personalized learning tutoring in intelligent tutoring systems.

CYJul 15, 2023
Knowledge Graph Enhanced Intelligent Tutoring System Based on Exercise Representativeness and Informativeness

Linqing Li, Zhifeng Wang

Presently, knowledge graph-based recommendation algorithms have garnered considerable attention among researchers. However, these algorithms solely consider knowledge graphs with single relationships and do not effectively model exercise-rich features, such as exercise representativeness and informativeness. Consequently, this paper proposes a framework, namely the Knowledge-Graph-Exercise Representativeness and Informativeness Framework, to address these two issues. The framework consists of four intricate components and a novel cognitive diagnosis model called the Neural Attentive cognitive diagnosis model. These components encompass the informativeness component, exercise representation component, knowledge importance component, and exercise representativeness component. The informativeness component evaluates the informational value of each question and identifies the candidate question set that exhibits the highest exercise informativeness. Furthermore, the skill embeddings are employed as input for the knowledge importance component. This component transforms a one-dimensional knowledge graph into a multi-dimensional one through four class relations and calculates skill importance weights based on novelty and popularity. Subsequently, the exercise representativeness component incorporates exercise weight knowledge coverage to select questions from the candidate question set for the tested question set. Lastly, the cognitive diagnosis model leverages exercise representation and skill importance weights to predict student performance on the test set and estimate their knowledge state. To evaluate the effectiveness of our selection strategy, extensive experiments were conducted on two publicly available educational datasets. The experimental results demonstrate that our framework can recommend appropriate exercises to students, leading to improved student performance.

LGApr 8, 2023
Knowledge Relation Rank Enhanced Heterogeneous Learning Interaction Modeling for Neural Graph Forgetting Knowledge Tracing

Linqing Li, Zhifeng Wang

Recently, knowledge tracing models have been applied in educational data mining such as the Self-attention knowledge tracing model(SAKT), which models the relationship between exercises and Knowledge concepts(Kcs). However, relation modeling in traditional Knowledge tracing models only considers the static question-knowledge relationship and knowledge-knowledge relationship and treats these relationships with equal importance. This kind of relation modeling is difficult to avoid the influence of subjective labeling and considers the relationship between exercises and KCs, or KCs and KCs separately. In this work, a novel knowledge tracing model, named Knowledge Relation Rank Enhanced Heterogeneous Learning Interaction Modeling for Neural Graph Forgetting Knowledge Tracing(NGFKT), is proposed to reduce the impact of the subjective labeling by calibrating the skill relation matrix and the Q-matrix and apply the Graph Convolutional Network(GCN) to model the heterogeneous interactions between students, exercises, and skills. Specifically, the skill relation matrix and Q-matrix are generated by the Knowledge Relation Importance Rank Calibration method(KRIRC). Then the calibrated skill relation matrix, Q-matrix, and the heterogeneous interactions are treated as the input of the GCN to generate the exercise embedding and skill embedding. Next, the exercise embedding, skill embedding, item difficulty, and contingency table are incorporated to generate an exercise relation matrix as the inputs of the Position-Relation-Forgetting attention mechanism. Finally, the Position-Relation-Forgetting attention mechanism is applied to make the predictions. Experiments are conducted on the two public educational datasets and results indicate that the NGFKT model outperforms all baseline models in terms of AUC, ACC, and Performance Stability(PS).

CVMar 2, 2023
Photovoltaic Panel Defect Detection Based on Ghost Convolution with BottleneckCSP and Tiny Target Prediction Head Incorporating YOLOv5

Longlong Li, Zhifeng Wang, Tingting Zhang

Photovoltaic (PV) panel surface-defect detection technology is crucial for the PV industry to perform smart maintenance. Using computer vision technology to detect PV panel surface defects can ensure better accuracy while reducing the workload of traditional worker field inspections. However, multiple tiny defects on the PV panel surface and the high similarity between different defects make it challenging to {accurately identify and detect such defects}. This paper proposes an approach named Ghost convolution with BottleneckCSP and a tiny target prediction head incorporating YOLOv5 (GBH-YOLOv5) for PV panel defect detection. To ensure better accuracy on multiscale targets, the BottleneckCSP module is introduced to add a prediction head for tiny target detection to alleviate tiny defect misses, using Ghost convolution to improve the model inference speed and reduce the number of parameters. First, the original image is compressed and cropped to enlarge the defect size physically. Then, the processed images are input into GBH-YOLOv5, and the depth features are extracted through network processing based on Ghost convolution, the application of the BottleneckCSP module, and the prediction head of tiny targets. Finally, the extracted features are classified by a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN) structure. Meanwhile, we compare our method with state-of-the-art methods to verify the effectiveness of the proposed method. The proposed PV panel surface-defect detection network improves the mAP performance by at least 27.8%.

CVAug 1, 2023
Using Scene and Semantic Features for Multi-modal Emotion Recognition

Zhifeng Wang, Ramesh Sankaranarayana

Automatic emotion recognition is a hot topic with a wide range of applications. Much work has been done in the area of automatic emotion recognition in recent years. The focus has been mainly on using the characteristics of a person such as speech, facial expression and pose for this purpose. However, the processing of scene and semantic features for emotion recognition has had limited exploration. In this paper, we propose to use combined scene and semantic features, along with personal features, for multi-modal emotion recognition. Scene features will describe the environment or context in which the target person is operating. The semantic feature can include objects that are present in the environment, as well as their attributes and relationships with the target person. In addition, we use a modified EmbraceNet to extract features from the images, which is trained to learn both the body and pose features simultaneously. By fusing both body and pose features, the EmbraceNet can improve the accuracy and robustness of the model, particularly when dealing with partially missing data. This is because having both body and pose features provides a more complete representation of the subject in the images, which can help the model to make more accurate predictions even when some parts of body are missing. We demonstrate the efficiency of our method on the benchmark EMOTIC dataset. We report an average precision of 40.39\% across the 26 emotion categories, which is a 5\% improvement over previous approaches.

LGNov 4, 2024Code
Leveraging Label Semantics and Meta-Label Refinement for Multi-Label Question Classification

Shi Dong, Xiaobei Niu, Rui Zhong et al.

Accurate annotation of educational resources is crucial for effective personalized learning and resource recommendation in online education. However, fine-grained knowledge labels often overlap or share similarities, making it difficult for existing multi-label classification methods to differentiate them. The label distribution imbalance due to sparsity of human annotations further intensifies these challenges. To address these issues, this paper introduces RR2QC, a novel Retrieval Reranking method to multi-label Question Classification by leveraging label semantics and meta-label refinement. First, RR2QC improves the pre-training strategy by utilizing semantic relationships within and across label groups. Second, it introduces a class center learning task to align questions with label semantics during downstream training. Finally, this method decomposes labels into meta-labels and uses a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels that frequently appear in other labels. Additionally, a mathematical LLM is used to generate solutions for questions, extracting latent information to further refine the model's insights. Experimental results show that RR2QC outperforms existing methods in Precision@K and F1 scores across multiple educational datasets, demonstrating its effectiveness for online education applications. The code and datasets are available at https://github.com/78Erii/RR2QC.

94.8IVMar 20
ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis

Lubin Gan, Jing Zhang, Heng Zhang et al.

Whole slide image (WSI) analysis heavily relies on multiple instance learning (MIL). While recent methods benefit from large-scale foundation models and advanced sequence modeling to capture long-range dependencies, they still struggle with two critical issues. First, directly applying frozen, task-agnostic features often leads to suboptimal separability due to the domain gap with specific histological tasks. Second, relying solely on global aggregators can cause over-smoothing, where sparse but critical diagnostic signals are overshadowed by the dominant background context. In this paper, we present ReconMIL, a novel framework designed to bridge this domain gap and balance global-local feature aggregation. Our approach introduces a Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold, improving boundary delineation. To prevent information dilution, we develop a bi-stream architecture combining a Mamba-based global stream for contextual priors and a CNN-based local stream to preserve subtle morphological anomalies. A scale-adaptive selection mechanism dynamically fuses these two streams, determining when to rely on overall architecture versus local saliency. Evaluations across multiple diagnostic and survival prediction benchmarks show that ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization results confirm the models superior ability to localize diagnostic regions by effectively balancing global structure and local granularity.

IVJun 24, 2025Code
Angio-Diff: Learning a Self-Supervised Adversarial Diffusion Model for Angiographic Geometry Generation

Zhifeng Wang, Renjiao Yi, Xin Wen et al.

Vascular diseases pose a significant threat to human health, with X-ray angiography established as the gold standard for diagnosis, allowing for detailed observation of blood vessels. However, angiographic X-rays expose personnel and patients to higher radiation levels than non-angiographic X-rays, which are unwanted. Thus, modality translation from non-angiographic to angiographic X-rays is desirable. Data-driven deep approaches are hindered by the lack of paired large-scale X-ray angiography datasets. While making high-quality vascular angiography synthesis crucial, it remains challenging. We find that current medical image synthesis primarily operates at pixel level and struggles to adapt to the complex geometric structure of blood vessels, resulting in unsatisfactory quality of blood vessel image synthesis, such as disconnections or unnatural curvatures. To overcome this issue, we propose a self-supervised method via diffusion models to transform non-angiographic X-rays into angiographic X-rays, mitigating data shortages for data-driven approaches. Our model comprises a diffusion model that learns the distribution of vascular data from diffusion latent, a generator for vessel synthesis, and a mask-based adversarial module. To enhance geometric accuracy, we propose a parametric vascular model to fit the shape and distribution of blood vessels. The proposed method contributes a pipeline and a synthetic dataset for X-ray angiography. We conducted extensive comparative and ablation experiments to evaluate the Angio-Diff. The results demonstrate that our method achieves state-of-the-art performance in synthetic angiography image quality and more accurately synthesizes the geometric structure of blood vessels. The code is available at https://github.com/zfw-cv/AngioDiff.

20.4CVMar 12
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang et al. · pku

4D radar-camera sensing configuration has gained increasing importance in autonomous driving. However, existing 3D object detection methods that fuse 4D Radar and camera data confront several challenges. First, their absolute depth estimation module is not robust and accurate enough, leading to inaccurate 3D localization. Second, the performance of their temporal fusion module will degrade dramatically or even fail when the ego vehicle's pose is missing or inaccurate. Third, for some small objects, the sparse radar point clouds may completely fail to reflect from their surfaces. In such cases, detection must rely solely on visual unimodal priors. To address these limitations, we propose R4Det, which enhances depth estimation quality via the Panoramic Depth Fusion module, enabling mutual reinforcement between absolute and relative depth. For temporal fusion, we design a Deformable Gated Temporal Fusion module that does not rely on the ego vehicle's pose. In addition, we built an Instance-Guided Dynamic Refinement module that extracts semantic prototypes from 2D instance guidance. Experiments show that R4Det achieves state-of-the-art 3D object detection results on the TJ4DRadSet and VoD datasets.

CVApr 21, 2024Code
Authentic Emotion Mapping: Benchmarking Facial Expressions in Real News

Qixuan Zhang, Zhifeng Wang, Yang Liu et al.

In this paper, we present a novel benchmark for Emotion Recognition using facial landmarks extracted from realistic news videos. Traditional methods relying on RGB images are resource-intensive, whereas our approach with Facial Landmark Emotion Recognition (FLER) offers a simplified yet effective alternative. By leveraging Graph Neural Networks (GNNs) to analyze the geometric and spatial relationships of facial landmarks, our method enhances the understanding and accuracy of emotion recognition. We discuss the advancements and challenges in deep learning techniques for emotion recognition, particularly focusing on Graph Neural Networks (GNNs) and Transformers. Our experimental results demonstrate the viability and potential of our dataset as a benchmark, setting a new direction for future research in emotion recognition technologies. The codes and models are at: https://github.com/wangzhifengharrison/benchmark_real_news

LGFeb 15, 2023
DKT-STDRL: Spatial and Temporal Representation Learning Enhanced Deep Knowledge Tracing for Learning Performance Prediction

Liting Lyu, Zhifeng Wang, Haihong Yun et al.

Knowledge tracing (KT) serves as a primary part of intelligent education systems. Most current KTs either rely on expert judgments or only exploit a single network structure, which affects the full expression of learning features. To adequately mine features of students' learning process, Deep Knowledge Tracing Based on Spatial and Temporal Deep Representation Learning for Learning Performance Prediction (DKT-STDRL) is proposed in this paper. DKT-STDRL extracts spatial features from students' learning history sequence, and then further extracts temporal features to extract deeper hidden information. Specifically, firstly, the DKT-STDRL model uses CNN to extract the spatial feature information of students' exercise sequences. Then, the spatial features are connected with the original students' exercise features as joint learning features. Then, the joint features are input into the BiLSTM part. Finally, the BiLSTM part extracts the temporal features from the joint learning features to obtain the prediction information of whether the students answer correctly at the next time step. Experiments on the public education datasets ASSISTment2009, ASSISTment2015, Synthetic-5, ASSISTchall, and Statics2011 prove that DKT-STDRL can achieve better prediction effects than DKT and CKT.

CVOct 18, 2024Code
Variable Aperture Bokeh Rendering via Customized Focal Plane Guidance

Kang Chen, Shijun Yan, Aiwen Jiang et al.

Bokeh rendering is one of the most popular techniques in photography. It can make photographs visually appealing, forcing users to focus their attentions on particular area of image. However, achieving satisfactory bokeh effect usually presents significant challenge, since mobile cameras with restricted optical systems are constrained, while expensive high-end DSLR lens with large aperture should be needed. Therefore, many deep learning-based computational photography methods have been developed to mimic the bokeh effect in recent years. Nevertheless, most of these methods were limited to rendering bokeh effect in certain single aperture. There lacks user-friendly bokeh rendering method that can provide precise focal plane control and customised bokeh generation. There as well lacks authentic realistic bokeh dataset that can potentially promote bokeh learning on variable apertures. To address these two issues, in this paper, we have proposed an effective controllable bokeh rendering method, and contributed a Variable Aperture Bokeh Dataset (VABD). In the proposed method, user can customize focal plane to accurately locate concerned subjects and select target aperture information for bokeh rendering. Experimental results on public EBB! benchmark dataset and our constructed dataset VABD have demonstrated that the customized focal plane together aperture prompt can bootstrap model to simulate realistic bokeh effect. The proposed method has achieved competitive state-of-the-art performance with only 4.4M parameters, which is much lighter than mainstream computational bokeh models. The contributed dataset and source codes will be released on github https://github.com/MoTong-AI-studio/VABM.

CVAug 8, 2024
LLDif: Diffusion Models for Low-light Emotion Recognition

Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana

This paper introduces LLDif, a novel diffusion-based facial expression recognition (FER) framework tailored for extremely low-light (LL) environments. Images captured under such conditions often suffer from low brightness and significantly reduced contrast, presenting challenges to conventional methods. These challenges include poor image quality that can significantly reduce the accuracy of emotion recognition. LLDif addresses these issues with a novel two-stage training process that combines a Label-aware CLIP (LA-CLIP), an embedding prior network (PNET), and a transformer-based network adept at handling the noise of low-light images. The first stage involves LA-CLIP generating a joint embedding prior distribution (EPD) to guide the LLformer in label recovery. In the second stage, the diffusion model (DM) refines the EPD inference, ultilising the compactness of EPD for precise predictions. Experimental evaluations on various LL-FER datasets have shown that LLDif achieves competitive performance, underscoring its potential to enhance FER applications in challenging lighting conditions.

CVFeb 7, 2024
Dual-Path Coupled Image Deraining Network via Spatial-Frequency Interaction

Yuhong He, Aiwen Jiang, Lingfang Jiang et al.

Transformers have recently emerged as a significant force in the field of image deraining. Existing image deraining methods utilize extensive research on self-attention. Though showcasing impressive results, they tend to neglect critical frequency information, as self-attention is generally less adept at capturing high-frequency details. To overcome this shortcoming, we have developed an innovative Dual-Path Coupled Deraining Network (DPCNet) that integrates information from both spatial and frequency domains through Spatial Feature Extraction Block (SFEBlock) and Frequency Feature Extraction Block (FFEBlock). We have further introduced an effective Adaptive Fusion Module (AFM) for the dual-path feature aggregation. Extensive experiments on six public deraining benchmarks and downstream vision tasks have demonstrated that our proposed method not only outperforms the existing state-of-the-art deraining method but also achieves visually pleasuring results with excellent robustness on downstream vision tasks.

CVFeb 1, 2024
LRDif: Diffusion Models for Under-Display Camera Emotion Recognition

Zhifeng Wang, Kaihao Zhang, Ramesh Sankaranarayana

This study introduces LRDif, a novel diffusion-based framework designed specifically for facial expression recognition (FER) within the context of under-display cameras (UDC). To address the inherent challenges posed by UDC's image degradation, such as reduced sharpness and increased noise, LRDif employs a two-stage training strategy that integrates a condensed preliminary extraction network (FPEN) and an agile transformer network (UDCformer) to effectively identify emotion labels from UDC images. By harnessing the robust distribution mapping capabilities of Diffusion Models (DMs) and the spatial dependency modeling strength of transformers, LRDif effectively overcomes the obstacles of noise and distortion inherent in UDC environments. Comprehensive experiments on standard FER datasets including RAF-DB, KDEF, and FERPlus, LRDif demonstrate state-of-the-art performance, underscoring its potential in advancing FER applications. This work not only addresses a significant gap in the literature by tackling the UDC challenge in FER but also sets a new benchmark for future research in the field.

CVMar 17, 2025
VasTSD: Learning 3D Vascular Tree-state Space Diffusion Model for Angiography Synthesis

Zhifeng Wang, Renjiao Yi, Xin Wen et al.

Angiography imaging is a medical imaging technique that enhances the visibility of blood vessels within the body by using contrast agents. Angiographic images can effectively assist in the diagnosis of vascular diseases. However, contrast agents may bring extra radiation exposure which is harmful to patients with health risks. To mitigate these concerns, in this paper, we aim to automatically generate angiography from non-angiographic inputs, by leveraging and enhancing the inherent physical properties of vascular structures. Previous methods relying on 2D slice-based angiography synthesis struggle with maintaining continuity in 3D vascular structures and exhibit limited effectiveness across different imaging modalities. We propose VasTSD, a 3D vascular tree-state space diffusion model to synthesize angiography from 3D non-angiographic volumes, with a novel state space serialization approach that dynamically constructs vascular tree topologies, integrating these with a diffusion-based generative model to ensure the generation of anatomically continuous vasculature in 3D volumes. A pre-trained vision embedder is employed to construct vascular state space representations, enabling consistent modeling of vascular structures across multiple modalities. Extensive experiments on various angiographic datasets demonstrate the superiority of VasTSD over prior works, achieving enhanced continuity of blood vessels in synthesized angiographic synthesis for multiple modalities and anatomical regions.

CVNov 5, 2024
CAD-NeRF: Learning NeRFs from Uncalibrated Few-view Images by CAD Model Retrieval

Xin Wen, Xuening Zhu, Renjiao Yi et al.

Reconstructing from multi-view images is a longstanding problem in 3D vision, where neural radiance fields (NeRFs) have shown great potential and get realistic rendered images of novel views. Currently, most NeRF methods either require accurate camera poses or a large number of input images, or even both. Reconstructing NeRF from few-view images without poses is challenging and highly ill-posed. To address this problem, we propose CAD-NeRF, a method reconstructed from less than 10 images without any known poses. Specifically, we build a mini library of several CAD models from ShapeNet and render them from many random views. Given sparse-view input images, we run a model and pose retrieval from the library, to get a model with similar shapes, serving as the density supervision and pose initializations. Here we propose a multi-view pose retrieval method to avoid pose conflicts among views, which is a new and unseen problem in uncalibrated NeRF methods. Then, the geometry of the object is trained by the CAD guidance. The deformation of the density field and camera poses are optimized jointly. Then texture and density are trained and fine-tuned as well. All training phases are in self-supervised manners. Comprehensive evaluations of synthetic and real images show that CAD-NeRF successfully learns accurate densities with a large deformation from retrieved CAD models, showing the generalization abilities.

CLApr 9, 2025
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Zican Dong, Han Peng, Peiyu Liu et al.

Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1(671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few in-domain demonstrations, the model consistently activates a sparse and stable subset of experts on tasks within the same domain. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and L2 norm of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities before and after routed experts. Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full model with only half the experts.

CVApr 24, 2025
Visual and Textual Prompts in VLLMs for Enhancing Emotion Recognition

Zhifeng Wang, Qixuan Zhang, Peter Zhang et al.

Vision Large Language Models (VLLMs) exhibit promising potential for multi-modal understanding, yet their application to video-based emotion recognition remains limited by insufficient spatial and contextual awareness. Traditional approaches, which prioritize isolated facial features, often neglect critical non-verbal cues such as body language, environmental context, and social interactions, leading to reduced robustness in real-world scenarios. To address this gap, we propose Set-of-Vision-Text Prompting (SoVTP), a novel framework that enhances zero-shot emotion recognition by integrating spatial annotations (e.g., bounding boxes, facial landmarks), physiological signals (facial action units), and contextual cues (body posture, scene dynamics, others' emotions) into a unified prompting strategy. SoVTP preserves holistic scene information while enabling fine-grained analysis of facial muscle movements and interpersonal dynamics. Extensive experiments show that SoVTP achieves substantial improvements over existing visual prompting methods, demonstrating its effectiveness in enhancing VLLMs' video emotion recognition capabilities.

CVAug 30, 2025
SemaMIL: Semantic-Aware Multiple Instance Learning with Retrieval-Guided State Space Modeling for Whole Slide Images

Lubin Gan, Xiaoman Wu, Jing Zhang et al.

Multiple instance learning (MIL) has become the leading approach for extracting discriminative features from whole slide images (WSIs) in computational pathology. Attention-based MIL methods can identify key patches but tend to overlook contextual relationships. Transformer models are able to model interactions but require quadratic computational cost and are prone to overfitting. State space models (SSMs) offer linear complexity, yet shuffling patch order disrupts histological meaning and reduces interpretability. In this work, we introduce SemaMIL, which integrates Semantic Reordering (SR), an adaptive method that clusters and arranges semantically similar patches in sequence through a reversible permutation, with a Semantic-guided Retrieval State Space Module (SRSM) that chooses a representative subset of queries to adjust state space parameters for improved global modeling. Evaluation on four WSI subtype datasets shows that, compared to strong baselines, SemaMIL achieves state-of-the-art accuracy with fewer FLOPs and parameters.

CVOct 29, 2025
Mask-Robust Face Verification for Online Learning via YOLOv5 and Residual Networks

Zhifeng Wang, Minghui Wang, Chunyan Zeng et al.

In the contemporary landscape, the fusion of information technology and the rapid advancement of artificial intelligence have ushered school education into a transformative phase characterized by digitization and heightened intelligence. Concurrently, the global paradigm shift caused by the Covid-19 pandemic has catalyzed the evolution of e-learning, accentuating its significance. Amidst these developments, one pivotal facet of the online education paradigm that warrants attention is the authentication of identities within the digital learning sphere. Within this context, our study delves into a solution for online learning authentication, utilizing an enhanced convolutional neural network architecture, specifically the residual network model. By harnessing the power of deep learning, this technological approach aims to galvanize the ongoing progress of online education, while concurrently bolstering its security and stability. Such fortification is imperative in enabling online education to seamlessly align with the swift evolution of the educational landscape. This paper's focal proposition involves the deployment of the YOLOv5 network, meticulously trained on our proprietary dataset. This network is tasked with identifying individuals' faces culled from images captured by students' open online cameras. The resultant facial information is then channeled into the residual network to extract intricate features at a deeper level. Subsequently, a comparative analysis of Euclidean distances against students' face databases is performed, effectively ascertaining the identity of each student.

LGOct 29, 2025
Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training

Zhifeng Wang, Longlong Li, Chunyan Zeng

Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.

AIOct 27, 2025
TLCD: A Deep Transfer Learning Framework for Cross-Disciplinary Cognitive Diagnosis

Zhifeng Wang, Meixin Su, Yang Yang et al.

Driven by the dual principles of smart education and artificial intelligence technology, the online education model has rapidly emerged as an important component of the education industry. Cognitive diagnostic technology can utilize students' learning data and feedback information in educational evaluation to accurately assess their ability level at the knowledge level. However, while massive amounts of information provide abundant data resources, they also bring about complexity in feature extraction and scarcity of disciplinary data. In cross-disciplinary fields, traditional cognitive diagnostic methods still face many challenges. Given the differences in knowledge systems, cognitive structures, and data characteristics between different disciplines, this paper conducts in-depth research on neural network cognitive diagnosis and knowledge association neural network cognitive diagnosis, and proposes an innovative cross-disciplinary cognitive diagnosis method (TLCD). This method combines deep learning techniques and transfer learning strategies to enhance the performance of the model in the target discipline by utilizing the common features of the main discipline. The experimental results show that the cross-disciplinary cognitive diagnosis model based on deep learning performs better than the basic model in cross-disciplinary cognitive diagnosis tasks, and can more accurately evaluate students' learning situation.

CLOct 26, 2025
A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback

Zhifeng Wang, Xinyue Zheng, Chunyan Zeng

As information technology advances, education is moving from one-size-fits-all instruction toward personalized learning. However, most methods handle modeling, item selection, and feedback in isolation rather than as a closed loop. This leads to coarse or opaque student models, assumption-bound adaptivity that ignores diagnostic posteriors, and generic, non-actionable feedback. To address these limitations, this paper presents an end-to-end personalized learning agent, EduLoop-Agent, which integrates a Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation Computerized Adaptive Testing strategy (BECAT), and large language models (LLMs). The NCD module provides fine-grained estimates of students' mastery at the knowledge-point level; BECAT dynamically selects subsequent items to maximize relevance and learning efficiency; and LLMs convert diagnostic signals into structured, actionable feedback. Together, these components form a closed-loop framework of ``Diagnosis--Recommendation--Feedback.'' Experiments on the ASSISTments dataset show that the NCD module achieves strong performance on response prediction while yielding interpretable mastery assessments. The adaptive recommendation strategy improves item relevance and personalization, and the LLM-based feedback offers targeted study guidance aligned with identified weaknesses. Overall, the results indicate that the proposed design is effective and practically deployable, providing a feasible pathway to generating individualized learning trajectories in intelligent education.

CRJan 27, 2022
A Survey of PPG's Application in Authentication

Lin Li, Chao Chen, Lei Pan et al.

Biometric authentication prospered because of its convenient use and security. Early generations of biometric mechanisms suffer from spoofing attacks. Recently, unobservable physiological signals (e.g., Electroencephalogram, Photoplethysmogram, Electrocardiogram) as biometrics offer a potential remedy to this problem. In particular, Photoplethysmogram (PPG) measures the change in blood flow of the human body by an optical method. Clinically, researchers commonly use PPG signals to obtain patients' blood oxygen saturation, heart rate, and other information to assist in diagnosing heart-related diseases. Since PPG signals contain a wealth of individual cardiac information, researchers have begun to explore their potential in cyber security applications. The unique advantages (simple acquisition, difficult to steal, and live detection) of the PPG signal allow it to improve the security and usability of the authentication in various aspects. However, the research on PPG-based authentication is still in its infancy. The lack of systematization hinders new research in this field. We conduct a comprehensive study of PPG-based authentication and discuss these applications' limitations before pointing out future research directions.

OCDec 16, 2021
Analysis of Generalized Bregman Surrogate Algorithms for Nonsmooth Nonconvex Statistical Learning

Yiyuan She, Zhifeng Wang, Jiuwu Jin

Modern statistical applications often involve minimizing an objective function that may be nonsmooth and/or nonconvex. This paper focuses on a broad Bregman-surrogate algorithm framework including the local linear approximation, mirror descent, iterative thresholding, DC programming and many others as particular instances. The recharacterization via generalized Bregman functions enables us to construct suitable error measures and establish global convergence rates for nonconvex and nonsmooth objectives in possibly high dimensions. For sparse learning problems with a composite objective, under some regularity conditions, the obtained estimators as the surrogate's fixed points, though not necessarily local minimizers, enjoy provable statistical guarantees, and the sequence of iterates can be shown to approach the statistical truth within the desired accuracy geometrically fast. The paper also studies how to design adaptive momentum based accelerations without assuming convexity or smoothness by carefully controlling stepsize and relaxation parameters.

MEDec 15, 2021
Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms and Theoretical Studies

Yiyuan She, Zhifeng Wang, Jiahui Shen

Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers.

STNov 17, 2014
Group Regularized Estimation under Structural Hierarchy

Yiyuan She, Zhifeng Wang, He Jiang

Variable selection for models including interactions between explanatory variables often needs to obey certain hierarchical constraints. The weak or strong structural hierarchy requires that the existence of an interaction term implies at least one or both associated main effects to be present in the model. Lately, this problem has attracted a lot of attention, but existing computational algorithms converge slow even with a moderate number of predictors. Moreover, in contrast to the rich literature on ordinary variable selection, there is a lack of statistical theory to show reasonably low error rates of hierarchical variable selection. This work investigates a new class of estimators that make use of multiple group penalties to capture structural parsimony. We give the minimax lower bounds for strong and weak hierarchical variable selection and show that the proposed estimators enjoy sharp rate oracle inequalities. A general-purpose algorithm is developed with guaranteed convergence and global optimality. Simulations and real data experiments demonstrate the efficiency and efficacy of the proposed approach.