Zhichao Zhang

CV
h-index98
27papers
570citations
Novelty43%
AI Score55

27 Papers

CVMar 14, 2023Code
Subjective and Objective Quality Assessment for in-the-Wild Computer Graphics Images

Zicheng Zhang, Wei Sun, Yingjie Zhou et al.

Computer graphics images (CGIs) are artificially generated by means of computer programs and are widely perceived under various scenarios, such as games, streaming media, etc. In practice, the quality of CGIs consistently suffers from poor rendering during production, inevitable compression artifacts during the transmission of multimedia applications, and low aesthetic quality resulting from poor composition and design. However, few works have been dedicated to dealing with the challenge of computer graphics image quality assessment (CGIQA). Most image quality assessment (IQA) metrics are developed for natural scene images (NSIs) and validated on databases consisting of NSIs with synthetic distortions, which are not suitable for in-the-wild CGIs. To bridge the gap between evaluating the quality of NSIs and CGIs, we construct a large-scale in-the-wild CGIQA database consisting of 6,000 CGIs (CGIQA-6k) and carry out the subjective experiment in a well-controlled laboratory environment to obtain the accurate perceptual ratings of the CGIs. Then, we propose an effective deep learning-based no-reference (NR) IQA model by utilizing both distortion and aesthetic quality representation. Experimental results show that the proposed method outperforms all other state-of-the-art NR IQA methods on the constructed CGIQA-6k database and other CGIQA-related databases. The database is released at https://github.com/zzc-1998/CGIQA6K.

CVJul 31, 2024Code
Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model

Zhichao Zhang, Wei Sun, Xinyue Li et al.

In recent years, artificial intelligence (AI)-driven video generation has gained significant attention. Consequently, there is a growing need for accurate video quality assessment (VQA) metrics to evaluate the perceptual quality of AI-generated content (AIGC) videos and optimize video generation models. However, assessing the quality of AIGC videos remains a significant challenge because these videos often exhibit highly complex distortions, such as unnatural actions and irrational objects. To address this challenge, we systematically investigate the AIGC-VQA problem, considering both subjective and objective quality assessment perspectives. For the subjective perspective, we construct the Large-scale Generated Video Quality assessment (LGVQ) dataset, consisting of 2,808 AIGC videos generated by 6 video generation models using 468 carefully curated text prompts. We evaluate the perceptual quality of AIGC videos from three critical dimensions: spatial quality, temporal quality, and text-video alignment. For the objective perspective, we establish a benchmark for evaluating existing quality assessment metrics on the LGVQ dataset. Our findings show that current metrics perform poorly on this dataset, highlighting a gap in effective evaluation tools. To bridge this gap, we propose the Unify Generated Video Quality assessment (UGVQ) model, designed to accurately evaluate the multi-dimensional quality of AIGC videos. The UGVQ model integrates the visual and motion features of videos with the textual features of their corresponding prompts, forming a unified quality-aware feature representation tailored to AIGC videos. Experimental results demonstrate that UGVQ achieves state-of-the-art performance on the LGVQ dataset across all three quality dimensions. Both the LGVQ dataset and the UGVQ model are publicly available on https://github.com/zczhang-sjtu/UGVQ.git.

LGMay 27
Machine Learning methods for event classification and vertex reconstruction of the 12C + 12C reaction with the MATE-TPC

Minghui Zhang, Xiaobin Li, Jie Chen et al.

In modern nuclear physics experiments, identifying events of interest is challenging for nuclear reaction studies with the active target Time Projection Chamber (TPC). In this work, machine learning techniques are employed to analyze the complex data of the 12C + 12C fusion reaction from a TPC named MATE (multi-purpose active-target time projection chamber for nuclear experiments). Specifically, we successfully applied Residual Neural Network (ResNet-50, ResNet-34 and ResNet-18) and Visual Geometry Group (VGG-19) to classify elastic scattering and fusion reaction events from the 12C + 12C reaction. The classification results of the four models are nearly identical, with accuracies of approximately 97% for the simulated data and 90% for the experimental data. Moreover, these approaches successfully identify some events that are misclassified by traditional methods. These models are also applied to classify events from different fusion reaction channels, with classification accuracies of approximately 95% on simulated data. In addition, a Convolutional Neural Network (CNN) model is developed to reconstruct the reaction vertex, providing an alternative strategy for vertex reconstruction. These results indicate that machine learning techniques can effectively classify reaction events from different channels and reconstruct the reaction vertex, thereby paving the way for future analyses of complex nuclear reaction data.

CVJan 28
Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework

Xinyue Li, Zhichao Zhang, Zhiming Xu et al.

Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher's quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.

CVFeb 3
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

Xinyue Li, Zhiming Xu, Zhichao Zhang et al.

Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.

IVApr 17, 2024Code
NTIRE 2024 Challenge on Short-form UGC Video Quality Assessment: Methods and Results

Xin Li, Kun Yuan, Yajing Pei et al.

This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA), where various excellent solutions are submitted and evaluated on the collected dataset KVQ from popular short-form video platform, i.e., Kuaishou/Kwai Platform. The KVQ database is divided into three parts, including 2926 videos for training, 420 videos for validation, and 854 videos for testing. The purpose is to build new benchmarks and advance the development of S-UGC VQA. The competition had 200 participants and 13 teams submitted valid solutions for the final testing phase. The proposed solutions achieved state-of-the-art performances for S-UGC VQA. The project can be found at https://github.com/lixinustc/KVQChallenge-CVPR-NTIRE2024.

CVApr 27, 2024Code
Large Multi-modality Model Assisted AI-Generated Image Quality Assessment

Puyi Wang, Wei Sun, Zicheng Zhang et al.

Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. Code is available at https://github.com/wangpuyi/MA-AGIQA.

IVMay 14, 2024Code
Enhancing Blind Video Quality Assessment with Rich Quality-aware Features

Wei Sun, Haoning Wu, Zicheng Zhang et al.

In this paper, we present a simple but effective method to enhance blind video quality assessment (BVQA) models for social media videos. Motivated by previous researches that leverage pre-trained features extracted from various computer vision models as the feature representation for BVQA, we further explore rich quality-aware features from pre-trained blind image quality assessment (BIQA) and BVQA models as auxiliary features to help the BVQA model to handle complex distortions and diverse content of social media videos. Specifically, we use SimpleVQA, a BVQA model that consists of a trainable Swin Transformer-B and a fixed SlowFast, as our base model. The Swin Transformer-B and SlowFast components are responsible for extracting spatial and motion features, respectively. Then, we extract three kinds of features from Q-Align, LIQE, and FAST-VQA to capture frame-level quality-aware features, frame-level quality-aware along with scene-specific features, and spatiotemporal quality-aware features, respectively. Through concatenating these features, we employ a multi-layer perceptron (MLP) network to regress them into quality scores. Experimental results demonstrate that the proposed model achieves the best performance on three public social media VQA datasets. Moreover, the proposed model won first place in the CVPR NTIRE 2024 Short-form UGC Video Quality Assessment Challenge. The code is available at \url{https://github.com/sunwei925/RQ-VQA.git}.

CVNov 25, 2024Code
Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Zhichao Zhang, Wei Sun, Xinyue Li et al.

AI-driven video generation techniques have made significant progress in recent years. However, AI-generated videos (AGVs) involving human activities often exhibit substantial visual and semantic distortions, hindering the practical application of video generation technologies in real-world scenarios. To address this challenge, we conduct a pioneering study on human activity AGV quality assessment, focusing on visual quality evaluation and the identification of semantic distortions. First, we construct the AI-Generated Human activity Video Quality Assessment (Human-AGVQA) dataset, consisting of 6,000 AGVs derived from 15 popular text-to-video (T2V) models using 400 text prompts that describe diverse human activities. We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. Based on Human-AGVQA, we benchmark the performance of T2V models and analyze their strengths and weaknesses in generating different categories of human activities. Second, we develop an objective evaluation metric, named AI-Generated Human activity Video Quality metric (GHVQ), to automatically analyze the quality of human activity AGVs. GHVQ systematically extracts human-focused quality features, AI-generated content-aware quality features, and temporal continuity features, making it a comprehensive and explainable quality metric for human activity AGVs. The extensive experimental results show that GHVQ outperforms existing quality metrics on the Human-AGVQA dataset by a large margin, demonstrating its efficacy in assessing the quality of human activity AGVs. The Human-AGVQA dataset and GHVQ metric will be released at https://github.com/zczhang-sjtu/GHVQ.git.

CVApr 14
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

Xinyue Li, Shubo Xu, Zhichao Zhang et al.

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20\% of MOS labels. The code will be released upon publication.

LGMay 3, 2022
Revisiting Communication-Efficient Federated Learning with Balanced Global and Local Updates

Zhigang Yan, Dong Li, Zhichao Zhang et al.

In federated learning (FL), a number of devices train their local models and upload the corresponding parameters or gradients to the base station (BS) to update the global model while protecting their data privacy. However, due to the limited computation and communication resources, the number of local trainings (a.k.a. local update) and that of aggregations (a.k.a. global update) need to be carefully chosen. In this paper, we investigate and analyze the optimal trade-off between the number of local trainings and that of global aggregations to speed up the convergence and enhance the prediction accuracy over the existing works. Our goal is to minimize the global loss function under both the delay and the energy consumption constraints. In order to make the optimization problem tractable, we derive a new and tight upper bound on the loss function, which allows us to obtain closed-form expressions for the number of local trainings and that of global aggregations. Simulation results show that our proposed scheme can achieve a better performance in terms of the prediction accuracy, and converge much faster than the baseline schemes.

CVApr 4, 2024
AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment

Chunyi Li, Tengchuan Kou, Yixuan Gao et al.

With the rapid advancements in AI-Generated Content (AIGC), AI-Generated Images (AIGIs) have been widely applied in entertainment, education, and social media. However, due to the significant variance in quality among different AIGIs, there is an urgent need for models that consistently match human subjective ratings. To address this issue, we organized a challenge towards AIGC quality assessment on NTIRE 2024 that extensively considers 15 popular generative models, utilizing dynamic hyper-parameters (including classifier-free guidance, iteration epochs, and output image resolution), and gather subjective scores that consider perceptual quality and text-to-image alignment altogether comprehensively involving 21 subjects. This approach culminates in the creation of the largest fine-grained AIGI subjective quality database to date with 20,000 AIGIs and 420,000 subjective ratings, known as AIGIQA-20K. Furthermore, we conduct benchmark experiments on this database to assess the correspondence between 16 mainstream AIGI quality models and human perception. We anticipate that this large-scale quality database will inspire robust quality indicators for AIGIs and propel the evolution of AIGC for vision. The database is released on https://www.modelscope.cn/datasets/lcysyzxdxc/AIGCQA-30K-Image.

CVMay 22, 2025
NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Shuhao Han, Haotian Fan, Fangyuan Kong et al.

This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.

CVApr 24, 2024
AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman et al.

This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed methods must process 30 FHD frames under 1 second. In the challenge, a total of 102 participants registered, and 15 submitted code and models. The performance of the top-5 submissions is reviewed and provided here as a survey of diverse deep models for efficient video quality assessment of user-generated content.

CVApr 30, 2025
AGHI-QA: A Subjective-Aligned Dataset and Metric for AI-Generated Human Images

Yunhao Li, Sijing Wu, Wei Sun et al.

The rapid development of text-to-image (T2I) generation approaches has attracted extensive interest in evaluating the quality of generated images, leading to the development of various quality assessment methods for general-purpose T2I outputs. However, existing image quality assessment (IQA) methods are limited to providing global quality scores, failing to deliver fine-grained perceptual evaluations for structurally complex subjects like humans, which is a critical challenge considering the frequent anatomical and textural distortions in AI-generated human images (AGHIs). To address this gap, we introduce AGHI-QA, the first large-scale benchmark specifically designed for quality assessment of AGHIs. The dataset comprises 4,000 images generated from 400 carefully crafted text prompts using 10 state of-the-art T2I models. We conduct a systematic subjective study to collect multidimensional annotations, including perceptual quality scores, text-image correspondence scores, visible and distorted body part labels. Based on AGHI-QA, we evaluate the strengths and weaknesses of current T2I methods in generating human images from multiple dimensions. Furthermore, we propose AGHI-Assessor, a novel quality metric that integrates the large multimodal model (LMM) with domain-specific human features for precise quality prediction and identification of visible and distorted body parts in AGHIs. Extensive experimental results demonstrate that AGHI-Assessor showcases state-of-the-art performance, significantly outperforming existing IQA methods in multidimensional quality assessment and surpassing leading LMMs in detecting structural distortions in AGHIs.

CVSep 11, 2025
VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Hanwei Zhu, Haoning Wu, Zicheng Zhang et al.

This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

LGAug 4, 2025
Graph Embedding in the Graph Fractional Fourier Transform Domain

Changjie Sheng, Zhichao Zhang, Wei Yao

Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain. The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on six benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.

CLNov 1, 2024
ReverseNER: A Self-Generated Example-Driven Framework for Zero-Shot Named Entity Recognition with Large Language Models

Anbang Wang, Difei Mei, Zhichao Zhang et al.

This paper presents ReverseNER, a method aimed at overcoming the limitation of large language models (LLMs) in zero-shot named entity recognition (NER) tasks, arising from their reliance on pre-provided demonstrations. ReverseNER tackles this challenge by constructing a reliable example library composed of dozens of entity-labeled sentences, generated through the reverse process of NER. Specifically, while conventional NER methods label entities in a sentence, ReverseNER features reversing the process by using an LLM to generate entities from their definitions and subsequently expand them into full sentences. During the entity expansion process, the LLM is guided to generate sentences by replicating the structures of a set of specific \textsl{feature sentences}, extracted from the task sentences by clustering. This expansion process produces dozens of entity-labeled task-relevant sentences. After constructing the example library, the method selects several semantically similar entity-labeled examples for each task sentence as references to facilitate the LLM's entity recognition. We also propose an entity-level self-consistency scoring mechanism to improve NER performance with LLMs. Experiments show that ReverseNER significantly outperforms other zero-shot NER methods with LLMs, marking a notable improvement in NER for domains without labeled data, while declining computational resource consumption.

LGMar 1
DWAFM: Dynamic Weighted Graph Structure Embedding Integrated with Attention and Frequency-Domain MLPs for Traffic Forecasting

Sen Shi, Zhichao Zhang, Yangfan He

Accurate traffic prediction is a key task for intelligent transportation systems. The core difficulty lies in accurately modeling the complex spatial-temporal dependencies in traffic data. In recent years, improvements in network architecture have failed to bring significant performance enhancements, while embedding technology has shown great potential. However, existing embedding methods often ignore graph structure information or rely solely on static graph structures, making it difficult to effectively capture the dynamic associations between nodes that evolve over time. To address this issue, this letter proposes a novel dynamic weighted graph structure (DWGS) embedding method, which relies on a graph structure that can truly reflect the changes in the strength of dynamic associations between nodes over time. By first combining the DWGS embedding with the spatial-temporal adaptive embedding, as well as the temporal embedding and feature embedding, and then integrating attention and frequency-domain multi-layer perceptrons (MLPs), we design a novel traffic prediction model, termed the DWGS embedding integrated with attention and frequency-domain MLPs (DWAFM). Experiments on five real-world traffic datasets show that the DWAFM achieves better prediction performance than some state-of-the-arts.

MLNov 20, 2025
Angular Graph Fractional Fourier Transform: Theory and Application

Feiyue Zhao, Yangfan He, Zhichao Zhang

Graph spectral representations are fundamental in graph signal processing, offering a rigorous framework for analyzing and processing graph-structured data. The graph fractional Fourier transform (GFRFT) extends the classical graph Fourier transform (GFT) with a fractional-order parameter, enabling flexible spectral analysis while preserving mathematical consistency. The angular graph Fourier transform (AGFT) introduces angular control via GFT eigenvector rotation; however, existing constructions fail to degenerate to the GFT at zero angle, which is a critical flaw that undermines theoretical consistency and interpretability. To resolve these complementary limitations - GFRFT's lack of angular regulation and AGFT's defective degeneracy - this study proposes an angular GFRFT (AGFRFT), a unified framework that integrates fractional-order and angular spectral analyses with theoretical rigor. A degeneracy-friendly rotation matrix family ensures exact GFT degeneration at zero angle, with two AGFRFT variants (I-AGFRFT and II-AGFRFT) defined accordingly. Rigorous theoretical analyses confirm their unitarity, invertibility, and smooth parameter dependence. Both support learnable joint parameterization of the angle and fractional order, enabling adaptive spectral processing for diverse graph signals. Extensive experiments on real-world data denoising, image denoising, and point cloud denoising demonstrate that AGFRFT outperforms GFRFT and AGFT in terms of spectral concentration, reconstruction quality, and controllable spectral manipulation, establishing a robust and flexible tool for integrated angular fractional spectral analysis in graph signal processing.

LGAug 16, 2025
A Novel Unified Extended Matrix for Graph Signal Processing: Theory and Application

Yunyan Zheng, Zhichao Zhang, Wei Yao

Graph signal processing has become an essential tool for analyzing data structured on irregular domains. While conventional graph shift operators (GSOs) are effective for certain tasks, they inherently lack flexibility in modeling dependencies between non-adjacent nodes, limiting their ability to represent complex graph structures. To address this limitation, this paper proposes the unified extended matrix (UEM) framework, which integrates the extended-adjacency matrix and the unified graph representation matrix through parametric design, so as to be able to flexibly adapt to different graph structures and reveal more graph signal information. Theoretical analysis of the UEM is conducted, demonstrating positive semi-definiteness and eigenvalue monotonicity under specific conditions. Then, we propose graph Fourier transform based on UEM (UEM-GFT), which can adaptively tune spectral properties to enhance signal processing performance. Experimental results on synthetic and real-world datasets demonstrate that the UEM-GFT outperforms existing GSO-based methods in anomaly detection tasks, achieving superior performance across varying network topologies.

CVAug 15, 2025
Hierarchical Graph Feature Enhancement with Adaptive Frequency Modulation for Visual Recognition

Feiyue Zhao, Zhichao Zhang

Convolutional neural networks (CNNs) have demonstrated strong performance in visual recognition tasks, but their inherent reliance on regular grid structures limits their capacity to model complex topological relationships and non-local semantics within images. To address this limita tion, we propose the hierarchical graph feature enhancement (HGFE), a novel framework that integrates graph-based rea soning into CNNs to enhance both structural awareness and feature representation. HGFE builds two complementary levels of graph structures: intra-window graph convolution to cap ture local spatial dependencies and inter-window supernode interactions to model global semantic relationships. Moreover, we introduce an adaptive frequency modulation module that dynamically balances low-frequency and high-frequency signal propagation, preserving critical edge and texture information while mitigating over-smoothing. The proposed HGFE module is lightweight, end-to-end trainable, and can be seamlessly integrated into standard CNN backbone networks. Extensive experiments on CIFAR-100 (classification), PASCAL VOC, and VisDrone (detection), as well as CrackSeg and CarParts (segmentation), validated the effectiveness of the HGFE in improving structural representation and enhancing overall recognition performance.

LGJan 26, 2025
HMCGeo: IP Region Prediction Based on Hierarchical Multi-label Classification

Tianzi Zhao, Xinran Liu, Zhaoxin Zhang et al.

Fine-grained IP geolocation plays a critical role in applications such as location-based services and cybersecurity. Most existing fine-grained IP geolocation methods are regression-based; however, due to noise in the input data, these methods typically encounter kilometer-level prediction errors and provide incorrect region information for users. To address this issue, this paper proposes a novel hierarchical multi-label classification framework for IP region prediction, named HMCGeo. This framework treats IP geolocation as a hierarchical multi-label classification problem and employs residual connection-based feature extraction and attention prediction units to predict the target host region across multiple geographical granularities. Furthermore, we introduce probabilistic classification loss during training, combining it with hierarchical cross-entropy loss to form a composite loss function. This approach optimizes predictions by utilizing hierarchical constraints between regions at different granularities. IP region prediction experiments on the New York, Los Angeles, and Shanghai datasets demonstrate that HMCGeo achieves superior performance across all geographical granularities, significantly outperforming existing IP geolocation methods.

SDJul 12, 2020
Learning Frame Level Attention for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shunqing Zhang et al.

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The classification performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose a convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. We investigated the classification performance when using different attention scaling function and applying different layers. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and our method achieved the state-of-the-art or competitive classification accuracy with lower computational complexity. We also visualized our attention results and observed that the proposed attention mechanism was able to lead the network tofocus on the semantically relevant parts of environmental sounds.

SDAug 16, 2019
Sub-Spectrogram Segmentation for Environmental Sound Classification via Convolutional Recurrent Neural Network and Score Level Fusion

Tianhao Qiao, Shunqing Zhang, Zhichao Zhang et al.

Environmental Sound Classification (ESC) is an important and challenging problem, and feature representation is a critical and even decisive factor in ESC. Feature representation ability directly affects the accuracy of sound classification. Therefore, the ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. In this paper, we propose a subspectrogram segmentation based ESC classification framework. In addition, we adopt the proposed Convolutional Recurrent Neural Network (CRNN) and score level fusion to jointly improve the classification accuracy. Extensive truncation schemes are evaluated to find the optimal number and the corresponding band ranges of sub-spectrograms. Based on the numerical experiments, the proposed framework can achieve 81.9% ESC classification accuracy on the public dataset ESC-50, which provides 9.1% accuracy improvement over traditional baseline schemes.

SDJul 4, 2019
Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Tianhao Qiao et al.

Environmental sound classification (ESC) is a challenging problem due to the complexity of sounds. The ESC performance is heavily dependent on the effectiveness of representative features extracted from the environmental sounds. However, ESC often suffers from the semantically irrelevant frames and silent frames. In order to deal with this, we employ a frame-level attention model to focus on the semantically relevant frames and salient frames. Specifically, we first propose an convolutional recurrent neural network to learn spectro-temporal features and temporal correlations. Then, we extend our convolutional RNN model with a frame-level attention mechanism to learn discriminative feature representations for ESC. Experiments were conducted on ESC-50 and ESC-10 datasets. Experimental results demonstrated the effectiveness of the proposed method and achieved the state-of-the-art performance in terms of classification accuracy.

SDAug 25, 2018
Deep Convolutional Neural Network with Mixup for Environmental Sound Classification

Zhichao Zhang, Shugong Xu, Shan Cao et al.

Environmental sound classification (ESC) is an important and challenging problem. In contrast to speech, sound events have noise-like nature and may be produced by a wide variety of sources. In this paper, we propose to use a novel deep convolutional neural network for ESC tasks. Our network architecture uses stacked convolutional and pooling layers to extract high-level feature representations from spectrogram-like features. Furthermore, we apply mixup to ESC tasks and explore its impacts on classification performance and feature distribution. Experiments were conducted on UrbanSound8K, ESC-50 and ESC-10 datasets. Our experimental results demonstrated that our ESC system has achieved the state-of-the-art performance (83.7%) on UrbanSound8K and competitive performance on ESC-50 and ESC-10.