Zicheng Zhang

CL
h-index21
12papers
81citations
Novelty45%
AI Score43

12 Papers

13.3IVSep 11, 2024Code
3DGCQA: A Quality Assessment Database for 3D AI-Generated Contents

Yingjie Zhou, Zicheng Zhang, Farong Wen et al.

Although 3D generated content (3DGC) offers advantages in reducing production costs and accelerating design timelines, its quality often falls short when compared to 3D professionally generated content. Common quality issues frequently affect 3DGC, highlighting the importance of timely and effective quality assessment. Such evaluations not only ensure a higher standard of 3DGCs for end-users but also provide critical insights for advancing generative technologies. To address existing gaps in this domain, this paper introduces a novel 3DGC quality assessment dataset, 3DGCQA, built using 7 representative Text-to-3D generation methods. During the dataset's construction, 50 fixed prompts are utilized to generate contents across all methods, resulting in the creation of 313 textured meshes that constitute the 3DGCQA dataset. The visualization intuitively reveals the presence of 6 common distortion categories in the generated 3DGCs. To further explore the quality of the 3DGCs, subjective quality assessment is conducted by evaluators, whose ratings reveal significant variation in quality across different generation methods. Additionally, several objective quality assessment algorithms are tested on the 3DGCQA dataset. The results expose limitations in the performance of existing algorithms and underscore the need for developing more specialized quality assessment methods. To provide a valuable resource for future research and development in 3D content generation and quality assessment, the dataset has been open-sourced in https://github.com/zyj-2000/3DGCQA.

3.9CVJul 31, 2023
MRA-GNN: Minutiae Relation-Aware Model over Graph Neural Network for Fingerprint Embedding

Yapeng Su, Tong Zhao, Zicheng Zhang

Deep learning has achieved remarkable results in fingerprint embedding, which plays a critical role in modern Automated Fingerprint Identification Systems. However, previous works including CNN-based and Transformer-based approaches fail to exploit the nonstructural data, such as topology and correlation in fingerprints, which is essential to facilitate the identifiability and robustness of embedding. To address this challenge, we propose a novel paradigm for fingerprint embedding, called Minutiae Relation-Aware model over Graph Neural Network (MRA-GNN). Our proposed approach incorporates a GNN-based framework in fingerprint embedding to encode the topology and correlation of fingerprints into descriptive features, achieving fingerprint representation in the form of graph embedding. Specifically, we reinterpret fingerprint data and their relative connections as vertices and edges respectively, and introduce a minutia graph and fingerprint graph to represent the topological relations and correlation structures of fingerprints. We equip MRA-GNN with a Topological relation Reasoning Module (TRM) and Correlation-Aware Module (CAM) to learn the fingerprint embedding from these graphs successfully. To tackle the over-smoothing problem in GNN models, we incorporate Feed-Forward Module and graph residual connections into proposed modules. The experimental results demonstrate that our proposed approach outperforms state-of-the-art methods on various fingerprint datasets, indicating the effectiveness of our approach in exploiting nonstructural information of fingerprints.

4.1LGDec 1, 2025
On the Tension Between Optimality and Adversarial Robustness in Policy Optimization

Haoran Li, Jiayu Lv, Congying Han et al.

Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.

19.4CLJan 20, 2025Code
Redundancy Principles for MLLMs Benchmarks

Zicheng Zhang, Xiangyu Zhao, Xinyu Fang et al. · pku

With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively. The code is available at https://github.com/zzc-1998/Benchmark-Redundancy.

10.9CLJun 1, 2025Code
Affordance Benchmark for MLLMs

Junying Wang, Wenzhe Li, Yalun Wu et al.

Affordance theory suggests that environments inherently provide action possibilities shaping perception and behavior. While Multimodal Large Language Models (MLLMs) achieve strong performance in vision-language tasks, their ability to perceive affordance, which is crucial for intuitive and safe interactions, remains underexplored. To address this, we introduce **A4Bench**, a novel benchmark designed to evaluate the affordance perception abilities of MLLMs across two dimensions: 1) Constitutive Affordance, assessing understanding of inherent object properties through 1,282 questionanswer pairs spanning nine sub-disciplines, and 2) Transformative Affordance, probing dynamic and contextual nuances (e.g., misleading, time-dependent, cultural, or individual-specific affordance) with 718 challenging question-answer pairs. We evaluate 17 MLLMs (nine proprietary and eight open-source) and compare them to human performance. Results show that proprietary models generally outperform open-source ones, yet all models perform far below humans, especially in transformative affordance. Furthermore, even top-performing models, such as Gemini-2.0-Pro (18.05% overall exact match accuracy), significantly lag behind human performance (best: 85.34%, worst: 81.25%). These findings highlight critical gaps in environmental understanding of MLLMs and provide a foundation for advancing AI systems toward more robust, context-aware interactions.

1.5CVAug 29, 2023
A Consumer-tier based Visual-Brain Machine Interface for Augmented Reality Glasses Interactions

Yuying Jiang, Fan Bai, Zicheng Zhang et al.

Objective.Visual-Brain Machine Interface(V-BMI) has provide a novel interaction technique for Augmented Reality (AR) industries. Several state-of-arts work has demonstates its high accuracy and real-time interaction capbilities. However, most of the studies employ EEGs devices that are rigid and difficult to apply in real-life AR glasseses application sceniraros. Here we develop a consumer-tier Visual-Brain Machine Inteface(V-BMI) system specialized for Augmented Reality(AR) glasses interactions. Approach. The developed system consists of a wearable hardware which takes advantages of fast set-up, reliable recording and comfortable wearable experience that specificized for AR glasses applications. Complementing this hardware, we have devised a software framework that facilitates real-time interactions within the system while accommodating a modular configuration to enhance scalability. Main results. The developed hardware is only 110g and 120x85x23 mm, which with 1 Tohm and peak to peak voltage is less than 1.5 uV, and a V-BMI based angry bird game and an Internet of Thing (IoT) AR applications are deisgned, we demonstrated such technology merits of intuitive experience and efficiency interaction. The real-time interaction accuracy is between 85 and 96 percentages in a commercial AR glasses (DTI is 2.24s and ITR 65 bits-min ). Significance. Our study indicates the developed system can provide an essential hardware-software framework for consumer based V-BMI AR glasses. Also, we derive several pivotal design factors for a consumer-grade V-BMI-based AR system: 1) Dynamic adaptation of stimulation patterns-classification methods via computer vision algorithms is necessary for AR glasses applications; and 2) Algorithmic localization to foster system stability and latency reduction.

7.2CLNov 18, 2024
MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

Yingjie Zhou, Zicheng Zhang, Jiezhang Cao et al.

Artificial Intelligence (AI) has demonstrated significant capabilities in various fields, and in areas such as human-computer interaction (HCI), embodied intelligence, and the design and animation of virtual digital humans, both practitioners and users are increasingly concerned with AI's ability to understand and express emotion. Consequently, the question of whether AI can accurately interpret human emotions remains a critical challenge. To date, two primary classes of AI models have been involved in human emotion analysis: generative models and Multimodal Large Language Models (MLLMs). To assess the emotional capabilities of these two classes of models, this study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions, generated by 12 Text-to-Image (T2I) models. Unlike previous works, MEMO-Bench provides a framework for evaluating both T2I models and MLLMs in the context of sentiment analysis. Additionally, a progressive evaluation approach is employed, moving from coarse-grained to fine-grained metrics, to offer a more detailed and comprehensive assessment of the sentiment analysis capabilities of MLLMs. The experimental results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones. Meanwhile, although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy, particularly in fine-grained emotion analysis. The MEMO-Bench will be made publicly available to support further research in this area.

14.7CVNov 6, 2024Code
VQA$^2$: Visual Question Answering for Video Quality Assessment

Ziheng Jia, Zicheng Zhang, Jiaying Qian et al.

The advent and proliferation of large multi-modal models (LMMs) have introduced new paradigms to computer vision, transforming various tasks into a unified visual question answering framework. Video Quality Assessment (VQA), a classic field in low-level visual perception, focused initially on quantitative video quality scoring. However, driven by advances in LMMs, it is now progressing toward more holistic visual quality understanding tasks. Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can markedly enhance low-level visual quality evaluation. Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset - the first visual question answering instruction dataset that focuses on video quality assessment. This dataset consists of 3 subsets and covers various video types, containing 157,755 instruction question-answer pairs. Then, leveraging this foundation, we present the VQA2 series models. The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos. We conduct extensive experiments on video quality scoring and understanding tasks, and results demonstrate that the VQA2series models achieve excellent performance in both tasks. Notably, our final model, the VQA2-Assistant, exceeds the renowned GPT-4o in visual quality understanding tasks while maintaining strong competitiveness in quality scoring tasks. Our work provides a foundation and feasible approach for integrating low-level video quality assessment and understanding with LMMs.

8.3CLJun 1, 2025
Improve MLLM Benchmark Efficiency through Interview

Farong Wen, Yijin Guo, Junying Wang et al.

The rapid development of Multimodal Large Language Models (MLLM) has led to a wide range of MLLM applications, and a number of benchmark datasets have sprung up in order to assess MLLM abilities. However, full-coverage Q&A testing on large-scale data is resource-intensive and time-consuming. To address this issue, we propose the MLLM Interview (MITV) strategy, which aims to quickly obtain MLLM performance metrics by quizzing fewer question. First, First, we constructed the interview dataset, which was built on an existing MLLM assessment dataset, by adding difficulty labels based on the performance of some typical MLLMs in this dataset. Second, we propose an MLLM Interview strategy, which obtains an initial performance situation of the large model by quizzing a small number of topics and then continuously tries to test the model's limits. Through extensive experiments, the result shows that the MITV strategy proposed in this paper performs well on MLLM benchmark datasets, and it is able to obtain the model evaluation capability faster through a small number of questions and answers.

4.6LGMar 29, 2024
Graph Neural Aggregation-diffusion with Metastability

Kaiyuan Cui, Xinyan Wang, Zicheng Zhang et al.

Continuous graph neural models based on differential equations have expanded the architecture of graph neural networks (GNNs). Due to the connection between graph diffusion and message passing, diffusion-based models have been widely studied. However, diffusion naturally drives the system towards an equilibrium state, leading to issues like over-smoothing. To this end, we propose GRADE inspired by graph aggregation-diffusion equations, which includes the delicate balance between nonlinear diffusion and aggregation induced by interaction potentials. The node representations obtained through aggregation-diffusion equations exhibit metastability, indicating that features can aggregate into multiple clusters. In addition, the dynamics within these clusters can persist for long time periods, offering the potential to alleviate over-smoothing effects. This nonlinear diffusion in our model generalizes existing diffusion-based models and establishes a connection with classical GNNs. We prove that GRADE achieves competitive performance across various benchmarks and alleviates the over-smoothing issue in GNNs evidenced by the enhanced Dirichlet energy.

14.4LGJul 9, 2025
The Primacy of Magnitude in Low-Rank Adaptation

Zicheng Zhang, Haoran Li, Yifeng Zhang et al.

Low-Rank Adaptation (LoRA) offers a parameter-efficient paradigm for tuning large models. While recent spectral initialization methods improve convergence and performance over the naive "Noise & Zeros" scheme, their extra computational and storage overhead undermines efficiency. In this paper, we establish update magnitude as the fundamental driver of LoRA performance and propose LoRAM, a magnitude-driven "Basis & Basis" initialization scheme that matches spectral methods without their inefficiencies. Our key contributions are threefold: (i) Magnitude of weight updates determines convergence. We prove low-rank structures intrinsically bound update magnitudes, unifying hyperparameter tuning in learning rate, scaling factor, and initialization as mechanisms to optimize magnitude regulation. (ii) Spectral initialization succeeds via magnitude amplification. We demystify that the presumed knowledge-driven benefit of the spectral component essentially arises from the boost in the weight update magnitude. (iii) A novel and compact initialization strategy, LoRAM, scales deterministic orthogonal bases using pretrained weight magnitudes to simulate spectral gains. Extensive experiments show that LoRAM serves as a strong baseline, retaining the full efficiency of LoRA while matching or outperforming spectral initialization across benchmarks.

4.1LGMay 7, 2025
Purity Law for Generalizable Neural TSP Solvers

Wenzhao Liu, Haoran Li, Congying Han et al.

Achieving generalization in neural approaches across different scales and distributions remains a significant challenge for the Traveling Salesman Problem~(TSP). A key obstacle is that neural networks often fail to learn robust principles for identifying universal patterns and deriving optimal solutions from diverse instances. In this paper, we first uncover Purity Law (PuLa), a fundamental structural principle for optimal TSP solutions, defining that edge prevalence grows exponentially with the sparsity of surrounding vertices. Statistically validated across diverse instances, PuLa reveals a consistent bias toward local sparsity in global optima. Building on this insight, we propose Purity Policy Optimization~(PUPO), a novel training paradigm that explicitly aligns characteristics of neural solutions with PuLa during the solution construction process to enhance generalization. Extensive experiments demonstrate that PUPO can be seamlessly integrated with popular neural solvers, significantly enhancing their generalization performance without incurring additional computational overhead during inference.