Yaqi Li

CV
h-index12
14papers
170citations
Novelty51%
AI Score59

14 Papers

CVOct 30, 2023Code
Leave No Stone Unturned: Mine Extra Knowledge for Imbalanced Facial Expression Recognition

Yuhang Zhang, Yaqi Li, Lixiong Qin et al.

Facial expression data is characterized by a significant imbalance, with most collected data showing happy or neutral expressions and fewer instances of fear or disgust. This imbalance poses challenges to facial expression recognition (FER) models, hindering their ability to fully understand various human emotional states. Existing FER methods typically report overall accuracy on highly imbalanced test sets but exhibit low performance in terms of the mean accuracy across all expression classes. In this paper, our aim is to address the imbalanced FER problem. Existing methods primarily focus on learning knowledge of minor classes solely from minor-class samples. However, we propose a novel approach to extract extra knowledge related to the minor classes from both major and minor class samples. Our motivation stems from the belief that FER resembles a distribution learning task, wherein a sample may contain information about multiple classes. For instance, a sample from the major class surprise might also contain useful features of the minor class fear. Inspired by that, we propose a novel method that leverages re-balanced attention maps to regularize the model, enabling it to extract transformation invariant information about the minor classes from all training samples. Additionally, we introduce re-balanced smooth labels to regulate the cross-entropy loss, guiding the model to pay more attention to the minor classes by utilizing the extra information regarding the label distribution of the imbalanced training data. Extensive experiments on different datasets and backbones show that the two proposed modules work together to regularize the model and achieve state-of-the-art performance under the imbalanced FER task. Code is available at https://github.com/zyh-uaiaaaa.

60.3CVMar 25Code
Leave No Stone Unturned: Uncovering Holistic Audio-Visual Intrinsic Coherence for Deepfake Detection

Jielun Peng, Yabin Wang, Yaqi Li et al.

The rapid progress of generative AI has enabled hyper-realistic audio-visual deepfakes, intensifying threats to personal security and social trust. Most existing deepfake detectors rely either on uni-modal artifacts or audio-visual discrepancies, failing to jointly leverage both sources of information. Moreover, detectors that rely on generator-specific artifacts tend to exhibit degraded generalization when confronted with unseen forgeries. We argue that robust and generalizable detection should be grounded in intrinsic audio-visual coherence within and across modalities. Accordingly, we propose HAVIC, a Holistic Audio-Visual Intrinsic Coherence-based deepfake detector. HAVIC first learns priors of modality-specific structural coherence, inter-modal micro- and macro-coherence by pre-training on authentic videos. Based on the learned priors, HAVIC further performs holistic adaptive aggregation to dynamically fuse audio-visual features for deepfake detection. Additionally, we introduce HiFi-AVDF, a high-fidelity audio-visual deepfake dataset featuring both text-to-video and image-to-video forgeries from state-of-the-art commercial generators. Extensive experiments across several benchmarks demonstrate that HAVIC significantly outperforms existing state-of-the-art methods, achieving improvements of 9.39% AP and 9.37% AUC on the most challenging cross-dataset scenario. Our code and dataset are available at https://github.com/tuffy-studio/HAVIC.

CLMar 4, 2024Code
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis

Hengxing Cai, Xiaochen Cai, Junhan Chang et al.

Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. It aims to thoroughly assess the efficacy of LLMs by evaluating their capabilities in Memorization (L1), Comprehension (L2), and Analysis \& Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including biology, chemistry, material, and medicine. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, highlighting their strengths and areas for improvement. We hope this evaluation supports the ongoing development of LLM applications in scientific literature analysis. SciAssess and its resources are available at \url{https://github.com/sci-assess/SciAssess}.

BMMay 20, 2024Code
Uni-Mol Docking V2: Towards Realistic and Accurate Binding Pose Prediction

Eric Alcaide, Zhifeng Gao, Guolin Ke et al.

In recent years, machine learning (ML) methods have emerged as promising alternatives for molecular docking, offering the potential for high accuracy without incurring prohibitive computational costs. However, recent studies have indicated that these ML models may overfit to quantitative metrics while neglecting the physical constraints inherent in the problem. In this work, we present Uni-Mol Docking V2, which demonstrates a remarkable improvement in performance, accurately predicting the binding poses of 77+% of ligands in the PoseBusters benchmark with an RMSD value of less than 2.0 Å, and 75+% passing all quality checks. This represents a significant increase from the 62% achieved by the previous Uni-Mol Docking model. Notably, our Uni-Mol Docking approach generates chemically accurate predictions, circumventing issues such as chirality inversions and steric clashes that have plagued previous ML models. Furthermore, we observe enhanced performance in terms of high-quality predictions (RMSD values of less than 1.0 Å and 1.5 Å) and physical soundness when Uni-Mol Docking is combined with more physics-based methods like Uni-Dock. Our results represent a significant advancement in the application of artificial intelligence for scientific research, adopting a holistic approach to ligand docking that is well-suited for industrial applications in virtual screening and drug design. The code, data and service for Uni-Mol Docking are publicly available for use and further development in https://github.com/dptech-corp/Uni-Mol.

LGFeb 2, 2024Code
Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains

Jiale Zhao, Wanru Zhuang, Jia Song et al.

In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.

CVDec 18, 2025
OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Yuxin Ray Song, Jinzhou Li, Rao Fu et al.

The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

65.1CVApr 27
Robust Deepfake Detection, NTIRE 2026 Challenge: Report

Benedikt Hopf, Radu Timofte, Chenfan Qu et al.

Robustness is a long-overlooked problem in deepfake detection. However, detection performance is nearly worthless in the real world if it suffers under exposure to even slight image degradation. In addition to weaker degradations that can accidentally occur in the image processing pipeline, there is another risk of malicious deepfakes that specifically introduce degradations, purposefully exploiting the detector's weaknesses in that regard. Here, we present an overview of the NTIRE 2026 Robust Deepfake Detection Challenge, which specifically addresses that problem. Participants were tasked with building a detector that would later be tested on an unknown test-set, which included both common and uncommon degradations of various strengths. With a total number of 337 participants and 57 submissions to the final leaderboard, the first edition of the challenge was well received. To ensure the reliability of the results, participants were given only 24h to complete the test run with no labels provided, limiting the possibility of training on the test data. Furthermore, the top solutions were scored on a private test-set to detect any such overfitting. This report presents the competition setting, dataset preparation, as well as details and performance of methods. Top methods rely on large foundation models, ensembles, and degradation training to combine generality and robustness.

87.4ITApr 23
Spatiotemporal 2-D Polar Codes over Non-Uniform MIMO Channels: A Reliability-Aware Construction Approach

Yaqi Li, Shuohan Zhang, Xiaohu You et al.

With the increasing demand for ultra-reliable and low-latency communication (URLLC), spatiotemporal two-dimensional (2-D) channel coding has received growing interest. By leveraging the spatial degrees of freedom in massive multiple-input multiple-output (MIMO) systems, it shortens the time-domain blocklength, thereby reducing latency and enhancing reliability. However, existing spatiotemporal coding schemes typically assume uniform reliability across spatial streams. This assumption does not hold in practical MIMO channels, where the underlying propagation environment generally leads to unequal spatial-eigenmode gains and reliabilities, making the conventional Gaussian-approximation-based construction for 2-D polar codes less effective. This paper investigates spatiotemporal 2-D polar coding over non-uniform MIMO channels, where the spatial domain exhibits inherently heterogeneous signal-to-noise ratios (SNRs). We propose a reciprocal channel approximation (RCA)-based reliability-aware 2-D polar coding framework that accurately characterizes such heterogeneous SNRs without relying on log-likelihood-ratio distribution assumptions. Simulation results demonstrate that the proposed RCA-based spatiotemporal 2-D polar coding scheme achieves clear performance gains and strong robustness, confirming its effectiveness in jointly exploiting temporal and spatial polarization for URLLC in practical MIMO systems.

ROMar 6
Dual-Agent Multiple-Model Reinforcement Learning for Event-Triggered Human-Robot Co-Adaptation in Decoupled Task Spaces

Yaqi Li, Zhengqi Han, Huifang Liu et al.

This paper presents a shared-control rehabilitation policy for a custom 6-degree-of-freedom (6-DoF) upper-limb robot that decomposes complex reaching tasks into decoupled spatial axes. The patient governs the primary reaching direction using binary commands, while the robot autonomously manages orthogonal corrective motions. Because traditional fixed-frequency control often induces trajectory oscillations due to variable inverse-kinematics execution times, an event-driven progression strategy is proposed. This architecture triggers subsequent control actions only when the end-effector enters an admission sphere centred on the immediate target waypoint, and was validated in a semi-virtual setup linking a physical pressure sensor to a MuJoCo simulation. To optimise human--robot co-adaptation safely and efficiently, this study introduces Dual Agent Multiple Model Reinforcement Learning (DAMMRL). This framework discretises decision characteristics: the human agent selects the admission sphere radius to reflect their inherent speed--accuracy trade-off, while the robot agent dynamically adjusts its 3D Cartesian step magnitudes to complement the user's cognitive state. Trained in simulation and deployed across mixed environments, this event-triggered DAMMRL approach effectively suppresses waypoint chatter, balances spatial precision with temporal efficiency, and significantly improves success rates in object acquisition tasks.

CLMar 15, 2024
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Hengxing Cai, Xiaochen Cai, Shuwen Yang et al.

In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as tables, charts, and molecule, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present \textbf{Uni-SMART} (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over other text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.

CVAug 25, 2025
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Yaqi Li, Peng Chen, Mingyang Han et al.

Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.

AIMar 10, 2025
Human Machine Co-Adaptation Model and Its Convergence Analysis

Steven W. Su, Yaqi Li, Kairui Guo et al.

The key to robot-assisted rehabilitation lies in the design of the human-machine interface, which must accommodate the needs of both patients and machines. Current interface designs primarily focus on machine control algorithms, often requiring patients to spend considerable time adapting. In this paper, we introduce a novel approach based on the Cooperative Adaptive Markov Decision Process (CAMDPs) model to address the fundamental aspects of the interactive learning process, offering theoretical insights and practical guidance. We establish sufficient conditions for the convergence of CAMDPs and ensure the uniqueness of Nash equilibrium points. Leveraging these conditions, we guarantee the system's convergence to a unique Nash equilibrium point. Furthermore, we explore scenarios with multiple Nash equilibrium points, devising strategies to adjust both Value Evaluation and Policy Improvement algorithms to enhance the likelihood of converging to the global minimal Nash equilibrium point. Through numerical experiments, we illustrate the effectiveness of the proposed conditions and algorithms, demonstrating their applicability and robustness in practical settings. The proposed conditions for convergence and the identification of a unique optimal Nash equilibrium contribute to the development of more effective adaptive systems for human users in robot-assisted rehabilitation.

AIMay 3, 2023
Human Machine Co-adaption Interface via Cooperation Markov Decision Process System

Kairui Guo, Adrian Cheng, Yaqi Li et al.

This paper aims to develop a new human-machine interface to improve rehabilitation performance from the perspective of both the user (patient) and the machine (robot) by introducing the co-adaption techniques via model-based reinforcement learning. Previous studies focus more on robot assistance, i.e., to improve the control strategy so as to fulfill the objective of Assist-As-Needed. In this study, we treat the full process of robot-assisted rehabilitation as a co-adaptive or mutual learning process and emphasize the adaptation of the user to the machine. To this end, we proposed a Co-adaptive MDPs (CaMDPs) model to quantify the learning rates based on cooperative multi-agent reinforcement learning (MARL) in the high abstraction layer of the systems. We proposed several approaches to cooperatively adjust the Policy Improvement among the two agents in the framework of Policy Iteration. Based on the proposed co-adaptive MDPs, the simulation study indicates the non-stationary problem can be mitigated using various proposed Policy Improvement approaches.

LGMay 17, 2021
Data Assimilation Predictive GAN (DA-PredGAN): applied to determine the spread of COVID-19

Vinicius L. S. Silva, Claire E. Heaney, Yaqi Li et al.

We propose the novel use of a generative adversarial network (GAN) (i) to make predictions in time (PredGAN) and (ii) to assimilate measurements (DA-PredGAN). In the latter case, we take advantage of the natural adjoint-like properties of generative models and the ability to simulate forwards and backwards in time. GANs have received much attention recently, after achieving excellent results for their generation of realistic-looking images. We wish to explore how this property translates to new applications in computational modelling and to exploit the adjoint-like properties for efficient data assimilation. To predict the spread of COVID-19 in an idealised town, we apply these methods to a compartmental model in epidemiology that is able to model space and time variations. To do this, the GAN is set within a reduced-order model (ROM), which uses a low-dimensional space for the spatial distribution of the simulation states. Then the GAN learns the evolution of the low-dimensional states over time. The results show that the proposed methods can accurately predict the evolution of the high-fidelity numerical simulation, and can efficiently assimilate observed data and determine the corresponding model parameters.