Yuxin Li

CV
h-index27
34papers
504citations
Novelty48%
AI Score58

34 Papers

BMAug 30, 2024Code
Technical Report of HelixFold3 for Biomolecular Structure Prediction

Lihang Liu, Shanzhuo Zhang, Yang Xue et al. · baidu

The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3's capabilities. Leveraging insights from previous models and extensive datasets, HelixFold3 achieves accuracy comparable to AlphaFold3 in predicting the structures of the conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. The latest version will be continuously updated on the HelixFold3 web server, providing both interactive visualization and API access.

CVAug 5, 2023Code
Where and How: Mitigating Confusion in Neural Radiance Fields from Sparse Inputs

Yanqi Bao, Yuxin Li, Jing Huo et al.

Neural Radiance Fields from Sparse input} (NeRF-S) have shown great potential in synthesizing novel views with a limited number of observed viewpoints. However, due to the inherent limitations of sparse inputs and the gap between non-adjacent views, rendering results often suffer from over-fitting and foggy surfaces, a phenomenon we refer to as "CONFUSION" during volume rendering. In this paper, we analyze the root cause of this confusion and attribute it to two fundamental questions: "WHERE" and "HOW". To this end, we present a novel learning framework, WaH-NeRF, which effectively mitigates confusion by tackling the following challenges: (i)"WHERE" to Sample? in NeRF-S -- we introduce a Deformable Sampling strategy and a Weight-based Mutual Information Loss to address sample-position confusion arising from the limited number of viewpoints; and (ii) "HOW" to Predict? in NeRF-S -- we propose a Semi-Supervised NeRF learning Paradigm based on pose perturbation and a Pixel-Patch Correspondence Loss to alleviate prediction confusion caused by the disparity between training and testing viewpoints. By integrating our proposed modules and loss functions, WaH-NeRF outperforms previous methods under the NeRF-S setting. Code is available https://github.com/bbbbby-99/WaH-NeRF.

CVAug 26, 2023Code
InsertNeRF: Instilling Generalizability into NeRF with HyperNet Modules

Yanqi Bao, Tianyu Ding, Jing Huo et al.

Generalizing Neural Radiance Fields (NeRF) to new scenes is a significant challenge that existing approaches struggle to address without extensive modifications to vanilla NeRF framework. We introduce InsertNeRF, a method for INStilling gEneRalizabiliTy into NeRF. By utilizing multiple plug-and-play HyperNet modules, InsertNeRF dynamically tailors NeRF's weights to specific reference scenes, transforming multi-scale sampling-aware features into scene-specific representations. This novel design allows for more accurate and efficient representations of complex appearances and geometries. Experiments show that this method not only achieves superior generalization performance but also provides a flexible pathway for integration with other NeRF-like systems, even in sparse input settings. Code will be available https://github.com/bbbbby-99/InsertNeRF.

CVDec 30, 2022
Delving into Semantic Scale Imbalance

Yanbiao Ma, Licheng Jiao, Fang Liu et al.

Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.

96.8CVMay 30
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

Shengjun Zhang, Zhang Zhang, Simin Huang et al.

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

CVJul 24, 2024
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities

Yanqi Bao, Tianyu Ding, Jing Huo et al.

3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.

CLJul 26, 2022
Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark

Zhenran Xu, Zifei Shan, Yuxin Li et al.

Modern Entity Linking (EL) systems entrench a popularity bias, yet there is no dataset focusing on tail and emerging entities in languages other than English. We present Hansel, a new benchmark in Chinese that fills the vacancy of non-English few-shot and zero-shot EL challenges. The test set of Hansel is human annotated and reviewed, created with a novel method for collecting zero-shot EL datasets. It covers 10K diverse documents in news, social media posts and other web articles, with Wikidata as its target Knowledge Base. We demonstrate that the existing state-of-the-art EL system performs poorly on Hansel (R@1 of 36.6% on Few-Shot). We then establish a strong baseline that scores a R@1 of 46.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We also show that our baseline achieves competitive results on TAC-KBP2015 Chinese Entity Linking task.

CVDec 1, 2025Code
Rethinking Intracranial Aneurysm Vessel Segmentation: A Perspective from Computational Fluid Dynamics Applications

Feiyang Xiao, Yichi Zhang, Xigui Li et al.

The precise segmentation of intracranial aneurysms and their parent vessels (IA-Vessel) is a critical step for hemodynamic analyses, which mainly depends on computational fluid dynamics (CFD). However, current segmentation methods predominantly focus on image-based evaluation metrics, often neglecting their practical effectiveness in subsequent CFD applications. To address this deficiency, we present the Intracranial Aneurysm Vessel Segmentation (IAVS) dataset, the first comprehensive, multi-center collection comprising 641 3D MRA images with 587 annotations of aneurysms and IA-Vessels. In addition to image-mask pairs, IAVS dataset includes detailed hemodynamic analysis outcomes, addressing the limitations of existing datasets that neglect topological integrity and CFD applicability. To facilitate the development and evaluation of clinically relevant techniques, we construct two evaluation benchmarks including global localization of aneurysms (Stage I) and fine-grained segmentation of IA-Vessel (Stage II) and develop a simple and effective two-stage framework, which can be used as a out-of-the-box method and strong baseline. For comprehensive evaluation of applicability of segmentation results, we establish a standardized CFD applicability evaluation system that enables the automated and consistent conversion of segmentation masks into CFD models, offering an applicability-focused assessment of segmentation outcomes. The dataset, code, and model will be public available at https://github.com/AbsoluteResonance/IAVS.

55.1CVMay 23
Dual Prototype-Conditioned Diffusion Model for Scalable Multi-Class Unsupervised Anomaly Detection in Large Category Spaces

Yaoxuan Feng, Yuxin Li, Weijiang Lv et al.

Multi-class anomaly detection aims to build unified models across diverse product categories. However, as the number of categories grows, its performance often degrades due to increasingly complex and heterogeneous normal distributions. To address this challenge, we propose DPDiff-AD, a Dual Prototype-conditioned Diffusion model for large-scale multi-class Anomaly Detection. DPDiff-AD models heterogeneous normal distributions through complementary local and global prototypes. Local prototypes capture representative fine-grained structural patterns via nearest-prototype aggregation, while global prototypes regulate holistic feature geometry through optimal transport regularization. Together, these dual-scale representations define a structured normality space. This space is refined through diffusion-based reconstruction conditioned on both local and global prototypes via prototype-aware attention. By jointly leveraging dual prototypes during generation, DPDiff-AD achieves precise normality modeling, preserves structured separability as category cardinality grows, and enables scalable anomaly discrimination. Extensive experiments across five benchmarks demonstrate the effectiveness and scalability of DPDiff-AD. On the 160-category large-scale dataset, it improves image- and pixel-level AUROC by 5.3 and 2.9 points over the previous state-of-the-art method Dinomaly+, while maintaining stable performance as category cardinality increases.

96.2ASMar 18
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Donghang Wu, Tianyu Zhang, Yuxin Li et al.

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

CLJul 22, 2025Code
Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu et al.

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

IVJul 10, 2024
Multi-modal MRI Translation via Evidential Regression and Distribution Calibration

Jiyao Liu, Shangqi Gao, Yuxin Li et al.

Multi-modal Magnetic Resonance Imaging (MRI) translation leverages information from source MRI sequences to generate target modalities, enabling comprehensive diagnosis while overcoming the limitations of acquiring all sequences. While existing deep-learning-based multi-modal MRI translation methods have shown promising potential, they still face two key challenges: 1) lack of reliable uncertainty quantification for synthesized images, and 2) limited robustness when deployed across different medical centers. To address these challenges, we propose a novel framework that reformulates multi-modal MRI translation as a multi-modal evidential regression problem with distribution calibration. Our approach incorporates two key components: 1) an evidential regression module that estimates uncertainties from different source modalities and an explicit distribution mixture strategy for transparent multi-modal fusion, and 2) a distribution calibration mechanism that adapts to source-target mapping shifts to ensure consistent performance across different medical centers. Extensive experiments on three datasets from the BraTS2023 challenge demonstrate that our framework achieves superior performance and robustness across domains.

CLJan 1
DepFlow: Disentangled Speech Generation to Mitigate Semantic Bias in Depression Detection

Yuxin Li, Xiangyu Zhang, Yifei Li et al.

Speech is a scalable and non-invasive biomarker for early mental health screening. However, widely used depression datasets like DAIC-WOZ exhibit strong coupling between linguistic sentiment and diagnostic labels, encouraging models to learn semantic shortcuts. As a result, model robustness may be compromised in real-world scenarios, such as Camouflaged Depression, where individuals maintain socially positive or neutral language despite underlying depressive states. To mitigate this semantic bias, we propose DepFlow, a three-stage depression-conditioned text-to-speech framework. First, a Depression Acoustic Encoder learns speaker- and content-invariant depression embeddings through adversarial training, achieving effective disentanglement while preserving depression discriminability (ROC-AUC: 0.693). Second, a flow-matching TTS model with FiLM modulation injects these embeddings into synthesis, enabling control over depressive severity while preserving content and speaker identity. Third, a prototype-based severity mapping mechanism provides smooth and interpretable manipulation across the depression continuum. Using DepFlow, we construct a Camouflage Depression-oriented Augmentation (CDoA) dataset that pairs depressed acoustic patterns with positive/neutral content from a sentiment-stratified text bank, creating acoustic-semantic mismatches underrepresented in natural data. Evaluated across three depression detection architectures, CDoA improves macro-F1 by 9%, 12%, and 5%, respectively, consistently outperforming conventional augmentation strategies in depression Detection. Beyond enhancing robustness, DepFlow provides a controllable synthesis platform for conversational systems and simulation-based evaluation, where real clinical data remains limited by ethical and coverage constraints.

CVNov 1, 2025
CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Yating Yu, Congqi Cao, Zhaoying Wang et al.

How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

IRFeb 13, 2025Code
Bridging Jensen Gap for Max-Min Group Fairness Optimization in Recommendation

Chen Xu, Yuxin Li, Wenjie Wang et al.

Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model's convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize the Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using six large-scale RS backbone models on three publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness. Our data and codes are shared at https://github.com/XuChen0427/FairDual.

AIOct 30, 2025
Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

Yuxin Li, Minghao Liu, Ruida Wang et al.

We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.

LGAug 24, 2024
Disentangled Generative Graph Representation Learning

Xinyue Hu, Zhibin Duan, Xinyang Liu et al.

Recently, generative graph models have shown promising results in learning graph representations through self-supervised methods. However, most existing generative graph representation learning (GRL) approaches rely on random masking across the entire graph, which overlooks the entanglement of learned representations. This oversight results in non-robustness and a lack of explainability. Furthermore, disentangling the learned representations remains a significant challenge and has not been sufficiently explored in GRL research. Based on these insights, this paper introduces DiGGR (Disentangled Generative Graph Representation Learning), a self-supervised learning framework. DiGGR aims to learn latent disentangled factors and utilizes them to guide graph mask modeling, thereby enhancing the disentanglement of learned representations and enabling end-to-end joint learning. Extensive experiments on 11 public datasets for two different graph learning tasks demonstrate that DiGGR consistently outperforms many previous self-supervised methods, verifying the effectiveness of the proposed approach.

89.3ITMar 10
Do Ambient Backscatter Communication Receivers Require Low-Noise Amplifiers?

Xinyi Wang, Yuxin Li, Yinghui Ye et al.

In ambient backscatter communication (AmBC), strong direct interference from the ambient source poses a major challenge to reliable symbol detection. Although previous studies have shown that employing a low-noise amplifier (LNA) in conventional point-to-point communication improves symbol detection performance at low-to-moderate transmission power, it remains unclear whether this improvement also holds for AmBC. To respond it, in this work, we investigate the symbol detection performance of an AmBC receiver that is equipped with an LNA and adopts the energy detection (ED) to recover tag's information. Particularly, we first propose a new AmBC symbol detection framework that incorporates LNA parameters. On this basis, we derive the bit error rate (BER) of the ED and employ the deflection coefficient (DC) to demonstrate that the detection performance can be enhanced by the LNA at low-to-moderate ambient source transmission power. Then, we derive the near-optimal detection threshold to minimize the BER and propose a method to estimate the required parameters for this threshold by leveraging the tag's pilot symbols.

CVDec 25, 2023
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models

Zhihao Chen, Bin Hu, Chuang Niu et al.

Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various tasks and attracted an increasing interest as a natural language interface across many domains. Recently, large vision-language models (VLMs) like BLIP-2 and GPT-4 have been intensively investigated, which learn rich vision-language correlation from image-text pairs. However, despite these developments, the application of LLMs and VLMs in image quality assessment (IQA), particularly in medical imaging, remains to be explored, which is valuable for objective performance evaluation and potential supplement or even replacement of radiologists' opinions. To this end, this paper introduces IQAGPT, an innovative image quality assessment system integrating an image quality captioning VLM with ChatGPT for generating quality scores and textual reports. First, we build a CT-IQA dataset for training and evaluation, comprising 1,000 CT slices with diverse quality levels professionally annotated. To better leverage the capabilities of LLMs, we convert annotated quality scores into semantically rich text descriptions using a prompt template. Second, we fine-tune the image quality captioning VLM on the CT-IQA dataset to generate quality descriptions. The captioning model fuses the image and text features through cross-modal attention. Third, based on the quality descriptions, users can talk with ChatGPT to rate image quality scores or produce a radiological quality report. Our preliminary results demonstrate the feasibility of assessing image quality with large models. Remarkably, our IQAGPT outperforms GPT-4 and CLIP-IQA, as well as the multi-task classification and regression models that solely rely on images.

AIDec 4, 2023
Local-Global History-aware Contrastive Learning for Temporal Knowledge Graph Reasoning

Wei Chen, Huaiyu Wan, Yuting Wu et al. · pku

Temporal knowledge graphs (TKGs) have been identified as a promising approach to represent the dynamics of facts along the timeline. The extrapolation of TKG is to predict unknowable facts happening in the future, holding significant practical value across diverse fields. Most extrapolation studies in TKGs focus on modeling global historical fact repeating and cyclic patterns, as well as local historical adjacent fact evolution patterns, showing promising performance in predicting future unknown facts. Yet, existing methods still face two major challenges: (1) They usually neglect the importance of historical information in KG snapshots related to the queries when encoding the local and global historical information; (2) They exhibit weak anti-noise capabilities, which hinders their performance when the inputs are contaminated with noise.To this end, we propose a novel \blue{Lo}cal-\blue{g}lobal history-aware \blue{C}ontrastive \blue{L}earning model (\blue{LogCL}) for TKG reasoning, which adopts contrastive learning to better guide the fusion of local and global historical information and enhance the ability to resist interference. Specifically, for the first challenge, LogCL proposes an entity-aware attention mechanism applied to the local and global historical facts encoder, which captures the key historical information related to queries. For the latter issue, LogCL designs four historical query contrast patterns, effectively improving the robustness of the model. The experimental results on four benchmark datasets demonstrate that LogCL delivers better and more robust performance than the state-of-the-art baselines.

LGDec 25, 2023
Federated learning-outcome prediction with multi-layer privacy protection

Yupei Zhang, Yuxin Li, Yifei Wang et al.

Learning-outcome prediction (LOP) is a long-standing and critical problem in educational routes. Many studies have contributed to developing effective models while often suffering from data shortage and low generalization to various institutions due to the privacy-protection issue. To this end, this study proposes a distributed grade prediction model, dubbed FecMap, by exploiting the federated learning (FL) framework that preserves the private data of local clients and communicates with others through a global generalized model. FecMap considers local subspace learning (LSL), which explicitly learns the local features against the global features, and multi-layer privacy protection (MPP), which hierarchically protects the private features, including model-shareable features and not-allowably shared features, to achieve client-specific classifiers of high performance on LOP per institution. FecMap is then achieved in an iteration manner with all datasets distributed on clients by training a local neural network composed of a global part, a local part, and a classification head in clients and averaging the global parts from clients on the server. To evaluate the FecMap model, we collected three higher-educational datasets of student academic records from engineering majors. Experiment results manifest that FecMap benefits from the proposed LSL and MPP and achieves steady performance on the task of LOP, compared with the state-of-the-art models. This study makes a fresh attempt at the use of federated learning in the learning-analytical task, potentially paving the way to facilitating personalized education with privacy protection.

CLMar 5, 2025
MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Ruida Wang, Rui Pan, Yuxin Li et al.

Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

AIMay 29, 2025
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability

Ruida Wang, Yuxin Li, Yi R. Fung et al.

Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce **NL-FL HybridReasoning (NFL-HR)**, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the NFL-HR framework achieves **89.80**% and **84.34%** accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by **4.60%** and **4.82%**, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.

CVJan 5, 2024
Complementary Information Mutual Learning for Multimodality Medical Image Segmentation

Chuyun Shen, Wenhao Li, Haoqing Chen et al.

Radiologists must utilize multiple modal images for tumor segmentation and diagnosis due to the limitations of medical imaging and the diversity of tumor signals. This leads to the development of multimodal learning in segmentation. However, the redundancy among modalities creates challenges for existing subtraction-based joint learning methods, such as misjudging the importance of modalities, ignoring specific modal information, and increasing cognitive load. These thorny issues ultimately decrease segmentation accuracy and increase the risk of overfitting. This paper presents the complementary information mutual learning (CIML) framework, which can mathematically model and address the negative impact of inter-modal redundant information. CIML adopts the idea of addition and removes inter-modal redundant information through inductive bias-driven task decomposition and message passing-based redundancy filtering. CIML first decomposes the multimodal segmentation task into multiple subtasks based on expert prior knowledge, minimizing the information dependence between modalities. Furthermore, CIML introduces a scheme in which each modality can extract information from other modalities additively through message passing. To achieve non-redundancy of extracted information, the redundant filtering is transformed into complementary information learning inspired by the variational information bottleneck. The complementary information learning procedure can be efficiently solved by variational inference and cross-modal spatial attention. Numerical results from the verification task and standard benchmarks indicate that CIML efficiently removes redundant information between modalities, outperforming SOTA methods regarding validation accuracy and segmentation effect.

LGJul 24, 2025
Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator

YuXin Li, Felix Dangel, Derek Tam et al.

The diagonal of a model's Fisher Information Matrix (the "Fisher diagonal") has frequently been used as a way to measure parameter sensitivity. Typically, the Fisher diagonal is estimated via squared sampled gradients of the model's likelihood with respect to its parameters, averaged over a few hundred or thousand examples -- a process which incurs nontrivial computational costs. At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training. This paper therefore explores whether an approximation of the Fisher diagonal can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training. Through a comprehensive set of experiments covering five applications of the Fisher diagonal, we demonstrate that the "Squisher" (SQUared gradient accumulator as an approximation of the FISHER) consistently performs similarly to the Fisher diagonal while outperforming baseline methods. Additionally, we clarify the exact differences between the Squisher and the Fisher diagonal and provide empirical quantification of their respective impact.

50.3HCMar 13
Virtual reality for large-scale laboratories based on colorized point clouds: design and pedagogical impact

Lei Fan, Yuxin Li

Effective laboratory training is essential in engineering education, yet conventional on-site instruction is often constrained by time, accessibility, and safety considerations. To address these challenges, this study presents the design, implementation, and evaluation of a web-based virtual reality (WebVR) representation of a large-scale engineering laboratory constructed from massive colorized point cloud data. This study proposes a novel WebVR framework that integrates Unity and Potree for high-fidelity point-cloud visualization combined with advanced interactive capabilities in a browser-based virtual laboratory. It supports immersive first-person exploration, guided navigation, interactive hotspots conveying equipment and safety information, as well as emergency evacuation simulations. The usability, educational effectiveness, and overall acceptance of the virtual laboratory were evaluated through an anonymous questionnaire administered to students and laboratory staff. The results indicate overwhelmingly positive feedback, with all participants rating the system as "good" or "excellent" across all evaluation dimensions. Participants particularly emphasized the benefits of immersive exploration and self-directed learning. In addition, qualitative feedback was systematically analyzed to inform future enhancements of the virtual environment. Overall, the findings demonstrate that the WebVR-based virtual laboratory can effectively complement conventional on-site laboratory instruction, offering a scalable, accessible, and low-risk platform that enhances learning experiences in engineering education.

AINov 19, 2025
Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang et al.

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

CLNov 26, 2025
A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics

Yuxin Li, Lorraine Xu, Meng Fan Wang

We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.

CLOct 5, 2025
Hierarchical Self-Supervised Representation Learning for Depression Detection from Speech

Yuxin Li, Eng Siong Chng, Cuntai Guan

Speech-based depression detection (SDD) is a promising, non-invasive alternative to traditional clinical assessments. However, it remains limited by the difficulty of extracting meaningful features and capturing sparse, heterogeneous depressive cues over time. Pretrained self-supervised learning (SSL) models such as WavLM provide rich, multi-layer speech representations, yet most existing SDD methods rely only on the final layer or search for a single best-performing one. These approaches often overfit to specific datasets and fail to leverage the full hierarchical structure needed to detect subtle and persistent depression signals. To address this challenge, we propose HAREN-CTC, a novel architecture that integrates multi-layer SSL features using cross-attention within a multitask learning framework, combined with Connectionist Temporal Classification loss to handle sparse temporal supervision. HAREN-CTC comprises two key modules: a Hierarchical Adaptive Clustering module that reorganizes SSL features into complementary embeddings, and a Cross-Modal Fusion module that models inter-layer dependencies through cross-attention. The CTC objective enables alignment-aware training, allowing the model to track irregular temporal patterns of depressive speech cues. We evaluate HAREN-CTC under both an upper-bound setting with standard data splits and a generalization setting using five-fold cross-validation. The model achieves state-of-the-art macro F1-scores of 0.81 on DAIC-WOZ and 0.82 on MODMA, outperforming prior methods across both evaluation scenarios.

SIFeb 1, 2025
Multi-Order Hyperbolic Graph Convolution and Aggregated Attention for Social Event Detection

Yao Liu, Zhilan Liu, Tien Ping Tan et al.

Social event detection (SED) is a task focused on identifying specific real-world events and has broad applications across various domains. It is integral to many mobile applications with social features, including major platforms like Twitter, Weibo, and Facebook. By enabling the analysis of social events, SED provides valuable insights for businesses to understand consumer preferences and supports public services in handling emergencies and disaster management. Due to the hierarchical structure of event detection data, traditional approaches in Euclidean space often fall short in capturing the complexity of such relationships. While existing methods in both Euclidean and hyperbolic spaces have shown promising results, they tend to overlook multi-order relationships between events. To address these limitations, this paper introduces a novel framework, Multi-Order Hyperbolic Graph Convolution with Aggregated Attention (MOHGCAA), designed to enhance the performance of SED. Experimental results demonstrate significant improvements under both supervised and unsupervised settings. To further validate the effectiveness and robustness of the proposed framework, we conducted extensive evaluations across multiple datasets, confirming its superiority in tackling common challenges in social event detection.

CVNov 15, 2021
Interactive Medical Image Segmentation with Self-Adaptive Confidence Calibration

Wenhao Li, Qisen Xu, Chuyun Shen et al.

Medical image segmentation is one of the fundamental problems for artificial intelligence-based clinical decision systems. Current automatic medical image segmentation methods are often failed to meet clinical requirements. As such, a series of interactive segmentation algorithms are proposed to utilize expert correction information. However, existing methods suffer from some segmentation refining failure problems after long-term interactions and some cost problems from expert annotation, which hinder clinical applications. This paper proposes an interactive segmentation framework, called interactive MEdical segmentation with self-adaptive Confidence CAlibration (MECCA), by introducing the corrective action evaluation, which combines the action-based confidence learning and multi-agent reinforcement learning (MARL). The evaluation is established through a novel action-based confidence network, and the corrective actions are obtained from MARL. Based on the confidential information, a self-adaptive reward function is designed to provide more detailed feedback, and a simulated label generation mechanism is proposed on unsupervised data to reduce over-reliance on labeled data. Experimental results on various medical image datasets have shown the significant performance of the proposed algorithm.

IVOct 26, 2021
Deep Learning-based Segmentation of Cerebral Aneurysms in 3D TOF-MRA using Coarse-to-Fine Framework

Meng Chen, Chen Geng, Dongdong Wang et al.

BACKGROUND AND PURPOSE: Cerebral aneurysm is one of the most common cerebrovascular diseases, and SAH caused by its rupture has a very high mortality and disability rate. Existing automatic segmentation methods based on DLMs with TOF-MRA modality could not segment edge voxels very well, so that our goal is to realize more accurate segmentation of cerebral aneurysms in 3D TOF-MRA with the help of DLMs. MATERIALS AND METHODS: In this research, we proposed an automatic segmentation framework of cerebral aneurysm in 3D TOF-MRA. The framework was composed of two segmentation networks ranging from coarse to fine. The coarse segmentation network, namely DeepMedic, completed the coarse segmentation of cerebral aneurysms, and the processed results were fed into the fine segmentation network, namely dual-channel SE_3D U-Net trained with weighted loss function, for fine segmentation. Images from ADAM2020 (n=113) were used for training and validation and images from another center (n=45) were used for testing. The segmentation metrics we used include DSC, HD, and VS. RESULTS: The trained cerebral aneurysm segmentation model achieved DSC of 0.75, HD of 1.52, and VS of 0.91 on validation cohort. On the totally independent test cohort, our method achieved the highest DSC of 0.12, the lowest HD of 11.61, and the highest VS of 0.16 in comparison with state-of-the-art segmentation networks. CONCLUSIONS: The coarse-to-fine framework, which composed of DeepMedic and dual-channel SE_3D U-Net can segment cerebral aneurysms in 3D TOF-MRA with a superior accuracy.

IVOct 26, 2021
An Automatic Detection Method Of Cerebral Aneurysms In Time-Of-Flight Magnetic Resonance Angiography Images Based On Attention 3D U-Net

Chen Geng, Meng Chen, Ruoyu Di et al.

Background:Subarachnoid hemorrhage caused by ruptured cerebral aneurysm often leads to fatal consequences.However,if the aneurysm can be found and treated during asymptomatic periods,the probability of rupture can be greatly reduced.At present,time-of-flight magnetic resonance angiography is one of the most commonly used non-invasive screening techniques for cerebral aneurysm,and the application of deep learning technology in aneurysm detection can effectively improve the screening effect of aneurysm.Existing studies have found that three-dimensional features play an important role in aneurysm detection,but they require a large amount of training data and have problems such as a high false positive rate. Methods:This paper proposed a novel method for aneurysm detection.First,a fully automatic cerebral artery segmentation algorithm without training data was used to extract the volume of interest,and then the 3D U-Net was improved by the 3D SENet module to establish an aneurysm detection model.Eventually a set of fully automated,end-to-end aneurysm detection methods have been formed. Results:A total of 231 magnetic resonance angiography image data were used in this study,among which 132 were training sets,34 were internal test sets and 65 were external test sets.The presented method obtained 97.89% sensitivity in the five-fold cross-validation and obtained 91.0% sensitivity with 2.48 false positives/case in the detection of the external test sets. Conclusions:Compared with the results of our previous studies and other studies,the method in this paper achieves a very competitive sensitivity with less training data and maintains a low false positive rate.As the only method currently using 3D U-Net for aneurysm detection,it proves the feasibility and superior performance of this network in aneurysm detection,and also explores the potential of the channel attention mechanism in this task.

MTRL-SCIApr 20, 2021
Crystal structure prediction of materials with high symmetry using differential evolution

Wenhui Yang, Edirisuriya M. Dilanga Siriwardane, Rongzhi Dong et al.

Crystal structure determines properties of materials. With the crystal structure of a chemical substance, many physical and chemical properties can be predicted by first-principles calculations or machine learning models. Since it is relatively easy to generate a hypothetical chemically valid formula, crystal structure prediction becomes an important method for discovering new materials. In our previous work, we proposed a contact map-based crystal structure prediction method, which uses global optimization algorithms such as genetic algorithms to maximize the match between the contact map of the predicted structure and the contact map of the real crystal structure to search for the coordinates at the Wyckoff Positions(WP). However, when predicting the crystal structure with high symmetry, we found that the global optimization algorithm has difficulty to find an effective combination of WPs that satisfies the chemical formula, which is mainly caused by the inconsistency between the dimensionality of the contact map of the predicted crystal structure and the dimensionality of the contact map of the target crystal structure. This makes it challenging to predict the crystal structures of high-symmetry crystals. In order to solve this problem, here we propose to use PyXtal to generate and filter random crystal structures with given symmetry constraints based on the information such as chemical formulas and space groups. With contact map as the optimization goal, we use differential evolution algorithms to search for non-special coordinates at the Wyckoff positions to realize the structure prediction of high-symmetry crystal materials. Our experimental results show that our proposed algorithm CMCrystalHS can effectively solve the problem of inconsistent contact map dimensions and predict the crystal structures with high symmetry.