Bin Zhang

CV
h-index77
113papers
3,258citations
Novelty51%
AI Score60

113 Papers

LGFeb 20, 2023Code
Audit to Forget: A Unified Method to Revoke Patients' Private Data in Intelligent Healthcare

Juexiao Zhou, Haoyang Li, Xingyu Liao et al. · tsinghua

Revoking personal private data is one of the basic human rights, which has already been sheltered by several privacy-preserving laws in many countries. However, with the development of data science, machine learning and deep learning techniques, this right is usually neglected or violated as more and more patients' data are being collected and used for model training, especially in intelligent healthcare, thus making intelligent healthcare a sector where technology must meet the law, regulations, and privacy principles to ensure that the innovation is for the common good. In order to secure patients' right to be forgotten, we proposed a novel solution by using auditing to guide the forgetting process, where auditing means determining whether a dataset has been used to train the model and forgetting requires the information of a query dataset to be forgotten from the target model. We unified these two tasks by introducing a new approach called knowledge purification. To implement our solution, we developed AFS, a unified open-source software, which is able to evaluate and revoke patients' private data from pre-trained deep learning models. We demonstrated the generality of AFS by applying it to four tasks on different datasets with various data sizes and architectures of deep learning networks. The software is publicly available at \url{https://github.com/JoshuaChou2018/AFS}.

CVMar 19, 2023Code
CCTV-Gun: Benchmarking Handgun Detection in CCTV Images

Srikar Yellapragada, Zhenghong Li, Kevin Bhadresh Doshi et al.

Gun violence is a critical security problem, and it is imperative for the computer vision community to develop effective gun detection algorithms for real-world scenarios, particularly in Closed Circuit Television (CCTV) surveillance data. Despite significant progress in visual object detection, detecting guns in real-world CCTV images remains a challenging and under-explored task. Firearms, especially handguns, are typically very small in size, non-salient in appearance, and often severely occluded or indistinguishable from other small objects. Additionally, the lack of principled benchmarks and difficulty collecting relevant datasets further hinder algorithmic development. In this paper, we present a meticulously crafted and annotated benchmark, called \textbf{CCTV-Gun}, which addresses the challenges of detecting handguns in real-world CCTV images. Our contribution is three-fold. Firstly, we carefully select and analyze real-world CCTV images from three datasets, manually annotate handguns and their holders, and assign each image with relevant challenge factors such as blur and occlusion. Secondly, we propose a new cross-dataset evaluation protocol in addition to the standard intra-dataset protocol, which is vital for gun detection in practical settings. Finally, we comprehensively evaluate both classical and state-of-the-art object detection algorithms, providing an in-depth analysis of their generalizing abilities. The benchmark will facilitate further research and development on this topic and ultimately enhance security. Code, annotations, and trained models are available at https://github.com/srikarym/CCTV-Gun.

SDJul 4, 2024Code
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Keyu An, Qian Chen, Chong Deng et al.

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.

AIOct 6, 2023
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies

Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang et al. · microsoft-research

In the upcoming decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative (deepspeed4science.ai) which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. By leveraging DeepSpeed's current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research.

AINov 19, 2023Code
TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

Yilun Kong, Jingqing Ruan, Yihong Chen et al.

Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools that require a blend of task planning and the utilization of external tools, such as APIs. However, real-world complex systems present three prevalent challenges concerning task planning and tool usage: (1) The real system usually has a vast array of APIs, so it is impossible to feed the descriptions of all APIs to the prompt of LLMs as the token length is limited; (2) the real system is designed for handling complex tasks, and the base LLMs can hardly plan a correct sub-task order and API-calling order for such tasks; (3) Similar semantics and functionalities among APIs in real systems create challenges for both LLMs and even humans in distinguishing between them. In response, this paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents operating within real-world systems. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs for the user task among the extensive array available; (2) LLM Finetuner tunes a base LLM so that the finetuned LLM can be more capable for task planning and API calling; (3) the Demo Selector adaptively retrieves different demonstrations related to hard-to-distinguish APIs, which is further used for in-context learning to boost the final performance. We validate our methods using a real-world commercial system as well as an open-sourced academic dataset, and the outcomes clearly showcase the efficacy of each individual component as well as the integrated framework.

LGFeb 21, 2023Code
Weather2K: A Multivariate Spatio-Temporal Benchmark Dataset for Meteorological Forecasting Based on Real-Time Observation Data from Ground Weather Stations

Xun Zhu, Yutong Xiong, Ming Wu et al.

Weather forecasting is one of the cornerstones of meteorological work. In this paper, we present a new benchmark dataset named Weather2K, which aims to make up for the deficiencies of existing weather forecasting datasets in terms of real-time, reliability, and diversity, as well as the key bottleneck of data quality. To be specific, our Weather2K is featured from the following aspects: 1) Reliable and real-time data. The data is hourly collected from 2,130 ground weather stations covering an area of 6 million square kilometers. 2) Multivariate meteorological variables. 20 meteorological factors and 3 constants for position information are provided with a length of 40,896 time steps. 3) Applicable to diverse tasks. We conduct a set of baseline tests on time series forecasting and spatio-temporal forecasting. To the best of our knowledge, our Weather2K is the first attempt to tackle weather forecasting task by taking full advantage of the strengths of observation data from ground weather stations. Based on Weather2K, we further propose Meteorological Factors based Multi-Graph Convolution Network (MFMGCN), which can effectively construct the intrinsic correlation among geographic locations based on meteorological factors. Sufficient experiments show that MFMGCN improves both the forecasting performance and temporal robustness. We hope our Weather2K can significantly motivate researchers to develop efficient and accurate algorithms to advance the task of weather forecasting. The dataset can be available at https://github.com/bycnfz/weather2k/.

AIAug 20, 2024Code
QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning

Yilun Kong, Hangyu Mao, Qi Zhao et al.

Prompt engineering has demonstrated remarkable success in enhancing the performance of large language models (LLMs) across diverse tasks. However, most existing prompt optimization methods only focus on the task-level performance, overlooking the importance of query-preferred prompts, which leads to suboptimal performances. Additionally, these methods rely heavily on frequent interactions with LLMs to obtain feedback for guiding the optimization process, incurring substantial redundant interaction costs. In this paper, we introduce Query-dependent Prompt Optimization (QPO), which leverages multi-loop offline reinforcement learning to iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries, thus significantly improving the prompting effect on the large target LLM. We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks, thereby circumventing the expenses of online interactions. Furthermore, we continuously augment the offline dataset with the generated prompts in each loop, as the prompts from the fine-tuned model are supposed to outperform the source prompts in the original dataset. These iterative loops bootstrap the model towards generating optimal prompts. Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.

CVAug 31, 2022
MAFormer: A Transformer Network with Multi-scale Attention Fusion for Visual Recognition

Yunhao Wang, Huixin Sun, Xiaodi Wang et al.

Vision Transformer and its variants have demonstrated great potential in various computer vision tasks. But conventional vision transformers often focus on global dependency at a coarse level, which suffer from a learning challenge on global relationships and fine-grained representation at a token level. In this paper, we introduce Multi-scale Attention Fusion into transformer (MAFormer), which explores local aggregation and global feature extraction in a dual-stream framework for visual recognition. We develop a simple but effective module to explore the full potential of transformers for visual representation by learning fine-grained and coarse-grained features at a token level and dynamically fusing them. Our Multi-scale Attention Fusion (MAF) block consists of: i) a local window attention branch that learns short-range interactions within windows, aggregating fine-grained local features; ii) global feature extraction through a novel Global Learning with Down-sampling (GLD) operation to efficiently capture long-range context information within the whole image; iii) a fusion module that self-explores the integration of both features via attention. Our MAFormer achieves state-of-the-art performance on common vision tasks. In particular, MAFormer-L achieves 85.9$\%$ Top-1 accuracy on ImageNet, surpassing CSWin-B and LV-ViT-L by 1.7$\%$ and 0.6$\%$ respectively. On MSCOCO, MAFormer outperforms the prior art CSWin by 1.7$\%$ mAPs on object detection and 1.4$\%$ on instance segmentation with similar-sized parameters, demonstrating the potential to be a general backbone network.

AIAug 7, 2023
TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage

Jingqing Ruan, Yihong Chen, Bin Zhang et al.

With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement.

CVMar 2Code
UETrack: A Unified and Efficient Framework for Single Object Tracking

Ben Kang, Jie Zhao, Xin Chen et al.

With growing real-world demands, efficient tracking has received increasing attention. However, most existing methods are limited to RGB inputs and struggle in multi-modal scenarios. Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, an efficient framework for single object tracking. UETrack demonstrates high practicality and versatility, efficiently handling multiple modalities including RGB, Depth, Thermal, Event, and Language, and addresses the gap in efficient multi-modal tracking. It introduces two key components: a Token-Pooling-based Mixture-of-Experts mechanism that enhances modeling capacity through feature aggregation and expert specialization, and a Target-aware Adaptive Distillation strategy that selectively performs distillation based on sample characteristics, reducing redundant supervision and improving performance. Extensive experiments on 12 benchmarks across 3 hardware platforms show that UETrack achieves a superior speed-accuracy trade-off compared to previous methods. For instance, UETrack-B achieves 69.2% AUC on LaSOT and runs at 163/56/60 FPS on GPU/CPU/AGX, demonstrating strong practicality and versatility. Code is available at https://github.com/kangben258/UETrack.

IVApr 21, 2023
SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model

Juexiao Zhou, Xiaonan He, Liyuan Sun et al.

Skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases, impacting a considerable portion of the population. Nonetheless, the field of dermatology diagnosis faces three significant hurdles. Firstly, there is a shortage of dermatologists accessible to diagnose patients, particularly in rural regions. Secondly, accurately interpreting skin disease images poses a considerable challenge. Lastly, generating patient-friendly diagnostic reports is usually a time-consuming and labor-intensive task for dermatologists. To tackle these challenges, we present SkinGPT-4, which is the world's first interactive dermatology diagnostic system powered by an advanced visual large language model. SkinGPT-4 leverages a fine-tuned version of MiniGPT-4, trained on an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors' notes. We designed a two-step training process to allow SkinGPT to express medical features in skin disease images with natural language and make accurate diagnoses of the types of skin diseases. With SkinGPT-4, users could upload their own skin photos for diagnosis, and the system could autonomously evaluate the images, identifies the characteristics and categories of the skin conditions, performs in-depth analysis, and provides interactive treatment recommendations. Meanwhile, SkinGPT-4's local deployment capability and commitment to user privacy also render it an appealing choice for patients in search of a dependable and precise diagnosis of their skin ailments. To demonstrate the robustness of SkinGPT-4, we conducted quantitative evaluations on 150 real-life cases, which were independently reviewed by certified dermatologists, and showed that SkinGPT-4 could provide accurate diagnoses of skin diseases.

MAFeb 26Code
QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li, Bin Zhang, Hao Chen et al.

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

AINov 23, 2023
Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Bin Zhang, Hangyu Mao, Jingqing Ruan et al.

The remarkable progress in Large Language Models (LLMs) opens up new avenues for addressing planning and decision-making problems in Multi-Agent Systems (MAS). However, as the number of agents increases, the issues of hallucination in LLMs and coordination in MAS have become increasingly prominent. Additionally, the efficient utilization of tokens emerges as a critical consideration when employing LLMs to facilitate the interactions among a substantial number of agents. In this paper, we develop a modular framework called LLaMAC to mitigate these challenges. LLaMAC implements a value distribution encoding similar to that found in the human brain, utilizing internal and external feedback mechanisms to facilitate collaboration and iterative reasoning among its modules. Through evaluations involving system resource allocation and robot grid transportation, we demonstrate the considerable advantages afforded by our proposed approach.

AIOct 17, 2022
PTDE: Personalized Training with Distilled Execution for Multi-Agent Reinforcement Learning

Yiqun Chen, Hangyu Mao, Jiaxin Mao et al.

Centralized Training with Decentralized Execution (CTDE) has emerged as a widely adopted paradigm in multi-agent reinforcement learning, emphasizing the utilization of global information for learning an enhanced joint $Q$-function or centralized critic. In contrast, our investigation delves into harnessing global information to directly enhance individual $Q$-functions or individual actors. Notably, we discover that applying identical global information universally across all agents proves insufficient for optimal performance. Consequently, we advocate for the customization of global information tailored to each agent, creating agent-personalized global information to bolster overall performance. Furthermore, we introduce a novel paradigm named Personalized Training with Distilled Execution (PTDE), wherein agent-personalized global information is distilled into the agent's local information. This distilled information is then utilized during decentralized execution, resulting in minimal performance degradation. PTDE can be seamlessly integrated with state-of-the-art algorithms, leading to notable performance enhancements across diverse benchmarks, including the SMAC benchmark, Google Research Football (GRF) benchmark, and Learning to Rank (LTR) task.

MAAug 18, 2024
Beyond Local Views: Global State Inference with Diffusion Models for Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu, Hangyu Mao, Nianmin Zhang et al.

In partially observable multi-agent systems, agents typically only have access to local observations. This severely hinders their ability to make precise decisions, particularly during decentralized execution. To alleviate this problem and inspired by image outpainting, we propose State Inference with Diffusion Models (SIDIFF), which uses diffusion models to reconstruct the original global state based solely on local observations. SIDIFF consists of a state generator and a state extractor, which allow agents to choose suitable actions by considering both the reconstructed global state and local observations. In addition, SIDIFF can be effortlessly incorporated into current multi-agent reinforcement learning algorithms to improve their performance. Finally, we evaluated SIDIFF on different experimental platforms, including Multi-Agent Battle City (MABC), a novel and flexible multi-agent reinforcement learning environment we developed. SIDIFF achieved desirable results and outperformed other popular algorithms.

89.3AO-PHMay 27
Skillful high-resolution weather forecasting independent of physical models

Pengcheng Zhao, Siqi Xiang, Weixin Jin et al.

Accurate and timely weather forecasts are critical for high-impact decisions in modern society. Machine-learning-based weather prediction is emerging as an alternative for producing initial conditions, forecasts, and even both in end-to-end systems. These methods deliver predictions faster and often with higher skill than traditional numerical weather prediction (NWP). However, even end-to-end models typically rely on NWP-generated reanalyses for supervision, thereby inheriting the biases and resolution limitations of those NWPs, and limiting adaptation to settings where suitable reanalysis products are unavailable, infrequently updated, or expensive to produce. Here we introduce ObsCast, a regional system that generates both analysis and predictions, without using any NWP-derived data in either training or inference, while still achieving state-of-the-art performance in short-term high-resolution regional modeling. Over the contiguous United States and Europe, ObsCast outperforms operational NWP for near-surface variables through 18 h and produces skillful precipitation forecasts. It provides a simpler and more adaptable route to build and refine regional forecasting services directly from local observations, without the need to develop complex and costly traditional forecasting pipelines.

56.0NEMar 15Code
MorphSNN: Adaptive Graph Diffusion and Structural Plasticity for Spiking Neural Networks

Yongsheng Huang, Peibo Duan, Yujie Wu et al.

Spiking Neural Networks (SNNs) currently face a critical bottleneck: while individual neurons exhibit dynamic biological properties, their macro-scopic architectures remain confined within conventional connectivity patterns that are static and hierarchical. This discrepancy between neuron-level dynamics and network-level fixed connectivity eliminates critical brain-like lateral interactions, limiting adaptability in changing environments. To address this, we propose MorphSNN, a backbone framework inspired by biological non-synaptic diffusion and structural plasticity. Specifically, we introduce a Graph Diffusion (GD)mechanism to facilitate efficient undirected signal propagation, complementing the feedforward hierarchy. Furthermore, it incorporates a Spatio-Temporal Structural Plasticity (STSP) mechanism, endowing the network with the capability for instance-specific, dynamic topological reorganization, thereby overcoming the limitations of fixed topologies. Experiments demonstrate that MorphSNN achieves state-of-the-art accuracy on static and neuromorphic datasets; for instance, it reaches 83.35% accuracy on N-Caltech101 with only 5 timesteps. More importantly, its self-evolving topology functions as an intrinsic distribution fingerprint, enabling superior Out-of- Distribution (OOD) detection without auxiliary training. The code is available at anonymous.4open.science/r/MorphSNN-B0BC.

CHEM-PHMay 22, 2022
Contrastive Learning of Coarse-Grained Force Fields

Xinqiang Ding, Bin Zhang

Coarse-grained models have proven helpful for simulating complex systems over long timescales to provide molecular insights into various processes. Methodologies for systematic parameterization of the underlying energy function, or force field that describes the interactions among different components of the system are of great interest for ensuring simulation accuracy. We present a new method, potential contrasting, to enable efficient learning of force fields that can accurately reproduce the conformational distribution produced with all-atom simulations. Potential contrasting generalizes the noise contrastive estimation method with umbrella sampling to better learn the complex energy landscape of molecular systems. When applied to the Trp-cage protein, we found that the technique produces force fields that thoroughly capture the thermodynamics of the folding process despite the use of only $α$-Carbons in the coarse-grained model. We further showed that potential contrasting could be applied over large datasets that combine the conformational ensembles of many proteins to ensure the transferability of coarse-grained force fields. We anticipate potential contrasting to be a powerful tool for building general-purpose coarse-grained force fields.

MAApr 20, 2022
Mingling Foresight with Imagination: Model-Based Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu, Dapeng Li, Bin Zhang et al.

Recently, model-based agents have achieved better performance than model-free ones using the same computational budget and training time in single-agent environments. However, due to the complexity of multi-agent systems, it is tough to learn the model of the environment. The significant compounding error may hinder the learning process when model-based methods are applied to multi-agent tasks. This paper proposes an implicit model-based multi-agent reinforcement learning method based on value decomposition methods. Under this method, agents can interact with the learned virtual environment and evaluate the current state value according to imagined future states in the latent space, making agents have the foresight. Our approach can be applied to any multi-agent value decomposition method. The experimental results show that our method improves the sample efficiency in different partially observable Markov decision process domains.

92.9AIMay 26
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Yiqun Chen, Wei Yang, Erhan Zhang et al.

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

AIMar 7, 2022
Efficient Policy Generation in Multi-Agent Systems via Hypergraph Neural Network

Bin Zhang, Yunpeng Bai, Zhiwei Xu et al.

The application of deep reinforcement learning in multi-agent systems introduces extra challenges. In a scenario with numerous agents, one of the most important concerns currently being addressed is how to develop sufficient collaboration between diverse agents. To address this problem, we consider the form of agent interaction based on neighborhood and propose a multi-agent reinforcement learning (MARL) algorithm based on the actor-critic method, which can adaptively construct the hypergraph structure representing the agent interaction and further implement effective information extraction and representation learning through hypergraph convolution networks, leading to effective cooperation. Based on different hypergraph generation methods, we present two variants: Actor Hypergraph Convolutional Critic Network (HGAC) and Actor Attention Hypergraph Critic Network (ATT-HGAC). Experiments with different settings demonstrate the advantages of our approach over other existing methods.

LGMar 23, 2023
Physics Symbolic Learner for Discovering Ground-Motion Models Via NGA-West2 Database

Su Chen, Xianwei Liu, Lei Fu et al.

Ground-motion model (GMM) is the basis of many earthquake engineering studies. In this study, a novel physics-informed symbolic learner (PISL) method based on the Nest Generation Attenuation-West2 database is proposed to automatically discover mathematical equation operators as symbols. The sequential threshold ridge regression algorithm is utilized to distill a concise and interpretable explicit characterization of complex systems of ground motions. In addition to the basic variables retrieved from previous GMMs, the current PISL incorporates two a priori physical conditions, namely, distance and amplitude saturation. GMMs developed using the PISL, an empirical regression method (ERM), and an artificial neural network (ANN) are compared in terms of residuals and extrapolation based on obtained data of peak ground acceleration and velocity. The results show that the inter- and intra-event standard deviations of the three methods are similar. The functional form of the PISL is more concise than that of the ERM and ANN. The extrapolation capability of the PISL is more accurate than that of the ANN. The PISL-GMM used in this study provide a new paradigm of regression that considers both physical and data-driven machine learning and can be used to identify the implied physical relationships and prediction equations of ground motion variables in different regions.

MAJun 6, 2022
Consensus Learning for Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu, Bin Zhang, Dapeng Li et al.

Almost all multi-agent reinforcement learning algorithms without communication follow the principle of centralized training with decentralized execution. During centralized training, agents can be guided by the same signals, such as the global state. During decentralized execution, however, agents lack the shared signal. Inspired by viewpoint invariance and contrastive learning, we propose consensus learning for cooperative multi-agent reinforcement learning in this paper. Although based on local observations, different agents can infer the same consensus in discrete space. During decentralized execution, we feed the inferred consensus as an explicit input to the network of agents, thereby developing their spirit of cooperation. Our proposed method can be extended to various multi-agent reinforcement learning algorithms with small model changes. Moreover, we carry out them on some fully cooperative tasks and get convincing results.

GNSep 6, 2023
Automated Bioinformatics Analysis via AutoBA

Juexiao Zhou, Bin Zhang, Xiuying Chen et al.

With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle the analysis continues to grow. In response to this need, we introduce Auto Bioinformatics Analysis (AutoBA), an autonomous AI agent based on a large language model designed explicitly for conventional omics data analysis. AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. Through rigorous validation by expert bioinformaticians, AutoBA's robustness and adaptability are affirmed across a diverse range of omics analysis cases, including whole genome sequencing (WGS), RNA sequencing (RNA-seq), single-cell RNA-seq, ChIP-seq, and spatial transcriptomics. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA deploys the analysis locally, preserving data privacy. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents a convenient tool, offering robustness and adaptability for complex omics data analysis.

MAApr 28, 2023
From Explicit Communication to Tacit Cooperation:A Novel Paradigm for Cooperative MARL

Dapeng Li, Zhiwei Xu, Bin Zhang et al.

Centralized training with decentralized execution (CTDE) is a widely-used learning paradigm that has achieved significant success in complex tasks. However, partial observability issues and the absence of effectively shared signals between agents often limit its effectiveness in fostering cooperation. While communication can address this challenge, it simultaneously reduces the algorithm's practicality. Drawing inspiration from human team cooperative learning, we propose a novel paradigm that facilitates a gradual shift from explicit communication to tacit cooperation. In the initial training stage, we promote cooperation by sharing relevant information among agents and concurrently reconstructing this information using each agent's local trajectory. We then combine the explicitly communicated information with the reconstructed information to obtain mixed information. Throughout the training process, we progressively reduce the proportion of explicitly communicated information, facilitating a seamless transition to fully decentralized execution without communication. Experimental results in various scenarios demonstrate that the performance of our method without communication can approaches or even surpasses that of QMIX and communication-based methods.

MAApr 20, 2023
Inducing Stackelberg Equilibrium through Spatio-Temporal Sequential Decision-Making in Multi-Agent Reinforcement Learning

Bin Zhang, Lijuan Li, Zhiwei Xu et al.

In multi-agent reinforcement learning (MARL), self-interested agents attempt to establish equilibrium and achieve coordination depending on game structure. However, existing MARL approaches are mostly bound by the simultaneous actions of all agents in the Markov game (MG) framework, and few works consider the formation of equilibrium strategies via asynchronous action coordination. In view of the advantages of Stackelberg equilibrium (SE) over Nash equilibrium, we construct a spatio-temporal sequential decision-making structure derived from the MG and propose an N-level policy model based on a conditional hypernetwork shared by all agents. This approach allows for asymmetric training with symmetric execution, with each agent responding optimally conditioned on the decisions made by superior agents. Agents can learn heterogeneous SE policies while still maintaining parameter sharing, which leads to reduced cost for learning and storage and enhanced scalability as the number of agents increases. Experiments demonstrate that our method effectively converges to the SE policies in repeated matrix game scenarios, and performs admirably in immensely complex settings including cooperative tasks and mixed tasks.

LGOct 30, 2022
A Simple Hypergraph Kernel Convolution based on Discounted Markov Diffusion Process

Fuyang Li, Jiying Zhang, Xi Xiao et al.

Kernels on discrete structures evaluate pairwise similarities between objects which capture semantics and inherent topology information. Existing kernels on discrete structures are only developed by topology information(such as adjacency matrix of graphs), without considering original attributes of objects. This paper proposes a two-phase paradigm to aggregate comprehensive information on discrete structures leading to a Discount Markov Diffusion Learnable Kernel (DMDLK). Specifically, based on the underlying projection of DMDLK, we design a Simple Hypergraph Kernel Convolution (SHKC) for hidden representation of vertices. SHKC can adjust diffusion steps rather than stacking convolution layers to aggregate information from long-range neighborhoods which prevents over-smoothing issues of existing hypergraph convolutions. Moreover, we utilize the uniform stability bound theorem in transductive learning to analyze critical factors for the effectiveness and generalization ability of SHKC from a theoretical perspective. The experimental results on several benchmark datasets for node classification tasks verified the superior performance of SHKC over state-of-the-art methods.

LGNov 3, 2025Code
Gated Fusion Enhanced Multi-Scale Hierarchical Graph Convolutional Network for Stock Movement Prediction

Xiaosha Xue, Peibo Duan, Zhipeng Liu et al.

Accurately predicting stock market movements remains a formidable challenge due to the inherent volatility and complex interdependencies among stocks. Although multi-scale Graph Neural Networks (GNNs) hold potential for modeling these relationships, they frequently neglect two key points: the subtle intra-attribute patterns within each stock affecting inter-stock correlation, and the biased attention to coarse- and fine-grained features during multi-scale sampling. To overcome these challenges, we introduce MS-HGFN (Multi-Scale Hierarchical Graph Fusion Network). The model features a hierarchical GNN module that forms dynamic graphs by learning patterns from intra-attributes and features from inter-attributes over different time scales, thus comprehensively capturing spatio-temporal dependencies. Additionally, a top-down gating approach facilitates the integration of multi-scale spatio-temporal features, preserving critical coarse- and fine-grained features without too much interference. Experiments utilizing real-world datasets from U.S. and Chinese stock markets demonstrate that MS-HGFN outperforms both traditional and advanced models, yielding up to a 1.4% improvement in prediction accuracy and enhanced stability in return simulations. The code is available at https://anonymous.4open.science/r/MS-HGFN.

89.9CLMay 7Code
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Jiacheng Xu, Heting Gao, Liufei Xie et al.

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.

64.8IMApr 14
FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation

Bin Zhang, Yabiao Wang, Xiaoyao Xie et al.

The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single-pulse search algorithms, which are computationally intensive and prone to high false-positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end-to-end framework unifying the detection and physical characterization of Fast Radio Transients (FRTs). Leveraging the morphological universality of dispersive trajectories in time-frequency dynamic spectra, we reframe FRT detection as a pattern recognition problem governed by the cold plasma dispersion relation. To facilitate this, we constructed CRAFTS-FRT, a pixel-level annotated dataset derived from the Commensal Radio Astronomy FAST Survey (CRAFTS), comprising 2{,}392 instances across diverse source classes. This dataset enables the training of a Mask R-CNN model for precise trajectory segmentation. Coupled with our physics-driven IMPIC algorithm, the framework maps the geometric coordinates of segmented trajectories to directly infer the Dispersion Measure (DM) and Time of Arrival (ToA). Benchmarking on the FAST-FREX dataset shows that FRTSearch achieves a 98.0\% recall, competitive with exhaustive search methods, while reducing false positives by over 99.9\% compared to PRESTO and delivering a processing speedup of up to $13.9\times$. Furthermore, the framework demonstrates robust cross-facility generalization, detecting all 19 tested FRBs from the ASKAP survey without retraining. By shifting the paradigm from ``search-then-identify'' to ``detect-and-infer,'' FRTSearch provides a scalable, high-precision solution for real-time discovery in the era of petabyte-scale radio astronomy.

SDJan 22
Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He et al.

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

CLSep 22, 2025Code
Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu et al. · pku

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

MAApr 25, 2023
SEA: A Spatially Explicit Architecture for Multi-Agent Reinforcement Learning

Dapeng Li, Zhiwei Xu, Bin Zhang et al.

Spatial information is essential in various fields. How to explicitly model according to the spatial location of agents is also very important for the multi-agent problem, especially when the number of agents is changing and the scale is enormous. Inspired by the point cloud task in computer vision, we propose a spatial information extraction structure for multi-agent reinforcement learning in this paper. Agents can effectively share the neighborhood and global information through a spatially encoder-decoder structure. Our method follows the centralized training with decentralized execution (CTDE) paradigm. In addition, our structure can be applied to various existing mainstream reinforcement learning algorithms with minor modifications and can deal with the problem with a variable number of agents. The experiments in several multi-agent scenarios show that the existing methods can get convincing results by adding our spatially explicit architecture.

MAFeb 4, 2023
Dual Self-Awareness Value Decomposition Framework without Individual Global Max for Cooperative Multi-Agent Reinforcement Learning

Zhiwei Xu, Bin Zhang, Dapeng Li et al.

Value decomposition methods have gained popularity in the field of cooperative multi-agent reinforcement learning. However, almost all existing methods follow the principle of Individual Global Max (IGM) or its variants, which limits their problem-solving capabilities. To address this, we propose a dual self-awareness value decomposition framework, inspired by the notion of dual self-awareness in psychology, that entirely rejects the IGM premise. Each agent consists of an ego policy for action selection and an alter ego value function to solve the credit assignment problem. The value function factorization can ignore the IGM assumption by utilizing an explicit search procedure. On the basis of the above, we also suggest a novel anti-ego exploration mechanism to avoid the algorithm becoming stuck in a local optimum. As the first fully IGM-free value decomposition method, our proposed framework achieves desirable performance in various cooperative tasks.

LGDec 26, 2023Code
PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning

Hangyu Mao, Rui Zhao, Ziyue Li et al.

Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which cascades two Transformers in a very natural way: the perceiving one focuses on \emph{the environmental perception} by processing the observation at the patch level, whereas the deciding one pays attention to \emph{the decision-making} by conditioning on the history of the desired returns, the perceiver's outputs, and the actions. Such a network design is generally applicable to a lot of deep RL settings, e.g., both the online and offline RL algorithms under environments with either image observations, proprioception observations, or hybrid image-language observations. Extensive experiments show that PDiT can not only achieve superior performance than strong baselines in different settings but also extract explainable feature representations. Our code is available at \url{https://github.com/maohangyu/PDiT}.

55.3MAMar 17
CoMAI: A Collaborative Multi-Agent Framework for Robust and Equitable Interview Evaluation

Gengxin Sun, Ruihao Yu, Liangyi Yin et al.

Ensuring robust and fair interview assessment remains a key challenge in AI-driven evaluation. This paper presents CoMAI, a general-purpose multi-agent interview framework designed for diverse assessment scenarios. In contrast to monolithic single-agent systems based on large language models (LLMs), CoMAI employs a modular task-decomposition architecture coordinated through a centralized finite-state machine. The system comprises four agents specialized in question generation, security, scoring, and summarization. These agents work collaboratively to provide multi-layered security defenses against prompt injection, support multidimensional evaluation with adaptive difficulty adjustment, and enable rubric-based structured scoring that reduces subjective bias. Experimental results demonstrate that CoMAI achieved 90.47% accuracy, 83.33% recall, and 84.41% candidate satisfaction. These results highlight CoMAI as a robust, fair, and interpretable paradigm for AI-driven interview assessment.

67.5MAMay 1
Breaking the Communication-Accuracy Trade-off: A Sparsified Information Diffusion Framework for Multi-Agent Collaborative Perception

Jirong Zha, Chenyu Zhao, Nan Zhou et al.

The growing relevance of multi-agent systems has drawn increasing focus on communication-efficient filters for collaborative perception to alleviate the system's communication burden. While the event-triggered (ET) mechanism can improve communication efficiency in collaborative state estimation, an inevitable trade-off exists between estimation accuracy and communication cost in ET filters. This paper proposes a fast and accurate ET diffusion-based filter for real-time multi-agent collaborative target tracking, aiming to reduce the system's data transmission without compromise in tracking performance. The proposed filter achieves improved tracking accuracy, reduced data transmission, and accelerated convergence using an error-minimized ET cubature information filter (CIF) for local estimation, and a correlation-aware diffusion strategy for global fusion. The experimental results confirm the scalability of the proposed EDC-CIF algorithm and demonstrate its efficacy in simultaneously reducing estimation error and computation time while significantly enhancing communication efficiency.

CVFeb 23
Monocular Mesh Recovery and Body Measurement of Female Saanen Goats

Bo Jin, Shichao Zhao, Jin Lyu et al.

The lactation performance of Saanen dairy goats, renowned for their high milk yield, is intrinsically linked to their body size, making accurate 3D body measurement essential for assessing milk production potential, yet existing reconstruction methods lack goat-specific authentic 3D data. To address this limitation, we establish the FemaleSaanenGoat dataset containing synchronized eight-view RGBD videos of 55 female Saanen goats (6-18 months). Using multi-view DynamicFusion, we fuse noisy, non-rigid point cloud sequences into high-fidelity 3D scans, overcoming challenges from irregular surfaces and rapid movement. Based on these scans, we develop SaanenGoat, a parametric 3D shape model specifically designed for female Saanen goats. This model features a refined template with 41 skeletal joints and enhanced udder representation, registered with our scan data. A comprehensive shape space constructed from 48 goats enables precise representation of diverse individual variations. With the help of SaanenGoat model, we get high-precision 3D reconstruction from single-view RGBD input, and achieve automated measurement of six critical body dimensions: body length, height, chest width, chest girth, hip width, and hip height. Experimental results demonstrate the superior accuracy of our method in both 3D reconstruction and body measurement, presenting a novel paradigm for large-scale 3D vision applications in precision livestock farming.

91.5CEMay 15
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

Loka Li, Duzhen Zhang, Xingbo Du et al.

Large language model (LLM) agents are increasingly capable of automating components of machine learning development, yet existing biomedical benchmarks mainly focus on question answering, reasoning, and tool usage, or evaluate only narrow aspects of biomedical ML coding. We present BioXArena, a biomedical machine learning benchmark designed to evaluate whether agents can generate task-specific model training pipelines for heterogeneous and multi-modal biomedical datasets. BioXArena contains 76 end-to-end tasks across 9 domains, including sequence modeling, single-cell analysis, structural biology, network biology, chemical biology, perturbation dynamics, phenotype-disease modeling, biomedical imaging, and text-integrated learning. Each task is curated from primary biomedical sources into a unified evaluation framework with hidden labels, held-out graders, and biology-aware metrics normalized to a 0 to 1 scale. Agents are required to write executable code, train predictive models, and generate submissions for private test samples. Most tasks involve multiple input modalities, including tabular data, images, natural language, molecular sequences, omics matrices, and protein structures. We evaluate 11 agent configurations in a standardized 2-hour single-GPU environment. MLEvolve with Gemini-3.1-Pro achieves the highest average score of 0.666, followed by GPT-5.4 with 0.636, while no single agent consistently dominates across all domains. We additionally perform extensive ablation studies, robustness evaluations, scaling analyses, cost analyses, and failure-mode investigations to better understand how model backbones, agent scaffolds, inference budgets, and biomedical domains influence BioML coding performance. We will publicly release all benchmark tasks, graders, execution runners, leaderboard results, and agent trajectories.

CLMar 26, 2025
Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He et al.

In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

48.4LGApr 12
Schema-Adaptive Tabular Representation Learning with LLMs for Generalizable Multimodal Clinical Reasoning

Hongxi Mao, Wei Zhou, Mengting Jia et al.

Machine learning for tabular data remains constrained by poor schema generalization, a challenge rooted in the lack of semantic understanding of structured variables. This challenge is particularly acute in domains like clinical medicine, where electronic health record (EHR) schemas vary significantly. To solve this problem, we propose Schema-Adaptive Tabular Representation Learning, a novel method that leverages large language models (LLMs) to create transferable tabular embeddings. By transforming structured variables into semantic natural language statements and encoding them with a pretrained LLM, our approach enables zero-shot alignment across unseen schemas without manual feature engineering or retraining. We integrate our encoder into a multimodal framework for dementia diagnosis, combining tabular and MRI data. Experiments on NACC and ADNI datasets demonstrate state-of-the-art performance and successful zero-shot transfer to unseen schemas, significantly outperforming clinical baselines, including board-certified neurologists, in retrospective diagnostic tasks. These results validate our LLM-driven approach as a scalable, robust solution for heterogeneous real-world data, offering a pathway to extend LLM-based reasoning to structured domains.

CVOct 2, 2025Code
OpusAnimation: Code-Based Dynamic Chart Generation

Bozheng Li, Miao Yang, Zhenhan Chen et al. · utoronto

Dynamic Chart Generation (DCG) involves producing code-rendered animated visualizations as charts. While recent advances in multi-modal large language models (MLLMs) have significantly improved their capability on static chart generation and comprehension, MLLMs' potential for handling dynamic chart generation and understanding remains underexplored. To bridge this research gap, we introduce DCG-Bench (Dynamic Chart Generation Benchmark), the first benchmark evaluating MLLM's capability on dynamic chart generation tasks from three dimensions: Simple Text-to-Chart, Detailed Text-to-Chart, and Video-to-Chart tasks. We construct DCG-8K, a high-quality DCG dataset with annotations covering instruction-code-video triplets and QA pairs for both code and video evaluation. Based on DCG-8K, we explored a two-stage training recipe, proposing Joint-Code-Visual Reward for group relative policy optimization to construct expert MLLM Qwen2.5-VL-DCG-3B for the DCG task. Our benchmarking result reveals shortcomings of existing MLLMs in the visual-to-chart task, and our model beats the best open-sourced MLLM with an average 8.31% performance gain across three tasks, and shows on par performance against proprietary models with only 3B parameters, proving the effectiveness of our training recipe. Our code and dataset will be publicly available.

30.2NEApr 21
Pareto Optimization with Robust Evaluation for Noisy Subset Selection

Yiheng Xu, Danxuan Liu, Bin Zhang et al.

Subset selection is a fundamental problem in combinatorial optimization, which has a wide range of applications such as influence maximization and sparse regression. The goal is to select a subset of limited size from a ground set in order to maximize a given objective function. However, the evaluation of the objective function in real-world scenarios is often noisy. Previous algorithms, including the greedy algorithm and multi-objective evolutionary algorithms POSS and PONSS, either struggle in noisy environments or consume excessive computational resources. In this paper, we focus on the noisy subset selection problem with a cardinality constraint, where the evaluation of a subset is noisy. We propose a novel approach based on Pareto Optimization with Robust Evaluation for noisy subset selection (PORE), which maximizes a robust evaluation function and minimizes the subset size simultaneously. PORE can efficiently identify well-structured solutions and handle computational resources, addressing the limitations observed in PONSS. Our experiments, conducted on real-world datasets for influence maximization and sparse regression, demonstrate that PORE significantly outperforms previous methods, including the classical greedy algorithm, POSS, and PONSS. Further validation through ablation studies confirms the effectiveness of our robust evaluation function.

IVJul 1, 2025Code
Accurate and Efficient Fetal Birth Weight Estimation from 3D Ultrasound

Jian Wang, Qiongying Ni, Hongkui Yu et al.

Accurate fetal birth weight (FBW) estimation is essential for optimizing delivery decisions and reducing perinatal mortality. However, clinical methods for FBW estimation are inefficient, operator-dependent, and challenging to apply in cases of complex fetal anatomy. Existing deep learning methods are based on 2D standard ultrasound (US) images or videos that lack spatial information, limiting their prediction accuracy. In this study, we propose the first method for directly estimating FBW from 3D fetal US volumes. Our approach integrates a multi-scale feature fusion network (MFFN) and a synthetic sample-based learning framework (SSLF). The MFFN effectively extracts and fuses multi-scale features under sparse supervision by incorporating channel attention, spatial attention, and a ranking-based loss function. SSLF generates synthetic samples by simply combining fetal head and abdomen data from different fetuses, utilizing semi-supervised learning to improve prediction performance. Experimental results demonstrate that our method achieves superior performance, with a mean absolute error of $166.4\pm155.9$ $g$ and a mean absolute percentage error of $5.1\pm4.6$%, outperforming existing methods and approaching the accuracy of a senior doctor. Code is available at: https://github.com/Qioy-i/EFW.

NEMay 15, 2025Code
ILIF: Temporal Inhibitory Leaky Integrate-and-Fire Neuron for Overactivation in Spiking Neural Networks

Kai Sun, Peibo Duan, Levin Kuhlmann et al.

The Spiking Neural Network (SNN) has drawn increasing attention for its energy-efficient, event-driven processing and biological plausibility. To train SNNs via backpropagation, surrogate gradients are used to approximate the non-differentiable spike function, but they only maintain nonzero derivatives within a narrow range of membrane potentials near the firing threshold, referred to as the surrogate gradient support width gamma. We identify a major challenge, termed the dilemma of gamma: a relatively large gamma leads to overactivation, characterized by excessive neuron firing, which in turn increases energy consumption, whereas a small gamma causes vanishing gradients and weakens temporal dependencies. To address this, we propose a temporal Inhibitory Leaky Integrate-and-Fire (ILIF) neuron model, inspired by biological inhibitory mechanisms. This model incorporates interconnected inhibitory units for membrane potential and current, effectively mitigating overactivation while preserving gradient propagation. Theoretical analysis demonstrates ILIF effectiveness in overcoming the gamma dilemma, and extensive experiments on multiple datasets show that ILIF improves energy efficiency by reducing firing rates, stabilizes training, and enhances accuracy. The code is available at github.com/kaisun1/ILIF.

IVJun 26, 2024Code
Multi-modal Evidential Fusion Network for Trustworthy PET/CT Tumor Segmentation

Yuxuan Qi, Li Lin, Jiajun Wang et al.

Accurate tumor segmentation in PET/CT images is crucial for computer-aided cancer diagnosis and treatment. The primary challenge lies in effectively integrating the complementary information from PET and CT images. In clinical settings, the quality of PET and CT images often varies significantly, leading to uncertainty in the modality information extracted by networks. To address this challenge, we propose a novel Multi-modal Evidential Fusion Network (MEFN), which consists of two core stages: Cross-Modal Feature Learning (CFL) and Multi-modal Trustworthy Fusion (MTF). The CFL stage aligns features across different modalities and learns more robust feature representations, thereby alleviating the negative effects of domain gap. The MTF stage utilizes mutual attention mechanisms and an uncertainty calibrator to fuse modality features based on modality uncertainty and then fuse the segmentation results under the guidance of Dempster-Shafer Theory. Besides, a new uncertainty perceptual loss is introduced to force the model focusing on uncertain features and hence improve its ability to extract trusted modality information. Extensive comparative experiments are conducted on two publicly available PET/CT datasets to evaluate the performance of our proposed method whose results demonstrate that our MEFN significantly outperforms state-of-the-art methods with improvements of 3.10% and 3.23% in DSC scores on the AutoPET dataset and the Hecktor dataset, respectively. More importantly, our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results, which is particularly important for clinical applications. Our code will be available at https://github.com/QPaws/MEFN.

CVSep 16, 2020Code
The 1st Tiny Object Detection Challenge:Methods and Results

Xuehui Yu, Zhenjun Han, Yuqi Gong et al.

The 1st Tiny Object Detection (TOD) Challenge aims to encourage research in developing novel and accurate methods for tiny object detection in images which have wide views, with a current focus on tiny person detection. The TinyPerson dataset was used for the TOD Challenge and is publicly released. It has 1610 images and 72651 box-levelannotations. Around 36 participating teams from the globe competed inthe 1st TOD Challenge. In this paper, we provide a brief summary of the1st TOD Challenge including brief introductions to the top three methods.The submission leaderboard will be reopened for researchers that areinterested in the TOD challenge. The benchmark dataset and other information can be found at: https://github.com/ucas-vg/TinyBenchmark.

IVAug 26, 2020Code
Better Than Reference In Low Light Image Enhancement: Conditional Re-Enhancement Networks

Yu Zhang, Xiaoguang Di, Bin Zhang et al.

Low light images suffer from severe noise, low brightness, low contrast, etc. In previous researches, many image enhancement methods have been proposed, but few methods can deal with these problems simultaneously. In this paper, to solve these problems simultaneously, we propose a low light image enhancement method that can combined with supervised learning and previous HSV (Hue, Saturation, Value) or Retinex model based image enhancement methods. First, we analyse the relationship between the HSV color space and the Retinex theory, and show that the V channel (V channel in HSV color space, equals the maximum channel in RGB color space) of the enhanced image can well represent the contrast and brightness enhancement process. Then, a data-driven conditional re-enhancement network (denoted as CRENet) is proposed. The network takes low light images as input and the enhanced V channel as condition, then it can re-enhance the contrast and brightness of the low light image and at the same time reduce noise and color distortion. It should be noted that during the training process, any paired images with different exposure time can be used for training, and there is no need to carefully select the supervised images which will save a lot. In addition, it takes less than 20 ms to process a color image with the resolution 400*600 on a 2080Ti GPU. Finally, some comparative experiments are implemented to prove the effectiveness of the method. The results show that the method proposed in this paper can significantly improve the quality of the enhanced image, and by combining with other image contrast enhancement methods, the final enhancement result can even be better than the reference image in contrast and brightness. (Code will be available at https://github.com/hitzhangyu/image-enhancement-with-denoise)

AIDec 29, 2025
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Jiawei Chen, Xintian Shen, Lihao Zheng et al.

Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

LGJan 15
We Need a More Robust Classifier: Dual Causal Learning Empowers Domain-Incremental Time Series Classification

Zhipeng Liu, Peibo Duan, Xuan Tang et al.

The World Wide Web thrives on intelligent services that rely on accurate time series classification, which has recently witnessed significant progress driven by advances in deep learning. However, existing studies face challenges in domain incremental learning. In this paper, we propose a lightweight and robust dual-causal disentanglement framework (DualCD) to enhance the robustness of models under domain incremental scenarios, which can be seamlessly integrated into time series classification models. Specifically, DualCD first introduces a temporal feature disentanglement module to capture class-causal features and spurious features. The causal features can offer sufficient predictive power to support the classifier in domain incremental learning settings. To accurately capture these causal features, we further design a dual-causal intervention mechanism to eliminate the influence of both intra-class and inter-class confounding features. This mechanism constructs variant samples by combining the current class's causal features with intra-class spurious features and with causal features from other classes. The causal intervention loss encourages the model to accurately predict the labels of these variant samples based solely on the causal features. Extensive experiments on multiple datasets and models demonstrate that DualCD effectively improves performance in domain incremental scenarios. We summarize our rich experiments into a comprehensive benchmark to facilitate research in domain incremental time series classification.