NIJan 2, 2023
Holistic Network Virtualization and Pervasive Network Intelligence for 6GXuemin, Shen, Jie Gao et al.
In this tutorial paper, we look into the evolution and prospect of network architecture and propose a novel conceptual architecture for the 6th generation (6G) networks. The proposed architecture has two key elements, i.e., holistic network virtualization and pervasive artificial intelligence (AI). The holistic network virtualization consists of network slicing and digital twin, from the aspects of service provision and service demand, respectively, to incorporate service-centric and user-centric networking. The pervasive network intelligence integrates AI into future networks from the perspectives of networking for AI and AI for networking, respectively. Building on holistic network virtualization and pervasive network intelligence, the proposed architecture can facilitate three types of interplay, i.e., the interplay between digital twin and network slicing paradigms, between model-driven and data-driven methods for network management, and between virtualization and AI, to maximize the flexibility, scalability, adaptivity, and intelligence for 6G networks. We also identify challenges and open issues related to the proposed architecture. By providing our vision, we aim to inspire further discussions and developments on the potential architecture of 6G.
BMAug 30, 2024Code
Technical Report of HelixFold3 for Biomolecular Structure PredictionLihang Liu, Shanzhuo Zhang, Yang Xue et al. · baidu
The AlphaFold series has transformed protein structure prediction with remarkable accuracy, often matching experimental methods. AlphaFold2, AlphaFold-Multimer, and the latest AlphaFold3 represent significant strides in predicting single protein chains, protein complexes, and biomolecular structures. While AlphaFold2 and AlphaFold-Multimer are open-sourced, facilitating rapid and reliable predictions, AlphaFold3 remains partially accessible through a limited online server and has not been open-sourced, restricting further development. To address these challenges, the PaddleHelix team is developing HelixFold3, aiming to replicate AlphaFold3's capabilities. Leveraging insights from previous models and extensive datasets, HelixFold3 achieves accuracy comparable to AlphaFold3 in predicting the structures of the conventional ligands, nucleic acids, and proteins. The initial release of HelixFold3 is available as open source on GitHub for academic research, promising to advance biomolecular research and accelerate discoveries. The latest version will be continuously updated on the HelixFold3 web server, providing both interactive visualization and API access.
CVMay 30
SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV AutonomyJie Gao, Jie Ma, Kaihui Lin et al.
For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety interface between human instructions and physical flight. In human-scale urban airspace below 20 meters, thin geometry, occlusions, vegetation, and urban clutter define whether an aerial agent can safely enter the space ahead. However, existing UAV datasets mainly provide 2D annotations or 3D boxes, while driving-oriented occupancy benchmarks assume stable ground-level sensor rigs. Both miss the defining regime of low-altitude flight: a front-facing monocular camera observing occupied and free space from a moving aerial body with frame-wise changing 6-DoF pose and camera extrinsics. To bridge this gap, we introduce \textbf{SkyShield}, to the best of our knowledge the first front-view monocular semantic occupancy benchmark for urban UAV flight below 20 meters. Built on CARLA, SkyShield contains 36K front-view UAV samples across diverse urban scenes and weather conditions, pairing each image with frame-wise 6-DoF UAV pose, frame-wise dynamic camera geometry, UAV states, and front-frustum semantic occupancy labels. We further propose \textbf{KAR-mIoU}, a UAV-centric and dynamics-aware metric that re-weights voxel-level evaluation by kinematic reachability and time-to-collision, revealing safety-critical risks hidden by conventional mIoU. To tackle this challenging new setting, we provide \textbf{SkyOcc}, a geometry-first monocular baseline that integrates frame-wise UAV attitude into projection, fuses temporal occupancy features, and applies safety-prior optimization to preserve sparse collision-critical structures. Together, SkyShield, KAR-mIoU, and SkyOcc establish occupancy as a safety interface for low-altitude aerial autonomy. Code and dataset will be released publicly.
NIOct 6, 2022
Digital Twin-Empowered Network Planning for Multi-Tier ComputingConghao Zhou, Jie Gao, Mushu Li et al.
In this paper, we design a resource management scheme to support stateful applications, which will be prevalent in 6G networks. Different from stateless applications, stateful applications require context data while executing computing tasks from user terminals (UTs). Using a multi-tier computing paradigm with servers deployed at the core network, gateways, and base stations to support stateful applications, we aim to optimize long-term resource reservation by jointly minimizing the usage of computing, storage, and communication resources and the cost from reconfiguring resource reservation. The coupling among different resources and the impact of UT mobility create challenges in resource management. To address the challenges, we develop digital twin (DT) empowered network planning with two elements, i.e., multi-resource reservation and resource reservation reconfiguration. First, DTs are designed for collecting UT status data, based on which UTs are grouped according to their mobility patterns. Second, an algorithm is proposed to customize resource reservation for different groups to satisfy their different resource demands. Last, a Meta-learning-based approach is developed to reconfigure resource reservation for balancing the network resource usage and the reconfiguration cost. Simulation results demonstrate that the proposed DT-empowered network planning outperforms benchmark frameworks by using less resources and incurring lower reconfiguration costs.
AIAug 21, 2024Code
BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal RepresentationHaotian Peng, Jiawei Liu, Jinsong Du et al.
We propose a bearing health management framework leveraging large language models (BearLLM), a novel multimodal model that unifies multiple bearing-related tasks by processing user prompts and vibration signals. Specifically, we introduce a prior knowledge-enhanced unified vibration signal representation to handle various working conditions across multiple datasets. This involves adaptively sampling the vibration signals based on the sampling rate of the sensor, incorporating the frequency domain to unify input dimensions, and using a fault-free reference signal as an auxiliary input. To extract features from vibration signals, we first train a fault classification network, then convert and align the extracted features into word embedding, and finally concatenate these with text embedding as input to an LLM. To evaluate the performance of the proposed method, we constructed the first large-scale multimodal bearing health management (MBHM) dataset, including paired vibration signals and textual descriptions. With our unified vibration signal representation, BearLLM using one set of pre-trained weights achieves state-of-the-art performance on nine publicly available fault diagnosis benchmarks, outperforming specific methods designed for individual datasets. We provide a dataset, our model, and code to inspire future research on building more capable industrial multimodal models https://github.com/SIA-IDE/BearLLM.
HCApr 16, 2023
VISAR: A Human-AI Argumentative Writing Assistant with Visual Programming and Rapid Draft PrototypingZheng Zhang, Jie Gao, Ranjodh Singh Dhaliwal et al.
In argumentative writing, writers must brainstorm hierarchical writing goals, ensure the persuasiveness of their arguments, and revise and organize their plans through drafting. Recent advances in large language models (LLMs) have made interactive text generation through a chat interface (e.g., ChatGPT) possible. However, this approach often neglects implicit writing context and user intent, lacks support for user control and autonomy, and provides limited assistance for sensemaking and revising writing plans. To address these challenges, we introduce VISAR, an AI-enabled writing assistant system designed to help writers brainstorm and revise hierarchical goals within their writing context, organize argument structures through synchronized text editing and visual programming, and enhance persuasiveness with argumentation spark recommendations. VISAR allows users to explore, experiment with, and validate their writing plans using automatic draft prototyping. A controlled lab study confirmed the usability and effectiveness of VISAR in facilitating the argumentative writing planning process.
CGMar 14, 2022
On Cyclic Solutions to the Min-Max Latency Multi-Robot Patrolling ProblemPeyman Afshani, Mark de Berg, Kevin Buchin et al.
We consider the following surveillance problem: Given a set $P$ of $n$ sites in a metric space and a set of $k$ robots with the same maximum speed, compute a patrol schedule of minimum latency for the robots. Here a patrol schedule specifies for each robot an infinite sequence of sites to visit (in the given order) and the latency $L$ of a schedule is the maximum latency of any site, where the latency of a site $s$ is the supremum of the lengths of the time intervals between consecutive visits to $s$. When $k=1$ the problem is equivalent to the travelling salesman problem (TSP) and thus it is NP-hard. We have two main results. We consider cyclic solutions in which the set of sites must be partitioned into $\ell$ groups, for some~$\ell \leq k$, and each group is assigned a subset of the robots that move along the travelling salesman tour of the group at equal distance from each other. Our first main result is that approximating the optimal latency of the class of cyclic solutions can be reduced to approximating the optimal travelling salesman tour on some input, with only a $1+\varepsilon$ factor loss in the approximation factor and an $O\left(\left( k/\varepsilon \right)^k\right)$ factor loss in the runtime, for any $\varepsilon >0$. Our second main result shows that an optimal cyclic solution is a $2(1-1/k)$-approximation of the overall optimal solution. Note that for $k=2$ this implies that an optimal cyclic solution is optimal overall. The results have a number of consequences. For the Euclidean version of the problem, for instance, combining our results with known results on Euclidean TSP, yields a PTAS for approximating an optimal cyclic solution, and it yields a $(2(1-1/k)+\varepsilon)$-approximation of the optimal unrestricted solution. If the conjecture mentioned above is true, then our algorithm is actually a PTAS for the general problem in the Euclidean setting.
LGJul 10, 2022
TCR: A Transformer Based Deep Network for Predicting Cancer Drugs ResponseJie Gao, Jing Hu, Wanqing Sun et al.
Predicting clinical outcomes to anti-cancer drugs on a personalized basis is challenging in cancer treatment due to the heterogeneity of tumors. Traditional computational efforts have been made to model the effect of drug response on individual samples depicted by their molecular profile, yet overfitting occurs because of the high dimension for omics data, hindering models from clinical application. Recent research shows that deep learning is a promising approach to build drug response models by learning alignment patterns between drugs and samples. However, existing studies employed the simple feature fusion strategy and only considered the drug features as a whole representation while ignoring the substructure information that may play a vital role when aligning drugs and genes. Hereby in this paper, we propose TCR (Transformer based network for Cancer drug Response) to predict anti-cancer drug response. By utilizing an attention mechanism, TCR is able to learn the interactions between drug atom/sub-structure and molecular signatures efficiently in our study. Furthermore, a dual loss function and cross sampling strategy were designed to improve the prediction power of TCR. We show that TCR outperformed all other methods under various data splitting strategies on all evaluation matrices (some with significant improvement). Extensive experiments demonstrate that TCR shows significantly improved generalization ability on independent in-vitro experiments and in-vivo real patient data. Our study highlights the prediction power of TCR and its potential value for cancer drug repurpose and precision oncology treatment.
NINov 20, 2023
Digital Twin-Based User-Centric Edge Continual Learning in Integrated Sensing and CommunicationShisheng Hu, Jie Gao, Xinyu Huang et al.
In this paper, we propose a digital twin (DT)-based user-centric approach for processing sensing data in an integrated sensing and communication (ISAC) system with high accuracy and efficient resource utilization. The considered scenario involves an ISAC device with a lightweight deep neural network (DNN) and a mobile edge computing (MEC) server with a large DNN. After collecting sensing data, the ISAC device either processes the data locally or uploads them to the server for higher-accuracy data processing. To cope with data drifts, the server updates the lightweight DNN when necessary, referred to as continual learning. Our objective is to minimize the long-term average computation cost of the MEC server by optimizing two decisions, i.e., sensing data offloading and sensing data selection for the DNN update. A DT of the ISAC device is constructed to predict the impact of potential decisions on the long-term computation cost of the server, based on which the decisions are made with closed-form formulas. Experiments on executing DNN-based human motion recognition tasks are conducted to demonstrate the outstanding performance of the proposed DT-based approach in computation cost minimization.
SEJan 10, 2023
Understanding the Complexity and Its Impact on Testing in ML-Enabled SystemsJunming Cao, Bihuan Chen, Longjie Hu et al.
Machine learning (ML) enabled systems are emerging with recent breakthroughs in ML. A model-centric view is widely taken by the literature to focus only on the analysis of ML models. However, only a small body of work takes a system view that looks at how ML components work with the system and how they affect software engineering for MLenabled systems. In this paper, we adopt this system view, and conduct a case study on Rasa 3.0, an industrial dialogue system that has been widely adopted by various companies around the world. Our goal is to characterize the complexity of such a largescale ML-enabled system and to understand the impact of the complexity on testing. Our study reveals practical implications for software engineering for ML-enabled systems.
CLJan 9, 2023
Logically at Factify 2: A Multi-Modal Fact Checking System Based on Evidence Retrieval techniques and Transformer Encoder ArchitecturePim Jordi Verschuuren, Jie Gao, Adelize van Eeden et al.
In this paper, we present the Logically submissions to De-Factify 2 challenge (DE-FACTIFY 2023) on the task 1 of Multi-Modal Fact Checking. We describes our submissions to this challenge including explored evidence retrieval and selection techniques, pre-trained cross-modal and unimodal models, and a cross-modal veracity model based on the well established Transformer Encoder (TE) architecture which is heavily relies on the concept of self-attention. Exploratory analysis is also conducted on this Factify 2 data set that uncovers the salient multi-modal patterns and hypothesis motivating the architecture proposed in this work. A series of preliminary experiments were done to investigate and benchmarking different pre-trained embedding models, evidence retrieval settings and thresholds. The final system, a standard two-stage evidence based veracity detection system, yields weighted avg. 0.79 on both val set and final blind test set on the task 1, which achieves 3rd place with a small margin to the top performing system on the leaderboard among 9 participants.
LGSep 6, 2024
Approximating Metric Magnitude of Point SetsRayna Andreeva, James Ward, Primoz Skraba et al.
Metric magnitude is a measure of the "size" of point clouds with many desirable geometric properties. It has been adapted to various mathematical contexts and recent work suggests that it can enhance machine learning and optimization algorithms. But its usability is limited due to the computational cost when the dataset is large or when the computation must be carried out repeatedly (e.g. in model training). In this paper, we study the magnitude computation problem, and show efficient ways of approximating it. We show that it can be cast as a convex optimization problem, but not as a submodular optimization. The paper describes two new algorithms - an iterative approximation algorithm that converges fast and is accurate, and a subset selection method that makes the computation even faster. It has been previously proposed that magnitude of model sequences generated during stochastic gradient descent is correlated to generalization gap. Extension of this result using our more scalable algorithms shows that longer sequences in fact bear higher correlations. We also describe new applications of magnitude in machine learning - as an effective regularizer for neural network training, and as a novel clustering criterion.
SDNov 10, 2025Code
E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech SynthesisZhisheng Zhang, Derui Wang, Yifan Mi et al.
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense techniques struggle to address the production large language model (LLM)-based speech synthesis. While previous studies have considered the protection for fine-tuning synthesizers, they assume manually annotated transcripts. Given the labor intensity of manual annotation, end-to-end (E2E) systems leveraging automatic speech recognition (ASR) to generate transcripts are becoming increasingly prevalent, e.g., voice cloning via commercial APIs. Therefore, this E2E speech synthesis also requires new security mechanisms. To tackle these challenges, we propose E2E-VGuard, a proactive defense framework for two emerging threats: (1) production LLM-based speech synthesis, and (2) the novel attack arising from ASR-driven E2E scenarios. Specifically, we employ the encoder ensemble with a feature extractor to protect timbre, while ASR-targeted adversarial examples disrupt pronunciation. Moreover, we incorporate the psychoacoustic model to ensure perturbative imperceptibility. For a comprehensive evaluation, we test 16 open-source synthesizers and 3 commercial APIs across Chinese and English datasets, confirming E2E-VGuard's effectiveness in timbre and pronunciation protection. Real-world deployment validation is also conducted. Our code and demo page are available at https://wxzyd123.github.io/e2e-vguard/.
AIMay 23
Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative AnalysisZhimin Lin, Kun Cheng, Fan Bai et al.
Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.
ASApr 20
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASRYuan Xie, Jiaqi Song, Guang Qiu et al.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate the modality gap and improve parameter efficiency; introduce an iterative asynchronous SFT stage to preserve acoustic fidelity and constrain representation drift; and design an ASR-specialized reinforcement learning stage to further enhance recognition quality and robustness. We additionally incorporate a suite of production-oriented optimizations, including robustness under noisy and silent conditions, real-time streaming inference, and hotword customization via retrieval-augmented generation (RAG). Experiments show that NIM4-ASR achieves state-of-the-art performance on multiple public benchmarks with merely 2.3B parameters, while substantially outperforming larger-scale competitors on internal benchmarks -- particularly in entity-intensive real-world scenarios. NIM4-ASR further supports million-scale hotword customization via RAG with sub-millisecond retrieval latency, enabling efficient adaptation to emerging entities and personalized user requirements.
LGJul 12, 2024
RIO-CPD: A Riemannian Geometric Method for Correlation-aware Online Change Point DetectionChengyuan Deng, Zhengzhang Chen, Xujiang Zhao et al.
Change point detection aims to identify abrupt shifts occurring at multiple points within a data sequence. This task becomes particularly challenging in the online setting, where different types of changes can occur, including shifts in both the marginal and joint distributions of the data. In this paper, we address these challenges by tracking the Riemannian geometry of correlation matrices, allowing Riemannian metrics to compute the geodesic distance as an accurate measure of correlation dynamics. We introduce Rio-CPD, a non-parametric, correlation-aware online change point detection framework that integrates the Riemannian geometry of the manifold of symmetric positive definite matrices with the cumulative sum (CUSUM) statistic for detecting change points. Rio-CPD employs a novel CUSUM design by computing the geodesic distance between current observations and the Fréchet mean of prior observations. With appropriate choices of Riemannian metrics, Rio-CPD offers a simple yet effective and computationally efficient algorithm. Experimental results on both synthetic and real-world datasets demonstrate that Rio-CPD outperforms existing methods on detection accuracy, average detection delay and efficiency.
IVOct 13, 2023
Ultrasound Image Segmentation of Thyroid Nodule via Latent Semantic Feature Co-RegistrationXuewei Li, Yaqiao Zhu, Jie Gao et al.
Segmentation of nodules in thyroid ultrasound imaging plays a crucial role in the detection and treatment of thyroid cancer. However, owing to the diversity of scanner vendors and imaging protocols in different hospitals, the automatic segmentation model, which has already demonstrated expert-level accuracy in the field of medical image segmentation, finds its accuracy reduced as the result of its weak generalization performance when being applied in clinically realistic environments. To address this issue, the present paper proposes ASTN, a framework for thyroid nodule segmentation achieved through a new type co-registration network. By extracting latent semantic information from the atlas and target images and utilizing in-depth features to accomplish the co-registration of nodules in thyroid ultrasound images, this framework can ensure the integrity of anatomical structure and reduce the impact on segmentation as the result of overall differences in image caused by different devices. In addition, this paper also provides an atlas selection algorithm to mitigate the difficulty of co-registration. As shown by the evaluation results collected from the datasets of different devices, thanks to the method we proposed, the model generalization has been greatly improved while maintaining a high level of segmentation accuracy.
CVOct 7, 2022
IDPL: Intra-subdomain adaptation adversarial learning segmentation method based on Dynamic Pseudo LabelsXuewei Li, Weilun Zhang, Jie Gao et al.
Unsupervised domain adaptation(UDA) has been applied to image semantic segmentation to solve the problem of domain offset. However, in some difficult categories with poor recognition accuracy, the segmentation effects are still not ideal. To this end, in this paper, Intra-subdomain adaptation adversarial learning segmentation method based on Dynamic Pseudo Labels(IDPL) is proposed. The whole process consists of 3 steps: Firstly, the instance-level pseudo label dynamic generation module is proposed, which fuses the class matching information in global classes and local instances, thus adaptively generating the optimal threshold for each class, obtaining high-quality pseudo labels. Secondly, the subdomain classifier module based on instance confidence is constructed, which can dynamically divide the target domain into easy and difficult subdomains according to the relative proportion of easy and difficult instances. Finally, the subdomain adversarial learning module based on self-attention is proposed. It uses multi-head self-attention to confront the easy and difficult subdomains at the class level with the help of generated high-quality pseudo labels, so as to focus on mining the features of difficult categories in the high-entropy region of target domain images, which promotes class-level conditional distribution alignment between the subdomains, improving the segmentation performance of difficult categories. For the difficult categories, the experimental results show that the performance of IDPL is significantly improved compared with other latest mainstream methods.
LGMay 21
Mixture of Complementary Agents for Robust LLM EnsembleYichi Zhang, Kevin Lu, Yuang Zhang et al.
Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.
SYMar 23
DQN Based Joint UAV Trajectory and Association Planning in NTN Assisted NetworksAfsoon Alidadi Shamsabadi, Cosmas Mwaba, Thomas Nugent et al.
Advanced Air Mobility (AAM) has emerged as a key pillar of next-generation transportation systems, encompassing a wide range of uncrewed aerial vehicle (UAV) applications. To enable AAM, maintaining reliable and efficient communication links between UAVs and control centers is essential. At the same time, the highly dynamic nature of wireless networks, combined with the limited onboard energy of UAVs, makes efficient trajectory planning and network association crucial. Existing terrestrial networks often fail to provide ubiquitous coverage due to frequent handovers and coverage gaps. To address these challenges, geostationary Earth orbit (GEO) satellites offer a promising complementary solution for extending UAV connectivity beyond terrestrial boundaries. This work proposes an integrated GEO terrestrial network architecture to ensure seamless UAV connectivity. Leveraging artificial intelligence (AI), a deep Q network (DQN) based algorithm is developed for joint UAV trajectory and association planning (JUTAP), aiming to minimize energy consumption, handover frequency, and disconnectivity. Simulation results validate the effectiveness of the proposed algorithm within the integrated GEO terrestrial framework.
AIApr 12
Enhancing Cross-Problem Vehicle Routing via Federated LearningXiangchi Meng, Jianan Zhou, Jie Gao et al.
Vehicle routing problems (VRPs) constitute a core optimization challenge in modern logistics and supply chain management. The recent neural combinatorial optimization (NCO) has demonstrated superior efficiency over some traditional algorithms. While serving as a primary NCO approach for solving general VRPs, current cross-problem learning paradigms are still subject to performance degradation and generalizability decay, when transferring from simple VRP variants to those involving different and complex constraints. To strengthen the paradigms, this paper offers an innovative "Multi-problem Pre-train, then Single-problem Fine-tune" framework with Federated Learning (MPSF-FL). This framework exploits the common knowledge of a federated global model to foster efficient cross-problem knowledge sharing and transfer among local models for single-problem fine-tuning. In this way, local models effectively retain common VRP knowledge from up-to-date global model, while being efficiently adapted to downstream VRPs with heterogeneous complex constraints. Experimental results demonstrate that our framework not only enhances the performance in diverse VRPs, but also improves the generalizability in unseen problems.
LGJun 22, 2023
Efficient Partitioning Method of Large-Scale Public Safety Spatio-Temporal Data based on Information Loss ConstraintsJie Gao, Yawen Li, Zhe Xue et al.
The storage, management, and application of massive spatio-temporal data are widely applied in various practical scenarios, including public safety. However, due to the unique spatio-temporal distribution characteristics of re-al-world data, most existing methods have limitations in terms of the spatio-temporal proximity of data and load balancing in distributed storage. There-fore, this paper proposes an efficient partitioning method of large-scale public safety spatio-temporal data based on information loss constraints (IFL-LSTP). The IFL-LSTP model specifically targets large-scale spatio-temporal point da-ta by combining the spatio-temporal partitioning module (STPM) with the graph partitioning module (GPM). This approach can significantly reduce the scale of data while maintaining the model's accuracy, in order to improve the partitioning efficiency. It can also ensure the load balancing of distributed storage while maintaining spatio-temporal proximity of the data partitioning results. This method provides a new solution for distributed storage of mas-sive spatio-temporal data. The experimental results on multiple real-world da-tasets demonstrate the effectiveness and superiority of IFL-LSTP.
CGMar 23
Charting the Diameter Computation Landscape of Geometric Intersection Graphs in Three Dimensions and HigherTimothy M. Chan, Hsien-Chih Chang, Jie Gao et al.
Recent research on computing the diameter of geometric intersection graphs has made significant strides, primarily focusing on the 2D case where truly subquadratic-time algorithms were given for simple objects such as unit-disks and (axis-aligned) squares. However, in three or higher dimensions, there is no known truly subquadratic-time algorithm for any intersection graph of non-trivial objects, even basic ones such as unit balls or (axis-aligned) unit cubes. This was partially explained by the pioneering work of Bringmann et al. [SoCG '22] which gave several truly subquadratic lower bounds, notably for unit balls or unit cubes in 3D when the graph diameter $Î$ is at least $Ω(\log n)$, hinting at a pessimistic outlook for the complexity of the diameter problem in higher dimensions. In this paper, we substantially extend the landscape of diameter computation for objects in three and higher dimensions, giving a few positive results. Our highlighted findings include: - A truly subquadratic-time algorithm for deciding if the diameter of unit cubes in 3D is at most 3 (Diameter-3 hereafter), the first algorithm of its kind for objects in 3D or higher dimensions. Our algorithm is based on a novel connection to pseudolines, which is of independent interest. - A truly subquadratic time lower bound for \Diameter-3 of unit balls in 3D under the Orthogonal Vector (OV) hypothesis, giving the first separation between unit balls and unit cubes in the small diameter regime. Previously, computing the diameter for both objects was known to be truly subquadratic hard when the diameter is $Ω(\log n)$. - A near-linear-time algorithm for Diameter-2 of unit cubes in 3D, generalizing the previous result for unit squares in 2D. - A truly subquadratic-time algorithm and lower bound for Diameter-2 and Diameter-3 of rectangular boxes (of arbitrary dimension and sizes), respectively.
CLMay 15
Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric StudyJie Gao, Yongan Yu, Junzhu Su et al.
Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.
AIMay 15
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDPIgor Bogdanov, Chung-Horng Lung, Thomas Kunz et al.
Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
AIMay 15
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population BroadcastIgor Bogdanov, Chung-Horng Lung, Thomas Kunz et al.
Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
LGFeb 3, 2024Code
Composite Active Learning: Towards Multi-Domain Active Learning with Theoretical GuaranteesGuang-Yuan Hao, Hengguan Huang, Haotian Wang et al.
Active learning (AL) aims to improve model performance within a fixed labeling budget by choosing the most informative data points to label. Existing AL focuses on the single-domain setting, where all data come from the same domain (e.g., the same dataset). However, many real-world tasks often involve multiple domains. For example, in visual recognition, it is often desirable to train an image classifier that works across different environments (e.g., different backgrounds), where images from each environment constitute one domain. Such a multi-domain AL setting is challenging for prior methods because they (1) ignore the similarity among different domains when assigning labeling budget and (2) fail to handle distribution shift of data across different domains. In this paper, we propose the first general method, dubbed composite active learning (CAL), for multi-domain AL. Our approach explicitly considers the domain-level and instance-level information in the problem; CAL first assigns domain-level budgets according to domain-level importance, which is estimated by optimizing an upper error bound that we develop; with the domain-level budgets, CAL then leverages a certain instance-level query strategy to select samples to label from each domain. Our theoretical analysis shows that our method achieves a better error bound compared to current AL methods. Our empirical results demonstrate that our approach significantly outperforms the state-of-the-art AL methods on both synthetic and real-world multi-domain datasets. Code is available at https://github.com/Wang-ML-Lab/multi-domain-active-learning.
CLOct 26, 2025Code
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified RepresentationCanxiang Yan, Chunxiang Jin, Dawei Huang et al.
Existing speech models suffer from competing requirements on token representations by understanding and generation tasks. This discrepancy in representation prevents speech language models from performing instruction-based free-form editing. To solve this challenge, we introduce a novel framework that unifies speech understanding, generation, and editing. The core of our unified model is a unified continuous speech tokenizer MingTok-Audio, the first continuous tokenizer to effectively integrate semantic and acoustic features, which makes it suitable for both understanding and generation tasks. Based on this unified continuous audio tokenizer, we developed the speech language model Ming-UniAudio, which achieved a balance between generation and understanding capabilities. Ming-UniAudio sets new state-of-the-art (SOTA) records on 8 out of 12 metrics on the ContextASR benchmark. Notably, for Chinese voice cloning, it achieves a highly competitive Seed-TTS-WER of 0.95. Leveraging this foundational model, we further trained a dedicated speech editing model Ming-UniAudio-Edit, the first speech language model that enables universal, free-form speech editing guided solely by natural language instructions, handling both semantic and acoustic modifications without timestamp condition. To rigorously assess the editing capability and establish a foundation for future research, we introduce Ming-Freeform-Audio-Edit, the first comprehensive benchmark tailored for instruction-based free-form speech editing, featuring diverse scenarios and evaluation dimensions spanning semantic correctness, acoustic quality, and instruction alignment. We open-sourced the continuous audio tokenizer, the unified foundational model, and the free-form instruction-based editing model to facilitate the development of unified audio understanding, generation, and manipulation.
LGJul 2, 2025Code
Test-Time Scaling with Reflective Generative ModelZixiao Wang, Yuxin Wang, Xiaorui Wang et al.
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
AIMay 13
How to Interpret Agent BehaviorJie Gao, Kaiser Sun, Jen-tse Huang et al.
Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.
CGMay 11
Charting the Diameter Computation Landscape on Intersection Graphs in the PlaneTimothy M. Chan, Hsien-Chih Chang, Jie Gao et al.
Computing the diameter of the intersection graphs of objects is a basic problem in computational geometry. Previous works showed that the complexity of computing the diameter mainly depends on the object types: for unit disks and squares in 2D, the problem is solvable in truly subquadratic time, while for other objects, including unit segments and equilateral triangles in 2D or unit balls and axis-parallel unit cubes in 3D, there is no truly subquadratic time algorithm under the Orthogonal Vector (OV) hypothesis. We undertake a comprehensive study of computing the diameter of geometric intersection graphs for various types of objects. We discover many new irregularities, showing that the landscape is extremely nuanced: the source of hardness is a combination of the object type, the true diameter value, and how the objects intersect with each other. Our highlighted results for the 2D case include: 1. The diameter of non-degenerate, axis-aligned line segments can be computed in truly subquadratic time. Previous hardness result for line segments applies only to degenerate instances. On the other hand, for the degenerate case, we show that a truly subquadratic time algorithm exists when the true diameter is constant. 2. An almost-linear-time algorithm for unit-square graphs of constant diameter. Previous algorithms rely on succinct representation assuming bounded VC-dimension; for such a strategy $Ω(n^{7/4})$ time is an inherent barrier. 3. An $\tilde{O}(n^{4/3})$-time algorithm to decide if the diameter of a unit-disk graph is at most 2. This improves upon the recent algorithm with running time $\tilde{O}(n^{2-1/9})$. 4. Deciding if the diameter of intersection graphs of fat triangles or line segments is at most 2 is truly subquadratic-hard under fine-grained complexity assumptions. Previous lower bounds only hold when deciding if diameter is at most 3.
LGSep 14, 2024
Turbo your multi-modal classification with contrastive learningZhiyu Zhang, Da Liu, Shengqiang Liu et al.
Contrastive learning has become one of the most impressive approaches for multi-modal representation learning. However, previous multi-modal works mainly focused on cross-modal understanding, ignoring in-modal contrastive learning, which limits the representation of each modality. In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding by joint in-modal and cross-modal contrastive learning. Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality. With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training. Finally, we combine the self-supervised Turbo with the supervised multi-modal classification and demonstrate its effectiveness on two audio-text classification tasks, where the state-of-the-art performance is achieved on a speech emotion recognition benchmark dataset.
AIFeb 2
Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial OptimizationXia Jiang, Jing Chen, Cong Zhang et al.
While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) -- searching high-dimensional solution spaces under hard constraints -- remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.
CYAug 10, 2021Code
Enhancing Knowledge Tracing via Adversarial TrainingXiaopeng Guo, Zhijie Huang, Jie Gao et al.
We study the problem of knowledge tracing (KT) where the goal is to trace the students' knowledge mastery over time so as to make predictions on their future performance. Owing to the good representation capacity of deep neural networks (DNNs), recent advances on KT have increasingly concentrated on exploring DNNs to improve the performance of KT. However, we empirically reveal that the DNNs based KT models may run the risk of overfitting, especially on small datasets, leading to limited generalization. In this paper, by leveraging the current advances in adversarial training (AT), we propose an efficient AT based KT method (ATKT) to enhance KT model's generalization and thus push the limit of KT. Specifically, we first construct adversarial perturbations and add them on the original interaction embeddings as adversarial examples. The original and adversarial examples are further used to jointly train the KT model, forcing it is not only to be robust to the adversarial examples, but also to enhance the generalization over the original ones. To better implement AT, we then present an efficient attentive-LSTM model as KT backbone, where the key is a proposed knowledge hidden state attention module that adaptively aggregates information from previous knowledge hidden states while simultaneously highlighting the importance of current knowledge hidden state to make a more accurate prediction. Extensive experiments on four public benchmark datasets demonstrate that our ATKT achieves new state-of-the-art performance. Code is available at: \color{blue} {\url{https://github.com/xiaopengguo/ATKT}}.
CGMar 20
Locality Sensitive Hashing in Hyperbolic SpaceChengyuan Deng, Jie Gao, Kevin Lu et al.
For a metric space $(X, d)$, a family $\mathcal{H}$ of locality sensitive hash functions is called $(r, cr, p_1, p_2)$ sensitive if a randomly chosen function $h\in \mathcal{H}$ has probability at least $p_1$ (at most $p_2$) to map any $a, b\in X$ in the same hash bucket if $d(a, b)\leq r$ (or $d(a, b)\geq cr$). Locality Sensitive Hashing (LSH) is one of the most popular techniques for approximate nearest-neighbor search in high-dimensional spaces, and has been studied extensively for Hamming, Euclidean, and spherical geometries. An $(r, cr, p_1, p_2)$-sensitive hash function enables approximate nearest neighbor search (i.e., returning a point within distance $cr$ from a query $q$ if there exists a point within distance $r$ from $q$) with space $O(n^{1+Ï})$ and query time $O(n^Ï)$ where $Ï=\frac{\log 1/p_1}{\log 1/p_2}$. But LSH for hyperbolic spaces $\mathbb{H}^d$ remains largely unexplored. In this work, we present the first LSH construction native to hyperbolic space. For the hyperbolic plane $(d=2)$, we show a construction achieving $Ï\leq 1/c$, based on the hyperplane rounding scheme. For general hyperbolic spaces $(d \geq 3)$, we use dimension reduction from $\mathbb{H}^d$ to $\mathbb{H}^2$ and the 2D hyperbolic LSH to get $Ï\leq 1.59/c$. On the lower bound side, we show that the lower bound on $Ï$ of Euclidean LSH extends to the hyperbolic setting via local isometry, therefore giving $Ï\geq 1/c^2$.
CYMar 26
To Use or Not to Use: Investigating Student Perceptions of Faculty Generative AI Usage in Higher EducationJie Gao, Jiayi Zhang, Dan Chen
While Generative AI (GenAI) rapidly integrated into higher education, existing research has primarily focused on regulating student use. As a result, student perspectives on faculty adoption of GenAI remained unexplored. In this study, we analyzed survey responses from 156 undergraduate and graduate students to examine their attitudes toward both student and faculty use of GenAI. We classified students into four groups based on their attitudes, including GenAI Optimists, Student Support Group, Faculty Support Group, and Non-supporters. Findings show that 37% of participants do not support GenAI use by either students or faculty, while 31% support GenAI use in both contexts. We also conducted thematic analysis to understand participants' concerns on faculty GenAI usage. Results revealed that (1) a majority of students (79%) questioned the validity and reliability of GenAI-generated responses, and (2) 37% of students feared that faculty overreliance on GenAI created a "futile cycle" that might reduce faculty critical thinking. Our findings showed that students expressed concerns about GenAI use by faculty in teaching and grading contexts, with pedagogical concerns being most prominent. These findings informed the future use of GenAI in teaching and learning in higher education.
BMApr 16, 2024
HelixFold-Multimer: Elevating Protein Complex Structure Prediction to New HeightsXiaomin Fang, Jie Gao, Jing Hu et al.
While monomer protein structure prediction tools boast impressive accuracy, the prediction of protein complex structures remains a daunting challenge in the field. This challenge is particularly pronounced in scenarios involving complexes with protein chains from different species, such as antigen-antibody interactions, where accuracy often falls short. Limited by the accuracy of complex prediction, tasks based on precise protein-protein interaction analysis also face obstacles. In this report, we highlight the ongoing advancements of our protein complex structure prediction model, HelixFold-Multimer, underscoring its enhanced performance. HelixFold-Multimer provides precise predictions for diverse protein complex structures, especially in therapeutic protein interactions. Notably, HelixFold-Multimer achieves remarkable success in antigen-antibody and peptide-protein structure prediction, greatly surpassing AlphaFold 3. HelixFold-Multimer is now available for public use on the PaddleHelix platform, offering both a general version and an antigen-antibody version. Researchers can conveniently access and utilize this service for their development needs.
LGNov 16, 2024
Neuc-MDS: Non-Euclidean Multidimensional Scaling Through Bilinear FormsChengyuan Deng, Jie Gao, Kevin Lu et al.
We introduce Non-Euclidean-MDS (Neuc-MDS), an extension of classical Multidimensional Scaling (MDS) that accommodates non-Euclidean and non-metric inputs. The main idea is to generalize the standard inner product to symmetric bilinear forms to utilize the negative eigenvalues of dissimilarity Gram matrices. Neuc-MDS efficiently optimizes the choice of (both positive and negative) eigenvalues of the dissimilarity Gram matrix to reduce STRESS, the sum of squared pairwise error. We provide an in-depth error analysis and proofs of the optimality in minimizing lower bounds of STRESS. We demonstrate Neuc-MDS's ability to address limitations of classical MDS raised by prior research, and test it on various synthetic and real-world datasets in comparison with both linear and non-linear dimension reduction methods.
ASApr 9
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMsYuan Xie, Jiaqi Song, Guang Qiu et al.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
MAFeb 5, 2025
Position: Emergent Machina Sapiens Urge Rethinking Multi-Agent ParadigmsHepeng Li, Yuhong Liu, Jun Yan et al.
Artificial Intelligence (AI) agents capable of autonomous learning and independent decision-making hold great promise for addressing complex challenges across various critical infrastructure domains, including transportation, energy systems, and manufacturing. However, the surge in the design and deployment of AI systems, driven by various stakeholders with distinct and unaligned objectives, introduces a crucial challenge: How can uncoordinated AI systems coexist and evolve harmoniously in shared environments without creating chaos or compromising safety? To address this, we advocate for a fundamental rethinking of existing multi-agent frameworks, such as multi-agent systems and game theory, which are largely limited to predefined rules and static objective structures. We posit that AI agents should be empowered to adjust their objectives dynamically, make compromises, form coalitions, and safely compete or cooperate through evolving relationships and social feedback. Through two case studies in critical infrastructure applications, we call for a shift toward the emergent, self-organizing, and context-aware nature of these multi-agentic AI systems.
DSApr 10
Packing Compact Subgraphs with Applications to DistrictingHo-Lin Chen, Po-Yu Chou, Prathamesh Dharangutte et al.
Packing disjoint subgraphs in a given graph is a fundamental problem with many applications. Motivated by political districting, we focus on connected subgraphs that are compact (e.g., having constant radius from a single center vertex) and that satisfy additional composition requirements, such as a minimum population/weight threshold or balanced weight types (e.g., political affiliations). We aim to maximize coverage by disjoint districts that meet these requirements. In this work, we present new results that substantially improve the previously known bounds on balanced star districts for planar and minor-free graphs (Dharangutte et al. 2025). In particular, we improve the approximation factor from $O(\log n)$ to $O(1)$ for packing balanced star districts using the exact same algorithm, but with a refined analysis. We also extend the results beyond planar graphs to minor-free graphs and an even broader family of graphs of bounded expansion. Additionally, we obtain an $O(1)$ approximation for packing radius-$k$ districts (with a constant $k$) in planar and apex-minor-free graphs. In order to get a $(1+\varepsilon)$ approximation on the max coverage, we show that this can be achieved if we allow a slight relaxation of the balancedness parameters (by a factor that can be made arbitrarily close to $1$), for bounded radius-$k$ districts on planar and apex-minor-free graphs. We show that all of these results can also be obtained if we enforce a minimum weight threshold for each district as the composition requirement, rather than balancedness. We present various results on hardness and hardness of approximation for this variant, by graph and district types.
LGOct 22, 2025
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context ReasoningLing Team, Bin Han, Caizhi Tang et al.
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
ASAug 21, 2025
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNetsChenlin Liu, Minghui Fang, Patrick Zhang et al.
Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.
DSMay 30, 2025
Randomized Dimensionality Reduction for Euclidean Maximization and Diversity MeasuresJie Gao, Rajesh Jayaram, Benedikt Kolbe et al.
Randomized dimensionality reduction is a widely-used algorithmic technique for speeding up large-scale Euclidean optimization problems. In this paper, we study dimension reduction for a variety of maximization problems, including max-matching, max-spanning tree, max TSP, as well as various measures for dataset diversity. For these problems, we show that the effect of dimension reduction is intimately tied to the \emph{doubling dimension} $λ_X$ of the underlying dataset $X$ -- a quantity measuring intrinsic dimensionality of point sets. Specifically, we prove that a target dimension of $O(λ_X)$ suffices to approximately preserve the value of any near-optimal solution,which we also show is necessary for some of these problems. This is in contrast to classical dimension reduction results, whose dependence increases with the dataset size $|X|$. We also provide empirical results validating the quality of solutions found in the projected space, as well as speedups due to dimensionality reduction.
LGMar 12, 2025
Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking UsersBo-Yi Liu, Zhi-Xuan Liu, Kuan Lun Chen et al.
Deep learning models are widely used in decision-making and recommendation systems, where they typically rely on the assumption of a static data distribution between training and deployment. However, real-world deployment environments often violate this assumption. Users who receive negative outcomes may adapt their features to meet model criteria, i.e., recourse action. These adaptive behaviors create shifts in the data distribution and when models are retrained on this shifted data, a feedback loop emerges: user behavior influences the model, and the updated model in turn reshapes future user behavior. Despite its importance, this bidirectional interaction between users and models has received limited attention. In this work, we develop a general framework to model user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics. Both the theoretical and empirical analyses show that user recourse behavior tends to push logistic and MLP models toward increasingly higher decision standards, resulting in higher recourse costs and less reliable recourse actions over time. To mitigate these challenges, we propose two methods--Fair-top-k and Dynamic Continual Learning (DCL)--which significantly reduce recourse cost and improve model robustness. Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
DSApr 22, 2025
On the Price of Differential Privacy for Hierarchical ClusteringChengyuan Deng, Jie Gao, Jalaj Upadhyay et al.
Hierarchical clustering is a fundamental unsupervised machine learning task with the aim of organizing data into a hierarchy of clusters. Many applications of hierarchical clustering involve sensitive user information, therefore motivating recent studies on differentially private hierarchical clustering under the rigorous framework of Dasgupta's objective. However, it has been shown that any privacy-preserving algorithm under edge-level differential privacy necessarily suffers a large error. To capture practical applications of this problem, we focus on the weight privacy model, where each edge of the input graph is at least unit weight. We present a novel algorithm in the weight privacy model that shows significantly better approximation than known impossibility results in the edge-level DP setting. In particular, our algorithm achieves $O(\log^{1.5}n/\varepsilon)$ multiplicative error for $\varepsilon$-DP and runs in polynomial time, where $n$ is the size of the input graph, and the cost is never worse than the optimal additive error in existing work. We complement our algorithm by showing if the unit-weight constraint does not apply, the lower bound for weight-level DP hierarchical clustering is essentially the same as the edge-level DP, i.e. $Ω(n^2/\varepsilon)$ additive error. As a result, we also obtain a new lower bound of $\tildeΩ(1/\varepsilon)$ additive error for balanced sparsest cuts in the weight-level DP model, which may be of independent interest. Finally, we evaluate our algorithm on synthetic and real-world datasets. Our experimental results show that our algorithm performs well in terms of extra cost and has good scalability to large graphs.
BMDec 13, 2024
Precise Antigen-Antibody Structure Predictions Enhance Antibody Development with HelixFold-MultimerJie Gao, Jing Hu, Lihang Liu et al.
The accurate prediction of antigen-antibody structures is essential for advancing immunology and therapeutic development, as it helps elucidate molecular interactions that underlie immune responses. Despite recent progress with deep learning models like AlphaFold and RoseTTAFold, accurately modeling antigen-antibody complexes remains a challenge due to their unique evolutionary characteristics. HelixFold-Multimer, a specialized model developed for this purpose, builds on the framework of AlphaFold-Multimer and demonstrates improved precision for antigen-antibody structures. HelixFold-Multimer not only surpasses other models in accuracy but also provides essential insights into antibody development, enabling more precise identification of binding sites, improved interaction prediction, and enhanced design of therapeutic antibodies. These advances underscore HelixFold-Multimer's potential in supporting antibody research and therapeutic innovation.
IVNov 21, 2024
CP-UNet: Contour-based Probabilistic Model for Medical Ultrasound Images SegmentationRuiguo Yu, Yiyang Zhang, Yuan Tian et al.
Deep learning-based segmentation methods are widely utilized for detecting lesions in ultrasound images. Throughout the imaging procedure, the attenuation and scattering of ultrasound waves cause contour blurring and the formation of artifacts, limiting the clarity of the acquired ultrasound images. To overcome this challenge, we propose a contour-based probabilistic segmentation model CP-UNet, which guides the segmentation network to enhance its focus on contour during decoding. We design a novel down-sampling module to enable the contour probability distribution modeling and encoding stages to acquire global-local features. Furthermore, the Gaussian Mixture Model utilizes optimized features to model the contour distribution, capturing the uncertainty of lesion boundaries. Extensive experiments with several state-of-the-art deep learning segmentation methods on three ultrasound image datasets show that our method performs better on breast and thyroid lesions segmentation.
IVOct 12, 2024
Diabetic retinopathy image classification method based on GreenBen data augmentationYutong Liu, Jie Gao, Haijiang Zhu
For the diagnosis of diabetes retinopathy (DR) images, this paper proposes a classification method based on artificial intelligence. The core lies in a new data augmentation method, GreenBen, which first extracts the green channel grayscale image from the retinal image and then performs Ben enhancement. Considering that diabetes macular edema (DME) is a complication closely related to DR, this paper constructs a joint classification framework of DR and DME based on multi task learning and attention module, and uses GreenBen to enhance its data to reduce the difference of DR images and improve the accuracy of model classification. We conducted extensive experiments on three publicly available datasets, and our method achieved the best results. For GreenBen, whether based on the ResNet50 network or the Swin Transformer network, whether for individual classification or joint DME classification, compared with other data augmentation methods, GreenBen achieved stable and significant improvements in DR classification results, with an accuracy increase of 10%.
LGMar 26
Personalizing Mathematical Game-based Learning for Children: A Preliminary StudyJie Gao, Adam K. Dubé
Game-based learning (GBL) is widely adopted in mathematics education. It enhances learners' engagement and critical thinking throughout the mathematics learning process. However, enabling players to learn intrinsically through mathematical games still presents challenges. In particular, effective GBL systems require dozens of high-quality game levels and mechanisms to deliver them to appropriate players in a way that matches their learning abilities. To address this challenge, we propose a framework, guided by adaptive learning theory, that uses artificial intelligence (AI) techniques to build a classifier for player-generated levels. We collect 206 distinct game levels created by both experts and advanced players in Creative Mode, a new tool in a math game-based learning app, and develop a classifier to extract game features and predict valid game levels. The preliminary results show that the Random Forest model is the optimal classifier among the four machine learning classification models (k-nearest neighbors, decision trees, support vector machines, and random forests). This study provides insights into the development of GBL systems, highlighting the potential of integrating AI into the game-level design process to provide more personalized game levels for players.