Mengyu Li

CV
h-index34
20papers
383citations
Novelty50%
AI Score55

20 Papers

CLAug 8, 2024Code
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang et al.

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

IVJun 21, 2023Code
Attention Hybrid Variational Net for Accelerated MRI Reconstruction

Guoyao Shen, Boran Hao, Mengyu Li et al.

The application of compressed sensing (CS)-enabled data reconstruction for accelerating magnetic resonance imaging (MRI) remains a challenging problem. This is due to the fact that the information lost in k-space from the acceleration mask makes it difficult to reconstruct an image similar to the quality of a fully sampled image. Multiple deep learning-based structures have been proposed for MRI reconstruction using CS, both in the k-space and image domains as well as using unrolled optimization methods. However, the drawback of these structures is that they are not fully utilizing the information from both domains (k-space and image). Herein, we propose a deep learning-based attention hybrid variational network that performs learning in both the k-space and image domain. We evaluate our method on a well-known open-source MRI dataset and a clinical MRI dataset of patients diagnosed with strokes from our institution to demonstrate the performance of our network. In addition to quantitative evaluation, we undertook a blinded comparison of image quality across networks performed by a subspecialty trained radiologist. Overall, we demonstrate that our network achieves a superior performance among others under multiple reconstruction tasks.

LGJul 20, 2024Code
Meta-GPS++: Enhancing Graph Meta-Learning with Contrastive Learning and Self-Training

Yonghao Liu, Mengyu Li, Ximing Li et al.

Node classification is an essential problem in graph learning. However, many models typically obtain unsatisfactory performance when applied to few-shot scenarios. Some studies have attempted to combine meta-learning with graph neural networks to solve few-shot node classification on graphs. Despite their promising performance, some limitations remain. First, they employ the node encoding mechanism of homophilic graphs to learn node embeddings, even in heterophilic graphs. Second, existing models based on meta-learning ignore the interference of randomness in the learning process. Third, they are trained using only limited labeled nodes within the specific task, without explicitly utilizing numerous unlabeled nodes. Finally, they treat almost all sampled tasks equally without customizing them for their uniqueness. To address these issues, we propose a novel framework for few-shot node classification called Meta-GPS++. Specifically, we first adopt an efficient method to learn discriminative node representations on homophilic and heterophilic graphs. Then, we leverage a prototype-based approach to initialize parameters and contrastive learning for regularizing the distribution of node embeddings. Moreover, we apply self-training to extract valuable information from unlabeled nodes. Additionally, we adopt S$^2$ (scaling & shifting) transformation to learn transferable knowledge from diverse tasks. The results on real-world datasets show the superiority of Meta-GPS++. Our code is available here.

IVNov 16, 2023Code
Learning to Reconstruct Accelerated MRI Through K-space Cold Diffusion without Noise

Guoyao Shen, Mengyu Li, Chad W. Farris et al.

Deep learning-based MRI reconstruction models have achieved superior performance these days. Most recently, diffusion models have shown remarkable performance in image generation, in-painting, super-resolution, image editing and more. As a generalized diffusion model, cold diffusion further broadens the scope and considers models built around arbitrary image transformations such as blurring, down-sampling, etc. In this paper, we propose a k-space cold diffusion model that performs image degradation and restoration in k-space without the need for Gaussian noise. We provide comparisons with multiple deep learning-based MRI reconstruction models and perform tests on a well-known large open-source MRI dataset. Our results show that this novel way of performing degradation can generate high-quality reconstruction images for accelerated MRI.

LGMay 26, 2022
Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification

Mengyu Li, Jun Yu, Hongteng Xu et al.

As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to the high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called \textsc{Spar-GW}, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. The proposed \textsc{Spar-GW} method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $O(n^4)$ to $O(n^{2+δ})$ for an arbitrary small $δ>0$. Theoretically, the convergence and consistency of the proposed estimation for GW distance are established under mild regularity conditions. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our \textsc{Spar-GW} to state-of-the-art methods in both synthetic and real-world tasks.

MLJun 11, 2023
Importance Sparsification for Sinkhorn Algorithm

Mengyu Li, Jun Yu, Tao Li et al.

Sinkhorn algorithm has been used pervasively to approximate the solution to optimal transport (OT) and unbalanced optimal transport (UOT) problems. However, its practical application is limited due to the high computational complexity. To alleviate the computational burden, we propose a novel importance sparsification method, called Spar-Sink, to efficiently approximate entropy-regularized OT and UOT solutions. Specifically, our method employs natural upper bounds for unknown optimal transport plans to establish effective sampling probabilities, and constructs a sparse kernel matrix to accelerate Sinkhorn iterations, reducing the computational cost of each iteration from $O(n^2)$ to $\widetilde{O}(n)$ for a sample of size $n$. Theoretically, we show the proposed estimators for the regularized OT and UOT problems are consistent under mild regularity conditions. Experiments on various synthetic data demonstrate Spar-Sink outperforms mainstream competitors in terms of both estimation error and speed. A real-world echocardiogram data analysis shows Spar-Sink can effectively estimate and visualize cardiac cycles, from which one can identify heart failure and arrhythmia. To evaluate the numerical accuracy of cardiac cycle prediction, we consider the task of predicting the end-systole time point using the end-diastole one. Results show Spar-Sink performs as well as the classical Sinkhorn algorithm, requiring significantly less computational time.

CVFeb 5
A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features

Mengyu Li, Ingibjörg Kristjánsdóttir, Thilo van Eimeren et al.

Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson's disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.

CLNov 4, 2025
Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

Bowen Jin, TJ Collins, Donghan Yu et al.

Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.

CVAug 21, 2023
Regularization by Neural Style Transfer for MRI Field-Transfer Reconstruction with Limited Data

Guoyao Shen, Yancheng Zhu, Mengyu Li et al.

Recent advances in MRI reconstruction have demonstrated remarkable success through deep learning-based models. However, most existing methods rely heavily on large-scale, task-specific datasets, making reconstruction in data-limited settings a critical yet underexplored challenge. While regularization by denoising (RED) leverages denoisers as priors for reconstruction, we propose Regularization by Neural Style Transfer (RNST), a novel framework that integrates a neural style transfer (NST) engine with a denoiser to enable magnetic field-transfer reconstruction. RNST generates high-field-quality images from low-field inputs without requiring paired training data, leveraging style priors to address limited-data settings. Our experiment results demonstrate RNST's ability to reconstruct high-quality images across diverse anatomical planes (axial, coronal, sagittal) and noise levels, achieving superior clarity, contrast, and structural fidelity compared to lower-field references. Crucially, RNST maintains robustness even when style and content images lack exact alignment, broadening its applicability in clinical environments where precise reference matches are unavailable. By combining the strengths of NST and denoising, RNST offers a scalable, data-efficient solution for MRI field-transfer reconstruction, demonstrating significant potential for resource-limited settings.

CVJun 11, 2025Code
Gaussian Herding across Pens: An Optimal Transport Perspective on Global Gaussian Reduction for 3DGS

Tao Wang, Mengyu Li, Geduo Zeng et al.

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for radiance field rendering, but it typically requires millions of redundant Gaussian primitives, overwhelming memory and rendering budgets. Existing compaction approaches address this by pruning Gaussians based on heuristic importance scores, without global fidelity guarantee. To bridge this gap, we propose a novel optimal transport perspective that casts 3DGS compaction as global Gaussian mixture reduction. Specifically, we first minimize the composite transport divergence over a KD-tree partition to produce a compact geometric representation, and then decouple appearance from geometry by fine-tuning color and opacity attributes with far fewer Gaussian primitives. Experiments on benchmark datasets show that our method (i) yields negligible loss in rendering quality (PSNR, SSIM, LPIPS) compared to vanilla 3DGS with only 10% Gaussians; and (ii) consistently outperforms state-of-the-art 3DGS compaction techniques. Notably, our method is applicable to any stage of vanilla or accelerated 3DGS pipelines, providing an efficient and agnostic pathway to lightweight neural rendering. The code is publicly available at https://github.com/DrunkenPoet/GHAP

CLMay 21, 2024
Resolving Word Vagueness with Scenario-guided Adapter for Natural Language Inference

Yonghao Liu, Mengyu Li, Di Liang et al.

Natural Language Inference (NLI) is a crucial task in natural language processing that involves determining the relationship between two sentences, typically referred to as the premise and the hypothesis. However, traditional NLI models solely rely on the semantic information inherent in independent sentences and lack relevant situational visual information, which can hinder a complete understanding of the intended meaning of the sentences due to the ambiguity and vagueness of language. To address this challenge, we propose an innovative ScenaFuse adapter that simultaneously integrates large-scale pre-trained linguistic knowledge and relevant visual information for NLI tasks. Specifically, we first design an image-sentence interaction module to incorporate visuals into the attention mechanism of the pre-trained model, allowing the two modalities to interact comprehensively. Furthermore, we introduce an image-sentence fusion module that can adaptively integrate visual information from images and semantic information from sentences. By incorporating relevant visual information and leveraging linguistic knowledge, our approach bridges the gap between language and vision, leading to improved understanding and inference capabilities in NLI tasks. Extensive benchmark experiments demonstrate that our proposed ScenaFuse, a scenario-guided approach, consistently boosts NLI performance.

LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

CLMay 19, 2024
Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification

Mengyu Li, Yonghao Liu, Fausto Giunchiglia et al.

Text classification is a crucial and fundamental task in natural language processing. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model performance. Moreover, these models leverage separate classification branch with cross entropy and supervised contrastive learning branch without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets.

CLJan 16, 2025
Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning

Yonghao Liu, Mengyu Li, Wei Pang et al.

Short text classification, as a research subtopic in natural language processing, is more challenging due to its semantic sparsity and insufficient labeled samples in practical scenarios. We propose a novel model named MI-DELIGHT for short text classification in this work. Specifically, it first performs multi-source information (i.e., statistical information, linguistic information, and factual information) exploration to alleviate the sparsity issues. Then, the graph learning approach is adopted to learn the representation of short texts, which are presented in graph forms. Moreover, we introduce a dual-level (i.e., instance-level and cluster-level) contrastive learning auxiliary task to effectively capture different-grained contrastive information within massive unlabeled data. Meanwhile, previous models merely perform the main task and auxiliary tasks in parallel, without considering the relationship among tasks. Therefore, we introduce a hierarchical architecture to explicitly model the correlations between tasks. We conduct extensive experiments across various benchmark datasets, demonstrating that MI-DELIGHT significantly surpasses previous competitive models. It even outperforms popular large language models on several datasets.

CVDec 16, 2025
TUMTraf EMOT: Event-Based Multi-Object Tracking Dataset and Baseline for Traffic Scenarios

Mengyu Li, Xingcheng Zhou, Guang Chen et al.

In Intelligent Transportation Systems (ITS), multi-object tracking is primarily based on frame-based cameras. However, these cameras tend to perform poorly under dim lighting and high-speed motion conditions. Event cameras, characterized by low latency, high dynamic range and high temporal resolution, have considerable potential to mitigate these issues. Compared to frame-based vision, there are far fewer studies on event-based vision. To address this research gap, we introduce an initial pilot dataset tailored for event-based ITS, covering vehicle and pedestrian detection and tracking. We establish a tracking-by-detection benchmark with a specialized feature extractor based on this dataset, achieving excellent performance.

MESep 22, 2025
Core-elements Subsampling for Alternating Least Squares

Dunyao Xue, Mengyu Li, Cheng Meng et al.

In this paper, we propose a novel element-wise subset selection method for the alternating least squares (ALS) algorithm, focusing on low-rank matrix factorization involving matrices with missing values, as commonly encountered in recommender systems. While ALS is widely used for providing personalized recommendations based on user-item interaction data, its high computational cost, stemming from repeated regression operations, poses significant challenges for large-scale datasets. To enhance the efficiency of ALS, we propose a core-elements subsampling method that selects a representative subset of data and leverages sparse matrix operations to approximate ALS estimations efficiently. We establish theoretical guarantees for the approximation and convergence of the proposed approach, showing that it achieves similar accuracy with significantly reduced computational time compared to full-data ALS. Extensive simulations and real-world applications demonstrate the effectiveness of our method in various scenarios, emphasizing its potential in large-scale recommendation systems.

CVAug 7, 2025
Few-Shot Deployment of Pretrained MRI Transformers in Brain Imaging Tasks

Mengyu Li, Guoyao Shen, Chad W. Farris et al.

Machine learning using transformers has shown great potential in medical imaging, but its real-world applicability remains limited due to the scarcity of annotated data. In this study, we propose a practical framework for the few-shot deployment of pretrained MRI transformers in diverse brain imaging tasks. By utilizing the Masked Autoencoder (MAE) pretraining strategy on a large-scale, multi-cohort brain MRI dataset comprising over 31 million slices, we obtain highly transferable latent representations that generalize well across tasks and datasets. For high-level tasks such as classification, a frozen MAE encoder combined with a lightweight linear head achieves state-of-the-art accuracy in MRI sequence identification with minimal supervision. For low-level tasks such as segmentation, we propose MAE-FUnet, a hybrid architecture that fuses multiscale CNN features with pretrained MAE embeddings. This model consistently outperforms other strong baselines in both skull stripping and multi-class anatomical segmentation under data-limited conditions. With extensive quantitative and qualitative evaluations, our framework demonstrates efficiency, stability, and scalability, suggesting its suitability for low-resource clinical environments and broader neuroimaging applications.

IVMay 23, 2024
Magnetic Resonance Image Processing Transformer for General Accelerated Image Reconstruction

Guoyao Shen, Mengyu Li, Stephan Anderson et al.

Recent advancements in deep learning have enabled the development of generalizable models that achieve state-of-the-art performance across various imaging tasks. Vision Transformer (ViT)-based architectures, in particular, have demonstrated strong feature extraction capabilities when pre-trained on large-scale datasets. In this work, we introduce the Magnetic Resonance Image Processing Transformer (MR-IPT), a ViT-based framework designed to enhance the generalizability and robustness of accelerated MRI reconstruction. Unlike conventional deep learning models that require separate training for different acceleration factors, MR-IPT is pre-trained on a large-scale dataset encompassing multiple undersampling patterns and acceleration settings, enabling a unified reconstruction framework. By leveraging a shared transformer backbone, MR-IPT effectively learns universal feature representations, allowing it to generalize across diverse reconstruction tasks. Extensive experiments demonstrate that MR-IPT outperforms both CNN-based and existing transformer-based methods, achieving superior reconstruction quality across varying acceleration factors and sampling masks. Moreover, MR-IPT exhibits strong robustness, maintaining high performance even under unseen acquisition setups, highlighting its potential as a scalable and efficient solution for accelerated MRI. Our findings suggest that transformer-based general models can significantly advance MRI reconstruction, offering improved adaptability and stability compared to traditional deep learning approaches.

IVMar 14, 2024
Region-based U-net for accelerated training and enhanced precision in deep brain segmentation

Mengyu Li, Magnus Magnusson, Thilo van Eimeren et al.

Segmentation of brain structures on MRI is the primary step for further quantitative analysis of brain diseases. Manual segmentation is still considered the gold standard in terms of accuracy; however, such data is extremely time-consuming to generate. This paper presents a deep learning-based segmentation approach for 12 deep-brain structures, utilizing multiple region-based U-Nets. The brain is divided into three focal regions of interest that encompass the brainstem, the ventricular system, and the striatum. Next, three region-based U-nets are run in parallel to parcellate these larger structures into their respective four substructures. This approach not only greatly reduces the training and processing times but also significantly enhances the segmentation accuracy, compared to segmenting the entire MRI image at once. Our approach achieves remarkable accuracy with an average Dice Similarity Coefficient (DSC) of 0.901 and 95% Hausdorff Distance (HD95) of 1.155 mm. The method was compared with state-of-the-art segmentation approaches, demonstrating a high level of accuracy and robustness of the proposed method.

CVJul 22, 2019
Single Image based Head Pose Estimation with Spherical Parameterization and 3D Morphing

Hui Yuan, Mengyu Li, Junhui Hou et al.

Head pose estimation plays a vital role in various applications, e.g., driverassistance systems, human-computer interaction, virtual reality technology, and so on. We propose a novel geometry based algorithm for accurately estimating the head pose from a single 2D face image at a very low computational cost. Specifically, the rectangular coordinates of only four non-coplanar feature points from a predefined 3D facial model as well as the corresponding ones automatically/ manually extracted from a 2D face image are first normalized to exclude the effect of external factors (i.e., scale factor and translation parameters). Then, the four normalized 3D feature points are represented in spherical coordinates with reference to the uniquely determined sphere by themselves. Due to the spherical parameterization, the coordinates of feature points can then be morphed along all the three directions in the rectangular coordinates effectively. Finally, the rotation matrix indicating the head pose is obtained by minimizing the Euclidean distance between the normalized 2D feature points and the 2D re-projections of morphed 3D feature points. Comprehensive experimental results over two popular databases, i.e., Pointing'04 and Biwi Kinect, demonstrate that the proposed algorithm can estimate head poses with higher accuracy and lower run time than state-of-the-art geometry based methods. Even compared with start-of-the-art learning based methods or geometry based methods with additional depth information, our algorithm still produces comparable performance.