CLAug 12, 2024Code
Mutual Reasoning Makes Smaller LLMs Stronger Problem-SolversZhenting Qi, Mingyuan Ma, Jiahang Xu et al. · microsoft-research
This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.
CVMar 12, 2023Code
Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language ModelsZangwei Zheng, Mingyuan Ma, Kai Wang et al. · berkeley
Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL.
95.3LGApr 14Code
Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic ReasoningAakshita Chandiramani, Aaron Blakeman, Abdullahi Olaoye et al. · amazon-science, cmu
We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.
LGApr 6, 2023
Hierarchical Graph Neural Network with Cross-Attention for Cross-Device User MatchingAli Taghibakhshi, Mingyuan Ma, Ashwath Aithal et al. · nvidia
Cross-device user matching is a critical problem in numerous domains, including advertising, recommender systems, and cybersecurity. It involves identifying and linking different devices belonging to the same person, utilizing sequence logs. Previous data mining techniques have struggled to address the long-range dependencies and higher-order connections between the logs. Recently, researchers have modeled this problem as a graph problem and proposed a two-tier graph contextual embedding (TGCE) neural network architecture, which outperforms previous methods. In this paper, we propose a novel hierarchical graph neural network architecture (HGNN), which has a more computationally efficient second level design than TGCE. Furthermore, we introduce a cross-attention (Cross-Att) mechanism in our model, which improves performance by 5% compared to the state-of-the-art TGCE method.
CVNov 28, 2022
PIDS: Joint Point Interaction-Dimension Search for 3D Point CloudTunhou Zhang, Mingyuan Ma, Feng Yan et al.
The interaction and dimension of points are two important axes in designing point operators to serve hierarchical 3D models. Yet, these two axes are heterogeneous and challenging to fully explore. Existing works craft point operator under a single axis and reuse the crafted operator in all parts of 3D models. This overlooks the opportunity to better combine point interactions and dimensions by exploiting varying geometry/density of 3D point clouds. In this work, we establish PIDS, a novel paradigm to jointly explore point interactions and point dimensions to serve semantic segmentation on point cloud data. We establish a large search space to jointly consider versatile point interactions and point dimensions. This supports point operators with various geometry/density considerations. The enlarged search space with heterogeneous search components calls for a better ranking of candidate models. To achieve this, we improve the search space exploration by leveraging predictor-based Neural Architecture Search (NAS), and enhance the quality of prediction by assigning unique encoding to heterogeneous search components based on their priors. We thoroughly evaluate the networks crafted by PIDS on two semantic segmentation benchmarks, showing ~1% mIOU improvement on SemanticKITTI and S3DIS over state-of-the-art 3D models.
89.7LGApr 2Code
ZEUS: Accelerating Diffusion Models with Only Second-Order PredictorYixiao Wang, Ting Jiang, Zishan Shao et al.
Denoising generative models deliver high-fidelity generation but remain bottlenecked by inference latency due to the many iterative denoiser calls required during sampling. Training-free acceleration methods reduce latency by either sparsifying the model architecture or shortening the sampling trajectory. Current training-free acceleration methods are more complex than necessary: higher-order predictors amplify error under aggressive speedups, and architectural modifications hinder deployment. Beyond 2x acceleration, step skipping creates structural scarcity -- at most one fresh evaluation per local window -- leaving the computed output and its backward difference as the only causally grounded information. Based on this, we propose ZEUS, an acceleration method that predicts reduced denoiser evaluations using a second-order predictor, and stabilizes aggressive consecutive skipping with an interleaved scheme that avoids back-to-back extrapolations. ZEUS adds essentially zero overhead, no feature caches, and no architectural modifications, and it is compatible with different backbones, prediction objectives, and solver choices. Across image and video generation, ZEUS consistently improves the speed-fidelity performance over recent training-free baselines, achieving up to 3.2x end-to-end speedup while maintaining perceptual quality. Our code is available at: https://github.com/Ting-Justin-Jiang/ZEUS.
LGSep 8, 2022
FADE: Enabling Federated Adversarial Training on Heterogeneous Resource-Constrained Edge DevicesMinxue Tang, Jianyi Zhang, Mingyuan Ma et al.
Federated adversarial training can effectively complement adversarial robustness into the privacy-preserving federated learning systems. However, the high demand for memory capacity and computing power makes large-scale federated adversarial training infeasible on resource-constrained edge devices. Few previous studies in federated adversarial training have tried to tackle both memory and computational constraints simultaneously. In this paper, we propose a new framework named Federated Adversarial Decoupled Learning (FADE) to enable AT on heterogeneous resource-constrained edge devices. FADE differentially decouples the entire model into small modules to fit into the resource budget of each device, and each device only needs to perform AT on a single module in each communication round. We also propose an auxiliary weight decay to alleviate objective inconsistency and achieve better accuracy-robustness balance in FADE. FADE offers theoretical guarantees for convergence and adversarial robustness, and our experimental results show that FADE can significantly reduce the consumption of memory and computing power while maintaining accuracy and robustness.
LGMay 25, 2025Code
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language ModelsQinsi Wang, Hancheng Ye, Ming-Yu Chung et al.
Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5x FLOPs reduction and a 10x overall speedup. Code is released at https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main.
CVMar 2, 2025Code
Multi-Cali Anything: Dense Feature Multi-Frame Structure-from-Motion for Large-Scale Camera Array CalibrationJinjiang You, Hewei Wang, Yijie Li et al.
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything
MEFeb 23
Orthogonal Uplift Learning with Permutation-Invariant Representations for Combinatorial TreatmentsXinyan Su, Jiacan Gao, Mingyuan Ma et al.
We study uplift estimation for combinatorial treatments. Uplift measures the pure incremental causal effect of an intervention (e.g., sending a coupon or a marketing message) on user behavior, modeled as a conditional individual treatment effect. Many real-world interventions are combinatorial: a treatment is a policy that specifies context-dependent action distributions rather than a single atomic label. Although recent work considers structured treatments, most methods rely on categorical or opaque encodings, limiting robustness and generalization to rare or newly deployed policies. We propose an uplift estimation framework that aligns treatment representation with causal semantics. Each policy is represented by the mixture it induces over contextaction components and embedded via a permutation-invariant aggregation. This representation is integrated into an orthogonalized low-rank uplift model, extending Robinson-style decompositions to learned, vector-valued treatments. We show that the resulting estimator is expressive for policy-induced causal effects, orthogonally robust to nuisance estimation errors, and stable under small policy perturbations. Experiments on large-scale randomized platform data demonstrate improved uplift accuracy and stability in long-tailed policy regimes
CVMar 17, 2025Code
Training Video Foundation Models with NVIDIA NeMoZeeshan Patel, Ethan He, Parth Mannan et al.
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
CLApr 2, 2024
Octopus: On-device language model for function calling of software APIsWei Chen, Zhiyuan Li, Mingyuan Ma
In the rapidly evolving domain of artificial intelligence, Large Language Models (LLMs) play a crucial role due to their advanced text processing and generation abilities. This study introduces a new strategy aimed at harnessing on-device LLMs in invoking software APIs. We meticulously compile a dataset derived from software API documentation and apply fine-tuning to LLMs with capacities of 2B, 3B and 7B parameters, specifically to enhance their proficiency in software API interactions. Our approach concentrates on refining the models' grasp of API structures and syntax, significantly enhancing the accuracy of API function calls. Additionally, we propose \textit{conditional masking} techniques to ensure outputs in the desired formats and reduce error rates while maintaining inference speeds. We also propose a novel benchmark designed to evaluate the effectiveness of LLMs in API interactions, establishing a foundation for subsequent research. Octopus, the fine-tuned model, is proved to have better performance than GPT-4 for the software APIs calling. This research aims to advance automated software development and API integration, representing substantial progress in aligning LLM capabilities with the demands of practical software engineering applications.
DCMay 6, 2025
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM ServingShan Yu, Jiarong Xing, Yifan Qiao et al.
Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems$\unicode{x2014}$the lack of $\textit{cross-model memory coordination}$, which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models' runtime demands. Evaluations on real-world traces show that Prism achieves more than $2\times$ cost savings and $3.3\times$ SLO attainment compared to state-of-the-art systems.
LGMar 31, 2025
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-TuningHuandong Chang, Zicheng Ma, Mingyuan Ma et al.
Low-Rank Adaptation (LoRA) has become a widely adopted technique for fine-tuning large-scale pre-trained models with minimal parameter updates. However, existing methods rely on fixed ranks or focus solely on either rank pruning or expansion, failing to adapt ranks dynamically to match the importance of different layers during training. In this work, we propose ElaLoRA, an adaptive low-rank adaptation framework that dynamically prunes and expands ranks based on gradient-derived importance scores. To the best of our knowledge, ElaLoRA is the first method that enables both rank pruning and expansion during fine-tuning. Experiments across multiple benchmarks demonstrate that ElaLoRA consistently outperforms existing PEFT methods across different parameter budgets. Furthermore, our studies validate that layers receiving higher rank allocations contribute more significantly to model performance, providing theoretical justification for our adaptive strategy. By introducing a principled and adaptive rank allocation mechanism, ElaLoRA offers a scalable and efficient fine-tuning solution, particularly suited for resource-constrained environments.
MAOct 14, 2025
KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent SystemsHancheng Ye, Zhengqi Gao, Mingyuan Ma et al.
Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.
SIAug 6, 2025
Quasi-Clique Discovery via Energy DiffusionYu Zhang, Yilong Luo, Mingyuan Ma et al.
Discovering quasi-cliques -- subgraphs whose edge density exceeds a given threshold -- is a fundamental task in graph mining with applications to web spam detection, fraud screening, and e-commerce recommendation. However, existing methods for quasi-clique discovery on large-scale web graphs are often sensitive to random seeds or lack of explicit edge-density guarantees, making the task challenging in practice. This paper presents EDQC, an energy diffusion-based method for quasi-clique discovery. EDQC first employs an adaptive energy diffusion process to generate an energy ranking that highlights structurally cohesive regions. Guided by this energy ranking, the algorithm identifies a high-quality subgraph by minimizing conductance, a standard measure from community detection. This subgraph is then refined to meet the specified density threshold. Extensive experiments on 75 real-world graphs show that EDQC finds larger quasi-cliques on most datasets, with consistently lower variance across runs and competitive runtime. To the best of our knowledge, EDQC is the first method to incorporate energy diffusion into quasi-clique discovery.
SPJan 10, 2021
Machine Learning for Electronic Design Automation: A SurveyGuyue Huang, Jingbo Hu, Yifan He et al.
With the down-scaling of CMOS technology, the design complexity of very large-scale integrated (VLSI) is increasing. Although the application of machine learning (ML) techniques in electronic design automation (EDA) can trace its history back to the 90s, the recent breakthrough of ML and the increasing complexity of EDA tasks have aroused more interests in incorporating ML to solve EDA tasks. In this paper, we present a comprehensive review of existing ML for EDA studies, organized following the EDA hierarchy.
LGJul 3, 2020
AEGCN: An Autoencoder-Constrained Graph Convolutional NetworkMingyuan Ma, Sen Na, Hongyu Wang
We propose a novel neural network architecture, called autoencoder-constrained graph convolutional network, to solve node classification task on graph domains. As suggested by its name, the core of this model is a convolutional network operating directly on graphs, whose hidden layers are constrained by an autoencoder. Comparing with vanilla graph convolutional networks, the autoencoder step is added to reduce the information loss brought by Laplacian smoothing. We consider applying our model on both homogeneous graphs and heterogeneous graphs. For homogeneous graphs, the autoencoder approximates to the adjacency matrix of the input graph by taking hidden layer representations as encoder and another one-layer graph convolutional network as decoder. For heterogeneous graphs, since there are multiple adjacency matrices corresponding to different types of edges, the autoencoder approximates to the feature matrix of the input graph instead, and changes the encoder to a particularly designed multi-channel pre-processing network with two layers. In both cases, the error occurred in the autoencoder approximation goes to the penalty term in the loss function. In extensive experiments on citation networks and other heterogeneous graphs, we demonstrate that adding autoencoder constraints significantly improves the performance of graph convolutional networks. Further, we notice that our technique can be applied on graph attention network to improve the performance as well. This reveals the wide applicability of the proposed autoencoder technique.
IRNov 30, 2018
The Graph-Based Behavior-Aware Recommendation for Interactive NewsMingyuan Ma, Sen Na, Hongyu Wang et al.
Interactive news recommendation has been launched and attracted much attention recently. In this scenario, user's behavior evolves from single click behavior to multiple behaviors including like, comment, share etc. However, most of the existing methods still use single click behavior as the unique criterion of judging user's preferences. Further, although heterogeneous graphs have been applied in different areas, a proper way to construct a heterogeneous graph for interactive news data with an appropriate learning mechanism on it is still desired. To address the above concerns, we propose a graph-based behavior-aware network, which simultaneously considers six different types of behaviors as well as user's demand on the news diversity. We have three main steps. First, we build an interaction behavior graph for multi-level and multi-category data. Second, we apply DeepWalk on the behavior graph to obtain entity semantics, then build a graph-based convolutional neural network called G-CNN to learn news representations, and an attention-based LSTM to learn behavior sequence representations. Third, we introduce core and coritivity features for the behavior graph, which measure the concentration degree of user's interests. These features affect the trade-off between accuracy and diversity of our personalized recommendation system. Taking these features into account, our system finally achieves recommending news to different users at their different levels of concentration degrees.
MLNov 14, 2017
Scalable Peaceman-Rachford Splitting Method with Proximal TermsSen Na, Mingyuan Ma, Mladen Kolar
Along with developing of Peaceman-Rachford Splittling Method (PRSM), many batch algorithms based on it have been studied very deeply. But almost no algorithm focused on the performance of stochastic version of PRSM. In this paper, we propose a new stochastic algorithm based on PRSM, prove its convergence rate in ergodic sense, and test its performance on both artificial and real data. We show that our proposed algorithm, Stochastic Scalable PRSM (SS-PRSM), enjoys the $O(1/K)$ convergence rate, which is the same as those newest stochastic algorithms that based on ADMM but faster than general Stochastic ADMM (which is $O(1/\sqrt{K})$). Our algorithm also owns wide flexibility, outperforms many state-of-the-art stochastic algorithms coming from ADMM, and has low memory cost in large-scale splitting optimization problems.