Reasoning Planning for Language ModelsBao Nguyen, Hieu Trung Nguyen, Ruifeng She et al.
Selecting an appropriate reasoning method for a given query remains a key challenge in language model generation. Existing approaches typically generate multiple candidate responses and use an aggregation strategy to select the output answer, often assuming that more candidate answers yield higher accuracy. We revisit this assumption through a rigorous theoretical analysis, deriving accuracy bounds for standard aggregation methods under fixed generation distributions and candidate sizes. Building on these insights, we introduce EPIC, an Ensemble Planning with Contrastive learning framework to learn a shared representation space that captures both model reasoning abilities and query-method compatibility. EPIC incorporates our probability bounds as a regularizer in a utility-driven optimization that balances accuracy and computational cost. Experiments on diverse mathematical reasoning tasks show that EPIC consistently selects optimal reasoning methods, improving accuracy while reducing computational overhead. Our code can be found at https://github.com/nguyenngocbaocmt02/EPIC.
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem SolvingTeng Wang, Wing-Yin Yu, Zhenqi He et al.
LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, an algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions. The StructuredOR dataset is available on Huggingface https://huggingface.co/datasets/LLM4OR/StructuredOR and GitHub https://github.com/LLM4OR/StructuredOR.
14.7CLFeb 15, 2025
LoRE-Merging: Exploring Low-Rank Estimation For Large Language Model MergingZehua Liu, Han Wu, Yuxuan Yao et al.
While most current approaches rely on further training techniques, such as fine-tuning or reinforcement learning, to enhance model capacities, model merging stands out for its ability of improving models without requiring any additional training. In this paper, we propose a unified framework for model merging based on low-rank estimation of task vectors without the need for access to the base model, named \textsc{LoRE-Merging}. Our approach is motivated by the observation that task vectors from fine-tuned models frequently exhibit a limited number of dominant singular values, making low-rank estimations less prone to interference. We implement the method by formulating the merging problem as an optimization problem. Extensive empirical experiments demonstrate the effectiveness of our framework in mitigating interference and preserving task-specific information, thereby advancing the state-of-the-art performance in model merging techniques.
3.3DCFeb 14, 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference OptimizationBowen Pang, Kai Li, Ruifeng She et al.
With the development of large language models (LLMs), it has become increasingly important to optimize hardware usage and improve throughput. In this paper, we study the inference optimization of the serving system that deploys LLMs. To optimize system throughput and maximize hardware utilization, we formulate the inference optimization problem as a mixed-integer programming (MIP) model and propose a hybrid offline-online method as solution. The offline method improves large-scale inference systems by introducing a Minimizing Makespan Bin Packing Problem. We further provide a theoretical lower bound computation method. Then, we propose an online sorting and preemptive scheduling method to better utilize hardware. In the online iteration scheduling process, a Lagrangian method is applied to evaluate the cost efficiency of inserting prefill stages versus decode stages at each iteration and dynamically determine when to preempt decoding tasks and insert prefill tasks. Experiments using real-world data from the LLaMA-65B model and the GSM8K dataset demonstrate that system utilization improves from 80.2% to 89.1%, and the total inference time decreases from 201.00 to 190.58 seconds. A 100-cases study shows that our method consistently outperforms the baseline method and improves the utilization rate by 8.0% on average. Finally, we discuss potential future extensions, including stochastic modeling, reinforcement learning-based schedulers, and dynamic decision-making strategies for system throughput and hardware utilization.
14.4LGMar 29, 2025
MoLAE: Mixture of Latent Experts for Parameter-Efficient Language ModelsZehua Liu, Han Wu, Ruifeng She et al.
Mixture of Experts (MoE) has become a key architectural paradigm for efficiently scaling Large Language Models (LLMs) by selectively activating a subset of parameters for each input token. However, standard MoE architectures face significant challenges, including high memory consumption and communication overhead during distributed training. In this paper, we introduce Mixture of Latent Experts (MoLAE), a novel parameterization that addresses these limitations by reformulating expert operations through a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially reduces parameter count and computational requirements, particularly in existing LLMs where hidden dimensions significantly exceed MoE intermediate dimensions. We provide a rigorous mathematical framework for transforming pre-trained MoE models into MoLAE architecture, characterizing conditions for optimal factorization, and developing a systematic two-step algorithm for this conversion. Our comprehensive theoretical analysis demonstrates that MoLAE significantly improves efficiency across multiple dimensions while preserving model capabilities. Experimental results confirm that MoLAE achieves comparable performance to standard MoE with substantially reduced resource requirements.
7.1LGMar 12, 2025
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming ApproachRuifeng She, Bowen Pang, Kai Li et al.
As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and pipeline-have been successfully implemented for popular neural networks on main-stream hardware, optimizing the distributed deployment schedule requires extensive expertise and manual effort. Further more, while existing frameworks with most simple chain-like structures, they struggle with complex non-linear architectures. Mixture-of-experts and multi-modal models feature intricate MIMO and branch-rich topologies that require fine-grained operator-level parallelization beyond the capabilities of existing frameworks. We propose formulating parallelism planning as a scheduling optimization problem using mixed-integer programming. We propose a bi-level solution framework balancing optimality with computational efficiency, automatically generating effective distributed plans that capture both the heterogeneous structure of modern neural networks and the underlying hardware constraints. In experiments comparing against expert-designed strategies like DeepSeek's DualPipe, our framework achieves comparable or superior performance, reducing computational bubbles by half under the same memory constraints. The framework's versatility extends beyond throughput optimization to incorporate hardware utilization maximization, memory capacity constraints, and other considerations or potential strategies. Such capabilities position our solution as both a valuable research tool for exploring optimal parallelization strategies and a practical industrial solution for large-scale AI deployment.