CLMay 29Code
Towards Efficient LLMs Annealing with Principled Sample SelectionYuanjian Xu, Jianing Hao, Wanbo Zhang et al.
The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.
CLMay 29Code
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM TrainingYuanjian Xu, Jianing Hao, Guang Zhang et al.
Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.
CEApr 19Code
BizCompass: Benchmarking the Reasoning Capabilities of LLMs in Business Knowledge and ApplicationsJianing Hao, Yuhe Wu, Yuanjian Xu et al.
Large language models (LLMs) hold great promise for business applications, yet business analysis remains inherently complex, demanding rigorous reasoning and the integration of diverse knowledge sources. Existing benchmarks typically target narrow tasks and thus leave a fundamental question unanswered: how can LLMs be reliably applied in business, and how are these applications grounded in underlying theoretical capabilities? To address this gap, we introduce BizCompass, a benchmark explicitly designed to connect theoretical foundations with practical business knowledge and applications. At the knowledge level, BizCompass covers four core domains--finance, economics, statistics, and operations management. At the application level, it structures tasks around three representative roles: the analyst, the trader, and the consultant. This dual-axis design not only exposes performance differences across realistic scenarios but also diagnoses which foundational capabilities enable or constrain success. We systematically evaluate both open-source and commercial LLMs, revealing how theoretical knowledge translates into practical performance in business. The results provide actionable insights for model selection and training optimization in real-world business contexts. All datasets and evaluation code are publicly released to support reproducibility and future research: https://bizcompass.dev.ypemc.com.
AIAug 19, 2024
LENS: Large Pre-trained Transformer for Exploring Financial Time Series RegularitiesYuanjian Xu, Anxian Liu, Jianing Hao et al.
Modeling large-scale time series has gained significant attention in recent years. However, its direct application in finance remains challenging due to substantial differences in data characteristics across domains. Specifically, financial systems feature inherent stochasticity and low signal-to-noise ratios, rendering traditional methods and pre-training approaches ineffective. This underscores the urgent need for a foundation model tailored to financial time series. To bridge this gap, we propose \textbf{LENS}, a pre-trained model for this domain. \textbf{LENS} effectively captures the complexity of financial stochastic systems through a carefully crafted model architecture and mitigates noise during pre-training by using an invertible embedding module. We provide a rigorous theoretical explanation of the model's effectiveness and validate its performance through extensive experiments. Pre-trained on a dataset comprising 100 billion financial observations, \textbf{LENS} achieves exceptional results across a wide range of critical downstream tasks. Moreover, our work offers practical insights into developing pre-trained time series models in high-noise environments, paving the way for further advancements in this pivotal research domain.
CLApr 9
Rethinking Data Mixing from the Perspective of Large Language ModelsYuanjian Xu, Tianze Sun, Changwei Xu et al.
Data mixing strategy is essential for large language model (LLM) training. Empirical evidence shows that inappropriate strategies can significantly reduce generalization. Although recent methods have improved empirical performance, several fundamental questions remain open: what constitutes a domain, whether human and model perceptions of domains are aligned, and how domain weighting influences generalization. We address these questions by establishing formal connections between gradient dynamics and domain distributions, offering a theoretical framework that clarifies the role of domains in training dynamics. Building on this analysis, we introduce DoGraph, a reweighting framework that formulates data scheduling as a graph-constrained optimization problem. Extensive experiments on GPT-2 models of varying scales demonstrate that DoGraph consistently achieves competitive performance.
LGDec 23, 2025
HGAN-SDEs: Learning Neural Stochastic Differential Equations with Hermite-Guided Adversarial TrainingYuanjian Xu, Yuan Shuai, Jianing Hao et al.
Neural Stochastic Differential Equations (Neural SDEs) provide a principled framework for modeling continuous-time stochastic processes and have been widely adopted in fields ranging from physics to finance. Recent advances suggest that Generative Adversarial Networks (GANs) offer a promising solution to learning the complex path distributions induced by SDEs. However, a critical bottleneck lies in designing a discriminator that faithfully captures temporal dependencies while remaining computationally efficient. Prior works have explored Neural Controlled Differential Equations (CDEs) as discriminators due to their ability to model continuous-time dynamics, but such architectures suffer from high computational costs and exacerbate the instability of adversarial training. To address these limitations, we introduce HGAN-SDEs, a novel GAN-based framework that leverages Neural Hermite functions to construct a structured and efficient discriminator. Hermite functions provide an expressive yet lightweight basis for approximating path-level dynamics, enabling both reduced runtime complexity and improved training stability. We establish the universal approximation property of our framework for a broad class of SDE-driven distributions and theoretically characterize its convergence behavior. Extensive empirical evaluations on synthetic and real-world systems demonstrate that HGAN-SDEs achieve superior sample quality and learning efficiency compared to existing generative models for SDEs
SEApr 29, 2025
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model EvaluationWenjing Yin, Tianze Sun, Yijiong Yu et al.
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive evaluation framework that reflects real-world applications. To address these gaps, we introduce CoCo-Bench (Comprehensive Code Benchmark), designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review. These dimensions capture essential developer needs, ensuring a more systematic and representative evaluation. CoCo-Bench includes multiple programming languages and varying task difficulties, with rigorous manual review to ensure data quality and accuracy. Empirical results show that CoCo-Bench aligns with existing benchmarks while uncovering significant variations in model performance, effectively highlighting strengths and weaknesses. By offering a holistic and objective evaluation, CoCo-Bench provides valuable insights to guide future research and technological advancements in code-oriented LLMs, establishing a reliable benchmark for the field.