h-index34
19papers
374citations
Novelty48%
AI Score58

19 Papers

CLJun 4
YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

PSBC LLM Team, Huawei LLM Team, Ruihan Long et al.

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

LGOct 4, 2022
Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Rui Yuan, Simon S. Du, Robert M. Gower et al.

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/ε^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.

AIDec 21, 2022
Predict+Optimize Problem in Renewable Energy Scheduling

Christoph Bergmeir, Frits de Nijs, Evgenii Genov et al.

Predict+Optimize frameworks integrate forecasting and optimization to address real-world challenges such as renewable energy scheduling, where variability and uncertainty are critical factors. This paper benchmarks solutions from the IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling, focusing on forecasting renewable production and demand and optimizing energy cost. The competition attracted 49 participants in total. The top-ranked method employed stochastic optimization using LightGBM ensembles, and achieved at least a 2% reduction in energy costs compared to deterministic approaches, demonstrating that the most accurate point forecast does not necessarily guarantee the best performance in downstream optimization. The published data and problem setting establish a benchmark for further research into integrated forecasting-optimization methods for energy systems, highlighting the importance of considering forecast uncertainty in optimization models to achieve cost-effective and reliable energy management. The novelty of this work lies in its comprehensive evaluation of Predict+Optimize methodologies applied to a real-world renewable energy scheduling problem, providing insights into the scalability, generalizability, and effectiveness of the proposed solutions. Potential applications extend beyond energy systems to any domain requiring integrated forecasting and optimization, such as supply chain management, transportation planning, and financial portfolio optimization.

LGOct 24, 2022
Optimal activity and battery scheduling algorithm using load and solar generation forecasts

Yogesh Pipada Sunil Kumar, Rui Yuan, Nam Trong Dinh et al.

Energy usage optimal scheduling has attracted great attention in the power system community, where various methodologies have been proposed. However, in real-world applications, the optimal scheduling problems require reliable energy forecasting, which is scarcely discussed as a joint solution to the scheduling problem. The 5\textsuperscript{th} IEEE Computational Intelligence Society (IEEE-CIS) competition raised a practical problem of decreasing the electricity bill by scheduling building activities, where forecasting the solar energy generation and building consumption is a necessity. To solve this problem, we propose a technical sequence for tackling the solar PV and demand forecast and optimal scheduling problems, where solar generation prediction methods and an optimal university lectures scheduling algorithm are proposed.

MLJan 30, 2023
A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence

Carlo Alfano, Rui Yuan, Patrick Rebeschini

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.

CLFeb 10
Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu

Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.

CLFeb 10
AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth et al.

Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

CLNov 4, 2024Code
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Xingwu Sun, Yanfeng Chen, Yiqing Huang et al. · tencent-ai

In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large

LGFeb 4
Beyond KL Divergence: Policy Optimization with Flexible Bregman Divergences for LLM Reasoning

Rui Yuan, Mykola Khandoga, Vinay Kumar Sankarapu

Policy optimization methods like Group Relative Policy Optimization (GRPO) and its variants have achieved strong results on mathematical reasoning and code generation tasks. Despite extensive exploration of reward processing strategies and training dynamics, all existing group-based methods exclusively use KL divergence for policy regularization, leaving the choice of divergence function unexplored. We introduce Group-Based Mirror Policy Optimization (GBMPO), a framework that extends group-based policy optimization to flexible Bregman divergences, including hand-designed alternatives (L2 in probability space) and learned neural mirror maps. On GSM8K mathematical reasoning, hand-designed ProbL2-GRPO achieves 86.7% accuracy, improving +5.5 points over the Dr. GRPO baseline. On MBPP code generation, neural mirror maps reach 60.1-60.8% pass@1, with random initialization already capturing most of the benefit. While evolutionary strategies meta-learning provides marginal accuracy improvements, its primary value lies in variance reduction ($\pm$0.2 versus $\pm$0.6) and efficiency gains (15% shorter responses on MBPP), suggesting that random initialization of neural mirror maps is sufficient for most practical applications. These results establish divergence choice as a critical, previously unexplored design dimension in group-based policy optimization for LLM reasoning.

LGDec 3, 2025
The Initialization Determines Whether In-Context Learning Is Gradient Descent

Shifeng Xie, Rui Yuan, Simone Rossi et al.

In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

CVNov 24, 2025Code
HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li et al.

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

CVApr 10
CAD 100K: A Comprehensive Multi-Task Dataset for Car Related Visual Anomaly Detection

Jiahua Pang, Ying Li, Dongpu Cao et al.

Multi-task visual anomaly detection is critical for car-related manufacturing quality assessment. However, existing methods remain task-specific, hindered by the absence of a unified benchmark for multi-task evaluation. To fill in this gap, We present the CAD Dataset, a large-scale and comprehensive benchmark designed for car-related multi-task visual anomaly detection. The dataset contains over 100 images crossing 7 vehicle domains and 3 tasks, providing models a comprehensive view for car-related anomaly detection. It is the first car-related anomaly dataset specialized for multi-task learning(MTL), while combining synthesis data augmentation for few-shot anomaly images. We implement a multi-task baseline and conduct extensive empirical studies. Results show MTL promotes task interaction and knowledge transfer, while also exposing challenging conflicts between tasks. The CAD dataset serves as a standardized platform to drive future advances in car-related multi-task visual anomaly detection.

CVJan 19, 2025
Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction

Quan Zhang, Yuxin Qi, Xi Tang et al.

Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.

CVNov 13, 2024
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

Quan Zhang, Jinwei Fang, Rui Yuan et al.

Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.

MLApr 22, 2025
From predictions to confidence intervals: an empirical study of conformal prediction methods for in-context learning

Zhe Huang, Simone Rossi, Rui Yuan et al.

Transformers have become a standard architecture in machine learning, demonstrating strong in-context learning (ICL) abilities that allow them to learn from the prompt at inference time. However, uncertainty quantification for ICL remains an open challenge, particularly in noisy regression tasks. This paper investigates whether ICL can be leveraged for distribution-free uncertainty estimation, proposing a method based on conformal prediction to construct prediction intervals with guaranteed coverage. While traditional conformal methods are computationally expensive due to repeated model fitting, we exploit ICL to efficiently generate confidence intervals in a single forward pass. Our empirical analysis compares this approach against ridge regression-based conformal methods, showing that conformal prediction with in-context learning (CP with ICL) achieves robust and scalable uncertainty estimates. Additionally, we evaluate its performance under distribution shifts and establish scaling laws to guide model training. These findings bridge ICL and conformal prediction, providing a theoretically grounded and new framework for uncertainty quantification in transformer-based models.

AINov 23, 2025
Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

Xiaoming He, Gaofeng Wang, Huajun Cui et al.

Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low-latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high-dimensional action sequences and managing intensive on-board computation. We propose an Intent-Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision-making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user-intent network into the trajectory network to maintain action dependencies. We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA-MAPPO achieve superior performance across diverse scenarios.

LGApr 11, 2024
Enhancing Policy Gradient with the Polyak Step-Size Adaption

Yunxiang Li, Rui Yuan, Chen Fan et al.

Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL). Renowned for its convergence guarantees and stability compared to other RL algorithms, its practical application is often hindered by sensitivity to hyper-parameters, particularly the step-size. In this paper, we introduce the integration of the Polyak step-size in RL, which automatically adjusts the step-size without prior knowledge. To adapt this method to RL settings, we address several issues, including unknown f* in the Polyak step-size. Additionally, we showcase the performance of the Polyak step-size in RL through experiments, demonstrating faster convergence and the attainment of more stable policies.

LGSep 23, 2021
IRMAC: Interpretable Refined Motifs in Binary Classification for Smart Grid Applications

Rui Yuan, S. Ali Pourmousavi, Wen L. Soong et al.

Modern power systems are experiencing the challenge of high uncertainty with the increasing penetration of renewable energy resources and the electrification of heating systems. In this paradigm shift, understanding electricity users' demand is of utmost value to retailers, aggregators, and policymakers. However, behind-the-meter (BTM) equipment and appliances at the household level are unknown to the other stakeholders mainly due to privacy concerns and tight regulations. In this paper, we seek to identify residential consumers based on their BTM equipment, mainly rooftop photovoltaic (PV) systems and electric heating, using imported/purchased energy data from utility meters. To solve this problem with an interpretable, fast, secure, and maintainable solution, we propose an integrated method called Interpretable Refined Motifs And binary Classification (IRMAC). The proposed method comprises a novel shape-based pattern extraction technique, called Refined Motif (RM) discovery, and a single-neuron classifier. The first part extracts a sub-pattern from the long time series considering the frequency of occurrences, average dissimilarity, and time dynamics while emphasising specific times with annotated distances. The second part identifies users' types with linear complexity while preserving the transparency of the algorithms. With the real data from Australia and Denmark, the proposed method is tested and verified in identifying PV owners and electrical heating system users.

LGJul 23, 2021
A general sample complexity analysis of vanilla policy gradient

Rui Yuan, Robert M. Gower, Alessandro Lazaric

We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A\geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(ε^{-4})$ sample complexity of PG to a stationary point. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and the batch size $m$, including the single trajectory case (i.e., $m=1$). When an additional relaxed weak gradient domination assumption is available, we establish a novel global optimum convergence theory of PG with $\widetilde{\mathcal{O}}(ε^{-3})$ sample complexity. We then instantiate our theorems in different settings, where we both recover existing results and obtain improved sample complexity, e.g., $\widetilde{\mathcal{O}}(ε^{-3})$ sample complexity for the convergence to the global optimum for Fisher-non-degenerated parametrized policies.