h-index32
23papers
638citations
Novelty56%
AI Score61

23 Papers

CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong et al.

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

CVFeb 14, 2025Code
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan et al.

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

CLFeb 17, 2025Code
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Ailin Huang, Boyong Wu, Bruce Wang et al.

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.

82.4LGMay 6
Demystifying Manifold Constraints in LLM Pre-training

Kang An, Jiaxiang Li, Donald Goldfarb et al.

The empirical success of large language model (LLM) pre-training relies heavily on heuristic stabilization techniques, such as explicit normalization layers and weight decay. While recent constrained optimization approaches that explicitly restrict weights may improve numerical stability and performance, the mechanism and motivation for adding constraints still remain elusive. This paper systematically demystifies the role of explicit manifold constraints in LLM pre-training. By introducing the Msign-Aligned Constrained Riemannian Optimizer (MACRO)-a provably convergent, single-loop optimization framework-our study disentangles weight regularization heuristics from interacting mechanisms like RMS normalization and decoupled weight decay. Theoretical analyses and comprehensive empirical evaluations reveal that manifold constraints independently bound forward activation scales and enforce stable rotational equilibrium, thereby subsuming the roles of these heuristic mechanisms. Evaluations on large-scale LLM architectures demonstrate that MACRO achieves highly competitive performance while rigorously preserving the theoretical guarantees of exact Riemannian optimization.

CLJul 22, 2025Code
Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu et al.

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

CLDec 23, 2025
Step-DeepResearch Technical Report

Chen Hu, Haikuo Du, Heng Wang et al.

As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.

LGMar 26, 2025Code
ASGO: Adaptive Structured Gradient Optimization

Kang An, Yuxing Liu, Rui Pan et al.

Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By a fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on this convergence theory, we further demonstrate that ASGO can benefit from low-rank gradients and block diagonal Hessians. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO.

CLMay 21, 2025Code
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization

Ziliang Wang, Xuhui Zheng, Kang An et al.

Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.

CVAug 14, 2025Code
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep Team, Chunrui Han, Guopeng Li et al. · tsinghua

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

CVMar 14, 2025Code
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

Haoyang Huang, Guoqing Ma, Nan Duan et al.

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.

65.9LGApr 16
Lightweight Geometric Adaptation for Training Physics-Informed Neural Networks

Kang An, Chenhao Si, Shiqian Ma et al.

Physics-Informed Neural Networks (PINNs) often suffer from slow convergence, training instability, and reduced accuracy on challenging partial differential equations due to the anisotropic and rapidly varying geometry of their loss landscapes. We propose a lightweight curvature-aware optimization framework that augments existing first-order optimizers with an adaptive predictive correction based on secant information. Consecutive gradient differences are used as a cheap proxy for local geometric change, together with a step-normalized secant curvature indicator to control the correction strength. The framework is plug-and-play, computationally efficient, and broadly compatible with existing optimizers, without explicitly forming second-order matrices. Experiments on diverse PDE benchmarks show consistent improvements in convergence speed, training stability, and solution accuracy over standard optimizers and strong baselines, including on the high-dimensional heat equation, Gray--Scott system, Belousov--Zhabotinsky system, and 2D Kuramoto--Sivashinsky system.

CVMar 6, 2025Code
Geometry-Constrained Monocular Scale Estimation Using Semantic Segmentation for Dynamic Scenes

Hui Zhang, Zhiyang Wu, Qianqian Shangguan et al.

Monocular visual localization plays a pivotal role in advanced driver assistance systems and autonomous driving by estimating a vehicle's ego-motion from a single pinhole camera. Nevertheless, conventional monocular visual odometry encoun-ters challenges in scale estimation due to the absence of depth information during projection. Previous methodologies, whether rooted in physical constraints or deep learning paradigms, con-tend with issues related to computational complexity and the management of dynamic objects. This study extends our prior research, presenting innovative strategies for ego-motion estima-tion and the selection of ground points. Striving for a nuanced equilibrium between computational efficiency and precision, we propose a hybrid method that leverages the SegNeXt model for real-time applications, encompassing both ego-motion estimation and ground point selection. Our methodology incorporates dy-namic object masks to eliminate unstable features and employs ground plane masks for meticulous triangulation. Furthermore, we exploit Geometry-constraint to delineate road regions for scale recovery. The integration of this approach with the mo-nocular version of ORB-SLAM3 culminates in the accurate esti-mation of a road model, a pivotal component in our scale recov-ery process. Rigorous experiments, conducted on the KITTI da-taset, systematically compare our method with existing monocu-lar visual odometry algorithms and contemporary scale recovery methodologies. The results undeniably confirm the superior ef-fectiveness of our approach, surpassing state-of-the-art visual odometry algorithms. Our source code is available at https://git hub.com/bFr0zNq/MVOSegScale.

93.5ITMay 13
Grouped Annulus-Modulated Transceiver Is Almost Full DoF-Achieving for RIS-Assisted Symbiotic Radios Over Spatial-Correlated Channels

Ruo-Qi Sun, Jianfeng Shi, Yonggang Zhu et al.

This paper considers a RIS-assisted symbiotic communication system, where additional information is conveyed by the passive reconfigurable intelligent surface (RIS). In existing schemes, individual phase modulation is usually adopted at the RIS elements, which severely limits exploiting all extra multiplexing gains brought by the RIS. To address the issue, we propose a novel matrix decomposition algorithm that transforms the equivalent channel into a structured form while effectively suppressing the decomposition residual. Based on this, a novel transceiver architecture employing grouped annulus modulation (GAM) with a hexagonal-lattice-based constellation is developed, which is capable of achieving the full degrees of freedom (DoFs) when the decomposition algorithm performs as expected. Numerical results demonstrate that the proposed transceiver achieves much higher communication rates, thereby leading to higher spectral efficiency, compared to the conventional phase-only modulation scheme, while maintaining comparable error performance.

93.3SYMay 13
Port-Hamiltonian Systems with Dissipation Potential: Modelling and Trajectory Tracking Control

Jinjun Jia, Yuchen Liao, Kang An et al.

Port-Hamiltonian systems (PHS) and interconnection and damping assignment passivity-based control (IDA-PBC) have achieved broad success in modelling and stabilisation of physical systems. However, the absence of a dedicated scalar potential for the momentum channel forces any modification of the momentum-dependent dynamics to proceed indirectly through the interconnection and damping matrices, rendering the matching partial differential equation (PDE) difficult to solve and complicating extensions to trajectory tracking. This paper proposes a port-Hamiltonian system with dissipation potential (PHS-DP), in which the damping matrix is replaced by scalar convex dissipation potentials, providing independent scalar objects for the momentum and auxiliary state channels and restoring the variational symmetry between stored and dissipated energy. Building on this framework, Dual Potential Shaping Control (DPSC) achieves trajectory tracking by sequentially shaping the potential energy and dissipation potentials without modifying the interconnection structure. Contraction of the closed-loop cascade is established via a hierarchical contraction argument, and the matching condition is satisfied automatically for any admissible choice of shaped potentials, requiring no PDE to be solved. In contrast to existing PDE-free energy shaping approaches, which achieve this by abandoning the port-Hamiltonian closed-loop structure and sacrificing physical interpretability, the proposed framework preserves the interconnection structure and retains a transparent energy-based interpretation at every stage of the design. Validation on a magnetic levitation system demonstrates tracking performance comparable to timed IDA-PBC with substantially reduced design complexity.

CVJun 3, 2019Code
Automated Steel Bar Counting and Center Localization with Convolutional Neural Networks

Zhun Fan, Jiewei Lu, Benzhang Qiu et al.

Automated steel bar counting and center localization plays an important role in the factory automation of steel bars. Traditional methods only focus on steel bar counting and their performances are often limited by complex industrial environments. Convolutional neural network (CNN), which has great capability to deal with complex tasks in challenging environments, is applied in this work. A framework called CNN-DC is proposed to achieve automated steel bar counting and center localization simultaneously. The proposed framework CNN-DC first detects the candidate center points with a deep CNN. Then an effective clustering algorithm named as Distance Clustering(DC) is proposed to cluster the candidate center points and locate the true centers of steel bars. The proposed CNN-DC can achieve 99.26% accuracy for steel bar counting and 4.1% center offset for center localization on the established steel bar dataset, which demonstrates that the proposed CNN-DC can perform well on automated steel bar counting and center localization. Code is made publicly available at: https://github.com/BenzhangQiu/Steel-bar-Detection.

46.0AIApr 20
SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

Shaowei Zhang, Faqiang Qian, Yan Chen et al.

Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.

CVDec 8, 2025
MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning

Xuhui Zheng, Kang An, Ziliang Wang et al.

Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.

LGJul 25, 2025
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

StepFun, Bin Wang, Bojun Wang et al.

Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.

AISep 29, 2025
UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following

FaQiang Qian, WeiKun Zhang, Ziliang Wang et al.

Shaping powerful LLMs to be beneficial and safe is central to AI alignment. We argue that post-training alignment is fundamentally a unified Preference Learning problem, involving two modalities: demonstrated preferences (e.g., Supervised Fine-Tuning, SFT) and comparative preferences (e.g., Reinforcement Learning, RL).The standard sequential pipeline-SFT followed by RL-is flawed due to a critical distributional mismatch: SFT uses static expert data, but as the policy evolves, its generation distribution drifts, making SFT knowledge brittle. Subsequent RL then explores without direct access to the rich, ground-truth knowledge in expert demonstrations, leading to inefficient, ungrounded updates. This separation prevents mutual regularization between data sources. To address this, we reframe alignment as a constrained optimization problem and propose Unified Adversarial Preference Learning (UniAPL),a novel framework that dynamically aligns the policy's distribution with the expert's. UniAPL implements a single-stage unified training objective, jointly learning from mixed batches of SFT and preference data. In every gradient step, dense expert demonstrations directly ground and regularize online exploration, inherently resolving distributional mismatch and maximizing data synergy.We evaluate UniAPL on instruction-following tasks using Qwen3-235B-Instruct-2507 as the teacher. Our models match or exceed strong GRPO baselines: +5.77% on Qwen3-0.6B (matching a 32B model) and +3.75% on Qwen3-4B,even outperforming the teacher. Analyses of response length and log-probability distributions confirm that UniAPL outputs closely mimic expert demonstrations, achieving both stronger performance and better behavioral alignment.

CVDec 17, 2025
Step-GUI Technical Report

Haolong Yan, Jia Wang, Xin Huang et al.

Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.

LGOct 8, 2025
AutoBalance: An Automatic Balancing Framework for Training Physics-Informed Neural Networks

Kang An, Chenhao Si, Ming Yan et al.

Physics-Informed Neural Networks (PINNs) provide a powerful and general framework for solving Partial Differential Equations (PDEs) by embedding physical laws into loss functions. However, training PINNs is notoriously difficult due to the need to balance multiple loss terms, such as PDE residuals and boundary conditions, which often have conflicting objectives and vastly different curvatures. Existing methods address this issue by manipulating gradients before optimization (a "pre-combine" strategy). We argue that this approach is fundamentally limited, as forcing a single optimizer to process gradients from spectrally heterogeneous loss landscapes disrupts its internal preconditioning. In this work, we introduce AutoBalance, a novel "post-combine" training paradigm. AutoBalance assigns an independent adaptive optimizer to each loss component and aggregates the resulting preconditioned updates afterwards. Extensive experiments on challenging PDE benchmarks show that AutoBalance consistently outperforms existing frameworks, achieving significant reductions in solution error, as measured by both the MSE and $L^{\infty}$ norms. Moreover, AutoBalance is orthogonal to and complementary with other popular PINN methodologies, amplifying their effectiveness on demanding benchmarks.

CLOct 1, 2025
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Ziliang Wang, Kang An, Xuhui Zheng et al.

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.

LGOct 6, 2019
FIS-GAN: GAN with Flow-based Importance Sampling

Shiyu Yi, Donglin Zhan, Wenqing Zhang et al.

Generative Adversarial Networks (GAN) training process, in most cases, apply Uniform or Gaussian sampling methods in the latent space, which probably spends most of the computation on examples that can be properly handled and easy to generate. Theoretically, importance sampling speeds up stochastic optimization in supervised learning by prioritizing training examples. In this paper, we explore the possibility of adapting importance sampling into adversarial learning. We use importance sampling to replace Uniform and Gaussian sampling methods in the latent space and employ normalizing flow to approximate latent space posterior distribution by density estimation. Empirically, results on MNIST and Fashion-MNIST demonstrate that our method significantly accelerates GAN's optimization while retaining visual fidelity in generated samples.