h-index32
16papers
131citations
Novelty45%
AI Score53

16 Papers

CVJan 14Code
STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao, Chunrui Han et al.

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong et al.

We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

CVSep 21, 2023
DIOR: Dataset for Indoor-Outdoor Reidentification -- Long Range 3D/2D Skeleton Gait Collection Pipeline, Semi-Automated Gait Keypoint Labeling and Baseline Evaluation Methods

Yuyang Chen, Praveen Raj Masilamani, Bhavin Jawade et al.

In recent times, there is an increased interest in the identification and re-identification of people at long distances, such as from rooftop cameras, UAV cameras, street cams, and others. Such recognition needs to go beyond face and use whole-body markers such as gait. However, datasets to train and test such recognition algorithms are not widely prevalent, and fewer are labeled. This paper introduces DIOR -- a framework for data collection, semi-automated annotation, and also provides a dataset with 14 subjects and 1.649 million RGB frames with 3D/2D skeleton gait labels, including 200 thousands frames from a long range camera. Our approach leverages advanced 3D computer vision techniques to attain pixel-level accuracy in indoor settings with motion capture systems. Additionally, for outdoor long-range settings, we remove the dependency on motion capture systems and adopt a low-cost, hybrid 3D computer vision and learning pipeline with only 4 low-cost RGB cameras, successfully achieving precise skeleton labeling on far-away subjects, even when their height is limited to a mere 20-25 pixels within an RGB frame. On publication, we will make our pipeline open for others to use.

GRApr 9, 2023
HyperINR: A Fast and Predictive Hypernetwork for Implicit Neural Representations via Knowledge Distillation

Qi Wu, David Bauer, Yuyang Chen et al.

Implicit Neural Representations (INRs) have recently exhibited immense potential in the field of scientific visualization for both data generation and visualization tasks. However, these representations often consist of large multi-layer perceptrons (MLPs), necessitating millions of operations for a single forward pass, consequently hindering interactive visual exploration. While reducing the size of the MLPs and employing efficient parametric encoding schemes can alleviate this issue, it compromises generalizability for unseen parameters, rendering it unsuitable for tasks such as temporal super-resolution. In this paper, we introduce HyperINR, a novel hypernetwork architecture capable of directly predicting the weights for a compact INR. By harnessing an ensemble of multiresolution hash encoding units in unison, the resulting INR attains state-of-the-art inference performance (up to 100x higher inference bandwidth) and can support interactive photo-realistic volume visualization. Additionally, by incorporating knowledge distillation, exceptional data and visualization generation quality is achieved, making our method valuable for real-time parameter exploration. We validate the effectiveness of the HyperINR architecture through a comprehensive ablation study. We showcase the versatility of HyperINR across three distinct scientific domains: novel view synthesis, temporal super-resolution of volume data, and volume rendering with dynamic global shadows. By simultaneously achieving efficiency and generalizability, HyperINR paves the way for applying INR in a wider array of scientific visualization applications.

CYDec 16, 2022
Twitter's Agenda-Setting Role: A Study of Twitter Strategy for Political Diversion

Yuyang Chen, Xiaoyu Cui, Yunjie Song et al.

This study verified the effectiveness of Donald Trump's Twitter campaign in guiding agen-da-setting and deflecting political risk and examined Trump's Twitter communication strategy and explores the communication effects of his tweet content during Covid-19 pandemic. We collected all tweets posted by Trump on the Twitter platform from January 1, 2020 to December 31, 2020.We used Ordinary Least Squares (OLS) regression analysis with a fixed effects model to analyze the existence of the Twitter strategy. The correlation between the number of con-firmed daily Covid-19 diagnoses and the number of particular thematic tweets was investigated using time series analysis. Empirical analysis revealed Twitter's strategy is used to divert public attention from negative Covid-19 reports during the epidemic, and it posts a powerful political communication effect on Twitter. However, findings suggest that Trump did not use false claims to divert political risk and shape public opinion.

57.5MAApr 10
C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination

Yuyang Chen, Kaiyan Zhao, Yiming Wang et al.

State-of-the-art (SOTA) urban traffic control increasingly employs Multi-Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand-crafted, myopic rewards (e.g., intersection pressure), which fail to capture high-level, human-centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common-sense coordination model from traffic-vehicle dynamics. C2T distills "common-sense" knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based multi-intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. We further highlight C2T's flexibility in principle, allowing distinct "efficiency-focused" versus "safety-focused" policies by modifying the LLM prompt.

49.0ROApr 30
E$^2$DT: Efficient and Effective Decision Transformer with Experience-Aware Sampling for Robotic Manipulation

Kaiyan Zhao, Borong Zhang, Yiming Wang et al.

In reinforcement learning (RL) for robotic manipulation, the Decision Transformer (DT) has emerged as an effective framework for addressing long-horizon tasks. However, DT's performance depends heavily on the coverage of collected experiences. Without an active exploration mechanism, standard DT relies on uniform replay, which leads to poor sample efficiency, limited exploration, and reduced overall effectiveness. At the same time, while excessive exploration can help avoid local optima, it often delays policy convergence and leads to degraded efficiency. To address these limitations, we propose E$^2$DT, a DT-guided k-Determinantal Point Process sampling framework that enables the model to actively shape its own experience selection. Our framework is experience-aware, allowing E$^2$DT to be both efficient, by prioritizing sampling quality, such as high-return, high-uncertainty, and underrepresented trajectories, and effective, by ensuring diversity across trajectory windows to preserve policy optimality. Specifically, DT's internal latent embeddings measure diversity across trajectory windows, while quality is quantified through a composite metric that integrates return-to-go (RTG) quantiles, predictive uncertainty, and stage coverage based on inverse frequency. These two dimensions are integrated into a novel quality-diversity joint kernel that prioritizes the most informative experiences, thereby enabling learning that is both efficient and effective. We evaluate E$^2$DT on challenging robotic manipulation benchmarks in both simulation and real-robot settings. Results show that it consistently outperforms prior methods. These findings demonstrate that coupling policy learning with experience-aware sampling provides a principled path toward robust long-horizon robotic learning.

LGJul 25, 2025
Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

StepFun, Bin Wang, Bojun Wang et al.

Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.

LGOct 27, 2024
Efficient Diversity-based Experience Replay for Deep Reinforcement Learning

Kaiyan Zhao, Yiming Wang, Yuyang Chen et al.

Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.

MMSep 22, 2025
Mano Technical Report

Tianyu Fu, Anyang Su, Chenxu Zhao et al.

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

LGMay 17, 2025
HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class

James V. Roggeveen, Erik Y. Wang, Will Flintoft et al.

Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present HARDMath2, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.

LGOct 16, 2024
Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay

Yuyang Chen, Kaiyan Zhao, Yiming Wang et al.

Nowadays transformer-based Large Language Models (LLM) for code generation tasks usually apply sampling and filtering pipelines. Due to the sparse reward problem in code generation tasks caused by one-token incorrectness, transformer-based models will sample redundant programs till they find a correct one, leading to low efficiency. To overcome the challenge, we incorporate Experience Replay (ER) in the fine-tuning phase, where codes and programs produced are stored and will be replayed to give the LLM agent a chance to learn from past experiences. Based on the spirit of ER, we introduce a novel approach called BTP pipeline which consists of three phases: beam search sampling, testing phase, and prioritized experience replay phase. The approach makes use of failed programs collected by code models and replays programs with high Possibility and Pass-rate Prioritized value (P2Value) from the replay buffer to improve efficiency. P2Value comprehensively considers the possibility of transformers' output and pass rate and can make use of the redundant resources caused by the problem that most programs collected by LLMs fail to pass any tests. We empirically apply our approach in several LLMs, demonstrating that it enhances their performance in code generation tasks and surpasses existing baselines.

CVNov 28, 2024
Dynamic Attention and Bi-directional Fusion for Safety Helmet Wearing Detection

Junwei Feng, Xueyan Fan, Yuyang Chen et al.

Ensuring construction site safety requires accurate and real-time detection of workers' safety helmet use, despite challenges posed by cluttered environments, densely populated work areas, and hard-to-detect small or overlapping objects caused by building obstructions. This paper proposes a novel algorithm for safety helmet wearing detection, incorporating a dynamic attention within the detection head to enhance multi-scale perception. The mechanism combines feature-level attention for scale adaptation, spatial attention for spatial localization, and channel attention for task-specific insights, improving small object detection without additional computational overhead. Furthermore, a two-way fusion strategy enables bidirectional information flow, refining feature fusion through adaptive multi-scale weighting, and enhancing recognition of occluded targets. Experimental results demonstrate a 1.7% improvement in mAP@[.5:.95] compared to the best baseline while reducing GFLOPs by 11.9% on larger sizes. The proposed method surpasses existing models, providing an efficient and practical solution for real-world construction safety monitoring.

DSJan 7, 2022
An Improved Mathematical Model of Sepsis: Modeling, Bifurcation Analysis, and Optimal Control Study for Complex Nonlinear Infectious Disease System

Yuyang Chen, Kaiming Bi, Chih-Hang J. Wu et al.

Sepsis is a life-threatening medical emergency, which is a major cause of death worldwide and the second highest cause of mortality in the United States. Researching the optimal control treatment or intervention strategy on the comprehensive sepsis system is key in reducing mortality. For this purpose, first, this paper improves a complex nonlinear sepsis model proposed in our previous work. Then, bifurcation analyses are conducted for each sepsis subsystem to study the model behaviors under some system parameters. The bifurcation analysis results also further indicate the necessity of control treatment and intervention therapy. If the sepsis system is without adding any control under some parameter and initial system value settings, the system will perform persistent inflammation outcomes as time goes by. Therefore, we develop our complex improved nonlinear sepsis model into a sepsis optimal control model, and then use some effective biomarkers recommended in existing clinic practices as optimization objective function to measure the development of sepsis. Besides that, a Bayesian optimization algorithm by combining Recurrent neural network (RNN-BO algorithm) is introduced to predict the optimal control strategy for the studied sepsis optimal control system. The difference between the RNN-BO algorithm from other optimization algorithms is that once given any new initial system value setting (initial value is associated with the initial conditions of patients), the RNN-BO algorithm is capable of quickly predicting a corresponding time-series optimal control based on the historical optimal control data for any new sepsis patient. To demonstrate the effectiveness and efficiency of the RNN-BO algorithm on solving the optimal control solution on the complex nonlinear sepsis system, some numerical simulations are implemented by comparing with other optimization algorithms in this paper.

LGJan 1, 2022
High-dimensional Bayesian Optimization Algorithm with Recurrent Neural Network for Disease Control Models in Time Series

Yuyang Chen, Kaiming Bi, Chih-Hang J. Wu et al.

Bayesian Optimization algorithm has become a promising approach for nonlinear global optimization problems and many machine learning applications. Over the past few years, improvements and enhancements have been brought forward and they have shown some promising results in solving the complex dynamic problems, systems of ordinary differential equations where the objective functions are computationally expensive to evaluate. Besides, the straightforward implementation of the Bayesian Optimization algorithm performs well merely for optimization problems with 10-20 dimensions. The study presented in this paper proposes a new high dimensional Bayesian Optimization algorithm combining Recurrent neural networks, which is expected to predict the optimal solution for the global optimization problems with high dimensional or time series decision models. The proposed RNN-BO algorithm can solve the optimal control problems in the lower dimension space and then learn from the historical data using the recurrent neural network to learn the historical optimal solution data and predict the optimal control strategy for any new initial system value setting. In addition, accurately and quickly providing the optimal control strategy is essential to effectively and efficiently control the epidemic spread while minimizing the associated financial costs. Therefore, to verify the effectiveness of the proposed algorithm, computational experiments are carried out on a deterministic SEIR epidemic model and a stochastic SIS optimal control model. Finally, we also discuss the impacts of different numbers of the RNN layers and training epochs on the trade-off between solution quality and related computational efforts.

LGAug 4, 2021
High dimensional Bayesian Optimization Algorithm for Complex System in Time Series

Yuyang Chen, Kaiming Bi, Chih-Hang J. Wu et al.

At present, high-dimensional global optimization problems with time-series models have received much attention from engineering fields. Since it was proposed, Bayesian optimization has quickly become a popular and promising approach for solving global optimization problems. However, the standard Bayesian optimization algorithm is insufficient to solving the global optimal solution when the model is high-dimensional. Hence, this paper presents a novel high dimensional Bayesian optimization algorithm by considering dimension reduction and different dimension fill-in strategies. Most existing literature about Bayesian optimization algorithms did not discuss the sampling strategies to optimize the acquisition function. This study proposed a new sampling method based on both the multi-armed bandit and random search methods while optimizing the acquisition function. Besides, based on the time-dependent or dimension-dependent characteristics of the model, the proposed algorithm can reduce the dimension evenly. Then, five different dimension fill-in strategies were discussed and compared in this study. Finally, to increase the final accuracy of the optimal solution, the proposed algorithm adds a local search based on a series of Adam-based steps at the final stage. Our computational experiments demonstrated that the proposed Bayesian optimization algorithm could achieve reasonable solutions with excellent performances for high dimensional global optimization problems with a time-series optimal control model.