Yufeng Yuan

h-index3

6papers

108citations

Novelty43%

AI Score30

Ranked #138,244 of 194,257 authors (top 71%)#8,408 in AI (top 67%)

6 Papers

33.4CVJul 8

Infinite Worlds with Versatile Interactions

Zelin Gao, Qiuyu Wang, Jiapeng Zhu et al.

We present LingBot-World 2.0 (also known as LingBot-World-Infinity), an advanced iteration of LingBot-World featuring four distinct upgrades. (1) Our model achieves an unbounded interaction horizon while maintaining consistent output quality, benefiting from a carefully crafted causal pretraining paradigm. (2) Through distilling a real-time variant from the base model, our system guarantees rapid response time, sufficient to drive 720p video streams at 60 fps. (3) Compared to the previous version, this update introduces highly diverse interactive elements, comprising a broader spectrum of actions (e.g., attacking, archery, spell-casting, and shooting) alongside a richer variety of text-driven events. (4) We pioneer the integration of an agentic harness within the domain of world modeling, wherein a pilot agent is tasked with planning and executing character behaviors, while a director agent is responsible for synthesizing novel environmental elements as the scene progresses. Additionally, to facilitate a shared experience, we develop an interface that permits multiple players to simultaneously immerse themselves in this vivid world simulator. We pair our primary 14B model with a lightweight 1.3B counterpart, which supports effortless deployment on a single GPU.

48.5AIApr 7, 2025

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu et al.

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

16.6CVMar 9, 2020

Accurate Temporal Action Proposal Generation with Relation-Aware Pyramid Network

Jialin Gao, Zhixiang Shi, Jiani Li et al.

Accurate temporal action proposals play an important role in detecting actions from untrimmed videos. The existing approaches have difficulties in capturing global contextual information and simultaneously localizing actions with different durations. To this end, we propose a Relation-aware pyramid Network (RapNet) to generate highly accurate temporal action proposals. In RapNet, a novel relation-aware module is introduced to exploit bi-directional long-range relations between local features for context distilling. This embedded module enhances the RapNet in terms of its multi-granularity temporal proposal generation ability, given predefined anchor boxes. We further introduce a two-stage adjustment scheme to refine the proposal boundaries and measure their confidence in containing an action with snippet-level actionness. Extensive experiments on the challenging ActivityNet and THUMOS14 benchmarks demonstrate our RapNet generates superior accurate proposals over the existing state-of-the-art methods.

2.6CVAug 9, 2019

Relation-Aware Pyramid Network (RapNet) for temporal action proposal

Jialin Gao, Zhixiang Shi, Jiani Li et al.

In this technical report, we describe our solution to temporal action proposal (task 1) in ActivityNet Challenge 2019. First, we fine-tune a ResNet-50-C3D CNN on ActivityNet v1.3 based on Kinetics pretrained model to extract snippet-level video representations and then we design a Relation-Aware Pyramid Network (RapNet) to generate temporal multiscale proposals with confidence score. After that, we employ a two-stage snippet-level boundary adjustment scheme to re-rank the order of generated proposals. Ensemble methods are also been used to improve the performance of our solution, which helps us achieve 2nd place.

1.5LGDec 29, 2018

Meta Reinforcement Learning with Distribution of Exploration Parameters Learned by Evolution Strategies

Yiming Shen, Kehan Yang, Yufeng Yuan et al.

In this paper, we propose a novel meta-learning method in a reinforcement learning setting, based on evolution strategies (ES), exploration in parameter space and deterministic policy gradients. ES methods are easy to parallelize, which is desirable for modern training architectures; however, such methods typically require a huge number of samples for effective training. We use deterministic policy gradients during adaptation and other techniques to compensate for the sample-efficiency problem while maintaining the inherent scalability of ES methods. We demonstrate that our method achieves good results compared to gradient-based meta-learning in high-dimensional control tasks in the MuJoCo simulator. In addition, because of gradient-free methods in the meta-training phase, which do not need information about gradients and policies in adaptation training, we predict and confirm our algorithm performs better in tasks that need multi-step adaptation.

13.2AIDec 18, 2018

Continual Match Based Training in Pommerman: Technical Report

Peng Peng, Liang Pang, Yufeng Yuan et al.

Continual learning is the ability of agents to improve their capacities throughout multiple tasks continually. While recent works in the literature of continual learning mostly focused on developing either particular loss functions or specialized structures of neural network explaining the episodic memory or neural plasticity, we study continual learning from the perspective of the training mechanism. Specifically, we propose a COnitnual Match BAsed Training (COMBAT) framework for training a population of advantage-actor-critic (A2C) agents in Pommerman, a partially observable multi-agent environment with no communication. Following the COMBAT framework, we trained an agent, namely, Navocado, that won the title of the top 1 learning agent in the NeurIPS 2018 Pommerman Competition. Two critical features of our agent are worth mentioning. Firstly, our agent did not learn from any demonstrations. Secondly, our agent is highly reproducible. As a technical report, we articulate the design of state space, action space, reward, and most importantly, the COMBAT framework for our Pommerman agent. We show in the experiments that Pommerman is a perfect environment for studying continual learning, and the agent can improve its performance by continually learning new skills without forgetting the old ones. Finally, the result in the Pommerman Competition verifies the robustness of our agent when competing with various opponents.