12 Papers

CVSep 29, 2023
Forward Flow for Novel View Synthesis of Dynamic Scenes

Xiang Guo, Jiadai Sun, Yuchao Dai et al.

This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF

CVJun 15, 2022
Neural Deformable Voxel Grid for Fast Optimization of Dynamic View Synthesis

Xiang Guo, Guanying Chen, Yuchao Dai et al.

Recently, Neural Radiance Fields (NeRF) is revolutionizing the task of novel view synthesis (NVS) for its superior performance. In this paper, we propose to synthesize dynamic scenes. Extending the methods for static scenes to dynamic scenes is not straightforward as both the scene geometry and appearance change over time, especially under monocular setup. Also, the existing dynamic NeRF methods generally require a lengthy per-scene training procedure, where multi-layer perceptrons (MLP) are fitted to model both motions and radiance. In this paper, built on top of the recent advances in voxel-grid optimization, we propose a fast deformable radiance field method to handle dynamic scenes. Our method consists of two modules. The first module adopts a deformation grid to store 3D dynamic features, and a light-weight MLP for decoding the deformation that maps a 3D point in the observation space to the canonical space using the interpolated features. The second module contains a density and a color grid to model the geometry and density of the scene. The occlusion is explicitly modeled to further improve the rendering quality. Experimental results show that our method achieves comparable performance to D-NeRF using only 20 minutes for training, which is more than 70x faster than D-NeRF, clearly demonstrating the efficiency of our proposed method.

CLSep 6, 2023
GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Aobo Xia, Shuyu Lei, Yushu Yang et al.

This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.

CVApr 9, 2024
3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu, Xiang Guo, Le Hui et al.

In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/

CLJul 20, 2024
Seal: Advancing Speech Language Models to be Few-Shot Learners

Shuyu Lei, Lingen Liu, Jiaolong Yang et al.

Existing auto-regressive language models have demonstrated a remarkable capability to perform a new task with just a few examples in prompt, without requiring any additional training. In order to extend this capability to a multi-modal setting (i.e. speech and language), this paper introduces the Seal model, an abbreviation for speech language model. It incorporates a novel alignment method, in which Kullback-Leibler divergence loss is performed to train a projector that bridges a frozen speech encoder with a frozen language model decoder. The resulting Seal model exhibits robust performance as a few-shot learner on two speech understanding tasks. Additionally, consistency experiments are conducted to validate its robustness on different pre-trained language models.

LGJan 7
ETR: Outcome-Guided Elastic Trust Regions for Policy Optimization

Shijie Zhang, Kevin Zhang, Zheyuan Gu et al.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an important paradigm for unlocking reasoning capabilities in large language models, exemplified by the success of OpenAI o1 and DeepSeek-R1. Currently, Group Relative Policy Optimization (GRPO) stands as the dominant algorithm in this domain due to its stable training and critic-free efficiency. However, we argue that GRPO suffers from a structural limitation: it imposes a uniform, static trust region constraint across all samples. This design implicitly assumes signal homogeneity, a premise misaligned with the heterogeneous nature of outcome-driven learning, where advantage magnitudes and variances fluctuate significantly. Consequently, static constraints fail to fully exploit high-quality signals while insufficiently suppressing noise, often precipitating rapid entropy collapse. To address this, we propose \textbf{E}lastic \textbf{T}rust \textbf{R}egions (\textbf{ETR}), a dynamic mechanism that aligns optimization constraints with signal quality. ETR constructs a signal-aware landscape through dual-level elasticity: at the micro level, it scales clipping boundaries based on advantage magnitude to accelerate learning from high-confidence paths; at the macro level, it leverages group variance to implicitly allocate larger update budgets to tasks in the optimal learning zone. Extensive experiments on AIME and MATH benchmarks demonstrate that ETR consistently outperforms GRPO, achieving superior accuracy while effectively mitigating policy entropy degradation to ensure sustained exploration.

LGFeb 10
Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Shijie Zhang, Xiang Guo, Rujun Guo et al.

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.

AISep 29, 2025
CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

Shijie Zhang, Guohao Sun, Kevin Zhang et al.

Recently, online Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing methods typically treat all training samples uniformly, overlooking the vast differences in problem difficulty relative to the model's current capabilities. This uniform training strategy leads to inefficient exploration of problems the model has already mastered, while concurrently lacking effective guidance on problems that are challenging its abilities the most, limiting both learning efficiency and upper-bound performance. To address this, we propose CLPO (Curriculum-guided Learning for Policy Optimization), a novel algorithm that creates a dynamic pedagogical feedback loop within the policy optimization process. The core of CLPO leverages the model's own rollout performance to conduct real-time difficulty assessment, thereby constructing an Online Curriculum. This curriculum then guides an Adaptive Problem Restructuring mechanism, where the model acts as its own teacher: it diversifies medium-difficulty problems to promote generalization and simplifies challenging problems to make them more attainable. Our approach transforms the static training procedure into a dynamic process that co-evolves with the model's capabilities. Experiments show that CLPO achieves state-of-the-art performance across eight challenging mathematical and general reasoning benchmarks, with an average pass@1 improvement of 6.96% over other methods, demonstrating its potential for more efficiently training more capable reasoning models.

HCFeb 27, 2022
Roadway Design Matters: Variation in Bicyclists' Psycho-Physiological Responses in Different Urban Roadway Designs

Xiang Guo, Arash Tavakoli, Erin Robartes et al.

As a healthier and more sustainable way of mobility, cycling has been advocated by literature and policy. However, current trends in bicyclist crash fatalities suggest deficiencies in current roadway design in protecting these vulnerable road users. The lack of cycling data is a common challenge for studying bicyclists' safety, behavior, and comfort levels under different design contexts. To understand bicyclists' behavioral and physiological responses in an efficient and safe way, this study uses a bicycle simulator within an immersive virtual environment (IVE). Off-the-shelf sensors are utilized to evaluate bicyclists' cycling performance (speed and lane position) and physiological responses (eye tracking and heart rate (HR)). Participants bike in a simulated virtual environment modeled to scale from a real-world street with a shared bike lane (sharrow) to evaluate how introduction of a bike lane and a protected bike lane with pylons may impact perceptions of safety, as well as behavioral and psycho-physiological responses. Results from 50 participants show that the protected bike lane design received the highest perceived safety rating and exhibited the lowest average cycling speed. Furthermore, both the bike lane and the protected bike lane scenarios show a less dispersed gaze distribution than the as-built sharrow scenario, reflecting a higher gaze focus among bicyclists on the biking task in the bike lane and protected bike lane scenarios, compared to when bicyclists share right of way with vehicles. Additionally, heart rate change point results from the study suggest that creating dedicated zones for bicyclists (bike lanes or protected bike lanes) has the potential to reduce bicyclists' stress levels.

HCDec 6, 2021
ORCLSim: A System Architecture for Studying Bicyclist and Pedestrian Physiological Behavior Through Immersive Virtual Environments

Xiang Guo, Austin Angulo, Erin Robartes et al.

Injuries and fatalities for vulnerable road users, especially bicyclists and pedestrians, are on the rise. To better inform design for vulnerable road users, we need to conduct more studies to evaluate how bicyclist and pedestrian behavior and physiological states change in different roadway designs and contextual settings. Previous research highlights the advantages of Immersive Virtual Environment (IVE) in conducting bicyclist and pedestrian studies. These environments do not put participants at risk of getting injured, are low-cost compared to on-road or naturalistic studies and allow researchers to fully control variables of interest. In this paper, we propose a framework ORCLSim, to support human sensing techniques within IVE to evaluate bicyclist and pedestrian physiological and behavioral changes in different contextual settings. To showcase this framework, we present two case studies where we collect and analyze pilot data from five participants' physiological and behavioral responses in an IVE setting, representing real-world roadway segments and traffic conditions. Results from these case studies indicate that physiological data is sensitive to road environment changes and real-time events, especially changes in heart rate and gaze behavior. Additionally, our preliminary data indicates participants may respond differently to various roadway settings (e.g., intersections with or without traffic signal). By analyzing these changes, we can identify how participants' stress levels and cognitive load is impacted by the simulated surrounding environment. The ORCLSim system architecture can be further utilized for future studies in users' behavioral and physiological responses in different virtual reality settings.

CVOct 22, 2020
Novel View Synthesis from only a 6-DoF Camera Pose by Two-stage Networks

Xiang Guo, Bo Li, Yuchao Dai et al.

Novel view synthesis is a challenging problem in computer vision and robotics. Different from the existing works, which need the reference images or 3D models of the scene to generate images under novel views, we propose a novel paradigm to this problem. That is, we synthesize the novel view from only a 6-DoF camera pose directly. Although this setting is the most straightforward way, there are few works addressing it. While, our experiments demonstrate that, with a concise CNN, we could get a meaningful parametric model that could reconstruct the correct scenery images only from the 6-DoF pose. To this end, we propose a two-stage learning strategy, which consists of two consecutive CNNs: GenNet and RefineNet. GenNet generates a coarse image from a camera pose. RefineNet is a generative adversarial network that refines the coarse image. In this way, we decouple the geometric relationship between mapping and texture detail rendering. Extensive experiments conducted on the public datasets prove the effectiveness of our method. We believe this paradigm is of high research and application value and could be an important direction in novel view synthesis.

CVJul 30, 2018
Occluded Joints Recovery in 3D Human Pose Estimation based on Distance Matrix

Xiang Guo, Yuchao Dai

Albeit the recent progress in single image 3D human pose estimation due to the convolutional neural network, it is still challenging to handle real scenarios such as highly occluded scenes. In this paper, we propose to address the problem of single image 3D human pose estimation with occluded measurements by exploiting the Euclidean distance matrix (EDM). Specifically, we present two approaches based on EDM, which could effectively handle occluded joints in 2D images. The first approach is based on 2D-to-2D distance matrix regression achieved by a simple CNN architecture. The second approach is based on sparse coding along with a learned over-complete dictionary. Experiments on the Human3.6M dataset show the excellent performance of these two approaches in recovering occluded observations and demonstrate the improvements in accuracy for 3D human pose estimation with occluded joints.