Jialian Li

LG
h-index19
11papers
266citations
Novelty48%
AI Score42

11 Papers

LGMar 13, 2022
Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Jialian Li, Tongzheng Ren, Dong Yan et al. · tsinghua

In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but may suffer from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment. In this work, we consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial observable Markov decision process, thanks to the game-theoretical formulation.

CVMar 28, 2022
LiDARCap: Long-range Marker-less 3D Human Motion Capture with LiDAR Point Clouds

Jialian Li, Jingyi Zhang, Zhiyong Wang et al.

Existing motion capture datasets are largely short-range and cannot yet fit the need of long-range applications. We propose LiDARHuman26M, a new human motion capture dataset captured by LiDAR at a much longer range to overcome this limitation. Our dataset also includes the ground truth human motions acquired by the IMU system and the synchronous RGB images. We further present a strong baseline method, LiDARCap, for LiDAR point cloud human motion capture. Specifically, we first utilize PointNet++ to encode features of points and then employ the inverse kinematics solver and SMPL optimizer to regress the pose through aggregating the temporally encoded features hierarchically. Quantitative and qualitative experiments show that our method outperforms the techniques based only on RGB images. Ablation experiments demonstrate that our dataset is challenging and worthy of further research. Finally, the experiments on the KITTI Dataset and the Waymo Open Dataset show that our method can be generalized to different LiDAR sensor settings.

LGSep 18, 2024
Reward-Robust RLHF in LLMs

Yuzi Yan, Xingzhou Lou, Jialian Li et al.

As Large Language Models (LLMs) continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is increasingly seen as a key pathway toward achieving Artificial General Intelligence (AGI). However, the reliance on reward-model-based (RM-based) alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles (BRME) to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect RMs. Empirical results demonstrate that our framework consistently outperforms baselines across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be acceptable even in a stochastic-case analysis. Together, these contributions highlight the framework potential to enhance both the performance and stability of LLM alignment.

77.3AIMay 12
Engagement Process: Rethinking the Temporal Interface of Action and Observation

Jialian Li, Yuchen Cao, Junhong Liu et al.

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

LGOct 12, 2024
Boosting Deductive Reasoning with Step Signals In RLHF

Jialian Li, Yipin Zhang, Wei Shen et al.

Logical reasoning is a crucial task for Large Language Models (LLMs), enabling them to tackle complex problems. Among reasoning tasks, multi-step reasoning poses a particular challenge. Grounded in the theory of formal logic, we have developed an automated method, Multi-step Deduction (MuseD), for deductive reasoning data. MuseD has allowed us to create training and testing datasets for multi-step reasoning. Our generation method enables control over the complexity of the generated instructions, facilitating training and evaluation of models across different difficulty levels. Through RLHF training, our training data has demonstrated significant improvements in logical capabilities for both in-domain of out-of-domain reasoning tasks. Additionally, we have conducted tests to assess the multi-step reasoning abilities of various models.

AIJun 11, 2024
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

Yuzi Yan, Yibo Miao, Jialian Li et al.

Aligning large language models (LLMs) with human preferences has gained significant attention, with Proximal Policy Optimization (PPO) as a standard yet computationally expensive method and Direct Preference Optimization (DPO) as a more efficient alternative. While DPO offers simplicity, it remains underutilized in state-of-the-art LLMs, suggesting potential limitations. In this work, we revisit DPO, analyzing its theoretical foundations and empirical performance to bridge this gap. We identify three key properties, termed 3D properties, that emerge from DPO's learning process: Drastic drop in rejected response likelihood, Degradation into response suppression, and Dispersion effect on unseen responses. We show that these issues arise from DPO's optimization dynamics, where the interaction between chosen and rejected response gradients leads to instability. Our findings are supported by experiments on both a controlled toy model and real-world LLM tasks, including mathematical problem-solving and instruction following. To address these challenges, we propose simple regularization techniques that improve training stability and performance. Additionally, we examine how preference data distribution impacts DPO's effectiveness, offering insights into how alignment models handle out-of-domain (OOD) data. Our work connects these observations to broader research and provides a theoretical explanation for DPO's limitations. We hope these insights will guide future advancements in reward-model-free preference learning, bringing it closer to reward-model-based approaches.

MLMar 25, 2021
Nearly Horizon-Free Offline Reinforcement Learning

Tongzheng Ren, Jialian Li, Bo Dai et al.

We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by $1$. Specifically: 1. For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}\left(H, S, A, d\right)$ in higher-order term. 2.For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by \cite{cui2020plug} that has additional $\mathrm{poly}\left(H, S, d\right)$ factors in the main term. To the best of our knowledge, these are the \emph{first} set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a "total variance" term in the offline scenarios, which could be of individual interest.

CVMay 20, 2019
Fast Regularity-Constrained Plane Reconstruction

Yangbin Lin, Jialian Li, Cheng Wang et al.

Man-made environments typically comprise planar structures that exhibit numerous geometric relationships, such as parallelism, coplanarity, and orthogonality. Making full use of these relationships can considerably improve the robustness of algorithmic plane reconstruction of complex scenes. This research leverages a constraint model requiring minimal prior knowledge to implicitly establish relationships among planes. We introduce a method based on energy minimization to reconstruct the planes consistent with our constraint model. The proposed algorithm is efficient, easily to understand, and simple to implement. The experimental results show that our algorithm successfully reconstructs planes under high percentages of noise and outliers. This is superior to other state-of-the-art regularity-constrained plane reconstruction methods in terms of speed and robustness.

LGOct 10, 2018
Lazy-CFR: fast and near optimal regret minimization for extensive games with imperfect information

Yichi Zhou, Tongzheng Ren, Jialian Li et al.

Counterfactual regret minimization (CFR) is the most popular algorithm on solving two-player zero-sum extensive games with imperfect information and achieves state-of-the-art performance in practice. However, the performance of CFR is not fully understood, since empirical results on the regret are much better than the upper bound proved in \cite{zinkevich2008regret}. Another issue is that CFR has to traverse the whole game tree in each round, which is time-consuming in large scale games. In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR, as well as a novel analysis on the regret of CFR with lazy update. Our analysis can also be applied to the vanilla CFR, resulting in a much tighter regret bound than that in \cite{zinkevich2008regret}. Inspired by lazy update, we further present a novel CFR variant, named Lazy-CFR. Compared to traversing $O(|\mathcal{I}|)$ information sets in vanilla CFR, Lazy-CFR needs only to traverse $O(\sqrt{|\mathcal{I}|})$ information sets per round while keeping the regret bound almost the same, where $\mathcal{I}$ is the class of all information sets. As a result, Lazy-CFR shows better convergence result compared with vanilla CFR. Experimental results consistently show that Lazy-CFR outperforms the vanilla CFR significantly.

CVJun 28, 2017
The YouTube-8M Kaggle Competition: Challenges and Methods

Haosheng Zou, Kun Xu, Jialian Li et al.

We took part in the YouTube-8M Video Understanding Challenge hosted on Kaggle, and achieved the 10th place within less than one month's time. In this paper, we present an extensive analysis and solution to the underlying machine-learning problem based on frame-level data, where major challenges are identified and corresponding preliminary methods are proposed. It's noteworthy that, with merely the proposed strategies and uniformly-averaging multi-crop ensemble was it sufficient for us to reach our ranking. We also report the methods we believe to be promising but didn't have enough time to train to convergence. We hope this paper could serve, to some extent, as a review and guideline of the YouTube-8M multi-label video classification benchmark, inspiring future attempts and research.

LGJun 14, 2016
Conditional Generative Moment-Matching Networks

Yong Ren, Jialian Li, Yucen Luo et al.

Maximum mean discrepancy (MMD) has been successfully applied to learn deep generative models for characterizing a joint distribution of variables via kernel mean embedding. In this paper, we present conditional generative moment- matching networks (CGMMN), which learn a conditional distribution given some input variables based on a conditional maximum mean discrepancy (CMMD) criterion. The learning is performed by stochastic gradient descent with the gradient calculated by back-propagation. We evaluate CGMMN on a wide range of tasks, including predictive modeling, contextual generation, and Bayesian dark knowledge, which distills knowledge from a Bayesian model by learning a relatively small CGMMN student network. Our results demonstrate competitive performance in all the tasks.