Haifeng Xu

4.6LGMay 18, 2024

Learning from Imperfect Human Feedback: a Tale from Corruption-Robust Dueling

Yuwei Cheng, Fan Yao, Xuefeng Liu et al.

This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human's imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as $t^{ρ-1}$ for some "imperfection rate" $ρ\in [0, 1]$. With $T$ as the total number of iterations, we establish a regret lower bound of $ Ω(\max\{\sqrt{T}, T^ρ\}) $ for LIHF, even when $ρ$ is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret $\tilde{\mathcal{O}}(\max\{\sqrt{T}, T^ρ\})$. Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

4.5MLOct 18, 2025

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

Bingji Yi, Qiyuan Liu, Yuwei Cheng et al.

Synthetic data has been increasingly used to train frontier generative models. However, recent study raises key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify this synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. To develop principled understandings of the above insight, we situate our analysis in the foundational linear regression setting, showing that iterative retraining with verified synthetic data can yield near-term improvements but ultimately drives the parameter estimate to the verifier's "knowledge center" in the long run. Our theory hence predicts that, unless the verifier is perfectly reliable, the early gains will plateau and may even reverse. Indeed, these theoretical insights are further confirmed by our experiments on both linear regression as well as Variational Autoencoders (VAEs) trained on MNIST data.

Haifeng Xu

2 Papers