Tianlong Nan

h-index1

3papers

4citations

3 Papers

10.0LGMay 31

Efficient Exploration for Iterative Nash Preference Optimization

Tianlong Nan, Xiaopeng Li, Christian Kroer et al.

Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of scalable NLHF remain limited. Existing regret guarantees rely on oracle-based methods that estimate a general preference model and solve KL-regularized minimax problems, while iterative NLHF methods directly optimize policy-level preference losses and are easier to implement but lack regret guarantees. We study online iterative NLHF under general preference models and identify exploration as the key obstacle. First, we show that standard iterative NLHF can suffer an exponential dependence on the KL-regularization parameter, revealing that implicit exploration through policy updates is insufficient for controlling regret. Second, we propose an explicitly exploratory iterative NLHF algorithm that combines SFT-based regularization with adversarial policy exploration. The resulting method retains the direct policy optimization structure of iterative NLHF, avoids explicit preference model estimation, and achieves an $O(\sqrt{T})$ regret bound without an exponential dependence on the KL-regularization parameter. We show that the regret can be improved to $O(\log(T))$ with access to a minimax oracle, clarifying the computational-statistical tradeoff in learning general preference games. Finally, we instantiate our method for LLM fine-tuning and evaluate it on \texttt{Llama-3-8B-Instruct} across multiple benchmarks, where explicit exploration yields consistent improvements over existing NLHF baselines.

9.2GTJul 1

Fully Distributed Tâtonnement for Chores Markets

Bhaskar Ray Chaudhury, Christian Kroer, Ruta Mehta et al.

We study price-adjustment dynamics for computing competitive equilibria (CE) in Fisher markets with chores. Unlike in classical goods markets, prices in chores markets are payments for taking on undesirable tasks, and natural excess-demand dynamics can fail; even the naïve analogue of Walrasian tâtonnement may diverge. Recent work of Chaudhury et al. [2025] overcomes this obstacle via relative tâtonnement, which subtracts the average excess-demand signal from the excess demand vector. This recovers convergence, but at the cost of coupling the price updates across all chores. This leaves open whether such global coupling is inherent, or whether convergent tâtonnement can be recovered through a genuinely local update in which each chore reacts only to its own excess demand. We answer this question affirmatively through multiplicative tâtonnement, a fully distributed dynamics in which each chore price is updated using only its current price and its own excess-demand signal. Although the update contains no explicit normalization term, Walras' law and the multiplicative form of the update implicitly preserve the relevant aggregate price geometry. We prove that multiplicative tâtonnement converges to a CE in any chores Fisher market with continuous, convex, and $1$-homogeneous (CCH) disutilities. For convex CES disutilities, we further prove an approximate-CE convergence rate with the same $O(1/\varepsilon^2)$ dependence as relative tâtonnement, but with improved dependence on problem constants. Experiments on real-world and simulated instances show that multiplicative tâtonnement is substantially faster in practice, often by an order of magnitude.

7.0GTJun 13

Competitive Equilibrium in Labor Economies through the Lens of Goods and Chores Fisher Markets

Bhaskar Ray Chaudhury, Christian Kroer, Ruta Mehta et al.

In this paper, we study a two-sided labor market that couples the classical Fisher market with goods and the Fisher market with bads into a single unified framework. In our model, users demand tasks in order to derive utility, while workers supply labor to perform these tasks in exchange for earnings. Each task thus plays a dual role: it is a good for the user side of the market and a chore for the worker side. Given prices for tasks, users choose utility-maximizing bundles subject to budgets, while workers choose disutility-minimizing task bundles subject to earning requirements; the resulting choices induce demand and supply endogenously for each task, and a CE corresponds to prices at which these coincide. We show that such markets are guaranteed to admit a CE in a very general setting, and the first and second welfare theorems hold for our labor market model. We next study the computation of equilibria under linear preferences. We show that, similar to the chores setting, equilibria correspond to KKT points of an Eisenberg-Gale-like non-convex program. Despite the non-convex characterization, we go on to show a set of surprisingly positive results. First, we show that there exists a polynomial-time combinatorial algorithm for computing CE, which relies on a natural Walrasian scheme for updating prices. In the "CEEI-like" case, this yields a strongly polynomial-time algorithm. We next show that our market admits a natural dual program, and this non-convex labor-market program admits a change of variables that transforms it into a linear program (albeit with irrational coefficients). Finally, leveraging this LP, we give yet another polynomial-time algorithm while deriving an approach for addressing the irrational coefficients in an efficient manner. We note that, even for goods-only linear Fisher markets, obtaining such an LP formulation remains open.