Luyang Zhang

CL
h-index3
7papers
8citations
Novelty42%
AI Score50

7 Papers

CLJun 2
Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

Luyang Zhang, Jingyan Li

Best-of-$N$ inference scaling (drawing $N$ candidate answers from a language model and returning the one a reward model ranks highest) improves accuracy by an amount that varies across models, but predicting that amount in advance currently requires running the procedure end-to-end. Prior work links cheap statistics of a model's sampled outputs and validation-set correctness (how often samples agree, how diverse they are, how confident the model is, and where correct samples appear) to model behavior, but does not isolate which of these form a stable, compact predictor of best-of-$N$ gain. We fit ridge predictors on features computed from a single labeled validation-set sampling pass, use bootstrap-Lasso as a stability analysis of the candidate feature set, and give a concentration analysis with an explicit linear-approximation residual. Across three base-model families, six post-training methods, and math and reasoning task domains, the stability analysis identifies a strict three-feature core spanning prompt-level agreement spread, label-assisted first-correct-sample position, and completion-length variance; a compact ridge predictor built from this core plus an entropy add-on reaches Spearman $ρ= 0.90$ with actual best-of-$N$ gain under a reward-model verifier. The intended use is labeled validation-set screening of candidate configurations before paying the full reward-model scoring cost.

CYMay 30
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums?

Luyang Zhang, Yi-Yun Chu, Ramayya Krishnan

LLM agents are increasingly used in moderation-relevant public forum workflows, where their choices to answer, acknowledge, repair, or decline are routinely challenged by users, platforms, and regulators. The same agent often returns different responses on identical content, so any defense based on the agent's behavior cannot be reliably reproduced. The variation is structural. Four deployment choices typically invisible to the operator each shift the agent's response rate, and their combinations can produce substantially different interventions on the same forum posts. The four choices are (1) which model version is currently served, which can change between calls without notice; (2) the model's weight-release status (open-weight, with weights publicly downloadable, vs. closed-weight, with weights held by the provider); (3) which provider serves the request; and (4) which system-prompt policy is in force. Across LLMs spanning both open-weight and closed-weight families, we find that the previously reported tendency to decline more on visible than hidden challenges aligns with the open/closed weight boundary in our panel more than with access surface. Every closed-weight cell declines more on visible challenges; every open-weight cell reverses this or shows no gap. Auditable forum-agent governance requires awareness of all four choices, not just the model name, since each independently shifts behavior.

GTMar 30
An Economic Framework for Generative Engines: Advertising or Subscription?

Luyang Zhang, Cathy Jiao, Beibei Li et al.

Generative Engines (GEs) such as ChatGPT and Google's AI Overviews are rapidly reshaping search economics by delivering synthesized responses that allow users to bypass third-party websites, cutting those sites' advertising revenue. Yet this shift also leaves GEs facing their own monetization problem: whether to insert ads into synthesized responses or keep them ad-free to drive subscription conversions. In this paper, we introduce a dynamic framework to study this problem, which captures how query-level design choices shape user engagement, retention, and subscription conversion over time. Using this framework, we show that the optimal policy follows a cutoff rule: ads should only be shown to users only when the immediate ad payoff exceeds the long-term value of providing ad-free responses. This cutoff shifts toward with-ad responses when i) ad revenue is high or ii) users are less sensitive to ads, and toward ad-free responses when iii) subscription conversion becomes relatively more valuable. In addition, the presence of rival GEs shifts the optimal policy further toward ad-free responses, as ad-heavy monetization becomes less sustainable when users can freely switch to alternatives. Our findings reveal incentives for real-life generative engine providers to adopt designs that enhance user experience and long-term sustainability.

GTJan 31, 2025Code
Fairshare Data Pricing via Data Valuation for Large Language Models

Luyang Zhang, Cathy Jiao, Beibei Li et al.

Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.

CYApr 1
Do Agents Repair When Challenged -- or Just Reply? Challenge, Repair, and Public Correction in a Deployed Agent Forum

Luyang Zhang, Yi-Yun Chu, Jialu Wang et al.

As large language model (LLM) agents are deployed in public interactive settings, a key question is whether their communities can sustain challenge, repair, and public correction, or merely produce norm-like language. We compare Moltbook, a live deployed agent forum, with five matched Reddit communities by tracing a three-step mechanism: whether discussions create threaded exchange, whether challenges elicit a response, and whether correction becomes visible to the wider thread. Relative to Reddit, Moltbook discussions are roughly ten times less threaded, leaving far fewer chances for challenge and response. When challenges do occur, the original author almost never returns (1.2% vs. 40.9% on Reddit), multi-turn continuation is nearly absent (0.1% vs. 38.5%), and we detect no repairs under a shared conservative protocol. A non-challenge baseline within Reddit suggests this gap is linked to challenge, not simply deeper threading. These results indicate that social alignment depends not only on producing norm-aware language, but on sustaining the interactional processes through which communities teach, enforce, and revise norms. This matters for safety, because correction is increasingly decentralized, and for fairness, because communities differ in how they expect participants to engage with challenge.

CVApr 10
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang et al.

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

CLSep 29, 2025
PET: Preference Evolution Tracking with LLM-Generated Explainable Distribution

Luyang Zhang, Jialu Wang, Shichao Zhu et al.

Understanding how user preference evolves over time is a fundamental challenge central to modern digital ecosystems, for which Large Language Models (LLMs) are an increasingly prominent and popular approach due to their ability to comprehend the rich semantic context within behavioral data. A common practice is to use LLMs to predict a user's next action by directly generating a ranked list of preferred items. Although effective for short-term prediction, the end-to-end generation paradigm inherently limits personalization. Its opaque decision-making process obscures holistic user profiling and exacerbates popularity bias. To address these limitations, we propose Preference Evolution Tracking (PET), a framework that reframes the task as inferring a dynamic probability distribution over a stable and interpretable lattice of preference clusters. By applying logit-probing and generative classification techniques, PET infers a user's preference as a probability distribution, enabling transparent preference learning. On public benchmarks (Yelp, MovieLens), PET improves ranking quality by up to 40% in NDCG over direct generation baselines. On a large-scale, real-world dataset from a short-video platform, it excels at ranking long-tail contents, significantly outperforming a SOTA production model by 7 times in the NDCG score. Ultimately, PET transforms the user profile model from direct preference list generation to a transparent distributional preference mapping, paving the way for more explainable, fair, and diverse personalization systems.