Cathy Jiao

CL
h-index3
10papers
950citations
Novelty35%
AI Score50

10 Papers

CLMay 25, 2022
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh et al. · cmu

Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks.

CLJan 27, 2023
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Jessica Huynh, Cathy Jiao, Prakhar Gupta et al. · cmu

Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.

IRApr 9
Efficient Dataset Selection for Continual Adaptation of Generative Recommenders

Cathy Jiao, Juan Elenter, Praveen Ravichandran et al.

Recommendation systems must continuously adapt to evolving user behavior, yet the volume of data generated in large-scale streaming environments makes frequent full retraining impractical. This work investigates how targeted data selection can mitigate performance degradation caused by temporal distributional drift while maintaining scalability. We evaluate a range of representation choices and sampling strategies for curating small but informative subsets of user interaction data. Our results demonstrate that gradient-based representations, coupled with distribution-matching, improve downstream model performance, achieving training efficiency gains while preserving robustness to drift. These findings highlight data curation as a practical mechanism for scalable monitoring and adaptive model updates in production-scale recommendation systems.

HCAug 18, 2022
The DialPort tools

Jessica Huynh, Shikib Mehri, Cathy Jiao et al.

The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including implementation, prior studies, corresponding discoveries, and the locations at which the tools will remain freely available to the community going forward.

CYMay 6
Rigorous Interpretation Is a Form of Evaluation

Isabelle Lee, Emmy Liu, Cathy Jiao et al.

Current machine learning models are evaluated through behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics. Model explanations and evaluations, however, are fundamentally intertwined: understanding why a model produces a behavior can be as important as measuring what it produces. If we trusted interpretability, we argue that it can serve not merely as diagnostics but as a richer and more principled form of model evaluation beyond surface-level performance metrics. We explore three ways interpretability can function evaluatively: (1) fixing problems by identifying the root causes of unwanted behavior, (2) detecting subtly faulty mechanisms that invalidate model outputs, and (3) predicting potential issues before they arise by fully understanding the model's weaknesses. To fulfill its evaluative potential, we argue that interpretability methods must generate claims that are falsifiable, reproducible, and predictive -- that is, interpretability must meet scientific standards.

GTMar 30
An Economic Framework for Generative Engines: Advertising or Subscription?

Luyang Zhang, Cathy Jiao, Beibei Li et al.

Generative Engines (GEs) such as ChatGPT and Google's AI Overviews are rapidly reshaping search economics by delivering synthesized responses that allow users to bypass third-party websites, cutting those sites' advertising revenue. Yet this shift also leaves GEs facing their own monetization problem: whether to insert ads into synthesized responses or keep them ad-free to drive subscription conversions. In this paper, we introduce a dynamic framework to study this problem, which captures how query-level design choices shape user engagement, retention, and subscription conversion over time. Using this framework, we show that the optimal policy follows a cutoff rule: ads should only be shown to users only when the immediate ad payoff exceeds the long-term value of providing ad-free responses. This cutoff shifts toward with-ad responses when i) ad revenue is high or ii) users are less sensitive to ads, and toward ad-free responses when iii) subscription conversion becomes relatively more valuable. In addition, the presence of rival GEs shifts the optimal policy further toward ad-free responses, as ad-heavy monetization becomes less sustainable when users can freely switch to alternatives. Our findings reveal incentives for real-life generative engine providers to adopt designs that enhance user experience and long-term sustainability.

GTJan 31, 2025Code
Fairshare Data Pricing via Data Valuation for Large Language Models

Luyang Zhang, Cathy Jiao, Beibei Li et al.

Training data is the backbone of large language models (LLMs), yet today's data markets often operate under exploitative pricing -- sourcing data from marginalized groups with little pay or recognition. This paper introduces a theoretical framework for LLM data markets, modeling the strategic interactions between buyers (LLM builders) and sellers (human annotators). We begin with theoretical and empirical analysis showing how exploitative pricing drives high-quality sellers out of the market, degrading data quality and long-term model performance. Then we introduce fairshare, a pricing mechanism grounded in data valuation that quantifies each data's contribution. It aligns incentives by sustaining seller participation and optimizing utility for both buyers and sellers. Theoretically, we show that fairshare yields mutually optimal outcomes: maximizing long-term buyer utility and seller profit while sustaining market participation. Empirically when training open-source LLMs on complex NLP tasks, including math problems, medical diagnosis, and physical reasoning, fairshare boosts seller earnings and ensures a stable supply of high-quality data, while improving buyers' performance-per-dollar and long-term welfare. Our findings offer a concrete path toward fair, transparent, and economically sustainable data markets for LLM.

CLJul 17, 2024
On the Feasibility of In-Context Probing for Data Attribution

Cathy Jiao, Gary Gao, Aditi Raghunathan et al.

Data attribution methods are used to measure the contribution of training data towards model outputs, and have several important applications in areas such as dataset curation and model interpretability. However, many standard data attribution methods, such as influence functions, utilize model gradients and are computationally expensive. In our paper, we show in-context probing (ICP) -- prompting a LLM -- can serve as a fast proxy for gradient-based data attribution for data selection under conditions contingent on data similarity. We study this connection empirically on standard NLP tasks, and show that ICP and gradient-based data attribution are well-correlated in identifying influential training data for tasks that share similar task type and content as the training data. Additionally, fine-tuning models on influential data selected by both methods achieves comparable downstream performance, further emphasizing their similarities. We also examine the connection between ICP and gradient-based data attribution using synthetic data on linear regression tasks. Our synthetic data experiments show similar results with those from NLP tasks, suggesting that this connection can be isolated in simpler settings, which offers a pathway to bridging their differences.

CLJul 12, 2025
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models

Cathy Jiao, Yijun Pan, Emily Xiao et al.

Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks -- training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement, with the motivation that DATE-LM can serve as a foundation for future data attribution research in LLMs.

CVJun 25, 2024
ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh et al.

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.