Dandan Qiao

2papers

2 Papers

CLJul 3, 2024
Regurgitative Training: The Value of Real Data in Training Large Language Models

Jinghui Zhang, Dandan Qiao, Mochen Yang et al.

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered training process where high-quality data are added before low-quality ones. Second, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). Third, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data.

AIDec 7, 2023
AI and Jobs: Has the Inflection Point Arrived? Evidence from an Online Labor Platform

Dandan Qiao, Huaxia Rui, Qian Xiong

The emergence of Large Language Models (LLMs) has renewed the debate on the important issue of "technology displacement". While prior research has investigated the effect of information technology in general on human labor from a macro perspective, this paper complements the literature by examining the impact of LLMs on freelancers from a micro perspective. Specifically, we leverage the release of ChatGPT to investigate how AI influences freelancers across different online labor markets (OLMs). Employing the Difference-in-Differences method, we discovered two distinct scenarios following ChatGPT's release: 1) the displacement effect of LLMs, featuring reduced work volume and earnings, as is exemplified by the translation & localization OLM; 2) the productivity effect of LLMs, featuring increased work volume and earnings, as is exemplified by the web development OLM. To shed light on the underlying mechanisms, we developed a Cournot-type competition model to highlight the existence of an inflection point for each occupation which separates the timeline of AI progress into a honeymoon phase and a substitution phase. Before AI performance crosses the inflection point, human labor benefits each time AI improves, resulting in the honeymoon phase. However, after AI performance crosses the inflection point, additional AI enhancement hurts human labor. Further analyzing the progression from ChatGPT 3.5 to 4.0, we found three effect scenarios (i.e., productivity to productivity, displacement to displacement, and productivity to displacement), consistent with the inflection point conjecture. Heterogeneous analyses reveal that U.S. web developers tend to benefit more from the release of ChatGPT compared to their counterparts in other regions, and somewhat surprisingly, experienced translators seem more likely to exit the market than less experienced translators after the release of ChatGPT.