CLDec 19, 2022
Evaluating Human-Language Model InteractionMina Lee, Megha Srivastava, Amelia Hardy et al. · stanford
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.
LGOct 3, 2023
Can large language models provide useful feedback on research papers? A large-scale empirical analysisWeixin Liang, Yuhui Zhang, Hancheng Cao et al. · stanford
Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are increasingly difficult to obtain. Researchers who are more junior or from under-resourced settings have especially hard times getting timely feedback. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback on research manuscripts. However, the utility of LLM-generated feedback has not been systematically studied. To address this gap, we created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers. We evaluated the quality of GPT-4's feedback through two large-scale studies. We first quantitatively compared GPT-4's generated feedback with human peer reviewer feedback in 15 Nature family journals (3,096 papers in total) and the ICLR machine learning conference (1,709 papers). The overlap in the points raised by GPT-4 and by human reviewers (average overlap 30.85% for Nature journals, 39.23% for ICLR) is comparable to the overlap between two human reviewers (average overlap 28.58% for Nature journals, 35.25% for ICLR). The overlap between GPT-4 and human reviewers is larger for the weaker papers. We then conducted a prospective user study with 308 researchers from 110 US institutions in the field of AI and computational biology to understand how researchers perceive feedback generated by our GPT-4 system on their own papers. Overall, more than half (57.4%) of the users found GPT-4 generated feedback helpful/very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. While our findings show that LLM-generated feedback can help researchers, we also identify several limitations.
DLOct 4, 2023
The Rise of Open Science: Tracking the Evolution and Perceived Value of Data and Methods Link-Sharing PracticesHancheng Cao, Jesse Dodge, Kyle Lo et al. · allen-ai, cmu
In recent years, funding agencies and journals increasingly advocate for open science practices (e.g. data and method sharing) to improve the transparency, access, and reproducibility of science. However, quantifying these practices at scale has proven difficult. In this work, we leverage a large-scale dataset of 1.1M papers from arXiv that are representative of the fields of physics, math, and computer science to analyze the adoption of data and method link-sharing practices over time and their impact on article reception. To identify links to data and methods, we train a neural text classification model to automatically classify URL types based on contextual mentions in papers. We find evidence that the practice of link-sharing to methods and data is spreading as more papers include such URLs over time. Reproducibility efforts may also be spreading because the same links are being increasingly reused across papers (especially in computer science); and these links are increasingly concentrated within fewer web domains (e.g. Github) over time. Lastly, articles that share data and method links receive increased recognition in terms of citation count, with a stronger effect when the shared links are active (rather than defunct). Together, these findings demonstrate the increased spread and perceived value of data and method sharing practices in open science.
CYSep 26, 2023
User Experience Design Professionals' Perceptions of Generative Artificial IntelligenceJie Li, Hancheng Cao, Laura Lin et al.
Among creative professionals, Generative Artificial Intelligence (GenAI) has sparked excitement over its capabilities and fear over unanticipated consequences. How does GenAI impact User Experience Design (UXD) practice, and are fears warranted? We interviewed 20 UX Designers, with diverse experience and across companies (startups to large enterprises). We probed them to characterize their practices, and sample their attitudes, concerns, and expectations. We found that experienced designers are confident in their originality, creativity, and empathic skills, and find GenAI's role as assistive. They emphasized the unique human factors of "enjoyment" and "agency", where humans remain the arbiters of "AI alignment". However, skill degradation, job replacement, and creativity exhaustion can adversely impact junior designers. We discuss implications for human-GenAI collaboration, specifically copyright and ownership, human creativity and agency, and AI literacy and access. Through the lens of responsible and participatory AI, we contribute a deeper understanding of GenAI fears and opportunities for UXD.
HCAug 3, 2023
Comparing scalable strategies for generating numerical perspectivesHancheng Cao, Sofia Eleni Spatharioti, Daniel G. Goldstein et al.
Numerical perspectives help people understand extreme and unfamiliar numbers (e.g., \$330 billion is about \$1,000 per person in the United States). While research shows perspectives to be helpful, generating them at scale is challenging both because it is difficult to identify what makes some analogies more helpful than others, and because what is most helpful can vary based on the context in which a given number appears. Here we present and compare three policies for large-scale perspective generation: a rule-based approach, a crowdsourced system, and a model that uses Wikipedia data and semantic similarity (via BERT embeddings) to generate context-specific perspectives. We find that the combination of these three approaches dominates any single method, with different approaches excelling in different settings and users displaying heterogeneous preferences across approaches. We conclude by discussing our deployment of perspectives in a widely-used online word processor.
SOC-PHMay 22
Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field ExperimentBinglu Wang, Weixin Liang, Jiahui Xue et al.
Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this hidden but vital practice and reallocate scientific feedback, an essential yet scarce resource for knowledge production. In a global large-scale randomized field experiment, we delivered customized LLM-generated feedback for over 31,000 arXiv preprints across 150 fields and more than 45,000 researchers from 133 geographic regions. Relative to controls, authors who received feedback had a significantly higher likelihood of revising their manuscripts, corresponding to a 12.55% relative increase over the baseline revision rate. Exposure to AI feedback also increased authors' subsequent use of LLM tools in their future papers, suggesting longer-run shifts in scientific practice. These effects were strongest among authors from non-English-dominant research regions, manuscripts less embedded in the scholarly literature, and teams with lower h-indexes and earlier career stages, consistent with the idea that AI feedback may provide the greatest benefit where access to timely critique is otherwise limited. Together, these findings provide causal evidence that structured AI-based interventions can transform access to scientific feedback from a largely private advantage into a more widely distributed resource, with broader implications for productivity, equity, and capacity across the global research system.
CLMar 11, 2024
Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer ReviewsWeixin Liang, Zachary Izzo, Yaohui Zhang et al. · berkeley, stanford
We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
CLApr 1, 2024
Mapping the Increasing Use of LLMs in Scientific PapersWeixin Liang, Yaohui Zhang, Zhengxuan Wu et al. · berkeley, stanford
Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.
CLFeb 13, 2025
The Widespread Adoption of Large Language Model-Assisted Writing Across SocietyWeixin Liang, Yaohui Zhang, Mihai Codreanu et al. · stanford
The recent advances in large language models (LLMs) attracted significant public and policymaker interest in its adoption patterns. In this paper, we systematically analyze LLM-assisted writing across four domains-consumer complaints, corporate communications, job postings, and international organization press releases-from January 2022 to September 2024. Our dataset includes 687,241 consumer complaints, 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations (UN) press releases. Using a robust population-level statistical framework, we find that LLM usage surged following the release of ChatGPT in November 2022. By late 2024, roughly 18% of financial consumer complaint text appears to be LLM-assisted, with adoption patterns spread broadly across regions and slightly higher in urban areas. For corporate press releases, up to 24% of the text is attributable to LLMs. In job postings, LLM-assisted writing accounts for just below 10% in small firms, and is even more common among younger firms. UN press releases also reflect this trend, with nearly 14% of content being generated or modified by LLMs. Although adoption climbed rapidly post-ChatGPT, growth appears to have stabilized by 2024, reflecting either saturation in LLM adoption or increasing subtlety of more advanced models. Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications.
CLJun 12, 2025
From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer ReviewYaohui Zhang, Haijing Zhang, Wenlong Ji et al.
The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows. Despite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process. In this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality. Our experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.
CLMay 21, 2025
Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the WildSheshera Mysore, Debarati Das, Hancheng Cao et al.
As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.
MAFeb 1
Multi-Agent Teams Hold Experts BackAneesh Pappu, Batu El, Hancheng Cao et al.
Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.
HCSep 12, 2025
Vibe Coding for UX Design: Understanding UX Professionals' Perceptions of AI-Assisted Design and DevelopmentJie Li, Youyang Hou, Laura Lin et al.
Generative AI is reshaping UX design practices through "vibe coding," where UX professionals express intent in natural language and AI translates it into functional prototypes and code. Despite rapid adoption, little research has examined how vibe coding reconfigures UX workflows and collaboration. Drawing on interviews with 20 UX professionals across enterprises, startups, and academia, we show how vibe coding follows a four-stage workflow of ideation, AI generation, debugging, and review. This accelerates iteration, supports creativity, and lowers barriers to participation. However, professionals reported challenges of code unreliability, integration, and AI over-reliance. We find tensions between efficiency-driven prototyping ("intending the right design") and reflection ("designing the right intention"), introducing new asymmetries in trust, responsibility, and social stigma within teams. Through the lens of responsible human-AI collaboration for AI-assisted UX design and development, we contribute a deeper understanding of deskilling, ownership and disclosure, and creativity safeguarding in the age of vibe coding.
CYJan 28, 2021
Large Scale Analysis of Multitasking Behavior During Remote MeetingsHancheng Cao, Chia-Jung Lee, Shamsi Iqbal et al.
Virtual meetings are critical for remote work because of the need for synchronous collaboration in the absence of in-person interactions. In-meeting multitasking is closely linked to people's productivity and wellbeing. However, we currently have limited understanding of multitasking in remote meetings and its potential impact. In this paper, we present what we believe is the most comprehensive study of remote meeting multitasking behavior through an analysis of a large-scale telemetry dataset collected from February to May 2020 of U.S. Microsoft employees and a 715-person diary study. Our results demonstrate that intrinsic meeting characteristics such as size, length, time, and type, significantly correlate with the extent to which people multitask, and multitasking can lead to both positive and negative outcomes. Our findings suggest important best-practice guidelines for remote meetings (e.g., avoid important meetings in the morning) and design implications for productivity tools (e.g., support positive remote multitasking).
CYOct 31, 2020
You Recommend, I Buy: How and Why People Engage in Instant Messaging Based Social CommerceHancheng Cao, Zhilong Chen, Mengjie Cheng et al.
As an emerging business phenomenon especially in China, instant messaging (IM) based social commerce is growing increasingly popular, attracting hundreds of millions of users and is becoming one important way where people make everyday purchases. Such platforms embed shopping experiences within IM apps, e.g., WeChat, WhatsApp, where real-world friends post and recommend products from the platforms in IM group chats and quite often form lasting recommending/buying relationships. How and why do users engage in IM based social commerce? Do such platforms create novel experiences that are distinct from prior commerce? And do these platforms bring changes to user social lives and relationships? To shed light on these questions, we launched a qualitative study where we carried out semi-structured interviews on 12 instant messaging based social commerce users in China. We showed that IM based social commerce: 1) enables more reachable, cost-reducing, and immersive user shopping experience, 2) shapes user decision-making process in shopping through pre-existing social relationship, mutual trust, shared identity, and community norm, and 3) creates novel social interactions, which can contribute to new tie formation while maintaining existing social relationships. We demonstrate that all these unique aspects link closely to the characteristics of IM platforms, as well as the coupling of user social and economic lives under such business model. Our study provides important research and design implications for social commerce, and decentralized, trusted socio-technical systems in general.
CYOct 16, 2020
Understanding the Role of Intermediaries in Online Social E-commerce: An Exploratory Study of BeidianZhilong Chen, Hancheng Cao, Fengli Xu et al.
Social e-commerce, as a new form of social computing based marketing platforms, utilizes existing real-world social relationships for promotions and sales of products. It has been growing rapidly in recent years and attracted tens of millions of users in China. A key group of actors who enable market transactions on these platforms are intermediaries who connect producers with consumers by sharing information with and recommending products to their real-world social contacts. Despite their crucial role, the nature and behavior of these intermediaries on these social e-commerce platforms has not been systematically analyzed. Here we address this knowledge gap through a mixed method study. Leveraging 9 months' all-round behavior of about 40 million users on Beidian -- one of the largest social e-commerce sites in China, alongside with qualitative evidence from online forums and interviews, we examine characteristics of intermediaries, identify their behavioral patterns and uncover strategies and mechanisms that make successful intermediaries. We demonstrate that intermediaries on social e-commerce sites act as local trend detectors and "social grocers". Furthermore, successful intermediaries are highly dedicated whenever best sellers appear and broaden items for promotion. To the best of our knowledge, this paper presents the first large-scale analysis on the emerging role of intermediaries in social e-commerce platforms, which provides potential insights for the design and management of social computing marketing platforms.
CYOct 14, 2020
My Team Will Go On: Differentiating High and Low Viability Teams through Team InteractionHancheng Cao, Vivian Yang, Victor Chen et al.
Understanding team viability -- a team's capacity for sustained and future success -- is essential for building effective teams. In this study, we aggregate features drawn from the organizational behavior literature to train a viability classification model over a dataset of 669 10-minute text conversations of online teams. We train classifiers to identify teams at the top decile (most viable teams), 50th percentile (above a median split), and bottom decile (least viable teams), then characterize the attributes of teams at each of these viability levels. We find that a lasso regression model achieves an accuracy of .74--.92 AUC ROC under different thresholds of classifying viability scores. From these models, we identify the use of exclusive language such as `but' and `except', and the use of second person pronouns, as the most predictive features for detecting the most viable teams, suggesting that active engagement with others' ideas is a crucial signal of a viable team. Only a small fraction of the 10-minute discussion, as little as 70 seconds, is required for predicting the viability of team interaction. This work suggests opportunities for teams to assess, track, and visualize their own viability in real time as they collaborate.
CYOct 13, 2020
Will This Idea Spread Beyond Academia? Understanding Knowledge Transfer of Scientific Concepts across Text CorporaHancheng Cao, Mengjie Cheng, Zhepeng Cen et al.
What kind of basic research ideas are more likely to get applied in practice? There is a long line of research investigating patterns of knowledge transfer, but it generally focuses on documents as the unit of analysis and follow their transfer into practice for a specific scientific domain. Here we study translational research at the level of scientific concepts for all scientific fields. We do this through text mining and predictive modeling using three corpora: 38.6 million paper abstracts, 4 million patent documents, and 0.28 million clinical trials. We extract scientific concepts (i.e., phrases) from corpora as instantiations of "research ideas", create concept-level features as motivated by literature, and then follow the trajectories of over 450,000 new concepts (emerged from 1995-2014) to identify factors that lead only a small proportion of these ideas to be used in inventions and drug trials. Results from our analysis suggest several mechanisms that distinguish which scientific concept will be adopted in practice, and which will not. We also demonstrate that our derived features can be used to explain and predict knowledge transfer with high accuracy. Our work provides greater understanding of knowledge transfer for researchers, practitioners, and government agencies interested in encouraging translational research.
CYOct 4, 2020
Learning from Home: A Mixed-Methods Analysis of Live Streaming Based Remote Education Experience in Chinese Colleges During the COVID-19 PandemicZhilong Chen, Hancheng Cao, Yuting Deng et al.
The COVID-19 global pandemic and resulted lockdown policies have forced education in nearly every country to switch from a traditional co-located paradigm to a pure online 'distance learning from home' paradigm. Lying in the center of this learning paradigm shift is the emergence and wide adoption of distance communication tools and live streaming platforms for education. Here, we present a mixed-methods study on live streaming based education experience during the COVID-19 pandemic. We focus our analysis on Chinese higher education, carried out semi-structured interviews on 30 students, and 7 instructors from diverse colleges and disciplines, meanwhile launched a large-scale survey covering 6291 students and 1160 instructors in one leading Chinese university. Our study not only reveals important design guidelines and insights to better support current remote learning experience during the pandemic, but also provides valuable implications towards constructing future collaborative education supporting systems and experience after pandemic.
SIAug 15, 2019
When Your Friends Become Sellers: An Empirical Study of Social Commerce Site BeidianHancheng Cao, Zhilong Chen, Fengli Xu et al.
Past few years have witnessed the emergence and phenomenal success of strong-tie based social commerce. Embedded in social networking sites, these E-Commerce platforms transform ordinary people into sellers, where they advertise and sell products to their friends and family in online social networks. These sites can acquire millions of users within a short time, and are growing fast at an accelerated rate. However, little is known about how these social commerce develop as a blend of social relationship and economic transactions. In this paper we present the first measurement study on the full-scale data of Beidian, one of the fastest growing social commerce sites in China, which involves 11.8 million users. We first analyzed the topological structure of the Beidian platform and highlighted its decentralized nature. We then studied the site's rapid growth and its growth mechanism via invitation cascade. Finally, we investigated purchasing behavior on Beidian, where we focused on user proximity and loyalty, which contributes to the site's high conversion rate. As the consequences of interactions between strong ties and economic logics, emerging social commerce demonstrates significant property deviations from all known social networks and E-Commerce in terms of network structure, dynamics and user behavior. To the best of our knowledge, this work is the first quantitative study on the network characteristics and dynamics of emerging social commerce platforms.