HCApr 7, 2023
Generative Agents: Interactive Simulacra of Human BehaviorJoon Sung Park, Joseph C. O'Brien, Carrie J. Cai et al. · stanford
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
AINov 4, 2023
Levels of AGI for Operationalizing Progress on the Path to AGIMeredith Ringel Morris, Jascha Sohl-Dickstein, Noah Fiedel et al. · anthropic
We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. With these principles in mind, we propose "Levels of AGI" based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems.
HCOct 23, 2023
Interactive AI Alignment: Specification, Process, and Evaluation AlignmentMichael Terry, Chinmay Kulkarni, Martin Wattenberg et al. · deepmind
Modern AI enables a high-level, declarative form of interaction: Users describe the intended outcome they wish an AI to produce, but do not actually create the outcome themselves. In contrast, in traditional user interfaces, users invoke specific operations to create the desired outcome. This paper revisits the basic input-output interaction cycle in light of this declarative style of interaction, and connects concepts in AI alignment to define three objectives for interactive alignment of AI: specification alignment (aligning on what to do), process alignment (aligning on how to do it), and evaluation alignment (assisting users in verifying and understanding what was produced). Using existing systems as examples, we show how these user-centered views of AI alignment can be used descriptively, prescriptively, and as an evaluative aid.
CYApr 4, 2023
Scientists' Perspectives on the Potential for Generative AI in their FieldsMeredith Ringel Morris
Generative AI models, including large language models and multimodal models that include text and other media, are on the cusp of transforming many aspects of modern life, including entertainment, education, civic life, the arts, and a range of professions. There is potential for Generative AI to have a substantive impact on the methods and pace of discovery for a range of scientific disciplines. We interviewed twenty scientists from a range of fields (including the physical, life, and social sciences) to gain insight into whether or how Generative AI technologies might add value to the practice of their respective disciplines, including not only ways in which AI might accelerate scientific discovery (i.e., research), but also other aspects of their profession, including the education of future scholars and the communication of scientific findings. In addition to identifying opportunities for Generative AI to augment scientists' current practices, we also asked participants to reflect on concerns about AI. These findings can help guide the responsible development of models and interfaces for scientific education, inquiry, and communication.
CLMay 21, 2022
Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation MetricsElisa Kreiss, Cynthia Bennett, Shayan Hooshmand et al.
Few images on the Web receive alt-text descriptions that would make them accessible to blind and low vision (BLV) users. Image-based NLG systems have progressed to the point where they can begin to address this persistent societal problem, but these systems will not be fully successful unless we evaluate them on metrics that guide their development correctly. Here, we argue against current referenceless metrics -- those that don't rely on human-generated ground-truth descriptions -- on the grounds that they do not align with the needs of BLV users. The fundamental shortcoming of these metrics is that they do not take context into account, whereas contextual information is highly valued by BLV users. To substantiate these claims, we present a study with BLV participants who rated descriptions along a variety of dimensions. An in-depth analysis reveals that the lack of context-awareness makes current referenceless metrics inadequate for advancing image accessibility. As a proof-of-concept, we provide a contextual version of the referenceless metric CLIPScore which begins to address the disconnect to the BLV data. An accessible HTML version of this paper is available at https://elisakreiss.github.io/contextual-description-evaluation/paper/reflessmetrics.html
AIApr 15, 2023
The Design Space of Generative ModelsMeredith Ringel Morris, Carrie J. Cai, Jess Holbrook et al.
Card et al.'s classic paper "The Design Space of Input Devices" established the value of design spaces as a tool for HCI analysis and invention. We posit that developing design spaces for emerging pre-trained, generative AI models is necessary for supporting their integration into human-centered systems and practices. We explore what it means to develop an AI model design space by proposing two design spaces relating to generative AI models: the first considers how HCI can impact generative models (i.e., interfaces for models) and the second considers how generative models can impact HCI (i.e., models as an HCI prototyping material).
CLMay 8, 2022
Context-Aware Abbreviation Expansion Using Large Language ModelsShanqing Cai, Subhashini Venugopalan, Katrin Tomanek et al.
Motivated by the need for accelerating text entry in augmentative and alternative communication (AAC) for people with severe motor impairments, we propose a paradigm in which phrases are abbreviated aggressively as primarily word-initial letters. Our approach is to expand the abbreviations into full-phrase options by leveraging conversation context with the power of pretrained large language models (LLMs). Through zero-shot, few-shot, and fine-tuning experiments on four public conversation datasets, we show that for replies to the initial turn of a dialog, an LLM with 64B parameters is able to exactly expand over 70% of phrases with abbreviation length up to 10, leading to an effective keystroke saving rate of up to about 77% on these exact expansions. Including a small amount of context in the form of a single conversation turn more than doubles abbreviation expansion accuracies compared to having no context, an effect that is more pronounced for longer phrases. Additionally, the robustness of models against typo noise can be enhanced through fine-tuning on noisy data.
60.5AIJun 1
VET: A Framework for Analyzing AI DiscourseMeredith Ringel Morris
Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Literacy among the general public. In this article, I introduce the VET Framework, a method for categorizing AI discourse along the dimensions of valence, effectiveness, and trajectory. I show how this framework can be used to identify, compare, and critique prevalent narratives of AI Hype, AI Doom, AI Denial, and AI Normalcy. Using VET, I analyze how each of these four stances exaggerates some aspects of the current state and/or likely evolution of AI, and illustrate how the VET framework can serve as an AI Literacy tool by supporting the ``vetting'' of polarized AI discourse.
87.7AIMay 27
Measuring Progress Toward AGI: A Cognitive FrameworkRyan Burnell, Yumeya Yamamori, Orhan Firat et al.
Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.
AINov 15, 2024
Generative Agent Simulations of 1,000 PeopleJoon Sung Park, Carolyn Q. Zou, Aaron Shaw et al.
The promise of human behavioral simulation--general-purpose computational agents that replicate human behavior across domains--could enable broad applications in policymaking and social science. We present a novel agent architecture that simulates the attitudes and behaviors of 1,052 real individuals--applying large language models to qualitative interviews about their lives, then measuring how well these agents replicate the attitudes and behaviors of the individuals that they represent. The generative agents replicate participants' responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later, and perform comparably in predicting personality traits and outcomes in experimental replications. Our architecture reduces accuracy biases across racial and ideological groups compared to agents given demographic descriptions. This work provides a foundation for new tools that can help investigate individual and collective behavior.
CYJan 14, 2024
Generative Ghosts: Anticipating Benefits and Risks of AI AfterlivesMeredith Ringel Morris, Jed R. Brubaker
As AI systems quickly improve in both breadth and depth of performance, they lend themselves to creating increasingly powerful and realistic agents, including the possibility of agents modeled on specific people. We anticipate that within our lifetimes it may become common practice for people to create custom AI agents to interact with loved ones and/or the broader world after death; indeed, the past year has seen a boom in startups purporting to offer such services. We call these generative ghosts, since such agents will be capable of generating novel content rather than merely parroting content produced by their creator while living. In this paper, we reflect on the history of technologies for AI afterlives, including current early attempts by individual enthusiasts and startup companies to create generative ghosts. We then introduce a novel design space detailing potential implementations of generative ghosts, and use this analytic framework to ground discussion of the practical and ethical implications of various approaches to designing generative ghosts, including potential positive and negative impacts on individuals and society. Based on these considerations, we lay out a research agenda for the AI and HCI research communities to better understand the risk/benefit landscape of this novel technology so as to ultimately empower people who wish to create and interact with AI afterlives to do so in a beneficial manner.
CLFeb 10, 2025
Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language ModelsLujain Ibrahim, Canfer Akbulut, Rasmi Elasmar et al.
The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
HCJun 13, 2024
Position: Towards Bidirectional Human-AI AlignmentHua Shen, Tiffany Knearem, Reshmi Ghosh et al.
Recent advances in general-purpose AI underscore the urgent need to align AI systems with human goals and values. Yet, the lack of a clear, shared understanding of what constitutes "alignment" limits meaningful progress and cross-disciplinary collaboration. In this position paper, we argue that the research community should explicitly define and critically reflect on "alignment" to account for the bidirectional and dynamic relationship between humans and AI. Through a systematic review of over 400 papers spanning HCI, NLP, ML, and more, we examine how alignment is currently defined and operationalized. Building on this analysis, we introduce the Bidirectional Human-AI Alignment framework, which not only incorporates traditional efforts to align AI with human values but also introduces the critical, underexplored dimension of aligning humans with AI -- supporting cognitive, behavioral, and societal adaptation to rapidly advancing AI technologies. Our findings reveal significant gaps in current literature, especially in long-term interaction design, human value modeling, and mutual understanding. We conclude with three central challenges and actionable recommendations to guide future research toward more nuanced, reciprocal, and human-AI alignment approaches.
LGJun 6, 2024
Can Language Models Use Forecasting Strategies?Sarah Pratt, Seth Blumberg, Pietro Kreitlon Carolino et al.
Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models' tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.
CLJan 20, 2022
LaMDA: Language Models for Dialog ApplicationsRomal Thoppilan, Daniel De Freitas, Jamie Hall et al.
We present LaMDA: Language Models for Dialog Applications. LaMDA is a family of Transformer-based neural language models specialized for dialog, which have up to 137B parameters and are pre-trained on 1.56T words of public dialog data and web text. While model scaling alone can improve quality, it shows less improvements on safety and factual grounding. We demonstrate that fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements towards the two key challenges of safety and factual grounding. The first challenge, safety, involves ensuring that the model's responses are consistent with a set of human values, such as preventing harmful suggestions and unfair bias. We quantify safety using a metric based on an illustrative set of human values, and we find that filtering candidate responses using a LaMDA classifier fine-tuned with a small amount of crowdworker-annotated data offers a promising approach to improving model safety. The second challenge, factual grounding, involves enabling the model to consult external knowledge sources, such as an information retrieval system, a language translator, and a calculator. We quantify factuality using a groundedness metric, and we find that our approach enables the model to generate responses grounded in known sources, rather than responses that merely sound plausible. Finally, we explore the use of LaMDA in the domains of education and content recommendations, and analyze their helpfulness and role consistency.
CYMar 10, 2021
Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and TradeoffsSolon Barocas, Anhong Guo, Ece Kamar et al.
Disaggregated evaluations of AI systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple. However, their design involves a variety of choices. Some of these choices influence the results that will be obtained, and thus the conclusions that can be drawn; others influence the impacts -- both beneficial and harmful -- that a disaggregated evaluation will have on people, including the people whose data is used to conduct the evaluation. We argue that a deeper understanding of these choices will enable researchers and practitioners to design careful and conclusive disaggregated evaluations. We also argue that better documentation of these choices, along with the underlying considerations and tradeoffs that have been made, will help others when interpreting an evaluation's results and conclusions.
CYMar 10, 2021
Understanding the Representation and Representativeness of Age in AI Data SetsJoon Sung Park, Michael S. Bernstein, Robin N. Brewer et al.
A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by focusing on the representation of age by asking whether older adults are represented proportionally to the population at large in AI data sets. We examine publicly-available information about 92 face data sets to understand how they codify age as a case study to investigate how the subjects' ages are recorded and whether older generations are represented. We find that older adults are very under-represented; five data sets in the study that explicitly documented the closed age intervals of their subjects included older adults (defined as older than 65 years), while only one included oldest-old adults (defined as older than 85 years). Additionally, we find that only 24 of the data sets include any age-related information in their documentation or metadata, and that there is no consistent method followed across these data sets to collect and record the subjects' ages. We recognize the unique difficulties in creating representative data sets in terms of age, but raise it as an important dimension that researchers and engineers interested in inclusive AI should consider.
HCAug 13, 2020
Social App Accessibility for Deaf SignersKelly Mack, Danielle Bragg, Meredith Ringel Morris et al.
Social media platforms support the sharing of written text, video, and audio. All of these formats may be inaccessible to people who are deaf or hard of hearing (DHH), particularly those who primarily communicate via sign language, people who we call Deaf signers. We study how Deaf signers engage with social platforms, focusing on how they share content and the barriers they face. We employ a mixed-methods approach involving seven in-depth interviews and a survey of a larger population (n = 60). We find that Deaf signers share the most in written English, despite their desire to share in sign language. We further identify key areas of difficulty in consuming content (e.g., lack of captions for spoken content in videos) and producing content (e.g., captioning signed videos, signing into a phone camera) on social media platforms. Our results both provide novel insights into social media use by Deaf signers and reinforce prior findings on DHH communication more generally, while revealing potential ways to make social media platforms more accessible to Deaf signers.
CVAug 22, 2019
Sign Language Recognition, Generation, and Translation: An Interdisciplinary PerspectiveDanielle Bragg, Oscar Koller, Mary Bellard et al.
Developing successful sign language recognition, generation, and translation systems requires expertise in a wide range of fields, including computer vision, computer graphics, natural language processing, human-computer interaction, linguistics, and Deaf culture. Despite the need for deep interdisciplinary knowledge, existing research occurs in separate disciplinary silos, and tackles separate portions of the sign language processing pipeline. This leads to three key questions: 1) What does an interdisciplinary view of the current landscape reveal? 2) What are the biggest challenges facing the field? and 3) What are the calls to action for people working in the field? To help answer these questions, we brought together a diverse group of experts for a two-day workshop. This paper presents the results of that interdisciplinary workshop, providing key background that is often overlooked by computer scientists, a review of the state-of-the-art, a set of pressing challenges, and a call to action for the research community.
CYAug 21, 2019
AI and Accessibility: A Discussion of Ethical ConsiderationsMeredith Ringel Morris
According to the World Health Organization, more than one billion people worldwide have disabilities. The field of disability studies defines disability through a social lens; people are disabled to the extent that society creates accessibility barriers. AI technologies offer the possibility of removing many accessibility barriers; for example, computer vision might help people who are blind better sense the visual world, speech recognition and translation technologies might offer real time captioning for people who are hard of hearing, and new robotic systems might augment the capabilities of people with limited mobility. Considering the needs of users with disabilities can help technologists identify high-impact challenges whose solutions can advance the state of AI for all users; however, ethical challenges such as inclusivity, bias, privacy, error, expectation setting, simulated data, and social acceptability must be considered.
CYJul 4, 2019
Toward Fairness in AI for People with Disabilities: A Research RoadmapAnhong Guo, Ece Kamar, Jennifer Wortman Vaughan et al.
AI technologies have the potential to dramatically impact the lives of people with disabilities (PWD). Indeed, improving the lives of PWD is a motivator for many state-of-the-art AI systems, such as automated speech recognition tools that can caption videos for people who are deaf and hard of hearing, or language prediction algorithms that can augment communication for people with speech or cognitive disabilities. However, widely deployed AI systems may not work properly for PWD, or worse, may actively discriminate against them. These considerations regarding fairness in AI for PWD have thus far received little attention. In this position paper, we identify potential areas of concern regarding how several AI technology categories may impact particular disability constituencies if care is not taken in their design, development, and testing. We intend for this risk assessment of how various classes of AI might interact with various classes of disability to provide a roadmap for future research that is needed to gather data, test these hypotheses, and build more inclusive algorithms.
CYJul 6, 2015
RIMES: Embedding Interactive Multimedia Exercises in Lecture VideosJuho Kim, Elena L. Glassman, Andrés Monroy-Hernández et al.
Teachers in conventional classrooms often ask learners to express themselves and show their thought processes by speaking out loud, drawing on a whiteboard, or even using physical objects. Despite the pedagogical value of such activities, interactive exercises available in most online learning platforms are constrained to multiple-choice and short answer questions. We introduce RIMES, a system for easily authoring, recording, and reviewing interactive multimedia exercises embedded in lecture videos. With RIMES, teachers can prompt learners to record their responses to an activity using video, audio, and inking while watching lecture videos. Teachers can then review and interact with all the learners' responses in an aggregated gallery. We evaluated RIMES with 19 teachers and 25 students. Teachers created a diverse set of activities across multiple subjects that tested deep conceptual and procedural knowledge. Teachers found the exercises useful for capturing students' thought processes, identifying misconceptions, and engaging students with content.
CYJul 6, 2015
Mudslide: A Spatially Anchored Census of Student Confusion for Online Lecture VideosElena L. Glassman, Juho Kim, Andrés Monroy-Hernández et al.
Educators have developed an effective technique to get feedback after in-person lectures, called "muddy card." Students are given time to reflect and write the "muddiest" (least clear) point on an index card, to hand in as they leave class. This practice of assigning end-of-lecture reflection tasks to generate explicit student feedback is well suited for adaptation to the challenge of supporting feedback in online video lectures. We describe the design and evaluation of Mudslide, a prototype system that translates the practice of muddy cards into the realm of online lecture videos. Based on an in-lab study of students and teachers, we find that spatially contextualizing students' muddy point feedback with respect to particular lecture slides is advantageous to both students and teachers. We also reflect on further opportunities for enhancing this feedback method based on teachers' and students' experiences with our prototype.