30.5CVMay 28
SalsaAgent: A multimodal embodied language model for interactive dance generationPayam Jome Yazdian, Zoe Stanley, Angelica Lim
Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.
ROMay 2, 2022
Data-driven emotional body language generation for social roboticsMina Marmpena, Fernando Garcia, Angelica Lim et al.
In social robotics, endowing humanoid robots with the ability to generate bodily expressions of affect can improve human-robot interaction and collaboration, since humans attribute, and perhaps subconsciously anticipate, such traces to perceive an agent as engaging, trustworthy, and socially present. Robotic emotional body language needs to be believable, nuanced and relevant to the context. We implemented a deep learning data-driven framework that learns from a few hand-designed robotic bodily expressions and can generate numerous new ones of similar believability and lifelikeness. The framework uses the Conditional Variational Autoencoder model and a sampling approach based on the geometric properties of the model's latent space to condition the generative process on targeted levels of valence and arousal. The evaluation study found that the anthropomorphism and animacy of the generated expressions are not perceived differently from the hand-designed ones, and the emotional conditioning was adequately differentiable between most levels except the pairs of neutral-positive valence and low-medium arousal. Furthermore, an exploratory analysis of the results reveals a possible impact of the conditioning on the perceived dominance of the robot, as well as on the participants' attention.
CVSep 22, 2023
Contextual Emotion Estimation from Image CaptionsVera Yang, Archita Srivastava, Yasaman Etesam et al.
Emotion estimation in images is a challenging task, typically using computer vision methods to directly estimate people's emotions using face, body pose and contextual cues. In this paper, we explore whether Large Language Models (LLMs) can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference. First, we must understand: how well do LLMs perceive human emotions? And which parts of the information enable them to determine emotions? One initial challenge is to construct a caption that describes a person within a scene with information relevant for emotion perception. Towards this goal, we propose a set of natural language descriptors for faces, bodies, interactions, and environments. We use them to manually generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset. These captions offer an interpretable representation for emotion estimation, towards understanding how elements of a scene affect emotion perception in LLMs and beyond. Secondly, we test the capability of a large language model to infer an emotion from the resulting image captions. We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations, but accuracy can depend on the emotion concept. Overall, the results suggest promise in the image captioning and LLM approach.
CVAug 15, 2022
Towards Inclusive HRI: Using Sim2Real to Address Underrepresentation in Emotion Expression RecognitionSaba Akhyani, Mehryar Abbasi Boroujeni, Mo Chen et al.
Robots and artificial agents that interact with humans should be able to do so without bias and inequity, but facial perception systems have notoriously been found to work more poorly for certain groups of people than others. In our work, we aim to build a system that can perceive humans in a more transparent and inclusive manner. Specifically, we focus on dynamic expressions on the human face, which are difficult to collect for a broad set of people due to privacy concerns and the fact that faces are inherently identifiable. Furthermore, datasets collected from the Internet are not necessarily representative of the general population. We address this problem by offering a Sim2Real approach in which we use a suite of 3D simulated human models that enables us to create an auditable synthetic dataset covering 1) underrepresented facial expressions, outside of the six basic emotions, such as confusion; 2) ethnic or gender minority groups; and 3) a wide range of viewing angles that a robot may encounter a human in the real world. By augmenting a small dynamic emotional expression dataset containing 123 samples with a synthetic dataset containing 4536 samples, we achieved an improvement in accuracy of 15% on our own dataset and 11% on an external benchmark dataset, compared to the performance of the same model architecture without synthetic training data. We also show that this additional step improves accuracy specifically for racial minorities when the architecture's feature extraction weights are trained from scratch.
ROMay 10, 2022
Read the Room: Adapting a Robot's Voice to Ambient and Social ContextsPaige Tuttosi, Emma Hughson, Akihiro Matsufuji et al.
How should a robot speak in a formal, quiet and dark, or a bright, lively and noisy environment? By designing robots to speak in a more social and ambient-appropriate manner we can improve perceived awareness and intelligence for these agents. We describe a process and results toward selecting robot voice styles for perceived social appropriateness and ambiance awareness. Understanding how humans adapt their voices in different acoustic settings can be challenging due to difficulties in voice capture in the wild. Our approach includes 3 steps: (a) Collecting and validating voice data interactions in virtual Zoom ambiances, (b) Exploration and clustering human vocal utterances to identify primary voice styles, and (c) Testing robot voice styles in recreated ambiances using projections, lighting and sound. We focus on food service scenarios as a proof-of-concept setting. We provide results using the Pepper robot's voice with different styles, towards robots that speak in a contextually appropriate and adaptive manner. Our results with N=120 participants provide evidence that the choice of voice style in different ambiances impacted a robot's perceived intelligence in several factors including: social appropriateness, comfort, awareness, human-likeness and competency.
ROFeb 12
Your Robot Will Feel You Now: Empathy in Robots and Embodied AgentsAngelica Lim, Ö. Nilay Yalçin
The fields of human-robot interaction (HRI) and embodied conversational agents (ECAs) have long studied how empathy could be implemented in machines. One of the major drivers has been the goal of giving multimodal social and emotional intelligence to these artificially intelligent agents, which interact with people through facial expressions, body, gesture, and speech. What empathic behaviors and models have these fields implemented by mimicking human and animal behavior? In what ways have they explored creating machine-specific analogies? This chapter aims to review the knowledge from these studies, towards applying the lessons learned to today's ubiquitous, language-based agents such as ChatGPT.
CVOct 30, 2023
Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic ReasoningYasaman Etesam, Özge Nilay Yalçın, Chuxuan Zhang et al.
The emotional theory of mind problem requires facial expressions, body pose, contextual information and implicit commonsense knowledge to reason about the person's emotion and its causes, making it currently one of the most difficult problems in affective computing. In this work, we propose multiple methods to incorporate the emotional reasoning capabilities by constructing "narrative captions" relevant to emotion perception, that includes contextual and physical signal descriptors that focuses on "Who", "What", "Where" and "How" questions related to the image and emotions of the individual. We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors. We further utilize these captions to guide the reasoning of language (GPT-4) and vision-language models (LLaVa, GPT-Vision). We evaluate the use of the resulting models in an image-to-language-to-emotion task. Our experiments showed that combining the "Fast" narrative descriptors and "Slow" reasoning of language models is a promising way to achieve emotional theory of mind.
CVDec 19, 2023
MotionScript: Natural Language Descriptions for Expressive 3D Human MotionsPayam Jome Yazdian, Rachel Lagasse, Hamid Mohammadi et al.
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data.
CVMay 14, 2024
Contextual Emotion Recognition using Large Vision Language ModelsYasaman Etesam, Özge Nilay Yalçın, Chuxuan Zhang et al.
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
CLApr 30, 2025
BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech RecognitionPaige Tuttösí, Mantaj Dhillon, Luna Sang et al.
Some speech recognition tasks, such as automatic speech recognition (ASR), are approaching or have reached human performance in many reported metrics. Yet, they continue to struggle in complex, real-world, situations, such as with distanced speech. Previous challenges have released datasets to address the issue of distanced ASR, however, the focus remains primarily on distance, specifically relying on multi-microphone array systems. Here we present the B(asic) E(motion) R(andom phrase) S(hou)t(s) (BERSt) dataset. The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER). We provide initial benchmarks for ASR and SER tasks, and find that ASR degrades both with an increase in distance and shout level and shows varied performance depending on the intended emotion. Our results show that the BERSt dataset is challenging for both ASR and SER tasks and continued work is needed to improve the robustness of such systems for more accurate real-world use.
31.9CLMar 31
Covertly improving intelligibility with data-driven adaptations of speech timingPaige Tuttösí, Angelica Lim, H. Henny Yeung et al.
Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.
HCMar 8
How Neurotypical and Autistic Children Interact Nonverbally with Anthropomorphic Agents in Open-Ended TasksChuxuan Zhang, Bermet Burkanova, Lawrence H. Kim et al.
What nonverbal behaviors should a robot respond to? Understanding how children-both neurotypical and autistic-engage with embodied artificial agents is critical for developing inclusive and socially interactive systems. In this paper, we study "open-ended" unconstrained interactions with embodied agents, where little is known about how children behave nonverbally when given few instructions. We conducted a Wizard-of-Oz study in which children were invited to interact nonverbally with 6 different embodied virtual characters displayed on a television screen. We collected 563 (141 unique) nonverbal behaviors produced by children and compare the childre's interaction patterns with those previously reported in an adult study. We also report the presence of repetitive face and hand movements, which should be considered in the development of nonverbally interactive artificial agents.
CLNov 24, 2025
Gender Bias in Emotion Recognition by Large Language ModelsMaureen Herbert, Katie Sun, Angelica Lim et al.
The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, ''How does this person feel?''. Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering, etc.
LGJul 25, 2025
Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and BenchmarksBermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang et al.
Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
HCJul 14, 2025
React to This (RTT): A Nonverbal Turing Test for Embodied AIChuxuan Zhang, Yasaman Etesam, Angelica Lim
We propose an approach to test embodied AI agents for interaction awareness and believability, particularly in scenarios where humans push them to their limits. Turing introduced the Imitation Game as a way to explore the question: "Can machines think?" The Total Turing Test later expanded this concept beyond purely verbal communication, incorporating perceptual and physical interaction. Building on this, we propose a new guiding question: "Can machines react?" and introduce the React to This (RTT) test for nonverbal behaviors, presenting results from an initial experiment.
SDJun 29, 2025
You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel PropertiesPaige Tuttösí, H. Henny Yeung, Yue Wang et al.
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
SDJun 8, 2024
Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlationPaige Tuttösí, H. Henny Yeung, Yue Wang et al.
Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
CVDec 10, 2021
The Many Faces of Anger: A Multicultural Video Dataset of Negative Emotions in the Wild (MFA-Wild)Roya Javadi, Angelica Lim
The portrayal of negative emotions such as anger can vary widely between cultures and contexts, depending on the acceptability of expressing full-blown emotions rather than suppression to maintain harmony. The majority of emotional datasets collect data under the broad label ``anger", but social signals can range from annoyed, contemptuous, angry, furious, hateful, and more. In this work, we curated the first in-the-wild multicultural video dataset of emotions, and deeply explored anger-related emotional expressions by asking culture-fluent annotators to label the videos with 6 labels and 13 emojis in a multi-label framework. We provide a baseline multi-label classifier on our dataset, and show how emojis can be effectively used as a language-agnostic tool for annotation.
ROMar 7, 2021
Developing a Data-Driven Categorical Taxonomy of Emotional Expressions in Real World Human Robot InteractionsGhazal Saheb Jam, Jimin Rhim, Angelica Lim
Emotions are reactions that can be expressed through a variety of social signals. For example, anger can be expressed through a scowl, narrowed eyes, a long stare, or many other expressions. This complexity is problematic when attempting to recognize a human's expression in a human-robot interaction: categorical emotion models used in HRI typically use only a few prototypical classes, and do not cover the wide array of expressions in the wild. We propose a data-driven method towards increasing the number of known emotion classes present in human-robot interactions, to 28 classes or more. The method includes the use of automatic segmentation of video streams into short (<10s) videos, and annotation using the large set of widely-understood emojis as categories. In this work, we showcase our initial results using a large in-the-wild HRI dataset (UE-HRI), with 61 clips randomly sampled from the dataset, labeled with 28 different emojis. In particular, our results showed that the "skeptical" emoji was a common expression in our dataset, which is not often considered in typical emotion taxonomies. This is the first step in developing a rich taxonomy of emotional expressions that can be used in the future as labels for training machine learning models, towards more accurate perception of humans by robots.
CVOct 28, 2020
SFU-Store-Nav: A Multimodal Dataset for Indoor Human NavigationZhitian Zhang, Jimin Rhim, Taher Ahmadi et al.
This article describes a dataset collected in a set of experiments that involves human participants and a robot. The set of experiments was conducted in the computing science robotics lab in Simon Fraser University, Burnaby, BC, Canada, and its aim is to gather data containing common gestures, movements, and other behaviours that may indicate humans' navigational intent relevant for autonomous robot navigation. The experiment simulates a shopping scenario where human participants come in to pick up items from his/her shopping list and interact with a Pepper robot that is programmed to help the human participant. We collected visual data and motion capture data from 108 human participants. The visual data contains live recordings of the experiments and the motion capture data contains the position and orientation of the human participants in world coordinates. This dataset could be valuable for researchers in the robotics, machine learning and computer vision community.
HCAug 30, 2019
The OMG-Empathy Dataset: Evaluating the Impact of Affective Behavior in StorytellingPablo Barros, Nikhil Churamani, Angelica Lim et al.
Processing human affective behavior is important for developing intelligent agents that interact with humans in complex interaction scenarios. A large number of current approaches that address this problem focus on classifying emotion expressions by grouping them into known categories. Such strategies neglect, among other aspects, the impact of the affective responses from an individual on their interaction partner thus ignoring how people empathize towards each other. This is also reflected in the datasets used to train models for affective processing tasks. Most of the recent datasets, in particular, the ones which capture natural interactions ("in-the-wild" datasets), are designed, collected, and annotated based on the recognition of displayed affective reactions, ignoring how these displayed or expressed emotions are perceived. In this paper, we propose a novel dataset composed of dyadic interactions designed, collected and annotated with a focus on measuring the affective impact that eight different stories have on the listener. Each video of the dataset contains around 5 minutes of interaction where a speaker tells a story to a listener. After each interaction, the listener annotated, using a valence scale, how the story impacted their affective state, reflecting how they empathized with the speaker as well as the story. We also propose different evaluation protocols and a baseline that encourages participation in the advancement of the field of artificial empathy and emotion contagion.
ROFeb 7, 2019
Commodifying Pointing in HRI: Simple and Fast Pointing Gesture Detection from RGB-D ImagesBita Azari, Angelica Lim, Richard T. Vaughan
We present and characterize a simple method for detecting pointing gestures suitable for human-robot interaction applications using a commodity RGB-D camera. We exploit a state-of-the-art Deep CNN-based detector to find hands and faces in RGB images, then examine the corresponding depth channel pixels to obtain full 3D pointing vectors. We test several methods of estimating the hand end-point of the pointing vector. The system runs at better than 30Hz on commodity hardware: exceeding the frame rate of typical RGB-D sensors. An estimate of the absolute pointing accuracy is found empirically by comparison with ground-truth data from a VICON motion-capture system, and the useful interaction volume established. Finally, we show an end-to-end test where a robot estimates where the pointing vector intersects the ground plane, and report the accuracy obtained. We provide source code as a ROS node, with the intention of contributing a commodity implementation of this common component in HRI systems.