CVSep 22, 2023
Contextual Emotion Estimation from Image CaptionsVera Yang, Archita Srivastava, Yasaman Etesam et al.
Emotion estimation in images is a challenging task, typically using computer vision methods to directly estimate people's emotions using face, body pose and contextual cues. In this paper, we explore whether Large Language Models (LLMs) can support the contextual emotion estimation task, by first captioning images, then using an LLM for inference. First, we must understand: how well do LLMs perceive human emotions? And which parts of the information enable them to determine emotions? One initial challenge is to construct a caption that describes a person within a scene with information relevant for emotion perception. Towards this goal, we propose a set of natural language descriptors for faces, bodies, interactions, and environments. We use them to manually generate captions and emotion annotations for a subset of 331 images from the EMOTIC dataset. These captions offer an interpretable representation for emotion estimation, towards understanding how elements of a scene affect emotion perception in LLMs and beyond. Secondly, we test the capability of a large language model to infer an emotion from the resulting image captions. We find that GPT-3.5, specifically the text-davinci-003 model, provides surprisingly reasonable emotion predictions consistent with human annotations, but accuracy can depend on the emotion concept. Overall, the results suggest promise in the image captioning and LLM approach.
CVOct 30, 2023
Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic ReasoningYasaman Etesam, Özge Nilay Yalçın, Chuxuan Zhang et al.
The emotional theory of mind problem requires facial expressions, body pose, contextual information and implicit commonsense knowledge to reason about the person's emotion and its causes, making it currently one of the most difficult problems in affective computing. In this work, we propose multiple methods to incorporate the emotional reasoning capabilities by constructing "narrative captions" relevant to emotion perception, that includes contextual and physical signal descriptors that focuses on "Who", "What", "Where" and "How" questions related to the image and emotions of the individual. We propose two distinct ways to construct these captions using zero-shot classifiers (CLIP) and fine-tuning visual-language models (LLaVA) over human generated descriptors. We further utilize these captions to guide the reasoning of language (GPT-4) and vision-language models (LLaVa, GPT-Vision). We evaluate the use of the resulting models in an image-to-language-to-emotion task. Our experiments showed that combining the "Fast" narrative descriptors and "Slow" reasoning of language models is a promising way to achieve emotional theory of mind.
CVMay 14, 2024
Contextual Emotion Recognition using Large Vision Language ModelsYasaman Etesam, Özge Nilay Yalçın, Chuxuan Zhang et al.
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
81.8SIMar 31
Beyond Individual Mimicry: Constructing Human-Like Social network with Graph-Augmented LLM AgentsHaoran Bu, Litian Zhang, Chuxuan Zhang et al.
Driven by large language models (LLMs), social bot can autonomously engage in local interactions, whose human-like behaviors enable them to evade social bot detection. However, while these botnets exhibit realistic local social interactions, they fail to preserve human-like social network. This is because LLM-based bots are graph-unaware and cannot coordinate over global interactions, which makes those botnets vulnerable to graph neural network (GNN)-based detection. To address this limitation, we propose GraphMind, which equips LLM-driven social bots to explicitly learn and fit human-like social network structures. Building on this foundation, we further construct GraphMind-Botnet, a LLM-driven botnet designed to evaluate the performance of existing social bot detection algorithms. Experiments on datasets derived from GraphMind-Botnet show that both text-based and graph-based detection models show substantially degraded performance in distinguishing. Our results highlight the critical role of social link construction in LLM-driven social network generation, while exposing fundamental weaknesses in existing bot detection mechanisms.
HCMar 8
How Neurotypical and Autistic Children Interact Nonverbally with Anthropomorphic Agents in Open-Ended TasksChuxuan Zhang, Bermet Burkanova, Lawrence H. Kim et al.
What nonverbal behaviors should a robot respond to? Understanding how children-both neurotypical and autistic-engage with embodied artificial agents is critical for developing inclusive and socially interactive systems. In this paper, we study "open-ended" unconstrained interactions with embodied agents, where little is known about how children behave nonverbally when given few instructions. We conducted a Wizard-of-Oz study in which children were invited to interact nonverbally with 6 different embodied virtual characters displayed on a television screen. We collected 563 (141 unique) nonverbal behaviors produced by children and compare the childre's interaction patterns with those previously reported in an adult study. We also report the presence of repetitive face and hand movements, which should be considered in the development of nonverbally interactive artificial agents.
LGJul 25, 2025
Salsa as a Nonverbal Embodied Language -- The CoMPAS3D Dataset and BenchmarksBermet Burkanova, Payam Jome Yazdian, Chuxuan Zhang et al.
Imagine a humanoid that can safely and creatively dance with a human, adapting to its partner's proficiency, using haptic signaling as a primary form of communication. While today's AI systems excel at text or voice-based interaction with large language models, human communication extends far beyond text-it includes embodied movement, timing, and physical coordination. Modeling coupled interaction between two agents poses a formidable challenge: it is continuous, bidirectionally reactive, and shaped by individual variation. We present CoMPAS3D, the largest and most diverse motion capture dataset of improvised salsa dancing, designed as a challenging testbed for interactive, expressive humanoid AI. The dataset includes 3 hours of leader-follower salsa dances performed by 18 dancers spanning beginner, intermediate, and professional skill levels. For the first time, we provide fine-grained salsa expert annotations, covering over 2,800 move segments, including move types, combinations, execution errors and stylistic elements. We draw analogies between partner dance communication and natural language, evaluating CoMPAS3D on two benchmark tasks for synthetic humans that parallel key problems in spoken language and dialogue processing: leader or follower generation with proficiency levels (speaker or listener synthesis), and duet (conversation) generation. Towards a long-term goal of partner dance with humans, we release the dataset, annotations, and code, along with a multitask SalsaAgent model capable of performing all benchmark tasks, alongside additional baselines to encourage research in socially interactive embodied AI and creative, expressive humanoid motion generation.
HCJul 14, 2025
React to This (RTT): A Nonverbal Turing Test for Embodied AIChuxuan Zhang, Yasaman Etesam, Angelica Lim
We propose an approach to test embodied AI agents for interaction awareness and believability, particularly in scenarios where humans push them to their limits. Turing introduced the Imitation Game as a way to explore the question: "Can machines think?" The Total Turing Test later expanded this concept beyond purely verbal communication, incorporating perceptual and physical interaction. Building on this, we propose a new guiding question: "Can machines react?" and introduce the React to This (RTT) test for nonverbal behaviors, presenting results from an initial experiment.