Kaleen Shrestha

h-index2

4papers

28citations

Novelty30%

AI Score43

Ranked #52,308 of 194,257 authors (top 27%)#293 in HC (top 12%)

4 Papers

7.8ROJun 25, 2025Code

HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction

Zhonghao Shi, Enyu Zhao, Nathaniel Dennler et al.

Real-time human perception is crucial for effective human-robot interaction (HRI). Large vision-language models (VLMs) offer promising generalizable perceptual capabilities but often suffer from high latency, which negatively impacts user experience and limits VLM applicability in real-world scenarios. To systematically study VLM capabilities in human perception for HRI and performance-latency trade-offs, we introduce HRIBench, a visual question-answering (VQA) benchmark designed to evaluate VLMs across a diverse set of human perceptual tasks critical for HRI. HRIBench covers five key domains: (1) non-verbal cue understanding, (2) verbal instruction understanding, (3) human-robot object relationship understanding, (4) social navigation, and (5) person identification. To construct HRIBench, we collected data from real-world HRI environments to curate questions for non-verbal cue understanding, and leveraged publicly available datasets for the remaining four domains. We curated 200 VQA questions for each domain, resulting in a total of 1000 questions for HRIBench. We then conducted a comprehensive evaluation of both state-of-the-art closed-source and open-source VLMs (N=11) on HRIBench. Our results show that, despite their generalizability, current VLMs still struggle with core perceptual capabilities essential for HRI. Moreover, none of the models within our experiments demonstrated a satisfactory performance-latency trade-off suitable for real-time deployment, underscoring the need for future research on developing smaller, low-latency VLMs with improved human perception capabilities. HRIBench and our results can be found in this Github repository: https://github.com/interaction-lab/HRIBench.

8.3HCApr 1, 2024

How Can Large Language Models Enable Better Socially Assistive Human-Robot Interaction: A Brief Survey

Zhonghao Shi, Ellen Landrum, Amy O' Connell et al.

Socially assistive robots (SARs) have shown great success in providing personalized cognitive-affective support for user populations with special needs such as older adults, children with autism spectrum disorder (ASD), and individuals with mental health challenges. The large body of work on SAR demonstrates its potential to provide at-home support that complements clinic-based interventions delivered by mental health professionals, making these interventions more effective and accessible. However, there are still several major technical challenges that hinder SAR-mediated interactions and interventions from reaching human-level social intelligence and efficacy. With the recent advances in large language models (LLMs), there is an increased potential for novel applications within the field of SAR that can significantly expand the current capabilities of SARs. However, incorporating LLMs introduces new risks and ethical concerns that have not yet been encountered, and must be carefully be addressed to safely deploy these more advanced systems. In this work, we aim to conduct a brief survey on the use of LLMs in SAR technologies, and discuss the potentials and risks of applying LLMs to the following three major technical challenges of SAR: 1) natural language dialog; 2) multimodal understanding; 3) LLMs as robot policies.

8.2HCMar 6

Exploring Socially Assistive Peer Mediation Robots for Teaching Conflict Resolution to Elementary School Students

Kaleen Shrestha, Harish Dukkipati, Avni Hulyalkar et al.

In peer mediation--an approach to conflict resolution used in many K-12 schools in the United States--students help other students to resolve conflicts. For schools without peer mediation programs, socially assistive robots (SARs) may be able to provide an accessible option to practice peer mediation. We investigate how elementary school students react to a peer mediator role-play activity through an exploratory study with SARs. We conducted a small single-session between-subjects study with 12 participants. The study had two conditions, one with two robots acting as disputants, and the other without the robots and just the tablet. We found that a majority of students had positive feedback on the activity, with many students saying the peer mediation practice helped them feel better about themselves. Some said that the activity taught them how to help friends during conflict, indicating that the use of SARs for peer mediation practice is promising. We observed that participants had varying reading levels that impacted their ability to read and dictate the turns in the role-play script, an important consideration for future study design. Additionally, we found that some participants were more expressive while reading the script and throughout the activity. Although we did not find statistical differences in pre-/post-session self-perception and quiz performance between the robot and tablet conditions, we found strong correlations (p<0.05) between certain trait-related measures and learning-related measures in the robot condition, which can inform future study design for SARs for this and related contexts.

12.0CLSep 19, 2025

Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans

Deuksin Kwon, Kaleen Shrestha, Bin Han et al.

Large Language Models (LLMs) are increasingly deployed in socially complex, interaction-driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT-4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude-3.7-Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.