9.7MMMar 10
TPIFM: A Task-Aware Model for Evaluating Perceptual Interaction Fluency in Remote AR CollaborationJiarun Song, Ninghao Wan, Fuzheng Yang et al.
Remote Collaborative Augmented Reality (RCAR) enables geographically distributed users to collaborate by integrating virtual and physical environments. However, because RCAR relies on real-time transmission, it is susceptible to delay and stalling impairments under constrained network conditions. Perceptual interaction fluency (PIF), defined as the perceived pace and responsiveness of collaboration, is influenced not only by physical network impairments but also by intrinsic task characteristics. These characteristics can be interpreted as the task-specific just-noticeable difference (JND), i.e., the maximal tolerable temporal responsiveness before PIF degrades. When the average response time (ART), measured as the mean time per operation from receiving collaborator feedback to initiating the next action, falls within the JND, PIF is generally sustained, whereas values exceeding it indicate disruption. Tasks differ in their JNDs, reflecting distinct temporal responsiveness demands and sensitivities to impairments. From the perspective of the Free Energy Principle (FEP), tasks with lower JNDs impose stricter temporal prediction demands, making PIF more vulnerable to impairments, whereas higher JNDs allow greater tolerance. On this basis, we classify RCAR tasks by JND and evaluate their PIF through controlled subjective experiments under delay, stalling, and hybrid conditions. Building on these findings, we propose the Task-Aware Perceptual Interaction Fluency Model (TPIFM). Experimental results show that TPIFM accurately assesses PIF under network impairments, providing guidance for adaptive RCAR design and user experience optimization under network constraints.
82.7HCMar 10
From Perception to Cognition: How Latency Affects Interaction Fluency and Social Presence in VR ConferencingJiarun Song, Ninghao Wan, FuZheng Yang et al.
Virtual reality (VR) conferencing has the potential to provide geographically dispersed users with an immersive environment, enabling rich social interactions and user experience using avatars. However, remote communication in VR inevitably introduces end-to-end (E2E) latency, which can significantly impact user experience. To clarify the impact of latency, we conducted subjective experiments to analyze how it influences interaction fluency from the perspective of quality perception and social presence from the perspective of social cognition, comparing VR conferencing with traditional video conferencing (VC). Specifically, interaction fluency emphasizes user perception of interaction pace and responsiveness and is assessed using Absolute Category Rating (ACR) method. In contrast, social presence focuses on the cognitive understanding of interaction, specifically whether individuals can comprehend the intentions, emotions, and behaviors expressed by others. It is primarily measured using the Networked Minds Social Presence Inventory (NMSPI). Building on this analysis, we further investigate the relationship between interaction fluency and social presence under different latency conditions to clarify the underlying perceptual and cognitive mechanisms. The findings from these subjective tests provide meaningful insights for optimizing the related systems, helping to improve interaction fluency and enhancing social presence in immersive virtual environments.
36.8HCMar 10
Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience PerspectiveNinghao Wan, Jiarun Song, Fuzheng Yang
In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.