Shengdong Zhao

8papers

60citations

Novelty48%

AI Score47

Ranked #54,352 of 205,806 authors (top 26%)#310 in HC (top 11%)

8 Papers

71.4HCMay 16

Spatial Balancing: Harnessing Spatial Reasoning to Balance Scientific Exposition and Narrative Engagement in LLM-assisted Science Communication Writing

Kexue Fu, Jiaye Leng, Yawen Zhang et al.

Balancing scientific exposition and narrative engagement is a central challenge in science communication. To examine how to achieve balance, we conducted a formative study with four science communicators and a literature review of science communication practices, focusing on their workflows and strategies. These insights revealed how creators iteratively shift between exposition and engagement but often lack structured support. Building on this, we developed SpatialBalancing, a co-writing system that connects human spatial reasoning with the linguistic intelligence of large language models. The system visualizes revision trade-offs in a dual-axis space, where users select strategy-based labels to generate, compare, and refine versions during the revision process. This spatial externalization transforms revision into spatial navigation, enabling intentional iterations that balance scientific rigor with narrative appeal. In a within-subjects study (N=16), SpatialBalancing enhanced metacognitive reflection, flexibility, and creative exploration, demonstrating how coupling spatial reasoning with linguistic generation fosters monitoring in iterative science communication writing.

CLOct 19, 2023

GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model Agents

Xin Zeng, Xiaoyu Wang, Tengxiang Zhang et al.

Existing gesture interfaces only work with a fixed set of gestures defined either by interface designers or by users themselves, which introduces learning or demonstration efforts that diminish their naturalness. Humans, on the other hand, understand free-form gestures by synthesizing the gesture, context, experience, and common sense. In this way, the user does not need to learn, demonstrate, or associate gestures. We introduce GestureGPT, a free-form hand gesture understanding framework that mimics human gesture understanding procedures to enable a natural free-form gestural interface. Our framework leverages multiple Large Language Model agents to manage and synthesize gesture and context information, then infers the interaction intent by associating the gesture with an interface function. More specifically, our triple-agent framework includes a Gesture Description Agent that automatically segments and formulates natural language descriptions of hand poses and movements based on hand landmark coordinates. The description is deciphered by a Gesture Inference Agent through self-reasoning and querying about the interaction context (e.g., interaction history, gaze data), which is managed by a Context Management Agent. Following iterative exchanges, the Gesture Inference Agent discerns the user's intent by grounding it to an interactive function. We validated our framework offline under two real-world scenarios: smart home control and online video streaming. The average zero-shot Top-1/Top-5 grounding accuracies are 44.79%/83.59% for smart home tasks and 37.50%/73.44% for video streaming tasks. We also provide an extensive discussion that includes rationale for model selection, generalizability, and future research directions for a practical system etc.

HCJul 22, 2024

TOM: A Development Platform For Wearable Intelligent Assistants

Nuwan Janaka, Shengdong Zhao, David Hsu et al.

Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the development of intelligent wearable assistants that are contextually aware of both the user and the environment. This system was developed collaboratively with AR/MR researchers, HCI researchers, AI/Robotic researchers, and software developers, and it continues to evolve to meet the diverse requirements of these stakeholders. TOM facilitates the creation of intelligent assistive AR applications for daily activities and supports the recording and analysis of user interactions, integration of new devices, and the provision of assistance for various activities. Additionally, we showcase several proof-of-concept assistive services and discuss the challenges involved in developing such services.

HCMar 6

Hierarchical Resource Rationality Explains Human Reading Behavior

Yunpeng Bai, Xiaofu Jin, Shengdong Zhao et al.

Reading is a pervasive and cognitively demanding activity that underpins modern human culture. It is a prime instance of a class of tasks where eye movements are coordinated for the purpose of comprehension. Existing theories explain either eye movements or comprehension during reading, but the critical link between the two remains unclear. Here, we propose resource-rational optimization as a unifying principle governing adaptive reading behavior. Eye movements are selected to maximize expected comprehension while minimizing cognitive and temporal costs, organized hierarchically across nested time scales: fixation decisions support word recognition; sentence-level integration guides skipping and regression; and text-level comprehension goals shape memory construction and rereading. A computational implementation successfully replicates an unprecedented range of findings in human reading, from lexical effects to comprehension outcomes. Together, these results suggest that resource rationality provides a general mechanism for coordinating perception, memory, and action in knowledge-intensive human behaviors, offering a principled account of how complex cognitive skills adapt to limited resources.

HCAug 12, 2021Code

Exploring Head-based Mode-Switching in Virtual Reality

Rongkai Shi, Nan Zhu, Hai-Ning Liang et al.

Mode-switching supports multilevel operations using a limited number of input methods. In Virtual Reality (VR) head-mounted displays (HMD), common approaches for mode-switching use buttons, controllers, and users' hands. However, they are inefficient and challenging to do with tasks that require both hands (e.g., when users need to use two hands during drawing operations). Using head gestures for mode-switching can be an efficient and cost-effective way, allowing for a more continuous and smooth transition between modes. In this paper, we explore the use of head gestures for mode-switching especially in scenarios when both users' hands are performing tasks. We present a first user study that evaluated eight head gestures that could be suitable for VR HMD with a dual-hand line-drawing task. Results show that move forward, move backward, roll left, and roll right led to better performance and are preferred by participants. A second study integrating these four gestures in Tilt Brush, an open-source painting VR application, is conducted to further explore the applicability of these gestures and derive insights. Results show that Tilt Brush with head gestures allowed users to change modes with ease and led to improved interaction and user experience. The paper ends with a discussion on some design recommendations for using head-based mode-switching in VR HMD.

HCFeb 26

Simulation-based Optimization for Augmented Reading

Yunpeng Bai, Shengdong Zhao, Antti Oulasvirta

Augmented reading systems aim to adapt text presentation to improve comprehension and task performance, yet existing approaches rely heavily on heuristics, opaque data-driven models, or repeated human involvement in the design loop. We propose framing augmented reading as a simulation-based optimization problem grounded in resource-rational models of human reading. These models instantiate a simulated reader that allocates limited cognitive resources, such as attention, memory, and time under task demands, enabling systematic evaluation of text user interfaces. We introduce two complementary optimization pipelines: an offline approach that explores design alternatives using simulated readers, and an online approach that personalizes reading interfaces in real time using ongoing interaction data. Together, this perspective enables adaptive, explainable, and scalable augmented reading design without relying solely on human testing.

HCFeb 6, 2022

Visual Behaviors and Mobile Information Acquisition

Nuwan Janaka, Xinke Wu, Shan Zhang et al.

It is common for people to engage in information acquisition tasks while on the move. To understand how users' visual behaviors influence microlearning, a form of mobile information acquisition, we conducted a shadowing study with 8 participants and identified three common visual behaviors: 'glance', 'inspect', and 'drift'. We found that 'drift' best supports mobile information acquisition. We also identified four user-related factors that can influence the utilization of mobile information acquisition opportunities: situational awareness, switching costs, ongoing cognitive processes, and awareness of opportunities. We further examined how these user-related factors interplay with device-related factors through a technology probe with 20 participants using mobile phones and optical head-mounted displays (OHMDs). Results indicate that different device platforms significantly influence how mobile information acquisition opportunities are used: OHMDs can better support mobile information acquisition when visual attention is fragmented. OHMDs facilitate shorter visual switch-times between the task and surroundings, which reduces the mental barrier of task transition. Mobile phones, on the other hand, provide a more focused experience in more stable surroundings. Based on these findings, we discuss trade-offs and design implications for supporting information acquisition tasks on the move.

HCAug 16, 2020

From Lost to Found: Discover Missing UI Design Semantics through Recovering Missing Tags

Chunyang Chen, Sidong Feng, Zhengyang Liu et al.

Design sharing sites provide UI designers with a platform to share their works and also an opportunity to get inspiration from others' designs. To facilitate management and search of millions of UI design images, many design sharing sites adopt collaborative tagging systems by distributing the work of categorization to the community. However, designers often do not know how to properly tag one design image with compact textual description, resulting in unclear, incomplete, and inconsistent tags for uploaded examples which impede retrieval, according to our empirical study and interview with four professional designers. Based on a deep neural network, we introduce a novel approach for encoding both the visual and textual information to recover the missing tags for existing UI examples so that they can be more easily found by text queries. We achieve 82.72% accuracy in the tag prediction. Through a simulation test of 5 queries, our system on average returns hundreds more results than the default Dribbble search, leading to better relatedness, diversity and satisfaction.