Franklin Mingzhe Li

HC
h-index21
12papers
207citations
Novelty37%
AI Score43

12 Papers

HCOct 3, 2023
Selenite: Scaffolding Online Sensemaking with Comprehensive Overviews Elicited from Large Language Models

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen et al.

Sensemaking in unfamiliar domains can be challenging, demanding considerable user effort to compare different options with respect to various criteria. Prior research and our formative study found that people would benefit from reading an overview of an information space upfront, including the criteria others previously found useful. However, existing sensemaking tools struggle with the "cold-start" problem -- it not only requires significant input from previous users to generate and share these overviews, but such overviews may also turn out to be biased and incomplete. In this work, we introduce a novel system, Selenite, which leverages Large Language Models (LLMs) as reasoning machines and knowledge retrievers to automatically produce a comprehensive overview of options and criteria to jumpstart users' sensemaking processes. Subsequently, Selenite also adapts as people use it, helping users find, read, and navigate unfamiliar information in a systematic yet personalized manner. Through three studies, we found that Selenite produced accurate and high-quality overviews reliably, significantly accelerated users' information processing, and effectively improved their overall comprehension and sensemaking experience.

CVMar 12Code
OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Xianjing Han, Bin Zhu, Shiqi Hu et al.

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

HCMar 7, 2025
OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu et al.

Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.

HCFeb 14, 2025
How Users Who are Blind or Low Vision Play Mobile Games: Perceptions, Challenges, and Strategies

Zihe Ran, Xiyu Li, Qing Xiao et al.

As blind and low-vision (BLV) players engage more deeply with games, accessibility features have become essential. While some research has explored tools and strategies to enhance game accessibility, the specific experiences of these players with mobile games remain underexamined. This study addresses this gap by investigating how BLV users experience mobile games with varying accessibility levels. Through interviews with 32 experienced BLV mobile players, we explore their perceptions, challenges, and strategies for engaging with mobile games. Our findings reveal that BLV players turn to mobile games to alleviate boredom, achieve a sense of accomplishment, and build social connections, but face barriers depending on the game's accessibility level. We also compare mobile games to other forms of gaming, highlighting the relative advantages of mobile games, such as the inherent accessibility of smartphones. This study contributes to understanding BLV mobile gaming experiences and provides insights for enhancing accessible mobile game design.

HCJul 5, 2025
More than One Step at a Time: Designing Procedural Feedback for Non-visual Makeup Routines

Franklin Mingzhe Li, Akihiko Oharazawa, Chloe Qingyu Zhu et al.

Makeup plays a vital role in self-expression, identity, and confidence - yet remains an underexplored domain for assistive technology, especially for people with vision impairments. While existing tools support isolated tasks such as color identification or product labeling, they rarely address the procedural complexity of makeup routines: coordinating step sequences, managing product placement, and assessing the final look with accessible feedback. To understand the real-world process, we conducted a contextual inquiry with 15 visually impaired makeup users, capturing real-time makeup application behaviors and their step-by-step information needs and assessment approaches. Our findings reveal embodied, tactile-first strategies; persistent challenges in blending, symmetry, and assessment; and a desire for honest, real-time, goal-aligned feedback. We also interviewed five professional makeup artists, who reviewed participant makeup videos and provided expert responses to participant-raised questions and assessment practices. We contribute a taxonomy of feedback needs in non-visual makeup, and outline design implications for future assistive systems - emphasizing hands-free, conversational interaction and context-aware, procedural support for expressive and independent beauty practices.

AIJul 4, 2025
Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking

Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu et al.

Cooking plays a vital role in everyday independence and well-being, yet remains challenging for people with vision impairments due to limited support for tracking progress and receiving contextual feedback. Object status - the condition or transformation of ingredients and tools - offers a promising but underexplored foundation for context-aware cooking support. In this paper, we present OSCAR (Object Status Context Awareness for Recipes), a technical pipeline that explores the use of object status recognition to enable recipe progress tracking in non-visual cooking. OSCAR integrates recipe parsing, object status extraction, visual alignment with cooking steps, and time-causal modeling to support real-time step tracking. We evaluate OSCAR on 173 instructional videos and a real-world dataset of 12 non-visual cooking sessions recorded by BLV individuals in their homes. Our results show that object status consistently improves step prediction accuracy across vision-language models, and reveal key factors that impact performance in real-world conditions, such as implicit tasks, camera placement, and lighting. We contribute the pipeline of context-aware recipe progress tracking, an annotated real-world non-visual cooking dataset, and design insights to guide future context-aware assistive cooking systems.

HCFeb 23, 2022
Understanding How Older Adults Comprehend COVID-19 Interactive Visualizations via Think-Aloud Protocol

Mingming Fan, Yiwen Wang, Yuni Xie et al.

Older adults have been hit disproportionally hard by the COVID-19 pandemic. One critical way for older adults to minimize the negative impact of COVID-19 and future pandemics is to stay informed about its latest information, which has been increasingly presented through online interactive visualizations (e.g., live dashboards and websites). Thus, it is imperative to understand how older adults interact with and comprehend online COVID-19 interactive visualizations and what challenges they might encounter to make such visualizations more accessible to older adults. We adopted a user-centered approach by inviting older adults to interact with COVID-19 interactive visualizations while at the same time verbalizing their thought processes using a think-aloud protocol. By analyzing their think-aloud verbalizations, we identified four types of thought processes representing how older adults comprehended the visualizations and uncovered the challenges they encountered. Furthermore, we also identified the challenges they encountered with seven common types of interaction techniques adopted by the visualizations. Based on the findings, we present design guidelines for making interactive visualizations more accessible to older adults.

HCJan 26, 2022
An Exploration of Captioning Practices and Challenges of Individual Content Creators on YouTube for People with Hearing Impairments

Franklin Mingzhe Li, Cheng Lu, Zhicong Lu et al.

Deaf and Hard-of-Hearing (DHH) audiences have long complained about caption qualities for many online videos created by individual content creators on video-sharing platforms (e.g., YouTube). However, there lack explorations of practices, challenges, and perceptions of online video captions from the perspectives of both individual content creators and DHH audiences. In this work, we first explore DHH audiences' feedback on and reactions to YouTube video captions through interviews with 13 DHH individuals, and uncover DHH audiences' experiences, challenges, and perceptions on watching videos created by individual content creators (e.g., manually added caption tags could create additional confidence and trust in caption qualities for DHH audiences). We then discover individual content creators' practices, challenges, and perceptions on captioning their videos (e.g., back-captioning problems) by conducting a YouTube video analysis with 189 captioning-related YouTube videos, followed by a survey with 62 individual content creators. Overall, our findings provide an in-depth understanding of captions generated by individual content creators and bridge the knowledge gap mutually between content creators and DHH audiences on captions.

HCJul 12, 2021
Non-Visual Cooking: Exploring Practices and Challenges of Meal Preparation by People with Visual Impairments

Franklin Mingzhe Li, Jamie Dorst, Peter Cederberg et al.

The reliance on vision for tasks related to cooking and eating healthy can present barriers to cooking for oneself and achieving proper nutrition. There has been little research exploring cooking practices and challenges faced by people with visual impairments. We present a content analysis of 122 YouTube videos to highlight the cooking practices of visually impaired people, and we describe detailed practices for 12 different cooking activities (e.g., cutting and chopping, measuring, testing food for doneness). Based on the cooking practices, we also conducted semi-structured interviews with 12 visually impaired people who have cooking experience and show existing challenges, concerns, and risks in cooking (e.g., tracking the status of tasks in progress, verifying whether things are peeled or cleaned thoroughly). We further discuss opportunities to support the current practices and improve the independence of people with visual impairments in cooking (e.g., zero-touch interactions for cooking). Overall, our findings provide guidance for future research exploring various assistive technologies to help people cook without relying on vision.

HCMay 31, 2021
ThumbTrak: Recognizing Micro-finger Poses Using a Ring with Proximity Sensing

Wei Sun, Franklin Mingzhe Li, Congshu Huang et al.

ThumbTrak is a novel wearable input device that recognizes 12 micro-finger poses in real-time. Poses are characterized by the thumb touching each of the 12 phalanges on the hand. It uses a thumb-ring, built with a flexible printed circuit board, which hosts nine proximity sensors. Each sensor measures the distance from the thumb to various parts of the palm or other fingers. ThumbTrak uses a support-vector-machine (SVM) model to classify finger poses based on distance measurements in real-time. A user study with ten participants showed that ThumbTrak could recognize 12 micro finger poses with an average accuracy of 93.6%. We also discuss potential opportunities and challenges in applying ThumbTrak in real-world applications.

HCFeb 24, 2021
TeethTap: Recognizing Discrete Teeth Gestures Using Motion and Acoustic Sensing on an Earpiece

Wei Sun, Franklin Mingzhe Li, Benjamin Steeper et al.

Teeth gestures become an alternative input modality for different situations and accessibility purposes. In this paper, we present TeethTap, a novel eyes-free and hands-free input technique, which can recognize up to 13 discrete teeth tapping gestures. TeethTap adopts a wearable 3D printed earpiece with an IMU sensor and a contact microphone behind both ears, which works in tandem to detect jaw movement and sound data, respectively. TeethTap uses a support vector machine to classify gestures from noise by fusing acoustic and motion data, and implements K-Nearest-Neighbor (KNN) with a Dynamic Time Warping (DTW) distance measurement using motion data for gesture classification. A user study with 11 participants demonstrated that TeethTap could recognize 13 gestures with a real-time classification accuracy of 90.9% in a laboratory environment. We further uncovered the accuracy differences on different teeth gestures when having sensors on single vs. both sides. Moreover, we explored the activation gesture under real-world environments, including eating, speaking, walking and jumping. Based on our findings, we further discussed potential applications and practical challenges of integrating TeethTap into future devices.

HCJan 22, 2021
"I Choose Assistive Devices That Save My Face" A Study on Perceptions of Accessibility and Assistive Technology Use Conducted in China

Franklin Mingzhe Li, Di Laura Chen, Mingming Fan et al.

Despite the potential benefits of assistive technologies (ATs) for people with various disabilities, only around 7% of Chinese with disabilities have had an opportunity to use ATs. Even for those who have used ATs, the abandonment rate was high. Although China has the world's largest population with disabilities, prior research exploring how ATs are used and perceived, and why ATs are abandoned have been conducted primarily in North America and Europe. In this paper, we present an interview study conducted in China with 26 people with various disabilities to understand their practices, challenges, perceptions, and misperceptions of using ATs. From the study, we learned about factors that influence AT adoption practices (e.g., misuse of accessible infrastructure, issues with replicating existing commercial ATs), challenges using ATs in social interactions (e.g., Chinese stigma), and misperceptions about ATs (e.g., ATs should overcome inaccessible social infrastructures). Informed by the findings, we derive a set of design considerations to bridge the existing gaps in AT design (e.g., manual vs. electronic ATs) and to improve ATs' social acceptability in China.