HCAIMay 6, 2024

OmniActions: Predicting Digital Actions in Response to Real-World Multimodal Sensory Inputs with LLMs

arXiv:2405.03901v134 citationsCHI
Originality Incremental advance
AI Analysis

This work addresses the challenge of seamless interaction in pervasive augmented reality for users who are physically, cognitively, or socially occupied, but it is incremental as it builds on existing LLM methods for a new application.

The paper tackled the problem of reducing friction for users to act on multimodal information in everyday scenarios by predicting digital follow-up actions based on context, using a diary study to generate a design space and evaluating LLM techniques, with in-context learning identified as most effective (e.g., achieving specific accuracy metrics not provided).

The progression to "Pervasive Augmented Reality" envisions easy access to multimodal information continuously. However, in many everyday scenarios, users are occupied physically, cognitively or socially. This may increase the friction to act upon the multimodal information that users encounter in the world. To reduce such friction, future interactive interfaces should intelligently provide quick access to digital actions based on users' context. To explore the range of possible digital actions, we conducted a diary study that required participants to capture and share the media that they intended to perform actions on (e.g., images or audio), along with their desired actions and other contextual information. Using this data, we generated a holistic design space of digital follow-up actions that could be performed in response to different types of multimodal sensory inputs. We then designed OmniActions, a pipeline powered by large language models (LLMs) that processes multimodal sensory inputs and predicts follow-up actions on the target information grounded in the derived design space. Using the empirical data collected in the diary study, we performed quantitative evaluations on three variations of LLM techniques (intent classification, in-context learning and finetuning) and identified the most effective technique for our task. Additionally, as an instantiation of the pipeline, we developed an interactive prototype and reported preliminary user feedback about how people perceive and react to the action predictions and its errors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes