ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
This addresses the challenge of enabling LLMs to interact with physical devices in VR environments, which is incremental as it builds on existing LLM capabilities for a new application domain.
The paper tackles the problem of whether Large Language Models (LLMs) can translate semantic actions into precise device manipulations for Virtual Reality (VR) games, finding that top models like Gemini-1.5-Pro show strong task decomposition but still lag behind humans in procedural reasoning and spatial understanding, with performance varying across games and improving with few-shot examples.
Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.