PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios
This addresses the challenge of real-time decision-making for mobile AI assistants, though it is incremental as it focuses on benchmarking rather than solving the underlying issue.
The paper tackles the problem of evaluating multimodal large language models as mobile assistants in continuous real-world streaming scenarios, introducing PhoStream as the first mobile-centric benchmark with 5,572 open-ended QA pairs across 4 scenarios, and finds that models like Gemini 3 Pro score above 80 on instant and backward tasks but drop to 16.40 on forward tasks due to early responses.
Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/PhoStream.