Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
This addresses a critical gap for conversational AI by providing a systematic evaluation framework for temporal capabilities, though it is incremental as it focuses on benchmarking rather than proposing new methods.
The paper tackles the problem of evaluating temporal dynamics in spoken language models (SLMs), such as timing and simultaneous speaking, by introducing the Game-Time Benchmark, and finds that while state-of-the-art models perform well on basic tasks, nearly all degrade substantially under temporal constraints.
Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.