MMAIMay 8

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

arXiv:2605.1096620.5Has Code
Predicted impact top 28% in MM · last 90 daysOriginality Incremental advance
AI Analysis

For researchers developing terminal-based AI agents, this benchmark fills a gap by evaluating agents on real-world multimedia workflows involving audio and video files.

The paper introduces MMTB, a benchmark of 105 tasks for evaluating terminal agents on multimedia-file tasks, and proposes Terminus-MM, a harness extending multimedia perception. The study reveals how different forms of multimedia access affect task outcomes and agent reliance on evidence.

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and Terminus-MM support a controlled study of multimedia terminal agents, revealing how different forms of multimedia access shape task outcomes and determine which evidence agents rely on to construct executable terminal workflows. MMTB media and metadata are released at https://huggingface.co/datasets/mm-tbench/mmtb-media

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes