STEP: Structured Training and Evaluation Platform for benchmarking trajectory prediction models
This work addresses the problem of inconsistent benchmarking for trajectory prediction, which is crucial for safe autonomous driving, but it is incremental as it builds on existing efforts to improve evaluation frameworks.
The authors tackled the lack of standardized evaluation practices for trajectory prediction models in automated vehicles by introducing STEP, a benchmarking framework that unifies datasets and model interfaces, revealing limitations in testing procedures and vulnerabilities in state-of-the-art models.
While trajectory prediction plays a critical role in enabling safe and effective path-planning in automated vehicles, standardized practices for evaluating such models remain underdeveloped. Recent efforts have aimed to unify dataset formats and model interfaces for easier comparisons, yet existing frameworks often fall short in supporting heterogeneous traffic scenarios, joint prediction models, or user documentation. In this work, we introduce STEP -- a new benchmarking framework that addresses these limitations by providing a unified interface for multiple datasets, enforcing consistent training and evaluation conditions, and supporting a wide range of prediction models. We demonstrate the capabilities of STEP in a number of experiments which reveal 1) the limitations of widely-used testing procedures, 2) the importance of joint modeling of agents for better predictions of interactions, and 3) the vulnerability of current state-of-the-art models against both distribution shifts and targeted attacks by adversarial agents. With STEP, we aim to shift the focus from the ``leaderboard'' approach to deeper insights about model behavior and generalization in complex multi-agent settings.