Multimodal Motion Prediction with Stacked Transformers
This work addresses a critical safety problem for autonomous driving systems by enhancing multimodal motion prediction, though it appears incremental as it builds on existing transformer-based methods.
The paper tackles the challenge of predicting multiple plausible future trajectories for nearby vehicles in autonomous driving by proposing a novel transformer framework called mmTransformer, which achieves state-of-the-art performance on the Argoverse dataset with substantial improvements in diversity and accuracy.
Predicting multiple plausible future trajectories of the nearby vehicles is crucial for the safety of autonomous driving. Recent motion prediction approaches attempt to achieve such multimodal motion prediction by implicitly regularizing the feature or explicitly generating multiple candidate proposals. However, it remains challenging since the latent features may concentrate on the most frequent mode of the data while the proposal-based methods depend largely on the prior knowledge to generate and select the proposals. In this work, we propose a novel transformer framework for multimodal motion prediction, termed as mmTransformer. A novel network architecture based on stacked transformers is designed to model the multimodality at feature level with a set of fixed independent proposals. A region-based training strategy is then developed to induce the multimodality of the generated proposals. Experiments on Argoverse dataset show that the proposed model achieves the state-of-the-art performance on motion prediction, substantially improving the diversity and the accuracy of the predicted trajectories. Demo video and code are available at https://decisionforce.github.io/mmTransformer.