CV AIMar 22

A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho, Khanh-Thanh-Khoa Nguyen, Thanh-Hai Le

arXiv:2603.2104819.4h-index: 2

AI Analysis

This addresses the problem of detecting hazardous driving behaviors for road safety applications like traffic violation monitoring, though it appears incremental with hybrid components.

The paper tackles temporal action localization for distracted driver behaviors by proposing a two-stage transformer framework combining VideoMAE feature extraction with an Augmented Self-Mask Attention detector and Spatial Pyramid Pooling-Fast module, achieving up to 92.67% mAP and demonstrating a trade-off between accuracy (88.09% Top-1) and computational efficiency (101.85 vs. 1584.06 GFLOPs/segment).

The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.

View on arXiv PDF

Similar