CVHCNov 15, 2025

Cross-View Cross-Modal Unsupervised Domain Adaptation for Driver Monitoring System

arXiv:2511.12196v1h-index: 44
Originality Incremental advance
AI Analysis

This work addresses robust and scalable deployment of driver monitoring systems across diverse vehicle configurations, though it is incremental as it combines existing techniques like contrastive learning and information bottleneck.

The paper tackles the problem of driver distraction detection by addressing cross-view and cross-modal domain shifts in real-time driver monitoring, achieving a 50% improvement in top-1 accuracy on RGB video data compared to a supervised cross-view method and outperforming unsupervised domain adaptation-only methods by up to 5%.

Driver distraction remains a leading cause of road traffic accidents, contributing to thousands of fatalities annually across the globe. While deep learning-based driver activity recognition methods have shown promise in detecting such distractions, their effectiveness in real-world deployments is hindered by two critical challenges: variations in camera viewpoints (cross-view) and domain shifts such as change in sensor modality or environment. Existing methods typically address either cross-view generalization or unsupervised domain adaptation in isolation, leaving a gap in the robust and scalable deployment of models across diverse vehicle configurations. In this work, we propose a novel two-phase cross-view, cross-modal unsupervised domain adaptation framework that addresses these challenges jointly on real-time driver monitoring data. In the first phase, we learn view-invariant and action-discriminative features within a single modality using contrastive learning on multi-view data. In the second phase, we perform domain adaptation to a new modality using information bottleneck loss without requiring any labeled data from the new domain. We evaluate our approach using state-of-the art video transformers (Video Swin, MViT) and multi modal driver activity dataset called Drive&Act, demonstrating that our joint framework improves top-1 accuracy on RGB video data by almost 50% compared to a supervised contrastive learning-based cross-view method, and outperforms unsupervised domain adaptation-only methods by up to 5%, using the same video transformer backbone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes