CVNov 17, 2025

View-aware Cross-modal Distillation for Multi-view Action Recognition

arXiv:2511.12870v1h-index: 7
Originality Incremental advance
AI Analysis

This addresses a real-world problem for systems with limited input modalities and annotations, though it is incremental as it builds on existing distillation techniques.

The paper tackles multi-view action recognition in partially overlapping settings where actions are visible in only a subset of views, proposing a framework that distills knowledge from a multi-modal teacher to a limited student, resulting in consistent outperformance of competitive methods and surpassing the teacher under limited conditions.

The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes