FedCVU: Federated Learning for Cross-View Video Understanding
This addresses privacy-preserving multi-camera video analysis for applications like surveillance or autonomous systems, with incremental improvements in handling view heterogeneity.
The paper tackled challenges in applying federated learning to cross-view video understanding, such as non-IID data and communication overhead, and proposed FedCVU, which improved unseen-view accuracy while maintaining seen-view performance, outperforming state-of-the-art baselines.
Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.