CVApr 2
Robust Embodied Perception in Dynamic Environments via Disentangled Weight FusionJuncen Guo, Xiaoguang Zhu, Jingyi Wu et al.
Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.
CVNov 21, 2024
Privacy-Preserving Video Anomaly Detection: A SurveyYang Liu, Siao Liu, Xiaoguang Zhu et al.
Video Anomaly Detection (VAD) aims to automatically analyze spatiotemporal patterns in surveillance videos collected from open spaces to detect anomalous events that may cause harm, such as fighting, stealing, and car accidents. However, vision-based surveillance systems such as closed-circuit television often capture personally identifiable information. The lack of transparency and interpretability in video transmission and usage raises public concerns about privacy and ethics, limiting the real-world application of VAD. Recently, researchers have focused on privacy concerns in VAD by conducting systematic studies from various perspectives including data, features, and systems, making Privacy-Preserving Video Anomaly Detection (P2VAD) a hotspot in the AI community. However, current research in P2VAD is fragmented, and prior reviews have mostly focused on methods using RGB sequences, overlooking privacy leakage and appearance bias considerations. To address this gap, this article is the first to systematically reviews the progress of P2VAD, defining its scope and providing an intuitive taxonomy. We outline the basic assumptions, learning frameworks, and optimization objectives of various approaches, analyzing their strengths, weaknesses, and potential correlations. Additionally, we provide open access to research resources such as benchmark datasets and available code. Finally, we discuss key challenges and future opportunities from the perspectives of AI development and P2VAD deployment, aiming to guide future work in the field.
CVMar 25, 2025
Adaptive Weighted Parameter Fusion with CLIP for Class-Incremental LearningJuncen Guo, Xiaoguang Zhu, Liangyu Teng et al.
Class-incremental Learning (CIL) enables the model to incrementally absorb knowledge from new classes and build a generic classifier across all previously encountered classes. When the model optimizes with new classes, the knowledge of previous classes is inevitably erased, leading to catastrophic forgetting. Addressing this challenge requires making a trade-off between retaining old knowledge and accommodating new information. However, this balancing process often requires sacrificing some information, which can lead to a partial loss in the model's ability to discriminate between classes. To tackle this issue, we design the adaptive weighted parameter fusion with Contrastive Language-Image Pre-training (CLIP), which not only takes into account the variability of the data distribution of different tasks, but also retains all the effective information of the parameter matrix to the greatest extent. In addition, we introduce a balance factor that can balance the data distribution alignment and distinguishability of adjacent tasks. Experimental results on several traditional benchmarks validate the superiority of the proposed method.
CVMar 24, 2025
CalFuse: Multi-Modal Continual Learning via Feature Calibration and Parameter FusionJuncen Guo, Siao Liu, Xiaoguang Zhu et al.
With the proliferation of multi-modal data in large-scale visual recognition systems, enabling models to continuously acquire knowledge from evolving data streams while preserving prior information has become increasingly critical. Class-Continual Learning (CCL) addresses this challenge by incrementally incorporating new class knowledge without revisiting historical data, making it essential for real-world big data applications. While traditional CCL methods rely solely on visual features, recent advances in Vision-Language Models (VLMs) such as CLIP demonstrate significant potential for CCL by leveraging pre-trained multi-modal knowledge. However, existing approaches face challenges in mitigating catastrophic forgetting while maintaining the cross-modal generalization capabilities of VLMs. To address these limitations, we propose CalFuse, a framework that synergizes feature Calibration with parameter Fusion to enable effective multi-modal knowledge integration in continual learning scenarios. CalFuse introduces a dynamic feature calibration mechanism that adaptively balances original CLIP visual representations with task-specific features, preserving the model's intrinsic cross-modal generalization while adapting to new classes. Concurrently, a QR decomposition-based parameter fusion strategy progressively integrates newly acquired knowledge with historical task parameters, maintaining equilibrium between learning new class representations and retaining prior knowledge across sequential tasks. Extensive experiments on benchmark datasets validate the effectiveness of our approach in large-scale multi-modal continual learning settings, demonstrating superior performance over state-of-the-art methods in both average accuracy and final task retention.
NIFeb 10, 2025
A Survey on Video Analytics in Cloud-Edge-Terminal Collaborative SystemsLinxiao Gong, Hao Yang, Gaoyun Fang et al.
The explosive growth of video data has driven the development of distributed video analytics in cloud-edge-terminal collaborative (CETC) systems, enabling efficient video processing, real-time inference, and privacy-preserving analysis. Among multiple advantages, CETC systems can distribute video processing tasks and enable adaptive analytics across cloud, edge, and terminal devices, leading to breakthroughs in video surveillance, autonomous driving, and smart cities. In this survey, we first analyze fundamental architectural components, including hierarchical, distributed, and hybrid frameworks, alongside edge computing platforms and resource management mechanisms. Building upon these foundations, edge-centric approaches emphasize on-device processing, edge-assisted offloading, and edge intelligence, while cloud-centric methods leverage powerful computational capabilities for complex video understanding and model training. Our investigation also covers hybrid video analytics incorporating adaptive task offloading and resource-aware scheduling techniques that optimize performance across the entire system. Beyond conventional approaches, recent advances in large language models and multimodal integration reveal both opportunities and challenges in platform scalability, data protection, and system reliability. Future directions also encompass explainable systems, efficient processing mechanisms, and advanced video analytics, offering valuable insights for researchers and practitioners in this dynamic field.