CVJun 17, 2023
Multi-scale Spatial-temporal Interaction Network for Video Anomaly DetectionZhiyuan Ning, Zhangxun Li, Zhengliang Guo et al.
Video Anomaly Detection (VAD) is an essential yet challenging task in signal processing. Since certain anomalies cannot be detected by isolated analysis of either temporal or spatial information, the interaction between these two types of data is considered crucial for VAD. However, current dual-stream architectures either confine this integral interaction to the bottleneck of the autoencoder or introduce anomaly-irrelevant background pixels into the interactive process, hindering the accuracy of VAD. To address these deficiencies, we propose a Multi-scale Spatial-Temporal Interaction Network (MSTI-Net) for VAD. First, to prioritize the detection of moving objects in the scene and harmonize the substantial semantic discrepancies between the two types of data, we propose an Attention-based Spatial-Temporal Fusion Module (ASTFM) as a substitute for the conventional direct fusion. Furthermore, we inject multi-ASTFM-based connections that bridge the appearance and motion streams of the dual-stream network, thus fostering multi-scale spatial-temporal interaction. Finally, to bolster the delineation between normal and abnormal activities, our system records the regular information in a memory module. Experimental results on three benchmark datasets validate the effectiveness of our approach, which achieves AUCs of 96.8%, 87.6%, and 73.9% on the UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, respectively.
42.4CVMar 19
Diffusion-Guided Semantic Consistency for Multimodal HeterogeneityJing Liu, Zhengliang Guo, Yan Wang et al.
Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
62.4CVMar 23
Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly DetectionYang Liu, Boan Chen, Yuanyuan Meng et al.
As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.