Zhaofeng Shi

CV
h-index16
4papers
18citations
Novelty36%
AI Score43

4 Papers

76.2CVMar 10Code
Test-time Ego-Exo-centric Adaptation for Action Anticipation via Multi-Label Prototype Growing and Dual-Clue Consistency

Zhaofeng Shi, Heqian Qiu, Lanxiao Wang et al.

Efficient adaptation between Egocentric (Ego) and Exocentric (Exo) views is crucial for applications such as human-robot cooperation. However, the success of most existing Ego-Exo adaptation methods relies heavily on target-view data for training, thereby increasing computational and data collection costs. In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. It is challenging for existing Test-Time Adaptation (TTA) methods to address this task due to the multi-action candidates and significant temporal-spatial inter-view gap. Hence, we propose a novel Dual-Clue enhanced Prototype Growing Network (DCPGN), which accumulates multi-label knowledge and integrates cross-modality clues for effective test-time Ego-Exo adaptation and action anticipation. Specifically, we propose a Multi-Label Prototype Growing Module (ML-PGM) to balance multiple positive classes via multi-label assignment and confidence-based reweighting for class-wise memory banks, which are updated by an entropy priority queue strategy. Then, the Dual-Clue Consistency Module (DCCM) introduces a lightweight narrator to generate textual clues indicating action progressions, which complement the visual clues containing various objects. Moreover, we constrain the inferred textual and visual logits to construct dual-clue consistency for temporally and spatially bridging Ego and Exo views. Extensive experiments on the newly proposed EgoMe-anti and the existing EgoExoLearn benchmarks show the effectiveness of our method, which outperforms related state-of-the-art methods by a large margin. Code is available at \href{https://github.com/ZhaofengSHI/DCPGN}{https://github.com/ZhaofengSHI/DCPGN}.

36.8LGMar 31
Causality-inspired Federated Learning for Dynamic Spatio-Temporal Graphs

Yuxuan Liu, Wenchao Xu, Haozhao Wang et al.

Federated Graph Learning (FGL) has emerged as a powerful paradigm for decentralized training of graph neural networks while preserving data privacy. However, existing FGL methods are predominantly designed for static graphs and rely on parameter averaging or distribution alignment, which implicitly assume that all features are equally transferable across clients, overlooking both the spatial and temporal heterogeneity and the presence of client-specific knowledge in real-world graphs. In this work, we identify that such assumptions create a vicious cycle of spurious representation entanglement, client-specific interference, and negative transfer, degrading generalization performance in Federated Learning over Dynamic Spatio-Temporal Graphs (FSTG). To address this issue, we propose a novel causality-inspired framework named SC-FSGL, which explicitly decouples transferable causal knowledge from client-specific noise through representation-level interventions. Specifically, we introduce a Conditional Separation Module that simulates soft interventions through client conditioned masks, enabling the disentanglement of invariant spatio-temporal causal factors from spurious signals and mitigating representation entanglement caused by client heterogeneity. In addition, we propose a Causal Codebook that clusters causal prototypes and aligns local representations via contrastive learning, promoting cross-client consistency and facilitating knowledge sharing across diverse spatio-temporal patterns. Experiments on five diverse heterogeneity Spatio-Temporal Graph (STG) datasets show that SC-FSGL outperforms state-of-the-art methods.

CVJan 31, 2025Code
EgoMe: A New Dataset and Challenge for Following Me via Egocentric View in Real World

Heqian Qiu, Zhaofeng Shi, Lanxiao Wang et al.

In human imitation learning, the imitator typically take the egocentric view as a benchmark, naturally transferring behaviors observed from an exocentric view to their owns, which provides inspiration for researching how robots can more effectively imitate human behavior. However, current research primarily focuses on the basic alignment issues of ego-exo data from different cameras, rather than collecting data from the imitator's perspective, which is inconsistent with the high-level cognitive process. To advance this research, we introduce a novel large-scale egocentric dataset, called EgoMe, which towards following the process of human imitation learning via the imitator's egocentric view in the real world. Our dataset includes 7902 paired exo-ego videos (totaling15804 videos) spanning diverse daily behaviors in various real-world scenarios. For each video pair, one video captures an exocentric view of the imitator observing the demonstrator's actions, while the other captures an egocentric view of the imitator subsequently following those actions. Notably, EgoMe uniquely incorporates exo-ego eye gaze, other multi-modal sensor IMU data and different-level annotations for assisting in establishing correlations between observing and imitating process. We further provide a suit of challenging benchmarks for fully leveraging this data resource and promoting the robot imitation learning research. Extensive analysis demonstrates significant advantages over existing datasets. Our EgoMe dataset and benchmarks are available at https://huggingface.co/datasets/HeqianQiu/EgoMe.

ASAug 1, 2021
A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Zhaofeng Shi

With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing.