Runzhong Zhang

CV
h-index49
3papers
18citations
Novelty48%
AI Score37

3 Papers

CVApr 29
HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

Runzhong Zhang, Suchen Wang, Yueqi Duan et al.

In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

CVFeb 6, 2025
Adaptive Margin Contrastive Learning for Ambiguity-aware 3D Semantic Segmentation

Yang Chen, Yueqi Duan, Runzhong Zhang et al.

In this paper, we propose an adaptive margin contrastive learning method for 3D point cloud semantic segmentation, namely AMContrast3D. Most existing methods use equally penalized objectives, which ignore per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we design adaptive objectives for individual points based on their ambiguity levels, aiming to ensure the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. Specifically, we first estimate ambiguities based on position embeddings. Then, we develop a margin generator to shift decision boundaries for contrastive feature embeddings, so margins are narrowed due to increasing ambiguities with even negative margins for extremely high-ambiguity points. Experimental results on large-scale datasets, S3DIS and ScanNet, demonstrate that our method outperforms state-of-the-art methods.

CVApr 21, 2024
Video sentence grounding with temporally global textual knowledge

Cai Chen, Runzhong Zhang, Jianjun Gao et al.

Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.