CVAISep 25, 2025

Temporal vs. Spatial: Comparing DINOv3 and V-JEPA2 Feature Representations for Video Action Analysis

arXiv:2509.21595v12 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

It provides empirical guidance for selecting feature extraction methods in video analysis, addressing a domain-specific problem for researchers and practitioners.

This study compared DINOv3 and V-JEPA2 for video action recognition on the UCF Sports dataset, finding that DINOv3 achieved better clustering (Silhouette score: 0.31 vs 0.21) and discrimination (6.16x separation ratio) for pose-based actions, while V-JEPA2 offered more consistent reliability across all actions with lower variance (0.094 vs 0.288).

This study presents a comprehensive comparative analysis of two prominent self-supervised learning architectures for video action recognition: DINOv3, which processes frames independently through spatial feature extraction, and V-JEPA2, which employs joint temporal modeling across video sequences. We evaluate both approaches on the UCF Sports dataset, examining feature quality through multiple dimensions including classification accuracy, clustering performance, intra-class consistency, and inter-class discrimination. Our analysis reveals fundamental architectural trade-offs: DINOv3 achieves superior clustering performance (Silhouette score: 0.31 vs 0.21) and demonstrates exceptional discrimination capability (6.16x separation ratio) particularly for pose-identifiable actions, while V-JEPA2 exhibits consistent reliability across all action types with significantly lower performance variance (0.094 vs 0.288). Through action-specific evaluation, we identify that DINOv3's spatial processing architecture excels at static pose recognition but shows degraded performance on motion-dependent actions, whereas V-JEPA2's temporal modeling provides balanced representation quality across diverse action categories. These findings contribute to the understanding of architectural design choices in video analysis systems and provide empirical guidance for selecting appropriate feature extraction methods based on task requirements and reliability constraints.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes