CVApr 14

STGV: Spatio-Temporal Hash Encoding for Gaussian-based Video Representation

arXiv:2604.1091043.7h-index: 15
AI Analysis

This work improves video representation quality for computer vision applications by addressing the entanglement of static and dynamic features in Gaussian splatting models.

STGV proposes a spatio-temporal hash encoding framework for Gaussian-based video representation that separates static and dynamic components, achieving +0.98 PSNR improvement over prior Gaussian-based methods and competitive downstream performance.

2D Gaussian Splatting (2DGS) has recently become a promising paradigm for high-quality video representation. However, existing methods employ content-agnostic or spatio-temporal feature overlapping embeddings to predict canonical Gaussian primitive deformations, which entangles static and dynamic components in videos and prevents modeling their distinct properties effectively. These result in inaccurate predictions for spatio-temporal deformations and unsatisfactory representation quality. To address these problems, this paper proposes a Spatio-Temporal hash encoding framework for Gaussian-based Video representation (STGV). By decomposing video features into learnable 2D spatial and 3D temporal hash encodings, STGV effectively facilitates the learning of motion patterns for dynamic components while maintaining background details for static elements. In addition, we construct a more stable and consistent initial canonical Gaussian representation through a key frame canonical initialization strategy, preventing from feature overlapping and a structurally incoherent geometry representation. Experimental results demonstrate that our method attains better video representation quality (+0.98 PSNR) against other Gaussian-based methods and achieves competitive performance in downstream video tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes