Learning Transformation-Isomorphic Latent Space for Accurate Hand Pose Estimation
This work addresses accuracy issues in hand pose estimation for computer vision applications, representing an incremental improvement over existing methods.
The paper tackled the problem of hand pose estimation by proposing TI-Net, a network that constructs a transformation-isomorphic latent space to capture compact, low-level features, resulting in a 10% improvement in PA-MPJPE on the DexYCB dataset compared to specialized state-of-the-art methods.
Vision-based regression tasks, such as hand pose estimation, have achieved higher accuracy and faster convergence through representation learning. However, existing representation learning methods often encounter the following issues: the high semantic level of features extracted from images is inadequate for regressing low-level information, and the extracted features include task-irrelevant information, reducing their compactness and interfering with regression tasks. To address these challenges, we propose TI-Net, a highly versatile visual Network backbone designed to construct a Transformation Isomorphic latent space. Specifically, we employ linear transformations to model geometric transformations in the latent space and ensure that {\rm TI-Net} aligns them with those in the image space. This ensures that the latent features capture compact, low-level information beneficial for pose estimation tasks. We evaluated TI-Net on the hand pose estimation task to demonstrate the network's superiority. On the DexYCB dataset, TI-Net achieved a 10% improvement in the PA-MPJPE metric compared to specialized state-of-the-art (SOTA) hand pose estimation methods. Our code will be released in the future.