Kai Zhai

h-index15
2papers

2 Papers

CVFeb 28, 2023
HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation

Kai Zhai, Qiang Nie, Bo Ouyang et al. · pku

2D-to-3D human pose lifting is fundamental for 3D human pose estimation (HPE), for which graph convolutional networks (GCNs) have proven inherently suitable for modeling the human skeletal topology. However, the current GCN-based 3D HPE methods update the node features by aggregating their neighbors' information without considering the interaction of joints in different joint synergies. Although some studies have proposed importing limb information to learn the movement patterns, the latent synergies among joints, such as maintaining balance are seldom investigated. We propose the Hop-wise GraphFormer with Intragroup Joint Refinement (HopFIR) architecture to tackle the 3D HPE problem. HopFIR mainly consists of a novel hop-wise GraphFormer (HGF) module and an intragroup joint refinement (IJR) module. The HGF module groups the joints by k-hop neighbors and applies a hopwise transformer-like attention mechanism to these groups to discover latent joint synergies. The IJR module leverages the prior limb information for peripheral joint refinement. Extensive experimental results show that HopFIR outperforms the SOTA methods by a large margin, with a mean per-joint position error (MPJPE) on the Human3.6M dataset of 32.67 mm. We also demonstrate that the state-of-the-art GCN-based methods can benefit from the proposed hop-wise attention mechanism with a significant improvement in performance: SemGCN and MGCN are improved by 8.9% and 4.5%, respectively.

CVNov 3, 2025
HGFreNet: Hop-hybrid GraphFomer for 3D Human Pose Estimation with Trajectory Consistency in Frequency Domain

Kai Zhai, Ziyan Huang, Qiang Nie et al.

2D-to-3D human pose lifting is a fundamental challenge for 3D human pose estimation in monocular video, where graph convolutional networks (GCNs) and attention mechanisms have proven to be inherently suitable for encoding the spatial-temporal correlations of skeletal joints. However, depth ambiguity and errors in 2D pose estimation lead to incoherence in the 3D trajectory. Previous studies have attempted to restrict jitters in the time domain, for instance, by constraining the differences between adjacent frames while neglecting the global spatial-temporal correlations of skeletal joint motion. To tackle this problem, we design HGFreNet, a novel GraphFormer architecture with hop-hybrid feature aggregation and 3D trajectory consistency in the frequency domain. Specifically, we propose a hop-hybrid graph attention (HGA) module and a Transformer encoder to model global joint spatial-temporal correlations. The HGA module groups all $k$-hop neighbors of a skeletal joint into a hybrid group to enlarge the receptive field and applies the attention mechanism to discover the latent correlations of these groups globally. We then exploit global temporal correlations by constraining trajectory consistency in the frequency domain. To provide 3D information for depth inference across frames and maintain coherence over time, a preliminary network is applied to estimate the 3D pose. Extensive experiments were conducted on two standard benchmark datasets: Human3.6M and MPI-INF-3DHP. The results demonstrate that the proposed HGFreNet outperforms state-of-the-art (SOTA) methods in terms of positional accuracy and temporal consistency.