Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting
For researchers in 3D hand pose estimation, this work provides evidence that adaptive attention mechanisms are more effective than fixed graph convolutions, suggesting a shift in inductive bias for skeleton-based lifting tasks.
The paper challenges the use of fixed graph convolution for 2D-to-3D hand pose lifting, showing that standard multi-head self-attention outperforms GCN baselines, reducing MPJPE from 12.36 mm to 10.09 mm on the FPHA benchmark.
Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.