Shibang Liu

3.6CVJan 19, 2025Code

Refinement Module based on Parse Graph for Human Pose Estimation

Shibang Liu, Xuemei Xie

Parse graphs have been widely used in Human Pose Estimation (HPE) to model the hierarchical structure and context relations of the human body. However, such methods often suffer from parameter redundancy. More importantly, they rely on predefined network structures, which limits their use in other methods. To address these issues, we propose a new context relation and hierarchical structure modeling module, RMPG (Refinement Module based on Parse Graph). RMPG adaptively refines feature maps through recursive top-down decomposition of feature maps and bottom-up composition of sub-node feature maps with context information. Through recursive hierarchical composition, RMPG fuses local details and global semantics into more structured feature representations, accompanied by context information, thereby improving the accuracy of joint inference. RMPG can be flexibly embedded as a plug-in into various mainstream HPE networks. Moreover, by supervising sub-node features map, RMPG learns the context relations and hierarchical structure between different body parts with fewer parameters. Extensive experiments show that RMPG improves performance across different architectures while effectively modeling hierarchical and context relations of the human body with fewer parameters. The RMPG code can be found at https://github.com/lushbng/RMPG.

3.6CVSep 9, 2025

Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

Shibang Liu, Xuemei Xie, Guangming Shi

Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

Shibang Liu

2 Papers