Hanchao Jia

CV
3papers
23citations
Novelty70%
AI Score35

3 Papers

CVAug 20, 2023Code
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

Haoyuan Li, Haoye Dong, Hanchao Jia et al.

Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.

CVJul 8, 2024
HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels

Yingying Jiang, Hanchao Jia, Xiaobing Wang et al.

Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text. Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets. However, the gap between ZS-CIR and triplet-supervised CIR is still large. In this work, we propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR. A new label Synthesis pipeline for CIR (SynCir) is proposed, in which only unlabeled images are required. First, image pairs are extracted based on visual similarity. Second, query text is generated for each image pair based on vision-language model and LLM. Third, the data is further filtered in language space based on semantic similarity. To improve ZS-CIR performance, we propose a hybrid training strategy to work with both ZS-CIR supervision and synthetic CIR triplets. Two kinds of contrastive learning are adopted. One is to use large-scale unlabeled image dataset to learn an image-to-text mapping with good generalization. The other is to use synthetic CIR triplets to learn a better mapping for CIR tasks. Our approach achieves SOTA zero-shot performance on the common CIR benchmarks: CIRR and CIRCO.

CVJul 26, 2024
BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation

Peng Hao, Weilong Wang, Xiaobing Wang et al.

Scene Graph Generation (SGG) remains a challenging task due to its compositional property. Previous approaches improve prediction efficiency through end-to-end learning. However, these methods exhibit limited performance as they assume unidirectional conditioning between entities and predicates, which restricts effective information interaction. To address this limitation, we propose a novel bidirectional conditioning factorization in a semantic-aligned space for SGG, enabling efficient and generalizable interaction between entities and predicates. Specifically, we introduce an end-to-end scene graph generation model, the Bidirectional Conditioning Transformer (BCTR), to implement this factorization. BCTR consists of two key modules. First, the Bidirectional Conditioning Generator (BCG) performs multi-stage interactive feature augmentation between entities and predicates, enabling mutual enhancement between these predictions. Second, Random Feature Alignment (RFA) is present to regularize feature space by distilling multi-modal knowledge from pre-trained models. Within this regularized feature space, BCG is feasible to capture interaction patterns across diverse relationships during training, and the learned interaction patterns can generalize to unseen but semantically related relationships during inference. Extensive experiments on Visual Genome and Open Image V6 show that BCTR achieves state-of-the-art performance on both benchmarks.