Boyan Wan

CV
h-index6
4papers
21citations
Novelty64%
AI Score47

4 Papers

CVMar 18, 2023
SOCS: Semantically-aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations

Boyan Wan, Yifei Shi, Kai Xu

Most learning-based approaches to category-level 6D pose estimation are design around normalized object coordinate space (NOCS). While being successful, NOCS-based methods become inaccurate and less robust when handling objects of a category containing significant intra-category shape variations. This is because the object coordinates induced by global and rigid alignment of objects are semantically incoherent, making the coordinate regression hard to learn and generalize. We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence. SOCS is semantically coherent: Any point on the surface of a object can be mapped to a semantically meaningful location in SOCS, allowing for accurate pose and size estimation under large shape variations. To learn effective coordinate regression to SOCS, we propose a novel multi-scale coordinate-based attention network. Evaluations demonstrate that our method is easy to train, well-generalizing for large intra-category shape variations and robust to inter-object occlusions.

CVFeb 23
Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

Yifei Shi, Boyan Wan, Xin Xu et al.

Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model's generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network's accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

CVJun 30, 2025
TextMesh4D: High-Quality Text-to-4D Mesh Generation

Sisi Dai, Xinxin Su, Boyan Wan et al.

Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.

CLOct 16, 2020
Lexicon-constrained Copying Network for Chinese Abstractive Summarization

Boyan Wan, Mishal Sohail

Copy mechanism allows sequence-to-sequence models to choose words from the input and put them directly into the output, which is finding increasing use in abstractive summarization. However, since there is no explicit delimiter in Chinese sentences, most existing models for Chinese abstractive summarization can only perform character copy, resulting in inefficient. To solve this problem, we propose a lexicon-constrained copying network that models multi-granularity in both encoder and decoder. On the source side, words and characters are aggregated into the same input memory using a Transformerbased encoder. On the target side, the decoder can copy either a character or a multi-character word at each time step, and the decoding process is guided by a word-enhanced search algorithm that facilitates the parallel computation and encourages the model to copy more words. Moreover, we adopt a word selector to integrate keyword information. Experiments results on a Chinese social media dataset show that our model can work standalone or with the word selector. Both forms can outperform previous character-based models and achieve competitive performances.