Chengkun Li

CV
h-index22
21papers
449citations
Novelty47%
AI Score54

21 Papers

CVMar 12, 2023
Modular Quantization-Aware Training for 6D Object Pose Estimation

Saqib Javed, Chengkun Li, Andrew Price et al.

Edge applications, such as collaborative robotics and spacecraft rendezvous, demand efficient 6D object pose estimation on resource-constrained embedded platforms. Existing 6D pose estimation networks are often too large for such deployments, necessitating compression while maintaining reliable performance. To address this challenge, we introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy that exploits the modular structure of modern 6D pose estimation architectures. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques. Our experiments showcase the generality of MQAT across datasets, architectures, and quantization algorithms. Remarkably, MQAT-trained quantized models achieve a significant accuracy boost (>7%) over the baseline full-precision network while reducing model size by a factor of 4x or more. Our project website is at: https://saqibjaved1.github.io/MQAT_/

75.3ROMay 15Code
MyoChallenge 2025: A New Benchmark for Human Athletic Intelligence

Cheryl Wang, Chun Kwang Tan, Balint K. Hodossy et al.

Athletic performance represents the pinnacle of human motor intelligence, demanding rapid choices, precise control, agility, and coordinated physical execution. Replicating this seamless combination of capabilities remains elusive in current artificial intelligence and robotic systems. Concurrently, understanding the biological mastery of these movements is hindered because complex muscle coordination is rarely measured in vivo due to the limitations of physical equipment. To bridge this fundamental gap in understanding, MyoChallenge at NeurIPS 2025 established a pioneering benchmark for motor control intelligence in sports, leveraging high-fidelity musculoskeletal models within physics simulation combined with machine learning-driven algorithms. The competition introduces two distinct tracks emphasizing either upper or lower limbs control: a table tennis rally task utilizing a biomechanic upper limb composed of an arm with a hand and a trunk; and a soccer penalty kick using a biomechanic model of legs and a trunk. Marking the fourth iteration of the MyoChallenge series, this event attracted almost 70 teams and over 560 submissions globally, uniting a diverse community ranging from physicians and neuroscientists to machine learning experts. The competition facilitated the development of several state-of-the-art control algorithms for a musculoskeletal system capable of sports agility, leveraging techniques such as physics-based motion planners, on-policy behaviour cloning, hierarchical planning, and muscle synergies. By integrating standardized tasks and physiologically realistic models into the open-source framework of MyoSuite, MyoChallenge'25 serves as a reproducible and reusable testbed to accelerate interdisciplinary research across machine learning, biomechanics, sports science, and neuroscience. Project page: https://www.myosuite.org//myochallenge/myochallenge-2025.

57.8ROMar 26Code
Towards Embodied AI with MuscleMimic: Unlocking full-body musculoskeletal motor learning at scale

Chengkun Li, Cheryl Wang, Bianca Ziliotto et al.

Learning motor control for muscle-driven musculoskeletal models is hindered by the computational cost of biomechanically accurate simulation and the scarcity of validated, open full-body models. Here we present MuscleMimic, an open-source framework for scalable motion imitation learning with physiologically realistic, muscle-actuated humanoids. MuscleMimic provides two validated musculoskeletal embodiments - a fixed-root upper-body model (126 muscles) for bimanual manipulation and a full-body model (416 muscles) for locomotion - together with a retargeting pipeline that maps SMPL-format motion capture data onto musculoskeletal structures while preserving kinematic and dynamic consistency. Leveraging massively parallel GPU simulation, the framework achieves order-of-magnitude training speedups over prior CPU-based approaches while maintaining comprehensive collision handling, enabling a single generalist policy to be trained on hundreds of diverse motions within days. The resulting policy faithfully reproduces a broad repertoire of human movements under full muscular control and can be fine-tuned to novel motions within hours. Biomechanical validation against experimental walking and running data demonstrates strong agreement in joint kinematics (mean correlation r = 0.90), while muscle activation analysis reveals both the promise and fundamental challenges of achieving physiological fidelity through kinematic imitation alone. By lowering the computational and data barriers to musculoskeletal simulation, MuscleMimic enables systematic model validation across diverse dynamic movements and broader participation in neuromuscular control research. Code, models, checkpoints, and retargeted datasets are available at: https://github.com/amathislab/musclemimic

CVApr 9, 2022
Robotic Surgery Remote Mentoring via AR with 3D Scene Streaming and Hand Interaction

Yonghao Long, Chengkun Li, Qi Dou

With the growing popularity of robotic surgery, education becomes increasingly important and urgently needed for the sake of patient safety. However, experienced surgeons have limited accessibility due to their busy clinical schedule or working in a distant city, thus can hardly provide sufficient education resources for novices. Remote mentoring, as an effective way, can help solve this problem, but traditional methods are limited to plain text, audio, or 2D video, which are not intuitive nor vivid. Augmented reality (AR), a thriving technique being widely used for various education scenarios, is promising to offer new possibilities of visual experience and interactive teaching. In this paper, we propose a novel AR-based robotic surgery remote mentoring system with efficient 3D scene visualization and natural 3D hand interaction. Using a head-mounted display (i.e., HoloLens), the mentor can remotely monitor the procedure streamed from the trainee's operation side. The mentor can also provide feedback directly with hand gestures, which is in-turn transmitted to the trainee and viewed in surgical console as guidance. We comprehensively validate the system on both real surgery stereo videos and ex-vivo scenarios of common robotic training tasks (i.e., peg-transfer and suturing). Promising results are demonstrated regarding the fidelity of streamed scene visualization, the accuracy of feedback with hand interaction, and the low-latency of each component in the entire remote mentoring system. This work showcases the feasibility of leveraging AR technology for reliable, flexible and low-cost solutions to robotic surgical education, and holds great potential for clinical applications.

LGSep 6, 2024
Amortized Bayesian Workflow

Chengkun Li, Aki Vehtari, Paul-Christian Bürkner et al.

Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve a favorable combination of both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed. By reusing computations across steps, our workflow synergizes amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on several synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality.

MLMar 16, 2023
PyVBMC: Efficient Bayesian inference in Python

Bobby Huggins, Chengkun Li, Marlon Tobaben et al.

PyVBMC is a Python implementation of the Variational Bayesian Monte Carlo (VBMC) algorithm for posterior and model inference for black-box computational models (Acerbi, 2018, 2020). VBMC is an approximate inference method designed for efficient parameter estimation and model assessment when model evaluations are mildly-to-very expensive (e.g., a second or more) and/or noisy. Specifically, VBMC computes: - a flexible (non-Gaussian) approximate posterior distribution of the model parameters, from which statistics and posterior samples can be easily extracted; - an approximation of the model evidence or marginal likelihood, a metric used for Bayesian model selection. PyVBMC can be applied to any computational or statistical model with up to roughly 10-15 continuous parameters, with the only requirement that the user can provide a Python function that computes the target log likelihood of the model, or an approximation thereof (e.g., an estimate of the likelihood obtained via simulation or Monte Carlo methods). PyVBMC is particularly effective when the model takes more than about a second per evaluation, with dramatic speed-ups of 1-2 orders of magnitude when compared to traditional approximate inference methods. Extensive benchmarks on both artificial test problems and a large number of real models from the computational sciences, particularly computational and cognitive neuroscience, show that VBMC generally - and often vastly - outperforms alternative methods for sample-efficient Bayesian inference, and is applicable to both exact and simulator-based models (Acerbi, 2018, 2019, 2020). PyVBMC brings this state-of-the-art inference algorithm to Python, along with an easy-to-use Pythonic interface for running the algorithm and manipulating and visualizing its results.

MLMar 9, 2023
Fast post-process Bayesian inference with Variational Sparse Bayesian Quadrature

Chengkun Li, Grégoire Clarté, Martin Jørgensen et al.

In applied Bayesian inference scenarios, users may have access to a large number of pre-existing model evaluations, for example from maximum-a-posteriori (MAP) optimization runs. However, traditional approximate inference techniques make little to no use of this available information. We propose the framework of post-process Bayesian inference as a means to obtain a quick posterior approximation from existing target density evaluations, with no further model calls. Within this framework, we introduce Variational Sparse Bayesian Quadrature (VSBQ), a method for post-process approximate inference for models with black-box and potentially noisy likelihoods. VSBQ reuses existing target density evaluations to build a sparse Gaussian process (GP) surrogate model of the log posterior density function. Subsequently, we leverage sparse-GP Bayesian quadrature combined with variational inference to achieve fast approximate posterior inference over the surrogate. We validate our method on challenging synthetic scenarios and real-world applications from computational neuroscience. The experiments show that VSBQ builds high-quality posterior approximations by post-processing existing optimization traces, with no further model evaluations.

MLApr 7, 2025Code
Stacking Variational Bayesian Monte Carlo

Francesco Silvestrin, Chengkun Li, Luigi Acerbi

Approximate Bayesian inference for models with computationally expensive, black-box likelihoods poses a significant challenge, especially when the posterior distribution is complex. Many inference methods struggle to explore the parameter space efficiently under a limited budget of likelihood evaluations. Variational Bayesian Monte Carlo (VBMC) is a sample-efficient method that addresses this by building a local surrogate model of the log-posterior. However, its conservative exploration strategy, while promoting stability, can cause it to miss important regions of the posterior, such as distinct modes or long tails. In this work, we introduce Stacking Variational Bayesian Monte Carlo (S-VBMC), a method that overcomes this limitation by constructing a robust, global posterior approximation from multiple independent VBMC runs. Our approach merges these local approximations through a principled and inexpensive post-processing step that leverages VBMC's mixture posterior representation and per-component evidence estimates. Crucially, S-VBMC requires no additional likelihood evaluations and is naturally parallelisable, fitting seamlessly into existing inference workflows. We demonstrate its effectiveness on two synthetic problems designed to challenge VBMC's exploration and two real-world applications from computational neuroscience, showing substantial improvements in posterior approximation quality across all cases. Our code is available as a Python package at https://github.com/acerbilab/svbmc.

LGAug 14, 2022
Model Generalization: A Sharpness Aware Optimization Perspective

Jozef Marus Coldenhoff, Chengkun Li, Yurui Zhu

Sharpness-Aware Minimization (SAM) and adaptive sharpness-aware minimization (ASAM) aim to improve the model generalization. And in this project, we proposed three experiments to valid their generalization from the sharpness aware perspective. And our experiments show that sharpness aware-based optimization techniques could help to provide models with strong generalization ability. Our experiments also show that ASAM could improve the generalization performance on un-normalized data, but further research is needed to confirm this.

CVNov 24, 2020Code
Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

Yang You, Zelin Ye, Yujing Lou et al.

3D object detection has attracted much attention thanks to the advances in sensors and deep learning methods for point clouds. Current state-of-the-art methods like VoteNet regress direct offset towards object centers and box orientations with an additional Multi-Layer-Perceptron network. Both their offset and orientation predictions are not accurate due to the fundamental difficulty in rotation classification. In the work, we disentangle the direct offset into Local Canonical Coordinates (LCC), box scales and box orientations. Only LCC and box scales are regressed, while box orientations are generated by a canonical voting scheme. Finally, an LCC-aware back-projection checking algorithm iteratively cuts out bounding boxes from the generated vote maps, with the elimination of false positives. Our model achieves state-of-the-art performance on three standard real-world benchmarks: ScanNet, SceneNN and SUN RGB-D. Our code is available on https://github.com/qq456cvb/CanonicalVoting.

CVApr 20, 2020Code
Semantic Correspondence via 2D-3D-2D Cycle

Yang You, Chengkun Li, Yujing Lou et al.

Visual semantic correspondence is an important topic in computer vision and could help machine understand objects in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks. We also conduct thorough and detailed experiments to analyze our network components. The code and experiments are publicly available at https://github.com/qq456cvb/SemanticTransfer.

CVFeb 28, 2020Code
KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

Yang You, Yujing Lou, Chengkun Li et al.

Detecting 3D objects keypoints is of great interest to the areas of both graphics and computer vision. There have been several 2D and 3D keypoint datasets aiming to address this problem in a data-driven way. These datasets, however, either lack scalability or bring ambiguity to the definition of keypoints. Therefore, we present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 103,450 keypoints and 8,234 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset. Our code and data are available on https://github.com/qq456cvb/KeypointNet.

MLApr 15, 2025
Normalizing Flow Regression for Bayesian Inference with Offline Likelihood Evaluations

Chengkun Li, Bobby Huggins, Petrus Mikkola et al.

Bayesian inference with computationally expensive likelihood evaluations remains a significant challenge in many scientific domains. We propose normalizing flow regression (NFR), a novel offline inference method for approximating posterior distributions. Unlike traditional surrogate approaches that require additional sampling or inference steps, NFR directly yields a tractable posterior approximation through regression on existing log-density evaluations. We introduce training techniques specifically for flow regression, such as tailored priors and likelihood functions, to achieve robust posterior and model evidence estimation. We demonstrate NFR's effectiveness on synthetic benchmarks and real-world applications from neuroscience and biology, showing superior or comparable performance to existing methods. NFR represents a promising approach for Bayesian inference when standard methods are computationally prohibitive or existing model evaluations can be recycled.

CVFeb 8, 2024
InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write

Blagoj Mitrevski, Arina Rak, Julian Schnitzler et al.

Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in a vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice that is still favored by a vast majority. Our work InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human.

ROAug 25, 2025
Arnold: a generalist muscle transformer policy

Alberto Silvio Chiappa, Boshi An, Merkourios Simos et al.

Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.

CVJul 9, 2025
ClipGS: Clippable Gaussian Splatting for Interactive Cinematic Visualization of Volumetric Medical Data

Chengkun Li, Yuqi Tong, Kai Chen et al.

The visualization of volumetric medical data is crucial for enhancing diagnostic accuracy and improving surgical planning and education. Cinematic rendering techniques significantly enrich this process by providing high-quality visualizations that convey intricate anatomical details, thereby facilitating better understanding and decision-making in medical contexts. However, the high computing cost and low rendering speed limit the requirement of interactive visualization in practical applications. In this paper, we introduce ClipGS, an innovative Gaussian splatting framework with the clipping plane supported, for interactive cinematic visualization of volumetric medical data. To address the challenges posed by dynamic interactions, we propose a learnable truncation scheme that automatically adjusts the visibility of Gaussian primitives in response to the clipping plane. Besides, we also design an adaptive adjustment model to dynamically adjust the deformation of Gaussians and refine the rendering performance. We validate our method on five volumetric medical data (including CT and anatomical slice data), and reach an average 36.635 PSNR rendering quality with 156 FPS and 16.1 MB model size, outperforming state-of-the-art methods in rendering quality and efficiency.

CVNov 21, 2021
Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine

Yang You, Chengkun Li, Yujing Lou et al.

Pixel-level 2D object semantic understanding is an important topic in computer vision and could help machine deeply understand objects (e.g. functionality and affordance) in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting image corresponding semantics in 3D domain and then projecting them back onto 2D images to achieve pixel-level understanding. In order to obtain reliable 3D semantic labels that are absent in current image datasets, we build a large scale keypoint knowledge engine called KeypointNet, which contains 103,450 keypoints and 8,234 3D models from 16 object categories. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks.

CVJun 8, 2020
Are We Hungry for 3D LiDAR Data for Semantic Segmentation? A Survey and Experimental Study

Biao Gao, Yancheng Pan, Chengkun Li et al.

3D semantic segmentation is a fundamental task for robotic and autonomous driving applications. Recent works have been focused on using deep learning techniques, whereas developing fine-annotated 3D LiDAR datasets is extremely labor intensive and requires professional skills. The performance limitation caused by insufficient datasets is called data hunger problem. This research provides a comprehensive survey and experimental study on the question: are we hungry for 3D LiDAR data for semantic segmentation? The studies are conducted at three levels. First, a broad review to the main 3D LiDAR datasets is conducted, followed by a statistical analysis on three representative datasets to gain an in-depth view on the datasets' size and diversity, which are the critical factors in learning deep models. Second, a systematic review to the state-of-the-art 3D semantic segmentation is conducted, followed by experiments and cross examinations of three representative deep learning methods to find out how the size and diversity of the datasets affect deep models' performance. Finally, a systematic survey to the existing efforts to solve the data hunger problem is conducted on both methodological and dataset's viewpoints, followed by an insightful discussion of remaining problems and open questions To the best of our knowledge, this is the first work to analyze the data hunger problem for 3D semantic segmentation using deep learning techniques that are addressed in the literature review, statistical analysis, and cross-dataset and cross-algorithm experiments. We share findings and discussions, which may lead to potential topics in future works.

ROFeb 21, 2020
SemanticPOSS: A Point Cloud Dataset with Large Quantity of Dynamic Instances

Yancheng Pan, Biao Gao, Jilin Mei et al.

3D semantic segmentation is one of the key tasks for autonomous driving system. Recently, deep learning models for 3D semantic segmentation task have been widely researched, but they usually require large amounts of training data. However, the present datasets for 3D semantic segmentation are lack of point-wise annotation, diversiform scenes and dynamic objects. In this paper, we propose the SemanticPOSS dataset, which contains 2988 various and complicated LiDAR scans with large quantity of dynamic instances. The data is collected in Peking University and uses the same data format as SemanticKITTI. In addition, we evaluate several typical 3D semantic segmentation models on our SemanticPOSS dataset. Experimental results show that SemanticPOSS can help to improve the prediction accuracy of dynamic objects as people, car in some degree. SemanticPOSS will be published at \url{www.poss.pku.edu.cn}.

CVDec 29, 2019
Human Correspondence Consensus for 3D Object Semantic Understanding

Yujing Lou, Yang You, Chengkun Li et al.

Semantic understanding of 3D objects is crucial in many applications such as object manipulation. However, it is hard to give a universal definition of point-level semantics that everyone would agree on. We observe that people have a consensus on semantic correspondences between two areas from different objects, but are less certain about the exact semantic meaning of each area. Therefore, we argue that by providing human labeled correspondences between different objects from the same category instead of explicit semantic labels, one can recover rich semantic information of an object. In this paper, we introduce a new dataset named CorresPondenceNet. Based on this dataset, we are able to learn dense semantic embeddings with a novel geodesic consistency loss. Accordingly, several state-of-the-art networks are evaluated on this correspondence benchmark. We further show that CorresPondenceNet could not only boost fine-grained understanding of heterogeneous objects but also cross-object registration and partial object matching.

CVDec 4, 2018
Estimating 6D Pose From Localizing Designated Surface Keypoints

Zelin Zhao, Gao Peng, Haoyu Wang et al.

In this paper, we present an accurate yet effective solution for 6D pose estimation from an RGB image. The core of our approach is that we first designate a set of surface points on target object model as keypoints and then train a keypoint detector (KPD) to localize them. Finally a PnP algorithm can recover the 6D pose according to the 2D-3D relationship of keypoints. Different from recent state-of-the-art CNN-based approaches that rely on a time-consuming post-processing procedure, our method can achieve competitive accuracy without any refinement after pose prediction. Meanwhile, we obtain a 30% relative improvement in terms of ADD accuracy among methods without using refinement. Moreover, we succeed in handling heavy occlusion by selecting the most confident keypoints to recover the 6D pose. For the sake of reproducibility, we will make our code and models publicly available soon.