54.2ROMar 22
Geometrically Plausible Object Pose Refinement using Differentiable SimulationAnil Zeybek, Rhys Newbury, Snehal Dikhale et al.
State-of-the-art object pose estimation methods are prone to generating geometrically infeasible pose hypotheses. This problem is prevalent in dexterous manipulation, where estimated poses often intersect with the robotic hand or are not lying on a support surface. We propose a multi-modal pose refinement approach that combines differentiable physics simulation, differentiable rendering and visuo-tactile sensing to optimize object poses for both spatial accuracy and physical consistency. Simulated experiments show that our approach reduces the intersection volume error between the object and robotic hand by 73\% when the initial estimate is accurate and by over 87\% under high initial uncertainty, significantly outperforming standard ICP-based baselines. Furthermore, the improvement in geometric plausibility is accompanied by a concurrent reduction in translation and orientation errors. Achieving pose estimation that is grounded in physical reality while remaining faithful to multi-modal sensor inputs is a critical step toward robust in-hand manipulation.
CVJul 28, 2022
HOB-CNN: Hallucination of Occluded Branches with a Convolutional Neural Network for 2D Fruit TreesZijue Chen, Keenan Granland, Rhys Newbury et al.
Orchard automation has attracted the attention of researchers recently due to the shortage of global labor force. To automate tasks in orchards such as pruning, thinning, and harvesting, a detailed understanding of the tree structure is required. However, occlusions from foliage and fruits can make it challenging to predict the position of occluded trunks and branches. This work proposes a regression-based deep learning model, Hallucination of Occluded Branch Convolutional Neural Network (HOB-CNN), for tree branch position prediction in varying occluded conditions. We formulate tree branch position prediction as a regression problem towards the horizontal locations of the branch along the vertical direction or vice versa. We present comparative experiments on Y-shaped trees with two state-of-the-art baselines, representing common approaches to the problem. Experiments show that HOB-CNN outperform the baselines at predicting branch position and shows robustness against varying levels of occlusion. We further validated HOB-CNN against two different types of 2D trees, and HOB-CNN shows generalization across different trees and robustness under different occluded conditions.
CVDec 3, 2025
KeyPointDiffuser: Unsupervised 3D Keypoint Learning via Latent Diffusion ModelsRhys Newbury, Juyan Zhang, Tin Tran et al.
Understanding and representing the structure of 3D objects in an unsupervised manner remains a core challenge in computer vision and graphics. Most existing unsupervised keypoint methods are not designed for unconditional generative settings, restricting their use in modern 3D generative pipelines; our formulation explicitly bridges this gap. We present an unsupervised framework for learning spatially structured 3D keypoints from point cloud data. These keypoints serve as a compact and interpretable representation that conditions an Elucidated Diffusion Model (EDM) to reconstruct the full shape. The learned keypoints exhibit repeatable spatial structure across object instances and support smooth interpolation in keypoint space, indicating that they capture geometric variation. Our method achieves strong performance across diverse object categories, yielding a 6 percentage-point improvement in keypoint consistency compared to prior approaches.
LGOct 11, 2024Code
Carefully Structured Compression: Efficiently Managing StarCraft II DataBryce Ferenczi, Rhys Newbury, Michael Burke et al.
Creation and storage of datasets are often overlooked input costs in machine learning, as many datasets are simple image label pairs or plain text. However, datasets with more complex structures, such as those from the real time strategy game StarCraft II, require more deliberate thought and strategy to reduce cost of ownership. We introduce a serialization framework for StarCraft II that reduces the cost of dataset creation and storage, as well as improving usage ergonomics. We benchmark against the most comparable existing dataset from \textit{AlphaStar-Unplugged} and highlight the benefit of our framework in terms of both the cost of creation and storage. We use our dataset to train deep learning models that exceed the performance of comparable models trained on other datasets. The dataset conversion and usage framework introduced is open source and can be used as a framework for datasets with similar characteristics such as digital twin simulations. Pre-converted StarCraft II tournament data is also available online.
LGAug 3, 2025
Why Heuristic Weighting Works: A Theoretical Analysis of Denoising Score MatchingJuyan Zhang, Rhys Newbury, Xinyang Zhang et al.
Score matching enables the estimation of the gradient of a data distribution, a key component in denoising diffusion models used to recover clean data from corrupted inputs. In prior work, a heuristic weighting function has been used for the denoising score matching loss without formal justification. In this work, we demonstrate that heteroskedasticity is an inherent property of the denoising score matching objective. This insight leads to a principled derivation of optimal weighting functions for generalized, arbitrary-order denoising score matching losses, without requiring assumptions about the noise distribution. Among these, the first-order formulation is especially relevant to diffusion models. We show that the widely used heuristical weighting function arises as a first-order Taylor approximation to the trace of the expected optimal weighting. We further provide theoretical and empirical comparisons, revealing that the heuristical weighting, despite its simplicity, can achieve lower variance than the optimal weighting with respect to parameter gradients, which can facilitate more stable and efficient training.
LGJul 31, 2025
INSPIRE-GNN: Intelligent Sensor Placement to Improve Sparse Bicycling Network Prediction via Reinforcement Learning Boosted Graph Neural NetworksMohit Gupta, Debjit Bhowmick, Rhys Newbury et al.
Accurate link-level bicycling volume estimation is essential for sustainable urban transportation planning. However, many cities face significant challenges of high data sparsity due to limited bicycling count sensor coverage. To address this issue, we propose INSPIRE-GNN, a novel Reinforcement Learning (RL)-boosted hybrid Graph Neural Network (GNN) framework designed to optimize sensor placement and improve link-level bicycling volume estimation in data-sparse environments. INSPIRE-GNN integrates Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) with a Deep Q-Network (DQN)-based RL agent, enabling a data-driven strategic selection of sensor locations to maximize estimation performance. Applied to Melbourne's bicycling network, comprising 15,933 road segments with sensor coverage on only 141 road segments (99% sparsity) - INSPIRE-GNN demonstrates significant improvements in volume estimation by strategically selecting additional sensor locations in deployments of 50, 100, 200 and 500 sensors. Our framework outperforms traditional heuristic methods for sensor placement such as betweenness centrality, closeness centrality, observed bicycling activity and random placement, across key metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Furthermore, our experiments benchmark INSPIRE-GNN against standard machine learning and deep learning models in the bicycle volume estimation performance, underscoring its effectiveness. Our proposed framework provides transport planners actionable insights to effectively expand sensor networks, optimize sensor placement and maximize volume estimation accuracy and reliability of bicycling data for informed transportation planning decisions.
ROFeb 25, 2022
Visibility Maximization Controller for Robotic ManipulationKerry He, Rhys Newbury, Tin Tran et al.
Occlusions caused by a robot's own body is a common problem for closed-loop control methods employed in eye-to-hand camera setups. We propose an optimization-based reactive controller that minimizes self-occlusions while achieving a desired goal pose. The approach allows coordinated control between the robot's base, arm and head by encoding the line-of-sight visibility to the target as a soft constraint along with other task-related constraints, and solving for feasible joint and base velocities. The generalizability of the approach is demonstrated in simulated and real-world experiments, on robots with fixed or mobile bases, with moving or fixed objects, and multiple objects. The experiments revealed a trade-off between occlusion rates and other task metrics. While a planning-based baseline achieved lower occlusion rates than the proposed controller, it came at the expense of highly inefficient paths and a significant drop in the task success. On the other hand, the proposed controller is shown to improve visibility to the line target object(s) without sacrificing too much from the task success and efficiency. Videos and code can be found at: rhys-newbury.github.io/projects/vmc/.
ROApr 15, 2021
Tabletop Object Rearrangement: Team ACRV's Entry to OCRTOCZheyu Zhang, Rhys Newbury, Kerry He et al.
Open Cloud Robot Table Organization Challenge (OCRTOC) is one of the most comprehensive cloud-based robotic manipulation competitions. It focuses on rearranging tabletop objects using vision as its primary sensing modality. In this extended abstract, we present our entry to the OCRTOC2020 and the key challenges the team has experienced.
ROApr 7, 2021
Demonstrating Cloth Folding to Robots: Design and Evaluation of a 2D and a 3D User InterfaceBenjamin Waymouth, Akansel Cosgun, Rhys Newbury et al.
An appropriate user interface to collect human demonstration data for deformable object manipulation has been mostly overlooked in the literature. We present an interaction design for demonstrating cloth folding to robots. Users choose pick and place points on the cloth and can preview a visualization of a simulated cloth before real-robot execution. Two interfaces are proposed: A 2D display-and-mouse interface where points are placed by clicking on an image of the cloth, and a 3D Augmented Reality interface where the chosen points are placed by hand gestures. We conduct a user study with 18 participants, in which each user completed two sequential folds to achieve a cloth goal shape. Results show that while both interfaces were acceptable, the 3D interface was found to be more suitable for understanding the task, and the 2D interface suitable for repetition. Results also found that fold previews improve three key metrics: task efficiency, the ability to predict the final shape of the cloth and overall user satisfaction.
ROMar 6, 2021
Visualizing Robot Intent for Object Handovers with Augmented RealityRhys Newbury, Akansel Cosgun, Tysha Crowley-Davis et al.
Humans are highly skilled in communicating their intent for when and where a handover would occur. However, even the state-of-the-art robotic implementations for handovers typically lack of such communication skills. This study investigates visualization of the robot's internal state and intent for Human-to-Robot Handovers using Augmented Reality. Specifically, we explore the use of visualized 3D models of the object and the robotic gripper to communicate the robot's estimation of where the object is and the pose in which the robot intends to grasp the object. We tested this design via a user study with 16 participants, in which each participant handed over a cube-shaped object to the robot 12 times. Results show communicating robot intent via augmented reality substantially improves the perceived experience of the users for handovers. Results also indicate that the effectiveness of augmented reality is even more pronounced for the perceived safety and fluency of the interaction when the robot makes errors in localizing the object.
CVOct 16, 2020
Minimizing Labeling Effort for Tree Skeleton Segmentation using an Automated Iterative Training MethodologyKeenan Granland, Rhys Newbury, David Ting et al.
Training of convolutional neural networks for semantic segmentation requires accurate pixel-wise labeling which requires large amounts of human effort. The human-in-the-loop method reduces labeling effort; however, it requires human intervention for each image. This paper describes a general iterative training methodology for semantic segmentation, Automating-the-Loop. This aims to replicate the manual adjustments of the human-in-the-loop method with an automated process, hence, drastically reducing labeling effort. Using the application of detecting partially occluded apple tree segmentation, we compare manually labeled annotations, self-training, human-in-the-loop, and Automating-the-Loop methods in both the quality of the trained convolutional neural networks, and the effort needed to create them. The convolutional neural network (U-Net) performance is analyzed using traditional metrics and a new metric, Complete Grid Scan, which promotes connectivity and low noise. It is shown that in our application, the new Automating-the-Loop method greatly reduces the labeling effort while producing comparable performance to both human-in-the-loop and complete manual labeling methods.
CVOct 14, 2020
Semantic Segmentation for Partially Occluded Apple Trees Based on Deep LearningZijue Chen, David Ting, Rhys Newbury et al.
Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall were used to evaluate the performances of the models. DeepLabv3 outperforms the other models at Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. We analyze the worst 10 images in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which are not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Furthermore, this shows the usefulness of image-transfer networks for hallucination behind occlusions. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment.
ROJun 2, 2020
Object-Independent Human-to-Robot Handovers using Real Time Robotic VisionPatrick Rosenberger, Akansel Cosgun, Rhys Newbury et al.
We present an approach for safe and object-independent human-to-robot handovers using real time robotic vision and manipulation. We aim for general applicability with a generic object detector, a fast grasp selection algorithm and by using a single gripper-mounted RGB-D camera, hence not relying on external sensors. The robot is controlled via visual servoing towards the object of interest. Putting a high emphasis on safety, we use two perception modules: human body part segmentation and hand/finger segmentation. Pixels that are deemed to belong to the human are filtered out from candidate grasp poses, hence ensuring that the robot safely picks the object without colliding with the human partner. The grasp selection and perception modules run concurrently in real-time, which allows monitoring of the progress. In experiments with 13 objects, the robot was able to successfully take the object from the human in 81.9% of the trials.
ROMay 2, 2020
Supportive Actions for Manipulation in Human-Robot Coworker TeamsShray Bansal, Rhys Newbury, Wesley Chan et al.
The increasing presence of robots alongside humans, such as in human-robot teams in manufacturing, gives rise to research questions about the kind of behaviors people prefer in their robot counterparts. We term actions that support interaction by reducing future interference with others as supportive robot actions and investigate their utility in a co-located manipulation scenario. We compare two robot modes in a shared table pick-and-place task: (1) Task-oriented: the robot only takes actions to further its own task objective and (2) Supportive: the robot sometimes prefers supportive actions to task-oriented ones when they reduce future goal-conflicts. Our experiments in simulation, using a simplified human model, reveal that supportive actions reduce the interference between agents, especially in more difficult tasks, but also cause the robot to take longer to complete the task. We implemented these modes on a physical robot in a user study where a human and a robot perform object placement on a shared table. Our results show that a supportive robot was perceived as a more favorable coworker by the human and also reduced interference with the human in the more difficult of two scenarios. However, it also took longer to complete the task highlighting an interesting trade-off between task-efficiency and human-preference that needs to be considered before designing robot behavior for close-proximity manipulation scenarios.
ROApr 1, 2020
Learning to Place Objects onto Flat Surfaces in Upright OrientationsRhys Newbury, Kerry He, Akansel Cosgun et al.
We study the problem of placing a grasped object on an empty flat surface in an upright orientation, such as placing a cup on its bottom rather than on its side. We aim to find the required object rotation such that when the gripper is opened after the object makes contact with the surface, the object would be stably placed in the upright orientation. We iteratively use two neural networks. At every iteration, we use a convolutional neural network to estimate the required object rotation, which is executed by the robot, and then a separate convolutional neural network to estimate the quality of a placement in its current orientation. Our approach places previously unseen objects in upright orientations with a success rate of 98.1% in free space and 90.3% with a simulated robotic arm, using a dataset of 50 everyday objects in simulation experiments. Real-world experiments were performed, which achieved an 88.0% success rate, which serves as a proof-of-concept for direct sim-to-real transfer.
ROApr 11, 2019
Learning to Take Good Pictures of People with a Robot PhotographerRhys Newbury, Akansel Cosgun, Mehmet Koseoglu et al.
We present a robotic system capable of navigating autonomously by following a line and taking good quality pictures of people. When a group of people are detected, the robot rotates towards them and then back to line while continuously taking pictures from different angles. Each picture is processed in the cloud where its quality is estimated in a two-stage algorithm. First, features such as the face orientation and likelihood of facial emotions are input to a fully connected neural network to assign a quality score to each face. Second, a representation is extracted by abstracting faces from the image and it is input to a to Convolutional Neural Network (CNN) to classify the quality of the overall picture. We collected a dataset in which a picture was labeled as good quality if subjects are well-positioned in the image and oriented towards the camera with a pleasant expression. Our approach detected the quality of pictures with 78.4% accuracy in this dataset and received a better mean user rating (3.71/5) than a heuristic method that uses photographic composition procedures in a study where 97 human judges rated each picture. A statistical analysis against the state-of-the-art verified the quality of the resulting pictures.