CVMar 26, 2023
VisDA 2022 Challenge: Domain Adaptation for Industrial Waste SortingDina Bashkirova, Samarth Mishra, Diala Lteif et al.
Label-efficient and reliable semantic segmentation is essential for many real-life applications, especially for industrial settings with high visual diversity, such as waste sorting. In industrial waste sorting, one of the biggest challenges is the extreme diversity of the input stream depending on factors like the location of the sorting facility, the equipment available in the facility, and the time of year, all of which significantly impact the composition and visual appearance of the waste stream. These changes in the data are called ``visual domains'', and label-efficient adaptation of models to such domains is needed for successful semantic segmentation of industrial waste. To test the abilities of computer vision models on this task, we present the VisDA 2022 Challenge on Domain Adaptation for Industrial Waste Sorting. Our challenge incorporates a fully-annotated waste sorting dataset, ZeroWaste, collected from two real material recovery facilities in different locations and seasons, as well as a novel procedurally generated synthetic waste sorting dataset, SynthWaste. In this competition, we aim to answer two questions: 1) can we leverage domain adaptation techniques to minimize the domain gap? and 2) can synthetic data augmentation improve performance on this task and help adapt to changing data distributions? The results of the competition show that industrial waste detection poses a real domain adaptation problem, that domain generalization techniques such as augmentations, ensembling, etc., improve the overall performance on the unlabeled target domain examples, and that leveraging synthetic data effectively remains an open problem. See https://ai.bu.edu/visda-2022/
ROMay 1, 2021Code
ECNNs: Ensemble Learning Methods for Improving Planar Grasp Quality EstimationFadi Alladkani, James Akl, Berk Calli
We present an ensemble learning methodology that combines multiple existing robotic grasp synthesis algorithms and obtain a success rate that is significantly better than the individual algorithms. The methodology treats the grasping algorithms as "experts" providing grasp "opinions". An Ensemble Convolutional Neural Network (ECNN) is trained using a Mixture of Experts (MOE) model that integrates these opinions and determines the final grasping decision. The ECNN introduces minimal computational cost overhead, and the network can virtually run as fast as the slowest expert. We test this architecture using open-source algorithms in the literature by adopting GQCNN 4.0, GGCNN and a custom variation of GGCNN as experts and obtained a 6% increase in the grasp success on the Cornell Dataset compared to the best-performing individual algorithm. The performance of the method is also demonstrated using a Franka Emika Panda arm.
ROApr 17, 2025
ViTa-Zero: Zero-shot Visuotactile Object 6D Pose EstimationHongyu Li, James Akl, Srinath Sridhar et al.
Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
66.8CVMar 13
Show, Don't Tell: Detecting Novel Objects by Watching Human VideosJames Akl, Jose Nicolas Avendano Arbelaez, James Barabas et al.
How can a robot quickly identify and recognize new objects shown to it during a human demonstration? Existing closed-set object detectors frequently fail at this because the objects are out-of-distribution. While open-set detectors (e.g., VLMs) sometimes succeed, they often require expensive and tedious human-in-the-loop prompt engineering to uniquely recognize novel object instances. In this paper, we present a self-supervised system that eliminates the need for tedious language descriptions and expensive prompt engineering by training a bespoke object detector on an automatically created dataset, supervised by the human demonstration itself. In our approach, "Show, Don't Tell," we show the detector the specific objects of interest during the demonstration, rather than telling the detector about these objects via complex language descriptions. By bypassing language altogether, this paradigm enables us to quickly train bespoke detectors tailored to the relevant objects observed in human task demonstrations. We develop an integrated on-robot system to deploy our "Show, Don't Tell" paradigm of automatic dataset creation and novel object-detection on a real-world robot. Empirical results demonstrate that our pipeline significantly outperforms state-of-the-art detection and recognition methods for manipulated objects, leading to improved task completion for the robot.
CVJun 4, 2021
ZeroWaste Dataset: Towards Deformable Object Segmentation in Cluttered ScenesDina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu et al.
Less than 35% of recyclable waste is being actually recycled in the US, which leads to increased soil and sea pollution and is one of the major concerns of environmental researchers as well as the common public. At the heart of the problem are the inefficiencies of the waste sorting process (separating paper, plastic, metal, glass, etc.) due to the extremely complex and cluttered nature of the waste stream. Recyclable waste detection poses a unique computer vision challenge as it requires detection of highly deformable and often translucent objects in cluttered scenes without the kind of context information usually present in human-centric datasets. This challenging computer vision task currently lacks suitable datasets or methods in the available literature. In this paper, we take a step towards computer-aided waste detection and present the first in-the-wild industrial-grade waste detection and segmentation dataset, ZeroWaste. We believe that ZeroWaste will catalyze research in object detection and semantic segmentation in extreme clutter as well as applications in the recycling domain. Our project page can be found at http://ai.bu.edu/zerowaste/.