Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
This provides a fast and generalizable tool for autonomous systems interacting with machinery, though it is incremental as it combines existing foundation models.
The paper tackles few-shot semantic segmentation of machinery parts by integrating foundation models like CLIPSeg and SAM with SuperPoint and a GCN, achieving a J&F score of 92.2 on real data with 10 synthetic samples and 71.5 on DAVIS 2017 with three support samples, with training under five minutes on consumer GPUs.
This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.