Better STEP, a format and dataset for boundary representation
This work addresses a bottleneck for researchers and industry users in CAD and machine learning by providing an incremental improvement in data accessibility and processing efficiency.
The paper tackles the problem of limited accessibility and high licensing costs of CAD boundary representation data in STEP format by introducing an open HDF5-based format and dataset, along with an open-source library, enabling easier integration into learning pipelines and demonstrating effectiveness through conversion of existing datasets and standard use cases.
Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We developed four standard use cases (normal estimation, denoising, surface reconstruction, and segmentation) to assess the integrity of the data and its compliance with the original STEP files.