CVApr 30, 2025

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

arXiv:2504.21614v22 citationsh-index: 1Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the lack of openly available data engines for iterative model improvement in ITS, benefiting researchers and the open-source community, though it is incremental as it builds on existing proprietary systems.

The paper tackles the challenge of selecting and labeling data for machine learning models, especially for detecting long-tail classes in Intelligent Transportation Systems, by presenting the Mcity Data Engine, an open-source system that provides modules for the complete data-based development cycle, with all code publicly available.

With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

View on arXiv PDF Code

Similar