SIEVE: Towards Verifiable Certification for Code-datasets
This addresses the issue of fragmented, costly quality assurance for code datasets used in code agents and empirical software engineering, though it appears incremental as it builds on existing dataset card concepts.
The paper tackles the problem of unverifiable quality guarantees in public code datasets by introducing SIEVE, a community-driven framework that replaces static dataset cards with machine-readable, verifiable certificates called Confidence Cards, which provide anytime-valid statistical bounds.
Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.