From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models
This highlights a critical issue for software security practitioners, as it exposes inflated performance in widely used benchmarks, though it is incremental in addressing data leakage problems.
The study investigated data leakage in a benchmark dataset for AI-based secret detectors, showing that duplication across training and test sets inflates performance metrics, misleading assessments of real-world effectiveness.
Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.