Uncovering bias in the PlantVillage dataset
This work highlights a critical bias issue in a widely used agricultural dataset, which could mislead model performance and affect farmers and researchers relying on accurate disease detection.
The study investigated bias in the PlantVillage dataset for plant disease detection by training a model using only 8 pixels from image backgrounds, achieving 49.0% accuracy compared to 2.6% random guessing, revealing that the dataset contains label-correlated noise exploitable by deep learning models.
We report our investigation on the use of the popular PlantVillage dataset for training deep learning based plant disease detection models. We trained a machine learning model using only 8 pixels from the PlantVillage image backgrounds. The model achieved 49.0% accuracy on the held-out test set, well above the random guessing accuracy of 2.6%. This result indicates that the PlantVillage dataset contains noise correlated with the labels and deep learning models can easily exploit this bias to make predictions. Possible approaches to alleviate this problem are discussed.