Scalable Data Balancing for Unlabeled Satellite Imagery
This addresses the challenge of handling data imbalance in large-scale unlabeled datasets like NASA's Earth Imagery, which is incremental as it adapts existing balancing concepts to unlabeled contexts.
The paper tackles the problem of data imbalance in unlabeled satellite imagery by introducing an iterative method that uses image embeddings as a proxy for labels to balance data, resulting in increased overall accuracy.
Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that can be used to balance data, and ultimately when trained increases overall accuracy.