Training Neural Networks on Data Sources with Unknown Reliability
This addresses the issue of degraded neural network performance due to unreliable data sources in supervised learning, though it is incremental as it builds on existing noisy data methods by incorporating source labels.
The paper tackles the problem of training neural networks on data from multiple sources with unknown and varied reliability, proposing a dynamic re-weighting strategy that adjusts training steps based on estimated source reliability, which significantly improves model performance on mixed data sources and maintains it on reliable ones.
When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of the data for individual sources is not known during training. Previous methods for training models in the presence of noisy data do not make use of the additional information that the source label can provide. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source's estimated reliability by using a dynamic re-weighting strategy motivated by likelihood tempering. This way, we allow training on all sources during the warm-up and reduce learning on less reliable sources during the final training stages, when it has been shown that models overfit to noise. We show through diverse experiments that this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.