On Certifying and Improving Generalization to Unseen Domains
This work addresses the problem of unreliable evaluation in domain generalization for machine learning practitioners, offering a certification framework to assess and enhance model robustness, though it is incremental as it builds on existing DG methods.
The paper tackles the challenge of evaluating and improving domain generalization (DG) methods, which aim to maintain performance on unseen domains, by showing that benchmark datasets may not reflect real-world variability and proposing a certification framework based on distributionally robust optimization to efficiently certify worst-case performance and a training algorithm to provably improve it, with empirical results demonstrating significant reduction in worst-case loss without major performance drops on benchmarks.
Domain Generalization (DG) aims to learn models whose performance remains high on unseen domains encountered at test-time by using data from multiple related source domains. Many existing DG algorithms reduce the divergence between source distributions in a representation space to potentially align the unseen domain close to the sources. This is motivated by the analysis that explains generalization to unseen domains using distributional distance (such as the Wasserstein distance) to the sources. However, due to the openness of the DG objective, it is challenging to evaluate DG algorithms comprehensively using a few benchmark datasets. In particular, we demonstrate that the accuracy of the models trained with DG methods varies significantly across unseen domains, generated from popular benchmark datasets. This highlights that the performance of DG methods on a few benchmark datasets may not be representative of their performance on unseen domains in the wild. To overcome this roadblock, we propose a universal certification framework based on distributionally robust optimization (DRO) that can efficiently certify the worst-case performance of any DG method. This enables a data-independent evaluation of a DG method complementary to the empirical evaluations on benchmark datasets. Furthermore, we propose a training algorithm that can be used with any DG method to provably improve their certified performance. Our empirical evaluation demonstrates the effectiveness of our method at significantly improving the worst-case loss (i.e., reducing the risk of failure of these models in the wild) without incurring a significant performance drop on benchmark datasets.