CLAug 5, 2021

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

Stefan Larson, Navtej Singh, Saarthak Maheshwari, Shanti Stewart, Uma Krishnaswamy

arXiv:2108.02684v10.55 citations

Originality Synthesis-oriented

AI Analysis

This addresses robustness for document analysis systems, but it is incremental as it evaluates existing datasets without introducing new methods.

The paper investigates how text classifiers trained on Tobacco-3482 and RVL-CDIP datasets generalize to out-of-distribution documents, finding that models on the smaller Tobacco-3482 perform poorly while those on the larger RVL-CDIP show smaller performance drops.

To be robust enough for widespread adoption, document analysis systems involving machine learning models must be able to respond correctly to inputs that fall outside of the data distribution that was used to generate the data on which the models were trained. This paper explores the ability of text classifiers trained on standard document classification datasets to generalize to out-of-distribution documents at inference time. We take the Tobacco-3482 and RVL-CDIP datasets as a starting point and generate new out-of-distribution evaluation datasets in order to analyze the generalization performance of models trained on these standard datasets. We find that models trained on the smaller Tobacco-3482 dataset perform poorly on our new out-of-distribution data, while text classification models trained on the larger RVL-CDIP exhibit smaller performance drops.

View on arXiv PDF

Similar