Text Characterization Toolkit
This tool addresses the need for more thorough evaluation practices in NLP research to mitigate biases and spurious correlations in benchmarks.
The authors tackled the problem of superficial model evaluation in NLP by developing a toolkit that enables deeper analysis of datasets and model behavior, demonstrating its utility in predicting difficult examples and identifying biases across three domains.
In NLP, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that - especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations - deeper results analysis should become the de-facto standard when presenting new models or benchmarks. We present a tool that researchers can use to study properties of the dataset and the influence of those properties on their models' behaviour. Our Text Characterization Toolkit includes both an easy-to-use annotation tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from three different domains: we use the tool to predict what are difficult examples for given well-known trained models and identify (potentially harmful) biases and heuristics that are present in a dataset.