CLOct 17, 2021

Quantifying the Task-Specific Information in Text-Based Classifications

Zining Zhu, Aparna Balagopalan, Marzyeh Ghassemi, Frank Rudzicz

arXiv:2110.08931v11.04 citations

Originality Incremental advance

AI Analysis

This addresses the issue of understanding dataset quality and model reliance on shortcuts for researchers in NLP and machine learning, though it is incremental as it builds on existing concerns about dataset biases.

The paper tackles the problem of quantifying task-specific information in text classification by developing an information-theoretic framework to measure linguistic knowledge beyond superficial dataset shortcuts, finding that Multi-NLI requires about 0.4 nats more such information than Quora Question Pair.

Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific information* (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.

View on arXiv PDF

Similar