Analyzing Text Representations under Tight Annotation Budgets: Measuring Structural Alignment
This work addresses the challenge of expensive text annotation for machine learning practitioners, but it is incremental as it builds on prior findings about representation importance under tight budgets.
The paper tackled the problem of training models with limited annotation budgets by analyzing why data representation choice is key, proposing a metric to measure structural alignment between representation and task, and showing that efficient representations induce good alignment, enabling learning from few samples.
Annotating large collections of textual data can be time consuming and expensive. That is why the ability to train models with limited annotation budgets is of great importance. In this context, it has been shown that under tight annotation budgets the choice of data representation is key. The goal of this paper is to better understand why this is so. With this goal in mind, we propose a metric that measures the extent to which a given representation is structurally aligned with a task. We conduct experiments on several text classification datasets testing a variety of models and representations. Using our proposed metric we show that an efficient representation for a task (i.e. one that enables learning from few samples) is a representation that induces a good alignment between latent input structure and class structure.