CLLGNov 16, 2021

DataCLUE: A Benchmark Suite for Data-centric NLP

arXiv:2111.08647v217 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for standardized evaluation in data-centric NLP, enabling more effective dataset improvements, though it is incremental as it adapts existing data-centric concepts to NLP.

The authors tackled the lack of data-centric benchmarks in NLP by introducing DataCLUE, a benchmark suite that improves model performance, with baselines achieving up to a 5.7% point increase in Macro-F1.

Data-centric AI has recently proven to be more effective and high-performance, while traditional model-centric AI delivers fewer and fewer benefits. It emphasizes improving the quality of datasets to achieve better model performance. This field has significant potential because of its great practicability and getting more and more attention. However, we have not seen significant research progress in this field, especially in NLP. We propose DataCLUE, which is the first Data-Centric benchmark applied in NLP field. We also provide three simple but effective baselines to foster research in this field (improve Macro-F1 up to 5.7% point). In addition, we conduct comprehensive experiments with human annotators and show the hardness of DataCLUE. We also try an advanced method: the forgetting informed bootstrapping label correction method. All the resources related to DataCLUE, including datasets, toolkit, leaderboard, and baselines, is available online at https://github.com/CLUEbenchmark/DataCLUE

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes