CL AIApr 15, 2024

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim

arXiv:2404.09682v315.928 citationsh-index: 8Has CodeEMNLP

Originality Synthesis-oriented

AI Analysis

This addresses the cost and time inefficiencies of human annotation for dataset cleansing in multi-document summarization, though it is incremental as it extends existing LLM-based methods to a specific dataset.

The study tackled the problem of noisy data in datasets by using large language models (LLMs) for data cleansing, specifically applying chain-of-thought and majority voting to filter unrelated documents from the Multi-News dataset, resulting in an enhanced Multi-News+ dataset that improves quality without human annotation.

The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.

View on arXiv PDF Code

Similar