LGNov 13, 2025

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

arXiv:2511.09833v12 citationsh-index: 11
Originality Incremental advance
AI Analysis

This addresses the data annotation bottleneck for researchers and practitioners in fields like NLP, CV, and multimodal understanding, offering a significant efficiency improvement but is incremental as it builds on existing LLM-based annotation methods.

The paper tackles the problem of expensive and time-consuming human data annotation for supervised learning by proposing the ACT pipeline, which uses multimodal large language models as annotators and judges to identify errors, directing human effort to suspicious cases; experiments show it reduces the performance gap to less than 2% on most benchmarks while saving up to 90% of human costs.

Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most "suspicious" cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes