Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
For NLP researchers, this work consolidates recurring evaluation debates into a practical framework, but it is primarily a synthesis of existing ideas rather than a novel empirical contribution.
The paper develops a taxonomy of evaluation concerns in NLP by reviewing historical and contemporary critiques, and provides a structured checklist to support deliberate evaluation design and interpretation.
Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring positions and trade-offs within each area. We also discuss practical implications of the taxonomy, including a structured checklist to support more deliberate evaluation design and interpretation. By situating contemporary debates within their historical context, this work provides a consolidated reference for reasoning about evaluation practices.