NLG Evaluation: Past, Present, Future
For researchers and practitioners in NLG, this paper provides a historical perspective and future outlook on evaluation practices, though it is a review without novel empirical contributions.
This paper traces the evolution of NLG evaluation from 1990 to 2026, highlighting the shift from linguistics-based to machine learning-driven evaluation, and predicts future trends emphasizing impact, qualitative, and safety evaluation.
Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.