Towards Neural Language Evaluators
This work tackles the problem of improving evaluation metrics for text summarization, which is incremental as it builds on existing methods.
The paper addresses limitations of BLEU and ROUGE for evaluating summaries by proposing criteria for good metrics and using Transformer-based language models to assess reference and hypothesis summaries, but does not report specific numerical results.
We review three limitations of BLEU and ROUGE -- the most popular metrics used to assess reference summaries against hypothesis summaries, come up with criteria for what a good metric should behave like and propose concrete ways to use recent Transformers-based Language Models to assess reference summaries against hypothesis summaries.