Neslihan Iskender

2papers

2 Papers

CLSep 16, 2021
Does Summary Evaluation Survive Translation to Other Languages?

Spencer Braun, Oleg Vasilyev, Neslihan Iskender et al.

The creation of a quality summarization dataset is an expensive, time-consuming effort, requiring the production and evaluation of summaries by both trained humans and machines. If such effort is made in one language, it would be beneficial to be able to use it in other languages without repeating human annotations. To investigate how much we can trust machine translation of such a dataset, we translate the English dataset SummEval to seven languages and compare performance across automatic evaluation measures. We explore equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries. While we find some potential for dataset reuse in languages similar to the source, most summary evaluation methods are not found to be statistically equivalent across translations.

CLMay 13, 2021
Towards Human-Free Automatic Quality Evaluation of German Summarization

Neslihan Iskender, Oleg Vasilyev, Tim Polzehl et al.

Evaluating large summarization corpora using humans has proven to be expensive from both the organizational and the financial perspective. Therefore, many automatic evaluation metrics have been developed to measure the summarization quality in a fast and reproducible way. However, most of the metrics still rely on humans and need gold standard summaries generated by linguistic experts. Since BLANC does not require golden summaries and supposedly can use any underlying language model, we consider its application to the evaluation of summarization in German. This work demonstrates how to adjust the BLANC metric to a language other than English. We compare BLANC scores with the crowd and expert ratings, as well as with commonly used automatic metrics on a German summarization data set. Our results show that BLANC in German is especially good in evaluating informativeness.