CL MLJul 31, 2017

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

arXiv:1707.09861v11201 citations

Originality Incremental advance

AI Analysis

This addresses a methodological problem for researchers and practitioners in machine learning, particularly in sequence tagging, by highlighting the need for more robust evaluation practices to avoid misleading comparisons, though it is incremental in improving existing evaluation frameworks.

The paper demonstrates that reporting a single performance score is insufficient for comparing non-deterministic sequence tagging systems, as random seed variations cause statistically significant differences (p < 10^-4) and up to 1 percentage point F1-score changes, affecting whether systems are perceived as state-of-the-art or mediocre. It proposes comparing score distributions from multiple executions and, based on evaluating 50,000 LSTM-networks, presents architectures that achieve superior performance and greater stability with respect to hyperparameters.

In this paper we show that reporting a single performance score is insufficient to compare non-deterministic approaches. We demonstrate for common sequence tagging tasks that the seed value for the random number generator can result in statistically significant (p < 10^-4) differences for state-of-the-art systems. For two recent systems for NER, we observe an absolute difference of one percentage point F1-score depending on the selected seed value, making these systems perceived either as state-of-the-art or mediocre. Instead of publishing and reporting single performance scores, we propose to compare score distributions based on multiple executions. Based on the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we present network architectures that produce both superior performance as well as are more stable with respect to the remaining hyperparameters.

View on arXiv PDF

Similar