When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP
This addresses the risk of misleading findings in NLP research due to overlooked software quality, proposing tools to improve coding practices, though it is incremental in focusing on existing methods.
The paper tackles the problem of code correctness in NLP research by identifying and fixing bugs in widely used implementations of the Conformer architecture, showing that bugs can still yield good and reproducible results but lead to incorrect conclusions in speech recognition and translation experiments across languages.
Despite its crucial role in research experiments, code correctness is often presumed only on the basis of the perceived quality of results. This assumption comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on reproducibility should go hand in hand with the emphasis on software quality. We present a case study in which we identify and fix three bugs in widely used implementations of the state-of-the-art Conformer architecture. Through experiments on speech recognition and translation in various languages, we demonstrate that the presence of bugs does not prevent the achievement of good and reproducible results, which however can lead to incorrect conclusions that potentially misguide future research. As a countermeasure, we propose a Code-quality Checklist and release pangoliNN, a library dedicated to testing neural models, with the goal of promoting coding best practices and improving research software quality within the NLP community.