Since the Scientific Literature Is Multilingual, Our Models Should Be Too
This addresses the issue for researchers and practitioners in NLP and scientific domains who rely on accurate document representations across languages, but it is incremental as it builds on existing critiques of monolingual biases.
The paper tackles the problem of English-centric models in scientific NLP by quantitatively showing that the literature is largely multilingual, and it argues that current models and benchmarks should reflect this diversity, highlighting that text-based models fail to create meaningful representations for non-English papers.
English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the natural language processing (NLP) research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.