Detecting Word Sense Disambiguation Biases in Machine Translation for Model-Agnostic Adversarial Attacks
This work addresses translation errors for NLP practitioners by exposing model vulnerabilities, but it is incremental as it builds on known issues with dataset artifacts.
The paper tackled the problem of word sense disambiguation errors in neural machine translation by identifying biases from dataset artifacts and developing a method to predict these errors, showing that disambiguation robustness varies across domains and models. It also introduced an adversarial attack strategy to probe model vulnerabilities, demonstrating effectiveness across multiple domains and model types.
Word sense disambiguation is a well-known source of translation errors in NMT. We posit that some of the incorrect disambiguation choices are due to models' over-reliance on dataset artifacts found in training data, specifically superficial word co-occurrences, rather than a deeper understanding of the source text. We introduce a method for the prediction of disambiguation errors based on statistical data properties, demonstrating its effectiveness across several domains and model types. Moreover, we develop a simple adversarial attack strategy that minimally perturbs sentences in order to elicit disambiguation errors to further probe the robustness of translation models. Our findings indicate that disambiguation robustness varies substantially between domains and that different models trained on the same data are vulnerable to different attacks.