Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
This work addresses the lack of coreference resolution resources for low-resource languages, offering a practical solution for expanding NLP capabilities to under-served languages.
The paper proposes a pipeline that uses cycle-consistent machine translation to generate training data for coreference resolution in low-resource languages, achieving significant performance gains across four languages and enabling coreference resolution where no prior corpora existed.
Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.