Practical Comparable Data Collection for Low-Resource Languages via Images
This addresses data scarcity for low-resource language tasks like machine translation, though it is incremental as it builds on existing pivot-based methods.
The paper tackles the problem of collecting high-quality comparable training data for low-resource languages by using images as a pivot to gather captions in both languages independently, resulting in 81.1% acceptable translation pairs and only 2.47% non-translations in English-Hindi evaluations.
We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all. We further establish the potential of the dataset collected through our approach by experimenting on two downstream tasks - machine translation and dictionary extraction. All code and data are available at https://github.com/madaan/PML4DC-Comparable-Data-Collection.