CLMay 12, 2022

On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat

arXiv:2205.06350v231.8629 citationsh-index: 39

Originality Synthesis-oriented

AI Analysis

This provides a practical tool for NLP practitioners to make cost-effective data collection decisions in multilingual settings, though it is an incremental application of economic concepts to a specific domain problem.

The paper tackles the problem of optimizing cost-performance trade-offs in multilingual few-shot learning by introducing an economic framework to evaluate machine-translated versus manually-created labeled data. Their case study on TyDIQA-GoldP shows that if machine translation costs are non-zero, optimal performance at least cost always requires at least some manual data.

Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.

View on arXiv PDF

Similar