Richer Countries and Richer Representations
This work addresses representational harms in NLP models that disproportionately affect lower-GDP countries, highlighting an incremental but critical fairness issue in AI.
The study investigated how low-frequency country names in training corpora lead to poorer semantic representation and prediction accuracy in embedding models, with performance discrepancies correlated with GDP, perpetuating wealth inequalities. It found that countries like Ghana are less likely to be correctly predicted in tasks such as 'The country producing the most cocoa is [MASK].'.
We examine whether some countries are more richly represented in embedding space than others. We find that countries whose names occur with low frequency in training corpora are more likely to be tokenized into subwords, are less semantically distinct in embedding space, and are less likely to be correctly predicted: e.g., Ghana (the correct answer and in-vocabulary) is not predicted for, "The country producing the most cocoa is [MASK].". Although these performance discrepancies and representational harms are due to frequency, we find that frequency is highly correlated with a country's GDP; thus perpetuating historic power and wealth inequalities. We analyze the effectiveness of mitigation strategies; recommend that researchers report training word frequencies; and recommend future work for the community to define and design representational guarantees.