CLLGOct 23, 2023

Geographical Erasure in Language Generation

arXiv:2310.14777v1134 citationsh-index: 11Has Code
Originality Incremental advance
AI Analysis

This addresses a fairness issue in AI for users affected by biased language generation, but it is incremental as it builds on existing work about biases in LLMs.

The paper tackles the problem of geographical erasure in large language models (LLMs), where certain countries are underpredicting in generated language, and finds that this erasure correlates with low frequencies of country mentions in training data, with mitigation achieved through finetuning using a custom objective.

Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can propagate into generated language. In this work, we study and operationalise a form of geographical erasure, wherein language models underpredict certain countries. We demonstrate consistent instances of erasure across a range of LLMs. We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. Lastly, we mitigate erasure by finetuning using a custom objective.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes