CLJun 4

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

Marco Antonio Stranisci, A Pranav, Rossana Damiano, Christian Hardmeier, Anne Lauscher

arXiv:2606.0593656.3

AI Analysis

For NLP practitioners and fairness researchers, the paper documents how current content moderation systems cause epistemic erasure of marginalized groups, revealing a tension between automated and human judgment.

The paper audits pretraining filters and inference-time guardrails in language models, finding that they disproportionately remove content mentioning marginalized groups (e.g., transgender people, women, Central Americans) while failing to flag private information or hate speech. Human annotators would retain 88.5% of filter-flagged and 91.3% of guardrail-flagged content, highlighting epistemic erasure.

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

View on arXiv PDF

Similar