CY HCApr 19

All Public Voices Are Equal, But Are Some More Equal Than Others to LLMs?

Sola Kim, Marco A. Janssen, Jieshu Wang, Ame Min-Venditti, Neha Karanjia, John M. Anderies

arXiv:2604.1724769.9h-index: 53

Predicted impact top 17% in CY · last 90 daysOriginality Incremental advance

AI Analysis

For federal agencies using LLMs to process public comments, this study reveals that socioeconomic signals like occupation can bias summarization, which current procurement frameworks do not evaluate.

This paper tests whether LLMs used by federal agencies treat public comments differently based on the commenter's demographic attribution. It finds that occupation (e.g., street vendor vs. financial analyst) consistently causes differential summarization, while race and gender effects are inconsistent or absent, highlighting a need for fairness in government AI procurement.

Federal agencies are increasingly deploying large language models (LLMs) to process public comments submitted during notice-and-comment rulemaking, the primary mechanism through which citizens influence federal regulation. Whether these systems treat all public input equally remains largely untested. Using a counterfactual design, we held comment content constant and varied only the commenter's demographic attribution -- race, gender, and socioeconomic status -- to test whether eight LLMs available for federal use produce differential summaries of identical comments. We processed 182 public comments across 32 identity conditions, generating over 106,000 summaries. Occupation was the only identity signal to produce consistent differential treatment: the same comment attributed to a street vendor, compared to a financial analyst, received a summary that preserved less of the original meaning, used simpler language, and shifted emotional tone. This pattern held across all names, prompts, models, and regulatory contexts tested. Race effects were inconsistent and appeared driven by specific name tokens rather than racial categories; gender effects were absent. Writing quality predicted summarization outcomes through argument substance rather than surface mechanics; experimentally injected spelling and grammar errors had negligible effects. The magnitude of occupation-based differential treatment varied by model provider, meaning that selecting a model implicitly selects a level of fairness -- a dimension that current procurement frameworks such as FedRAMP do not evaluate. These findings suggest that socioeconomic signals warrant attention in AI fairness assessments for government information systems, and that fairness benchmarks could be incorporated into existing federal IT procurement processes.

View on arXiv PDF

Similar