CLLGOct 16, 2023

Who Are All The Stochastic Parrots Imitating? They Should Tell Us!

arXiv:2310.10583v2126 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This addresses the problem of factual inaccuracies in LM outputs for users in critical settings, particularly for low-resource languages, but it is an opinion piece proposing a conceptual solution rather than an incremental technical advance.

The authors argue that current language models (LMs) are unreliable for critical use, especially in low-resource languages, and propose a novel strategy to enhance trustworthiness by enabling LMs to cite their training data sources for verifiability.

Both standalone language models (LMs) as well as LMs within downstream-task systems have been shown to generate statements which are factually untrue. This problem is especially severe for low-resource languages, where training data is scarce and of worse quality than for high-resource languages. In this opinion piece, we argue that LMs in their current state will never be fully trustworthy in critical settings and suggest a possible novel strategy to handle this issue: by building LMs such that can cite their sources - i.e., point a user to the parts of their training data that back up their outputs. We first discuss which current NLP tasks would or would not benefit from such models. We then highlight the expected benefits such models would bring, e.g., quick verifiability of statements. We end by outlining the individual tasks that would need to be solved on the way to developing LMs with the ability to cite. We hope to start a discussion about the field's current approach to building LMs, especially for low-resource languages, and the role of the training data in explaining model generations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes