Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
This addresses the problem of mitigating social stereotypes in AI models for researchers and practitioners, but it is incremental as it builds on existing methods for quantifying stereotypes.
The authors investigated whether natural language interventions can alter a QA model's unethical behavior, specifically in the context of social stereotypes, and found that current models respond poorly to such interventions, with zero-shot evaluation showing minimal improvement and few-shot learning remaining inadequate for generalization.
Is it possible to use natural language to intervene in a model's behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a question-answering (QA) model's unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system's social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today's powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Few-shot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.