CLCYJun 18, 2021

Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets

arXiv:2106.10328v2268 citations
Originality Highly original
AI Analysis

This addresses the issue of aligning language models with societal values to reduce harm and bias, representing a novel method for a known bottleneck in AI safety.

The paper tackles the problem of harmful and biased outputs from language models by proposing PALMS, an iterative fine-tuning process using values-targeted datasets, which significantly improves adherence to target values and reduces toxicity across various GPT-3 model sizes without compromising capability.

Language models can generate harmful and biased outputs and exhibit undesirable behavior according to a given cultural context. We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets, an iterative process to significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values. We evaluate our process using three metrics: quantitative metrics with human evaluations that score output adherence to a target value, toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. Through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations. PALMS performs significantly better on all metrics compared to baseline and control models for a broad range of GPT-3 language model sizes without compromising capability integrity. We find that the effectiveness of PALMS increases with model size. We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes