CLAILGFeb 19, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings

Berkeley
arXiv:2402.11777v11 citationsh-index: 43
Originality Synthesis-oriented
AI Analysis

This addresses the problem of understanding latent knowledge in language models for AI ethics researchers, but it is incremental as it builds on existing tasks and models.

The study investigated whether language models implicitly learn human wellbeing concepts, finding that the leading principal component of OpenAI's text-embedding-ada-002 achieved 73.9% accuracy on the ETHICS Utilitarianism task, closely matching finetuned BERT-large at 74.6%, and observed that performance does not decrease with increased model size when using enough principal components.

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes