CL AI LGFeb 19, 2024

Uncovering Latent Human Wellbeing in Language Model Embeddings

Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons

Berkeley

arXiv:2402.11777v11.91 citationsh-index: 43

Originality Synthesis-oriented

AI Analysis

This addresses the problem of understanding latent knowledge in language models for AI ethics researchers, but it is incremental as it builds on existing tasks and models.

The study investigated whether language models implicitly learn human wellbeing concepts, finding that the leading principal component of OpenAI's text-embedding-ada-002 achieved 73.9% accuracy on the ETHICS Utilitarianism task, closely matching finetuned BERT-large at 74.6%, and observed that performance does not decrease with increased model size when using enough principal components.

Do language models implicitly learn a concept of human wellbeing? We explore this through the ETHICS Utilitarianism task, assessing if scaling enhances pretrained models' representations. Our initial finding reveals that, without any prompt engineering or finetuning, the leading principal component from OpenAI's text-embedding-ada-002 achieves 73.9% accuracy. This closely matches the 74.6% of BERT-large finetuned on the entire ETHICS dataset, suggesting pretraining conveys some understanding about human wellbeing. Next, we consider four language model families, observing how Utilitarianism accuracy varies with increased parameters. We find performance is nondecreasing with increased model size when using sufficient numbers of principal components.

View on arXiv PDF

Similar