CLAIMay 23, 2023

HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors of Language Models in Human-Machine Conversations

arXiv:2305.14195v3106 citations
Originality Incremental advance
AI Analysis

This addresses the need for demographic alignment in public-facing language models, though it is incremental by applying clinical techniques to a new evaluation context.

The paper tackled the problem of evaluating how large language models adapt to demographic factors like age in human-machine conversations, finding that GPT-3.5's capabilities vary widely by task, mimicking humans aged 6-15 in inference but outperforming a typical 21-year-old in memorization, while exhibiting less than 50% of tested pragmatic skills.

While demographic factors like age and gender change the way people talk, and in particular, the way people talk to machines, there is little investigation into how large pre-trained language models (LMs) can adapt to these changes. To remedy this gap, we consider how demographic factors in LM language skills can be measured to determine compatibility with a target demographic. We suggest clinical techniques from Speech Language Pathology, which has norms for acquisition of language skills in humans. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to complement clinical evaluation at scale. Empirically, we focus on age, finding LM capability varies widely depending on task: GPT-3.5 mimics the ability of humans ranging from age 6-15 at tasks requiring inference, and simultaneously, outperforms a typical 21 year old at memorization. GPT-3.5 also has trouble with social language use, exhibiting less than 50% of the tested pragmatic skills. Findings affirm the importance of considering demographic alignment and conversational goals when using LMs as public-facing tools. Code, data, and a package will be available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes