CLAIHCNov 19, 2024

Evaluating the Prompt Steerability of Large Language Models

IBM
arXiv:2411.12405v225 citationsh-index: 33Has CodeNAACL
Originality Synthesis-oriented
AI Analysis

This work addresses the need for pluralistic AI by providing a tool to assess model steerability, which is incremental as it focuses on evaluation rather than new model design.

The authors tackled the problem of evaluating how well large language models can be steered to represent diverse personas through prompting, proposing a benchmark that reveals limited steerability in current models due to skewed baselines and asymmetries.

Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model's joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited -- due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes