CL LGMar 16, 2023

Towards the Scalable Evaluation of Cooperativeness in Language Models

arXiv:2303.13360v13.610 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This addresses the need for scalable evaluation methods in Cooperative AI to ensure pro-social behavior in high-stakes applications like negotiation, though it is incremental as it builds on existing game-theoretic frameworks.

The paper tackles the problem of evaluating cooperativeness in language models for multi-agent interactions by generating scenarios with specific game-theoretic structures, finding mediocre quality in both crowdworker and model-generated scenarios and mixed results in alignment judgments, with instruct-tuned models showing potential for cooperative behavior as they scale.

It is likely that AI systems driven by pre-trained language models (PLMs) will increasingly be used to assist humans in high-stakes interactions with other agents, such as negotiation or conflict resolution. Consistent with the goals of Cooperative AI \citep{dafoe_open_2020}, we wish to understand and shape the multi-agent behaviors of PLMs in a pro-social manner. An important first step is the evaluation of model behaviour across diverse cooperation problems. Since desired behaviour in an interaction depends upon precise game-theoretic structure, we focus on generating scenarios with particular structures with both crowdworkers and a language model. Our work proceeds as follows. First, we discuss key methodological issues in the generation of scenarios corresponding to particular game-theoretic structures. Second, we employ both crowdworkers and a language model to generate such scenarios. We find that the quality of generations tends to be mediocre in both cases. We additionally get both crowdworkers and a language model to judge whether given scenarios align with their intended game-theoretic structure, finding mixed results depending on the game. Third, we provide a dataset of scenario based on our data generated. We provide both quantitative and qualitative evaluations of UnifiedQA and GPT-3 on this dataset. We find that instruct-tuned models tend to act in a way that could be perceived as cooperative when scaled up, while other models seemed to have flat scaling trends.

View on arXiv PDF

Similar