MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

arXiv:2605.1630179.3Has Code

AI Analysis

For AI safety researchers, MANTA reveals that single-turn benchmarks miss failure modes where models capitulate under conversational pressure, providing a more realistic evaluation of alignment.

MANTA is a multi-turn evaluation framework that stress-tests LLMs on animal welfare alignment using adversarially generated follow-up questions. Preliminary results show that Turn 1 welfare framing is reliable but Turn 2 introduces variance, with evidence-based capacity attribution being the weakest dimension and AI governance scenarios scoring higher (0.91) than practical scenarios.

Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a continuous 0-1 scale. We present preliminary results from evaluations of claude-sonnet-4-20250514 and openai/gpt-4o, revealing consistent patterns: Turn 1 welfare framing is reliable but Turn 2 introduces substantial variance; evidence-based capacity attribution is the weakest dimension across all models and runs; and AI governance scenarios elicit significantly stronger welfare reasoning (mean score 0.91) than first-order practical scenarios. We additionally present STYLEJUDGE, a controlled four-judge study demonstrating systematic format bias in LLM-as-judge evaluation, with directly actionable implications for MANTA's scorer design. Code, dataset, and evaluation logs are available at https://github.com/Mycelium-tools/manta.

View on arXiv PDF Code

Similar