LGAug 29, 2023

Benchmarks for Detecting Measurement Tampering

arXiv:2308.15605v54 citationsh-index: 16Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of ensuring AI systems do not manipulate measurements to create false impressions of success, which is an incremental step in improving robustness for AI safety.

The paper tackles the problem of detecting measurement tampering in AI systems by building four new text-based datasets to evaluate detection techniques on large language models, demonstrating that their techniques outperform simple baselines on most datasets but do not achieve maximum performance.

When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals which are robust to optimization. One concern is \textit{measurement tampering}, where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. In this work, we build four new text-based datasets to evaluate measurement tampering detection techniques on large language models. Concretely, given sets of text inputs and measurements aimed at determining if some outcome occurred, as well as a base model able to accurately predict measurements, the goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering. We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance. We believe there is significant room for improvement for both techniques and datasets, and we are excited for future work tackling measurement tampering.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes