Peli Grietzer

AI
3papers
58citations
Novelty50%
AI Score38

3 Papers

CLJul 9, 2024
AI-AI Bias: large language models favor communications generated by large language models

Walter Laurito, Benjamin Davis, Peli Grietzer et al.

Are large language models (LLMs) biased in favor of communications produced by LLMs, leading to possible antihuman discrimination? Using a classical experimental design inspired by employment discrimination studies, we tested widely used LLMs, including GPT-3.5, GPT-4 and a selection of recent open-weight models in binary choice scenarios. These involved LLM-based assistants selecting between goods (the goods we study include consumer products, academic papers, and film-viewings) described either by humans or LLMs. Our results show a consistent tendency for LLM-based AIs to prefer LLM-presented options. This suggests the possibility of future AI systems implicitly discriminating against humans as a class, giving AI agents and AI-assisted humans an unfair advantage.

AIOct 12, 2023
Understanding and Controlling a Maze-Solving Policy Network

Ulisse Mini, Peli Grietzer, Mrinank Sharma et al.

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network contains redundant, distributed, and retargetable goal representations, shedding light on the nature of goal-direction in trained policy networks.

LGFeb 12
Soft Contamination Means Benchmarks Test Shallow Generalization

Ari Spiesberger, Juan J. Vazquez, Nicky Pochinkov et al.

If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.