CLAIMay 24, 2023

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

arXiv:2305.15507v1233 citationsHas Code
Originality Incremental advance
AI Analysis

This reveals a critical limitation in LLMs' understanding of programming semantics, making them unreliable for tasks that deviate from training data, which is an incremental but important insight for AI safety and code generation applications.

The study found that large language models (LLMs) fail to generate correct Python code when function names are swapped, with some models becoming more confident in incorrect predictions as size increases, contrary to typical scaling trends.

Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes