An empirical study on the limitation of Transformers in program trace generation
This work addresses the problem of AI models' reasoning generalization in program execution for AI and software engineering, but it is incremental as it builds on existing Transformer studies with specific modifications.
The study investigated Transformers' limitations in generating program execution traces, finding that while they achieve strong in-distribution accuracy, they systematically fail to generalize to factors like program length and trace steps, with some model modifications improving generalization.
We study Transformers on the task \emph{program trace generation} (PTG), where models produce step-by-step execution traces for synthetic programs. Unlike existing algorithmic problems, PTG externalizes reasoning through long traces where each step is trivial. We train small Transformers with diverse modifications, including alternative position encodings, softmax replacements, hybrid model, and short convolutions. While these models achieve strong in-distribution accuracy, they exhibit systematic failures when generalizing to various factors (e.g., program length, trace steps), though some designs significantly improve generalization.