When Can Transformers Ground and Compose: Insights from Compositional Generalization Benchmarks
This work provides insights into compositional generalization for researchers in AI and linguistics, but it is incremental as it builds on existing benchmarks and methods.
The authors tackled the problem of whether transformers can ground and compose language in navigation tasks, showing that a simple transformer model outperforms specialized architectures on benchmarks like ReaSCAN and gSCAN, and generalizes to novel combinations in a new task, RefEx, with a single self-attention layer achieving this.
Humans can reason compositionally whilst grounding language utterances to the real world. Recent benchmarks like ReaSCAN use navigation tasks grounded in a grid world to assess whether neural models exhibit similar capabilities. In this work, we present a simple transformer-based model that outperforms specialized architectures on ReaSCAN and a modified version of gSCAN. On analyzing the task, we find that identifying the target location in the grid world is the main challenge for the models. Furthermore, we show that a particular split in ReaSCAN, which tests depth generalization, is unfair. On an amended version of this split, we show that transformers can generalize to deeper input structures. Finally, we design a simpler grounded compositional generalization task, RefEx, to investigate how transformers reason compositionally. We show that a single self-attention layer with a single head generalizes to novel combinations of object attributes. Moreover, we derive a precise mathematical construction of the transformer's computations from the learned network. Overall, we provide valuable insights about the grounded compositional generalization task and the behaviour of transformers on it, which would be useful for researchers working in this area.