Pretraining on Interactions for Learning Grounded Affordance Representations
This work addresses the gap between traditional semantic theories and modern foundation models for researchers in grounded language learning, though it is incremental as it builds on existing simulation-based approaches.
The paper tackled the problem of integrating affordance information into language representations by training a neural network to predict object trajectories in simulated interactions, showing that the resulting representations differentiate between observed and unobserved affordances and outperform 2D computer vision models on the SPATIAL dataset.
Lexical semantics and cognitive science point to affordances (i.e. the actions that objects support) as critical for understanding and representing nouns and verbs. However, study of these semantic features has not yet been integrated with the "foundation" models that currently dominate language representation research. We hypothesize that predictive modeling of object state over time will result in representations that encode object affordance information "for free". We train a neural network to predict objects' trajectories in a simulated interaction and show that our network's latent representations differentiate between both observed and unobserved affordances. We find that models trained using 3D simulations from our SPATIAL dataset outperform conventional 2D computer vision models trained on a similar task, and, on initial inspection, that differences between concepts correspond to expected features (e.g., roll entails rotation). Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.