RO AIApr 9

Exploring Temporal Representation in Neural Processes for Multimodal Action Prediction

Marco Gabriele Fedozzi, Yukie Nagai, Francesco Rea, Alessandra Sciutti

arXiv:2604.084184.7

AI Analysis

This work addresses action prediction for robotics, but it is incremental as it modifies an existing architecture to enhance temporal representation.

The paper tackled the problem of self-supervised multimodal action prediction in robotics by identifying that an existing model (DMBN) struggled to generalize to unseen sequences due to poor temporal representation, and proposed a revised version (DMBN-PTE) that improved robustness in preliminary results.

Inspired by the human ability to understand and predict others, we study the applicability of Conditional Neural Processes (CNP) to the task of self-supervised multimodal action prediction in robotics. Following recent results regarding the ontogeny of the Mirror Neuron System (MNS), we focus on the preliminary objective of self-actions prediction. We find a good MNS-inspired model in the existing Deep Modality Blending Network (DMBN), able to reconstruct the visuo-motor sensory signal during a partially observed action sequence by leveraging the probabilistic generation of CNP. After a qualitative and quantitative evaluation, we highlight its difficulties in generalizing to unseen action sequences, and identify the cause in its inner representation of time. Therefore, we propose a revised version, termed DMBN-Positional Time Encoding (DMBN-PTE), that facilitates learning a more robust representation of temporal information, and provide preliminary results of its effectiveness in expanding the applicability of the architecture. DMBN-PTE figures as a first step in the development of robotic systems that autonomously learn to forecast actions on longer time scales refining their predictions with incoming observations.

View on arXiv PDF

Similar