CVAIROApr 5, 2016

The Curious Robot: Learning Visual Representations via Physical Interactions

arXiv:1604.01360v2193 citations
AI Analysis

This work addresses the challenge of reducing reliance on massive labeled data for computer vision, offering a biologically-inspired approach that could benefit robotics and AI systems, though it is incremental in applying interaction-based learning to a specific robotic setup.

The paper tackles the problem of learning visual representations without relying on large labeled datasets by using physical interactions (pushing, poking, grasping) with objects on a Baxter robot to collect over 130K datapoints, resulting in a ConvNet that improves image classification and achieves a 3% higher recall@1 in instance retrieval compared to an ImageNet-trained network.

What is the right supervisory signal to train visual representations? Current approaches in computer vision use category labels from datasets such as ImageNet to train ConvNets. However, in case of biological agents, visual representation learning does not require millions of semantic labels. We argue that biological agents use physical interactions with the world to learn visual representations unlike current vision systems which just use passive observations (images and videos downloaded from web). For example, babies push objects, poke them, put them in their mouth and throw them to learn representations. Towards this goal, we build one of the first systems on a Baxter platform that pushes, pokes, grasps and observes objects in a tabletop environment. It uses four different types of physical interactions to collect more than 130K datapoints, with each datapoint providing supervision to a shared ConvNet architecture allowing us to learn visual representations. We show the quality of learned representations by observing neuron activations and performing nearest neighbor retrieval on this learned representation. Quantitatively, we evaluate our learned ConvNet on image classification tasks and show improvements compared to learning without external data. Finally, on the task of instance retrieval, our network outperforms the ImageNet network on recall@1 by 3%

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes