CVROMar 23, 2019

V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation

arXiv:1903.10869v114 citations
Originality Incremental advance
AI Analysis

This addresses the problem of automating command generation from videos for robotic applications, but it appears incremental as it builds on existing encoder-decoder and TCN methods.

The authors tackled the problem of translating demonstration videos into commands for robotic manipulation by proposing V2CNet, a deep learning framework with two branches for fine-grained understanding, which outperformed state-of-the-art methods by a substantial margin on a new large-scale dataset.

We propose V2CNet, a new deep learning framework to automatically translate the demonstration videos to commands that can be directly used in robotic applications. Our V2CNet has two branches and aims at understanding the demonstration video in a fine-grained manner. The first branch has the encoder-decoder architecture to encode the visual features and sequentially generate the output words as a command, while the second branch uses a Temporal Convolutional Network (TCN) to learn the fine-grained actions. By jointly training both branches, the network is able to model the sequential information of the command, while effectively encodes the fine-grained actions. The experimental results on our new large-scale dataset show that V2CNet outperforms recent state-of-the-art methods by a substantial margin, while its output can be applied in real robotic applications. The source code and trained models will be made available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes