Fine-grained activity recognition for assembly videos
This work is significant for researchers and practitioners in robotics and computer vision who need to recognize fine-grained assembly actions, offering a substantial improvement in accuracy for complex geometric reasoning tasks.
This paper tackles the problem of recognizing assembly actions in videos, such as building furniture or toy block towers. Their system achieved an average framewise accuracy of 70% and a normalized edit distance of 10% on an IKEA furniture-assembly dataset, and a 23% normalized edit distance on a block-building dataset, representing a 69% relative improvement over prior work.
In this paper we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality by unifying assembly actions and kinematic structures within a single framework. We use this framework to develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure. Finally, we evaluate our method empirically on two application-driven data sources: (1) An IKEA furniture-assembly dataset, and (2) A block-building dataset. On the first, our system recognizes assembly actions with an average framewise accuracy of 70% and an average normalized edit distance of 10%. On the second, which requires fine-grained geometric reasoning to distinguish between assemblies, our system attains an average normalized edit distance of 23% -- a relative improvement of 69% over prior work.