CVApr 11, 2021

Deformable Capsules for Object Detection

Rodney Lalonde, Naji Khosravan, Ulas Bagci

arXiv:2104.05031v32.612 citations

Originality Incremental advance

AI Analysis

This addresses a critical bottleneck in computer vision by enabling capsule networks to handle large-scale detection tasks, which is an incremental but important advancement for the field.

The paper tackles the problem of scaling capsule networks to object detection by introducing DeformCaps, which includes novel structures and routing algorithms, achieving results on MS COCO comparable to state-of-the-art one-stage CNN methods with fewer false positives and better generalization to unusual poses.

Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling up to bigger tasks such as detection or large-scale classification. In this study, we introduce a new family of capsule networks, deformable capsules (\textit{DeformCaps}), to address a very important problem in computer vision: object detection. We propose two new algorithms associated with our \textit{DeformCaps}: a novel capsule structure (\textit{SplitCaps}), and a novel dynamic routing algorithm (\textit{SE-Routing}), which balance computational efficiency with the need for modeling a large number of objects and classes, which have never been achieved with capsule networks before. We demonstrate that the proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and it obtains results on MS COCO which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detection, generalizing to unusual poses/viewpoints of objects.

View on arXiv PDF

Similar