HindSight: A Graph-Based Vision Model Architecture For Representing Part-Whole Hierarchies
This addresses the challenge of encoding hierarchical visual information for computer vision applications, but it appears incremental as it builds on existing graph and self-supervised techniques.
The paper tackles the problem of representing part-whole hierarchies in images by proposing a graph-based model architecture that divides images into patches at different levels and uses a dynamic feature extraction module to learn rich graph representations. The result is a general-purpose vision encoder model that can be applied to various downstream tasks like image classification and object detection.
This paper presents a model architecture for encoding the representations of part-whole hierarchies in images in form of a graph. The idea is to divide the image into patches of different levels and then treat all of these patches as nodes for a fully connected graph. A dynamic feature extraction module is used to extract feature representations from these patches in each graph iteration. This enables us to learn a rich graph representation of the image that encompasses the inherent part-whole hierarchical information. Utilizing proper self-supervised training techniques, such a model can be trained as a general purpose vision encoder model which can then be used for various vision related downstream tasks (e.g., Image Classification, Object Detection, Image Captioning, etc.).