LG CVMay 23, 2024

Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

Mohammed Baharoon, Jonathan Klein, Dominik L. Michels

arXiv:2405.14239v32.6h-index: 6Has CodeTrans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses the limitation of vision-language models in learning localized features for computer vision practitioners, offering an incremental improvement over existing joint self- and weakly supervised methods.

The paper tackles the problem of learning general-purpose visual representations by combining vision-language training with self-supervision to improve performance on dense prediction tasks like segmentation and detection, resulting in Harmony, which significantly outperforms baseline CLIP and other joint methods such as SLIP, MaskCLIP, and DetailCLIP across various downstream tasks.

Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. Moreover, Harmony optimizes for five different objectives simultaneously, efficiently utilizing the supervision in each data example, making it even more suited in data-constrained settings. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and outperforms the previously leading joint self- and weakly supervised methods, SLIP, MaskCLIP, and DetailCLIP.

View on arXiv PDF Code

Similar