LGMLJun 24, 2024

The Hidden Pitfalls of the Cosine Similarity Loss

arXiv:2406.16468v19 citations
Originality Incremental advance
AI Analysis

This addresses a fundamental issue in self-supervised learning for machine learning practitioners, though it is incremental as it builds on known loss functions.

The paper identifies that the gradient of cosine similarity loss goes to zero in specific settings, such as when points have large magnitude or are opposite in latent space, and proves that optimization forces magnitude growth, making this unavoidable. It proposes cut-initialization, a simple change to network initialization that helps self-supervised learning methods converge faster.

We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes