CV MLSep 5, 2022

Design of the topology for contrastive visual-textual alignment

arXiv:2209.02127v21.42 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses noise in large-scale training data for visual-textual alignment, offering a domain-specific improvement.

The paper tackled the problem of noisy training data in contrastive visual-textual alignment by analyzing the role of softmax temperature from a topological perspective and proposing an alternative embedding topology using an oblique manifold. This approach improved zero-shot classification performance of baseline CLIP models by an average of 6.1%.

Cosine similarity is the common choice for measuring the distance between the feature representations in contrastive visual-textual alignment learning. However, empirically a learnable softmax temperature parameter is required when learning on large-scale noisy training data. In this work, we first discuss the role of softmax temperature from the embedding space's topological properties. We argue that the softmax temperature is the key mechanism for contrastive learning on noisy training data. It acts as a scaling factor of the distance range (e.g. [-1, 1] for the cosine similarity), and its learned value indicates the level of noise in the training data. Then, we propose an alternative design of the topology for the embedding alignment. We make use of multiple class tokens in the transformer architecture; then map the feature representations onto an oblique manifold endowed with the negative inner product as the distance function. With this configuration, we largely improve the zero-shot classification performance of baseline CLIP models pre-trained on large-scale datasets by an average of 6.1\%.

View on arXiv PDF Code

Similar