Cross-Modal Discrete Representation Learning
This work addresses the challenge of capturing granular concepts across modalities like video, text, and audio, which is incremental as it builds on existing representation learning methods.
The authors tackled the problem of learning fine-grained cross-modal representations by introducing a self-supervised framework with a discretized embedding space and a Cross-Modal Code Matching objective, resulting in improved performance on cross-modal retrieval tasks and the ability to localize objects/actions without direct supervision.
Recent advances in representation learning have demonstrated an ability to represent information from different modalities such as video, text, and audio in a single high-level embedding vector. In this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose a Cross-Modal Code Matching objective that forces the representations from different views (modalities) to have a similar distribution over the discrete embedding space such that cross-modal objects/actions localization can be performed without direct supervision. In our experiments we show that the proposed discretized multi-modal fine-grained representation (e.g., pixel/word/frame) can complement high-level summary representations (e.g., video/sentence/waveform) for improved performance on cross-modal retrieval tasks. We also observe that the discretized representation uses individual clusters to represent the same semantic concept across modalities.