LGMMOct 9, 2021

Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal Spaces

arXiv:2110.04599v1
AI Analysis

This addresses the challenge of bridging different latent spaces for general AI systems, though it appears incremental as it builds on existing pretrained models and HTL techniques.

The paper tackles the problem of expensive multi-modal network training by proposing a cost-effective heterogeneous transfer learning strategy for co-embedding multi-modal spaces, achieving success in a joint image-audio embedding task.

Any general artificial intelligence system must be able to interpret, operate on, and produce data in a multi-modal latent space that can represent audio, imagery, text, and more. In the last decade, deep neural networks have seen remarkable success in unimodal data distributions, while transfer learning techniques have seen a massive expansion of model reuse across related domains. However, training multi-modal networks from scratch remains expensive and illusive, while heterogeneous transfer learning (HTL) techniques remain relatively underdeveloped. In this paper, we propose a novel and cost-effective HTL strategy for co-embedding multi-modal spaces. Our method avoids cost inefficiencies by preprocessing embeddings using pretrained models for all components, without passing gradients through these models. We prove the use of this system in a joint image-audio embedding task. Our method has wide-reaching applications, as successfully bridging the gap between different latent spaces could provide a framework for the promised "universal" embedding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes