LGAICVOct 28, 2022

Multimodal Transformer for Parallel Concatenated Variational Autoencoders

arXiv:2210.16174v16 citationsh-index: 89
Originality Synthesis-oriented
AI Analysis

This work addresses synthetic data generation for multimodal applications, but appears incremental as it builds on existing VAE and transformer methods.

The paper tackles cross-modal data generation by proposing a multimodal transformer with parallel concatenated variational autoencoders (PC-VAE), using column stripes for image input and a new loss function based on interaction information, achieving unspecified experimental validation.

In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes