Multimodal Transformer for Parallel Concatenated Variational Autoencoders
This work addresses synthetic data generation for multimodal applications, but appears incremental as it builds on existing VAE and transformer methods.
The paper tackles cross-modal data generation by proposing a multimodal transformer with parallel concatenated variational autoencoders (PC-VAE), using column stripes for image input and a new loss function based on interaction information, achieving unspecified experimental validation.
In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE.