Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders
This work addresses a key limitation in multimodal VAE methods for researchers and practitioners in machine learning, offering an incremental improvement over existing aggregation techniques.
The paper tackles the problem of aggregating single-modality distributions in multimodal variational autoencoders (VAEs) by introducing the CoDE method, which avoids the independence assumption used in existing approaches, resulting in improved generative coherence, quality, and log-likelihood estimations, with CoDE-VAE minimizing the generative quality gap as modalities increase and achieving classification accuracy comparable to state-of-the-art models.
Multimodal learning with variational autoencoders (VAEs) requires estimating joint distributions to evaluate the evidence lower bound (ELBO). Current methods, the product and mixture of experts, aggregate single-modality distributions assuming independence for simplicity, which is an overoptimistic assumption. This research introduces a novel methodology for aggregating single-modality distributions by exploiting the principle of consensus of dependent experts (CoDE), which circumvents the aforementioned assumption. Utilizing the CoDE method, we propose a novel ELBO that approximates the joint likelihood of the multimodal data by learning the contribution of each subset of modalities. The resulting CoDE-VAE model demonstrates better performance in terms of balancing the trade-off between generative coherence and generative quality, as well as generating more precise log-likelihood estimations. CoDE-VAE further minimizes the generative quality gap as the number of modalities increases. In certain cases, it reaches a generative quality similar to that of unimodal VAEs, which is a desirable property that is lacking in most current methods. Finally, the classification accuracy achieved by CoDE-VAE is comparable to that of state-of-the-art multimodal VAE models.