Supervised Vector Quantized Variational Autoencoder for Learning Interpretable Global Representations
This work addresses the problem of interpretable representation learning in deep generative models for researchers in machine learning and bioinformatics, though it appears incremental as it builds on existing VQ-VAE methods with added supervision.
The authors tackled the challenge of learning interpretable global representations for data classes by introducing the Supervised Vector Quantized Variational AutoEncoder (S-VQ-VAE), which integrates supervised and unsupervised learning to capture class-specific characteristics, as demonstrated on MNIST and gene expression data from LINCS, revealing mechanism correlations between perturbagens for drug development.
Learning interpretable representations of data remains a central challenge in deep learning. When training a deep generative model, the observed data are often associated with certain categorical labels, and, in parallel with learning to regenerate data and simulate new data, learning an interpretable representation of each class of data is also a process of acquiring knowledge. Here, we present a novel generative model, referred to as the Supervised Vector Quantized Variational AutoEncoder (S-VQ-VAE), which combines the power of supervised and unsupervised learning to obtain a unique, interpretable global representation for each class of data. Compared with conventional generative models, our model has three key advantages: first, it is an integrative model that can simultaneously learn a feature representation for individual data point and a global representation for each class of data; second, the learning of global representations with embedding codes is guided by supervised information, which clearly defines the interpretation of each code; and third, the global representations capture crucial characteristics of different classes, which reveal similarity and differences of statistical structures underlying different groups of data. We evaluated the utility of S-VQ-VAE on a machine learning benchmark dataset, the MNIST dataset, and on gene expression data from the Library of Integrated Network-Based Cellular Signatures (LINCS). We proved that S-VQ-VAE was able to learn the global genetic characteristics of samples perturbed by the same class of perturbagen (PCL), and further revealed the mechanism correlations between PCLs. Such knowledge is crucial for promoting new drug development for complex diseases like cancer.