LG DATA-AN BMFeb 18, 2019

Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins

Jérôme Tubiana, Simona Cocco, Rémi Monasson

arXiv:1902.06495v15.429 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of learning compositional representations from interacting systems, such as protein sequences, for researchers in computational biology and machine learning, though it is incremental as it builds on existing RBM methods with comparative analysis.

The study investigated how Restricted Boltzmann Machines (RBMs) learn features from protein sequence data, finding that with appropriate parameters, RBMs enter a compositional phase for recombining features and outperform deterministic methods like PCA in capturing interactions and robustness to sample size. On synthetic lattice-protein data, RBMs showed significant improvements in capturing ground-truth interactions compared to other representation learning algorithms.

A Restricted Boltzmann Machine (RBM) is an unsupervised machine-learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. As such, RBM were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBM operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including Principal or Independent Component Analysis, autoencoders (AE), variational auto-encoders (VAE), and their sparse variants. We show that RBM, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBM to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice-protein data, that share similar statistical features with real protein sequences, and for which ground-truth interactions are known.

View on arXiv PDF

Similar