LGNov 9, 2024

Concept Bottleneck Language Models For protein design

arXiv:2411.06090v221 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses the need for interpretability and control in protein design for drug discovery, though it is incremental as it adapts an existing concept bottleneck approach to a new domain.

The authors tackled the problem of designing interpretable and controllable protein language models by introducing Concept Bottleneck Protein Language Models (CB-pLM), which achieve a 3 times larger change in desired concept values compared to baselines while maintaining comparable performance to traditional models.

We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model's decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked protein language models due to their importance in drug discovery and the ability to validate our model's capabilities through real-world experiments and expert knowledge. We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes