LGAIMLApr 17, 2024

Decomposing and Editing Predictions by Modeling Model Computation

MIT
arXiv:2404.11534v126 citationsh-index: 54Has CodeICML
Originality Highly original
AI Analysis

This work addresses the interpretability and control of ML models for researchers and practitioners, offering a method to decompose predictions and edit models, though it builds on existing attribution techniques.

The paper tackles the problem of understanding how machine learning models transform inputs into predictions by introducing component modeling, specifically component attribution, and presents COAR, a scalable algorithm that effectively estimates these attributions across various models, datasets, and modalities, enabling model editing tasks such as fixing errors and improving robustness.

How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at https://github.com/MadryLab/modelcomponents .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes