LGMLJan 26, 2024

Better Representations via Adversarial Training in Pre-Training: A Theoretical Perspective

arXiv:2401.15248v1AISTATS
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of ensuring adversarial robustness in pre-trained models for machine learning practitioners, offering incremental theoretical insights into an empirically observed phenomenon.

The paper provides theoretical justification for how adversarial training during pre-training leads to robust representations in downstream tasks, showing that feature purification in two-layer neural networks enables clean training to achieve adversarial robustness.

Pre-training is known to generate universal representations for downstream tasks in large-scale deep learning such as large language models. Existing literature, e.g., \cite{kim2020adversarial}, empirically observe that the downstream tasks can inherit the adversarial robustness of the pre-trained model. We provide theoretical justifications for this robustness inheritance phenomenon. Our theoretical results reveal that feature purification plays an important role in connecting the adversarial robustness of the pre-trained model and the downstream tasks in two-layer neural networks. Specifically, we show that (i) with adversarial training, each hidden node tends to pick only one (or a few) feature; (ii) without adversarial training, the hidden nodes can be vulnerable to attacks. This observation is valid for both supervised pre-training and contrastive learning. With purified nodes, it turns out that clean training is enough to achieve adversarial robustness in downstream tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes