LG MLSep 7, 2022

Blessing of Class Diversity in Pre-training

Princeton

arXiv:2209.03447v34.64 citationsh-index: 49

Originality Highly original

AI Analysis

This provides a theoretical foundation for pre-training techniques in NLP, addressing a key problem for researchers and practitioners seeking to understand and optimize transfer learning.

The paper tackles the problem of explaining why pre-training improves sample efficiency in NLP by proving that class diversity in pre-training tasks leads to a faster convergence rate of O(1/(ν̃√n)) for downstream tasks, compared to O(1/√m) in standard supervised learning, where n is pre-training data size and m is downstream data size.

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tildeν$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tildeν \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.

View on arXiv PDF

Similar