CLJun 15, 2023

Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models

Qinhong Zhou, Zonghan Yang, Peng Li, Yang Liu

Tsinghua

arXiv:2306.08909v126.4224 citationsh-index: 107Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of distilling knowledge from large models without internal access, which is incremental as it builds on decision-based distillation methods.

The paper tackles the problem of knowledge distillation for pre-trained language models when only teacher decisions (top-1 labels) are accessible, by proposing a method to estimate logits from decision distributions, which significantly outperforms baselines on natural language understanding and machine reading comprehension datasets.

Conventional knowledge distillation (KD) methods require access to the internal information of teachers, e.g., logits. However, such information may not always be accessible for large pre-trained language models (PLMs). In this work, we focus on decision-based KD for PLMs, where only teacher decisions (i.e., top-1 labels) are accessible. Considering the information gap between logits and decisions, we propose a novel method to estimate logits from the decision distributions. Specifically, decision distributions can be both derived as a function of logits theoretically and estimated with test-time data augmentation empirically. By combining the theoretical and empirical estimations of the decision distributions together, the estimation of logits can be successfully reduced to a simple root-finding problem. Extensive experiments show that our method significantly outperforms strong baselines on both natural language understanding and machine reading comprehension datasets.

View on arXiv PDF Code

Similar