CLJun 15, 2023

Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models

Tsinghua
arXiv:2306.08909v1224 citationsh-index: 35
Originality Incremental advance
AI Analysis

This addresses the challenge of distilling knowledge from large models without internal access, which is incremental as it builds on decision-based distillation methods.

The paper tackles the problem of knowledge distillation for pre-trained language models when only teacher decisions (top-1 labels) are accessible, by proposing a method to estimate logits from decision distributions, which significantly outperforms baselines on natural language understanding and machine reading comprehension datasets.

Conventional knowledge distillation (KD) methods require access to the internal information of teachers, e.g., logits. However, such information may not always be accessible for large pre-trained language models (PLMs). In this work, we focus on decision-based KD for PLMs, where only teacher decisions (i.e., top-1 labels) are accessible. Considering the information gap between logits and decisions, we propose a novel method to estimate logits from the decision distributions. Specifically, decision distributions can be both derived as a function of logits theoretically and estimated with test-time data augmentation empirically. By combining the theoretical and empirical estimations of the decision distributions together, the estimation of logits can be successfully reduced to a simple root-finding problem. Extensive experiments show that our method significantly outperforms strong baselines on both natural language understanding and machine reading comprehension datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes