Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis
This work addresses the need for reliable uncertainty quantification in PLMs for safety-critical NLP applications, providing practical recommendations based on extensive empirical evidence, though it is incremental as it builds on prior limited studies.
The paper tackled the problem of minimizing calibration error in pre-trained language model (PLM) prediction pipelines for safety-critical NLP applications, conducting a large-scale empirical analysis across three classification tasks and domain shift settings to recommend optimal choices for PLM, uncertainty quantifier, and fine-tuning loss, such as using ELECTRA, larger models, Temp Scaling, and Focal Loss.
Pre-trained language models (PLMs) have gained increasing popularity due to their compelling prediction performance in diverse natural language processing (NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it is also crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. That is, the pipeline should reliably indicate when we can trust its predictions. In particular, there are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. Although prior work has looked into some of these considerations, they usually draw conclusions based on a limited scope of empirical studies. There still lacks a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline. To fill this void, we compare a wide range of popular options for each consideration based on three prevalent NLP classification tasks and the setting of domain shift. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.