CLLGNov 18, 2021

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task

arXiv:2111.09612v1664 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the fragility of language models for NLP practitioners, but it is incremental as it applies existing methods (SWA and CheckList) to a known problem.

The study investigated the robustness of ALBERT combined with Stochastic Weight Averaging (SWA) on sentiment analysis (SST-2), finding that SWA reduces error rates but models retain distinct biases as measured by CheckList criteria.

Despite their success, modern language models are fragile. Even small changes in their training pipeline can lead to unexpected results. We study this phenomenon by examining the robustness of ALBERT (arXiv:1909.11942) in combination with Stochastic Weight Averaging (SWA) (arXiv:1803.05407) -- a cheap way of ensembling -- on a sentiment analysis task (SST-2). In particular, we analyze SWA's stability via CheckList criteria (arXiv:2005.04118), examining the agreement on errors made by models differing only in their random seed. We hypothesize that SWA is more stable because it ensembles model snapshots taken along the gradient descent trajectory. We quantify stability by comparing the models' mistakes with Fleiss' Kappa (Fleiss, 1971) and overlap ratio scores. We find that SWA reduces error rates in general; yet the models still suffer from their own distinct biases (according to CheckList).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes