Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning
This work addresses a technical bottleneck in end-to-end training of latent structure models for natural language processing, offering incremental improvements and practical guidance.
The paper tackles the problem of training latent structure models with non-differentiable argmax operations by analyzing surrogate gradients, showing that pulling back the downstream objective motivates existing methods like STE and SPIGOT and leads to new algorithms, with empirical comparisons providing insights and revealing failure cases.
Latent structure models are a powerful tool for modeling language data: they can mitigate the error propagation and annotation bottleneck in pipeline systems, while simultaneously uncovering linguistic insights about the data. One challenge with end-to-end training of these models is the argmax operation, which has null gradient. In this paper, we focus on surrogate gradients, a popular strategy to deal with this problem. We explore latent structure learning through the angle of pulling back the downstream learning objective. In this paradigm, we discover a principled motivation for both the straight-through estimator (STE) as well as the recently-proposed SPIGOT - a variant of STE for structured models. Our perspective leads to new algorithms in the same family. We empirically compare the known and the novel pulled-back estimators against the popular alternatives, yielding new insight for practitioners and revealing intriguing failure cases.