Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models
This work addresses the problem of balancing efficiency and accuracy in NLP models for researchers and practitioners, though it is incremental as it applies existing analysis techniques to a specific domain.
The paper tackles the trade-off between model size and performance in NLP by developing methods to measure and compare them, applying these to part-of-speech taggers across eight languages and finding that classical taggers are often size-performance optimal, while deep models achieve high performance but not always with the most complexity.
Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.