CLAIApr 5, 2023

To Asymmetry and Beyond: Structured Pruning of Sequence to Sequence Models for Improved Inference Efficiency

arXiv:2304.02721v3222 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses deployment challenges in latency-sensitive or web-scale applications for summarization models, though it is incremental as it builds on existing pruning techniques.

The paper tackled the problem of improving inference efficiency in sequence-to-sequence models for summarization by studying structured pruning, showing that asymmetric pruning can achieve nearly 3x faster inference latency with only about a 1-point loss in Rouge-2 score.

Sequence-to-sequence language models can be used to produce abstractive summaries which are coherent, relevant, and concise. Still, model sizes can make deployment in latency-sensitive or web-scale implementations difficult. This paper studies the relationship between model size, structured pruning, inference efficiency, and summarization accuracy on widely used summarization datasets. We show that model accuracy is tied to the encoder size while inference efficiency is connected to the decoder. Using asymmetric pruning can lead to nearly 3x improvement in inference latency with ~1 point loss in Rouge-2. Moreover, we find both the average degradation and the role of asymmetry to be consistent across model sizes and variations in datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes