Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices
This work addresses the challenge of deploying search systems on edge devices with limited computational resources, but it appears incremental as it builds on existing distillation techniques.
The paper tackled the problem of compressing large transformer ranker models for edge devices by proposing two extensions to knowledge distillation: optimal vocabulary generation and embedding dimensionality reduction. The result was extremely compressed student models with reduced memory and energy consumption, though specific performance numbers were not provided.
Modern search systems use several large ranker models with transformer architectures. These models require large computational resources and are not suitable for usage on devices with limited computational resources. Knowledge distillation is a popular compression technique that can reduce the resource needs of such models, where a large teacher model transfers knowledge to a small student model. To drastically reduce memory requirements and energy consumption, we propose two extensions for a popular sentence-transformer distillation procedure: generation of an optimal size vocabulary and dimensionality reduction of the embedding dimension of teachers prior to distillation. We evaluate these extensions on two different types of ranker models. This results in extremely compressed student models whose analysis on a test dataset shows the significance and utility of our proposed extensions.