CLApr 13, 2018

Pieces of Eight: 8-bit Neural Machine Translation

arXiv:1804.05038v132.11101 citations

Originality Synthesis-oriented

AI Analysis

This work addresses latency and cost issues for industry applications of machine translation, though it is incremental as it applies known quantization techniques to this domain.

The paper tackled the problem of reducing inference time and cloud hosting costs in neural machine translation by applying 8-bit quantization to models trained with 32-bit floating point values, achieving significant speed improvements without accuracy degradation.

Neural machine translation has achieved levels of fluency and adequacy that would have been surprising a short time ago. Output quality is extremely relevant for industry purposes, however it is equally important to produce results in the shortest time possible, mainly for latency-sensitive applications and to control cloud hosting costs. In this paper we show the effectiveness of translating with 8-bit quantization for models that have been trained using 32-bit floating point values. Results show that 8-bit translation makes a non-negligible impact in terms of speed with no degradation in accuracy and adequacy.

View on arXiv PDF

Similar