CLAug 14, 2023

SOTASTREAM: A Streaming Approach to Machine Translation Training

Microsoft
arXiv:2308.07489v1133 citationsh-index: 40Has Code
Originality Incremental advance
AI Analysis

This addresses a bottleneck for machine translation researchers and developers by streamlining data handling, though it is incremental as it improves existing practices rather than introducing a new paradigm.

The paper tackles the inefficiency and inflexibility of static data preprocessing in machine translation training by introducing SOTASTREAM, a streaming approach that eliminates separate preprocessing, enabling on-the-fly data manipulation and reducing training time, complexity, and disk space without affecting model accuracy.

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes