CLMay 16, 2022

Directed Acyclic Transformer for Non-Autoregressive Machine Translation

Tsinghua
arXiv:2205.07459v183 citationsh-index: 22
AI Analysis

This work addresses the translation quality gap for users of non-autoregressive models, achieving competitive results with autoregressive Transformers without knowledge distillation, though it is incremental in the NAT domain.

The paper tackles the problem of non-autoregressive machine translation by proposing DA-Transformer, which uses a Directed Acyclic Graph to capture token dependencies and multiple translations, resulting in a 3 BLEU point improvement over previous NATs on the WMT benchmark.

Non-autoregressive Transformers (NATs) significantly reduce the decoding latency by generating all tokens in parallel. However, such independent predictions prevent NATs from capturing the dependencies between the tokens for generating multiple possible translations. In this paper, we propose Directed Acyclic Transfomer (DA-Transformer), which represents the hidden states in a Directed Acyclic Graph (DAG), where each path of the DAG corresponds to a specific translation. The whole DAG simultaneously captures multiple translations and facilitates fast predictions in a non-autoregressive fashion. Experiments on the raw training data of WMT benchmark show that DA-Transformer substantially outperforms previous NATs by about 3 BLEU on average, which is the first NAT model that achieves competitive results with autoregressive Transformers without relying on knowledge distillation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes