EM-Network: Oracle Guided Self-distillation for Sequence Learning
This work addresses the challenge of enhancing prediction accuracy in sequence models for tasks like speech recognition and machine translation, representing an incremental improvement over existing methods.
The paper tackles the problem of improving sequence-to-sequence learning by introducing EM-Network, a self-distillation method that uses oracle guidance from target sequences, resulting in state-of-the-art performance on speech recognition and machine translation benchmarks like WMT'14 and IWSLT'14.
We introduce EM-Network, a novel self-distillation approach that effectively leverages target information for supervised sequence-to-sequence (seq2seq) learning. In contrast to conventional methods, it is trained with oracle guidance, which is derived from the target sequence. Since the oracle guidance compactly represents the target-side context that can assist the sequence model in solving the task, the EM-Network achieves a better prediction compared to using only the source input. To allow the sequence model to inherit the promising capability of the EM-Network, we propose a new self-distillation strategy, where the original sequence model can benefit from the knowledge of the EM-Network in a one-stage manner. We conduct comprehensive experiments on two types of seq2seq models: connectionist temporal classification (CTC) for speech recognition and attention-based encoder-decoder (AED) for machine translation. Experimental results demonstrate that the EM-Network significantly advances the current state-of-the-art approaches, improving over the best prior work on speech recognition and establishing state-of-the-art performance on WMT'14 and IWSLT'14.