LG CLMay 31, 2021

Memory-Efficient Differentiable Transformer Architecture Search

Yuekai Zhao, Li Dong, Yelong Shen, Zhihua Zhang, Furu Wei, Weizhu Chen

arXiv:2105.14669v153.0720 citations

Originality Incremental advance

AI Analysis

This addresses the memory bottleneck for researchers and practitioners using DARTS in Transformer architecture search, making it more feasible for sequence-to-sequence tasks, though it is incremental as it builds on existing DARTS methods.

The paper tackles the memory-intensive problem of applying Differentiable Architecture Search (DARTS) to Transformers by proposing a multi-split reversible network with a backpropagation-with-reconstruction algorithm, which reduces memory usage and enables searching with larger hidden sizes and more operations, resulting in consistent outperformance over standard Transformers on three WMT'14 datasets and favorable comparison with big-size Evolved Transformers while reducing search computation by an order of magnitude.

Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden size and more candidate operations. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech. Experimental results show that our network consistently outperforms standard Transformers across the tasks. Moreover, our method compares favorably with big-size Evolved Transformers, reducing search computation by an order of magnitude.

View on arXiv PDF

Similar