LGCLMay 31, 2021

Memory-Efficient Differentiable Transformer Architecture Search

arXiv:2105.14669v1720 citations
Originality Incremental advance
AI Analysis

This addresses the memory bottleneck for researchers and practitioners using DARTS in Transformer architecture search, making it more feasible for sequence-to-sequence tasks, though it is incremental as it builds on existing DARTS methods.

The paper tackles the memory-intensive problem of applying Differentiable Architecture Search (DARTS) to Transformers by proposing a multi-split reversible network with a backpropagation-with-reconstruction algorithm, which reduces memory usage and enables searching with larger hidden sizes and more operations, resulting in consistent outperformance over standard Transformers on three WMT'14 datasets and favorable comparison with big-size Evolved Transformers while reducing search computation by an order of magnitude.

Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden size and more candidate operations. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech. Experimental results show that our network consistently outperforms standard Transformers across the tasks. Moreover, our method compares favorably with big-size Evolved Transformers, reducing search computation by an order of magnitude.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes