LGCLDec 11, 2023

DYAD: A Descriptive Yet Abjuring Density efficient approximation to linear neural network layers

arXiv:2312.06881v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of training speed and memory efficiency for large-scale models like Transformers, though it is an incremental improvement over existing dense layers.

The paper tackles the computational inefficiency of linear neural network layers by proposing DYAD, a near-sparse matrix structure that approximates dense weight matrices. The method achieves competitive performance (≥90% of dense baselines) on benchmarks like BLIMP and GLUE while being 7-15% faster to train on GPU.

We devise, implement and performance-asses DYAD, a layer which can serve as a faster and more memory-efficient approximate replacement for linear layers, (nn.Linear() in Pytorch). These layers appear in common subcomponents, such as in the ff module of Transformers. DYAD is based on a bespoke near-sparse matrix structure which approximates the dense "weight" matrix W that matrix-multiplies the input in the typical realization of such a layer, a.k.a DENSE. Our alternative near-sparse matrix structure is decomposable to a sum of 2 matrices permutable to a block-sparse counterpart. These can be represented as 3D tensors, which in unison allow a faster execution of matrix multiplication with the mini-batched input matrix X compared to DENSE (O(rows(W ) x cols(W )) --> O( rows(W ) x cols(W ) # of blocks )). As the crux of our experiments, we pretrain both DYAD and DENSE variants of 2 sizes of the OPT arch and 1 size of the Pythia arch, including at different token scales of the babyLM benchmark. We find DYAD to be competitive (>= 90%) of DENSE performance on zero-shot (e.g. BLIMP), few-shot (OPENLM) and finetuning (GLUE) benchmarks, while being >=7-15% faster to train on-GPU even at 125m scale, besides surfacing larger speedups at increasing scale and model width.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes