LGApr 22, 2024

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

arXiv:2404.14462v44 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient and accessible LLMs for researchers and practitioners, though it is incremental as it builds on existing decoder-only transformer architectures.

The paper tackles the problem of reducing model sizes and training times for decoder-only transformers while maintaining performance, introducing three architectural variants (ParallelGPT, LinearGPT, ConvGPT) that achieve comparable language generation results with smaller models and faster training.

In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes