CL AIJan 29

CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

arXiv:2601.21766v20.6h-index: 33

Originality Incremental advance

AI Analysis

This addresses the problem of computational cost and parameter efficiency for large-scale language model deployment, though it appears incremental as it modifies existing Transformer components rather than proposing a fundamentally new architecture.

The paper tackles the parameter inefficiency of Transformer architectures for language generation by introducing CoFrGeNets, a new function class based on continued fractions that replaces Multi-head Attention and Feed-Forward Networks with fewer parameters. Results show competitive or superior performance on downstream tasks with 1/2 to 2/3 the parameters and shorter pre-training time compared to GPT2-xl and Llama3.

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

View on arXiv PDF

Similar