CVApr 11
On The Application of Linear Attention in Multimodal TransformersArmin Gerami, Seyedehanita Madani, Ramani Duraiswami
Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention (LA) as a high-efficiency alternative within multimodal frameworks. By integrating LA, we reduce the computational overhead from quadratic to linear relative to sequence length while preserving competitive performance. We evaluate our approach across ViT-S/16, ViT-B/16, and ViT-L/16 architectures trained on the LAION-400M dataset, with validation focused on ImageNet-21K zero-shot accuracy. Our systematic evaluation demonstrates that Linear Attention not only yields significant computational savings but also adheres to the same scaling laws as standard softmax attention. These findings position Linear Attention as a robust, scalable solution for next-generation multimodal Transformers tasked with processing increasingly large and complex datasets.
LGFeb 12, 2024
FAST: Factorizable Attention for Speeding up TransformersArmin Gerami, Monte Hoover, Pranav S. Dulepet et al.
Motivated by the factorization inherent in the original fast multipole method and the improved fast Gauss transform we introduce a factorable form of attention that operates efficiently in high dimensions. This approach reduces the computational and memory complexity of the attention mechanism in transformers from $O(N^2)$ to $O(N)$. In comparison to previous attempts, our work presents a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification and incorporates the all-to-all relationship between tokens. We explore the properties of our new attention metric and conduct tests in various standard settings. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
LGOct 24, 2025
Transformer Based Linear Attention with Optimized GPU Kernel ImplementationArmin Gerami, Ramani Duraiswami
The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6 times. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.
LGOct 1, 2025
Auditing Algorithmic Bias in Transformer-Based TradingArmin Gerami, Ramani Duraiswami
Transformer models have become increasingly popular in financial applications, yet their potential risk making and biases remain under-explored. The purpose of this work is to audit the reliance of the model on volatile data for decision-making, and quantify how the frequency of price movements affects the model's prediction confidence. We employ a transformer model for prediction, and introduce a metric based on Partial Information Decomposition (PID) to measure the influence of each asset on the model's decision making. Our analysis reveals two key observations: first, the model disregards data volatility entirely, and second, it is biased toward data with lower-frequency price movements.