Marko Karbevski

LG
3papers
2citations
Novelty52%
AI Score43

3 Papers

LGMar 11
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \mathbb{R}^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline, comfortably outperforming a model with 12.5% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

LGApr 26
Can an MLP Absorb Its Own Skip Connection?

Antonij Mijoski, Marko Karbevski

We study when a skip connection around a single-hidden-layer MLP can be absorbed into a residual-free MLP of the same width. We first show that for any architecture whose skip branch is an invertible linear map (including Hyper-Connections and their manifold-constrained variants), the problem reduces to the identity skip case. For homogeneous activations of degree $k \neq 1$, such as ReLU$^2$ and ReGLU, absorption is unconditionally impossible by a degree argument. For gated activations whose gate is differentiable at the origin with $g(0) = 0$, including SwiGLU and GeGLU, a linearization argument gives the same conclusion. These impossibility results extend to arbitrary depth: a composition of $L$ residual blocks using such activations cannot be replicated by any composition of $L$ residual-free blocks of the same width. For ungated ReLU and GELU, the situation is richer. For generic weight matrices, absorption holds at the single-block level if and only if there exists an index set $S$ of size at least $d$ such that $W_{\mathrm{down}}[:,S]\,W_{\mathrm{up}}[S,:] = -I_d$. This condition is non-generic (it fails with probability one under continuous weight distributions), so skip-connected and residual-free MLPs of the same width represent generically disjoint function classes. Whether this disjointness persists for deep compositions of ReLU or GELU blocks remains open.

LGOct 27, 2025
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

Marko Karbevski, Antonij Mijoski

The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.