LGAINov 14, 2025

Virtual Width Networks

arXiv:2511.11238v22 citationsh-index: 19
Originality Highly original
AI Analysis

This work addresses efficiency challenges in scaling large models for machine learning practitioners, offering a novel approach to improve training speed and loss reduction without significant computational overhead.

The paper tackles the problem of achieving wider representations in neural networks without the quadratic computational cost by introducing Virtual Width Networks (VWN), which decouples representational width from backbone width, resulting in over 2 times faster optimization for next-token prediction and 3 times for next-2-token prediction in large-scale experiments.

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes