LG AINov 14, 2025

Virtual Width Networks

Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chengyin Xu, Chi Zhang, Chong Hu, Daoguang Zan

arXiv:2511.11238v29.42 citationsh-index: 19

Originality Highly original

AI Analysis

This work addresses efficiency challenges in scaling large models for machine learning practitioners, offering a novel approach to improve training speed and loss reduction without significant computational overhead.

The paper tackles the problem of achieving wider representations in neural networks without the quadratic computational cost by introducing Virtual Width Networks (VWN), which decouples representational width from backbone width, resulting in over 2 times faster optimization for next-token prediction and 3 times for next-2-token prediction in large-scale experiments.

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

View on arXiv PDF

Similar