CVAIOct 21, 2025

ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters

arXiv:2510.18431v2h-index: 32IEEE Transactions on Image Processing
Originality Incremental advance
AI Analysis

This provides a cost-effective solution for scaling vision transformers, primarily benefiting researchers and practitioners in computer vision, though it appears incremental as it builds on existing pretrained models with adapter-like techniques.

The paper tackles the problem of computationally intensive training for scaling vision transformers by introducing ScaleNet, which expands pretrained models with minimal parameter increases through layer insertion and weight sharing with adjustment parameters. On ImageNet-1K, ScaleNet achieved a 7.42% accuracy improvement over training from scratch with a 2× depth-scaled DeiT-Base model while requiring only one-third of the training epochs.

Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes