Inverse Depth Scaling From Most Layers Being Similar
This addresses the problem of inefficient depth utilization in LLMs for AI researchers, suggesting architectural innovations are needed, but it is incremental as it builds on existing scaling law studies.
The study quantified how depth affects loss in large language models, finding that loss scales inversely proportional to depth due to functionally similar layers performing ensemble averaging rather than compositional learning, which is inefficient but robust.
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.