CLAILGJan 10, 2023

Scaling Laws for Generative Mixed-Modal Language Models

UW
arXiv:2301.03728v1153 citationsh-index: 116
Originality Incremental advance
AI Analysis

This work provides insights for designing and training mixed-modal generative models, which are important for unified AI systems, though it is incremental in extending scaling laws to mixed modalities.

The authors investigated scaling properties of generative mixed-modal language models, conducting over 250 experiments with models up to 30 billion parameters and found new scaling laws that unify individual modalities and their interactions, leading to a 30B speech-text model that significantly outperforms unimodal counterparts.

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes