LGMay 12, 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

arXiv:2305.07185v2170 citations
Originality Highly original
AI Analysis

This addresses the challenge of efficient long-sequence modeling for applications such as high-resolution images and audio, representing a novel method rather than an incremental improvement.

The authors tackled the problem of scaling autoregressive transformers to long sequences like images and audio by proposing Megabyte, a multi-scale decoder architecture that enables end-to-end modeling of over one million bytes, achieving competitive performance with subword models on language modeling and state-of-the-art density estimation on ImageNet.

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes