CL AI LGJun 17, 2025

Essential-Web v1.0: 24T tokens of organized web data

Essential AI, Andrew Hojel, Michael Pust, Tim Romanski, Yash Vanjani, Ritvik Kapila, Mohit Parmar, Adarsh Chaluvaraju, Alok Tripathy, Anil Thomas, Ashish Tanwer, Darsh J Shah

arXiv:2506.14111v213.97 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This addresses the issue of costly and inaccessible data pipelines for AI researchers and developers, though it is incremental as it builds on existing data curation methods.

The authors tackled the problem of lacking massive, well-organized pre-training datasets for language models by introducing Essential-Web v1.0, a 24-trillion-token dataset with a twelve-category taxonomy, which enables competitive performance in domains like math, web code, STEM, and medical with specific gains such as +14.3% in web code and +24.5% in STEM.

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

View on arXiv PDF Code

Similar