CLFeb 12

propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

arXiv:2602.12414v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the need for more interpretable and flexible data filtering in LLM pretraining, though it is incremental as it builds on existing curation methods.

The paper tackles the problem of limited single-score data curation for LLM pretraining by introducing propella-1, a family of multilingual LLMs that annotate documents across 18 properties, achieving higher agreement than larger general-purpose models and releasing a dataset of over three billion annotations.

Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes