CL AI IRMar 3, 2025

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao

arXiv:2503.01151v112 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses the problem of efficient web content extraction for grounding large language models, but it is incremental as it builds on existing small language model approaches.

The paper tackles the problem of extracting clean Markdown or JSON from messy HTML for web content, and the result is ReaderLM-v2, a 1.5B parameter model that outperforms GPT-4o-2024-08-06 by 15-20% on benchmarks, especially for long documents up to 512K tokens.

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

View on arXiv PDF

Similar