Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

arXiv:2602.06394v1h-index: 3
Originality Highly original
AI Analysis

This work addresses the challenge of processing noisy data for foundation model pre-training, with incremental improvements in specific domains like genomics and finance.

The paper tackled the problem of tokenization methods being ineffective on noisy real-world corpora by introducing QA-Token, which incorporates data reliability into vocabulary construction, resulting in improvements such as a 6.7 percentage point F1 gain in genomics variant calling and a 30% Sharpe ratio improvement in finance.

Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes