LGIRJun 19, 2025

Semantic Outlier Removal with Embedding Models and LLMs

arXiv:2506.16644v11 citationsh-index: 1ACL
Originality Incremental advance
AI Analysis

This addresses the need for robust, cost-effective text processing in multilingual settings, though it is incremental as it builds on existing embedding and search techniques.

The paper tackled the problem of removing extraneous content from text documents by introducing SORE, a method that uses multilingual sentence embeddings and nearest-neighbor search to achieve near-LLM precision at lower cost, as demonstrated in experiments on HTML datasets.

Modern text processing pipelines demand robust methods to remove extraneous content while preserving a document's core message. Traditional approaches such as HTML boilerplate extraction or keyword filters often fail in multilingual settings and struggle with context-sensitive nuances, whereas Large Language Models (LLMs) offer improved quality at high computational cost. We introduce SORE (Semantic Outlier Removal), a cost-effective, transparent method that leverages multilingual sentence embeddings and approximate nearest-neighbor search to identify and excise unwanted text segments. By first identifying core content via metadata embedding and then flagging segments that either closely match predefined outlier groups or deviate significantly from the core, SORE achieves near-LLM extraction precision at a fraction of the cost. Experiments on HTML datasets demonstrate that SORE outperforms structural methods and yield high precision in diverse scenarios. Our system is currently deployed in production, processing millions of documents daily across multiple languages while maintaining both efficiency and accuracy. To facilitate reproducibility and further research, we release our implementation and evaluation datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes