CLSep 24, 2025

LOCA: Logical Chain Augmentation for Scientific Corpus Cleaning

You-Le Fang, Dong-Shan Jian, Xiang Li, Ce Meng, Ling-Shi Meng, Chen-Xu Yan, Zhi-Zhang Bian, Yan-Qing Ma

arXiv:2510.01249v12.71 citationsh-index: 2

Originality Highly original

AI Analysis

This addresses the need for high-quality scientific corpora to improve the reliability of scientific AI, representing a novel method for a known bottleneck in dataset cleaning.

The paper tackles the problem of high error rates in scientific QA datasets due to logical leaps and implicit reasoning, introducing LOCA, a framework that automatically cleans scientific corpora by completing missing logical steps and separating principles from derivations, reducing error rates from up to 20% to below 2%.

While Large Language Models (LLMs) excel in general domains, their reliability often falls short in scientific problem-solving. The advancement of scientific AI depends on large-scale, high-quality corpora. However, existing scientific question-answering (QA) datasets suffer from high error rates, frequently resulting from logical leaps and implicit reasoning within the answers. To address this issue, we introduce LOCA (Logical Chain Augmentation), a novel framework for automatically cleaning scientific corpora, implemented through an augment-and-review loop. At its core, LOCA enhances raw answers by completing missing logical steps and explicitly separating the underlying scientific principle from its subsequent derivation. By applying LOCA to challenging scientific corpora, we demonstrate that it can automatically filter noisy datasets, typically reducing the error rate from as high as 20\% to below 2\%. LOCA provides a scalable and effective methodology for creating high-quality scientific corpora, paving the way for more reliable training and evaluation of scientific AI.

View on arXiv PDF

Similar