CLJan 1

Rule-Based Approaches to Atomic Sentence Extraction

Lineesha Kamana, Akshita Ananda Subramanian, Mehuli Ghosh, Suman Saha

arXiv:2601.00506v1

Originality Synthesis-oriented

AI Analysis

It addresses interpretability gaps in extraction for information retrieval and reasoning systems, but is incremental as it builds on existing rule-based approaches.

This study tackled the problem of atomic sentence extraction by analyzing how complex sentence structures affect rule-based methods, achieving moderate-to-high performance with ROUGE-1 F1 = 0.6714 and BERTScore F1 = 0.5898 on the WikiSplit dataset.

Natural language often combines multiple ideas into complex sentences. Atomic sentence extraction, the task of decomposing complex sentences into simpler sentences that each express a single idea, improves performance in information retrieval, question answering, and automated reasoning systems. Previous work has formalized the "split-and-rephrase" task and established evaluation metrics, and machine learning approaches using large language models have improved extraction accuracy. However, these methods lack interpretability and provide limited insight into which linguistic structures cause extraction failures. Although some studies have explored dependency-based extraction of subject-verb-object triples and clauses, no principled analysis has examined which specific clause structures and dependencies lead to extraction difficulties. This study addresses this gap by analyzing how complex sentence structures, including relative clauses, adverbial clauses, coordination patterns, and passive constructions, affect the performance of rule-based atomic sentence extraction. Using the WikiSplit dataset, we implemented dependency-based extraction rules in spaCy, generated 100 gold=standard atomic sentence sets, and evaluated performance using ROUGE and BERTScore. The system achieved ROUGE-1 F1 = 0.6714, ROUGE-2 F1 = 0.478, ROUGE-L F1 = 0.650, and BERTScore F1 = 0.5898, indicating moderate-to-high lexical, structural, and semantic alignment. Challenging structures included relative clauses, appositions, coordinated predicates, adverbial clauses, and passive constructions. Overall, rule-based extraction is reasonably accurate but sensitive to syntactic complexity.

View on arXiv PDF

Similar