CLJun 5

ReadingMachine: A Computational Methodology for Structured Corpus Reading and Large-Scale Synthesis

arXiv:2606.07753Has Code

Originality Incremental advance

AI Analysis

This work provides a new approach for qualitative researchers and analysts to systematically synthesize large corpora while preserving traceability and disagreement, though it is an incremental improvement over existing LLM-based analysis methods.

ReadingMachine introduces a computational methodology for structured corpus reading that uses LLMs to extract insights and generate thematic maps from large document collections, demonstrated on 152 industrial policy documents producing over 17,500 insights.

ReadingMachine is a computational methodology for structured corpus reading that uses large language models to perform bounded reading operations over entire document collections. Rather than relying on retrieval or recursive summarization, the approach decomposes analysis into inspectable stages including insight extraction, semantic clustering, theme generation, and iterative omission detection. By delaying irreversible compression and explicitly tracking intermediate representations, the method prioritizes coverage, traceability, and preservation of disagreement across large corpora. The system is demonstrated on a heterogeneous corpus of 152 industrial policy documents, producing more than 17,500 extracted insights and a structured thematic map. ReadingMachine is released as an open-source experimental framework for large-scale qualitative synthesis and corpus analysis.

View on arXiv PDF

Similar