IR CLMay 21

AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

arXiv:2605.229232.3

Predicted impact top 100% in IR · last 90 daysOriginality Synthesis-oriented

AI Analysis

This work addresses the need for better knowledge source preparation for RAG in STEM domains, but the contribution is incremental as it applies existing preprocessing techniques to a specific format.

The authors propose a preprocessing pipeline to convert LaTeX source code into Markdown and JSONL chunks for RAG, preserving structural and semantic information lost in PDF extraction. The approach resolves cross-references, interprets macros, and identifies exercises/examples to improve retrieval accuracy for mathematical and technical documents.

Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

View on arXiv PDF

Similar