CLAIJun 9, 2024

Seventeenth-Century Spanish American Notary Records for Fine-Tuning Spanish Large Language Models

arXiv:2406.05812v11 citations
Originality Synthesis-oriented
AI Analysis

This provides a domain-specific resource for researchers and historians working with historical Spanish texts, though it is incremental as it applies existing fine-tuning methods to a new dataset.

The paper tackles the problem of customizing Spanish large language models for historical text analysis by introducing a collection of 160+ pages of seventeenth-century Spanish American notary records for fine-tuning. The result shows that fine-tuning with this dataset outperforms pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o in tasks like classification and masked language modeling.

Large language models have gained tremendous popularity in domains such as e-commerce, finance, healthcare, and education. Fine-tuning is a common approach to customize an LLM on a domain-specific dataset for a desired downstream task. In this paper, we present a valuable resource for fine-tuning LLMs developed for the Spanish language to perform a variety of tasks such as classification, masked language modeling, clustering, and others. Our resource is a collection of handwritten notary records from the seventeenth century obtained from the National Archives of Argentina. This collection contains a combination of original images and transcribed text (and metadata) of 160+ pages that were handwritten by two notaries, namely, Estenban Agreda de Vergara and Nicolas de Valdivia y Brisuela nearly 400 years ago. Through empirical evaluation, we demonstrate that our collection can be used to fine-tune Spanish LLMs for tasks such as classification and masked language modeling, and can outperform pre-trained Spanish models and ChatGPT-3.5/ChatGPT-4o. Our resource will be an invaluable resource for historical text analysis and is publicly available on GitHub.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes