CLMar 11, 2025

ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

Johan R. Portela, Nicolás Perez, Rubén Manrique

arXiv:2503.08803v11 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This addresses a gap for Spanish NLP researchers by providing a new dataset, but it is incremental as it extends existing NLI work to a new language and genre focus.

The paper tackles the lack of Spanish datasets for Natural Language Inference (NLI) by creating ESNLIR, a multi-genre dataset with causal relationships, and shows that genre enrichment improves model generalization based on evaluations with BERT-family models.

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), serves as a crucial area within the domain of Natural Language Processing (NLP). This area fundamentally empowers machines to discern semantic relationships between assorted sections of text. Even though considerable work has been executed for the English language, it has been observed that efforts for the Spanish language are relatively sparse. Keeping this in view, this paper focuses on generating a multi-genre Spanish dataset for NLI, ESNLIR, particularly accounting for causal Relationships. A preliminary baseline has been conceptualized and subjected to an evaluation, leveraging models drawn from the BERT family. The findings signify that the enrichment of genres essentially contributes to the enrichment of the model's capability to generalize. The code, notebooks and whole datasets for this experiments is available at: https://zenodo.org/records/15002575. If you are interested only in the dataset you can find it here: https://zenodo.org/records/15002371.

View on arXiv PDF

Similar