CLApr 11, 2024

Lexical Complexity Prediction and Lexical Simplification for Catalan and Spanish: Resource Creation, Quality Assessment, and Ethical Considerations

Stefan Bott, Horacio Saggion, Nelson Peréz Rojas, Martin Solis Salazar, Saul Calderon Ramirez

arXiv:2404.07814v213.224 citationsh-index: 15Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

Originality Synthesis-oriented

AI Analysis

This work addresses a data scarcity problem for researchers and developers working on lexical simplification in Catalan and Spanish, though it is incremental as it builds on existing methods for resource creation.

The paper tackled the lack of lexical simplification datasets for Catalan and Spanish by creating two novel datasets, including the first for Catalan and the first for Spanish with scalar difficulty ratings, and analyzed their appropriateness and ethical dimensions.

Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.

View on arXiv PDF

Similar