CLJan 26, 2025

Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

arXiv:2501.15624v16.74 citationsh-index: 1RANLP

Originality Synthesis-oriented

AI Analysis

This work addresses text simplification for the low-resource Estonian language, providing a basis for further research but is incremental as it applies existing methods to a new domain.

The study tackled Estonian text simplification by developing a custom dataset and comparing a neural machine translation model (OpenNMT) with a fine-tuned LLaMA model, finding that LLaMA outperformed OpenNMT in readability, grammaticality, and meaning preservation.

This study introduces an approach to Estonian text simplification using two model architectures: a neural machine translation model and a fine-tuned large language model (LLaMA). Given the limited resources for Estonian, we developed a new dataset, the Estonian Simplification Dataset, combining translated data and GPT-4.0-generated simplifications. We benchmarked OpenNMT, a neural machine translation model that frames text simplification as a translation task, and fine-tuned the LLaMA model on our dataset to tailor it specifically for Estonian simplification. Manual evaluations on the test set show that the LLaMA model consistently outperforms OpenNMT in readability, grammaticality, and meaning preservation. These findings underscore the potential of large language models for low-resource languages and provide a basis for further research in Estonian text simplification.

View on arXiv PDF

Similar