CL AIOct 13, 2025

Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Marco Braga, Gian Carlo Milanese, Gabriella Pasi

arXiv:2510.11482v1h-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more context-aware preprocessing in NLP, offering a practical improvement for researchers and practitioners working with multilingual text data, though it is incremental as it applies existing LLMs to a known bottleneck.

The paper tackled the problem of text preprocessing in NLP by using Large Language Models to perform tasks like stopword removal, lemmatization, and stemming, achieving accuracies of up to 97% and improving downstream text classification F1 scores by up to 6% compared to traditional methods.

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

View on arXiv PDF Code

Similar