New Evaluation Paradigm for Lexical Simplification
This addresses a gap in LS evaluation for LLMs, which is incremental as it adapts existing methods to new model capabilities.
The paper tackles the problem of evaluating lexical simplification (LS) methods, particularly for large language models (LLMs) that simplify sentences directly, by proposing a new annotation method to create an all-in-one LS dataset through human-machine collaboration, and shows that multi-LLMs approaches significantly outperform existing baselines.
Lexical Simplification (LS) methods use a three-step pipeline: complex word identification, substitute generation, and substitute ranking, each with separate evaluation datasets. We found large language models (LLMs) can simplify sentences directly with a single prompt, bypassing the traditional pipeline. However, existing LS datasets are not suitable for evaluating these LLM-generated simplified sentences, as they focus on providing substitutes for single complex words without identifying all complex words in a sentence. To address this gap, we propose a new annotation method for constructing an all-in-one LS dataset through human-machine collaboration. Automated methods generate a pool of potential substitutes, which human annotators then assess, suggesting additional alternatives as needed. Additionally, we explore LLM-based methods with single prompts, in-context learning, and chain-of-thought techniques. We introduce a multi-LLMs collaboration approach to simulate each step of the LS task. Experimental results demonstrate that LS based on multi-LLMs approaches significantly outperforms existing baselines.