Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics
This work addresses health literacy by making biomedical information more accessible, though it is incremental as it compares existing LLM approaches on a known task.
The study applied Large Language Models to simplify biomedical texts for better health literacy, finding that gpt-4o-mini outperformed other methods, with G-Eval metrics aligning well with qualitative assessments.
This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.