CL IRJan 25, 2025

MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren

arXiv:2501.15000v212 citationsh-index: 8Has CodeWWW

Originality Incremental advance

AI Analysis

This addresses the need for better readability assessment in LLM outputs for chatbot applications, though it is incremental as it focuses on a specific metric rather than a broader breakthrough.

The paper tackles the problem of evaluating Markdown awareness in large language models (LLMs) to improve readability in web chatbots, introducing MDEval, a benchmark that achieves a Spearman correlation of 0.791 and 84.1% accuracy with human judgments, and shows that fine-tuning open-source models can match GPT-4o's performance.

Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.

View on arXiv PDF Code

Similar