MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
This addresses the need for better readability assessment in LLM outputs for chatbot applications, though it is incremental as it focuses on a specific metric rather than a broader breakthrough.
The paper tackles the problem of evaluating Markdown awareness in large language models (LLMs) to improve readability in web chatbots, introducing MDEval, a benchmark that achieves a Spearman correlation of 0.791 and 84.1% accuracy with human judgments, and shows that fine-tuning open-source models can match GPT-4o's performance.
Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.