IRAIJun 21, 2024

Évaluation des capacités de réponse de larges modèles de langage (LLM) pour des questions d'historiens

arXiv:2406.15173v1
Originality Synthesis-oriented
AI Analysis

This work addresses the reliability of LLMs for historians and researchers in handling historical queries, but it is incremental as it applies existing evaluation methods to a new domain-specific dataset.

The study evaluated the capabilities of ten large language models (LLMs) in providing reliable and relevant responses to historical questions in French, revealing significant shortcomings in accuracy, language treatment, verbosity, and consistency.

Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes