CLFeb 28, 2024

Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore

arXiv:2402.18045v318 citationsh-index: 7
AI Analysis

It addresses the lack of factuality evaluation for multilingual LLMs, which is important for researchers and developers working on global AI applications, but is incremental as it extends an existing method to new contexts.

This paper tackles the problem of evaluating the factuality of long-form text generated by multilingual large language models (LLMs) across languages and geographic regions, by applying the FActScore pipeline to diverse languages and topics, resulting in systematic assessments and guidelines for multilingual factual evaluation.

Evaluating the factuality of long-form large language model (LLM)-generated text is an important challenge. Recently there has been a surge of interest in factuality evaluation for English, but little is known about the factuality evaluation of multilingual LLMs, specially when it comes to long-form generation. %This paper systematically evaluates multilingual LLMs' factual accuracy across languages and geographic regions. We introduce a simple pipeline for multilingual factuality evaluation, by applying FActScore (Min et al., 2023) for diverse languages. In addition to evaluating multilingual factual generation, we evaluate the factual accuracy of long-form text generation in topics that reflect regional diversity. We also examine the feasibility of running the FActScore pipeline using non-English Wikipedia and provide comprehensive guidelines on multilingual factual evaluation for regionally diverse topics.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes