RDF-Based Structured Quality Assessment Representation of Multilingual LLM Evaluations
This work addresses the challenge of evaluating LLM quality in multilingual settings with knowledge conflicts, but it is incremental as it applies a structured representation to a specific domain.
The authors tackled the problem of systematically assessing multilingual LLM reliability with conflicting information by proposing an RDF-based framework, and demonstrated it in a fire safety domain experiment, revealing patterns in context prioritization and language-specific performance across 28 questions.
Large Language Models (LLMs) increasingly serve as knowledge interfaces, yet systematically assessing their reliability with conflicting information remains difficult. We propose an RDF-based framework to assess multilingual LLM quality, focusing on knowledge conflicts. Our approach captures model responses across four distinct context conditions (complete, incomplete, conflicting, and no-context information) in German and English. This structured representation enables the comprehensive analysis of knowledge leakage-where models favor training data over provided context-error detection, and multilingual consistency. We demonstrate the framework through a fire safety domain experiment, revealing critical patterns in context prioritization and language-specific performance, and demonstrating that our vocabulary was sufficient to express every assessment facet encountered in the 28-question study.