Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation
This provides a diagnostic tool for researchers and practitioners in NLP to better assess information fidelity in generative tasks, though it is incremental as it adapts an existing framework for new applications.
The paper tackled the problem of evaluating semantic fidelity in text-to-text generation tasks, where traditional metrics like BLEU and BERTScore are inadequate, by proposing the Cross-Examination Framework (CEF) that uses reference-free, multi-dimensional scoring to identify errors such as content omissions and factual contradictions, validated across translation, summarization, and clinical note-generation with strong correlation to human expert judgments.
Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and candidate as independent knowledge bases. CEF generates verifiable questions from each text and performs a cross-examination to derive three interpretable scores: Coverage, Conformity, and Consistency. Validated across translation, summarization and clinical note-generation, our framework identifies critical errors, such as content omissions and factual contradictions, missed by standard metrics. A key contribution is a systematic robustness analysis to select a stable judge model. Crucially, the strong correlation between our reference-free and with-reference modes validates CEF's reliability without gold references. Furthermore, human expert validation demonstrates that CEF mismatching questions align with meaning-altering semantic errors higher than with non-semantic errors, particularly excelling at identifying entity-based and relational distortions.