CL AI CVNov 21, 2025

SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Shrikant Kendre, Austin Xu, Honglu Zhou, Michael Ryoo, Shafiq Joty, Juan Carlos Niebles

arXiv:2511.17432v12.7

Originality Incremental advance

AI Analysis

This work provides a more accurate and lightweight evaluation tool for question-answering systems, which is important for researchers and developers in natural language processing and computer vision, though it is incremental as it builds on prior semantic metrics.

The paper tackled the problem of evaluating question-answering systems by addressing the limitations of existing metrics that focus too much on lexical similarity or lack flexibility in balancing semantics. It introduced SMILE, a composite metric that combines lexical and semantic components, achieving high correlation with human judgments across text, image, and video QA tasks.

Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

View on arXiv PDF

Similar