SAGE: A Realistic Benchmark for Semantic Understanding
This work addresses the problem of evaluating semantic understanding for researchers and practitioners in AI, providing a more realistic benchmark for real-world deployment, though it is incremental as it builds on existing evaluation frameworks.
The authors tackled the need for more challenging benchmarks to evaluate semantic understanding in language models by introducing SAGE, a comprehensive benchmark that assesses embedding models and similarity metrics across five categories, revealing significant performance gaps and trade-offs, such as OpenAI's text-embedding-3-large scoring 0.682 in human preference alignment but being outperformed by classical metrics like Jaccard Similarity (0.905 vs. 0.794) in information sensitivity.
As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.