CLAILGFeb 6

CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

arXiv:2602.06446v1h-index: 5Has Code
Originality Incremental advance
AI Analysis

This addresses a critical gap in LLM evaluation and safety by identifying unrelatedness reasoning as an under-evaluated frontier, which is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating large language models' ability to distinguish meaningful semantic relations from unrelatedness, revealing that state-of-the-art LLMs achieve 48.25-70.9% overall accuracy on a benchmark but drop to near 0-41.35% on unrelated pairs, with accuracy falling to about 2% on a larger dataset.

Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen's Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes