LGFeb 21

From Human-Level AI Tales to AI Leveling Human Scales

arXiv:2602.18911v1
Originality Incremental advance
AI Analysis

This addresses the issue of incommensurate benchmarks for AI researchers and developers, though it is incremental as it builds on existing human test data and calibration methods.

The authors tackled the problem of misleading comparisons between AI models and 'human level' by proposing a framework that calibrates AI performance on a common, human-anchored scale based on world population data, resulting in standardized scales for capabilities like reasoning and comprehension.

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes