Benchmarking graph construction by large language models for coherence-driven inference
This work addresses the challenge of evaluating LLMs for coherence-driven inference, which could advance machine cognition, but it appears incremental as it focuses on benchmarking an existing method on new data.
The paper tackled the problem of benchmarking large language models' ability to reconstruct coherence graphs from natural language propositions, with results showing that reasoning-optimized models like o1/3/4-mini achieve perfect reconstruction half of the time on sparse graphs.
We devise an algorithm to generate propositions that objectively instantiate graphs supporting coherence-driven inference. We also benchmark the ability of large language models (LLMs) to reconstruct coherence graphs from (a simple transformation of) propositions expressed in natural language, with promising results from a single prompt to reasoning-optimized LLMs. For example, o1/3/4-mini achieve perfect reconstruction half of the time on sparse graphs. Coherence-driven inference on consistency evaluations by LLMs may advance machine cognition capabilities.