AICLLONov 14, 2025

Do LLMs Really Struggle at NL-FOL Translation? Revealing their Strengths via a Novel Benchmarking Strategy

arXiv:2511.11816v11 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately assessing LLMs' logical reasoning for applications like system verification, though it is incremental as it focuses on evaluation methodology rather than a new translation method.

The paper tackles the problem of evaluating Large Language Models (LLMs) for translating natural language to First-Order Logic (NL-FOL translation), revealing that existing benchmarks misrepresent their capabilities, and shows that state-of-the-art LLMs achieve strong performance with genuine logical understanding, while embedding-centric models perform worse.

Due to its expressiveness and unambiguous nature, First-Order Logic (FOL) is a powerful formalism for representing concepts expressed in natural language (NL). This is useful, e.g., for specifying and verifying desired system properties. While translating FOL into human-readable English is relatively straightforward, the inverse problem, converting NL to FOL (NL-FOL translation), has remained a longstanding challenge, for both humans and machines. Although the emergence of Large Language Models (LLMs) promised a breakthrough, recent literature provides contrasting results on their ability to perform NL-FOL translation. In this work, we provide a threefold contribution. First, we critically examine existing datasets and protocols for evaluating NL-FOL translation performance, revealing key limitations that may cause a misrepresentation of LLMs' actual capabilities. Second, to overcome these shortcomings, we propose a novel evaluation protocol explicitly designed to distinguish genuine semantic-level logical understanding from superficial pattern recognition, memorization, and dataset contamination. Third, using this new approach, we show that state-of-the-art, dialogue-oriented LLMs demonstrate strong NL-FOL translation skills and a genuine grasp of sentence-level logic, whereas embedding-centric models perform markedly worse.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes