CLJun 5, 2025

Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

arXiv:2506.04575v21 citationsh-index: 17Has Code
AI Analysis

This addresses a key weakness in LLM-based logical reasoning for real-world applications with varied text, offering an incremental improvement through a new benchmark and method.

The paper tackles the problem of LLM-based translators generating inconsistent symbolic representations for the same concept across different linguistic forms, which breaks logical coherence and reduces reasoning accuracy. They introduce the SoLT benchmark to evaluate this issue and propose MenTaL, a method that improves consistency and leads to stable performance gains, such as significant accuracy improvements on SoLT.

Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes