CLOct 13, 2025

A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Hayate Funakura, Hyunsoo Kim, Koji Mineshima

arXiv:2510.11225v11 citationsProceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Originality Incremental advance

AI Analysis

This work addresses a critical evaluation gap for researchers and practitioners in natural language processing, particularly for reasoning-oriented applications, though it is incremental as it builds on existing methods like theorem proving.

The paper tackled the problem of evaluating neural semantic parsers by showing that graph-matching metrics like Smatch often fail to capture logical equivalence, and found that models performing well on these metrics frequently produce logically inadequate formulas, with normalization improving logical adequacy by up to 20% in some settings.

Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.

View on arXiv PDF

Similar