AI CL MAFeb 17, 2025

Integrating Expert Knowledge into Logical Programs via LLMs

Franciszek Górski, Oskar Wysocki, Marco Valentino, Andre Freitas

arXiv:2502.12275v29.63 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This provides a robust evaluation platform for selecting models in self-correcting systems, particularly in engineering domains, though it is incremental in benchmarking LLM performance.

The paper tackles the problem of evaluating how effectively Large Language Models (LLMs) integrate expert knowledge into logical reasoning systems, such as for engineering validation tasks, and finds that most models generate nearly perfect syntactically correct code but vary in logical correctness and self-improvement capabilities.

This paper introduces ExKLoP, a novel framework designed to evaluate how effectively Large Language Models (LLMs) integrate expert knowledge into logical reasoning systems. This capability is especially valuable in engineering, where expert knowledge-such as manufacturer-recommended operational ranges-can be directly embedded into automated monitoring systems. By mirroring expert verification steps, tasks like range checking and constraint validation help ensure system safety and reliability. Our approach systematically evaluates LLM-generated logical rules, assessing both syntactic fluency and logical correctness in these critical validation tasks. We also explore the models' capacity for self-correction via an iterative feedback loop based on code execution outcomes. ExKLoP presents an extensible dataset comprising 130 engineering premises, 950 prompts, and corresponding validation points. It enables comprehensive benchmarking while allowing control over task complexity and scalability of experiments. We leverage the synthetic data creation methodology to conduct extensive empirical evaluation on a diverse set of LLMs including Llama3, Gemma3, Codestral and QwenCoder. The results reveal that most models generate nearly perfect syntactically correct code and exhibit strong performance in translating expert knowledge into correct code. At the same time, while most LLMs produce nearly flawless syntactic output, their ability to correctly implement logical rules varies, as does their capacity for self-improvement. Overall, ExKLoP serves as a robust evaluation platform that streamlines the selection of effective models for self-correcting systems while clearly delineating the types of errors encountered.

View on arXiv PDF Code

Similar