CVAIMANov 13, 2025

Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models

arXiv:2511.11751v1h-index: 9
Originality Incremental advance
AI Analysis

This addresses interpretability and hallucination issues in vision-language models, particularly for medical imaging and underrepresented datasets, though it is incremental as it builds on existing neurosymbolic frameworks.

The paper tackles the problem of vision-language models lacking interpretability and hallucinating facts by introducing Concept-RuleNet, a multi-agent neurosymbolic system that grounds symbols in visual data and uses first-order rules for reasoning, achieving an average 5% improvement over baselines and reducing hallucinated symbols by up to 50%.

Modern vision-language models (VLMs) deliver impressive predictive accuracy yet offer little insight into 'why' a decision is reached, frequently hallucinating facts, particularly when encountering out-of-distribution data. Neurosymbolic frameworks address this by pairing black-box perception with interpretable symbolic reasoning, but current methods extract their symbols solely from task labels, leaving them weakly grounded in the underlying visual data. In this paper, we introduce a multi-agent system - Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. Specifically, a multimodal concept generator first mines discriminative visual concepts directly from a representative subset of training images. Next, these visual concepts are utilized to condition symbol discovery, anchoring the generations in real image statistics and mitigating label bias. Subsequently, symbols are composed into executable first-order rules by a large language model reasoner agent - yielding interpretable neurosymbolic rules. Finally, during inference, a vision verifier agent quantifies the degree of presence of each symbol and triggers rule execution in tandem with outputs of black-box neural models, predictions with explicit reasoning pathways. Experiments on five benchmarks, including two challenging medical-imaging tasks and three underrepresented natural-image datasets, show that our system augments state-of-the-art neurosymbolic baselines by an average of 5% while also reducing the occurrence of hallucinated symbols in rules by up to 50%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes