20.1AIMay 9Code
Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching AlignmentFabio Rovai
We present Open Ontologies, an open-source ontology engineering system implemented in Rust that integrates LLM-driven construction with formal OWL reasoning and ontology alignment via the Model Context Protocol. Our primary finding is that stable 1-to-1 matching is the dominant factor in ontology alignment quality: on the OAEI Anatomy track, it achieves F1 = 0.832 (P = 0.963, R = 0.733), competitive with state-of-the-art systems and exceeding all in precision. Ablation across five weight configurations shows that signal weights are irrelevant when stable matching is applied (F1 varies by less than 0.004), while removing stable matching drops F1 to 0.728. On the Conference track, the same method achieves F1 = 0.438. On tool-augmented ontology interaction, we find a surprising result: an LLM reading a raw OWL file (F1 = 0.323) performs worse than the same LLM with no file at all (F1 = 0.431), while structured MCP tool access achieves F1 = 0.717. This demonstrates that tool structure provides a qualitatively different mode of access that the LLM cannot replicate by reading raw syntax. The system ships as a single binary under the MIT licence.
20.6AIMay 15
Deterministic Event-Graph Substrates as World Models for Counterfactual ReasoningFabio Rovai
We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.
2.1AIMay 14
Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World ReplicationsFabio Rovai
We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-order list; n=592 trajectories), short-range substrate sizes fit a power-law N(t) proportional to t^b. Within each substrate b is architecture-sensitive (cross-validated R^2 approximately 0.82); the regression does not transfer across substrates (arith+bool to list yields R^2 approximately -0.84). A heuristic mean-field closure model predicts a saturating power-law dN/dt = K N^k exp(-mu N) of which the pure power-law is the short-range approximation. Three robustness checks: bootstrap intervals on (k, mu) are tight in 4/5 toy trajectories and degenerate in 1/5; out-of-sample forecasting on toy data (fit first 100 epochs, predict next 400) is won by pure power-law 5/5, indicating the toy trajectories do not reach saturation; on two real-world growth proxies the result splits. New Mathlib/*.lean file additions per month (mathlib4, 60 months, 9701 files) support the saturating form on OOS forecasting by approximately 7x over pure power-law; Coq mathcomp monthly commits (129 months, 3083 commits) favour pure power-law on both tests with mu collapsing to zero. The dynamics are substrate-conditional at two levels: within-substrate architecture-to-b regressions do not transfer, and the preferred functional family for N(t) itself (pure vs. saturating power-law) differs by substrate. We propose "saturating power-law growth with substrate-conditional (k, mu), observable when the substrate has reached its saturation regime" as a working framing.
31.2AIMay 9
CIVeX: Causal Intervention Verification for Language AgentsFabio Rovai
A valid tool call is not necessarily a valid intervention. Tool-using language agents are guarded by schema validators, policy filters, provenance checks, state predictors, and self-verification, yet such safeguards do not certify that a state-changing action has an identifiable causal effect. In confounded workflows, the action that looks optimal in observational logs can reduce utility when executed. We introduce CIVeX, a causal intervention verifier that maps proposed actions to structural causal queries over a committed action-state graph, checks identifiability, and returns one of four auditable verdicts: EXECUTE, REJECT, EXPERIMENT, or ABSTAIN. Execution requires an assumption-scoped causal certificate carrying graph commitments, an identification argument, a one-sided lower confidence bound (LCB), provenance, and risk limits. On Causal-ToolBench (1,890 instances, 7 seeds), CIVeX yields zero observed false executions across moderate and adversarial confounding. Under adversarial confounding it reaches 84.9% accuracy and 81.1% of oracle utility (+2.23 vs +2.76) and is the only non-oracle method whose constrained utility under a zero-false-execution constraint exceeds the AlwaysAbstain floor. On IHDP and ZOZO Open Bandit (real production logs with uniform-random ground truth), CIVeX matches Oracle correct-execution within 0.1pp and cuts per-execute false-execution by >=50x over naive baselines. A chain-of-thought LLM verifier (Claude Opus, Sonnet) cuts false-execution by an order of magnitude over a terse baseline, yet under adversarial confounding Opus's utility falls to 74% of CIVeX's. Intervention identifiability, not action validity, is the missing primitive for reliable tool use.