AIOct 8, 2025

Verifying Memoryless Sequential Decision-making of Large Language Models

arXiv:2510.06756v11 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses the safety verification problem for LLM policies in sequential decision-making, providing a practical tool for researchers and practitioners, though it is incremental as it builds on existing MDP and verification methods.

The paper tackles the problem of verifying safety properties of large language model (LLM)-based policies in memoryless sequential decision-making tasks by introducing an automated tool that constructs reachable portions of a Markov decision process (MDP) from LLM actions and checks them with Storm, showing in experiments on grid world benchmarks that open source LLMs can be verified with deterministic seeding but generally underperform deep reinforcement learning baselines.

We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM's chosen actions. Each state is encoded as a natural language prompt, the LLM's response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes