CLAIPLJan 26

LLMs versus the Halting Problem: Revisiting Program Termination Prediction

arXiv:2601.18987v14 citations
Originality Incremental advance
AI Analysis

This work addresses the undecidable halting problem for programmers and verification tool developers, showing incremental progress by applying LLMs to a known bottleneck in software verification.

The study evaluated large language models (LLMs) on predicting program termination for C programs from SV-Comp 2025, finding that models like GPT-5 and Claude Sonnet-4.5 performed close to top-ranked tools, but struggled with providing proofs and longer programs.

Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes