Lukas Stepanek

2papers

2 Papers

20.4DCMay 31
Fail-Closed Lowering of Resident KV Claims onto LLM Serving Runtimes

Lukas Stepanek

LLM serving runtimes increasingly expose KV-cache primitives that resemble future-reuse controls: retention priority, TTL-like duration, host or storage offload, block events, active no-evict scheduling, and KV-aware routing. This paper argues that such primitives are weaker than accepted future-KV obligations. A runtime can expose priority, offload, events, and routing without accepting responsibility for a future reuse claim. We study ResidentClaim lowering: when a runtime primitive, trusted adapter, or patch can be treated as satisfying an accepted claim about future KV reuse. A conformant lowering must bind behavior to accepted claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes. We contribute a fail-closed lowering relation, checker, descriptor format, and bad-lowering suite that classify runtime/mode mappings as native conformance, adapter-observational evidence, adapter-policy evidence under controlled pressure, approximation substrate, rejected mapping, or unknown evidence. The checker validates manually curated, anchored runtime descriptors against obligation bundles; it does not prove that unaudited runtime behavior is complete. Public TensorRT-LLM, SGLang/HiCache, and Dynamo expose strong substrates and selected adapter positives, but not native ResidentClaim conformance. The positive systems witness is a local patched vLLM connector/scheduler-boundary mechanism: claim metadata flows through real in-process offload/load behavior, and controlled same-claim restoration failure reaches vLLM's invalid-KV-load path and becomes an ordered claim-scoped fail-closed outcome. The result is a calibrated semantics boundary, not a production performance claim or a compatibility survey.

42.2DCMay 22
Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure

Lukas Stepanek

KV-cache reuse mechanisms increasingly expose priority, duration, offload, routing hints, scheduler modes, and event streams. These mechanisms help preserve reusable prefixes, but they do not by themselves define a portable contract for accepted future-reuse state when resident KV and active live KV cannot both fit. We introduce resident KV claims, a conformance contract that binds future-reuse intent to a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry. In controlled vLLM allocator probes, a 60-block resident claim and a 70-block active prefill exceed an 80-block usable KV pool. Write no-admit prevents the active request from becoming future reusable state, but it still allows active allocation to evict residents from the shared pool. A minimal vLLM prototype shows that hard protected resident claims convert this failure mode into scheduler-visible active refusal with direct blocking-claim attribution. The result is not a production speedup or a new cache-replacement algorithm. It is a runtime contract that turns unreported resident loss into reconstructable active/resident arbitration. A companion MicroRuntime and vLLM litmus suite distinguish ordinary eviction, soft priority, write no-admit, accepted hard claims, materialization failure, demotion, expiry, active refusal, and trace-level outcome reconstruction.