NEMay 7

PRIMETIME : Limits of LLMs in Temporal Primitives

arXiv:2504.161555.74 citationsh-index: 12

Predicted impact top 88% in NE · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners working on temporal reasoning in LLMs, this work provides a decompositional evaluation strategy and a synthetic generator that enables both diagnosis and remediation of primitive failures.

The paper introduces PRIMETIME, a synthetic generator for benchmarking and fine-tuning LLMs on temporal primitives (datetime parsing and arithmetic). It finds that these primitives are individually unreliable (accuracy ranging from near-zero to perfect) but fully learnable via fine-tuning, achieving frontier-level accuracy on composed tasks with small quantized LoRA transformers.

This paper introduces PRIMETIME, a synthetic generator that supports both benchmarking and fine-tuning of two primitive operations underlying temporal reasoning in Large Language Models (LLMs): parsing and arithmetic on datetimes. Existing temporal benchmarks assume simplified canonical datetime forms, conflate arithmetic, composition, and world knowledge into a single aggregate score, and offer no direct path to remediation. The first contribution is methodological: the PRIMETIME synthetic generator delivers non-conflated, uncontaminated, and unlimited datetime exemplars that enable a decompositional evaluation strategy for each primitive in isolation. The generator is extensible to support complex datetime tasks and is publicly released, alongside generated benchmarks. The second contribution is diagnostic: under this evaluation strategy, the primitives themselves prove individually unreliable, with per-primitive accuracy ranging from near-zero to perfect across models and prompting conditions. The third contribution is constructive: the same generator used for diagnosis also produces new training exemplars for fine-tuning, and the resulting models show that the primitives are fully learnable and the composed Event Planning task reaches frontier-level accuracy using small quantized LoRA transformers. The broader takeaway is that a single synthetic generator can serve both diagnosis and production-ready deployment. This methodological pattern may apply beyond temporal reasoning.

View on arXiv PDF

Similar