CLJan 13

How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Yingjie He, Zhaolu Kang, Kehan Jiang, Qianyuan Zhang, Jiachen Qian, Chunlei Meng, Yujie Feng, Yuan Wang, Jiabao Dou, Aming Wu, Leqi Zheng, Pengxiang Zhao

arXiv:2601.08626v11.12 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of assessing structural robustness in LLMs, which is crucial for applications requiring precise ordering, but it is incremental as it builds on existing evaluation methods.

The paper tackled the problem of evaluating LLMs' ability to reconstruct internal structure from scrambled inputs by introducing OrderProbe, a deterministic benchmark using fixed four-character expressions in Chinese, Japanese, and Korean with exact-match scoring, and found that zero-shot recovery frequently falls below 35% across twelve widely used LLMs.

Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.

View on arXiv PDF

Similar