LGAIOct 1, 2025

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

arXiv:2510.01288v11 citationsh-index: 52
Originality Highly original
AI Analysis

This work addresses the challenge of identifying and mitigating undesirable behaviours in LLMs for users and developers, offering a novel, unsupervised detection method that is incremental in applying biological inspiration to AI probing.

The paper tackles the problem of detecting misbehaviours in large language models (LLMs) by proposing a probing method inspired by microsaccades, which uses lightweight position encoding perturbations to elicit latent signals indicating failures in areas like factuality, safety, toxicity, and backdoor attacks, with experiments showing it is computationally efficient and effective across multiple state-of-the-art models.

We draw inspiration from microsaccades, tiny involuntary eye movements that reveal hidden dynamics of human perception, to propose an analogous probing method for large language models (LLMs). Just as microsaccades expose subtle but informative shifts in vision, we show that lightweight position encoding perturbations elicit latent signals that indicate model misbehaviour. Our method requires no fine-tuning or task-specific supervision, yet detects failures across diverse settings including factuality, safety, toxicity, and backdoor attacks. Experiments on multiple state-of-the-art LLMs demonstrate that these perturbation-based probes surface misbehaviours while remaining computationally efficient. These findings suggest that pretrained LLMs already encode the internal evidence needed to flag their own failures, and that microsaccade-inspired interventions provide a pathway for detecting and mitigating undesirable behaviours.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes