CLMay 19

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Francis Kulumba, Guillaume Vimont, Laurent Romary, Florian Cafiero

arXiv:2605.1990818.5

Predicted impact top 84% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For researchers in authorship attribution and mechanistic interpretability, this work explains performance variability in encoder-based models and reveals how scoring mechanisms influence feature consolidation.

Authorship attribution models using the same encoder, data, and loss can vary four-fold in performance due to the scoring mechanism. The gap arises not from representation quality but from where the encoder consolidates authorship signal, with mean pooling forcing early consolidation and late interaction deferring it.

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

View on arXiv PDF

Similar