LGJan 28

Structurally Human, Semantically Biased: Detecting LLM-Generated References with Embeddings and GNNs

Melika Mobini, Vincent Holst, Floriano Tori, Andres Algaba, Vincent Ginis

arXiv:2601.20704v12.71 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses the issue of detecting AI-generated references for researchers and publishers, though it is incremental as it builds on existing embedding and GNN methods.

The study tackled the problem of distinguishing LLM-generated bibliographies from human ones by analyzing citation graphs, finding that while structural features alone were ineffective (RF accuracy ≈0.60), semantic embeddings significantly improved detection, with GNNs achieving 93% test accuracy for GPT-4o vs. ground truth.

Large language models are increasingly used to curate bibliographies, raising the question: are their reference lists distinguishable from human ones? We build paired citation graphs, ground truth and GPT-4o-generated (from parametric knowledge), for 10,000 focal papers ($\approx$ 275k references) from SciSciNet, and added a field-matched random baseline that preserves out-degree and field distributions while breaking latent structure. We compare (i) structure-only node features (degree/closeness/eigenvector centrality, clustering, edge count) with (ii) 3072-D title/abstract embeddings, using an RF on graph-level aggregates and Graph Neural Networks with node features. Structure alone barely separates GPT from ground truth (RF accuracy $\approx$ 0.60) despite cleanly rejecting the random baseline ($\approx$ 0.89--0.92). By contrast, embeddings sharply increase separability: RF on aggregated embeddings reaches $\approx$ 0.83, and GNNs with embedding node features achieve 93\% test accuracy on GPT vs.\ ground truth. We show the robustness of our findings by replicating the pipeline with Claude Sonnet 4.5 and with multiple embedding models (OpenAI and SPECTER), with RF separability for ground truth vs.\ Claude $\approx 0.77$ and clean rejection of the random baseline. Thus, LLM bibliographies, generated purely from parametric knowledge, closely mimic human citation topology, but leave detectable semantic fingerprints; detection and debiasing should target content signals rather than global graph structure.

View on arXiv PDF

Similar