NCLGMay 13

SpikeProphecy: A Large-Scale Benchmark for Autoregressive Neural Population Forecasting

arXiv:2605.1299263.2
AI Analysis

For computational neuroscientists and machine learning researchers, this benchmark provides a standardized evaluation framework that uncovers hidden structure in neural population forecasting performance.

The paper introduces SpikeProphecy, the first large-scale benchmark for autoregressive spike-count forecasting, with a population metric decomposition that separates temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. Applied to 105 Neuropixels sessions, it reveals a brain-region predictability ranking consistent across seven baselines and exposes evaluation floors and distillation limitations.

Neural population models, which predict the joint firing of many simultaneously recorded neurons forward in time, are typically evaluated by a single aggregate Pearson correlation $r$ between predicted and actual spike counts, a number that masks critical structure. We argue that how we evaluate spike forecasting matters as much as what we build, and introduce SpikeProphecy, the first large-scale benchmark for causal, autoregressive spike-count forecasting on real electrophysiology recordings. Our core contribution is a population metric decomposition that separates aggregate performance into temporal fidelity, spatial pattern accuracy, and magnitude-invariant alignment. The decomposition surfaces aspects of the underlying data that an aggregate scalar collapses together. We apply the protocol to 105 Neuropixels sessions (Steinmetz 2019 + IBL Repeated Site; ~89,800 neurons) with seven architecture baselines spanning four structural families: four SSMs (three diagonal and one non-diagonal), a Transformer, an LSTM, and a spiking network. The decomposition surfaces a brain-region predictability ranking that reproduces across all seven baselines and survives ANCOVA correction for firing-statistics constraints (region $ΔR^2 = 0.018$ above the firing-statistics covariates). It also exposes a sub-Poisson evaluation floor where rigorous metrics combine with genuine biophysical constraints on regular spike trains, and yields a negative result on KL-on-output-rates distillation for ANN-to-SNN transfer in this Poisson count domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes