NC AI CV LGOct 31, 2025

ConnectomeBench: Can LLMs Proofread the Connectome?

Jeff Brown, Andrew Kirjner, Annika Vivekananthan, Ed Boyden

arXiv:2511.05542v12.31 citationsh-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the labor-intensive proofreading bottleneck in connectomics research, offering a promising but incremental step toward automating scientific data validation.

The paper tackles the problem of automating neural connectome proofreading by introducing ConnectomeBench, a multimodal benchmark that evaluates LLMs on three critical tasks: segment type identification (52-82% balanced accuracy vs. 20-25% chance), split error correction (75-85% accuracy vs. 50% chance), and merge error detection. Results show that current models perform surprisingly well on some tasks but still lag behind expert performance.

Connectomics - the mapping of neural connections in an organism's brain - currently requires extraordinary human effort to proofread the data collected from imaging and machine-learning assisted segmentation. With the growing excitement around using AI agents to automate important scientific tasks, we explore whether current AI systems can perform multiple tasks necessary for data proofreading. We introduce ConnectomeBench, a multimodal benchmark evaluating large language model (LLM) capabilities in three critical proofreading tasks: segment type identification, split error correction, and merge error detection. Using expert annotated data from two large open-source datasets - a cubic millimeter of mouse visual cortex and the complete Drosophila brain - we evaluate proprietary multimodal LLMs including Claude 3.7/4 Sonnet, o4-mini, GPT-4.1, GPT-4o, as well as open source models like InternVL-3 and NVLM. Our results demonstrate that current models achieve surprisingly high performance in segment identification (52-82% balanced accuracy vs. 20-25% chance) and binary/multiple choice split error correction (75-85% accuracy vs. 50% chance) while generally struggling on merge error identification tasks. Overall, while the best models still lag behind expert performance, they demonstrate promising capabilities that could eventually enable them to augment and potentially replace human proofreading in connectomics. Project page: https://github.com/jffbrwn2/ConnectomeBench and Dataset https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench/tree/main

View on arXiv PDF Code

Similar