Reproducible benchmarks and writeups on AI research tooling over the arXiv corpus.
Benchmark
A blind, reproducible head-to-head of a curated-corpus MCP against a general web deep-research pipeline — including where the built-in beat us.