Corpus MCP vs Claude Code deep-research: a benchmark

TL;DR. On three depth-shaped CS/AI/ML research questions, we ran two agents on the same prompt: one using Claude Code’s built-in deep-research (a web fan-out), one using Scholar Feed’s corpus MCP (600k arXiv papers + citation graph). A blind judge preferred the corpus report 3 out of 3 times. It did it for ~$0.63 vs ~$26.60 per question and surfaced 2.6× more real papers. The built-in is good. It runs an adversarial fact-check the corpus doesn’t. But for finding and correctly citing the research frontier, the corpus is cheaper, fresher, and better grounded. Below is the methodology, including where the built-in beat us. Everything is reproducible: the harness and raw data are public.

Why we ran this

Scholar Feed is an MCP server that gives a coding agent deep access to the arXiv corpus. Not just search: a citation graph, foundational-lineage lookup, a rising-work signal, and full-text extraction. The skeptical question we kept asking ourselves was why pay for this when the agent can already web-search arXiv for free?

Claude Code ships a built-in deep-research command, and it’s no toy. It fans out to ~100 sub-agents, runs parallel web searches, fetches sources, and adversarially verifies claims with a 3-vote panel before synthesizing. That’s the real competitor, so we put them head-to-head with the research tool as the only variable.

The setup (and how we kept it fair)

Identical prompt, both arms. Each question asks the agent to name the established approach, trace how the method evolved, and identify the 2025–2026 work that supersedes it, with every claim cited to an arXiv ID. The prompt is byte-identical across arms.
Both explicitly triggered. Arm DR runs /deep-research; Arm SF runs our /scholar-feed skill. We’re measuring output quality, not which one an agent picks.
Pure arms. DR gets web only (no MCP). SF gets the corpus only (no web). Same model (Claude Sonnet 4.6), same clean-room headless setup, fresh directory per run.
Three questions, chosen in fast-moving areas where recency matters: the best optimizer for small-LM pretraining, long-context attention, and KV-cache compression.
Blind judging. A separate Claude Opus judge scored both reports per question with arm labels stripped and A/B order randomized. It never knew which tool produced which.
External ID verification. We checked every cited arXiv ID against the live arXiv API and against the corpus, so "grounding" isn’t a vibe. It’s a count.

Result 1: Cost, ~42× cheaper and ~4× faster

This one is hard to argue with, because it’s counted mechanically:

Per question (mean)	Scholar Feed (corpus)	Built-in `deep-research` (web)
Cost	$0.63	$26.60
Wall-clock	214 s	890 s
How	1 agent, ~15 targeted tool calls	~100-agent fan-out

The built-in’s quality comes from brute parallelism: ~100 agents, dozens of fetches, and a big verification panel. That’s powerful, and it’s why each question costs ~$27 and burns a quarter-hour. The corpus loop reaches the answer with a single agent doing a handful of graph-aware calls: it searches, walks the foundational lineage, follows forward citations, then pulls full text on the few papers that matter. Same depth, two orders of magnitude less compute.

Result 2: A blind judge preferred the corpus 3/3

A blind Opus judge (no tools, labels stripped) scored each pair 1–5 on recency, depth of lineage, coverage, grounding, and correctness:

Question	Winner	Confidence	Corpus 2025–26 papers	Web 2025–26 papers
Best small-LM optimizer	Corpus	medium	13	8
Long-context attention	Corpus	high	12	3
KV-cache compression	Corpus	high	11	6

The corpus report swept every rubric dimension (mean 5.0/5.0/5.0/5.0/4.0 vs the web report’s 3.7/3.3/3.3/4.3/3.0). On long-context attention the judge wrote that the corpus report "identifies the actual frontier shift to natively-trained sparse attention (NSA) and hybrid linear+softmax architectures (Kimi Linear, Samba, Falcon-H1)," while the web report "omits the hybrid track entirely." The recency gap is structural. The corpus’s forward-citation graph walks from a canonical paper to the newest work that builds on it, which is how you reach 2026 papers a model can’t recall and a keyword search doesn’t rank.

Result 3: The corpus’s citations are correct by construction

Neither tool fabricated arXiv IDs; both came in at ~0% nonexistent (web 0/38, corpus 0/100+ checked). The subtler difference is in binding. Does the ID match the title and date it’s attached to?

The corpus arm copies the ID, title, and year straight out of a tool result, so it can’t misbind unless it mistypes. Across 100+ citations, zero binding errors. It even volunteered honesty notes like "Mamba (2312.00752) not indexed in this corpus" instead of bluffing.
The web arm reconstructs citations from memory and snippets, and it occasionally slips. It cited 2409.20325 as "Modular Duality in Deep Learning" (that ID is "Old Optimizer, New Norm"; Modular Duality is 2410.21265), and it dated a December-2025 paper "late 2024." The blind judge caught the date error on its own: "KQ-SVD called 'late 2024' but tagged arXiv:2512 = Dec 2025."

These are the errors a corpus tool structurally cannot make. If you’ve ever had an AI hand you a real-looking arXiv ID stuck on the wrong paper, that’s the failure mode this designs out.

Where the built-in won (the honest part)

The built-in isn’t bad, and two things it does are better than what we have:

Adversarial verification. Every built-in report ships a "confirmed vs killed claims" ledger. It caught overstated magnitudes the corpus report repeated uncritically: a Muon optimizer "2× efficiency" number and a KV-cache method’s "20% retention is enough" claim, both of which deserve the skepticism. The corpus arm has no equivalent fact-check and will echo a paper’s headline number as fact.
It works on any topic. The corpus only knows CS/AI/ML papers. For "research the history of the lithium supply chain," the web tool is the right call and ours is useless.

So this isn’t "ours is better, theirs is bad." The built-in spends ~40× more to brute-force breadth and verify magnitudes, which is a real strength. The corpus reaches a fresher, correctly-cited frontier far cheaper, because a citation graph is a better map of a research field than a web index. The best version of our product borrows the built-in’s idea and adds a verification pass of its own. That’s the obvious next step.

Honest limitations

n = 3 questions, one run each. Directionally strong and consistent, but not a large sample. We’re publishing the harness so you can run more.
The questions favor a corpus. Optimizers, attention, and KV-caches are the fast-moving, paper-dense areas where a citation graph helps most. On a less paper-gated question the quality gap would narrow, though the cost and grounding edges are structural and wouldn’t.
Single model, single setup. Claude Sonnet 4.6, headless, clean room.

Try it / reproduce it

Install the MCP (no account needed for the search tools):

npx scholar-feed-mcp init

The full harness, every per-question report, the blind-judge verdicts, and the arXiv ID checks are public. Swap in your own questions and rerun: github.com/YGao2005/scholar-feed-vs-deep-research. Scholar Feed MCP is open source (MIT) at github.com/YGao2005/scholar-feed-mcp; the shorter comparison writeup lives at scholarfeed.org/compare/deep-research-vs-scholar-feed.

If you’re doing literature review, prior-art search, or "what’s the latest on X" from inside your agent, the takeaway is this: a curated corpus with a citation graph gets you a fresher, correctly-cited answer for a fraction of the cost of a general web fan-out, as long as your question is about CS/AI/ML papers, which is what it’s for.