Benchmark

We ran our research MCP against Claude Code’s built-in deep-research. It cost 1/40th as much and cited 2.6× more of the actual frontier.

A blind, reproducible head-to-head of a curated-corpus MCP against a general web deep-research pipeline — including where the built-in beat us.

TL;DR. On three depth-shaped CS/AI/ML research questions, we ran two agents on the same prompt: one using Claude Code’s built-in deep-research (a web fan-out), one using Scholar Feed’s corpus MCP (600k arXiv papers + citation graph). A blind judge preferred the corpus report 3 out of 3 times. It did it for ~$0.63 vs ~$26.60 per question and surfaced 2.6× more real papers. The built-in is good. It runs an adversarial fact-check the corpus doesn’t. But for finding and correctly citing the research frontier, the corpus is cheaper, fresher, and better grounded. Below is the methodology, including where the built-in beat us. Everything is reproducible: the harness and raw data are public.

Why we ran this

Scholar Feed is an MCP server that gives a coding agent deep access to the arXiv corpus. Not just search: a citation graph, foundational-lineage lookup, a rising-work signal, and full-text extraction. The skeptical question we kept asking ourselves was why pay for this when the agent can already web-search arXiv for free?

Claude Code ships a built-in deep-research command, and it’s no toy. It fans out to ~100 sub-agents, runs parallel web searches, fetches sources, and adversarially verifies claims with a 3-vote panel before synthesizing. That’s the real competitor, so we put them head-to-head with the research tool as the only variable.

The setup (and how we kept it fair)

Result 1: Cost, ~42× cheaper and ~4× faster

This one is hard to argue with, because it’s counted mechanically:

Per question (mean)Scholar Feed (corpus)Built-in deep-research (web)
Cost$0.63$26.60
Wall-clock214 s890 s
How1 agent, ~15 targeted tool calls~100-agent fan-out

The built-in’s quality comes from brute parallelism: ~100 agents, dozens of fetches, and a big verification panel. That’s powerful, and it’s why each question costs ~$27 and burns a quarter-hour. The corpus loop reaches the answer with a single agent doing a handful of graph-aware calls: it searches, walks the foundational lineage, follows forward citations, then pulls full text on the few papers that matter. Same depth, two orders of magnitude less compute.

Result 2: A blind judge preferred the corpus 3/3

A blind Opus judge (no tools, labels stripped) scored each pair 1–5 on recency, depth of lineage, coverage, grounding, and correctness:

QuestionWinnerConfidenceCorpus 2025–26 papersWeb 2025–26 papers
Best small-LM optimizerCorpusmedium138
Long-context attentionCorpushigh123
KV-cache compressionCorpushigh116

The corpus report swept every rubric dimension (mean 5.0/5.0/5.0/5.0/4.0 vs the web report’s 3.7/3.3/3.3/4.3/3.0). On long-context attention the judge wrote that the corpus report "identifies the actual frontier shift to natively-trained sparse attention (NSA) and hybrid linear+softmax architectures (Kimi Linear, Samba, Falcon-H1)," while the web report "omits the hybrid track entirely." The recency gap is structural. The corpus’s forward-citation graph walks from a canonical paper to the newest work that builds on it, which is how you reach 2026 papers a model can’t recall and a keyword search doesn’t rank.

Result 3: The corpus’s citations are correct by construction

Neither tool fabricated arXiv IDs; both came in at ~0% nonexistent (web 0/38, corpus 0/100+ checked). The subtler difference is in binding. Does the ID match the title and date it’s attached to?

These are the errors a corpus tool structurally cannot make. If you’ve ever had an AI hand you a real-looking arXiv ID stuck on the wrong paper, that’s the failure mode this designs out.

Where the built-in won (the honest part)

The built-in isn’t bad, and two things it does are better than what we have:

  1. Adversarial verification. Every built-in report ships a "confirmed vs killed claims" ledger. It caught overstated magnitudes the corpus report repeated uncritically: a Muon optimizer "2× efficiency" number and a KV-cache method’s "20% retention is enough" claim, both of which deserve the skepticism. The corpus arm has no equivalent fact-check and will echo a paper’s headline number as fact.
  2. It works on any topic. The corpus only knows CS/AI/ML papers. For "research the history of the lithium supply chain," the web tool is the right call and ours is useless.

So this isn’t "ours is better, theirs is bad." The built-in spends ~40× more to brute-force breadth and verify magnitudes, which is a real strength. The corpus reaches a fresher, correctly-cited frontier far cheaper, because a citation graph is a better map of a research field than a web index. The best version of our product borrows the built-in’s idea and adds a verification pass of its own. That’s the obvious next step.

Honest limitations

Try it / reproduce it

Install the MCP (no account needed for the search tools):

npx scholar-feed-mcp init

The full harness, every per-question report, the blind-judge verdicts, and the arXiv ID checks are public. Swap in your own questions and rerun: github.com/YGao2005/scholar-feed-vs-deep-research. Scholar Feed MCP is open source (MIT) at github.com/YGao2005/scholar-feed-mcp; the shorter comparison writeup lives at scholarfeed.org/compare/deep-research-vs-scholar-feed.

If you’re doing literature review, prior-art search, or "what’s the latest on X" from inside your agent, the takeaway is this: a curated corpus with a citation graph gets you a fresher, correctly-cited answer for a fraction of the cost of a general web fan-out, as long as your question is about CS/AI/ML papers, which is what it’s for.

npx scholar-feed-mcp init

More on the developers page · shorter comparison.