The Question
On January 17th at 4AM, we'd been trying to run an evaluation on our Model Context Protocol server for three weeks. We tried the raw SWE-bench dataset with SWE-agent. We tried adding our MCP tool to OpenHands. Nothing gave us what we needed.
The question was simple: Does our MCP server actually make the agent better, or are we just burning context tokens?
Nobody had a good way to answer this. Every available coding benchmark measures the LLM, not the tools it uses. No existing tool let us isolate the performance impact of introducing an MCP server to an agent.
No one is rigorously testing their MCPs. Developers ship blind, hoping their tools help. Users have no way to compare servers or verify claims.
This isn't a tooling gap. It's an evaluation gap. What you can't measure, you can't improve.
Background
The Model Context Protocol (MCP) lets developers expose tools and data to LLM-based agents through a standardized interface. Since its release, hundreds of MCP servers have appeared for code analysis, database access, web search, and more.
Evaluation lagged behind. SWE-bench (Jimenez et al., 2024) and HumanEval (Chen et al., 2021) evaluate model capabilities, not tool augmentation. ToolBench (Qin et al., 2023) evaluates tool-use broadly but doesn't answer the MCP-specific question: given a fixed model and agent, does adding this particular server improve task completion?
That gap motivated mcpbr.
Methodology
mcpbr is an open-source benchmark runner that isolates the variable of MCP tool augmentation. It spins up Docker containers for each SWE-bench task, injects a headless Claude Code instance with a controlled configuration, and measures performance with and without your MCP server.
- Paired comparison. Each task runs twice on the same commit: once with MCP, once without. Controls for task difficulty.
- SWE-bench Verified. Human-validated real-world GitHub issues with ground-truth patches.
- Automated infrastructure. Azure provider handles VM lifecycle per run.
- Full trace capture. Every tool call, failure, token count, and timing is logged.
We evaluated an MCP server providing static code analysis (two tools exposed to the agent). All runs used Claude Sonnet (claude-sonnet-4-20250514), 300-second timeout, 30-iteration max per task. 500 tasks, head-to-head. ~68 GPU-hours total.
Results
Resolution Rates (n=500)
| Metric | Baseline | With MCP | Change |
|---|---|---|---|
| Tasks resolved | 249/500 (49.8%) | 212/500 (42.4%) | -14.9% relative |
| Both conditions pass | 194 | ||
| Only MCP passes | 18 | ||
| Only baseline passes | 55 | ||
| Both conditions fail | 233 |
The baseline outperformed MCP on raw resolution rate. MCP helped solve 18 tasks the baseline couldn't, but hurt on 55 tasks.
Efficiency Metrics (n=500)
| Metric | Baseline | With MCP | Change |
|---|---|---|---|
| Total cost | $242.12 | $205.30 | -15.2% |
| Cost per task | $0.48 | $0.41 | -14.6% |
| Cost per resolved task | $0.97 | $0.97 | 0% |
| Total tokens | 3,005,706 | 2,586,206 | -14.0% |
| Total tool calls | 22,745 | 13,129 | -42.3% |
| Avg runtime per task | 231s | 258s | +11.7% |
42% fewer tool calls. 14% fewer tokens. $37 saved. But 12% slower per task (MCP startup and API latency) and 15% fewer tasks resolved.
Tool Adoption
| Metric | Value |
|---|---|
| MCP-specific tool calls | 1,325 (10.1% of all MCP-condition calls) |
| Tasks where agent used MCP tools | 500/500 (100%) |
| MCP tool failure rate | 0% |
| Bash failure rate (MCP condition) | 46.3% |
100% adoption, 0% tool failure. The dominant failure source was Bash errors (missing dependencies in containers), not MCP.
Per-Repository Breakdown
| Repository | Tasks | Baseline | With MCP | Delta |
|---|---|---|---|---|
| django | 231 | 77 (33%) | 65 (28%) | -12 |
| sympy | 75 | 59 (79%) | 49 (65%) | -10 |
| sphinx-doc | 44 | 22 (50%) | 21 (48%) | -1 |
| matplotlib | 34 | 17 (50%) | 14 (41%) | -3 |
| scikit-learn | 32 | 23 (72%) | 21 (66%) | -2 |
| astropy | 22 | 11 (50%) | 11 (50%) | 0 |
| pydata/xarray | 22 | 14 (64%) | 13 (59%) | -1 |
| pytest | 19 | 14 (74%) | 9 (47%) | -5 |
| pylint | 10 | 3 (30%) | 4 (40%) | +1 |
| psf/requests | 8 | 7 (88%) | 4 (50%) | -3 |
Baseline won or tied on 11 of 12 repositories. Pylint was the sole exception.
Analysis
The efficiency-resolution tradeoff
The MCP server made the agent more efficient but less effective. 42% fewer tool calls, 14% fewer tokens, 15% fewer tasks resolved.
The MCP tools changed the agent's exploration strategy. With code analysis available, the agent relied on it instead of doing its own searching. Faster and cheaper, but dependent on the MCP server's understanding. When that understanding was incomplete, the agent explored less and missed solutions that brute-force searching would have found.
In baseline, the agent made 22,745 tool calls — nearly 10,000 more than with MCP. Read, Grep, Bash: patiently reading files, searching for patterns, testing hypotheses. Expensive but thorough.
You're trading the agent's general-purpose exploration for your tool's opinionated shortcuts. When right, you save time and money. When wrong, you've narrowed the search space in a way that costs solutions.
MCP servers should be tested like APIs, not plugins
APIs have contracts. Plugins mostly need to avoid crashing. An MCP server needs to return the right shape of data and fulfill the implicit promise in its tool description.
100% adoption tells us the descriptions were well-calibrated. But adoption isn't helpfulness. The agent always reached for the tools. The question is whether what they returned actually helped.
Compare to the Bash tool's 46% failure rate. When tools fail, agents route around them. MCP tools occupy a middle ground: they can succeed technically (return valid data) while failing practically (returning data that leads the agent astray).
The effect varies by codebase
Pylint: helped. Django and sympy: hurt measurably. Astropy: neutral.
Static code analysis value depends on how a codebase is structured. Clear module boundaries and organized code are easier to analyze statically. Complex, dynamic patterns produce misleading analysis.
Benchmark per-repository, not just in aggregate. An overall resolution rate hides the variance that matters.
Limitations
- Non-determinism. Each task ran once per condition. Multiple repetitions would allow variance estimation and confidence intervals.
- Single model and agent. Claude Sonnet with Claude Code only. Different models or agents may interact with MCP tools differently.
- One MCP server. A single code analysis server. Says nothing about MCP servers in general.
- Runtime overhead not isolated. 12% slower, but we didn't decompose startup, latency, and interpretation time.
- No hyperparameter tuning. Default configurations throughout.
References
- Jimenez, C. E., et al. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024.
- Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code." arXiv:2107.03374.
- Qin, Y., et al. (2023). "ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs." ICLR 2024.
Published by the Supermodel Tools team. mcpbr has since evolved into matchspec, the evaluation layer of the MIST stack.
