The Question

On January 17th at 4AM, we'd been trying to run an evaluation on our Model Context Protocol server for three weeks. We tried the raw SWE-bench dataset with SWE-agent. We tried adding our MCP tool to OpenHands. Nothing gave us what we needed.

The question was simple: Does our MCP server actually make the agent better, or are we just burning context tokens?

Nobody had a good way to answer this. Every available coding benchmark measures the LLM, not the tools it uses. No existing tool let us isolate the performance impact of introducing an MCP server to an agent.

No one is rigorously testing their MCPs. Developers ship blind, hoping their tools help. Users have no way to compare servers or verify claims.

This isn't a tooling gap. It's an evaluation gap. What you can't measure, you can't improve.

Background

The Model Context Protocol (MCP) lets developers expose tools and data to LLM-based agents through a standardized interface. Since its release, hundreds of MCP servers have appeared for code analysis, database access, web search, and more.

Evaluation lagged behind. SWE-bench (Jimenez et al., 2024) and HumanEval (Chen et al., 2021) evaluate model capabilities, not tool augmentation. ToolBench (Qin et al., 2023) evaluates tool-use broadly but doesn't answer the MCP-specific question: given a fixed model and agent, does adding this particular server improve task completion?

That gap motivated mcpbr.

Methodology

mcpbr is an open-source benchmark runner that isolates the variable of MCP tool augmentation. It spins up Docker containers for each SWE-bench task, injects a headless Claude Code instance with a controlled configuration, and measures performance with and without your MCP server.

We evaluated an MCP server providing static code analysis (two tools exposed to the agent). All runs used Claude Sonnet (claude-sonnet-4-20250514), 300-second timeout, 30-iteration max per task. 500 tasks, head-to-head. ~68 GPU-hours total.

Results

Resolution Rates (n=500)

Metric Baseline With MCP Change
Tasks resolved 249/500 (49.8%) 212/500 (42.4%) -14.9% relative
Both conditions pass 194
Only MCP passes 18
Only baseline passes 55
Both conditions fail 233

The baseline outperformed MCP on raw resolution rate. MCP helped solve 18 tasks the baseline couldn't, but hurt on 55 tasks.

Efficiency Metrics (n=500)

Metric Baseline With MCP Change
Total cost $242.12 $205.30 -15.2%
Cost per task $0.48 $0.41 -14.6%
Cost per resolved task $0.97 $0.97 0%
Total tokens 3,005,706 2,586,206 -14.0%
Total tool calls 22,745 13,129 -42.3%
Avg runtime per task 231s 258s +11.7%

42% fewer tool calls. 14% fewer tokens. $37 saved. But 12% slower per task (MCP startup and API latency) and 15% fewer tasks resolved.

Tool Adoption

Metric Value
MCP-specific tool calls 1,325 (10.1% of all MCP-condition calls)
Tasks where agent used MCP tools 500/500 (100%)
MCP tool failure rate 0%
Bash failure rate (MCP condition) 46.3%

100% adoption, 0% tool failure. The dominant failure source was Bash errors (missing dependencies in containers), not MCP.

Per-Repository Breakdown

Repository Tasks Baseline With MCP Delta
django 231 77 (33%) 65 (28%) -12
sympy 75 59 (79%) 49 (65%) -10
sphinx-doc 44 22 (50%) 21 (48%) -1
matplotlib 34 17 (50%) 14 (41%) -3
scikit-learn 32 23 (72%) 21 (66%) -2
astropy 22 11 (50%) 11 (50%) 0
pydata/xarray 22 14 (64%) 13 (59%) -1
pytest 19 14 (74%) 9 (47%) -5
pylint 10 3 (30%) 4 (40%) +1
psf/requests 8 7 (88%) 4 (50%) -3

Baseline won or tied on 11 of 12 repositories. Pylint was the sole exception.

Analysis

The efficiency-resolution tradeoff

The MCP server made the agent more efficient but less effective. 42% fewer tool calls, 14% fewer tokens, 15% fewer tasks resolved.

The MCP tools changed the agent's exploration strategy. With code analysis available, the agent relied on it instead of doing its own searching. Faster and cheaper, but dependent on the MCP server's understanding. When that understanding was incomplete, the agent explored less and missed solutions that brute-force searching would have found.

In baseline, the agent made 22,745 tool calls — nearly 10,000 more than with MCP. Read, Grep, Bash: patiently reading files, searching for patterns, testing hypotheses. Expensive but thorough.

You're trading the agent's general-purpose exploration for your tool's opinionated shortcuts. When right, you save time and money. When wrong, you've narrowed the search space in a way that costs solutions.

MCP servers should be tested like APIs, not plugins

APIs have contracts. Plugins mostly need to avoid crashing. An MCP server needs to return the right shape of data and fulfill the implicit promise in its tool description.

100% adoption tells us the descriptions were well-calibrated. But adoption isn't helpfulness. The agent always reached for the tools. The question is whether what they returned actually helped.

Compare to the Bash tool's 46% failure rate. When tools fail, agents route around them. MCP tools occupy a middle ground: they can succeed technically (return valid data) while failing practically (returning data that leads the agent astray).

The effect varies by codebase

Pylint: helped. Django and sympy: hurt measurably. Astropy: neutral.

Static code analysis value depends on how a codebase is structured. Clear module boundaries and organized code are easier to analyze statically. Complex, dynamic patterns produce misleading analysis.

Benchmark per-repository, not just in aggregate. An overall resolution rate hides the variance that matters.

Limitations

References


Published by the Supermodel Tools team. mcpbr has since evolved into matchspec, the evaluation layer of the MIST stack.