Impact analysis is strongest when the question is scoped.
In a large codebase, almost everything is connected if you walk far enough. Not literally everything, but close enough to make the naive version of impact analysis useless. Imports lead to imports, tests touch shared helpers, frameworks hide edges, generated files fan out, and a complete transitive walk eventually becomes a slow way to say "look at the repo."
That is not what an agent needs.
The unscoped version of the product asks:
What does this change affect?
That question is too broad. It invites a giant list, and a giant list is not very useful to an agent.
It is also expensive. Every extra hop through the graph costs time, tokens, and attention. If you keep expanding until the graph is exhausted, you may get high recall, but you lose the thing that made impact analysis useful in the first place: a fast answer about where to start.
The useful version is more precise:
If I change this file, function, or diff, what should an agent inspect or run first?
That is the claim we tested.
Not a vague blast radius.
Not a dump of everything that might be related.
A ranked shortlist for a scoped change.
The Claim
The useful product claim is:
Supermodel gives agents a ranked map of likely validation and inspection targets for a scoped code change.
That wording matters.
An agent needs to spend fewer tool calls finding the right part of the repo. It needs to know which tests are likely to matter. It needs a starting map before it starts editing.
So the benchmark asks a very specific question:
Given the production files changed by a real PR, can Supermodel rank the corresponding validation files better than a simple path/name baseline?
That is small enough to measure.
It is also the engineering tradeoff the product has to make. We want enough graph traversal to find non-obvious validation files, but not so much traversal that every answer becomes a repo-wide search result.
Methodology
We evaluated 10 merged PRs from large public repositories. The PRs were selected from after the model training cutoff we were using for agent comparisons, so the benchmark would not depend on a model having memorized the patch.
The PRs were:
- Next.js #93417 - Fix streaming in draft mode for cache components - https://github.com/vercel/next.js/pull/93417
- VS Code #314217 - Fix tool_search bookkeeping when resuming from stateful marker - https://github.com/microsoft/vscode/pull/314217
- MUI #48472 - Fix incorrect role with slotProps.input - https://github.com/mui/material-ui/pull/48472
- Grafana #123935 - Alerting: fix pagination for ungrouped alert rules - https://github.com/grafana/grafana/pull/123935
- React #36047 - Fix FragmentInstance listener leak - https://github.com/facebook/react/pull/36047
- Angular #68512 - Ensure debounced async validators produce pending status during debounce - https://github.com/angular/angular/pull/68512
- Prisma #29512 - Surface unmapped driver errors as user-facing P2039 - https://github.com/prisma/prisma/pull/29512
- Payload #16465 - Stop workflows retrying forever when no retries are configured - https://github.com/payloadcms/payload/pull/16465
- Superset #39504 - Apply full transitive ancestor chain for dependent filters - https://github.com/apache/superset/pull/39504
- Terraform #38338 - Include provider local in generated resource config when set in import - https://github.com/hashicorp/terraform/pull/38338
For each PR, we labeled:
- the production files changed by the PR
- the validation files changed by the PR
Then we ran two rankers:
- A naive path/name baseline.
- Supermodel scoped validation ranking.
The baseline is intentionally dumb. It looks for validation files whose paths or names resemble the changed production files. It does not use graph structure.
Supermodel gets the scoped production change and returns ranked validation files. The expected validation files are not used during ranking. They are used only after the ranked list is produced, for scoring.
We scored file-level precision, recall, and F1.
We capped Supermodel at the top 9 validation files per case. That cap matters because otherwise any system can inflate recall by returning half the repo.
The cap is part of the benchmark because it mirrors the product constraint. An agent does not benefit from a theoretically complete list that is too large to act on. The ranking needs to be small enough to run first and broad enough to catch distant but relevant files.
The ranking command was:
node benchmark/agent-impact/run-real-impact-ranking.mjs \
--out-dir target/real-impact-ranking-current \
--scope replay-dirs
The ranking run was separate from the agent run. The ranking benchmark measures whether Supermodel can produce a useful shortlist. The agent run checks whether the replay harness is valid and whether a frontier agent can complete the task under controlled conditions.
Agent Replay Setup
For the agent replay, we used:
agent: gpt-5.5
runner: codex-cli 0.128.0
container: supermodel-agent-impact-go:local
go: go1.26.2 linux/arm64
node: v24.15.0
codex cli: 0.128.0
Each arm ran in its own fresh checkout inside Docker. The repository was mounted at /workspace/repo. The prompt and optional impact context were mounted at /workspace/run.
The hidden reference diff was not mounted into the agent container. The prompt also explicitly blocked fetching, checking out, or applying the merged PR patch.
Representative control prompt:
You are repairing a real post-cutoff PR replay benchmark.
Repository: grafana/grafana
PR: #123935 Alerting: fix pagination for ungrouped alert rules
Merged at: 2026-05-04T19:46:34Z
The checkout is at the PR base commit with the PR validation files applied,
but the production fix has been withheld.
Rules:
- Make the verifier pass by fixing production behavior.
- Do not remove, weaken, or rewrite the validation file to hide the failure.
- Do not revert the benchmark baseline commit.
- Do not fetch, inspect, checkout, or apply the PR merge commit or patch.
- Keep the repair as small as possible.
Verifier:
- go test ./pkg/services/ngalert/store -run 'TestIntegration_ListAlertRulesByGroup/should_paginate_with_no-group_rule_group_filter' -count=1
No impact-analysis context is available. Find the affected production files yourself.
When finished, leave the repository in a passing state.
The impact-context arm used the same prompt, with this line changed:
Upper-bound file-ranking context is available in IMPACT_ANALYSIS.md and impact-analysis.json.
Treat it as a starting map, then verify in code.
That file-ranking packet is not the headline product claim. It is an upper-bound check: if a correct file-ranking packet is present, can the agent use it without the benchmark leaking the actual patch?
Before trusting the agent runs, we ran a verifier preflight. The withheld production state had to fail. The reference production state had to pass.
The preflight command was:
node benchmark/agent-impact/run-post-cutoff-pr-replay.mjs \
--out-dir target/post-cutoff-pr-replay-current \
--preflight-only
| Case | Arm | Withheld production state | Reference production state |
|---|---|---|---|
| Grafana #123935 - https://github.com/grafana/grafana/pull/123935 | control | 1 | 0 |
| Grafana #123935 - https://github.com/grafana/grafana/pull/123935 | impact context | 1 | 0 |
| Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338 | control | 1 | 0 |
| Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338 | impact context | 1 | 0 |
Relevant log excerpts:
Grafana withheld state:
--- FAIL: TestIntegration_ListAlertRulesByGroup
--- FAIL: TestIntegration_ListAlertRulesByGroup/should_paginate_with_no-group_rule_group_filter
FAIL github.com/grafana/grafana/pkg/services/ngalert/store
Grafana reference state:
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.877s
Terraform withheld state:
--- FAIL: TestContext2Plan_importResourceConfigGenWithProviderLocalName
FAIL github.com/hashicorp/terraform/internal/terraform
Terraform reference state:
ok github.com/hashicorp/terraform/internal/terraform 0.011s
The Grafana agent replay also completed in both arms:
| Arm | Agent status | Verifier after agent | Changed file F1 | Time | Tool calls | Input tokens |
|---|---|---|---|---|---|---|
| no impact context | 0 | 0 | 1.000 | 245s | 30 | 722,856 |
| with impact context | 0 | 0 | 1.000 | 303s | 35 | 1,270,987 |
That run is not the headline win. In Grafana, the verifier already points at a narrow area, so impact context was not faster. The point of including it is methodological: the harness ran inside Docker, the agent did not see the hidden patch, the broken state actually failed, the reference state actually passed, and the agent-produced patch passed the verifier.
Result
| Method | Precision | Recall | F1 | Correct / Expected | Predicted |
|---|---|---|---|---|---|
| Baseline path/name matcher | 0.060 | 0.286 | 0.099 | 6 / 21 | 100 |
| Supermodel scoped ranking | 0.274 | 0.952 | 0.426 | 20 / 21 | 73 |
That is the short version.
Supermodel found 20 of 21 labeled validation files. The baseline found 6 of 21. Supermodel also returned fewer total candidates.
F1 moved from 0.099 to 0.426. That is a 4.3x improvement.
This is a ranking result. Scoped graph ranking found validation files that simple proximity missed.
Per-Repo Performance
| Repo / PR | Expected | Baseline F1 | Supermodel F1 | Supermodel Correct | Supermodel Candidates |
|---|---|---|---|---|---|
| Next.js #93417 - https://github.com/vercel/next.js/pull/93417 | 4 | 0.000 | 0.615 | 4 / 4 | 9 |
| VS Code #314217 - https://github.com/microsoft/vscode/pull/314217 | 1 | 0.182 | 0.333 | 1 / 1 | 5 |
| MUI #48472 - https://github.com/mui/material-ui/pull/48472 | 1 | 0.182 | 1.000 | 1 / 1 | 1 |
| Grafana #123935 - https://github.com/grafana/grafana/pull/123935 | 1 | 0.182 | 0.200 | 1 / 1 | 9 |
| React #36047 - https://github.com/facebook/react/pull/36047 | 1 | 0.000 | 0.200 | 1 / 1 | 9 |
| Angular #68512 - https://github.com/angular/angular/pull/68512 | 1 | 0.000 | 0.400 | 1 / 1 | 4 |
| Prisma #29512 - https://github.com/prisma/prisma/pull/29512 | 5 | 0.133 | 0.714 | 5 / 5 | 9 |
| Payload #16465 - https://github.com/payloadcms/payload/pull/16465 | 5 | 0.000 | 0.571 | 4 / 5 | 9 |
| Superset #39504 - https://github.com/apache/superset/pull/39504 | 1 | 0.182 | 0.200 | 1 / 1 | 9 |
| Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338 | 1 | 0.182 | 0.200 | 1 / 1 | 9 |
The miss was in Payload: a generated type file with limited direct evidence from the scoped production diff.
That is useful to know. It tells us where the ranking still needs work.
What This Does Not Prove
This does not prove that Supermodel can predict every affected file.
It does not prove that all runtime behavior is captured.
It does not prove that agents always finish faster with graph context.
It proves a narrower thing: on this 10-repo benchmark, scoped validation ranking found far more of the labeled validation files than a path/name baseline.
That is the right kind of result. It is measurable, falsifiable, and limited to what the benchmark actually supports.
Objections
"Is this just cherry-picking?"
The honest answer: 10 cases is not enough to settle the question.
That is why the benchmark uses public repos, real merged PRs, fixed scoring, and a dumb baseline. The next step is not to declare victory. The next step is to keep adding repos and keep the same scoring rules.
"Did you use the PR answer key?"
Only for scoring.
The ranker gets the scoped production files. It does not get the expected validation files. After ranking, we compare its output against the validation files from the PR.
"Does post-cutoff matter here?"
It matters most for agent comparisons. We do not want an agent benchmark where the model can solve the task from memory.
For the ranking benchmark, the important point is simpler: the ranker is not using the PR patch or the expected validation files. It gets a scoped change and produces a ranked list. The answer key is only used after the fact.
"Why validation files instead of all impacted files?"
Because "all impacted files" is not well-defined enough for a first benchmark.
Validation files are concrete. A PR either added or changed a unit test, integration test, e2e case, generated validation file, or similar coverage file. That gives us a label we can score.
It is not the whole blast radius. It is the first useful slice.
"Does low precision make this useless?"
No, but it changes the product claim.
At 0.274 precision, this is not an exact answer. It is a run queue. The system is saying: "start here." That is valuable for an agent, especially in large repos, but it should be presented as ranked context, not certainty.
The precision problem is now the main engineering problem. We usually find the right file. We need to move it earlier and return fewer neighbors.
"Why not just run the tests?"
You should run the tests.
The problem is knowing which tests to run first when the repo is large, the suite is expensive, or the failure is not already localized. Impact analysis is most useful before the verifier has handed you the answer.
"Did the baseline really fail on those zero cases?"
Yes.
The zero-F1 baseline cases still produced candidates. They just produced the wrong candidates. Next.js, React, Angular, and Payload were legitimate misses, not harness failures.
"Did the agent run prove impact context helps?"
Not yet.
The Grafana replay proved the harness was valid and the agent could complete the task in both arms. It did not show a speedup from impact context, because the verifier already localized the failure well.
The agent benchmark that matters next is larger and more ambiguous: real PR-sized changes, less direct verifier output, expensive tests, and constrained agent budgets.
What Changed In The Product Claim
The first version of impact analysis was too broad. It mixed structural context, affected source files, and validation targets into one mental bucket.
The benchmark forced a cleaner shape:
affected source files
validation files to inspect or run
broader architectural context
Those are different things.
This is also why scoping matters operationally. The graph can keep expanding. The product has to decide when to stop, which evidence is strong enough to rank, and which files belong in background context instead of the first run queue.
For agents, the most immediately useful output is often the second one:
These are the validation files most likely to matter.
Run or inspect these first.
That is what the current benchmark measures.
Where This Leaves Us
The useful claim is not:
Supermodel knows everything that will break.
The useful claim is:
For a scoped change, Supermodel can rank validation targets better than path/name matching, and that gives agents a better place to start.
That is enough to justify the next round.
The next work is straightforward:
- Add more repos.
- Add more languages.
- Separate validation ranking from source impact more clearly.
- Improve precision without giving up recall.
- Run larger agent A/B tests where search cost actually matters.
Impact analysis is a ranking problem. The job now is to make the ranking good enough that agents stop wasting time wandering through the wrong parts of the codebase.
