Impact Analysis Is a Ranking Problem

Impact analysis is strongest when the question is scoped.

In a large codebase, almost everything is connected if you walk far enough. Not literally everything, but close enough to make the naive version of impact analysis useless. Imports lead to imports, tests touch shared helpers, frameworks hide edges, generated files fan out, and a complete transitive walk eventually becomes a slow way to say "look at the repo."

That is not what an agent needs.

The unscoped version of the product asks:

What does this change affect?

That question is too broad. It invites a giant list, and a giant list is not very useful to an agent.

It is also expensive. Every extra hop through the graph costs time, tokens, and attention. If you keep expanding until the graph is exhausted, you may get high recall, but you lose the thing that made impact analysis useful in the first place: a fast answer about where to start.

The useful version is more precise:

If I change this file, function, or diff, what should an agent inspect or run first?

That is the claim we tested.

Not a vague blast radius.

Not a dump of everything that might be related.

A ranked shortlist for a scoped change.

The Claim

The useful product claim is:

Supermodel gives agents a ranked map of likely validation and inspection targets for a scoped code change.

That wording matters.

An agent needs to spend fewer tool calls finding the right part of the repo. It needs to know which tests are likely to matter. It needs a starting map before it starts editing.

So the benchmark asks a very specific question:

Given the production files changed by a real PR, can Supermodel rank the corresponding validation files better than a simple path/name baseline?

That is small enough to measure.

It is also the engineering tradeoff the product has to make. We want enough graph traversal to find non-obvious validation files, but not so much traversal that every answer becomes a repo-wide search result.

Methodology

We evaluated 10 merged PRs from large public repositories. The PRs were selected from after the model training cutoff we were using for agent comparisons, so the benchmark would not depend on a model having memorized the patch.

The PRs were:

For each PR, we labeled:

the production files changed by the PR
the validation files changed by the PR

Then we ran two rankers:

A naive path/name baseline.
Supermodel scoped validation ranking.

The baseline is intentionally dumb. It looks for validation files whose paths or names resemble the changed production files. It does not use graph structure.

Supermodel gets the scoped production change and returns ranked validation files. The expected validation files are not used during ranking. They are used only after the ranked list is produced, for scoring.

We scored file-level precision, recall, and F1.

We capped Supermodel at the top 9 validation files per case. That cap matters because otherwise any system can inflate recall by returning half the repo.

The cap is part of the benchmark because it mirrors the product constraint. An agent does not benefit from a theoretically complete list that is too large to act on. The ranking needs to be small enough to run first and broad enough to catch distant but relevant files.

The ranking command was:

node benchmark/agent-impact/run-real-impact-ranking.mjs \
  --out-dir target/real-impact-ranking-current \
  --scope replay-dirs

The ranking run was separate from the agent run. The ranking benchmark measures whether Supermodel can produce a useful shortlist. The agent run checks whether the replay harness is valid and whether a frontier agent can complete the task under controlled conditions.

Agent Replay Setup

For the agent replay, we used:

agent:        gpt-5.5
runner:       codex-cli 0.128.0
container:    supermodel-agent-impact-go:local
go:           go1.26.2 linux/arm64
node:         v24.15.0
codex cli:    0.128.0

Each arm ran in its own fresh checkout inside Docker. The repository was mounted at /workspace/repo. The prompt and optional impact context were mounted at /workspace/run.

The hidden reference diff was not mounted into the agent container. The prompt also explicitly blocked fetching, checking out, or applying the merged PR patch.

Representative control prompt:

You are repairing a real post-cutoff PR replay benchmark.

Repository: grafana/grafana
PR: #123935 Alerting: fix pagination for ungrouped alert rules
Merged at: 2026-05-04T19:46:34Z

The checkout is at the PR base commit with the PR validation files applied,
but the production fix has been withheld.

Rules:
- Make the verifier pass by fixing production behavior.
- Do not remove, weaken, or rewrite the validation file to hide the failure.
- Do not revert the benchmark baseline commit.
- Do not fetch, inspect, checkout, or apply the PR merge commit or patch.
- Keep the repair as small as possible.

Verifier:
  - go test ./pkg/services/ngalert/store -run 'TestIntegration_ListAlertRulesByGroup/should_paginate_with_no-group_rule_group_filter' -count=1

No impact-analysis context is available. Find the affected production files yourself.

When finished, leave the repository in a passing state.

The impact-context arm used the same prompt, with this line changed:

Upper-bound file-ranking context is available in IMPACT_ANALYSIS.md and impact-analysis.json.
Treat it as a starting map, then verify in code.

That file-ranking packet is not the headline product claim. It is an upper-bound check: if a correct file-ranking packet is present, can the agent use it without the benchmark leaking the actual patch?

Before trusting the agent runs, we ran a verifier preflight. The withheld production state had to fail. The reference production state had to pass.

The preflight command was:

node benchmark/agent-impact/run-post-cutoff-pr-replay.mjs \
  --out-dir target/post-cutoff-pr-replay-current \
  --preflight-only

Case	Arm	Withheld production state
Grafana #123935 - https://github.com/grafana/grafana/pull/123935	control	1
Grafana #123935 - https://github.com/grafana/grafana/pull/123935	impact context	1
Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338	control	1
Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338	impact context	1

Relevant log excerpts:

Grafana withheld state:
--- FAIL: TestIntegration_ListAlertRulesByGroup
--- FAIL: TestIntegration_ListAlertRulesByGroup/should_paginate_with_no-group_rule_group_filter
FAIL github.com/grafana/grafana/pkg/services/ngalert/store

Grafana reference state:
ok github.com/grafana/grafana/pkg/services/ngalert/store 1.877s

Terraform withheld state:
--- FAIL: TestContext2Plan_importResourceConfigGenWithProviderLocalName
FAIL github.com/hashicorp/terraform/internal/terraform

Terraform reference state:
ok github.com/hashicorp/terraform/internal/terraform 0.011s

The Grafana agent replay also completed in both arms:

Arm	Agent status	Verifier after agent	Changed file F1	Time	Tool calls	Input tokens
no impact context	0	0	1.000	245s	30	722,856
with impact context	0	0	1.000	303s	35	1,270,987

That run is not the headline win. In Grafana, the verifier already points at a narrow area, so impact context was not faster. The point of including it is methodological: the harness ran inside Docker, the agent did not see the hidden patch, the broken state actually failed, the reference state actually passed, and the agent-produced patch passed the verifier.

Result

Method	Precision	Recall	F1	Correct / Expected	Predicted
Baseline path/name matcher	0.060	0.286	0.099	6 / 21	100
Supermodel scoped ranking	0.274	0.952	0.426	20 / 21	73

That is the short version.

Supermodel found 20 of 21 labeled validation files. The baseline found 6 of 21. Supermodel also returned fewer total candidates.

F1 moved from 0.099 to 0.426. That is a 4.3x improvement.

This is a ranking result. Scoped graph ranking found validation files that simple proximity missed.

Per-Repo Performance

Repo / PR	Expected	Baseline F1	Supermodel F1	Supermodel Correct	Supermodel Candidates
Next.js #93417 - https://github.com/vercel/next.js/pull/93417	4	0.000	0.615	4 / 4	9
VS Code #314217 - https://github.com/microsoft/vscode/pull/314217	1	0.182	0.333	1 / 1	5
MUI #48472 - https://github.com/mui/material-ui/pull/48472	1	0.182	1.000	1 / 1	1
Grafana #123935 - https://github.com/grafana/grafana/pull/123935	1	0.182	0.200	1 / 1	9
React #36047 - https://github.com/facebook/react/pull/36047	1	0.000	0.200	1 / 1	9
Angular #68512 - https://github.com/angular/angular/pull/68512	1	0.000	0.400	1 / 1	4
Prisma #29512 - https://github.com/prisma/prisma/pull/29512	5	0.133	0.714	5 / 5	9
Payload #16465 - https://github.com/payloadcms/payload/pull/16465	5	0.000	0.571	4 / 5	9
Superset #39504 - https://github.com/apache/superset/pull/39504	1	0.182	0.200	1 / 1	9
Terraform #38338 - https://github.com/hashicorp/terraform/pull/38338	1	0.182	0.200	1 / 1	9

The miss was in Payload: a generated type file with limited direct evidence from the scoped production diff.

That is useful to know. It tells us where the ranking still needs work.

What This Does Not Prove

This does not prove that Supermodel can predict every affected file.

It does not prove that all runtime behavior is captured.

It does not prove that agents always finish faster with graph context.

It proves a narrower thing: on this 10-repo benchmark, scoped validation ranking found far more of the labeled validation files than a path/name baseline.

That is the right kind of result. It is measurable, falsifiable, and limited to what the benchmark actually supports.

Objections

"Is this just cherry-picking?"

The honest answer: 10 cases is not enough to settle the question.

That is why the benchmark uses public repos, real merged PRs, fixed scoring, and a dumb baseline. The next step is not to declare victory. The next step is to keep adding repos and keep the same scoring rules.

"Did you use the PR answer key?"

Only for scoring.

The ranker gets the scoped production files. It does not get the expected validation files. After ranking, we compare its output against the validation files from the PR.

"Does post-cutoff matter here?"

It matters most for agent comparisons. We do not want an agent benchmark where the model can solve the task from memory.

For the ranking benchmark, the important point is simpler: the ranker is not using the PR patch or the expected validation files. It gets a scoped change and produces a ranked list. The answer key is only used after the fact.

"Why validation files instead of all impacted files?"

Because "all impacted files" is not well-defined enough for a first benchmark.

Validation files are concrete. A PR either added or changed a unit test, integration test, e2e case, generated validation file, or similar coverage file. That gives us a label we can score.

It is not the whole blast radius. It is the first useful slice.

"Does low precision make this useless?"

No, but it changes the product claim.

At 0.274 precision, this is not an exact answer. It is a run queue. The system is saying: "start here." That is valuable for an agent, especially in large repos, but it should be presented as ranked context, not certainty.

The precision problem is now the main engineering problem. We usually find the right file. We need to move it earlier and return fewer neighbors.

"Why not just run the tests?"

You should run the tests.

The problem is knowing which tests to run first when the repo is large, the suite is expensive, or the failure is not already localized. Impact analysis is most useful before the verifier has handed you the answer.

"Did the baseline really fail on those zero cases?"

Yes.

The zero-F1 baseline cases still produced candidates. They just produced the wrong candidates. Next.js, React, Angular, and Payload were legitimate misses, not harness failures.

"Did the agent run prove impact context helps?"

Not yet.

The Grafana replay proved the harness was valid and the agent could complete the task in both arms. It did not show a speedup from impact context, because the verifier already localized the failure well.

The agent benchmark that matters next is larger and more ambiguous: real PR-sized changes, less direct verifier output, expensive tests, and constrained agent budgets.

What Changed In The Product Claim

The first version of impact analysis was too broad. It mixed structural context, affected source files, and validation targets into one mental bucket.

The benchmark forced a cleaner shape:

affected source files
validation files to inspect or run
broader architectural context

Those are different things.

This is also why scoping matters operationally. The graph can keep expanding. The product has to decide when to stop, which evidence is strong enough to rank, and which files belong in background context instead of the first run queue.

For agents, the most immediately useful output is often the second one:

These are the validation files most likely to matter.
Run or inspect these first.

That is what the current benchmark measures.

Where This Leaves Us

The useful claim is not:

Supermodel knows everything that will break.

The useful claim is:

For a scoped change, Supermodel can rank validation targets better than path/name matching, and that gives agents a better place to start.

That is enough to justify the next round.

The next work is straightforward:

Add more repos.
Add more languages.
Separate validation ranking from source impact more clearly.
Improve precision without giving up recall.
Run larger agent A/B tests where search cost actually matters.

Impact analysis is a ranking problem. The job now is to make the ranking good enough that agents stop wasting time wandering through the wrong parts of the codebase.