How We Split a Monolith Into a Control Plane and Data Plane (and Got 10x Scale)

By Grey Newell, CTO at Supermodel

For engineers evaluating async job architectures or considering similar trade-offs. Our API processes customer codebases with tree-sitter and LLMs to produce structural graphs. The original prototype took 10-15 minutes per synchronous request. We redesigned the entire API to scale asynchronously in one calendar week (implementation and initial production deployment). Here is how the architecture works, why we made each decision, and what we traded off.

Business challenge

Our prototype ran 2 runtimes (JVM + Node.js) in 1 Docker container, sharing a single synchronous HTTP request path.

10-15 minute synchronous HTTP calls. The client had to hold the connection open the entire time.
0 idempotency. A retry spawned a second job, doubling compute cost.
0 separation between the auth hot path (sub-100ms) and the processing pipeline (minutes).
0 formal guarantees about when customer source code was deleted from the processing environment.

A concrete failure scenario: a client sends POST /v1/graphs/call with a 50MB zip archive. 8 minutes into processing, the client's load balancer times out. The work is lost. The client retries. Now 2 copies of the job run simultaneously, consuming 2x compute, and the client has no way to deduplicate or retrieve the first result.

Design tenets

Four principles. The team shipped the redesign in one calendar week, so each principle had to be simple to implement.

Idempotency everywhere. Same Idempotency-Key + user_id + api_key_id = same job, enforced by a unique constraint in Postgres. On concurrent duplicates, the loser hits the constraint, deletes its blob, returns the winner's job. Zero wasted compute.
Separation of compute profiles. Control plane (Java/Spring Boot): auth, billing, API key validation. CPU-bound, sub-100ms. Data plane (TypeScript/Node.js): tree-sitter parsing, LLM calls. I/O-bound, minutes. 0 shared code, 0 shared dependencies. 1 shared Postgres database, 1 shared blob container.
Zero data retention. Customer source code is deleted from blob, disk, and database after processing. 4 independent cleanup mechanisms cover crash scenarios. Worst-case retention: 60 minutes (blob hard TTL). Details in how we process code without retention.
Postgres as the single source of truth. 1 database as job queue, state machine, and result store. 0 message brokers. 0 Redis instances. Job claiming uses FOR UPDATE SKIP LOCKED.

Solution overview

Two independently deployed runtimes connected by a shared Postgres database (Citus) and Azure Blob Storage container.

flowchart LR
    subgraph client [Client]
        SDK["SupermodelClient\n(TypeScript SDK)"]
    end

    subgraph controlPlane ["Control Plane (Java / Spring Boot)"]
        Auth["API Key Validation\n+ Subscription Check"]
        JobCreate["Job Creation\n+ Idempotency"]
        Poll["Poll Handler\n(200 / 202)"]
    end

    subgraph sharedInfra ["Shared Infrastructure"]
        DB[("Postgres (Citus)\njobs table")]
        Blob[("Azure Blob\nzip payloads")]
    end

    subgraph dataPlane ["Data Plane (TypeScript / Node.js)"]
        Worker["Job Worker\n(poll loop)"]
        TreeSitter["Tree-sitter Parser"]
        LLM["LLM Service\n(OpenRouter)"]
    end

    SDK -->|"POST /v1/graphs/call\n+ Idempotency-Key"| Auth
    Auth --> JobCreate
    JobCreate -->|"INSERT status=pending"| DB
    JobCreate -->|"upload zip"| Blob
    SDK -->|"re-POST same key\n(poll)"| Poll
    Poll -->|"SELECT job"| DB

    Worker -->|"UPDATE SET status=processing\nFOR UPDATE SKIP LOCKED"| DB
    Worker -->|"download zip"| Blob
    Worker --> TreeSitter
    Worker --> LLM
    Worker -->|"UPDATE SET status=completed\nresult = jsonb"| DB

Why Postgres instead of Kafka, SQS, or Redis?

We use Azure Cosmos DB for PostgreSQL (Citus). Coordinator-only today (node_count=0). Distribution key user_id is set on 4 tables; workers can be added without schema migration. No cross-shard queries. The jobs table with FOR UPDATE SKIP LOCKED gives you exactly-once claiming, transactional state updates, and zero new infrastructure. Trade-offs: 1-30 second polling latency vs. sub-second push from a dedicated broker; table growth and VACUUM; no built-in backpressure. For our workload (minutes per job, moderate volume), acceptable.

Why no separate `/jobs/{id}/status` endpoint?

The client re-POSTs the same request with the same idempotency key. The server returns the existing job. Polling is submission. Trade-off: less discoverable than a dedicated status endpoint.

Why blob storage for payloads?

Zip archives are too large for a database column. The control plane uploads to Azure Blob, stores the URL in the job row, the data plane downloads it, and the blob is deleted after processing.

The job lifecycle

A single request, start to finish.

Phase 1: Submission (control plane, ~50ms)

API key validation (Caffeine cache, 10k entries, 5-min TTL) and HMAC verification with constant-time comparison.
Subscription status check (Caffeine cache, 10k entries, 30-second TTL).
JobService.getOrCreateJob() queries by (idempotency_key, user_id, api_key_id). If found: returns the existing job with 0 new work. If not found: computes SHA-256 of the zip, uploads to Azure Blob, inserts a row with status='pending' and blob_expires_at = now + 1 hour.
Returns HTTP 202 Accepted with Retry-After: 10 (configurable per operation, default 10 seconds).

Input validation: Zip size limit 500MB (multipart). Path traversal in zips is blocked. No zip-bomb or nested-archive validation.

The idempotent job creation logic. JobRepository.save() catches unique violation (23505) and returns null; no exception thrown.

// JobService.getOrCreateJob (simplified)
Optional<Job> existing = jobRepository.findByIdempotencyKey(idempotencyKey, userId, apiKeyId);
if (existing.isPresent()) return existing.get();

blobConnector.uploadZip(jobId, fileBytes);
UUID savedId = jobRepository.save(...);

if (savedId == null) {
    // Unique constraint: concurrent request won. Delete our blob, return theirs.
    blobConnector.deleteZip(jobId);
    return jobRepository.findByIdempotencyKey(idempotencyKey, userId, apiKeyId).orElseThrow();
}
return jobRepository.findById(savedId).orElseThrow();

Two concurrent requests with the same key both upload and attempt INSERT. One wins. The loser catches the unique constraint violation, deletes its blob, returns the winner's job. If the loser's deleteZip() fails (e.g. blob timeout), the blob is orphaned (no job row). Azure lifecycle deletes it within 24 hours. No dedicated orphan-cleanup job.

Phase 2: Processing (data plane, seconds to minutes)

JobWorkerService.pollLoop() polls every 1-30 seconds (adaptive: x2 backoff on failure, halves after 3 consecutive successes).
findPendingJobs() atomically claims work:

UPDATE jobs SET status = 'processing', started_at = NOW()
WHERE id IN (
    SELECT id FROM jobs
    WHERE status = 'pending' AND blob_expires_at > NOW()
    ORDER BY created_at ASC LIMIT $1
    FOR UPDATE SKIP LOCKED
)
RETURNING *;

Up to 4 jobs claimed per cycle (SUPERMODEL_JOB_CONCURRENCY). FOR UPDATE SKIP LOCKED means multiple replicas poll concurrently with zero contention; each claims different rows. Jobs cannot run longer than 30 minutes (zombie reaper). Blob TTL (60 min) only affects pending jobs; once claimed, the data plane deletes the blob on completion.

Downloads zip from blob (3 retries, 1s initial delay, 10s max). Extracts via ZipHydratorService to a temp directory.
Parses with tree-sitter. Calls LLMs via OpenRouter if needed.
Writes the result: UPDATE jobs SET status = 'completed', result = $1::jsonb, blob_url = NULL, ... WHERE id = $3. DB write retries: 3 attempts, 500ms initial delay, 5s max, exponential backoff with 0-50% jitter.
Deletes blob and temp directory in the finally block.

Phase 3: Retrieval (client polls, ~50ms per poll)

The client re-POSTs with the same Idempotency-Key. The control plane loads the job and branches:

if (job.isCompleted()) return ResponseEntity.ok(response.withResult(...));
if (job.isFailed()) return ResponseEntity.ok(response.withError(...));
return ResponseEntity.status(202).header("Retry-After", String.valueOf(retryAfter)).body(response);

completed = 200 OK with result. failed = 200 OK with error. Anything else = 202 Accepted with Retry-After.

The client SDK reads retryAfter, sleeps, re-posts:

while (attempt < maxPollingAttempts && !timedOut) {
    const response = await apiCall();
    if (response.status === 'completed') return response.result;
    if (response.status === 'failed') throw new JobFailedError(...);
    await sleep((response.retryAfter ?? 10) * 1000);
}

SDK uses 15-minute default timeout, 90 max attempts, and falls back to 10s if retryAfter is missing. The caller sees none of this. They call client.generateCallGraph(file, { idempotencyKey }) and get a result.

Idempotency in detail

Idempotency-Key is required for all data plane requests, enforced by ApiKeyAuthFilter. The SDK generates one via crypto.randomUUID().

The jobs table enforces uniqueness: UNIQUE(idempotency_key, user_id, api_key_id) (with partial indexes for NULL api_key_id in bearer-token auth). Same key + same user + same API key returns the existing job regardless of request content. We do not validate that the zip hash matches; first submission wins.

There is no /jobs/{id}/status endpoint. The client re-POSTs with the same idempotency key to poll (1 endpoint, 1 code path, 1 auth check per poll). Every poll re-validates the API key and subscription status. Revocation invalidates cache on the instance that processes the revoke; other replicas may serve cached entries for up to 5 minutes (cache TTL).

How we process code without retention

Where does your code go, and when is it deleted?

Customer source code is deleted from every storage layer after processing. 4 independent cleanup mechanisms cover crash scenarios. Worst-case retention: 60 minutes (blob hard TTL). Typical: seconds.

flowchart TD
    subgraph upload ["1. Upload (Control Plane)"]
        CP["Client uploads zip"]
        Blob["Blob Storage\n(TTL: 1 hour)"]
        DB_pending["jobs.blob_url = url\njobs.blob_expires_at = now + 1h"]
    end

    subgraph process ["2. Process (Data Plane)"]
        Download["Download zip from blob"]
        Extract["Extract to temp dir\n/__processing/repoId-uuid/"]
        Parse["Tree-sitter parse\n(in-memory graph)"]
    end

    subgraph cleanup ["3. Cleanup (immediate)"]
        MarkDone["markCompleted/markFailed\nblob_url = NULL"]
        DeleteBlob["deleteBlob: jobId.zip"]
        DeleteDisk["hydration.cleanup:\nfs.remove targetDir"]
    end

    subgraph expire ["4. Expiry (scheduled)"]
        JobCleanup["JobCleanupService\ndeletes expired rows"]
        ZombieReap["Zombie reaper\nmarks stuck jobs failed"]
        BlobExpiry["Expired pending jobs\nmarked failed"]
    end

    CP --> Blob --> DB_pending
    DB_pending --> Download --> Extract --> Parse
    Parse --> MarkDone
    MarkDone --> DeleteBlob
    MarkDone --> DeleteDisk
    DeleteBlob --> JobCleanup
    DeleteDisk --> ZombieReap
    ZombieReap --> BlobExpiry

Layer 1: Blob storage

The 60-minute TTL is enforced by application logic (blob_expires_at, expired-pending cleanup). Azure Blob lifecycle policy is a 24-hour safety net (staging/production); Azure does not support hour-level TTL natively. On upload, blob_expires_at = NOW() + 1 hour. On completion or failure, the data plane deletes the blob (3 retries, 1s initial delay, 10s max) and sets blob_url = NULL.

If the DB write fails, the blob is intentionally preserved so the zombie reaper can identify the orphaned job: if (statusRecorded) await deleteBlob(...); in the finally block.

Layer 2: Disk

Extracted files live in /__processing/{repoId}-{uuid}/. cleanup() runs fs.remove(targetDir) in the finally block on all paths. RETAIN_HYDRATED_REPOS exists only for local development; not set in any deployed environment.

Layer 3: Database

The result column contains JSONB structural metadata (file paths, function signatures, dependency edges, line numbers), not source code. File paths and function signatures may expose repo structure; we do not store source code. No application-level limit on JSONB result size; Postgres TOAST applies. Large graphs may impact WAL and backups.

Completed jobs: deleted after 24 hours. Failed jobs: deleted after 7 days. JobCleanupService runs hourly: @Scheduled(cron = "0 0 * * * ?") calls jobRepository.deleteExpired(). Scheduled jobs run on every control-plane replica. No distributed lock (e.g. ShedLock). Cleanup is idempotent; usage rollups use ON CONFLICT for deduplication.

Layer 4: Defense-in-depth

3 independent mechanisms catch anything the primary cleanup misses:

Zombie reaper (data plane, every poll cycle). Marks jobs stuck in processing for > 30 minutes as failed, sets blob_url = NULL: UPDATE jobs SET status = 'failed', blob_url = NULL WHERE status = 'processing' AND started_at <= $1.

Expired pending cleanup (control plane, every 60 seconds via @Scheduled(cron = "0 * * * * ?")). Marks pending jobs with blob_expires_at <= NOW() as failed.

Expired pending cleanup (data plane, every poll cycle). Runs the same query, independently, in case the control plane is down.

The data plane has 0 ingress (ingress_enabled = false in Azure Container Apps). It receives 0 external HTTP requests. It pulls work from the database, processes in memory, writes structural metadata back, and deletes all source code artifacts. Stateless by construction, not by policy.

Retention summary

Artifact	Typical retention	Worst-case retention	Cleanup mechanism
Blob (customer zip)	Seconds (deleted on job completion)	60 minutes (hard TTL)	`deleteBlob()` + expired pending cleanup
Extracted files on disk	Seconds (deleted in `finally` block)	Container lifetime (crash = container replaced)	`hydration.cleanup()` via `fs.remove()`
`blob_url` pointer in DB	Seconds (NULLed on completion/failure)	30 minutes (zombie reaper threshold)	`markCompleted`/`markFailed` SQL
Job result (structural metadata, not source code)	24 hours (completed) / 7 days (failed)	Same	`JobCleanupService.cleanupExpiredJobs()` hourly
Orphan/zombie jobs	30 minutes	60 minutes	Zombie reaper + expired pending cleanup (both planes)

Failure modes and automated recovery

Failure	Automated response	Recovery time
Client disconnects mid-poll	Job continues processing. Client re-POSTs same key to retrieve result.	0; job is unaffected
Data plane container crashes	Zombie reaper marks jobs in `processing` > 30 min as `failed`, clears `blob_url`. Container orchestrator (ACA) restarts the replica. Client may wait up to 30 minutes to learn of a failed job.	30 minutes (zombie threshold)
`markFailed` DB write fails after crash	Blob is intentionally preserved. Zombie reaper catches the orphan on next poll cycle.	30 minutes
Blob expires before processing starts	`findPendingJobs` skips jobs with `blob_expires_at <= NOW()`. Both planes independently mark them `failed`.	60 minutes (blob TTL)
Postgres transient failure (connection reset, 53xxx)	Both planes retry: 3 attempts, 500ms initial delay, 5s max, exponential backoff with 0-50% jitter.	Seconds
Blob storage transient failure (timeout, 5xx)	3 retries, 1s initial delay, 10s max. 404 (BlobNotFoundError) is not retried.	Seconds
Concurrent duplicate submissions	Unique constraint on `(idempotency_key, user_id, api_key_id)`. Loser deletes its blob, returns winner's job.	0; no duplicate work
API key revoked during processing	Job completes (work already started), but next poll re-validates the key and returns `401 Unauthorized`. Revoked key cannot retrieve results.	Immediate
Control plane down	No new jobs, no polling. Data plane continues processing.	Until control plane recovers

0 of these failure modes require manual intervention. 0 of them result in customer code persisting beyond the cleanup window. This applies to the failures listed; Postgres outage beyond retry window, control-plane outage, and schema migrations may require manual intervention. See Limitations.

Separation of concerns

Responsibility	Control Plane (Java/Spring Boot)	Data Plane (TypeScript/Node.js)
API key validation (HMAC + Caffeine cache)	Yes	No
Subscription enforcement (Stripe)	Yes	No
Usage metering and billing	Yes	No
Job creation and idempotency	Yes	No
HTTP ingress (public endpoint)	Yes (port 8080)	No (`ingress_enabled = false`)
Tree-sitter parsing	No	Yes
LLM calls (OpenRouter, Google AI)	No	Yes
Graph construction (in-memory)	No	Yes
Job claiming (`FOR UPDATE SKIP LOCKED`)	No	Yes
Blob download/deletion	No	Yes

Shared interface: 1 Postgres database (jobs table) + 1 Azure Blob container (job-payloads, naming {jobId}.zip). That is the entire contract. 0 RPC calls. 0 shared code. 0 protobuf schemas. Each runtime has its own Dockerfile, CI pipeline, and Azure Container App.

Observability

Application Insights (control and data plane). Data plane emits job events (completed/failed), duration, success/failure counts, retries, poll interval. Structured JSON logging with correlation IDs. No OpenTelemetry; no Prometheus/CloudWatch. We do not publish formal SLOs. Control-plane target: sub-100ms P99 for auth and job creation.

Deployment

Single revision, 100% traffic. No blue/green or canary. In-flight jobs in a crashed container are failed by the zombie reaper after 30 minutes.

Why Java for the control plane?

The control plane does not process code. It validates keys, checks subscriptions, creates database rows, and returns HTTP responses. Java/Spring Boot is built for this.

We use OpenAPI-first code generation (same spec generates the TypeScript SDK), Spring Security filter chain (OAuth2, API key auth, CSRF, Stripe webhooks), @Scheduled cron jobs (4 tasks: expired cleanup, job deletion, usage reports), AOP for usage metering and scope checks, and Caffeine caches for API keys and subscription status. The data plane is TypeScript because tree-sitter ships Node.js bindings and the work is I/O-bound. We had a Spring Boot veteran on the team from day one. Each runtime handles what it is best at. 64 source files, ~8,200 lines (significant portion generated). Every request <100ms.

Scaling constraints and trade-offs

Parameter	Default	Configurable via	Notes
Poll interval (base)	1,000ms	`SUPERMODEL_JOB_POLL_INTERVAL_MS`	Minimum latency between job creation and pickup
Poll interval (max)	30,000ms	`SUPERMODEL_JOB_MAX_POLL_INTERVAL_MS`	Reached after consecutive failures (x2 backoff)
Poll recovery	Halves after 3 successes	Hardcoded	Returns to base interval
Concurrency per replica	4 jobs	`SUPERMODEL_JOB_CONCURRENCY`	More replicas = linear scale
Blob TTL	1 hour	`SUPERMODEL_JOB_TTL_BLOB_HOURS`	Unprocessed jobs fail after this
Completed job TTL	24 hours	`SUPERMODEL_JOB_TTL_COMPLETED_HOURS`	Client must retrieve results within this window
Failed job TTL	7 days	`SUPERMODEL_JOB_TTL_FAILED_DAYS`	For debugging and support
Zombie threshold	30 minutes	Hardcoded	Jobs in `processing` longer than this are failed
API key cache	10,000 keys, 5-min TTL	Hardcoded	Caffeine in-process cache
Subscription cache	10,000 entries, 30-sec TTL	Hardcoded	Caffeine in-process cache
DB retries	3 attempts, 500ms-5s backoff	Hardcoded	Exponential + 0-50% jitter
Blob retries	3 attempts, 1s-10s backoff	Hardcoded	Exponential + 0-50% jitter
Citus distribution	`user_id` on 4 tables	Schema-level	Coordinator-only (`node_count=0`); workers addable without migration
Client `Retry-After`	10 seconds	Per-operation in `application.properties`	Configurable per graph type

Trade-off. Polling adds 1-30 seconds of latency between job completion and client retrieval. For jobs that take minutes, negligible. Sub-second notification would require WebSocket or SSE.

Limitations and future work

No per-API-key or per-user rate limiting on job creation or polling. OpenAPI declares rate-limit headers; enforcement is not implemented.
No formal SLO/SLA published. Control-plane target: sub-100ms P99.
Single Postgres instance (Citus coordinator). Failover is managed by Azure; we do not use read replicas.
Scheduled jobs run on every control-plane replica. No distributed lock (e.g. ShedLock). Cleanup is idempotent; usage rollups use ON CONFLICT for deduplication.
Orphan blobs (loser's deleteZip() failed) are cleaned by Azure lifecycle within 24 hours, not by a dedicated app job.
API key revocation: other replicas may serve cached entries for up to 5 minutes (cache TTL).
No zip-bomb or nested-archive validation.
No blue/green or canary deployment.

Conclusion

Metric	Before	After
Peak concurrent jobs	1 (synchronous)	N replicas x 4 jobs each (production: 2-10 data-plane replicas). Throughput depends on job duration.
Client connection requirement	Hold open 10-15 minutes	Single POST + periodic polls (~50ms each)
Duplicate work on retry	100% (new job every time)	0% (idempotency key deduplication)
Infrastructure components added	N/A	0 (reused existing Postgres + Blob)
Message brokers	N/A	0
Customer code worst-case retention	Indefinite (container lifetime)	60 minutes (blob TTL)
Customer code typical retention	Container lifetime	Seconds
Manual intervention for failures	Required	0 for the failure modes listed above; see Limitations.
Time to build	N/A	1 calendar week (implementation and initial production deployment)

What we would change

A dedicated /jobs/{id} status endpoint would be more discoverable than the re-POST pattern. We traded discoverability for simplicity (1 endpoint, 1 code path, 1 auth check per poll).
WebSocket or SSE for push-based progress. The 1-30 second polling latency is fine for our workloads but would not be acceptable for sub-second use cases.
Citus worker nodes. We have the distribution key set (user_id on 4 tables) but run coordinator-only. This is insurance, not a current requirement.

For questions about our architecture or API, contact [email protected] or visit supermodeltools.com.