By Grey Newell, CTO at Supermodel
For engineers evaluating async job architectures or considering similar trade-offs. Our API processes customer codebases with tree-sitter and LLMs to produce structural graphs. The original prototype took 10-15 minutes per synchronous request. We redesigned the entire API to scale asynchronously in one calendar week (implementation and initial production deployment). Here is how the architecture works, why we made each decision, and what we traded off.
Business challenge
Our prototype ran 2 runtimes (JVM + Node.js) in 1 Docker container, sharing a single synchronous HTTP request path.
- 10-15 minute synchronous HTTP calls. The client had to hold the connection open the entire time.
- 0 idempotency. A retry spawned a second job, doubling compute cost.
- 0 separation between the auth hot path (sub-100ms) and the processing pipeline (minutes).
- 0 formal guarantees about when customer source code was deleted from the processing environment.
A concrete failure scenario: a client sends POST /v1/graphs/call with a 50MB zip archive. 8 minutes into processing, the client's load balancer times out. The work is lost. The client retries. Now 2 copies of the job run simultaneously, consuming 2x compute, and the client has no way to deduplicate or retrieve the first result.
Design tenets
Four principles. The team shipped the redesign in one calendar week, so each principle had to be simple to implement.
Idempotency everywhere. Same
Idempotency-Key+user_id+api_key_id= same job, enforced by a unique constraint in Postgres. On concurrent duplicates, the loser hits the constraint, deletes its blob, returns the winner's job. Zero wasted compute.Separation of compute profiles. Control plane (Java/Spring Boot): auth, billing, API key validation. CPU-bound, sub-100ms. Data plane (TypeScript/Node.js): tree-sitter parsing, LLM calls. I/O-bound, minutes. 0 shared code, 0 shared dependencies. 1 shared Postgres database, 1 shared blob container.
Zero data retention. Customer source code is deleted from blob, disk, and database after processing. 4 independent cleanup mechanisms cover crash scenarios. Worst-case retention: 60 minutes (blob hard TTL). Details in how we process code without retention.
Postgres as the single source of truth. 1 database as job queue, state machine, and result store. 0 message brokers. 0 Redis instances. Job claiming uses
FOR UPDATE SKIP LOCKED.
Solution overview
Two independently deployed runtimes connected by a shared Postgres database (Citus) and Azure Blob Storage container.
flowchart LR
subgraph client [Client]
SDK["SupermodelClient\n(TypeScript SDK)"]
end
subgraph controlPlane ["Control Plane (Java / Spring Boot)"]
Auth["API Key Validation\n+ Subscription Check"]
JobCreate["Job Creation\n+ Idempotency"]
Poll["Poll Handler\n(200 / 202)"]
end
subgraph sharedInfra ["Shared Infrastructure"]
DB[("Postgres (Citus)\njobs table")]
Blob[("Azure Blob\nzip payloads")]
end
subgraph dataPlane ["Data Plane (TypeScript / Node.js)"]
Worker["Job Worker\n(poll loop)"]
TreeSitter["Tree-sitter Parser"]
LLM["LLM Service\n(OpenRouter)"]
end
SDK -->|"POST /v1/graphs/call\n+ Idempotency-Key"| Auth
Auth --> JobCreate
JobCreate -->|"INSERT status=pending"| DB
JobCreate -->|"upload zip"| Blob
SDK -->|"re-POST same key\n(poll)"| Poll
Poll -->|"SELECT job"| DB
Worker -->|"UPDATE SET status=processing\nFOR UPDATE SKIP LOCKED"| DB
Worker -->|"download zip"| Blob
Worker --> TreeSitter
Worker --> LLM
Worker -->|"UPDATE SET status=completed\nresult = jsonb"| DB
Why Postgres instead of Kafka, SQS, or Redis?
We use Azure Cosmos DB for PostgreSQL (Citus). Coordinator-only today (node_count=0). Distribution key user_id is set on 4 tables; workers can be added without schema migration. No cross-shard queries. The jobs table with FOR UPDATE SKIP LOCKED gives you exactly-once claiming, transactional state updates, and zero new infrastructure. Trade-offs: 1-30 second polling latency vs. sub-second push from a dedicated broker; table growth and VACUUM; no built-in backpressure. For our workload (minutes per job, moderate volume), acceptable.
Why no separate /jobs/{id}/status endpoint?
The client re-POSTs the same request with the same idempotency key. The server returns the existing job. Polling is submission. Trade-off: less discoverable than a dedicated status endpoint.
Why blob storage for payloads?
Zip archives are too large for a database column. The control plane uploads to Azure Blob, stores the URL in the job row, the data plane downloads it, and the blob is deleted after processing.
The job lifecycle
A single request, start to finish.
Phase 1: Submission (control plane, ~50ms)
API key validation (Caffeine cache, 10k entries, 5-min TTL) and HMAC verification with constant-time comparison.
Subscription status check (Caffeine cache, 10k entries, 30-second TTL).
JobService.getOrCreateJob()queries by(idempotency_key, user_id, api_key_id). If found: returns the existing job with 0 new work. If not found: computes SHA-256 of the zip, uploads to Azure Blob, inserts a row withstatus='pending'andblob_expires_at = now + 1 hour.Returns HTTP
202 AcceptedwithRetry-After: 10(configurable per operation, default 10 seconds).
Input validation: Zip size limit 500MB (multipart). Path traversal in zips is blocked. No zip-bomb or nested-archive validation.
The idempotent job creation logic. JobRepository.save() catches unique violation (23505) and returns null; no exception thrown.
// JobService.getOrCreateJob (simplified)
Optional<Job> existing = jobRepository.findByIdempotencyKey(idempotencyKey, userId, apiKeyId);
if (existing.isPresent()) return existing.get();
blobConnector.uploadZip(jobId, fileBytes);
UUID savedId = jobRepository.save(...);
if (savedId == null) {
// Unique constraint: concurrent request won. Delete our blob, return theirs.
blobConnector.deleteZip(jobId);
return jobRepository.findByIdempotencyKey(idempotencyKey, userId, apiKeyId).orElseThrow();
}
return jobRepository.findById(savedId).orElseThrow();
Two concurrent requests with the same key both upload and attempt INSERT. One wins. The loser catches the unique constraint violation, deletes its blob, returns the winner's job. If the loser's deleteZip() fails (e.g. blob timeout), the blob is orphaned (no job row). Azure lifecycle deletes it within 24 hours. No dedicated orphan-cleanup job.
Phase 2: Processing (data plane, seconds to minutes)
JobWorkerService.pollLoop()polls every 1-30 seconds (adaptive: x2 backoff on failure, halves after 3 consecutive successes).findPendingJobs()atomically claims work:
UPDATE jobs SET status = 'processing', started_at = NOW()
WHERE id IN (
SELECT id FROM jobs
WHERE status = 'pending' AND blob_expires_at > NOW()
ORDER BY created_at ASC LIMIT $1
FOR UPDATE SKIP LOCKED
)
RETURNING *;
Up to 4 jobs claimed per cycle (SUPERMODEL_JOB_CONCURRENCY). FOR UPDATE SKIP LOCKED means multiple replicas poll concurrently with zero contention; each claims different rows. Jobs cannot run longer than 30 minutes (zombie reaper). Blob TTL (60 min) only affects pending jobs; once claimed, the data plane deletes the blob on completion.
Downloads zip from blob (3 retries, 1s initial delay, 10s max). Extracts via
ZipHydratorServiceto a temp directory.Parses with tree-sitter. Calls LLMs via OpenRouter if needed.
Writes the result:
UPDATE jobs SET status = 'completed', result = $1::jsonb, blob_url = NULL, ... WHERE id = $3. DB write retries: 3 attempts, 500ms initial delay, 5s max, exponential backoff with 0-50% jitter.Deletes blob and temp directory in the
finallyblock.
Phase 3: Retrieval (client polls, ~50ms per poll)
The client re-POSTs with the same Idempotency-Key. The control plane loads the job and branches:
if (job.isCompleted()) return ResponseEntity.ok(response.withResult(...));
if (job.isFailed()) return ResponseEntity.ok(response.withError(...));
return ResponseEntity.status(202).header("Retry-After", String.valueOf(retryAfter)).body(response);
completed = 200 OK with result. failed = 200 OK with error. Anything else = 202 Accepted with Retry-After.
The client SDK reads retryAfter, sleeps, re-posts:
while (attempt < maxPollingAttempts && !timedOut) {
const response = await apiCall();
if (response.status === 'completed') return response.result;
if (response.status === 'failed') throw new JobFailedError(...);
await sleep((response.retryAfter ?? 10) * 1000);
}
SDK uses 15-minute default timeout, 90 max attempts, and falls back to 10s if retryAfter is missing. The caller sees none of this. They call client.generateCallGraph(file, { idempotencyKey }) and get a result.
Idempotency in detail
Idempotency-Key is required for all data plane requests, enforced by ApiKeyAuthFilter. The SDK generates one via crypto.randomUUID().
The jobs table enforces uniqueness: UNIQUE(idempotency_key, user_id, api_key_id) (with partial indexes for NULL api_key_id in bearer-token auth). Same key + same user + same API key returns the existing job regardless of request content. We do not validate that the zip hash matches; first submission wins.
There is no /jobs/{id}/status endpoint. The client re-POSTs with the same idempotency key to poll (1 endpoint, 1 code path, 1 auth check per poll). Every poll re-validates the API key and subscription status. Revocation invalidates cache on the instance that processes the revoke; other replicas may serve cached entries for up to 5 minutes (cache TTL).
How we process code without retention
Where does your code go, and when is it deleted?
Customer source code is deleted from every storage layer after processing. 4 independent cleanup mechanisms cover crash scenarios. Worst-case retention: 60 minutes (blob hard TTL). Typical: seconds.
flowchart TD
subgraph upload ["1. Upload (Control Plane)"]
CP["Client uploads zip"]
Blob["Blob Storage\n(TTL: 1 hour)"]
DB_pending["jobs.blob_url = url\njobs.blob_expires_at = now + 1h"]
end
subgraph process ["2. Process (Data Plane)"]
Download["Download zip from blob"]
Extract["Extract to temp dir\n/__processing/repoId-uuid/"]
Parse["Tree-sitter parse\n(in-memory graph)"]
end
subgraph cleanup ["3. Cleanup (immediate)"]
MarkDone["markCompleted/markFailed\nblob_url = NULL"]
DeleteBlob["deleteBlob: jobId.zip"]
DeleteDisk["hydration.cleanup:\nfs.remove targetDir"]
end
subgraph expire ["4. Expiry (scheduled)"]
JobCleanup["JobCleanupService\ndeletes expired rows"]
ZombieReap["Zombie reaper\nmarks stuck jobs failed"]
BlobExpiry["Expired pending jobs\nmarked failed"]
end
CP --> Blob --> DB_pending
DB_pending --> Download --> Extract --> Parse
Parse --> MarkDone
MarkDone --> DeleteBlob
MarkDone --> DeleteDisk
DeleteBlob --> JobCleanup
DeleteDisk --> ZombieReap
ZombieReap --> BlobExpiry
Layer 1: Blob storage
The 60-minute TTL is enforced by application logic (blob_expires_at, expired-pending cleanup). Azure Blob lifecycle policy is a 24-hour safety net (staging/production); Azure does not support hour-level TTL natively. On upload, blob_expires_at = NOW() + 1 hour. On completion or failure, the data plane deletes the blob (3 retries, 1s initial delay, 10s max) and sets blob_url = NULL.
If the DB write fails, the blob is intentionally preserved so the zombie reaper can identify the orphaned job: if (statusRecorded) await deleteBlob(...); in the finally block.
Layer 2: Disk
Extracted files live in /__processing/{repoId}-{uuid}/. cleanup() runs fs.remove(targetDir) in the finally block on all paths. RETAIN_HYDRATED_REPOS exists only for local development; not set in any deployed environment.
Layer 3: Database
The result column contains JSONB structural metadata (file paths, function signatures, dependency edges, line numbers), not source code. File paths and function signatures may expose repo structure; we do not store source code. No application-level limit on JSONB result size; Postgres TOAST applies. Large graphs may impact WAL and backups.
Completed jobs: deleted after 24 hours. Failed jobs: deleted after 7 days. JobCleanupService runs hourly: @Scheduled(cron = "0 0 * * * ?") calls jobRepository.deleteExpired(). Scheduled jobs run on every control-plane replica. No distributed lock (e.g. ShedLock). Cleanup is idempotent; usage rollups use ON CONFLICT for deduplication.
Layer 4: Defense-in-depth
3 independent mechanisms catch anything the primary cleanup misses:
Zombie reaper (data plane, every poll cycle). Marks jobs stuck in processing for > 30 minutes as failed, sets blob_url = NULL: UPDATE jobs SET status = 'failed', blob_url = NULL WHERE status = 'processing' AND started_at <= $1.
Expired pending cleanup (control plane, every 60 seconds via @Scheduled(cron = "0 * * * * ?")). Marks pending jobs with blob_expires_at <= NOW() as failed.
Expired pending cleanup (data plane, every poll cycle). Runs the same query, independently, in case the control plane is down.
The data plane has 0 ingress (ingress_enabled = false in Azure Container Apps). It receives 0 external HTTP requests. It pulls work from the database, processes in memory, writes structural metadata back, and deletes all source code artifacts. Stateless by construction, not by policy.
Retention summary
| Artifact | Typical retention | Worst-case retention | Cleanup mechanism |
|---|---|---|---|
| Blob (customer zip) | Seconds (deleted on job completion) | 60 minutes (hard TTL) | deleteBlob() + expired pending cleanup |
| Extracted files on disk | Seconds (deleted in finally block) |
Container lifetime (crash = container replaced) | hydration.cleanup() via fs.remove() |
blob_url pointer in DB |
Seconds (NULLed on completion/failure) | 30 minutes (zombie reaper threshold) | markCompleted/markFailed SQL |
| Job result (structural metadata, not source code) | 24 hours (completed) / 7 days (failed) | Same | JobCleanupService.cleanupExpiredJobs() hourly |
| Orphan/zombie jobs | 30 minutes | 60 minutes | Zombie reaper + expired pending cleanup (both planes) |
Failure modes and automated recovery
| Failure | Automated response | Recovery time |
|---|---|---|
| Client disconnects mid-poll | Job continues processing. Client re-POSTs same key to retrieve result. | 0; job is unaffected |
| Data plane container crashes | Zombie reaper marks jobs in processing > 30 min as failed, clears blob_url. Container orchestrator (ACA) restarts the replica. Client may wait up to 30 minutes to learn of a failed job. |
30 minutes (zombie threshold) |
markFailed DB write fails after crash |
Blob is intentionally preserved. Zombie reaper catches the orphan on next poll cycle. | 30 minutes |
| Blob expires before processing starts | findPendingJobs skips jobs with blob_expires_at <= NOW(). Both planes independently mark them failed. |
60 minutes (blob TTL) |
| Postgres transient failure (connection reset, 53xxx) | Both planes retry: 3 attempts, 500ms initial delay, 5s max, exponential backoff with 0-50% jitter. | Seconds |
| Blob storage transient failure (timeout, 5xx) | 3 retries, 1s initial delay, 10s max. 404 (BlobNotFoundError) is not retried. | Seconds |
| Concurrent duplicate submissions | Unique constraint on (idempotency_key, user_id, api_key_id). Loser deletes its blob, returns winner's job. |
0; no duplicate work |
| API key revoked during processing | Job completes (work already started), but next poll re-validates the key and returns 401 Unauthorized. Revoked key cannot retrieve results. |
Immediate |
| Control plane down | No new jobs, no polling. Data plane continues processing. | Until control plane recovers |
0 of these failure modes require manual intervention. 0 of them result in customer code persisting beyond the cleanup window. This applies to the failures listed; Postgres outage beyond retry window, control-plane outage, and schema migrations may require manual intervention. See Limitations.
Separation of concerns
| Responsibility | Control Plane (Java/Spring Boot) | Data Plane (TypeScript/Node.js) |
|---|---|---|
| API key validation (HMAC + Caffeine cache) | Yes | No |
| Subscription enforcement (Stripe) | Yes | No |
| Usage metering and billing | Yes | No |
| Job creation and idempotency | Yes | No |
| HTTP ingress (public endpoint) | Yes (port 8080) | No (ingress_enabled = false) |
| Tree-sitter parsing | No | Yes |
| LLM calls (OpenRouter, Google AI) | No | Yes |
| Graph construction (in-memory) | No | Yes |
Job claiming (FOR UPDATE SKIP LOCKED) |
No | Yes |
| Blob download/deletion | No | Yes |
Shared interface: 1 Postgres database (jobs table) + 1 Azure Blob container (job-payloads, naming {jobId}.zip). That is the entire contract. 0 RPC calls. 0 shared code. 0 protobuf schemas. Each runtime has its own Dockerfile, CI pipeline, and Azure Container App.
Observability
Application Insights (control and data plane). Data plane emits job events (completed/failed), duration, success/failure counts, retries, poll interval. Structured JSON logging with correlation IDs. No OpenTelemetry; no Prometheus/CloudWatch. We do not publish formal SLOs. Control-plane target: sub-100ms P99 for auth and job creation.
Deployment
Single revision, 100% traffic. No blue/green or canary. In-flight jobs in a crashed container are failed by the zombie reaper after 30 minutes.
Why Java for the control plane?
The control plane does not process code. It validates keys, checks subscriptions, creates database rows, and returns HTTP responses. Java/Spring Boot is built for this.
We use OpenAPI-first code generation (same spec generates the TypeScript SDK), Spring Security filter chain (OAuth2, API key auth, CSRF, Stripe webhooks), @Scheduled cron jobs (4 tasks: expired cleanup, job deletion, usage reports), AOP for usage metering and scope checks, and Caffeine caches for API keys and subscription status. The data plane is TypeScript because tree-sitter ships Node.js bindings and the work is I/O-bound. We had a Spring Boot veteran on the team from day one. Each runtime handles what it is best at. 64 source files, ~8,200 lines (significant portion generated). Every request <100ms.
Scaling constraints and trade-offs
| Parameter | Default | Configurable via | Notes |
|---|---|---|---|
| Poll interval (base) | 1,000ms | SUPERMODEL_JOB_POLL_INTERVAL_MS |
Minimum latency between job creation and pickup |
| Poll interval (max) | 30,000ms | SUPERMODEL_JOB_MAX_POLL_INTERVAL_MS |
Reached after consecutive failures (x2 backoff) |
| Poll recovery | Halves after 3 successes | Hardcoded | Returns to base interval |
| Concurrency per replica | 4 jobs | SUPERMODEL_JOB_CONCURRENCY |
More replicas = linear scale |
| Blob TTL | 1 hour | SUPERMODEL_JOB_TTL_BLOB_HOURS |
Unprocessed jobs fail after this |
| Completed job TTL | 24 hours | SUPERMODEL_JOB_TTL_COMPLETED_HOURS |
Client must retrieve results within this window |
| Failed job TTL | 7 days | SUPERMODEL_JOB_TTL_FAILED_DAYS |
For debugging and support |
| Zombie threshold | 30 minutes | Hardcoded | Jobs in processing longer than this are failed |
| API key cache | 10,000 keys, 5-min TTL | Hardcoded | Caffeine in-process cache |
| Subscription cache | 10,000 entries, 30-sec TTL | Hardcoded | Caffeine in-process cache |
| DB retries | 3 attempts, 500ms-5s backoff | Hardcoded | Exponential + 0-50% jitter |
| Blob retries | 3 attempts, 1s-10s backoff | Hardcoded | Exponential + 0-50% jitter |
| Citus distribution | user_id on 4 tables |
Schema-level | Coordinator-only (node_count=0); workers addable without migration |
Client Retry-After |
10 seconds | Per-operation in application.properties |
Configurable per graph type |
Trade-off. Polling adds 1-30 seconds of latency between job completion and client retrieval. For jobs that take minutes, negligible. Sub-second notification would require WebSocket or SSE.
Limitations and future work
- No per-API-key or per-user rate limiting on job creation or polling. OpenAPI declares rate-limit headers; enforcement is not implemented.
- No formal SLO/SLA published. Control-plane target: sub-100ms P99.
- Single Postgres instance (Citus coordinator). Failover is managed by Azure; we do not use read replicas.
- Scheduled jobs run on every control-plane replica. No distributed lock (e.g. ShedLock). Cleanup is idempotent; usage rollups use
ON CONFLICTfor deduplication. - Orphan blobs (loser's
deleteZip()failed) are cleaned by Azure lifecycle within 24 hours, not by a dedicated app job. - API key revocation: other replicas may serve cached entries for up to 5 minutes (cache TTL).
- No zip-bomb or nested-archive validation.
- No blue/green or canary deployment.
Conclusion
| Metric | Before | After |
|---|---|---|
| Peak concurrent jobs | 1 (synchronous) | N replicas x 4 jobs each (production: 2-10 data-plane replicas). Throughput depends on job duration. |
| Client connection requirement | Hold open 10-15 minutes | Single POST + periodic polls (~50ms each) |
| Duplicate work on retry | 100% (new job every time) | 0% (idempotency key deduplication) |
| Infrastructure components added | N/A | 0 (reused existing Postgres + Blob) |
| Message brokers | N/A | 0 |
| Customer code worst-case retention | Indefinite (container lifetime) | 60 minutes (blob TTL) |
| Customer code typical retention | Container lifetime | Seconds |
| Manual intervention for failures | Required | 0 for the failure modes listed above; see Limitations. |
| Time to build | N/A | 1 calendar week (implementation and initial production deployment) |
What we would change
- A dedicated
/jobs/{id}status endpoint would be more discoverable than the re-POST pattern. We traded discoverability for simplicity (1 endpoint, 1 code path, 1 auth check per poll). - WebSocket or SSE for push-based progress. The 1-30 second polling latency is fine for our workloads but would not be acceptable for sub-second use cases.
- Citus worker nodes. We have the distribution key set (
user_idon 4 tables) but run coordinator-only. This is insurance, not a current requirement.
For questions about our architecture or API, contact engineers@supermodeltools.com or visit supermodeltools.com.
