Benchmarks

Comparative performance analysis of Archon Specs against industry standards.

Terminal Bench (Success Rate)

Percentage of tasks completed successfully in a terminal environment without human intervention.

Archon Specs (v2.6) 89%

Industry Standard Agent 42%

Vanilla GPT-4o 12%

What we measure Whether the agent completes a full code-generation task — from DSL input to a runnable NestJS workspace — without any human correction. How Each agent is given the same enterprise-grade spec and scored pass/fail on whether the output compiles, passes lint, and satisfies all schema contracts. What it means Archon's governed pipeline and state-machine workflow reduce unrecoverable failures by more than 2× compared to the best generic agent, and 7× compared to a raw LLM prompt.

Architectural Drift Prevention

Ability to detect and reconcile structural drift over 100 iterative updates.

Archon Specs 100%

Generic LLM Scaffold 18%

What we measure After 100 iterative spec mutations (adding domains, patching entities, renaming fields), how many workspaces still match the spec exactly — no orphaned files, no stale code, no schema mismatch. How Each mutation triggers a full drift scan using SHA-256 artifact hashes. Archon's DriftDetector compares the desired state manifest against the observed filesystem. Generic scaffolds are re-generated from scratch without a state manifest. What it means Archon never loses track of what it generated. A 100% score means zero files drift silently — every change is intentional and traceable, which is what makes long-running projects safe.

Performance

Latency benchmarks for core pipeline operations. Measured over 10 warmed runs on macOS (Apple M-series, Node 22). Last run: 2026-05-15.

Spec Compilation Latency

Time to normalize and compile a DesignSpec at three scale tiers (10 runs, warmup discarded, values shown are averages using performance.now()).

Scale	Entities	Avg (ms)	Min (ms)	Max (ms)	Entities/sec
Small	5	0.010	0.009	0.012	518,350
Medium	50	0.090	0.084	0.101	555,734
Enterprise	200	0.384	0.290	0.631	520,365

What we measure The time for normalizeSpec() to deep-clone, sort, sanitize, and canonicalize a DesignSpec into a deterministic, bit-identical form ready for hashing and code generation. How Synthetic specs are built at three sizes (5 / 50 / 200 entities). Each is run 10 times after a warmup pass; average is taken using performance.now() for sub-millisecond resolution. What it means Compilation time grows linearly but stays well under 1 ms even at enterprise scale — meaning spec validation and re-compilation add no perceptible latency to the generation pipeline regardless of project size.

Stream Materialization Throughput

File-write operations per second when applying an execution plan to the Virtual File System. Median of 3 runs.

10 operations 1.02 ms9,804 ops/sec

50 operations 3.98 ms12,563 ops/sec

200 operations 16.04 ms12,469 ops/sec

What we measure How fast the materialization engine can write a batch of generated files to disk — each operation is a fs.outputFile() call, which is what happens when Archon applies an execution plan to a workspace. How Batches of 10, 50, and 200 TypeScript source files are written to a temp directory. Each batch is run 3 times; the median duration is recorded to avoid outliers from OS scheduling. What it means A typical enterprise generation of ~50 files completes in under 4 ms on disk. The throughput plateau around 12,500 ops/sec across 50–200 ops shows the bottleneck is I/O bandwidth, not Archon's own logic.

Code Quality

Percentage of generated NestJS files that are immediately compilable and lint-clean without modification.

TypeScript Compiler Pass Rate

Generated .ts files that pass tsc --noEmit out of the box.

Archon Specs (generated)94%

Generic LLM Scaffold (est.)~55%

What we measure The percentage of generated .ts files that the TypeScript compiler accepts without modification — no missing imports, no type mismatches, no undefined references. How tsc --noEmit --skipLibCheck is run against all generated files from the enterprise social-network spec (47 files). Each compiler error counts against the file that triggered it. What it means 94% of Archon's output is immediately valid TypeScript. The remaining 6% are edge-case inter-module imports that require a full npm install to resolve — not logic errors. Generic scaffolds average ~55% because they generate plausible-looking code without enforcing type contracts across files.

Based on enterprise social-network spec (47 generated files). Run npx ts-node scripts/pipeline-test.ts && npm run benchmark to refresh.

ESLint Error-Free Rate

Generated .ts files with zero ESLint errors (no-undef, no-unused-vars).

Archon Specs (generated)100%

Generic LLM Scaffold (est.)~70%

What we measure Whether generated files are free of undefined variable references and declared-but-unused identifiers — the two categories most likely to indicate hallucinated or incomplete code. How ESLint is run with no-undef: error and no-unused-vars: error rules across all generated files. Errors are counted per file; a file is clean only if it has zero errors. What it means Every symbol Archon generates is either properly imported or explicitly declared — nothing is invented or left dangling. This is a direct consequence of generating from typed Handlebars templates rather than asking an LLM to write code freehand.

Scaling

How Archon components perform as workspace size and concurrency grow. Last run: 2026-05-15.

Drift Detection — Artifact Scaling

Time to scan all owned artifacts for drift. 10% drift injected per run. Confirms O(n) linear scaling.

Artifacts	Scan Time	Artifacts/sec	Drift Detected
100	< 1 ms	> 1,000,000	10 / 10
1,000	< 1 ms	> 1,000,000	100 / 100
10,000	2 ms	5,000,000	1,000 / 1,000

What we measure How long it takes the drift engine to compare the entire state manifest — every managed file's expected SHA-256 hash — against a set of observed files, at three workspace sizes. How Synthetic state manifests are built at 100, 1K, and 10K entries. 10% of observed files are given a modified hash to simulate real drift. The detector iterates all entries and compares hashes in a single in-memory pass — no filesystem I/O. What it means Even a 10,000-file monorepo is scanned in 2 ms. Drift detection will never be a bottleneck in the governance loop — its cost is negligible compared to the network latency of any LLM call it gates.

Worker Pool Queue Throughput

Simulated dispatch queue throughput at increasing concurrency levels (200 jobs per run, pure queue overhead — no real MCP workers).

Concurrency	Jobs	Total	Avg Wait	p95 Wait	Throughput
1	200	< 1 ms	0.00 ms	0 ms	> 200,000/sec
5	200	1 ms	0.11 ms	1 ms	200,000/sec
10	200	< 1 ms	0.00 ms	0 ms	> 200,000/sec
20	200	< 1 ms	0.00 ms	0 ms	> 200,000/sec

What we measure The overhead of the pool's own dispatch logic — how quickly jobs can be enqueued, picked up by an available worker slot, and resolved — independent of how long each actual MCP tool call takes. How A bounded FIFO queue is simulated with 1, 5, 10, and 20 worker slots. Each of the 200 jobs resolves in a single microtask tick (Promise.resolve()), isolating queue scheduling overhead from tool execution time. What it means The scheduler itself adds zero measurable overhead at any concurrency level tested. In production, all observed latency comes from the MCP tool calls themselves — the pool introduces no queuing tax, even under 20 simultaneous clients.