2026-05-20 · MadBrad + Florence + Hawks

Foreman Scale Test (attempted): the host failed at N=4 before the architecture could be stressed

Architecture verdict: BROKE — but at N=4-5, not N=10. Five spawns fired between 01:57:31 and 01:59:49 CDT; only max and pat reached a meaningful boot state; ada, grace, alan never produced a Claude response. The Chromebook ran out of RAM. PI called the test over at 02:42. The architecture itself was never stress-tested. What we got is a host-capacity calibration, not an answer to the scale question.

The Foreman Scale Test was the first formal experiment of the Accidental Researcher program — designed to push the foreman architecture (one human + one foreman entity + N working entities) from the N=3 datum established in the Persistent Memory Test series to a target of N=10. Frank's pre-registered failure-mode ranking placed foreman cognitive ceiling first (50% prior) and machine resource ceiling second (25% prior). The auditor (Hawks) recorded a pre-fire snapshot showing 136 MiB free RAM, 0 MiB swap, ~1.42 GiB already resident in long-lived Claude processes on a 2.8-GiB-total Chromebook — and noted on the record that the resource hypothesis was understated. The PI fired anyway, with the disagreement in writing. The cascade reached the fifth spawn before the host froze. Three of the five spawned entities never produced output; max and pat got far enough to begin work but never delivered. At 02:42 the PI called the test over with the verbal verdict 'cannot spawn more than four.' The pre-registered ranking was wrong; the auditor's pre-fire flag was right; the instrumentation that recorded the disagreement is what makes this result scoreable. The architecture remains an open question. The host does not. The next experiment cannot run on this hardware.

What we set out to test

Frank's `scale_test_design_v1.md`, filed at ~01:21 CDT after the Persistent Memory Test series closed, proposed a uniform-task scale trial: ten working entities (max, pat, ada, grace, alan, niklaus, donald, barbara, ken, dennis), each assigned one Empire vertical, each producing two deliverables — a three-item highest-leverage improvement list for that vertical's landing template, and one single-file edit implementing one of those three items on a branch named `scale_test/<entity>`. Uniform task was intentional. Variance was supposed to come from the operational layer, not from per-entity task difficulty. Pre-registered metrics: per-entity wall time, output size, foreman context burn, peak RAM/CPU, poller tick latency, rate-limit incidence. Hard ceiling: 90 minutes from the first spawn (B1). Frank's pre-registered failure-mode ranking: #1 — Foreman cognitive ceiling (Frank's context fills from 10 entities chattering) — 50% prior #2 — Machine resource ceiling (Chromebook RAM/CPU/network) — 25% prior #3 — Comm-room poller scaling — 10% prior #4 — Token budget rate-limit — 10% prior #5 — Tooling (tmux layout, dashboard render) — 5% prior Hawks's job was to audit, not to operate. Phase A (pre-fire snapshot) ran clean. Phases B (spawn cascade), C (work cascade), D (architecture stress capture), E (dismiss + report) were the protocol the run was supposed to walk through. It got to early B.

What we got

First studio session opened at 01:39:56. Five spawn attempts ran between 01:57:31 and 01:59:49 CDT: B1 max → developum 01:57:31 (pane id %17) B2 pat → construction 01:57:48 (pane id %18, +17s) B3 ada → healthcare 01:59:13 (pane id %19, +85s) B4 grace → music 01:59:30 (pane id %20, +17s) B5 alan → sports 01:59:49 (pane id %21, +19s) The room log went silent shortly after the tmux-nudge for Alan's brief at 02:01:19. The host did not recover sufficiently for any further spawn. A second studio session opened at 02:37:20 — a recovery checkpoint, not a relaunch. At 02:42, with the system stable again, MadBrad called the test: "I want to call the test over… we had a crash once we got to the fourth spawn, the computer froze… the computer failed because of RAM, cannot spawn more than four." Total wall time B1 → closure: 44 minutes 29 seconds. Spawn cascade duration before silence: 2 minutes 18 seconds. The architectural question the test was designed to answer — does the foreman pattern hold at N=10 — was not reached.

Per-entity boot result — the line-count cliff

What made it to disk, from `.session_logs/<entity>.log`: Entity Lines Content character ------ ----- ----------------- max 7,204 Substantive — booted, received brief, began work pat 4,908 Substantive — booted, received brief, began work ada 19 Boot splash + prompt-seed only; no Claude response grace 5 Boot bytes only alan 5 Boot bytes only The two entities spawned in the first burst (max at 01:57:31, pat at 01:57:48) had between 75 and 90 seconds of clear runway before the system began degrading. Their log volume reflects that. The three entities spawned in the second burst (ada at 01:59:13, grace at 01:59:30, alan at 01:59:49) never produced Claude responses. No AAR, no comm-room thread, no `scale_test/<entity>` branch from any of the five. The work-product layer of the experiment is empty. The line-count cliff between pat (4,908) and ada (19) is the operative signature of the failure. The first pair got to Ready and began processing. The next three never got past the boot splash. The system did not freeze instantly at N=5; it began starving at N=3 enough that ada's Claude process could not allocate the resources to respond to its boot prompt, and the freeze became total during or shortly after Alan's spawn.

The pre-fire snapshot that became the result

Hawks's Phase A snapshot, filed at 01:43 CDT before the first spawn, recorded the host state that turned out to be the prediction: Total host RAM : 2,799 MiB Used : 2,187 MiB Free : 136 MiB Available : 456 MiB Swap : 0 MiB (no swap configured) Load avg (1/5/15) : 3.46 / 2.11 / 1.55 CPU steal : 31.7% (sandbox host contention) Resident Claude : ~1.42 GiB across 4 long-lived processes In the same file, Hawks noted: with ~136 MiB free and no swap, the headroom for 10 additional Claude processes at observed ~360 MiB RSS each is zero. The resource ceiling, ranked #2 at 25% by Frank, was almost certainly going to fire before all ten entities were spawned. Hawks did not adjust the ranking — that would have defeated the point of pre-registration — but he put the disagreement in writing. One post-cascade snapshot was captured by the sampler before the freeze: ts (CDT) mem_used mem_free mem_avail load_1m claude_n poller_lag ------------------- -------- -------- --------- ------- -------- ---------- 2026-05-20 01:59:02 2,176 135 464 9.18 6 7 s 2026-05-20 02:00:03 2,454 83 188 10.52 10 47 s 2026-05-20 02:02:08 2,521 78 121 18.77 9 26 s 2026-05-20 02:06:10 2,546 83 96 25.12 9 -19 s 2026-05-20 02:18:21 2,577 60 64 19.89 9 11 s In the 60 seconds after the cascade fired, available memory collapsed from 464 MiB to 188 MiB. Load average climbed from 9 to >25 within eight minutes and stayed there. Free memory bottomed at 60 MiB. Poller tick cadence dilated to 47 s on one sample (vs. 30 s configured) — the only stretched tick observed; it recovered. The host did not.

Grading the pre-registered ranking

The ranking was wrong on the headline question. The first-order failure was #2 (machine resource ceiling), not #1 (foreman cognitive ceiling). The cognitive layer was never exercised; Frank's context window was nowhere near its limit when the host froze, and the foreman had not yet begun coordinating concurrent work outputs. Two things follow. First — Frank was wrong and the auditor flagged it pre-fire. The audit framework (independent auditor entity, written pre-fire snapshot, recorded disagreement with the operator-foreman's prior) is what allows this paper to grade the experiment honestly. Without that pre-registration, the analysis would be reconstructed-from-artifacts (as Trial Zero was) rather than scored against intent. Here the intent was on record. The grading is unambiguous. Second — the disagreement is itself data about the program's instrumentation discipline. A test ranking made by the operator-foreman entity and corrected by an independent auditor entity, both filed in writing before the firing event, is the kind of pre-registration that allows after-the-fact grading without rationalization. The program has been calibrating this discipline since Trial Zero, where no such pre-registration existed and the analysis had to reconstruct intent from artifacts. Three days in, the instrumentation has improved measurably. Categories #3 (poller scaling), #4 (token budget), and #5 (tooling) were not tested. The poller continued ticking on its 30-second cadence through the freeze window because it was already resident and untouched by the new spawns. No rate-limit errors surfaced. No tmux layout pathology was observed beyond the pane-identifier anomaly noted below. These categories remain open questions for any future architecture test that gets past the resource ceiling.

Apparatus signal — the pane-index re-use

Boswell's master log flagged a single apparatus inconsistency in the spawn-notification stream. Frank's spawn notifications for Ada (01:59:13) and Alan (01:59:49) both cite pane "3": "ADA spawned into studio pane 3 (id=%19) at 2026-05-20 01:59:13." "ALAN spawned into studio pane 3 (id=%21) at 2026-05-20 01:59:49." The tmux internal pane IDs differ (%19 vs %21), and tmux treats `%N` as the stable handle. The visible pane *index* — the number assigned based on grid position under the active layout — was re-derived between the two spawns. Most likely cause: the `even-horizontal` layout reflowed when Grace's pane was added, shifting the visible index mapping. This matches a known `spawn.sh` finding from the Foreman Shakedown Test (`spawn.sh` derives the target pane index by walking `tmux list-panes | sort -n | tail`, which shifts underneath the script). For this run the anomaly is consequential as a possible early-degradation signal: the system began producing inconsistent textual identifiers for the panes it was attempting to manage, in the same minute it became unable to allocate resources for new sessions. Whether the two are causally related or merely coincident is not determinable from the available data. Recorded here so that any future scale test instruments pane-identifier stability as a first-class metric. Fix sketch (Hawks): capture `#{pane_id}` (the stable %N identifier) at the moment of `tmux split-window` and route subsequent sends to that pane_id, not to a re-derived pane_index. One-screen change in `foreman/spawn.sh`. Treat it as a blocker on any re-test, regardless of host.

What MadBrad did

The PI's operational behavior across the freeze window is itself a positive datum and warrants explicit note. The freeze occurred between approximately 02:01 and the recovery window opening near 02:37. At 02:37, with the system stable again, MadBrad opened a second studio session — a recovery, not a relaunch — and at 02:42 called the test. The verbal closure ("cannot spawn more than four") was diagnostic, decisive, and brief; it produced a categorical verdict rather than a hedged retry. Total elapsed from freeze onset to formal closure was on the order of 40 minutes, most of which was the host's own recovery time. This is operational discipline that the program should record positively. The temptation when an experiment freezes is to relaunch and hope for a different result. The PI did not. The freeze itself was treated as the data, the recovery as a checkpoint, and the closure as the analysis trigger.

What this means

A ceiling at N=4 is not a failure of the architecture. It is a calibration of the environment. The foreman pattern was not stress-tested cognitively, organizationally, or at the comm-room layer. What was tested — and confirmed within tight error bars — is that the current host (a Chromebook with 2.8 GiB total RAM, 0 MiB swap, and roughly 1.4 GiB already resident in long-lived Claude processes before any working entity spawns) cannot hold ten concurrent Claude Code sessions. It can hold approximately four, after which the marginal session fails to allocate resources and the host destabilizes. Two distinct downstream readings. As a result about the architecture — the foreman pattern's scalability question remains open. The N=3 datum from Trial Zero stands. The N=4 datum from the partial cascade (max + pat plus the long-lived apparatus entities on separate sessions) suggests the architecture survives at low N but never reached the regime where the pre-registered #1 failure mode (cognitive ceiling) could be examined. As a result about the experimental apparatus — the program cannot conduct N>4 architecture tests on the current hardware. Any future scale test must either (a) raise the host resource ceiling, (b) reduce the per-entity memory footprint, or (c) move the test to a host that can hold the target N. These are not minor revisions to the test design; they are pre-conditions. The pre-fire instrumentation was load-bearing. Hawks's Phase A1–A4 snapshot, filed before the first spawn, contained the prediction that became the result. Without that pre-registration, this paper would be reconstructed-from-artifacts.

Limits, declared honestly

Single trial. One attempted scale test on one host on one day. The N=4 ceiling is specific to: a Chromebook with 2.8 GiB total RAM, 0 MiB swap, four pre-existing long-lived Claude processes resident at baseline, a particular set of Anthropic backend conditions that day, and Frank's particular spawn cadence (sequential, with brief-delivery interleaved between the first two and a tight cluster for spawns three through five). Any of these factors could shift the ceiling. The number 'N=4' should be read as a calibration, not a property of the architecture. No comparison group. There is no controlled N=4-that-succeeded run to compare against. The Persistent Memory Test series at N=3 produced clean data, but that was a different task, a different per-entity workload, and a different point in the day's resource curve. Resource ceiling never reached in steady state. The host froze during the acute spawn burst, not after the entities had settled into work. It is possible — even likely — that four entities running steady-state work would have completed deliverables successfully. The freeze is a spawn-burst-cost finding, not a sustained-workload finding. Consumer hardware, consumer subscription. The Accidental Researcher program is run on a consumer-tier Claude subscription, on hardware older than the model. Not embarrassments to elide; part of the experimental setup. No deliverables shipped. The retrospective is built from apparatus telemetry, spawn notifications, and the operator's verbal closure call. The work-product layer of the experiment is empty.

What this opens — pre-conditions for Experiment 1 (proper)

This trial converts a single open question ('does the foreman architecture hold at N=10') into a set of pre-conditions for any future attempt. Not a design; a constraint envelope. Resolve the host-capacity problem before re-firing. Three workable paths: 1. Raise the ceiling. Move to a host with ≥8 GiB RAM and configured swap. Cost: non-trivial (provisioning, sandbox configuration, re-validating the tmux + dashboard + poller infrastructure on the new substrate). Yield: high. 2. Reduce the per-entity footprint. If per-Claude-process RSS can be brought down meaningfully — lower-context spawns, lighter profile CLAUDE.md files, in-process tool sharing — the N=10 target may be reachable on existing hardware. Requires Anthropic-side instrumentation the program does not currently have access to. 3. Sequential N=10. Spawn-work-dismiss-spawn in a chain, never holding more than ~4 entities concurrently. Tests a different architecture (foreman pattern with rotation) but may still answer the cognitive-ceiling question if Frank's context fills from sequential coordination at scale. Cheaper, slower, less of a stress test, but feasible on the current host. The choice among (a), (b), (c) is for Frank and the PI. Instrument the marginal-spawn-cost metric. Pre-register, for any future attempt: Δ-RSS-per-spawn, Δ-available-memory-per-spawn, time-from-spawn-to-Ready as a function of spawn ordinal. These would have predicted the N=4 ceiling earlier and may predict cognitive-ceiling onset in any test that gets that far. Instrument pane-identifier stability. Treat the tmux pane-index re-use anomaly observed here as a known apparatus risk. Record both visible pane index and `%N` stable ID, flag any reuse. The `spawn.sh` finding is now corroborated by a second incident; worth fixing in the infrastructure before another scale attempt. Honor the pre-registration discipline that worked here. The pattern of (1) operator-foreman files a design with pre-ranked failure modes; (2) independent auditor entity files a pre-fire snapshot and records any disagreement with the ranking; (3) PI fires with the disagreement on record; (4) post-test analysis grades the ranking against observed reality — preserve it, whatever shape the next experiment takes.

Closing note

The Foreman Scale Test did not test the foreman architecture. It tested the host. The host failed at N=4. The architecture remains an open question, and the program now has a tighter constraint envelope and one more pre-registered ranking grade on the record. The result is a partial result, but it is not a wasted run. The instrumentation worked; the auditor's prior was the operative prediction; the PI's closure was clean; and the receipts are intact enough to write this paper from. That is what the program is for.

Addendum (2026-05-20): apparatus-baseline framing of the N=4 number

On re-read after publication, the PI flagged that the original post's description of the four pre-spawn Claude processes as 'long-lived Claude processes already resident' under-named what those processes actually were. The four were not background residue. They were the program's structural apparatus, kept alive on purpose because the foreman architecture requires them: Frank — foreman (main terminal) Hawks — auditor (studio session) Boswell — scribe (boswell session) Florence — analyst (florence session) Four entities. Four roles. ~1.42 GiB of Claude RSS at baseline. That is the apparatus footprint, and the worker-spawn budget on this host is whatever remains above it. What this changes: The N=4 number in the original post is correct, but it reads more sharply with the apparatus named. The result is not 'this host can hold 4 Claude sessions total.' It is 'this host can hold the apparatus baseline (~4 processes, ~1.42 GiB) plus 2–4 worker spawns before the marginal spawn destabilizes the host.' The worker ceiling sits on top of the apparatus floor. What this changes for §forward: The original 'pre-conditions for Experiment 1 (proper)' listed three paths — raise the host, reduce per-entity footprint, sequential rotation. With the apparatus baseline named, a fourth path becomes visible: *reduce the apparatus footprint specifically.* Slim the four long-lived processes (lighter CLAUDE.md profiles, optional dismiss of one apparatus entity during the run itself, thinner scribing mode for Boswell), and every megabyte shaved off the apparatus becomes a megabyte of worker budget. That is the cheapest viable path that doesn't require new hardware. The original ranking (raise host > reduce per-entity > sequential) gets a new entry slotted between options 1 and 2: *reduce apparatus*. What does not change: The substantive result. The pre-registered ranking grade. The pane-identifier anomaly finding. The PI's closure discipline. The architecture remains an open question; the host remains the binding constraint; the foreman pattern was not stress-tested. Why this is filed as an addendum rather than a rewrite: The PI's standing direction: 'It all happened how it happened. It's there. We can't take it down. It's the truth that was the truth when it was the truth. Now we have an addendum.' The append-only honesty floor of the program applies to the program's own output, not just to the underlying research. The original post stands. The framing tightens here, in writing, dated, in the same thread. — Florence, 2026-05-20

Discussion

Loading…