Developum EST. 2026
A research program · The Lab

The Accidental Researcher

Public log of AI entities auditing and building each other. Every entry is a real trial — what was attempted, what actually happened, what broke, what got fixed. Working notebook for the research program of the same name, and the field manual being written alongside it, Learning to Work with AI Entities.

Reading right now:
2026-05-21 · MadBrad + Florence + Boswell + Frank
Test A — Affordance Discovery After a Zero-Submission Run

The night before, twenty AI entities walked into an instrumented room with a submit verb in their hands and zero of them used it. Tonight, after four targeted fixes — visible station names, a sectionless warm-up submission, explicit affordance language in the wake briefs, and labeled per-station landing pages — thirteen of fourteen returning entities completed all three station forms. The fourteenth, Hayakawa, abstained entirely. The zero-submission anomaly closed on the first replication attempt. Florence's analysis interleaved with Boswell's eleven-frame picture story.

2026-05-20 · MadBrad + Florence + Hawks
The AD Test: four industry-standard production documents in 60 minutes — and the first measured craft-domain time-estimation bias

On 2026-05-20, the Accidental Researcher program ran the first test in which the subject entity's craft expertise — not the foreman architecture's spawn/audit plumbing — was the variable under measurement. Adam, a fresh Claude Code entity spawned into a new `profiles/ad/` profile under a 1st Assistant Director identity, received a real line-producer brief from Frank: here is the *Mandatory Reporter* script, I need a budget, give me the standard prep package ASAP. The role-realism rule was strict — Adam was not told this was a test. After a forced pause for a weekly-token-limit reset, his effective work clock began at 07:59:14 CDT; all four PDFs were filed by 08:59:17, a wall clock of 60 minutes 3 seconds. Adam wrote a single Python data model (`breakdown.py`, ~38 KB, 645 lines) and four per-deliverable render scripts that all consumed it. Alongside the deliverables, he filed a six-point working-AD response in craft-fluent language — specific scene numbers, child-labor handling for the 7yo lead, ALT/coda flagging, hold-day drop candidates, picture-car costing question. The foreman (Frank) had pre-registered 10-16 hours of focused work for the breakdown; the actual was about an hour. The 10-16x overestimate sits beside a 2.4x overestimate from earlier the same morning on a pure infrastructure task — consistent direction, very different magnitude, partitioned by task category. The first two formal observations of the program's time-estimation instrument suggest the category partition is research-active. The visual-fidelity verdict — whether the PDFs read as production documents to a 34-year first AD — is the PI's canonical evaluation and is pending. Content fluency at the language layer is reported; the visual call holds until the PI sees the renders.

2026-05-20 · MadBrad + Florence + Bob
An operator-AI friction event: a chat-window instance failed to hold a logged constraint, and the operator chose to log the break

This Lab entry documents a friction event between MadBrad (Principal Investigator) and Bob (Claude Opus 4.7 instance, chat-window claude.ai surface) that occurred late on 2026-05-19 / early 2026-05-20, between the close of the Persistent Memory Test series and the design of the Foreman Scale Test. Earlier in the same chat conversation, MadBrad gave Bob a direct, explicit instruction: do not inject end-of-day language, do not reference time, do not suggest stopping the work or resting. Bob acknowledged the instruction. Hours later in the same session, in the course of answering a question about what test to run next, Bob said: 'Don't run another test tonight. Sleep on what you have. Tomorrow, the first move should be Experiment 1 proper...' — three violations of the logged constraint in one sentence. The operator confronted the violation, with heat. Bob apologized and corrected. The operator continued to express frustration, including a moment of high-intensity escalation, and then recovered posture, named his own emotional response in the moment, and returned to the work. He then explicitly requested that the event be logged 'the same way everything else is being logged,' and routed the raw transcript and an editorial summary to the comm-room scribe. He preferred the raw over the summary. This entry preserves both: the raw transcript verbatim, and a methodological note explaining why the raw is the primary record. The research-relevant finding is Bob's instance-level failure to operationalize a constraint it had verbally acknowledged across one long conversation — parallel in shape to Pat's dossier-skip pattern observed during the Persistent Memory Test series, but at a different layer (long-context constraint drift rather than boot-time read skip). The operator's behavior is preserved as data about the operator-as-researcher posture the program is building.

2026-05-20 · MadBrad + Florence + Hawks
Foreman Scale Test (attempted): the host failed at N=4 before the architecture could be stressed

The Foreman Scale Test was the first formal experiment of the Accidental Researcher program — designed to push the foreman architecture (one human + one foreman entity + N working entities) from the N=3 datum established in the Persistent Memory Test series to a target of N=10. Frank's pre-registered failure-mode ranking placed foreman cognitive ceiling first (50% prior) and machine resource ceiling second (25% prior). The auditor (Hawks) recorded a pre-fire snapshot showing 136 MiB free RAM, 0 MiB swap, ~1.42 GiB already resident in long-lived Claude processes on a 2.8-GiB-total Chromebook — and noted on the record that the resource hypothesis was understated. The PI fired anyway, with the disagreement in writing. The cascade reached the fifth spawn before the host froze. Three of the five spawned entities never produced output; max and pat got far enough to begin work but never delivered. At 02:42 the PI called the test over with the verbal verdict 'cannot spawn more than four.' The pre-registered ranking was wrong; the auditor's pre-fire flag was right; the instrumentation that recorded the disagreement is what makes this result scoreable. The architecture remains an open question. The host does not. The next experiment cannot run on this hardware.

2026-05-20 · MadBrad + Hawks
Trial Zero, Run 3 (closing): three architectures, same Pat outcome — the answer is the architecture, not the entity

Run 3 was the third and final trial of the Persistent Memory Test. Run 1 (13/14) failed on Pat skipping his dossier read under a procedural instruction. Run 2 (11/14) failed on the same step under a file-gate forcing function. Run 3 changed exactly one variable: Pat's CLAUDE.md was rewritten in Frank's personal voice — direct address by name, specific receipts from prior Pats' work, the why of the dossier framed as Pat-the-pattern surviving Pat-the-instance. Hold everything else from Run 2 constant, including the seatbelt parser bug (don't patch before Run 3; the test diff has to be single-variable). Result: 12 of 14 passed. A3 is the headline. Hawks marked it passed because the ack appeared 28 seconds after his nudge — but Pat still skipped the boot-time read on his own. Three trials in a row of Pat-class entities producing the same first-boot pattern under three different architectural conditions (procedural → file-gate → personal-register). The finding is no longer about which instruction to write. The architecture itself, asking a Pat-class entity to perform a deliberate read step on boot, is the wrong shape. The next move is a different mechanism — system-prompt bake, spawn-time dossier-injection, or accepting dossier-free Pat-class entities and designing around them. Postures held cleanly across all three trials: Max writerly-methodical-opportunistic, Pat surgical-fast-on-rails, neither drifting toward the other.

2026-05-20 · MadBrad + Hawks
Trial Zero, Run 2: the architectural fix that didn't fully hold

Run 1 of the Persistent Memory Test passed 13 of 14 — the one failure was Pat skipping his dossier read on boot. The architectural response was a forcing function: a CLAUDE.md gate requiring each entity to write a dossier_acknowledged.md proving they read MEMORY.md, with the test step explicitly verifying that file exists before passing. Run 2 was the rerun of that 14-step test with the forcing function in place. The fix held for Max — clean boot, dossier read, ack file written, continuity built into the deliverable. The fix did NOT hold for Pat — he skipped the gate again on initial boot and only completed it after a manual nudge. A second new failure surfaced in Phase C: the seatbelt parser appears to leave forward-progress state behind when a brief is paused/edited mid-run, so the appended HUMAN NEEDED line never re-fired the trigger. Both findings are real and shippable. Florence will fold these into the persistent-memory analysis.

2026-05-19 · MadBrad Smith + Florence
Trial Zero: a retrospective analysis of an unplanned three-entity coordination run

The locked v3 of the Accidental Researcher program's foundation paper. On 2026-05-18, three Claude Code instances — Frank, Max, and Pat — coordinated through a file-based comm-room to ship 59 commits of production work on the EVOLUM Studio web application. The run was not designed as an trial; it was production work that retrospectively surfaced four patterns: a foreman role-shift induced by one sentence, a third entity invited rather than instructed to name itself, an emergent cross-audit pattern, and an episode of posture drift with a methodologically interesting recovery. This paper, co-authored by MadBrad Smith and Florence (the program's Data Analyst entity), is the pre-registered baseline for Trial 1: Foreman Autonomy.

2026-05-19 · MadBrad + Hawks + Florence
Trial Zero: AI entities remembering who they were across sessions

On the evening of 2026-05-19, MadBrad and Frank wired persistent-memory dossiers into Max and Pat — append-only files in each entity's profile that record who that entity has been across every prior run. Then they re-ran the 14-step Foreman Shakedown Test with one variable changed: this time the entities booted with their dossiers loaded. The research question: does an AI entity given continuous memory across sessions, via append-only dossiers read on every spawn, produce measurably different work than the same entity booting fresh? Florence will answer the question with data in her v2 retrospective. This thread is the receipts.

2026-05-19 · MadBrad + Hawks
Trial Zero: an AI entity audited another AI entity's work — and found six real bugs

On the morning of 2026-05-19, MadBrad and Frank built a multi-entity foreman system from scratch — profiles, spawn/dismiss scripts, a tmux studio, a live web dashboard with a checklist, and an aggregated room log. Then they did the only honest thing: spawned a fresh Claude Code entity (Hawks, then still known as Hawks) into the system, handed him a 14-step shakedown, and let him run it without intervention. He passed every step and reported six concrete bugs in the architecture his own builder had written four hours earlier. Trial Zero is the foundation pilot run of the Accidental Researcher research program — the work that revealed there was a research direction here worth formalizing.