AetherJudge — Web edition

The finding

A candidate engine that won 100% of paired trials on the design corpus was refused promotion. On a corpus it had never seen, its true runtime cost was 2.8× higher than the design-corpus estimate. The candidate was archived; the discipline that refused it was inscribed as a permanent invariant.

The lesson

A signal that does not survive transposition to data it has not chosen is not a signal — it is a property of the data. A six-value verdict ladder, written before promotion, is what makes that distinction enforceable rather than aspirational.

For whom

Anyone shipping decisions from systems whose verdicts cannot be reproduced blind — hand-tuned heuristics, ML pipelines, custom compilers, distributed systems with emergent behaviour. See § 14 for the applicability checklist; the rest of the paper is the case study that produced it.

Abstract

This paper documents a six-week, AI-assisted exploration of combat AI for a closed-source commercial roguelite — Aethermancer, a Unity-based monster-taming game sold on Steam — built around a hand-written deterministic simulator (authored via AI pair-programming, not statistically learned) validated against the decompiled game binary. The tracker codebase exceeds 100,000 lines of TypeScript across 652 commits, of which the combat module alone accounts for roughly 70,000; 70% of that module sits in an auto-audit subsystem named AetherJudge, and the test suite exceeds the combat-AI module's own source by a factor of 1.13×. Over sixty numbered slices — single-hypothesis experiments closed by a written verdict drawn from a six-value ladder — were executed across the six weeks. At least twelve hypotheses were closed with a negative verdict before reaching production.

The central methodological contribution is the discipline of independent corpus replay before any engine promotion. On one occasion — slice 4.2ter, April 30, 2026 — this discipline prevented a candidate engine from being shipped despite a 100% paired win-rate on the design corpus by revealing that its true runtime cost on a freshly captured independent corpus was 2.8× the design-corpus estimate. The candidate was archived. The verdict was inscribed as an invariant.

The central technical result is negative: a Gumbel-AlphaZero MCTS variant without a learned value network plateaued below a beam search refined over six weeks of phased, AI-assisted tuning, suggesting that the bitter lesson holds for the couple search + learned representations, not for search alone.

We additionally formalize the path E doctrine — under which an author may explicitly suspend a cost gate on the production surface while preserving it on the benchmark surface — and argue that the methodology transfers cleanly to any setting where the system under audit is itself partially opaque.

Six-week, AI-assisted study · April 6 – May 18, 2026 · Single engineer with Claude Code

60+ numbered slices
12 negative verdicts before prod
100% paired wins — refused
2.8× hidden cost caught by replay
100k lines of TypeScript

Independent-corpus replay caught a 2.8× cost overrun that paired benchmarks had hidden — the candidate was archived despite a 100% paired win-rate.
A Gumbel-AlphaZero MCTS variant without a learned value network plateaued below a beam search refined over six weeks of AI-assisted tuning.
Methodology dominates raw signal: the discipline of writing verdicts before promotion beats strong but unaudited candidates.

§ 1Introduction

On April 29, 2026, slice 4 of AetherJudge produced the strongest signal the project had ever recorded. A candidate engine — beam-rerank-v1 — won 100% of decided paired trials against the production beam on a corpus of 110 snapshots, with a defeats delta of minus seven. The trigger subset showed forty-eight wins and zero losses. A reasonable engineer would have promoted. The methodology forbade it.

Slice 4.1 decomposed the seventy-one wins forensically and found that fifteen percent of them rested on a comparison fallback known to be weak — scoreState ties broken by a sixth-tier rule. The verdict held at SERIOUS_SIGNAL, explicitly not PROMOTION_READY. Slice 4.2 then tried to compress the candidate's runtime cost — and discovered, in the process, that the project owned no infrastructure for capturing a second corpus on which to revalidate. The design corpus had been collected once, externally, on April 27. There was no recorder. There was no manifest format. There was no way to ask the only question that now mattered: does this signal survive on data the probe has never seen? The slice closed with verdict INCONCLUSIVE.

Slice 4.2bis built the missing infrastructure — a snapshot recorder, a manifest builder, sixteen tests — and then halted with EXECUTION_BLOCKED because the capture itself required a human running the game on Windows for one to two hours. The agent could not produce the corpus. The slice waited.

When slice 4.2ter finally replayed the candidate on fifty independently captured combats — two hundred and nine snapshots in total — the signal direction held. The candidate did not lose a single decided trial. The sign-test p-value reached 2.6 × 10⁻²⁶. But the runtime cost ratio jumped from 4.76× on the design corpus to 13.26× on the independent corpus — a 2.8× underestimation that would have made the candidate non-viable in production. The candidate was promoted to archive: code preserved as offline oracle, never to run live. The methodology that had refused to promote it earlier was inscribed as a permanent invariant of the project.

This paper is about that methodology, the simulator it stands on, the engine search it organizes, and the ceiling it eventually let us measure honestly. It is also about a result we did not expect to write: that a search-based AlphaZero variant without a learned value network does not, on this terrain, beat a beam search refined over six weeks of phased, AI-assisted tuning. The two threads — methodology and negative result — are not independent. It is precisely because the methodology refused to let us round up our experiments that we were able to declare the ceiling at all.

The remainder of the paper is organized as follows. § 2 surveys adjacent literature at the intersection of MLOps evaluation rigor, negative results in deep reinforcement learning, search-based game AI under bounded compute, and differential testing — four threads the paper sits between. § 3 introduces the game and the combat structure that makes the AI nontrivial. § 4 documents the simulator and the cross-check discipline against the decompiled binary that anchors every later verdict. § 5 surveys the engine registry and the production beam. § 6 — the central section — formalizes the slice methodology: the ladder of six verdicts, the A→B→C distillation workflow, and the independent corpus replay procedure. § 7 returns to the 4.2 incident above in full chronological detail. Sections §§ 8–11 cover frozen flags, the path E doctrine, the failed distillation experiment, and the measured ceiling. Sections §§ 12–14 position the work intellectually, declare threats to validity, and propose transposable applications. § 15 concludes.

§ 2Related work

This paper sits at the intersection of four threads that are rarely cited together, plus a fifth — differential testing — which supplies the strongest analogy for the discipline at the heart of the methodology. We position the work against each in turn, and read them as a single composite at the end.

Evaluation rigor in machine-learning systems

Sculley and colleagues (Sculley et al. 2015, NeurIPS; Sculley et al. 2017, IEEE Big Data) catalogued the long-tail debt that accumulates when machine-learned components are promoted to production without test infrastructure equivalent to what is standard for non-ML code. Their ML Test Score remains, almost a decade later, the most explicit checklist for ML reliability — and the field has largely ignored it. The slice methodology proposed in § 6 is, in spirit, an attempt to operationalize a test-score discipline for a setting that is not, strictly speaking, machine learning: a hand-tuned beam search whose authors keep being tempted to call themselves heuristics designers rather than engineers shipping production decisions. The lesson is symmetric: any system whose verdicts the authors cannot reproduce blind is one accumulating Sculley-debt, ML or not.

Negative results and reproducibility in deep reinforcement learning

Henderson et al. (Henderson et al. 2018, AAAI) demonstrated that the apparent performance of canonical deep-RL algorithms collapses or inverts under seed variance, hyperparameter drift, and implementation idiosyncrasies. Engstrom et al. (Engstrom et al. 2020, ICLR) extended the result by showing that the algorithmic improvements claimed by published PPO variants are dominated by implementation-level choices invisible from the paper. Both argued, in effect, for what we here call independent corpus replay: a verdict is not a verdict until something the authors did not choose can reproduce it. The 4.2bis → 4.2ter sequence documented in § 7 is a single-engineer instantiation of that demand. Where Henderson and Engstrom called for seed independence and implementation independence, we call for corpus independence — but the structural argument is the same.

Search-based game AI without learned policies

The contemporary public-facing game-AI literature is dominated by deep RL, but a parallel community has continued to refine search and heuristic approaches under bounded compute. Foul Play and the broader Pokémon Showdown ecosystem have demonstrated that a carefully tuned minimax with hand-engineered evaluation can hold its own against learned policies on a game with combinatorial action spaces. The recent PokéChamp work (Karten et al. 2025, ICML) is the most direct neighbor of the present paper: an LLM-augmented search agent on a closed-source competitive game with deterministic rules but stochastic resolution — the same structural class as Aethermancer. We share the diagnosis that pure end-to-end RL is not the right tool for this problem class. We differ in that PokéChamp uses an LLM as a runtime decision aid, whereas the architecture documented here keeps the LLM out of the runtime loop and confines it to development-time slice authorship — a choice motivated as much by determinism and audit cost as by latency.

The AlphaZero family and its preconditions

The Gumbel-AlphaZero variant (Danihelka et al. 2022, ICLR) we instantiate in our MCTS archive promises sample-efficient policy improvement under tight simulation budgets — sixty-four simulations replacing the canonical eight hundred. The promise relies on the presence of a learned value network. In our case there is none: the value signal is a tanh of the hand-written scoreState. The archived MCTS plateau documented in § 11 is therefore consistent with the family's stated requirements, not a refutation of it. We note in passing that the adversarial-policy work of Wang et al. (Wang et al. 2023, ICML), in which simple strategies defeat superhuman Go agents, is the strongest published evidence that opaque policies carry exploitable blind spots — which is part of why this project never let the MCTS leave its archive even before the win-rate verdict was in.

Differential testing as the missing analogy

The closest methodological analogue to our simulator-versus-decompiled discipline is not from game AI at all — it comes from compiler validation. Yang et al. (Yang et al. 2011, PLDI), through Csmith, and Le, Afshari and Su (Le et al. 2014, PLDI), through Equivalence Modulo Inputs, established that the only credible oracle for a compiler is another compiler, executed differentially on the same input. Our hand-written simulator stands to the decompiled C# binary as a candidate compiler stands to GCC: each divergence reveals a bug in one of the two, and the cross-check is what makes the verdict trustworthy. To our knowledge, this is the first published work that explicitly frames game-engine reimplementation against a decompiled commercial binary as a differential-testing problem in the Csmith/EMI sense, though the underlying technique is presumably folk knowledge in MMO botting communities that do not publish, and structurally similar workflows exist around community reimplementations such as Pokémon Showdown or Sabberstone that have never been formalized under the differential-testing label.

Synthesis

Reading the five threads together produces an awkward composite that the literature does not currently name: a single-engineer project applying MLOps test discipline to a search-based system, validated against a differential oracle reverse-engineered from a commercial binary, while explicitly refusing the deep-RL promotion path on the basis of its own measured negative result. The present paper attempts to make that composition coherent and, by writing it down, transposable.

§ 3The game and the combat

Aethermancer is a Unity-based roguelite released on Steam by a small studio. It crosses the floor-by-floor progression structure of Slay the Spire with the monster-capture mechanic of Pokémon: the player advances through a hand-crafted run, recruiting up to three monsters into an active party, leveling them, and customizing their skill slots between encounters. Each encounter is a turn-based combat against one to three enemy monsters, with the player's avatar — the eponymous Aethermancer — participating on the field as a fourth unit able to deploy a small inventory of consumables but no skills of its own. Combat is deterministic from the player's perspective for action selection but stochastic for resolution: critical hits, evasion rolls, and a handful of skill effects resolve through hidden RNG that the live binary draws but does not expose.

Video 1 — A 2×-speed excerpt from one combat encounter. Three allied monsters and the Aethermancer (player avatar) face enemy monsters in turn-based exchange; skills, consumables, Aether economy, and capture attempts compose the legal action space described below.

A combat in shape

Three allied monsters plus the Aethermancer face one to three enemies. Initiative determines turn order through a Speed statistic modified by Stagger effects. Each unit, on its turn, selects an action: a skill from its three or four equipped slots, a consumable from the player's sidecar, or a capture attempt on a corrupted enemy. Skills consume Aether — a collective player-side resource regenerated by generator skills and burned by spender skills, which the player must manage as a stock rather than as a per-unit pool. Buffs and debuffs carry durations measured in rounds. Combat ends in victory (all enemies dead), defeat (all allies dead or the Aethermancer downed), or timeout. Standard encounters last four to ten rounds; boss fights stretch to eight or fifteen. The simulator caps rollouts at thirty rounds for tractability.

Branching factor

On a typical turn with three living allies, each with four skill slots and three targets, plus four consumables across three targets, the legal action space sits at roughly 50–80 atomic actions before any pruning. In practice the production beam deduplicates on action signature and prunes aggressively, and the effective round-0 candidates returned by Top-K extraction sit between two and five. The factor is wide enough that exhaustive enumeration to depth three is infeasible, narrow enough that depth two is tractable for any reasonable beam width, and ill-behaved enough at depth four to five that the forecast drift attribution documented in § 11 — approximately 18% of a 72.2% decision divergence between depth-3 and depth-5 beams — overwhelms the gain from looking further ahead.

Observability and determinism

The game state is fully observable through our BepInEx hook; the live capture is exhaustive. The only opacity is the per-resolution RNG the binary samples internally. The production simulator is rendered deterministic by an exactMode flag that disables the expected-value smoothing applied to incoming damage in stochastic mode. All AetherJudge verdicts — paired benchmarks, oracle rollouts, candidate evaluations — run in exactMode. This is a methodological choice with a cost: the AI is judged on its capacity to dominate the deterministic projection of the game, not on its expected behavior under the stochastic original. We accept this cost in exchange for verdict reproducibility, and we argue in § 11 that the residual gap between the two surfaces is dominated by simulator-fidelity drift rather than by exactMode itself.

§ 4The simulator and the cross-check discipline

Every verdict in this paper rests on one infrastructural choice: the project owns a hand-written deterministic simulator of the game's combat, separate from the live binary, and validates its behavior against the binary differentially. Without this discipline the methodology of § 6 would have no oracle. With it, every decision the search makes can be re-played, audited, and compared bit-stable against a snapshot the live game produced. The cost is the duplication of a non-trivial combat engine in TypeScript; the return is the right to call any later experiment reproducible.

Why hand-write a simulator at all

Three alternatives were considered and rejected. Driving the live binary through the BepInEx hook would have been the most faithful oracle but is single-threaded, stochastic, and slow — unusable for any search that wants to expand a beam of 32 leaves per round. Lifting the decompiled C# into a headless evaluator would have eliminated the duplication but inherits the binary's stochastic resolution path and exposes the audit to whatever obfuscation IL2CPP introduced. A learned simulator — a value or transition model trained on captured traces — was excluded as a deliberate AetherJudge constraint: the methodology forbids opaque components on the promotion surface, and a learned simulator would obscure the very divergences the cross-check is designed to surface. The hand-written path is the only one that preserves both speed and inspectability.

exactMode and the determinism contract

The simulator runs in two regimes. In stochastic mode it applies the same expected-value smoothing the live binary uses for crit, evasion, and partial resolution effects — the natural mode for human play and live forecasting. In exactMode, that smoothing is disabled: a flagged constant for crit, evasion, and partial resolution outcomes is held to its modal value, and the resolution path becomes a pure function of SimState and the action chosen. Every search, every bench, and every audit in this paper runs in exactMode. A given snapshot, replayed through the production beam, produces the same decision byte-for-byte across runs, machines, and time. This is what allows the slice methodology of § 6 to attribute a measured difference between two engines to the difference in the engines themselves, not to the difference in their random seeds.

The cost of exactMode is also explicit: the beam's choice is optimal against its own deterministic projection of the game, not against the expected behavior under the stochastic original. We accept this cost in exchange for verdict reproducibility, and we argue in § 11 that the residual gap between the two surfaces is dominated by simulator-fidelity drift rather than by exactMode itself.

The differential cross-check

The pattern is borrowed directly from the compiler differential testing tradition cited in § 2. Three artifacts converge on a single comparator. The live binary, instrumented by the BepInEx hook, emits a JSONL stream of round-by-round snapshots through test-ws-client.ts — a capture of what the game actually did. The decompiled C# source, extracted offline via dnSpy and ILSpy, plays the role of ground-truth oracle: not an executable, but a readable reference against which any disputed simulator behavior can be reconciled by inspection. The simulator itself, run in exactMode over the same pre-snapshot state and the same action choice, produces its own post-state. The three feed a differential comparator which classifies every observed divergence as a DRIFT or EXONERATES the simulator for that case.

The DRIFT taxonomy

A DRIFT verdict carries one of four severities, each tied to what the divergence threatens. P0 — Outcome flip. The simulator and the binary disagree on whether the unit dies, the combat ends, or the action lands. The most severe class; every further verdict drawn from the simulator on snapshots sharing the offending mechanic must be considered suspect until the drift is resolved. P1 — Magnitude divergence beyond tolerance. Both surfaces agree on the qualitative outcome but disagree on a numeric magnitude (damage dealt, shield consumed, Aether regenerated) by more than the calibrated tolerance for that mechanic. P2 — Trajectory divergence. The post-state is consistent within tolerance, but the path by which it was reached differs (intermediate triggers, buff application order, multi-target resolution sequence). P3 — Cosmetic. A divergence on a field that does not feed any later decision (animation timestamps, audio cue identifiers, transient UI flags). Logged for traceability, never blocks a verdict.

The four-tier ladder is not symmetric. P0 and P1 must be resolved before the simulator can be considered an oracle for the affected mechanic; P2 may be tolerated when the project judges that the simulator's trajectory is preferable to the binary's (a position the project has only occupied twice, both documented in their respective slice files); P3 is preserved purely as a forensic byproduct. In practice the project has closed substantially more drifts than it has tolerated, and the residual P0-P1 catalog has stabilized to a small set of mechanics around which the production beam is deliberately conservative — a conservativeness the cross-check made explicit rather than inferred.

With the simulator anchored against the binary in this way, the section that follows can treat the engine registry as operating on a trustworthy substrate. Whatever the beam, the reranker, or the archived MCTS produce, the comparison between them is a comparison between policies, not between interpretations of the same scene.

§ 5The engine registry and the production beam

Engines are first-class objects in the project. Each is a typed factory registered under a stable key, carrying an EngineKind field that tells the rest of the system how it may be used. The taxonomy is small — production for the engine the live advisor consults, baseline for a deterministic reference engine never intended for promotion (currently heuristic-baseline), candidate for an engine under active evaluation, archive for an engine that has delivered a real signal but is not viable in production (currently mcts-archive and, from a different direction, beam-rerank-v1), and rejected for an engine that was falsified at the paired bench (beam-antiwaste-v1, discussed below).

The rule that holds the taxonomy together is simple and inviolable: a rejected engine may never return to production under the same registry key. The code remains in tree as forensic record — the failure is preserved with its slice number embedded — but the key is burned. Any second attempt at the same idea must register a new key (e.g. beam-antiwaste-v2) and earn its way through a fresh paired bench. This rule is the structural counterpart of the methodology in § 6: it makes the cost of a failed promotion visible in the registry itself, and it prevents the slow drift by which a once-rejected idea quietly becomes production after enough subsequent slices have softened the institutional memory of why it was rejected in the first place.

beam-rerank-v1 production on the live advisor runtime since slice 18 path E, but archive on the paired-bench surface. The dual-kind regime is intentional and is the subject of § 9. Mechanically: a wrapper that asks the beam for its Top-K=3 round-0 candidates, forces each as the first action, runs a counterfactual rollout under the beam, and returns the trial that wins the compareTrials chain.

Phases of the production beam

The production beam was built across eight phases over the project's six weeks. Each phase resolved a single architectural question and shipped a single mechanism. We summarize them in order — the labels are a thematic decomposition of the work, not a calendar timeline.

Phase A — Simulator calibration. Three sub-phases gated the simulator on truth against the live binary before any search was built on top: a PassiveConcealed evaluation gate, correct PassiveAetherShield timing, the exactMode flag that disables expected-value crit smoothing, and a three-tier validation harness running the simulator against captured combats at unit, integration, and replay scales.

Phase B — Buff and trigger coverage. Extended the simulator's effect tracking to the buff space: Force, Glory, Wholesome, TriggerBuff dispatch. Without correct buff propagation, leaf evaluation in any future search would be incoherent.

Phase C — Opponent modeling. Introduced EnemyFutureMode with two settings: preview_reuse (use the actual enemy intent preview the game shows at round 0) and predicted_policy (heuristic prediction at deeper rounds). Round 0 always uses the real preview; rounds 1 and 2 fall back to predictEnemyPreviews — a tier-based heuristic selecting enemy actions by kill-this-turn → lowest effective HP → paid-offensive → free-offensive, with a special signature for boss fights. The 18% forecast drift documented in § 11 is the measured cost of this predicted-policy regime.

Phase D — Variable-width beam with diversity. The architectural step. Beam width tapers (32 → 20 → 12) with depth, inner expansion (40 → 28 → 20 → 16) with breadth. Six diversity buckets — damage, sustain, poise, setup, summon, other — receive a ceiling of width/6 survivors each, preventing the beam from collapsing onto a single archetype. Lineage tracking ensures that a parent's classification informs its descendants' bucket attribution.

Phase E — Contextual leaf evaluation. A refactor of scoreState introducing context-aware signals: contextualDamageValue (discounts damage on already-doomed enemies), dynamicAetherStockCap (penalizes hoarding when no use is imminent), contextualDotValueOnEnemy (values damage-over-time differently against shielded vs unshielded targets), and poiseProgressValue (rewards stagger setup). Approximately seventy hand-tuned weights, frozen since.

Phase F — Monte Carlo rollout tail. On the top-4 surviving leaves, the beam continues with a greedy ally policy and the predicted-policy enemy model for ten further rounds, producing a terminal score. A ROLLOUT_MARGIN_THRESHOLD of 6 gates the blend: if the leaf margin between rank 0 and rank 1 exceeds the threshold, the leaf score dominates (tactical clarity); otherwise the score blends as 0.4·leaf + 0.6·terminal (strategic tie-breaker).

Phase G.1 — Dead-code excision. A housekeeping phase that removed an orphan evaluator.ts module to eliminate maintenance entropy. Trivial functionally, mentioned only because the project keeps a written record of removals as carefully as of additions.

Phase H — Consumables and Aethermancer capture. Added consumable plays as legal actions for the Aethermancer unit and capture attempts as legal actions against corrupted enemies. A sidecar with MAX_CONSUMABLES_PER_ROUND=1 keeps the search tractable — chaining multiple consumables in a single round is tactically near-never optimal and computationally lets the beam's branching factor explode. Phase H.1 was a follow-up review pass that adjusted the survivalSlackFactor range on capture reward.

The combined object — variable width over diversity buckets, lookahead 3, contextual leaf evaluation, predicted-policy opponent at rounds > 0, rollout tail on top-4, consumables and captures included — is the production beam. It is the champion that no other engine in the registry has been able to dethrone on a paired bench. § 11 returns to why.

The reranker and the rejected wrapper

Two engines wrap the production beam in interestingly opposite ways and deserve a closer look, because they pre-figure the methodological arguments of §§ 6, 7, and 9.

beam-rerank-v1 asks the beam for its Top-K=3 distinct round-0 candidates via computeTopKTurnsFromState. For each, it forces the candidate as the first action, lets the beam continue subsequent rounds normally, and obtains a complete TrialResult. The compareTrials chain — outcome → alliesAlive → enemiesAlive → enemiesRemainingHp → turnsPlayed → finalScore — selects the winner; if the rollouts fail to break a tie cleanly, the baseline beam top-1 wins the tiebreak to avoid flipping on noise. The model is not a neural network or a gradient-boosted ensemble. It is the beam itself, used as a secondary evaluator on its own outputs. This makes the reranker a policy-improvement step in the AlphaZero sense — simulation-backed reordering — without the AlphaZero value network. It is also why the reranker is deterministic and auditable: every label can be reproduced bit-stable by replaying the same rollouts.

beam-antiwaste-v1 attempted the inverse: rather than rerank Top-K candidates, it filtered out consumable plays classified as OVERUSE-Flasque or OVERUSE-Fouet by an audit pass. On the paired bench, the candidate posted a 10% global win-rate against the beam, with 1 win for 9 losses on the trigger subset. The forensic postmortem revealed that blocking Flasque-capture in already-lost defeats actually destroyed a survival-extending crowd-control: the AI was using the consumable as a stalling tool, not as a misplay. The slice produced one of the project's most-cited methodological invariants — diagnostic metrics are not promotion metrics. A regret count is not a verdict. The engine remained in the source tree as rejected, blocked from re-promotion under the same key.

The asymmetry of these two wrappers is informative. beam-rerank-v1 respects the beam's enumeration and improves the selection. beam-antiwaste-v1 overrode the beam's enumeration with hand-tuned filters and discovered that hand-tuned filters carry context the beam already had. The reranker treats the beam as a candidate generator; the antiwaste treated the beam as a candidate that needed correcting. The first survived. The second did not. § 6 generalizes the methodological principle that drove this distinction.

§ 6The slice methodology

The methodology stands on a single sentence inscribed in the third slice of the project:

Diagnostic metrics are not promotion metrics.

Every other procedure in this section is a consequence of that statement, and every failure documented in subsequent sections is, in some form, the rediscovery that someone confused the two. A diagnostic metric measures something about the world. A regret count over a corpus, an OVERUSE rate per consumable, a forecast-drift percentage between predicted and ground-truth enemy actions — these are observations. They tell us where the system might be wrong. A promotion metric measures whether a candidate engine improves outcomes against an established baseline on the surface where it would ship. A defeat delta on paired benchmark, a paired win-rate decided through compareTrials, a cost ratio against the runtime budget — these are verdicts. They tell us whether a candidate is allowed to leave the candidate kind. Conflating the two destroys the project's ability to know whether anything got better. Most of the project's hardest-earned discipline is a procedural defense against that single confusion.

§ 6.1 — Two surfaces

The methodology partitions every AetherJudge artifact onto one of two surfaces with disjoint comparison logic.

The hypothesis-generating surface is where audits live. The regret lab, oracle case mining, score-stability probes, mechanic-axis refinement, and the forecast-drift investigations all sit here. Their job is to produce signals worth testing, not to render verdicts. A regret count of 21 MED+HIGH cases for the production beam and 29 for a candidate is a fact about how the two engines disagree with the oracle; it is not a comparison of their qualities. The two engines may disagree with the oracle on different cases. They may disagree on the same cases for opposite reasons. They may both be correct against a flawed oracle. The hypothesis-generating surface tells us where to look. It does not tell us what we found.

This is encoded as a hard CI gate. The audit and regret golden tests assert that each generated report contains the HYPOTHESIS-GENERATING METRICS banner and a NOT a promotion gate parenthetical in its triage section; both strings are checked independently. Six adjacent hypothesis-generating surfaces — oracle mining, sim forecast, score stability, rollout tail, forecast drift, outcome telemetry — assert the equivalent single banner HYPOTHESIS-GENERATING ONLY in their own golden tests. The rerank-bench reports go further: their golden test asserts that when both banners coexist in the same report, the PROMOTION-GATE banner appears before the gate table — promotion follows diagnostics by methodology, and the order is checked. The separation between the two surfaces is enforced in source, not merely documented.

The promotion-gate surface is where verdicts live. The paired benchmark — policy-bench and the league it organizes — is the canonical instrument. It runs two engines against identical starting states, in identical seeds, with identical exactMode settings, and aggregates the outcomes through the compareTrials chain: outcome → alliesAlive → enemiesAlive → enemiesRemainingHp → turnsPlayed → finalScore. A trial decided by compareTrials is one of three values: candidate wins, baseline wins, or undecided. Aggregated across a corpus, the result is a paired win-rate (RAW), a strict win-rate (STRICT, excluding undecided), and a defeats delta. These three numbers together constitute the verdict surface.

The two surfaces are not interchangeable. A signal on the hypothesis-generating surface is a question. A verdict on the promotion-gate surface is an answer. § 6.6 lists the procedural traps that follow from misreading one for the other.

§ 6.2 — The ladder of six verdicts

Every slice closes with a written verdict drawn from a six-value ladder. The values, in order from "evidence supports promotion" to "evidence does not support a conclusion at all":

OK: The candidate passes the paired bench against its target baseline on the design corpus, the verdict survives independent corpus replay, the cost gate is respected, and no auxiliary invariant — forecast drift, distinctness, comparator stability — is violated. Promotion to the relevant production surface is authorized. Paradigm: slice 18 path E (re-promotion of beam-rerank-v1 on the live advisor surface).
DRIFT: Conformity between simulator and ground truth — live capture or decompiled binary — is broken in a specific region. The format is reserved for fidelity audits, not engine evaluations. A DRIFT verdict carries a criticality sub-tag P0 through P3 by analogy with bug priority. Paradigm: slice 28 (Burn shield routing).
REJECT: The candidate is falsified by the paired bench: defeats delta non-negative, direction inverted, or trigger-subset loss. The engine is retained in the source tree as rejected with the slice number embedded in its module documentation. The re-promotion rule applies — a rejected engine cannot become production under the same key. Paradigm: slice 3 (beam-antiwaste-v1).
ARCHIVE: The candidate confirms a real signal on the paired bench but cannot pass a cost gate or another structural constraint. The engine is preserved as archive for its conceptual content — typically as offline oracle for distillation or as competition reference — but never executes live on the surface that archived it. Paradigm: the first archival of beam-rerank-v1 at slice 4.4, before its later re-promotion under path E.
SERIOUS_SIGNAL: The candidate shows a positive direction on the design corpus alone, with no independent confirmation. The verdict is explicitly transitory: it must be resolved into OK, ARCHIVE, REJECT, or INCONCLUSIVE through independent corpus replay before the candidate can shift kind. SERIOUS_SIGNAL is never a terminal state. Paradigm: slice 4 (original beam-rerank-v1).
INCONCLUSIVE: The methodology cannot render a verdict. The cause may be a corpus too small, a missing infrastructure, contradictory results across cells of a sweep, or — as in the case that taught the project the most — the simple absence of an independent corpus to replay against. INCONCLUSIVE is a refusal to launder uncertainty into a positive answer. Paradigm: slice 4.2 (INCONCLUSIVE_PENDING_USER_CAPTURE).

Three auxiliary tags compose with the core six. MAP marks a hypothesis-generating audit that produced a usable map of behavior — for instance, the oracle case-mining reports of slice 5. EXONERATE marks a suspect that an audit ruled out without reaching a full positive verdict: a layer of the system was scrutinized and emerged consistent with its specification, so the investigation moves on rather than producing an OK on something that was never the right hypothesis to begin with. NON_PROMOTABLE_COST attaches to a candidate whose signal is real and survives independent replay, but whose runtime overhead exceeds the 4× cost gate — the qualifier appended to SERIOUS_SIGNAL_CONFIRMED on slice 4.2ter, where it disqualified beam-rerank-v1 from live promotion.

Slice 4.2 — visited in chronological detail in § 7 — is the textbook example of how the verdict ladder absorbs an unresolved methodological question. The slice tested two cost-compression strategies and confirmed that the leaf-margin gate fires too rarely to matter on the available corpus. Closing the slice on that finding alone, with the verdict REJECT on the cost compression, would have buried the larger fact: that no second corpus existed against which any candidate could be replayed. The slice's "fails to compress cost" (confirmed) is a textbook conservative-failure. The slice could have been closed REJECT on the cost-compression attempt alone.

But while attempting the second compression strategy, the agent discovered that the project owned no infrastructure for capturing a second corpus. The design corpus existed because someone had once recorded combat snapshots externally on April 27, in a session that was not versioned and not reproducible. There was no recorder in the tracker. There was no manifest format. No CLI flag on test-ws-client.ts to begin a recording. No hook on session exit to materialize a corpus. Without these, the question "does this signal survive on data the wrapper has not seen?" could not be asked at all.

This discovery converted slice 4.2 into a methodological cliff. The bench paired had returned 100% on all four cells of the matrix. The candidate looked, on the data available, exactly as good as before. And the methodology was supposed to allow promotion under exactly those conditions — except for the independent-corpus check, which was now obviously impossible. The slice closed with verdict INCONCLUSIVE. The candidate remained at candidate.

Slice 4.2bisBuilding the infrastructure to ask the next question

Slice 4.2bis was scoped narrowly: build the missing infrastructure. Three artifacts shipped:

snapshot-recorder.ts — a module that wraps the WebSocket stream from the BepInEx hook and writes one JSONL frame per turn into a directory.
CLI flags --record <dir> and --record-id <id> on test-ws-client.ts, allowing a recording session to be started from the command line.
A SIGINT/beforeExit hook that builds the corpus manifest automatically when the session terminates, recording the slice the capture was made for, the engine versions, the snapshot count, and the excludesCorpusIds field listing every prior corpus from which the new one is disjoint.

Sixteen tests shipped with the code: nine for the recorder, seven for the manifest builder. All passed in CI.

Then the slice halted. The capture itself required a human running the game on Windows, in Steam, for one to two hours of real gameplay. The agent could not produce the corpus. The slice closed with verdict EXECUTION_BLOCKED — an auxiliary tag indicating the next step requires action the methodology cannot produce algorithmically. The infrastructure shipped. The capture followed.

Slice 4.2terThe replay that confirmed direction and killed promotion

The capture eventually arrived: 50 combats, 209 snapshots, taken from a separate session that traversed late-floor and boss encounters underrepresented in the design corpus. The manifest was committed as independent-rerank-4-2bis. manifestsAreDistinct(design, independent) passed. The matrix sweep ran for approximately four hours of compute.

The results split cleanly into two narratives, and the slice's verdict depended on which one one accepted as primary.

The direction held. On the independent corpus, the candidate did not lose a single decided trial across any of the four cells:

RAW paired win-rate: 100% decided, 85 wins to 0 losses, 124 undecided.
STRICT paired win-rate: 100% decided, 65 wins to 0 losses, 144 undecided.
Sign-test p-value on the RAW global: ≈ 2.6 × 10⁻²⁶.

There was no plausible reading in which the signal was a corpus artifact. The candidate genuinely beat the beam wherever both reached a decisive compareTrials outcome.

The cost did not hold. The same replay revealed that the runtime cost of beam-rerank-v1 on the independent corpus was structurally different from its cost on the design corpus:

Cost ratio: 4.76× on design, 13.26× on independent.
Triggered candidate mean latency: 856 ms; p95 1,506 ms; maximum observed 9.4 s per decision.
Signal magnitude diluted: defeats delta fell from −4 on design to −2 on independent; trigger rate from 28.2% to 20.6%.

The mechanical hypothesis advanced by the slice's postmortem was straightforward. The design corpus had been biased toward early- and mid-floor combats, which terminate in fewer rounds. The independent corpus, captured from a genuine progression run, included late-floor and corrupted encounters in which playCombatFromState must simulate substantially more rounds before reaching termination. The cost of the rerank is O(topK × rounds_per_rollout), and the rounds per rollout are structurally longer on the realistic corpus than on the corpus the wrapper had been benched against.

The verdict written into the slice document closed on a single sentence:

« Sans cette découverte, on aurait sur-estimé la viabilité runtime de beam-rerank-v1 d'un facteur ~3. » Without this discovery, the runtime viability of beam-rerank-v1 would have been overestimated by a factor of approximately 3. Slice 4.2ter postmortem · April 30, 2026

The candidate was tagged SERIOUS_SIGNAL_CONFIRMED composed with NON_PROMOTABLE_COST. The signal was real. The cost gate had failed. The candidate could not be promoted into production runtime on the bench surface. Slice 4.4 would later set archive.

Slice 4.3 & 4.4The compression that failed, and the formal archival

Slice 4.3 attempted a final compression strategy: a dynamic skip that would bypass the rollout when intermediate beam confidence was already high enough to obviate it. On the independent corpus, the defeats delta moved from −2 (slow but successful) to 0 (no improvement) as soon as the skip aggressiveness was tuned to bring the cost ratio into the 4× range. The compression destroyed the signal. The slice closed REJECT.

Slice 4.4 formalized the disposition. The candidate engine beam-rerank-v1 was set to archive, with explicit module documentation that the code is preserved as offline oracle for the distillation pipeline (slices A, B, C of § 6.4) and explicitly never executes on the production runtime surface. The slice closed with verdict ARCHIVE. The chapter was procedurally complete.

Five invariants inscribed by the sequence

The chain 4 → 4.1 → 4.2 → 4.2bis → 4.2ter → 4.3 → 4.4 inscribed five invariants into the project. Each is now enforced at the level of source code or written process documentation, and each can be traced back to the specific slice that taught it.

Independent corpus replay is mandatory before any promotion. Inscribed by 4.2 (the impossibility of asking the question) and 4.2ter (the answer the question produced when finally posed).
Capture-side infrastructure is permanent. The recorder, the manifest builder, the distinctness test, and the --record flags are not optional tools — they are part of the tracker baseline. No future slice will be blocked on missing capture infrastructure.
Strict replay, zero tuning. The configuration the candidate carries on the design corpus is the configuration it carries on the independent corpus. No threshold widens, no cost gate loosens, no trigger condition adjusts after the replay results are seen.
Cost ratio is a first-order gate. Operationally, a candidate targeting production must hold a cost ratio at or below 4× the production engine on the independent corpus. This is the threshold against which 13.26× was measured non-viable.
SERIOUS_SIGNAL is never a terminal verdict. It resolves into OK, ARCHIVE, REJECT, or INCONCLUSIVE through independent replay. The intermediate state cannot ship code.

Closing and forward link

The sequence above produced the strongest single result the project would record and refused to ship it. This is not the end of the story for beam-rerank-v1. § 9 returns to the engine in the context of slice 18, which re-promotes it onto the live advisor surface under an explicit doctrinal exception — the path E doctrine — under which the cost gate decisive in 4.2ter does not apply on a surface the user is the sole audience for. The doctrine, the asymmetry it introduces between the bench and the live surfaces, and the conditions under which it is admissible, are the subject of § 9. The fact that they were inscribable at all rested on the chain of decisions documented in this section.

§ 7The slice 4.2 incident, in chronological detail

The methodology of § 6 is a set of rules. § 7 is the slice sequence in which those rules were tested under maximum pressure and held. The events between April 29 and April 30, 2026 produced the strongest signal the project had ever recorded, came within one decision of shipping it, and ultimately refused. Every invariant that subsequent sections rely on was inscribed in these four days.

Slice 4The probe and the first signal

On the morning of April 29, slice 4 closed with the SERIOUS_SIGNAL verdict. The probe under test was beam-rerank-v1: a wrapper that extracts the production beam's Top-K=3 round-0 candidates via computeTopKTurnsFromState, forces each as a first action, replays the rest of the combat under the beam itself, and selects the trial that wins the compareTrials chain. On the design-corpus-2026-04-27 — 110 snapshots drawn from 32 combats — the candidate posted 71 paired wins against zero for the beam, a 100% paired win-rate on decided trials, and a defeats delta of −7 (the beam lost 64 combats, the candidate 57). The trigger subset — the 48 snapshots on which the candidate's selection differed from the beam's — won 48 times for zero. A reasonable engineer would have proposed promotion.

The probe's report flagged two reservations. First, the runtime cost stood at 8.31× the beam — high but not yet disqualifying. Second, 17 of the 71 wins (15%) had been decided by the finalScore fallback step of compareTrials, the rule that consults the simulator's heuristic score when every earlier tiebreak ties. The fallback is documented and allowed, but it is qualitatively different from the tactical chain that decides the other 54. Slice 4 closed without promotion and listed its three open questions: does the signal survive multiple values of K? Does it survive after stripping the fallback wins? Does it survive on a corpus the probe has never seen?

Slice 4.1Hardening — sweep and forensic

Later that same day slice 4.1 answered the first two. A sweep across topK ∈ {2, 3, 5} produced SERIOUS_SIGNAL on every cell, with K=5 dominated by K=3 (identical signal, +10% cost) and K=2 a defensible cost-conscious alternative. A strict re-classification of the fallback wins as ties — rather than candidate wins — left 66% of the wins surviving in K=3 and a strict win-rate of 100% on decided trials. A forensic classifier inspected the fallback cases and found 94% of them to be legitimate_tie — trials in which the chain had correctly resolved every tactical step to equality and the fallback merely arbitrated equivalent outcomes. Zero cases were classified as comparator_gap, the failure mode that would have suggested compareTrials was missing a visible difference. The signal was not inflated.

The third question — corpus independence — could not be answered. A search of the repository confirmed that no snapshot had ever been held aside for validation. The 32 combats of the design corpus were the only ones in existence. The slice closed with the verdict SERIOUS_SIGNAL_CONFIRMED on what it could measure and the explicit annotation INDEPENDENT_CORPUS_NOT_AVAILABLE on what it could not.

Slice 4.2Cost compression and a missing oracle

Slice 4.2 attempted two things at once: build the manifest infrastructure to ever hold a second corpus, and reduce the candidate's runtime cost through an early-exit gate (leaf-margin). On the design corpus, the cost-compression result was disappointing — the gate fired 1 time on 110 snapshots — and revealing: the wins the reranker was generating were not outcome-flips that a margin-on-leaf-score gate could short-circuit, they were tactical tiebreaks intra-outcome invisible to the gate by construction. The slice could legitimately have closed with another SERIOUS_SIGNAL on its 4 matrix cells. It closed instead with INCONCLUSIVE, because the methodological invariant the slice authored — no promotion without independent-corpus replay — could not be honored. The manifest infrastructure was now in place. The corpus was not.

Slice 4.2bisThe infrastructure for a corpus that did not exist

Slice 4.2bis built what was needed to capture a second corpus: a snapshot recorder in src/lib/combat/judge/capture/, a manifest builder, the wiring through test-ws-client.ts, and sixteen tests. Total: 1057 tests passing, zero changes to the production beam, zero promotions. The slice then halted with the verdict EXECUTION_BLOCKED, an explicit formal acknowledgement that the next step — running the game on Windows, playing 40 to 60 combats, recording the capture — required a human. The agent could not produce the corpus. The slice waited.

captureThe human in the loop

Over roughly two hours that evening and into the early morning of April 30, the human played 50 combats end-to-end while the live client wrote the snapshots to disk. The capture sessions accumulated, the recorder appended, the manifest was regenerated from disk to reconcile multiple sessions into a single independent-rerank-4-2bis dataset: 50 combats, 209 snapshots, manifestsAreDistinct(design, independent) === true. The verification passed. The replay could now run.

Slice 4.2terThe replay, and the cost the design corpus had hidden

Slice 4.2ter ran the same 4-cell matrix (topK ∈ {2, 3} × earlyExit ∈ {none, leaf-margin}) against the new 209-snapshot corpus, completing around 01:28 local on April 30. The directional signal held. Every cell posted 100% RAW win-rate and 100% STRICT win-rate on decided trials; the reference cell (topK=2 earlyExit=none) recorded 85 paired wins for zero against the beam, a sign-test p-value of ≈ 2.6 × 10⁻²⁶. The trigger subset delivered 43 wins for zero. The reranker was structurally beating the beam on data it had never seen.

And then the cost number returned. On the design corpus the ratio had been 4.76×; on the independent corpus the same cell measured 13.26× — a 2.8× multiplier on what the design measurement had estimated. The cause was structural: the design corpus had been collected from early-to-mid floor combats, biased toward short encounters; the independent corpus contained a true cross-section of a real run, with late-floor, elite, and boss encounters whose playCombatFromState rollouts simulate more rounds before terminating. The design corpus had been hiding the cost. The independent corpus revealed it. The verdict closed at SERIOUS_SIGNAL_CONFIRMED + NON_PROMOTABLE_COST — directional signal locked, cost gate violated on all four cells.

Slice 4.3One last attempt, falsified at the bench

Slice 4.3 tried the most defensible cost-reduction available: skip the rerank entirely when the beam's internal margin between rank 0 and rank 1 exceeded a threshold (dynamic-skip by beam dominance). The threshold was chosen a priori from the forensic of slice 4.2bis, before the bench was run, to honor the methodology's prohibition on tuning after reading results. The candidate skipped 30% of the rollouts, reduced cost from 12.04× to 8.89× — a real −26% — and saved two fewer defeats than the unmodified reranker. The defeats delta regressed from −2 to 0. The minimum continuation gate failed. Verdict: REJECT — defeats regression. The forensic surfaced the structural reason: the beam's internal-score margin and the cases the reranker recovers are anti-correlated. The trials on which the rerank corrects a defeat are exactly the trials on which the beam is most internally confident — and therefore the trials a margin gate would skip. No threshold can separate the two without re-running the rollout, which defeats the cost optimization by construction.

Slice 4.4Archive

Slice 4.4 transitioned beam-rerank-v1 from candidate to archive. The code remained in tree. The tests remained green. The slice document recorded why the engine would not be promoted to live runtime, why no local heuristic could reclaim its cost, and what the engine remained useful for offline — oracle case mining in slice 5+, where its forced-rollout selections become labels for examining the beam's mistakes after the fact. The archive was not a defeat. It was the methodology drawing a line the bench had taught it to draw.

The sequence inscribed five invariants the rest of the project obeys without exception. Independent-corpus replay is non-negotiable; no candidate may be promoted on the corpus it was tested against. Cost is a first-class promotion gate; a directional signal without an affordable runtime is an oracle, not a product. Heuristic cost-reductions must be paired-benched after a priori parameter choice; tuning after reading results is fitting to the corpus. A rejected engine key is burned; the next attempt registers a new key with a fresh slice ladder. Diagnostic and promotion surfaces remain separated; a forensic that explains a win is not a license to promote the winner. § 8 walks through the seven flags that have stayed frozen since.

§ 8Frozen flags and the discipline of the freeze

The project carries a small registry of frozen design parameters — values the methodology refuses to vary, even when their variation would be computationally cheap to test. Freezing is not a default state. It is a verdict: a value has been proven, by a specific slice, to interact with downstream invariants in ways that re-tuning would silently invalidate. Three reasons compose every freeze in the registry, and a value enters the registry only when all three apply.

Three reasons

First, anti-flat-fitting. A parameter that has been tuned against the design corpus carries that corpus's biases. If it is then re-tuned against an independent corpus, the result is a system fitted to the union of two corpora — neither of which is independent of the other for subsequent verification. The freeze interrupts this drift toward a moving target.

Second, historical reproducibility. Every slice in the project's history is dated and references specific parameter values in its reasoning. A retroactive change to a parameter would render prior verdict text incoherent against the new state of the world. The freeze is the technical mechanism that keeps the slice ladder readable in retrospect.

Third, STOP PROBING. There is a procedural fatigue cost to re-evaluating a value the project has already tested. Several flags in the registry carry explicit comments documenting which slice closed their evaluation, and the freeze cuts off the temptation to re-probe by accident. The lock-in is at least as procedural as it is technical.

The registry

Seven entries are currently frozen. The table below lists them with their value, surface of effect, and the slice that closed the freeze.

Flag	Value	Rationale	Closing slice
LOOKAHEAD_ROUNDS	3	Lookahead beyond three degrades performance because the 18% forecast-drift compounds faster than the added depth recovers. Verified on the ceiling slice.	Phase C / ceiling
survivalFix	false (OFF)	Enabling produces the tortoise pathology — extends defeats without converting any to victory. Measured and rejected.	Slice 6
damageDebtFix	false (OFF)	Enabling over-weights damage debt and causes premature aggressive blowouts on trials the system would have won anyway. Measured and rejected.	Slice 8
ROLLOUT_MARGIN_THRESHOLD	6	Below 6, leaf score dominates and the rollout adds noise; above 6, the terminal-score blend fails to break tactical ties. Calibrated empirically.	Phase F
predictEnemyPreviews	tier-based policy	The kill-this-turn → lowest-effective-HP → paid-offensive → free-offensive cascade for rounds beyond zero, with a boss-fight signature override. Any change to the tiering — even an apparently neutral one — re-attributes the 18% forecast drift measured in § 11 onto a different mechanic, which would invalidate the slice-9-through-13 ceiling investigation that depended on attributing it to this policy specifically.	Phase C
scoreState	contextual eval	Roughly seventy hand-tuned weights settled during Phase E. Re-tuning destabilizes the sixth-tier fallback of `compareTrials`, which would in turn invalidate every prior paired-bench verdict that fell back to it.	Phase E
comparator	compareTrials chain	The canonical ordering outcome → alliesAlive → enemiesAlive → enemiesRemainingHp → turnsPlayed → finalScore. Consolidated as a tested API by the initial paired-bench harness; its order is locked by golden tests in addition to the freeze, so re-ordering any tier breaks not only prior verdicts but also a CI assertion.	Slice 1

What the registry reveals

What the freeze registry reveals epistemologically is not that the project has found optimal values. It is that the project has stopped looking for them. Each frozen value is the residue of a slice that satisfied itself the value was good enough and that the variance from improving it would be smaller than the variance from instability in everything downstream. The freeze is therefore a measurement of where the project has decided that further calibration is not worth the audit cost.

In a research culture that valorizes continuous tuning, this registry is a small piece of counter-culture: a written declaration that diminishing returns have set in on each listed axis, that the slice ladder will not entertain further probes against them, and that any future improvement to the system must come from new dimensions, not from re-tuning the old ones.

The next section documents the single explicit condition under which a frozen gate may, with documented justification, be suspended on one surface while continuing to operate on another.

§ 9The path E doctrine and the two-surface engine

§ 7 closed with the candidate engine beam-rerank-v1 archived for cost — 13.26× the production beam on the independent corpus, against a 4× cost gate. This section returns to the engine. It does so because a subsequent slice — slice 16a — fixed a damage-to-corruption mismatch in the simulator that had been silently inflating rerank rollout duration on corrupted-enemy snapshots, the same snapshots that dominate late-floor combat. After the fix, the cost ratio on a re-captured independent corpus dropped from 13.26× to 5.19×.

Still above the 4× gate. But close enough that the question changed.

§ 9.1 — The decision

The decision recorded in slice 18 is informal in tone and explicit in scope. Paraphrased — the project's internal language is more direct than the paper warrants — the position taken was this: on a surface whose sole audience is the author, the runtime budget gate is a methodological courtesy, not a methodological requirement. The benchmark surface continues to enforce the gate because the benchmark exists to allow third-party comparability, the author's own future self in a more constrained context, and any reader auditing the system after the fact. The live advisor surface, however, exists to support one user — the author playing one combat at a time, on hardware the author controls — for whom the difference between a 200-millisecond response and a 1.5-second response is not the difference between viability and non-viability. It is the difference between fast and slow on a system the user already accepts.

The slice argued, and the verdict authorized, that under these surface-level conditions the author may explicitly suspend the cost gate on the live advisor surface only. The engine returned to that surface with verdict OK. The bench surface's verdict remained ARCHIVE. The same engine, in the same source file, became production on one surface and archive on the other.

§ 9.2 — The formal mechanism

The mechanism is small. The engine registry indexes engines by name; each surface resolves its production engine independently. The live advisor runtime — computeBestTurnFinality — imports beam-rerank-v1 and uses it as its decision function. The paired-bench harness — DEFAULT_ENGINE_SPECS — imports the bare beam and uses it as its production engine for that surface. Both imports succeed. Both surfaces are internally consistent: the live runtime documents its choice in its module header (slice 18 path E re-promotion), and the bench documents its choice in DEFAULT_ENGINE_SPECS (slice 4.4 archival, never reversed for the bench).

The dual-kind regime is therefore not a special tag on the engine. It is the natural consequence of allowing surfaces to disagree about an engine, combined with the requirement that each disagreement be inscribed by a slice. Path E is, mechanically, just "the bench did not re-promote what the live advisor did."

§ 9.3 — Admissibility conditions

The doctrinal innovation of path E is not the dual-kind state itself. It is the explicit listing of conditions under which a single-engineer project may operate this state without violating its own methodology. Drawing from the path E promotion slice and its subsequent doctrinal reconciliation, the doctrine can be summarized as five admissibility conditions — a systematization not inscribed verbatim in any single slice, but derivable from the engine-kind transition discipline and the path E reconciliation document together:

The surface on which the cost gate is suspended must have a single user audience whose constraints are known to the author.
The cost gate must remain enforced on the benchmark surface — non-negotiable.
The slice that promotes the engine onto the suspended-gate surface must name the surface, the engine, and the reason for the suspension, in writing.
The engine's behavior on the suspended-gate surface must remain auditable through the standard slice-ladder mechanisms. No special-case logic that bypasses paired comparison when the user requests it.
The bench surface verdict must remain stable. A future re-promotion onto the bench surface requires its own slice, with the cost gate met.

Closing

Path E formalizes a tension that single-engineer projects face routinely and that team projects often resolve through bureaucracy. The team can institutionally separate "what we ship" from "what I run for myself"; the single engineer must inscribe the separation explicitly, or it dissolves into ambient permission. The doctrine is not cheap in writing: the path E promotion slice itself runs to 126 lines; combined with its 2026-05-15 doctrinal reconciliation, the layer reaches nearly 400 lines; the full slice 18 family — pivot document, promotion slice, reconciliation — exceeds 590 lines of maintained prose. The disproportion is itself the point: little code is required for the decision, but considerable written discipline is required to keep it traceable.

§ 10 returns to a parallel attempt — also stemming from beam-rerank-v1's archival — to extract usable value from the engine through distillation rather than through doctrinal exception. Unlike path E, the distillation did not produce a promotable result.

§ 10The distillation experiment that did not generalize

The distillation pipeline of § 6.4 — A → B → C — was executed in earnest after path E re-promoted beam-rerank-v1 onto the live advisor surface. The motivation was simple. If the reranker is now production on one surface but archived on the other, can a distilled student of the reranker fit into both — capturing the reranker's decision quality at a cost the bench gate would accept? The answer the project produced was no. Across twelve cells of formulation × ablation, no student model both captured the reranker's training-time accuracy and held that accuracy on the independent corpus. The slice closed with verdict NO-GO across all twelve cells. § 10 documents what was tried, what the failure mode was, and what the failure revealed about the methodology beyond distillation.

§ 10.1 — Setup

The pipeline followed § 6.4 strictly. Slice A exported the design corpus under the production beam at Top-K=8, fingerprinted by contentHash, and froze the output. Slice B replayed every row under beam-rerank-v1 in counterfactual mode, attaching the oracle labels oracleBeatsBeamTop, oracleScoreDeltaVsBeamTop, oracleTopCandidateId, and oracleLabelConfidence. The B output was bit-stable across replays — verified by re-running and confirming contentHash identity — and frozen in its turn. The combined A+B output was the fixed input to slice C.

Slice C trained twelve student formulations against the fixed A+B input. The matrix was formulation (six variants of feature subsetting, normalization, and pairwise vs pointwise loss) × ablation (two variants: include or exclude actorId from the feature set). Identical full-batch gradient descent, identical zero initialization, identical iteration count, identical anti-leakage CI tests. The only varying axes were formulation and the actorId inclusion flag.

§ 10.2 — The train-design / holdout-independent gap

Every cell exhibited a consistent gap between training accuracy on the design corpus and held-out accuracy on the independent corpus. The strongest cell — pairwise loss with full feature inclusion — achieved 88.4% top-1 accuracy on the held-out portion of the design corpus, a +44.2-point improvement over the production beam's own top-1 prediction evaluated as a no-rerank student — the beam baseline against which the rerank labels of slice B are themselves defined. On the independent corpus, the same cell collapsed to 62.5% against a beam-baseline figure of 65.0% — a 26-point drop that brought the student 2.5 points below the beam baseline on the independent corpus. The other eleven cells showed weaker but qualitatively identical drops. No student generalized.

§ 10.3 — The actorId revelation

The decisive finding was on the ablation axis. Cells that included actorId in the feature set fitted the design corpus better than cells that excluded it — by approximately three points. But on the independent corpus, actorId inclusion became catastrophic: 77.8% of the independent corpus's actorId values were out-of-vocabulary relative to the design corpus.

The students that had used actorId on the training set were, on the independent corpus, looking at features they had never seen during training and producing predictions that no longer carried the calibration the design corpus had taught them. The student had not learned the reranker's decision policy. It had learned the reranker's behavior on the specific monsters present in the design corpus.

§ 10.4 — What the failure revealed

The lesson, written into the slice C postmortem, is the methodological echo of slice 4.2ter applied to a different surface. The independent corpus had again caught a generalization claim that the design corpus alone would have certified. The first time, in § 7, the certification would have shipped a candidate at three times its true runtime cost. The second time, here, it would have shipped a student model trained to recognize specific monster identities rather than the policy of the reranker it was meant to imitate.

The deeper observation is structural. The bottleneck of the distillation experiment was not feature richness — the fifty features of CandidateFeaturesV1 are expressive enough to encode any signal the reranker reasons over. The bottleneck was the disjointness of the corpora on the identity axis. The design corpus had taught the student to associate decision quality with specific actor instances. The independent corpus presented unfamiliar actor instances and exposed the substitution. Until the project can either capture an independent corpus whose actor distribution matches the design corpus's, or train students explicitly invariant to actor identity, distillation against beam-rerank-v1 cannot generalize on this terrain.

§ 10.5 — Disposition

Slice C closed NO-GO. The distillation pipeline was preserved in the source tree, with explicit module documentation describing the actorId failure mode and the conditions under which a future slice could re-open the question. beam-rerank-v1 remains archive on the bench surface and production on the live advisor surface — its dual-kind regime undisturbed by the failed distillation. The reranker's runtime budget on the live surface continues to be paid by the user's tolerance for 1.5-second responses, not by a generalizable distilled student.

The second time the project tried to extract systematic value from beam-rerank-v1, the independent corpus replay caught the attempt — the first time on cost, the second on out-of-distribution generalization. The path E doctrine of § 9 remains the only escape route the methodology has found.

§ 11The measured ceiling and what it bounds

The project's most useful single artifact is the ceiling report — a written document that names, on four axes, the maximum performance the system can reach without changing its assumptions. The ceiling is not a hypothetical bound. It was measured by audits whose verdicts are documented in specific slices, and it is the residue of those audits when read together.

§ 11.1 — Forecast drift

The first axis is the divergence between the simulator's predictEnemyPreviews — the tier-based heuristic the beam uses to model enemy actions at rounds 1 and 2 — and the ground truth captured from the live binary. Slice 11 measured this divergence at 18% on the design corpus: roughly one in five predicted enemy actions did not match what the binary actually played. The slice closed with verdict FORECAST_DEPTH_DEGRADATION_CONFIRMED.

The forecast drift is the largest single attributable cause of a broader phenomenon: the decision divergence between the production beam at depth 3 and the same beam at depth 5, which slice 10 measured at 72.2% on the design corpus. Of that divergence, slice 11 attributes approximately 18% specifically to forecast drift. The remainder is distributed across rollout-tail value noise — exonerated by slice 13 as not producing outcome flips — and combinatorial interactions that the layered audits could not isolate. The implication for lookahead policy is clear in direction even where the decomposition is imprecise: the beam reasoning at depth 5 on this terrain operates against a state representation that diverges majoritarily from its depth-3 self, and the divergence is dominated by interaction effects the project has not been able to attack independently of the forecast drift that seeds them.

§ 11.2 — Damage debt as non-local signal

The second axis was inscribed by slice 8. Roughly two-thirds of recorded defeats follow trajectories in which the AI accumulates a damage deficit over three to four rounds before dying. The deficit is not localized to any single state. No state-only scoring function — including the seventy-weight contextual scoreState — can detect it cheaply, because the signal of impending defeat is the integral of damage taken minus damage dealt over the recent history, not a property of the current frame.

The damageDebtFix flag, tested in slice 8 and rejected, was an attempt to surface this debt as a leaf-level penalty. The attempt failed because over-weighting the debt destabilized the system in trials it would have won: the AI became prematurely aggressive on combats it was already going to win, and the wins it preserved were balanced out by the wins it lost. The signal is real. It is not learnable from leaf evaluation alone.

§ 11.3 — Distributed signal

The third axis is the most uncomfortable. Slices 12 and 13 jointly demonstrated that the reranker's improvement over the bare beam — which we know is real, on both design and independent corpora — cannot be reproduced by any local proxy that operates on a single state or a single rollout. The signal lives on the trajectory. Specifically, it lives in the way the reranker's first-action choice changes the distribution of states the beam encounters on rounds 1 through 3, and that distribution shift is what produces the win-rate delta.

This means the canonical AI improvement strategy on this terrain — distill the reranker into a learned local function — has a structural limit. § 10 documented that the distillation failed on out-of-distribution fragility. § 11.3 is the deeper reason it would have failed even on a corpus that controlled for actorId: the signal the student would need to learn is not, in any precise sense, a function of the state.

§ 11.4 — Lookahead extension

The fourth axis was measured by slice 9. The team tested lookahead at four and five rounds against the production beam's three-round configuration. Both extensions degraded performance — the forecast drift compounded faster than the additional depth recovered. The lookahead therefore stays at three. This axis is the simplest of the four because it is the most direct cause-and-effect: at the current forecast drift, the beam is at its optimal depth. Reducing the forecast drift would, by this logic, allow deeper lookahead — but only if the reduction holds on the independent corpus and only if it does not destabilize the contextual leaf evaluator's calibration.

§ 11.5 — The axes are not independent

The four axes do not operate independently of one another. Reducing forecast drift would lift the lookahead ceiling of § 11.4 but would not address the distributed-signal limit of § 11.3 or the non-local damage debt of § 11.2. A learned value network would help with the distributed signal but its absence is itself the reason mcts-archive plateaued. The four axes form a system in which improving along any one of them produces gains bounded by the others, which is why piecewise tuning of any single axis has not, in the project's history, produced a promotable candidate. The four axes are jointly the ceiling.

§ 11.6 — What the project did not try

A ceiling report that listed only what the project measured would be a partial ceiling. Three directions the project did not try are worth declaring honestly.

First, an F4-style search — large-scale random search over scoring-function variants, in the manner of FunSearch or AlphaEvolve — was not attempted. The methodology requires written verdict closure on each slice; an F4 search generates thousands of variants for which writing a verdict per result is prohibitive. The project does not currently know whether such a search would find a scoreState formulation that beats the seventy hand-tuned weights of Phase E. We suspect, but did not measure, that it would not. The absence of measurement is honest, not conclusive.

Second, a learned value network — the missing component of the archived mcts-archive engine — was not trained. The phases 2 (distillation against beam labels) and 3 (self-play) of the original AlphaZero plan were not executed after phase 1 plateaued. A value network might lift the ceiling on the search side substantially. The project does not currently know.

Third, the independent corpus was not re-captured after slice 16a fixed the damage-to-corruption drift. The 5.19× cost ratio of § 9 was measured on a corpus captured after the fix; we have not re-measured the win-rate signal of beam-rerank-v1 on a fully post-fix independent corpus. The direction is presumed to hold — the fix narrowed simulator-binary divergence without changing decision policy — but the verification has not been performed.

§ 11.7 — The honest position

The ceiling exists as a written document not because the project finished — the slice ladder continues to issue verdicts — but because the methodology demands that the maximum the project has confidently reached be statable in writing. The ceiling is a snapshot, not a verdict on the system's potential. It is the residue of six weeks of slices on a specific terrain, against a specific binary, with the seven specific frozen flags of § 8.

§ 12 returns to what that residue means for the broader thesis of the paper: that a hand-tuned beam built under audit discipline can, on a closed-source commercial game, sit at a ceiling that an unguided MCTS variant has not been able to dethrone — and that this result is consistent with, rather than a refutation of, the AlphaZero family's stated preconditions.

§ 12Discussion: intellectual position

§ 11 closed with the residue of six weeks of slices on a specific terrain. § 12 asks what that residue means intellectually.

The principal empirical finding the paper produces is narrow. On the specific terrain of Aethermancer's combat — closed-source, turn-based, deterministic in exactMode, with a forecast drift attribution of approximately 18% on the design corpus — a Gumbel-AlphaZero MCTS variant without a learned value network plateaued below a beam search refined over six weeks of phased, AI-assisted tuning. This is the result. It is empirical, it is local to this terrain, and it has methodological consequences disproportionate to its modesty.

§ 12.1 — Sutton's bitter lesson, fairly restated

The bitter lesson, as Sutton stated it (Sutton 2019), claims that two general methods — search and learning — scale with computation in a way that hand-crafted task-specific knowledge does not, and that the history of AI research is the history of repeatedly underestimating this fact. Applied to game AI, the bitter lesson predicts that hand-engineered heuristics — leaf evaluators, opponent models, contextual scorers — will be outperformed by methods that exploit compute through search or learning, given enough of both.

The bitter lesson is not wrong, on this paper's evidence. But it is incomplete in its common phrasing. Specifically, it collapses two distinct claims into one:

Search scales with compute: a search algorithm using more simulations finds better moves.
Learning scales with compute: a learned model trained on more data and parameters produces better representations.

The strong reading of the bitter lesson is that both claims are true simultaneously and combine multiplicatively. The AlphaZero family is the canonical evidence: search and learning together, scaled together, defeat hand-tuned approaches on Go, chess, and shogi.

§ 12.2 — What the project measured

The project's measurement is specific and small. Search alone, without learning, does not scale through the combinatorial structure of Aethermancer's combat at a compute budget any single-engineer project can afford. The archived mcts-archive engine — 64 simulations, Gumbel-AlphaZero PUCT, value head equal to tanh(scoreState / 30) without a learned network — plateaued below 45% paired win-rate against the production beam.

This is consistent with the AlphaZero family's stated requirements, not a refutation of it. Danihelka et al. were explicit: Gumbel-AlphaZero replaces the canonical 800 simulations with 64 because a learned value network shoulders the rest of the work. Substituting a hand-written scoreState produces exactly the result one would expect: the search runs, but it does not gain over the beam because the leaf signal is the same signal the beam already uses, and adding tree depth on a fixed leaf signal hits diminishing returns at the forecast-drift wall identified in § 11.

The decomposition matters. The bitter lesson holds for the couple search + learned representations. It does not, on this terrain at this budget, hold for search alone. The two thirds of the bitter lesson that come from search are real but exhaustible against a hand-tuned baseline above a forecast-drift wall. The remaining third — the learned representation — is precisely what mcts-archive lacked.

§ 12.3 — Chollet's caveat

Chollet's complementary position (Chollet 2019) argues that the value of a system's capacity to learn is bounded by its capacity to abstract — to recognize that two superficially different problems share a deeper structure — and that this capacity does not scale linearly with compute. The bitter lesson and Chollet's caveat are not in opposition. They describe different axes. Sutton describes what happens to specific systems as their compute budget grows. Chollet describes what those systems would need to count as intelligent in a stronger sense.

§ 12.4 — Where this paper sits between them

The combat AI documented here sits squarely in the regime Sutton describes: a deterministic, finite-state environment on which a search engine plus hand-tuned leaves outperforms a search engine with neither learned leaves nor learned representations. The result is consistent with the bitter lesson's narrow form (learning beats hand-tuning when the learning is properly supplied) and silent on its strong form (general methods always win, eventually). It is also too narrow to bear on Chollet's caveat at all: nothing in Aethermancer's combat requires abstraction across problem families, and nothing this paper measures could falsify or confirm that a more abstractive architecture would have done better.

The honest reading is that this work generalizes along the methodological axis, not the algorithmic one. The F4 family of future-work directions — programmatic search, learned value, transformer policy — would extend the algorithmic axis; none of them is launched here, and the ceiling report of § 11 explicitly defers them. What this paper offers, instead, is a discipline that survived its own strongest signal — the slice 4 sequence of § 7 — and that returned a coherent ladder of verdicts on every subsequent candidate. That discipline is what is portable.

The paper does not claim Aethermancer's combat AI is interesting in itself. It claims that the audit methodology that produced this AI's measured ceiling is interesting — and that the methodology generalizes to any project whose authors want to know what their system does when it leaves the corpus they trained it on. The F4 family, on this reading, does not displace the methodology. It will, when it ships beyond the academic surface, need its own version.

§ 13Threats to validity

Several structural features of the project bear on the validity of the conclusions it produces. We declare them explicitly here rather than absorb them into individual sections, because their effect is cumulative across the paper rather than localized. For each, we follow the same pattern: the threat itself, why it exists, what partially mitigates it, and what remains.

§ 13.1 — Single-engineer self-review

The project was developed and audited by a single practitioner. Every slice was authored, executed, and closed by the same person. Each verdict is therefore a self-assessment, not a peer-reviewed result. Standard methodological practice in experimental work requires independent verification by a reviewer who did not author the experiment; this paper offers no such verification.

What partially mitigates the threat: every slice is a written artifact that names its hypothesis, criteria, and methodology before the run, and its verdict after. The chain of reasoning is auditable by any future reader — including the author's future self in a different context — even without contemporaneous review. The independent corpus replay procedure of § 6.5 is in part a self-administered check against confirmation bias: the verdict the corpus produces is the verdict that closes the slice, and the author cannot post-hoc adjust the candidate to recover a failed result.

What remains: any systematic bias in the author's reasoning — for instance, an unrecognized preference for a particular type of verdict in ambiguous cases — propagates undetected. Peer review would catch this class of bias. The slice ladder does not.

§ 13.2 — Single-game scope

The methodology has been exercised against exactly one game. The project cannot claim the procedures of § 6 generalize to a second game without measuring them on a second game. The specific frozen flags of § 8 are calibrated for Aethermancer's combat structure; the 4× cost gate is calibrated against the production beam's runtime budget on this terrain. Other games would impose different ratios.

What partially mitigates the threat: the methodology is structured around generic abstractions. The slice ladder of § 6.2 makes no reference to Aethermancer-specific mechanics. The two-surface partition of § 6.1 generalizes to any system with diagnostic and promotion surfaces, which is most systems. The decompiled cross-check pattern of § 4 generalizes to any closed-source binary that can be decompiled into static source.

What remains: claims about specific values — the 4× cost gate, the 18% forecast drift attribution, the seventy-weight contextual evaluator — are local to this terrain. Other game-AI projects would have to discover their own values through analogous slice work.

§ 13.3 — ML-free as a voluntary constraint

The project deliberately excluded machine-learning components from the runtime path. No learned policy, no neural network, no gradient-trained representation. This choice produces a verdict surface that is deterministic, auditable, and reproducible — but it also means the project cannot claim to have explored the space of ML-augmented engine designs.

What partially mitigates the threat: the failed distillation experiment of § 10 is the strongest evidence the project has that learned representations on this terrain would have to overcome the same out-of-distribution fragility that defeated the student model. The mcts-archive plateau of § 11.1 is consistent with the AlphaZero family's stated requirement of a learned value network — which the project did not provide. Both observations bound the ML-free choice from one side: a model would have to address known generalization failures to lift the ceiling on this terrain.

What remains: the project does not know whether a value network trained with sufficient self-play would lift the ceiling. Phases 2 and 3 of the AlphaZero plan were not executed. The negative result on search alone does not extend to search + learning. § 12.2 stated this explicitly; § 13.3 declares it as a validity limit.

§ 13.4 — Path E is not scientifically reproducible

The path E doctrine of § 9 authorizes the author to suspend a cost gate on a single-user surface. This is a decision that depends on the surface's audience being the author. A reader cannot reproduce the path E verdict in any conventional sense — the surface's audience is, by construction, the author of the project, not a third party. The admissibility conditions of § 9.3 are intended to make this irreproducibility procedurally honest rather than methodologically hidden, but they do not convert it into a scientifically reproducible result.

What partially mitigates the threat: path E is declared as a doctrinal exception, not as a generalizable result. The paper does not claim path E should be applied to other projects; it claims only that some single-engineer projects face a tension that path E resolves, and that the resolution can be inscribed rather than left ambient.

What remains: the procedure is auctorial. A reader who wishes to apply path E to their own project does so on their own surface, with their own conditions, and the verdict they reach has no comparability with the verdict slice 18 reached. Path E is in the paper because it is part of the project's history. It is not in the paper as a procedure that scales beyond single-engineer practice.

§ 13.5 — Simulator validation is incident-driven, not exhaustive

The simulator-versus-decompiled cross-check discipline of § 4 identifies and resolves divergences when they are observed in live capture. The procedure is reactive: a divergence is opened as a DRIFT slice, the cross-check is performed, the simulator or capture is patched. Action handlers in the decompiled binary that have never produced an observable live drift remain untested by this procedure.

What partially mitigates the threat: the project's testing strategy operates at three scales — unit tests at the simulator function level, integration tests at the round-resolution level, and replay tests against captured combats. The replay tier in particular exercises the simulator against real distributions of inputs, biased toward gameplay patterns the corpora contain. The validation harness defines its tightest threshold at TIER 1 — the mechanical fidelity tier covering HP, shield, existence, and isDead — and targets ≥ 95% accuracy on that tier, per the project's Phase A3 specification. This is a fidelity threshold, not a combat-correctness measure: it certifies that the simulator tracks the binary on the observable state, not that the decisions made on top of that state are correct. Within the patterns the corpora visit, the discipline produces stable verdicts.

What remains: mechanic regions never exercised by the corpora carry undiscovered drift potential. A future capture session that visits previously unvisited regions may reveal new DRIFT slices that the current ceiling does not anticipate. The discipline of § 4 will absorb them when they are observed. Until they are observed, they are invisible.

§ 13.6 — Synthesis

The five threats above are structural features of a single-engineer game-AI project pursuing audit-grade methodology against a closed-source binary. None are unique to this paper, and none can be eliminated without changing the structure of the project itself — additional engineers, additional games, a willingness to use ML, a willingness to ship to multiple users, an exhaustive validation infrastructure. Each would dilute the project's specific strengths in exchange for partial mitigation. The threats are declared, not solved.

§ 14 returns to what the project would do next given the threats above, and to the transposable components of the methodology that survive their declaration.

§ 14Future work and transposition

The ceiling reported in § 11 and the threats declared in § 13 define a frontier. Four directions cross that frontier in well-defined ways. We discuss each in turn. Before that, a practical checklist for readers asking whether the methodology applies to their own terrain.

The methodology, in one sentence

It forces an apparent signal — a benchmark won, a test passed, a paired win-rate of 100% on the data the author chose — to prove that it survives on data it has never been exposed to, before it can trigger a promotion.

When it applies — three conditions

Your verdicts are not reproducible blind.

Someone else — or you, six months from now — cannot look at the same trial data and reach the same verdict without the context you currently carry in your head. This is Sculley-debt (Sculley et al. 2015): hand-tuned heuristics, ML pipelines whose promotion criteria drift, custom compilers, distributed systems with emergent behaviour. If you can hand a peer the corpus and the rule and they reach the same verdict mechanically, you do not need this paper. If you cannot, you have the first condition.
You can build an independent oracle — executable, not a frozen snapshot.

A snapshotted corpus of "known-good outputs" catches regressions, not new bugs on inputs the snapshot never saw. What § 4 calls the differential cross-check requires an oracle that re-produces the result on inputs it has never been exposed to: a decompiled binary, a hand-written reference implementation, a deliberately simpler model. Csmith and EMI (Yang et al. 2011; Le et al. 2014) formalise this as the only credible compiler oracle being another compiler. The pattern transposes wherever a deliberately simpler implementation of the same specification is cheaper to author than to trust your primary one blind.
You accept the cost of writing verdicts before promotion.

The six-value verdict ladder (§ 6.2) and the requirement that every slice close with a written verdict are heavier than a binary go/no-go. That weight is exactly what prevented the beam-rerank-v1 candidate from being promoted on its 100% paired-wins design-corpus signal when its true independent-corpus cost ratio was 2.8× the design-corpus estimate (§ 7). Without the ladder discipline, the candidate ships.

When this is not for you. If your tests can be re-played on the same inputs and yield the same verdict — classic regression testing on a stable spec — the machinery proposed here is overhead. The methodology is calibrated for settings where the system under audit is partially opaque, where the spec is what the code does, and where verdicts must survive independent replay before they bind a promotion decision.

§ 14.1 — A value network for scoreState

The most direct arrow from the measured ceiling is a learned value head replacing the tanh wrap of the hand-tuned scoreState that gates the archived mcts-archive engine. The pipeline of § 6.4 already supplies the dataset: slice A export, slice B oracle labeling under beam-rerank-v1, and a frozen contentHash that would let any candidate student train against a fixed input. The exact architecture is an open implementation choice — a small learned approximator over the CandidateFeaturesV1 schema, in whatever functional form best preserves determinism and bench reproducibility. The constraint is methodological, not architectural: whatever the learned function, it must remain auditable enough to fit the slice ladder. The independent corpus check of § 6.5 applies without modification — the same procedure that defeated the distilled student of § 10 would catch any value head that learned the design corpus rather than the policy. A value head that survived would, in principle, lift the mcts-archive plateau of § 11.1 by replacing the leaf signal the search currently shares with the production beam.

§ 14.2 — F4 search under audit discipline

The deeper question — and the one this paper raised but did not answer — is whether F4-style program-space search can operate within slice-grade audit infrastructure. The constituent elements are not logically exclusive: both the methodology and F4 search ultimately rest on some notion of strict evaluation. How an F4 loop could couple to a slice-grade verdict surface without dissolving either the slice format or the F4 throughput remains an open question. The structural tensions identified in § 12.4 — verdict granularity that does not scale to ten thousand candidates per night, single-author audit capacity that cannot individually close each generated variant, and the path E doctrine's incompatibility with policies whose generation is itself a search — are the components a future hybrid would need to address. This paper does not.

§ 14.3 — Post-fix independent corpus re-capture

The cheapest unresolved direction is also the simplest: re-capture the independent corpus on a session that runs entirely after slice 16a's damage-to-corruption fix. The 5.19× cost ratio of § 9 was measured on a corpus that postdates the fix; the original 13.26× signal of slice 4.2ter was measured on a corpus that predates it. We do not currently know what the signal direction and magnitude look like on a fully post-fix corpus. The cost is approximately one capture session — one to two hours of human gameplay on Windows, per § 7.4 — plus a re-replay of the slice 18 path E verdict against the new manifest. The benefit is a clean update of the doctrinal record, and possibly a re-evaluation of whether beam-rerank-v1 now sits closer to or further from the 4× cost gate on the bench surface. This is the project's most visible technical debt.

§ 14.4 — Methodology beyond Aethermancer

The most ambitious direction is the transposition of the methodology to projects whose terrain shares the structural features that made AetherJudge possible: a closed-source binary that can be decompiled or otherwise observed, a deterministic projection of the system under audit, and a single-engineer or small-team author with the discipline to write slices. Three transposition targets are tractable in our view.

First, other turn-based games with closed-source engines. The cross-check pattern of § 4, the slice ladder of § 6.2, and the independent corpus replay of § 6.5 transpose almost mechanically to Slay the Spire, Hearthstone, or any roguelite-class game with a decompilable binary. The specific values would differ; the procedural structure would not.

Second, ML-augmented production systems that ship models alongside hand-tuned rules. The two-surface partition of § 6.1, the engine kind taxonomy of § 6.3, and the anti-patterns of § 6.6 transpose to MLOps without modification — regret count gates, OVERUSE-as-verdict, and post-replay tuning all have direct analogues in ML pipelines that conflate diagnostic metrics with promotion gates.

Third, F4 systems in their pre-deployment phase. The path E doctrine of § 9 may have analogues in research-only systems whose authors want to ship a private version while preserving methodological discipline on a public benchmark. The specific admissibility conditions would need to be re-derived for the F4 context, but the structural skeleton — explicit inscription of a surface-disjoint verdict — applies.

In all three transposition targets, what would not transpose is the specific verdict ladder values, the frozen flags, the cost gate threshold, and the independent corpus manifest format. These are properties of the terrain. What would transpose is the discipline that produced them.

§ 14.5 — Synthesis

Future work has four arrows: a value head that lifts the search ceiling, an F4 hybrid that operates under audit, a re-capture that closes a known measurement gap, and a transposition that takes the methodology somewhere it has not yet been measured. The first three are technical. The fourth is the question this paper exists to make possible.

§ 15Conclusion

This paper has documented six weeks of single-engineer, AI-assisted work on an audit subsystem for a closed-source commercial roguelite. The central finding is that a hand-tuned beam search refined under written-verdict discipline produces decisions that an unguided Gumbel-AlphaZero variant cannot improve upon at the compute budget the project can afford. The result is consistent with the AlphaZero family's stated reliance on a learned value network, not a refutation of it. The bitter lesson holds for the couple search + learned representations; on this terrain, at this budget, with hand-tuned leaves, it does not hold for search alone.

The single moment that exemplifies the methodology — and the moment without which the rest of the paper is uninteresting — is the slice sequence from April 27 to April 30, 2026, that produced the strongest signal the project had ever recorded and refused to ship it. The candidate engine beam-rerank-v1 posted a 100% paired win-rate against the production beam on a corpus of 110 snapshots and was archived because that corpus was the only one available. The independent corpus that the slice 4.2bis infrastructure made it possible to capture revealed that the candidate's true runtime cost on data the wrapper had not seen was 2.8× the design-corpus estimate. The methodology declined to ship. The result was inscribed as an invariant. Every other procedure documented in this paper is, in some way, a generalization of what that single slice taught.

The remaining slices — the rejected wrappers of § 5, the failed distillation of § 10, the measured ceiling of § 11 — would not have a methodological home without the procedure 4.2ter inscribed. Each is a negative result honestly closed. Most of the project's hardest-earned discipline lives in those negative closures, not in the rare candidate that passes the gate.

What the methodology accumulates over time is not a record of successes. It is a corpus of decisions documented under uncertainty, with the conditions of each decision named in writing and the verdicts left undisturbed by subsequent events. This corpus is the project's defense against the most common failure mode of single-engineer work: the silent revision of past judgments to fit present preferences. The slice ladder will not let a verdict be retroactively warmed by what came after, and the engine kind taxonomy will not let a rejected candidate launder itself into production through edits. These are small disciplines, written in a few hundred lines of process documentation. They are also why anything in this paper can be said to have happened.

The paper closes here. The methodology continues to issue verdicts.