The Learned Manager · Swink AgentShore™

A small actor-critic network chooses the next play. It's the RL Agent. The LLM Agents are the muscle.

The shape of the thing

The whole decision-maker inside Swink AgentShore™ is a single, small actor-critic MLP. It's deliberately tiny. You can train it, you can inspect it, you can replay it. The expensive thing is the LLM Agent that runs the play it picks.

The action space

Nineteen plays the policy picks from, plus three reserved slots that are masked at runtime.

04 Issue Pickup

05 Code Review

07 Run QA

06 Merge PR

02 Write Plan

01 Unblock PR

12 Refine Tasks

08 Systematic Debug

16 Groom Backlog

00 Instantiate Agent

13 Cleanup

11 Reconcile State

09 Design Audit

03 End Agent

17 Seed Project

18 Calibrate Align

10 End Session

15 Take Break

19 Prune

14 Reserved

20 Reserved

21 Reserved

Ineligible plays are masked, not penalized

The policy can't pick them. No PR to review → Code Review is masked. Dollar or time cap reached → only End Session remains.

Every play has a precondition, and it's checked twice: once when building the action mask before the policy picks, and again at dispatch before the play actually runs. Belt and suspenders.

Irreversible mistakes (merging the wrong thing, letting an author review their own PR) cost too much to learn from negative reward. Safety lives in the hard-gate catalog.

The reward, in one paragraph

Reward is how the policy learns what's worth doing. It's a weighted sum, kept inside reasonable bounds so no single play swamps the rest.

Pushing forward: shipping issues (issue throughput) and moving the project toward done (alignment delta).

Pulling back: spinning in circles (loop), inflating the backlog faster than work closes (inflation), and stretches with no measurable progress (stagnation).

Why this beats vibes and Ralph loops

Decisions are measurable Every play carries a reward, a value estimate, and a replay trail. Ask "why did we run review here?" and get an answer with numbers.

Stuck states are penalized Loop, inflation, and stagnation get negative reward. This one knows when to stop.

Budgets are first-class Every session runs against a dollar cap the policy learns to spend well, plus an optional wall-clock time cap enforced as a deterministic backstop the policy never sees. The mask catalog ensures neither gets blown past accidentally.

Safety is in the mask Anti-confirmation, merge readiness, budget exhaustion: architectural invariants in the executor, not lines in a system prompt.

Cross-framework review Code review and QA run on a different LLM family than the author. The policy can't route around it. The mistake of grading yourself is structurally removed.

Policy is portable The LLM Agent behind a play can change without touching the policy. Swap Claude for Codex, or for an open-weight model. The policy is the same.

Long-running, not chat-bound Sessions run for hours, not in conversational turns. No context window to overflow. State lives in BEADS, GitHub, and the policy's replay trail.

Mutations are auditable Every external write goes through one adapter with idempotency keys and audit records. You can prove what the system did, and what it didn't.

Updates are incremental PPO updates every 16 completed plays. Bad updates roll back on invalid logits. The manager improves the way managers do: a little at a time.

The PPO policy

The shape of the thing

The action space

Ineligible plays are masked, not penalized

The reward, in one paragraph

Why this beats vibes and Ralph loops