A small actor-critic network chooses the next play. It's the RL Agent. The LLM Agents are the muscle.
The shape of the thing
The whole decision-maker inside Swink AgentShore is a single, small actor-critic MLP. It's deliberately tiny. You can train it, you can inspect it, you can replay it. The expensive thing is the LLM Agent that runs the play it picks.
Parameters
~51K
Observation
246 dim
Action head
22 discrete
Update cadence
16 plays
The action space
Nineteen plays the policy picks from, plus three reserved slots that are masked at runtime.
04 Issue Pickup
05 Code Review
07 Run QA
06 Merge PR
02 Write Plan
01 Unblock PR
12 Refine Tasks
08 Systematic Debug
16 Groom Backlog
00 Instantiate Agent
13 Cleanup
11 Reconcile State
09 Design Audit
03 End Agent
17 Seed Project
18 Calibrate Align
10 End Session
15 Take Break
19 Prune
14 Reserved
20 Reserved
21 Reserved
Ineligible plays are masked, not penalized
The policy can't pick them. No PR to review → Code Review is masked. Budget exhausted → only End Session remains.
Every play has a precondition, and it's checked twice: once when building the action mask before the policy picks, and again at dispatch before the play actually runs. Belt and suspenders.
Irreversible mistakes (merging the wrong thing, letting an author review their own PR) cost too much to learn from negative reward. Safety lives in the hard-gate catalog.
The reward, in one paragraph
Reward is how the policy learns what's worth doing. It's a weighted sum, kept inside reasonable bounds so no single play swamps the rest.
Pushing forward: shipping issues (issue throughput) and moving the project toward done (alignment delta).
Pulling back: spinning in circles (loop), inflating the backlog faster than work closes (inflation), and stretches with no measurable progress (stagnation).
The bias is toward merging and project progress.
Rough magnitudes only, to show the shape of the bias. Real values are clipped and adjusted per project.
Why this beats vibes and Ralph loops
Decisions are measurableEvery play carries a reward, a value estimate, and a replay trail. Ask "why did we run review here?" and get an answer with numbers.
Stuck states are penalizedLoop, inflation, and stagnation get negative reward. This one knows when to stop.
Budgets are first-classEvery session runs against a budget cap. The policy learns to spend it well, and the mask catalog ensures it can't blow past it accidentally.
Safety is in the maskAnti-confirmation, merge readiness, budget exhaustion: architectural invariants in the executor, not lines in a system prompt.
Cross-framework reviewCode review and QA run on a different LLM family than the author. The policy can't route around it. The mistake of grading yourself is structurally removed.
Policy is portableThe LLM Agent behind a play can change without touching the policy. Swap Claude for Codex, or for an open-weight model. The policy is the same.
Mutations are auditableEvery external write goes through one adapter with idempotency keys and audit records. You can prove what the system did, and what it didn't.
Updates are incrementalPPO updates every 16 completed plays. Bad updates roll back on invalid logits. The manager improves the way managers do: a little at a time.