← Swink AgentShore Tech
The learned manager

The PPO policy

A small actor-critic network chooses the next play. It's the RL Agent. The LLM Agents are the muscle.

The shape of the thing

The whole decision-maker inside Swink AgentShore is a single, small actor-critic MLP. It's deliberately tiny. You can train it, you can inspect it, you can replay it. The expensive thing is the LLM Agent that runs the play it picks.

Parameters
~51K
Observation
246 dim
Action head
22 discrete
Update cadence
16 plays

The action space

Nineteen plays the policy picks from, plus three reserved slots that are masked at runtime.

04 Issue Pickup
05 Code Review
07 Run QA
06 Merge PR
02 Write Plan
01 Unblock PR
12 Refine Tasks
08 Systematic Debug
16 Groom Backlog
00 Instantiate Agent
13 Cleanup
11 Reconcile State
09 Design Audit
03 End Agent
17 Seed Project
18 Calibrate Align
10 End Session
15 Take Break
19 Prune
14 Reserved
20 Reserved
21 Reserved

Ineligible plays are masked, not penalized

The policy can't pick them. No PR to review → Code Review is masked. Budget exhausted → only End Session remains.

Every play has a precondition, and it's checked twice: once when building the action mask before the policy picks, and again at dispatch before the play actually runs. Belt and suspenders.

Irreversible mistakes (merging the wrong thing, letting an author review their own PR) cost too much to learn from negative reward. Safety lives in the hard-gate catalog.

The reward, in one paragraph

Reward is how the policy learns what's worth doing. It's a weighted sum, kept inside reasonable bounds so no single play swamps the rest.

Pushing forward: shipping issues (issue throughput) and moving the project toward done (alignment delta).

Pulling back: spinning in circles (loop), inflating the backlog faster than work closes (inflation), and stretches with no measurable progress (stagnation).

The bias is toward merging and project progress.

Rough reward magnitudes by play Horizontal divergent bar chart. Merge PR is the largest positive reward. Issue Pickup is a medium positive. Code Review and Run QA are small positives. Loop, Inflation, and Stagnation are negative, with Stagnation the largest penalty. Rough reward magnitudes by play Defaults are illustrative. Weights are configurable per project. Merge PR Issue Pickup Run QA Code Review Loop Inflation Stagnation PENALTY 0 REWARD
Rough magnitudes only, to show the shape of the bias. Real values are clipped and adjusted per project.

Why this beats vibes and Ralph loops

Decisions are measurable Every play carries a reward, a value estimate, and a replay trail. Ask "why did we run review here?" and get an answer with numbers.
Stuck states are penalized Loop, inflation, and stagnation get negative reward. This one knows when to stop.
Budgets are first-class Every session runs against a budget cap. The policy learns to spend it well, and the mask catalog ensures it can't blow past it accidentally.
Safety is in the mask Anti-confirmation, merge readiness, budget exhaustion: architectural invariants in the executor, not lines in a system prompt.
Cross-framework review Code review and QA run on a different LLM family than the author. The policy can't route around it. The mistake of grading yourself is structurally removed.
Policy is portable The LLM Agent behind a play can change without touching the policy. Swap Claude for Codex, or for an open-weight model. The policy is the same.
Long-running, not chat-bound Sessions run for hours, not in conversational turns. No context window to overflow. State lives in BEADS, GitHub, and the policy's replay trail.
Mutations are auditable Every external write goes through one adapter with idempotency keys and audit records. You can prove what the system did, and what it didn't.
Updates are incremental PPO updates every 16 completed plays. Bad updates roll back on invalid logits. The manager improves the way managers do: a little at a time.