1. Introduction
When a research idea is hard, the work that follows is rarely a single step. You read a pile of papers, you argue about the right formulation, you write code, the code is wrong, the baseline you compare against turns out to be broken, you fix it, and only after many loops do you get a number you trust. That is the normal shape of science, and it is exactly the shape most AI agents are bad at. They are tuned for short tasks, and they tend to either stop too early or hand you a clean looking table that never came from a real run.
We built Nadhi, a long-horizon research agent, and pointed it at the kidney exchange problem. Clearing an exchange with bounded length cycles was shown NP-hard in 2007, nineteen years ago, and the hardness has been an active frontier since. The version we care about is harder still: it asks for a matching that is at once optimal (O), fair across sensitization classes (F), robust to edge failure (R), and scalable (S). We call it the (O,F,R,S) question. This post is an honest account of how the agent attacked it, including the iterations that failed.
2. The long-horizon pipeline, written as math
Let Q be the seed question. Let A_t be the artifacts accumulated by iteration t (the environment, the trained policy, the evaluation logs) and let J_t be the journal. One iteration is a map that takes the question, the current hypothesis h_t, and the artifacts so far, and returns the next hypothesis, a structured result r_t, and updated artifacts:
The result r_t carries a verdict v_t that is one of produced, incomplete, or failed. After each iteration a Director maps the result to a control decision:
The loop we actually run is this:
2.1 The roundtable
Instead of a tournament that picks one winner and throws the rest away, a panel of N expert sub-agents each propose an approach, then argue over R rounds, objecting to and adopting parts of each other, producing refined proposals. A moderator synthesizes one consensus that carries the whole panel forward, and records the strongest dissent as a fallback:
2.2 The preflight gate
Each code producing stage runs its script c only if a static predicate passes and the file compiles. This is what stops the model from passing off a toy Q-table or a hardcoded score as a result:
The environment predicate requires importing gymnasium, calling the environment checker, printing a success marker, never calling simple-cycles without a length bound (it is exponential and hangs the check), and never making an unregistered environment id. The training predicate requires importing a real library, calling learn, and saving a model. The evaluation predicate requires loading the saved policy and computing every number in that same run.
2.3 The persistence guard
The Director sometimes emits a malformed decision, and the safe default was to stop, which is terrible across a long run. Let Answered(Q, J_t) be true only when the seed has a real measured result and the last iteration did not just reprove a textbook lemma. The effective decision is:
On the override we re-anchor the next hypothesis to Q itself, not to whatever easier proxy the last iteration drifted toward. In this run the guard fired at iterations four, six, and fifteen, and without it the campaign would have ended three times with nothing to show.
3. The kidney-exchange MDP
An instance is a directed compatibility graph G on n pairs. Each vertex carries a sensitization class in 0 to m-1 with m = 4, and we keep one class small and highly sensitized so fairness actually bites. Each edge realizes independently with probability p, default p = 0.7, and we allow cycles of length two and three only, so L = 3. The small training pools used n = 20 with at most K = 10 candidate cycles.
The action space is Discrete(K+1). Actions 0 to K-1 select from the currently valid cycles sorted by expected value, and action K is STOP. The single most useful trick is to map any invalid or out of bounds action to STOP. The environment can then never crash on a bad action, and the greedy baseline becomes trivially and flawlessly implemented as always picking action zero. The observation is a flat fixed length vector: the per-class counts of still unmatched vertices, a per-cycle histogram of how many vertices of each class sit in each candidate cycle, a K by K conflict matrix marking which candidate cycles share a vertex, and the availability flags.
3.1 Reward
When the policy commits a cycle c it receives its expected number of realized transplants:
This one term carries optimality and robustness at once, because |c| times p to the power |c| is exactly the expected transplants in cycle c that survive when each edge realizes with probability p. Committing a cycle removes its vertices and disables every overlapping candidate. At the end of the episode we add a fairness bonus, the worst-off matched fraction across the m classes, scaled by beta:
We found a large beta, around ten, is what finally pushed the agent to give up a myopic high value cycle in order to lift the worst-off class, which is the whole point of the fairness term.
3.2 Objective, how one scalar carries the four goals
A policy committing a sequence of disjoint cycles M = (c_1, ..., c_k) collects the episode return:
and the training objective is the expected return over the instance distribution D:
Scalability (S) is not in the reward, it is a property of the representation. The observation and the action set are fixed size in K and do not grow with n, so the same tiny policy runs on a larger pool. That is how one scalar return ends up carrying three of the four properties and the fourth is built into the encoding.
3.3 The learner
We optimize J with PPO, which maximizes the clipped surrogate:
rho_t is the probability ratio and Â_t is the advantage estimate. We use a small multilayer perceptron policy on the CPU, implemented with Stable-Baselines3 on a Gymnasium environment.
3.4 What counts as success
The whole point of the guard rails is to make the claim falsifiable, so we state it precisely. A learned policy pi empirically achieves (O,F,R) on a pool distribution at small scale if, over held-out seeds, its mean return strictly exceeds a correctly implemented greedy heuristic and the gap survives the seed spread:
This is deliberately weak. It is about achievability on small bounded pools, not a statement about the tractability of the problem in general. It is exactly the test our headline result passes.
4. Experiments and results
We ran two campaigns. The first was a single experiment to get the pipeline working. The second was the full long-horizon run of twenty-one iterations. We report both, because the contrast is itself a result about how easy it is to fool yourself with a bad baseline.
4.1 Pipeline validation run
PPO with an MLP policy, 200,000 timesteps on n = 20 pools, evaluated over five seeds (42, 100, 2023, 555, 999). The learning curve below climbs from about 1.2 mean return at ten thousand steps to a plateau around 3.7.
| Policy | Mean return |
|---|---|
| PPO agent | 3.69 ± 0.62 |
| Random | -1.39 ± 0.38 |
| Greedy (as evaluated then) | -12.98 ± 0.00 |

If you only read that table you would think the agent crushed the baselines. It did not. The greedy score of -12.98 is a red flag; a sensible heuristic should never score that badly. The greedy baseline in this run was selecting overlapping invalid cycles and accumulating penalties, it was not respecting the action mask. The agent looked great only because the comparison was unfair. The Director caught it, marked the run incomplete, and steered the next campaign to fix the baseline first.
4.2 The long-horizon run
Twenty-one iterations under a six hour budget. It was not a straight line. Early iterations fought with action masking, standard discrete PPO kept wasting probability mass on invalid actions, and the greedy baseline kept crashing or scoring negative. We tried a continuous priority score action space, we tried MaskablePPO, we tried denser fairness shaping. Several iterations failed the environment checker because cycle enumeration was non deterministic, and the fix was to strictly sort every enumerated cycle and drive all randomness through the environment generator. A few iterations produced a win that the Director correctly flagged as an artifact, the agent was only beating greedy because greedy was still allowed invalid moves.
The breakthrough was iteration eleven. We mapped every out of bounds action to STOP, which made the environment safe, and defined the greedy baseline as deterministically picking action zero on the strictly sorted valid cycles, which made it flawless. With a flawless greedy baseline there is nowhere to hide. We trained PPO for 100,000 timesteps and evaluated over three seeds (42, 100, 2023).
| Policy | Mean return (3 seeds) |
|---|---|
| PPO agent | 7.59 ± 0.46 |
| Greedy (flawless, action 0) | 6.59 |
| Random | 1.18 |
This is the result we stand behind. The agent lower bound, mean minus one standard deviation, is 7.13, which still strictly dominates the greedy mean of 6.59, so the win holds across the seed spread and not just on average. The agent learned a lookahead policy. It sometimes passes on the single highest value cycle now, because committing it would block two cycles that together cover the worst-off class and carry more expected realized transplants over the whole episode. A purely expected-value greedy heuristic cannot see that, it is myopic by construction, and that is the gap the learned policy exploits.
The full verdict trajectory across the twenty-one iterations was incomplete five times, then failed twice, incomplete twice, produced three times, then a noisy tail of failed and incomplete with a final produced. The Director stopped on convergence, not on budget. Iteration eleven is the one we report.
4.3 What we did not finish
This is empirical achievability on small pools, it is not a proof. Later iterations tried to cleanly separate the combined return into its individual O, F, and R parts, to scale to pools of size ten thousand, and to correlate performance with a structural parameter kappa(G) such as the treewidth of the three-cycle conflict graph. Those iterations hit determinism issues and observation-space incompatibilities and ran into the budget, so we do not report numbers for them, because reporting numbers we did not actually measure is the one thing the whole system exists to prevent. The tractability dichotomy, the exact structural condition under which all four properties are achievable together and a matching hardness result for when they are not, remains open and needs a separate proof-first run.
5. Discussion
The technical artifact is modest and honest, a policy that beats greedy by about one unit of expected realized transplants on tiny pools. The process is the more interesting part. A long-horizon agent left to its own devices will drift, find an easier proxy, declare victory on a broken baseline, and stop on the first parse error. Every one of those failure modes showed up in this run. What saved it was not a smarter base model, it was the guard rails, the staged pipeline that refuses toy code, the preflight that rejects fabricated metrics, the persistence guard, and a Director that recognized a too-good-to-be-true baseline and demanded it be fixed.
There is a lesson here that generalizes well beyond kidney exchange. A large part of doing empirical science correctly is making the comparison fair, and an agent rewarded for beating a baseline has a quiet incentive to beat a weak one. Building the environment so the baseline is flawless by construction removed that incentive, and in hindsight it was the single most important design decision in the whole campaign.
6. Reproduce it
The environment, the trained policy, and a single script that does both training and inference are public. The inference path uses the corrected, flawless greedy baseline.
References
- A. E. Roth, T. Sonmez, M. U. Unver. Kidney exchange. The Quarterly Journal of Economics, 119(2), 2004.
- D. J. Abraham, A. Blum, T. Sandholm. Clearing algorithms for barter exchange markets: enabling nationwide kidney exchanges. ACM EC, 2007.
- A. E. Roth. What have we learned from market design? The Economic Journal, 118(527), 2008.
- J. P. Dickerson, A. D. Procaccia, T. Sandholm. Failure-aware kidney exchange. ACM EC, 2013.
- H. Bidkhori, J. P. Dickerson, D. C. McElfresh, K. Ren. Kidney exchange with inhomogeneous edge existence uncertainty. UAI, 2020.
- D. C. McElfresh, J. P. Dickerson. Balancing lexicographic fairness and a utilitarian objective with application to kidney exchange. AAAI, 2018.
- G. Farnadi, W. St-Arnaud, B. Babaki, M. Carvalho. Individual fairness in kidney exchange programs. AAAI, 2021.
- C. Chang, A. Khare, D. Shmoys. Optimizing for fairness in generalized kidney exchange: theory and computations. arXiv, 2026.
- I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv:1611.09940, 2016.
- H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, L. Song. Learning combinatorial optimization algorithms over graphs. NeurIPS, 2017.
- T. D. Barrett, W. R. Clements, J. N. Foerster, A. I. Lvovsky. Exploratory combinatorial optimization with reinforcement learning. AAAI, 2020.
- C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, D. Ha. The AI Scientist: towards fully automated open-ended scientific discovery. arXiv:2408.06292, 2024.
- B. Romera-Paredes et al. Mathematical discoveries from program search with large language models. Nature, 625, 2024.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
- A. Raffin et al. Stable-Baselines3: reliable reinforcement learning implementations. JMLR, 22(268), 2021.
- M. Towers et al. Gymnasium: a standard interface for reinforcement learning environments. 2023.