Research log · long-horizon agent

Can an AI agent run a six-hour research campaign on its own, and come back with an honest result?

We pointed Nadhi, our long-horizon research agent, at a kidney exchange problem that has sat on the computational hardness frontier for nineteen years. After twenty-one iterations, several of which failed, it produced a learned policy that genuinely beats a flawless greedy baseline. This is the honest story of what happened, the numbers, and the open model you can download and rerun today.

Model & code on Hugging Face Read the full technical write-up →By Nandakishor M, Convai Innovations

7.59

Learned policy score

mean over 3 seeds

6.59

Flawless greedy baseline

best classical heuristic

Autonomous iterations

failures included

~6 h

Unattended runtime

one research campaign

The short version

Most AI agents are good at short tasks and then they stop. Real science is long, it loops, and it fails a lot. We wanted to see if an agent could carry a multi-hour campaign to a result that is real and not made up.
We gave it a hard, open problem in kidney exchange: find a matching that is at once optimal, fair, robust to failure, and scalable.
The agent built a real simulator, trained a policy, and after twenty-one iterations it beat a correctly built greedy baseline, a mean score of 7.59 against 6.59, with random at 1.18.
This is empirical evidence on small cases, not a mathematical proof. We say so plainly, and we released the model.

The problem, in plain words

Kidney exchange is a real and beautiful idea. Many patients have a friend or family member willing to donate a kidney, but who is not a medical match for them. Exchange programs pool these pairs and look for cycles, you give to me, I give to her, she gives to you, so that everyone in the cycle gets a compatible kidney. Cycles are kept short, usually two or three pairs, because every surgery in a cycle has to happen at the same time.

Finding the best set of cycles is hard. Clearing the exchange with bounded length cycles was shown to be NP-hard back in 2007, which is nineteen years ago, and the hardness has been an active research frontier ever since. Worse, real programs do not just want the most transplants. They also want the matching to be fair to highly sensitized patients who are very hard to match, robust to the fact that a planned match can fall through at the final cross-check, and able to run on a national pool. We packed all four of these into one question and kept it as the north star: can a matching be optimal, fair, robust, and scalable all at once.

What Nadhi is, and how it works

Nadhi is a long-horizon research agent that lives in a desktop application. The experiments and the papers it reads stay on the machine, and the reasoning is driven by a frontier model, Gemini 3.1 Pro with custom tool calling. The thing that makes it more than a chatbot is the loop and the guard rails around it.

It reads first

It fans out and downloads a corpus of related papers, twenty-eight in this run, and indexes them so every step is grounded in the real literature.

A roundtable, not a contest

Several expert sub-agents each propose an approach, then argue with each other and converge to one consensus. No single winner is picked; the whole panel feeds forward.

A staged pipeline

Each experiment runs as gated stages: design, build and check the environment, train with a real library, then evaluate. A preflight rejects toy code and fabricated numbers before anything runs.

It refuses to quit early

A Director decides after each step whether to refine, branch, or stop. A persistence guard overrides a premature stop and steers back to the real question.

The single most important rule is simple: every number has to come from a real run. No hardcoded scores, no mocked timings, no hand-typed tables. That sounds obvious, but it is exactly what most agents get wrong when they are left alone for hours.

How the agent turned the question into a game

The agent built a small simulator where a policy looks at the current pool and decides which cycle to commit next, or to stop. The reward it collects for committing a cycle is its expected number of realized transplants, which is the cycle length times the chance every edge in it survives. That one term carries both the optimal and the robust goals at once. At the end of an episode it adds a fairness bonus, the matched fraction of the worst-off patient class, so the policy is pushed to lift the group that a pure efficiency objective would leave behind.

One small design choice mattered more than anything else. The agent sorted the candidate cycles and made the very first action always the greedy choice, and it mapped any invalid action to a safe stop. This meant the greedy baseline could be built flawlessly, as simply picking action zero. With a flawless baseline there is nowhere to hide, any win has to be a real win.

The result

After twenty-one iterations and about six hours, the learned policy beat the flawless greedy heuristic on the combined objective, measured over three seeds. The lower edge of the agent score, its mean minus one standard deviation, still sits above the greedy mean, so the win holds across the spread and not just on average.

Policy	Mean score (3 seeds)
Learned PPO policy	7.59 ± 0.46
Greedy heuristic (flawless)	6.59
Random	1.18

The agent learned to look ahead. It sometimes passes on the single best cycle right now, because committing it would block two cycles that together cover the worst-off class and carry more expected transplants over the whole episode. A purely greedy heuristic cannot see that, it is short-sighted by design, and that is the gap the learned policy exploits.

Learning curve of the PPO policy, mean return rising and then plateauing over training timesteps — The training curve, mean return against training steps. Every point is logged from the real run; nothing here is drawn by hand.

The most useful thing we learned: a broken baseline

Early on, an earlier run looked like a triumph, the agent scored 3.69 while the greedy baseline scored a terrible minus 12.98. If we had stopped there we would have published a lie. The greedy baseline was broken; it was taking invalid moves and piling up penalties, so the agent only looked good because the comparison was unfair. The Director caught it, marked the run incomplete, and steered the next campaign to fix the baseline first.

There is a lesson here that goes well beyond kidney exchange. An agent rewarded for beating a baseline has a quiet incentive to beat a weak one. Building the simulator so the baseline is flawless by construction removed that incentive, and in hindsight it was the single most important decision in the whole project.

What this is, and what it is not

This is empirical achievability on small pools. It is a learned policy that beats a real baseline, measured honestly. It is not a proof. We did not prove the tractability dichotomy, the question of exactly which structural condition makes all four properties achievable together and a matching hardness result for when it is not. That remains open and needs a separate, proof-first run. We would rather report a small honest result than a big fabricated one.

The technical artifact is modest, a policy that beats greedy by about one unit on tiny pools. The part we find more interesting is the process. A long-horizon agent left alone will drift, find an easier proxy, declare victory on a broken baseline, and stop on the first error. Every one of those failures showed up in this run. What saved it was not a bigger model, it was the discipline wrapped around it.

Run it yourself

The environment, the trained policy, and a single script that does both training and inference are public. The inference path uses the corrected, flawless greedy baseline, so you can reproduce a fair comparison directly.

# grab it from Hugging Face, then

pip install -r requirements.txt

python run.py infer --model policy.zip --seeds 42 100 2023

python run.py train --timesteps 200000

convaiinnovations/kidney-exchange-ppo

Want an agent that does this on your problem?

Nadhi reads the papers, runs the experiments, and reports only what it actually measured.