Self-Evaluation

The Recursiv Protocol, agents grade their own output and refine before delivering

Overview

The name Recursiv is the product. Agents are not meant to one-shot an answer and hand it over. They are meant to produce an output, grade it honestly against the task, and if it falls short, fix the gaps and grade again. That loop is the Recursiv Protocol, and it is how an agent catches its own mistakes before a human ever sees them.

This page covers the self-evaluation loop, the protocol and simulator primitives, and where a productized check-suite-and-gating product is on the roadmap.

The self-evaluation loop

After an agent produces code, a plan, an analysis, or any deliverable, it calls self_evaluate (available in MCP) before delivering. The agent scores its own output 1 to 10 against the task criteria and declares the recursion depth.

The loop converges on a simple rule:

  • Score 7 or higher, the output is good enough. Any learnings are written to persistent memory and the agent delivers.
  • Score below 7, the agent is told the gaps it identified and instructed to address them and re-evaluate.
  • Depth 3 reached, the loop stops. The agent reports the remaining gaps honestly instead of silently converging.
1// MCP tool: self_evaluate
2{
3 "task_summary": "Add rate limiting to the upload endpoint",
4 "output_summary": "Added a token-bucket limiter middleware with per-key buckets",
5 "score": 6,
6 "depth": 1,
7 "gaps": ["No test for the 429 path", "Limit is hardcoded, not configurable"],
8 "learnings": ["Endpoint limiters should read limits from project settings"]
9}

A score of 6 at depth 1 returns the gaps and asks the agent to fix them and call again at depth 2. When the agent reaches a 7+ it converges and its learnings are saved. The depth cap (3) is a hard bound so the loop cannot run forever, and the 7-of-10 threshold is the convergence bar. Learnings are written to memory automatically, so the next run starts smarter.

This is deliberately a discipline imposed on the agent, not a black box. The point is that the agent commits to an honest score and either improves or reports what is still missing, rather than presenting a first draft as final.

Why this is governance

Self-evaluation is a govern primitive because it produces a record of the agent assessing its own work: what it set out to do, how it scored the result, what gaps it found, and what it learned. Combined with Audit & Observability, that means you can show not just what an agent did but how rigorously it checked itself before delivering.

Protocols and the simulator

Two related primitives support verification work.

r.protocols manages cross-protocol adapters and search. It lets a network pull and search content across external protocols (the source data an agent verifies or reasons over) and configure which protocols and search terms are enabled.

1const { data: adapters } = await r.protocols.list();
2const { data: status } = await r.protocols.status();

r.simulator drives simulated activity against an environment, which is how you exercise an agent or a network under controlled, repeatable load before trusting it with real traffic.

1await r.simulator.start({ user_count: 50, interval_ms: 1000 });
2const { data: status } = await r.simulator.status();
3const { data: failures } = await r.simulator.failures(); // what broke under load
4await r.simulator.stop();

The simulator’s failures view is the verification surface here: run a workload, then read what failed, rather than assuming the agent behaves under load.

Composing a verification flow today

There is no single one-call verification product yet. You build the equivalent from the primitives that exist: have the agent run the self_evaluate loop to grade and refine its output, exercise the agent against a r.simulator workload and read failures, and record measured results as dispatcher outcomes (r.dispatcher.recordOutcome) so the before-and-after is captured. Together these give you a graded, exercised, and recorded deliverable.

Roadmap: a productized Verification product (define a named suite of checks, run them against an agent’s output in one call, and gate a release or deploy on the result) is on the roadmap and is not yet generally available. Today the building blocks are the self_evaluate self-grading loop, the simulator, and dispatcher outcomes. There is no r.verify resource.

What this proves

Self-evaluation lets you show that an agent graded its own work against the task, addressed the gaps it found, and either cleared the convergence bar or reported honestly what was still missing. That is the difference between an agent that hopes it got the answer right and one that checks.