Skip to content

Reading Your Results

Your first nasde run finishes and prints a table, then writes a pile of files. Here’s how to read both.

When you start a run, NASDE echoes the configuration so you know exactly what’s about to execute — agent, variant, model, attempts, whether Opik and assessment are on:

The nasde run startup banner showing agent, variant, model, attempts, and tracking configuration

As trials complete, progress streams in; at the end NASDE prints a per-configuration summary table — one row per (agent, model, reasoning effort) group. Read it like this:

  • Trials — how many times that configuration ran (set by --attempts / -n). The sample size behind the mean.
  • Score — the normalized quality (0–1) as mean ±std. The ±std is the spread between attempts (the agent writes different code each run). A single attempt shows mean (n=1) rather than a fake ±0.00.
  • Tokens / Cost — total tokens and USD cost (see Token & Cost), with an inter-trial ±std once a group has 2+ trials.

The headline question — is configuration A better than B? — is answered by comparing rows and their spreads: a 0.05 gap means little if each row wobbles by ±0.08.

Every run writes a timestamped job folder under jobs/ (gitignored). Inside, one folder per trial (<task>__<id>/), each containing:

FileWhat it holds
assessment_summary.jsonThe representative result — per-dimension mean / std / range, plus economics
assessment_eval_<N>.jsonEach individual reviewer pass (one per repetition) with full reasoning
result.json / config.jsonThe trial’s reward, model, variant, and config
agent/trajectory.jsonThe agent’s full tool-call trace
verifier/The rough-test output — reward.txt (0/1) and test-stdout.txt

This is the file you’ll open most. It carries, per dimension, the mean, std (between repeated reviews of the same code — judge noise), min, and max, plus the normalized total and the token/cost economics. Two distinct noise sources live in different places, and keeping them apart is the point:

  • Agent noise (the agent writes different code each run) → the ±std in the run summary table, across trials.
  • Judge noise (the reviewer scores the same code differently) → the per-trial std inside assessment_summary.json.

A gap you can trust is one that’s larger than both.

  • Opik dashboard — run with --with-opik and the scores flow to an experiment tracker for browsing and cross-run comparison. See Authentication & Opik.
  • Export the essencejobs/ is heavy and gitignored. nasde results-export copies just the scores, metrics, patch, and trajectory into any plain directory you want to keep.
  • Compare models visually — the quality-vs-cost Pareto frontier and per-dimension radars in Benchmark Results.