Skip to content

Running & Configuring Runs

This guide covers the operational lifecycle of a run, in the order you typically hit it: point it at your code, scale it out, configure the two agents (the one under test and the reviewer), and keep the results.

You can build benchmarks from local (private) repositories by adding [nasde.source] to task.toml with a relative path:

[nasde.source]
git = "../.."
ref = "abc1234"

NASDE auto-generates the Docker environment — no custom Dockerfile needed. See examples/nasde-dev-skill/ for a complete example that tests nasde-toolkit itself. The full [nasde.source] reference is in Configuration.

By default, Harbor runs agents in local Docker containers. For horizontal scaling, you can use a cloud sandbox provider — this shifts command execution to the cloud, making trials I/O bounded rather than compute bounded. You can typically parallelize far above your local CPU count.

Supported providers (via Harbor):

ProviderFlag valueAPI key env var
Docker (default)docker
DaytonadaytonaDAYTONA_API_KEY
ModalmodalMODAL_TOKEN_ID + MODAL_TOKEN_SECRET
E2Be2bE2B_API_KEY
RunlooprunloopRUNLOOP_API_KEY
GKEgkeGCP credentials

We recommend Daytona for its flexibility and scaling capabilities.

Terminal window
# Run with Daytona cloud sandbox
export DAYTONA_API_KEY=...
nasde run --variant vanilla --harbor-env daytona -C my-benchmark
# Or use the Harbor pass-through for full control
nasde harbor run --dataset my-benchmark@1.0 --agent claude-code --model claude-sonnet-4-6 --env daytona -n 32

The cloud sandbox provider affects only the Harbor trial execution (Stage 1). The assessment evaluation (Stage 2) always runs locally on the host machine. You can set a default provider in nasde.toml:

[defaults]
harbor_env = "daytona"

See the Harbor documentation for detailed provider configuration.

The agent under test is the one whose configuration you’re measuring — and that configuration is the whole point of a benchmark. A variant bundles everything that defines one agent setup. (For the exact file format, see Configuration → variant.toml.)

Instructions (CLAUDE.md / AGENTS.md / GEMINI.md)

Section titled “Instructions (CLAUDE.md / AGENTS.md / GEMINI.md)”

The single most important knob: the system instructions you inject into the agent. Each family reads its own file, dropped into the variant directory and injected into the sandbox:

  • Claude Code → CLAUDE.md/app/CLAUDE.md
  • Codex → AGENTS.md/app/AGENTS.md
  • Gemini CLI → GEMINI.md/app/GEMINI.md

This is how you test “baseline vs. with my custom prompt” — two variants, same task, different instruction file.

Give the agent Claude Code skills two ways: drop them under variants/<v>/skills/<name>/ (copied in whole, including references/), or reference a skill from its source path with a [[skill]] entry in variant.toml (no copy — staged from the source at an optional git ref). Codex and Gemini skills live under agents_skills/ and gemini_skills/. See Plugins & Skills for the full workflow.

Wire MCP servers the agent can call during the task. The cleanest path is a [nasde.plugin] declaration in task.toml, which ships a plugin’s skills and its MCP server into the sandbox in one line — see Benchmarking a plugin.

Set how hard the agent thinks with reasoning_effort in variant.toml, or override per run with nasde run --effort. Family defaults are not comparable, so set it deliberately when comparing agents — see Configuration → Reasoning effort.

Restrict a variant to specific tasks with tasks = [...] in variant.toml — useful when a skill is tuned to one repo’s conventions and would mislead elsewhere. See Scoping a variant.

The reviewer agent (assessment evaluator) is the mirror image of the agent under test: same axes (model, skills, MCP, system prompt), configured via the [evaluation] section in nasde.toml. By default it uses claude-opus-4-7 with read-only tools (Read, Glob, Grep).

By default, nasde uses the Claude Code CLI for assessment evaluation. You can switch to Codex:

[evaluation]
backend = "codex" # "claude" (default) | "codex"
model = "gpt-5.3-codex"

Both backends use your existing CLI authentication (subscription OAuth or API key) — no additional setup. The evaluator spawns the CLI as a subprocess, so you get the same billing treatment as interactive use. See examples/nasde-dev-skill/nasde.codex.toml for a ready-to-use Codex configuration.

Give the reviewer agent skills (e.g. a code review methodology) by creating a directory with SKILL.md files and pointing at it:

[evaluation]
skills_dir = "./evaluator_skills"

Skills are copied into the evaluator’s workspace and loaded via the CLI’s native auto-discovery (claude --add-dir <workspace>).

Add external analysis tools (linters, complexity analyzers) as MCP servers:

evaluator_mcp.json
{
"mcpServers": {
"code-analysis": {
"type": "stdio",
"command": "npx",
"args": ["@some-org/code-analysis-mcp"]
}
}
}
[evaluation]
mcp_config = "./evaluator_mcp.json"
allowed_tools = ["Read", "Glob", "Grep", "mcp__code-analysis__analyze"]

MCP tool names follow the mcp__<server>__<tool> convention. If you override allowed_tools, you must include the MCP tools explicitly.

SettingDefaultPurpose
backendclaudeSubprocess backend: claude or codex
modelclaude-opus-4-7Evaluator model
dimensions_fileassessment_dimensions.jsonScoring dimensions file
max_turns60Max evaluator conversation turns (raise for DDD-rich workspaces with many small files)
allowed_tools["Read", "Glob", "Grep"]Tool whitelist
mcp_configPath to MCP server config JSON
skills_dirPath to evaluator skills directory
append_system_promptExtra system prompt text
include_trajectoryfalseInclude ATIF trajectory in evaluation

When include_trajectory is enabled, the evaluator can read the agent’s full execution trajectory (agent/trajectory.json) — tool calls, timestamps, token usage, errors. This enables assessment dimensions that evaluate the agent’s process (efficiency, verification discipline, decision-making) alongside the final output. See examples/nasde-dev-skill for a working example with trajectory-aware dimensions.

By default a run’s output lives only in the local, gitignored jobs/ directory — and most of its weight is build junk (compiled binaries, .git checkouts) that’s useless for analysis. If you clear jobs/, the results are gone. nasde results-export copies just the essence of each trial into a plain destination directory so your results survive and travel:

Terminal window
nasde results-export jobs/2026-03-13__14-30-00 --to ~/Dropbox/nasde-results -C my-benchmark

The destination is any path you like — an iCloud or Dropbox folder, an external drive, or a git repo you commit yourself. NASDE just writes files there; it never talks to a cloud provider, so there’s nothing to authenticate. Each trial becomes one flat folder <job>__<trial>/ containing:

  • metrics.json — self-contained summary: timing, model, variant, task, reward, reasoning effort, token usage + USD cost (see Token & Cost)
  • assessment_eval_*.json — the reviewer’s per-dimension scores and reasoning (one file per repetition)
  • assessment_summary.json — per-dimension mean/std/range across repetitions (the representative result)
  • trajectory.json — the agent’s full tool-call trace, for post-hoc cost/process analysis
  • changes.patch — exactly what the agent changed (a code diff, not the multi-GB workspace)
  • verifier_stdout.txt, reward.txt — the rough-test output

You can pass several paths at once, mixing whole jobs and individual trials — NASDE figures out which is which. Re-running is safe: it merges (copying any evaluations added since the last export) and never re-touches the immutable trajectory or patch.