Running & Configuring Runs
This guide covers the operational lifecycle of a run, in the order you typically hit it: point it at your code, scale it out, configure the two agents (the one under test and the reviewer), and keep the results.
Running on a local repo
Section titled “Running on a local repo”You can build benchmarks from local (private) repositories by adding [nasde.source] to task.toml with a relative path:
[nasde.source]git = "../.."ref = "abc1234"NASDE auto-generates the Docker environment — no custom Dockerfile needed. See examples/nasde-dev-skill/ for a complete example that tests nasde-toolkit itself. The full [nasde.source] reference is in Configuration.
Cloud sandbox providers
Section titled “Cloud sandbox providers”By default, Harbor runs agents in local Docker containers. For horizontal scaling, you can use a cloud sandbox provider — this shifts command execution to the cloud, making trials I/O bounded rather than compute bounded. You can typically parallelize far above your local CPU count.
Supported providers (via Harbor):
| Provider | Flag value | API key env var |
|---|---|---|
| Docker (default) | docker | — |
| Daytona | daytona | DAYTONA_API_KEY |
| Modal | modal | MODAL_TOKEN_ID + MODAL_TOKEN_SECRET |
| E2B | e2b | E2B_API_KEY |
| Runloop | runloop | RUNLOOP_API_KEY |
| GKE | gke | GCP credentials |
We recommend Daytona for its flexibility and scaling capabilities.
# Run with Daytona cloud sandboxexport DAYTONA_API_KEY=...nasde run --variant vanilla --harbor-env daytona -C my-benchmark
# Or use the Harbor pass-through for full controlnasde harbor run --dataset my-benchmark@1.0 --agent claude-code --model claude-sonnet-4-6 --env daytona -n 32The cloud sandbox provider affects only the Harbor trial execution (Stage 1). The assessment evaluation (Stage 2) always runs locally on the host machine. You can set a default provider in nasde.toml:
[defaults]harbor_env = "daytona"See the Harbor documentation for detailed provider configuration.
Configuring the agent under test
Section titled “Configuring the agent under test”The agent under test is the one whose configuration you’re measuring — and that configuration is the whole point of a benchmark. A variant bundles everything that defines one agent setup. (For the exact file format, see Configuration → variant.toml.)
Instructions (CLAUDE.md / AGENTS.md / GEMINI.md)
Section titled “Instructions (CLAUDE.md / AGENTS.md / GEMINI.md)”The single most important knob: the system instructions you inject into the agent. Each family reads its own file, dropped into the variant directory and injected into the sandbox:
- Claude Code →
CLAUDE.md→/app/CLAUDE.md - Codex →
AGENTS.md→/app/AGENTS.md - Gemini CLI →
GEMINI.md→/app/GEMINI.md
This is how you test “baseline vs. with my custom prompt” — two variants, same task, different instruction file.
Skills
Section titled “Skills”Give the agent Claude Code skills two ways: drop them under variants/<v>/skills/<name>/ (copied in whole, including references/), or reference a skill from its source path with a [[skill]] entry in variant.toml (no copy — staged from the source at an optional git ref). Codex and Gemini skills live under agents_skills/ and gemini_skills/. See Plugins & Skills for the full workflow.
MCP servers
Section titled “MCP servers”Wire MCP servers the agent can call during the task. The cleanest path is a [nasde.plugin] declaration in task.toml, which ships a plugin’s skills and its MCP server into the sandbox in one line — see Benchmarking a plugin.
Reasoning effort
Section titled “Reasoning effort”Set how hard the agent thinks with reasoning_effort in variant.toml, or override per run with nasde run --effort. Family defaults are not comparable, so set it deliberately when comparing agents — see Configuration → Reasoning effort.
Variant scoping
Section titled “Variant scoping”Restrict a variant to specific tasks with tasks = [...] in variant.toml — useful when a skill is tuned to one repo’s conventions and would mislead elsewhere. See Scoping a variant.
Configuring the reviewer agent
Section titled “Configuring the reviewer agent”The reviewer agent (assessment evaluator) is the mirror image of the agent under test: same axes (model, skills, MCP, system prompt), configured via the [evaluation] section in nasde.toml. By default it uses claude-opus-4-7 with read-only tools (Read, Glob, Grep).
Backend and model
Section titled “Backend and model”By default, nasde uses the Claude Code CLI for assessment evaluation. You can switch to Codex:
[evaluation]backend = "codex" # "claude" (default) | "codex"model = "gpt-5.3-codex"Both backends use your existing CLI authentication (subscription OAuth or API key) — no additional setup. The evaluator spawns the CLI as a subprocess, so you get the same billing treatment as interactive use. See examples/nasde-dev-skill/nasde.codex.toml for a ready-to-use Codex configuration.
Skills
Section titled “Skills”Give the reviewer agent skills (e.g. a code review methodology) by creating a directory with SKILL.md files and pointing at it:
[evaluation]skills_dir = "./evaluator_skills"Skills are copied into the evaluator’s workspace and loaded via the CLI’s native auto-discovery (claude --add-dir <workspace>).
MCP servers
Section titled “MCP servers”Add external analysis tools (linters, complexity analyzers) as MCP servers:
{ "mcpServers": { "code-analysis": { "type": "stdio", "command": "npx", "args": ["@some-org/code-analysis-mcp"] } }}[evaluation]mcp_config = "./evaluator_mcp.json"allowed_tools = ["Read", "Glob", "Grep", "mcp__code-analysis__analyze"]MCP tool names follow the mcp__<server>__<tool> convention. If you override allowed_tools, you must include the MCP tools explicitly.
All options
Section titled “All options”| Setting | Default | Purpose |
|---|---|---|
backend | claude | Subprocess backend: claude or codex |
model | claude-opus-4-7 | Evaluator model |
dimensions_file | assessment_dimensions.json | Scoring dimensions file |
max_turns | 60 | Max evaluator conversation turns (raise for DDD-rich workspaces with many small files) |
allowed_tools | ["Read", "Glob", "Grep"] | Tool whitelist |
mcp_config | — | Path to MCP server config JSON |
skills_dir | — | Path to evaluator skills directory |
append_system_prompt | — | Extra system prompt text |
include_trajectory | false | Include ATIF trajectory in evaluation |
When include_trajectory is enabled, the evaluator can read the agent’s full execution trajectory (agent/trajectory.json) — tool calls, timestamps, token usage, errors. This enables assessment dimensions that evaluate the agent’s process (efficiency, verification discipline, decision-making) alongside the final output. See examples/nasde-dev-skill for a working example with trajectory-aware dimensions.
Exporting results
Section titled “Exporting results”By default a run’s output lives only in the local, gitignored jobs/ directory — and most of its weight is build junk (compiled binaries, .git checkouts) that’s useless for analysis. If you clear jobs/, the results are gone. nasde results-export copies just the essence of each trial into a plain destination directory so your results survive and travel:
nasde results-export jobs/2026-03-13__14-30-00 --to ~/Dropbox/nasde-results -C my-benchmarkThe destination is any path you like — an iCloud or Dropbox folder, an external drive, or a git repo you commit yourself. NASDE just writes files there; it never talks to a cloud provider, so there’s nothing to authenticate. Each trial becomes one flat folder <job>__<trial>/ containing:
metrics.json— self-contained summary: timing, model, variant, task, reward, reasoning effort, token usage + USD cost (see Token & Cost)assessment_eval_*.json— the reviewer’s per-dimension scores and reasoning (one file per repetition)assessment_summary.json— per-dimension mean/std/range across repetitions (the representative result)trajectory.json— the agent’s full tool-call trace, for post-hoc cost/process analysischanges.patch— exactly what the agent changed (a code diff, not the multi-GB workspace)verifier_stdout.txt,reward.txt— the rough-test output
You can pass several paths at once, mixing whole jobs and individual trials — NASDE figures out which is which. Re-running is safe: it merges (copying any evaluations added since the last export) and never re-touches the immutable trajectory or patch.