Configuration
A benchmark is a directory of plain files. This page covers the layout and the three configuration files: nasde.toml (project), variant.toml (per agent configuration), and task.toml (per task).
Quick reference: configuring a variant
Section titled “Quick reference: configuring a variant”For the impatient — a variant under test is a directory under variants/<name>/ with:
variant.toml(required) — agent type (claude/codex/gemini), model, optionalreasoning_effort, optionaltasksscope, optional[[skill]]references.CLAUDE.md/AGENTS.md/GEMINI.md— the agent’s instructions (one file, matching the family).skills/(optional) — Claude Code skills copied in whole (Codex:agents_skills/, Gemini:gemini_skills/).
That’s the whole agent-under-test surface. The walkthrough — what each knob is for — is in Configuring the agent under test; the formats are below.
Project layout
Section titled “Project layout”my-benchmark/ nasde.toml # Project configuration assessment_dimensions.json # Scoring dimensions (shared across tasks) tasks/ feature-a/ task.toml # Task config (Harbor sections + [nasde.source] / [nasde.plugin]) instruction.md # Agent prompt assessment_criteria.md # Per-task criteria for post-hoc evaluator environment/ # Optional: custom Dockerfile (else auto-generated from [nasde.source] / [nasde.plugin]) tests/ test.sh # Harbor verification script variants/ vanilla/ # Claude Code variant variant.toml # agent = "claude", model = "claude-sonnet-4-6" CLAUDE.md # Agent system prompt (injected to /app/CLAUDE.md) guided/ # Claude Code variant with skills variant.toml # may also list [[skill]] entries (skill-by-reference) CLAUDE.md skills/ # Claude skills (injected to /app/.claude/skills/, incl. references/) my-skill/ SKILL.md codex-baseline/ # Codex variant variant.toml # agent = "codex", model = "gpt-5.3-codex" AGENTS.md # Codex instructions (injected to /app/AGENTS.md) gemini-baseline/ # Gemini CLI variant variant.toml # agent = "gemini", model = "google/gemini-3-flash-preview" GEMINI.md # Gemini instructions (injected to /app/GEMINI.md) evaluator_skills/ # Optional: skills for the evaluator agent code-review/ SKILL.md evaluator_mcp.json # Optional: MCP server config for evaluator jobs/ # Trial output (gitignored)Each agent family injects its instructions differently: Claude Code variants get CLAUDE.md → /app/CLAUDE.md, Codex variants AGENTS.md → /app/AGENTS.md, Gemini variants GEMINI.md → /app/GEMINI.md. Codex/Gemini skills live under agents_skills/ and gemini_skills/ respectively — and their SKILL.md must start with a --- YAML frontmatter line (see Plugins & Skills → how skills reach each agent).
What each task file does
Section titled “What each task file does”Each file in a task feeds a different stage of the run:
flowchart LR
I["instruction.md"] --> AG["Agent under test"]
E["environment/<br/>Dockerfile"] --> AG
AG --> T["tests/test.sh"] --> R["Reward 0/1"]
AG --> RV["Reviewer agent"]
AC["assessment_criteria.md<br/>+ assessment_dimensions.json"] --> RV
RV --> S["Per-dimension scores"]
style RV fill:#c0392b,color:#fff
See Anatomy of a Benchmark for the conceptual walkthrough.
nasde.toml
Section titled “nasde.toml”Project-level configuration: defaults, the Docker base, and the reviewer ([evaluation]).
[project]name = "my-benchmark"version = "1.0.0"
[defaults]variant = "vanilla"# harbor_env = "daytona" # Optional: cloud sandbox provider (default: docker)
[docker]base_image = "ubuntu:22.04"build_commands = []
[evaluation]backend = "claude" # "claude" (default) | "codex"model = "claude-opus-4-7"dimensions_file = "assessment_dimensions.json"# eval_repetitions = 3 # Judge evaluations per trial (default 3)# max_turns = 60 # Max evaluator conversation turns (default 60)# allowed_tools = ["Read", "Glob", "Grep"] # Override default tool whitelist# mcp_config = "./evaluator_mcp.json" # MCP server config for evaluator# skills_dir = "./evaluator_skills" # Skills directory for evaluator# append_system_prompt = "" # Extra system prompt for evaluator# include_trajectory = false # Include ATIF trajectory in evaluation
[reporting]platform = "opik"project_name = "my-benchmark" # Opik project name (defaults to [project] name)The [evaluation] block is the reviewer agent’s configuration — see Configuring the Reviewer Agent for what each option does.
variant.toml
Section titled “variant.toml”Every variant directory must contain a variant.toml declaring the agent type and the model:
agent = "claude" # "claude" | "codex" | "gemini"model = "claude-sonnet-4-6" # model appropriate for the agent familyreasoning_effort = "high" # optional — see Reasoning effort belowIf no harbor_config.json exists, one is auto-generated from the agent type.
Reasoning effort
Section titled “Reasoning effort”How hard the model thinks is a configuration you should set deliberately, not leave to chance. Each agent family ships a different default level, and those defaults are not comparable — Codex’s high is the top of its three levels, while Claude’s high is only the middle of five (xhigh and max sit above it). Comparing two agents on their respective defaults silently compares different thinking budgets.
Set the effort explicitly with the optional reasoning_effort field in variant.toml, or override it for a single run with nasde run --effort. Priority is --effort > variant.toml reasoning_effort > Harbor’s family default (left unset means NASDE passes nothing and the family default applies). Typical levels (for reference — the exact set differs per model and changes over time): Claude low/medium/high/xhigh/max, Codex none/minimal/low/medium/high/xhigh, Gemini minimal/low/medium/high. NASDE does not police the value — it passes whatever you set straight to the agent, which is the source of truth and rejects an unknown level itself; this avoids a stale built-in list wrongly blocking a newly-valid level.
The effort you set is stamped onto each trial (reasoning_effort in assessment_summary.json and metrics.json), and the nasde run cost table groups by (agent, model, effort) — a different effort is treated as a different configuration and never averaged in with another.
Skill-by-reference ([[skill]]) and task scoping (tasks)
Section titled “Skill-by-reference ([[skill]]) and task scoping (tasks)”A variant can reference a skill by its source path instead of copying it in, and can be scoped to specific tasks. Both are covered with examples in Plugins & Skills:
agent = "claude"model = "claude-sonnet-4-6"
tasks = ["csharp-anemic-to-rich-domain"] # optional: restrict this variant to specific tasks
[[skill]] # optional: stage a skill from its source pathpath = "../../../src/plugins/my-plugin/skills/my-skill"ref = "abc1234"task.toml
Section titled “task.toml”A single task config file, shared with Harbor — it reads its standard sections ([task], [agent], [environment], [verifier], [metadata]) directly. NASDE-specific fields live under [nasde.*] and are ignored by Harbor.
Local repo source ([nasde.source])
Section titled “Local repo source ([nasde.source])”Build benchmarks from local (private) repositories by adding [nasde.source] to task.toml — NASDE auto-generates the Docker environment, no custom Dockerfile needed:
[nasde.source]git = "../.."ref = "abc1234"Plugin source ([nasde.plugin])
Section titled “Plugin source ([nasde.plugin])”Ship a local Claude Code plugin (skills + MCP server) into the sandbox with one declaration. Full walkthrough in Plugins & Skills:
[nasde.plugin]path = "../../../src/plugins/my-plugin" # dir containing .claude-plugin/plugin.jsonref = "abc1234" # optional git refinstall_root = "/opt/my-plugin" # optional, default /opt/<plugin-name>build = "bun install --frozen-lockfile" # optional, run at image-build time
[nasde.plugin.env] # optional, exported in the MCP server wrapperCLAUDE_PLUGIN_DATA = "/opt/my-plugin-data"