Most users only need nasde run — everything else is occasional. The everyday commands come first; the full reference follows.
# Scaffold a new benchmark project from scratch
# Run the default variant
nasde run --variant vanilla -C my-benchmark
# Codex variant (model name is OpenAI-side)
nasde run --variant codex-baseline --model gpt-5.3-codex -C my-benchmark
nasde run --variant gemini-baseline --model google/gemini-3-flash-preview -C my-benchmark
# Run a single task with experiment tracking
nasde run --variant vanilla --tasks my-task -C my-benchmark --with-opik
# Skip the reviewer (rough tests only, faster)
nasde run --variant vanilla -C my-benchmark --without-eval
# Re-run the reviewer on an existing trial (no re-execution)
nasde eval jobs/2026-03-13__14-30-00 --with-opik -C my-benchmark
# [Experimental] Back up the results essence so they don't only live in jobs/
nasde results-export jobs/2026-03-13__14-30-00 --to ~/Dropbox/nasde-results -C my-benchmark
# Publish a trial as a PR for human rubric calibration, then pull comments back
nasde calibrate publish jobs/2026-03-13__14-30-00/movie__abc -C my-benchmark
nasde calibrate pull-comments jobs/2026-03-13__14-30-00/movie__abc -C my-benchmark --json
Authentication is covered in Authentication & Opik — in short, export an API key (ANTHROPIC_API_KEY / CODEX_API_KEY / GEMINI_API_KEY) or just use whatever OAuth subscription you’re already logged into via claude / codex / gemini login.
Command Description nasde runRun benchmark: Harbor trial + assessment evaluation (default) nasde eval <JOB_DIR>Re-run assessment evaluation on an existing job nasde results-export <PATHS> --to <DIR>Copy trial artifact essence (scores, metrics, patch, trajectory) to a plain dir nasde calibrate publish <PATHS>Publish trial diffs + assessments as PRs/MRs for human rubric review nasde calibrate pull-comments <PATHS>Pull review comments back from the PRs/MRs (use --json for the orchestrator) nasde init [DIR]Scaffold a new evaluation project nasde install-skillsInstall bundled Claude Code authoring skills into ~/.claude/skills/ (or ./.claude/skills/ with --scope project)
Command Description nasde harbor ...Full Harbor CLI (view, jobs resume, trials, datasets, etc.) nasde opik ...Opik CLI (configure, usage-report, export, etc.)
Flag Description --variantVariant to run (defaults to config default) --all-variantsRun every available variant (Cartesian product with tasks) --tasksComma-separated task names to run --modelModel override (e.g. claude-sonnet-4-6, o3, google/gemini-3-flash-preview) --effortReasoning-effort override (overrides variant.toml reasoning_effort; see Configuration → Reasoning effort ) --attempts, -nIndependent agent attempts per task (Harbor n_attempts) — the sample size behind the mean ±std --timeoutAgent timeout in seconds --with-opikEnable Opik tracing --without-evalSkip assessment evaluation --eval-repetitionsJudge evaluations per trial (default: from nasde.toml [evaluation], fallback 3) --max-concurrent-evalMax concurrent assessment evaluations (default: 10) --harbor-envHarbor execution environment (docker, daytona, modal, e2b, runloop, gke) --job-suffixCustom suffix for the job directory name (default: random 6-char hex) --project-dir, -CPath to evaluation project