Quick Start

This page takes you from nothing installed to a scored benchmark built from your own git history.

Prerequisites

Python 3.12+
Docker (default) or a cloud sandbox provider — Harbor runs agents in isolated environments
uv — Package manager
npm — Required for Gemini CLI (@google/gemini-cli is installed automatically by Harbor)
Agent credentials (at least one):
- Claude Code: ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN
- OpenAI Codex: CODEX_API_KEY (API key) or codex login (ChatGPT subscription OAuth)
- Gemini CLI: GEMINI_API_KEY (API key), GOOGLE_API_KEY (Vertex AI), or gemini login (Google account OAuth)
Evaluator CLI — the assessment evaluator spawns the claude CLI by default (or codex if [evaluation] backend = "codex"). That CLI must be installed and authenticated (OAuth subscription or API key — whichever you already use interactively)

See Authentication & Opik for how to set up each agent’s credentials.

Install the CLI

uv tool install nasde-toolkit --python 3.13
nasde --version

This installs the latest stable release from PyPI.

Prefer pipx, pip, or a from-source install? See the alternatives below.

Installation alternatives

# pipx — analogous isolation, popular in Python community
pipx install nasde-toolkit --python 3.13

# Inside an existing virtual environment (3.12 or 3.13)
pip install nasde-toolkit

# Latest unreleased changes from main (for testing PRs and dev builds)
uv tool install git+https://github.com/NoesisVision/nasde-toolkit.git --python 3.13

# Local clone (for developing NASDE itself)
git clone git@github.com:NoesisVision/nasde-toolkit.git
cd nasde-toolkit
uv sync

Upgrading to the newest release:

uv tool upgrade nasde-toolkit       # if installed via uv tool
pipx upgrade nasde-toolkit          # if installed via pipx
pip install --upgrade nasde-toolkit # if installed via pip

nasde checks PyPI for newer releases on startup and prints a one-line notice on stderr when an upgrade is available (severity-tinted: patch / minor / major). Disable with NASDE_NO_UPDATE_CHECK=1 or CI=true.

After installation, only nasde appears on PATH. Harbor and Opik are bundled as core dependencies. The reviewer agent spawns your already-installed claude or codex CLI as a subprocess (not bundled), so it reuses whatever authentication you’ve set up interactively. Check the installed version with nasde --version.

Install the authoring skills

nasde install-skills

This copies the bundled nasde-benchmark-* skills into ~/.claude/skills/ so they’re available in every Claude Code session. Use --scope project to install into the current project’s .claude/skills/ instead, or --force to overwrite after a nasde upgrade.

Build your first benchmark from git history

Open your own project in Claude Code and say something like:

“Create a NASDE benchmark with a single task, based on a recent piece of work from this repo — a commit, a range of commits, or a merged PR.”

Start with one task. Point the skill at whatever unit of work feels self-contained in your workflow — a single commit, a range, a merged MR/PR, or an issue that was closed by a set of commits. The nasde-benchmark-from-history skill proposes a good candidate, and generates one task directory with instruction.md, a Dockerfile, test.sh, and a starter assessment_criteria.md. You review each file before it’s written.

Run it

nasde run --all-variants -C path/to/generated-benchmark

--all-variants runs every variant the skill scaffolded, so you don’t need to know their names yet. If you’d rather burn fewer tokens on the first run, pick just one with --variant NAME — you can run the others later.

Good to know

Start small. One task is enough to validate the loop end to end. Scale up once it works — more tasks only pay off after you’ve seen what a task looks like in practice.
Your subscription covers it. Runs use your existing claude / codex / gemini CLI auth, so a Claude Max or ChatGPT Plus subscription is enough to get going. API keys are supported too when you have them — see Authentication & Opik for the full picture.
More docs. See Use Cases for the end-to-end walkthrough and Benchmark Results for reference numbers.