Use Cases
UC1: Evaluating Your Agent Configuration Against Your Own Codebase
Section titled “UC1: Evaluating Your Agent Configuration Against Your Own Codebase”Persona
Section titled “Persona”Tech lead or senior developer at a company that has invested in AI coding agent configuration — custom skills, CLAUDE.md files, prompting strategies, MCP servers. Works in one or a few company repositories (often a monorepo). Has access to git history of real problems the team has already solved.
Problem
Section titled “Problem”You’ve tuned how Claude Code operates in your codebase, but you have no way to measure whether the full configuration actually helps. Built-in eval tools like Skill Creator can test individual skills in isolation, but they can’t tell you whether your skills work well together, how your CLAUDE.md interacts with MCP server configurations, or how the same task set performs across different coding agents. Skill changes are a leap of faith — maybe the new prompt improves refactoring but breaks the agent’s ability to write tests. Without structured evaluation of the complete configuration, you can’t tell what’s improving and what’s regressing.
What NASDE enables
Section titled “What NASDE enables”You turn real problems from your team’s history into repeatable benchmark tasks, then run different agent configurations against them — not just individual skills, but the full combination of CLAUDE.md, skills, and MCP servers. You can also compare results across different coding agents (Claude Code, Codex, Cursor, etc.) on the same task set. Results are multi-dimensional scores — not just “did it work?” but “how well did it work across code quality, architecture, testing, and whatever else matters to you.” Once the task set is established, it becomes a regression suite: re-run it every time the configuration changes.
Workflow
Section titled “Workflow”Phase 1: Build the benchmark (one-time setup)
Section titled “Phase 1: Build the benchmark (one-time setup)”Step 1 — Identify source tasks from git history
The nasde-benchmark-from-history skill automates this step. Tell it a commit range or set of PR numbers — e.g., “create benchmark tasks from the last 20 commits on main” — and it scans diffs, filters for good candidates, and presents them for your approval. See the full skill description below.
You can also do this manually. Browse your team’s closed PRs, resolved issues, or notable commits. Look for changes that:
- Are self-contained (clear before/after state)
- Have existing tests or a well-defined “done” criteria
- Represent the kind of work your agents handle (bug fixes, features, refactors)
Good candidates: 5–10 problems where you know what a good solution looks like — because your team already produced one.
Step 2 — Create the benchmark project
nasde init company-skills-evalcd company-skills-evalStep 3 — Define assessment dimensions
Choose 3–5 scoring dimensions that reflect what your team values. Examples for a DDD-oriented .NET team:
| Dimension | Max score | What it measures |
|---|---|---|
domain_modeling | 30 | Correct use of aggregates, value objects, domain events |
architecture_compliance | 25 | Follows team patterns (mediator, CQRS, repository) |
test_quality | 25 | Meaningful tests, not just coverage |
code_clarity | 20 | Readable, idiomatic, well-structured |
Write these to assessment_dimensions.json.
Step 4 — Create tasks from your history
For each selected problem, create a task directory under tasks/. Each task needs:
| File | Purpose |
|---|---|
task.toml | Git source (your repo at the commit before the fix), evaluation config |
instruction.md | What the agent should do — derived from the original issue/PR description |
environment/Dockerfile | Clones your repo at the right commit, installs dependencies |
tests/test.sh | Verifies the solution works — often adapted from your existing tests |
assessment_criteria.md | Rubric for the reviewer agent — what “good” looks like for this specific task |
The nasde-benchmark-creator skill walks you through this interactively. If you used nasde-benchmark-from-history in Step 1, task files are generated automatically — you review and edit each one before it’s written.
Step 5 — Define variants
Create directories under variants/ for each configuration you want to compare:
| Variant | What it represents |
|---|---|
vanilla | No custom skills — baseline Claude Code |
current | Your current production CLAUDE.md + skills |
proposed-v2 | The change you’re considering |
Each variant contains at minimum a CLAUDE.md file that gets injected into the agent’s sandbox.
Phase 2: Run and compare
Section titled “Phase 2: Run and compare”# Run baselinenasde run --variant vanilla --with-opik
# Run current configurationnasde run --variant current --with-opik
# Run proposed changenasde run --variant proposed-v2 --with-opikResults land in Opik. Compare variants across dimensions — see which configuration scores highest on domain modeling, which is best at test quality, whether the proposed change helps or hurts.
Comparing across coding agents: Cross-agent comparison works through Harbor’s variant system. Each variant can point to a different agent implementation via harbor_config.json:
variants/ claude-code-v1/ CLAUDE.md harbor_config.json # import_path → Claude Code agent codex-v1/ harbor_config.json # import_path → Codex agentnasde run --variant claude-code-v1 --with-opiknasde run --variant codex-v1 --with-opikBoth runs use the same tasks, dimensions, and assessment criteria — only the coding agent differs. This lets you answer “which agent produces better code for our problems?” with data, not anecdotes.
Phase 3: Regression testing (ongoing)
Section titled “Phase 3: Regression testing (ongoing)”The task set in tasks/ is now your regression suite. When someone proposes a skill change:
- Create a new variant directory with the proposed configuration
- Run the benchmark:
nasde run --variant proposed-change --with-opik - Compare scores against the
currentvariant baseline - If scores drop on any dimension, investigate before shipping
The task files are committed to the benchmark project repo — they’re stable, versioned, and shared across the team.
What varies, what stays fixed
Section titled “What varies, what stays fixed”| Fixed | Varies |
|---|---|
| Source repository (your company repos) | Agent configurations (CLAUDE.md, skills, MCP servers) |
| Task set (frozen after Phase 1) | Agent models (Sonnet vs Opus) |
| Assessment dimensions and criteria | Coding agents (Claude Code, Codex, Cursor, etc. via Harbor) |
Current constraints
Section titled “Current constraints”- NASDE supports local git repos and public remote repos. Private remote repos require local clones (not a practical limitation — you already have them).
- Task creation from git history is manual when using
nasde-benchmark-creatoralone. Thenasde-benchmark-from-historyskill automates this — see below.
Skill: nasde-benchmark-from-history
Section titled “Skill: nasde-benchmark-from-history”NASDE includes a dedicated skill that accelerates Phase 1 by mining git history for benchmark candidates. Instead of manually browsing PRs and writing task files from scratch, you point the skill at a commit range and it does the heavy lifting.
How to use it: Open your repository in Claude Code and describe what you want — e.g., “create benchmark tasks from the last 20 commits on main” or “turn PRs #45, #52, and #61 into evaluation tasks.” The skill activates automatically.
What it does:
- Scans the specified commits, reads diffs, and filters for good candidates (self-contained changes with clear before/after states)
- Presents a numbered list of candidates with metadata — files changed, difficulty estimate, whether tests exist
- For each candidate you approve, generates the full task directory:
task.toml,instruction.md,Dockerfile,test.sh,assessment_criteria.md - You review and edit each generated file before it’s written — the skill proposes, you decide
What it won’t do: It doesn’t generate instructions that leak the actual solution. The instruction describes the problem to solve (derived from the commit message and PR description), not the implementation (the diff). The agent must arrive at a solution independently.
Relationship to other skills: nasde-benchmark-from-history is an alternative entry point into the benchmark creation workflow. Where nasde-benchmark-creator starts from scratch (“what do you want to evaluate?”), nasde-benchmark-from-history starts from evidence (“here’s what your team already solved”). Both produce the same NASDE task structure.
See the full skill reference: .claude/skills/nasde-benchmark-from-history/SKILL.md
UC2: Building and Validating a Universal Skill
Section titled “UC2: Building and Validating a Universal Skill”Persona
Section titled “Persona”AI tooling developer or prompt engineer building a skill (or CLAUDE.md configuration) intended to work across many different codebases, languages, and team conventions. Not tied to one repository — the skill should generalize.
Problem
Section titled “Problem”You’ve tested your skill on a handful of repos and it works. But you have no structured way to validate that it generalizes. Does it handle Python as well as TypeScript? Large monorepos as well as small libraries? Projects with extensive tests as well as those with none? Without a diverse, repeatable test suite, you’re shipping based on anecdotes.
What NASDE enables
Section titled “What NASDE enables”A benchmark that spans multiple repositories, languages, and problem types. Define the test suite once, re-run it whenever the skill changes. Each run gives you per-task, per-dimension scores — so you can see exactly where the skill shines and where it struggles.
Workflow
Section titled “Workflow”Phase 1: Curate the benchmark
Section titled “Phase 1: Curate the benchmark”Step 1 — Select diverse source repositories
The nasde-benchmark-from-public-repos skill automates this step. Describe the skill you’re building — e.g., “I’m building a refactoring skill that should work across Python, TypeScript, Go, and Rust” — and it builds a diversity matrix, suggests repos, and generates task scaffolding. See the full skill description below.
You can also curate manually. Pick public repos that test different axes of your skill’s capabilities. For a refactoring skill:
| Repo type | What it tests |
|---|---|
| Small Express.js API | Simple extraction, JS idioms |
| Large Django monolith | Complex refactoring in a big codebase |
| Rust CLI tool | Language-specific patterns, ownership model |
| React component library | Frontend patterns, component composition |
| Go microservice | Interface-driven design, Go conventions |
The key is diversity — each repo should stress a different aspect of the skill.
Step 2 — Create tasks per repo
For each source repo, define 1–3 tasks that exercise your skill. Each task should be realistic and self-contained:
- Django project → “Extract the database layer into a repository pattern”
- React project → “Split the God component into focused components”
- Rust project → “Replace manual error handling with a Result type chain”
- Go project → “Extract the HTTP handler logic into a testable service layer”
Step 3 — Define common assessment dimensions
Since the skill is universal, dimensions should be too:
| Dimension | Max score | What it measures |
|---|---|---|
correctness | 30 | Does the refactoring preserve behavior? |
idiom_adherence | 25 | Does it follow language-specific conventions? |
code_clarity | 25 | Is the result cleaner than the input? |
scope_discipline | 20 | Did it change only what was needed? |
Per-task assessment_criteria.md files adapt these dimensions to the specific repo and language context.
Step 4 — Scaffold and verify
nasde init universal-refactoring-skill# Add tasks, build Docker images, test verifiersnasde run --variant current-skill --tasks django-repo-extract --without-eval # dry runPhase 2: Evaluate and iterate
Section titled “Phase 2: Evaluate and iterate”# Run full benchmark with current skill versionnasde run --variant v1 --with-opik
# Make changes to the skill, run againnasde run --variant v2 --with-opikCompare in Opik: did v2 improve Rust scores? Did it regress on Python? The per-task breakdown shows exactly which repos and problem types benefit from the change.
Phase 3: Expand coverage
Section titled “Phase 3: Expand coverage”As you discover edge cases (the skill fails on monorepos, or struggles with codebases that have no tests), add new tasks to the benchmark. The suite grows over time, becoming a comprehensive validation of your skill’s capabilities.
What varies, what stays fixed
Section titled “What varies, what stays fixed”| Fixed | Varies |
|---|---|
| Task set (grows over time, but existing tasks don’t change) | Skill versions (different CLAUDE.md / skill configurations) |
| Assessment dimensions | Agent models (Sonnet vs Opus) |
| Source repos (diverse, public) | — |
Key difference from UC1
Section titled “Key difference from UC1”| UC1: Company skill eval | UC2: Universal skill dev | |
|---|---|---|
| Source repos | Your company repos (local/private) | Public repos (diverse) |
| Task origin | Derived from team’s real history | Crafted for cross-cutting diversity |
| Core question | ”Do our skills help us?" | "Does this skill help everyone?” |
| Benchmark shape | Narrow, deep — your real problems | Broad, varied — many languages and styles |
| What varies | Multiple skill configs + agents | Versions of one skill under development |
Current constraints
Section titled “Current constraints”- Curating tasks from diverse public repos is time-consuming — finding the right repos, understanding their structure, writing meaningful tasks.
- Each repo needs its own Dockerfile with potentially different base images, dependencies, and build steps.
- The
nasde-benchmark-from-public-reposskill addresses both of these — see below.
Skill: nasde-benchmark-from-public-repos
Section titled “Skill: nasde-benchmark-from-public-repos”NASDE includes a dedicated skill for curating diverse benchmark suites from public repositories. Instead of manually searching GitHub and scaffolding Dockerfiles for each language, you describe your skill and the tool guides the curation process.
How to use it: Open your benchmark project in Claude Code and describe the skill you’re building — e.g., “I’m building a refactoring skill that should work across Python, TypeScript, Go, and Rust.” The skill activates automatically.
What it does:
- Builds a diversity matrix based on your skill description — axes like language, project size, test coverage, architecture style — and presents it for your approval
- For each cell in the matrix, searches for and proposes public repositories with concrete task ideas
- For each repo+task pair you approve, generates the full task directory:
task.toml(with pinned commit hash),instruction.md, language-appropriateDockerfile,test.sh,assessment_criteria.md - After all tasks are created, shows a coverage summary highlighting which matrix cells are filled and where gaps remain
Key design principle: Task instructions are written to be skill-agnostic. The instruction describes the raw problem; the skill being tested is injected via the variant’s CLAUDE.md. This means the same benchmark can test “with skill” vs “without skill” by simply switching variants.
Relationship to other skills: Like nasde-benchmark-from-history, this is an alternative entry point into the benchmark creation workflow — optimized for the “many repos, one skill” pattern (UC2) rather than the “one repo, many skills” pattern (UC1).
See the full skill reference: .claude/skills/nasde-benchmark-from-public-repos/SKILL.md
UC3: Benchmarking a Claude Code plugin or a single skill (by reference)
Section titled “UC3: Benchmarking a Claude Code plugin or a single skill (by reference)”Persona
Section titled “Persona”A plugin/skill author whose plugin under test bundles skills and an MCP server (e.g. a knowledge-graph server the skills call). They want to evaluate “with the plugin” vs “vanilla” without freezing a copy of the plugin into the benchmark.
Problem
Section titled “Problem”Before [nasde.plugin], exercising a plugin meant paying a triple tax: vendor a frozen snapshot of the entire plugin tree into the benchmark, hand-write a Dockerfile COPY, hand-write [environment.mcp_servers] with an env-export wrapper, and copy the plugin’s skills into each variant. The snapshot drifted from the live plugin and had to be refreshed by a documented manual procedure. A benchmark testing just one skill still had to copy that skill (and its references/) into variants/<v>/skills/.
What NASDE enables
Section titled “What NASDE enables”Whole plugin, one declaration. In task.toml:
[nasde.plugin]path = "../../../src/plugins/my-plugin"ref = "abc1234" # reproducible: builds from this commitbuild = "bun install --frozen-lockfile"
[nasde.plugin.env]CLAUDE_PLUGIN_DATA = "/opt/my-plugin-data"nasde ships the plugin into the sandbox image (from a git worktree at ref), registers the plugin’s own skills for the agent (whole skill dir, including references/), and wires its MCP server into the task — no snapshot, no hand-wiring.
Single skill, by reference. When a variant only needs one skill, point at its source in variant.toml instead of copying it:
agent = "claude"model = "claude-sonnet-4-6"
[[skill]]path = "../../../src/plugins/my-plugin/skills/my-skill"ref = "abc1234"The whole skill directory (including references/) is staged into the sandbox; nothing is copied into variants/.
What varies, what stays fixed
Section titled “What varies, what stays fixed”| Fixed | Varies |
|---|---|
| Task set, assessment criteria | with-plugin vs vanilla variant |
Plugin source (referenced at a pinned ref) | Plugin/skill versions under development |
Current constraints
Section titled “Current constraints”- The plugin must be a local directory containing
.claude-plugin/plugin.json(the standard Claude Code plugin layout). Its MCP server is read from the plugin’s.mcp.json. - A baked-not-installed plugin needs its MCP-server env set explicitly (
CLAUDE_PLUGIN_ROOT/CLAUDE_PLUGIN_DATA/project dir). nasde supplies sensible defaults; override per plugin via[nasde.plugin.env].
See ADR-009 for the design.