A Real Task (DDD example)
Everything in How It Works is easier to grasp on a concrete example. Here is one benchmark task from the repo — examples/ddd-architectural-challenges/tasks/ddd-weather-discount — shown end to end: the agent’s instruction, the assessment criteria, and the resulting scores.
instruction.md — what the coding agent is asked to do
Section titled “instruction.md — what the coding agent is asked to do”Task — Implement a weather-based discount.
You are working on an e-commerce system built using Domain-Driven Design and hexagonal architecture (.NET 8, C#). Implement a discount that:
- Checks current weather in Warsaw via the Open-Meteo API.
- Applies a 10% discount when
precipitation > 0.- Must be extensible: more weather-based discounts (temperature, wind, UV, humidity) will follow and should plug in without rewrites.
Quality expectations: fit into the existing DDD architecture · handle API failures gracefully (do not break order processing) · write unit and integration tests · follow codebase conventions.
assessment_criteria.md — what the reviewer scores against (excerpt)
Section titled “assessment_criteria.md — what the reviewer scores against (excerpt)”The criteria spell out what each score means for each dimension. Here is the full ladder for the Domain Modeling dimension — in this benchmark the author chose a 0–25 scale (the scale is entirely up to you: 0–5, 0–10, 0–100, named levels, pass/fail only, whatever fits):
| Score | Criteria |
|---|---|
| 0 | No domain types for weather — raw HTTP responses or primitives used directly in domain logic. |
| 10 | Domain types exist for weather, but they leak infrastructure concerns (JSON annotations, HTTP status codes). |
| 15 | Clean domain types (precipitation as a value object), but discount logic is not modeled as a domain service or policy. |
| 20 | Good domain modeling and discount as a domain service, but error handling uses infrastructure exceptions instead of domain-appropriate patterns. |
| 25 | Weather modeled as value objects · discount encapsulated in a domain service/policy · failures handled via domain patterns (Result type, domain exceptions, safe defaults) · domain layer has zero infrastructure dependencies. |
Key checks for the reviewer agent:
- Is there a port / interface for weather data in the domain layer?
- Does that port use domain types (not
HttpResponseMessage,JsonElement)? - Is the discount rule inside a domain service / policy, or living in the HTTP adapter?
- Are failure modes (API down) handled with domain-appropriate defaults?
The full assessment covers four more dimensions the benchmark author picked for this task (Encapsulation · Architecture Compliance · Extensibility · Test Quality), each with its own ladder and checks. Another author would have chosen different dimensions or different scales for the same task.
Results — four agent configurations scored against the same criteria
Section titled “Results — four agent configurations scored against the same criteria”| Variant | Pass | Domain (/25) | Encaps. (/20) | Arch. (/20) | Ext. (/15) | Tests (/20) | Total (/100) |
|---|---|---|---|---|---|---|---|
claude-vanilla | 75% | 17.1 | 11.2 | 16.1 | 9.5 | 7.7 | 61.6 |
claude-guided (with a DDD skill) | 75% | 17.4 | 12.4 | 16.6 | 10.0 | 8.7 | 65.1 |
codex-vanilla | 89% | 18.8 | 13.8 | 16.8 | 11.4 | 8.7 | 69.4 |
codex-guided (same skill) | 50% | 11.5 | 9.6 | 12.9 | 7.4 | 6.0 | 47.4 |
The insight: the same “DDD guidance” skill helps Claude a little (+3.5) and badly hurts Codex (-22). The per-dimension breakdown pinpoints where Codex regresses — domain modeling, encapsulation, extensibility — which would be invisible without this assessment. Skill optimization is agent-specific.
Deep dive — does a public skill, and tuning it, actually help?
Section titled “Deep dive — does a public skill, and tuning it, actually help?”A separate study took a public DDD skill (the tactical-ddd skill from ntcoding/claude-skillz) and its repo-tuned version across four configurations on two deliberately different tasks — a feature on a clean DDD codebase and a legacy anemic→rich refactor. The headline: a repo-tuned skill measurably beats the bare model on both tasks (+0.12 on the clean feature, +0.05 on the legacy refactor — increment over vanilla, both clearing our significance bar), and it also beats hand-written DDD hints. But an off-the-shelf public skill helps only on the greenfield feature (+0.07) — on the legacy refactor it doesn’t beat the bare model at all. Two lessons that generalize: judge per dimension, not one aggregate (a real architecture gain can hide inside a flat average); and a skill present on disk is not a skill used — verify it activated.
→ Full tables, per-dimension radars, and token/time charts in Benchmark Results.
More benchmarks in the repo
Section titled “More benchmarks in the repo”- Refactoring katas (Java + Python) — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. Takeaway: a candidate “refactoring skill” didn’t move the score — shipping it would have been based on vibes.
- Project-specific skill validation (NASDE’s own repo) — one task pulled from NASDE’s git history; four skill combinations tested. Takeaway: the testing-discipline skill alone raised pass rate from 67% → 100%; the “full-stack, everything-on” variant scored worse than vanilla.
See Benchmark Results for the full tables and methodology, and Use Cases for the end-to-end walkthrough of building a benchmark like these yourself.