Skip to content

A Real Task (DDD example)

Everything in How It Works is easier to grasp on a concrete example. Here is one benchmark task from the repo — examples/ddd-architectural-challenges/tasks/ddd-weather-discount — shown end to end: the agent’s instruction, the assessment criteria, and the resulting scores.

instruction.md — what the coding agent is asked to do

Section titled “instruction.md — what the coding agent is asked to do”

Task — Implement a weather-based discount.

You are working on an e-commerce system built using Domain-Driven Design and hexagonal architecture (.NET 8, C#). Implement a discount that:

  • Checks current weather in Warsaw via the Open-Meteo API.
  • Applies a 10% discount when precipitation > 0.
  • Must be extensible: more weather-based discounts (temperature, wind, UV, humidity) will follow and should plug in without rewrites.

Quality expectations: fit into the existing DDD architecture · handle API failures gracefully (do not break order processing) · write unit and integration tests · follow codebase conventions.

assessment_criteria.md — what the reviewer scores against (excerpt)

Section titled “assessment_criteria.md — what the reviewer scores against (excerpt)”

The criteria spell out what each score means for each dimension. Here is the full ladder for the Domain Modeling dimension — in this benchmark the author chose a 0–25 scale (the scale is entirely up to you: 0–5, 0–10, 0–100, named levels, pass/fail only, whatever fits):

ScoreCriteria
0No domain types for weather — raw HTTP responses or primitives used directly in domain logic.
10Domain types exist for weather, but they leak infrastructure concerns (JSON annotations, HTTP status codes).
15Clean domain types (precipitation as a value object), but discount logic is not modeled as a domain service or policy.
20Good domain modeling and discount as a domain service, but error handling uses infrastructure exceptions instead of domain-appropriate patterns.
25Weather modeled as value objects · discount encapsulated in a domain service/policy · failures handled via domain patterns (Result type, domain exceptions, safe defaults) · domain layer has zero infrastructure dependencies.

Key checks for the reviewer agent:

  • Is there a port / interface for weather data in the domain layer?
  • Does that port use domain types (not HttpResponseMessage, JsonElement)?
  • Is the discount rule inside a domain service / policy, or living in the HTTP adapter?
  • Are failure modes (API down) handled with domain-appropriate defaults?

The full assessment covers four more dimensions the benchmark author picked for this task (Encapsulation · Architecture Compliance · Extensibility · Test Quality), each with its own ladder and checks. Another author would have chosen different dimensions or different scales for the same task.

Results — four agent configurations scored against the same criteria

Section titled “Results — four agent configurations scored against the same criteria”
VariantPassDomain (/25)Encaps. (/20)Arch. (/20)Ext. (/15)Tests (/20)Total (/100)
claude-vanilla75%17.111.216.19.57.761.6
claude-guided (with a DDD skill)75%17.412.416.610.08.765.1
codex-vanilla89%18.813.816.811.48.769.4
codex-guided (same skill)50%11.59.612.97.46.047.4

The insight: the same “DDD guidance” skill helps Claude a little (+3.5) and badly hurts Codex (-22). The per-dimension breakdown pinpoints where Codex regresses — domain modeling, encapsulation, extensibility — which would be invisible without this assessment. Skill optimization is agent-specific.

Deep dive — does a public skill, and tuning it, actually help?

Section titled “Deep dive — does a public skill, and tuning it, actually help?”

A separate study took a public DDD skill (the tactical-ddd skill from ntcoding/claude-skillz) and its repo-tuned version across four configurations on two deliberately different tasks — a feature on a clean DDD codebase and a legacy anemic→rich refactor. The headline: a repo-tuned skill measurably beats the bare model on both tasks (+0.12 on the clean feature, +0.05 on the legacy refactor — increment over vanilla, both clearing our significance bar), and it also beats hand-written DDD hints. But an off-the-shelf public skill helps only on the greenfield feature (+0.07) — on the legacy refactor it doesn’t beat the bare model at all. Two lessons that generalize: judge per dimension, not one aggregate (a real architecture gain can hide inside a flat average); and a skill present on disk is not a skill used — verify it activated.

→ Full tables, per-dimension radars, and token/time charts in Benchmark Results.

  • Refactoring katas (Java + Python) — four classic refactorings scored on behavior preservation, clarity, technique, scope discipline. Takeaway: a candidate “refactoring skill” didn’t move the score — shipping it would have been based on vibes.
  • Project-specific skill validation (NASDE’s own repo) — one task pulled from NASDE’s git history; four skill combinations tested. Takeaway: the testing-discipline skill alone raised pass rate from 67% → 100%; the “full-stack, everything-on” variant scored worse than vanilla.

See Benchmark Results for the full tables and methodology, and Use Cases for the end-to-end walkthrough of building a benchmark like these yourself.