Feb 28, 2026

Inside Claude Code skill-creator: A Quality Evaluation Architecture

TL;DR

Claude Code’s skill-creator has two evaluation systems: skill output quality and description trigger accuracy. The first is a multi-stage pipeline of parallel with/without-skill execution → automated grading → blind comparison → human review. The second is an automated loop that improves descriptions using a train/test split plus extended thinking. What makes it interesting is that this recursive structure of “evaluating LLM output with LLMs” embeds classical techniques from machine learning and software testing — overfitting prevention, bias removal, and so on.

Background: How Do You Evaluate Things Made by LLMs?

Claude Code Skills package knowledge and procedures for specific tasks. skill-creator is a meta-skill that automatically generates and improves these Skills, and internally it carries a serious quality evaluation pipeline.

How to measure the quality of LLM-generated artifacts and feed that into an improvement loop is a question being asked everywhere in engineering today. Reading through the skill-creator source (~/.claude/skills/skill-creator/) reveals one concrete answer.

skill-creator has two kinds of quality evaluation.

Skill output quality evaluation — comparing outputs with and without the skill applied
Description trigger accuracy evaluation — verifying that SKILL.md’s description field fires on appropriate queries and stays silent on unrelated ones

Let’s look at each.

1. Skill Output Quality Evaluation

Overall Flow

First, the big picture.

flowchart TD
    A["Create test cases"] --> B["Parallel execution<br>with/without skill"]
    B --> C["Grader Agent<br>auto-grading"]
    C --> D["Aggregation<br>benchmark.json"]
    D --> E["Analyzer<br>pattern analysis"]
    E --> F["Human review<br>HTML viewer"]
    F -->|"feedback"| G["Skill improvement"]
    G -->|"next iteration"| A

For each test case, the skill is run in parallel with and without it applied, then the outputs are auto-graded, aggregated, pattern-analyzed, and finally human-reviewed. This loop is iterated to improve the skill.

Test Case Execution: The A/B Testing Foundation

For each test case, two sub-agents are launched in parallel.

with_skill: runs with the skill applied
without_skill (or old_skill): runs as the baseline

Outputs are saved under iteration-N/eval-ID/{with_skill,without_skill}/outputs/, and total_tokens and duration_ms at task completion are recorded in timing.json.

Notice that this isn’t just PASS/FAIL — token count and execution time are also recorded. If the skill improves quality but at 10x the token cost, that’s a practical problem. The mechanism for quantifying the quality–cost tradeoff is built in from the start.

Grader Agent: Critiquing Not Just Outputs but the Eval Itself

The Grader Agent (agents/grader.md) does automated grading, and on top of assertion PASS/FAIL judgments performs three additional quality checks.

Check	Content
Claims extraction & verification	Extracts implicit claims from the output, classifies them into fact/process/quality types, and verifies them
User Notes reference	Considers uncertainties or workarounds the executor recorded
Self-critique of the eval	Provides feedback like “this assertion is too weak” or “an important outcome isn’t covered”

The third — self-critique of the eval — is especially interesting. It’s a recursive structure where the test evaluates its own quality. By having the Grader emit feedback like “this assertion has no discriminating power” or “a more important angle is missing,” the evaluation criteria themselves enter the improvement loop.

The output is saved as grading.json with four fields: expectations, summary, claims, and eval_feedback.

Aggregation and Pattern Analysis

scripts/aggregate_benchmark.py computes mean ± stddev, min, and max across all runs and calculates the deltas (pass_rate, time_seconds, tokens) between with_skill and without_skill. The output is benchmark.json and benchmark.md.

That’s the quantitative aggregation. Next, the Analyzer Agent (agents/analyzer.md) detects patterns that statistics alone can’t surface.

Assertions that always pass under both configs → no discriminating power (meaningless as a test)
Assertions that always fail under both configs → outside the model’s capability, or the test itself is broken
High-variance evals → possibly flaky tests
Time/token tradeoffs → did quality improve at the cost of excessive resource use?

Automatic flaky-test detection is unglamorous but important. LLM outputs are fundamentally probabilistic, so the same prompt produces variable results. Measuring that variance and arguing within confidence intervals is at the foundation of this system.

As an option, blind comparison via agents/comparator.md and agents/analyzer.md is also provided.

Blind Comparator: presents two outputs as A/B, hiding which came from which skill
Performs rubric-based scoring (Content: accuracy/completeness/precision, Structure: organization/format/usability, each 1–5 points)
After picking a winner, the Post-hoc Analyzer examines “why it won” and generates improvement suggestions

Hiding “which side is the skill version” is a standard tactic for removing the confirmation bias LLM-based evaluation tends to fall into. LLMs subconsciously give higher scores when told “this is the improved version,” so this countermeasure makes sense.

Human Review: A Human Looks at It in the End

eval-viewer/generate_review.py generates an HTML viewer. It has two tabs: the Outputs tab shows per-test-case outputs, a feedback input field, and comparison with the previous output. The Benchmark tab shows quantitative summary statistics and notes from the analyzer.

Feedback is saved to feedback.json, and an empty feedback is treated as “no problems.” Rather than fully automating, the design explicitly carves out a point where human judgment is folded in — a pragmatic choice.

2. Description Trigger Accuracy Evaluation

Why Description Matters

Claude Code Skills include a mechanism that automatically triggers them based on user queries. The accuracy of that trigger is determined by the description field in SKILL.md. If the description is vague, the skill won’t fire on relevant queries; if it’s too broad, it fires on irrelevant ones.

How Trigger Detection Works

scripts/run_eval.py is at the core. Here’s the processing flow.

sequenceDiagram
    participant Eval as run_eval.py
    participant CLI as claude -p
    participant Stream as Stream parser

    Eval->>CLI: Execute query (stream-json)
    CLI->>Stream: Stream events
    Stream->>Eval: Detect tool_use
    Note over Eval: Run each query 3 times
    Eval->>Eval: Compute trigger_rate
    Note over Eval: Trigger if above threshold 0.5

Each query is run 3 times by default to compute the trigger_rate. If it exceeds the threshold (default 0.5), it’s judged as a trigger. The design absorbs probabilistic noise via repeated execution.

Optimization Loop: Train/Test Split to Prevent Overfitting

scripts/run_loop.py runs the optimization loop.

flowchart TD
    A["eval set"] --> B["60/40 train/test split<br>(stratified)"]
    B --> C["Trigger evaluation"]
    C -->|"all pass on train"| D["Early termination"]
    C -->|"failures present"| E["Description improvement"]
    E --> F["Generate new description"]
    F --> C

The key design point here is picking the best by test score. Even if performance improves on train, if it drops on test, it’s not adopted. This is the absolute fundamentals of overfitting prevention in machine learning, and the same problem occurs verbatim in LLM prompt optimization. A description over-fit to specific query examples becomes brittle on unseen queries.

The use of stratified splitting also matters: it keeps the should_trigger=true vs. false ratio balanced between train and test.

Description Improvement: Generalizing with Extended Thinking

scripts/improve_description.py calls the Anthropic API directly to improve the description. It uses extended thinking (10,000-token budget) and includes the following in the prompt.

Current description
Failed triggers (should_trigger=true but didn’t fire)
False triggers (should_trigger=false but fired)
History of past attempts (but test scores are hidden)
Skill body

Note that test scores are excluded from the past-attempt history. This prevents the model from indirectly fitting to the test set.

The generated description has constraints imposed.

Stay between 100–200 words
Generalize to categories of intent rather than enumerating specific queries
If it exceeds 1024 characters, shorten it again via the API

The constraint “generalize to categories of intent rather than enumerating specific queries” is smart. Listing concrete examples like “fire when the user says ‘write a test’” can’t handle queries the model has never seen. Abstract category descriptions like “provide assistance related to test automation” generalize better to unknown queries.

Summary of Design Patterns

The design patterns that can be extracted from skill-creator’s evaluation architecture, organized below.

Aspect	Technique	Why it matters
Overfitting prevention	Train/test split, pick best by test score	Overfitting also happens in prompt optimization
Statistical reliability	Run each query multiple times, evaluate by mean ± stddev	LLM outputs are probabilistically noisy
Integrating human judgment	HTML viewer + feedback.json	Full automation has a low ceiling on quality
Bias removal	Blind comparison (hide which is A/B)	LLMs grade leniently when told “this is the improved version”
Eval quality management	Grader points out weaknesses in the assertions themselves	Good results from a bad test are meaningless
Improvement automation	Iteratively improve descriptions with extended thinking	Humans rewriting every time doesn’t scale
Cost measurement	Always record token count and execution time	Watch that quality gains don’t come with a cost explosion

These patterns generalize beyond skill-creator and apply to LLM-based system quality evaluation in general.

Closing Thoughts

What struck me looking at skill-creator’s evaluation architecture was how directly classical machine learning and software testing techniques apply to the new problem of “evaluating LLMs.” Train/test splits, blind comparison, flaky-test detection, meta-evaluation of test quality — none of these are novel concepts, but combined in the LLM context they form a strong quality assurance pipeline.

At the same time, what’s also striking is that human review remains an explicit step. The recursive structure of “using LLMs to evaluate LLMs” has fundamental limits, and the pragmatic acceptance that the final quality call is made by a human feels right.

Rather than aiming for full automation, clearly delineate what can be automated from what needs human judgment. This design philosophy is a principle that probably applies to LLM-based tool development in general.

References

That’s all.