Inside Claude Code skill-creator: A Quality Evaluation Architecture
TL;DR
Claude Code’s skill-creator has two evaluation systems: skill output quality and description trigger accuracy. The first is a multi-stage pipeline of parallel with/without-skill execution → automated grading → blind comparison → human review. The second is an automated loop that improves descriptions using a train/test split plus extended thinking. What makes it interesting is that this recursive structure of “evaluating LLM output with LLMs” embeds classical techniques from machine learning and software testing — overfitting prevention, bias removal, and so on.
Background: How Do You Evaluate Things Made by LLMs?
Claude Code Skills package knowledge and procedures for specific tasks. skill-creator is a meta-skill that automatically generates and improves these Skills, and internally it carries a serious quality evaluation pipeline.
How to measure the quality of LLM-generated artifacts and feed that into an improvement loop is a question being asked everywhere in engineering today. Reading through the skill-creator source (~/.claude/skills/skill-creator/) reveals one concrete answer.
skill-creator has two kinds of quality evaluation.
- Skill output quality evaluation — comparing outputs with and without the skill applied
- Description trigger accuracy evaluation — verifying that SKILL.md’s description field fires on appropriate queries and stays silent on unrelated ones
Let’s look at each.
1. Skill Output Quality Evaluation
Overall Flow
First, the big picture.
flowchart TD
A["Create test cases"] --> B["Parallel execution<br>with/without skill"]
B --> C["Grader Agent<br>auto-grading"]
C --> D["Aggregation<br>benchmark.json"]
D --> E["Analyzer<br>pattern analysis"]
E --> F["Human review<br>HTML viewer"]
F -->|"feedback"| G["Skill improvement"]
G -->|"next iteration"| A
For each test case, the skill is run in parallel with and without it applied, then the outputs are auto-graded, aggregated, pattern-analyzed, and finally human-reviewed. This loop is iterated to improve the skill.
Test Case Execution: The A/B Testing Foundation
For each test case, two sub-agents are launched in parallel.
with_skill: runs with the skill appliedwithout_skill(orold_skill): runs as the baseline
Outputs are saved under iteration-N/eval-ID/{with_skill,without_skill}/outputs/, and total_tokens and duration_ms at task completion are recorded in timing.json.
Notice that this isn’t just PASS/FAIL — token count and execution time are also recorded. If the skill improves quality but at 10x the token cost, that’s a practical problem. The mechanism for quantifying the quality–cost tradeoff is built in from the start.
Grader Agent: Critiquing Not Just Outputs but the Eval Itself
The Grader Agent (agents/grader.md) does automated grading, and on top of assertion PASS/FAIL judgments performs three additional quality checks.
| Check | Content |
|---|---|
| Claims extraction & verification | Extracts implicit claims from the output, classifies them into fact/process/quality types, and verifies them |
| User Notes reference | Considers uncertainties or workarounds the executor recorded |
| Self-critique of the eval | Provides feedback like “this assertion is too weak” or “an important outcome isn’t covered” |
The third — self-critique of the eval — is especially interesting. It’s a recursive structure where the test evaluates its own quality. By having the Grader emit feedback like “this assertion has no discriminating power” or “a more important angle is missing,” the evaluation criteria themselves enter the improvement loop.
The output is saved as grading.json with four fields: expectations, summary, claims, and eval_feedback.
Aggregation and Pattern Analysis
scripts/aggregate_benchmark.py computes mean ± stddev, min, and max across all runs and calculates the deltas (pass_rate, time_seconds, tokens) between with_skill and without_skill. The output is benchmark.json and benchmark.md.
That’s the quantitative aggregation. Next, the Analyzer Agent (agents/analyzer.md) detects patterns that statistics alone can’t surface.
- Assertions that always pass under both configs → no discriminating power (meaningless as a test)
- Assertions that always fail under both configs → outside the model’s capability, or the test itself is broken
- High-variance evals → possibly flaky tests
- Time/token tradeoffs → did quality improve at the cost of excessive resource use?
Automatic flaky-test detection is unglamorous but important. LLM outputs are fundamentally probabilistic, so the same prompt produces variable results. Measuring that variance and arguing within confidence intervals is at the foundation of this system.
Blind Comparison: Removing Bias
As an option, blind comparison via agents/comparator.md and agents/analyzer.md is also provided.
- Blind Comparator: presents two outputs as A/B, hiding which came from which skill
- Performs rubric-based scoring (Content: accuracy/completeness/precision, Structure: organization/format/usability, each 1–5 points)
- After picking a winner, the Post-hoc Analyzer examines “why it won” and generates improvement suggestions
Hiding “which side is the skill version” is a standard tactic for removing the confirmation bias LLM-based evaluation tends to fall into. LLMs subconsciously give higher scores when told “this is the improved version,” so this countermeasure makes sense.
Human Review: A Human Looks at It in the End
eval-viewer/generate_review.py generates an HTML viewer. It has two tabs: the Outputs tab shows per-test-case outputs, a feedback input field, and comparison with the previous output. The Benchmark tab shows quantitative summary statistics and notes from the analyzer.
Feedback is saved to feedback.json, and an empty feedback is treated as “no problems.” Rather than fully automating, the design explicitly carves out a point where human judgment is folded in — a pragmatic choice.
2. Description Trigger Accuracy Evaluation
Why Description Matters
Claude Code Skills include a mechanism that automatically triggers them based on user queries. The accuracy of that trigger is determined by the description field in SKILL.md. If the description is vague, the skill won’t fire on relevant queries; if it’s too broad, it fires on irrelevant ones.
How Trigger Detection Works
scripts/run_eval.py is at the core. Here’s the processing flow.
sequenceDiagram
participant Eval as run_eval.py
participant CLI as claude -p
participant Stream as Stream parser
Eval->>CLI: Execute query (stream-json)
CLI->>Stream: Stream events
Stream->>Eval: Detect tool_use
Note over Eval: Run each query 3 times
Eval->>Eval: Compute trigger_rate
Note over Eval: Trigger if above threshold 0.5
Each query is run 3 times by default to compute the trigger_rate. If it exceeds the threshold (default 0.5), it’s judged as a trigger. The design absorbs probabilistic noise via repeated execution.
Optimization Loop: Train/Test Split to Prevent Overfitting
scripts/run_loop.py runs the optimization loop.
flowchart TD
A["eval set"] --> B["60/40 train/test split<br>(stratified)"]
B --> C["Trigger evaluation"]
C -->|"all pass on train"| D["Early termination"]
C -->|"failures present"| E["Description improvement"]
E --> F["Generate new description"]
F --> C
The key design point here is picking the best by test score. Even if performance improves on train, if it drops on test, it’s not adopted. This is the absolute fundamentals of overfitting prevention in machine learning, and the same problem occurs verbatim in LLM prompt optimization. A description over-fit to specific query examples becomes brittle on unseen queries.
The use of stratified splitting also matters: it keeps the should_trigger=true vs. false ratio balanced between train and test.
Description Improvement: Generalizing with Extended Thinking
scripts/improve_description.py calls the Anthropic API directly to improve the description. It uses extended thinking (10,000-token budget) and includes the following in the prompt.
- Current description
- Failed triggers (should_trigger=true but didn’t fire)
- False triggers (should_trigger=false but fired)
- History of past attempts (but test scores are hidden)
- Skill body
Note that test scores are excluded from the past-attempt history. This prevents the model from indirectly fitting to the test set.
The generated description has constraints imposed.
- Stay between 100–200 words
- Generalize to categories of intent rather than enumerating specific queries
- If it exceeds 1024 characters, shorten it again via the API
The constraint “generalize to categories of intent rather than enumerating specific queries” is smart. Listing concrete examples like “fire when the user says ‘write a test’” can’t handle queries the model has never seen. Abstract category descriptions like “provide assistance related to test automation” generalize better to unknown queries.
Summary of Design Patterns
The design patterns that can be extracted from skill-creator’s evaluation architecture, organized below.
| Aspect | Technique | Why it matters |
|---|---|---|
| Overfitting prevention | Train/test split, pick best by test score | Overfitting also happens in prompt optimization |
| Statistical reliability | Run each query multiple times, evaluate by mean ± stddev | LLM outputs are probabilistically noisy |
| Integrating human judgment | HTML viewer + feedback.json | Full automation has a low ceiling on quality |
| Bias removal | Blind comparison (hide which is A/B) | LLMs grade leniently when told “this is the improved version” |
| Eval quality management | Grader points out weaknesses in the assertions themselves | Good results from a bad test are meaningless |
| Improvement automation | Iteratively improve descriptions with extended thinking | Humans rewriting every time doesn’t scale |
| Cost measurement | Always record token count and execution time | Watch that quality gains don’t come with a cost explosion |
These patterns generalize beyond skill-creator and apply to LLM-based system quality evaluation in general.
Closing Thoughts
What struck me looking at skill-creator’s evaluation architecture was how directly classical machine learning and software testing techniques apply to the new problem of “evaluating LLMs.” Train/test splits, blind comparison, flaky-test detection, meta-evaluation of test quality — none of these are novel concepts, but combined in the LLM context they form a strong quality assurance pipeline.
At the same time, what’s also striking is that human review remains an explicit step. The recursive structure of “using LLMs to evaluate LLMs” has fundamental limits, and the pragmatic acceptance that the final quality call is made by a human feels right.
Rather than aiming for full automation, clearly delineate what can be automated from what needs human judgment. This design philosophy is a principle that probably applies to LLM-based tool development in general.
References
That’s all.