Testing and Iteration

Testing Approaches

Choose rigor based on skill visibility:

Manual testing — Run queries in Claude.ai, observe behavior. Fast iteration.
Scripted testing — Automate test cases in Claude Code for repeatable validation.
Programmatic testing — Build eval suites via skills API for systematic testing.

Pro tip: Iterate on a single challenging task until Claude succeeds, then extract the winning approach into the skill. Expand to multiple test cases after.

Three Testing Areas

1. Triggering Tests

Ensure skill loads at right times.

Should trigger	Should NOT trigger
"Help me set up a new ProjectHub workspace"	"What's the weather?"
"I need to create a project in ProjectHub"	"Help me write Python code"
"Initialize a ProjectHub project for Q4"	"Create a spreadsheet"

Debug: Ask Claude: "When would you use the [skill-name] skill?" — it quotes the description back.

2. Functional Tests

Verify correct outputs:

Valid outputs generated
API/MCP calls succeed
Error handling works
Edge cases covered

3. Performance Comparison

Compare with and without skill:

Metric	Without Skill	With Skill
Messages needed	15 back-and-forth	2 clarifying questions
Failed API calls	3 retries	0
Tokens consumed	12,000	6,000

Success Criteria

Quantitative

Skill triggers on ~90% of relevant queries (test 10-20 queries)
Completes workflow in fewer tool calls than without skill
0 failed API calls per workflow

Qualitative

Users don't need to prompt Claude about next steps
Workflows complete without user correction
Consistent results across sessions
New users can accomplish task on first try

Iteration Signals

Undertriggering

Skill doesn't load when it should → add more trigger phrases/keywords to description
Users manually enabling it → description too vague

Overtriggering

Skill loads for unrelated queries → add negative triggers, be more specific
Users disabling it → clarify scope in description

Execution Issues

Inconsistent results → improve instructions, add validation scripts
API failures → add error handling, retry guidance
User corrections needed → make instructions more explicit

Iteration Workflow

Use skill on real tasks
Notice struggles, inefficiencies, token usage
Identify SKILL.md or resource updates needed
Implement changes
Test again with same scenarios

2.5 KiB Raw Blame History