79 lines
2.5 KiB
Markdown
79 lines
2.5 KiB
Markdown
# Testing and Iteration
|
|
|
|
## Testing Approaches
|
|
|
|
Choose rigor based on skill visibility:
|
|
- **Manual testing** — Run queries in Claude.ai, observe behavior. Fast iteration.
|
|
- **Scripted testing** — Automate test cases in Claude Code for repeatable validation.
|
|
- **Programmatic testing** — Build eval suites via skills API for systematic testing.
|
|
|
|
**Pro tip:** Iterate on a single challenging task until Claude succeeds, then extract the winning approach into the skill. Expand to multiple test cases after.
|
|
|
|
## Three Testing Areas
|
|
|
|
### 1. Triggering Tests
|
|
|
|
Ensure skill loads at right times.
|
|
|
|
| Should trigger | Should NOT trigger |
|
|
|---|---|
|
|
| "Help me set up a new ProjectHub workspace" | "What's the weather?" |
|
|
| "I need to create a project in ProjectHub" | "Help me write Python code" |
|
|
| "Initialize a ProjectHub project for Q4" | "Create a spreadsheet" |
|
|
|
|
**Debug:** Ask Claude: "When would you use the [skill-name] skill?" — it quotes the description back.
|
|
|
|
### 2. Functional Tests
|
|
|
|
Verify correct outputs:
|
|
- Valid outputs generated
|
|
- API/MCP calls succeed
|
|
- Error handling works
|
|
- Edge cases covered
|
|
|
|
### 3. Performance Comparison
|
|
|
|
Compare with and without skill:
|
|
|
|
| Metric | Without Skill | With Skill |
|
|
|---|---|---|
|
|
| Messages needed | 15 back-and-forth | 2 clarifying questions |
|
|
| Failed API calls | 3 retries | 0 |
|
|
| Tokens consumed | 12,000 | 6,000 |
|
|
|
|
## Success Criteria
|
|
|
|
### Quantitative
|
|
- Skill triggers on ~90% of relevant queries (test 10-20 queries)
|
|
- Completes workflow in fewer tool calls than without skill
|
|
- 0 failed API calls per workflow
|
|
|
|
### Qualitative
|
|
- Users don't need to prompt Claude about next steps
|
|
- Workflows complete without user correction
|
|
- Consistent results across sessions
|
|
- New users can accomplish task on first try
|
|
|
|
## Iteration Signals
|
|
|
|
### Undertriggering
|
|
- Skill doesn't load when it should → add more trigger phrases/keywords to description
|
|
- Users manually enabling it → description too vague
|
|
|
|
### Overtriggering
|
|
- Skill loads for unrelated queries → add negative triggers, be more specific
|
|
- Users disabling it → clarify scope in description
|
|
|
|
### Execution Issues
|
|
- Inconsistent results → improve instructions, add validation scripts
|
|
- API failures → add error handling, retry guidance
|
|
- User corrections needed → make instructions more explicit
|
|
|
|
## Iteration Workflow
|
|
|
|
1. Use skill on real tasks
|
|
2. Notice struggles, inefficiencies, token usage
|
|
3. Identify SKILL.md or resource updates needed
|
|
4. Implement changes
|
|
5. Test again with same scenarios
|