init
This commit is contained in:
@@ -0,0 +1,78 @@
|
||||
# Testing and Iteration
|
||||
|
||||
## Testing Approaches
|
||||
|
||||
Choose rigor based on skill visibility:
|
||||
- **Manual testing** — Run queries in Claude.ai, observe behavior. Fast iteration.
|
||||
- **Scripted testing** — Automate test cases in Claude Code for repeatable validation.
|
||||
- **Programmatic testing** — Build eval suites via skills API for systematic testing.
|
||||
|
||||
**Pro tip:** Iterate on a single challenging task until Claude succeeds, then extract the winning approach into the skill. Expand to multiple test cases after.
|
||||
|
||||
## Three Testing Areas
|
||||
|
||||
### 1. Triggering Tests
|
||||
|
||||
Ensure skill loads at right times.
|
||||
|
||||
| Should trigger | Should NOT trigger |
|
||||
|---|---|
|
||||
| "Help me set up a new ProjectHub workspace" | "What's the weather?" |
|
||||
| "I need to create a project in ProjectHub" | "Help me write Python code" |
|
||||
| "Initialize a ProjectHub project for Q4" | "Create a spreadsheet" |
|
||||
|
||||
**Debug:** Ask Claude: "When would you use the [skill-name] skill?" — it quotes the description back.
|
||||
|
||||
### 2. Functional Tests
|
||||
|
||||
Verify correct outputs:
|
||||
- Valid outputs generated
|
||||
- API/MCP calls succeed
|
||||
- Error handling works
|
||||
- Edge cases covered
|
||||
|
||||
### 3. Performance Comparison
|
||||
|
||||
Compare with and without skill:
|
||||
|
||||
| Metric | Without Skill | With Skill |
|
||||
|---|---|---|
|
||||
| Messages needed | 15 back-and-forth | 2 clarifying questions |
|
||||
| Failed API calls | 3 retries | 0 |
|
||||
| Tokens consumed | 12,000 | 6,000 |
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Quantitative
|
||||
- Skill triggers on ~90% of relevant queries (test 10-20 queries)
|
||||
- Completes workflow in fewer tool calls than without skill
|
||||
- 0 failed API calls per workflow
|
||||
|
||||
### Qualitative
|
||||
- Users don't need to prompt Claude about next steps
|
||||
- Workflows complete without user correction
|
||||
- Consistent results across sessions
|
||||
- New users can accomplish task on first try
|
||||
|
||||
## Iteration Signals
|
||||
|
||||
### Undertriggering
|
||||
- Skill doesn't load when it should → add more trigger phrases/keywords to description
|
||||
- Users manually enabling it → description too vague
|
||||
|
||||
### Overtriggering
|
||||
- Skill loads for unrelated queries → add negative triggers, be more specific
|
||||
- Users disabling it → clarify scope in description
|
||||
|
||||
### Execution Issues
|
||||
- Inconsistent results → improve instructions, add validation scripts
|
||||
- API failures → add error handling, retry guidance
|
||||
- User corrections needed → make instructions more explicit
|
||||
|
||||
## Iteration Workflow
|
||||
|
||||
1. Use skill on real tasks
|
||||
2. Notice struggles, inefficiencies, token usage
|
||||
3. Identify SKILL.md or resource updates needed
|
||||
4. Implement changes
|
||||
5. Test again with same scenarios
|
||||
Reference in New Issue
Block a user