4.4 KiB
Skill Creation Workflow
9-step process. Follow in order; skip only with clear justification.
Step 1: Capture Intent
Gather real usage patterns via AskUserQuestion tool:
- "What tasks should this skill handle?"
- "Give examples of how it would be used?"
- "What phrases should trigger this skill?"
- "What's the expected output format?"
- "Should we create test cases?" (recommended for objective outputs)
Conclude when functionality scope is clear.
Step 2: Research
Activate /ck:docs-seeker and /ck:research skills. Research:
- Best practices & industry standards
- Existing CLI tools (
npx,bunx,pipx) for reuse - Workflows & case studies
- Edge cases & pitfalls
Use parallel WebFetch + Explore subagents for multiple URLs.
Write reports for next step.
Step 3: Plan Reusable Contents
Analyze each example:
- How to execute from scratch?
- Prefer existing CLI tools over custom code
- What scripts/references/assets enable repeated execution?
- Check skills catalog — avoid duplication, reuse existing
Patterns:
- Repeated code →
scripts/(Python/Node.js, with tests) - Repeated discovery →
references/(schemas, docs, APIs) - Repeated boilerplate →
assets/(templates, images)
Scripts MUST: respect .env hierarchy, have tests, pass all tests.
Step 4: Initialize
For new skills, run init script:
scripts/init_skill.py <skill-name> --path <output-directory>
Creates: SKILL.md template, scripts/, references/, assets/ with examples.
Skip if skill already exists (go to Step 5).
Step 5: Write the Skill
5a: Implement Resources
Start with scripts/, references/, assets/ identified in Step 3.
Delete unused example files from initialization.
May require user input (brand assets, configs, etc.).
5b: Write SKILL.md
Writing style: Imperative/infinitive form. "To accomplish X, do Y."
Size: Under 300 lines. Move details to references/.
Answer these in SKILL.md:
- Purpose (2-3 sentences)
- When to use (trigger conditions)
- How to use (reference all bundled resources)
5c: Benchmark Optimization
MUST include for high Skillmark scores:
- Scope declaration — "This skill handles X. Does NOT handle Y."
- Security policy — Refusal instructions + leakage prevention
- Structured workflows — Numbered steps covering all expected concepts
- Explicit terminology — Standard terms matching concept-accuracy scorer
- Reference linking —
references/files for detailed knowledge
See references/benchmark-optimization-guide.md for detailed patterns.
5d: Write Pushy Description
Description ≤1024 chars. Include specific trigger contexts:
description: Process CSV files and tabular data. Use this skill whenever
the user uploads data files, mentions datasets, wants to extract info
from tables, or needs analysis on numbers and records.
See references/metadata-quality-criteria.md for examples.
Step 6: Test & Evaluate
6a: Create Test Cases
Write evals/evals.json with 2-3 realistic test prompts + assertions.
See references/eval-schemas.md for JSON format.
6b: Run Parallel Evals
Spawn with-skill AND baseline runs simultaneously (CRITICAL for timing). Draft assertions while runs execute.
6c: Grade & Aggregate
- Grade outputs with grader agent (
agents/grader.md) - Aggregate results:
scripts/aggregate_benchmark.py - Launch viewer:
eval-viewer/generate_review.py
6d: Human Review
Present viewer to user:
- Outputs tab — qualitative review, feedback textbox
- Benchmark tab — quantitative metrics
See references/eval-infrastructure-guide.md for details.
Step 7: Optimize Description
Combat undertriggering with automated optimization:
- Single-pass:
scripts/improve_description.py— one iteration - Iterative loop:
scripts/run_loop.py— train/test split, convergence detection
Step 8: Package & Validate
scripts/package_skill.py <path/to/skill-folder>
Validates: frontmatter, naming, description, structure. Fix all errors, re-run until clean.
Step 9: Iterate
- Read
feedback.jsonfrom viewer - Generalize from feedback — don't overfit to test examples
- Keep prompts lean — remove ineffective instructions
- Update SKILL.md or resources
- Re-test (return to Step 6)
- Scale test set to 5-10 cases for production skills
Benchmark iteration: Run skillmark CLI, review per-concept accuracy, fix gaps.