Linux command
agent-skills-eval 命令
文件
复制后可按需替换文件名、目录或参数。
常用示例
Run
npx agent-skills-eval [./skills]
Compare
npx agent-skills-eval [./skills] --target [gpt-4o-mini] --judge [gpt-4o-mini] --baseline
Use
npx agent-skills-eval --config [agent-skills-eval.yaml]
Generate
npx agent-skills-eval [./skills] --baseline --report
Limit
npx agent-skills-eval [./skills] --concurrency [2] --include "[skills/translate*]" --exclude "[**/draft-*]"
Stream
npx agent-skills-eval [./skills] --log-format [jsonl] --strict
说明
agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.
参数
- \-\-config _file_
- Load YAML configuration from _file_. CLI flags override values from the config.
- \-\-workspace _dir_
- Output directory for results (default: ./agent-skills-workspace).
- \-\-baseline
- Enable the with_skill vs without_skill comparison. Without it, only the with-skill run is performed.
- \-\-target _model_
- Target model whose performance is being evaluated.
- \-\-judge _model_
- Judge model used to grade target outputs.
- \-\-base-url _url_
- API base URL (defaults to the OpenAI endpoint).
- \-\-api-key-env _VAR_
- Environment variable that holds the API key (default: OPENAI_API_KEY).
- \-\-include _glob_
- Run only skills whose path matches _glob_.
- \-\-exclude _glob_
- Skip skills whose path matches _glob_.
- \-\-concurrency _N_
- Number of parallel eval runs (default: 4).
- \-\-layout _mode_
- Workspace layout: iteration (default, numbered run folders) or flat.
- \-\-strict
- Fail when SKILL.md validation errors are detected.
- \-\-log-format _mode_
- Output format: pretty, jsonl, or silent.
- \-\-report
- Emit a static HTML report under the workspace.
- \-\-report-output _dir_
- Override the directory where the report is written.
FAQ
What is the agent-skills-eval command used for?
agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.
How do I run a basic agent-skills-eval example?
Run `npx agent-skills-eval [./skills]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does \-\-config _file_ do in agent-skills-eval?
Load YAML configuration from _file_. CLI flags override values from the config.