agent-skills-eval Command: Examples, Options, and Usage

常用示例

Run

npx agent-skills-eval [./skills]

Compare

npx agent-skills-eval [./skills] --target [gpt-4o-mini] --judge [gpt-4o-mini] --baseline

Use

npx agent-skills-eval --config [agent-skills-eval.yaml]

Generate

npx agent-skills-eval [./skills] --baseline --report

Limit

npx agent-skills-eval [./skills] --concurrency [2] --include "[skills/translate*]" --exclude "[**/draft-*]"

Stream

npx agent-skills-eval [./skills] --log-format [jsonl] --strict

说明

agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.

参数

\-\-config _file_: Load YAML configuration from _file_. CLI flags override values from the config.
\-\-workspace _dir_: Output directory for results (default: ./agent-skills-workspace).
\-\-baseline: Enable the with_skill vs without_skill comparison. Without it, only the with-skill run is performed.
\-\-target _model_: Target model whose performance is being evaluated.
\-\-judge _model_: Judge model used to grade target outputs.
\-\-base-url _url_: API base URL (defaults to the OpenAI endpoint).
\-\-api-key-env _VAR_: Environment variable that holds the API key (default: OPENAI_API_KEY).
\-\-include _glob_: Run only skills whose path matches _glob_.
\-\-exclude _glob_: Skip skills whose path matches _glob_.
\-\-concurrency _N_: Number of parallel eval runs (default: 4).
\-\-layout _mode_: Workspace layout: iteration (default, numbered run folders) or flat.
\-\-strict: Fail when SKILL.md validation errors are detected.
\-\-log-format _mode_: Output format: pretty, jsonl, or silent.
\-\-report: Emit a static HTML report under the workspace.
\-\-report-output _dir_: Override the directory where the report is written.

FAQ

What is the agent-skills-eval command used for?

agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.

How do I run a basic agent-skills-eval example?

Run `npx agent-skills-eval [./skills]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does \-\-config _file_ do in agent-skills-eval?

Load YAML configuration from _file_. CLI flags override values from the config.

agent-skills-eval 命令

常用示例

说明

参数

FAQ

相关命令