← 返回命令列表

Linux command

agent-skills-eval 命令

文件

复制后可按需替换文件名、目录或参数。

常用示例

Run

npx agent-skills-eval [./skills]

Compare

npx agent-skills-eval [./skills] --target [gpt-4o-mini] --judge [gpt-4o-mini] --baseline

Use

npx agent-skills-eval --config [agent-skills-eval.yaml]

Generate

npx agent-skills-eval [./skills] --baseline --report

Limit

npx agent-skills-eval [./skills] --concurrency [2] --include "[skills/translate*]" --exclude "[**/draft-*]"

Stream

npx agent-skills-eval [./skills] --log-format [jsonl] --strict

说明

agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.

参数

\-\-config _file_
Load YAML configuration from _file_. CLI flags override values from the config.
\-\-workspace _dir_
Output directory for results (default: ./agent-skills-workspace).
\-\-baseline
Enable the with_skill vs without_skill comparison. Without it, only the with-skill run is performed.
\-\-target _model_
Target model whose performance is being evaluated.
\-\-judge _model_
Judge model used to grade target outputs.
\-\-base-url _url_
API base URL (defaults to the OpenAI endpoint).
\-\-api-key-env _VAR_
Environment variable that holds the API key (default: OPENAI_API_KEY).
\-\-include _glob_
Run only skills whose path matches _glob_.
\-\-exclude _glob_
Skip skills whose path matches _glob_.
\-\-concurrency _N_
Number of parallel eval runs (default: 4).
\-\-layout _mode_
Workspace layout: iteration (default, numbered run folders) or flat.
\-\-strict
Fail when SKILL.md validation errors are detected.
\-\-log-format _mode_
Output format: pretty, jsonl, or silent.
\-\-report
Emit a static HTML report under the workspace.
\-\-report-output _dir_
Override the directory where the report is written.

FAQ

What is the agent-skills-eval command used for?

agent-skills-eval is a test harness for the agentskills.io specification. Each skill lives in a directory with a SKILL.md describing its purpose plus an evals/evals.json file enumerating prompts, attached files, and judge-graded assertions. The runner loads each eval, sends the prompt to the configured target model—optionally with the skill content injected—and asks the judge model to score the result against the declared assertions. When \-\-baseline is set, every eval runs twice: once with the skill loaded into context and once without. Comparing the two scores measures whether the skill actually improves the model's output rather than just confirming it can solve the task on its own. Outputs, timing, token counts, tool calls, and grading rationales are persisted under the workspace so runs are reproducible and auditable. The CLI is designed for both ad-hoc local iteration and CI pipelines. The iteration layout numbers each run, making before/after comparisons easy; the flat layout overwrites a single result tree. Logs can be rendered as colored progress for humans or streamed as JSONL for programmatic consumers, and a static HTML report can be produced for sharing without a server.

How do I run a basic agent-skills-eval example?

Run `npx agent-skills-eval [./skills]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.

What does \-\-config _file_ do in agent-skills-eval?

Load YAML configuration from _file_. CLI flags override values from the config.