Agent Security Market

ClawHeart安全市场
SkillsPackagesSecurity IncidentsSkills Evaluation
Language
SkillsEval Harness

Eval Harness

v1.6.2

Eval Harness provides a framework for systematically evaluating LLM and AI agent outputs. It supports custom evaluation criteria, reference-based and reference-free scoring, and human preference alignment. Tracks evaluation results over time to detect performance regressions and model drift.

13.7k installs90% success rateby evalops
86Safe

Demo: Eval Harness

$ claw run eval-harness

Running Eval Harness...

✓ Skill executed successfully

Output: Demo completed. Results saved to ./output/

This is a simulated demo output. In production, this would execute the skill in a sandboxed environment and show real-time results.

Evaluate

Evaluation Scores

Safety
94
Executability
84
Completeness
86
Maintainability
88
Cost
76

Client Support

cli
web
claw123

Manifest

Triggers

evaluate model outputsrun eval suitebenchmark AI performance

Tools

openaipython

Parameters

eval_dataset-Path to evaluation dataset
metrics-Comma-separated metrics: accuracy | safety | fluency | relevance
model_endpoint-API endpoint of the model to evaluate

Metadata

Industry

AI

Role

DeveloperData

Capability

AnalysisExecution

Use Case

Reporting

Tech Stack

OpenAIPython

Related Skills

Prompt Optimizer88Agent Builder90Test Generator84