Skill Evaluation
claw skill evaluate eval-harness
Score Breakdown
Safety
94
Executability
84
Completeness
86
Maintainability
88
Cost
76
Check Results
Evaluation Checks
Overall Score:86
dataset_validationPASS
Eval datasets validated for format and completeness
metric_correctnessPASS
Metrics computed correctly against baselines
reproducibilityPASS
Results reproducible with fixed seeds
statistical_robustnessPASS
Confidence intervals provided
cost_trackingPASS
API costs tracked per evaluation run
drift_detectionWARN
Alerting on score regressions requires threshold tuning