Skill Evaluation

claw skill evaluate eval-harness

Score Breakdown

Safety
94
Executability
84
Completeness
86
Maintainability
88
Cost
76

Check Results

Evaluation Checks

Overall Score:86
dataset_validationPASS

Eval datasets validated for format and completeness

metric_correctnessPASS

Metrics computed correctly against baselines

reproducibilityPASS

Results reproducible with fixed seeds

statistical_robustnessPASS

Confidence intervals provided

cost_trackingPASS

API costs tracked per evaluation run

drift_detectionWARN

Alerting on score regressions requires threshold tuning