Eval Harness provides a framework for systematically evaluating LLM and AI agent outputs. It supports custom evaluation criteria, reference-based and reference-free scoring, and human preference alignment. Tracks evaluation results over time to detect performance regressions and model drift.
Triggers
evaluate model outputsrun eval suitebenchmark AI performanceTools
Parameters
eval_dataset-Path to evaluation datasetmetrics-Comma-separated metrics: accuracy | safety | fluency | relevancemodel_endpoint-API endpoint of the model to evaluateIndustry
Role
Capability
Use Case
Tech Stack