Up and running in 60 seconds
Install, initialize, check, and evaluate — four commands to your first AI skill evaluation.
Multi-dimensional scoring, pipeline evaluation, and analytics for SKILL.md files. Open source CLI + Web.
From multi-dimensional rubrics to full CI pipelines — md-evals gives you the tools to measure, track, and improve AI agent performance.
Evaluate skills across 7 quality dimensions with configurable YAML rubrics. Letter grades from S to F give instant clarity.
Three-stage evaluation: Auditor analyzes, Target executes, Judge scores. Use different models per stage for unbiased results.
Built-in probes for dimensions, edge cases, compliance, and Gherkin scenarios. Extend with your own via Python entry_points.
Citation validation ensures the LLM references specific lines. Gherkin-like eval scenarios define precise Given/When/Then checks.
Run eval suites in CI with proper exit codes. Generate static HTML reports. Evaluate entire plugin directories at once.
Track score trends over time, monitor cost per model, and explore skills × dimensions with interactive heatmaps.
Pre-check skills, run full pipeline evaluations, execute test suites, and track analytics — all from your terminal. Perfect for CI/CD integration.
Install, initialize, check, and evaluate — four commands to your first AI skill evaluation.
Install from PyPI
pip install md-evals
Create config files
md-evals init
Validate your skill file
md-evals check SKILL.md
Run the full pipeline
md-evals run --pipeline