Metrics
5 skills with this tag
affaan-m
Passed
Eval Harness
Eval Harness provides a formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles. It helps define expected behaviors before implementation, run evals continuously during development, and track pass/fail metrics for both capability and regression tests.
TestingEvaluationTdd+3
6332.2k
wshobson
Passed
prometheus-configuration
This skill helps you set up and configure Prometheus for comprehensive metric collection and monitoring. It provides configuration templates for scrape jobs, recording rules, alert rules, and Kubernetes service discovery, along with validation commands and best practices for production deployments.
PrometheusMonitoringMetrics+3
19327.0k
wshobson
Passed
grafana-dashboards
This skill helps you create production-ready Grafana dashboards for monitoring applications and infrastructure. It provides design patterns like RED and USE methods, JSON templates for common dashboard types (API, database, infrastructure), and provisioning examples using Terraform and Ansible.
GrafanaMonitoringDashboards+3
65527.0k
wshobson
Passed
llm-evaluation
This skill teaches comprehensive evaluation strategies for LLM applications, covering automated metrics (BLEU, ROUGE, BERTScore), human evaluation frameworks, LLM-as-Judge patterns using Claude, A/B testing with statistical analysis, and regression detection. It includes ready-to-use Python code examples and integrates with tools like LangSmith.
A B TestingQuality AssuranceLlm Evaluation+3
53727.0k
NeoLabHQ
Passed
Agent Evaluation
A comprehensive evaluation framework for assessing Claude Code agents, commands, and skills. Provides LLM-as-Judge implementation patterns, multi-dimensional rubrics, bias mitigation techniques, and metrics for measuring agent quality across instruction following, completeness, tool efficiency, reasoning, and coherence.
EvaluationQuality AssuranceLlm As Judge+3
529160