Solutions / Evaluation

SOTA Metrics for
Agent Behavior

Research-grounded evaluation purpose-built for long-running agents. Detect uncertainty, score trajectories, and catch regressions before your users do.

Start Building Eval Docs

Why Evaluation

Know when your agent is right — and when it isn't.

Agents drift in ways accuracy alone can't reveal. PandaProbe scores behavior across whole trajectories so you can ship and improve with confidence.

SOTA Agent Metrics

Research-grounded evaluation metrics purpose-built for long-running agents — not just single prompt/response pairs.

Diagnose Uncertainty

The only platform with metrics that quantify and explain agent uncertainty — pinpoint exactly where agents become unstable.

Trace + Session

Score individual traces for granular quality signals, then evaluate entire sessions to catch patterns across a full workflow.

Explainable Scores

LLM-as-judge metrics return scores with human-readable explanations, so you know exactly why a trace scored the way it did.

Automated Monitoring

Schedule recurring eval runs against production traffic to catch behavioral drift and regressions before your users do.

Open & Managed

Apache-2.0 and provider-agnostic. On Cloud, PandaProbe runs the eval LLM and embedding models for you — no external API keys.

How It Works

From traces to behavior scores.

Three stages take you from a raw trace, to trace scores, to agent-level behavior.

Tracing

Capture

Your instrumented agent records each run as a structured trace — the foundation every evaluation builds on.

Trace Evaluation

Score

Evals score each trace — LLM-as-judge with explanations, or deterministic embedding based.

Agent Evaluation

Diagnose

SOTA algorithms transform trace signals into agent scores — quantifying behavior and uncertainty.

Evaluation Metrics

Metrics for traces and agents.

Trace metrics rate individual runs across quality dimensions. Agent metrics build on them to quantify behavior and uncertainty over a full session.

Trace metrics

Score individual agent runs across nine quality dimensions.

task completion

How fully the agent accomplished the user's objective.

tool correctness

Whether the right tools were selected for the task.

argument correctness

Whether tool parameters matched what the task required.

step efficiency

How direct the execution path was, free of redundant steps.

confidence

How decisive and well-founded the agent's actions were.

plan adherence

How closely execution followed the agent's declared plan.

plan quality

How structurally sound and complete the plan was.

coherence

How well the output follows logically from the input.

loop detection

Whether the agent repeats itself across the trace.

Agent metrics

Sessions represent the lifecycle of agents. We quantify behavior and uncertainty over the full session.

agent reliability

Tail-risk score. Per-trace risk is max-composed from inverted signals, then blended over the worst-15% tail — so one catastrophic trace can't hide behind good ones.

agent consistency

Variance score via RMS aggregation of per-trace uncertainty, multiplicatively amplified when failure signals compound — penalizing spread, not just the mean.

Trace metrics reference Agent metrics reference

Monitoring

Catch regressions before users do.

Schedule recurring eval runs against your production traffic. Spot behavioral drift and performance regressions the moment they appear.

Flexible scheduling

Run evals hourly, daily, or on a custom cron against live production traffic.

Sampling controls

Tune the sampling rate to balance eval coverage against volume and cost.

Regression alerts

Get notified the moment a metric regresses across agent versions.

Managed eval LLM

PandaProbe Cloud runs the judge and embedding models — no external API keys.

Single or suite

Monitor one targeted metric or run the full metric suite on every cycle.

Dashboard & API

Trigger runs and review results from the dashboard or programmatically.

Set up monitoring

Quick Start

Evaluate your agent's behavior in minutes.

Install the PandaProbe skill and run your first evals through natural language.

Step One

Install the skill

Install SKILL

Run it in your terminal, or hand the onboarding prompt to your coding agent.

Step Two

Run evals on your traces

Ask your agent to score traces and surface where behavior drifts — no manual setup.

Q&A

Frequently asked questions

Everything you need to know about evaluating your agents with PandaProbe.

Get Started

Start evaluating your agents today.