PandaProbe logoPandaProbe

Solutions / Evaluation

SOTA Metrics for
Agent Behavior

Research-grounded evaluation purpose-built for long-running agents. Detect uncertainty, score trajectories, and catch regressions before your users do.

Trace Evaluation

9 built-in metrics for every trace.

Each trace is assessed across multiple quality dimensions. LLM-as-judge metrics use multi-stage pipelines that extract context, evaluate quality, and return structured 0–1 scores with human-readable explanations. Embedding metrics bypass LLM calls entirely for deterministic analysis.

task_completionLLM judge

Did the agent accomplish the user's objective?

tool_correctnessLLM judge

Were the right tools selected for the task?

argument_correctnessLLM judge

Did tool parameters match what the task required?

step_efficiencyLLM judge

Was the execution path minimally redundant?

confidenceLLM judge

Were agent actions decisive and well-founded?

plan_adherenceLLM judge

Did execution follow the agent's declared plan?

plan_qualityLLM judge

Was the plan structurally sound and complete?

coherenceEmbedding

Does the output logically follow from the input?

loop_detectionEmbedding

Is the agent repeating itself across traces?

Session Evaluation

Evaluate the full agent lifecycle.

Individual traces show what happened in a single request. Session evaluation reveals patterns only visible across an entire conversation or workflow — tool calls succeeding but being ignored, quality degrading over many steps, or an agent that handles 9 out of 10 requests well but catastrophically fails on the 10th.

Session scores are computed deterministically from precomputed trace-level signals — no additional LLM calls.

agent_reliability

Worst-case failure risk across the session. A single catastrophic trace significantly impacts the score — ideal for safety-critical applications.

agent_consistency

Overall stability across all traces using weighted RMS aggregation. Captures the spread of issues, not just the average — useful for quality-sensitive applications.

Signals feeding into session scores: confidence, loop_detection, tool_correctness, coherence — all precomputed at the trace level, reused for session aggregation.

View session metrics reference
Automated Monitoring

Catch regressions before users do.

Schedule recurring eval runs against your production traffic. Spot behavioral drift and performance regressions the moment they appear — not after a user complaint.

  • Schedule evals on any cadence — hourly, daily, or custom cron
  • Sampling rate controls to manage eval volume and cost
  • Alerts on metric regressions across agent versions
  • PandaProbe Cloud manages the eval LLM — no external API keys needed
  • Monitor individual metrics or run full metric suites