9 built-in metrics for every trace.
Each trace is assessed across multiple quality dimensions. LLM-as-judge metrics use multi-stage pipelines that extract context, evaluate quality, and return structured 0–1 scores with human-readable explanations. Embedding metrics bypass LLM calls entirely for deterministic analysis.
Did the agent accomplish the user's objective?
Were the right tools selected for the task?
Did tool parameters match what the task required?
Was the execution path minimally redundant?
Were agent actions decisive and well-founded?
Did execution follow the agent's declared plan?
Was the plan structurally sound and complete?
Does the output logically follow from the input?
Is the agent repeating itself across traces?
Evaluate the full agent lifecycle.
Individual traces show what happened in a single request. Session evaluation reveals patterns only visible across an entire conversation or workflow — tool calls succeeding but being ignored, quality degrading over many steps, or an agent that handles 9 out of 10 requests well but catastrophically fails on the 10th.
Session scores are computed deterministically from precomputed trace-level signals — no additional LLM calls.
Worst-case failure risk across the session. A single catastrophic trace significantly impacts the score — ideal for safety-critical applications.
Overall stability across all traces using weighted RMS aggregation. Captures the spread of issues, not just the average — useful for quality-sensitive applications.
Signals feeding into session scores: confidence, loop_detection, tool_correctness, coherence — all precomputed at the trace level, reused for session aggregation.
View session metrics referenceCatch regressions before users do.
Schedule recurring eval runs against your production traffic. Spot behavioral drift and performance regressions the moment they appear — not after a user complaint.
- Schedule evals on any cadence — hourly, daily, or custom cron
- Sampling rate controls to manage eval volume and cost
- Alerts on metric regressions across agent versions
- PandaProbe Cloud manages the eval LLM — no external API keys needed
- Monitor individual metrics or run full metric suites