06 · Trust + admin

Accuracy + evals

01 · Trust + admin

How we benchmark

Each engine has a held-out golden set of 50–150 hand-labeled examples. Promptfoo gates every prompt change in CI. Drift monitoring samples 1% of prod runs nightly.

02 · Trust + admin

When we miss

Quarterly public report (linked from /accuracy). Per-engine current accuracy + the failures we're working on.

★ · Still stuck?

Email support.

Replies within 11 minutes on Team and Business plans. support@trueleveler.com