Each engine has a held-out golden set of 50–150 hand-labeled examples. Promptfoo gates every prompt change in CI. Drift monitoring samples 1% of prod runs nightly.
Quarterly public report (linked from /accuracy). Per-engine current accuracy + the failures we're working on.
Replies within 11 minutes on Team and Business plans. support@trueleveler.com