Live eval · last run 4d ago
The agent grades itself.
The same RAG agent that powers the homepage chat runs against 89 hand-designed recruiter cases on every build. Pass / fail per case is visible below — failures aren't hidden, they're cited.
Same eval discipline I applied to the production T-Mobile copilot — 52 hand-designed pytest cases parametrized to over 400 distinct test invocations, zero LLM-hallucination incidents in production since the suite landed. The dashboard below applies the same approach, in public, to this site's own agent. A parametrization layer is the next step.
All questions
88/89
99% pass
Mandatory
38/38
100% pass
Nice-to-have
37/37
100% pass
Out-of-scope
13/14
93% pass
Per-question results
Identity & background
Current role / Enidus
AI Copilot deep-dive
Custom Reports & Dashboards deep-dive
Carrier API Gateway / BFF
ClaudeJob deep-dive
Denari RAG capstone
Weather pipeline / distributed systems coursework
Earlier roles
Education
Engineering judgment
AI engineering specifics
Behavioral / soft
Career goals
Hardball / probing
Out-of-scope sanity checks