sahil_mehta.

Live eval · last run 4d ago

The agent grades itself.

The same RAG agent that powers the homepage chat runs against 89 hand-designed recruiter cases on every build. Pass / fail per case is visible below — failures aren't hidden, they're cited.

Same eval discipline I applied to the production T-Mobile copilot — 52 hand-designed pytest cases parametrized to over 400 distinct test invocations, zero LLM-hallucination incidents in production since the suite landed. The dashboard below applies the same approach, in public, to this site's own agent. A parametrization layer is the next step.

All questions

88/89

99% pass

Mandatory

38/38

100% pass

Nice-to-have

37/37

100% pass

Out-of-scope

13/14

93% pass

Per-question results

Identity & background

Current role / Enidus

AI Copilot deep-dive

Custom Reports & Dashboards deep-dive

Carrier API Gateway / BFF

ClaudeJob deep-dive

Denari RAG capstone

Weather pipeline / distributed systems coursework

Earlier roles

Education

Engineering judgment

AI engineering specifics

Behavioral / soft

Career goals

Hardball / probing

Out-of-scope sanity checks