01 / project
AI Chatbot & Agentic Copilot
T-Mobile for Business · Enidus
The problem
Enidus operates Enterprise Portal — the customer-facing console that T-Mobile for Business resellers and account admins use to manage corporate phone fleets. The console works, but every meaningful action (suspending a line, swapping a SIM, upgrading a plan, ordering a device, restoring a cancelled service) lives behind 10+ click-through screens, multi-step wizards, and per-product business rules. For high-volume reseller users this is hours of friction per week, and friction creates escalations to engineering.
The brief I took on was to build a conversational AI copilot that lets users execute these multi-step transactions through natural language — without ever giving an LLM autonomous write access to a billing-grade production system. The constraints were non-negotiable: per-tenant data isolation, zero LLM-generated SQL, complete auditability of every action, and graceful behavior when the model hallucinates (because it will).
The platform context
The copilot doesn't live in isolation. It's one of three plugins I've shipped into Enterprise Portal alongside the Custom Reports & Dashboards plugin (self-serve analytics) and the Carrier API Gateway / BFF that fronts the actual T-Mobile carrier APIs underneath everything. Each plugin is an independent React module mounted into the portal shell with shared auth, RBAC, and tenant scoping; the AI copilot is the headline feature, but its safety guarantees only hold because the gateway and reporting plugins it composes against are themselves built on the same isolation primitives.
Architecture
End-to-end, a request walks four layers. A user sends a natural-language utterance from the chat UI. The classifier prompt routes it to one of 53 hand-designed intents — a closed set, never free-form. The planner (Claude or GPT-4 with native function-calling) selects which of 43 tools to invoke and emits arguments. Pydantic schemas validate every argument against live data — real device SKUs, valid BAN format, tenant-scoped account IDs — before the backend ever sees the call. Validated tool calls execute through parameterized SQL templates inside a session-scoped row-level-security context; the LLM never writes raw SQL and never picks the parameters. The response composer renders results back into chat plus an inline confirmation panel for any write-capable transaction.
The four-layer split is deliberate: each layer has a different failure mode and a different defense. A hallucinated intent gets caught by the closed set. A hallucinated tool argument gets caught by Pydantic. A malformed SQL parameter gets caught by row-level security. A wrong tool call that passes every validator still hits the stage-and-confirm panel before any side effect lands — the human is the final approver on writes. This is the architectural payoff of typed tool dispatch over freeform generation: the failure modes are observable and individually defensible, instead of one undifferentiated "the LLM did something wrong."
The sections below are the how for each layer — agent design covers the classifier and planner, multi-tenant SQL isolation covers the execution layer, and the knowledge layer covers retrieval. Hand-drawn architecture diagram is incoming for this section.
The agent design
Three layers, intentionally separated:
1. Intent classification — 53 intents. A natural-language utterance ("suspend line for John in marketing", "swap my CEO's SIM, his old one was lost") is first classified into one of 53 user intents. The classifier itself is a small, fast prompt that returns a structured intent + confidence; ambiguous cases route to clarification rather than guessing. The intent space was hand-designed against the actual telecom workflow taxonomy — it's not a free-form NLU, it's a closed set the agent can never escape.
2. Tool dispatch — 43 Pydantic-typed handlers. Each intent maps to one or more tool handlers. Every tool has a Pydantic schema for its arguments and return shape, and those schemas are the same shape Claude and GPT-4 expect for native function-calling. The model picks a tool by name and emits arguments; Pydantic validates the arguments against the live schema (real device SKUs, valid BAN format, reseller-scoped account IDs) before anything reaches the backend. If validation fails, the tool call is rejected at the boundary and the agent is told to try again — the backend never sees a malformed call.
3. Execution — stage and confirm (write-capable tools only). Read-only tool calls — line counts, billing summaries, device-status lookups, account-context retrievals — execute directly and stream back into the chat; no confirmation step, no friction. State-mutating tool calls are different. The agent doesn't execute them; it composes a transaction (a single tool call or a multi-step plan: "validate device → reserve number → suspend old line → activate new line"), stages each step in an inline confirmation panel rendered alongside the chat, and waits for the user to click Confirm. Only then does the backend run the actual carrier API calls through the gateway plugin. The UX feels conversational, and the safety property is unambiguous: a hallucinated tool argument can't reach production on a write because a human is the gate.
Multi-tenant SQL isolation — three layers
The LLM has no path to a raw query. The isolation model is layered so that even if one layer fails, the others hold:
- Layer 1: parameterized SQL templates only. No string concatenation anywhere in the codepath that touches user input. Every SQL operation is a template with bound parameters; the model picks the template by name (it's a tool), the parameters are Pydantic-validated, and the actual SQL execution is library code the model can't reach.
- Layer 2: session-scoped row-level security. PostgreSQL RLS policies are scoped to the authenticated tenant on every connection. Even if a query were somehow malformed to ignore the WHERE clause, RLS would block it at the database layer.
- Layer 3: 8-role RBAC. The 43 tools are partitioned across 8 RBAC roles; the session token determines which subset of tools the model can even see, much less invoke. A reseller agent can't call admin-only tools because those tools aren't in the agent's tool registry for that session.
Knowledge layer — Qdrant, per-tenant
For retrieval-augmented responses (account history, prior interactions, product catalog context), the knowledge layer is a Qdrant vector store with per-tenant collections rather than a shared collection with a tenant filter. Filter-based isolation is a bug factory — one mistaken predicate and you're leaking. Separate collections fail closed: a misrouted query against a non-existent collection errors instead of silently returning another tenant's data. There are 4 collection domains (account context, product catalog, support runbooks, prior interactions), and the upstream 6-intent classifier decides which domain to hit before retrieval ever runs.
Evaluation
The eval suite is the most important non-obvious part of this system. It's 52 hand-designed pytest cases parametrized to over 400 distinct test invocations across three layers: intent classification (does the planner pick the right intent for an utterance?), planning correctness (does the right intent map to the right tool, with arguments that resolve against live data?), and execution validity (does the validated tool call produce a response the UI can render — and, for write-capable tools, a transaction the user can confirm before any side effect lands?). The cases are hand-written against real reseller workflows — every supported telecom action has at least one case, edge-case BAN formats and SIM-status transitions have dedicated cases, and the trickiest multi-step plans (port-in then suspend then activate) have explicit step-sequencing assertions.
The suite wasn't a polish item; it shaped the design. Early in development the cases caught a class of failure I hadn't expected to be common: the agent hallucinating tool arguments. Specifically, it would sometimes invent device SKUs that didn't exist in the catalog and malformed BANs (Billing Account Numbers that passed surface format checks but weren't real accounts in the tenant's scope). On a write-capable agent these would have been silent bugs that hit production and only surfaced as customer complaints.
The fix was structural, not prompt-engineering: Pydantic-validated tool schemas that look up SKUs against the live catalog and validate BANs against the session's tenant scope. The model still gets to pick tools and propose arguments, but invalid arguments fail at the boundary before the backend ever gets the call. Use the type system as a hallucination filter. It's a one-line change per tool that compounds across 43 tools and removes an entire failure class.
The production result: zero LLM-hallucination incidents since the suite landed. The suite runs on every PR; nothing merges with a regression in any layer. The eval set is the living spec — when the PM asks for a new intent, the eval cases come first and the implementation is secondary.
Key technical decisions
Four trade-offs that mattered. Each was chosen because the alternative had a failure mode that wouldn't surface until production.
- Tool-calling over freeform generation. The model picks tools by name and emits Pydantic-typed arguments rather than producing free-form responses or JSON blobs the backend has to parse. Rationale: free-form output drifts; typed tool dispatch fails closed at the schema boundary, and the failures we did see in development were exactly the ones the eval suite was built to catch — observable, individually defensible, fixable by tightening one schema.
- No dynamic SQL ever. Every database operation is a parameterized template with bound parameters; the model picks the template by name (it's a tool), and the actual SQL execution is library code the model can't reach. Rationale: the moment you let an LLM generate or even template SQL, every prompt becomes a SQL-injection surface, and no amount of prompt engineering closes that loop reliably. The cost is a fixed-set tool catalog instead of arbitrary queries; the benefit is a closed attack surface.
- Stage-and-confirm for writes, direct execution for reads. Write-capable tool calls compose into a staged transaction in an inline panel; the backend doesn't commit any side-effecting action until the user clicks Confirm. Read-only operations (line counts, billing summaries, account context lookups) execute directly with no confirmation step. Rationale: eval results made it obvious that "the model sometimes picks the wrong device variant" was a class of bug we'd never fully eliminate on writes — but we could route around it by making the user the final approver on side effects, while keeping reads frictionless. Productivity stayed conversational; safety on writes became a non-issue.
- Per-tenant Qdrant collections over a shared collection with a tenant filter. Each tenant gets its own collection; retrieval queries hit the right collection by session tenant, never by predicate. Rationale: filter-based isolation is a bug factory — one mistaken predicate and you're leaking — while separate collections fail closed: a misrouted query against a non-existent collection errors instead of silently returning another tenant's data.
What I'd do differently
I'd build first-class agent traces on day one instead of bolting them on after the eval suite was already grinding. An agent workflow has no stack trace — debugging "the agent picked the wrong intent" required reconstructing classifier output, retrieval ranking, and tool-registry filtering from disjoint logs, when one turn-record-per-request would have collapsed the whole investigation into a single page. Building the trace UI first would have shaved weeks off the eval-iteration loop. The lesson generalizes past this project: when you ship non-deterministic code, observability has to land before you start grinding on the model.
Outcome
Currently in pilot with 15 reseller tenants representing 25+ enterprise customers and 100+ daily portal users. Zero production incidents tied to LLM hallucination of tool arguments since the eval-driven hardening landed. The pilot was deliberately rolled out in waves — we validated the UX and the safety model on the first 5 tenants, then opened the gate to the next 10 once the eval suite had caught the failure classes that mattered. Adoption per pilot tenant has been steady; the most common follow-up request from pilot users has been more tools, which is the right shape of feedback.
Tech
FastAPI · Python · React 19 · TypeScript · Pydantic · Qdrant (per-tenant collections) · Claude / GPT-4 function-calling · PostgreSQL with row-level security · 8-role RBAC · parametrized pytest eval suite (52 hand-designed cases / 400+ test invocations) covering intent dispatch, planning, and execution.
Anonymized as "Fortune-100 telecom client" where appropriate. Full architecture and code lives behind enterprise NDA — happy to walk through architecture in interviews.
next project
ClaudeJob
Agentic Resume Tailoring Pipeline · Personal