03 / project

Denari RAG Capstone

Full-Stack RAG · UW Madison

TypeScriptTimescaleDBDockerS3OpenAI APIs

The Problem

Existing financial-document QA systems return one of two things: irrelevant boilerplate, or hallucinated answers that sound plausible and reference filings that don't say what the system claims. A senior analyst can spot both in seconds; for everyone else, the answers are worse than no answer at all.

The capstone goal: build a RAG system over a real corpus of financial filings where every answer is grounded in retrievable text and the system fails honestly when the corpus doesn't cover the question.

The Approach

A full-stack pipeline from ingestion to answer:

Ingestion + storage. 22K+ documents normalized and chunked, stored in TimescaleDB hypertables for time-series-friendly queries. Docker Compose for local parity with prod; S3 for raw artifact storage.
Embedding generation. 300K+ vectors generated through OpenAI's embedding API in batched parallel calls. Chunk size and overlap tuned against retrieval-quality benchmarks — too small and the model loses context, too large and the embeddings get mushy. Settled around 500 tokens with 50-token overlap after a sweep.
Hybrid retrieval. BM25 for keyword precision (filing numbers, CUSIPs, ticker mentions) plus TF-IDF for sparse-term recall, fused with a semantic rerank on top. Pure-vector retrieval was the original plan; benchmarks showed it under-performed on the kind of "find filings that mention X" queries analysts actually ask.
Re-ranking. Top-N from each retriever, rescored by a cross-encoder, deduplicated, returned with source metadata.

Engineering Decisions That Mattered

TimescaleDB hypertables instead of a generic vector DB. Most queries are "what did filings say about X in the last quarter" — the time-series partitioning paid for itself in query latency. Vector search lives alongside, indexed separately.
Hybrid retrieval, not pure-vector. Pure semantic retrieval whiffed on exact-term queries; pure keyword whiffed on paraphrased ones. The fused stack got measurably better on both.
Chunk size as a benchmark target, not a guess. Ran a sweep of (chunk_size, overlap) pairs against a held-out QA set. The "obvious" answer (~256 tokens) lost to ~500 tokens. Worth the day to measure.
Agile/Scrum delivery on a 4-month clock. 25+ production-grade features delivered across ingestion, embeddings, DB schema, and retrieval — not because the team loved process, but because cutting the scope every sprint kept the project from becoming an unfinished prototype.

Outcome

73% accuracy on the held-out QA benchmark. 40% query-latency reduction over the v1 baseline after retrieval-stack tuning. Capstone delivered on schedule with all 25+ planned features in.

Tech

TypeScript · TimescaleDB · Docker · AWS S3 · OpenAI APIs (embeddings + chat) · BM25 · TF-IDF

next project

AI Chatbot & Agentic Copilot

T-Mobile for Business · Enidus