05 / project
Real-Time Weather Pipeline
Distributed Systems · UW Madison
The problem
A 1300-station weather sensor network produces a continuous, write-heavy stream: every station emits readings on its own cadence, the data is append-only, and the analytical queries we cared about (rolling averages, anomaly detection, station-level trend lines) needed to run over hot data without blocking ingest. Three constraints I had to handle at once: high write throughput, time-series-shaped reads, and graceful behavior when a node goes down — sensors don't pause politely while you fix a partition.
This was a UW Madison distributed-systems course project. I treated it as if it had to actually run for a semester, not just demo for the grader.
The approach
I picked the stack to match the workload: Cassandra for storage (its log-structured merge tree is built for write-heavy time-series with predictable-key access patterns), Apache Spark for analytics jobs that ran over windows of hot writes, and a Python gRPC server as the ingest gateway — gRPC's binary framing and HTTP/2 multiplexing handled concurrent station-stream uploads without the overhead of REST keep-alive churn.
The schema was the most important decision. I partitioned by (station_id, day_bucket) so each day's readings for a single station landed on the same node and same SSTable — that meant reads for "the last 24h of station 47" were a single-node fetch, and writes were never bouncing between coordinators. Spark jobs read the same partitions in batch using the Cassandra connector with predicate pushdown.
Engineering decisions that mattered
- Cassandra over PostgreSQL/TimescaleDB. Tempting to reach for a hypertable, but with 1300 stations writing concurrently the LSM-tree append pattern wins on raw write throughput, and Cassandra's native sharding meant horizontal scaling didn't require a migration.
- gRPC over REST. Binary protocol, HTTP/2 multiplexing, and code-generated client stubs in PyArrow's columnar format meant the ingest path was both faster and easier to evolve. REST would've been simpler to demo but worse under load.
- Day-bucketed partition key. Without bucketing, partitions would grow unbounded over time and Cassandra's wide-row performance would degrade. Bucketing by day kept partition size predictable and made TTL-based retention trivial.
- Fault-tolerance test as part of the deliverable. I deliberately killed nodes during writes to verify the replication factor + consistency level (RF=3, CL=QUORUM) actually held the line. Most coursework demos skip this; I wanted to see the system behave under stress, not just on the happy path.
What it taught me that I still use
This project is the reason I think about partition keys before I think about indexes when I'm designing data layout for any production system today. It's also why I default to gRPC over REST when the consumer is another service rather than a browser. The hot/cold data separation idea — Spark reads vs. live ingest — directly informs how I'd build the next generation of the T-Mobile copilot's analytics layer.
Tech
Python · Cassandra · Apache Spark · gRPC · PyArrow · Docker · UW Madison Distributed Systems coursework, November 2023.
next project
AI Chatbot & Agentic Copilot
T-Mobile for Business · Enidus