05 / project

Real-Time Weather Pipeline

Distributed Systems · UW Madison

CassandraApache SparkPythongRPCDockerPyArrow

The problem

A 1300-station weather sensor network produces a continuous, write-heavy stream: every station emits readings on its own cadence, the data is append-only, and the analytical queries we cared about (rolling averages, anomaly detection, station-level trend lines) needed to run over hot data without blocking ingest. Three constraints I had to handle at once: high write throughput, time-series-shaped reads, and graceful behavior when a node goes down — sensors don't pause politely while you fix a partition.

This was a UW Madison distributed-systems course project. I treated it as if it had to actually run for a semester, not just demo for the grader.

The approach

I picked the stack to match the workload: Cassandra for storage (its log-structured merge tree is built for write-heavy time-series with predictable-key access patterns), Apache Spark for analytics jobs that ran over windows of hot writes, and a Python gRPC server as the ingest gateway — gRPC's binary framing and HTTP/2 multiplexing handled concurrent station-stream uploads without the overhead of REST keep-alive churn.

The schema was the most important decision. I partitioned by (station_id, day_bucket) so each day's readings for a single station landed on the same node and same SSTable — that meant reads for "the last 24h of station 47" were a single-node fetch, and writes were never bouncing between coordinators. Spark jobs read the same partitions in batch using the Cassandra connector with predicate pushdown.

Engineering decisions that mattered

Cassandra over PostgreSQL/TimescaleDB. Tempting to reach for a hypertable, but with 1300 stations writing concurrently the LSM-tree append pattern wins on raw write throughput, and Cassandra's native sharding meant horizontal scaling didn't require a migration.
gRPC over REST. Binary protocol, HTTP/2 multiplexing, and code-generated client stubs in PyArrow's columnar format meant the ingest path was both faster and easier to evolve. REST would've been simpler to demo but worse under load.
Day-bucketed partition key. Without bucketing, partitions would grow unbounded over time and Cassandra's wide-row performance would degrade. Bucketing by day kept partition size predictable and made TTL-based retention trivial.
Fault-tolerance test as part of the deliverable. I deliberately killed nodes during writes to verify the replication factor + consistency level (RF=3, CL=QUORUM) actually held the line. Most coursework demos skip this; I wanted to see the system behave under stress, not just on the happy path.

What it taught me that I still use

This project is the reason I think about partition keys before I think about indexes when I'm designing data layout for any production system today. It's also why I default to gRPC over REST when the consumer is another service rather than a browser. The hot/cold data separation idea — Spark reads vs. live ingest — directly informs how I'd build the next generation of the T-Mobile copilot's analytics layer.

Tech

Python · Cassandra · Apache Spark · gRPC · PyArrow · Docker · UW Madison Distributed Systems coursework, November 2023.

next project

AI Chatbot & Agentic Copilot

T-Mobile for Business · Enidus