1 year ago

Thurs Aug 15, 2024 5:16pm PST

Show HN: Denormalized – Embeddable Stream Processing in Rust and DataFusion

tl;dr we built an embeddable stream processing engine in Rust using apache DataFusion, check us out at https://github.com/probably-nothing-labs/denormalized

Hey HN,

We’d like to showcase a very early version of our embeddable stream processing engine called Denormalized. The rise of DuckDB has abundantly made it clear that even for many workloads of Terabyte scale, a single node system outshines the distributed query engines of previous generation such as Spark, Snowflake etc in terms of both performance and cost.

Now a lot of workloads DuckDB is used for were normally considered to be “big data” in the previous generation, but no more. In the context of streaming especially, this problem is more acute. A streaming system is designed to incrementally process large amounts of data over a period of time. Even on the upper end of scale, productionized use-cases of stream processing are rarely performing compute on more than tens of gigabytes of data at a given time.

Even so, the standard stream processing solutions such as Flink involve spinning up a distributed JVM cluster to even compute against the simplest of event streams. To that end, we’re building Denormalized designed to be embeddable in your applications and scale up to hundreds of thousands of events per second with a Flink-like dataflow API. While we currently only support Rust, we have plans for Python and Typescript bindings soon.

We’re built atop DataFusion and the Arrow ecosystems and currently support streaming joins as well as windowed aggregations on Kafka topics.

Please check out out repo at: https://github.com/probably-nothing-labs/denormalized