Open-sourcing Streamling: the data processing runtime for App teams

Streamling is a Rust data processing runtime built on Apache Arrow and Apache DataFusion. Write plugins in RUST or WASM, define pipelines in YAML, and hit production in seconds. It's now open-source.

Jeff Ling

Chief Technology Officer

Today we are open-sourcing Streamling, the data processing runtime built in Rust that powers Goldsky Turbo, our data platform. We focused on making sure it starts up fast, has great performance, and keeping it simple to use and extend.

Some numbers to keep your attention:

Over 30X less compute: In a benchmarked Kafka-to-Clickhouse production workload (read from Kafka, enrich, write into Clickhouse), average CPU usage fell from 10+ cores to 0.3 cores compared to Flink.
Over $1 million in savings per year: Moving our pipelines to Streamling cut Goldsky’s cloud costs by six figures a month.
Over 20x faster startup: Cold starts, from a topology change to data written in a destination, dropped from over two minutes on Flink to under 5 seconds, including image download, pod creation, and execution.

Streamling is built as a runtime which can take WASM or dynamically linked crate plugins. You tie it together with a YAML file to create a pipeline topology, which then runs. You can use it as a pipeline to write from one datastore to another efficiently, or as a processing runtime with your own complex business logic with data guarantees.

We wrote this to replace our existing pipeline system, which was powered by Apache Flink. Flink was doing the job, but structurally didn’t feel like the framework we could build the ultimate product with.

Streaming for fun and profit

Goldsky is a blockchain data and automation company. We have products that provide a safe and operational read-write layer to the world of blockchain. Our customers process $billions in volume every day, and use our services to build APIs and analytics to power their products.

One of our most popular products was Mirror, which is probably one of the best Flink self-serve platforms out there. It allowed users to create pipelines from raw blockchain data, massaging it to create their own view of the blockchain, making a permanent and always-synced materialized view in their own database.

Users use it by making a YAML file like this (a format we kept for the rewrite as well)

 sources: # you can have any amount of sources, transforms, and sinks
  raw.transactions:
    type: dataset
    name: transactions

transforms: 
  filtered_transactions: # you would typically do something more interesting
    type: sql
    primary_key: id
    sql: |
      SELECT *
      FROM raw.transactions
      WHERE address = 'some address'

sinks:
  pg.large_transactions:
    from: filtered_transactions
    type: postgres
    schema: public
    table: my_table
    primary_key: id

Configs like this power some of the biggest teams in the blockchain ecosystem. Most people don’t know what’s going on in the backend, they make a YAML file and go.

Since launch, we’ve had 3000+ pipelines running concurrently. It handled both full historical backfills and real-time in the same pipeline. It felt so much like a database that users would write disgustingly large joins with SQL queries which would just work, as long as they set the XXL size which was 40 Flink task managers running on instance-level SSDs to support them. I was and am still extremely proud and impressed by our team’s achievement, and the traction (and profit) this product brought us.

Anyways, we rewrote it.

Why would we even

It was an extremely difficult decision. It started with a member of our team experimenting with a ‘lighter’ version of our streaming engine, meant to take data from Kafka, filter it, and send it directly to a webhook. We had a use-case to make tens of thousands of these, all with different transformations, filters, Kafka topics, and webhooks. We realized the startup time was incredibly fast and the resource usage was miniscule compared to our usual pipeline. A research spike showed that we could have a clear line of sight to feature parity, and off we went.

There were many reasons outside of the obvious ‘Rust fast’ narrative. Flink was impossible for the non-streaming-core team to extend. The checkpointing system was working great but was impossible to inspect. And we were paying AWS a lot.

The one fact that made the decision for us was cold start speed. Our Flink pipelines would take minutes to start up after each topology change or update. Streamling starts sending data in under 5 seconds. Optimizing for fast cold starts means two key things:

Amazing developer iteration speed: You can run a full pipeline with all your transforms over and over again while developing on our cloud platform, inspecting data through every node using an SSE endpoint.
High availability: Fast cold starts mean that when we have an issue with an instance or a node, it restarts in a new node almost immediately. The workload rarely gets interrupted for more than a few seconds.

Yet another rewrite in Rust

I personally have spent a lot of time dissuading the teams I manage and advise from picking Rust for their projects. It’s always tempting as it’s legitimately so cool, but the cons outweigh the pros in many cases from a CTO perspective.

In the last two years, times have changed. The verbosity that makes it hard to read and write is less of an issue in our agent-driven age. The time-wasting strictness helps our virtual monkey developers ralph-loop their way into correct solutions, making it an ideal language to be lazy in if you’re incredibly thoughtful about the right constraints.

The RAD (Rust, Arrow, DataFusion) stack is a huge advantage. While Rust does allow us to think about memory reuse quite deliberately, the huge toolset given to us with Apache DataFusion and Arrow allow us to be almost reckless with how we think about data.

We can store huge batches in memory, allowing stable write throughput even with huge batches and making sure we’re always constrained by external bottlenecks.

Not only that, our externally acquired data can be stored in the rawest form possible, making QA and data lineage simpler. More work goes into the pipeline instead of being preprocessed for speed. Remapping, filtering, and more takes milliseconds.

Giving up on joins

Streaming joins are where the complexity lives in any production data system. Matching two streaming data sources involves a lot of coordination, intermediary state, and complexity. The Flink documentation on joining two streams and the different edge cases is a book, and that is not an accident.

One controversial decision we made with Streamling was to not support stateful SQL. All state should be kept in external databases that have their own HA posture. We started embracing completely denormalized data blobs.

Our team is full of Flink/Java/Scala experts. They’ve run some of the largest streaming production systems out there at Shopify and Activision. We made it work for Mirror, but chose not to do it for Streamling given today’s data landscape.

Our position is that if your pipeline needs to do a streaming join, the problem is upstream of the pipeline.

Most of the data we process is event data, often nested in other data contexts. That data is naturally narrow and arrives carrying the keys it needs, because the producer knew what it was producing at the moment it produced it.

To save space, we normalize tables. Instead of one wide row, we make a few tables and have smaller and sometimes fewer rows. We save the data in one table and the context in another table. It saves space and processing takes less compute, unless you suddenly need to join the data back again.

Instead, with how cheap storage can get, and how fast we can process and unnest data from larger nested objects with Streamling, we now simply store raw data in fast, cheap storage, and process it inside each customer’s pipeline directly. A downside is tons of redundant data flowing through the pipeline. With Streamling, we process it fast and cheaply enough to where the benefits of easier source-level QA and management and transform-level flexibility makes it completely worth it.

As a consequence of removing the heavy intermediary state, we managed to achieve our goal: extremely fast startups.

A different approach to high availability

Flink-style high availability is a tax you pay because catching up from a failure is expensive. It can take minutes for a jobmanager/taskmanager combo to spin back up on an image update, or if a task manager goes down. Standby tasks, hot replicas, checkpoints frequently written to S3, and operator state that has to be rehydrated before the pipeline can resume are all design choices shaped by one assumption, which is that a restart is something you must avoid at almost any cost.

Our thesis is that if a restart takes seconds, you do not need to avoid them.

Once you have thrown out the keyed state from the previous section, the streaming engine becomes effectively stateless, with nothing except Kafka offsets (or other source metadata) to rehydrate when it comes back up. A supervisor (like Kubernetes) can restart the binary, it reads its offset from a checkpoint stored in a database, and it is back to processing in [N] seconds, which is faster than most Flink jobs notice that a node has failed in the first place.

This makes it trivial for anyone to have a production setup using a single binary. If you put it under a k8s deployment, if the machine goes down, it’ll automatically restart with maybe a few seconds of data lag. It’ll replay the gap automatically.

You can still run multiple processes. For example, the Kafka source just uses consumer groups, so multiple processes will just rebalance as needed. Practically, rebalancing takes seconds anyways, so while multiple processes can increase the throughput linearly, it doesn’t necessarily save time.

Plugins without compromise

Plugins changed how our engineering team works. Here is a whole (contrived and useless) plugin:

use streamling_plugin::{register_plugin_sink, SinkPlugin};
use streamling_core::{Result, PluginError, CheckpointEpoch};
use arrow::record_batch::RecordBatch;

pub struct SqsSink { /* ... */ }

#[async_trait]
impl SinkPlugin for SqsSink {
    async fn initialize(&self) -> Result<(), PluginError> {
        // open connection pool
        Ok(())
    }

    async fn process_batch(&self, data: RecordBatch) -> Result<(), PluginError> {
        // your sink logic — react to the incoming RecordBatch
        self.client.send_batch(data).await?;
        Ok(())
    }

    // Checkpoints ensure the pipeline's data guarantees.  
    async fn process_checkpoint_marker(&self, epoch: CheckpointEpoch)
        -> Result<(), PluginError> { /* warns you a checkpoint is coming */ Ok(()) }

    async fn process_checkpoint_finalizer(&self, epoch: CheckpointEpoch)
        -> Result<(), PluginError> { /* make sure all in-flight writes are committed  */ Ok(()) }
}

register_plugin_sink!("sqs", SqsSink);

Every stream processor on the out there advertises itself as extensible, but in practice, extending requires adding to your own implementation of the streaming processor framework. We wanted to take a Runtime approach instead of a Framework approach, meaning the only Streamling code you need to understand is the plugin interface.

A Streamling plugin, by contrast, is a Rust crate containing a single trait to implement, which will give you the ability to implement row-based or even vectorized transforms. Running it is as simple as compiling the crate and giving a path to the Streamling execution.

Plugins pass data without copying. Normally, to pass data from one node to another in a pipeline, you have to read, deserialize, transform, serialize, and write. With Streamling, often plugins can directly write on the RecordBatch using Arrow/DataFusion, and manage any copying and transformations deliberately.

This is what makes Streamling plugins the right interface for AI agents to write code against. We have been internally generating Streamling plugins with Claude for months now. The output is under a hundred lines of Rust, it compiles on the first try most of the time, often running the plugin against the engine multiple times against real data to tune, and it runs as fast as if you wrote it right into the engine itself.

With agents, everything can be easy, but it’s dangerously easy. Being able to isolate generated code as individual plugins that we can include at runtime allows us to have a strong level of safety and confidence. The isolated nature of each plugin allows us to ship fast and partner with our customers directly without crashing everyone else. In fact, our sales, product, and support teams have now all contributed working plugins to our codebase, often immediately in reaction to a customer ask.