Apr 13, 2023

Supercharging Subgraph Indexing with Firehose

How Firehose led to a 3x improvement in subgraph indexing speed

Paymahn Moghadasian

Software Engineer

Subgraphs provide an unprecedented developer experience for blockchain indexing by offering direct and granular controls to full-stack developers. With subgraphs, you can easily create a Typescript-based indexer that is incredibly easy to grok as a full-stack developer. However, subgraph indexing can be slower compared to other methods, since it relies on purely sequential event-by-event crawling through the chain. This can lead to long development cycles (up to a month for bigger subgraphs!)

At Goldsky we’re committed to making it easier and faster to get the data developers need so they can focus on building. We employ various methods such as intelligent pre-caching, and tiered RPC node balancing with both dedicated nodes and shared APIs.

In this blog post, we’ll dive into how Firehose, a files-based approach to blockchain data streaming combined with other proprietary enhancements, enabled ZORA to index their subgraph 3x faster.

Firehose is an open-source, files-based streaming layer built by StreamingFast, and uses object storage instead of RPC^{note 1} (Remote Procedure Call) providers. RPC providers can vary widely in reliability and can also be expensive, so reducing the reliance on RPC providers to do data indexing is a significant improvement to the developer experience. Object storage is battle-tested and far more reliable in high throughput applications.

Firehose is a distributed system with the following components:

A reader node which is a modified version of geth that emits file representations (called one-block files) of each block produced by a blockchain.
A merger node which collects one-block files and merges 100 of them into 100-block files.
An indexer node which reads 100 one-hundred block files and builds 10,000 block files called index files.
An object store (such as S3) for these files.
A relayer node which streams newly ingested blocks from the reader to the Firehose node.
A Firehose node which is the external interface to the entire system.

With Firehose, blockchain data is represented in flat files in an object store. When Firehose needs to increase scale, there’s no need to increase data duplication or worry about differences in sync state across instances of Firehose nodes. The Firehose node is simply a retrieval layer that fetches data from the object store (we’re oversimplifying a bit here but we’ll save the details for a future post).

This files-based approach allows Firehose to improve subgraph indexing performance in the following ways:

Index files enable Firehose to reduce the total number of blocks a subgraph has to fetch during indexing.
No need for new RPC nodes when a given node is overloaded, eliminating data duplication and load balancing across nodes which may vary in sync state.
More reliable than RPC nodes which reduces the total number of retries in the system.
Written as a streaming service which means blocks can be pushed to the Graph Node whereas RPC’s require the Graph Node to perform polling.
Built-in intelligent caching.
Goldsky runs Firehose in the same AWS availability zone as the rest of our infrastructure which results in reduced network latency.
Firehose is co-designed with The Graph which means that Firehose will continue to evolve to specifically handle the workload of The Graph (a read heavy streaming workload) whereas RPC’s are built for more general blockchain use cases.

Finally, Firehose gracefully handles reorgs which is a critical capability when working with blockchain data.

We’re very happy with how Firehose allowed us to deliver a 3x indexing speed performance improvement for ZORA, and now all of our customers can reap the benefits of our Firehose implementation.

Notes

^ Note: Firehose is technically capable of replacing all of the RPC needs of the Graph Node except for eth_call.

Loading system status...