Storage options for stream processing apps

2 minute read

Storage in Faust

Faust provides a few options for state storage that we need to understand before we start building streaming applications in production. Here, we are going to review, in-memory and RocksDB state storage to better understand when to use which option.

Like most stream processing frameworks, faust keeps track of all of its state in a Kafka Topic dedicated for each stream. Doing so, keeping this changelog allows Faust to scale up from one node to thousands of processing nodes without skipping a beat.

How does this work? Well, similar to the way any Kafka producer works. As events happen in our stateful table, say a user makes an additional purchase, and we update the total value of their lifetime purchases, an event is emitted for that specific event. Essentially, anytime a change occurs, Faust emits an event to the changelog topic with the change details. Using this compact changelog topic, faust can re-create state on boot by reading through the topic before beginning to process messages.

Storing state in-memory

When a faust application starts and reads state from the changelog topic, or as it processes data during execution, it needs to store the current state of the application somewhere!! The first and default option is to simply store the state in memory. This means that an up-to-date copy of the state of a table is kept in application memory on every single node. While this is fast and simple to reason about, it has significant disadvantages. Every time the application restarts, it has to completely rebuild state from the Kafka changelog topic. What happens if we have 1 million events in our changelog topic?. It might take our application minutes or even hours to recover state and begin processing again. For many applications, that’s simply not acceptable. The point of stream-processing applications is typically to be fast and work in real-time. Additionally, what happens if the state of the table is too large to fit in memory? What do we do? Because of these significant disadvantages, it is not recommended to use in-memory storage for anything but your local test or development or when datasets are limited and recovery times are not critical.

Storing state in RocksDB

The second option for storing application state locally is to use RocksDB. RocksDB is a highly performant datastore that runs side-by-side with your stream processing application. As changes are made in your streaming tables, the state is stored in RocksDB, in addition to being sent to Kafka. The RocksDB instance simply stores the state of the world at that point in time on disk. If your application were to crash or restart for any reason, as soon as it recovers it will connect to RocksDB and instantly begin processing records once again. Because of these characteristics, it’s always recommended to use RocksDB in production. RocksDB is simple to use and requires no configuration on your part. You simply need to have a library installed in your machine to make use of it.

Conclusion

Use RocksDB in production and use in-memory only for local or development environments.