Challenges in building message broker solutions
1. Communication between services must be reliable
When one service wants to communicate with another service, it can’t always do so immediately. Failures happen and independent deployments might make a service unavailable for periods of time. For applications at scale, it’s not a matter of “if” or “when” a service becomes unavailable, but how often. To mitigate this problem, a best practice is to limit the amount of synchronous communication between services (i.e., directly calling the service’s APIs, for example by sending a HTTP(S) request) and instead prefer persisted channels whenever practical, so that services can then consume messages at their convenience. The two main paradigms for this type of asynchronous communication are event streams and message queues.
- Message queues are based on mutable lists and are sometimes consumed through tools that help implement common patterns. There are two main differences between message queues and event streams: Message queues use a “push” type of communication—a service pushes a new message to another service’s inbox whenever something new needs attention. Streams operate in the opposite way.
- Messages contain mutable state (e.g., number of retries) and, when successfully processed, they are deleted from the system. Stream events are immutable and the history, when trimmed, is often saved in cold storage.
Event streams are based on the log data type, which is extremely efficient at seeking through its history and appending new items to its end. These two properties make the immutable log both a great communication primitive, and an efficient way to store data.
Communicating through a stream is different than using a message queue. As mentioned previously, message queues are “push,” while streams are “pull.” In practice, this means that each service writes to its own stream and other services will optionally observe (i.e. “pull” from) it. This makes one-to-many communication much more efficient than with message queues.
Message queues work best when one service wants another one to perform an operation. In that situation, the second service’s message queue acts as a “request inbox.” When a service needs instead to publish an event (i.e., a message that is of interest to multiple services), the publishing service will need to push a message to the queue of each service interested in the event. In practice most tools (e.g. Enterprise Service Buses) can do that transparently, but generating and storing a separate message copy for each recipient remains inefficient.
Event streams outperform message queues in one-to-many communication patterns by inverting the protocol: only one copy of the original event exists, and whichever service wants to access it can seek through the event stream (i.e. the publishing service’s stream) at its own pace. Event streams have another practical advantage over message queues: you don’t need to specify the event-subscribers in advance. In message queues, the system needs to know to which queues to deliver a copy of the event, so if you add a new service later, it will receive only new events. With event streams, this problem doesn’t exist—a new service can even be made to go through the full event history, which is great for adding new analytics and still being able to compute them retroactively. This means that you don’t have to come up immediately with every metric you might need in the future. You can just track the ones you need now, and add more as you go because you know that you will still be able to see the full history even for the ones added at a later date.
2. Storage must be space efficient
Space efficiency is a welcome property for all communication channels that persist messages. For event streams, though, it’s fundamental, as they are often used for long-term information storage. (We mentioned above that immutable logs are fast at appending new entries and at seeking through history.)
Redis Streams is an implementation of the immutable log that uses radix trees as the underlying data structure. Each stream entry is identified by a timestamp and can contain an arbitrary set of field-value pairs. Entries of the same stream can have different fields, but Redis is able to compress multiple events in a row that share the same schema. This means that if your events have stable set of fields you won’t pay a storage price for each field name, letting you use longer and more descriptive key names without any downside.
As noted, streams can be trimmed to remove older entries and deleted history often gets preserved in an archival format. Another feature of Redis Streams is the ability to mark any mid-stream entry as “deleted” to help with compliance with regulations such as GDPR.
Scaling processing throughput
Event streams and message queues help cope with communication bursts. But another problem of direct API invocation is that services can get overwhelmed when traffic spikes. Asynchronous communication channels can act as a buffer, which helps smooth out spikes, but the processing throughput has to be robust enough to sustain normal traffic or the system will collapse and the buffer will need to grow indefinitely.
In Redis Streams it’s possible to increase the processing throughput by reading a stream through a consumer group. Readers that are part of the same consumer group see messages in a mutually exclusive fashion. Of course, a single stream can have multiple consumer groups. In practice, you’ll want to create a separate consumer group for every service, so that each service can then spin up multiple reader instances to increase parallelism as needed.
3. Messaging semantics must be clear
When communicating asynchronously, it’s fundamental to consider possible failure scenarios. A service instance might crash or lose connectivity while processing a message, for example. Because communication failures are inevitable, messaging systems divide themselves into two categories: at-most-once and at-least-once delivery. (Some messaging systems claim to offer exactly once delivery, but that is not the complete picture. In any reliable messaging system, messages will occasionally need to be delivered more than once in order to overcome failures. This is an unavoidable characteristic of communication over unreliable networks.)
To handle failures correctly, all services participating in the system must be able to perform idempotent message processing. ‘Idempotent’ means that the system’s state doesn’t change in the event of duplicate message delivery. Idempotence is typically achieved by applying any necessary state change and saving the last message processed atomically (e.g., in a transaction). This way, failure will never be left in an inconsistent state in the case of failure, and the reader will be able to tell if a given message was already processed or not by checking if the new message identifier precedes the last processed message.
Redis Streams, being a reliable streaming communication channel, is an at-least-once system. When reading a stream through a consumer group, Redis remembers which event was dispatched to which consumer. It’s the consumer’s duty then to properly acknowledge that the message was successfully processed. When a consumer dies, an event might remain stuck. To solve this problem consumer groups offer a way of inspecting the state of pending messages and, when necessary, to reassign an event to another consumer.
We noted above that transactions (and atomic operations) are the main way to achieve idempotency. To help with that, Redis Transactions and Lua scripting allow composition of multiple commands with all-or-nothing transactional semantics.
Redis Pub/Sub is an at-most-once messaging system that allows publishers to broadcast messages to one or more channels. More precisely, Redis Pub/Sub is designed for real-time communication between instances where low latency is of the utmost importance, and as such doesn’t feature any form of persistence or acknowledgment. The result is the leanest possible real-time messaging system, perfect for financial and gaming applications, where every millisecond matters.