We are now, simply, Redis
Active-Active Geo-Distribution allows you to place your Redis database cluster instances close to where your users are, no matter where they are. Placing read replicas closer to your users is obviously the right thing to do whenever possible, but for applications that are write-heavy that’s not enough—write operations will still be slow without an Active-Active setup. So how do you develop for an Active-Active cluster instead?
Conflict-free replicated data types (CRDTs) are a family of replicated data types that have a common set of properties that enable operations performed on them to always converge to a final state consistent among all replicas. To ensure that conflicts never happen and cause problems in your applications, operations on CRDTs have to respect a specific set of algebraic properties.
The way this works in Redis Enterprise, when you create a CRDT-enabled database, is that normal commands get swapped with an equivalent CRDT implementation. Let’s look at an example of a Redis data structure that has a CRDT equivalent, and the nuances that Active-Active distributed replication adds.
Redis Sets are similar to what your favorite programming language offers in the standard library, offering specialized operations to add and remove elements, check if an element is part of the set, and perform set intersection, union, and difference. Here’s how it looks like when calling actual Redis commands:
// Create a set:
> SADD fruits apple pear banana
// Test if it contains apple:
> SISMEMBER fruits apple
// Remove a fruit:
> SREM fruits banana
// Test if it’s still present:
> SISMEMBER fruits banana
Running the same commands in a CRDT-enabled Redis Enterprise database would display the same behavior, but what’s happening behind the scenes is completely different.
The key point is that in a CRDT database, all replica nodes can apply changes to the “fruits” key, independently from each other. This is what enables you to write truly geo-distributed applications. The downside is that now the “replication lag” is also experienced when writing to a value. Nevertheless, this approach offers great advantages for geo-distributed applications..
In a Redis Enterprise CRDT-enabled database, Sets operations work under a few additional rules, the two most important of which get applied when merging operations coming from different nodes:
The second rule is sometimes referred to as the “observed remove” rule, meaning that you can delete only items that you were able to observe when the command was issued.
A note on replication lag and consistency model
While eventually all replica nodes will converge to the same final state, in the short term a command sent to ReplicaEU may not have yet been propagated to ReplicaUS, for example. This situation, albeit normally brief, is the reason why CRDT data structures must hardcode conflict-resolution strategies. All replicas in the same CRDT database will constantly sync their state to provide as consistent a view of the dataset as possible, but keep in mind that CRDTs are also useful for ensuring high availability in case of network partitions.
This means that to implement better resilience in your system you will need to account for the possibility of prolonged moments where the system has not yet been not fully synchronized. This makes CRDTs a form of eventual consistency. This is usually described as strong eventual consistency because it’s much more efficient than the more common types of replication based on quorum quotas. (For more information, see the Redis page on Active-Active Geo-Distribution.)
More CRDT data structures
For a complete list of the CRDT data structures that Redis Enterprise supports, take a look at the official documentation.
What are the practical requirements for developing an Active-Active application?
Let’s look at the main development aspects that could raise questions.
One question is whether you need a special type of Redis client to interact with a CRDT database. The answer is no—any normal Redis client can connect to a CRDT database and execute commands directly. As mentioned earlier, the commands don’t change, it’s the underlying mechanics that do. The only thing that clients don’t offer out-of-the-box is the ability to connect to a different geo-distributed replica in case the one closest to the service instance becomes unavailable. We’re working on this issue, but in the meantime, in the event of a network split, you will have to decide at the application level when it’s appropriate to connect to a replica in a different region.
The advantage of Active-Active is the ability to share some state across service instances distributed globally, all while experiencing local latencies when manipulating that state. To fully exploit this capability, your services need to rely on CRDTs’ semantics as much as possible and as such avoid keeping internal state that would get lost (or become un-mergeable) in the event of failures.
That said, CRDTs can’t solve all problems because they hardcode specific merge rules that might not be appropriate for your particular problem. Counters are a typical example: CRDT counters are great but they can’t be used to model a bank account balance because merges could allow the counter to go negative—and there is no way at the application level to prevent this from happening. In other words, CRDTs are an efficient but nuanced form of eventual consistency that don’t apply properly to inherently transactional problems.
It might seem that testing an Active-Active application must be much harder than testing normal single-master ones. While CRDTs certainly are a complex component of your data model, their behavior is fully deterministic in terms of end results. The main situation you have to account for is when the cluster is partitioned. As mentioned above, it’s OK to continue sending updates to a replica that has been disconnected from the rest of the cluster, because the updates will eventually be merged successfully when the connection is reestablished. You just need to make sure that your service can still operate correctly when disconnected from the whole Active-Active database cluster.
In other words, the only non-obvious extra testing required on your part should be about how the application behaves in the event of a network partition.
CRDTs allow you to create geo-distributed applications that can offer local latencies to your entire user base, all the while rendering your entire application more resilient to failure. While not all problems can be solved by CRDTs, there is a vast space of improvements that most companies can benefit from. As an example, take a look at this recent blog post by Kyle Davis, where he shows how to implement a leaderboard using Sorted Sets—and hints at the benefits of a CRDT version.