Probably and No: Redis, RedisBloom, and Bloom Filters
Break the data matrix, and become a #RedisHero.
To put it simply, deduplication is any task that needs to ask the question, “Have we seen this piece of information before?” This delivers improved user experiences and engagement as well as more efficient data processing. Although most people think of deduplication as removing multiple copies of the same data in a database, in reality the need for deduplication happens in many common situations, leading to such questions as:
Sending messages to customers is an essential part of any business. Before sending an email or text to a given user, deduplication can determine if that user has seen that email or text before. If so, the system can send that user a different message or decide not to contact that user for a given time period (if at all).
Extract, transform, and load (ETL) is a common technique for working with large data sets such as event logs. Typically ETL is used to collate data from disparate systems so it can be analyzed. Adding deduplication to this process means the system doesn’t perform the same analysis multiple times on multiple copies of the same data.
When a user creates a new account, it’s important that their ID be unique. With deduplication, it’s trivial to determine if a given ID already exists.
It’s good to get an alert that something important has happened in a system. It’s bad to not get any alert at all, but it’s almost as bad to get 300 alerts for the same event. Or 30,000 alerts. Deduplication can stifle an alert that has occurred before, but the technique can be combined with a timer to remove recent duplicates, allowing the system to issue a given alert no more than once in a given time frame.
Fraud detection is a great use case for AI/ML processing. Unfortunately, AI/ML processing is expensive. Deduplication can examine a transaction and see if it is similar to a transaction that was approved in the past. If the new transaction is similar enough, the system can approve the transaction automatically without incurring the cost of fraud detection. (In this use case, a duplicate value is a good thing.)
Redis Enterprise provides two great technologies for deduplication: sets and Bloom filters.
A Redis set stores any given value only once. No matter how many times a value is added to a set, the set contains only one copy of that value. Redis sets are also optimized to answer the question, “Is this value in this set?” with extreme speed. Sets are also determinate in that the answers to the question are “Definitely” and “Definitely not.” Be aware, however, that a set can use a substantial amount of memory as the number of unique values in the set increases.
The RedisBloom module provides Bloom filters, a probabalistic algorithm useful for deduplication. Unlike a set, a Bloom filter only stores hashes for each value, not the value itself. That means a Bloom filter can take as little as 2% of the memory a set requires. And they are typically slightly faster than sets, which are themselves very fast. But there is a tradeoff for this speed and memory efficiency: a Bloom filter is not determinate. The answers to the question, “Have we seen this value before?” are “Definitely not” and “Probably so.” In other words, false positives are possible with a Bloom filter.
A major gaming company uses Bloom filters in Redis Enterprise to add an event-level deduplication layer to real-time data streams, enabling them to process more than 2 billion events per day. Compared to their original implementation, they now use roughly 90% less memory and they reduced latency from roughly 2 milliseconds to less than 0.25 milliseconds.
A set contains at most one copy of any given value. In other words, we can add the same value to a set a million times (call SADD pets dog over and over), but the value will only occur once in the set. With a set defined, we can use the SISMEMBER command to tell if a given value is in the set. If we create a set named pets and add the values dog, cat, and parrot to it, the value dog is a member of the set, while the value monkey is not:
redis> SADD pets dog cat parrot (integer) 3 redis> SISMEMBER pets dog (integer) 1 redis> SISMEMBER pets monkey (integer) 0 redis> SADD pets dog (integer) 0
Here’s how easy it is to set up and use a Bloom filter:
redis> BF.ADD users doug (integer) 1 redis> BF.ADD users lisa (integer) 1 redis> BF.EXISTS users doug (integer) 1 redis> BF.EXISTS users bob (integer) 0 redis> BF.ADD users lisa (integer) 0
We want to make sure that every user has a unique name. We add doug and lisa to the Bloom filter. (BF.ADD creates the filter if necessary.) If someone tries to create another account named doug, the Bloom filter will say that we probably have a user with that name already. On the other hand, if someone tries to create a new account with the username bob, the Bloom filter will tell us that it is definitely not an ID we’ve seen before, so it’s safe to create a new account with that name.