Try Redis Cloud Essentials for Only $5/Month!

Learn More

Fast, secure data deduplication solutions

Declutter your data to enhance user engagement and streamline systems with Redis Enterprise.

What is data deduplication?

Data deduplication, also known as Dedup, delivers improved user experiences and engagement as well as more efficient data processing. Although most people think of deduplication as removing multiple copies of the same data in a database, in reality the need for deduplication happens in many common situations, leading to such questions as: 

Use cases for deduplication

Engaging customers without repeating previous messages

Sending messages to customers is an essential part of any business. Before sending an email or text to a given user, deduplication can determine if that user has seen that email or text before. If so, the system can send that user a different message or decide not to contact that user for a given time period (if at all).

Illustration of deduped email messages
Illustration of deduped ETL process

ETL processing

Extract, transform, and load (ETL) is a common technique for working with large data sets such as event logs. Typically ETL is used to collate data from disparate systems so it can be analyzed. Adding deduplication to this process means the system doesn’t perform the same analysis multiple times on multiple copies of the same data.

Making sure user IDs are unique

When a user creates a new account, it’s important that their ID be unique. With deduplication, it’s trivial to determine if a given ID already exists.

Illustration of deduped user profiles
Illustration of deduped alerts

Filtering duplicate alerts

It’s good to get an alert that something important has happened in a system. It’s bad to not get any alert at all, but it’s almost as bad to get 300 alerts for the same event. Or 30,000 alerts. Deduplication can stifle an alert that has occurred before, but the technique can be combined with a timer to remove recent duplicates, allowing the system to issue a given alert no more than once in a given time frame.

Filtering transactions to avoid expensive fraud detection processing

Fraud detection is a great use case for AI/ML processing. Unfortunately, AI/ML processing is expensive. Deduplication can examine a transaction and see if it is similar to a transaction that was approved in the past. If the new transaction is similar enough, the system can approve the transaction automatically without incurring the cost of fraud detection. (In this use case, a duplicate value is a good thing.)

Illustration of deduped transactions

How Redis Enterprise meets the challenges of real-time deduplication

Redis Enterprise provides two great technologies for deduplication: sets and Bloom filters.

Redis Sets

Deduplication Illustration Redis Sets

A Redis set stores any given value only once. No matter how many times a value is added to a set, the set contains only one copy of that value. Redis sets are also optimized to answer the question, “Is this value in this set?” with extreme speed. Sets are also determinate in that the answers to the question are “Definitely” and “Definitely not.” Be aware, however, that a set can use a substantial amount of memory as the number of unique values in the set increases.

Probabilistic

Deduplication Illustration Bloom Filter

The Probabilistic module provides Bloom filters, a probabalistic algorithm useful for deduplication. Unlike a set, a Bloom filter only stores hashes for each value, not the value itself. That means a Bloom filter can take as little as 2% of the memory a set requires. And they are typically slightly faster than sets, which are themselves very fast. But there is a tradeoff for this speed and memory efficiency: a Bloom filter is not determinate. The answers to the question, “Have we seen this value before?” are “Definitely not” and “Probably so.” In other words, false positives are possible with a Bloom filter.

Customer success story

A major gaming company uses Bloom filters in Redis Enterprise to add an event-level deduplication layer to real-time data streams, enabling them to process more than 2 billion events per day. Compared to their original implementation, they now use roughly 90% less memory and they reduced latency from roughly 2 milliseconds to less than 0.25 milliseconds.

Top 3 challenges a deduplication solution must address

How to implement deduplication

Using sets

A set contains at most one copy of any given value. In other words, we can add the same value to a set a million times (call SADD pets dog over and over), but the value will only occur once in the set. With a set defined, we can use the SISMEMBER command to tell if a given value is in the set. If we create a set named pets and add the values dogcat, and parrot to it, the value dog is a member of the set, while the value monkey is not:

redis> SADD pets dog cat parrot
(integer) 3
redis> SISMEMBER pets dog
(integer) 1
redis> SISMEMBER pets monkey
(integer) 0
redis> SADD pets dog
(integer) 0

Using Bloom filters

Here’s how easy it is to set up and use a Bloom filter:

redis> BF.ADD users doug
(integer) 1
redis> BF.ADD users lisa
(integer) 1
redis> BF.EXISTS users doug
(integer) 1
redis> BF.EXISTS users bob
(integer) 0
redis> BF.ADD users lisa
(integer) 0

We want to make sure that every user has a unique name. We add doug and lisa to the Bloom filter. (BF.ADD creates the filter if necessary.) If someone tries to create another account named doug, the Bloom filter will say that we probably have a user with that name already. On the other hand, if someone tries to create a new account with the username bob, the Bloom filter will tell us that it is definitely not an ID we’ve seen before, so it’s safe to create a new account with that name.