Data deduplication, also known as Dedup, delivers improved user experiences and engagement as well as more efficient data processing. Although most people think of deduplication as removing multiple copies of the same data in a database, in reality the need for deduplication happens in many common situations, leading to such questions as:
A set contains at most one copy of any given value. In other words, we can add the same value to a set a million times (call SADD pets dog over and over), but the value will only occur once in the set. With a set defined, we can use the SISMEMBER command to tell if a given value is in the set. If we create a set named pets and add the values dog, cat, and parrot to it, the value dog is a member of the set, while the value monkey is not:
redis> SADD pets dog cat parrot (integer) 3 redis> SISMEMBER pets dog (integer) 1 redis> SISMEMBER pets monkey (integer) 0 redis> SADD pets dog (integer) 0
Here’s how easy it is to set up and use a Bloom filter:
redis> BF.ADD users doug (integer) 1 redis> BF.ADD users lisa (integer) 1 redis> BF.EXISTS users doug (integer) 1 redis> BF.EXISTS users bob (integer) 0 redis> BF.ADD users lisa (integer) 0
We want to make sure that every user has a unique name. We add doug and lisa to the Bloom filter. (BF.ADD creates the filter if necessary.) If someone tries to create another account named doug, the Bloom filter will say that we probably have a user with that name already. On the other hand, if someone tries to create a new account with the username bob, the Bloom filter will tell us that it is definitely not an ID we’ve seen before, so it’s safe to create a new account with that name.