Intro to Caching from the Caching at Scale Primer
Break the data matrix. Explore what Redis has to offer.
“Simplicity is the ultimate sophistication”—Leonardo da Vinci
“Most information is irrelevant and most effort is wasted, but only the expert knows what to ignore”—James Clear, Atomic Habits
You have a fancy data pipeline with lots of different systems. It looks very sophisticated on the surface, but it’s actually a complex mess under the hood. It might need a lot of plumbing work to connect different pieces, it might need constant monitoring, it might require a large team with unique expertise to run, debug and manage it. Not to mention, the more systems you use, the more places you are duplicating your data and the more chances of it going out-of-sync or stale. Furthermore, since each of these subsystems are developed independently by different companies, their upgrades or bug fixes might break your pipeline and your data layer.
If you aren’t careful, you may end up with the following situation as depicted in the three-minute video below. I highly recommend you watch it before you proceed.
Complexity arises because even though each system might appear simple on the surface, they actually bring the following variables into your pipeline and can add a ton of complexity:
The variables such as the data format, schema and protocol add up to what’s called the “transformation overhead.” Other variables like performance, durability and scalability add up to what’s called the “pipeline overhead.” Put together, these classifications contribute to what’s known as the “impedance mismatch.” If we can measure that, we can calculate the complexity and use that to simplify our system. We’ll get to that in a bit.
Now, you might argue that your system, although it might appear complex, is actually the simplest system for your needs. But how can you prove that?
In other words, how do you really measure and tell if your data layer is truly simple or complex? And secondly, how can you estimate if your system will remain simple as you add more features? That is, if you add more features in your roadmap, do you also need to add more systems?
That’s where the “impedance mismatch test,” comes in. But let’s first look into what an impedance mismatch is and then we’ll get into the test itself.
The term originated in electrical engineering to explain the mismatch in electrical impedance, resulting in the loss of energy when energy is being transferred from point A to point B.
Simply said, it means that what you have doesn’t match what you need. To use it, you take what you currently have, transform it into what you need, and then use it. Hence there is a mismatch and an overhead associated with fixing the mismatch.
In our case, you have the data in some form or some quantity, and you need to transform it before we can use it. The transformation might happen multiple times and might even use multiple systems in between.
In the database world, the impedance mismatch happens for two reasons:
The goal of the test is to measure the complexity of the overall platform and whether the complexity grows or shrinks as you add more features in the future.
The way the test works is to simply calculate the “transformational overhead” and the “pipeline overhead,” using an “Impedance Mismatch Score” (IMS). This will tell you if your system is already complex relative to other systems, and also if that complexity grows over time as you add more features.
Here is the formula to calculate IMS:
The formula simply adds both types of overheads and then divides them by the number of features. This way, you’ll get the total overhead/feature (i.e. complexity score).
To understand this better, let’s compare four different simple data pipelines and calculate their scores. And secondly, let’s also imagine we are building a simple app in two phases, so that we can see how the IMS score changes as we add more features over time.
Say you are getting millions of button-click events from mobile devices and you need an alert if there is any drop or spike. Additionally, you are considering this entire thing as a feature of your larger application.
Case 1: Say you just used a RDBMS to store these events, although the tables might not fit.
Case 2: Say you used Kafka to process these events and then stored them into the RDBMS.
Case 3: Say you used Kafka to process these events and then stored them into KsqlDB.
Case 4: Say you used Redis Streams to process these events and then stored them into RedisTimeseries (both are part of Redis and work natively with Redis).
We compared four systems in this example and found out that “Case 3” or “Case 4” are the simplest with an IMS of 1. At this point, they both are the same, but will they remain the same when we add more features?
Let’s add more features to our system and see how IMS holds up.
Let’s say you are building the same app but want to make sure they come from only white-listed IP addresses. Now you are adding a new feature.
Case 1: Say you just used RDBMS to store these events, although the tables might not fit and they used Redis or MemCached for IP-whitelisting.
Case 2: Say you are using Redis + Kafka + RDBMS.
Case 3: Say you are using Redis + Kafka + KsqlDB.
Case 4: Say you are using Redis + Redis Streams + RedisTimeSeries.
When we added an additional feature,
So in our example, Case 4, which had one of the lowest IMS scores of 1, actually got better as we added the new feature and it ended up at 0.5.
Please note: If you add more or different features, Case 4 may not remain the simplest. But that’s the idea of the IMS score. Simply list all the features, compare different architectures, and see which one is the best for your use case.
To make it even simpler to use, we are providing you a calculator that you can implement in a simple spreadsheet to calculate the IMS score.
Here is how you use it:
Data Pipeline 1
Data Pipeline 2
It is very easy to get carried away and build a complex data layer without thinking about the consequences. The IMS score was created to help you be conscious of your decision.
You can use the IMS score to easily compare and contrast multiple systems for your use case and see which one is really the best for your set of features. You can also validate if your system can hold up to feature expansions and continue to remain as simple as possible.
“Simplicity is the ultimate sophistication” — Leonardo da Vinci
“Most information is irrelevant and most effort is wasted, but only the expert knows what to ignore” — James Clear, Atomic Habits