Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets. FM uses an algorithm to navigate between absolute rules to find duplicate strings, words/entries, that do not immediately share the same characteristics. Where typical search logic operates on a binary pattern, (i.e.: 0:1, yes/no, true/false, etc) – fuzzy string matching instead finds strings, entries, and/or text in datasets that fall in the in-between of these definitive parameters and navigates intermediate degrees of truth.
Approximate string matching assists in finding approximate matches, even when certain words are misspelled, abbreviated, or omitted, a function heavily used in search engines. In the end, approximate string matching provides a match score and since it is used to identify words, phrases, and strings that are not a perfect fuzzy match, the match score will not be 100%.
It’s important to land on the right fuzzy matching algorithm to help determine the similarity between one string and another. In one case, you could have a single character distance from “trial” to “trail,” or search for “passport” when the existing string reads “passaport” – a typo. Of course, not every fuzzy logic case will be a single character distance matter. “Martin Luther Junior” is quite similar to “Martin Luther King, Jr.” Distances vary and there are various fuzzy name matching algorithms to help bridge those gaps.
There are some drawbacks to running a fuzzy logic search with loosely-defined rules for matching strings. Using a weak system increases the chance of false positives. In order to keep these false positives at a bare minimum, or, ideally, non-existent, your approximate string matching system should be rather holistic. It needs to account for misspellings, abbreviations, name variations, geographical spellings of certain names, shortened nicknames, acronyms, and many other variables.
Though there are a good many string matching algorithms to choose from when reconciling datasets, there isn’t a one-size-fits-all solution for all use cases. Here are a few of the most reliable and often used string matching techniques used in data science to find approximate matches.
The Levenshtein Distance (LD) is one of the fuzzy matching techniques that measure between two strings, with the given number representing how far the two strings are from being an exact match. The higher the number of the Levenshtein edit distance, the further the two terms are from being identical.
For example, if you are measuring the distance between “Cristian” and “Christian,” you’d have a distance of 1 since you’d be one “h” away from an exact match. This term is oftentimes interchangeable with the term “edit distance.”
Named for American mathematician Richard Hamming, the Hamming distance (HD) is quite similar to Levenshtein, except that it is primarily utilized in signal processing, whereas the former is often used to calculate the distance in textual strings. This algorithm uses the ASCII (American Standard Code for Information Interchange) table to determine the binary code assigned to each letter in each string to calculate the distance score.
Take the textual strings “Corn” and “Cork.” If attempting to find the HD between these two, your answer would be a distance of 2, not 1, like you’d get with the Levenshtein algorithm. To get that score, you have to look at the binary assignation of each letter, one by one. Since the ASCII Binary Character Table assigns the code (01101110) for N and (01101011) for K, you’ll note that the difference between each letter’s code occurs in two locations, thus an HD of 2.
This LD variant also finds the minimum number of operations needed to make two strings a direct match, using single-character distance operations like insertion, deletion, and substitution, but, Damerau-Levenshtein takes it one step further by integrating a fourth possible operation – the transposition of two characters to find an approximate match.
String 1: Micheal
String 2: Michaela
Operation 1: transposition: swap characters “a” and “e”
Operation 2: insert “a” (end of string 2)
Distance = 2
Each operation has a count of “1”, so each insertion, deletion, transposition, etc is weighted equally.
FM’s use cases are vast, with lots of real-world applications, deduplication being one of the most popular among them. Imagine continuously serving the same digital ad to a user who has already reacted negatively to that ad and favorably to another. How would the user experience be impacted should a financial institution impose fraud detection on a transaction the user repeats on a weekly basis? It’s the use of approximate string matching that has allowed deduplication to streamline records within so many of our modern data systems.
When we launched Search and Query back in 2016, one of its key features was an auto-suggest engine with FM. Anyone who has ever surfed the web has seen auto-suggest in action on a search engine. Speaking of search engines, have you ever misspelled a word while conducting a Google search, but still got the results you were looking for? Google will actually serve up what it believes you meant to type out as the main query while providing an option for searching the word(s) as you typed them just below. In this way, fuzzy matching has helped shape how AI/ML has helped improve our most trusted search engines.
Research has found that human error accounts for a considerable amount of the duplication that occurs in record keeping and data management. An Online Research Journal study on Perspectives in Health Information Management found that duplicated medical records are not only common but dangerous and costly. The study, led by Beth Haenke Just, MBA, RHIA, FAHIMA, used a multisite dataset of 398,939 patient records and found that the majority of name field mismatches were the result of misspellings (54.14% in first names fields, 33.62% in last names fields, and 58.3% for middle names). Human error is often the biggest obstacle in data management and record linkage. FM has become an indispensable tool for joining imprecise datasets in the medical field, financial services, identifying social security fraud, and much more. In the end, FM has helped save modern enterprises countless man-hours on the often onerous and painstaking work of manual deduplication.
Other benefits of FM include
Fuzzy Matching algorithms can be implemented in various programming languages like:
The implementations are similar, with all languages comparing sets, matching patterns, and determining the statistical distance from the perfect match.
With FM, reliability is not a surefire guarantee. Sometimes false positives surface, which calls for manually checking for errors. It’s important to ask: Will a few false positives outweigh the benefit of correctly matching exponentially more data? If it’s negligible, perhaps spending time manually checking for errors would not be time well spent. Matching the right algorithm and programming language with the correct use case is the best way to prevent errors when applying fuzzy logic to data matching.
Fuzzy matching with Redis has been available since Redis Stack v1.2.0. It uses the LD algorithm. Learn more about query syntax, used for complex queries, and how Redis uses FM as one of its core rules.