7.1.1 Basic search theory

  • Redis in Action – Home
  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback
  • Redis in Action – Home
  • Foreword
  • Preface
  • Acknowledgments
  • About this Book
  • About the Cover Illustration
  • Part 1: Getting Started
  • Part 2: Core concepts
  • Part 3: Next steps
  • Appendix A
  • Appendix B
  • Buy the paperback

    7.1.1 Basic search theory

    Searching for words in documents faster than scanning over them requires preprocessing the documents in advance. This preprocessing step is generally known as indexing, and the structures that we create are called inverted indexes. In the search world, inverted indexes are well known and are the underlying structure for almost every search engine that we’ve used on the internet. In a lot of ways, we can think of this process as producing something similar to the index at the end of this book. We create inverted indexes in Redis primarily because Redis has native structures that are ideally suited for inverted indexes: the SET and ZSET.1

    More specifically, an inverted index of a collection of documents will take the words from each of the documents and create tables that say which documents contain what words. So if we have two documents, docA and docB, containing just the titles lord of the rings and lord of the dance, respectively, we’ll create a SET in Redis for lord that contains both docA and docB. This signifies that both docA and docB contain the word lord. The full inverted index for our two documents appears in figure 7.1.

    Figure 7.1The inverted index for docA and docB

    Knowing that the ultimate result of our index operation is a collection of Redis SETs is helpful, but it’s also important to know how we get to those SETs.


    Figure 7.2The process of tokenizing text into words, then removing stop words, as run on a paragraph from an early version of this section

    In order to construct our SETs of documents, we must first examine our documents for words. The process of extracting words from documents is known as parsing and tokenization; we’re producing a set of tokens (or words) that identify the document.

    There are many different ways of producing tokens. The methods used for web pages could be different from methods used for rows in a relational database, or from documents from a document store. We’ll keep it simple and consider words of alphabetic characters and apostrophes (‘) that are at least two characters long. This accounts for the majority of words in the English language, except for I and a, which we’ll ignore.

    One common addition to a tokenization process is the removal of words known as stop words. Stop words are words that occur so frequently in documents that they don’t offer a substantial amount of information, and searching for those words returns too many documents to be useful. By removing stop words, not only can we improve the performance of our searches, but we can also reduce the size of our index. Figure 7.2 shows the process of tokenization and stop word removal.

    One challenge in this process is coming up with a useful list of stop words. Every group of documents will have different statistics on what words are most common, which may or may not affect stop words. As part of listing 7.1, we include a list of stop words (fetched from http://www.textfixer.com/resources/), as well as functions to both tokenize and index a document, taking into consideration the stop words that we want to remove.

    Listing 7.1Functions to tokenize and index a document
    STOP_WORDS = set('''able about across after all almost also am among an and any are as at be because been but by can cannot could dear did do does either else ever every for from get got had has have he her hers him his how however if in into is it its just least let like likely may me might most must my neither no nor not of off often on only or other our own rather said say says she should since so some than that the their them then there these they this tis to too twas us wants was we were what when where which while who whom why will with
    would yet you your'''.split())

    We predeclare our known stop words; these were fetched from http://www.textfixer.com/resources/.

    WORDS_RE = re.compile("[a-z']{2,}")

    A regular expression that extracts words as we defined them.

    def tokenize(content):
        words = set()

    Our Python set of words that we have found in the document content.

        for match in WORDS_RE.finditer(content.lower()):

    Iterate over all of the words in the content.

            word = match.group().strip("'")

    Strip any leading or trailing single-quote characters.

            if len(word) >= 2:

    Keep any words that are still at least two characters long.

        return words - STOP_WORDS

    Return the set of words that remain that are also not stop words.

    def index_document(conn, docid, content):
        words = tokenize(content)

    Get the tokenized words for the content.

        pipeline = conn.pipeline(True)
        for word in words:
            pipeline.sadd('idx:' + word, docid)

    Add the documents to the appropriate inverted index entries.

        return len(pipeline.execute())

    Return the number of unique non-stop words that were added for the document.


    If we were to run our earlier docA and docB examples through the updated tokenization and indexing step in listing 7.1, instead of having the five SETs corresponding to lord, of, the, rings, and dance, we’d only have lord, rings, and dance, because of and the are both stop words.

    REMOVING A DOCUMENT FROM THE INDEXIf you’re in a situation where your document may have its content changed over time, you’ll want to add functionality to automatically remove the document from the index prior to reindexing the item, or a method to intelligently update only those inverted indexes that the document should be added or removed from. A simple way of doing this would be to use the SET command to update a key with a JSONencoded list of words that the document had been indexed under, along with a bit of code to un-index as necessary at the start of index_document().

    Now that we have a way of generating inverted indexes for our knowledge base documents, let’s look at how we can search our documents.


    Searching the index for a single word is easy: we fetch the set of documents in that word’s SET. But what if we wanted to find documents that contained two or more words? We could fetch the SETs of documents for all of those words, and then try to find those documents that are in all of the SETs, but as we discussed in chapter 3, there are two commands in Redis that do this directly: SINTER and SINTERSTORE. Both commands will discover the items that are in all of the provided SETs and, for our example, will discover the SET of documents that contain all of the words.

    One of the amazing things about using inverted indexes with SET intersections is not so much what we can find (the documents we’re looking for), and it’s not even how quickly it can find the results—it’s how much information the search completely ignores. When searching text the way a text editor does, a lot of useless data gets examined. But with inverted indexes, we already know what documents have each individual word, and it’s only a matter of filtering through the documents that have all of the words we’re looking for.

    Sometimes we want to search for items with similar meanings and have them considered the same, which we call synonyms (at least in this context). To handle that situation, we could again fetch all of the document SETs for those words and find all of the unique documents, or we could use another built-in Redis operation: SUNION or SUNIONSTORE.

    Occasionally, there are times when we want to search for documents with certain words or phrases, but we want to remove documents that have other words. There are also Redis SET operations for that: SDIFF and SDIFFSTORE.

    With Redis SET operations and a bit of helper code, we can perform arbitrarily intricate word queries over our documents. Listing 7.2 provides a group of helper functions that will perform SET intersections, unions, and differences over the given words, storing them in temporary SETs with an expiration time that defaults to 30 seconds.

    Listing 7.2SET intersection, union, and difference operation helpers
    def _set_common(conn, method, names, ttl=30, execute=True):
        id = str(uuid.uuid4())

    Create a new temporary identifier.

        pipeline = conn.pipeline(True) if execute else conn

    Set up a transactional pipeline so that we have consistent results for each individual call.

        names = ['idx:' + name for name in names]

    Add the ‘idx:’ prefix to our terms.

        getattr(pipeline, method)('idx:' + id, *names)

    Set up the call for one of the operations.

        pipeline.expire('idx:' + id, ttl)

    Instruct Redis to expire the SET in the future.

        if execute:

    Actually execute the operation.

        return id

    Return the ID for the caller to process the results.

        def intersect(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sinterstore', items, ttl, _execute)

    Helper function to perform SET intersections.

        def union(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sunionstore', items, ttl, _execute)

    Helper function to perform SET unions.

        def difference(conn, items, ttl=30, _execute=True):
            return _set_common(conn, 'sdiffstore', items, ttl, _execute)

    Helper function to perform SET differences.


    Each of the intersect(), union(), and difference() functions calls another helper function that actually does all of the heavy lifting. This is because they all essentially do the same thing: set up the keys, make the appropriate SET call, update the expiration, and return the new SET’s ID. Another way of visualizing what happens when we perform the three different SET operations SINTER, SUNION, and SDIFF can be seen in figure 7.3, which shows the equivalent operations on Venn diagrams.

    This is everything necessary for programming the search engine, but what about parsing a search query?


    We almost have all of the pieces necessary to perform indexing and search. We have tokenization, indexing, and the basic functions for intersection, union, and differences. The remaining piece is to take a text query and turn it into a search operation. Like our earlier tokenization of documents, there are many ways to tokenize search queries. We’ll use a method that allows for searching for documents that contain all of the provided words, supporting both synonyms and unwanted words.

    A basic search will be about finding documents that contain all of the provided words. If we have just a list of words, that’s a simple intersect() call. In the case where we want to remove unwanted words, we’ll say that any word with a leading minus character (-) will be removed with difference(). To handle synonyms, we

    Figure 7.3 The SET intersection, union, and difference calls as if they were operating on Venn diagrams

    need a way of saying “This word is a synonym,” which we’ll denote by the use of the plus (+) character prefix on a word. If we see a plus character leading a word, it’s considered a synonym of the word that came just before it (skipping over any unwanted words), and we’ll group synonyms together to perform a union() prior to the higherlevel intersect() call.

    Putting it all together where + denotes synonyms and – denotes unwanted words, the next listing shows the code for parsing a query into a Python list of lists that describes words that should be considered the same, and a list of words that are unwanted.

    Listing 7.3 A function for parsing a search query
    QUERY_RE = re.compile("[+-]?[a-z']{2,}")

    Our regular expression for finding wanted, unwanted, and synonym words.

    def parse(query):
        unwanted = set()

    A unique set of unwanted words.

        all = []

    Our final result of words that we’re looking to intersect.

        current = set()

    The current unique set of words to consider as synonyms.

        for match in QUERY_RE.finditer(query.lower()):

    Iterate over all words in the search query.

            word = match.group()
            prefix = word[:1]
            if prefix in '+-':
                word = word[1:]
                prefix = None

    Discover +/- prefixes, if any.

            word = word.strip("'")
            if len(word) < 2 or word in STOP_WORDS:

    Strip any leading or trailing single quotes, and skip anything that’s a stop word.

            if prefix == '-':

    If the word is unwanted, add it to the unwanted set.

            if current and not prefix:
                current = set()

    Set up a new synonym set if we have no synonym prefix and we already have words.


    Add the current word to the current set.

        if current:

    Add any remaining words to the final intersection.

        return all, list(unwanted)


    To give this parsing function a bit of exercise, let’s say that we wanted to search our knowledge base for chat connection issues. What we really want to search for is an article with connect, connection, disconnect, or disconnection, along with chat, but because we aren’t using a proxy, we want to skip any documents that include proxy or proxies. Here’s an example interaction that shows the query (formatted into groups for easier reading):

    >>> parse('''
    connect +connection +disconnect +disconnection
    -proxy -proxies''')
    ([['disconnection', 'connection', 'disconnect', 'connect'], ['chat']],
    ['proxies', 'proxy'])


    Our parse function properly extracted the synonyms for connect/disconnect, kept chat separate, and discovered our unwanted proxy and proxies. We aren’t going to be passing that parsed result around (except for maybe debugging as necessary), but instead are going to be calling our parse() function as part of a parse_and_search() function that union()s the individual synonym lists as necessary, intersect()ing the final list, and removing the unwanted words with difference() as necessary. The full parse_and_search() function is shown in the next listing.

    Listing 7.4 A function to parse a query and search documents
    def parse_and_search(conn, query, ttl=30):
        all, unwanted = parse(query)

    Parse the query.

        if not all:
            return None

    If there are only stop words, we don’t have a result.

        to_intersect = []
        for syn in all:

    Iterate over each list of synonyms.

            if len(syn) > 1:
                to_intersect.append(union(conn, syn, ttl=ttl))

    If the synonym list is more than one word long, perform the union operation.


    Otherwise, use the individual word directly.

        if len(to_intersect) > 1:
            intersect_result = intersect(conn, to_intersect, ttl=ttl)

    If we have more than one word/result to intersect, intersect them.

            intersect_result = to_intersect[0]

    Otherwise, use the individual word/result directly.

        if unwanted:
            unwanted.insert(0, intersect_result)
            return difference(conn, unwanted, ttl=ttl)

    If we have any unwanted words, remove them from our earlier result and return it.

        return intersect_result

    Otherwise, return the intersection result.


    Like before, the final result will be the ID of a SET that includes the documents that match the parameters of our search. Assuming that Fake Garage Startup has properly indexed all of their documents using our earlier index_document() function, parse_and_search() will perform the requested search.

    We now have a method that’s able to search for documents with a given set of criteria. But ultimately, when there are a large number of documents, we want to see them in a specific order. To do that, we need to learn how to sort the results of our searches.

    1 Though SETs and ZSETs could be emulated with a properly structured table and unique index in a relational database, emulating a SET or ZSET intersection, union, or difference using SQL for more than a few terms is cumbersome.