9.2.1 HASHes

  • Redis in Action – Home
  • Foreword
  • Preface
  • Part 1: Getting Started
  • Part 2: Core concepts
  • 1.3.1 Voting on articles
  • 1.3.2 Posting and fetching articles
  • 1.3.3 Grouping articles
  • 4.2.1 Configuring Redis for replication
  • 4.2.2 Redis replication startup process
  • 4.2.3 Master/slave chains
  • 4.2.4 Verifying disk writes
  • 5.1 Logging to Redis
  • 5.2 Counters and statistics
  • 5.3 IP-to-city and -country lookup
  • 5.4 Service discovery and configuration
  • 5.1.1 Recent logs
  • 5.1.2 Common logs
  • 5.2.2 Storing statistics in Redis
  • 5.3.1 Loading the location tables
  • 5.3.2 Looking up cities
  • 5.4.1 Using Redis to store configuration information
  • 5.4.2 One Redis server per application component
  • 5.4.3 Automatic Redis connection management
  • 8.1.1 User information
  • 8.1.2 Status messages
  • 9.1.1 The ziplist representation
  • 9.1.2 The intset encoding for SETs
  • Chapter 11: Scripting Redis with Lua
  • 11.1.1 Loading Lua scripts into Redis
  • 11.1.2 Creating a new status message
  • 11.2 Rewriting locks and semaphores with Lua
  • 11.3 Doing away with WATCH/MULTI/EXEC
  • 11.4 Sharding LISTs with Lua
  • 11.5 Summary
  • 11.2.1 Why locks in Lua?
  • 11.2.2 Rewriting our lock
  • 11.2.3 Counting semaphores in Lua
  • 11.4.1 Structuring a sharded LIST
  • 11.4.2 Pushing items onto the sharded LIST
  • 11.4.4 Performing blocking pops from the sharded LIST
  • A.1 Installation on Debian or Ubuntu Linux
  • A.2 Installing on OS X
  • B.1 Forums for help
  • B.4 Data visualization and recording
  • Buy the paperback
  • Redis in Action – Home
  • Foreword
  • Preface
  • Part 1: Getting Started
  • Part 2: Core concepts
  • 1.3.1 Voting on articles
  • 1.3.2 Posting and fetching articles
  • 1.3.3 Grouping articles
  • 4.2.1 Configuring Redis for replication
  • 4.2.2 Redis replication startup process
  • 4.2.3 Master/slave chains
  • 4.2.4 Verifying disk writes
  • 5.1 Logging to Redis
  • 5.2 Counters and statistics
  • 5.3 IP-to-city and -country lookup
  • 5.4 Service discovery and configuration
  • 5.1.1 Recent logs
  • 5.1.2 Common logs
  • 5.2.2 Storing statistics in Redis
  • 5.3.1 Loading the location tables
  • 5.3.2 Looking up cities
  • 5.4.1 Using Redis to store configuration information
  • 5.4.2 One Redis server per application component
  • 5.4.3 Automatic Redis connection management
  • 8.1.1 User information
  • 8.1.2 Status messages
  • 9.1.1 The ziplist representation
  • 9.1.2 The intset encoding for SETs
  • Chapter 11: Scripting Redis with Lua
  • 11.1.1 Loading Lua scripts into Redis
  • 11.1.2 Creating a new status message
  • 11.2 Rewriting locks and semaphores with Lua
  • 11.3 Doing away with WATCH/MULTI/EXEC
  • 11.4 Sharding LISTs with Lua
  • 11.5 Summary
  • 11.2.1 Why locks in Lua?
  • 11.2.2 Rewriting our lock
  • 11.2.3 Counting semaphores in Lua
  • 11.4.1 Structuring a sharded LIST
  • 11.4.2 Pushing items onto the sharded LIST
  • 11.4.4 Performing blocking pops from the sharded LIST
  • A.1 Installation on Debian or Ubuntu Linux
  • A.2 Installing on OS X
  • B.1 Forums for help
  • B.4 Data visualization and recording
  • Buy the paperback

    9.2.1 HASHes

    One of the primary uses for HASHes is storing simple key/value pairs in a grouped fashion. Back in section 5.3, we developed a method of mapping IP addresses to locations around the world. In addition to a ZSET that mapped from IP addresses to city IDs, we used a single HASH that mapped from city IDs to information about the city itself. That HASH had more than 370,000 entries using the August 2012 version of the database, which we’ll now shard.

    To shard a HASH table, we need to choose a method of partitioning our data. Because HASHes themselves have keys, we can use those keys as a source of information to partition the keys. For partitioning our keys, we’ll generally calculate a hash function on the key itself that will produce a number. Depending on how many keys we want to fit in a single shard and how many total keys we need to store, we’ll calculate the number of shards we need, and use that along with our hash value to determine the shard ID that the data will be stored in.

    For numeric keys, we’ll assume that the keys will be more or less sequential and tightly packed, and will assign them to a shard ID based on their numeric key value (keeping numerically similar keys in the same shard). The next listing shows our function for calculating a new key for a sharded HASH, given the base key and the HASH key HASH.

    Listing 9.7 A function to calculate a shard key from a base key and a secondary entry key
    def shard_key(base, key, total_elements, shard_size):
    

    We’ll call the shard_key() function with a base HASH name, along with the key to be stored in the sharded HASH, the total number of expected elements, and the desired shard size.

       if isinstance(key, (int, long)) or key.isdigit():
    

    If the value is an integer or a string that looks like an integer, we’ll use it directly to calculate the shard ID.

          shard_id = int(str(key), 10) // shard_size
    

    For integers, we assume they are sequentially assigned IDs, so we can choose a shard ID based on the upper “bits” of the numeric ID itself. We also use an explicit base here (necessitating the str() call) so that a key of 010 turns into 10, and not 8.

       else:
    
     
          shards = 2 * total_elements // shard_size
    

    For non-integer keys, we first calculate the total number of shards desired, based on an expected total number of elements and desired shard size.

          shard_id = binascii.crc32(key) % shards
    

    When we know the number of shards, we hash the key modulo the number of shards.

       return "%s:%s"%(base, shard_id)
    

    Finally, we combine the base key with the shard ID we calculated to determine the shard key.

    In our function, you’ll notice that for non-numeric keys we calculate a CRC32 checksum. We’re using CRC32 in this case because it returns a simple integer without additional work, is fast to calculate (much faster than MD5 or SHA1 hashes), and because it’ll work well enough for most situations.

    BEING CONSISTENT ABOUT total_elements AND shard_sizeWhen using nonnumeric keys to shard on, you’ll notice that we use the total_elements value to calculate the total number of shards necessary, in addition to the shard_size that’s used for both numeric and non-numeric keys. These two pieces of information are necessary to keep the total number of shards down. If you were to change either of these numbers, then the number of shards (and thus the shard that any piece of data goes to) will change. Whenever possible, you shouldn’t change either of these values, or when you do change them, you should have a process for moving your data from the old data shards to the new data shards (this is generally known as resharding).

    We’ll now use our shard_key() to pick shards as part of two functions that will work like HSET and HGET on sharded hashes in the following listing.

    Listing 9.8 Sharded HSET and HGET functions
    def shard_hset(conn, base, key, value, total_elements, shard_size):
    
     
       shard = shard_key(base, key, total_elements, shard_size)
    

    Calculate the shard to store our value in.

       return conn.hset(shard, key, value)
    

    Set the value in the shard.

    def shard_hget(conn, base, key, total_elements, shard_size):
    
     
       shard = shard_key(base, key, total_elements, shard_size)
    

    Calculate the shard to fetch our value from.

       return conn.hget(shard, key)
    

    Get the value in the shard.

    Nothing too complicated there; we’re finding the proper location for the data to be stored or fetched from the HASH and either setting or getting the values. To update our earlier IP-address-to-city lookup calls, we only need to replace one call in each of two functions. The next listing shows just those parts of our earlier functions that need to be updated.

    Listing 9.9 Sharded IP lookup functions
    TOTAL_SIZE = 320000
    SHARD_SIZE = 1024
    

    We set the arguments for the sharded calls as global constants to ensure that we always pass the same information.

     
    def import_cities_to_redis(conn, filename):
       for row in csv.reader(open(filename)):
          ...
    
     
          shard_hset(conn, 'cityid2city:', city_id,
             json.dumps([city, region, country]),
             TOTAL_SIZE, SHARD_SIZE)
    

    To set the data, we need to pass the TOTAL_SIZE and SHARD_SIZE information, though in this case TOTAL_SIZE is unused because our IDs are numeric.

     
    def find_city_by_ip(conn, ip_address):
       ...
    
     
       data = shard_hget(conn, 'cityid2city:', city_id,
          TOTAL_SIZE, SHARD_SIZE)
    

    To fetch the data, we need to use the same information for TOTAL_SIZE and SHARD_SIZE for general sharded keys.

       return json.loads(data)
    
     

    On a 64-bit machine, storing the single HASH of all of our cities takes up roughly 44 megabytes. But with these few small changes to shard our data, setting hash-maxziplist-entries to 1024 and hash-max-ziplist-value to 256 (the longest city/country name in the list is a bit over 150 characters), the sharded HASHes together take up roughly 12 megabytes. That’s a 70% reduction in data size, which would allow us to store 3.5 times as much data as before. For shorter keys and values, you can potentially see even greater percentage savings (overhead is larger relative to actual data stored).

    STORING STRINGS IN HASHESIf you find yourself storing a lot of relatively short strings or numbers as plain STRING values with consistently named keys like namespace:id, you can store those values in sharded HASHes for significant memory reduction in some cases.

    Exercise: Adding other operations

    As you saw, getting and setting values in a sharded HASH is easy. Can you add support for sharded HDEL, HINCRBY, and HINCRBYFLOAT operations?

    We’ve just finished sharding large hashes to reduce their memory use. Next, you’ll learn how to shard SETs.