Hash-Range Partitioning

Better partitioning for distributed data

In distributed systems, it’s extroardinarily common to want to split a large dataset across some number of physical shards or partitions. This is commonly done by taking the key, hashing it, and then taking the hash modulo the number of partitions:

def partition(key, num_partitions):
  hash(key) % num_partitions

Let’s call that hash-modulus partitioning for short. It’s simple, straightforward, and correct, and it’s used as the default partitioner in projects from Spark to Kafka to Hive. It’s also far from optimal, especially when you’re joining different data sets together.

Here’s a slightly different partitioning function, which I’ll call hash-range partitioning:

def partition(key, num_partitions):
  (hash(key) * num_partitions) >> hash_bits

This post explains how that function works and why you might prefer it.

Why it works

So we can visualize things easily, let’s imagine we’re using a tiny 4-bit hash function. This gives us 2^4=16 different possible hash values, mapping to the integers from 0 to 15.

 0                             15
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

Hash-modulus partitioning will take those 16 values and round-robin them across our partitions. If we have 3 partitions for our 16 possible hashes, we get:

 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘

Hash-range partitioning, on the other hand, will split the hash space into ranges and assign a partition to each range. For our 3 partitions, that looks something like:

 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
└───────────┴─────────┴─────────┘

Like the hash-modulo partitioning, this divides the values “as fairly as possible” between partitions.

Why it’s better

Almost every system that needs to partition data also needs to repartition data: taking an existing dataset with n partitions and turning it into a new dataset with m partitions. For example, Spark can only join two datasets with the same number of partitions… if you try and join two datasets with different partition counts, Spark will need to repartition at least one of them to make the counts match up before doing the join.

Let’s imagine we want to turn our 3-partition dataset into a four-partition dataset, and compare the partition numbers hash-modulus partitioning assigns.

 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 
├─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┼─┤
 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

There’s absolutely no correlation here! Data from partition 0 in our 3-partition dataset is spread out over all 4 partitions in our new dataset.

Here’s how hash-range partitioning does it:

 0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
├───────┬───┴───┬─────┴─┬───────┤
 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Here the correlation is quite strong: each of our output partition maps to just one or two of our input partitions.

With hash-modulo partitioning, repartitioning is a “global” operation: each output partition depends on every input partition. With hash-range partitioning, repartitioning is “local” and has a much narrow set of dependencies. This can have really meaningful consequences for reliability and performance. For example, suppose we lose a machine that holds one of our output partitions. If we’re using hash-modulus partitioning, we’ll have to refetch the data from all our input partitions; with hash-range partitioning, we’ll only have to contact one or two.

What to do about it

Many data-processing tools, like Spark and Kafka, leave the partitioning algorithm configurable. Consider using hash-range partitioning for your data, even if you’re not going to take advantage of the easier repartitioning yet… changing the partitioning of some datasets (eg. Kafka topics) can be a pain, so it’s easier to get it right up front. If you’re the author of some framework, consider making hash-range partitioning the default. (And please don’t hardcode hash-modulus partitioning!)

If you’re working on the JVM, note that typical hashCode implementations distribute data very badly, in a way that hash-modulus partitioning can help hide. If you’re adopting hash-range partitioning, you’ll want to make sure you’re using a halfway-decent hash algorithm as well. (Murmur3 is a good choice!)

To make your life easy, here’s a quick implementation in Java 8, which I hereby place in the public domain:

public static int partition(int hashCode, int numPartitions) {
  return (int) (Integer.toUnsignedLong(hashCode) * numPartitions >> 32);
}

If you end up finding this helpful, please let me know! (And please get in touch if you know other references for this idea: I wrote this mostly because I couldn’t find anything else to link.)