2 years ago

#53892

test-img

Lance

How to create a hashing function which buckets an array of integers into random buckets?

So I am doing something part theoretical and part practical, trying to find the balance of what is realistic. Currently I am thinking about a situation where we have 3 properties which specify a table for a record:

  1. org_id: organization grouping (int possible values == 2147483647 on postgres ~~ 2 billion)
  2. type_id: basic category of the record (int possible values)
  3. chunk_id: chunking of records into "int" chunks (~2 billion records per chunk)

The chunk_id would be a static/permanent value, but it maps to a dynamic "shard" url. Basically, the shard is the database instance which will house all records of a specific chunk, for a specific type and org. So given an org and type, if we are given a chunk, we know all those records in that chunk will be in a specific shard.

So you'd have a table, the shard_lookup table, which maps a (org_id, type_id, chunk_id) to a shard url/address. Pretty basic.

Anyways, the problem would arise if there were billions of orgs and billions of shards. There would be too many pieces of info to store in memory I think, too many records in the shard_lookup table to fit on one machine/instance. So you would have to subdivide the shard_lookup table into small chunks too! You would have to shard the shard lookup table :)

Well we can do something simpler for this case, since we have a hard limit on how many records we are allowing in the system, for now at least (another story). We can theoretically take the (org_id, type_id, chunk_id) tuple, and hash it to get an index, and then from that index we would be able to know what "shard lookup shard" to go to, which would contain the shard_lookup table that we know contains our tuple, and then we could look in that table to find the specific type table address. So hash -> shard lookup shard -> record shard.

So that's the back of the story. Question is, how can I take this tuple of 3 integers, and hash it to produce a random bucket? It's probably something simple, but I'm not seeing it. By random bucket, I mean each tuple always resolves to the same bucket, but the distribution is seemingly randomized.

Basically what is going in my head is something with modulo but I can't pinpoint what the solution would be yet. It seems something like:

if (org_id % 3 === 0 && type_id % 3 === 0 && chunk_id % 3 === 0) {
  return 3 // third bucket
} else if (org_id % 2 === 0 && type_id % 2 === 0 && chunk_id % 2 === 0) {
  return 2 // second bucket
} else {
  return 1 // first bucket
}

But really you'd want to test way more combinations than that, or else everything would fall in the 1 bucket pretty much. So how would you do this sort of thing in a dynamic sort of way?

Say we can tune the algorithm to having n possible buckets as well, so initially we start with just one bucket (everything is in one place), but later it is 2 buckets, then 3, etc.

algorithm

math

hash

integer

modulo

0 Answers

Your Answer

Accepted video resources