2 years ago
#53892

Lance
How to create a hashing function which buckets an array of integers into random buckets?
So I am doing something part theoretical and part practical, trying to find the balance of what is realistic. Currently I am thinking about a situation where we have 3 properties which specify a table for a record:
org_id
: organization grouping (int possible values ==2147483647
on postgres ~~ 2 billion)type_id
: basic category of the record (int possible values)chunk_id
: chunking of records into "int" chunks (~2 billion records per chunk)
The chunk_id
would be a static/permanent value, but it maps to a dynamic "shard" url. Basically, the shard is the database instance which will house all records of a specific chunk, for a specific type and org. So given an org and type, if we are given a chunk, we know all those records in that chunk will be in a specific shard.
So you'd have a table, the shard_lookup
table, which maps a (org_id
, type_id
, chunk_id
) to a shard url/address. Pretty basic.
Anyways, the problem would arise if there were billions of orgs and billions of shards. There would be too many pieces of info to store in memory I think, too many records in the shard_lookup
table to fit on one machine/instance. So you would have to subdivide the shard_lookup
table into small chunks too! You would have to shard the shard lookup table :)
Well we can do something simpler for this case, since we have a hard limit on how many records we are allowing in the system, for now at least (another story). We can theoretically take the (org_id
, type_id
, chunk_id
) tuple, and hash it to get an index, and then from that index we would be able to know what "shard lookup shard" to go to, which would contain the shard_lookup
table that we know contains our tuple, and then we could look in that table to find the specific type table address. So hash -> shard lookup shard -> record shard
.
So that's the back of the story. Question is, how can I take this tuple of 3 integers, and hash it to produce a random bucket? It's probably something simple, but I'm not seeing it. By random bucket, I mean each tuple always resolves to the same bucket, but the distribution is seemingly randomized.
Basically what is going in my head is something with modulo
but I can't pinpoint what the solution would be yet. It seems something like:
if (org_id % 3 === 0 && type_id % 3 === 0 && chunk_id % 3 === 0) {
return 3 // third bucket
} else if (org_id % 2 === 0 && type_id % 2 === 0 && chunk_id % 2 === 0) {
return 2 // second bucket
} else {
return 1 // first bucket
}
But really you'd want to test way more combinations than that, or else everything would fall in the 1 bucket pretty much. So how would you do this sort of thing in a dynamic sort of way?
Say we can tune the algorithm to having n possible buckets as well, so initially we start with just one bucket (everything is in one place), but later it is 2 buckets, then 3, etc.
algorithm
math
hash
integer
modulo
0 Answers
Your Answer