MSI Hashtable Lookup in Zig

1. Introduction

Chris Wellons has a neat blog post about a tiny hash table he likes to use. It's dripping with implicit C behaviours, and I wanted to see how simple it could look in Zig, which tends to be more verbose explicit.

The core of the table is the index lookup function, which generates the next index you should check for your value.

int32_t ht_lookup(uint64_t hash, int exp, int32_t idx)
{
    uint32_t mask = ((uint32_t)1 << exp) - 1;
    uint32_t step = (hash >> (64 - exp)) | 1;
    return (idx + step) & mask;
}

// The caller should initialise `idx` to the bottom 32 bits of `hash`. e.g:
int32_t idx = hash;
idx = ht_lookup(hash, exp, idx);

Exactly why this is doing what it's doing is not the focus of this post, for that see Chris' original article 1 , instead we'll be breaking down the behaviour of the function, both explicit and implicit, so that we can write it in Zig. This is not a flashy post, just going line-by-line and seeing how to rewrite it.

2. The Cast

int32_t idx = hash;

Here, we assign the signed 32 bit idx to the unsigned 64 bit hash. In C, if the actual numerical value of hash would fit an int32_t, it should just become that value. If it doesn't fit, then it's behaviour is "implementation defined".

Now, an ahead-of-time compiled C program cannot actually know what the value is, so instead they (mostly) do a cute trick. If you truncate hash, taking the low 32 bits, then that preserves the value if it would fit, and produces a nice and useful action in any case.

Some C programmers then rely on this behaviour, and hope their code isn't ran on an implementation that does something else.

3. The Mask

uint32_t mask = ((uint32_t)1 << exp) - 1;

Nothing weird here, but exp needs to be in the range [0, 31] (non-negative, and less than the width of a uint32_t), otherwise the behaviour is undefined.

4. The Shift

uint32_t step = (hash >> (64 - exp)) | 1;

First, to avoid undefined behaviour, exp must be such that 64 - exp is in the range [0, 63], so that means exp must (in addition to the restriction above be [1,31].

Second, the result of the expression, which is a uint64_t, gets assigned to a uint32_t. What happens here is well defined, it truncates and takes the bottom 32 bits of the expression.

5. The Return

return (idx + step) & mask;
  1. idx is promoted to a uint32_t, which is done by reinterpreting the bits2
  2. Adding idx and step could produce a value greather than a uint32_t can hold, in which case the addition wraps around3.
  3. Finally the result is cast to an int32_t due to the return type, which again is "implementation defined", but in practice the bits are just reinterpreted.

6. Converting to Zig

Now let's try and express this as literally as we can in Zig, and let the compiler guide us bit by bit. It's going to get a bit ugly, so at the end I'll show you what I think is a much clearer way to write it.

For our first attempt, let's try writing the first thing we can think of.

fn ht_lookup(hash: u64, exp: i32, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = (hash >> (64 - exp)) | 1;
    return (idx + step) & mask;
}

Even if you don't know Zig, I think it's clear how this maps onto the original function. The only trouble is, it doesn't compile.

hashtable.zig:36:39: error: expected type 'u5', found 'i32'
    const mask: u32 = (@as(u32, 1) << exp) - 1;
                                      ^~~
hashtable.zig:36:39: note: unsigned 5-bit int cannot represent all possible signed 32-bit values

Zig has arbitrary bit-width integers, which is what that u5 type it's requesting is. The reason it wants a u5 is because Zig demands we use a type that cannot result in an invalid shift, in Zig this can be expressed by the type system, so they do. We could explicitly cast it with @intCast, but a better solution is to actually just take a u5.

fn ht_lookup(hash: u64, exp: u5, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = (hash >> (64 - exp)) | 1;
    return (idx + step) & mask;
}

And compiling this we get a new error, we've moved down a line, progress!

hashtable.zig:60:33: error: type 'u5' cannot represent integer value '64'
    const step: u32 = (hash >> (64 - exp)) | 1;

Now that we've made exp a u5 we've earned ourselves some more trouble. The 64 in (64 - exp) has taken the type of exp, within which it doesn't fit.

The smallest type that 64 will fit in is a u7, so let's cast to that. And since we know from the previous error that the RHS of a shift needs to match the width of the LHS, we need to cast the whole thing back to a u6.

fn ht_lookup(hash: u64, exp: u5, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = (hash >> @intCast(@as(u7, 64) - exp)) | 1;
    return (idx + step) & mask;
}

You might notice we have @intCast there instead of the @as we've used previously. This is because the result of the hash - exp expression could be bigger than would fit in a u6 (if exp was 0). @intCast() performs the cast, and in safe modes will panic if the value doesn't fit in the new type, by using it we're saying "We know exp won't be 0!"

And with this we move to the next error.

hashtable.zig:152:61: error: expected type 'u32', found 'u64'
    const step: u32 = (hash >> @intCast(@as(u7, 64) - exp)) | 1;
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~
hashtable.zig:152:61: note: unsigned 32-bit int cannot represent all possible unsigned 64-bit values

Right, shifting the 64bit hash value gives us a u64, but we're saying step should be a u32. This truncation we get "for free" in C, but as usual Zig requires us to explicitly opt into the truncation with @truncate

fn ht_lookup(hash: u64, exp: u5, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = @truncate((hash >> @intCast(@as(u7, 64) - exp)) | 1);
    return (idx + step) & mask;
}

And next error please!

hashtable.zig:88:17: error: incompatible types: 'i32' and 'u32'
    return (idx + step) & mask;

Zig is very particular about what types can be added together4, so we'll have to cast one them. In the original, the bits of idx are reinterpreted as unsigned, we can do this in Zig with @bitCast5 , specifying what type we want to cast to with @as. And since the original asked for it to be returned as a 32 bit signed integer, we do the same with the actual final expression

fn ht_lookup(hash: u64, exp: u5, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = @truncate((hash >> @as(u6, @intCast(@as(u7, 64) - exp))) | 1);
    return @as(i32, @bitCast((@as(u32, @bitCast(idx)) + step) & mask));
}

This compiles, but it's not quite right. Addition in Zig is checked for overflow in safe release modes, so the final addition of idx and step could panic. We can explicitly opt-into wrapping (which is what we want) with the wrapping addition operator +% Which gives us our final final listing.

fn ht_lookup(hash: u64, exp: u5, idx: i32) i32 {
    const mask: u32 = (@as(u32, 1) << exp) - 1;
    const step: u32 = @truncate((hash >> @as(u6, @intCast(@as(u7, 64) - exp))) | 1);
    return @as(i32, @bitCast((@as(u32, @bitCast(idx)) +% step) & mask));
}

This is quite a mess, but we can do a lot better by changing the types around. Things were signed because of the original author's predelection for signed subscripts, and C's willigness to do the "right thing" implicitly. We can be a bit easier on ourselves by changing idx and the return type to be unsigned.

fn ht_lookup(hash: u64, exp: u5, idx: u32) u32 {
    const mask = (@as(u32, 1) << exp) - 1;

    const shift = 63 - @as(u6, (exp - 1));
    const step: u32 = @truncate((hash >> shift) | 1);

    return (idx +% step) & mask;
}

I added an intermediate variable, and rejigged the calculation of the shift amount. exp still needs to be in a specific range for it to all work, but unless Zig gains arbitrary integer ranges6 this is the best we can do. It's still much more verbose than the original C, but I think it's an okay price to pay for the increased explicitness.

Think you can do better? Shoot me an email! mailto:joshua@joshuao.com

Footnotes:

1

And, if you're anything like me, spend a few hours reading Wikipedia and scratching your head!

2

The actual definition of this conversion is arithmetical, but as long as your implementation reprsents signed integers using twos complement, then a reinterpretation does the same thing. The C23 standard mandates twos complement representation, before that it was just so exceedingly common that people take it for granted

3

People often call this "overflow", but strictly for "standard C" with unsigned integers this is the wrong term. The standard states that computations on unsigned integers can "never overflow" specifically because they have this wrapping behaviour. Indeed, on many CPUs the OF flag will be set for such an addition, but even the definition of that flag makes it clear it is only relevant for signed addition, for example the Intel Reference Manual says:

Set if the integer result is too large a positive number or too small a negative number (excluding the sign-bit) to fit in the destination operand; cleared otherwise. This flag indicates an overflow condition for signed-integer (two’s complement) arithmetic

4

This is very arguably too picky, as in fact there are fewer "valid" additions of u32 and u32 than there are of u32 and i32, Zig might rethink these in the future.

5

Zig's signed ints, like C23's are defined as twos complement BTW.