Programming with Judy

The Judy Advantage

Hashing versus JudyL

Chapter 220

Hashing versus JudyL

The figures below show JudyL compared to a commonly used hashing

algorithm. The hash algorithm is a simple static table sized to be a power

of two with a linear (linked list) collision chain. The table is a fixed size

of 1,048,576 buckets (2

). Each bucket is a pointer to a chain. Each

chain element is 3 words (key, value, next pointer). The key, value and

next pointer are all 64-bit quantities; therefore, the hash memory used

should asymptotically approach 24 bytes per inserted data element.

To insert into the hash table, the key is hashed and then ANDed with the

table size - 1 to find a bucket. It is also common to have a

prime-number-sized hash table and use the mod function to get a table

index. However, mod is very slow (~1 microsecond on the system we

were using to acquire the above data) so we used the power of two table

method instead. Chains are linear searched for a match. If there is no

match, a new chain element is created. Finally, the data value is

incremented (the data value becomes a usage count to verify the number

of unique keys inserted).

The hash table was seriously overloaded by a factor of 40:1. Overloading

is very undesirable but all too common with hashing. The resulting

collision chains were an average of 39.15 in length, showing that our

hash distribution was very good.

The hash table creation time is not shown in the data. Judy table

creation is an integral part of JudyLIns since a Judy array starts at zero

length and grows dynamically.

The hashing code for this benchmark was all in-lined. This is common

with hashing but not a practical option with Judy. Mallopt(3C) was used

to speed up the allocation of each hash element. Measurements were

done on a 2GB HP 9000 J5000 workstation running HP-UX 11i at 440

MHz. The raw data is shown below.

NOTE The data was collected using random numbers as keys. Random keys

provide the best-case data for hashing and the worst case data for Judy.

Judy's performance and memory usage will both improve with any data

clustering.