A while ago, I found myself quite confused by exactly how hash maps can claim to have such excellent performance.
I’ve since, to my knowledge, cleared up my understanding, so I thought I’d write up the misunderstanding I had and how I resolved it.
The one sentence TL;DR of the entire post is this: The asymptotic cost bounds for hash map performance assume the sizes of all keys are bounded by some fixed constant.
A hash map is a data structure that maps keys to values. Its most important operations are:
For instance, in TypeScript:
// create a new, empty map
let m = new Map<string, number>();
// insert
m.set("foo", 150);
// get
let num = m.get("foo");
console.log(num); // ==> 150
// remove
m.delete("foo");
Roughly speaking, a hash map has two key components:
All of the major operations on a hash map involve a given key. Given that key, a hash map implementation will:
There’s a lot of detail I’m glossing over here, but this is the general idea.
The asymptotic cost bounds for the performance of the various hash map operations are quite impressive.
A hash map may execute a single one of the three core operations (insert, get, or remove):
Indeed, hash maps are quite a ubiquitous and important data structure because of this. In practice, other data structures, even those with asymptotically better worst-case performance, such as search trees, are often eschewed for the all-powerful hash map.
This fact has somewhat to do with cache locality (or lack thereof) on modern CPUs. Search trees are often implemented with each node as a new memory allocation, which leads to lots of pointer chasing.
But in any case, what confused me is the assertion that all of the hash map operations could possibly operate in constant time at all. Specifically, I was concerned about how much time it would take to apply the hash function to a given key.
If the keys are, for example, strings, then any decent hash function for strings would have to traverse the entire string, incorporating information from each byte of the string to calculate the hash value for that string. This is an operation.
How, then, can the average case for hash map operations be and only the worst-case be , given that we must always hash the key?
To begin to answer this question, we should first be more precise about what we mean by .
Computer scientists and software engineers usually use the variable to talk about various common cost bounds. From lowest to highest, some common cost bounds are:
Name | Big- |
---|---|
Constant | |
Logarithmic | |
Linear | |
Quadratic | |
Cubic | |
Exponential |
It’s not wrong to say that traversing every byte of a string is an operation. However, the key is that here refers specifically to the number of bytes in that string.
By contrast, when we say that the common hash map operations each have worst-case performance , here, refers to the number of key-value pairs in the map.
So, we have two completely different things:
But both are, colloquially, “”, since we often use as much as possible when expressing “sizes” of things.
Thus, I conflated the two in my mind, and got confused.
We are now ready to answer the question of how the hash map operations can execute in constant time on average.
The answer is, actually, the TL;DR at the top of this post, which I will include again here:
The asymptotic cost bounds for hash map performance assume the sizes of all keys are bounded by some fixed constant.
The way I like to think of this is that there is an implicit third component of a hash map, in addition to the hash function and backing array:
This implicit component is an invariant about the hash map. The invariant must be upheld in order for us to say that we may hash an arbitrary key in constant time.
This concept, in which we enforce a bound on the size of the keys to be able to achieve good performance, comes up in other computer science contexts.
One such context it came up for me was when I was implementing a dynamic programming algorithm. I wished to construct a DP table keyed on strings. However, a TA pointed out to me that since there was no constant bound on the size of the string keys I would be using, I could not claim that accessing the DP table would be a constant-time operation. I thus had to restructure my approach.