Monday, June 9, 2014

Back To The Basics: The Hash Table

A few weeks ago I began a series entitled Back To The Basics. This series is intended to go over various software data structures that are helpful to understand as a software engineer. I started the series with the Binary Tree. The next data structure I'd like to take a look at in this series is the Hash Table.

Hash Table's are very important data structures to understand. At their most basic level they allow for associative array functionality with near constant time addition, deletion, and retrieval. The ability for a Hash Table to provide near constant time operations depends on how good the hashing algorithm is. The goal of the hashing algorithm is to avoid collisions where two different values create the same hash. When a collision is encountered the Hash Table strays farther away from being able to provide constant time operations.

The ability to provide near constant time operations makes the Hash Table ideal for use as a cache or an index as items can be inserted and retrieved with very little overhead.

While most modern programming languages and frameworks provide several optimized implementations of the Hash Table, let's take a look at what it would take to build a Hash Table from scratch which can handle adding, removing, and retrieving values.

Let's first start by defining our HashTable class. The most basic definition of a Hash Table will define the underlying array which will store our values and our constructor will initialize the store. The important thing to pay attention to here is the size of our initial array. We're going to initialize it to a size using a prime number because it will help us evenly distribute values within our backing store. The reason for this will become more apparent when we create our hashing function.

public class HashTable {
private Object[] values;
private final int tableSize = 1543;

public HashTable() {
this.values = new Object[tableSize];
}
}

Now that we have our initial class definition let's start by creating a method which returns the index we should use to store the value for a given key.

private int getKeyIndex(String key) {
    return key.hashCode() % this.values.length;
}

The getKeyIndex method is pretty simple. It first gets a hash code for the key and then returns an index location based on modding the hash code with the length of the array that holds our values. The reason we return a mod is so that we can make sure that the index returned is within the bounds of our array. And because we've used a prime number for our array length we can get better distribution of hashes in our array so that they're not clustered together.

Right now the getKeyIndex method is relatively boring and hides the magic of the actual hashing algorithm because our keys are strings and we're able to rely on the String classes hashCode method. But what if there was no hashCode method on String? How would we go about writing our own?

A good hashing algorithm will try to provide uniform distribution of values in order to avoid collisions which will increase the cost of the operation. The cost is increased because for every collision you have to resolve it somehow. The work it takes to resolve the collision is relative to the number of collisions for that key and the algorithm used to resolve the collisions. One way to resolve the collisions is to store the values in a Linked List and iterate through the list till we find the item with our hash code. This will cause our resolve algorithm to have an O(n) cost where n is the number of collisions for that key.

Now let's build our string hash algorithm. Because strings are made up of individual characters and those characters have an integer value we could create a hashing function that just sums the character values. For example:

private int hash(String key) {
    int hash = 0;
for(int index=0; index < key.length(); index++) {
hash += key.charAt(index);
}
    return hash;
}

This has a problem in that it's likely to cause collisions. For example let's say we wanted to hash the string "dog". We'd start our hash with an initial value of 0. Then we'd iterate through the characters in the string summing them as we go. For dog this would produce a hash of 314 ((d)100 + (o)111 + (g)103). Now let's say we wanted to create a hash of the string "cav". This would also create a hash of 314 ((c)99 + (a)97 + (v)118). This is a collision, which is something we want to avoid.

The good thing is that there's a pretty simple way to help avoid these collisions. If we change our algorithm slightly to include the use of a constant that the character value can be multiplied against then we can avoid these collisions. Notice that I multiply the constant by the current hash value during each iteration. This helps us make sure the hash is unique because the value used for a character is different depending on it's position within the string. This helps us decrease our likelihood of a collision.
  
private int hashCode(String key) {
    int constant = 31;
    int hash = 0;
for(int index=0; index < key.length(); index++) {
hash +=  (constant * hash) + key.charAt(index);
}
    return hash;
}

With our new algorithm dog will hash to 99644 and cav will hash to 98264. Because we've added a constant that is multiplied against the current hash value at each iteration we've increased the likelihood that the resulting hash code is unique.

Note: The above algorithm is loosely based on the hashCode method in the OpenJDK's String class. If you're curious as to why the constant value of 31 was chosen check out Effective Java by Joshua Bloch. This Java Bytes blog post goes into even more depth for the ultra-curious


No comments:

Post a Comment