Just Another Technology Guy: back to the basics

Showing posts with label back to the basics. Show all posts

Monday, June 16, 2014

Back To The Basics: The Hash Table CRUD Operations

In my previous post Back To The Basics: The Hash Table I described the Hash Table data structure, provided a basic template for the class in Java, as well as walked you through the hashing algorithm. In this post we’ll take a look at the CRUD operations. Actually we'll really only talk about addition, retrieval, and deletion because adding and updating in a Hash Table are the same operation.

Let's take a look at the add, get, and remove operations first in their simplest form.

private void add(String key, Object value) {
int index = this.getKeyIndex(key);
  values[index] = value;
}

private Object get(String key) {
  int index = this.getKeyIndex(key);
  return values[index];
}

private Object remove(String key) {
  int index = this.getKeyIndex(key);
Object value = values[index];
  values[index] = null;
  return value;
}

Looking at these three operations we can now see why a Hash Table in practice provides near constant time additional and retrieval operations. The only non-constant time operation that is performed is the generation of the hash code. But this operation can be optimized and it's resulting value cached (if necessary) effectively making the add and get operations approach O(1).

What you may notice by looking at these operations are that they don't handle collisions. The add operation is simply overwriting the existing value associated with the hash regardless of whether it's actually associated with the key. The get operation is returning whatever value is stored at the index generated by the getKeyIndex method. And, like the add operation, the remove operation is removing whatever is associated with the hash.

We can try to optimize our hashing algorithms as much as we want but we'll never truly be able to avoid collisions 100% of the time. The reason is that we'd have to know all the possible item's that can be hashed in order to create a truly perfect hashing algorithm. Since this is not really possible for a general purpose Hash Table, it's important that we add some code that can handle collisions when hashing the key.

The most straight forward way to handle hash collisions is to store the values at each index in a Linked List. Instead of setting the value associated with a particular key as the value in the backing array the value will be set to a Linked List which will contain all the key/value pairs that map to a particular hash. This will require us to add a little complexity to our Hash Table but will allow us to avoid putting our Hash Table into an unexpected state when a collision occurs.

The first thing we need to do to avoid collisions is to define an object that can represent a key/value pair item that gets stored in our Linked List for each value. Since we're using Java we're going to create a class that implements Map.Entry<String, Object>.

final class HashEntry<K,V> implements Map.Entry<K,V> {
private final K key;

private V value;

public HashEntry(K key, V value) {
this.key = key;
this.value = value;
}

@Override
public K getKey() { return this.key; }

@Override
public V getValue() { return this.value; }

@Override
public V setValue(V newValue) {
V oldValue = this.value;
this.value = newValue;
return oldValue;
}
}

The next thing we need to do is update our Hash Table class definition to reflect the fact that our backing array contains a Linked List of our Hash Entry key/value pairs instead of an array of objects. Our new class definition looks like:

public class HashTable {
private LinkedList<HashEntry<String,Object>>[] values;
private final int tableSize = 1543;

public HashTable() {
this.values = (LinkedList<HashEntry<String,Object>>[])new LinkedList[tableSize];
}
}

Now we can update our add, get, and remove methods to support using a Linked List as the backing array. We're going to start with the get method because we'll define a helper method that the addition operation can also take advantage of.

Our new get method is fundamentally the same as our previous get method with the only exception being that we've pushed the actual retrieval logic into it's own method named findEntry. First it tries to find the item. If it does it returns it's value, otherwise it returns null. The new get method looks like:

private Object get(String key) {
  int index = this.getKeyIndex(key);
HashEntry<String, Object> entry = this.findEntry(values[index], key);
  if(entry != null)
  return entry.getValue();
  return null;
}

The findEntry method simply iterates through the linked list attempting to find an existing item with the given key. If it finds one it returns it, otherwise it returns null:

private HashEntry<String, Object> findEntry(LinkedList<HashEntry<String, Object>> list, String key) {
if(list == null)
  return null;

Iterator<HashEntry<String, Object>> iterator = list.iterator();
  while(iterator.hasNext()) {
HashEntry<String, Object> entry = iterator.next();
if(entry.getKey().equals(key))
  return entry;
}
  return null;
}

Our new add method is also the same as our previous add method with the exception of two things. First we've added logic to create the Linked List if it doesn't already exist. The second change is that we've pushed the actual addition logic into it's own addToList method. This allows the add method to maintain a single responsibility. The new add method looks like:

private void add(String key, Object value) {
  int index = this.getKeyIndex(key);
LinkedList<HashEntry<String,Object>> list = this.values[index];
if(list == null) {
list = new LinkedList<HashEntry<String, Object>>();
  this.values[index] = list;
}
  this.addValueToList(list, key, value);
}

The addToList method is not very complicated either. It tries to get a handle to the existing object using the findEntry method we've already defined. If it does not find an entry it adds the new key/value pair to the list otherwise it updates the existing entry value. Here's what the addToList method looks like:

private void addValueToList(LinkedList<HashEntry<String, Object>> list, String key, Object value) {
HashEntry<String, Object> entry = this.findEntry(list, key);
if(entry == null) {
list.add(new HashEntry<String, Object>(key, value));
} else {
entry.setValue(value);
}
}

Finally, our new remove method is also able to benefit from our findEntry method. It first tries to find the existing value. If it finds it then it removes it from the list and returns the value of the entry. Otherwise it returns null. The new remove method looks like:

private Object remove(String key) {
  int index = this.getKeyIndex(key);
LinkedList<HashEntry<String, Object>> list = values[index];
HashEntry<String, Object> entry = this.findEntry(list, key);

  if(list != null && entry != null) {
list.remove(entry);
  return entry.getValue();
}

  return null;
}

Wrap Up

There are a few additional things to consider when creating or using Hash Tables. Concurrency can become an issue when the hash table is used in a multi-threaded environment. If you use a hash table in a multi-threaded environment you'll need to make sure that the implementation you use is thread safe.

Another thing to consider when creating or using Hash Tables is that because the Hash Table relies on a fixed size array as it's backing object store, it may eventually need to grow as more items are added. The cost of performing a resize operation is typically O(N) as each item in the existing array needs to be copied to a new, larger, array. This can be done all at once or incrementally.

Lastly, you'll also need to consider how you are going to handle null keys and/or null values. Some implementations don't allow null keys and/or null values.

Monday, June 9, 2014

Back To The Basics: The Hash Table

A few weeks ago I began a series entitled Back To The Basics. This series is intended to go over various software data structures that are helpful to understand as a software engineer. I started the series with the Binary Tree. The next data structure I'd like to take a look at in this series is the Hash Table.

Hash Table's are very important data structures to understand. At their most basic level they allow for associative array functionality with near constant time addition, deletion, and retrieval. The ability for a Hash Table to provide near constant time operations depends on how good the hashing algorithm is. The goal of the hashing algorithm is to avoid collisions where two different values create the same hash. When a collision is encountered the Hash Table strays farther away from being able to provide constant time operations.

The ability to provide near constant time operations makes the Hash Table ideal for use as a cache or an index as items can be inserted and retrieved with very little overhead.

While most modern programming languages and frameworks provide several optimized implementations of the Hash Table, let's take a look at what it would take to build a Hash Table from scratch which can handle adding, removing, and retrieving values.

Let's first start by defining our HashTable class. The most basic definition of a Hash Table will define the underlying array which will store our values and our constructor will initialize the store. The important thing to pay attention to here is the size of our initial array. We're going to initialize it to a size using a prime number because it will help us evenly distribute values within our backing store. The reason for this will become more apparent when we create our hashing function.

public class HashTable {
private Object[] values;
private final int tableSize = 1543;

public HashTable() {
this.values = new Object[tableSize];
}
}

Now that we have our initial class definition let's start by creating a method which returns the index we should use to store the value for a given key.

private int getKeyIndex(String key) {
return key.hashCode() % this.values.length;
}

The getKeyIndex method is pretty simple. It first gets a hash code for the key and then returns an index location based on modding the hash code with the length of the array that holds our values. The reason we return a mod is so that we can make sure that the index returned is within the bounds of our array. And because we've used a prime number for our array length we can get better distribution of hashes in our array so that they're not clustered together.

Right now the getKeyIndex method is relatively boring and hides the magic of the actual hashing algorithm because our keys are strings and we're able to rely on the String classes hashCode method. But what if there was no hashCode method on String? How would we go about writing our own?

A good hashing algorithm will try to provide uniform distribution of values in order to avoid collisions which will increase the cost of the operation. The cost is increased because for every collision you have to resolve it somehow. The work it takes to resolve the collision is relative to the number of collisions for that key and the algorithm used to resolve the collisions. One way to resolve the collisions is to store the values in a Linked List and iterate through the list till we find the item with our hash code. This will cause our resolve algorithm to have an O(n) cost where n is the number of collisions for that key.

Now let's build our string hash algorithm. Because strings are made up of individual characters and those characters have an integer value we could create a hashing function that just sums the character values. For example:

private int hash(String key) {
int hash = 0;
for(int index=0; index < key.length(); index++) {
hash += key.charAt(index);
}
return hash;
}

This has a problem in that it's likely to cause collisions. For example let's say we wanted to hash the string "dog". We'd start our hash with an initial value of 0. Then we'd iterate through the characters in the string summing them as we go. For dog this would produce a hash of 314 ((d)100 + (o)111 + (g)103). Now let's say we wanted to create a hash of the string "cav". This would also create a hash of 314 ((c)99 + (a)97 + (v)118). This is a collision, which is something we want to avoid.

The good thing is that there's a pretty simple way to help avoid these collisions. If we change our algorithm slightly to include the use of a constant that the character value can be multiplied against then we can avoid these collisions. Notice that I multiply the constant by the current hash value during each iteration. This helps us make sure the hash is unique because the value used for a character is different depending on it's position within the string. This helps us decrease our likelihood of a collision.

private int hashCode(String key) {
int constant = 31;
int hash = 0;
for(int index=0; index < key.length(); index++) {
hash += (constant * hash) + key.charAt(index);
}
return hash;
}

With our new algorithm dog will hash to 99644 and cav will hash to 98264. Because we've added a constant that is multiplied against the current hash value at each iteration we've increased the likelihood that the resulting hash code is unique.

Note: The above algorithm is loosely based on the hashCode method in the OpenJDK's String class. If you're curious as to why the constant value of 31 was chosen check out Effective Java by Joshua Bloch. This Java Bytes blog post goes into even more depth for the ultra-curious

Monday, June 2, 2014

Back To The Basics: The Binary Tree Deletion

In my previous post Back To The Basics: The Binary Tree Traversing I discussed several common algorithms for traversing a Binary Search Tree. In this post I’d like to turn our attention to one last common algorithm, delete.

The delete algorithm consists of three main scenarios. First, deleting a leaf node. Second deleting a node with only one child. Last, deleting a node with two children.

The first scenario is the easiest scenario to account for. Remember that a leaf node is a node which has no child nodes. So to delete a leaf node requires us simply to null out the value of that node in it's parent.

The second scenario of deleting a node with only one child is also pretty straight forward. You can simply change the pointer that points to the node being deleted to point to the deleted node's child.

The third scenario is the more complicated one. When deleting a node with two children you replace the pointer from the node being deleted to either the minimum value node from the right tree or maximum value node from the left tree and then delete that node from that tree.

Here is what that algorithm looks like.

public Tree<T> delete(T data) {
  return this.delete(this, data);
}

private Tree<T> delete(Tree<T> parent, T data) {
  if(parent == null)
  return parent;

int comparison = data.compareTo(parent.data);
  if(comparison < 0)
parent.left = this.delete(parent.left, data);
else if(comparison > 0)
parent.right = this.delete(parent.right, data);
  else {
  if(parent.right == null)
  return parent.left;
  if(parent.left == null)
  return parent.right;

Tree<T> temp = parent;
parent = temp.right.minTree();
parent.right = this.deleteMin(temp.right);
parent.left = temp.left;
}

  return parent;
}

private Tree<T> deleteMin(Tree<T> parent) {
  if(parent.left == null)
  return parent.right;
  return this.deleteMin(parent.left);
}

Monday, May 26, 2014

Back To The Basics: The Binary Tree Traversing

In my previous post Back To The Basics: The Binary Tree Insert and Find Operations I described The Binary Tree insert and find operations. In this post I'd like to go over several common algorithms for traversing a Binary Tree. Just like the insert and find operations all of our traversal operations will use recursion.

The first three algorithms deal with returning the tree values in different orders. For this example we'll simply be printing those values out to System.out. The first algorithm allows us to print the Binary Tree's values in pre-order. Notice that in pre-order our recursive function prints the current node before recursing down it's child nodes. Pre-order has us following the tree top down from the left side of the tree to the right.
public void traversePreOrder(Tree<T> parent) {
if(parent == null)
return;
System.out.print(parent.data + " ");
this.traversePreOrder(parent.left);
this.traversePreOrder(parent.right);
}

The second algorithm allows us to print the Binary Tree's values in post-order. Notice that in post-order our recursive function prints the current node after recursing down it's child nodes. Post-order has us following the tree bottom up from the left side of the tree to the right. In post-order the leaf nodes are printing before their parent nodes.
public void traversePostOrder(Tree<T> parent) { if(parent == null) return;
this.traversePostOrder(parent.left); this.traversePostOrder(parent.right);
System.out.print(parent.data + " "); } The third algorithm allows us to print the Binary Tree's values in-order. In-order, like it's name suggests, prints the tree from the smallest value to largest value. Notice that for in-order traversal we traverse down the left child nodes, then print the current node value, and finally traverse the right child nodes. We print the current node value after traversing the left child nodes and before traversing the right child nodes because the values on the left side of the tree are always smaller than values on the right side. This is what allows us to print the tree in-order. public void traverseInOrder(Tree<T> parent) { if(parent == null) return;
this.traverseInOrder(parent.left); System.out.print(parent.data + " "); this.traverseInOrder(parent.right);
} The fourth algorithm allows us to return the size of the tree. The size is determined by number of items in the tree. This is the simplest of all the traversal algorithms. We simply start with a value of 1 to account for the current node and sum the values of the left and right child nodes. The resulting value will be the number of nodes in the tree. Notice that we must check the left and the right nodes for null.
public int getSize() {
int size = 1; if(this.left != null)
size += this.left.getSize();
if(this.right != null) size += this.right.getSize();
return size; } The fifth algorithm allows us to return the max depth of the tree. The max depth of the tree is really just the height of the tree. The way the max depth is determined is by getting the depth of the left side of the tree and the right side and comparing them. For this algorithm, like the getSize one, we start with a value of 1 to account for the depth of the current level of the tree. public int maxDepth() { int depth = 1; int leftDepth = (this.left == null) ? 0 : this.left.maxDepth();
int rightDepth = (this.right == null) ? 0 : this.right.maxDepth(); int childDepth = (leftDepth > rightDepth) ? leftDepth : rightDepth;
return depth + childDepth; } The sixth algorithm allows us to return the minimum value in the tree. Unlike the previous 5 algorithms we do not traverse through every node in the tree to find this value. The minimum value in the tree is contained in the left most leaf node. So to find this value we simply need to travel down the left side of the tree until we no longer have any child nodes.
public T minValue() { if(this.left != null) return this.left.minValue();
return this.data; } Finally, the seventh algorithm allows us to return the maximum value in the tree. This algorithm is just like finding the minimum value except this time we use the right side of the tree. To find the maximum value we simply need to travel down the right side of the tree until we no longer have any child nodes.
public T maxValue() { if(this.right != null) return this.right.maxValue();
return this.data; }

Monday, May 19, 2014

Back To The Basics: The Binary Tree Insert and Find Operations

In my previous post Back To The Basics: The Binary Tree I described The Binary Tree data structure as well as provided a basic template for the class in Java. In this post we’ll take a look at the insert and find operations.

The insert and find operations on a Binary Search Tree are actually quite similar. They both recurse through the tree starting at a particular node (typically the root) using a similar method to a binary search to find the appropriate place that the node should exist at. Once the appropriate place for the node has been determined the find method will return the node if it exists and the insert method will insert the node into that position if it does not already exist. Both operations take time proportional to the height of the tree.

Let’s first take a look at the insert operation.

public void insert(T newData) {

this.insert(this, newData);

}

private void insert(Tree<T> parent, T newData) {

if(parent.data == null) {

parent.data = newData;

return;

}

if(newData.compareTo(parent.data) == 0)

return;

if(newData.compareTo(parent.data) < 0) {

if(parent.left == null) parent.left = new Tree<T>();
this.insert(parent.left, newData);

} else {

if(parent.right == null) parent.right = new Tree<T>();
this.insert(parent.right, newData);

}

}

Now let's take a look at the find operation.

public Tree<T> find(T dataToFind) {

return this.find(this, dataToFind);

}

private Tree<T> find(Tree<T> parent, T dataToFind) {

if(parent == null)

return parent;

if(parent.data.compareTo(dataToFind) == 0)

return parent;

if(dataToFind.compareTo(parent.data) < 0)

return this.find(parent.left, dataToFind);

return this.find(parent.right, dataToFind);

}

What you should notice is that these two operations are strikingly similar. They both take advantage of recursion. Both of the algorithms are made up of the same three parts.

The first part of the algorithm checks to see if the current node exists. In the case of the insert operation this is true if the data of the node is null.* If this is true we know that we can safely set the nodes data and exit. In the find operation this is true if the node itself is null in which case we didn't find what we're looking for. In either case this is an exit condition of the recursive calls.

The second part of the algorithm checks to see if the current nodes data is equal to the data passed in to the method. In the case of the insert operation this is a signal that duplicate data is attempting to be inserted into the tree. In this case we can terminate the recursion since we know we no longer have any additional work to do. In the case of the find operation this is the signal that we've found the data that we're looking for so we can return the current node and exit our recursion.

The last part of the algorithm recurses to the next level of the tree. It decides which branch of the tree to recurse down based on whether the data passed in to the method is less or greater than the current nodes value. This is what allows this algorithm to have time proportional to the height of the tree. Because we only recurse down one child node we're only ever acting on one node at each level of the tree.

* The insert operation needs to set the data on a an existing node and cannot act on a null node. The reason for this is that we'd need a way to create the new node and assign it to the parent. Because we're using recursion this is difficult but easily overcome if before we recurse into the insert operation we first make sure the node we're sending in IS NOT NULL.

Monday, May 12, 2014

Back To The Basics: The Binary Tree

I’d like to begin a new series here entitled Back To The Basics where I go over various software data structures that are helpful to understand. I’d like to start the series with a data structure that I’ve been asked about quite often in various interviews i’ve done during my career as a software developer. That data structure is the Binary Tree.

So what exactly is a Binary Tree? The basic concept of a Binary Tree is actually pretty simple. It’s a tree of nodes. Each node consists of some piece of data and pointers to both a right node and a left node. The first node is called the root node as it is at the root of the tree. The left and right nodes are called child nodes (or internal nodes). Nodes that have no children are called leaf nodes (or external nodes).

What’s so special about a Binary Tree? You can do really cool things with them that allow you to optimize your interaction with the data that the tree contains. For instance, you can create a Binary Search Tree which allows very fast insert and lookup of nodes.

What’s the difference between a Binary Tree and a Binary Search Tree? A plain Binary Tree has no structure or rules around the organization of it’s child or leaf nodes. This fact makes it hard to optimize the tree operations. A Binary Search Tree adds some structure which makes insert and lookup of nodes very fast. The rules that a Binary Search Tree adds are that the left child node must contain a value that is less than it’s parent value and the right search node must contain a value that is greater than its value. So if I had a node with a value of 4 and I wanted to insert a node with a value of 3 it would go in the left child node. But if I wanted to insert a node with a value of 9 it would go in the right child node.

Binary Tree’s allow us to solve a wide variety of complex software problems like making data searchable, representing hierarchical data, and routing algorithms.

So now that we know what a Binary Tree is let’s take a closer look at a some common operations of a Binary Search Tree. For this example I’m going to write our Binary Search Tree in Java.

public class Tree<T extends Comparable<T>> {
private Tree<T> left;
private Tree<T> right;
private T data;

public T getData() { return this.data; }
public Tree<T> getLeft() { return this.left; }
public Tree<T> getRight() { return this.right; }

public Tree() {
this(null);
}

public Tree(T data) {
this.left = null;
this.right=null;
this.data = data;
}
}

The first thing you’ll notice in the definition of our Tree is that I’m making my Tree extend from Comparable<T>. While this doesn’t have anything to do with a Binary Tree per se, doing this will allow us to create a tree that can handle more complex data types. Extending the Tree from Comparable<T> will allow us to keep the Tree simple and cohesive. It forces us to keep the comparison logic out of the tree which would be a very poor place for it to begin with.

Taking a look at the definition of our tree the first thing you’ll notice are the three private members. Tree<T> left represents our pointer to the left child node. Tree<T> right represents our pointer to the right child node. Finally T data represents the tree’s data.

One other important thing to notice is that I’ve provided two constructors for this Tree. While we don’t inherently need two constructors there is a nice ease of use factor in having them. Having both an empty constructor and a parameterized constructor allows the consumer to initialize an empty tree or one with data right off the bat.

In my next post I’ll take a look at the two most common operations on a Binary Tree, insert and find.