HASHING TECHNIQUE IN DATA STRUCTURE DEMYSTIFIED

Marshall Akpan
9 min readSep 14, 2018

--

Hashing technique is useful for searching. Before delving into hashing let us look at some common terminologies and other searching technique to facilitate a clearer understanding of the hashing technique:

1. Algorithm: An algorithm is a step by step procedure to solve a problem. In computer science, an algorithm can be defined as a sequence of unambiguous instructions used for solving a problem, which can be implemented (as a program) on a computer.

2. Performance Analysis: Performance analysis of an algorithm is the process of calculating space required by that algorithm and time required by that algorithm. Performance analysis helps us to select the best algorithm from multiple algorithms to solve a problem. Performance of an algorithm means predicting the resources which are required to an algorithm to perform its task.

Performance analysis of an algorithm is performed by using the following measures:

a. Space Complexity: it is the space required to complete the task of an algorithm. It includes program space and data space.

b. Time Complexity: Time required to complete the task of an algorithm. The time complexity of an algorithm is the total amount of time required by an algorithm to complete its execution considering only the input data.

3. Linear Search — O(n): The time complexity for linear search is of order O(n).

4. Binary Search — O(log n): The time complexity for Binary search is of order O(log n).It can be seen that Binary Search improves on Linear search with a faster running time. But we want a search method which takes a constant time of O(1) that does not depend on the size of the input. This is the requirement which introduces the Hashing technique.

Hashing is the process of indexing and retrieving element (data) in a Data-Structure to provide faster way of finding the element using the hash key. Here, hash key is a value which provides the index value where that actual data is likely to store in the Data Structure.

Hash key value is used to map the data with index in the hash table. And the hash key is generated for every data using a hash function. That means every entry in the hash table is based on the key value generated using a hash function.

Hash Table is defined as an array which maps a key (data) into the Data Structure with the help of hash function such that insertion, deletion and search operations can be performed with constant time complexity (i.e. O(1)).

Basic concept of hashing and hash table is shown in the following figure…

Hash Table

A. Simple Hash Function Using Remainders

A hash table is a collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0. For example, we will have a slot named 0, a slot named 1, a slot named 2, and so on. Initially, the hash table contains no items so every slot is empty. We can implement a hash table by using a list with each element initialized to the special Python value None.

The Figure below shows a hash table of size n = 11. In other words, there are n slots in the table, named 0 through 10. The mapping between an item and the slot where that item belongs in the hash table is called the hash function.

HOW TO COMPUTE THE HASH TABLE

The hash function will take any item in the collection and return an integer in the range of slot names, between 0 and n − 1. Assume that we have the set of integer items 54, 26, 93, 17, 77, and 31. Our first hash function, sometimes referred to as the “remainder method,” simply takes an item and divides it by the table size, returning the remainder as its hash value (ℎ(item) = item%11).

Once the hash values have been computed, we can insert each item into the hash table at the designated position as shown in Figure below. Note that 6 of the 11 slots are now occupied.

Computed Hash Table Using Hash Function

Now when we want to search for an item, we simply use the hash function to compute the slot name for the item and then check the hash table to see if it is present. This searching operation is 𝑂 (1), since a constant amount of time is required to compute the hash value and then index the hash table at that location. If everything is where it should be, we have found a constant time search algorithm.

TACKLING COLLISIONS

Collision Resolution

The systematic method for placing the second item in the hash table when two items hash to the same slot is called collision resolution.

One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty. Note that we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as open addressing in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called linear probing.

When we attempt to place 33 into slot 0, a collision occurs. Under linear probing, we look sequentially, slot by slot, until we find an open position. In this case, we find slot 1. Again, 55 should go in slot 0 but must be placed in slot 2 since it is the next open position. The final value of 20 hashes to slot 9. Since slot 9 is full, we begin to do linear probing. We visit slots 10, 0, 1, and 2, and finally find an empty slot at position 3.

Once we have built a hash table using open addressing and linear probing, it is essential that we utilize the same methods to search for items. Assume we want to look up the item 93. When we compute the hash value, we get 5. Looking in slot 5 reveals 93, and we can return True. What if we are looking for 20? Now the hash value is 9, and slot 9 is currently holding 31. We cannot simply return False since we know that there could have been collisions. We are now forced to do a sequential search, starting at position 10, looking until either we find the item 20 or we find an empty slot.

A disadvantage to linear probing is the tendency for clustering; items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution this will have an impact on other items that are being inserted, as we saw when we tried to add the item 20 above.

One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs.

We can do a “plus 3” probe. This means that once a collision occurs, we will look at every third slot until we find one that is empty.

The “plus 3” rehash can be defined as: rehash(pos) = (pos + 3)%size_of_table.

In general, rehash(pos) = (pos + skip)%size_of_table.

It is important to note that the size of the “skip” must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number. This is the reason I have been using 11 in my examples.

A variation of the linear probing idea is called quadratic probing. Instead of using a constant “skip” value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on.

An alternative method for handling the collision problem is to allow each slot to hold a reference to a collection (or chain) of items. Chaining allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table. You can study more on chaining here.

IMPLEMENTING THE HASH TABLE IN PYTHON

Hash Table is very efficient in Implementing the Map Abstract Data Type. We will use Python

The Map is just like a dictionary in python. The key is used to look up the associated data value.

It is better to use a hash table as described above since looking up an item in a hash table can approach 𝑂(1) performance unlike other Data Structures.

Map() Create a new, empty map. It returns an empty map collection.

• put(key,val) Add a new key-value pair to the map. If the key is already in the map then replace the old value with the new value.

• get(key) Given a key, return the value stored in the map or None otherwise.

• del Delete the key-value pair from the map using a statement of the form del map[key].

len() Return the number of key-value pairs stored in the map.

in Return True for a statement of the form key in map, if the given key is in the map, False otherwise.

Below we use two lists to create a HashTable class that implements the Map abstract data type. One list, called slots, will hold the key items and a parallel list, called data, will hold the data values.

Note that the initial size for the hash table has been chosen to be 11. Although this is arbitrary, it is important that the size be a prime number so that the collision resolution algorithm can be as efficient as possible.

class HashTable:
def __init__(self):
self.size = 11
self.slots = [None] * self.size
self.data = [None] * self.size
def put(self, key, data):
hash_value = self.hash_function(key,len(self.slots))

if self.slots[hash_value] == None:
self.slots[hash_value] = key
self.data[hash_value] = data
else:
if self.slots[hash_value] == key:
self.data[hash_value] = data #replace
else:
next_slot = self.rehash(hash_value, len(self.slots))
while self.slots[next_slot] != None and self.slots[next_slot] != key:
next_slot = self.rehash(next_slot, len(self.slots))

if self.slots[next_slot] == None:
self.slots[next_slot] = key
self.data[next_slot] = data
else:
self.data[next_slot] = data #replace
def hash_function(self, key, size):
return key % size
def rehash(self, old_hash, size):
return (old_hash + 1) % size

def get(self, key):
start_slot = self.hash_function(key, len(self.slots))
data = None
stop = False
found = False
position = start_slot
while self.slots[position] != None and not found and not stop:
if self.slots[position] == key:
found = True
data = self.data[position]
else:
position=self.rehash(position, len(self.slots))
if position == start_slot:
stop = True
return data
def delete(self, key):
hash_value = self.hash_function(key,len(self.slots))

if self.slots[hash_value] == key: #Item in original slot
self.slots[hash_value] = None
self.data[hash_value] = None
else:
#if key not found in original slot, do a search for it
next_slot = self.rehash(hash_value, len(self.slots))
while self.slots[next_slot] != key:
next_slot = self.rehash(hash_value, len(self.slots))
if self.slots[next_slot] == key:
self.slots[next_slot] = None
self.data[next_slot] = None
else:
self.slots[next_slot] = None

‘’’implementing the __len__ method’’’

def __len__(self):
count = 0
for value in self.slots:
if value != None:
count += 1
return count

def __delitem__(self, key):
return self.delete(key)
def __getitem__(self, key):
return self.get(key)
def __setitem__(self, key, data):
self.put(key, data)
h=HashTable()
h.put(54,”cat”)
h.put(26,”dog”)
h.put(93,”lion”)
h.put(17,”tiger”)
h.put(77,”bird”)
h.put(31,”cow”)
h.put(44,”goat”)
h.put(55,”pig”)
h.put(20,”chicken”)
print(h.slots)
print(f’The length of the list is: {len(h)}\nexcluding the “None”’)

CONCLUSION

Analysis of Hashing

The most important piece of information we need to analyze the use of a hash table is the load factor, 𝜆.

𝜆 = number_of_items/ table_size

Conceptually, if 𝜆 is small, then there is a lower chance of collisions, meaning that items are more likely to be in the slots where they belong. If 𝜆 is large, meaning that the table is filling up, then there are more and more collisions. This means that collision resolution is more difficult, requiring more comparisons to find an empty slot. With chaining, increased collisions means an increased number of items on each chain.

--

--

Marshall Akpan
Marshall Akpan

Written by Marshall Akpan

A Software Engineer, Proficient in JavaScript, React, TypeScript, Python, NodeJs, and RoR . I mentor and build web apps for businesses. Tw: @uimarshall.

No responses yet