What is Hashing?
Hashing is an important Data Structure, which is designed to use a hash function that maps a given element to a particular key for faster access. Imagine having a set of elements/values you would like to not only store, but also access quickly. We could store such elements in a list, and access them later via a linear search. However, a linear search take O(n) time, where n = number of elements. Also, storing more elements results in a longer linear search. Rather than simply appending each element to our list, we can intelligently choose an index for each element, which can be later accessed immediately in O(1) time.
Basic Hashing Exercise
Take the simple example of storing integers from 1 to 10 in an array of length 10. We could simply store these integers in any order we would like, but it is smarter to write a hash function to determine the index for each integer. Ideally, we would like the integer 1 to be stored at index 0, and integer 2 to be stored at index 1, and so on . . . until integer 10 is stored at index 9.
Choosing Hashing Functions
We might be tempted to think that our hash function for any integer x is simply: H(x) = x-1. Now when x = 1, we place it at index 0 and when X = 10, we place it at index 9. But what happens when unexpected integers are added such as 11, 0, and -1? Our primitive hash function is going to calculate a key/index that is out of bounds, because our array is only of size 10. Increasing the size of the array can help mitigate larger X values but now our array is growing in size. We need a more intelligent hash function that does not depend on a lower and upper bound of input so that:
A) Hash function considers future input scale.
B) The Array size does not increase beyond necessity.
So why not use the following hash function: H(x) = x % 10. Notice that we are now using the modulo operator by the size of our array. In this case, when x = 0, our key/index is now 0. For x = 1 our resulting key/index is now also 1. For x = 2, the resulting key/index is also 2. For x = 10, our resulting key/index is not 10, but it is 0 since there is no remainder. Even if our input x is larger than 10, say x = 11, then the resulting key/index goes back to 1. Throw an even larger input such as x = 100, and it is clear that the resulting key is still within bounds. For negative integers, the result is always positive but we can make the function more clear by stating:
H(x) = | x | % 10.
Handling Collisions
Given that the array is of size 10 & the improved hash function, what happens when two integers map to the same key? Both 0 and 10 map to index 0, and each index can only hold one element. The dilemma of having 2 elements map to the same key is known as a collision. We want to store both 0 and 10 safely via chaining.
Chaining is having each cell of the hash table point to a linked list of elements that have the same hash function key. In this case, both 0 and 10 can reside in the same cell as a linked list. However, this solution can compromise our fast search time. What if the integers we have to store all happen to map to index 0? We now end up with one linked list of many values. When all N elements map to the same index, we can re-calculate the index of any element in O(1) time, but finding the chosen element in a linked list takes O(N) time. It is the same as performing a linear search on an array, so this hashtable is not as efficient as intended.
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to the chain/list.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.
Disadvantages:
1) Cache performance of chaining is not good as keys are stored using a linked list. Open addressing provides better cache performance as everything is stored in the same table.
2) Waste of Space (Some cells of hash table are never used)
3) If the chain becomes long, then search time can become O(N) in the worst case.
4) Uses extra space for Linked List objects.
Hashing Performance
We can evaluate the performance of our hashing function and data structure by the following:
m = Size of hash table
n = Number of keys to be inserted in hash table
Load factor α = n/m
Expected time to search = O(1 + α)
Expected time to insert/delete = O(1 + α) Time complexity of search, insertion and deletion is
O(1) if α is O(1)
The load factor represents the average number of keys placed in each index/cell.
Where is Hashing used?
Microsoft Azure
Hashing does have a very practical use case, specifically for storing and accessing repeatedly used values. In fact, Azure Cache for Redis utilizes hashing to accomplish this. As a result, Azure Cache improves the performance and scalability of systems that rely heavily on backend data-stores. The cache temporarily copies frequently accessed data to a hash table. This fast storage is located in-memory with Azure Cache for Redis instead of being re-loaded from disk by a database. Application performance is improved by taking advantage of the low-latency, high-throughput performance of the Redis engine.
Loading Web Pages Faster
Most web pages are generated from templates with headers, footers, toolbars, menus, etc. They don’t actually change often and should not be generated dynamically. Using an in-memory cache, like Azure Cache for Redis, will give your web servers quick access to this type of static content compared to backend datastores. This pattern reduces processing time and server load that would be required to generate content dynamically, saving cloud users cost and execution time. This allows web servers to be more responsive, and can allow you to reduce the number of servers needed to handle loads.
Storing User Session
Storing too much in a cookie can have a negative impact on performance as the cookie size grows and is passed and validated with every request. A typical solution is to use the cookie as a key to query the data in a backend database. How can we reduce the number of queries made to a database? Using an in-memory cache, like Azure Cache for Redis, to associate information with a user is much faster than interacting with a full relational database.
Another more common case, is choosing to store an API response and other essential objects related to the user’s session. These can be stored in a persisting HashMap, that is inserted into other classes via dependency injection. Developers can pre-determine which Strings can act as the key for each expected object, then get them from the HashMap, instead of re-executing API calls.
Database Management Systems
When we are forced to interact with a backend data store, in most cases the DBMS is also implementing hashing! Hashing is used to directly search the location of desired data on the disk without using an index structure. Data is stored in the form of data blocks whose address is generated by applying a hash function in the memory location where these records are stored, known as a data block or data bucket.
Here, are the situations in the DBMS where you need to apply the Hashing method:
- For a huge database structure, it’s tough to search all the index values through all its level and then you need to reach the destination data block to get the desired data. This is a case where linear search results in poor performance.
- Hashing method is used to index and retrieve items in a database as it is faster to search that specific item using the shorter hashed key instead of using its original value.
- Hashing is an ideal method to calculate the direct location of a data record on the disk without using index structure.
- It is also a helpful technique for implementing dictionaries.
Hashing for Justice: Microsoft’s PhotoDNA
In 2009, Microsoft partnered with Dartmouth College to develop PhotoDNA, a technology that aids in finding and removing known images of child exploitation. Today, PhotoDNA is used by organizations around the world and has assisted in the detection, disruption, and reporting of millions of child exploitation images.
How does PhotoDNA accomplish this complicated task?
PhotoDNA creates a unique digital signature (known as a “hash”) of an image which is then compared against signatures (hashes) of other photos to find copies of the same image. When matched with a database containing hashes of previously identified illegal images, PhotoDNA is an incredible tool to help detect, disrupt and report the distribution of child exploitation material. PhotoDNA is not facial recognition software and cannot be used to identify a person or object in an image. A PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image.
Prior to PhotoDNA’s inception, law enforcement had to manually compare new images found online to pre-existing images and determine whether they were new or duplicates. Even if an existing image is cropped or discolored, PhotoDNA is capable of detecting whether it is new or not.
Types of Hashing Data Structures
Now that we understand the use cases for hashing, we can consider different data structures that already do the heavy lifting of hashing for us.
HashMaps
The HashMap will map a given key to a given value, and allow for fast lookup later on. This allows the user to choose any key for their values, while keeping in mind that there is a 1:1 mapping relationship. It also takes care of constructing a hash function. We have the following basic operations:
get(key) -> returns the value associated with the given key (returns null if key was never added with a value)
add(key, value) -> stores the given key and value as a pair
However, one issue with the HashMap is iterating over each (key, value) pair will not guarantee any particular order.
For example, say we have a HashMap containing Character to Integer pairs. We can insert (‘b’, 5), (‘a’, 3), and (‘c’, 6) in that order. However, if we iterate thru this map and print each pair, there is no guarantee that all 3 pairs would print in the same order they were added. The HashMap gives no control on ordering every pair, only quick insertion and search.
The HashMap may internally end up looking like this:
HashMap {
‘a’ : 3
‘b’ : 5
‘c’ : 6
}
Linked HashMaps
Linked HashMaps take care of this ordering issue. When we iterate over the map and print each pair, the order is maintained relative to our insertions. This Map implementation is more favorable when ordering matters and iteration will be used later on. We can think of a Linked HashMap as being a combination of a Linked List, since it maintains ordering by insertion, and a HashMap, due to the quick insertion and search time.
Internally, our Linked HashMap will look like this:
Linked HashMap {
‘b’ : 5
‘a’ : 3
‘c’ : 6
}
TreeMap
What if we are more picky about the ordering of pairs? We may decide that the pairs should be sorted by their keys in ascending order. In this case, a TreeMap is appropriate. We can decide how pairs are ordered (ascending/descending), and print each pair based on that criteria. However, we pay a cost of time needed to sort each pair. Internally, the TreeMap uses a Red Black Tree to maintain sorted order. Trees & Sorting can be covered in another article. The main point is that inserting & searching one pair into the TreeMap takes log(N) time. After adding N elements/pairs, this takes up N*log(N) time.
The two previous Map implementations take up O(1) time per insertion & search, and thus O(N) time to insert N elements (making them faster). Because of this time difference, TreeMaps should only be used when sorting is absolutely necessary.
If we decided that our TreeMap will store pairs by keys in descending order, then it will look like this:
TreeMap {
‘c’ : 6
‘b’ : 5
‘a’ : 3
}
Concurrent HashMaps
Concurrent HashMaps behave similarly to regular HashMaps, while including support for concurrency. If multiple threads tried to read and write to a regular hashmap, they may end up accessing the same index and this can eventually lead to a deadlock. More specifically, if Thread A tries to put data in index K, and at same time Thread B tries to read data from index K, we will go into an infinite loop. The Concurrent HashMap allows multiple threads to access it at the same time, just like the HashMap. However, now each thread locks the exact bucket/index being operated on. In this case, no two threads can access the same key at the same time which solves the infinite loop issue.
Now that we have a fundamental understanding of hashing data structures, let’s solve some commonly asked problems! Consider the Map & Set Implementations available in your language of choice and which types of Maps are most useful for each scenario. Sometimes a map may not be needed, but the concept of hashing is useful for gaining the optimal solution.
Exercises:
1) Duplicate Image Problem: You have thousands of images in your Pictures folder., some of which may be duplicates. After reading each image, write an algorithm that identifies which files are in fact duplicates.
2) Employee Ranking Search Problem: Given a list of Employees, each containing a name and rank, sort them by rank in ascending order for your company and provide a feature that lets users quickly look up an employee by rank.
3) Letter Frequency Problem: Given a string of lowercase characters, print each unique character followed by its frequency. The order of these characters must be in the same order they occur in the given string
Input: str = “avanade”
Output: a3 v1 n1 d1 e1
4) Check Permutation Problem: Given two strings (containing only lowercase characters), determine whether they are valid permutations. A permutation is a way in which a set of things can be arranged. Both strings should have the exact same set of characters and the same frequency per character.
Example: "happy" and "paphy" are permutations of each other
since both contain the exact same set of characters in common...
1 h
1 a
1 y
2 p