A hash table is made up of two parts: an array (the actual table where the data to be searched is stored) and a mapping function, known as a hash function. The hash function is a mapping from the input space to the integer space that defines the indices of the array. In other words, the hash function provides a way for assigning numbers to the input data such that the data can then be stored at the array index corresponding to the assigned number.
- Hash tables are used to quickly store and retrieve data (or records).
- Records are stored in buckets using hash keys
- Hash keys are calculated by applying a hashing algorithm to a chosen value contained within the record. This chosen value must be a common value to all the records.
- Each bucket can have multiple records which are be organized in a particular order.
So a hash table is a data structure that stores key/value pairs and can be quickly searched by the key. Because insertion and removal are operations dependent on the speed of the search, they tend to be fast as well.
Handling the collisions
In the small number of cases, where multiple keys map to the same integer, then elements with different keys may be stored in the same “slot” of the hash table. It is clear that when the hash function is used to locate a potential match, it will be necessary to compare the key of that element with the search key. But there may be more than one element which should be stored in a single slot of the table. Various techniques are used to manage this problem:
- overflow areas
- using neighbouring slots (linear probing)
- quadratic probing
- random probing
- Chaining : Hangs an additional data structure off of the buckets. For example the bucket array becomes an array of link list. So to find an item we first go to the bucket then compare keys.. This is a popular method, and if link list is used the hash never fills up.
- Overflow areas: It will divide the pre-allocated table into two sections: the primary area to which keys are mapped and an area for collisions, normally termed the overflow area. When a collision occurs, a slot in the overflow area is used for the new element and a link from the primary slot established as in a chained system. This is essentially the same as chaining, except that the overflow area is pre-allocated and thus possibly faster to access. As with re-hashing, the maximum number of elements must be known in advance, but in this case, two parameters must be estimated: the optimum size of the primary and overflow areas.
- Re-hashing: Re-hashing schemes use a second hashing operation when there is a collision. If there is a further collision, we re-hash until an empty “slot” in the table is found. The re-hashing function can either be a new function or a re-application of the original one. As long as the functions are applied to a key in the same order, then a sought key can always be located. Build a second table twice as large as the original and rehash there all the keys of the original table. Rehashing is expensive operation, with running time O(N). However, once done, the new hash table will have good performance.
- Linear Probing:On a collision, look in the neighbouring slot in the table. It calculates the new address extremely quickly and may be extremely efficient. Distance between probes is constant (i.e. 1, when probe examines consequent slots)
- Quadratic Probing: Distance between probes increases by certain constant at each step (in this case distance to the first slot depends on step number quadratically). The simplest form of quadratic probing is really just adding consequent squares to the calculated position instead of linear 1, 2, 3.