Hash table: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>WikiCleanerBot
m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation)
imported>Comp.arch
mNo edit summary
 
Line 1: Line 1:
{{Short description|Associative array for storing key-value pairs}}
{{Short description|Associative array for storing key–value pairs}}
{{Distinguish|Hash list|Hash tree (disambiguation){{!}}Hash tree}}
{{Distinguish|Hash list|Hash tree (disambiguation){{!}}Hash tree}}
{{Redirect|Rehash|the ''South Park'' episode|Rehash (South Park)}}
{{Redirect|Rehash|the ''South Park'' episode|Rehash (South Park)}}
Line 36: Line 36:
In [[computer science]], a '''hash table''' is a [[data structure]] that implements an [[associative array]], also called a '''dictionary''' or simply '''map'''; an associative array is an [[abstract data type]] that maps [[Unique key|keys]] to [[Value (computer science)|values]].<ref name="ms">{{cite book |doi=10.1007/978-3-540-77978-0_4 |chapter=Hash Tables and Associative Arrays |title=Algorithms and Data Structures |first1=Kurt|last1=Mehlhorn |author1-link=Kurt Mehlhorn |first2=Peter |last2=Sanders |author2-link=Peter Sanders (computer scientist) |publisher=Springer |date=2008 |pages=81–98 |isbn=978-3-540-77977-3 |chapter-url=https://people.mpi-inf.mpg.de/~mehlhorn/ftp/Toolbox/HashTables.pdf}}</ref> A hash table uses a [[hash function]] to compute an ''index'', also called a ''hash code'', into an array of ''buckets'' or ''slots'', from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a '''hash map'''.
In [[computer science]], a '''hash table''' is a [[data structure]] that implements an [[associative array]], also called a '''dictionary''' or simply '''map'''; an associative array is an [[abstract data type]] that maps [[Unique key|keys]] to [[Value (computer science)|values]].<ref name="ms">{{cite book |doi=10.1007/978-3-540-77978-0_4 |chapter=Hash Tables and Associative Arrays |title=Algorithms and Data Structures |first1=Kurt|last1=Mehlhorn |author1-link=Kurt Mehlhorn |first2=Peter |last2=Sanders |author2-link=Peter Sanders (computer scientist) |publisher=Springer |date=2008 |pages=81–98 |isbn=978-3-540-77977-3 |chapter-url=https://people.mpi-inf.mpg.de/~mehlhorn/ftp/Toolbox/HashTables.pdf}}</ref> A hash table uses a [[hash function]] to compute an ''index'', also called a ''hash code'', into an array of ''buckets'' or ''slots'', from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a '''hash map'''.


Most hash table designs employ an [[Perfect hash function|imperfect hash function]]. [[Hash collision|Hash collisions]], where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way.
Most hash table designs employ an [[Perfect hash function|imperfect hash function]]. [[Hash collision]]s, where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way. Common strategies to handle hash collisions include chaining, which stores multiple elements in the same slot using linked lists, and open addressing, which searches for the next available slot according to a probing sequence.<ref name="knuth">{{cite book |author-last=Knuth |author-first=Donald E. |author-link=Donald E. Knuth |date=24 April 1998 |title=The Art of Computer Programming: Volume 3: Sorting and Searching |edition=2nd |publisher=[[Addison-Wesley Professional]] |url=https://dl.acm.org/doi/10.5555/280635 |isbn=978-0-201-89685-5}}</ref>


In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of [[name–value pair|key–value pairs]], at [[amortized analysis|amortized]] constant average cost per operation.<ref name="leiser">{{cite web |first=Charles E. |last=Leiserson |author-link=Charles E. Leiserson |url=http://videolectures.net/mit6046jf05_leiserson_lec13/ |title=Lecture 13: Amortized Algorithms, Table Doubling, Potential Method |archive-url=https://web.archive.org/web/20090807022046/http://videolectures.net/mit6046jf05_leiserson_lec13/ |archive-date=August 7, 2009 |work=course MIT 6.046J/18.410J Introduction to Algorithms |date=Fall 2005 |url-status=live}}</ref><ref name="knuth">{{cite book | first=Donald |last=Knuth |author1-link=Donald Knuth | title = The Art of Computer Programming | volume = 3: ''Sorting and Searching'' | edition = 2nd | publisher = Addison-Wesley | year = 1998 | isbn = 978-0-201-89685-5 | pages = 513–558 }}</ref><ref name="cormen">{{cite book |last1=Cormen |first1=Thomas H. |author1-link=Thomas H. Cormen |last2=Leiserson |first2=Charles E. |author2-link=Charles E. Leiserson |last3=Rivest |first3=Ronald L. |author3-link=Ronald L. Rivest |last4=Stein |first4=Clifford |author4-link=Clifford Stein | title = Introduction to Algorithms | publisher = MIT Press and McGraw-Hill | year= 2001 | isbn = 978-0-262-53196-2 | edition = 2nd | pages=[https://archive.org/details/introductiontoal00corm_691/page/n243 221]–252 | chapter = Chapter 11: Hash Tables |title-link=Introduction to Algorithms }}</ref>
In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of [[name–value pair|key–value pairs]], at [[amortized analysis|amortized]] constant average cost per operation.<ref name="leiser">{{cite web |first=Charles E. |last=Leiserson |author-link=Charles E. Leiserson |url=https://videolectures.net/mit6046jf05_leiserson_lec13/ |title=Lecture 13: Amortized Algorithms, Table Doubling, Potential Method |archive-url=https://web.archive.org/web/20090807022046/http://videolectures.net/mit6046jf05_leiserson_lec13/ |archive-date=August 7, 2009 |work=course MIT 6.046J/18.410J Introduction to Algorithms |date=Fall 2005 |url-status=live}}</ref><ref name="knuth" />{{rp|pages=513–558}}<ref name="cormen">{{cite book |last1=Cormen |first1=Thomas H. |author1-link=Thomas H. Cormen |last2=Leiserson |first2=Charles E. |author2-link=Charles E. Leiserson |last3=Rivest |first3=Ronald L. |author3-link=Ronald L. Rivest |last4=Stein |first4=Clifford |author4-link=Clifford Stein | title = Introduction to Algorithms | publisher = MIT Press and McGraw-Hill | year= 2001 | isbn = 978-0-262-53196-2 | edition = 2nd | pages=[https://archive.org/details/introductiontoal00corm_691/page/n243 221]–252 | chapter = Chapter 11: Hash Tables |title-link=Introduction to Algorithms }}</ref>


Hashing is an example of a [[space-time tradeoff]]. If [[computer memory|memory]] is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a [[binary search]] or [[linear search]] can be used to retrieve the element.{{r|algo1rob|p=458}}
Hashing is an example of a [[space–time tradeoff]]. If [[computer memory|memory]] is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a [[binary search]] or [[linear search]] can be used to retrieve the element.{{r|algo1rob|p=458}}


In many situations, hash tables turn out to be on average more efficient than [[search tree]]s or any other [[table (computing)|table]] lookup structure. For this reason, they are widely used in many kinds of computer [[software]], particularly for [[associative array]]s, [[database index]]ing, [[cache (computing)|caches]], and [[set (abstract data type)|sets]].
In many situations, hash tables turn out to be on average more efficient than [[search tree]]s or any other [[table (computing)|table]] lookup structure. Hash tables are widely used in modern software systems for tasks such as database indexing, caching, and implementing associative arrays, due to their fast average-case performance.<ref>{{cite book |author-last1=Silberschatz |author-first1=A. |author-last2=Korth |author-first2=H. F. |author-last3=Sudarshan |author-first3=S. |date=2020 |title=Database System Concepts |edition=7th |publisher=[[McGraw-Hill]]}}</ref> For this reason, they are widely used in many kinds of computer [[software]], particularly for [[associative array]]s, [[database index]]ing, [[cache (computing)|caches]], and [[set (abstract data type)|sets]]. Many programming languages provide built-in hash table structures, such as Python’s dictionaries, Java’s HashMap, and C++’s unordered_map, which abstract the complexity of hashing from the programmer.<ref>{{cite book |author-last1=Goodrich |author-first1=M. T. |author-last2=Tamassia |author-first2=R. |author-last3=Goldwasser |author-first3=M. H. |date=2014 |title=Data Structures and Algorithms in Java |edition=6th |publisher=[[Wiley (publisher)|Wiley]]}}</ref>


==History==
==History==
The idea of hashing arose independently in different places. In January 1953, [[Hans Peter Luhn]] wrote an internal [[IBM]] memorandum that used hashing with chaining.  The first example of [[open addressing]] was proposed by A. D. Linh, building on Luhn's memorandum.<ref name="knuth"/>{{rp|p=547}} Around the same time, [[Gene Amdahl]], [[Elaine M. McGraw]], [[Nathaniel Rochester (computer scientist)|Nathaniel Rochester]], and [[Arthur Samuel (computer scientist)|Arthur Samuel]] of [[IBM Research]] implemented hashing for the [[IBM 701]] [[Assembly language#Assembler|assembler]].{{r|Konheim|p=124}} Open addressing with linear probing is credited to Amdahl, although [[Andrey Ershov]] independently had the same idea.<ref name="Konheim">{{cite book |doi=10.1002/9780470630617 |title=Hashing in Computer Science |date=2010 |last1=Konheim |first1=Alan G. |isbn=978-0-470-34473-6 }}</ref>{{rp|pp=124–125}} The term "open addressing" was coined by [[W. Wesley Peterson]] in his article which discusses the problem of search in large files.<ref name="hashhist">{{cite book |doi=10.1201/9781420035179 |title=Handbook of Data Structures and Applications |date=2004 |isbn=978-0-429-14701-2 |editor-last1=Mehta |editor-last2=Mehta |editor-last3=Sahni |editor-first1=Dinesh P. |editor-first2=Dinesh P. |editor-first3=Sartaj }}</ref>{{rp|p=15}}
The idea of hashing arose independently in different places. In January 1953, [[Hans Peter Luhn]] wrote an internal [[IBM]] memorandum that used hashing with chaining.  The first example of [[open addressing]] was proposed by A. D. Linh, building on Luhn's memorandum.<ref name="knuth"/>{{rp|p=547}} Around the same time, [[Gene Amdahl]], [[Elaine M. McGraw]], [[Nathaniel Rochester (computer scientist)|Nathaniel Rochester]], and [[Arthur Samuel (computer scientist)|Arthur Samuel]] of [[IBM Research]] implemented hashing for the [[IBM 701]] [[Assembly language#Assembler|assembler]].{{r|Konheim|p=124}} Open addressing with linear probing is credited to Amdahl, although [[Andrey Ershov]] independently had the same idea.<ref name="Konheim">{{cite book |doi=10.1002/9780470630617 |title=Hashing in Computer Science |date=2010 |last1=Konheim |first1=Alan G. |isbn=978-0-470-34473-6 }}</ref>{{rp|pp=124–125}} The term "open addressing" was coined by [[W. Wesley Peterson]] in his article which discusses the problem of search in large files.<ref name="hashhist">{{cite book |doi=10.1201/9781420035179 |title=Handbook of Data Structures and Applications |date=2004 |isbn=978-0-429-14701-2 |editor-last1=Mehta |editor-last2=Mehta |editor-last3=Sahni |editor-first1=Dinesh P. |editor-first2=Dinesh P. |editor-first3=Sartaj }}</ref>{{rp|p=15}}


The first [[Academic publishing|published]] work on hashing with chaining is credited to [[Arnold Dumey]], who discussed the idea of using remainder modulo a prime as a hash function.{{r|hashhist|p=15}} The word "hashing" was first published in an article by Robert Morris.{{r|Konheim|p=126}} A [[Analysis of algorithms|theoretical analysis]] of linear probing was submitted originally by Konheim and Weiss.{{r|hashhist|p=15}}
The first published work on hashing with chaining is credited to [[Arnold Dumey]], who discussed the idea of using remainder modulo a prime as a hash function.{{r|hashhist|p=15}} The word "hashing" was first published in an article by Robert Morris.{{r|Konheim|p=126}} A [[Analysis of algorithms|theoretical analysis]] of linear probing was submitted originally by Konheim and Weiss.{{r|hashhist|p=15}}


== Overview ==
== Overview ==
An [[associative array]] stores a [[Set (abstract data type)|set]] of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of [[unique key]]s. In the hash table implementation of associative arrays, an array <math>A</math> of length <math>m</math> is partially filled with <math>n</math> elements, where <math>m \ge n</math>. A key '''<math>x</math>''' is hashed using a hash function <math>h</math> to compute an index location <math>A[h(x)]</math> in the hash table, where <math>h(x) < m</math>. At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have better [[time complexity]] bounds on search, delete, and insert operations in comparison to [[self-balancing binary search tree]]s.{{r|hashhist|p=1}}
An [[associative array]] stores a [[Set (abstract data type)|set]] of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of [[unique key]]s. In the hash table implementation of associative arrays, an array <math>A</math> of length <math>m</math> is partially filled with <math>n</math> elements, where <math>m \ge n</math>. A key '''<math>x</math>''' is hashed using a hash function <math>h</math> to compute an index location <math>A[h(x)]</math> in the hash table, where <math>h(x) < m</math>. The efficiency of a hash table depends on the load factor, defined as the ratio of the number of stored elements to the number of available slots, with lower load factors generally yielding faster operations.<ref>{{cite book |author-last1=Cormen |author-first1=T. H. |author-last2=Leiserson |author-first2=C. E. |author-last3=Rivest |author-first3=R. L. |author-link3=Ronald Rivest |author-last4=Stein |author-first4=C. |date=2009 |title=Introduction to Algorithms |edition=3rd |publisher=[[MIT Press]]}}</ref> At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have better [[time complexity]] bounds on search, delete, and insert operations in comparison to [[self-balancing binary search tree]]s.{{r|hashhist|p=1}}


Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.{{r|hashhist|p=1}}
Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.{{r|hashhist|p=1}}


===Load factor===
=== Load factor ===
A ''load factor'' <math>\alpha</math> is a critical statistic of a hash table, and is defined as follows:<ref name="Cormen et al" />
A ''load factor'' <math>\alpha</math> is a critical statistic of a hash table, and is defined as follows:<ref name="Cormen et al" />
<math display="block">\text{load factor}\ (\alpha) = \frac{n}{m},</math>
<math display="block">\text{load factor}\ (\alpha) = \frac{n}{m},</math>
Line 61: Line 61:
* <math>m</math> is the number of buckets.
* <math>m</math> is the number of buckets.


The performance of the hash table deteriorates in relation to the load factor <math>\alpha</math>.{{r|hashhist|p=2}}
The performance of the hash table deteriorates in relation to the load factor <math>\alpha</math>.{{r|hashhist|p=2}} In the limit of large <math>m</math> and <math>n</math>, each bucket statistically has a [[Poisson distribution]] with expectation <math>\lambda=\alpha</math> for an ideally random [[hash function]].


The software typically ensures that the load factor <math>\alpha</math> remains below a certain constant, <math>\alpha_{\max}</math>. This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor <math>\alpha</math> reaches <math>\alpha_{\max}</math>. Similarly the table may also be resized if the load factor drops below <math>\alpha_{\max}/4</math>.<ref name="cornell08">{{cite web|url=https://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html|title=CS 312: Hash tables and amortized analysis|publisher=[[Cornell University]], Department of Computer Science|first=Andrew|last=Mayers|access-date=26 October 2021|year=2008|archive-url=https://web.archive.org/web/20210426052033/http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html|archive-date=26 April 2021|url-status=live|via=cs.cornell.edu}}</ref>
The software typically ensures that the load factor <math>\alpha</math> remains below a certain constant, <math>\alpha_{\max}</math>. This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor <math>\alpha</math> reaches <math>\alpha_{\max}</math>. Similarly the table may also be resized if the load factor drops below <math>\alpha_{\max}/4</math>.<ref name="cornell08">{{cite web|url=https://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html|title=CS 312: Hash tables and amortized analysis|publisher=[[Cornell University]], Department of Computer Science|first=Andrew|last=Mayers|access-date=26 October 2021|year=2008|archive-url=https://web.archive.org/web/20210426052033/http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html|archive-date=26 April 2021|url-status=live|via=cs.cornell.edu}}</ref>


==== Load factor for separate chaining ====
==== Load factor for separate chaining ====
With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.<ref name="plank" />
With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.<ref name="plank" />


Line 74: Line 73:


==== Load factor for open addressing ====
==== Load factor for open addressing ====
With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.<ref name="plank" >
With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.<ref name="plank" >
James S. Plank and Brad Vander Zanden.
James S. Plank and Brad Vander Zanden.
[http://web.eecs.utk.edu/~bvanderz/teaching/cs140Sp15/Notes/Hashing/ "CS140 Lecture notes -- Hashing"].
[https://web.eecs.utk.edu/~bvanderz/teaching/cs140Sp15/Notes/Hashing/ "CS140 Lecture notes -- Hashing"].
</ref>
</ref>


Line 86: Line 84:
With open addressing, acceptable figures of max load factor <math>\alpha_{\max}</math> should range around 0.6 to 0.75.<ref>{{cite journal |last1=Maurer |first1=W. D. |last2=Lewis |first2=T. G. |title=Hash Table Methods |journal=ACM Computing Surveys |date=March 1975 |volume=7 |issue=1 |pages=5–19 |doi=10.1145/356643.356645 |s2cid=17874775 }}</ref>{{r|owo03|p=110}}
With open addressing, acceptable figures of max load factor <math>\alpha_{\max}</math> should range around 0.6 to 0.75.<ref>{{cite journal |last1=Maurer |first1=W. D. |last2=Lewis |first2=T. G. |title=Hash Table Methods |journal=ACM Computing Surveys |date=March 1975 |volume=7 |issue=1 |pages=5–19 |doi=10.1145/356643.356645 |s2cid=17874775 }}</ref>{{r|owo03|p=110}}


==Hash function==
== Hash function ==
 
A [[hash function]] <math>h : U \rightarrow \{0, ..., m-1\}</math> maps the universe <math>U</math> of keys to indices or slots within the table, that is, <math>h(x) \in \{0, ..., m-1\}</math> for <math>x \in U</math>. The conventional implementations of hash functions are based on the ''integer universe assumption'' that all elements of the table stem from the universe <math>U = \{0, ..., u - 1\}</math>, where the [[bit length]] of <math>u</math> is confined within the [[Word (computer architecture)|word size]] of a [[computer architecture]].{{r|hashhist|p=2}}
A [[hash function]] <math>h : U \rightarrow \{0, ..., m-1\}</math> maps the universe <math>U</math> of keys to indices or slots within the table, that is, <math>h(x) \in \{0, ..., m-1\}</math> for <math>x \in U</math>. The conventional implementations of hash functions are based on the ''integer universe assumption'' that all elements of the table stem from the universe <math>U = \{0, ..., u - 1\}</math>, where the [[bit length]] of <math>u</math> is confined within the [[Word (computer architecture)|word size]] of a [[computer architecture]].{{r|hashhist|p=2}}


A hash function <math>h</math> is said to be [[perfect hash function|perfect]] for a given set <math>S</math> if it is [[injective function|injective]] on <math>S</math>, that is, if each element <math>x \in S</math> maps to a different value in <math>{0, ..., m-1}</math>.<ref name="Yi06">{{cite conference | last1 = Lu | first1 = Yi | last2 = Prabhakar | first2 = Balaji | last3 = Bonomi | first3 = Flavio | doi = 10.1109/ISIT.2006.261567 | conference = 2006 IEEE International Symposium on Information Theory | pages = 2774–2778 | title = Perfect Hashing for Network Applications | year = 2006| isbn = 1-4244-0505-X | s2cid = 1494710 }}</ref><ref name="CHD">{{cite conference | last1 = Belazzougui | first1 = Djamal | last2 = Botelho | first2 = Fabiano C. | last3 = Dietzfelbinger | first3 = Martin | title = Hash, displace, and compress | url = http://cmph.sourceforge.net/papers/esa09.pdf | doi = 10.1007/978-3-642-04128-0_61 | location = Berlin | mr = 2557794 | pages = 682–693 | publisher = Springer | series = [[Lecture Notes in Computer Science]] | book-title = Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7–9, 2009, Proceedings | volume = 5757 | year = 2009| citeseerx = 10.1.1.568.130}}</ref> A perfect hash function can be created if all the keys are known ahead of time.<ref name="Yi06" />
A hash function <math>h</math> is said to be [[perfect hash function|perfect]] for a given set <math>S</math> if it is [[injective function|injective]] on <math>S</math>, that is, if each element <math>x \in S</math> maps to a different value in <math>{0, ..., m-1}</math>.<ref name="Yi06">{{cite conference | last1 = Lu | first1 = Yi | last2 = Prabhakar | first2 = Balaji | last3 = Bonomi | first3 = Flavio | doi = 10.1109/ISIT.2006.261567 | conference = 2006 IEEE International Symposium on Information Theory | pages = 2774–2778 | title = Perfect Hashing for Network Applications | year = 2006| isbn = 1-4244-0505-X | s2cid = 1494710 }}</ref><ref name="CHD">{{cite conference | last1 = Belazzougui | first1 = Djamal | last2 = Botelho | first2 = Fabiano C. | last3 = Dietzfelbinger | first3 = Martin | title = Hash, displace, and compress | url = https://cmph.sourceforge.net/papers/esa09.pdf | doi = 10.1007/978-3-642-04128-0_61 | location = Berlin | mr = 2557794 | pages = 682–693 | publisher = Springer | series = [[Lecture Notes in Computer Science]] | book-title = Algorithms—ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7–9, 2009, Proceedings | volume = 5757 | year = 2009| citeseerx = 10.1.1.568.130}}</ref> A perfect hash function can be created if all the keys are known ahead of time.<ref name="Yi06" />


=== Integer universe assumption ===
=== Integer universe assumption ===
Line 104: Line 101:
<math display="block">h(x) = \lfloor m \bigl((xA) \bmod 1\bigr) \rfloor</math>
<math display="block">h(x) = \lfloor m \bigl((xA) \bmod 1\bigr) \rfloor</math>
Where <math>A</math> is a non-integer [[Real number|real-valued constant]] and <math>m</math> is the size of the table. An advantage of the hashing by multiplication is that the <math>m</math> is not critical.{{r|hashhist|pp=2-3}} Although any value <math>A</math> produces a hash function, [[Donald Knuth]] suggests using the [[golden ratio]].{{r|hashhist|p=3}}
Where <math>A</math> is a non-integer [[Real number|real-valued constant]] and <math>m</math> is the size of the table. An advantage of the hashing by multiplication is that the <math>m</math> is not critical.{{r|hashhist|pp=2-3}} Although any value <math>A</math> produces a hash function, [[Donald Knuth]] suggests using the [[golden ratio]].{{r|hashhist|p=3}}
==== String hashing ====
Commonly a string is used as a key to the hash function.  Stroustrup<ref>
{{cite book | last = Stroustrup | first = Bjarne
| title = The C++ Programming Language Third Edition
| page=503 | publisher = Addison-Wesley
| location = Reading Massachusetts | date = 1997
| isbn = 0-201-88954-4
}}
</ref> describes a simple hash function in which an unsigned integer that is initially zero is repeatedly left shifted one bit and then xor'ed with the integer value of the next character.  This hash value is then taken modulo the table size.  If the left shift is not circular, then the string length should be at least eight bits less than the size of the unsigned integer in bits.  Another common way to hash a string to an integer is with a [[rolling hash | polynomial rolling hash function]].


===Choosing a hash function===
===Choosing a hash function===
[[Uniform distribution (discrete)|Uniform distribution]] of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a [[Pearson's chi-squared test#Discrete uniform distribution|Pearson's chi-squared test]] for discrete uniform distributions.<ref name="chernoff">{{Cite journal | first=Karl |last=Pearson |author1-link=Karl Pearson | year = 1900 | title = On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling | journal = Philosophical Magazine |series=Series 5 | volume = 50 | number = 302 | pages = 157–175 | doi=10.1080/14786440009463897 |url=https://zenodo.org/record/1430618 }}</ref><ref name="plackett">{{Cite journal |first=Robin |last=Plackett |author1-link=Robin Plackett | year = 1983 | title =  Karl Pearson and the Chi-Squared Test | journal = International Statistical Review | volume = 51 | number = 1 | pages = 59–72 | doi=10.2307/1402731 |jstor=1402731 }}</ref>
[[Uniform distribution (discrete)|Uniform distribution]] of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a [[Pearson's chi-squared test#Discrete uniform distribution|Pearson's chi-squared test]] for discrete uniform distributions.<ref name="chernoff">{{Cite journal | first=Karl |last=Pearson |author1-link=Karl Pearson | year = 1900 | title = On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling | journal = Philosophical Magazine |series=Series 5 | volume = 50 | number = 302 | pages = 157–175 | doi=10.1080/14786440009463897 |url=https://zenodo.org/record/1430618 }}</ref><ref name="plackett">{{Cite journal |first=Robin |last=Plackett |author1-link=Robin Plackett | year = 1983 | title =  Karl Pearson and the Chi-Squared Test | journal = International Statistical Review | volume = 51 | number = 1 | pages = 59–72 | doi=10.2307/1402731 |jstor=1402731 }}</ref>
Line 117: Line 123:
==Collision resolution==
==Collision resolution==
{{see also| 2-choice hashing}}
{{see also| 2-choice hashing}}
A search algorithm that uses hashing consists of two parts. The first part is computing a [[hash function]] which transforms the search key into an [[array index]]. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.<ref name="donald3">{{cite book|title=The Art of Computer Programming: Volume 3: Sorting and Searching|publisher= Addison-Wesley Professional |author=[[Donald E. Knuth]]|date=24 April 1998|url=https://dl.acm.org/doi/10.5555/280635|isbn=978-0-201-89685-5}}</ref>{{rp|p=515}} Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.<ref name="algo1rob">{{cite book|first1=Robert|last1=Sedgewick|first2=Kevin|last2=Wayne|url=https://algs4.cs.princeton.edu/|via=[[Princeton University]], Department of Computer Science|title=Algorithms|edition=4|volume=1|publisher= Addison-Wesley Professional |year=2011|author-link1=Robert Sedgewick (computer scientist)}}</ref>{{rp|p=458}}
A search algorithm that uses hashing consists of two parts. The first part is computing a [[hash function]] which transforms the search key into an [[array index]]. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.<ref name="knuth" />{{rp|p=515}} Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.<ref name="algo1rob">{{cite book|first1=Robert|last1=Sedgewick|first2=Kevin|last2=Wayne|url=https://algs4.cs.princeton.edu/|via=[[Princeton University]], Department of Computer Science|title=Algorithms|edition=4|volume=1|publisher= Addison-Wesley Professional |year=2011|author-link1=Robert Sedgewick (computer scientist)}}</ref>{{rp|p=458}}


===Separate chaining===
===Separate chaining===
[[File:Hash table 5 0 1 1 1 1 1 LL.svg|thumb|450px|right|Hash collision resolved by separate chaining]]
[[File:Hash table 5 0 1 1 1 1 1 LL.svg|thumb|450px|right|Hash collision resolved by separate chaining]]
[[File:Hash table 5 0 1 1 1 1 0 LL.svg|thumb|right|500px|Hash collision by separate chaining with head records in the bucket array.]]
[[File:Hash table 5 0 1 1 1 1 0 LL.svg|thumb|right|500px|Hash collision by separate chaining with head records in the bucket array]]


In separate chaining, the process involves building a [[linked list]] with [[key–value pair]] for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.{{r|algo1rob|p=464}} Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let <math>T</math> and <math>x</math> be the hash table and the node respectively, the operation involves as follows:<ref name="cormenalgo01">{{cite book|last1=Cormen |first1=Thomas H. |author1-link=Thomas H. Cormen|last2=Leiserson |first2=Charles E. |author2-link=Charles E. Leiserson|last3=Rivest |first3=Ronald L. |author3-link=Ronald L. Rivest|last4=Stein |first4=Clifford |author4-link=Clifford Stein| title = Introduction to Algorithms| publisher = [[Massachusetts Institute of Technology]]| year= 2001| isbn = 978-0-262-53196-2| edition = 2nd|chapter = Chapter 11: Hash Tables|title-link=Introduction to Algorithms }}</ref>{{rp|p=258}}
In separate chaining, the process involves building a [[linked list]] with [[key–value pair]] for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.{{r|algo1rob|p=464}} Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let <math>T</math> and <math>x</math> be the hash table and the node respectively, the operation involves as follows:<ref name="cormenalgo01">{{cite book|last1=Cormen |first1=Thomas H. |author1-link=Thomas H. Cormen|last2=Leiserson |first2=Charles E. |author2-link=Charles E. Leiserson|last3=Rivest |first3=Ronald L. |author3-link=Ronald L. Rivest|last4=Stein |first4=Clifford |author4-link=Clifford Stein| title = Introduction to Algorithms| publisher = [[Massachusetts Institute of Technology]]| year= 2001| isbn = 978-0-262-53196-2| edition = 2nd|chapter = Chapter 11: Hash Tables|title-link=Introduction to Algorithms }}</ref>{{rp|p=258}}
Line 134: Line 140:
   ''delete'' ''x'' ''from the linked list'' ''T''[''h''(''k'')]
   ''delete'' ''x'' ''from the linked list'' ''T''[''h''(''k'')]


If the element is comparable either [[Sequence#Analysis|numerically]] or [[Lexicographic order|lexically]], and inserted into the list by maintaining the [[total order]], it results in faster termination of the unsuccessful searches.{{r|donald3|pp=520-521}}
If the element is comparable either [[Sequence#Analysis|numerically]] or [[Lexicographic order|lexically]], and inserted into the list by maintaining the [[total order]], it results in faster termination of the unsuccessful searches.{{r|knuth|pp=520-521}}


====Other data structures for separate chaining====
====Other data structures for separate chaining====
If the keys are [[total order|ordered]], it could be efficient to use "[[Optimal binary search tree|self-organizing]]" concepts such as using a [[self-balancing binary search tree]], through which the [[worst-case complexity|theoretical worst case]] could be brought down to <math>O(\log{n})</math>, although it introduces additional complexities.{{r|knuth|p=521}}


If the keys are [[total order|ordered]], it could be efficient to use "[[Optimal binary search tree|self-organizing]]" concepts such as using a [[self-balancing binary search tree]], through which the [[Worst-case complexity|theoretical worst case]] could be brought down to <math>O(\log{n})</math>, although it introduces additional complexities.{{r|donald3|p=521}}
In [[dynamic perfect hashing]], two-level hash tables are used to reduce the look-up complexity to be a guaranteed <math>O(1)</math> in the worst case. In this technique, the buckets of <math>k</math> entries are organized as [[Perfect hash function|perfect hash tables]] with <math>k^2</math> slots providing constant worst-case lookup time, and low amortized time for insertion.<ref>{{cite web |first1=Erik |last1=Demaine |first2=Jeff |last2=Lind |work=6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory |date=Spring 2003 |url=https://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |title=Lecture 2 |access-date=2008-06-30 |url-status=live |archive-url=https://web.archive.org/web/20100615203901/http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |archive-date=June 15, 2010 |df=mdy-all }}</ref> A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.{{r|nick05|p=99}}
 
In [[dynamic perfect hashing]], two-level hash tables are used to reduce the look-up complexity to be a guaranteed <math>O(1)</math> in the worst case. In this technique, the buckets of <math>k</math> entries are organized as [[Perfect hash function|perfect hash tables]] with <math>k^2</math> slots providing constant worst-case lookup time, and low amortized time for insertion.<ref>{{cite web |first1=Erik |last1=Demaine |first2=Jeff |last2=Lind |work=6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory |date=Spring 2003 |url=http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |title=Lecture 2 |access-date=2008-06-30 |url-status=live |archive-url=https://web.archive.org/web/20100615203901/http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L2/lecture2.pdf |archive-date=June 15, 2010 |df=mdy-all }}</ref> A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.{{r|nick05|p=99}}


Techniques such as using [[fusion tree]] for each buckets also result in constant time for all operations with high probability.<ref>{{cite journal | last = Willard | first = Dan E. | author-link = Dan Willard | doi = 10.1137/S0097539797322425 | issue = 3 | journal = [[SIAM Journal on Computing]] | mr = 1740562 | pages = 1030–1049 | title = Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree | volume = 29 | year = 2000}}.</ref>
Techniques such as using [[fusion tree]] for each buckets also result in constant time for all operations with high probability.<ref>{{cite journal | last = Willard | first = Dan E. | author-link = Dan Willard | doi = 10.1137/S0097539797322425 | issue = 3 | journal = [[SIAM Journal on Computing]] | mr = 1740562 | pages = 1030–1049 | title = Examining computational geometry, van Emde Boas trees, and hashing from the perspective of the fusion tree | volume = 29 | year = 2000}}.</ref>


==== Caching and locality of reference ====
====Caching and locality of reference====
The linked list of separate chaining implementation may not be [[Cache-oblivious algorithm|cache-conscious]] due to [[spatial locality]]—[[locality of reference]]—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail [[CPU cache]] inefficiencies.<ref name="nick05">{{cite book |doi=10.1007/11575832_1 |chapter=Enhanced Byte Codes with Restricted Prefix Properties |title=String Processing and Information Retrieval |series=Lecture Notes in Computer Science |date=2005 |last1=Culpepper |first1=J. Shane |last2=Moffat |first2=Alistair |volume=3772 |pages=1–12 |isbn=978-3-540-29740-6 }}</ref>{{rp|p=91}}
The linked list of separate chaining implementation may not be [[Cache-oblivious algorithm|cache-conscious]] due to [[spatial locality]]—[[locality of reference]]—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail [[CPU cache]] inefficiencies.<ref name="nick05">{{cite book |doi=10.1007/11575832_1 |chapter=Enhanced Byte Codes with Restricted Prefix Properties |title=String Processing and Information Retrieval |series=Lecture Notes in Computer Science |date=2005 |last1=Culpepper |first1=J. Shane |last2=Moffat |first2=Alistair |volume=3772 |pages=1–12 |isbn=978-3-540-29740-6 }}</ref>{{rp|p=91}}


In [[cache-oblivious algorithm|cache-conscious variants]] of collision resolution through separate chaining, a [[dynamic array]] found to be more [[CPU cache|cache-friendly]] is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the [[Memory management (operating systems)#Single contiguous allocation|contiguous allocation]] pattern of  the array could be exploited by [[Cache prefetching|hardware-cache prefetchers]]—such as [[translation lookaside buffer]]—resulting in reduced access time and memory consumption.<ref>{{cite journal |last1=Askitis |first1=Nikolas |last2=Sinha |first2=Ranjan |title=Engineering scalable, cache and space efficient tries for strings |journal=The VLDB Journal |date=October 2010 |volume=19 |issue=5 |pages=633–660 |doi=10.1007/s00778-010-0183-9 }}</ref><ref>{{Cite conference | title=Cache-conscious Collision Resolution in String Hash Tables | first1=Nikolas | last1=Askitis | first2=Justin | last2=Zobel |date=October 2005 | isbn=978-3-540-29740-6 | pages=91–102 | book-title=Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005) | doi=10.1007/11575832_11 | volume=3772/2005}}</ref><ref>{{Cite conference |title      = Fast and Compact Hash Tables for Integer Keys |first1      = Nikolas |last1      = Askitis |year        = 2009 |isbn        = 978-1-920682-72-9 |url        = http://crpit.com/confpapers/CRPITV91Askitis.pdf |pages      = 113–122 |book-title    = Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009) |volume      = 91 |url-status  = dead |archive-url  = https://web.archive.org/web/20110216180225/http://crpit.com/confpapers/CRPITV91Askitis.pdf |archive-date = February 16, 2011 |df          = mdy-all |access-date = June 13, 2010 }}</ref>
In [[cache-oblivious algorithm|cache-conscious variants]] of collision resolution through separate chaining, a [[dynamic array]] found to be more [[CPU cache|cache-friendly]] is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the [[Memory management (operating systems)#Single contiguous allocation|contiguous allocation]] pattern of  the array could be exploited by [[Cache prefetching|hardware-cache prefetchers]]—such as [[translation lookaside buffer]]—resulting in reduced access time and memory consumption.<ref>{{cite journal |last1=Askitis |first1=Nikolas |last2=Sinha |first2=Ranjan |title=Engineering scalable, cache and space efficient tries for strings |journal=The VLDB Journal |date=October 2010 |volume=19 |issue=5 |pages=633–660 |doi=10.1007/s00778-010-0183-9 }}</ref><ref>{{Cite conference | title=Cache-conscious Collision Resolution in String Hash Tables | first1=Nikolas | last1=Askitis | first2=Justin | last2=Zobel |date=October 2005 | isbn=978-3-540-29740-6 | pages=91–102 | book-title=Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005) | doi=10.1007/11575832_11 | volume=3772/2005}}</ref><ref>{{Cite conference |title      = Fast and Compact Hash Tables for Integer Keys |first1      = Nikolas |last1      = Askitis |year        = 2009 |isbn        = 978-1-920682-72-9 |url        = https://crpit.com/confpapers/CRPITV91Askitis.pdf |pages      = 113–122 |book-title    = Proceedings of the 32nd Australasian Computer Science Conference (ACSC 2009) |volume      = 91 |url-status  = dead |archive-url  = https://web.archive.org/web/20110216180225/http://crpit.com/confpapers/CRPITV91Askitis.pdf |archive-date = February 16, 2011 |df          = mdy-all |access-date = June 13, 2010 }}</ref>


===Open addressing===
===Open addressing===
Line 163: Line 168:
The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor <math>\alpha</math> approaches 1.<ref name="cornell08" />{{r|nick05|p=93}} The probing results in an [[infinite loop]] if the load factor reaches 1, in the case of a completely filled table.{{r|algo1rob|p=471}} The [[Average-case complexity|average cost]] of linear probing depends on the hash function's ability to [[Probability distribution|distribute]] the elements [[Continuous uniform distribution|uniformly]] throughout the table to avoid [[Cluster analysis|clustering]], since formation of clusters would result in increased search time.{{r|algo1rob|p=472}}
The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor <math>\alpha</math> approaches 1.<ref name="cornell08" />{{r|nick05|p=93}} The probing results in an [[infinite loop]] if the load factor reaches 1, in the case of a completely filled table.{{r|algo1rob|p=471}} The [[Average-case complexity|average cost]] of linear probing depends on the hash function's ability to [[Probability distribution|distribute]] the elements [[Continuous uniform distribution|uniformly]] throughout the table to avoid [[Cluster analysis|clustering]], since formation of clusters would result in increased search time.{{r|algo1rob|p=472}}


==== Caching and locality of reference ====
====Caching and locality of reference====
Since the slots are located in successive locations, linear probing could lead to better utilization of [[CPU cache]] due to [[locality of reference]]s resulting in reduced [[memory latency]].<ref name="Cuckoo" />
Since the slots are located in successive locations, linear probing could lead to better utilization of [[CPU cache]] due to [[locality of reference]]s resulting in reduced [[memory latency]].<ref name="Cuckoo" />


Line 178: Line 183:


=====Hopscotch hashing=====
=====Hopscotch hashing=====
{{main| Hopscotch hashing}}
{{main|Hopscotch hashing}}


[[Hopscotch hashing]] is an open addressing based algorithm which combines the elements of [[cuckoo hashing]], [[linear probing]] and chaining through the notion of a ''neighbourhood'' of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.<ref name="nir08">{{cite book |doi=10.1007/978-3-540-87779-0_24 |chapter=Hopscotch Hashing |title=Distributed Computing |series=Lecture Notes in Computer Science |date=2008 |last1=Herlihy |first1=Maurice |last2=Shavit |first2=Nir |last3=Tzafrir |first3=Moran |volume=5218 |pages=350–364 |isbn=978-3-540-87778-3 }}</ref>{{rp|pp=351–352}} The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in [[Concurrent computing|concurrent settings]], thus well suited for implementing resizable [[concurrent hash table]].{{r|nir08|p=350}} The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.{{r|nir08|p=352}}
[[Hopscotch hashing]] is an open addressing based algorithm which combines the elements of [[cuckoo hashing]], [[linear probing]] and chaining through the notion of a ''neighbourhood'' of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.<ref name="nir08">{{cite book |doi=10.1007/978-3-540-87779-0_24 |chapter=Hopscotch Hashing |title=Distributed Computing |series=Lecture Notes in Computer Science |date=2008 |last1=Herlihy |first1=Maurice |last2=Shavit |first2=Nir |last3=Tzafrir |first3=Moran |volume=5218 |pages=350–364 |isbn=978-3-540-87778-3 }}</ref>{{rp|pp=351–352}} The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in [[Concurrent computing|concurrent settings]], thus well suited for implementing resizable [[concurrent hash table]].{{r|nir08|p=350}} The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.{{r|nir08|p=352}}
Line 185: Line 190:


=====Robin Hood hashing=====
=====Robin Hood hashing=====
Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest ''probe sequence length'' (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.<ref name="waterloo86">{{cite book|title=Robin Hood Hashing|first=Pedro|last=Celis|publisher=[[University of Waterloo]], Dept. of Computer Science|year=1986|url=https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf |location=Ontario, Canada|isbn= 978-0-315-29700-5 |oclc= 14083698|archive-url=https://web.archive.org/web/20211101071032/https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf|archive-date=1 November 2021|access-date=2 November 2021|url-status=live}}</ref>{{rp|p=12}} Although Robin Hood hashing does not change the [[Computational complexity theory|theoretical search cost]], it significantly affects the [[variance]] of the [[Probability distribution|distribution]] of the items on the buckets,<ref>{{cite journal |last1=Poblete |first1=P. V. |last2=Viola |first2=A. |title=Analysis of Robin Hood and Other Hashing Algorithms Under the Random Probing Model, With and Without Deletions |journal=Combinatorics, Probability and Computing |date=July 2019 |volume=28 |issue=4 |pages=600–617 |doi=10.1017/S0963548318000408 |s2cid=125374363 }}</ref>{{rp|p=2}} i.e. dealing with [[Cluster analysis|cluster]] formation in the hash table.<ref name="cornell14">{{cite web|url=https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|title= Lecture 13: Hash tables|publisher=[[Cornell University]], Department of Computer Science|first=Michael|last=Clarkson|access-date=1 November 2021|year=2014|archive-url=https://web.archive.org/web/20211007011300/https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|archive-date=7 October 2021|url-status=live|via=cs.cornell.edu}}</ref> Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.<ref>{{cite web|publisher=[[Cornell University]], Department of Computer Science|url=https://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|title=JavaHyperText and Data Structure: Robin Hood Hashing|access-date=2 November 2021|first=David|last=Gries|year=2017|archive-url=https://web.archive.org/web/20210426051503/http://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|archive-date=26 April 2021|url-status=live|via=cs.cornell.edu}}</ref> Let <math>x</math> be the key to be inserted, <math>x{.}\text{psl}</math> be the (incremental) PSL length of <math>x</math>, <math>T</math> be the hash table and <math>j</math> be the index, the insertion procedure is as follows:{{r|waterloo86|pp=12-13}}<ref name="indiana88">{{cite tech report|first=Pedro|last=Celis|date=28 March 1988| number=246|institution=[[Indiana University]], Department of Computer Science|location=Bloomington, Indiana| url=https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-url=https://web.archive.org/web/20211103013505/https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-date=3 November 2021|access-date=2 November 2021|url-status=live| title=External Robin Hood Hashing}}</ref>{{rp|p=5}}
Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest ''probe sequence length'' (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.<ref name="waterloo86">{{cite book|title=Robin Hood Hashing|first=Pedro|last=Celis|publisher=[[University of Waterloo]], Dept. of Computer Science|year=1986|url=https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf |location=Ontario, Canada|isbn= 978-0-315-29700-5 |oclc= 14083698|archive-url=https://web.archive.org/web/20211101071032/https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf|archive-date=1 November 2021|access-date=2 November 2021|url-status=live}}</ref>{{rp|p=12}} Although Robin Hood hashing does not change the [[Computational complexity theory|theoretical search cost]], it significantly affects the [[variance]] of the [[Probability distribution|distribution]] of the items on the buckets,<ref>{{cite journal |last1=Poblete |first1=P. V. |last2=Viola |first2=A. |title=Analysis of Robin Hood and Other Hashing Algorithms Under the Random Probing Model, With and Without Deletions |journal=Combinatorics, Probability and Computing |date=July 2019 |volume=28 |issue=4 |pages=600–617 |doi=10.1017/S0963548318000408 |s2cid=125374363 |doi-access=free }}</ref>{{rp|p=2}} i.e. dealing with [[Cluster analysis|cluster]] formation in the hash table.<ref name="cornell14">{{cite web|url=https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|title= Lecture 13: Hash tables|publisher=[[Cornell University]], Department of Computer Science|first=Michael|last=Clarkson|access-date=1 November 2021|year=2014|archive-url=https://web.archive.org/web/20211007011300/https://www.cs.cornell.edu/courses/cs3110/2014fa/lectures/13/lec13.html|archive-date=7 October 2021|url-status=live|via=cs.cornell.edu}}</ref> Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.<ref>{{cite web|publisher=[[Cornell University]], Department of Computer Science|url=https://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|title=JavaHyperText and Data Structure: Robin Hood Hashing|access-date=2 November 2021|first=David|last=Gries|year=2017|archive-url=https://web.archive.org/web/20210426051503/http://www.cs.cornell.edu/courses/JavaAndDS/files/hashing_RobinHood.pdf|archive-date=26 April 2021|url-status=live|via=cs.cornell.edu}}</ref> Let <math>x</math> be the key to be inserted, <math>x{.}\text{psl}</math> be the (incremental) PSL length of <math>x</math>, <math>T</math> be the hash table and <math>j</math> be the index, the insertion procedure is as follows:{{r|waterloo86|pp=12-13}}<ref name="indiana88">{{cite tech report|first=Pedro|last=Celis|date=28 March 1988| number=246|institution=[[Indiana University]], Department of Computer Science|location=Bloomington, Indiana| url=https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-url=https://web.archive.org/web/20211103013505/https://legacy.cs.indiana.edu/ftp/techreports/TR246.pdf|archive-date=3 November 2021|access-date=2 November 2021|url-status=live| title=External Robin Hood Hashing}}</ref>{{rp|p=5}}
* If <math>x{.}\text{psl}\ \le\ T[j]{.}\text{psl}</math>: the iteration goes into the next bucket without attempting an external probe.
* If <math>x{.}\text{psl}\ \le\ T[j]{.}\text{psl}</math>: the iteration goes into the next bucket without attempting an external probe.
* If <math>x{.}\text{psl}\ >\ T[j]{.}\text{psl}</math>: insert the item <math>x</math> into the bucket <math>j</math>; swap <math>x</math> with <math>T[j]</math>—let it be <math>x'</math>; continue the probe from the <math>(j+1)</math>th bucket to insert <math>x'</math>; repeat the procedure until every element is inserted.
* If <math>x{.}\text{psl}\ >\ T[j]{.}\text{psl}</math>: insert the item <math>x</math> into the bucket <math>j</math>; swap <math>x</math> with <math>T[j]</math>—let it be <math>x'</math>; continue the probe from the <math>(j+1)</math>th bucket to insert <math>x'</math>; repeat the procedure until every element is inserted.
Line 208: Line 213:
The performance of a hash table is dependent on the hash function's ability in generating [[Low-discrepancy sequence|quasi-random numbers]] (<math>\sigma</math>) for entries in the hash table where <math>K</math>, <math>n</math> and <math>h(x)</math> denotes the key, number of buckets and the hash function such that <math>\sigma\ =\ h(K)\ \%\ n</math>. If the hash function generates the same <math>\sigma</math> for distinct keys (<math>K_1 \ne K_2,\ h(K_1)\ =\ h(K_2)</math>), this results in ''collision'', which is dealt with in a variety of ways. The constant time complexity (<math>O(1)</math>) of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is [[Proportionality (mathematics)#Direct proportionality|directly proportional]] to the chosen hash function's ability to [[Statistical dispersion|disperse]] the indices.<ref name="dijk10">{{cite web|title=Analysing and Improving Hash Table Performance|first=Tom Van|last=Dijk|publisher=[[University of Twente]]|location=[[Netherlands]]|url=https://www.tvandijk.nl/pdf/bscthesis.pdf|access-date=31 December 2021|archive-url=https://web.archive.org/web/20211106094558/http://www.tvandijk.nl/pdf/bscthesis.pdf|archive-date=6 November 2021|url-status=live|year=2010}}</ref>{{rp|1}} However, construction of such a hash function is [[NP-hardness|practically infeasible]], that being so, implementations depend on [[Use case|case-specific]] [[#Collision resolution|collision resolution techniques]] in achieving higher performance.{{r|dijk10|p=2}}
The performance of a hash table is dependent on the hash function's ability in generating [[Low-discrepancy sequence|quasi-random numbers]] (<math>\sigma</math>) for entries in the hash table where <math>K</math>, <math>n</math> and <math>h(x)</math> denotes the key, number of buckets and the hash function such that <math>\sigma\ =\ h(K)\ \%\ n</math>. If the hash function generates the same <math>\sigma</math> for distinct keys (<math>K_1 \ne K_2,\ h(K_1)\ =\ h(K_2)</math>), this results in ''collision'', which is dealt with in a variety of ways. The constant time complexity (<math>O(1)</math>) of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is [[Proportionality (mathematics)#Direct proportionality|directly proportional]] to the chosen hash function's ability to [[Statistical dispersion|disperse]] the indices.<ref name="dijk10">{{cite web|title=Analysing and Improving Hash Table Performance|first=Tom Van|last=Dijk|publisher=[[University of Twente]]|location=[[Netherlands]]|url=https://www.tvandijk.nl/pdf/bscthesis.pdf|access-date=31 December 2021|archive-url=https://web.archive.org/web/20211106094558/http://www.tvandijk.nl/pdf/bscthesis.pdf|archive-date=6 November 2021|url-status=live|year=2010}}</ref>{{rp|1}} However, construction of such a hash function is [[NP-hardness|practically infeasible]], that being so, implementations depend on [[Use case|case-specific]] [[#Collision resolution|collision resolution techniques]] in achieving higher performance.{{r|dijk10|p=2}}


The best performance is obtained in the case that the has function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful
The best performance is obtained in the case that the hash function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful search is <math display=inline>1+\frac{\alpha}{2}+\Theta\left(\frac{1}{m}\right)</math>, and the expected time for an unsuccessful search is <math display=inline>e^{-\alpha}+\alpha+
search is <math>1+\frac{\alpha}{2}+\Theta(\frac{1}{m})</math>, and the expected time for an unsuccessful search is <math>e^{-\alpha}+\alpha+
\Theta\left(\frac{1}{m}\right)</math>.<ref>
\Theta(\frac{1}{m})</math>.<ref>
  {{cite book
  {{cite book
  | last1      = Baeza-Yates
  | last1      = Baeza-Yates
Line 221: Line 225:
  | publisher  = CRC Press
  | publisher  = CRC Press
  | year      = 1999
  | year      = 1999
  | pages      = 2-6
  | pages      = 2–6
  | isbn      = 0849326494
  | isbn      = 0849326494
   }}
   }}
Line 236: Line 240:
===Caches===
===Caches===
{{Main|Cache (computing) }}
{{Main|Cache (computing) }}
Hash tables can be used to implement [[cache (computing)|caches]], auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.<ref>{{cite journal |last1=Zhong |first1=Liang |last2=Zheng |first2=Xueqian |last3=Liu |first3=Yong |last4=Wang |first4=Mengting |last5=Cao |first5=Yang |title=Cache hit ratio maximization in device-to-device communications overlaying cellular networks |journal=China Communications |date=February 2020 |volume=17 |issue=2 |pages=232–238 |doi=10.23919/jcc.2020.02.018 |s2cid=212649328 }}</ref><ref>{{cite web|url=https://www.linuxjournal.com/article/7105|publisher=[[Linux Journal]]|access-date=16 April 2022|date=1 January 2004|title=Understanding Caching|first=James|last=Bottommley|url-status=live|archive-url=https://web.archive.org/web/20201204195114/https://www.linuxjournal.com/article/7105|archive-date=4 December 2020}}</ref>
Hash tables can be used to implement [[cache (computing)|caches]], auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.<ref>{{cite journal |last1=Zhong |first1=Liang |last2=Zheng |first2=Xueqian |last3=Liu |first3=Yong |last4=Wang |first4=Mengting |last5=Cao |first5=Yang |title=Cache hit ratio maximization in device-to-device communications overlaying cellular networks |journal=China Communications |date=February 2020 |volume=17 |issue=2 |pages=232–238 |doi=10.23919/jcc.2020.02.018 |bibcode=2020CComm..17b.232Z |s2cid=212649328 }}</ref><ref>{{cite web|url=https://www.linuxjournal.com/article/7105|publisher=[[Linux Journal]]|access-date=16 April 2022|date=1 January 2004|title=Understanding Caching|first=James|last=Bottommley|url-status=live|archive-url=https://web.archive.org/web/20201204195114/https://www.linuxjournal.com/article/7105|archive-date=4 December 2020}}</ref>


===Sets===
===Sets===
Line 249: Line 253:
Many programming languages provide hash table functionality, either as built-in associative arrays or as [[standard library]] modules.
Many programming languages provide hash table functionality, either as built-in associative arrays or as [[standard library]] modules.


* In [[JavaScript]], an "object" is a mutable collection of key-value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is first [[Type conversion|coerced]] to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.<ref>{{cite web |title=JavaScript data types and data structures - JavaScript {{!}} MDN |url=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#objects |website=developer.mozilla.org |access-date=24 July 2022}}</ref> ECMAScript 2015 also added the <code>Map</code> data structure, which accepts arbitrary values as keys.<ref>{{Cite web |date=2023-06-20 |title=Map - JavaScript {{!}} MDN |url=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map |access-date=2023-07-15 |website=developer.mozilla.org |language=en-US}}</ref>
* In [[JavaScript]], an "object" is a mutable collection of key–value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is first [[Type conversion|coerced]] to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.<ref>{{cite web |title=JavaScript data types and data structures - JavaScript {{!}} MDN |url=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Data_structures#objects |website=developer.mozilla.org |access-date=24 July 2022}}</ref> ECMAScript 2015 also added the <code>Map</code> data structure, which accepts arbitrary values as keys.<ref>{{Cite web |date=2023-06-20 |title=Map - JavaScript {{!}} MDN |url=https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map |access-date=2023-07-15 |website=developer.mozilla.org |language=en-US}}</ref>
* [[C++11]] includes <code>[[unordered map (C++)|unordered_map]]</code> in its standard library for storing keys and values of [[Template (C++)|arbitrary types]].<ref>{{cite web|url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf|title=Programming language C++ - Technical Specification|access-date=8 February 2022|publisher=[[International Organization for Standardization]]|archive-url=https://web.archive.org/web/20220121061142/http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf|archive-date=21 January 2022|pages=812–813}}</ref>
* [[C++11]] includes <code>[[unordered map (C++)|unordered_map]]</code> in its standard library for storing keys and values of [[Template (C++)|arbitrary types]].<ref>{{cite web|url=https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf|title=Programming language C++ - Technical Specification|access-date=8 February 2022|publisher=[[International Organization for Standardization]]|archive-url=https://web.archive.org/web/20220121061142/http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3690.pdf|archive-date=21 January 2022|pages=812–813}}</ref>
* [[Go (programming language)|Go]]'s built-in <code>map</code> implements a hash table in the form of a [[Primitive data type|type]].<ref>{{cite web|url=https://go.dev/ref/spec#Map_types|title=The Go Programming Language Specification|website=go.dev|access-date=January 1, 2023}}</ref>
* [[Go (programming language)|Go]]'s built-in <code>map</code> implements a map type in the form of a [[Primitive data type|type]], which is often (but not guaranteed to be) a hash table.<ref>{{cite web|url=https://go.dev/ref/spec#Map_types|title=The Go Programming Language Specification|website=go.dev|access-date=January 1, 2023}}</ref>
* [[Java (programming language)|Java]] programming language includes the <code>HashSet</code>, <code>HashMap</code>, <code>LinkedHashSet</code>, and <code>LinkedHashMap</code> [[Generics in Java|generic]] collections.<ref>{{cite web|url=https://docs.oracle.com/javase/tutorial/collections/implementations/index.html|title=Lesson: Implementations (The Java™ Tutorials > Collections)|website=docs.oracle.com|access-date=April 27, 2018|url-status=live|archive-url=https://web.archive.org/web/20170118041252/https://docs.oracle.com/javase/tutorial/collections/implementations/index.html|archive-date=January 18, 2017|df=mdy-all}}</ref>
* [[Java (programming language)|Java]] programming language includes the <code>HashSet</code>, <code>HashMap</code>, <code>LinkedHashSet</code>, and <code>LinkedHashMap</code> [[Generics in Java|generic]] collections.<ref>{{cite web|url=https://docs.oracle.com/javase/tutorial/collections/implementations/index.html|title=Lesson: Implementations (The Java™ Tutorials > Collections)|website=docs.oracle.com|access-date=April 27, 2018|url-status=live|archive-url=https://web.archive.org/web/20170118041252/https://docs.oracle.com/javase/tutorial/collections/implementations/index.html|archive-date=January 18, 2017|df=mdy-all}}</ref>
* [[Python (programming language)|Python]]'s built-in <code>dict</code> implements a hash table in the form of a [[Primitive data type|type]].<ref>{{cite journal|journal=[[Journal of Physics: Conference Series]]|first1=Juan|last1=Zhang|first2=Yunwei|last2=Jia|title=Redis rehash optimization based on machine learning|volume=1453|year=2020|issue=1 |page=3|doi=10.1088/1742-6596/1453/1/012048 |bibcode=2020JPhCS1453a2048Z |s2cid=215943738 |doi-access=free}}</ref>
* [[Python (programming language)|Python]]'s built-in <code>dict</code> implements a hash table in the form of a [[Primitive data type|type]].<ref>{{cite journal|journal=[[Journal of Physics: Conference Series]]|first1=Juan|last1=Zhang|first2=Yunwei|last2=Jia|title=Redis rehash optimization based on machine learning|volume=1453|year=2020|issue=1 |page=3|doi=10.1088/1742-6596/1453/1/012048 |bibcode=2020JPhCS1453a2048Z |s2cid=215943738 |doi-access=free}}</ref>
Line 272: Line 276:
* [[Stable hashing]]
* [[Stable hashing]]
* [[Succinct hash table]]
* [[Succinct hash table]]
* [[Hash function]]
{{div col end}}
{{div col end}}


Line 288: Line 293:
{{Wikibooks | Data Structures/Hash Tables}}
{{Wikibooks | Data Structures/Hash Tables}}
* [[NIST]] entry on [https://xlinux.nist.gov/dads/HTML/hashtab.html hash tables]
* [[NIST]] entry on [https://xlinux.nist.gov/dads/HTML/hashtab.html hash tables]
* [http://opendatastructures.org/versions/edition-0.1e/ods-java/5_Hash_Tables.html Open Data Structures – Chapter 5 – Hash Tables], [[Pat Morin]]
* [https://opendatastructures.org/versions/edition-0.1e/ods-java/5_Hash_Tables.html Open Data Structures – Chapter 5 – Hash Tables], [[Pat Morin]]
* [http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-7-hashing-hash-functions/ MIT's Introduction to Algorithms: Hashing 1] MIT OCW lecture Video
* [https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-7-hashing-hash-functions/ MIT's Introduction to Algorithms: Hashing 1] MIT OCW lecture Video
* [http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-8-universal-hashing-perfect-hashing/ MIT's Introduction to Algorithms: Hashing 2] MIT OCW lecture Video
* [https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-8-universal-hashing-perfect-hashing/ MIT's Introduction to Algorithms: Hashing 2] MIT OCW lecture Video


{{Data structures}}
{{Data structures}}

Latest revision as of 18:59, 17 November 2025

Template:Short description Script error: No such module "Distinguish". Script error: No such module "redirect hatnote". Template:Use mdy dates Template:Infobox data structure

File:Hash table 3 1 1 0 1 0 0 SP.svg
A small phone book as a hash table

In computer science, a hash table is a data structure that implements an associative array, also called a dictionary or simply map; an associative array is an abstract data type that maps keys to values.[1] A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called a hash map.

Most hash table designs employ an imperfect hash function. Hash collisions, where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way. Common strategies to handle hash collisions include chaining, which stores multiple elements in the same slot using linked lists, and open addressing, which searches for the next available slot according to a probing sequence.[2]

In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at amortized constant average cost per operation.[3][2]Template:Rp[4]

Hashing is an example of a space–time tradeoff. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.Template:R

In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. Hash tables are widely used in modern software systems for tasks such as database indexing, caching, and implementing associative arrays, due to their fast average-case performance.[5] For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets. Many programming languages provide built-in hash table structures, such as Python’s dictionaries, Java’s HashMap, and C++’s unordered_map, which abstract the complexity of hashing from the programmer.[6]

History

The idea of hashing arose independently in different places. In January 1953, Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining. The first example of open addressing was proposed by A. D. Linh, building on Luhn's memorandum.[2]Template:Rp Around the same time, Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel of IBM Research implemented hashing for the IBM 701 assembler.Template:R Open addressing with linear probing is credited to Amdahl, although Andrey Ershov independently had the same idea.[7]Template:Rp The term "open addressing" was coined by W. Wesley Peterson in his article which discusses the problem of search in large files.[8]Template:Rp

The first published work on hashing with chaining is credited to Arnold Dumey, who discussed the idea of using remainder modulo a prime as a hash function.Template:R The word "hashing" was first published in an article by Robert Morris.Template:R A theoretical analysis of linear probing was submitted originally by Konheim and Weiss.Template:R

Overview

An associative array stores a set of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of unique keys. In the hash table implementation of associative arrays, an array A of length m is partially filled with n elements, where mn. A key x is hashed using a hash function h to compute an index location A[h(x)] in the hash table, where h(x)<m. The efficiency of a hash table depends on the load factor, defined as the ratio of the number of stored elements to the number of available slots, with lower load factors generally yielding faster operations.[9] At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have better time complexity bounds on search, delete, and insert operations in comparison to self-balancing binary search trees.Template:R

Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.Template:R

Load factor

A load factor α is a critical statistic of a hash table, and is defined as follows:[10] load factor (α)=nm, where

  • n is the number of entries occupied in the hash table.
  • m is the number of buckets.

The performance of the hash table deteriorates in relation to the load factor α.Template:R In the limit of large m and n, each bucket statistically has a Poisson distribution with expectation λ=α for an ideally random hash function.

The software typically ensures that the load factor α remains below a certain constant, αmax. This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor α reaches αmax. Similarly the table may also be resized if the load factor drops below αmax/4.[11]

Load factor for separate chaining

With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.[12]

Separate chaining hash tables suffer gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed.[11]

With separate chaining, the value of αmax that gives best performance is typically between 1 and 3.[11]

Load factor for open addressing

With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.[12]

The performance of open addressing becomes very bad when the load factor approaches 1.[11]

Therefore a hash table that uses open addressing must be resized or rehashed if the load factor α approaches 1.[11]

With open addressing, acceptable figures of max load factor αmax should range around 0.6 to 0.75.[13]Template:R

Hash function

A hash function h:U{0,...,m1} maps the universe U of keys to indices or slots within the table, that is, h(x){0,...,m1} for xU. The conventional implementations of hash functions are based on the integer universe assumption that all elements of the table stem from the universe U={0,...,u1}, where the bit length of u is confined within the word size of a computer architecture.Template:R

A hash function h is said to be perfect for a given set S if it is injective on S, that is, if each element xS maps to a different value in 0,...,m1.[14][15] A perfect hash function can be created if all the keys are known ahead of time.[14]

Integer universe assumption

The schemes of hashing used in integer universe assumption include hashing by division, hashing by multiplication, universal hashing, dynamic perfect hashing, and static perfect hashing.Template:R However, hashing by division is the commonly used scheme.Template:R[16]Template:Rp

Hashing by division

The scheme in hashing by division is as follows:Template:R h(x) = xmodm, where h(x) is the hash value of xS and m is the size of the table.

Hashing by multiplication

The scheme in hashing by multiplication is as follows:Template:R h(x)=m((xA)mod1) Where A is a non-integer real-valued constant and m is the size of the table. An advantage of the hashing by multiplication is that the m is not critical.Template:R Although any value A produces a hash function, Donald Knuth suggests using the golden ratio.Template:R

String hashing

Commonly a string is used as a key to the hash function. Stroustrup[17] describes a simple hash function in which an unsigned integer that is initially zero is repeatedly left shifted one bit and then xor'ed with the integer value of the next character. This hash value is then taken modulo the table size. If the left shift is not circular, then the string length should be at least eight bits less than the size of the unsigned integer in bits. Another common way to hash a string to an integer is with a polynomial rolling hash function.

Choosing a hash function

Uniform distribution of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.[18][19]

The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.[20]

For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.[20][2]

K-independent hashing offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.[21]

Collision resolution

Script error: No such module "Labelled list hatnote". A search algorithm that uses hashing consists of two parts. The first part is computing a hash function which transforms the search key into an array index. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.[2]Template:Rp Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.[22]Template:Rp

Separate chaining

File:Hash table 5 0 1 1 1 1 1 LL.svg
Hash collision resolved by separate chaining
File:Hash table 5 0 1 1 1 1 0 LL.svg
Hash collision by separate chaining with head records in the bucket array

In separate chaining, the process involves building a linked list with key–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.Template:R Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let T and x be the hash table and the node respectively, the operation involves as follows:[23]Template:Rp

Chained-Hash-Insert(T, k)
  insert x at the head of linked list T[h(k)]

Chained-Hash-Search(T, k)
  search for an element with key k in linked list T[h(k)]

Chained-Hash-Delete(T, k)
  delete x from the linked list T[h(k)]

If the element is comparable either numerically or lexically, and inserted into the list by maintaining the total order, it results in faster termination of the unsuccessful searches.Template:R

Other data structures for separate chaining

If the keys are ordered, it could be efficient to use "self-organizing" concepts such as using a self-balancing binary search tree, through which the theoretical worst case could be brought down to O(logn), although it introduces additional complexities.Template:R

In dynamic perfect hashing, two-level hash tables are used to reduce the look-up complexity to be a guaranteed O(1) in the worst case. In this technique, the buckets of k entries are organized as perfect hash tables with k2 slots providing constant worst-case lookup time, and low amortized time for insertion.[24] A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.Template:R

Techniques such as using fusion tree for each buckets also result in constant time for all operations with high probability.[25]

Caching and locality of reference

The linked list of separate chaining implementation may not be cache-conscious due to spatial localitylocality of reference—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail CPU cache inefficiencies.[26]Template:Rp

In cache-conscious variants of collision resolution through separate chaining, a dynamic array found to be more cache-friendly is used in the place where a linked list or self-balancing binary search trees is usually deployed, since the contiguous allocation pattern of the array could be exploited by hardware-cache prefetchers—such as translation lookaside buffer—resulting in reduced access time and memory consumption.[27][28][29]

Open addressing

Script error: No such module "Labelled list hatnote".

File:Hash table 5 0 1 1 1 1 0 SP.svg
Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".
File:Hash table average insertion time.png
This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to better locality of reference, though as the table gets full, its performance degrades drastically.

Open addressing is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through probing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search.[30]

Well-known probe sequences include:

  • Linear probing, in which the interval between probes is fixed (usually 1).[31]
  • Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation.Template:R
  • Double hashing, in which the interval between probes is computed by a secondary hash function.Template:R

The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor α approaches 1.[11]Template:R The probing results in an infinite loop if the load factor reaches 1, in the case of a completely filled table.Template:R The average cost of linear probing depends on the hash function's ability to distribute the elements uniformly throughout the table to avoid clustering, since formation of clusters would result in increased search time.Template:R

Caching and locality of reference

Since the slots are located in successive locations, linear probing could lead to better utilization of CPU cache due to locality of references resulting in reduced memory latency.[31]

Other collision resolution techniques based on open addressing

Coalesced hashing

Script error: No such module "Labelled list hatnote".

Coalesced hashing is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table.[32]Template:Rp The algorithm is ideally suited for fixed memory allocation.Template:R The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.Template:R

Cuckoo hashing

Script error: No such module "Labelled list hatnote".

Cuckoo hashing is a form of open addressing collision resolution technique which guarantees O(1) worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into infinite loop—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.[33]Template:Rp

Hopscotch hashing

Script error: No such module "Labelled list hatnote".

Hopscotch hashing is an open addressing based algorithm which combines the elements of cuckoo hashing, linear probing and chaining through the notion of a neighbourhood of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.[34]Template:Rp The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in concurrent settings, thus well suited for implementing resizable concurrent hash table.Template:R The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.Template:R

Each bucket within the hash table includes an additional "hop-information"—an H-bit bit array for indicating the relative distance of the item which was originally hashed into the current virtual bucket within H − 1 entries.Template:R Let k and Bk be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed:Template:R if Bk is empty, the element is inserted, and the leftmost bit of bitmap is set to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the neighbourhood, i.e. H − 1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood invariant properties.Template:R

Robin Hood hashing

Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest probe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.[35]Template:Rp Although Robin Hood hashing does not change the theoretical search cost, it significantly affects the variance of the distribution of the items on the buckets,[36]Template:Rp i.e. dealing with cluster formation in the hash table.[37] Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.[38] Let x be the key to be inserted, x.psl be the (incremental) PSL length of x, T be the hash table and j be the index, the insertion procedure is as follows:Template:R[39]Template:Rp

  • If x.psl  T[j].psl: the iteration goes into the next bucket without attempting an external probe.
  • If x.psl > T[j].psl: insert the item x into the bucket j; swap x with T[j]—let it be x; continue the probe from the (j+1)th bucket to insert x; repeat the procedure until every element is inserted.

Dynamic resizing

Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized O(1) performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables are rehashed into the buckets of the new hash table,[11] since the items cannot be copied over as varying table sizes results in different hash value due to modulo operation.[40] If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessive memory usage.[41]

Resizing by moving all entries

Generally, a new hash table with a size double that of the original hash table gets allocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is simple, but computationally expensive.[42]Template:Rp

Alternatives to all-at-once rehashing

Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoid memory fragmentation that triggers heap compaction due to deallocation of large memory blocks caused by the old hash table.[43]Template:Rp In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions hold and hnew. The process of rehashing a bucket's items in accordance with the new hash function is termed as cleaning, which is implemented through command pattern by encapsulating the operations such as Add(key), Get(key) and Delete(key) through a Lookup(key,command) wrapper such that each element in the bucket gets rehashed and its procedure involve as follows:Template:R

  • Clean Table[hold(key)] bucket.
  • Clean Table[hnew(key)] bucket.
  • The command gets executed.

Linear hashing

Script error: No such module "Labelled list hatnote". Linear hashing is an implementation of the hash table which enables dynamic growths or shrinks of the table one bucket at a time.[44]

Performance

The performance of a hash table is dependent on the hash function's ability in generating quasi-random numbers (σ) for entries in the hash table where K, n and h(x) denotes the key, number of buckets and the hash function such that σ = h(K) % n. If the hash function generates the same σ for distinct keys (K1K2, h(K1) = h(K2)), this results in collision, which is dealt with in a variety of ways. The constant time complexity (O(1)) of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is directly proportional to the chosen hash function's ability to disperse the indices.[45]Template:Rp However, construction of such a hash function is practically infeasible, that being so, implementations depend on case-specific collision resolution techniques in achieving higher performance.Template:R

The best performance is obtained in the case that the hash function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful search is 1+α2+Θ(1m), and the expected time for an unsuccessful search is eα+α+Θ(1m).[46]

Applications

Associative arrays

Script error: No such module "Labelled list hatnote". Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays.[47]

Database indexing

Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees are more popular in these applications.[48]

Caches

Script error: No such module "Labelled list hatnote". Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.[49][50]

Sets

Script error: No such module "Labelled list hatnote". Hash tables can be used in the implementation of set data structure, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.[51]

Transposition table

Script error: No such module "Labelled list hatnote". A transposition table to a complex Hash Table which stores information about each section that has been searched.[52]

Implementations

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules.

  • In JavaScript, an "object" is a mutable collection of key–value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is first coerced to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.[53] ECMAScript 2015 also added the Map data structure, which accepts arbitrary values as keys.[54]
  • C++11 includes unordered_map in its standard library for storing keys and values of arbitrary types.[55]
  • Go's built-in map implements a map type in the form of a type, which is often (but not guaranteed to be) a hash table.[56]
  • Java programming language includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap generic collections.[57]
  • Python's built-in dict implements a hash table in the form of a type.[58]
  • Ruby's built-in Hash uses the open addressing model from Ruby 2.4 onwards.[59]
  • Rust programming language includes HashMap, HashSet as part of the Rust Standard Library.[60]
  • The .NET standard library includes HashSet and Dictionary,[61][62] so it can be used from languages such as C# and VB.NET.[63]

See also

Template:Div col

Template:Div col end

Notes

Template:Notelist

References

Template:Reflist

Further reading

  • Script error: No such module "citation/CS1".
  • Script error: No such module "Citation/CS1".

External links

Template:Sister project Template:Sister project

Template:Data structures Template:Authority control

  1. Script error: No such module "citation/CS1".
  2. a b c d e Script error: No such module "citation/CS1".
  3. Script error: No such module "citation/CS1".
  4. Script error: No such module "citation/CS1".
  5. Script error: No such module "citation/CS1".
  6. Script error: No such module "citation/CS1".
  7. Script error: No such module "citation/CS1".
  8. Script error: No such module "citation/CS1".
  9. Script error: No such module "citation/CS1".
  10. Cite error: Invalid <ref> tag; no text was provided for refs named Cormen et al
  11. a b c d e f g Script error: No such module "citation/CS1".
  12. a b James S. Plank and Brad Vander Zanden. "CS140 Lecture notes -- Hashing".
  13. Script error: No such module "Citation/CS1".
  14. a b Script error: No such module "citation/CS1".
  15. Script error: No such module "citation/CS1".
  16. Script error: No such module "Citation/CS1".
  17. Script error: No such module "citation/CS1".
  18. Script error: No such module "Citation/CS1".
  19. Script error: No such module "Citation/CS1".
  20. a b Script error: No such module "citation/CS1".
  21. Script error: No such module "Citation/CS1".
  22. Script error: No such module "citation/CS1".
  23. Script error: No such module "citation/CS1".
  24. Script error: No such module "citation/CS1".
  25. Script error: No such module "Citation/CS1"..
  26. Script error: No such module "citation/CS1".
  27. Script error: No such module "Citation/CS1".
  28. Script error: No such module "citation/CS1".
  29. Script error: No such module "citation/CS1".
  30. Script error: No such module "citation/CS1".
  31. a b Script error: No such module "citation/CS1".
  32. Script error: No such module "citation/CS1".
  33. Script error: No such module "citation/CS1".
  34. Script error: No such module "citation/CS1".
  35. Script error: No such module "citation/CS1".
  36. Script error: No such module "Citation/CS1".
  37. Script error: No such module "citation/CS1".
  38. Script error: No such module "citation/CS1".
  39. Script error: No such module "citation/CS1".
  40. Script error: No such module "citation/CS1".
  41. Script error: No such module "citation/CS1".
  42. Script error: No such module "citation/CS1".
  43. Script error: No such module "Citation/CS1".
  44. Script error: No such module "citation/CS1".
  45. Script error: No such module "citation/CS1".
  46. Script error: No such module "citation/CS1".
  47. Script error: No such module "citation/CS1"..
  48. Script error: No such module "citation/CS1".
  49. Script error: No such module "Citation/CS1".
  50. Script error: No such module "citation/CS1".
  51. Script error: No such module "citation/CS1".
  52. Script error: No such module "citation/CS1".
  53. Script error: No such module "citation/CS1".
  54. Script error: No such module "citation/CS1".
  55. Script error: No such module "citation/CS1".
  56. Script error: No such module "citation/CS1".
  57. Script error: No such module "citation/CS1".
  58. Script error: No such module "Citation/CS1".
  59. Script error: No such module "citation/CS1".
  60. Script error: No such module "citation/CS1".
  61. Script error: No such module "citation/CS1".
  62. Script error: No such module "citation/CS1".
  63. Script error: No such module "citation/CS1".