Fast Hashing With Strong Concentration Bounds | Awesome Similarity Search Papers

Fast Hashing With Strong Concentration Bounds

Anders Aamand, Jakob B. T. Knudsen, Mathias B. T. Knudsen, Peter M. R. Rasmussen, Mikkel Thorup · STOC '20: 52nd Annual ACM SIGACT Symposium on Theory of Computing · 2019

Previous work on tabulation hashing by Patrascu and Thorup from STOC’11 on simple tabulation and from SODA’13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of (c=O(1)) characters, e.g., a 64-bit key as (c=8) characters of 8-bits. The character domain (\Sigma) should be small enough that character tables of size (|\Sigma|) fit in fast cache. The schemes then use (O(1)) tables of this size, so the space of tabulation hashing is (O(|\Sigma|)). However, the concentration bounds by Patrascu and Thorup only apply if the expected sums are (\ll |\Sigma|). To see the problem, consider the very simple case where we use tabulation hashing to throw (n) balls into (m) bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if (n=m), for then the expected value is (1). However, if (m=2), as when tossing (n) unbiased coins, the expected value (n/2) is (\gg |\Sigma|) for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.

Explore more on:
Uncategorized
Similar Work
Loading…