Talk:Hash function/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Overhaul of hash function and merging of hash algorithm

Hi Pfunk42! I just saw your major overhaul of hash function and "merging" of hash algorithm. Very nice work! You beat me to it. I put up those merging notices but never got around to note on the talk pages that I was going to do that work in some days. But I am happy you did it since it saved me the work and you did it so well. I added a picture instead, hope you like it. --David Göthberg 02:31, 5 September 2005 (UTC)

Surjective?

The article says that hash functions are surjective. Is that true? They're certainly not injective, but my understanding is that "surjective" means that "All elements of the image domain are actually 'used' by the mapping," to quote a posting by Peter Simons from the monoton-devel mailing list. [1]

So a hash function could be surjective, but I don't think it has to be.

But this is so far from my areas of expertise that I don't want to just change it; I may be way offbase. --Elysdir 00:10, 17 December 2005 (UTC)

Basically, you want the outputs of your hash function to be uniformly distributed over the codomain of the function. If the function is not a surjection, there exist values in the codomain that are not in the range. If this is true, then the output is not, strictly speaking, uniformly distributed. However, if the size of the codomain is greater than the square root of the size of the domain, then many quite good hash functions will not be surjective (see birthday paradox). Anyway, it's complicated, but not really an issue anymore since i took out the reference to "surjective". --pfunk42 05:56, 2 January 2006 (UTC)

Stock function?

What's a stock hash function? Isn't the case when an algorithm become published it becomes stock? What's significant about a stock function the way it is being described in this article? Just a little bit confused. --MegaHasher

Some hash functions are special-purpose. For example, they might only give nice, uniform output for the input they were designed to take. Or they might accept only input of some fixed size. These are not stock because they aren't widely reusable. Also, classes of hash functions or hash function designs are not themselves hash functions. For example, Pearson Hashing is not a stock hash function. Basically, what I'm familiar with being called a "stock hash function" is one which, from a mathematical perspective, is almost universally applicable and good. I say "from a mathematical perspective" because being a stock hash function doesn't mean it's implementation is universally fast. --pfunk42 05:56, 2 January 2006 (UTC)

Randomization Function

Ok. maybe i should have discussed it before throwing it in there, but this is widely used term for a hash function that is one-to-one. a google search for "randomization function" reveals that most of the results use the term as i describe.

True, a bijective function whose domain equals its range represents a permutation, but calling any one-to-one hash function a "permutation" is highly misleading.

It might be worth creating a separate "Randomization Function" page (with redirect from "Mix Function"), but i'm not sure. Comments welcome.--pfunk42 06:12, 2 January 2006 (UTC)

IMHO, Wikipedia already has too many semi-related pages about random functions. In my mind, permutations used for hashing has less strength than permutations used for pseudo randomness. Even with lots of rounds these permutations probably would not be too good for pseudo randomness, because the period is too short when there are already fast RNG generators with very long periods. --MegaHasher 06:33, 2 January 2006 (UTC)

Many uses of randomization functions, especially those which use them under that name, have no "period" requirement/concern. In fact, periodicity concerns might inappropriately narrow the space of possible functions. Other uses (such as pseudorandom number generation) might call for functions satisfying the requirements of a "randomization function" as well as other, stronger requirements. It would seem to me appropriate to disambiguate such classes of requirements in an article. Here's one thing i believe is a useful observation: a random function is an idealization of a hash function; a random permutation is an idealization of a randomization function. --pfunk42 13:08, 2 January 2006 (UTC)

Intro Edit

I admit perhaps the original intro is too technical. Hopefully my edit is more to the level of a beginner programmer. --MegaHasher 03:44, 25 January 2006 (UTC)

The example of catching a change in an article suffers from a defect that the check is only probabilistic. The materials in the introduction should be rock solid.

I appreciate your effort, but even so the opening sentence basically says "a hash function is something used to create a hash value". I got to this page by searching for "hash value", which redirected me here. In other words, I still don't know what "hash" means, except with respect to meat dishes. I'm going attempt to redress that. ShawnVW 18:52, 2 March 2006 (UTC)

ShawnVW: I like your new intro. --David Göthberg 01:04, 3 March 2006 (UTC)

I think the intro is good (concise, simple, accurate, and to the point). There is one small inaccuracy, I believe. The input to the hash function is not necessarily variable length. Here are a few definitions (from a Google search of "define hash function") that might be worth consideration as part of the intro:

Web definitions

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. ... en.wikipedia.org/wiki/Hash_function

(Hash functions) take a large amount of data and create a small fingerprint based on that data. www.informit.com/guides/content.aspx

I like aspects the first one, because it includes "possibly," and I like aspects of the second one because it is easy to picture and understand if someone is not familiar with this type of information. Maybe the best of both could be incorporated?

--Mark Nadon 8/13/2013

Intro

It seems the intro has become longer than the main text. Clean-up is again needed. -MegaHasher 23:50, 5 April 2006 (UTC)

The Intro is incorrect. A well designed hash function is not necessarily one-way nor unforgeable. A well designed hash function only needs to distribute input keys well across all buckets. You might be able to prove that a hash function that does a good job of spreading values across buckets is also one-way or non-forgeable, but those are not initial design constraints. There are "niche" applicaitons in which one-way or non-forgeable attributes are important, but it is not a design constraint for hash functions in general. Chuck Simmons 20:06, 14 April 2006 (UTC)

Reverse hash.

Say I have a hash (I know the hashing function and it has no salt), and I know my cleartext is between 12 and 18 bytes long and contains only ascii a-z \. 0-9 Would it be possible to write a function to generate all the possible cleartexts that fall under this particular category with this hash? Doodle77 03:24, 28 March 2006 (UTC)

In the case of non-crypto hash, yes. -MegaHasher 23:50, 5 April 2006 (UTC)

Actually, Doodle77 hasn't really asked the right question, and MegaHasher's answer is potentially misleading (though technically correct). Regardless of whether a hash function is cryptographically strong, one can write a function to generate all possible cleartexts within a finite range/set that generate a certain hash. The solution is as simple as iterating over the possibilities, hashing them, and comparing the result to the desired hash. This answers Doodle77's question the way he/she asked it; it was asked as a question of computability. Doodle77 was probably interested in whether the solution was efficiently computable, or practical, which is a question of computational complexity. With a space of inputs as large as you describe, efficiently generating matching cleartexts depends on the ability to "reverse" the hash function, which is (presumably; see One-way function) much more difficult for cryptographic hash functions.

--pfunk42 12:36, 23 April 2006 (UTC)

Audio Identification NPOV

"Contrary to an assumption often voiced by people who don't follow the industry carefully,"

That seems a bit too opinionated to be in accord with WP:NPOV and WP:AWW, all the more so that it's uncited. I made the statement a bit more neutral and verifiable, but I still ask that a source be given for this claim, or otherwise it should be removed entirely. The claim regarding the noisy room is also uncited, so I'm requesting a source for that as well. --Ultimus 17:07, 1 July 2006 (UTC)

Hash table section

Two issues are in this section:

Cryptographic hash functions are too slow for hash table usages. Cryptographic hash functions are used to generate security hashes.

"too slow" is a brash generalization. some hash tables have records that live on disk, in which case you have virtually all the hashing time in the world. besides, i've seen many-a-hash table using MD5 as the hash function, which is (IIRC) only about 2-3 times slower than Jenkins. (Granted, i quickly pointed out they should use Jenkins instead... :) --pfunk42 21:25, 2 February 2006 (UTC)

MD5 is a CRYPTOGRAPHIC hash function, and is designed for a much different use than Jenkin's hash. I'm surprised to see it does that well. But I'd suggest that 2 or 3 times as slow is impractical for many uses. I'v implemented many cryptographic hash functions (1 NIST validated), and I don't use them for anything other than cryptographic work... any more than I would use hash table algorithms for cryptographic use. Cryptographic hash functions are WAY too slow for any hash table use I do. Nahaj 17:32, 1 July 2006 (UTC)

A hash function that maps successive keys to successive output values probably has severe funneling, and is not a good hash function to begin with. --MegaHasher 06:31, 1 February 2006 (UTC)

another brash generalization. there are cases in which regularity in the input can be exploited to beat "random" hash functions. this is pretty well known thanks to work by Knuth, et al. but the downside is that the *wrong* set of inputs can be very bad... as i discuss in the article! :) also, some very small hash tables (for example, in implementation of "dictionary"-heavy languages like python, smalltalk, etc.) use hash functions that are pretty bad in terms of collisions but work quite well because low load factor + linear probing (very fast search) + small size + frequent access (likely in CPU cache) mean the hash function must be obscenely fast. --pfunk42 21:25, 2 February 2006 (UTC)

(Knuth worked with non-cryptographic hash functions, and the rest of Pfunk42's reply sounds like uses of non-cryptographic hashs. MegaHasher's comment sounds like a sound objection to such a hash for cryptographic work.) Possibly we are talking apples and oranges here? Nahaj 17:32, 1 July 2006 (UTC)

Algorithms for Audio Identification

Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences.

This sentence is way too vague. Could someone please elucidate which algorithms and give some perspective on how they work? I assume that one can bring up a more broad topic that is "algorithms for analog signals identification", too, but I really have no knowledge about the topic. Poli 10:38, 12 Jun 2005 (UTC)

A Google search led me to

"Perceptual Audio Hashing Functions" by Hamza Özer, Bülent Sankur, Nasir Memon, and Emin Anarım 2004
"A Perceptual Audio Hashing Algorithm: A Tool for Robust Audio Identification and Information Hiding" by Mehmet Kivanç Mihçak, Ramarathnam Venkatesan 2001

I suggest we move most of this information on audio identification to the acoustic fingerprint article. --75.37.227.177 14:55, 18 July 2007 (UTC)

Audio Identification Facts

Musicbrainz indeed is an open source system as the article describes; however Musicbrainz does not do audio hashing. It depends on a commercial system to do that—Trevor Caira 11:09, 29 December 2006 (UTC)

How do we know it isn't using human recognition[2]? --75.37.227.177 14:55, 18 July 2007 (UTC)

Merge with one-way security

The article claims that a merge with One-way security is being considered. As far as I can tell there is no discussion of that here and no proposed reason why the merge is a good idea. I don't see why that merge makes sense. So, I'm going to remove the merge tags. Feel free to add them back and explain why you think it would be a good idea. --SamHartman 21:22, 3 October 2007 (UTC)

Strongly and Weakly Collision resistant?

If, given a message x, it is computationally infeasible to find a message y not equal to x such that H(x) = H(y) then H is said to be a weakly collision-free hash function. On the other hand, a strongly collision-free hash function H is one for which it is computationally infeasible to find any two messages x and y such that H(x) = H(y).

This seems awkward and opaque to me...

The only semantic difference I can see in this text seems to suggest that for "strongly" collision free hash functions, even if message x = y, H(x) != H(y)! Which doesn't sound much like a function to me...

Looking around, I found (at http://www.cs.berkeley.edu/~jordan/courses/174-spring02/notes/note14.pdf) the following:

2. A hash function h(x) is said to be weakly collision-free if given a message x1 it is hard to find another message x2 such that h(x1) = h(x2). 3. A hash function h(x) is said to be strongly collision-free if it is hard to find any pair of messages x1, x2 such that h(x1) = h(x2).

This is still not too clear to me.

Weakly collision free seems to me to mean: Given any specific message x, it is computationally infeasible to find a message y such that H(x) = H(y), but if you wish to, you can craft two messages together (x and y), such that H(x) = H(y).

That is, if you're NOT a hacker trying to create a forged message (which is hard), but rather a social engineer trying to create TWO messages which can be interchanged to create plausible deniability (which is easier), you can do it with a Weakly collision free hashing function.

Strongly collision free seems to me to mean: Even though in a large domain of messages there will clearly be a large number of collisions, It is computationally infeasible to find ANY two messages x and y such that x != y and H(x) = H(y), even if you are trying to craft the two together.

Sigh... I'm still not sure how to phrase this is a way which is clear to the casual reader, even assuming that I have correctly interpreted this. Anyone else care to give it a try?

You interpret correctly. But the text makes sense to me as-is.

Maybe this is a clearer non-technical rendering: "Weak means it is hard to find a collision WITH A SPECIFIC message. Strong means it is hard to find ANY MESSAGES that collide." But I agree with the anonymous person above... it makes sense to me as is. [But it is a bit of a technical statement for the math phobic.] Nahaj 17:54, 1 July 2006 (UTC)

I agree with Nahaj. However, I think the distinction between "weakly collision-free hash function" vs. "strongly collision-free hash function" only matters when we are talking about cryptographic hash functions -- right? So I see no need to even mention that distinction in the hash function article -- please delegate it to the cryptographic hash function article.

I agree this is only an issue for cryptographic hashing. There are similar issues that come up in adversarial hashing. For example if someone can find a large group of messages that all fall into the same hash bucket they can make your performance suffer. However I think that collisions resistance is enough to make something a cryptographic hash function. SamHartman 21:33, 3 October 2007 (UTC)

Hashing is for indexing (?)

I have restored the qualification "usually a single integer that may serve as an index into an array" for the result of a hash function. To my knowledge, this property is what distinguishes hash functions (in the proper sense of the term) from other related but distinct concepts such as checksums, crypto hash functions, fingerprints, etc.. For instance, a function that returns a 128-bit value cannot be called a hash function.

Perhaps some peopel would rather define "hash function" as a generic term that covers all those concepts. But then we would be left without a name for the original sense ('a function that is suitable for "hashing" scattered data into a table'). All the best, --Jorge Stolfi (talk) 18:41, 21 November 2008 (UTC)

Overlap/contradiction with Hash table section

This article seems to have some overlap in function and content with Hash_table#Choosing_a_good_hash_function. Also, the latter characterizes the most important test for a hash function as being the avalanche condition, which seems a bit dubious to me - could an expert take a look at this? Dcoetzee 10:49, 24 November 2008 (UTC)

Illustration needed

This article had an illustration that was actually showing a checksum or fingerprint function, rather than a hash function proper (see at right). The illustration has been copied to the checksum and fingerprint (computing) articles. Would someone provide an illustration for hashing in the proper sense? (I don't have any SVG editor at hand). Thanks, --Jorge Stolfi (talk) 19:06, 21 November 2008 (UTC)

I made the original png image that the svg to the right is based on. I see that you removed the image from the article. Why did you do that? What do you think is wrong with the image? That is, what do you think needs to be changed?

It does show an actual hash function at work, in this case it shows the first four bytes (the first 32 bit word) of output from SHA-1. This image has been tested on many beginners and it helped them understand what hash functions are. And it has been discussed with other crypto geeks and computer scientists I know, and this is the best image we have come up with so far.

The reason we choose to show just 4 bytes output in this article is that for table lookup and other such tasks fast short hash sums are usually used. For cryptographic use hash sums are much longer so in the corresponding image over at cryptographic hash function we show the same image but with the full SHA-1 output.

--David Göthberg (talk) 11:24, 25 November 2008 (UTC)

The problem is that 32 bits are too many for a hash function. Image 1 would be OK for a checksum, and indeed I have adapted it for the checksum article (with 32 bit output) and for the fingerprint (computing) article (with 64-bit output). Moreover collisions are almost unavoidable in a hash function, and should be shown; whereas Image 1 and its original caption suggest that they do not exist. Finally, using SHA-1 for ordinary hashing seems an overkill.

I propose that we use the same illustration in hash table (Image 2) but without the third column (the key/value pairs). I would do the adaptation myself, but I don't have the time now (my SVG editor is emacs ...8-( ). --Jorge Stolfi (talk) 02:21, 26 November 2008 (UTC)

Well, what size the output should have in the image can be debated. Since what hash size one uses depends very much on the usage case, so it can be anything from some bits to a lot of bits.

And the image is not meant to show SHA-1 specifically, I just mentioned that the values you see happens to be produced by SHA-1, thus it does represent the kind of output you might get from a hash function.

I think that adding hash collisions in the first image the readers see perhaps makes things a bit too complicated. The first image should show the basics. And the basics here are that most/many hash sums take input of varying size (thus the short "Fox" text compared to the longer texts) and still produce a fixed size output. And that very small changes in the text produces very different hash sums (thus the two long but similar inputs). The images you added over at checksum and fingerprint (computing) fails to illustrate that checksums and fingerprints usually can take varying length inputs.

If you want to explain collisions then I suggest you use a second image further down in the article. Since my experience when teaching is that people first have to understand the basics I just described above, before they understand why collisions occur. And note that not all hash functions have collisions. Since for some uses we use hash functions that has the same size input as output, and then we often design them so they don't have any collisions at all.

So you removed an image that explains the basics, but you did not replace it with anything else. Thus you did more harm than good. I say we put the image back. If/when there is a better image, then it should perhaps be replaced. But as I said, I think collisions should probably be shown in a second image instead.

--David Göthberg (talk) 20:44, 26 November 2008 (UTC)

Well, let me try again:

When the term "hash function" was introducted (and for decades thereafter), it meant exclusively the sort of function one uses to build hash tables: it generates a small integer (not a hex string), much smaller than 32 bits. Apart from the (very rare) "perfect hash functions", hash functions were understood to have collisions just as life has taxes; and the many hash table algorithms were basically ways to handle those unavoidable collisions. They had *no* connection to cryptography or trapdoor functions at all. In particular, its is *not* true that "very small changes in the text [should produce] very different hash sums". (And I have never heard them called "hash sums").
At the same tme, but quite independently, people invented "checksum functions" to guard against *accidental* errors. They had very different properties from hash functions: were usually 16 or 32 bits, were *guaranteed* to be distinct under *small* differences, etc.. No one ever confused checksums with hash functions. (And no one called them "hash sums").
In the late 80's yet another concept became popular, that of "fingerprints"; sort of longish checksums, but designed so that there would be no collisions even among all the files in a large filesystem. This was to be true even for non-random inputs, *but not necessarily under malicious attack*. They typically were 64 bits long or so, and still did not need cryptographic techniques. A small change in the inputhad to generate a different output, but it could be a small one too. Fingerprints were even further removed from hash functions, and, again, no one confused the two. (And no one called them "hash sums").
Finally, in the 90s, yet another concept became popular: a function that was almost certainly collision-free, *even for malicious inputs*. These functions needed even more output bits (128, 256, etc.) and cryptographic techniques. For this goal, a small change in the input must necessarily cause a large change in the output Unfortunately, the cryptographers choose to call them "cryptographic hash functions" or "hash sums". However, they are *not* a special class of hash functions; and are distinct in purpose and design from checksums. The three concepts are almost as unlike as bottle, bottle gourd, and bottleneck.
So this is the root of the problem: we have three very, very different concepts with confusingly similar names. Each deserves a separate article, with a warning about the distinction. Image 1 is neither a "hash function" nor a "cryptographic hash function"; it could be a checksum function except that using SHA-1 for that purpose is an overkill. I don't think that a wrong image is better than no image. Image 2, minus the right-hand column, shows a proper "hash function".
Hope it helps, --Jorge Stolfi (talk) 01:02, 27 November 2008 (UTC)

Who is Jenkins?

Quote: "See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself." It would be informative to know who on earth this Jenkins is, before he is alluded to as 'the big man himself'. :) --Egregius 23:11, 29 Mar 2005 (UTC)

He published a hash algorithm in Dr. Dobbs. I'm not aware of any thing on the topic that he has published in any more formal on the topic, nor have I found any articles that examine his hash formally. I would have expected that if it had been formally evaluated it would have been mentioned on his web page. Nahaj 17:48, 1 July 2006 (UTC)

I was just about to ask the same thing. Why link to his source code in particular? If it's an algorithm of note, it ought to be worth having its own page on Wikipedia for, to which this article can link. If not, why mention it? --Kate (talk) 11:13, 24 February 2009 (UTC)

Geometric hashing

The section Geometric hashing looks like a description of Vector quantization. Vector quantisation collects groups (not just pairs) of data points that are close together (in typically 2- or more dimension space) so they can be represented with small errors by a single vector e.g. the centroid of the data group. Lossy compression is achieved by describing the data set using the reduced set of vectors. An example application is image compression from 24-bit RGB color to an optimum indexed 256-colour palette. In contrast, hashing seeks to group data in its own space that is constructed with the intention that data points that are actually close should not be kept together. Cuddlyable3 (talk) 11:22, 19 December 2008 (UTC)

Vector quantization is a special case of geometric hashing, although the latter is a more recent and (afaik) independently developed concept. "Bucket grid" techniques were used in the 1980's in several computer graphics applications, such as mapping a mouse click to the nearest object on the screen. Indeed, 2- and 3-dimensional bucket grids have probably been used since computers were applied to surveying, cartography, etc.. The Hough transform is basically geometric hashing of straight lines. The term "geometric hashing" applied to general shapes (as opposed to points) has a definite origin, perhaps in the 1990's; I will try to find the reference. Jorge Stolfi (talk) 15:55, 8 March 2009 (UTC)

Trivial Hash Function

The example code given for a trivial hash function doesn't seem to be the best way to explain this; whilst it is a completely valid section of code, it is probably a bit obscure to anyone who doesn't know C and/or C++. An example in some form of pseudocode might make more sense? The parts they may be too obscure to someone unfamiliar with the language are

char*

declaring a pointer to one character within a string of text, the conditional

if(g=h&0xf0000000)

both changing the value of g and checking this value, and

*p!='\0'

catching the end of the string of text. --Elder pegasus (talk) 22:32, 6 March 2009 (UTC)

That code was temporarily removed because of the above problems (and also because of the overly optimistic but unsourced claim that it generates no collisions). There is now a generic pseudocode in the section "Hashing variable-length data", which presumably can be made equivalent to the deleted code by the proper choice of the F function. Jorge Stolfi (talk) 16:02, 8 March 2009 (UTC)

Hash functions and "modulo the table size"

There seems to be some confusion about the "modulo the table size" bit. A hash function used for a hash table with N buckets should return an integer in 0..N-1. That is, the "mod" operation, if it is done at all, is technically inside the hash function, not outside it. The whole discussion about what makes a good hash function assumes this view. As said further on in the article, one way to get a hash function for a given size N is to take a hash function h with a much larger range (such as 2³² and then compute h(k) mod N. But that is not the only way, and it may easily give a very poor hash function. If N is a power of two, the mod-N trick gives the low-order bits of h(k); which, for some popular hash functions, are known to be the worst ones. There are many other ways of reducing a large hash value to a smaller one, but they cannot be analyzed or recommended separately from the raw hash function. This is another reason to consider the mod-N operation a part of the hash function, and to avoid mentioning it outside of that section. All the best, --Jorge Stolfi (talk) 01:57, 18 April 2009 (UTC)

Different from fingerprint?

From the two pages it seem like hash function and fingerprint are the same thing. Could someone explain the difference? Thanks, — sligocki (talk) 18:15, 12 November 2009 (UTC)

For example, they both claim to aim to avoid collisions. — sligocki (talk) 18:18, 12 November 2009 (UTC)

Consider that a hash function that collides infrequently might be fine for use in a hash table, but it wouldn't be a good fingerprint. Fingerprints are the subset of hashes with particularly low collision rates and resistance to tampering. Phil Spectre (talk) 03:35, 13 November 2009 (UTC)

MurmurHash conflict

I'm writing to ask that editors interested in hashing consider taking a look at MurmurHash. Recently, a particular editor with a track record of aggressive and hasty deletionism started cutting the article down and removing key references, turning it into a stub. He removed the entire summary paragraph, repeatedly tried to remove the algorithm, and has taken out almost all of the links to ports in various languages. This lowers the value of the article and I'm obviously very unhappy about it, but I don't see further discussion with him as likely to be productive. Instead, I'd like to get more people involved so that a genuine consensus can be formed. 208.80.104.2 (talk) 13:53, 11 September 2009 (UTC)

I suggest you take a look at WP:CIVILITY before you throw around unfounded accusations such as "aggressive and hasty deletionism".

To uninvolved parties, I'm the editor that's being referred to here. I'm more than happy for a 3rd opinion on the MurmurHash article. Oli Filth^{(talk|contribs)} 17:48, 11 September 2009 (UTC)

Anyone capable of clicking on your talk page can decide for themselves whether I am correct in saying you have a history of aggressive and hasty deletionism. False accusations of incivlity are themselves a violation of WP:CIVILITY, so mind your own manners. 208.80.104.2 (talk) 18:12, 11 September 2009 (UTC)

I've inherited the article, which I've recreated under guidance from the admins who initially opposed it. It's back on the chopping block, so please chime in: Wikipedia:Articles for deletion/MurmurHash (2nd nomination). Phil Spectre (talk) 00:35, 15 November 2009 (UTC)

Universal class of hash functions

The article does not mention universal hash functions which are used in applications such as Cuckoo Hashing. It would be nice if they did.

Dr Tom Conway (talk) —Preceding comment was added at 09:30, 31 March 2008 (UTC)

I agree. But I'm biased being one of the authors of Universal Hash Functions, Wegman, so I won't make the change myself. It's much more than Cuckoo Hashing though. Universal Hash Functions almost guarantee that for any data set the hash function will work as extpected. That is to say the programmer doesn't have to worry about collisions. I would propose to replace this section:

Uniformity

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of collisions — pairs of inputs that are mapped to the same hash value — increases. Basically, if some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.

Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good randomizing function is usually good for hashing, but the converse need not be true.

Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.

In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but a small number of collisions is virtually inevitable, even if n is much larger than m -- see the birthday paradox).

When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-square test.[3]

with something different.

Few Collisions

The odds of a pair of elements colliding under the hash function should be small. If x and y are different it's desirable that the probability of H(x) = H(y) be roughly the size of the range of the function. That is to say if you are hashing into a table the odds of two elements colliding should be no more than the number of elements in the table. The cost of look up in the best of implementations of tables is roughly the number of elements that collide.

Unfortunately hash functions are many to one mappings. The indexes to the tables we use are smaller than the strings. In a cryptography application the authentication tag is typically much smaller than the string being authenticated. So by definition there is some set of inputs that is all mapped to a single element and those inputs by definition collide. A good hash function can require someone to think carefully about the kinds of values being hashed to make sure that there won't be too many inputs that will all map to the same value under the chosen function.

An alternative to careful thought is universal_hashing. A universal hash function has one or more parameters to the function. The value of those parameters are chosen randomly. A universal hash function has the property that for any set of inputs the vast majority of parameter values will distribute the set with few collisions. For example H_a,b(x) might equal ax+b (mod p) (mod q), where p is a prime the size of the largest input and a and b are choosen less than p and where q is the number of elements in the table. There are many implementation tricks to make the calculation of such a function fast.

The current write up refers to perfect hashing which can only be used once the entire set of possibly hashed values has been determined and they tend to be too slow to compute for most examples. There is no reason to bother with a chi-square test when using universal functions the odds of getting a high value on such a test are provably infinitesimal. —Preceding unsigned comment added by MarkWegman (talk • contribs) 23:00, 19 January 2010 (UTC)

Mathematically definiton

In Introduction it is written, that a hash function is a mathematical function. Therefore I missed a proper mathematical definition of a hash function. For example something like:

A hash function is a polynomial time algorithm H which maps a bit string of arbitrary length to a bit string of a certain length. In formula: $H:\{0,1\}^{*}\rightarrow \{0,1\}^{l}$ .

Now you could distinguish certain kinds and properties of hash functions in a more formal way. For example like:

fixed length: fix the domain and set $H:\{0,1\}^{l'}\rightarrow \{0,1\}^{l}$ with l'>l

family of hash functions or also called keyed hash functions

strongly and weekly collision resistant and explanation of the difference between

preimage resistance: for given H(m) it is infeasable to find a certain m

second preimage resitance: for given m and H(m) it is infeasable to find a certain m' s.t H(m)=H(m')

I read above that someone more prefer to write such properties in the cryptographic hash function article. But the book I actually read (Introduction to modern cryptography by Katz and Lindell) describe cryptographic hash function as functions which fulfill certain properties but define this properties for ('noncryptographic') hash functions in general. Anyhow at the moment this things aren't described in detail neither in Hash_function, nor in Cryptographic_hash_function.

--Wohingenau (talk) 13:54, 26 March 2010 (UTC)

'Bernstein hash' link goes to French wikipedia

The link at the bottom of the page to the 'Bernstein hash' page goes to http://fr.wikipedia.org/wiki/Table_de_hachage#Fonction_de_Hachage , which is probably undesirable on the English version of Wikipedia. I did a quick search for 'Bernstein hash' on the English wikipedia and didn't find a specifc page about the algorithm, otherwise I would have fixed the link myself. Either the link should be removed, or the content in the French wiki article should be copied over. 202.20.20.202 (talk) 05:15, 24 May 2010 (UTC)

Cutter numbers

Is there a connection between hashes and Cutter numbers (Library Science)? The aim of Cutters, as far as I can tell, is to map points from a large sparse space (eg names of people) to a small space (eg 3 decimal digits). That sounds like hashing. Cutters have the additional property that order is preserved - that is, if name X is alphabetically before name Y, then the Cutter number of X comes before the Cutter number of Y.

An issue that Cutters make use of is that points in the name space are not uniformly distributed. A practical way to generate Cutters for people's names is to use a phone book. Find the page with the name, or where the name would be if it is missing, and use the page number. This takes account of the non-uniform frequency distribution of names.

138.251.194.28 (talk) 14:43, 23 August 2010 (UTC)

Cutter codes are generated according to a set of rules (i.e., a hash function), although some human judgement is allowed to accommodate very common names; on the other hand, such judgement could be incorporated into the rules of one's Cutter algorithm. Cutter is not a hash function itself, but a kind of hash function, and different libraries may use different Cutter tables in generating their Cutter codes (hashes). Here's an example of a Cutter table: Example Cutter table

However, this Wikipedia page defines hash codes as being of "fixed length", which Cutter codes are not, yet gives no explanation why hash codes must be of fixed length.

Help understanding please...

How are any of these functions regarded as reliable? Take SHA-1, for example: with a message size of up to 2^64-1 bits, and a digest size of 160 bits, any given digest can have many tens of trillions of messages (roughly 2^57, assuming the algorithm yields high efficiency and even distribution). Compared to the fact that any given message yields exactly one of up to 2^160 digests, there's an incredible discrepancy (like, a billion orders of magnitude) and collisions seem inevitable under the circumstances of even very low-tax, common applications (those applications where message sizes are nowhere near the full possible length). I guess the question in short is: How long can messages be, before the risk of frequent collisions becomes a practical problem for the application? Something like message lengths of 2^8 bits appears to be the point at which collisions become absolutely guaranteed. For many cryptographic hash applications, it doesn't matter if you discover the actual message, if you discover any one of the many tens of trillions of messages which match the digest, you're ready to spoof. So, I am missing something. --76.102.243.117 (talk) 22:25, 7 June 2010 (UTC)

It's not how long the messages are, it's how many messages you have. If you have more than 2^m messages, where m=n/2 and n is the bit length of the hashes, then you've got at least a 50% chance that there's at least one collision somewhere.

At least, that's the case if the hash function is good, and there's nothing special about the messages; i.e. if the hashes are generated randomly enough that you can apply statistics to the problem.- Wolfkeeper 22:47, 7 June 2010 (UTC)

That's the Birthday paradox at work.- Wolfkeeper 22:48, 7 June 2010 (UTC)

OK, from there I found the link to Birthday attack... --76.102.243.117 (talk) 10:35, 25 August 2010 (UTC)

Collision-free hashing is lossless coding ?

A hash function that has no collisions is what we call a lossless coding. Cuddlyable3 (talk) 23:48, 26 November 2008 (UTC)

This is a poor analogy - lossless coding usually produces variable-size codings and usually takes advantage of context to determine the code for each symbol, and so isn't really a function at all. A hash function with no collisions is best called a perfect hash function. Dcoetzee 00:08, 27 November 2008 (UTC)

Agreed. Lossless coding is different from perfect hashing in many aspects. It may deserve a "See also", but not more. --Jorge Stolfi (talk) 01:02, 27 November 2008 (UTC)

It's a statement not an analogy. It does not say that lossless codings are hash functions. A hash function with no collisions can be reversed to recover the original data, hence lossless. Whether particular lossless codings should be called functions depending on whether they produce fixed-length codes and on whether they take advantage of context seems moot here. Cuddlyable3 (talk) 11:40, 19 December 2008 (UTC)

I misunderstood your phrasing as suggesting that perfect hash functions should be explained in terms of lossless coding and readers redirected to the article on lossless coding. Perfect hash functions can be used as (very poor) lossless codes; this deserves a link, but they're studied in very different contexts and understanding one will not allow you to automatically understand the other. Dcoetzee 18:49, 19 December 2008 (UTC)

I don't have any editing proposal about this yet. The article talks about perfect = injective hashes which to me seem equivalent to substitution codes, and not very useful (except when the substitution function is both changing and secret. That case is the one-time pad that gives excellent cryptography.) I suggest that the alphabetical ordering in a telephone directory of the names of people who in the real world are in an unrelated order is a combination of two hashes: (i) a very compressive hash with many collisions whereby names are sorted into 26 bins (by their first letters A, B, ... Z). (ii) a perfect hash that allocates exactly one bin (text line) to each person. Cuddlyable3 (talk) 20:59, 19 December 2008 (UTC)

If the hash function had no collision in the first place, and recall that since a hash function is deterministic, this would imply that the inputs have a finite size and are all distinct, why would you use a hash function in the first place .. 121.73.17.17 (talk) 00:44, 6 December 2010 (UTC)

Must Hash Codes Be of "Fixed Length"?

This page defines hash codes as being of "fixed length", but offers no explanation of why uniform length is an inherent and necessary trait of hash codes. There are hashing systems (such as Cutter codes) that produce (slightly) varied hash-code lengths, but that satisfy all the functional requirements of a hashing system. I would argue that fixed or uniform length of hash codes is a typical characteristic, but not a required characteristic of hash codes. Anyone know any reason to the contrary? — Preceding unsigned comment added by 76.121.65.17 (talk) 17:28, 23 November 2012 (UTC)

Bloom filters and hashing

Yes, Bloom filters use hash functions, but this section isn't helpful. What does "a compact data structure that provides an enclosing approximation to a set of them" even mean? I'd recommend saying something that makes sense here, or deleting the section.

Paul Kube (talk) 22:54, 4 January 2013 (UTC)

@"Perfect Hasing"

maybe we should mention that any Perfect Hash function cannot output any smaller data sets than the input? :p

if the hasing function is "perfect", the output cannot be smaller than the input...Divinity76 (talk) 19:08, 19 July 2012 (UTC)

Perfect???: PCR-Hash to texts for Programm Codon Restrict IS rename failure by Polesinskij:

l:=(Length of bytes)

S:=(total plass of bites of all bytes)

U:=(total multiplicat for bites of all bytes)

with sqrt on n=N:

([S/l]=n to sqrt of U)=k

x:=([k(to float)]=n2 to sqrt of U)

total concatenat:l+k+x

second part with_2-sqrt:

beg {z:=l/2; int Y+1; while z>2;} end

beg {k:=2sqrt of U;Y-1; while Y>0;} end

x:=[k(to float)=n to sqrt of U]

total concatenat:l+k+x

— Preceding unsigned comment added by 95.153.189.81 (talk) 19:48, 1 November 2013 (UTC)

"data" is plural

The article uses singular, such as "the data is". — Preceding unsigned comment added by 193.11.93.250 (talk) 09:10, 5 November 2013 (UTC)

not necess arily. -- intgr [talk] 13:43, 5 November 2013 (UTC)

Uniformity of telephone numbers/license plates

Neither telephone numbers nor license plates have a uniform distribution over their number ranges/values. Therefore, using modulo arithmetic will not have the desired effect. 203.97.96.78 (talk) 23:07, 27 August 2008 (UTC)

The highest-order digits ("exchange code") are usually assigned by location. Therfore, a local business that hashes its customer's phones into an 1000-entry table by using the three high-order digits will find that most of the phones are mapped to a coupld of table entries. On the other hand, the lowest-order digits of phone numbers are assigned in an essentially random way within each exchange. So, the modulo 1000 hashshould spread the numbers fairly uniformly in the range [0..999] (which is what is needed). The same hold for a general table size N, up to a few thousand. Dos this make sense? --Jorge Stolfi (talk) 19:00, 21 November 2008 (UTC)

It makes sense, but that claim requires citations. I've edited to request these. 192.12.184.6 (talk) 16:51, 8 November 2013 (UTC)

Checksums, fingerprints, and cryptographic hash functions are distinct

The article states that hash functions are the same thing as checksums, fingerprints, and cryptographic hash functions. Actually these concepts are related but distinct; they serve different purposes and have different requirements. Jorge Stolfi (talk) 19:58, 23 February 2008 (UTC)

I noticed that the article then goes on to say under the Descriptions section that hash tables are related to, but not the same as, checksums, contradicting itself in a space of a few paragraphs. Velociostrich (talk) 16:49, 18 November 2012 (UTC)

True! Anyone care to edit it lol, wikipedia is full of "this" these days, a hash is defiantly an extension of the checksum idea, which is an extension of the parity bit idea, this is an emergent field though so some people do not understand the history of the language, it waits for someone to take authority on the idea, but citations are preferred DarkShroom (talk) 10:16, 13 April 2014 (UTC)

Dubious source of CRC32 information

Bret Mulvey's analysis referenced in the article shows a CRC-32 implementation where the polynomial used is the most-significant-bit (msb)-first representation of one of the standard CRC-32 implementations. However, his table initialization code is for a lsb-first implementation. (Note that all 256 of his table values will have the most significant 4 bits unset.) His CRC-calculation code is correct for checking the CRC-32 in a protocol where the CRC-32 is appended to the message such that the CRC-32 of the message with the appended CRC-32 is a constant. However, it appears that he didn't append 4 null bytes to his keys (as is effectively done in protocol implementations by zero-ing out the 4 bytes left for the CRC at the end of the message buffer). Note that the worked-through 3-bit CRC example in the Wikipedia article on CRCs includes padding with 3 zero bits.

These two flaws (improperly mixing lsb-first and msb-first implementations, and failing to pad with 32 zero bits) together account for his poor results.

Kmag (talk) 05:48, 15 December 2013 (UTC)

@Kmag: The CRC-32 mapping is given in the article as an example of intermediate hashing. Although CRC-32 is not intended to be a hashing algorithm and his length (4 bytes) is too small to avoid frequent collisions, it does give very different values for slightly different input by design. This is why it is used for error detection. A poor implementation does not mean the algorithm itself is weak. I think it is good that the source was removed because it creates a false impression. However, we don't really need to support the argument that CRC-32 produces very different results for slightly different input, because it is the main characteristic of the algorithm. Nxavar (talk) 10:54, 4 June 2014 (UTC)

Functions versus algorithms

Every once in a while, someone confuses functions with algorithms. Functions are mappings, algorithms are abstractions of computer programs. Functions are concepts, algorithms are implementations. So saying that "A hash function is any algorithm that maps ..." has things entirely backwards. Correct would be: "A hash function is a function that maps ...", and "A hash algorithm is an algorithm that implements some hash function". Boute (talk) 13:45, 27 April 2014 (UTC)

It's easy to see where the confusion comes from, the word "function" can refer to a subroutine in a computer program (implementation of an algorithm), as well as a "mathematical" function, which is an abstract mapping. Maybe it's easier to just avoid the distinction in the first place? -- intgr [talk] 11:56, 4 June 2014 (UTC)

@Nxavar: recently added this to the lead: "A hashing algorithm evaluates a hash function repeatedly to produce a hash value for input data of arbitrary size". I think you're referring to the "round function" in cryptographic hashes, which is indeed executed multiple times. -- intgr [talk] 12:08, 4 June 2014 (UTC)

I am not refering to the round function. The sentence should be understood in the context of how the hash function was just described: it takes data of fixed size and some hash value as an argument. I chose not to elaborate on how, for data of arbitrary size, the hash function can be used to produce a hash value: a) the input must be broken into chuncks that have the size supported by the hash function, and b) the hash function is evaluated for each chunk, with its output serving as the hash value for the next evaluation. This iterative procedure is the hashing algorithm, and terminates when all chunks have been processed, returning the last hash value. Checksums are produced the same way. Nxavar (talk) 14:22, 4 June 2014 (UTC)

There is nothing wrong in calling an algorithm with no side effects a function. Such algorithms are restricted to producing some output for given input which matches perfectly the definition for a function. Hashing algorithms have no side effects. I suggest that we go back to refering to the hash function as the hashing algorithm. Nxavar (talk) 04:09, 5 June 2014 (UTC)

I forgot to mention that in order to produce same output for same input, the algorithm must not use any global variables, and must be deterministic. Both are true for hashing algorithms. Nxavar (talk) 04:19, 5 June 2014 (UTC)

I think I see now what you mean. So you say that "hash function" is a function of hash(prev_state, fixed_block) -> new_state and a "hash algorithm" applies that function to arbitrary-length inputs?

I think we can't reasonably make progress here by hand-waving and opinions, so please provide some sources supporting this definition. This does not match any definitions of "hash function" that I've seen. Most definitions say "arbitrary-lenght input, fixed-length output". Admittedly these are cryptography sources because I couldn't find solid definitions for non-cryptographic hashes. For example Handbook of Applied Cryptography (Menezes et al) defines it as:

A hash function (in the unrestricted sense) is a function h which has, as a minimum, the following two properties:

1. compression — h maps an input x of arbitrary finite bitlength, to an output h(x) of fixed bitlength n

Applied Cryptography (Schneier) chapter 18:

A one-way hash function, H(M), operates on an arbitrary-length pre-image message, M. It returns a fixed-length hash value, h.

Most cryptographic hashes today use Merkle–Damgård construction or similar, which is indeed such an iterated algorithm, but the resulting construction is usually called "hash function", while the internal function is a "one-way compression function" or just "compression function". -- intgr [talk] 10:41, 5 June 2014 (UTC)

From the pseudocode given in the article for SHA-1:

...
Process the message in successive 512-bit chunks:
break message into 512-bit chunks
for each chunk
...

The chunk size is the block size mentioned in comparison tables of cryptographic hash functions. I do not think there is a distinction between a hash function and a hashing algorithm in the literature, but small hashes are usually based on some arithmetic expression and you can say that you have a "normal" function and an algorithm that uses it. Nxavar (talk) 12:19, 5 June 2014 (UTC)

@Nxavar: I do not think there is a distinction between a hash function and a hashing algorithm in the literature

Are you familiar with verifiability policy? This sounds like original research and does not belong to Wikipedia.

As for SHA-1, I already addressed that in my last comment: SHA-1 is a hash based on the Merkle–Damgård construction. The inner function is called a "one-way compression function" and the resulting construction is a "hash function". The Handbook of Applied Cryptography also uses this terminology. -- intgr [talk] 14:22, 5 June 2014 (UTC)

You focus too much on cryptographic hash functions. This is not the only type of hash function. What is called hash function for small hashes is called compression function for large ones. Cryptographic hash functions are effectively hash functions that produce large hashes. The qualifier cryptographic just signifies the intended area of application. Anyway, I completely agree with not including such information in the lead. What about putting it to the "Description" section, using appropriate terminology? Nxavar (talk) 15:10, 5 June 2014 (UTC)

Cryptographic hash function have higher standards for collision resistance ofcourse, but their basic algorithmic design is the same as for other hash functions. Nxavar (talk) 15:15, 5 June 2014 (UTC)

True, I brought up cryptographic hash functions because that's what I could get a solid definition for. Ignore the cryptography, I'm asking you to find a reliable source that supports your point of view. Wikipedia requires sources for contentious claims. If there's no source then Wikipedia shouldn't state this. -- intgr [talk] 16:34, 5 June 2014 (UTC)

Here are some sources:

Hash Functions

Selecting a hashing algorithm and the full article on the authors website.

They deal with small hashes. You can see that the hash functions are indeed refered to as hashing algorithms, and they use an "inner function" which is indeed an arithmetic expression. We now have references for the "outer function" - "inner function" concept, where the former feeds the input data in chunks to the latter, for both simple and cryptographic hash functions. Nxavar (talk) 18:03, 5 June 2014 (UTC)

I am not contesting the fact that you can split a hash function into an inner and outer part. I am talking specifically about the claim you are trying to make about the terms "hash function" and "hashing algorithm":

"A hashing algorithm evaluates a hash function repeatedly to produce a hash value for input data of arbitrary size"

Neither of these sources supports your distinction. In fact, they seem to contradict your claim. "Selecting a Hashing Algorithm" seems to use the two terms interchangeably and the other page describes the whole construction (including the loop) as "hash function". If I am mistaken, be more specific about how they support your claim. -- intgr [talk] 20:07, 5 June 2014 (UTC)

As I mentioned before, we can use appropriate terminology for the article. The terms are indeed used interchangeably, but when you try to use them together the obvious choice is to name the outer function the "hashing algorithm" and the inner function the "hashing function". This terminolgy ofcourse is not compatible with cryptographic hash functions because different terms are used in that case, even though they mean the same thing. However, by putting this content in the Description we won't have to be terse and compress everything to a single sentence. We can give all necessary details for avoiding misinterpretations. Nxavar (talk) 04:12, 6 June 2014 (UTC)

Sorry, but I disagree that your proposed terminology is "obvious" (or even accurate). I don't think you should add it to the article at all, on the grounds that it's original research. -- intgr [talk] 13:17, 6 June 2014 (UTC)

No problem. We can show how "ordinary" hash functions have an inner loop with an arithmetic expression and then show that cryptographic hash functions have an inner loop with a compression function, and that both inner loops work on data of fixed size, and that this is a similarity of the two types of hash function. Nxavar (talk) 14:10, 6 June 2014 (UTC)

Agreed. -- intgr [talk] 17:04, 6 June 2014 (UTC)

"A hash function that is used to search for similar...".

This citation is referencing a Wikipedia book and has no added value — Preceding unsigned comment added by 131.159.218.64 (talk) 13:06, 16 June 2015 (UTC)

Factual error pertaining to Unicode

in the case of Unicode characters, the table would have 17×2¹⁶ = 1114112 entries.

As the first paragraph of Unicode notes, there are more than 2¹⁶ different Unicode characters, so this sentence is wrong. For purposes of the article's example, it's sufficient (and sufficiently accurate) to change 2¹⁶ to 2³², reflecting the fixed character width used in UTF-32. (17× that = 73,014,444,032 entries) Enoent (talk) 17:00, 5 July 2015 (UTC)

History?

What about the history of hash functions? Admittedly hashing has only been a vital part of information security systems since the Computer Age, but there must be some history of the mathematical foundations. — Preceding unsigned comment added by 202.163.119.34 (talk) 10:01, 11 October 2015 (UTC)

Hash functions weblink

Do you think this website is helpful and should be added in the External links section? http://tools.timodenk.com/?p=hash-function

--109.192.64.215 (talk) 13:56, 31 December 2015 (UTC)

how about 2 simple examples

They are not good (failing Spectral_test with very few domains, but small and famous (fast LCG-hash) cores.

   //[Linear congruental generator] hash, 4 domains in, 4 domains out.
   #define HASHSCALE4 vec4(1031,.1030,.0973,.1099)
   vec4 hash44(vec4 p){p=fract(p*HASHSCALE4);//arbitiarily scaled seesaw modulo remainder. with bias to hide; failig a [Spectral test] in 2d.
    p+=dot(p,p.wzxy+19.19);//dot product with domain-swiveled self
    return fract(p.zxqx*(p.xxyz+p.yzzw));//seesaw modulo remainder of linear polynomial with domain-swiveled selves.
   }//https://www.shadertoy.com/view/4djSRW

   //[Linear congruental generator] hash, 4 domains in, 4 domains out; does mod(x,17*17). Apparently intended for small type integer, does not even bother to scale or swivel domains.
   vec4 mod289(vec4 x){return x-floor(x*(1./289.))*289.;}//289=17*17;

Ollj (talk) 07:34, 8 September 2017 (UTC)

Article contains contradiction

The article says "A hash function is any function that can be used to map data of arbitrary size to data of fixed size.", then immediately retorts that checksums are not hash functions. However, checksums satisfy this definition. A perfunctory examination suggests that, possibly, some other of the article's listed: "checksums, check digits, fingerprints, lossy compression, randomization functions, error-correcting codes, and ciphers", fit the given definition of hash functions. — Preceding unsigned comment added by Ihearthonduras (talk • contribs) 04:20, 21 March 2018 (UTC)

robust noisey room hashing

"Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room." - very interesting, can we have a source - william sharkey

Passage no longer in article. Sbalfour (talk) 14:54, 20 September 2019 (UTC)

"The Hash Keeper database ..."

Since nobody knows what this is, what is point of the sentence?

Deleted.Sbalfour (talk) 14:54, 20 September 2019 (UTC)

"The remaining characters of the string ..".

Given that the previous so-called sentence runs on and on without saying anything, what "remaining characters"?

I think this has been clarified. Sbalfour (talk) 15:29, 20 September 2019 (UTC)

The definition

The opening sentence says: A hash function is any function that can be used to map data of arbitrary size to fixed-size values. And it's cited, though the source is popular rather than scholarly. The definition is a common one, though I'd say it's aped from one source to another. In many cases, the data is already a fixed size. What then? Suppose I have 40,000,000,000 64-bit values, and need to collect some modest set of them meeting some criteria into a table which I need to index quickly to retrieve any particular one. It's ok if the table has empty slots or is even sparse, but must be many orders of magnitude smaller than 2^64, or even 10^10. I don't need a hash function, because the definition says so: my data is already fixed length, all 64-bits.

On the other hand suppose I define a function which always returns 1, regardless of the value or length of the key. According to the definition, that's a hash function. Not only is the hash value fixed length (one bit), but also a fixed value. You may say that it's merely the case that we've constructed a bad (very bad) hash function. But the function is so bad, that it no longer conforms to ordinary notions of what a hash function is. Suppose again that I define an algebraic function, which always returns one. I didn't specify what the effective domain of the function is. I input the value 1, and it returns 1, the square root of 1. It's a square root function by definition, but again does not conform to our ordinary notion of such a function.

For their bountiful effusion, the editors have not clarified what the basic concepts of a hash function are, and the definition is symptomatic of the problem.

A hash function is a mapping. It does two things: first, it yields an index into the space of key values for each key, not necessarily a unique index; i.e. it distributes the key values in some fashion, the more random the better, over the key space. The key values may all be sequential, like soldiers' dog tag numbers, allocated from some arbitrary starting number serially. If we find a dog tag, and want to look up the soldier's name, how do we find that dog tag in a list? It won't do to try linear or even binary search of the list - the search space is enormous, and the list isn't necessarily constructed in a serially increasing order by dog tag number: someone at the end of recruiting took all the sign-up sheets in a messy stack and entered the tag numbers line by line into the computer. Now we need to construct a referenceable table. We can use the dog tag number itself, so that dog tag number 1,000,000 (the nth dog tag in the list) goes into slot number 1,000,000 in the table. So if a medic finds dog tag number 1,000,000 on the battlefield, he can index slot 1,000,000 in the table, and find that the soldier's name is Joe Blow. The problem is, there are billions, possibly trillions or more, possible dog tag numbers, and mapping them onto the computer system's disks either takes a lot of disks or is infeasible. So, second: a hash function maps a large state space of key indices into a manageable state space, such that collisions are minimized. Key indices which were distinguishable in the original key space may not be distinguishable in the reduced key space, i.e. there's a collision, and one of the dog tag indices will have to be mapped somewhere else in such a way that it is findable.

If key values were truly random, then we could take any n bits of the key and use it as an index into a table of size 2^n. The "hash" function could be just about anything that selects n bits. But the keys are not random, they're clustered serially, one cluster for each recruiting station. If the clusters are not broken up, when we attempt to map the indices into a reduced table, there will be many collisions, making a reduced table with little more order than a messy stack of papers. So the hash function cannot be just 'anything' - it has to breakup the clusters and distribute the key indices over the key space, then map the indices into a practical size table. When we combine both processes into a single operation, like modulo, we confuse ourselves about exactly what happened. (modulo, BTW, does NOT break up serial clusters of key values, so is a poor choice for hashing dog tag numbers, or any application where serial clustering may occur).

All the rambling and roaming of the article fails to tell us about the one, and then about the other. And it doesn't have any information on collision resolution, or performance of the algorithms. Nor how external hashing, i.e. across physical interfaces like disks, drums, tapes, etc, is handled efficiently.

We need to start again, and we need to start at the top with a notional definition, cited or not.Sbalfour (talk) 16:54, 20 September 2019 (UTC)

Checksums (MD5) and cryptographic applications (SHA-1, etc)

This article is not for these. They have their own articles. I propose deleting them forthwith, since there's already a note at the bottom of the lead redirecting readers to them.Sbalfour (talk) 18:19, 20 September 2019 (UTC)

I'm going to be BOLD and just do it.Sbalfour (talk) 18:50, 20 September 2019 (UTC)

Multiplicative hashing section inscrutable

I have no idea what the given formula means: everything is undefined, including M, W, m, w, a, x, y, and h_a (how's that different from h?). I'm pretty sure it was copied from somewhere, and not the source stated, which uses different notation. So what's the real source? If we knew that, we'd probably have a copyvio.Sbalfour (talk) 20:05, 21 September 2019 (UTC)

Article rambles

This article says much, but conveys little. It's like a survey with a few "how-to" tidbits thrown in, and little or no analysis of how these functions perform. For example, the three sections: Hashing uniformly distributed data; Hashing data with other distributions; Special-purpose hash functions; essentially say the same thing, that if we know something about the distribution or structure of the data, we can construct a better hash function. How we do that depends on what we know, and that's simply too wide an endeavor to detail here. Sbalfour (talk) 15:42, 20 September 2019 (UTC)

I've fixed a lot of this with focus on two things: hashing integers, and hashing variable length data. Sbalfour (talk) 18:58, 22 September 2019 (UTC)

Missing sections/info

There's nothing in the article about deletions, performance of the algorithms versus data, or analysis (i.e. theoretical performance, best case, average case, worst case, etc). The article is critically anemic. Sbalfour (talk) 19:00, 22 September 2019 (UTC)

proposed Description section

A hash function takes as input a key, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed, like an integer, or variable length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them.

A hash function may be considered to perform three functions:

Convert variable length keys into fixed length (usually machine word length or less) values, by folding them by words or other units using a parity-preserving operator like ADD or XOR.
Scramble the bits of the key so that the resulting values are uniformly distributed over the key space.
Map the key values into ones less than or equal to the size of the table

A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize collisions. Hash functions rely on favorable probability distributions for their effectiveness, reducing access time to nearly constant. High table loading factors, pathological key sets and poorly designed hash functions can result in access times approaching linear in the number of items in the table. Hash functions can be designed to give best worst-case performance,^{[Notes 1]}good performance under high table loading factors, and in special cases, perfect (collisionless) mapping of keys into hash codes. Implementation is based on parity-preserving bit operations (XOR and ADD), multiply or divide. Implementations include an adjunct collision-resolution method. Sbalfour (talk) 00:44, 23 September 2019 (UTC)

Hash tables subsection: Bad concept, and bad diction

The text says this Typically, ... a hash function ... will map several different keys to the same index which could result in collisions. Not could result in a collision, but WILL result in a collision. And no, a hash function will NOT typically map several keys to the same index - it will typically map only ONE key to an index. If there are more than a few collisions, we'll usually choose another hash function. The whole section needs cleaned up for conceptual presentation as well as technical diction.Sbalfour (talk) 19:17, 20 September 2019 (UTC)

Then it says: Hash table implementations include a specific hash function—such as a Jenkins hash or Zobrist hashing—and a hash-table collision resolution scheme—such as coalesced hashing, cuckoo hashing, or hopscotch hashing. This is just mongering, or I don't know what. These are highly specialized hash functions and techniques. The most common hash function is modulo based, even if hashing strings, the most common hash table structure is open addressing, and the most common collision resolution technique, linear probing. These are the simplest, and usually quite good enough, that's why. I think we ought to say these first, and relegate the others to a footnote, if we mention them at all. Sbalfour (talk) 21:30, 20 September 2019 (UTC)

I think the editor of that section had a singular and rather dogmatic idea of what hash functions and tables actually look like in practice.Sbalfour (talk) 18:40, 23 September 2019 (UTC)

I propose this:

Hash functions are used in conjunction with hash tables to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or it can be added to the table in some other location by a specified procedure. That procedure depends on the structure of the hash table: In chained hashing, each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain. Chains may be kept in random order and searched linearly, or in serial order, or as a self-ordering list by frequency to speed up access. In open address hashing, the table is probed starting from the occupied slot in a specified manner, usually by linear probing, quadratic probing, or double hashing until an open slot is located or the entire table is probed (overflow). Searching for the item follows the same procedure until the item is located, an open slot is found or the entire table has been searched (item not in table).

Programs, pseudocode and other elaborate examples

I've deleted numerous examples from the text for three reasons: programming language code is copyrighted unless the program itself is in the public domain, or the example was uncited and almost certainly original art that would otherwise be copyrightable to the creator. In the the absence of all else, the encyclopedia is not a how-to manual: actual compilable code and elaborate pseudo-code, except when plainly necessary to clarify obscure text, is for programming textbooks, not the encyclopedia. Clearly, we can "invent" self-evident examples such as "1+1=2 is an example of integer arithmetic addition". The best course is to accurately and succinctly describe the algorithm, ... an exercise in technical diction, which is exactly what we want in the encyclopedia. Examples must not be original art, copyrighted, or uncited. Sbalfour (talk) 15:04, 24 September 2019 (UTC)

Hash function+Hash table = Hashing

What good does it do to speak of Hash function without Hash table? And vise-versa. It's inane. There's no use for a hash function without a hash table (I'm not talking about crypto, checksums, etc, which have their own articles) or a hash table without a hash function. I propose combining both articles into one, and title it Hashing. It's almost a direct insert - no need to merge sections, paragraphs, etc.Sbalfour (talk) 17:26, 24 September 2019 (UTC)

Articles such as Cryptographic hash function are separate, yes, but rather than reproducing the full description of a hash function they link back here. It seems to me that people arriving from such links would be confused if they arrived at an article that spent 50+% of it's word count on indexing into tables. - MrOllie (talk) 17:54, 24 September 2019 (UTC)

Point taken. The page is or was serving two different functions: a general directory of hash functions, and a specific page (the only one) relating to hash tables and information storage/retrieval application of hash functions. However, in content, the page when I found it, was essentially an overview of hash functions, and could have been titled (in keeping with other wiki pages), Outline of hash functions. I've refocused the page, and it could possibly be renamed Hash function (data storage). We'd then create a disambig page Hash function, and cover crypto, error detection & correction, fingerprinting, and etc.Sbalfour (talk) 22:06, 25 September 2019 (UTC)

Can the editor reconstruct an obtuse sentence into comprehensible English?

Article says (obtusely): "Use of hash functions relies on statistical properties of key and function interaction: worst-case behaviour is intolerably bad with a vanishingly small probability, and average-case behaviour can be nearly optimal (minimal collision).[1]" (FairNPOV (talk) 17:33, 11 January 2022 (UTC))
Cite error: There are <ref group=Notes> tags on this page, but the references will not show without a {{reflist|group=Notes}} template (see the help page).

[Notes 1]