Information about Hash Function

A hash function [1] is a reproducible method of turning some kind of data into a (relatively) small number that may serve as a digital "fingerprint" of the data. The algorithm "chops and mixes" (i.e., substitutes or transposes) the data to create such fingerprints. The fingerprints are called hash sums, hash values, hash codes or simply hashes. (Note that hashes can also mean the hash functions.) Hash sums are commonly used as indices into hash tables or hash files. Cryptographic hash functions are used for various purposes in information security applications.

Properties

Hash functions are designed to be fast and to yield few hash collisions in expected input domains. In hash tables and data processing, collisions inhibit the distinguishing of data, making records more costly to find.

A hash function must be deterministic, i.e. if two hashes generated by some hash function are different, then the two inputs were different in some way.

Hash functions are usually not injective, i.e. the computed hash value may be the same for different input values. This is because it is usually a requirement that the hash value can be stored in fewer bits than the data being hashed. It is a design goal of hash functions to minimize the likelihood of such a hash collision occurring.

A desirable property of a hash function is the mixing property: a small change in the input (e.g. one bit) should cause a large change in the output (e.g. about half of the bits). This is called the avalanche effect.

Typical hash functions have an infinite domain, such as byte strings of arbitrary length, and a finite range, such as bit sequences of some fixed length. In certain cases, hash functions can be designed with one-to-one mapping between identically sized domain and range. [2] Hash functions that are one-to-one are also called permutations. Reversibility is achieved by using a series of reversible "mixing" operations on the function input.

Applications

Because of the variety of applications for hash functions (details below), they are often tailored to the application. For example, cryptographic hash functions assume the existence of an adversary who can deliberately try to find inputs with the same hash value. A well designed cryptographic hash function is a "one-way" operation: there is no practical way to calculate a particular data input that will result in a desired hash value, so it is also very difficult to forge. Functions intended for cryptographic hashing, such as MD5, are commonly used as stock hash functions.

Functions for error detection and correction focus on distinguishing cases in which data has been disturbed by random processes. When hash functions are used for checksums, the relatively small hash value can be used to verify that a data file of any size has not been altered.

Cryptography

Main article: Cryptographic hash function


A typical cryptographic one-way function is not one-to-one and makes an effective hash function; a typical cryptographic trapdoor function is one-to-one and makes an effective randomization function.

Hash tables

Main article: Hash table


Hash tables, a major application for hash functions, enable fast lookup of a data record given its key. (Note: Keys are not usually secret as in cryptography, but both are used to "unlock" or access information.) For example, keys in an English dictionary would be English words, and their associated records would contain definitions. In this case, the hash function must map alphabetic strings to indexes for the hash table's internal array.

The ideal for a hash table's hash function is to map each key to a unique index (see perfect hashing), because this guarantees access to each data record in the first probe into the table. However, this is often impossible or impractical.

Hash functions that are truly random with uniform output (including most cryptographic hash functions) are good in that, on average, only one or two probes will be needed (depending on the load factor). Perhaps as important is that excessive collision rates with random hash functions are highly improbable—if not computationally infeasible for an adversary. However, a small, predictable number of collisions is virtually inevitable (see birthday paradox).

In many cases, a heuristic hash function can yield many fewer collisions than a random hash function. Heuristic functions take advantage of regularities in likely sets of keys. For example, one could design a heuristic hash function such that file names such as FILE0000.CHK, FILE0001.CHK, FILE0002.CHK, etc. map to successive indices of the table, meaning that such sequences will not collide. Beating a random hash function on "good" sets of keys usually means performing much worse on "bad" sets of keys, which can arise naturally—not just through attacks. Bad performance of a hash table's hash function means that lookup can degrade to a costly linear search.

Aside from minimizing collisions, the hash function for a hash table should also be fast relative to the cost of retrieving a record in the table, as the goal of minimizing collisions is minimizing the time needed to retrieve a desired record. Consequently, the optimal balance of performance characteristics depends on the application.

A good performing hash function for use in typical hash tables is Bob Jenkins' LOOKUP3 hash function; an earlier algorithm was published in an article in Dr. Dobb's Journal. [3] The hash function performs well as long as there is no adversary, for it is trivially reversible and useless as a cryptographic hash function.

Error correction

Main article: Error correction and detection


Using a hash function to detect errors in transmission is straightforward. The hash function is computed for the data at the sender, and the value of this hash is sent with the data. The hash function is performed again at the receiving end, and if the hash values do not match, an error has occurred at some point during the transmission. This is called a redundancy check.

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

Identification and verification

Cryptographic grade hash functions are commonly used as integrity check values to identify files and/or verify their integrity. Some hash algorithms, notably MD5 are no longer recommended for new applications, and may not provide the necessary level of security desired. However they still may be useful as an error checking mechanism, where purposeful data tampering isn't a primary concern.

NIST distributes a software reference library, the National Software Reference Library, that utilises hash functions to identify files and map them to software products.

The first coordinated effort to develop and catalog hash values of "known to be good" computer files, known as HashKeeper, began in 1996 at the National Drug Intelligence Center.

Audio identification

Main articles: Acoustic fingerprint and Algorithm Alley
For audio identification [4] such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all items that would nonetheless be interpreted as identical to a human listener. Though they are not common, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room. One example of this in practical use is the service Shazam. Customers call a number and place their telephone near a speaker. The service then analyses the music, and compares it to known hash values in its database. The name of the music is sent to the user. An open source alternative to this service is MusicBrainz which creates a fingerprint for an audio file and matches it to its online community driven database.

Rabin-Karp string search algorithm

Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

Origins of the term

The term "hash" comes by way of analogy with its standard meaning in the physical world, to "chop and mix". Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953, and that Robert Morris used the term in a survey paper in CACM which elevated the term from technical jargon to formal terminology.[5]

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular division.

See also

Notes

  1. ^  In the remainder of this article, the term function is used to refer to algorithms as well as the functions they compute.

References

2. ^ "Integer Hash Function"
3. ^ Jenkins, Bob (September, 1997), Hash Functions, "Algorithm Alley", Dr. Dobb's Journal, <[1]
4. ^ "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen"
5. ^ Knuth, Donald (1973). The Art of Computer Programming, volume 3, Sorting and Searching, 506-542. 

External links

For other uses, see Data (disambiguation).


Debt, AIDS, Trade in Africa (or DATA) is a multinational non-government organization founded in January 2002 in London by U2's Bono along with Bobby Shriver and activists from the Jubilee 2000 Drop
..... Click the link for more information.
In mathematics, computing, linguistics, and related disciplines, an algorithm is a finite list of well-defined instructions for accomplishing some task that, given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an
..... Click the link for more information.
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number).
..... Click the link for more information.
In cryptography, a cryptographic hash function is a transformation that takes an input and returns a fixed-size string, which is called the hash value. Hash functions with this property are used for a variety of computational purposes, including cryptography.
..... Click the link for more information.
Information security means protecting information and information systems from unauthorized access, use, disclosure, disruption, modification, or destruction.[1] The terms information security
..... Click the link for more information.
In computer science, a hash collision is a situation that occurs when two distinct inputs into a hash function produce identical outputs.

All hash functions have potential collisions, though with a well-designed hash function, collisions should occur less often (compared
..... Click the link for more information.
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number).
..... Click the link for more information.
Data processing is any computer process that converts data into information or knowledge. The processing is usually assumed to be automated and running on a computer. Because data are most useful when well-presented and actually informative
..... Click the link for more information.
In the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields.
..... Click the link for more information.
In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states.
..... Click the link for more information.
non-injective function.]] In mathematics, an injective function is a function which associates distinct arguments to distinct values. More precisely, a function f is said to be injective if it maps distinct x in the domain to distinct y
..... Click the link for more information.
In computer science, a hash collision is a situation that occurs when two distinct inputs into a hash function produce identical outputs.

All hash functions have potential collisions, though with a well-designed hash function, collisions should occur less often (compared
..... Click the link for more information.
avalanche effect refers to a desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions. The avalanche effect is evident if, when an input is changed slightly (for example, flipping a single bit) the output changes significantly (eg,
..... Click the link for more information.
The word infinity comes from the Latin infinitas or "unboundedness." It refers to several distinct concepts (usually linked to the idea of "without end") which arise in philosophy, mathematics, and theology.
..... Click the link for more information.
domain is most often defined as the set of values, D for which a function is defined.[1] A function that has a domain N is said to be a function over N, where N is an arbitrary set.
..... Click the link for more information.
byte (pronounced /baɪt/) is a unit of measurement of information storage, most often consisting of eight bits. In many computer architectures it is a unit of memory addressing.
..... Click the link for more information.
string is an ordered sequence of symbols. These symbols are chosen from a predetermined set.

In programming, when stored in memory each symbol is represented using a numeric value.
..... Click the link for more information.
BIT is an acronym for:
  • Bannari amman Institute of Technology
  • Bangalore Institute of Technology
  • Beijing Institute of Technology
  • Benzisothiazolinone
  • Bilateral Investment Treaty
  • Bhilai Institute of Technology - Durg

..... Click the link for more information.
non-injective function.]] In mathematics, an injective function is a function which associates distinct arguments to distinct values. More precisely, a function f is said to be injective if it maps distinct x in the domain to distinct y
..... Click the link for more information.
non-injective function.]] In mathematics, an injective function is a function which associates distinct arguments to distinct values. More precisely, a function f is said to be injective if it maps distinct x in the domain to distinct y
..... Click the link for more information.
Permutation is the rearrangement of objects or symbols into distinguishable sequences. Each unique ordering is called a permutation.
..... Click the link for more information.
Reversibility can refer to
  • Reversible dynamics - a mathematical dynamical system, or physical laws of motion, for which time-reversed dynamics are well defined.
* A reversible diffusion — an example of a reversible stochastic process.

..... Click the link for more information.
In cryptography, a cryptographic hash function is a transformation that takes an input and returns a fixed-size string, which is called the hash value. Hash functions with this property are used for a variety of computational purposes, including cryptography.
..... Click the link for more information.
Adversary may refer to:

In religion
  • Satan, whose name means adversary in the Hebrew language.
In computer science:
  • Adversary (cryptography), malicious entity in cryptography whose aim is to prevent the users of the cryptosystem from

..... Click the link for more information.
Criminal law
Part of the common law series
Elements of crimes
Actus reus  · Causation  · Concurrence
Mens rea  · Intention (general)
Intention in English law  · Recklessness
..... Click the link for more information.
MD5

General
Ronald Rivest
April 1992
MD, MD2, MD3, MD4, MD5

Detail
128 bits

4

In cryptography, MD5 (Message-Digest algorithm 5) is a widely used cryptographic hash function with a 128-bit hash value.
..... Click the link for more information.
In cryptography, a cryptographic hash function is a transformation that takes an input and returns a fixed-size string, which is called the hash value. Hash functions with this property are used for a variety of computational purposes, including cryptography.
..... Click the link for more information.
Do one-way functions exist?
A one-way function is a function that is easy to compute but "hard to invert" (in the sense defined below). "Easy to compute" means that some algorithm can compute the function in polynomial time (in the input size).
..... Click the link for more information.
A trapdoor function is a function that is easy to compute in one direction, yet believed to be difficult to compute in the opposite direction (finding its inverse) without special information, called the "trapdoor". Trapdoor functions are widely used in cryptography.
..... Click the link for more information.
In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number).
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter