Information about Prediction By Partial Matching
Prediction by Partial Matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream.
Predictions are usually reduced to symbol rankings. The number of previous symbols, n, determines the order of the PPM model which is denoted as PPM(n). Unbounded variants where the context has no length limitations also exist and are denoted as PPM*. If no prediction can be made based on all n context symbols a prediction is attempted with just n-1 symbols. This process is repeated until a match is found or no more symbols remain in context. At that point a fixed prediction is made. This process is complementary to that followed by Dynamic Markov Compression (DMC) which builds up from a zero-order model.
Much of the work in optimizing a PPM model is handling inputs that have not already occurred in the input stream. The obvious way to handle them is to create a "never-seen" symbol which triggers the escape sequence. But what probability should be assigned to a symbol that has never been seen? This is called the zero-frequency problem. One variant assigns the "never-seen" symbol a fixed pseudocount of one. A variant called PPM-D increments the pseudocount of the "never-seen" symbol every time the "never-seen" symbol is used. (In other words, PPM-D estimates the probability of a new symbol as the ratio of the number of unique symbols to the total number of symbols observed).
PPM compression implementations vary greatly in other details. The actual symbol selection is usually recorded using arithmetic coding, though it is also possible to use Huffman encoding or even some type of dictionary coding technique. The underlying model used in most PPM algorithms can also be extended to predict multiple symbols. It is also possible to use non-Markov modeling to either replace or supplement Markov modeling. The symbol size is usually static, typically a single byte, which makes generic handling of any file format easy.
Published research on this family of algorithms can be found as far back as the mid-1980s. Software implementations were not popular until the early 1990s because PPM algorithms require a significant amount of RAM. Recent PPM implementations are among the best-performing lossless compression programs for natural language text.
Trying to improve PPM algorithms led to the PAQ series of data compression algorithms.
(See for formats and for codecs)
..... Click the link for more information.
..... Click the link for more information.
Predictions are usually reduced to symbol rankings. The number of previous symbols, n, determines the order of the PPM model which is denoted as PPM(n). Unbounded variants where the context has no length limitations also exist and are denoted as PPM*. If no prediction can be made based on all n context symbols a prediction is attempted with just n-1 symbols. This process is repeated until a match is found or no more symbols remain in context. At that point a fixed prediction is made. This process is complementary to that followed by Dynamic Markov Compression (DMC) which builds up from a zero-order model.
Much of the work in optimizing a PPM model is handling inputs that have not already occurred in the input stream. The obvious way to handle them is to create a "never-seen" symbol which triggers the escape sequence. But what probability should be assigned to a symbol that has never been seen? This is called the zero-frequency problem. One variant assigns the "never-seen" symbol a fixed pseudocount of one. A variant called PPM-D increments the pseudocount of the "never-seen" symbol every time the "never-seen" symbol is used. (In other words, PPM-D estimates the probability of a new symbol as the ratio of the number of unique symbols to the total number of symbols observed).
PPM compression implementations vary greatly in other details. The actual symbol selection is usually recorded using arithmetic coding, though it is also possible to use Huffman encoding or even some type of dictionary coding technique. The underlying model used in most PPM algorithms can also be extended to predict multiple symbols. It is also possible to use non-Markov modeling to either replace or supplement Markov modeling. The symbol size is usually static, typically a single byte, which makes generic handling of any file format easy.
Published research on this family of algorithms can be found as far back as the mid-1980s. Software implementations were not popular until the early 1990s because PPM algorithms require a significant amount of RAM. Recent PPM implementations are among the best-performing lossless compression programs for natural language text.
Trying to improve PPM algorithms led to the PAQ series of data compression algorithms.
References
- T. Bell, J. Cleary, and I. Witten (University of Waikato, New Zealand)(1984) Data compression using adaptive coding and partial string matching IEEE Transactions on Communications, Vol. 32 (4), p. 396-402.
- A. Moffat (1990) Implementing the PPM data compression scheme, IEEE Transactions on Communications, Vol. 38 (11), p. 1917-1921.
- J. Cleary, W. Teahan, and I. Witten (1995) Unbounded length contexts for PPM. In J.A. Storer and M. Cohn, editors, Proceedings DCC-95. IEEE Computer Society Press.
- C. Bloom (unknown) Solving the problems of context modeling.
- W. Teahan (unknown) Probability estimation for PPM.
External links
- Suite of PPM compressors with benchmarks
- BICOM, a bijective PPM compressor
- "Arithmetic Coding + Statistical Modeling = Data Compression", Part 2
| Lossless compression methods |
| ||||
|---|---|---|---|---|---|
| Audio compression methods |
| ||||
| Image compression methods |
| ||||
| Video compression |
| ||||
| Timeline of information theory, data compression, and error-correcting codes | |||||
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
..... Click the link for more information.
data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an un-encoded representation would use through use of specific encoding schemes.
..... Click the link for more information.
..... Click the link for more information.
prediction is a statement or claim that a particular event will occur in the future in more certain terms than a forecast. The etymology of this word is Latin (from præ- "before" plus dicere "to say").
..... Click the link for more information.
..... Click the link for more information.
Theory
Entropy Complexity Redundancy Entropy encoding
Huffman Adaptive Huffman Arithmetic (Shannon-Fano Range) Golomb Exp-Golomb Universal (Elias Fibonacci) Dictionary
LZ77/78 LZW LZO DEFLATE LZMA LZX Others
RLE BWT PPM DMC
..... Click the link for more information.
Entropy Complexity Redundancy Entropy encoding
Huffman Adaptive Huffman Arithmetic (Shannon-Fano Range) Golomb Exp-Golomb Universal (Elias Fibonacci) Dictionary
LZ77/78 LZW LZO DEFLATE LZMA LZX Others
RLE BWT PPM DMC
..... Click the link for more information.
- This article refers to codes used as commands for computing devices. Escape sequence can also refer to a sequence of escape characters used in parsing source code.
..... Click the link for more information.
A pseudocount is a count added to observed data in order to change the probability in a model of those data, which is known not to be zero, to being negligible rather than being zero.
..... Click the link for more information.
..... Click the link for more information.
Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code.
..... Click the link for more information.
..... Click the link for more information.
Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way
..... Click the link for more information.
..... Click the link for more information.
A dictionary coder, also sometimes known as a substitution coder, is any of a number of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary')
..... Click the link for more information.
..... Click the link for more information.
Dynamic RAM (DRAM) modules
Two 512 MB DRAM Modules
Connects to:
..... Click the link for more information.
Two 512 MB DRAM Modules
Connects to:
- PCB or motherboard via one of
..... Click the link for more information.
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the
..... Click the link for more information.
..... Click the link for more information.
In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed (visually or tactilely) by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming
..... Click the link for more information.
..... Click the link for more information.
PAQ is a series of data compression archivers that have evolved through collaborative development to top rankings on several benchmarks measuring compression ratio (although at the expense of speed and memory usage).
..... Click the link for more information.
..... Click the link for more information.
data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an un-encoded representation would use through use of specific encoding schemes.
..... Click the link for more information.
..... Click the link for more information.
Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. This can be contrasted to lossy data compression, which does not allow the exact original data to be reconstructed from the
..... Click the link for more information.
..... Click the link for more information.
Information theory is a branch of applied mathematics and engineering involving the quantification of information to find fundamental limits on compressing and reliably communicating data.
..... Click the link for more information.
..... Click the link for more information.
Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
In computer science, the Kolmogorov complexity (also known as descriptive complexity, Kolmogorov-Chaitin complexity, stochastic complexity, algorithmic entropy, or program-size complexity
..... Click the link for more information.
..... Click the link for more information.
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data.
..... Click the link for more information.
..... Click the link for more information.
In information theory an entropy encoding is a lossless data compression scheme that is independent of the media’s specific characteristics.
One of the main types of entropy coding assigns codes to symbols so as to match code lengths with the probabilities of the
..... Click the link for more information.
One of the main types of entropy coding assigns codes to symbols so as to match code lengths with the probabilities of the
..... Click the link for more information.
Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way
..... Click the link for more information.
..... Click the link for more information.
Adaptive Huffman coding (also called Dynamic Huffman coding) is an adaptive coding technique based on Huffman coding, building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to
..... Click the link for more information.
..... Click the link for more information.
Arithmetic coding is a method for lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code.
..... Click the link for more information.
..... Click the link for more information.
Range encoding is a form of arithmetic coding, a data compression method, that is believed to be free from arithmetic coding related patents. It is on this basis that interest in range encoding has arisen, particularly in the open source community.
..... Click the link for more information.
..... Click the link for more information.
..... Click the link for more information.
An Exponential-Golomb code (or just Exp-Golomb code) of order is a type of universal code, parameterized by a whole number . To encode a nonnegative integer in an order- exp-Golomb code, one can use the following method:
..... Click the link for more information.
..... Click the link for more information.
universal code for integers is a prefix code that maps the positive integers onto binary codewords, with the additional property that whatever the true probability distribution on integers, as long as the distribution is monotonic (i.e.
..... Click the link for more information.
..... Click the link for more information.
Elias gamma code is a universal code encoding positive integers. It is used most commonly when coding integers whose upper-bound cannot be determined beforehand.
To code a number:
..... Click the link for more information.
To code a number:
- Write it in binary.
..... Click the link for more information.
Fibonacci coding is a universal code which encodes positive integers into binary code words. All tokens end with "11" and have no "11" before the end.
The formula used to generate Fibonacci codes is:
where F(i) is the i-th Fibonacci number.
..... Click the link for more information.
The formula used to generate Fibonacci codes is:
where F(i) is the i-th Fibonacci number.
..... Click the link for more information.
A dictionary coder, also sometimes known as a substitution coder, is any of a number of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary')
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus