Information about Maximum Entropy
The principle of maximum entropy is a method for analyzing the available information in order to determine a unique epistemic probability distribution. It states that the least biased distribution that encodes certain given information is that which maximizes the information entropy.
The principle was first expounded by E.T. Jaynes in 1957 when he introduced what is now known as Maximum entropy thermodynamics: an interpretation of the Gibbs algorithm of statistical mechanics. He suggested that thermodynamics, and in particular thermodynamic entropy, should be seen just as a particular application of a general tool of inference and information theory.
The maximum entropy principle is like other Bayesian methods in that it makes explicit use of prior information. This is an alternative to the methods of inference of classical statistics.
are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.
Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy probability distribution is the uniform distribution,
The principle of maximum entropy can thus be seen as a generalization of the classical principle of indifference, also known as the principle of insufficient reason.
Furthermore, the probabilities must sum to one, giving the constraint
The probability distribution with maximum information entropy subject to these constraints is
with normalization constant determined by
(Interestingly, the Pitman-Koopman theorem states that the necessary and sufficient condition for a sampling distribution to admit sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)
The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to
These m simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.
where m(x), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that it is known; we will discuss it further after the solution equations are given.
Relative entropy is usually defined as the Kullback-Leibler divergence of m from p (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the Principle of Minimum Discrimination Information.
We have some testable information I about a quantity x which takes values in some interval of the real numbers (all integrals below are over this interval). We express this information as m constraints on the expectations of the functions fk, i.e. we require our epistemic probability density function to satisfy
And of course, the probability density must integrate to one, giving the constraint
The probability density function with maximum Hc subject to these constraints is
with normalization constant determined by
As in the discrete case, the values of the λk parameters are determined by the constraints according to
The invariant measure function m(x) can be best understood by supposing that x is known to take values only in the bounded interval (a, b), and that no other information is given. Then the maximum entropy probability density function is
where A is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory.
Refer to Cover and Thomas for excellent explanation of the ideas .
By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we do possess. Thus the maximum entropy distribution is the only reasonable distribution.
Suppose an individual wishes to make an epistemic probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of epistemic probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be
where ni is the number of quanta that were assigned to the ith proposition.
Now, in order to reduce the 'graininess' of the epistemic probability assignment, it will be necessary to use quite a large number of quanta of epistemic probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, our protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,
where
is sometimes known as the multiplicity of the outcome.
The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, our protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize
At this point, in order simplify the expression, our protagonist takes the limit as N → ∞, i.e. as the epistemic probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, she finds
All that remains for our protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all "fair" random epistemic distributions, in the limit as the probability levels go from discrete to continuous.
The principle was first expounded by E.T. Jaynes in 1957 when he introduced what is now known as Maximum entropy thermodynamics: an interpretation of the Gibbs algorithm of statistical mechanics. He suggested that thermodynamics, and in particular thermodynamic entropy, should be seen just as a particular application of a general tool of inference and information theory.
The maximum entropy principle is like other Bayesian methods in that it makes explicit use of prior information. This is an alternative to the methods of inference of classical statistics.
Testable information
The principle of maximum entropy is useful only when applied to testable information. A piece of information is testable if it can be determined whether a given distribution is consistent with it. For example, the statements- The expectation of the variable x is 2.87
- p2 + p3 > 0.6
are statements of testable information.
Given testable information, the maximum entropy procedure consists of seeking the probability distribution which maximizes information entropy, subject to the constraints of the information. This constrained optimization problem is typically solved using the method of Lagrange multipliers.
Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy probability distribution is the uniform distribution,
The principle of maximum entropy can thus be seen as a generalization of the classical principle of indifference, also known as the principle of insufficient reason.
General solution for the maximum entropy distribution with linear constraints
Discrete case
We have some testable information I about a quantity x ∈ {x1, x2,..., xn}. We express this information as m constraints on the expectations of the functions fk; that is, we require our epistemic probability distribution to satisfyFurthermore, the probabilities must sum to one, giving the constraint
The probability distribution with maximum information entropy subject to these constraints is
with normalization constant determined by
(Interestingly, the Pitman-Koopman theorem states that the necessary and sufficient condition for a sampling distribution to admit sufficient statistics of bounded dimension is that it have the general form of a maximum entropy distribution.)
The λk parameters are Lagrange multipliers whose particular values are determined by the constraints according to
These m simultaneous equations do not generally possess a closed form solution, and are usually solved by numerical methods.
Continuous case
For continuous distributions, the simple definition of Shannon entropy ceases to be so useful (see differential entropy). Instead what is more useful is the relative entropy of the distribution, (Jaynes, 1963, 1968, 2003),where m(x), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points. For now, we shall assume that it is known; we will discuss it further after the solution equations are given.
Relative entropy is usually defined as the Kullback-Leibler divergence of m from p (although it is sometimes, confusingly, defined as the negative of this). The inference principle of minimizing this, due to Kullback, is known as the Principle of Minimum Discrimination Information.
We have some testable information I about a quantity x which takes values in some interval of the real numbers (all integrals below are over this interval). We express this information as m constraints on the expectations of the functions fk, i.e. we require our epistemic probability density function to satisfy
And of course, the probability density must integrate to one, giving the constraint
The probability density function with maximum Hc subject to these constraints is
with normalization constant determined by
As in the discrete case, the values of the λk parameters are determined by the constraints according to
The invariant measure function m(x) can be best understood by supposing that x is known to take values only in the bounded interval (a, b), and that no other information is given. Then the maximum entropy probability density function is
where A is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'. It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the principle of transformation groups or marginalization theory.
Refer to Cover and Thomas for excellent explanation of the ideas .
Examples
For several examples of maximum entropy distributions, see the article on maximum entropy probability distributions.Justifications for the principle of maximum entropy
Proponents of the principle of maximum entropy justify its use in assigning epistemic probabilities in several ways, including the following two arguments. These arguments take the use of epistemic probability as given, and thus have no force if the concept of epistemic probability is itself under question.Information entropy as a measure of 'uninformativeness'
Consider a discrete epistemic probability distribution among m mutually exclusive propositions. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value, log m. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is from zero (completely informative) to log m (completely uninformative).By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we do possess. Thus the maximum entropy distribution is the only reasonable distribution.
The Wallis derivation
The following argument is the result of a suggestion made by Graham Wallis to E. T. Jaynes in 1962 (Jaynes, 2003). It is essentially the same mathematical argument used for the Maxwell-Boltzmann statistics in statistical mechanics, although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed a priori, but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.Suppose an individual wishes to make an epistemic probability assignment among m mutually exclusive propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute N quanta of epistemic probability (each worth 1/N) at random among the m possibilities. (One might imagine that she will throw N balls into m buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be
where ni is the number of quanta that were assigned to the ith proposition.
Now, in order to reduce the 'graininess' of the epistemic probability assignment, it will be necessary to use quite a large number of quanta of epistemic probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, our protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the multinomial distribution,
where
is sometimes known as the multiplicity of the outcome.
The most probable result is the one which maximizes the multiplicity W. Rather than maximizing W directly, our protagonist could equivalently maximize any monotonic increasing function of W. She decides to maximize
At this point, in order simplify the expression, our protagonist takes the limit as N → ∞, i.e. as the epistemic probability levels go from grainy discrete values to smooth continuous values. Using Stirling's approximation, she finds
All that remains for our protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all "fair" random epistemic distributions, in the limit as the probability levels go from discrete to continuous.
Compatibility with Bayes Rule
Recently, it has been shown that Bayes' Rule and the Principle of Maximum Entropy (MaxEnt) are completely compatible and can be seen as special cases of the Method of Maximum (relative) Entropy (ME). This method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually.See also
References
- Jaynes, E.T., 1957, 'Information Theory and Statistical Mechanics', in Physical Review May 1957 Volume 106, #4 (pages 620-630).
- Jaynes, E. T., 1963, 'Information Theory and Statistical Mechanics', in Statistical Physics, K. Ford (ed.), Benjamin, New York, p. 181.
- Jaynes, E. T., 1968, 'Prior Probabilities', IEEE Trans. on Systems Science and Cybernetics, SSC-4, 227.
- Jaynes, E. T., 2003, Probability Theory: The Logic of Science, Cambridge University Press.
- Giffin, A. and Caticha, A., 2007, Updating Probabilities with Data and Moments
- Guiasu, S. and Shenitzer, A., 1985, 'The principle of maximum entropy', The Mathematical Intelligencer, 7(1).
- Kapur, J. N.; and Kesevan, H. K., 1992, Entropy optimization principles with applications, Boston: Academic Press. ISBN 0-12-397670-7
- Uffink, Jos, 1995, 'Can the Maximum Entropy Principle be explained as a consistency requirement?', Studies in History and Philosophy of Modern Physics 26B, 223-261.
External links
- An easy-to-read introduction to maximum entropy methods in the context of natural language processing is:
- Ratnaparkhi A. "A simple introduction to maximum entropy models for natural language processing" Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania, 1997. Available here.
- This page contains pointers to various papers and software implementations of Maximum Entropy Model on the net.
Information is the result of processing, gathering, manipulating and organizing data in a way that adds to the knowledge of the receiver. In other words, it is the context in which data is taken.
..... Click the link for more information.
..... Click the link for more information.
Bayesian probability is an interpretation of the probability calculus which holds that the concept of probability can be defined as the degree to which a person (or community) believes that a proposition is true.
..... Click the link for more information.
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
..... Click the link for more information.
bias is a prejudice in a general or specific sense, usually in the sense for having a preference to one particular point of view or ideological perspective. However, one is generally only said to be biased
..... Click the link for more information.
..... Click the link for more information.
Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the
..... Click the link for more information.
..... Click the link for more information.
19th century - 20th century - 21st century
1920s 1930s 1940s - 1950s - 1960s 1970s 1980s
1954 1955 1956 - 1957 - 1958 1959 1960
Year 1957 (MCMLVII
..... Click the link for more information.
1920s 1930s 1940s - 1950s - 1960s 1970s 1980s
1954 1955 1956 - 1957 - 1958 1959 1960
Year 1957 (MCMLVII
..... Click the link for more information.
In physics the Maximum entropy school of thermodynamics (or more colloquially, the MaxEnt school of thermodynamics), initiated with two papers published in the Physical Review by Edwin T.
..... Click the link for more information.
..... Click the link for more information.
In statistical mechanics, the Gibbs algorithm, first introduced by J. Willard Gibbs in 1878, is the injunction to choose a statistical ensemble (probability distribution) for the unknown microscopic state of a thermodynamic system by minimising the average log probability
..... Click the link for more information.
..... Click the link for more information.
Statistical mechanics is the application of probability theory, which includes mathematical tools for dealing with large populations, to the field of mechanics, which is concerned with the motion of particles or objects when subjected to a force.
..... Click the link for more information.
..... Click the link for more information.
Thermodynamics (from the Greek θερμη, therme, meaning "heat" and δυναμις, dynamis, meaning "power") is a branch of physics that studies the effects of changes in temperature, pressure, and volume on
..... Click the link for more information.
..... Click the link for more information.
Ice melting - a classic example of entropy increasing[1] described in 1862 by Rudolf Clausius as an increase in the disgregation of the molecules of the body of ice.
..... Click the link for more information.
..... Click the link for more information.
Inference is the act or process of deriving a conclusion based solely on what one already knows.
Inference is studied within several different fields.
..... Click the link for more information.
Inference is studied within several different fields.
- Human inference (i.e. how humans draw conclusions) is traditionally studied within the field of cognitive psychology.
..... Click the link for more information.
A prior probability is a marginal probability, interpreted as a description of what is known about a variable in the absence of some evidence. The posterior probability is then the conditional probability of the variable taking the evidence into account.
..... Click the link for more information.
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
..... Click the link for more information.
Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
Shannon entropy quantifies the information contained in a piece of data: it is the minimum average message length, in bits (if using base-2 logarithms), that must
..... Click the link for more information.
Lagrange multipliers, named after Joseph Louis Lagrange, is a method for finding the extrema of a function of several variables subject to one or more constraints: it is the basic tool in nonlinear constrained optimization.
..... Click the link for more information.
..... Click the link for more information.
Uniform distribution can refer to:
..... Click the link for more information.
- Uniform distribution (mathematics), probability distributions:
- Uniform distribution (continuous)
- Uniform distribution (discrete)
..... Click the link for more information.
The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. Suppose that there are n > 1 mutually exclusive and collectively exhaustive possibilities.
..... Click the link for more information.
..... Click the link for more information.
In probability and statistics, an exponential family is any class of probability distributions having a certain form. This special form is chosen for mathematical convenience, on account of some useful algebraic properties; as well as for generality, as exponential families are in
..... Click the link for more information.
..... Click the link for more information.
In statistics, sufficiency is the property possessed by a statistic, with respect to a parameter, "when no other statistic which can be calculated from the same sample provides any additional information as to the value of the parameter"[1]
..... Click the link for more information.
..... Click the link for more information.
In mathematics, an equation or system of equations is said to have a closed-form solution if, and only if, at least one solution can be expressed analytically in terms of a bounded number of certain "well-known" functions.
..... Click the link for more information.
..... Click the link for more information.
Numerical analysis is the study of algorithms for the problems of continuous mathematics (as distinguished from discrete mathematics).
One of the earliest mathematical writing is the Babylonian tablet YBC 7289, which gives a sexagesimal numerical approximation of ,
..... Click the link for more information.
One of the earliest mathematical writing is the Babylonian tablet YBC 7289, which gives a sexagesimal numerical approximation of ,
..... Click the link for more information.
Differential entropy (also referred to as continuous entropy) is a concept in information theory which tries to extend the idea of (Shannon) entropy, a measure of average surprisal of a random variable, to continuous probability distributions.
..... Click the link for more information.
..... Click the link for more information.
Edwin Thompson Jaynes (July 5, 1922 – April 30, 1998) was Wayman Crow Distinguished Professor of Physics at Washington University in St. Louis. He wrote extensively on statistical mechanics and on foundations of probability and statistical inference, initiating in 1957 the
..... Click the link for more information.
..... Click the link for more information.
In algebra, an interval is a set that contains every real number between two indicated numbers and may contain the two numbers themselves. Interval notation is the notation in which permitted values for a variable are expressed as ranging over a certain interval; "" is an
..... Click the link for more information.
..... Click the link for more information.
In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2.4871773339…. The real numbers include both rational numbers, such as 42 and −23/129, and irrational numbers, such as π and
..... Click the link for more information.
..... Click the link for more information.
The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics.
..... Click the link for more information.
Definition and examples
In probability theory, a normalizing constant..... Click the link for more information.
In algebra, an interval is a set that contains every real number between two indicated numbers and may contain the two numbers themselves. Interval notation is the notation in which permitted values for a variable are expressed as ranging over a certain interval; "" is an
..... Click the link for more information.
..... Click the link for more information.
Conditional probability is the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the probability of A, given B".
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus
















