Information about Maximum Likelihood
Maximum likelihood estimation (MLE) is a popular statistical method used to calculate the best way of fitting a mathematical model to some data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide an optimum fit.
The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:
Loosely speaking, for a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. In the case of the normal distribution this gives a unique solution, although in more complex problems this may not be the case.
of probability distributions parameterized by an unknown parameter
(which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as
. We draw a sample
of n values from this distribution, and then using
we compute the (multivariate) probability density associated with our observed data,
As a function of θ with x1, ..., xn fixed, this is the likelihood function
The method of maximum likelihood estimates θ by finding the value of θ that maximizes L(θ). This is the maximum likelihood estimator (MLE) of θ:
Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:
and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:
The maximum of this expression can then be found numerically using various optimization algorithms.
This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.
Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.
is the MLE for θ, and if g is a one-to-one function, then the MLE for α = g(θ) is
If g is not one-to-one, then
is the MLE of α = g(θ) only if the likelihood function is modified to be
Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:
The regularity conditions required to ensure this behavior are:
While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE.
We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.
over all possible values 0 ≤ p ≤ 1.
One way to maximize this function is by differentiating with respect to p and setting to zero:

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.
This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.
which has probability density function
the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is
or more conveniently:
is the sample mean.
This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood
over both parameters simultaneously, or if possible, individually.
Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]
which is solved by
This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,
which means that the maximum-likelihood estimator
is unbiased.
Similarly we differentiate the log likelihood with respect to σ and equate to zero:
which is solved by
Inserting
we obtain
When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain
This means that the estimator
is biased (However,
is consistent).
Formally we say that the maximum likelihood estimator for
is:
In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.
The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:
- linear models and generalized linear models are commonly fit by maximum likelihood.
- econometrics and hypothesis testing in medical research.
- time-delay of arrival (TDOA) in acoustic detection.
- data modeling in nuclear and particle physics.
- origin/destination and path-choice modeling in transport networks.
Loosely speaking, for a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. In the case of the normal distribution this gives a unique solution, although in more complex problems this may not be the case.
Prerequisites
The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.Principles
Consider a family
of probability distributions parameterized by an unknown parameter
(which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as
. We draw a sample
of n values from this distribution, and then using
we compute the (multivariate) probability density associated with our observed data,
As a function of θ with x1, ..., xn fixed, this is the likelihood function
The method of maximum likelihood estimates θ by finding the value of θ that maximizes L(θ). This is the maximum likelihood estimator (MLE) of θ:
Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:
and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:
The maximum of this expression can then be found numerically using various optimization algorithms.
This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.
Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.
Properties
Functional invariance
The maximum likelihood estimator (MLE) of a parameter θ can be used to calculate the MLE of a function of the parameter. Specifically, if
is the MLE for θ, and if g is a one-to-one function, then the MLE for α = g(θ) is
If g is not one-to-one, then
is the MLE of α = g(θ) only if the likelihood function is modified to be
Bias
The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only (n+1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.Asymptotics
In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:
- The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
- The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
- The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean
and covariance matrix equal to the inverse of the Fisher information matrix.
The regularity conditions required to ensure this behavior are:
- The first and second derivatives of the log-likelihood function must be defined.
- The Fisher information matrix must not be zero.
While these asymptotic properties only become strictly true in the limit of infinite sample size, in practice they are often assumed to be approximately true, especially when the sample size is not that small. In particular, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE.
Examples
Discrete distribution, finite parameter space
Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.
Discrete distribution, continuous parameter space
Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:over all possible values 0 ≤ p ≤ 1.
One way to maximize this function is by differentiating with respect to p and setting to zero:
Likelihood of different proportion parameter values for a binomial process with t = 3 and n = 10; the ML estimator occurs at the mode with the peak (maximum) of the curve.
which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.
This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.
Continuous distribution, continuous parameter space
For the normal distribution
which has probability density function
the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is
or more conveniently:
,
is the sample mean.
This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood
over both parameters simultaneously, or if possible, individually.
Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. [Note: the log-likelihood is closely related to information entropy and Fisher information.]
- :

- :

which is solved by
.
This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,
which means that the maximum-likelihood estimator
is unbiased.
Similarly we differentiate the log likelihood with respect to σ and equate to zero:
- :

- :

which is solved by
.
Inserting
we obtain
.
When we calculate the expectation value, the double sum gives a nonzero contribution only if i=j. We obtain
.
This means that the estimator
is biased (However,
is consistent).
Formally we say that the maximum likelihood estimator for
is:
In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.
See also
- likelihood function, a description on what likelihood functions are.
- Delta method, a method for finding the distribution of functions of a maximum likelihood estimator.
- mean squared error, a measure of how 'good' an estimator of a distributional parameter is (be it the maximum likelihood estimator or some other estimator).
- The Rao–Blackwell theorem, a result which yields a process for finding the best possible unbiased estimator (in the sense of having minimal mean squared error). The MLE is often a good starting place for the process.
- sufficient statistic, a function of the data through which the MLE (if it exists and is unique) will depend on the data.
- generalized method of moments, a method related to maximum likelihood estimation.
- inferential statistics, for an alternative to the maximum likelihood estimate.
- Maximum a posteriori (MAP) estimator, for a contrast in the way to calculate estimators when prior knowledge is postulated.
- Method of moments (statistics), for another popular method for finding parameters of distributions.
References
- Kay, Steven M. (1993). Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice Hall, Ch. 7. ISBN 0-13-345711-7.
- A paper on the history of Maximum Likelihood: Aldrich, John (1997). "R.A. Fisher and the making of maximum likelihood 1912-1922". Statistical Science 12 (3): 162-176. DOI:10.1214/ss/1030037906.
External links
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
..... Click the link for more information.
A geneticist is a scientist who studies genetics, the science of heredity and variation of organisms. A geneticist can be a physician, but not always. A geneticist can also be employed as a teacher or researcher.
..... Click the link for more information.
..... Click the link for more information.
Statisticians work with theoretical and applied statistics in both the private and public sectors. The core of that work is to measure, interpret, and describe the world and human activity patterns within it.
..... Click the link for more information.
..... Click the link for more information.
Ronald Fisher
Sir Ronald Aylmer Fisher
Born 17 January 1890
East Finchley, London , England
..... Click the link for more information.
Sir Ronald Aylmer Fisher
Born 17 January 1890
East Finchley, London , England
..... Click the link for more information.
In statistics the linear model is given by
where Y is an n×1 column vector of random variables, X is an n×p matrix of "known" (i.e.
..... Click the link for more information.
where Y is an n×1 column vector of random variables, X is an n×p matrix of "known" (i.e.
..... Click the link for more information.
generalized linear model (GLM) is a useful generalization of ordinary least squares regression. It relates the random distribution of the measured variable of the experiment (the distribution function
..... Click the link for more information.
..... Click the link for more information.
Econometrics is concerned with the tasks of developing and applying quantitative or statistical methods to the study and elucidation of economic principles.[1] Econometrics combines economic theory with statistics to analyze and test economic relationships.
..... Click the link for more information.
..... Click the link for more information.
statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative (for or against the hypothesis) which minimizes certain risks.
This article describes the commonly used frequentist treatment of hypothesis testing.
..... Click the link for more information.
This article describes the commonly used frequentist treatment of hypothesis testing.
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
..... Click the link for more information.
In statistics, mean has two related meanings:
..... Click the link for more information.
- the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).
- the expected value of a random variable, which is also called the population mean.
..... Click the link for more information.
variance of a random variable (or somewhat more precisely, of a probability distribution) is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value.
..... Click the link for more information.
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
..... Click the link for more information.
Probability theory is the branch of mathematics concerned with analysis of random phenomena.[1] The central objects of probability theory are random variables, stochastic processes, and events: mathematical abstractions of non-deterministic events or measured quantities
..... Click the link for more information.
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
..... Click the link for more information.
In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
A random variable is an abstraction of the intuitive concept of chance into the theoretical domains of mathematics, forming the foundations of probability theory and mathematical statistics.
..... Click the link for more information.
..... Click the link for more information.
expected value (or mathematical expectation, or mean) of a discrete random variable is the sum of the probability of each possible outcome of the experiment multiplied by the outcome value (or payoff).
..... Click the link for more information.
..... Click the link for more information.
In mathematics, a continuous function is a function for which, intuitively, small changes in the input result in small changes in the output. Otherwise, a function is said to be discontinuous.
..... Click the link for more information.
..... Click the link for more information.
In mathematics, the real numbers may be described informally as numbers that can be given by an infinite decimal representation, such as 2.4871773339…. The real numbers include both rational numbers, such as 42 and −23/129, and irrational numbers, such as π and
..... Click the link for more information.
..... Click the link for more information.
function expresses dependence between two quantities, one of which is given (the independent variable, argument of the function, or its "input") and the other produced (the dependent variable, value of the function, or "output").
..... Click the link for more information.
..... Click the link for more information.
Differentiation can mean the following:
..... Click the link for more information.
- the act of finding the derivative in mathematics
- Cellular differentiation in biology describes how cells acquire a type
- Planetary differentiation in planetary science
- Inductive reasoning aptitude in psychology
..... Click the link for more information.
maxima and minima, known collectively as extrema, are the largest value (maximum) or smallest value (minimum), that a function takes in a point either within a given neighbourhood (local extremum) or on the function domain in its entirety (global
..... Click the link for more information.
..... Click the link for more information.
In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
probability mass function (abbreviated pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value. A probability mass function differs from a probability density function (abbreviated pdf
..... Click the link for more information.
..... Click the link for more information.
Likelihood as a solitary term is a shorthand for likelihood function. In non-technical usage, "likelihood" is a synonym for "probability", but throughout this article only the technical definition is used.
..... Click the link for more information.
..... Click the link for more information.
independent and identically distributed (i.i.d.) if each has the same probability distribution as the others and all are mutually independent.
The abbreviation i.i.d.
..... Click the link for more information.
The abbreviation i.i.d.
..... Click the link for more information.
In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set.
..... Click the link for more information.
..... Click the link for more information.
bias. An estimator or decision rule having nonzero bias is said to be biased.
Although the term bias sounds pejorative, it is not necessarily used in that way in statistics. Biased estimators may have desirable properties.
..... Click the link for more information.
Although the term bias sounds pejorative, it is not necessarily used in that way in statistics. Biased estimators may have desirable properties.
..... Click the link for more information.
non-injective function.]] In mathematics, an injective function is a function which associates distinct arguments to distinct values. More precisely, a function f is said to be injective if it maps distinct x in the domain to distinct y
..... Click the link for more information.
..... Click the link for more information.
bias. An estimator or decision rule having nonzero bias is said to be biased.
Although the term bias sounds pejorative, it is not necessarily used in that way in statistics. Biased estimators may have desirable properties.
..... Click the link for more information.
Although the term bias sounds pejorative, it is not necessarily used in that way in statistics. Biased estimators may have desirable properties.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus














