Information about Mixture Model
In mathematics, the term mixture model is a model in which independent variables are fractions of a total.

Suppose researchers are trying to find the optimal mixture of ingredients for a fruit punch consisting of grape juice, mango juice, and pineapple juice. A mixture model is suitable here because the results of the taste tests will not depend on the amount of ingredients used to make the batch but rather on the fraction of each ingredient present in the punch. The components always sum to a whole, which a mixture model takes into account.
As another example, financial returns often behave differently in normal situations and during crisis times. A mixture model for return data seems reasonable.
In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of two normal distributions with different means may result in a density with two modes, which is not modeled by standard parametric distributions.
Suppose that the discrete random variable
is a mixture of
component discrete random variables
. Then, the probability mass function of
,
, is a weighted sum of its component distributions:
for some mixture proportions
where
.
The definition is the same for continuous random variables, except that the functions
are probability density functions.
:
where
and
be the class of all binomial distributions with
. Then a mixture of two members of
would have
and
. Clearly, given
and
, it is not possible to determine the above mixture model uniquely, as there are three parameters (
) to be determined.
be the class of all component distributions. Then the convex hull
of
defines the class of all finite mixture of distributions in
:
is said to be identifiable if all its members are unique, that is, given two members
and
in
, being mixtures of
distributions and
distributions respectively in
, we have
if and only if, first of all,
and secondly we can reorder the summations such that
and
for all
.
and we can sample from
, but we would like to determine the
and
values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.
It's common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.
's and
's). It is an iterative algorithm with two steps: an expectation step and a maximization step. are included in the SOCR demonstrations.
and distribution
, we compute a membership value
:
The mixing coefficients
are the means of the membership values over the
data points.
The component model parameters
are also calculated by expectation maximization using data points
that have been weighted using the membership values. For example, if
is a mean
With new estimates for
and the
's, we proceed back to the expectation step to recompute new membership values. The procedure is repeated until model parameters converge.
We'll use the example of a mixture of two Gaussian distributions to demonstrate how the method works. We start again with initial guessed parameters for our mixture model. Instead of computing partial memberships for each elemental distribution, we draw a membership value for each data point from a Bernoulli distribution (that is, it will be assigned to either the first or the second Gaussian). The Bernoulli parameter
is determined for each data point on the basis of one of the constituent distributions. Draws from the distribution generate membership associations for each data point. We can then use plug-in estimators as in the M step of EM to generate a new set of mixture model parameters, and return to the binomial draw step.
are points in high-dimensional
Euclidean space, and the hidden distributions are known to be log-concave (such as Gaussian distribution or Exponential distribution).
Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points. The idea is to consider the top
singular vectors, where
is the number of distributions we are trying to learn. The projection
of each data point to a linear subspace spanned by those vectors groups points originating from the same distribution
very close together, while points from different distributions stay far apart.
One distinctive feature of the spectral method is that it allows us to prove that if distributions satisfy certain separation condition (e.g. not too close), then the estimated mixture will be very close to the true one with high probability.
1. Generate random numbers from a Bernoulli distribution of size N
2. Generate N random numbers each from the I distributions Fi
3. Pick random number from the ith distribution if the corresponding Bernoulli random number is between pi-1 and pi
Examples
The normal distribution is plotted using different means and variances
As another example, financial returns often behave differently in normal situations and during crisis times. A mixture model for return data seems reasonable.
Direct and indirect applications of mixture models
The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories. This underlying mechanism may or may not, however, be observable. In this form of mixture, each of the sources is described by a component probability density function, and its mixture weight is the probability that an observation comes from this component.In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of two normal distributions with different means may result in a density with two modes, which is not modeled by standard parametric distributions.
Types of mixture model
Probability mixture model
In statistics, a probability mixture model is a probability distribution that is a convex combination of other probability distributions.Suppose that the discrete random variable
is a mixture of
component discrete random variables
. Then, the probability mass function of
,
, is a weighted sum of its component distributions:
for some mixture proportions
where
.
The definition is the same for continuous random variables, except that the functions
are probability density functions.
Parametric mixture model
In the parametric mixture model, the component distributions are from a parametric family, with unknown parameters
:
Continuous mixture
A continuous mixture is defined similarly:where
and
Identifiability
Identifiability refers to the existence of a unique characterization for any one of the models in the class being considered. Estimation procedure may not be well-defined and asymptotic theory may not hold if a model is not identifiable.Example
Let
be the class of all binomial distributions with
. Then a mixture of two members of
would have
and
. Clearly, given
and
, it is not possible to determine the above mixture model uniquely, as there are three parameters (
) to be determined.
Definition
Consider a mixture of parametric distributions of the same class. Letbe the class of all component distributions. Then the convex hull
of
defines the class of all finite mixture of distributions in
:
is said to be identifiable if all its members are unique, that is, given two members
and
in
, being mixtures of
distributions and
distributions respectively in
, we have
if and only if, first of all,
and secondly we can reorder the summations such that
and
for all
.
Common approaches for estimation in mixture models
Parametric mixture models are often used when we know the distribution
and we can sample from
, but we would like to determine the
and
values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.
It's common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.
Expectation maximization
The Expectation-maximization algorithm can be used to compute the parameters of a parametric mixture model distribution (the
's and
's). It is an iterative algorithm with two steps: an expectation step and a maximization step. are included in the SOCR demonstrations.
The expectation step
With initial guesses for the parameters of our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point. That is, for each data point
and distribution
, we compute a membership value
:
The maximization step
With our expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters.The mixing coefficients
are the means of the membership values over the
data points.
The component model parameters
are also calculated by expectation maximization using data points
that have been weighted using the membership values. For example, if
is a mean
With new estimates for
and the
's, we proceed back to the expectation step to recompute new membership values. The procedure is repeated until model parameters converge.
Markov chain Monte Carlo
As an alternative to the EM algorithm, we can use posterior sampling as indicated by Bayes' theorem to deduce parameters in our mixture model. Once again we regard this as an incomplete data problem where membership of data points is our missing data. We resort to a method called Gibbs sampling which is once again a two-step iterative procedure.We'll use the example of a mixture of two Gaussian distributions to demonstrate how the method works. We start again with initial guessed parameters for our mixture model. Instead of computing partial memberships for each elemental distribution, we draw a membership value for each data point from a Bernoulli distribution (that is, it will be assigned to either the first or the second Gaussian). The Bernoulli parameter
is determined for each data point on the basis of one of the constituent distributions. Draws from the distribution generate membership associations for each data point. We can then use plug-in estimators as in the M step of EM to generate a new set of mixture model parameters, and return to the binomial draw step.
Spectral method
Some problems in mixture model estimation can be solved using spectral techniques. In particular it becomes useful if data points
are points in high-dimensional
Euclidean space, and the hidden distributions are known to be log-concave (such as Gaussian distribution or Exponential distribution).
Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points. The idea is to consider the top
singular vectors, where
is the number of distributions we are trying to learn. The projection
of each data point to a linear subspace spanned by those vectors groups points originating from the same distribution
very close together, while points from different distributions stay far apart.
One distinctive feature of the spectral method is that it allows us to prove that if distributions satisfy certain separation condition (e.g. not too close), then the estimated mixture will be very close to the true one with high probability.
Other methods
Other methods which guarantee accurate estimation have emerged in the last few years. Some of them can even provably learn mixtures of heavy-tailed distributions including those with infinite variance (see links to papers below). In this setting, EM based methods would not work, since the Expectation step would diverge due to presence of outliers.A simulation
To simulate a sample of size N that is a mixture of distributions Fi, i=1 to I, with probability pi each (sigma pi=1):1. Generate random numbers from a Bernoulli distribution of size N
2. Generate N random numbers each from the I distributions Fi
3. Pick random number from the ith distribution if the corresponding Bernoulli random number is between pi-1 and pi
Further reading
Books on mixture models
- Titterington, D., A. Smith, and U. Makov "Statistical Analysis of Finite Mixture Distributions," John Wiley & Sons (1985).
- G. McLachlan, D. Peel Finite Mixture Models, , Wiley (2000)
- Marin, J.M., Mengersen, K. and Robert, C.P. "Bayesian modelling and inference on mixtures of distributions". Handbook of Statistics 25, D. Dey and C.R. Rao (eds). Elsevier-Sciences (to appear). available as PDF
- Lindsay B.G., Mixture Models: Theory, Geometry, and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics Vol. 5, Institute of Mathematical Statistics, Hayward (1995).
Recent papers
- S. Dasgupta (1999). "Learning Mixtures of Gaussians". Proc. of Symposium on Foundations of Computer Science (FOCS).
- S. Dasgupta and L. Schulman (2000). "A Two-Round Variant of EM for Gaussian Mixtures". Conference in Uncertainty in Artificial Intelligence (UAI).
- S. Arora and R. Kannan (2001). "Learning Mixtures of Arbitrary Gaussians". Symposium on Theory of Computing (STOC).
- S. Vempala and G. Wang (2004). "A spectral algorithm for learning mixture models". Journal of Computer and System Sciences.
- T. Batu, S. Guha, and S. Kannan (2004). "Inferring Mixtures of Markov Chains". Conference on Learning Theory (COLT).
- R. Kannan, H. Salmasian, and Santosh Vempala (2005). "The Spectral Method for General Mixture Models". Conference on Learning Theory (COLT).
- D. Achlioptas and F. McSherry (2005). "On Spectral Learning of Mixtures of Distributions". Conference on Learning Theory (COLT).
- A. Dasgupta, J. Kleinberg, J. Hopcroft, and M.Sandler (2005). "On Learning Mixtures of Heavy-Tailed Distributions". Symposium on Foundations of Computer Science (FOCS).
- J. Feldman, R. O'Donnell and R. Servedio (2005). "Learning Mixtures of Product Distributions over Discrete Domains". Symposium on Foundations of Computer Science (FOCS).
See also
- Mixture density
- The
External links
- Mixture modelling page (and the Snob program for Minimum Message Length (MML) applied to finite mixture models), maintained by D.L. Dowe.
- PyMix - Python Mixture Package, algorithms and data structures for a broad variety of mixture model based data mining applications in Python
- em - A Python package for learning Gaussian Mixture Models with Expectation Maximization, currently packaged with SciPy
Mathematics (colloquially, maths or math) is the body of knowledge centered on such concepts as quantity, structure, space, and change, and also the academic discipline that studies them. Benjamin Peirce called it "the science that draws necessary conclusions".
..... Click the link for more information.
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
..... Click the link for more information.
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
..... Click the link for more information.
convex combination is a linear combination of data points (which can be vectors, scalars, or more generally points in an affine space) where all coefficients are non-negative and sum up to 1.
..... Click the link for more information.
..... Click the link for more information.
discrete if it is characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is then called a discrete random variable, if
as u
..... Click the link for more information.
as u
..... Click the link for more information.
probability mass function (abbreviated pmf) is a function that gives the probability that a discrete random variable is exactly equal to some value. A probability mass function differs from a probability density function (abbreviated pdf
..... Click the link for more information.
..... Click the link for more information.
In probability theory, a probability distribution is called continuous if its cumulative distribution function is continuous. That is equivalent to saying that for random variables X with the distribution in question, Pr[X = a
..... Click the link for more information.
..... Click the link for more information.
In mathematics, a probability density function (pdf) is a function that represents a probability distribution in terms of integrals.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
Formally, a probability distribution has density f, if f
..... Click the link for more information.
In mathematics, the convex hull or convex envelope for a set of points X in a real vector space V is the minimal convex set containing X.
..... Click the link for more information.
Intuitive picture
For planar objects, i.e...... Click the link for more information.
An expectation-maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables.
..... Click the link for more information.
..... Click the link for more information.
In computational mathematics, an iterative method attempts to solve a problem (for example an equation or system of equations) by finding successive approximations to the solution starting from an initial guess.
..... Click the link for more information.
..... Click the link for more information.
expected value (or mathematical expectation, or mean) of a discrete random variable is the sum of the probability of each possible outcome of the experiment multiplied by the outcome value (or payoff).
..... Click the link for more information.
..... Click the link for more information.
In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all the members of the list divided by the number of items in the list. The arithmetic mean is what students are taught very early to call the "average".
..... Click the link for more information.
..... Click the link for more information.
This article or section may be confusing or unclear for some readers.
Please [improve the article] or discuss this issue on the talk page. This article has been tagged since September 2007.
..... Click the link for more information.
Please [improve the article] or discuss this issue on the talk page. This article has been tagged since September 2007.
..... Click the link for more information.
Monte Carlo methods are a widely used class of computational algorithms for simulating the behavior of various physical and mathematical systems, and for other computations.
..... Click the link for more information.
..... Click the link for more information.
Bayes' theorem (also known as Bayes' rule or Bayes' law) is a result in probability theory, which relates the conditional and marginal probability distributions of random variables.
..... Click the link for more information.
..... Click the link for more information.
In mathematics and physics, Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. The purpose of such a sequence is to approximate the joint distribution (i.e.
..... Click the link for more information.
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
..... Click the link for more information.
Bernoulli distribution, named after Swiss scientist Jakob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and value 0 with failure probability .
..... Click the link for more information.
..... Click the link for more information.
Spectral music (or spectralism) is a musical genre or movement originating in France in the 1970s featuring the use of sound, including timbre, pitch, and rhythm of individual sounds, as a model for composition, most often using computer analysis of sound wave components and
..... Click the link for more information.
..... Click the link for more information.
Euclidean space. Most of this article is devoted to developing the modern language necessary for the conceptual leap to higher dimensions.
An essential property of a Euclidean space is its flatness. Other spaces exist in geometry that are not Euclidean.
..... Click the link for more information.
An essential property of a Euclidean space is its flatness. Other spaces exist in geometry that are not Euclidean.
..... Click the link for more information.
Log-concave may refer to:
..... Click the link for more information.
- Logarithmically concave function
- Logarithmically concave measure
..... Click the link for more information.
normal distribution, also called the Gaussian distribution, is an important family of continuous probability distributions, applicable in many fields. Each member of the family may be defined by two parameters, location and scale: the mean ("average",
..... Click the link for more information.
..... Click the link for more information.
exponential distributions are a class of continuous probability distribution. They are often used to model the time between independent events that happen at a constant average rate.
..... Click the link for more information.
..... Click the link for more information.
In linear algebra, the singular value decomposition (SVD) is an important factorization of a rectangular real or complex matrix, with several applications in signal processing and statistics.
..... Click the link for more information.
..... Click the link for more information.
In linear algebra, a Euclidean subspace (or subspace of Rn) is a set of vectors that is closed under addition and scalar multiplication. Geometrically, a subspace is a flat in n-dimensional Euclidean space that passes through the origin.
..... Click the link for more information.
..... Click the link for more information.
In mathematics, a proof is a demonstration that, assuming certain axioms, some statement is necessarily true. A proof is a logical argument, not an empirical one. That is, one must demonstrate that a proposition is true in all cases before it is considered a theorem of mathematics.
..... Click the link for more information.
..... Click the link for more information.
In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded:[1] that is, they have heavier tails than the exponential distribution.
..... Click the link for more information.
..... Click the link for more information.
variance of a random variable (or somewhat more precisely, of a probability distribution) is one measure of statistical dispersion, averaging the squared distance of its possible values from the expected value.
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus










