Information about Sample Size

The sample size of a statistical sample is the number of repeated measurements that constitute it. It is typically denoted n, and is a non-negative integer (natural number).

Typically, different sample sizes lead to different accuracies of measurement. This can be seen in such statistical rules as the law of large numbers and the central limit theorem. All else being equal, a larger sample size n leads to increased precision in estimates of various properties of the population.

A typical example would be when a statistician wishes to estimate the arithmetic mean of a continuous random variable (for example, the height of a person). Assuming that they have a random sample with independent observations, then if the variability of the population (as measured by the standard deviation σ) is known, then the standard error of the sample mean is given by the formula:
:


It is easy to show that as n becomes large, this variability becomes very small. This yields to more sensitive hypothesis tests with greater Statistical power and smaller confidence intervals.

With more complicated sampling techniques, such as Stratified sampling, the sample can often be split up into sub-samples. Typically, if there are k such sub-samples (from k different strata) then each of them will have a sample size ni, i = 1, 2, ..., k. These ni must conform to the rule that n1 + n2 + ... + nk = n (i.e. that the total sample size is given by the sum of the sub-sample sizes). Selecting these ni optimally can be done in various ways, using (for example) Neyman's optimal allocation.

Further examples

Central limit theorem

The central limit theorem is a significant result which depends on sample size.

Estimating proportions

A typical statistical aim is to demonstrate with 95% certainty that the true value of a parameter is within a distance B of the estimate: B is an error range that decreases with increasing sample size (n). The value of B generated is referred to as the 95% confidence interval.

For example, a simple situation is estimating a proportion in a population. To do so, a statistician will estimate the bounds of a 95% confidence interval for an unknown proportion.

The rule of thumb for (a maximum or 'conservative') B for a proportion derives from the fact the estimator of a proportion, , (where X is the number of 'positive' observations) has a (scaled) binomial distribution and is also a form of sample mean (from a Bernoulli distribution [0,1] which has a maximum variance of 0.25 for parameter p = 0.5). So, the sample mean X/n has maximum variance 0.25/n. For sufficiently large n (usually this means that we need to have observed at least 10 positive and 10 negative responses), this distribution will be closely approximated by a normal distribution with the same mean and variance.

Using this approximation, it can be shown that ~95% of this distribution's probability lies within 2 standard deviations of the mean. Because of this, an interval of the form



will form a 95% confidence interval for the true proportion.

If we require the sampling error ε to be no larger than some bound B, we can solve the equation



to give us



So, n = 100 <=> B = 10%, n = 400 <=> B = 5%, n = 1000 <=> B = ~3%, and n = 10000 <=> B = 1%. One sees these numbers quoted often in news reports of opinion polls and other sample surveys.

Extension to other cases

In general, if a population mean is estimated using the sample mean from n observations from a distribution with variance σ², then if n is large enough (typically >30) the central limit theorem can be applied to obtain an approximate 95% confidence interval of the form


If the sampling error ε is required to be no larger than bound B, as above, then


Note, if the mean is to be estimated using P parameters that must first be estimated themselves from the same sample, then to preserve sufficient "degrees of freedom," the sample size should be at least n + P.

Required sample sizes for hypothesis tests

A common problem facing statisticians is calculating the sample size required to yield a certain power for a test, given a predetermined Type I error rate α. A typical example for this is as follows:

Let X i , i = 1, 2, ..., n be independent observations taken from a normal distribution with mean μ and variance σ2 . Let us consider two hypotheses, a null hypothesis:



and an alternative hypothesis:



for some 'smallest significant difference' μ* >0. This is the smallest value for which we care about observing a difference. Now, if we wish to (1) reject H0 with a probability of at least 1-β when Ha is true (i.e. a power of 1-β), and (2) reject H0 with probability α when H0 is true, then we need the following:

If zα is the upper α percentage point of the standard normal distribution, then



and so

'Reject H0 if our sample average () is more than


is a decision rule which satisfies (2). (Note, this is a 2-tailed test)

Now we wish for this to happen with a probability at least 1-β when Ha is true. In this case, our sample average will come from a Normal distribution with mean μ*. Therefore we require



Through careful manipulation, this can be shown to happen when



where is the normal cumulative distribution function.

See also

External links

sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size.
..... Click the link for more information.
The integers (from the Latin integer, which means with untouched integrity, whole, entire) are the set of numbers including the whole numbers (0, 1, 2, 3, …) and their negatives (0, −1, −2, −3, …).
..... Click the link for more information.
In mathematics, a natural number can mean either an element of the set (i.e the positive integers or the counting numbers) or an element of the set (i.e. the non-negative integers).
..... Click the link for more information.
The law of large numbers (LLN) is a theorem in probability that describes the long-term stability of a random variable. Given a sample of independent and identically distributed random variables with a finite population mean and variance, the average of these observations will
..... Click the link for more information.
A central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent and identically-distributed random variables will tend to be distributed according to a particular "attractor distribution".
..... Click the link for more information.
Precision has the following meanings:
  1. In engineering, science, industry, and statistics, precision characterises the degree of mutual agreement among a series of individual measurements, values, or results — see accuracy and precision.

..... Click the link for more information.
In statistics, a statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population.
..... Click the link for more information.
Statisticians work with theoretical and applied statistics in both the private and public sectors. The core of that work is to measure, interpret, and describe the world and human activity patterns within it.
..... Click the link for more information.
In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all the members of the list divided by the number of items in the list. The arithmetic mean is what students are taught very early to call the "average".
..... Click the link for more information.
random is used to express lack of order, purpose, cause, or predictability in non-scientific parlance. A random process is a repeating process whose outcomes follow no describable deterministic pattern, but follow a probability distribution.
..... Click the link for more information.
sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size.
..... Click the link for more information.
In probability theory, to say that two events are independent, intuitively means that the occurrence of one event makes it neither more nor less probable that the other occurs.
..... Click the link for more information.
In probability and statistics, the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. It is usually denoted with the letter σ (lower case sigma).
..... Click the link for more information.
Standard error can refer to:
  • Standard error (statistics) -- in statistics, an expression of the uncertainty in a value.
  • One of the standard streams in Unix and Unix-like operating systems.

..... Click the link for more information.
statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative (for or against the hypothesis) which minimizes certain risks.

This article describes the commonly used frequentist treatment of hypothesis testing.
..... Click the link for more information.
The power of a statistical test is the probability that the test will reject a false null hypothesis (that it will not make a Type II error). As power increases, the chances of a Type II error decrease, and vice versa. The probability of a Type II error is referred to as β.
..... Click the link for more information.
confidence interval (CI) is an interval estimate of a population parameter. Instead of estimating the parameter by a single value, a whole interval of likely estimates is given. How likely the estimates are is determined by the confidence coefficient.
..... Click the link for more information.
In statistics, stratified sampling is a method of sampling from a population.

When sub-populations vary considerably, it is advantageous to sample each subpopulation (stratum) independently.
..... Click the link for more information.
A central limit theorem is any of a set of weak-convergence results in probability theory. They all express the fact that any sum of many independent and identically-distributed random variables will tend to be distributed according to a particular "attractor distribution".
..... Click the link for more information.
Parameters, in the plural form, has recently become popular with non-technical users to mean limits, but this should not be confused with the word's technical meaning.

In mathematics, statistics, and the mathematical sciences, parameters (L: auxiliary measure
..... Click the link for more information.
Estimation is the calculated approximation of a result which is usable even if input data may be incomplete, uncertain, or noisy.

In statistics, see estimation theory, estimator.
..... Click the link for more information.
proportionality, see Proportionality (disambiguation).


In mathematics, two quantities are called proportional if they vary in such a way that one of the quantities is a constant multiple of the other, or equivalently if they have a constant ratio.
..... Click the link for more information.
In statistics, a statistical population is a set of entities concerning which statistical inferences are to be drawn, often based on a random sample taken from the population.
..... Click the link for more information.
confidence interval (CI) is an interval estimate of a population parameter. Instead of estimating the parameter by a single value, a whole interval of likely estimates is given. How likely the estimates are is determined by the confidence coefficient.
..... Click the link for more information.
proportionality, see Proportionality (disambiguation).


In mathematics, two quantities are called proportional if they vary in such a way that one of the quantities is a constant multiple of the other, or equivalently if they have a constant ratio.
..... Click the link for more information.
A rule of thumb is a principle with broad application that is not intended to be strictly accurate or reliable for every situation. It is an easily learned and easily applied procedure for approximately calculating or recalling some value, or for making some determination.
..... Click the link for more information.
proportionality, see Proportionality (disambiguation).


In mathematics, two quantities are called proportional if they vary in such a way that one of the quantities is a constant multiple of the other, or equivalently if they have a constant ratio.
..... Click the link for more information.
In statistics, an estimator is a function of the observable sample data that is used to estimate an unknown population parameter; an estimate is the result from the actual application of the function to a particular set of data.
..... Click the link for more information.
proportionality, see Proportionality (disambiguation).


In mathematics, two quantities are called proportional if they vary in such a way that one of the quantities is a constant multiple of the other, or equivalently if they have a constant ratio.
..... Click the link for more information.
binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter