Information about Spearman's Rank Correlation Coefficient

In statistics, Spearman's rank correlation coefficient, named after Charles Spearman and often denoted by the Greek letter ρ (rho), is a non-parametric measure of correlation – that is, it assesses how well an arbitrary monotonic function could describe the relationship between two variables, without making any assumptions about the frequency distribution of the variables. Unlike the Pearson product-moment correlation coefficient, it does not require the assumption that the relationship between the variables is linear, nor does it require the variables to be measured on interval scales; it can be used for variables measured at the ordinal level.

In principle, ρ is simply a special case of the Pearson product-moment coefficient in which the data are converted to rankings before calculating the coefficient. In practice, however, a simpler procedure is normally used to calculate ρ. The raw scores are converted to ranks, and the differences d between the ranks of each observation on the two variables are calculated.

If there are no tied ranks, i.e.

then ρ is given by:



where:

= the difference between each rank of corresponding values of x and y, and


= the number of pairs of values.


If tied ranks exist, classic Pearson's correlation coefficient between ranks has to be used instead of this formula. You have to assign the same rank to each of the equal values. It is an average of their positions in the ascending order of the values:

An Example of Averaging Ranks

Variable Position in the descending order Rank
0.855
1.24
1.23
2.322 |18||1||1


Spearman's rank correlation coefficient is equivalent to Pearson correlation on ranks. The formula above is a short-cut to its product-moment form, assuming no tie. The product-moment form can be used in both tied and untied cases.

A version of this correlation is called Spearman's rho. In this case ranks are calculated as above, but in the formula of Pearson's correlation a standard deviation is taken as there were no ties.

Another popular method for computing rank correlation is the Kendall tau rank correlation coefficient.

Example

The raw data used in this example is shown below.
IQHours of TV per week.
1067
860
10027
10150
9928
10329
9720
11312
1126
11017


The first step is to sort this data by the first column. Next, two more columns are created. Both of these are for ranking the first two columns. Notice how the rank of values that are the same is the mean of what their ranks would otherwise be. Then a column "d" is created to hold the differences between the two rank columns. Finally another column "d2" should be created. This is just column d squared.

After doing this process with the example data you should end up with something like:

IQ (i)Hours of TV per week (t)rank (i)rank (t)dd2
8601100
972026416
992838525
100274739
10150510525
103296939
106773416
110178539
112692749 |113 |12 |10 |4 |6 |36


The values in the d2 column can now be added to find . The value of n is 10. So these values can now be substituted back into the equation,



which evaluates to . In the case of ties in the original values, this formula should not be used. Instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).

Determining significance

The modern approach to testing whether an observed value of ρ is significantly different from zero (we will always have 1 ≥ ρ ≥ −1) is to calculate the probability that it would be greater than or equal to the observed ρ, given the null hypothesis, by using a permutation test. This approach is almost always superior to traditional methods, unless the data set is so large that computing power is not sufficient to generate permutations, or unless an algorithm for creating permutations that are logical under the null hypothesis is difficult to devise for the particular case (but usually these algorithms are straightforward).

Although the permutation test is often trivial to perform for anyone with computing resources and programming experience, traditional methods for determining significance are still widely used. The most basic approach is to compare the observed ρ with published tables for various levels of significance. This is a simple solution if the significance only needs to be known within a certain range or less than a certain value, as long as tables are available that specify the desired ranges. A reference to such a table is given below. However, generating these tables is computationally intensive and complicated mathematical tricks have been used over the years to generate tables for larger and larger sample sizes, so it is not practical for most people to extend existing tables.

An alternative approach available for sufficiently large sample sizes is an approximation to the Student's t-distribution. For sample sizes above about 20, the variable
has a Student's t-distribution in the null case (zero correlation). In the non-null case (i.e. to test whether an observed ρ is significantly different from a theoretical value, or whether two observed ρs differ significantly) tests are much less powerful, though the t-distribution can again be used.

A generalisation of the Spearman coefficient is useful in the situation where there are three or more conditions, a number of subjects are all observed in each of them, and we predict that the observations will have a particular order. For example, a number of subjects might each be given three trials at the same task, and we predict that performance will improve from trial to trial. A test of the significance of the trend between conditions in this situation was developed by E. B. Page and is usually referred to as Page's trend test for ordered alternatives.

See also

External links



Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
Charles Edward Spearman (September 10 1863 - September 17 1945) was an English psychologist known for work in statistics, as a pioneer of factor analysis, and for Spearman's rank correlation coefficient.
..... Click the link for more information.
Rho (upper case Ρ, lower case ρ) is a letter of the Greek alphabet. See Rho (letter).

The lower-case letter rho (ρ) may represent:
  • in mathematics, the radius in a spherical coordinate system
  • in physics and engineering, the density of a material.

..... Click the link for more information.
Non-Parametric statistics are statistics where it is not assumed that the population fits any parametrized distributions. Non-Parametric statistics are typically applied to populations that take on a ranked order (such as movie reviews receiving one to four stars).
..... Click the link for more information.
correlation, also called correlation coefficient, indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence.
..... Click the link for more information.
monotonic function (or monotone function) is a function which preserves the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order theory.
..... Click the link for more information.
variable (IPA pronunciation: [ˈvæɹiəbl]) (sometimes called a pronumeral) is a symbolic representation denoting a quantity or expression.
..... Click the link for more information.
In statistics, a frequency distribution is a list of the values that a variable takes in a sample. It is usually a list, ordered by quantity, showing the number of times each value appears.
..... Click the link for more information.
In statistics, the Pearson product-moment correlation coefficient (sometimes known as the PMCC) (r) is a measure of the correlation of two variables X and Y
..... Click the link for more information.
A linear equation is an equation in which each term is either a constant or the product of a constant times the first power of a variable. Such an equation is equivalent to equating a first-degree polynomial to zero.
..... Click the link for more information.
Ranking is the process of positioning items such as individuals, groups or businesses on an ordinal scale in relation to others. A list arranged in this way is said to be in rank order.
..... Click the link for more information.
In statistics and data analysis, a raw score is an original datum that has not been transformed. This may include, for example, the original result obtained by a student on a test (i.e.
..... Click the link for more information.
correlation, also called correlation coefficient, indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence.
..... Click the link for more information.
The Kendall tau rank correlation coefficient (or simply the Kendall tau coefficient, Kendall's τ or Tau test(s)) is used to measure the degree of correspondence between two rankings and assessing the significance of this correspondence.
..... Click the link for more information.
In statistics, a null hypothesis is a hypothesis set up to be nullified or refuted in order to support an alternate hypothesis. When used, the null hypothesis is presumed true until statistical evidence in the form of a hypothesis test indicates otherwise.
..... Click the link for more information.
In statistics, resampling is any of a variety of methods for doing one of the following:
  1. Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknife) or drawing randomly with replacement from a set of data

..... Click the link for more information.
data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question.
..... Click the link for more information.
t-distribution or Student's t-distribution is a probability distribution that arises in the problem of estimating the mean of a normally distributed population when the sample size is small.
..... Click the link for more information.
In statistics, the Page test for multiple comparisons between ordered correlated variables is the counterpart of Spearman's rank correlation coefficient which summarizes the association of continuous variables.
..... Click the link for more information.
The Kendall tau rank correlation coefficient (or simply the Kendall tau coefficient, Kendall's τ or Tau test(s)) is used to measure the degree of correspondence between two rankings and assessing the significance of this correspondence.
..... Click the link for more information.
In statistics, rank correlation is the study of relationships between different rankings on the same set of items. It deals with measuring correspondence between two rankings, and assessing the significance of this correspondence.
..... Click the link for more information.
Chebyshev's sum inequality, named after Pafnuty Chebyshev, states that if



and



then



Similarly, if



and



then


..... Click the link for more information.
rearrangement inequality states that



The rearrangement inequality can be proved by induction. Many famous inequalities can be proved by the rearrangement inequality, such as the arithmetic mean - geometric mean inequality, the Cauchy-Schwarz
..... Click the link for more information.
In statistics, the Pearson product-moment correlation coefficient (sometimes known as the PMCC) (r) is a measure of the correlation of two variables X and Y
..... Click the link for more information.
Microsoft Excel (full name Microsoft Office Excel) is a spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS. It features calculation, graphing tools, pivot tables and a macro programming language called VBA (Visual Basic for
..... Click the link for more information.
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.
..... Click the link for more information.
In statistics, mean has two related meanings:
  • the arithmetic mean (and is distinguished from the geometric mean or harmonic mean).
  • the expected value of a random variable, which is also called the population mean.

..... Click the link for more information.
In mathematics and statistics, the arithmetic mean (or simply the mean) of a list of numbers is the sum of all the members of the list divided by the number of items in the list. The arithmetic mean is what students are taught very early to call the "average".
..... Click the link for more information.
The geometric mean of a collection of positive data is defined as the nth root of the product of all the members of the data set, where n is the number of members.
..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter