Bi-Variate Distribution - Correlation

 

Bivariate Distribution and Correlation

So far, we have confined ourselves to univariate distributions, i.e., distributions involving only one variable. However, in many real-life situations, we encounter cases where each observation is associated with two or more variables.

For example, if we measure the heights and weights of a group of persons, each observation consists of two related values—one for height and one for weight. Such a distribution involving two variables is called a bivariate distribution.

In a bivariate distribution, we are often interested in examining whether there exists any correlation or covariation between the two variables under study.

  • If a change in one variable is accompanied by a change in the other variable, the two variables are said to be correlated.

  • If the two variables tend to deviate in the same direction (i.e., an increase in one variable corresponds to an increase in the other, or a decrease corresponds to a decrease), the correlation is called direct or positive correlation.

  • On the other hand, if the two variables deviate in opposite directions (i.e., an increase in one variable corresponds to a decrease in the other, and vice versa), the correlation is called inverse or negative correlation.

Examples:

  • Positive correlation: (i) Heights and weights of individuals, (ii) Income and expenditure.

  • Negative correlation: (i) Price and demand of a commodity, (ii) Volume and pressure of a perfect gas.

Finally, correlation is said to be perfect if the deviation in one variable is always accompanied by a proportional and exact deviation in the other variable.

Scatter Diagram

A scatter diagram is the simplest method of representing bivariate data diagrammatically.

For a bivariate distribution (xi,yi),  i=1,2,,n(x_i, y_i), \; i = 1, 2, \dots, n, if the values of the variables XX and YY are plotted along the horizontal axis (x-axis) and vertical axis (y-axis), respectively, in the Cartesian plane, the resulting diagram of points is known as a scatter diagram.

From a scatter diagram, we can form a fairly good—though approximate—idea about whether the two variables are correlated:

  • If the points lie close together, forming a dense cluster, we expect a high degree of correlation.

  • If the points are widely scattered, the correlation is expected to be weak.

However, this method is not very suitable when the number of observations is very large, as the diagram becomes too crowded to interpret effectively

Karl Pearson’s Coefficient of Correlation

As a measure of the intensity or degree of linear relationship between two variables, Karl Pearson (1867–1936), a British biometrician, developed a formula known as the Correlation Coefficient.

The correlation coefficient between two random variables XX and YY, usually denoted by r(X,Y)r(X, Y) or simply rxyr_{xy}, is a numerical measure of the linear relationship between them and is defined as:

r(X,Y)=Cov(X,Y)σXσYr(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}

where:

  • Cov(X,Y)\text{Cov}(X, Y) = Covariance between XX and YY

  • σX\sigma_X = Standard deviation of XX

  • σY\sigma_Y = Standard deviation of YY


Karl Pearson’s Coefficient of Correlation

Let (xi,yi);  i=1,2,,n(x_i, y_i); \; i = 1, 2, \dots, n represent a bivariate distribution.

Covariance

The covariance between XX and YY is defined as:

Cov(X,Y)=E[(XE(X))(YE(Y))]\text{Cov}(X, Y) = E \Big[ (X - E(X))(Y - E(Y)) \Big]

For a sample, it is given by:

Cov(X,Y)=1ni=1n(xixˉ)(yiyˉ)\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

where xˉ\bar{x} and yˉ\bar{y} are the sample means of XX and YY.

Similarly, the variances are:

σX2=E[(XE(X))2]=1ni=1n(xixˉ)2\sigma_X^2 = E[(X - E(X))^2] = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
σY2=E[(YE(Y))2]=1ni=1n(yiyˉ)2\sigma_Y^2 = E[(Y - E(Y))^2] = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2

Karl Pearson’s Formula

The correlation coefficient between XX and YY is defined as:

r(X,Y)=Cov(X,Y)σXσYr(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}

This is also called the product-moment correlation coefficient because

Cov(X,Y)=E[(XE(X))(YE(Y))]\text{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]

Alternative (Computational) Formula

For practical calculation, an equivalent formula is:

r(X,Y)=nxiyixiyi(nxi2(xi)2)(nyi2(yi)2)r(X, Y) = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{\sqrt{\big(n \sum x_i^2 - (\sum x_i)^2\big)\big(n \sum y_i^2 - (\sum y_i)^2\big)}}

This avoids computing deviations separately and is useful in tabulated data.


Convenient Form of Covariance Formula

From the definition:

Cov(X,Y)=1ni=1n(xixˉ)(yiyˉ)\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

Expanding:

Cov(X,Y)=1n[xiyixˉyiyˉxi+xˉyˉ]\text{Cov}(X, Y) = \frac{1}{n} \left[ \sum x_i y_i - \bar{x} \sum y_i - \bar{y} \sum x_i + n \bar{x}\bar{y} \right]

Since xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_i and yˉ=1nyi\bar{y} = \frac{1}{n}\sum y_i, this simplifies to:

Cov(X,Y)=1nxiyixˉyˉ\text{Cov}(X, Y) = \frac{1}{n} \sum x_i y_i - \bar{x}\bar{y}


Similarly, the Variances

σX2=1n(xixˉ)2=1nxi2xˉ2\sigma_X^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2
σY2=1n(yiyˉ)2=1nyi2yˉ2\sigma_Y^2 = \frac{1}{n} \sum (y_i - \bar{y})^2 = \frac{1}{n}\sum y_i^2 - \bar{y}^2


✅ So the final computational forms are:

Cov(X,Y)=1nxiyixˉyˉ\text{Cov}(X, Y) = \frac{1}{n}\sum x_i y_i - \bar{x}\bar{y} σX2=1nxi2xˉ2,σY2=1nyi2yˉ2\sigma_X^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2, \quad \sigma_Y^2 = \frac{1}{n}\sum y_i^2 - \bar{y}^2


Limits of Correlation Coefficient

Using the Cauchy–Schwarz inequality, it can be shown that:

1    r(X,Y)    +1-1 \; \leq \; r(X, Y) \; \leq \; +1
  • If r=+1r = +1: perfect positive correlation.

  • If r=−1r = -1: perfect negative correlation.

  • If r=0r = 0: no linear correlation.

Remarks on Karl Pearson’s Correlation Coefficient

The coefficient r(X,Y)provides a measure of the linear relationship between X and Y. For nonlinear relationships, however, it is not a suitable measure.

Sometimes, the covariance is denoted as:

Cov(X,Y)=σXY\text{Cov}(X, Y) = \sigma_{XY}

Karl Pearson’s correlation coefficient is also called the product-moment correlation coefficient, since it is based on the expected product of deviations of the two variables from their respective means:

Cov(X,Y)=E[(XE(X))(YE(Y))]\text{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]

Thus,

r(X,Y)=σXYσXσYr(X, Y) = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}

Example

Calculate the correlation coefficient for the following heights (in inches) of fathers (X) and their sons (Y) :

Data

Fathers’ heights XX: 65, 66, 67, 67, 68, 69, 70, 72
Sons’ heights YY: 67, 68, 65, 68, 72, 72, 69, 71

Useful totals

n=8,X=544,Y=552,X2=37028,Y2=38132,XY=37560.\begin{aligned} n&=8, \quad \sum X=544,\quad \sum Y=552,\\ \sum X^2&=37028,\quad \sum Y^2=38132,\quad \sum XY=37560. \end{aligned}

(So Xˉ=5448=68, Yˉ=5528=69\bar X=\tfrac{544}{8}=68,\ \bar Y=\tfrac{552}{8}=69



Pearson correlation (computational form)

r=nXY(X)(Y)(nX2(X)2)(nY2(Y)2).r=\frac{n\sum XY-(\sum X)(\sum Y)} {\sqrt{\big(n\sum X^2-(\sum X)^2\big)\big(n\sum Y^2-(\sum Y)^2\big)}}.

Plug in the numbers (showing each step):

  • Numerator:

nXY(X)(Y)=837560544552=300480300288=192.n\sum XY-(\sum X)(\sum Y) =8\cdot37560-544\cdot552 =300480-300288 =192.
  • Denominator parts:

nX2(X)2=8370285442=296224295936=288,nY2(Y)2=8381325522=305056304704=352.\begin{aligned} n\sum X^2-(\sum X)^2&=8\cdot37028-544^2=296224-295936=288,\\ n\sum Y^2-(\sum Y)^2&=8\cdot38132-552^2=305056-304704=352. \end{aligned}

So

288352=101376318.396.\sqrt{288\cdot352}=\sqrt{101376}\approx 318.396.
  • Therefore

r=192318.3960.603.r=\frac{192}{318.396}\approx 0.603.

Conclusion

r0.603\boxed{r \approx 0.603}

This indicates a moderate positive linear correlation between fathers’ and sons’ heights

If you use the sample version with n1 in covariance/variance, those n1 factors cancel, so you get the same computational formula for r.

Coding/assumed means trick (useful with large numbers): set
u=xAh,v=yBk

Then

r=nuv(u)(v)(nu2(u)2)(nv2(v)2),

which is numerically stable and faster; scale/shift cancel in r.A and B are the mean value.



r(X,Y)=Cov(X,Y)σXσY=24644=2439.79949750.603.r(X,Y)=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y} =\frac{24}{6\cdot\sqrt{44}}=\frac{24}{39.7994975}\approx0.603.

For the standardised variables U,VU,V:

Cov(U,V)=Cov(X,Y)σXσY0.603,\operatorname{Cov}(U,V)=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}\approx0.603,

and since σU=σV=1\sigma_U=\sigma_V=1 we get r(U,V)=Cov(U,V)0.603r(U,V)=\operatorname{Cov}(U,V)\approx0.603

r(U,V)=Cov(U,V)0.603. This matches r(X,Y)r(X,Y).

Example Problem

A computer, while calculating the correlation coefficient between two variables XX and YY

 from 25 pairs of observations, obtained the following results:

n=25,X=125,X2=650,Y=100,Y2=460,XY=508n = 25,\quad \sum X = 125,\quad \sum X^2 = 650,\quad \sum Y = 100,\quad \sum Y^2 = 460,\quad \sum XY = 508

However, during checking, it was discovered that two pairs of values had been copied incorrectly. The pairs were entered as

(6,14), (8,6)(6,14),\ (8,6)

while the correct values were

(8,12), (6,8).(8,12),\ (6,8).

Obtain the correct value of the correlation coefficient.

Given (initial)

n=25n=25
X=125,X2=650\sum X = 125,\quad \sum X^2 = 650
Y=100,Y2=460\sum Y = 100,\quad \sum Y^2 = 460
XY=508\sum XY = 508

Later it was found that two (actually a block of) pairs were copied incorrectly; after correcting those pairs the corrected totals become:

Corrected totals (after fixing the mis-copied pairs)

X=125,X2=650(unchanged)\sum X = 125,\qquad \sum X^2 = 650 \quad(\text{unchanged})
Y=100,Y2=436(changed)\sum Y = 100,\qquad \sum Y^2 = 436 \quad(\text{changed})
XY=520(changed)\sum XY = 520 \quad(\text{changed})

(You can verify these by subtracting the wrong contributions and adding the correct ones.)

Compute means

Xˉ=Xn=12525=5,Yˉ=Yn=10025=4.\bar X=\frac{\sum X}{n}=\frac{125}{25}=5,\qquad \bar Y=\frac{\sum Y}{n}=\frac{100}{25}=4.

Use Pearson’s computational formula

r=nXY(X)(Y)(nX2(X)2)(nY2(Y)2).r=\frac{n\sum XY-(\sum X)(\sum Y)} {\sqrt{\big(n\sum X^2-(\sum X)^2\big)\big(n\sum Y^2-(\sum Y)^2\big)}}.

Plug in the corrected numbers:

  • Numerator:

25520125100=1300012500=500.25\cdot 520 - 125\cdot 100 = 13\,000 - 12\,500 = 500.

  • Denominator parts:

256501252=1625015625=625,25\cdot 650 - 125^2 = 16\,250 - 15\,625 = 625,
254361002=1090010000=900.25\cdot 436 - 100^2 = 10\,900 - 10\,000 = 900.

So denominator =625900=562500=750.=\sqrt{625\cdot 900}=\sqrt{562500}=750.

  • Therefore

r=500750=230.6667 (0.67).r=\frac{500}{750}=\frac{2}{3}\approx 0.6667\ (\approx 0.67).

Final answer

r0.667
\boxed{r \approx 0.667}

The whole set of cell frequencies will then define a bivariate frequency distribution. The column totals and row totals will give us the marginal distributions of X and Y. A particular column or row will be called the conditional distribution of. Y for given X or of-X for given Y respectively.

Suppose bivariate data on XX and YY are presented in a two-way frequency table where there are mm classes of YY along the horizontal direction and nn classes of XX along the vertical direction. Let fijf_{ij} be the frequency of observations in the cell at row ii and column jj (the (i,j)(i,j)-th cell).

The row-sum (sum of frequencies in row ii) is

ri=j=1mfij,r_i=\sum_{j=1}^m f_{ij},

and the column-sum (sum of frequencies in column jj) is

cj=i=1nfij.c_j=\sum_{i=1}^n f_{ij}.

The total number of observations is

N=i=1nj=1mfij=i=1nri=j=1mcj.N=\sum_{i=1}^n\sum_{j=1}^m f_{ij}=\sum_{i=1}^n r_i=\sum_{j=1}^m c_j.

A row (fixed ii) gives the conditional distribution of YY given the ii-th class of XX; similarly a column (fixed jj) gives the conditional distribution of XX given the jj-th class of YY.


Notation for grouped (class) midpoints

Let the midpoint of the ii-th XX-class be xix_i (i=1,,n) and the midpoint of the jj-th YY-class be yjy_j (j=1,,m)(j=1,\dots,m). Then treat each cell as representing fijf_{ij} observations at the point (xi,yj)(x_i,y_j).


Marginal (grouped) distributions and sample totals

Σfx=i=1nj=1mfijxi=i=1nrixi,Σfy=i=1nj=1mfijyj=j=1mcjyj.\begin{aligned} \Sigma f\,x &= \sum_{i=1}^n\sum_{j=1}^m f_{ij}\,x_i=\sum_{i=1}^n r_i x_i,\\[4pt] \Sigma f\,y &= \sum_{i=1}^n\sum_{j=1}^m f_{ij}\,y_j=\sum_{j=1}^m c_j y_j. \end{aligned}

Sample means (grouped)

xˉ=1Ni=1nj=1mfijxi=1Ni=1nrixi,yˉ=1Ni=1nj=1mfijyj=1Nj=1mcjyj.\bar x = \frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m f_{ij}x_i = \frac{1}{N}\sum_{i=1}^n r_i x_i, \qquad \bar y = \frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m f_{ij}y_j = \frac{1}{N}\sum_{j=1}^m c_j y_j.


Grouped second moments, variances and covariance

Σfx2=i=1nj=1mfijxi2=i=1nrixi2,Σfy2=i=1nj=1mfijyj2=j=1mcjyj2,Σfxy=i=1nj=1mfijxiyj.\begin{aligned} \Sigma f\,x^2 &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} x_i^2 = \sum_{i=1}^n r_i x_i^2,\\[4pt] \Sigma f\,y^2 &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} y_j^2 = \sum_{j=1}^m c_j y_j^2,\\[4pt] \Sigma f\,xy &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} x_i y_j. \end{aligned}

Grouped variances (population form, using 1/N1/N):

σX2=1N(Σfx2(Σfx)2N),σY2=1N(Σfy2(Σfy)2N).\sigma_X^2=\frac{1}{N}\Big(\Sigma f\,x^2 - \frac{(\Sigma f\,x)^2}{N}\Big),\qquad \sigma_Y^2=\frac{1}{N}\Big(\Sigma f\,y^2 - \frac{(\Sigma f\,y)^2}{N}\Big).

Grouped covariance:

Covg(X,Y)=1N(Σfxy(Σfx)(Σfy)N).\operatorname{Cov}_g(X,Y)=\frac{1}{N}\Big(\Sigma f\,xy - \frac{(\Sigma f\,x)(\Sigma f\,y)}{N}\Big).

(If you prefer the “sample” denominators N1, replace 1/N1/N by 1/(N1)1/(N-1)

consistently — but the correlation formula below is unchanged because it uses ratios.)


Correlation coefficient (grouped)

r  =  Covg(X,Y)σXσY  =  NΣfxy(Σfx)(Σfy)(NΣfx2(Σfx)2)(NΣfy2(Σfy)2).r \;=\; \frac{\operatorname{Cov}_g(X,Y)}{\sigma_X\,\sigma_Y} \;=\; \frac{\,N\Sigma f\,xy - (\Sigma f\,x)(\Sigma f\,y)\,} {\sqrt{\big(N\Sigma f\,x^2 - (\Sigma f\,x)^2\big)\,\big(N\Sigma f\,y^2 - (\Sigma f\,y)^2\big)}}.

This rr lies in [1,1][-1,1]. Its sign indicates direction of linear association; its magnitude indicates strength.


Conditional distributions (for clarity)

Conditional probability (grouped) of Y=yjY=y_jgiven XX in class ii:

P(Y=yjX in class i)=fijri(provided ri>0).P(Y=y_j \mid X\ \text{in class }i)=\frac{f_{ij}}{r_i}\quad\text{(provided }r_i>0\text{).}

Similarly

P(X=xiY in class j)=fijcj(provided cj>0).P(X=x_i \mid Y\ \text{in class }j)=\frac{f_{ij}}{c_j}\quad\text{(provided }c_j>0\text{).}

Example:


Joint distribution table is given.Find the correlation coefficient between X and Y

X=1X=+1g(y)Y=0183848Y=1282848p(x)38581\begin{array}{c|c|c|c} & X=-1 & X=+1 & g(y) \\ \hline Y=0 & \tfrac{1}{8} & \tfrac{3}{8} & \tfrac{4}{8} \\ Y=1 & \tfrac{2}{8} & \tfrac{2}{8} & \tfrac{4}{8} \\ \hline p(x) & \tfrac{3}{8} & \tfrac{5}{8} & 1 \end{array}

Expectations

E(X)E(X)

E(X)=(1)38+(+1)58=3+58=28=14.E(X)=(-1)\cdot\tfrac{3}{8}+(+1)\cdot\tfrac{5}{8} = \frac{-3+5}{8} = \frac{2}{8}=\tfrac{1}{4}.

E(Y)E(Y)

E(Y)=048+148=48=12.E(Y)=0\cdot\tfrac{4}{8}+1\cdot\tfrac{4}{8} = \tfrac{4}{8}=\tfrac{1}{2}.

Second moments

E(X2)E(X^2)

Since X=±1X=\pm1:

E(X2)=121=1.E(X^2)=1^2\cdot 1 = 1.

E(Y2)

E(Y2)=0248+1248=48=12.E(Y^2)=0^2\cdot\tfrac{4}{8}+1^2\cdot\tfrac{4}{8}=\tfrac{4}{8}=\tfrac{1}{2}.

Variances

Var(X)=E(X2)[E(X)]2=1(14)2=1116=1516.\operatorname{Var}(X)=E(X^2)-[E(X)]^2=1-\Big(\tfrac{1}{4}\Big)^2=1-\tfrac{1}{16}=\tfrac{15}{16}.
Var(Y)=E(Y2)[E(Y)]2=12(12)2=1214=14.\operatorname{Var}(Y)=E(Y^2)-[E(Y)]^2=\tfrac{1}{2}-\Big(\tfrac{1}{2}\Big)^2=\tfrac{1}{2}-\tfrac{1}{4}=\tfrac{1}{4}.

Covariance

E(XY)=(1)(0)18+(+1)(0)38+(1)(1)28+(+1)(1)28.E(XY)=(-1)(0)\cdot\tfrac{1}{8}+(+1)(0)\cdot\tfrac{3}{8}+(-1)(1)\cdot\tfrac{2}{8}+(+1)(1)\cdot\tfrac{2}{8}.

Simplify:

E(XY)=0+0+(2/8)+(2/8)=0.E(XY)=0+0+(-2/8)+(2/8)=0.

So

Cov(X,Y)=E(XY)E(X)E(Y)=0(1412)=18.\operatorname{Cov}(X,Y)=E(XY)-E(X)E(Y)=0-\Big(\tfrac{1}{4}\cdot\tfrac{1}{2}\Big)=-\tfrac{1}{8}.

Correlation coefficient

r=Cov(X,Y)Var(X)Var(Y)=18151614.r=\frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\,\operatorname{Var}(Y)}} =\frac{-\tfrac{1}{8}}{\sqrt{\tfrac{15}{16}\cdot\tfrac{1}{4}}}.

Denominator:

151614=1564=158.\sqrt{\tfrac{15}{16}\cdot\tfrac{1}{4}}=\sqrt{\tfrac{15}{64}}=\frac{\sqrt{15}}{8}.

So

r=1/815/8=1150.258.r=\frac{-1/8}{\sqrt{15}/8}=-\frac{1}{\sqrt{15}}\approx -0.258.

Final Answer:
The correlation coefficient between XX and YY is

r=115    0.258\boxed{r=-\tfrac{1}{\sqrt{15}} \;\;\approx -0.258}

which indicates a weak negative correlation.

Spearman’s Rank Correlation 


1. Setup

We have nn individuals ranked in two characteristics AA and BB.

  • Rank of ii-th individual in AA: XiX_i

  • Rank of ii-th individual in BB: YiY_i

Since they are ranks:

Xi,Yi{1,2,3,,n}.X_i, Y_i \in \{1,2,3,\dots,n\}.

Let:

  • Mean of ranks:

    Xˉ=Yˉ=1+2+3++nn=n+12.\bar{X} = \bar{Y} = \frac{1+2+3+\dots+n}{n}=\frac{n+1}{2}.
  • Variance of ranks:

    σX2=σY2=12+22++n2n(n+12)2.\sigma_X^2 = \sigma_Y^2 = \frac{1^2+2^2+\dots+n^2}{n} - \Big(\frac{n+1}{2}\Big)^2.

Recall the formula:

12+22++n2=n(n+1)(2n+1)6.1^2+2^2+\dots+n^2 = \frac{n(n+1)(2n+1)}{6}.

So:

σX2=n(n+1)(2n+1)6n(n+1)24=(n21)12.\sigma_X^2 = \frac{n(n+1)(2n+1)}{6n} - \frac{(n+1)^2}{4} = \frac{(n^2-1)}{12}.

Thus,

σX2=σY2=n2112.\sigma_X^2 = \sigma_Y^2 = \frac{n^2-1}{12}.


2. Differences of ranks

Define:

di=XiYi.d_i = X_i - Y_i.

Clearly:

i=1ndi2=(XiYi)2.\sum_{i=1}^n d_i^2 = \sum (X_i - Y_i)^2.


3. Relation with covariance

Expanding:

di2=(XiYi)2=(XiXˉ)2+(YiYˉ)22(XiXˉ)(YiYˉ).\sum d_i^2 = \sum (X_i - Y_i)^2 = \sum (X_i-\bar{X})^2 + \sum (Y_i-\bar{Y})^2 - 2\sum (X_i-\bar{X})(Y_i-\bar{Y}).

Divide by nn:

1ndi2=σX2+σY22Cov(X,Y).\frac{1}{n}\sum d_i^2 = \sigma_X^2 + \sigma_Y^2 - 2\,\text{Cov}(X,Y).

Since σX2=σY2\sigma_X^2=\sigma_Y^2

1ndi2=2σX22ρσX2,\frac{1}{n}\sum d_i^2 = 2\sigma_X^2 - 2\rho \sigma_X^2,

where ρ\rho = rank correlation coefficient.

So:

1ndi2=2σX2(1ρ).\frac{1}{n}\sum d_i^2 = 2\sigma_X^2(1-\rho).


4. Substituting variance

We already know:

σX2=n2112.\sigma_X^2 = \frac{n^2-1}{12}.

Thus:

1ndi2=2n2112(1ρ).\frac{1}{n}\sum d_i^2 = 2\cdot \frac{n^2-1}{12}(1-\rho).

Simplify:

1ndi2=n216(1ρ).\frac{1}{n}\sum d_i^2 = \frac{n^2-1}{6}(1-\rho).


5. Final formula

Rearranging for ρ\rho:

1ρ=6n(n21)di2,1-\rho = \frac{6}{n(n^2-1)}\sum d_i^2,
ρ=16di2n(n21).\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}.


This is Spearman’s Rank Correlation formula:

rs=16di2n(n21)\boxed{r_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}}

where

  • di=XiYid_i = X_i - Y_i is the difference between the two ranks for the ii-th individual,

  • nn is the number of individuals.

Spearman’s rank correlation coefficient (since Pearson’s correlation is based on actual values, not ranks). Let me give you a small worked-out example of Spearman’s rank correlation:


Example

Suppose we have marks of 5 students in Math and Science:

StudentMath (X)Science (Y)
A8593
B7065
C9089
D6060
E7580

Step 1: Assign ranks

Rank each score (highest = 1).

StudentMath (X)Rank XScience (Y)Rank Y
A852931
B704654
C901892
D605605
E753803

Step 2: Find differences of ranks

d=Rank XRank Yd = \text{Rank X} - \text{Rank Y}, and compute d2d^2.

StudentRank XRank Ydd
d2d^2
A2111
B4400
C12-11
D5500
E3300
d2=2\sum d^2 = 2

Step 3: Apply Spearman’s formula

rs=16d2n(n21)r_s = 1 - \frac{6\sum d^2}{n(n^2-1)}

Here n=5n = 5, d2=2\sum d^2 = 2:

rs=16(2)5(251)r_s = 1 - \frac{6(2)}{5(25-1)} rs=112120=10.1=0.9r_s = 1 - \frac{12}{120} = 1 - 0.1 = 0.9

Spearman’s rank correlation coefficient = 0.9
This shows a strong positive correlation between Math and Science marks.

Example

Ten competitors in a musical test were ranked by the

three judges A. Band C in the following order:

Ranks by A: 1 6 5 10 3 2 4 9 7 8

Ranks by B : 3 5 8 4 7 10 2 1 6 9

Ranks by C : is 4 9 8 1 2 3 10 5 7

Using rank correlation method. discuss which pair of judges has the nearest approach to common likings in music.

Step 1: Write down the rankings

CompetitorABC
1132
2654
3589
41048
5371
62103
74210
8915
9767
10896

Step 2: Compare A & B

Compute d=AB, d2d^2.

CompetitorABd
113-24
26511
358-39
4104636
537-416
6210-864
74224
891864
97611
1089-11

d2=200\sum d^2 = 200
rAB=1620010(1021)=11200990=11.212=0.212r_{AB} = 1 - \frac{6 \cdot 200}{10(10^2-1)} = 1 - \frac{1200}{990} = 1 - 1.212 = -0.212

So A & B have slight negative correlation.


Step 3: Compare A & C

CompetitorACd
112-11
26424
359-416
410824
53124
623-11
7410-636
895416
97700
108624

d2=86\sum d^2 = 86
rAC=168610(99)=1516990=10.521=0.479r_{AC} = 1 - \frac{6 \cdot 86}{10(99)} = 1 - \frac{516}{990} = 1 - 0.521 = 0.479

So A & C have moderate positive correlation.


Step 4: Compare B & C

CompetitorBCd
13211
25411
389-11
448-416
571636
6103749
7210-864
815-416
967-11
109639

d2=194\sum d^2 = 194
rBC=1619410(99)=11164990=11.176=0.176r_{BC} = 1 - \frac{6 \cdot 194}{10(99)} = 1 - \frac{1164}{990} = 1 - 1.176 = -0.176

So B & C have slight negative correlation.


✅ Conclusion

  • rAB=0.212r_{AB} = -0.212 (slight negative)

  • rAC=0.479 (moderate positive)

  • rBC=0.176r_{BC} = -0.176 (slight negative)

👉 Judges A and C have the nearest approach to common likings in music.

Remarks on Spearman’s Rank Correlation Coefficient

  1. Check for Numerical Accuracy

    • In calculations, the sum of rank differences should be zero:

      d=(xiyi)=0\sum d = \sum (x_i - y_i) = 0

      This provides a simple check for errors in numerical work.

  2. Relation with Pearson’s Correlation

    • Spearman’s rank correlation coefficient (ρ\rho) is essentially Pearson’s correlation applied to ranks instead of actual data.

    • Therefore, it is interpreted in the same way as Karl Pearson’s correlation coefficient.

  3. Distribution-Free (Non-Parametric)

    • Pearson’s correlation assumes that the population is normally distributed.

    • When this assumption is not valid, we need a distribution-free measure, which does not depend on any population parameters.

    • Spearman’s ρ\rho is such a measure, making it useful in non-parametric situations.

  4. Simplicity and Information Loss

    • Spearman’s formula is easier to understand and apply compared to Pearson’s formula.

    • However, using ranks instead of raw data results in loss of information.

    • Unless there are many tied ranks, Spearman’s coefficient is usually slightly lower than Pearson’s coefficient.

  5. Use with Qualitative Data and Extreme Observations

    • Spearman’s correlation is the only suitable method when dealing with qualitative characteristics (e.g., taste, preference, intelligence level) that cannot be measured numerically but can be ordered.

    • It can also be used when actual quantitative data are available.

    • When data include extreme observations (outliers), Spearman’s formula is often preferred over Pearson’s because it is less sensitive to extremes.

  6. Limitations

    • Spearman’s method is not practical for bivariate frequency distributions (correlation tables).

    • For large samples (n > 30), it is computationally heavy if ranks are not directly given. In such cases, Pearson’s formula is preferred unless ranking is necessary.


In short:

  • Spearman’s ρ\rho is simple, non-parametric, and works well with ranks or qualitative data.

  • It is less accurate than Pearson’s in terms of information retention but more robust when assumptions (like normality) are not satisfied.


Python  Code

import numpy as np
import pandas as pd

# Sample data: study hours vs exam scores of students
study_hours = [2, 3, 4, 5, 6, 7, 8, 9]
exam_scores = [50, 55, 60, 65, 70, 75, 80, 85]

# Convert to pandas DataFrame for easy handling
data = pd.DataFrame({
    "Study Hours": study_hours,
    "Exam Scores": exam_scores
})

print("Dataset:")
print(data)

# Mean (average) study hours and exam scores
mean_hours = np.mean(study_hours)
mean_scores = np.mean(exam_scores)
print("\nMean Study Hours:", mean_hours)
print("Mean Exam Scores:", mean_scores)

# Variance (spread of data)
var_hours = np.var(study_hours)
var_scores = np.var(exam_scores)
print("\nVariance of Study Hours:", var_hours)
print("Variance of Exam Scores:", var_scores)

# Correlation (relationship between study hours and scores)
correlation = np.corrcoef(study_hours, exam_scores)[0, 1]
print("\nCorrelation between Study Hours and Exam Scores:", correlation)

Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems