Bivariate Distribution and Correlation

So far, we have confined ourselves to univariate distributions, i.e., distributions involving only one variable. However, in many real-life situations, we encounter cases where each observation is associated with two or more variables.

For example, if we measure the heights and weights of a group of persons, each observation consists of two related values—one for height and one for weight. Such a distribution involving two variables is called a bivariate distribution.

In a bivariate distribution, we are often interested in examining whether there exists any correlation or covariation between the two variables under study.

If a change in one variable is accompanied by a change in the other variable, the two variables are said to be correlated.
If the two variables tend to deviate in the same direction (i.e., an increase in one variable corresponds to an increase in the other, or a decrease corresponds to a decrease), the correlation is called direct or positive correlation.
On the other hand, if the two variables deviate in opposite directions (i.e., an increase in one variable corresponds to a decrease in the other, and vice versa), the correlation is called inverse or negative correlation.

Examples:

Positive correlation: (i) Heights and weights of individuals, (ii) Income and expenditure.
Negative correlation: (i) Price and demand of a commodity, (ii) Volume and pressure of a perfect gas.

Finally, correlation is said to be perfect if the deviation in one variable is always accompanied by a proportional and exact deviation in the other variable.

Scatter Diagram

A scatter diagram is the simplest method of representing bivariate data diagrammatically.

For a bivariate distribution $(x_i, y_i), \; i = 1, 2, \dots, n$ , if the values of the variables $X$ and $Y$ are plotted along the horizontal axis (x-axis) and vertical axis (y-axis), respectively, in the Cartesian plane, the resulting diagram of points is known as a scatter diagram.

From a scatter diagram, we can form a fairly good—though approximate—idea about whether the two variables are correlated:

If the points lie close together, forming a dense cluster, we expect a high degree of correlation.
If the points are widely scattered, the correlation is expected to be weak.

However, this method is not very suitable when the number of observations is very large, as the diagram becomes too crowded to interpret effectively

Karl Pearson’s Coefficient of Correlation

As a measure of the intensity or degree of linear relationship between two variables, Karl Pearson (1867–1936), a British biometrician, developed a formula known as the Correlation Coefficient.

The correlation coefficient between two random variables $X$ and $Y$ , usually denoted by $r(X, Y)$ or simply $r_{xy}$ , is a numerical measure of the linear relationship between them and is defined as:

$r(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}$

where:

$\text{Cov}(X, Y)$ = Covariance between $X$ and $Y$
$\sigma_X$ = Standard deviation of $X$
$\sigma_Y$ = Standard deviation of $Y$

Karl Pearson’s Coefficient of Correlation

Let $(x_i, y_i); \; i = 1, 2, \dots, n$ represent a bivariate distribution.

Covariance

The covariance between $X$ and $Y$ is defined as:

\text{Cov}(X, Y) = E \Big[ (X - E(X))(Y - E(Y)) \Big]

For a sample, it is given by:

\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

where $\bar{x}$ and $\bar{y}$ are the sample means of $X$ and $Y$ .

Similarly, the variances are:

\sigma_X^2 = E[(X - E(X))^2] = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2

\sigma_Y^2 = E[(Y - E(Y))^2] = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2

Karl Pearson’s Formula

The correlation coefficient between $X$ and $Y$ is defined as:

r(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}

This is also called the product-moment correlation coefficient because

\text{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]

Alternative (Computational) Formula

For practical calculation, an equivalent formula is:

r(X, Y) = \frac{n \sum x_i y_i - \sum x_i \sum y_i}{\sqrt{\big(n \sum x_i^2 - (\sum x_i)^2\big)\big(n \sum y_i^2 - (\sum y_i)^2\big)}}

This avoids computing deviations separately and is useful in tabulated data.

Convenient Form of Covariance Formula

From the definition:

$\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$

Expanding:

$\text{Cov}(X, Y) = \frac{1}{n} \left[ \sum x_i y_i - \bar{x} \sum y_i - \bar{y} \sum x_i + n \bar{x}\bar{y} \right]$

Since $\bar{x} = \frac{1}{n}\sum x_i$ and $\bar{y} = \frac{1}{n}\sum y_i$ , this simplifies to:

$\text{Cov}(X, Y) = \frac{1}{n} \sum x_i y_i - \bar{x}\bar{y}$

Similarly, the Variances

$\sigma_X^2 = \frac{1}{n} \sum (x_i - \bar{x})^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2$ $\sigma_Y^2 = \frac{1}{n} \sum (y_i - \bar{y})^2 = \frac{1}{n}\sum y_i^2 - \bar{y}^2$

✅ So the final computational forms are:

$\text{Cov}(X, Y) = \frac{1}{n}\sum x_i y_i - \bar{x}\bar{y}$ $\sigma_X^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2, \quad \sigma_Y^2 = \frac{1}{n}\sum y_i^2 - \bar{y}^2$

Limits of Correlation Coefficient

Using the Cauchy–Schwarz inequality, it can be shown that:

-1 \; \leq \; r(X, Y) \; \leq \; +1

If $r = +1$ : perfect positive correlation.
If $r = -1$ : perfect negative correlation.
If $r = 0$ : no linear correlation.

Remarks on Karl Pearson’s Correlation Coefficient

The coefficient $r (X, Y)$ provides a measure of the linear relationship between $X$ and $Y$ . For nonlinear relationships, however, it is not a suitable measure.

Sometimes, the covariance is denoted as:

$\text{Cov}(X, Y) = \sigma_{XY}$

Karl Pearson’s correlation coefficient is also called the product-moment correlation coefficient, since it is based on the expected product of deviations of the two variables from their respective means:

$\text{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]$

Thus,

$r(X, Y) = \frac{\sigma_{XY}}{\sigma_X \sigma_Y}$

Example

Calculate the correlation coefficient for the following heights (in inches) of fathers (X) and their sons (Y) :

Data

Fathers’ heights $X$ : 65, 66, 67, 67, 68, 69, 70, 72
Sons’ heights $Y$ : 67, 68, 65, 68, 72, 72, 69, 71

Useful totals

\begin{aligned} n&=8, \quad \sum X=544,\quad \sum Y=552,\\ \sum X^2&=37028,\quad \sum Y^2=38132,\quad \sum XY=37560. \end{aligned}

(So $\bar X=\tfrac{544}{8}=68,\ \bar Y=\tfrac{552}{8}=69$

Pearson correlation (computational form)

r=\frac{n\sum XY-(\sum X)(\sum Y)} {\sqrt{\big(n\sum X^2-(\sum X)^2\big)\big(n\sum Y^2-(\sum Y)^2\big)}}.

Plug in the numbers (showing each step):

Numerator:

n\sum XY-(\sum X)(\sum Y) =8\cdot37560-544\cdot552 =300480-300288 =192.

Denominator parts:

\begin{aligned} n\sum X^2-(\sum X)^2&=8\cdot37028-544^2=296224-295936=288,\\ n\sum Y^2-(\sum Y)^2&=8\cdot38132-552^2=305056-304704=352. \end{aligned}

\sqrt{288\cdot352}=\sqrt{101376}\approx 318.396.

Therefore

r=\frac{192}{318.396}\approx 0.603.

Conclusion

\boxed{r \approx 0.603}

This indicates a moderate positive linear correlation between fathers’ and sons’ heights

If you use the sample version with

n - 1 in covariance/variance, those

n - 1

factors cancel, so you get the same computational formula for

r

Coding/assumed means trick (useful with large numbers): set

u = \frac{x - A}{h}, v = \frac{y - B}{k}

Then

r = \frac{n \sum u v - (\sum u) (\sum v)}{\sqrt{(n \sum u^{2} - (\sum u)^{2}) (n \sum v^{2} - (\sum v)^{2})}},

which is numerically stable and faster; scale/shift cancel in $r.A and B are the mean value.$

r(X,Y)=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y} =\frac{24}{6\cdot\sqrt{44}}=\frac{24}{39.7994975}\approx0.603.

For the standardised variables $U,V$ :

\operatorname{Cov}(U,V)=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}\approx0.603,

and since $\sigma_U=\sigma_V=1$ we get $r(U,V)=\operatorname{Cov}(U,V)\approx0.603$

r(U,V)=Cov(U,V)≈0.603. This matches

r(X,Y)

Example Problem

A computer, while calculating the correlation coefficient between two variables $X$ and $Y$

from 25 pairs of observations, obtained the following results:

n = 25,\quad \sum X = 125,\quad \sum X^2 = 650,\quad \sum Y = 100,\quad \sum Y^2 = 460,\quad \sum XY = 508

However, during checking, it was discovered that two pairs of values had been copied incorrectly. The pairs were entered as

(6,14),\ (8,6)

while the correct values were

(8,12),\ (6,8).

Obtain the correct value of the correlation coefficient.

Given (initial)

$n=25$
$\sum X = 125,\quad \sum X^2 = 650$
$\sum Y = 100,\quad \sum Y^2 = 460$
$\sum XY = 508$

Later it was found that two (actually a block of) pairs were copied incorrectly; after correcting those pairs the corrected totals become:

Corrected totals (after fixing the mis-copied pairs)

$\sum X = 125,\qquad \sum X^2 = 650 \quad(\text{unchanged})$ $\sum Y = 100,\qquad \sum Y^2 = 436 \quad(\text{changed})$ $\sum XY = 520 \quad(\text{changed})$

(You can verify these by subtracting the wrong contributions and adding the correct ones.)

Compute means

$\bar X=\frac{\sum X}{n}=\frac{125}{25}=5,\qquad \bar Y=\frac{\sum Y}{n}=\frac{100}{25}=4.$

Use Pearson’s computational formula

$r=\frac{n\sum XY-(\sum X)(\sum Y)} {\sqrt{\big(n\sum X^2-(\sum X)^2\big)\big(n\sum Y^2-(\sum Y)^2\big)}}.$

Plug in the corrected numbers:

Numerator:

$25\cdot 520 - 125\cdot 100 = 13\,000 - 12\,500 = 500.$

Denominator parts:

$25\cdot 650 - 125^2 = 16\,250 - 15\,625 = 625,$ $25\cdot 436 - 100^2 = 10\,900 - 10\,000 = 900.$

So denominator $=\sqrt{625\cdot 900}=\sqrt{562500}=750.$

Therefore

$r=\frac{500}{750}=\frac{2}{3}\approx 0.6667\ (\approx 0.67).$

Final answer

$\boxed{r \approx 0.667}$

The whole set of cell frequencies will then define a bivariate frequency distribution. The column totals and row totals will give us the marginal distributions of X and Y. A particular column or row will be called the conditional distribution of. Y for given X or of-X for given Y respectively.

Suppose bivariate data on $X$ and $Y$ are presented in a two-way frequency table where there are $m$ classes of $Y$ along the horizontal direction and $n$ classes of $X$ along the vertical direction. Let $f_{ij}$ be the frequency of observations in the cell at row $i$ and column $j$ (the $(i,j)$ -th cell).

The row-sum (sum of frequencies in row $i$ ) is

$r_i=\sum_{j=1}^m f_{ij},$

and the column-sum (sum of frequencies in column $j$ ) is

$c_j=\sum_{i=1}^n f_{ij}.$

The total number of observations is

$N=\sum_{i=1}^n\sum_{j=1}^m f_{ij}=\sum_{i=1}^n r_i=\sum_{j=1}^m c_j.$

A row (fixed $i$ ) gives the conditional distribution of $Y$ given the $i$ -th class of $X$ ; similarly a column (fixed $j$ ) gives the conditional distribution of $X$ given the $j$ -th class of $Y$ .

Notation for grouped (class) midpoints

Let the midpoint of the $i$ -th $X$ -class be $x_i$ $(i = 1, \dots, n) and the midpoint of the$ $j$ -th $Y$ -class be $y_j$ $(j=1,\dots,m)$ . Then treat each cell as representing $f_{ij}$ observations at the point $(x_i,y_j)$ .

Marginal (grouped) distributions and sample totals

$\begin{aligned} \Sigma f\,x &= \sum_{i=1}^n\sum_{j=1}^m f_{ij}\,x_i=\sum_{i=1}^n r_i x_i,\\[4pt] \Sigma f\,y &= \sum_{i=1}^n\sum_{j=1}^m f_{ij}\,y_j=\sum_{j=1}^m c_j y_j. \end{aligned}$

Sample means (grouped)

$\bar x = \frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m f_{ij}x_i = \frac{1}{N}\sum_{i=1}^n r_i x_i, \qquad \bar y = \frac{1}{N}\sum_{i=1}^n\sum_{j=1}^m f_{ij}y_j = \frac{1}{N}\sum_{j=1}^m c_j y_j.$

Grouped second moments, variances and covariance

$\begin{aligned} \Sigma f\,x^2 &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} x_i^2 = \sum_{i=1}^n r_i x_i^2,\\[4pt] \Sigma f\,y^2 &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} y_j^2 = \sum_{j=1}^m c_j y_j^2,\\[4pt] \Sigma f\,xy &= \sum_{i=1}^n\sum_{j=1}^m f_{ij} x_i y_j. \end{aligned}$

Grouped variances (population form, using $1/N$ ):

$\sigma_X^2=\frac{1}{N}\Big(\Sigma f\,x^2 - \frac{(\Sigma f\,x)^2}{N}\Big),\qquad \sigma_Y^2=\frac{1}{N}\Big(\Sigma f\,y^2 - \frac{(\Sigma f\,y)^2}{N}\Big).$

Grouped covariance:

$\operatorname{Cov}_g(X,Y)=\frac{1}{N}\Big(\Sigma f\,xy - \frac{(\Sigma f\,x)(\Sigma f\,y)}{N}\Big).$

(If you prefer the “sample” denominators $N - 1, replace$ $1/N$ by $1/(N-1)$

consistently — but the correlation formula below is unchanged because it uses ratios.)

Correlation coefficient (grouped)

$r \;=\; \frac{\operatorname{Cov}_g(X,Y)}{\sigma_X\,\sigma_Y} \;=\; \frac{\,N\Sigma f\,xy - (\Sigma f\,x)(\Sigma f\,y)\,} {\sqrt{\big(N\Sigma f\,x^2 - (\Sigma f\,x)^2\big)\,\big(N\Sigma f\,y^2 - (\Sigma f\,y)^2\big)}}.$

This $r$ lies in $[-1,1]$ . Its sign indicates direction of linear association; its magnitude indicates strength.

Conditional distributions (for clarity)

Conditional probability (grouped) of $Y=y_j$ given $X$ in class $i$ :

$P(Y=y_j \mid X\ \text{in class }i)=\frac{f_{ij}}{r_i}\quad\text{(provided }r_i>0\text{).}$

Similarly

$P(X=x_i \mid Y\ \text{in class }j)=\frac{f_{ij}}{c_j}\quad\text{(provided }c_j>0\text{).}$

$Example:$

$Joint distribution table is given.Find the correlation coefficient between X and Y$

\begin{array}{c|c|c|c} & X=-1 & X=+1 & g(y) \\ \hline Y=0 & \tfrac{1}{8} & \tfrac{3}{8} & \tfrac{4}{8} \\ Y=1 & \tfrac{2}{8} & \tfrac{2}{8} & \tfrac{4}{8} \\ \hline p(x) & \tfrac{3}{8} & \tfrac{5}{8} & 1 \end{array}

Student	Rank X	Rank Y	$d$	$d^2$
A	2	1	1	1
B	4	4	0	0
C	1	2	-1	1
D	5	5	0	0
E	3	3	0	0

Competitor	A	B	d	d²
1	1	3	-2	4
2	6	5	1	1
3	5	8	-3	9
4	10	4	6	36
5	3	7	-4	16
6	2	10	-8	64
7	4	2	2	4
8	9	1	8	64
9	7	6	1	1
10	8	9	-1	1

Competitor	A	C	d	d²
1	1	2	-1	1
2	6	4	2	4
3	5	9	-4	16
4	10	8	2	4
5	3	1	2	4
6	2	3	-1	1
7	4	10	-6	36
8	9	5	4	16
9	7	7	0	0
10	8	6	2	4

Competitor	B	C	d	d²
1	3	2	1	1
2	5	4	1	1
3	8	9	-1	1
4	4	8	-4	16
5	7	1	6	36
6	10	3	7	49
7	2	10	-8	64
8	1	5	-4	16
9	6	7	-1	1
10	9	6	3	9

Competitor	A	B	d	d²
1	1	3	-2	4
2	6	5	1	1
3	5	8	-3	9
4	10	4	6	36
5	3	7	-4	16
6	2	10	-8	64
7	4	2	2	4
8	9	1	8	64
9	7	6	1	1
10	8	9	-1	1

Competitor	A	C	d	d²
1	1	2	-1	1
2	6	4	2	4
3	5	9	-4	16
4	10	8	2	4
5	3	1	2	4
6	2	3	-1	1
7	4	10	-6	36
8	9	5	4	16
9	7	7	0	0
10	8	6	2	4

Competitor	B	C	d	d²
1	3	2	1	1
2	5	4	1	1
3	8	9	-1	1
4	4	8	-4	16
5	7	1	6	36
6	10	3	7	49
7	2	10	-8	64
8	1	5	-4	16
9	6	7	-1	1
10	9	6	3	9

Bi-Variate Distribution - Correlation

Bivariate Distribution and Correlation

Scatter Diagram

Karl Pearson’s Coefficient of Correlation

Karl Pearson’s Coefficient of Correlation

Covariance

Karl Pearson’s Formula

Alternative (Computational) Formula

Convenient Form of Covariance Formula

Similarly, the Variances

Limits of Correlation Coefficient

Data

Useful totals

Pearson correlation (computational form)

Conclusion

Example Problem

Given (initial)

Corrected totals (after fixing the mis-copied pairs)

Compute means

Use Pearson’s computational formula

Final answer

Notation for grouped (class) midpoints

Marginal (grouped) distributions and sample totals

Grouped second moments, variances and covariance

Correlation coefficient (grouped)

Conditional distributions (for clarity)

Example:

Joint distribution table is given.Find the correlation coefficient between X and Y

X=−1X=+1g(y)Y=0183848Y=1282848p(x)38581\begin{array}{c|c|c|c} & X=-1 & X=+1 & g(y) \\ \hline Y=0 & \tfrac{1}{8} & \tfrac{3}{8} & \tfrac{4}{8} \\ Y=1 & \tfrac{2}{8} & \tfrac{2}{8} & \tfrac{4}{8} \\ \hline p(x) & \tfrac{3}{8} & \tfrac{5}{8} & 1 \end{array}​

Expectations

E(X)E(X)

E(Y)E(Y)

Second moments

E(X2)E(X^2)

E(Y2)

Variances

Covariance

Correlation coefficient

Spearman’s Rank Correlation

1. Setup

2. Differences of ranks

3. Relation with covariance

4. Substituting variance

5. Final formula

Example

Step 1: Assign ranks

Step 2: Find differences of ranks

Step 3: Apply Spearman’s formula

Example

Step 1: Write down the rankings

Step 2: Compare A & B

Step 3: Compare A & C

Step 4: Compare B & C

✅ Conclusion

Remarks on Spearman’s Rank Correlation Coefficient

Comments

Post a Comment

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Model Question Paper and Answers KTU GNEST305 AI and Data Science

$Example:$

$Joint distribution table is given.Find the correlation coefficient between X and Y$

$\begin{array}{c|c|c|c} & X=-1 & X=+1 & g(y) \\ \hline Y=0 & \tfrac{1}{8} & \tfrac{3}{8} & \tfrac{4}{8} \\ Y=1 & \tfrac{2}{8} & \tfrac{2}{8} & \tfrac{4}{8} \\ \hline p(x) & \tfrac{3}{8} & \tfrac{5}{8} & 1 \end{array}$

$E(X)$

$E(Y)$

$E(X^2)$

$E (Y^{2})$

Competitor	A	B	d	d²
1	1	3	-2	4
2	6	5	1	1
3	5	8	-3	9
4	10	4	6	36
5	3	7	-4	16
6	2	10	-8	64
7	4	2	2	4
8	9	1	8	64
9	7	6	1	1
10	8	9	-1	1

Competitor	A	C	d	d²
1	1	2	-1	1
2	6	4	2	4
3	5	9	-4	16
4	10	8	2	4
5	3	1	2	4
6	2	3	-1	1
7	4	10	-6	36
8	9	5	4	16
9	7	7	0	0
10	8	6	2	4

Competitor	B	C	d	d²
1	3	2	1	1
2	5	4	1	1
3	8	9	-1	1
4	4	8	-4	16
5	7	1	6	36
6	10	3	7	49
7	2	10	-8	64
8	1	5	-4	16
9	6	7	-1	1
10	9	6	3	9