Bi-Variate Distribution - Correlation
Bivariate Distribution and Correlation
So far, we have confined ourselves to univariate distributions, i.e., distributions involving only one variable. However, in many real-life situations, we encounter cases where each observation is associated with two or more variables.
For example, if we measure the heights and weights of a group of persons, each observation consists of two related values—one for height and one for weight. Such a distribution involving two variables is called a bivariate distribution.
In a bivariate distribution, we are often interested in examining whether there exists any correlation or covariation between the two variables under study.
-
If a change in one variable is accompanied by a change in the other variable, the two variables are said to be correlated.
-
If the two variables tend to deviate in the same direction (i.e., an increase in one variable corresponds to an increase in the other, or a decrease corresponds to a decrease), the correlation is called direct or positive correlation.
-
On the other hand, if the two variables deviate in opposite directions (i.e., an increase in one variable corresponds to a decrease in the other, and vice versa), the correlation is called inverse or negative correlation.
Examples:
-
Positive correlation: (i) Heights and weights of individuals, (ii) Income and expenditure.
-
Negative correlation: (i) Price and demand of a commodity, (ii) Volume and pressure of a perfect gas.
Finally, correlation is said to be perfect if the deviation in one variable is always accompanied by a proportional and exact deviation in the other variable.
Scatter Diagram
A scatter diagram is the simplest method of representing bivariate data diagrammatically.
For a bivariate distribution , if the values of the variables and are plotted along the horizontal axis (x-axis) and vertical axis (y-axis), respectively, in the Cartesian plane, the resulting diagram of points is known as a scatter diagram.
From a scatter diagram, we can form a fairly good—though approximate—idea about whether the two variables are correlated:
-
If the points lie close together, forming a dense cluster, we expect a high degree of correlation.
-
If the points are widely scattered, the correlation is expected to be weak.
However, this method is not very suitable when the number of observations is very large, as the diagram becomes too crowded to interpret effectively
Karl Pearson’s Coefficient of Correlation
As a measure of the intensity or degree of linear relationship between two variables, Karl Pearson (1867–1936), a British biometrician, developed a formula known as the Correlation Coefficient.
The correlation coefficient between two random variables and , usually denoted by or simply , is a numerical measure of the linear relationship between them and is defined as:
where:
-
= Covariance between and
-
= Standard deviation of
-
= Standard deviation of
Karl Pearson’s Coefficient of Correlation
Let represent a bivariate distribution.
Covariance
The covariance between and is defined as:
For a sample, it is given by:
where and are the sample means of and .
Similarly, the variances are:
Karl Pearson’s Formula
The correlation coefficient between and is defined as:
This is also called the product-moment correlation coefficient because
Alternative (Computational) Formula
For practical calculation, an equivalent formula is:
This avoids computing deviations separately and is useful in tabulated data.
Convenient Form of Covariance Formula
From the definition:
Expanding:
Since and , this simplifies to:
Similarly, the Variances
✅ So the final computational forms are:
Limits of Correlation Coefficient
Using the Cauchy–Schwarz inequality, it can be shown that:
-
If : perfect positive correlation.
-
If : perfect negative correlation.
-
If : no linear correlation.
Remarks on Karl Pearson’s Correlation Coefficient
The coefficient provides a measure of the linear relationship between and . For nonlinear relationships, however, it is not a suitable measure.
Sometimes, the covariance is denoted as:Karl Pearson’s correlation coefficient is also called the product-moment correlation coefficient, since it is based on the expected product of deviations of the two variables from their respective means:
Thus,
Example
Calculate the correlation coefficient for the following heights (in inches) of fathers (X) and their sons (Y) :
Data
Fathers’ heights : 65, 66, 67, 67, 68, 69, 70, 72
Sons’ heights : 67, 68, 65, 68, 72, 72, 69, 71
Useful totals
(So
Pearson correlation (computational form)
Plug in the numbers (showing each step):
-
Numerator:
-
Denominator parts:
So
Therefore
Conclusion
This indicates a moderate positive linear correlation between fathers’ and sons’ heights
If you use the sample version with factors cancel, so you get the same computational formula for .Then
which is numerically stable and faster; scale/shift cancel in
For the standardised variables :
and since we get
r(U,V)=Cov(U,V)≈0.603. This matches .
Example Problem
A computer, while calculating the correlation coefficient between two variables and
However, during checking, it was discovered that two pairs of values had been copied incorrectly. The pairs were entered as
while the correct values were
Obtain the correct value of the correlation coefficient.
Given (initial)
Later it was found that two (actually a block of) pairs were copied incorrectly; after correcting those pairs the corrected totals become:
Corrected totals (after fixing the mis-copied pairs)
(You can verify these by subtracting the wrong contributions and adding the correct ones.)
Compute means
Use Pearson’s computational formula
Plug in the corrected numbers:
-
Numerator:
-
Denominator parts:
So denominator
-
Therefore
Final answer
The whole set of cell frequencies will then define a bivariate frequency distribution. The column totals and row totals will give us the marginal distributions of X and Y. A particular column or row will be called the conditional distribution of. Y for given X or of-X for given Y respectively.
Suppose bivariate data on and are presented in a two-way frequency table where there are classes of along the horizontal direction and classes of along the vertical direction. Let be the frequency of observations in the cell at row and column (the -th cell).
The row-sum (sum of frequencies in row ) is
and the column-sum (sum of frequencies in column ) is
The total number of observations is
A row (fixed ) gives the conditional distribution of given the -th class of ; similarly a column (fixed ) gives the conditional distribution of given the -th class of .
Notation for grouped (class) midpoints
Let the midpoint of the -th -class be -th -class be . Then treat each cell as representing observations at the point .
Marginal (grouped) distributions and sample totals
Sample means (grouped)
Grouped second moments, variances and covariance
Grouped variances (population form, using ):
Grouped covariance:
(If you prefer the “sample” denominators by
Correlation coefficient (grouped)
This lies in . Its sign indicates direction of linear association; its magnitude indicates strength.
Conditional distributions (for clarity)
Conditional probability (grouped) of given in class :
Similarly
Example:
Joint distribution table is given.Find the correlation coefficient between X and Y
Expectations
Second moments
Since :
Variances
Covariance
Simplify:
So
Correlation coefficient
Denominator:
So
✅ Final Answer:
The correlation coefficient between and is
which indicates a weak negative correlation.
Spearman’s Rank Correlation
1. Setup
We have individuals ranked in two characteristics and .
-
Rank of -th individual in :
-
Rank of -th individual in :
Since they are ranks:
Let:
-
Mean of ranks:
-
Variance of ranks:
Recall the formula:
So:
Thus,
2. Differences of ranks
Define:
Clearly:
3. Relation with covariance
Expanding:
Divide by :
Since
where = rank correlation coefficient.
So:
4. Substituting variance
We already know:
Thus:
Simplify:
5. Final formula
Rearranging for :
✅ This is Spearman’s Rank Correlation formula:
where
-
is the difference between the two ranks for the -th individual,
-
is the number of individuals.
Spearman’s rank correlation coefficient (since Pearson’s correlation is based on actual values, not ranks). Let me give you a small worked-out example of Spearman’s rank correlation:
Example
Suppose we have marks of 5 students in Math and Science:
Student | Math (X) | Science (Y) |
---|---|---|
A | 85 | 93 |
B | 70 | 65 |
C | 90 | 89 |
D | 60 | 60 |
E | 75 | 80 |
Step 1: Assign ranks
Rank each score (highest = 1).
Student | Math (X) | Rank X | Science (Y) | Rank Y |
---|---|---|---|---|
A | 85 | 2 | 93 | 1 |
B | 70 | 4 | 65 | 4 |
C | 90 | 1 | 89 | 2 |
D | 60 | 5 | 60 | 5 |
E | 75 | 3 | 80 | 3 |
Step 2: Find differences of ranks
, and compute .
Student | Rank X | Rank Y | ||
---|---|---|---|---|
A | 2 | 1 | 1 | 1 |
B | 4 | 4 | 0 | 0 |
C | 1 | 2 | -1 | 1 |
D | 5 | 5 | 0 | 0 |
E | 3 | 3 | 0 | 0 |
Step 3: Apply Spearman’s formula
Here , :
✅ Spearman’s rank correlation coefficient = 0.9
This shows a strong positive correlation between Math and Science marks.
Example
Ten competitors in a musical test were ranked by the
three judges A. Band C in the following order:
Ranks by A: 1 6 5 10 3 2 4 9 7 8
Ranks by B : 3 5 8 4 7 10 2 1 6 9
Ranks by C : is 4 9 8 1 2 3 10 5 7
Using rank correlation method. discuss which pair of judges has the nearest approach to common likings in music.
Step 1: Write down the rankings
Competitor | A | B | C |
---|---|---|---|
1 | 1 | 3 | 2 |
2 | 6 | 5 | 4 |
3 | 5 | 8 | 9 |
4 | 10 | 4 | 8 |
5 | 3 | 7 | 1 |
6 | 2 | 10 | 3 |
7 | 4 | 2 | 10 |
8 | 9 | 1 | 5 |
9 | 7 | 6 | 7 |
10 | 8 | 9 | 6 |
Step 2: Compare A & B
Compute .
Competitor | A | B | d | d² |
---|---|---|---|---|
1 | 1 | 3 | -2 | 4 |
2 | 6 | 5 | 1 | 1 |
3 | 5 | 8 | -3 | 9 |
4 | 10 | 4 | 6 | 36 |
5 | 3 | 7 | -4 | 16 |
6 | 2 | 10 | -8 | 64 |
7 | 4 | 2 | 2 | 4 |
8 | 9 | 1 | 8 | 64 |
9 | 7 | 6 | 1 | 1 |
10 | 8 | 9 | -1 | 1 |
So A & B have slight negative correlation.
Step 3: Compare A & C
Competitor | A | C | d | d² |
---|---|---|---|---|
1 | 1 | 2 | -1 | 1 |
2 | 6 | 4 | 2 | 4 |
3 | 5 | 9 | -4 | 16 |
4 | 10 | 8 | 2 | 4 |
5 | 3 | 1 | 2 | 4 |
6 | 2 | 3 | -1 | 1 |
7 | 4 | 10 | -6 | 36 |
8 | 9 | 5 | 4 | 16 |
9 | 7 | 7 | 0 | 0 |
10 | 8 | 6 | 2 | 4 |
So A & C have moderate positive correlation.
Step 4: Compare B & C
Competitor | B | C | d | d² |
---|---|---|---|---|
1 | 3 | 2 | 1 | 1 |
2 | 5 | 4 | 1 | 1 |
3 | 8 | 9 | -1 | 1 |
4 | 4 | 8 | -4 | 16 |
5 | 7 | 1 | 6 | 36 |
6 | 10 | 3 | 7 | 49 |
7 | 2 | 10 | -8 | 64 |
8 | 1 | 5 | -4 | 16 |
9 | 6 | 7 | -1 | 1 |
10 | 9 | 6 | 3 | 9 |
So B & C have slight negative correlation.
✅ Conclusion
-
(slight negative)
-
-
(slight negative)
👉 Judges A and C have the nearest approach to common likings in music.
Remarks on Spearman’s Rank Correlation Coefficient
-
Check for Numerical Accuracy
-
In calculations, the sum of rank differences should be zero:
This provides a simple check for errors in numerical work.
-
-
Relation with Pearson’s Correlation
-
Spearman’s rank correlation coefficient () is essentially Pearson’s correlation applied to ranks instead of actual data.
-
Therefore, it is interpreted in the same way as Karl Pearson’s correlation coefficient.
-
-
Distribution-Free (Non-Parametric)
-
Pearson’s correlation assumes that the population is normally distributed.
-
When this assumption is not valid, we need a distribution-free measure, which does not depend on any population parameters.
-
Spearman’s is such a measure, making it useful in non-parametric situations.
-
-
Simplicity and Information Loss
-
Spearman’s formula is easier to understand and apply compared to Pearson’s formula.
-
However, using ranks instead of raw data results in loss of information.
-
Unless there are many tied ranks, Spearman’s coefficient is usually slightly lower than Pearson’s coefficient.
-
-
Use with Qualitative Data and Extreme Observations
-
Spearman’s correlation is the only suitable method when dealing with qualitative characteristics (e.g., taste, preference, intelligence level) that cannot be measured numerically but can be ordered.
-
It can also be used when actual quantitative data are available.
-
When data include extreme observations (outliers), Spearman’s formula is often preferred over Pearson’s because it is less sensitive to extremes.
-
-
Limitations
-
Spearman’s method is not practical for bivariate frequency distributions (correlation tables).
-
For large samples (n > 30), it is computationally heavy if ranks are not directly given. In such cases, Pearson’s formula is preferred unless ranking is necessary.
-
✅ In short:
-
Spearman’s is simple, non-parametric, and works well with ranks or qualitative data.
-
It is less accurate than Pearson’s in terms of information retention but more robust when assumptions (like normality) are not satisfied.
Comments
Post a Comment