Spearman Correlation Coefficient for When Pearson Isn’t Enough

on the Pearson correlation coefficient, we discussed how it is used to measure the strength of the linear relationship between two variables (years of experience and salary).

Not all relationships between variables are linear, and Pearson correlation works best when the relationship follows a straight-line pattern.

When the relationship is not linear but still moves consistently in one direction, we use Spearman correlation coefficient to capture that pattern.

To understand the Spearman correlation coefficient, let’s consider the fish market dataset.

This dataset includes physical attributes of each fish, such as:

Weight – the weight of the fish in grams (this will be our target variable)
Length1, Length2, Length3 – various length measurements (in cm)
Height – the height of the fish (in cm)
Width – the diagonal width of the fish body (in cm)

We need to predict the weight of the fish based on various length measurements, height and width.

This was the same example we used to understand the math behind multiple linear regression in an earlier blog but used only height and width as independent variables first to get the individual equations for slopes and intercepts.

Here we are trying to fit a multiple linear regression model, and we have five independent variables and one target variable.

Now let’s calculate the Pearson correlation coefficient between each independent variable and the target variable.

Code:

import pandas as pd

# Load the Fish Market dataset
df = pd.read_csv("C:/Fish.csv")

# Drop the categorical 'Species' column 
if 'Species' in df.columns:
    df_numeric = df.drop(columns=['Species'])
else:
    df_numeric = df.copy()

# Calculate Pearson correlation between each independent variable and the target (Weight)
target = 'Weight'
pearson_corr = df_numeric.corr(method='pearson')[target].drop(target)  # drop self-correlation

pearson_corr.sort_values(ascending=False)

The Pearson correlation coefficient between Weight and

Length3 is 0.923044
Length2 is 0.918618
Length1 is 0.915712
Width is 0.886507
Height is 0.724345

Among all the variables, Height has the weakest Pearson correlation coefficient, and we might think that we should drop this variable before applying the multiple linear regression model.

But before that, is it correct to drop an independent variable based on Pearson correlation coefficient?

No.

First, let’s look at the scatter plot between Height and Weight.

From the scatter plot we can observe that as height increases, weight also increases, but the relationship is not linear.

At smaller heights, the weight increases slowly. At larger heights, it increases more quickly.

Here the trend is non-linear but still monotonic, because it moves in one direction.

Since the Pearson correlation coefficient assumes a straight-line relationship (linearity), it gives a lower value here.

This is where the Spearman correlation coefficient comes in.

Now let’s calculate the Spearman correlation coefficient between Height and Weight.

Code:

import pandas as pd
from scipy.stats import spearmanr

# Load the dataset
df = pd.read_csv("C:/Fish.csv") 

# Calculate Spearman correlation coefficient between Height and Weight
spearman_corr = spearmanr(df["Height"], df["Weight"])[0]

print(f"Spearman Correlation Coefficient: c")

The Spearman correlation coefficient is 0.8586, which indicates a strong positive relationship between Height and Weight.

This means that as the height of the fish increases, the weight also tends to increase.

Earlier, we got a Pearson correlation coefficient of 0.72 between Height and Weight, which underestimates the actual relationship between these variables.

If we select features only based on the Pearson correlation and remove the Height feature, we might lose an important variable that actually has a strong relationship with the target, leading to less relevant predictions.

This is where the Spearman correlation coefficient helps, as it captures non-linear but monotonic trends.

By using the Spearman correlation, we can also decide the next steps, such as applying transformations like log or lag values or considering algorithms like decision trees or random forests that can handle both linear and non-linear relationships.

As we have understood the significance of the Spearman correlation coefficient, now it is time to understand the math behind it.

How is the Spearman correlation coefficient calculated in a way that it captures the relationship even when the data is non-linear and monotonic?

To understand this, let’s consider a 10-point sample from the dataset.

Now, we sort the values in ascending order in each column and then assign ranks.

Now that we have given ranks to both Height and Weight, we don’t keep them in the sorted order.

Each value needs to go back to its original place in the dataset so that every fish’s height rank is matched with its own weight rank.

We sort the columns only to assign ranks. After that, we place the ranks back in their original order and then calculate the Spearman correlation using these two sets of ranks.

Here, while assigning ranks after sorting the values in ascending order in the Weight column, we encountered a tie at ranks 5 and 6, so we assigned both values the average rank of 5.5.

Similarly, we found another tie across ranks 7, 8, 9, and 10, so we assigned all of them the average rank of 8.5.

Now, we calculate the Spearman correlation coefficient, which is actually the Pearson correlation applied to the ranks.

We already know the formula for calculating Pearson correlation coefficient.

\[
r = \fraccc
= \frac{\fracccccccccccc \sum_cccccccccc^c (X_i – \bar{X})(Y_i – \bar{Y})}
{\sqrt{\frac{1}{n – 1} \sum_{i=1}^{n} (X_i – \bar{X})^2} \cdot \sqrt{\frac{1}{n – 1} \sum_{i=1}^{n} (Y_i – \bar{Y})^2}}
\]

\[
= \frac{\sum_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}
{\sqrt{\sum_{i=1}^{n} (X_i – \bar{X})^2} \cdot \sqrt{\sum_{i=1}^{n} (Y_i – \bar{Y})^2}}
\]

Now, the formula for Spearman correlation coefficient is:

\[
r_s =
\frac{
\sum_{i=1}^{n}
\underbrace{(R_{X_i} – \bar{R}_X)}_{\text{Rank deviation of } X_i}
\cdot
\underbrace{(R_{Y_i} – \bar{R}_Y)}_{\text{Rank deviation of } Y_i}
}{
\sqrt{
\sum_{i=1}^{n}
\underbrace{(R_{X_i} – \bar{R}_X)^2}_{\text{Squared rank deviations of } X}
}
\cdot
\sqrt{
\sum_{i=1}^{n}
\underbrace{(R_{Y_i} – \bar{R}_Y)^2}_{\text{Squared rank deviations of } Y}
}
}
\]

\[
\begin{aligned}
\text{Where:} \\
R_{X_i} & = \text{ rank of the } i^\text{th} \text{ value in variable } X \\
R_{Y_i} & = \text{ rank of the } i^\text{th} \text{ value in variable } Y \\
\bar{R}_X & = \text{ mean of all ranks in } X \\
\bar{R}_Y & = \text{ mean of all ranks in } Y
\end{aligned}
\]

Now, let’s calculate the Spearman correlation coefficient for the sample data.

\[
\textbf{Step 1: Ranks from the original data}
\]

\[
\begin{array}{c|cccccccccc}
R_{x_i} & 3 & 1 & 2 & 5 & 8 & 4 & 7 & 9 & 10 & 6 \\[2pt]
R_{y_i} & 1 & 2 & 4 & 5.5 & 8.5 & 3 & 5.5 & 8.5 & 8.5 & 8.5
\end{array}
\]

\[
\textbf{Step 2: Formula of Spearman’s correlation (Pearson on ranks)}
\]

\[
\rho_s =
\frac{\sum_{i=1}^{n}\bigl(R_{x_i}-\bar{R_x}\bigr)\bigl(R_{y_i}-\bar{R_y}\bigr)}
{\sqrt{\sum_{i=1}^{n}\bigl(R_{x_i}-\bar{R_x}\bigr)^2} \;
\sqrt{\sum_{i=1}^{n}\bigl(R_{y_i}-\bar{R_y}\bigr)^2}},
\qquad n = 10
\]

\[
\textbf{Step 3: Mean of rank variables}
\]

\[
\bar{R_x} = \frac{3+1+2+5+8+4+7+9+10+6}{10} = \frac{55}{10} = 5.5
\]

\[
\bar{R_y} = \frac{1+2+4+5.5+8.5+3+5.5+8.5+8.5+8.5}{10}
= \frac{55.5}{10} = 5.55
\]

\[
\textbf{Step 4: Deviations and cross-products}
\]

\[
\begin{array}{c|c|c|c}
i & R_{x_i}-\bar{R_x} & R_{y_i}-\bar{R_y} & (R_{x_i}-\bar{R_x})(R_{y_i}-\bar{R_y}) \\ \hline
1 & -2.5 & -4.55 & 11.38 \\
2 & -4.5 & -3.55 & 15.98 \\
3 & -3.5 & -1.55 & 5.43 \\
4 & -0.5 & -0.05 & 0.03 \\
5 & 2.5 & 2.95 & 7.38 \\
6 & -1.5 & -2.55 & 3.83 \\
7 & 1.5 & -0.05 & -0.08 \\
8 & 3.5 & 2.95 & 10.33 \\
9 & 4.5 & 2.95 & 13.28 \\
10 & 0.5 & 2.95 & 1.48
\end{array}
\]

\[
\sum (R_{x_i}-\bar{R_x})(R_{y_i}-\bar{R_y}) = 68.0
\]

\[
\textbf{Step 5: Sum of squares for each rank variable}
\]

\[
\sum (R_{x_i}-\bar{R_x})^2 = 82.5,
\qquad
\sum (R_{y_i}-\bar{R_y})^2 = 82.5
\]

\[
\textbf{Step 6: Substitute into the formula}
\]

\[
\rho_s
= \frac{68.0}{\sqrt{(82.5)(82.5)}}
= \frac{68.0}{82.5}
= 0.824
\]

\[
\textbf{Step 7: Interpretation}
\]

\[
\rho_s = 0.824
\]

A value of \( \rho_s = 0.824 \) shows a strong positive monotonic relationship between Height and Weight as height increases, weight also tends to increase.

This is how we calculate the spearman correlation coefficient.

We also have another formula to calculate the Spearman correlation coefficient, but it is used only when there are no tied ranks.

\[
\rho_s = 1 – \frac{6\sum d_i^2}{n(n^2 – 1)}
\]

where:

\[
\begin{aligned}
\rho_s & : \text{ Spearman correlation coefficient} \\[4pt]
d_i & : \text{ difference between the ranks of each observation, } (R_{x_i} – R_{y_i}) \\[4pt]
n & : \text{ total number of paired observations}
\end{aligned}
\]

If ties are present, the rank differences no longer represent the exact distances between positions, and we instead calculate ‘ρ’ using the ‘Pearson correlation on ranks’ formula.

Dataset

The dataset used in this blog is the Fish Market dataset, which contains measurements of fish species sold in markets, including attributes like weight, height, and width.

It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.

Spearman’s correlation coefficient helps us understand how two variables move together when the relationship is not perfectly linear.

By converting the data into ranks, it shows how well one variable increases as the other increases, capturing any upward or downward pattern.

It is very helpful when the data has outliers, is not normally distributed or when the relationship is monotonic but curved.

I hope this post helped you see not just how to calculate the Spearman correlation coefficient, but also when to use it and why it is an important tool in data analysis.

Thanks for reading!

Author

Alfie Williams
Alfie Williams is a dedicated author with Razzc Minds LLC, the force behind Razzc Trending Blog. Based in Helotes, TX, Alfie is passionate about bringing readers the latest and most engaging trending topics from across the United States.Razzc Minds LLC at 14389 Old Bandera Rd #3, Helotes, TX 78023, United States, or reach out at +1(951)394-0253.