Pandas Correlation

Correlation is a statistical concept that quantifies the degree to which two variables are related to each other.

Correlation can be calculated in Pandas using the corr() function.

Let's look at an example.

import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate correlation matrix print(df.corr())

Output

                 Temperature  Ice_Cream_Sales
Temperature         1.000000         0.923401
Ice_Cream_Sales     0.923401         1.000000

In this example, we used the corr() method on the DataFrame df to calculate the correlation coefficients between the columns.

The output is a correlation matrix that displays the correlation coefficients between all pairs of columns in the dataframe. In this case, there are only two columns, so the matrix is 2x2.

Here, the correlation coefficient between Temperature and Ice_Cream_Sales is 0.923401, which is positive. This indicates that as the temperature increases, the ice cream sales also increase.

The coefficient value of 1.000000 along the diagonal represents the correlation of each column with itself.


Positive and Negative Correlation

Positive correlation refers to a relationship between two variables where they both tend to change in the same direction. When one variable increases, the other variable also tends to increase, and when one variable decreases, the other variable also tends to decrease.

Graph Showing Positive Correlation Between Temperature and Ice Cream Sales
Positive Correlation

In the figure above, we can clearly see that ice cream sales increase with the increase in temperature. We can say that there is a positive correlation between temperature and ice cream sales.

Negative correlation, on the other hand, refers to a relationship between two variables where they tend to change in opposite directions. When one variable increases, the other variable tends to decrease, and vice versa.

Graph Showing Negative Correlation Between Temperature and Coffee Sales
Negative Correlation

In the figure above, coffee sales decrease with increase in temperature. We can say that there is a negative correlation between temperature and coffee sales.


Example: Correlation Between Two Columns

Instead of finding the whole correlation matrix, we can specify the columns to calculate correlation between them.

import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate correlation coefficient correlation = df['Temperature'].corr(df["Ice_Cream_Sales"])
print(correlation)

Output

0.9234007664064656

In this example, we calculated correlation between Temperature and Ice_Cream_Sales.

The syntax for doing so is:

df['column1'].corr(df['column2'])

Example: Missing Values

DataFrame may contain missing values (NaN). The corr() function completely ignores the rows with NaN values.

import pandas as pd
import numpy as np

# create a dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Coffee_Sales": [158, 145, np.nan, np.nan, 140]
}

df = pd.DataFrame(data)

# calculate correlation between Temperature and Ice_Cream_sales
correlation1 = df["Temperature"].corr(df["Coffee_Sales"])

print("With NaN values")
print(df)
print(f"correlation = {correlation1}")
print()

# remove missing values
df.dropna(inplace=True)

# calculate correlation between Temperature and Ice_Cream_sales
correlation2 = df["Temperature"].corr(df["Coffee_Sales"])

print("Without NaN values")
print(df)
print(f"correlation = {correlation2}")
print()

Output

With NaN values
   Temperature  Coffee_Sales
0           22         158.0
1           25         145.0
2           32           NaN
3           28           NaN
4           30         140.0
correlation = -0.923177938058926

Without NaN values
   Temperature  Coffee_Sales
0           22         158.0
1           25         145.0
4           30         140.0
correlation = -0.923177938058926

Notice that the correlation value is the same before and after removing the NaN values. This means that the NaN values are completely ignored by corr().

We used the NumPy Library to generate NaN values.


Correlation Methods in Pandas

We can calculate correlation using three different methods in Pandas:

  • Pearson Method (Default): evaluates the linear relationship between two continuous variables
  • Kendall Method: measures the ordinal association between two measured quantities
  • Spearman Method: evaluates the monotonic relationship between two continuous or ordinal variables

By default, corr() computes the Pearson correlation coefficient, which measures the linear relationship between two variables.


Example: Pearson, Kendall and Spearman Methods

import pandas as pd

# create dataframe
data = {
    "Temperature": [22, 25, 32, 28, 30],
    "Ice_Cream_Sales": [105, 120, 135, 130, 125]
}

df = pd.DataFrame(data)

# calculate different correlation coefficients pearson = df['Temperature'].corr(df["Ice_Cream_Sales"]) kendall = df['Temperature'].corr(df["Ice_Cream_Sales"], method='kendall') spearman = df['Temperature'].corr(df["Ice_Cream_Sales"], method='spearman')
# display different correlation coefficient print(f"Pearson's Coefficient: {pearson}") print(f"Kendall's Coefficient: {kendall}") print(f"Spearman's Coefficient: {spearman}")

Output

Pearson's Coefficient: 0.9234007664064656
Kendall's Coefficient: 0.7999999999999999
Spearman's Coefficient: 0.8999999999999998

Here, Pearson's Coefficient has the highest value, which signifies that the correlation is mostly linear.


Perfect, Good & Bad Correlation

We can interpret the correlation values as:

Perfect Correlation

A perfect positive correlation implies that for every increase in one variable, there is a proportionate increase in the other variable, indicated by a coefficient of +1.

A perfect negative correlation, represented by -1, signifies that an increase in one variable leads to a proportionate decrease in the other.

Graph Showing
Perfect Negative Correlation

Good Correlation

A good correlation can range from 0.5 to 0.9 (positive or negative) and generally indicates a strong relationship between the variables, but it doesn't mean the relationship is perfect.

Graph Showing
Good Negative Correlation

Bad Correlation

A bad correlation is typically close to zero, indicating that there is no relationship or any form of dependence between the two variables.

Graph Showing
Bad Correlation