Pandas Handling Missing Values

In Pandas, missing values, often represented as NaN (Not a Number), can cause problems during data processing and analysis. These gaps in data can lead to incorrect analysis and misleading conclusions.

Pandas provides a host of functions like dropna(), fillna() and combine_first() to handle missing values.

Let's consider the following DataFrame to illustrate various techniques on handling missing data:

import pandas as pd
import numpy as np

# create dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

print(df)

Output

     A    B    C  D
0  1.0  NaN  1.0  1
1  2.0  2.0  2.0  2
2  NaN  3.0  3.0  3
3  4.0  4.0  NaN  4
4  5.0  5.0  5.0  5

Here, we have used the NumPy library to generate NaN values in the DataFrame.


Remove Rows Containing Missing Values

One straightforward way to handle missing values is by removing them. Since the data sets we deal with are often large, eliminating a few rows typically has minimal impact on the final outcome.

We use the dropna() function to remove rows containing at least one missing value. For example,

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# remove rows with missing values df.dropna(inplace=True)
print(df)

Output

     A    B    C  D
1  2.0  2.0  2.0  2
4  5.0  5.0  5.0  5

In this example, we removed all the rows containing NaN values using dropna(). The dropna() method detects the rows with NaN values and removes them.

Here, inplace=True specifies that changes are to be made in the original DataFrame itself.


Replace Missing Values

Instead of deleting the entire row containing missing values, we can replace the missing values with a specified value using fillna().

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# replace missing values with 0 df.fillna(value=0, inplace=True)
print(df)

Output

     A    B    C  D
0  1.0  0.0  1.0  1
1  2.0  2.0  2.0  2
2  0.0  3.0  3.0  3
3  4.0  4.0  0.0  4
4  5.0  5.0  5.0  5

In this example, we replaced the NaN values with 0 using fillna().


Replace Missing Values With Mean, Median and Mode

A more refined approach is to replace missing values with the mean, median, or mode of the remaining values in the column. This can give a more accurate representation than just replacing it with a default value.

We can use the fillna() function with aggregate functions to replace missing values with mean, median or mode.

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# replace missing values with mean df['A'].fillna(value=df['A'].mean(), inplace=True)
# replace missing values with median df['B'].fillna(value=df['B'].median(), inplace=True)
# replace missing values with mode df['C'].fillna(value=df['C'].mode()[0], inplace=True)
print(df)

Output

     A    B    C  D
0  1.0  3.5  1.0  1
1  2.0  2.0  2.0  2
2  3.0  3.0  3.0  3
3  4.0  4.0  1.0  4
4  5.0  5.0  5.0  5

In this example, we replaced the missing values of A, B and C columns with their mean, median and mode respectively.

Here, mode()[0] returns the most frequent value. Since all the values have the same frequency, it returns the first value of the column.


Replace Values Using Another DataFrame

We can replace missing values in one DataFrame using another DataFrame using the fillna() method.

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data1 = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df1 = pd.DataFrame(data1)

# create datframe to fill the missing values with
data2 = {
    'A': [10, 20, 30, 40, 50],
    'B': [10, 20, 30, 40, 50],
    'C': [10, 20, 30, 40, 50],
    'D': [10, 20, 30, 40, 50]
}
df2 = pd.DataFrame(data2)

# replace missing values
df1.fillna(df2, inplace=True)

print(df1)

Output

      A     B     C  D
0   1.0  10.0   1.0  1
1   2.0   2.0   2.0  2
2  30.0   3.0   3.0  3
3   4.0   4.0  40.0  4
4   5.0   5.0   5.0  5

Here, we've two dataframes df1 and df2. The fillna() replaces missing values in df1 with corresponding values from df2.


Frequently Asked Questions

How to remove columns containing only NaN values?

We can remove the columns containing only NaN values by selecting such columns first using isnull() and all() methods first and then dropping the columns.

Let's look at an example.

import pandas as pd
import numpy as np

# create a DataFrame
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, np.nan, np.nan],
    'C': [np.nan, np.nan, np.nan, np.nan],
    'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# check which columns contain only NaN values columns_with_nan = df.columns[df.isnull().all()] # drop the columns containing only NaN values df = df.drop(columns=columns_with_nan)
print(df)

Output

   A    B   D
0  1  5.0   9
1  2  6.0  10
2  3  NaN  11
3  4  NaN  12

Here,

  • df.columns[df.isnull().all()] - returns a list of columns where all values are null
  • df.drop() - removes the specified columns

Since column C has NaN values only, it is removed.

How to remove columns containing NaN values that exceed a certain number?
import pandas as pd
import numpy as np

# create a DataFrame
data = {
    'A': [1, 2, 3, np.nan],
    'B': [5, 6, np.nan, np.nan],
    'C': [np.nan, np.nan, np.nan, 7],
    'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# set the threshold for the maximum number of NaN values allowed threshold = 2 # calculate the number of NaN values in each column nan_counts = df.isnull().sum() # remove columns that have more NaN values than the threshold columns_to_drop = nan_counts[nan_counts > threshold].index df = df.drop(columns=columns_to_drop)
print(df)

Output

     A    B   D
0  1.0  5.0   9
1  2.0  6.0  10
2  3.0  NaN  11
3  NaN  NaN  12

Here,

  • df.isnull().sum() - returns a list of integers containing counts of NaN values in each column
  • nan_counts[nan_counts>threshold].index - returns a list of column indices whose NaN count exceeds the threshold value
  • df.drop() - removes the specified columns

The above code removes the columns containing more than two NaN values.