Certification Courses

Created with over a decade of experience and thousands of feedback.

Learn Python

Learn HTML

Learn JavaScript

Learn SQL

Learn DSA

View all Courses on

Learn C

Learn C++

Learn Java

Pandas Handling Missing Values

In Pandas, missing values, often represented as NaN (Not a Number), can cause problems during data processing and analysis. These gaps in data can lead to incorrect analysis and misleading conclusions.

Pandas provides a host of functions like dropna(), fillna() and combine_first() to handle missing values.

Let's consider the following DataFrame to illustrate various techniques on handling missing data:

import pandas as pd
import numpy as np

# create dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

print(df)

Output

     A    B    C  D
0  1.0  NaN  1.0  1
1  2.0  2.0  2.0  2
2  NaN  3.0  3.0  3
3  4.0  4.0  NaN  4
4  5.0  5.0  5.0  5

Here, we have used the NumPy library to generate NaN values in the DataFrame.

Remove Rows Containing Missing Values

One straightforward way to handle missing values is by removing them. Since the data sets we deal with are often large, eliminating a few rows typically has minimal impact on the final outcome.

We use the dropna() function to remove rows containing at least one missing value. For example,

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# remove rows with missing values
df.dropna(inplace=True)

print(df)

Output

     A    B    C  D
1  2.0  2.0  2.0  2
4  5.0  5.0  5.0  5

In this example, we removed all the rows containing NaN values using dropna(). The dropna() method detects the rows with NaN values and removes them.

Here, inplace=True specifies that changes are to be made in the original DataFrame itself.

Replace Missing Values

Instead of deleting the entire row containing missing values, we can replace the missing values with a specified value using fillna().

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# replace missing values with 0
df.fillna(value=0, inplace=True)

print(df)

Output

     A    B    C  D
0  1.0  0.0  1.0  1
1  2.0  2.0  2.0  2
2  0.0  3.0  3.0  3
3  4.0  4.0  0.0  4
4  5.0  5.0  5.0  5

In this example, we replaced the NaN values with 0 using fillna().

Replace Missing Values With Mean, Median and Mode

A more refined approach is to replace missing values with the mean, median, or mode of the remaining values in the column. This can give a more accurate representation than just replacing it with a default value.

We can use the fillna() function with aggregate functions to replace missing values with mean, median or mode.

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# replace missing values with mean
df['A'].fillna(value=df['A'].mean(), inplace=True)

# replace missing values with median
df['B'].fillna(value=df['B'].median(), inplace=True)

# replace missing values with mode
df['C'].fillna(value=df['C'].mode()[0], inplace=True)

print(df)

Output

     A    B    C  D
0  1.0  3.5  1.0  1
1  2.0  2.0  2.0  2
2  3.0  3.0  3.0  3
3  4.0  4.0  1.0  4
4  5.0  5.0  5.0  5

In this example, we replaced the missing values of A, B and C columns with their mean, median and mode respectively.

Here, mode()[0] returns the most frequent value. Since all the values have the same frequency, it returns the first value of the column.

Replace Values Using Another DataFrame

We can replace missing values in one DataFrame using another DataFrame using the fillna() method.

Let's look at an example.

import pandas as pd
import numpy as np

# create a dataframe with missing values
data1 = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, 5],
    'D': [1, 2, 3, 4, 5]
}
df1 = pd.DataFrame(data1)

# create datframe to fill the missing values with
data2 = {
    'A': [10, 20, 30, 40, 50],
    'B': [10, 20, 30, 40, 50],
    'C': [10, 20, 30, 40, 50],
    'D': [10, 20, 30, 40, 50]
}
df2 = pd.DataFrame(data2)

# replace missing values
df1.fillna(df2, inplace=True)

print(df1)

Output

      A     B     C  D
0   1.0  10.0   1.0  1
1   2.0   2.0   2.0  2
2  30.0   3.0   3.0  3
3   4.0   4.0  40.0  4
4   5.0   5.0   5.0  5

Here, we've two dataframes df1 and df2. The fillna() replaces missing values in df1 with corresponding values from df2.

Frequently Asked Questions

How to remove columns containing only NaN values?

We can remove the columns containing only NaN values by selecting such columns first using isnull() and all() methods first and then dropping the columns.

Let's look at an example.

import pandas as pd
import numpy as np

# create a DataFrame
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, np.nan, np.nan],
    'C': [np.nan, np.nan, np.nan, np.nan],
    'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# check which columns contain only NaN values
columns_with_nan = df.columns[df.isnull().all()]

# drop the columns containing only NaN values
df = df.drop(columns=columns_with_nan)

print(df)

Output

   A    B   D
0  1  5.0   9
1  2  6.0  10
2  3  NaN  11
3  4  NaN  12

Here,

df.columns[df.isnull().all()] - returns a list of columns where all values are null
df.drop() - removes the specified columns

Since column C has NaN values only, it is removed.

How to remove columns containing NaN values that exceed a certain number?

import pandas as pd
import numpy as np

# create a DataFrame
data = {
    'A': [1, 2, 3, np.nan],
    'B': [5, 6, np.nan, np.nan],
    'C': [np.nan, np.nan, np.nan, 7],
    'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)

# set the threshold for the maximum number of NaN values allowed
threshold = 2

# calculate the number of NaN values in each column
nan_counts = df.isnull().sum()

# remove columns that have more NaN values than the threshold
columns_to_drop = nan_counts[nan_counts > threshold].index
df = df.drop(columns=columns_to_drop)

print(df)

Output

     A    B   D
0  1.0  5.0   9
1  2.0  6.0  10
2  3.0  NaN  11
3  NaN  NaN  12

Here,

df.isnull().sum() - returns a list of integers containing counts of NaN values in each column
nan_counts[nan_counts>threshold].index - returns a list of column indices whose NaN count exceeds the threshold value
df.drop() - removes the specified columns

The above code removes the columns containing more than two NaN values.

Introduction
Remove Rows Containing Missing Values
Replace Missing Values
Replace Missing Values With Mean, Median and Mode
Replace Values Using Another DataFrame

Previous Tutorial:

Pandas Cleaning Data

Next Tutorial:

Pandas Handling Wrong Format

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO

Interactive Courses
Certificates
AI Help
2000+ Challenges

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python
programmer.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Missing Values

Remove Rows Containing Missing Values

Replace Missing Values

Replace Missing Values With Mean, Median and Mode

Replace Values Using Another DataFrame

Frequently Asked Questions

Table of Contents

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python programmer.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Missing Values

Remove Rows Containing Missing Values

Replace Missing Values

Replace Missing Values With Mean, Median and Mode

Replace Values Using Another DataFrame

Frequently Asked Questions

Table of Contents

Become a certified Python
programmer.