Certification Courses

Created with over a decade of experience and thousands of feedback.

Learn Python

Learn HTML

Learn JavaScript

Learn SQL

Learn DSA

View all Courses on

Learn C

Learn C++

Learn Java

Pandas Data Cleaning

Data cleaning means fixing and organizing messy data. Pandas offers a wide range of tools and functions to help us clean and preprocess our data effectively.

Data cleaning often involves:

Dropping irrelevant columns.
Renaming column names to meaningful names.
Making data values consistent.
Replacing or filling in missing values.

Drop Rows With Missing Values

In Pandas, we can drop rows with missing values using the dropna() function. For example,

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)
print("Original Data:\n",df)
print()

# use dropna() to remove rows with any missing values
df_cleaned = df.dropna()

print("Cleaned Data:\n",df_cleaned)

Output

Original Data:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0

Cleaned Data:
    A    B    C
1  2.0  2.0  2.0
4  5.0  5.0  5.0

Here, we have used the dropna() method to remove rows with any missing values. The resulting DataFrame df_cleaned will only contain rows without any missing values

Fill Missing Values

To fill the missing values in Pandas, we use the fillna() function. For example,

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with 0
df.fillna(0, inplace=True)

print("\nData after filling NaN with 0:\n", df)

Output

Original Data:
    A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0
Data after filling NaN with 0:
   A    B    C
0  1.0  0.0  1.0
1  2.0  2.0  2.0
2  3.0  3.0  0.0
3  0.0  4.0  0.0
4  5.0  5.0  5.0

Here, we used data.fillna() to fill the missing values(NaN) in each column with 0.

Note: The inplace=True argument here means that the operation will modify the DataFrame directly, rather than returning a new DataFrame with the modifications.

Use Aggregate Functions to Fill Missing Values

Instead of filling with 0, we can also use aggregate functions to fill missing values.

Let's look at an example to fill missing values with the mean of each column.

import pandas as pd

# define a dictionary with sample data which includes some missing values
data = {
    'A': [1, 2, 3, None, 5],  
    'B': [None, 2, 3, 4, 5],  
    'C': [1, 2, None, None, 5]
}

df = pd.DataFrame(data)

print("Original Data:\n", df)

# filling NaN values with the mean of each column
df.fillna(df.mean(), inplace=True)

print("\nData after filling NaN with mean:\n", df)

Output

Original Data:
    A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  2.0
2  3.0  3.0  NaN
3  NaN  4.0  NaN
4  5.0  5.0  5.0
Data after filling NaN with mean:
     A    B         C
0  1.00  3.5  1.000000
1  2.00  2.0  2.000000
2  3.00  3.0  2.666667
3  2.75  4.0  2.666667
4  5.00  5.0  5.000000

Here, the df.mean() calculates the mean for each column, and the fillna() method then replaces NaN values in each column with the respective mean.

Handle Duplicates Values

In Pandas, to handle duplicate rows, we can use the duplicated() and the drop_duplicates() function.

duplicated() - to check for duplicates
drop_duplicates() - remove duplicate rows

import pandas as pd

# sample data
data = {
    'A': [1, 2, 2, 3, 3, 4],
    'B': [5, 6, 6, 7, 8, 8]
}
df = pd.DataFrame(data)

print("Original DataFrame:\n", df.to_string(index=False))

# detect duplicates
print("\nDuplicate Rows:\n", df[df.duplicated()].to_string(index=False))

# remove duplicates based on column 'A'
df.drop_duplicates(subset=['A'], keep='first', inplace=True)

print("\nDataFrame after removing duplicates based on column 'A':\n", df.to_string(index=False))

Output

Original DataFrame:
 A  B
 1  5
 2  6
 2  6
 3  7
 3  8
 4  8
Duplicate Rows:
 A  B
 2  6
DataFrame after removing duplicates based on column 'A':
 A  B
 1  5
 2  6
 3  7
 4  8

Hrere,

df[df.duplicated()] produces a boolean Series to identify duplicate rows.
df.drop_duplicates(subset=['A'], keep='first', inplace=True), removes duplicates based on column A, retaining only the first occurrence of each duplicate directly in the original DataFrame.

Rename Column Names to Meaningful Names

To rename column names to more meaningful names in Pandas, we can use the rename() function. For example,

import pandas as pd

# sample data
data = {
    'A': [25, 30, 35],
    'B': ['John', 'Doe', 'Smith'],
    'C': [50000, 60000, 70000]
}

df = pd.DataFrame(data)

# rename columns
df.rename(columns={'A': 'Age', 'B': 'Name', 'C': 'Salary'}, inplace=True)

print(df.to_string(index=False))

Output

Age  Name  Salary
  25  John   50000
  30   Doe   60000
  35 Smith   70000

Here, the columns of df are renamed from A, B, and C to more meaningful names Age, Name, and Salary respectively.

Introduction
Drop Rows With Missing Values
Fill Missing Values
Use Aggregate Functions to Fill Missing Values
Handle Duplicates Values
Rename Column Names to Meaningful Names

Previous Tutorial:

Pandas Concat

Next Tutorial:

Pandas Missing Values

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO

Interactive Courses
Certificates
AI Help
2000+ Challenges

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Learn Python practically
and Get Certified.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Data Cleaning

Drop Rows With Missing Values

Fill Missing Values

Use Aggregate Functions to Fill Missing Values

Handle Duplicates Values

Rename Column Names to Meaningful Names

Table of Contents

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Learn Python practically and Get Certified.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Data Cleaning

Drop Rows With Missing Values

Fill Missing Values

Use Aggregate Functions to Fill Missing Values

Handle Duplicates Values

Rename Column Names to Meaningful Names

Table of Contents

Learn Python practically
and Get Certified.