Certification Courses

Created with over a decade of experience and thousands of feedback.

Learn Python

Learn HTML

Learn JavaScript

Learn SQL

Learn DSA

View all Courses on

Learn C

Learn C++

Learn Java

Pandas Handling Wrong Data

Sometimes, a dataset can have inaccurate entries due to reasons such as human errors during data input, sourcing data from unreliable places, etc.

This can significantly undermine the quality and reliability of the data analysis performed on it.

Let's take a DataFrame containing data about students of an all-boys elementary school.

      Name  Age Gender  Standard
0     John    8      M         3
1  Michael    9      M         4
2      Tom    7      M        12
3     Alex   80      F         3
4     Ryan  100      M         5

In the above DataFrame, we can see a few obvious mistakes like:

The ages of two students are listed as 80 and 100, which is too old for primary school students.
The gender of Alex is marked as F. Since this is an all-boys school, it is obviously a mistake.
Tom is listed as being in the 12th standard, which is simply not possible in an elementary school context.

We can handle such wrong data in the following ways:

Replace Individual Values
Replace Values Based on a Condition
Remove Wrong Values

Replace Individual Values

We can see that the value F for Gender column is an obvious mistake. Let's replace F with M to rectify the error.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
    'Age': [8, 9, 7, 80, 100],
    'Gender': ['M', 'M', 'M', 'F', 'M'],
    'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)

# replace F with M
df.loc[3, 'Gender'] = 'M'

print(df)

Output

      Name  Age Gender  Standard
0     John    8      M         3
1  Michael    9      M         4
2      Tom    7      M        12
3     Alex   80      M         3
4     Ryan  100      M         5

In this example, we replaced the wrong value with the right value using df.loc[].

While this method is effective for small datasets, it becomes tedious as the size of the dataset grows.

Replace Values Based on a Condition

In cases where the values need to satisfy a particular condition, we can iterate through the values to check if the condition is satisfied and to make changes accordingly.

For example, the maximum age of students in this elementary school is 14. But there are two students whose ages are 80 and 100. This looks like a typo where an additional zero has been added unintentionally.

Let's fix the error.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
    'Age': [8, 9, 7, 80, 100],
    'Gender': ['M', 'M', 'M', 'M', 'M'],
    'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)

# replace values based on conditions
for i  in df.index:
    age_val = df.loc[i, 'Age']
    if (age_val > 14) and (age_val%10 == 0):
        df.loc[i, 'Age'] = age_val/10

print(df)

Output

      Name  Age Gender  Standard
0     John    8      M         3
1  Michael    9      M         4
2      Tom    7      M        12
3     Alex    8      M         3
4     Ryan   10      M         5

In the above example, we replaced ages greater than 14 which are a multiple of 10 by removing a zero from the original values.

We checked if the age is a multiple of 10 because it is highly likely that a higher age which is a multiple of 10 is a typo.

Remove Wrong Values

Some values can't be corrected. For example, the value 12 in Standard doesn't make any sense as this is an elementary school. But we can't also correct it because we don't know the right standard.

So we need to remove this whole row and any other row with such values.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
    'Age': [8, 9, 7, 8, 10],
    'Gender': ['M', 'M', 'M', 'M', 'M'],
    'Standard': [3, 4, 12, 3, 5]
}
df = pd.DataFrame(data)

# remove mistaken values
for i in df.index:
    if df.loc[i,'Standard'] > 8:
        df.drop(i, inplace=True)

print(df)

Output

      Name  Age Gender  Standard
0     John    8      M         3
1  Michael    9      M         4
3     Alex    8      M         3
4     Ryan   10      M         5

In this example, we removed the row(s) containing values greater than 8 in the Standard column.

Introduction
Replace Individual Values
Replace Values Based on a Condition
Remove Wrong Values

Previous Tutorial:

Pandas Handling Wrong Format

Next Tutorial:

Pandas Get Dummies

Our premium learning platform, created with over a decade of experience and thousands of feedbacks.

Learn and improve your coding skills like never before.

Try Programiz PRO

Interactive Courses
Certificates
AI Help
2000+ Challenges

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python
programmer.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Wrong Data

Replace Individual Values

Replace Values Based on a Condition

Remove Wrong Values

Table of Contents

Popular Tutorials

Popular Examples

Reference Materials

Certification Courses

Become a certified Python programmer.

Popular Tutorials

Reference Materials

Popular Examples

Introduction

Dataframe Operations and Manipulations

Data Import and Export

Data Cleaning

Data Analysis and Aggregation

Data Visualization

Pandas Handling Wrong Data

Replace Individual Values

Replace Values Based on a Condition

Remove Wrong Values

Table of Contents

Become a certified Python
programmer.