Pandas Handling Duplicate Values

In large datasets, we often encounter duplicate entries in tables. These duplicate entries can throw off our analysis and skew the results.

Pandas provides several methods to find and remove duplicate entries in DataFrames.


Find Duplicate Entries

We can find duplicate entries in a DataFrame using the duplicated() method. It returns True if a row is duplicated and returns False otherwise.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# check for duplicate entries print(df.duplicated())

Output

0    False
1    False
2     True
3     True
4    False
dtype: bool

In the above example, we checked for duplicate entries in df using the duplicated() method. It returned a series with boolean values indicating if an entry is a duplicate.

Here, we got True in the third and the fourth rows because they are duplicates of the first and the second rows respectively.


Example: Find Duplicates Based on Columns

By default, duplicated() considers all columns. To find duplicates based on certain columns, we can pass them as a list to the duplicated() function.

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'Johnny', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Las Vegas', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# check for duplicate entries in columns Name and Age print(df.duplicated(subset=['Name', 'Age']))

Output

0    False
1    False
2    False
3     True
4    False
dtype: bool

In this example, we checked the duplicate entries based on Name and Age columns only.

If you look at the Name and Age columns, the fourth row is a duplicate of the second row. Hence, the boolean value of the fourth row is True in the output.


Remove Duplicate Entries

We can remove duplicate entries in Pandas using the drop_duplicates() method. For example,

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# remove duplicates df.drop_duplicates(inplace=True)
print(df)

Output

   Name  Age         City
0  John   28     New York
1  Anna   24  Los Angeles
4  John   19      Chicago

In this example, we removed duplicate entries from df using drop_duplicates().

Here, inplace=True specifies that the changes are to be made in the original dataframe.

Notice that the drop_duplicates() function keeps the first duplicate entry and removes the last by default. Here, the first and the second rows are kept while the third and the fourth rows are removed.

To keep the last entry, we can pass the keep='last' argument. For example,

import pandas as pd

# create dataframe
data = {
    'Name': ['John', 'Anna', 'John', 'Anna', 'John'],
    'Age': [28, 24, 28, 24, 19],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# remove duplicates, keep last entries df.drop_duplicates(keep='last', inplace=True)
print(df)

Output

   Name  Age         City
2  John   28     New York
3  Anna   24  Los Angeles
4  John   19      Chicago