The drop_duplicates()
method in Pandas is used to drop duplicate rows from a DataFrame.
Example
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'Age': [25, 30, 25, 35, 30]}
df = pd.DataFrame(data)
# drop duplicate rows based on all columns
result = df.drop_duplicates()
# display the result
print(result)
'''
Output
Name Age
0 Alice 25
1 Bob 30
3 Charlie 35
'''
drop_duplicates() Syntax
The syntax of the drop_duplicates()
method in Pandas is:
df.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
drop_duplicates() Arguments
The drop_duplicates()
method takes following arguments:
subset
(optional) - a list of column names or labels to consider for identifying duplicateskeep
(optional) - specifies which duplicates to keep ('first'
,'last'
, orFalse
)inplace
(optional) - IfTrue
, modifies the original DataFrame in place; ifFalse
, returns a new DataFrame.ignore_index
(optional) - IfTrue
, resets the index of the resulting DataFrame to a clean, new index.
drop_duplicates() Return Value
The drop_duplicates()
method in Pandas returns a new DataFrame with duplicate rows removed.
Example1: Remove Duplicate Rows Across all Columns
import pandas as pd
# create a sample DataFrame with duplicate data
data = {
'Student_ID': [1, 2, 3, 2, 4, 1, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
'Age': [18, 19, 20, 19, 21, 18, 22]
}
df = pd.DataFrame(data)
# drop duplicate rows based on all columns
# keeping the first occurrence
result = df.drop_duplicates()
# display the result
print(result)
Output
Student_ID Name Age
0 1 Alice 18
1 2 Bob 19
2 3 Charlie 20
4 4 David 21
6 5 Eve 22
In the above example, we have used the drop_duplicates()
method to remove duplicate rows across all columns, keeping only the first occurrence of each unique row.
It removes the following duplicate rows:
- Row with
Student_ID
: 2,Name
: Bob,Age
: 19 (second occurrence of Bob) - Row with
Student_ID
: 1,Name
: Alice,Age
: 18 (second occurrence of Alice)
Example 2: Drop Duplicate Rows Based on Subset of Columns
import pandas as pd
# create a sample DataFrame with duplicate data
data = {
'Student_ID': [1, 2, 3, 2, 4, 1, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
'Age': [18, 19, 20, 19, 21, 18, 22]
}
df = pd.DataFrame(data)
# drop duplicate rows based on a subset of columns ('Student_ID' and 'Name')
# keeping the first occurrence, and modify original DataFrame in place
df.drop_duplicates(subset=['Student_ID', 'Name'], inplace=True)
# display the result
print(df)
Output
Student_ID Name Age
0 1 Alice 18
1 2 Bob 19
2 3 Charlie 20
4 4 David 21
6 5 Eve 22
In this example, we have used the drop_duplicates()
method with the subset
parameter set to ['Student_ID', 'Name']
.
This means that duplicates will be identified and removed based on the combination of the Student_ID
and Name
columns.
Here, the inplace=True
argument in drop_duplicates()
method indicates that the original DataFrame df is modified in place, and no new DataFrame is created.
Example 3: Use of keep argument in drop_duplicates()
The keep
argument specifies which duplicate values to keep. It can take one of the following values:
'first'
- keep the first occurrence (default behavior).'last'
- keep the last occurrence.False
- remove all duplicates.
Let's look at an example,
import pandas as pd
# create a sample DataFrame with duplicate data
data = {
'Student_ID': [1, 2, 3, 2, 4, 1, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
'Age': [18, 19, 20, 19, 21, 18, 22]
}
df = pd.DataFrame(data)
# keep the first occurrence of each duplicate
df_keep_first = df.drop_duplicates(keep='first')
print("Keep the first occurrence:")
print(df_keep_first)
print()
# keep the last occurrence of each duplicate
df_keep_last = df.drop_duplicates(keep='last')
print("\nKeep the last occurrence:")
print(df_keep_last)
print()
# remove all duplicates
df_remove_all = df.drop_duplicates(keep=False)
print("\nRemove all duplicates:")
print(df_remove_all)
Output
Keep the first occurrence:
Student_ID Name Age
0 1 Alice 18
1 2 Bob 19
2 3 Charlie 20
4 4 David 21
6 5 Eve 22
Keep the last occurrence:
Student_ID Name Age
2 3 Charlie 20
3 2 Bob 19
4 4 David 21
5 1 Alice 18
6 5 Eve 22
Remove all duplicates:
Student_ID Name Age
2 3 Charlie 20
4 4 David 21
6 5 Eve 22
Example 4: Reset Index for the Resulting DataFrame
import pandas as pd
# create a sample DataFrame with duplicate data
data = {
'Student_ID': [1, 2, 3, 2, 4, 1, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'Alice', 'Eve'],
'Age': [18, 19, 20, 19, 21, 18, 22]
}
df = pd.DataFrame(data)
# set ignore_index to True
df_deduplicated_ignore_index = df.drop_duplicates(subset=['Student_ID', 'Name'], ignore_index=True)
print("With ignore_index=True:")
print(df_deduplicated_ignore_index)
print()
# set ignore_index to False (Default)
df_deduplicated_default_index = df.drop_duplicates(subset=['Student_ID', 'Name'])
print("\nWith ignore_index=False (Default):")
print(df_deduplicated_default_index)
Output
With ignore_index=True:
Student_ID Name Age
0 1 Alice 18
1 2 Bob 19
2 3 Charlie 20
3 4 David 21
4 5 Eve 22
With ignore_index=False (Default):
Student_ID Name Age
0 1 Alice 18
1 2 Bob 19
2 3 Charlie 20
4 4 David 21
6 5 Eve 22
Here,
ignore_index=True
results in a DataFrame with a reset index starting from 0.ignore_index=False
is a default behavior, which retains the original index of the DataFrame.