The duplicated()
method in Pandas is used to mark duplicate rows based on column values.
Example
import pandas as pd
# sample DataFrame
data = {'A': [1, 2, 2],
'B': [4, 5, 5]}
df = pd.DataFrame(data)
# identify duplicate rows
duplicates = df.duplicated()
print(duplicates)
'''
Output
0 False
1 False
2 True
dtype: bool
'''
duplicated() Syntax
The syntax of the duplicated()
method in Pandas is:
df.duplicated(subset=None, keep='first')
duplicated() Arguments
The duplicated()
method has the following arguments:
subset
(optional): column label or sequence of labels to consider for identifying duplicateskeep
(optional): determines which duplicates (if any) to mark
duplicated() Return Value
The duplicated()
method returns a boolean Series indicating whether each row is a duplicate.
Example 1: Identifying Duplicates in a Specific Column
import pandas as pd
data = {'A': [1, 2, 2],
'B': [4, 5, 6]}
df = pd.DataFrame(data)
# identify duplicates in column 'A'
duplicates_in_A = df.duplicated(subset='A')
print(duplicates_in_A)
Output
0 False 1 False 2 True dtype: bool
In this example, we identified duplicates based on column A
using the subset='A'
argument.
Here, the third element of column A
is a duplicate.
Example 2: Keeping Last Occurrences
import pandas as pd
data = {'A': [1, 2, 2, 2],
'B': [4, 5, 5, 5]}
df = pd.DataFrame(data)
# keep the last occurrence of the duplicate rows
last_occurrences = df.duplicated(keep='last')
print(last_occurrences)
Output
0 False 1 True 2 True 3 False dtype: bool
In this example, we marked all duplicates as True
except for the last occurrence using the keep='last'
argument.
Here, there are three occurrences of the row values [2, 5]
. The first two are marked True
whereas the last one is marked False
.
Example 3: Marking All Duplicates
import pandas as pd
data = {'A': [1, 2, 2, 2],
'B': [4, 5, 5, 5]}
df = pd.DataFrame(data)
# mark all duplicates
all_duplicates = df.duplicated(keep=False)
print(all_duplicates)
Output
0 False 1 True 2 True 3 True dtype: bool
In this example, we marked all duplicate rows as True
using the keep=False
argument.