The get_dummies() method in Pandas is used to convert categorical variables into dummy variables.
Each category is transformed into a new column with binary value (1 or 0) indicating the presence of the category in the original data.
Example
import pandas as pd
# create a Series
data = pd.Series(['A', 'B', 'A', 'C', 'B'])
# use get_dummies on the Series
dummies = pd.get_dummies(data)
print(dummies)
'''
Output
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
'''
get_dummies() Syntax
The syntax of the get_dummies() method in Pandas is:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, drop_first=False)
get_dummies() Arguments
The get_dummies() method takes following arguments:
data- the input data to be transformedprefix(optional) - string to append DataFrame column namesprefix_sep(optional) - separator for the prefix and the dummy column namedummy_na(optional) - add a column to indicate NaNs, ifFalseNaNs are ignored.drop_first(optional) - whether to remove first level or not
get_dummies() Return Value
The get_dummies() method returns a DataFrame where the value in the input becomes a separate column filled with binary values (1s and 0s), indicating the presence or absence of that value in each row of the original data.
Example 1: Grouping by a Single Column in Pandas
import pandas as pd
# create a Series
data = pd.Series(['apple', 'orange', 'apple', 'banana'])
# use get_dummies() to convert the series into dummy variables
dummy_data = pd.get_dummies(data)
print(dummy_data)
Output
apple banana orange 0 1 0 0 1 0 0 1 2 1 0 0 3 0 1 0
In the above example, we have created the data Series with fruit names.
We then applied get_dummies() which creates a new DataFrame where each fruit name becomes a column.
And for each row in the data Series, the corresponding column in the new DataFrame will have a 1 if the fruit name was present in that row, and 0 otherwise.
Example 2: Apply get_dummies() With Prefix
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies with a specified prefix
dummies = pd.get_dummies(df['Color'], prefix='Color')
print(dummies)
Output
Color Color_Blue Color_Green Color_Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
Here, we have passed the prefix='Color' argument to get_dummies(), so the new dummy variable columns are prefixed with Color_.
Hence, the resulting DataFrame contains columns Color_Blue, Color_Green, and Color_Red, representing the presence or absence of the respective color categories.
Example 3: Get Dummies With Specified Prefix and Prefix Separator
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies with a specified prefix and prefix separator
dummies = pd.get_dummies(df['Color'], prefix='Color', prefix_sep='--')
print(dummies)
Output
Color--Blue Color--Green Color--Red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
In this example, the prefix_sep='--' argument means that the prefix and the original category name will be separated by --.
So, for a color like Blue, the resulting column name in the dummies DataFrame would be Color--Blue and so on.
Example 4: Use dummy_na to Manage Missing Data
import pandas as pd
# sample data with a missing value
data = {'Color': ['Red', 'Green', 'Blue', None, 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies without considering NaN
dummies_without_nan = pd.get_dummies(df['Color'])
# get dummies considering NaN
dummies_with_nan = pd.get_dummies(df['Color'], dummy_na=True)
print("Dummies without NaN handling:\n", dummies_without_nan)
print("\nDummies with NaN handling:\n", dummies_with_nan)
Output
Dummies without NaN handling:
Blue Green Red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 0
4 0 0 1
Dummies with NaN handling:
Blue Green Red NaN
0 0 0 1 0
1 0 1 0 0
2 1 0 0 0
3 0 0 0 1
4 0 0 1 0
Here,
get_dummies(df['Color'])- generates columns forRed,Green, andBlue, but no indication of theNaNvalue.get_dummies(df['Color'], dummy_na=True)- generates the same columns and an additional one calledNaNindicating whereNaNvalues were present in the original data.
Example 5: Specifying Columns for Dummy Encoding
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# creating a DataFrame
df = pd.DataFrame(data)
# getting dummies without dropping any columns
dummies_all = pd.get_dummies(df['Color'])
print("DataFrame with all dummy columns:")
print(dummies_all)
print("\n")
# getting dummies and dropping the first category column ('Blue' in this case)
dummies = pd.get_dummies(df['Color'], drop_first=True)
print("DataFrame after dropping 'Blue':")
print(dummies)
Output
DataFrame with all dummy columns:
Color Blue Green Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
DataFrame after dropping 'Blue':
Color Green Red
0 Red 0 1
1 Green 1 0
2 Blue 0 0
3 Green 1 0
4 Red 0 1
Here, the drop_first=True argument is passed to get_dummies() to indicate that the first category should be dropped.
Hence the resulting DataFrame contains two columns Green and Red. The category named Blue is not represented in these columns because it was dropped.