Pandas Get Dummies

A dummy variable is a numerical representation used to encode categorical data.

Dummy variables exhibit binary values, exclusively 0 or 1.

For some data, each item can only belong to one category. For example, a car can be red or blue, but not both at the same time.

However, some data can belong to more than one category. Like a movie that's both action and comedy.

In both cases, the point of get_dummies() in Pandas is to change these categories into 0s and 1s. This makes it easier for computer programs to understand and work with the data.

In the context of a dummy variable:

  1. The value 1 signifies the existence of a specific category.
  2. The value 0 signifies the non-existence of a particular category.

In Pandas, we use the get_dummies() function to transform categorical variables into binary values.


Using get_dummies() on Pandas Series

In Pandas, to use get_dummies() on the Series, we pass the Series inside the function. For example,

import pandas as pd

# create a Panda Series
data = pd.Series(['A', 'B', 'A', 'C', 'B'])

# using get_dummies on the Series
dummies = pd.get_dummies(data)

print(dummies)

Output

   A  B  C
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1
4  0  1  0

In the above example, each column A, B, and C contains binary values (1 or 0) indicating the presence or absence of each category for each row in the data Series.


Use get_dummies() on a DataFrame Column

We can also apply multiple aggregation functions to one or more columns using the aggregate() function in Pandas. For example,

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# creating a DataFrame
df = pd.DataFrame(data)

# using get_dummies to convert the categorical column
dummies = pd.get_dummies(df['Color'])

# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)

print(df)

Output

    Color    Blue  Green  Red
0   Red       0      0     1
1   Green     0      1     0
2   Blue      1      0     0
3   Green     0      1     0
4   Red       0      0     1

In this example, we have applied the get_dummies() function to the Color column of the df DataFrame.

This function converts the categorical values in the Color column into a set of binary indicator columns.

In this case, since there are three unique colors Red, Green, Blue, these three new columns.

And the values in these columns are 1 if the corresponding color is present for a row and 0 if not.

Note: axis=1 refers to operations along columns or the horizontal axis. It means that the operation will be applied column-wise across the DataFrame.


Use of drop_first Inside get_dummies()

In Pandas, we can use the get_dummies() function to create dummy variables for a categorical column in a DataFrame and then drop the first category using the drop_first parameter.

Let's look at an example.

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# creating a DataFrame
df = pd.DataFrame(data)

# getting dummies without dropping any columns
dummies_all = pd.get_dummies(df['Color'])

# concatenating the dummies DataFrame with the original DataFrame
df_all = pd.concat([df, dummies_all], axis=1)

print("DataFrame with all dummy columns:")
print(df_all)
print("\n")

# getting dummies and dropping the first category column ('Blue' in this case)
dummies = pd.get_dummies(df['Color'], drop_first=True)

# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)

print("DataFrame after dropping 'Blue':")
print(df)

Output

DataFrame with all dummy columns:
   Color  Blue  Green  Red
0  Red     0      0     1
1  Green   0      1     0
2  Blue    1      0     0
3  Green   0      1     0
4  Red     0      0     1


DataFrame after dropping 'Blue':
   Color  Green  Red
0  Red      0     1
1  Green    1     0
2  Blue     0     0
3  Green    1     0
4  Red      0     1

Here, the drop_first=True argument is passed to get_dummies() to indicate that the first category should be dropped.

Hence the resulting DataFrame contains two columns Green and Red. The category named Blue is not represented in these columns because it was dropped.


Use of prefix Inside get_dummies()

We can use the prefix parameter inside the get_dummies() function to specify a prefix for the dummy variables created from a DataFrame column.

Let's look at an example.

import pandas as pd

# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}

# creating a DataFrame
df = pd.DataFrame(data)

# getting dummies with a specified prefix
dummies = pd.get_dummies(df['Color'], prefix='Color')

# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)

print(df)

Output

   Color Color_Blue Color_Green  Color_Red
0  Red        0          0            1
1  Green      0          1            0
2  Blue       1          0            0
3  Green      0          1            0
4  Red        0          0            1

Here, we have passed the prefix='Color' argument to get_dummies(), so the new dummy variable columns are prefixed with Color_.

Hence, the resulting DataFrame contains columns Color_Blue, Color_Green, and Color_Red, representing the presence or absence of the respective color categories.

Note: To learn more, visit Pandas get_dummies().