Pandas crosstab()

The crosstab() method in Pandas allows us to create contingency tables, also known as cross-tabulations.

A contingency table helps us understand the relationship between two or more categorical variables within a dataset.

Example

import pandas as pd

# sample DataFrame
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'])

print(cross_tab)

'''
Output

Smoker  No  Yes
Gender         
Female   2    0
Male     1    2
'''

crosstab() Syntax

The syntax of the crosstab() method in Pandas is:

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)

crosstab() Arguments

The crosstab() method has the following arguments:

  • index: the column or array-like object whose values will be used as rows
  • columns: the column or array-like object whose values will be used as columns
  • values (optional): the column to aggregate values based on the intersection of index and columns
  • rownames (optional): the names to be used for the row index
  • colnames (optional): the names to be used for the column index
  • aggfunc (optional): the aggregation function to apply to values
  • margins (optional): whether to include row and column margins
  • margins_name (optional): the name to be used for the margin labels
  • dropna (optional): whether to exclude missing values
  • normalize (optional): whether to normalize the values to show proportions.

crosstab() Return Value

The crosstab() method returns a DataFrame representing the cross-tabulation of the factors specified in index and columns.


Example 1: Basic Cross-Tabulation

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Employed': ['Yes', 'Yes', 'Yes', 'Yes', 'No']}

df = pd.DataFrame(data)

# create a basic cross-tabulation of Gender and Employed
cross_tab = pd.crosstab(df['Gender'], df['Employed'])

print(cross_tab)

Output

Employed  No  Yes
Gender            
Female      0    2
Male        1    2

In this example, we created a basic cross-tabulation of Gender and Employed to understand the distribution of employed and unemployed people among genders.


Example2: Margins in crosstab()

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a cross-tabulation with margins
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], margins=True, margins_name='Total')

print(cross_tab)

Output

Smoker  No  Yes  Total
Gender                
Female   2    0      2
Male     1    2      3
Total    3    2      5

In this example, we included row and column margins in the cross-tabulation to show the totals for each row and column.


Example 3: Normalized Cross-Tabulation

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}

df = pd.DataFrame(data)

# create a normalized cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], normalize=True)

print(cross_tab)

Output

Smoker        No       Yes
Gender                    
Female  0.166667  0.166667
Male    0.333333  0.333333

In this example, we created a normalized cross-tabulation to show proportions instead of raw counts.


Example 4: Aggregate Functions with crosstab()

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
        'Smoker': ['Yes', 'No', 'Yes', 'No', 'No'],
        'Age': [25, 30, 35, 40, 45]}

df = pd.DataFrame(data)

# create a cross-tabulation of Gender and Smoker with average Age as the aggregation
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], values=df['Age'], aggfunc='mean')

print(cross_tab)

Output

Smoker    No   Yes
Gender            
Female  35.0   NaN
Male    45.0  30.0

In this example, we used aggfunc=mean to calculate the mean age for smokers and non smokers of different genders.

Your builder path starts here. Builders don't just know how to code, they create solutions that matter.

Escape tutorial hell and ship real projects.

Try Programiz PRO
  • Real-World Projects
  • On-Demand Learning
  • AI Mentor
  • Builder Community