Pandas Pivot Table

The pivot_table() function in Pandas allows us to create a spreadsheet-style pivot table making it easier to group and analyze our data.

Working of pivot table operation in Pandas
Pivot Table Operation in Pandas

Let's look at an example.

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77]}
df = pd.DataFrame(data)

print("Original DataFrame\n", df)
print()

# pivot the dataframe pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')
print("Reshaped DataFrame\n", pivot_df)

Output

Original DataFrame
          Date         City  Temperature
0  2023-01-01     New York           32
1  2023-01-01  Los Angeles           75
2  2023-01-02     New York           30
3  2023-01-02  Los Angeles           77

Reshaped DataFrame
 City        Los Angeles  New York
Date                             
2023-01-01           75        32
2023-01-02           77        30

In this example, we reshaped the DataFrame with Date as index, City as columns and Temperature as values.

The pivot_df DataFrame is a multidimensional table that shows the temperature based on the city and the date.

Thus the pivot_table() operation reshapes the data to make it clearer for further analysis.


pivot_table() Syntax

The syntax of pivot_table() in Pandas is:

df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, dropna=True)

Here,

  • index: the column to use as row labels
  • columns: the column that will be reshaped as columns
  • values: the column(s) to use for the new DataFrame's values
  • aggfunc: the function to use for aggregation, defaulting to 'mean'
  • fill_value: value to replace missing values with
  • dropna: whether to exclude the columns whose entries are all NaN

Example: pivot_table() with Multiple Values

If we omit the values argument in pivot_table(), it selects all the remaining columns (besides the ones specified index and columns) as values for the pivot table.

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77],
        'Humidity': [80, 10, 85, 5]}

df = pd.DataFrame(data)

print('Original DataFrame')
print(df)
print()

# pivot the dataframe pivot_df = df.pivot_table(index='Date', columns='City')
print('Reshaped DataFrame') print(pivot_df)

Output

Original DataFrame
         Date         City  Temperature  Humidity
0  2023-01-01     New York           32        80
1  2023-01-01  Los Angeles           75        10
2  2023-01-02     New York           30        85
3  2023-01-02  Los Angeles           77         5

Reshaped DataFrame
              Humidity          Temperature         
City       Los Angeles New York Los Angeles New York
Date                                                
2023-01-01          10       80          75       32
2023-01-02           5       85          77       30

In this example, we created a pivot table for multiple values i.e. Temperature and Humidity.


pivot_table() With Aggregate Functions

We can use the pivot_table() method with different aggregate functions using the aggfunc parameter. We can set the value of aggfunc to functions such as 'sum', 'mean', 'count', 'max' or 'min'.

Let's see an example.

import pandas as pd

data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77, 33, 78],
        'Humidity': [80, 10, 85, 5, 81, 7]}

df = pd.DataFrame(data)

# calculate mean temperature for each city using pivot_table() mean_temperature = df.pivot_table(index='City', values='Temperature', aggfunc='mean')
print(mean_temperature)

Output

Temperature
City                    
Los Angeles    76.666667
New York       31.666667

In the above example, we calculated the mean temperature of each city using the aggfunc='mean' argument in pivot_table().


Pivot Table With MultiIndex

We can create a pivot table with MultiIndex using the pivot_table() function.

Let's look at an example.

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles','Delhi', 'Chennai', 'Delhi', 'Chennai'],
        'Country': ['USA', 'USA', 'USA', 'USA', 'India', 'India', 'India', 'India'],
        'Temperature': [32, 75, 30, 77, 75, 80, 78, 79]}
df = pd.DataFrame(data)

print("Original DataFrame\n", df)
print()

# create a pivot table with multiindex pivot_df = df.pivot_table(index=['Country', 'City'], columns='Date', values='Temperature')
print("Reshaped DataFrame\n", pivot_df)

Output

Original DataFrame
          Date         City Country  Temperature
0  2023-01-01     New York     USA           32
1  2023-01-01  Los Angeles     USA           75
2  2023-01-02     New York     USA           30
3  2023-01-02  Los Angeles     USA           77
4  2023-01-01        Delhi   India           75
5  2023-01-01      Chennai   India           80
6  2023-01-02        Delhi   India           78
7  2023-01-02      Chennai   India           79

Reshaped DataFrame
 Date                 2023-01-01  2023-01-02
Country City                               
India   Chennai              80          79
        Delhi                75          78
USA     Los Angeles          75          77
        New York             32          30

In this example, we created a pivot table with a MultiIndex by passing a list of columns as an index argument.

A MultiIndex contains multiple levels of indexes with columns linked to one another through a parent/relationship. Here, Country is the parent column and City is the child column.


Handle Missing Values With pivot_table()

Sometimes while reshaping data using pivot_table(), missing values may occur in the pivot table. Such missing values or NaN values can be handled in a pivot_table() operation using the arguments fill_value and dropna.

The dropna argument specifies whether to remove the columns whose entries are all NaN. The default value of dropna is True.

Let's look at an example.

import pandas as pd
import numpy as np

# Creating the DataFrame
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago'],
        'Temperature': [32, 75, 30, 77, np.nan, 76, np.nan]}
df = pd.DataFrame(data)

# create a pivot table
pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')

print("\nDefault Pivot Table\n", pivot_df)

# create a pivot table with dropna=True pivot_df_dropna = df.pivot_table(index='Date', columns='City', values='Temperature', dropna=False)
print("\nPivot Table with dropna=False:\n", pivot_df_dropna)

Output

Default Pivot Table
 City        Los Angeles  New York
Date                             
2023-01-01         75.0      32.0
2023-01-02         77.0      30.0
2023-01-03         76.0       NaN

Pivot Table with dropna=False:
 City        Chicago  Los Angeles  New York
Date                                      
2023-01-01      NaN         75.0      32.0
2023-01-02      NaN         77.0      30.0
2023-01-03      NaN         76.0       NaN

In this example, we used the dropna function to determine the handling of columns with entirely NaN entries. By default, the dropna parameter is set to True, resulting in the automatic removal of the Chicago column.

Notice that the New York column is not dropped despite having one NaN value. This is because dropna removes the columns whose entries are all NaN.

The fill_value argument on the other hand replaces all the NaN values with a specified value. For example,

import pandas as pd
import numpy as np

# Creating the DataFrame
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, np.nan, 30, 77, np.nan, 76]}
df = pd.DataFrame(data)

# create a pivot table
pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')

print("\nDefault Pivot Table\n", pivot_df)

# create a pivot table with fill_value=0 pivot_df_dropna = df.pivot_table(index='Date', columns='City', values='Temperature', fill_value=0)
print("\nPivot Table with fill_value=0:\n", pivot_df_dropna)

Output

Default Pivot Table
 City        Los Angeles  New York
Date                             
2023-01-01          NaN      32.0
2023-01-02         77.0      30.0
2023-01-03         76.0       NaN

Pivot Table with fill_value=0:
 City        Los Angeles  New York
Date                             
2023-01-01            0        32
2023-01-02           77        30
2023-01-03           76         0

In this example, we replaced the NaN values with 0 using the fill_value=0 argument.


pivot() vs pivot_table()

The pivot() and pivot_table() functions perform similar operations but with few key differences.

Basis pivot() pivot_table()
Aggregation Does not allow aggregation of data. Allows aggregation (sum, mean, count, etc.).
Duplicate Index Cannot handle duplicate index values. Can handle duplicate index values.
MultiIndex Only accepts a single-level index. Accepts multi-level index for complex data.