The groupby()
method in Pandas is used for grouping rows based on some columns and then performing an aggregation function.
Example
import pandas as pd
# creating a dictionary
data = {
'City': ['NY', 'LA', 'NY', 'LA', 'NY'],
'Temperature': [55, 78, 56, 76, 54],
'Humidity': [65, 50, 60, 49, 63]
}
# convert dictionary to a dataframe named df
df = pd.DataFrame(data)
# group dataframe by City column and
# calculate the mean for each group
grouped = df.groupby('City').mean()
# display the grouped and aggregated data
print(grouped)
'''
Output
Temperature Humidity
City
LA 77.0 49.5
NY 55.0 62.666667
'''
groupby() Syntax
The syntax of the groupby()
method in Pandas is:
df.groupby(by=None, axis=0, level=None, sort=True, as_index=True, dropna=True)
groupby() Arguments
The groupby()
method takes following arguments:
by
- determines how to group the dataaxis
(optional) - specifies if we want to group by rows or columnslevel
(optional) - determines which level to use for groupingsort
(optional) - sorts the grouped dataas_index
(optional) - ifTrue
, group labels become indices; ifFalse
, they remain columnsdropna
(optional) - exclude null values ifTrue
.
groupby() Return Value
The groupby()
method returns an object which represents data grouped by a key(s).
Example 1: Grouping by a Single Column in Pandas
import pandas as pd
# create a dictionary containing the data
data = {'Genre': ['Fiction', 'Non-Fiction', 'Fiction', 'Non-Fiction', 'Fiction'],
'BooksSold': [150, 90, 80, 60, 200]}
# create a DataFrame using the data dictionary
df = pd.DataFrame(data)
# group the DataFrame by the Genre column and
# calculate the total number of books sold for each genre
grouped = df.groupby('Genre')['BooksSold'].sum()
# print the grouped data
print(grouped)
Output
Genre Fiction 430 Non-Fiction 150 Name: BooksSold, dtype: int64
In the above example, df.groupby('Genre')['BooksSold'].sum()
is used to group by a single column and calculate sum.
This line does the following:
df.groupby('Category')
- groups the df DataFrame by the unique values in theGenre
column.[BooksSold]
- specifies that we are interested in the'BooksSold'
column within each group..sum()
- calculates the sum of the'BookSold'
values for each group.
Example 2: Use of axis Argument in groupby()
import pandas as pd
# create a sample DataFrame
data = {
'A': [1, 2, 1, 2],
'B': [3, 4, 3, 4],
'C': [5, 6, 7, 8]
}
df = pd.DataFrame(data)
# group along rows (axis=0) based on column A
grouped_rows = df.groupby('A', axis=0)
# calculate the sum of each group
sum_rows = grouped_rows.sum()
print("Grouped along rows:")
print(sum_rows)
# group along columns (axis=1) based on index labels
grouped_cols = df.groupby(df.columns, axis=1)
# calculate the sum of each group
sum_cols = grouped_cols.sum()
print("\nGrouped along columns:")
print(sum_cols)
Output
Grouped along rows:
A B C
1 6 12
2 8 14
Grouped along columns:
A B C
0 1 3 5
1 2 4 6
2 1 3 7
3 2 4 8
In this example, we first create a DataFrame with columns A, B, and C.
We then use the groupby()
method to group the data along rows based on the A column and along columns based on the index labels (columns A, B, and C).
The axis
parameter is used to specify whether the grouping should be done along rows or columns.
Example 3: Use of level argument in groupby()
import pandas as pd
# create a DataFrame with a multi-level index
data = {'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]}
index = pd.MultiIndex.from_tuples([('Group1', 'A'), ('Group1', 'B'), ('Group2', 'A'), ('Group2', 'B'), ('Group3', 'A')],
names=['Group', 'Category'])
df = pd.DataFrame(data, index=index)
# display original DataFrame
print("Original DataFrame:")
print(df)
# group by the Group level and calculate the sum
grouped = df.groupby(level='Group').sum()
# display the result
print("\nGrouped by 'Group' level and summed:")
print(grouped)
Output
Original DataFrame:
A B
Group Category
Group1 A 1 10
B 2 20
Group2 A 3 30
B 4 40
Group3 A 5 50
Grouped by 'Group' level and summed:
A B
Group
Group1 3 30
Group2 7 70
Group3 5 50
Here, we create a DataFrame with a multi-level index where the levels are named Group
and Category
.
We then use the groupby()
method with the level
argument set to Group
to group the data by the Group
level and calculate the sum for each group.
As a result, we get a new DataFrame with the groups as the index and the sums of columns A and B for each group.
Example 4: Sort the Grouped Data
import pandas as pd
# create a DataFrame
data = {'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Value': [10, 15, 20, 25, 30, 35]}
df = pd.DataFrame(data)
# group by Category column and sort the grouped data
grouped = df.groupby('Category', sort=True)
# calculate the mean of Value for each group
result = grouped['Value'].mean()
print(result)
Output
Category A 20.0 B 25.0 Name: Value, dtype: float64
In this example, we use the groupby()
method to group the DataFrame by the Category column and specify sort=True
, which means the groups will be sorted alphabetically based on the Category values.
Example 5: Use of as_index Argument in groupby()
The as_index()
argument is used to specify whether grouping columns should be treated as index columns or not.
as_index=True
- grouped columns become the index of the resulting DataFrameas_index=False
- grouped columns remain as regular columns in the resulting DataFrame
Let's look at an example.
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
'Value': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
# group by Category and calculate the mean, with 'Category' as a regular column.
result1 = df.groupby('Category', as_index=False).mean()
print("When as_index=Flase:\n", result1)
print()
# group by Category and calculate the mean, with 'Category' becoming the index.
result2 = df.groupby('Category', as_index=True).mean()
print("\nWhen as_index=True:\n", result2)
Output
When as_index=Flase:
Category Value
0 A 18.333333
1 B 22.500000
When as_index=True:
Value
Category
A 18.333333
B 22.500000
Here,
- when
as_index=False
, the Category column is not set as the index of the resulting grouped DataFrame, and it remains a regular column - when
as_index=True
, the Category column becomes the index of the resulting grouped DataFrame
Example 6: Grouping by a Multiple Column in Pandas
import pandas as pd
# create a DataFrame with student data
data = {
'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Grade': ['A', 'B', 'A', 'A', 'B'],
'Score': [90, 85, 92, 88, 78],
'Attendance': [95, 98, 92, 97, 88]
}
df = pd.DataFrame(data)
# define the aggregate functions to be applied to the Score and Attendance columns
agg_functions = {
'Score': 'min', # Calculate the minimum Score
'Attendance': 'mean' # Calculate the mean Attendance
}
# group the DataFrame by Gender and Grade, then apply the aggregate functions
grouped = df.groupby(['Gender', 'Grade']).aggregate(agg_functions)
# print the resulting grouped DataFrame
print(grouped)
Output
Score Attendance
Gender Grade
Female A 88 97.0
B 85 98.0
Male A 90 93.5
B 78 88.0
In the above example,
- The DataFrame is grouped by Gender and Grade, creating multi-level row indices.
- The Score column shows the minimum score for each combination of Gender and Grade.
- The Attendance column shows the mean attendance for each combination of Gender and Grade.
Example 7: Use of dropna Argument in groupby()
The dropna
argument specifies how the grouping operation should handle rows with missing values in the columns by which you are grouping your data.
- When
dropna=True
- excludes rows with missing values in the grouping columns from the groups - When
dropna=False
- includes rows with missing values in the grouping columns in their own separate group
Let's look at an example.
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'B', None],
'Value': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data)
# group by Category with dropna=True (default)
grouped_true = df.groupby('Category', dropna=True).mean()
print("With grouped_true: ")
print(grouped_true)
print()
# group by Category with dropna=False
grouped_false = df.groupby('Category', dropna=False).mean()
print("With grouped_false: ")
print(grouped_false)
Output
With grouped_true:
Value
Category
A 12.5
B 22.5
With grouped_false:
Value
Category
A 12.5
B 22.5
NaN 30.0
Here, rows with None
in the Category column are excluded from the grouping when dropna=True
.
However, a separate group is created for the rows with None
in the Category column when dropna=False
.