The cut()
method in Pandas is used for segmenting and sorting data values into bins.
Example
import pandas as pd
# create a list of ages
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# define the bins - age ranges
bins = [18, 25, 35, 60, 100]
# use cut() to categorize each age into the defined bins
categories = pd.cut(ages, bins)
print(categories)
'''
Output
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
'''
cut() Syntax
The syntax of the cut()
method in Pandas is:
Pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
cut() Arguments
The cut()
method takes following arguments:
x
- the input array to be binnedbins
- the criteria to bin byright
(optional) - indicates whether bins include the rightmost edgelabels
(optional) - specifies the labels for the returned binsretbins
(optional) - specifies whether to return the bins or notprecision
(optional) - precision at which to store and display the bins labelsinclude_lowest
(optional) - whether the first interval should be left-inclusive or not.
cut() Return Value
The cut()
method in Pandas returns a Series or an array that represents the specific bin or category each original value in the input data belongs to.
Example 1: Categorizing Data Using cut()
import pandas as pd
# create a list of exam scores
scores = [88, 92, 75, 85, 78, 95, 64, 82, 90, 73, 67, 99]
# define the bins - grading ranges
bins = [0, 60, 70, 80, 90, 100]
# use cut() to categorize each score into the defined grading bins
grade_categories = pd.cut(scores, bins)
print(grade_categories)
Output
[(80, 90], (90, 100], (70, 80], (80, 90], (70, 80], ..., (80, 90], (80, 90], (70, 80], (60, 70], (90, 100]] Length: 12 Categories (5, interval[int64, right]): [(0, 60] < (60, 70] < (70, 80] < (80, 90] < (90, 100]]
In the above example, we have created the list named scores containing exam scores.
The bins
are defined to represent different grading ranges: 0-60, 61-70, 71-80, 81-90, 91-100.
Then we used pd.cut()
to categorize each score into the corresponding grading bin.
Example 2: Control Bin Boundaries Using right Argument in cut()
import pandas as pd
# create a list of data
data = [2, 4, 6, 8, 10]
# define the bins
bins = [0, 5, 10]
# use cut() with right=True (default)
categories_right_true = pd.cut(data, bins, right=True)
print("Bins closed on the right:")
print(categories_right_true)
print()
# use cut with right=False
categories_right_false = pd.cut(data, bins, right=False)
print("\nBins closed on the left:")
print(categories_right_false)
Output
Bins closed on the right: [(0, 5], (0, 5], (5, 10], (5, 10], (5, 10]] Categories (2, interval[int64, right]): [(0, 5] < (5, 10]] Bins closed on the left: [[0.0, 5.0), [0.0, 5.0), [5.0, 10.0), [5.0, 10.0), NaN] Categories (2, interval[int64, left]): [[0, 5) < [5, 10)]
Here, with
right=True
, the bins are(0, 5]
and(5, 10]
, indicating that the right edge (5 and 10) is included in the bin.right=False
, the bins are[0, 5)
and[5, 10)
, meaning the left edge (0 and 5) is included in the bin.
Example 3: Naming Bins in Pandas cut()
import pandas as pd
# create a list of data
data = [20, 35, 45, 60, 75, 90]
# define the bins
bins = [0, 25, 50, 75, 100]
# define custom labels for the bins
labels = ['Low', 'Medium', 'High', 'Very High']
# use cut() with custom labels
categories_with_labels = pd.cut(data, bins, labels=labels)
print(categories_with_labels)
Output
['Low', 'Medium', 'Medium', 'High', 'High', 'Very High'] Categories (4, object): ['Low' < 'Medium' < 'High' < 'Very High']
In this example, we have defined the list of custom labels: Low
, Medium
, High
, and Very High
, corresponding to each bin.
Then used pd.cut()
to categorize the data into bins and assign the custom labels to these bins.
Example 4: Extract Bin Information Using retbins Argument in cut()
import pandas as pd
# create a list of data
data = [10, 15, 20, 25, 30, 35, 40]
# define the bins
bins = [0, 20, 40]
# use cut() with retbins=True
categories, bin_edges = pd.cut(data, bins, retbins=True)
print("Binned Categories:")
print(categories)
print("\nBin Edges:")
print(bin_edges)
Output
Binned Categories: [(0, 20], (0, 20], (0, 20], (20, 40], (20, 40], (20, 40], (20, 40]] Categories (2, interval[int64, right]): [(0, 20] < (20, 40]] Bin Edges: [ 0 20 40]
In the above example, we used pd.cut()
with retbins=True
, so it returns two things: the binned categories and the array of bin edges.
The categories variable contains the binned data (each element of data categorized into the bins).
And the bin_edges variable contains the actual edges of the bins used in the process.
Example 5: Specify the precision of the Labels of the Bins
import pandas as pd
# create a list of floating-point data
data = [10.123, 15.456, 20.789, 25.012, 30.345, 35.678, 40.901]
# define the bins
bins = [0, 20, 40, 60]
# use cut() with precision=2
categories = pd.cut(data, bins, precision=2)
print("Binned Categories with Two Decimal Precision:")
print(categories)
Output
Binned Categories with Two Decimal Precision: [(0, 20], (0, 20], (20, 40], (20, 40], (20, 40], (20, 40], (40, 60]] Categories (3, interval[int64, right]): [(0, 20] < (20, 40] < (40, 60]]
Here, we used pd.cut()
with precision=2
. This means that the labels of the bins will be formatted to have two decimal places.
Example 6: Use of include_lowest Argument in cut()
import pandas as pd
# create a list of data
data = [20, 22, 24, 26, 28, 30]
# define the bins
bins = [20, 25, 30]
# use cut() with include_lowest=False (default)
categories_default = pd.cut(data, bins)
print("First bin exclusive of the lower edge:")
print(categories_default)
print()
# use cut() with include_lowest=True
categories_include_lowest = pd.cut(data, bins, include_lowest=True)
print("\nFirst bin inclusive of the lower edge:")
print(categories_include_lowest)
Output
First bin exclusive of the lower edge: [NaN, (20.0, 25.0], (20.0, 25.0], (25.0, 30.0], (25.0, 30.0], (25.0, 30.0]] Categories (2, interval[int64, right]): [(20, 25] < (25, 30]] First bin inclusive of the lower edge: [(19.999, 25.0], (19.999, 25.0], (19.999, 25.0], (25.0, 30.0], (25.0, 30.0], (25.0, 30.0]] Categories (2, interval[float64, right]): [(19.999, 25.0] < (25.0, 30.0]]
In this example, with
include_lowest=False
- the first bin(20, 25]
does not include the lower edge 20. Thus, the value 20 in the data is not included in any bin, resulting inNaN
.include_lowest=True
- the first bin[20, 25]
is inclusive of the lower edge 20. Therefore, the value 20 is included in the first bin, and there are noNaN
values.