The sample()
method in Pandas is used to randomly select a specified number of rows from a DataFrame.
Example
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 22, 35, 28]}
df = pd.DataFrame(data)
# select 2 random rows from the DataFrame
sampled_rows = df.sample(n=2)
print(sampled_rows)
'''
Output
Name Age
4 Eva 28
0 Alice 25
'''
sample() Syntax
The syntax of the sample()
method in Pandas is:
df.sample(n=None, frac=None, replace=False, weights=None, random_state=None)
sample() Arguments
The sample()
method takes following arguments:
n
(optional) - specifies the number of random samples to selectfrac
(optional) - specifies the fraction of the DataFrame to sample (between 0 and 1)replace
(optional) - a boolean that determines if sampling should be with replacement or notweights
(optional) - allows assigning different probabilities to rows for weighted samplingrandom_state
(optional) - an integer for controlling randomness.
sample() Return Value
The sample()
method returns a new DataFrame containing the randomly selected rows or columns from the original DataFrame.
Example 1: Select Random Rows Using sample()
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 28, 22, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)
# select 3 random rows
sampled_rows = df.sample(n=3)
print("Selected 3 random rows:")
print(sampled_rows)
Output
Selected 3 random rows: Name Age City 3 David 22 Houston 4 Eve 30 Miami 1 Bob 32 Los Angeles
In the above example, we have used the sample()
method with n=3
to randomly select 3 rows from the df DataFrame.
The sampled_rows variable contains those 3 randomly selected rows from df.
Note: Since sample()
randomly selects rows, the output will be different each time we execute the code.
Example 2: Select Fraction of Rows Randomly
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 28, 22, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)
# select 30% of the rows randomly
sampled_fraction = df.sample(frac=0.3)
print("Selected 30% of the rows randomly:")
print(sampled_fraction)
Output
Selected 30% of the rows randomly: Name Age City 0 Alice 25 New York 4 Eve 30 Miami
Here, inside sample()
we used the frac
parameter with a value of 0.3 to randomly select 30% of the rows from the df DataFrame.
The sampled_fraction variable contains that random subset of rows.
Example 3: Sample With Replacement in Pandas
Sample with replacement simply means to allow the same row to be selected multiple times.
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 28, 22, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)
# sample 5 random rows with replacement
sampled_with_replacement = df.sample(n=5, replace=True)
print("Sampled with replacement (allowing duplicates):")
print(sampled_with_replacement)
Output
Sampled with replacement (allowing duplicates):
Name Age City
1 Bob 32 Los Angeles
1 Bob 32 Los Angeles
0 Alice 25 New York
3 David 22 Houston
1 Bob 32 Los Angeles
In this example, we set replace=True
when using the sample()
method with n=5
.
This allows the same row to be selected multiple times in the sampled output, effectively creating duplicates in the result.
Example 4: Control Randomness With random_state Argument in sample()
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 28, 22, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)
# select 3 random rows with a specific random seed for reproducibility
sampled_with_seed = df.sample(n=3, random_state=42)
print("Sampled with a specific random seed (for reproducibility):")
print(sampled_with_seed)
Output
Sampled with a specific random seed (for reproducibility):
Name Age City
1 Bob 32 Los Angeles
4 Eve 30 Miami
2 Charlie 28 Chicago
Here, we set random_state=42
when using the sample()
method to sample 3 random rows.
Setting a specific random seed (in our case, 42) ensures that the same random sample is generated whenever we use this seed.
This is useful when we want to reproduce the same random sample in different runs of your code.
Note: The choice of 42 as the seed value is arbitrary; we can use any integer value we like.
Example 5: Perform Weighted Sampling for Biased Data Selection
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 28, 22, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']}
df = pd.DataFrame(data)
# assign weights to each row
weights_list = [0.1, 0.2, 0.3, 0.2, 0.2]
# sample 2 random rows with weights
weighted_sample = df.sample(n=2, weights=weights_list)
print("Weighted sampling:")
print(weighted_sample)
Output
Weighted sampling:
Name Age City
2 Charlie 28 Chicago
4 Eve 30 Miami
In the above example, we have defined the list called weights_list, which contains weight values for each row. These weights represent the probabilities of each row being selected during the weighted sampling.
Then we used sample()
with n
set to 2 and the weights
parameter set to the weights_list list.
Here, we performed weighted sampling and selected 2 random rows from the df DataFrame, considering the specified weights.