The join() method in Pandas allows us to combine two DataFrames based on their indexes.
Example
import pandas as pd
# dataFrames representing scores of two subjects for students
df1 = pd.DataFrame({'Math': [85, 90]}, index=['Alice', 'Bob'])
df2 = pd.DataFrame({'Physics': [92, 88]}, index=['Alice', 'Bob'])
# join the DataFrames
result = df1.join(df2)
print(result)
'''
Output
Math Physics
Alice 85 92
Bob 90 88
'''
join() Syntax
The syntax of the join() method in Pandas is:
df.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
join() Arguments
The join() function takes following arguments:
other- DataFrame to be joinon(optional) - column to join on the index inotherhow(optional) - specifies how to join dataframes. Default isleftlsuffix&rsuffix(optional) - suffixes to use from left and right DataFrame columns when overlapping columns are encountered.sort(optional) - sort the join keys. Default is False.
join() Return Value
The join() method returns a new DataFrame that results from joining two (or more) dataframes.
Example1: Join Two DataFrames
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using the default left join
result = math_scores.join(physics_scores)
print(result)
Output
Math_Score Physics_Score
Alice 85 92.0
Bob 90 84.0
Charlie 78 79.0
David 88 NaN
In this example, math_scores and physics_scores are joined based on their student names (which are indices).
Here, Alice, Bob, and Charlie have both Math and Physics scores. However, David only has a Math score, thus, his Physics score is NaN.
And, Eva and Frank, who only appear in the Physics score list, are not included in the result because we used a default left join.
Note: If you were to use an outer join, Eva and Frank would also be included in the resulting DataFrame.
Example2: Perform Outer Join
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using an outer join
result = math_scores.join(physics_scores, how='outer')
print(result)
Output
Math_Score Physics_Score
Alice 85.0 92.0
Bob 90.0 84.0
Charlie 78.0 79.0
David 88.0 NaN
Eva NaN 88.0
Frank NaN 86.0
Here,
Alice,Bob, andCharliehave scores for bothMathandPhysics.- David only has a score for
Math. EvaandFrankonly have scores forPhysics.
The NaN values indicate the absence of scores in one of the subjects for the respective students.
Example 3: Perform Right Join
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using a right join
result = math_scores.join(physics_scores, how='right')
print(result)
Output
Math_Score Physics_Score
Alice 85.0 92
Bob 90.0 84
Eva NaN 88
Charlie 78.0 79
Frank NaN 86
Here,
- All unique indices from the right DataFrame physics_scores are included in the output.
- Only
Alice,Bob,Charlie, andEvahave matching scores in the left DataFrame math_scores, so theirMathscores are displayed. DavidandFrank, who don't have matchingMathscores, showNaNvalues in the Math_Score column.- All students have
Physicsscores, so there are noNaNvalues in the Physics_Score column.
Example 4: Setting a New Column as Index
import pandas as pd
# DataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Student_Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Math_Score': [85, 90, 78, 88],
})
# DataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Student_Name': ['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'],
'Physics_Score': [92, 84, 88, 79, 86],
})
# join the two DataFrames using the 'Student_Name' column as the key
result = math_scores.join(physics_scores.set_index('Student_Name'), on='Student_Name')
print(result)
Output
Student_Name Math_Score Physics_Score
0 Alice 85 92.0
1 Bob 90 84.0
2 Charlie 78 79.0
3 David 88 NaN
Here, in the join() method, we specified on='Student_Name' to indicate that the join should be performed based on the Student_Name column.
Since the join is performed based on the Student_Name column, we can see the math and physics scores combined for each student.
Students Eva and Frank from the physics_scores DataFrame are not included in the result because they do not have matching Student_Name values in the math_scores DataFrame.
Example 5: Perform a Left Join With Suffixes
import pandas as pd
# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 90, 78],
})
df2 = pd.DataFrame({
'ID': [2, 3, 4],
'Name': ['Bob', 'Charlie', 'David'],
'Score': [92, 84, 88],
})
# perform a left join with suffixes
result_with_suffixes = df1.join(df2.set_index('ID'), on='ID', how='left', lsuffix='_left', rsuffix='_right')
print("Result with suffixes:")
print(result_with_suffixes)
Output
Result with suffixes:
ID Name_left Score_left Name_right Score_right
0 1 Alice 85 NaN NaN
1 2 Bob 90 Bob 92.0
2 3 Charlie 78 Charlie 84.0
In the above example, the lsuffix='_left' and rsuffix='_right' parameters are used to specify custom suffixes for the columns from the left and right DataFrames in the result.
Hence, in the output we can see that the columns from the left DataFrame df1 have _left suffixes, and the columns from the right DataFrame df2 have _right suffixes to differentiate them.
The left join retains all rows from df1 and adds matching rows from df2.
Example 6: Sort the Join Keys
import pandas as pd
# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
'ID': [3, 1, 2],
'Name': ['Charlie', 'Alice', 'Bob'],
'Score': [78, 85, 90],
})
df2 = pd.DataFrame({
'ID': [2, 4, 3],
'Grade': ['B', 'A', 'C'],
})
# perform a left join without specifying sort (default: sort=False)
result_without_sort = df1.join(df2.set_index('ID'), on='ID', how='left')
# perform a left join with sort=True
result_with_sort = df1.join(df2.set_index('ID'), on='ID', how='left', sort=True)
print("Result without sort:")
print(result_without_sort)
print("\nResult with sort:")
print(result_with_sort)
Output
Result without sort:
ID Name Score Grade
0 3 Charlie 78 C
1 1 Alice 85 NaN
2 2 Bob 90 B
Result with sort:
ID Name Score Grade
1 1 Alice 85 NaN
2 2 Bob 90 B
0 3 Charlie 78 C
Here, In the result with sorting sort=True, the result is sorted in ascending order based on the ID column.