The join()
method in Pandas allows us to combine two DataFrames based on their indexes.
Example
import pandas as pd
# dataFrames representing scores of two subjects for students
df1 = pd.DataFrame({'Math': [85, 90]}, index=['Alice', 'Bob'])
df2 = pd.DataFrame({'Physics': [92, 88]}, index=['Alice', 'Bob'])
# join the DataFrames
result = df1.join(df2)
print(result)
'''
Output
Math Physics
Alice 85 92
Bob 90 88
'''
join() Syntax
The syntax of the join()
method in Pandas is:
df.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
join() Arguments
The join()
function takes following arguments:
other
- DataFrame to be joinon
(optional) - column to join on the index inother
how
(optional) - specifies how to join dataframes. Default isleft
lsuffix
&rsuffix
(optional) - suffixes to use from left and right DataFrame columns when overlapping columns are encountered.sort
(optional) - sort the join keys. Default is False.
join() Return Value
The join()
method returns a new DataFrame that results from joining two (or more) dataframes.
Example1: Join Two DataFrames
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using the default left join
result = math_scores.join(physics_scores)
print(result)
Output
Math_Score Physics_Score Alice 85 92.0 Bob 90 84.0 Charlie 78 79.0 David 88 NaN
In this example, math_scores and physics_scores are joined based on their student names (which are indices).
Here, Alice
, Bob
, and Charlie
have both Math
and Physics
scores. However, David
only has a Math
score, thus, his Physics
score is NaN
.
And, Eva
and Frank
, who only appear in the Physics
score list, are not included in the result because we used a default left
join.
Note: If you were to use an outer
join, Eva
and Frank
would also be included in the resulting DataFrame.
Example2: Perform Outer Join
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using an outer join
result = math_scores.join(physics_scores, how='outer')
print(result)
Output
Math_Score Physics_Score Alice 85.0 92.0 Bob 90.0 84.0 Charlie 78.0 79.0 David 88.0 NaN Eva NaN 88.0 Frank NaN 86.0
Here,
Alice
,Bob
, andCharlie
have scores for bothMath
andPhysics
.- David only has a score for
Math
. Eva
andFrank
only have scores forPhysics
.
The NaN
values indicate the absence of scores in one of the subjects for the respective students.
Example 3: Perform Right Join
import pandas as pd
# dataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Math_Score': [85, 90, 78, 88],
}, index=['Alice', 'Bob', 'Charlie', 'David'])
# dataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Physics_Score': [92, 84, 88, 79, 86],
}, index=['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'])
# join the two DataFrames using a right join
result = math_scores.join(physics_scores, how='right')
print(result)
Output
Math_Score Physics_Score
Alice 85.0 92
Bob 90.0 84
Eva NaN 88
Charlie 78.0 79
Frank NaN 86
Here,
- All unique indices from the right DataFrame physics_scores are included in the output.
- Only
Alice
,Bob
,Charlie
, andEva
have matching scores in the left DataFrame math_scores, so theirMath
scores are displayed. David
andFrank
, who don't have matchingMath
scores, showNaN
values in the Math_Score column.- All students have
Physics
scores, so there are noNaN
values in the Physics_Score column.
Example 4: Setting a New Column as Index
import pandas as pd
# DataFrame representing Math scores of students
math_scores = pd.DataFrame({
'Student_Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Math_Score': [85, 90, 78, 88],
})
# DataFrame representing Physics scores of students
physics_scores = pd.DataFrame({
'Student_Name': ['Alice', 'Bob', 'Eva', 'Charlie', 'Frank'],
'Physics_Score': [92, 84, 88, 79, 86],
})
# join the two DataFrames using the 'Student_Name' column as the key
result = math_scores.join(physics_scores.set_index('Student_Name'), on='Student_Name')
print(result)
Output
Student_Name Math_Score Physics_Score
0 Alice 85 92.0
1 Bob 90 84.0
2 Charlie 78 79.0
3 David 88 NaN
Here, in the join()
method, we specified on='Student_Name'
to indicate that the join should be performed based on the Student_Name
column.
Since the join is performed based on the Student_Name
column, we can see the math and physics scores combined for each student.
Students Eva
and Frank
from the physics_scores DataFrame are not included in the result because they do not have matching Student_Name
values in the math_scores DataFrame.
Example 5: Perform a Left Join With Suffixes
import pandas as pd
# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 90, 78],
})
df2 = pd.DataFrame({
'ID': [2, 3, 4],
'Name': ['Bob', 'Charlie', 'David'],
'Score': [92, 84, 88],
})
# perform a left join with suffixes
result_with_suffixes = df1.join(df2.set_index('ID'), on='ID', how='left', lsuffix='_left', rsuffix='_right')
print("Result with suffixes:")
print(result_with_suffixes)
Output
Result with suffixes:
ID Name_left Score_left Name_right Score_right
0 1 Alice 85 NaN NaN
1 2 Bob 90 Bob 92.0
2 3 Charlie 78 Charlie 84.0
In the above example, the lsuffix='_left'
and rsuffix='_right'
parameters are used to specify custom suffixes for the columns from the left and right DataFrames in the result.
Hence, in the output we can see that the columns from the left DataFrame df1 have _left
suffixes, and the columns from the right DataFrame df2
have _right
suffixes to differentiate them.
The left join retains all rows from df1 and adds matching rows from df2.
Example 6: Sort the Join Keys
import pandas as pd
# create two DataFrames with overlapping column names
df1 = pd.DataFrame({
'ID': [3, 1, 2],
'Name': ['Charlie', 'Alice', 'Bob'],
'Score': [78, 85, 90],
})
df2 = pd.DataFrame({
'ID': [2, 4, 3],
'Grade': ['B', 'A', 'C'],
})
# perform a left join without specifying sort (default: sort=False)
result_without_sort = df1.join(df2.set_index('ID'), on='ID', how='left')
# perform a left join with sort=True
result_with_sort = df1.join(df2.set_index('ID'), on='ID', how='left', sort=True)
print("Result without sort:")
print(result_without_sort)
print("\nResult with sort:")
print(result_with_sort)
Output
Result without sort:
ID Name Score Grade
0 3 Charlie 78 C
1 1 Alice 85 NaN
2 2 Bob 90 B
Result with sort:
ID Name Score Grade
1 1 Alice 85 NaN
2 2 Bob 90 B
0 3 Charlie 78 C
Here, In the result with sorting sort=True
, the result is sorted in ascending order based on the ID
column.