The merge()
function in Pandas merges two DataFrames based on their indexes or a specified column.
Example
import pandas as pd
# sample DataFrames with meaningful keys
employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})
salaries = pd.DataFrame({
'employee_id': [2, 3, 4, 5],
'salary': [60000, 75000, 80000, 55000]
})
# merge two dataframes
merged_df = pd.merge(employees, salaries)
print(merged_df)
'''
Output
employee_id name salary
0 2 Bob 60000
1 3 Charlie 75000
2 4 David 80000
'''
merge() Syntax
The syntax of the merge()
method in Pandas is:
pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None, sort=False)
merge() Arguments
The merge()
function takes following arguments:
left
- specifies the left DataFrame to be mergedright
- specifies the right DataFrame to be mergedon
(optional) - specifies column(s) to join onhow
(optional) - specifies the type of join to performleft_on
(optional) - specifies column(s) from the left DataFrame to use as key(s) for mergingright_on
(optional) - specifies column(s) from the right DataFrame to use as key(s) for mergingsort
(optional) - ifTrue
, sort the result DataFrame by the join keys
merge() Return Value
The merge()
function returns a new DataFrame that represents the result of the merge operation.
Example: Merge DataFrames Based on Keys
In Pandas, we can merge DataFrames based on keys. For example,
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# merge the DataFrames based on CourseID key
merged_data = pd.merge(students, courses, left_on='CourseID', right_on='CourseID')
print(merged_data)
Output
StudentID Name CourseID CourseName
0 1 Alice C001 Math
1 3 Charlie C001 Math
2 2 Bob C002 English
3 4 David C003 History
4 5 Eve C004 Science
In the above example, we performed a merge operation on two DataFrames students and courses using the merge()
method with various arguments.
Here, we used CourseID
as the key for merging the DataFrames.
Specify Join Type Using The how Argument
We can specify the join type in the how
argument. Here are the 5 join types we can use in the merge()
method:
- Left Join
- Right Join
- Outer Join
- Inner Join (Default)
- Cross Join
Note: If values are not found in the DataFrames, it fills the space with NaN
.
Left Join Pandas DataFrames
A left join merges two DataFrames based on a common key and returns a new DataFrame that contains all rows from the left DataFrame and the matched rows from the right DataFrame.
Let's look at an example.
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# left merge dataframe
left_join_data = pd.merge(students, courses, on='CourseID', how='left')
print(left_join_data)
Output
StudentID Name CourseID CourseName
0 1 Alice C001 Math
1 2 Bob C002 English
2 3 Charlie C001 Math
3 4 David C003 History
4 5 Eve C004 Science
In the above example, we use the how='left'
parameter in the merge()
function to perform a left join on the CourseID
column.
Hence, all rows from the left DataFrame students are included in the merged DataFrame, and matching rows from the right DataFrame courses are added based on the CourseID
key.
Right Join
A right join is the opposite of a left join. It returns a new DataFrame that contains all rows from the right DataFrame and the matched rows from the left DataFrame. For example,
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': ['S001', 'S002', 'S003', 'S004', 'S005'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# right merge the dataframes
right_join_data = pd.merge(students, courses, on='CourseID', how='right')
print(right_join_data)
Output
StudentID Name CourseID CourseName
0 S001 Alice C001 Math
1 S003 Charlie C001 Math
2 S002 Bob C002 English
3 S004 David C003 History
4 S005 Eve C004 Science
5 NaN NaN C005 Art
Here, we use the how='right'
parameter in the merge()
function to perform a right join on the CourseID
column.
Hence, all rows from the right DataFrame courses are included in the merged DataFrame, and matching rows from the left DataFrame students are added based on the CourseID
key.
Inner Join
An inner join combines two DataFrames based on a common key and returns a new DataFrame that contains only rows that have matching values in both of the original DataFrames. For example,
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# perform an inner join on CourseID
inner_join_data = pd.merge(students, courses, on='CourseID', how='inner')
print(inner_join_data)
Output
StudentID Name CourseID CourseName
0 1 Alice C001 Math
1 3 Charlie C001 Math
2 2 Bob C002 English
3 4 David C003 History
4 5 Eve C004 Science
In this example, we use the how='inner'
parameter to perform an inner join on the CourseID
column.
So, only the rows with matching CourseID
values in both students and courses are included in the merged DataFrame.
Outer Join
An outer join combines two DataFrames based on a common key. Unlike an inner join, an outer join returns a new DataFrame that contains all rows from both original DataFrames.
Let's take a look at an example.
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': ['S001', 'S002', 'S003', 'S004', 'S005'],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# perform an outer join on 'CourseID'
outer_join_data = pd.merge(students, courses, on='CourseID', how='outer')
print(outer_join_data)
Output
StudentID Name CourseID CourseName
0 S001 Alice C001 Math
1 S003 Charlie C001 Math
2 S002 Bob C002 English
3 S004 David C003 History
4 S005 Eve C004 Science
5 NaN NaN C005 Art
Here, we set how='outer'
parameter in merge()
to perform an outer join on the CourseID
column.
As a result, all rows from both the left DataFrame students and the right DataFrame courses are included in the merged DataFrame.
Cross Join
A cross join in Pandas creates the cartesian product of both DataFrames while preserving the order of the left DataFrame. For example,
import pandas as pd
# create two DataFrames
data1 = {
'StudentID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)
data2 = {
'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)
# merge the dataframes
df_merge = pd.merge(students, courses, how = 'cross')
print(df_merge)
Output
StudentID Name CourseID_x CourseID_y CourseName
0 1 Alice C001 C001 Math
1 1 Alice C001 C002 English
2 1 Alice C001 C003 History
3 1 Alice C001 C004 Science
4 1 Alice C001 C005 Art
5 2 Bob C002 C001 Math
6 2 Bob C002 C002 English
7 2 Bob C002 C003 History
8 2 Bob C002 C004 Science
9 2 Bob C002 C005 Art
10 3 Charlie C001 C001 Math
11 3 Charlie C001 C002 English
12 3 Charlie C001 C003 History
13 3 Charlie C001 C004 Science
14 3 Charlie C001 C005 Art
15 4 David C003 C001 Math
16 4 David C003 C002 English
17 4 David C003 C003 History
18 4 David C003 C004 Science
19 4 David C003 C005 Art
20 5 Eve C004 C001 Math
21 5 Eve C004 C002 English
22 5 Eve C004 C003 History
23 5 Eve C004 C004 Science
24 5 Eve C004 C005 Art
Here, the how='cross'
parameter inside merge()
performs a cross join between the students and courses DataFrames.
Hence, every row from the first DataFrame students is combined with every row from the second DataFrame courses, resulting in a Cartesian product of the two DataFrames.