Pandas merge()

The merge() function in Pandas merges two DataFrames based on their indexes or a specified column.

Example

import pandas as pd

# sample DataFrames with meaningful keys
employees = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

salaries = pd.DataFrame({
    'employee_id': [2, 3, 4, 5],
    'salary': [60000, 75000, 80000, 55000]
})

# merge two dataframes merged_df = pd.merge(employees, salaries)
print(merged_df) ''' Output employee_id name salary 0 2 Bob 60000 1 3 Charlie 75000 2 4 David 80000 '''

merge() Syntax

The syntax of the merge() method in Pandas is:

pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None, sort=False)

merge() Arguments

The merge() function takes following arguments:

  • left - specifies the left DataFrame to be merged
  • right - specifies the right DataFrame to be merged
  • on (optional) - specifies column(s) to join on
  • how (optional) - specifies the type of join to perform
  • left_on (optional) - specifies column(s) from the left DataFrame to use as key(s) for merging
  • right_on (optional) - specifies column(s) from the right DataFrame to use as key(s) for merging
  • sort (optional) - if True, sort the result DataFrame by the join keys

merge() Return Value

The merge() function returns a new DataFrame that represents the result of the merge operation.


Example: Merge DataFrames Based on Keys

In Pandas, we can merge DataFrames based on keys. For example,

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# merge the DataFrames based on CourseID key merged_data = pd.merge(students, courses, left_on='CourseID', right_on='CourseID')
print(merged_data)

Output

         StudentID   Name    CourseID  CourseName
0               1    Alice   C001      Math
1               3    Charlie C001      Math
2               2    Bob     C002      English
3               4    David   C003      History
4               5    Eve     C004      Science

In the above example, we performed a merge operation on two DataFrames students and courses using the merge() method with various arguments.

Here, we used CourseID as the key for merging the DataFrames.


Specify Join Type Using The how Argument

We can specify the join type in the how argument. Here are the 5 join types we can use in the merge() method:

  • Left Join
  • Right Join
  • Outer Join
  • Inner Join (Default)
  • Cross Join

Note: If values are not found in the DataFrames, it fills the space with NaN.


Left Join Pandas DataFrames

A left join merges two DataFrames based on a common key and returns a new DataFrame that contains all rows from the left DataFrame and the matched rows from the right DataFrame.

Let's look at an example.

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# left merge dataframe left_join_data = pd.merge(students, courses, on='CourseID', how='left')
print(left_join_data)

Output

   StudentID     Name CourseID CourseName
0          1    Alice     C001       Math
1          2      Bob     C002    English
2          3  Charlie     C001       Math
3          4    David     C003    History
4          5      Eve     C004    Science

In the above example, we use the how='left' parameter in the merge() function to perform a left join on the CourseID column.

Hence, all rows from the left DataFrame students are included in the merged DataFrame, and matching rows from the right DataFrame courses are added based on the CourseID key.


Right Join

A right join is the opposite of a left join. It returns a new DataFrame that contains all rows from the right DataFrame and the matched rows from the left DataFrame. For example,

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': ['S001', 'S002', 'S003', 'S004', 'S005'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# right merge the dataframes right_join_data = pd.merge(students, courses, on='CourseID', how='right')
print(right_join_data)

Output

  StudentID     Name CourseID CourseName
0      S001    Alice     C001       Math
1      S003  Charlie     C001       Math
2      S002      Bob     C002    English
3      S004    David     C003    History
4      S005      Eve     C004    Science
5       NaN      NaN     C005        Art

Here, we use the how='right' parameter in the merge() function to perform a right join on the CourseID column.

Hence, all rows from the right DataFrame courses are included in the merged DataFrame, and matching rows from the left DataFrame students are added based on the CourseID key.


Inner Join

An inner join combines two DataFrames based on a common key and returns a new DataFrame that contains only rows that have matching values in both of the original DataFrames. For example,

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# perform an inner join on CourseID inner_join_data = pd.merge(students, courses, on='CourseID', how='inner')
print(inner_join_data)

Output

     StudentID     Name     CourseID  CourseName
0          1        Alice     C001       Math
1          3      Charlie     C001       Math
2          2          Bob     C002       English
3          4       David      C003       History
4          5         Eve      C004       Science

In this example, we use the how='inner' parameter to perform an inner join on the CourseID column.

So, only the rows with matching CourseID values in both students and courses are included in the merged DataFrame.


Outer Join

An outer join combines two DataFrames based on a common key. Unlike an inner join, an outer join returns a new DataFrame that contains all rows from both original DataFrames.

Let's take a look at an example.

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': ['S001', 'S002', 'S003', 'S004', 'S005'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# perform an outer join on 'CourseID' outer_join_data = pd.merge(students, courses, on='CourseID', how='outer')
print(outer_join_data)

Output

  StudentID     Name  CourseID CourseName
0      S001    Alice     C001       Math
1      S003  Charlie     C001       Math
2      S002      Bob     C002    English
3      S004    David     C003    History
4      S005      Eve     C004    Science
5       NaN      NaN     C005        Art

Here, we set how='outer' parameter in merge() to perform an outer join on the CourseID column.

As a result, all rows from both the left DataFrame students and the right DataFrame courses are included in the merged DataFrame.


Cross Join

A cross join in Pandas creates the cartesian product of both DataFrames while preserving the order of the left DataFrame. For example,

import pandas as pd

# create two DataFrames
data1 = {
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'CourseID': ['C001', 'C002', 'C001', 'C003', 'C004'],
}
students = pd.DataFrame(data1)

data2 = {
    'CourseID': ['C001', 'C002', 'C003', 'C004', 'C005'],
    'CourseName': ['Math', 'English', 'History', 'Science', 'Art'],
}
courses = pd.DataFrame(data2)

# merge the dataframes df_merge = pd.merge(students, courses, how = 'cross')
print(df_merge)

Output

    StudentID     Name  CourseID_x  CourseID_y CourseName
0           1    Alice       C001       C001       Math
1           1    Alice       C001       C002    English
2           1    Alice       C001       C003    History
3           1    Alice       C001       C004    Science
4           1    Alice       C001       C005        Art
5           2      Bob       C002       C001       Math
6           2      Bob       C002       C002    English
7           2      Bob       C002       C003    History
8           2      Bob       C002       C004    Science
9           2      Bob       C002       C005        Art
10          3  Charlie       C001       C001       Math
11          3  Charlie       C001       C002    English
12          3  Charlie       C001       C003    History
13          3  Charlie       C001       C004    Science
14          3  Charlie       C001       C005        Art
15          4    David       C003       C001       Math
16          4    David       C003       C002    English
17          4    David       C003       C003    History
18          4    David       C003       C004    Science
19          4    David       C003       C005        Art
20          5      Eve       C004       C001       Math
21          5      Eve       C004       C002    English
22          5      Eve       C004       C003    History
23          5      Eve       C004       C004    Science
24          5      Eve       C004       C005        Art

Here, the how='cross' parameter inside merge() performs a cross join between the students and courses DataFrames.

Hence, every row from the first DataFrame students is combined with every row from the second DataFrame courses, resulting in a Cartesian product of the two DataFrames.