2022-11-15

Merging, Concatenating, and Joining DataFrames in Pandas

Introduction

Data manipulation is an essential skill for data scientists and analysts, as it helps to prepare and clean data for further analysis. In Python, the pandas library is a flexible tool that simplifies data manipulation tasks. In this article, I will cover three essential techniques for combining DataFrames in pandas: merging, concatenating, and joining.

Merging DataFrames

Merging is a technique used to combine two DataFrames based on the values of their columns. The merge function in pandas allows you to merge DataFrames using various types of joins, such as inner, outer, left, and right. In this chapter, I will explore each of these join types and their applications.

Inner Merge

An inner merge, also known as an inner join, returns only the rows with matching keys in both DataFrames. It is the default merge type in pandas. To perform an inner merge, you can use the pd.merge() function:

python
import pandas as pd

# Sample DataFrames
data1 = {'key': ['A', 'B', 'C', 'D'],
         'value': [1, 2, 3, 4]}
data2 = {'key': ['B', 'D', 'E', 'F'],
         'value': [5, 6, 7, 8]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Inner Merge
merged_df = pd.merge(df1, df2, on='key', suffixes=('_df1', '_df2'))
print(merged_df)
  key  value_df1  value_df2
0   B          2          5
1   D          4          6

Outer Merge

An outer merge, also known as an outer join, returns all rows from both DataFrames, filling in missing values with NaN when a key is not present in one of the DataFrames. To perform an outer merge, you need to specify the how parameter as 'outer':

python
# Outer Merge
merged_df = pd.merge(df1, df2, on='key', how='outer', suffixes=('_df1', '_df2'))
print(merged_df)
  key  value_df1  value_df2
0   A        1.0        NaN
1   B        2.0        5.0
2   C        3.0        NaN
3   D        4.0        6.0
4   E        NaN        7.0
5   F        NaN        8.0

Left Merge

A left merge, or left join, returns all rows from the left DataFrame and the matched rows from the right DataFrame. If no match is found, NaN values are used. To perform a left merge, you need to specify the how parameter as 'left':

python
# Left Merge
merged_df = pd.merge(df1, df2, on='key', how='left', suffixes=('_df1', '_df2'))
print(merged_df)
  key  value_df1  value_df2
0   A          1        NaN
1   B          2        5.0
2   C          3        NaN
3   D          4        6.0

Right Merge

A right merge, or right join, returns all rows from the right DataFrame and the matched rows from the left DataFrame. If no match is found, NaN values are used. To perform a right merge, you need to specify the how parameter as 'right':

python
# Right Merge
merged_df = pd.merge(df1, df2, on='key', how='right', suffixes=('_df1', '_df2'))
print(merged_df)
  key  value_df1  value_df2
0   B        2.0          5
1   D        4.0          6
2   E        NaN          7
3   F        NaN          8

Concatenating DataFrames

Concatenation is the process of joining DataFrames end-to-end along a particular axis, either rows or columns. Unlike merging, concatenation does not rely on matching column values. Instead, it simply appends the DataFrames together. In this chapter, I will discuss how to concatenate DataFrames along rows and columns using pandas.

Concatenation Along Rows

To concatenate DataFrames along rows, you can use the pd.concat() function. By default, the function concatenates along rows (axis=0). Here's an example:

python
import pandas as pd

# Sample DataFrames
data1 = {'A': [1, 2, 3],
         'B': [4, 5, 6]}
data2 = {'A': [7, 8, 9],
         'B': [10, 11, 12]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenation along rows
concatenated_df = pd.concat([df1, df2], ignore_index=True)
print(concatenated_df)
   A   B
0  1   4
1  2   5
2  3   6
3  7  10
4  8  11
5  9  12

In the example above, the ignore_index parameter is set to True, which reindexes the resulting DataFrame. If it's set to False, the original index values are preserved.

Concatenation Along Columns

To concatenate DataFrames along columns, you need to specify the axis parameter as 1 in the pd.concat() function:

python
# Sample DataFrames
data1 = {'A': [1, 2, 3],
         'B': [4, 5, 6]}
data2 = {'C': [7, 8, 9],
         'D': [10, 11, 12]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenation along columns
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)
   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

In this example, the DataFrames are concatenated side-by-side along the columns, resulting in a wider DataFrame. Note that the number of rows in both DataFrames must be the same to concatenate along columns.

Joining DataFrames

Joining is another technique to combine DataFrames in pandas, similar to merging. However, joining specifically focuses on combining DataFrames based on their index values rather than matching column values. In this chapter, I will discuss the different types of joins available in pandas and how to perform them using the join() function.

Inner Join

An inner join returns rows from both DataFrames only when the index values match in both. To perform an inner join, you can use the join() function and set the how parameter to 'inner':

python
import pandas as pd

# Sample DataFrames
data1 = {'A': [1, 2, 3],
         'B': [4, 5, 6]}
data2 = {'C': [7, 8, 9],
         'D': [10, 11, 12]}

df1 = pd.DataFrame(data1, index=['X', 'Y', 'Z'])
df2 = pd.DataFrame(data2, index=['Y', 'Z', 'W'])

# Inner Join
joined_df = df1.join(df2, how='inner')
print(joined_df)
   A  B  C   D
Y  2  5  8  11
Z  3  6  9  12

Outer Join

An outer join returns all rows from both DataFrames, filling in missing values with NaN when an index value is not present in one of the DataFrames. To perform an outer join, set the how parameter to 'outer':

python
# Outer Join
joined_df = df1.join(df2, how='outer')
print(joined_df)
     A    B    C     D
W  NaN  NaN  9.0  12.0
X  1.0  4.0  NaN   NaN
Y  2.0  5.0  8.0  11.0
Z  3.0  6.0  9.0  12.0

Left Join

A left join returns all rows from the left DataFrame and the matched rows from the right DataFrame. If no match is found, NaN values are used. To perform a left join, set the how parameter to 'left':

python
# Left Join
joined_df = df1.join(df2, how='left')
print(joined_df)
   A  B    C     D
X  1  4  NaN   NaN
Y  2  5  8.0  11.0
Z  3  6  9.0  12.0

Right Join

A right join returns all rows from the right DataFrame and the matched rows from the left DataFrame. If no match is found, NaN values are used. To perform a right join, set the how parameter to 'right':

python
# Right Join
joined_df = df1.join(df2, how='right')
print(joined_df)
     A    B  C   D
Y  2.0  5.0  8  11
Z  3.0  6.0  9  12
W  NaN  NaN  9  12

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!