2022-11-16

Data Filtering Techniques in Pandas

Introduction

In this article, I will explain various techniques for filtering data in Pandas. Filtering is a crucial operation in data analysis, allowing you to select specific data from your DataFrame based on certain conditions. By mastering these techniques, you can effectively analyze and manipulate your data.

Using Boolean Indexing

Boolean indexing is a straightforward and powerful technique to filter data in a Pandas DataFrame. It involves creating a boolean mask that represents the condition you want to filter by. For example, let's say you have a DataFrame named df containing information about different products, and you want to filter out products with a price greater than 100.

python
price_filter = df['price'] > 100
filtered_df = df[price_filter]

Using the query Method

The query method is another convenient way to filter data in a DataFrame. It allows you to write a query string as an argument and returns the rows that meet the specified condition. Using the same example, you can filter out products with a price greater than 100 as follows:

python
filtered_df = df.query('price > 100')

Combining Filters

In some cases, you might want to apply multiple filters to your data. To combine filters, you can use logical operators such as & (and) or | (or). For example, let's filter out products with a price greater than 100 and a rating higher than 4:

python
price_filter = df['price'] > 100
rating_filter = df['rating'] > 4
filtered_df = df[price_filter & rating_filter]

Filtering Based on String Patterns

Pandas also allows you to filter data based on string patterns. You can use the str accessor along with string methods such as contains, startswith, or endswith. For example, to filter out products with a name containing the word "Laptop":

python
name_filter = df['name'].str.contains('Laptop')
filtered_df = df[name_filter]

Filtering with Date and Time Conditions

Pandas provides robust support for filtering based on date and time conditions. If you have a DateTime column, you can easily filter by specific date ranges or time periods. For example, to filter out records created after January 1, 2022:

python
date_filter = df['created_at'] > '2022-01-01'
filtered_df = df[date_filter]

You can also use the between method to filter data within a specific date range:

python
date_filter = df['created_at'].between('2022-01-01', '2022-12-31')
filtered_df = df[date_filter]

Using the isin Method

The isin method is useful for filtering data based on a list of values. It checks if each element in a given column is present in the specified list and returns a boolean mask. For example, let's say you have a DataFrame named df with a column 'category' and you want to filter out products belonging to categories 'Electronics' and 'Clothing':

python
category_filter = df['category'].isin(['Electronics', 'Clothing'])
filtered_df = df[category_filter]

Applying Custom Functions

Sometimes, you may need to apply a custom function to filter your data. Pandas provides the apply method, which allows you to apply a custom function along a specific axis (rows or columns) of your DataFrame. For example, let's create a custom filter to select products with a price-to-rating ratio greater than 20:

python
def price_to_rating_ratio(row):
    return row['price'] / row['rating'] > 20

ratio_filter = df.apply(price_to_rating_ratio, axis=1)
filtered_df = df[ratio_filter]

Filtering with Regular Expressions

Pandas also supports filtering data using regular expressions. You can use the str accessor along with the match method to apply a regex pattern to a column. For example, let's filter out products with a name starting with 'Laptop':

python
import re

name_filter = df['name'].str.match(r'^Laptop')
filtered_df = df[name_filter]

Filtering with Multi-Index DataFrames

In cases where you have a DataFrame with a multi-level index, you can use the xs method to filter data based on specific index levels. For example, let's say you have a multi-index DataFrame with 'category' and 'sub_category' as index levels, and you want to select all rows with 'Electronics' as the category:

python
filtered_df = df.xs('Electronics', level='category')

Conditional Aggregation

Conditional aggregation allows you to apply aggregation functions to subsets of your data based on certain conditions. For example, let's calculate the average price of products with a rating higher than 4:

python
import numpy as np

rating_filter = df['rating'] > 4
average_price = np.mean(df.loc[rating_filter, 'price'])

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!