2022-12-15

Pandas DataFrame Normalization

Normalization of DataFrame

Data normalization is a process of adjusting values measured on different scales to a common scale. I will introduce how to apply normalization to Pandas DataFrame by using Scikit-learn.

Min-Max Normalization

Min-Max Normalization is a technique that rescales the attributes to a range of [0,1]. This is done by subtracting the minimum value of the dataset and then dividing by the range of the dataset.

Here's how you can do it in Python:

python
from sklearn.preprocessing import MinMaxScaler

# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# Create a scaler object
scaler = MinMaxScaler()

# Fit and transform the DataFrame
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

The normalized DataFrame (df_normalized) will look like this:

     A    B    C
0  0.0  0.0  0.0
1  0.25 0.25 0.25
2  0.5  0.5  0.5
3  0.75 0.75 0.75
4  1.0  1.0  1.0

Standardization

Standardization is a technique that transforms the attributes such that the resulting distribution has a mean of 0 and a standard deviation of 1. It subtracts the mean of the dataset and then divides by the standard deviation.

Here's how you can do it in Python:

python
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
})

# Create a scaler object
scaler = StandardScaler()

# Fit and transform the DataFrame
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

The standardized DataFrame (df_normalized) will look like this:

          A         B         C
0 -1.414214 -1.414214 -1.414214
1 -0.707107 -0.707107 -0.707107
2  0.000000  0.000000  0.000000
3  0.707107  0.707107  0.707107
4  1.414214  1.414214  1.414214

Separating Fit and Transform Processes

In some cases, it might be necessary to separate the fit and transform processes, especially when we need to apply the same scaling parameters to different datasets (e.g., training set and test set).

First, we apply the fit method to compute the minimum, maximum, mean, and standard deviation (depending on the normalization technique) on the training set. We then use the transform method to normalize the training set and the test set.

Here's an example in Python:

python
# Create a scaler object
scaler = StandardScaler()

# Apply fit method to training data
scaler.fit(df_train)

# Use transform method on both training and test data
df_train_normalized = pd.DataFrame(scaler.transform(df_train), columns=df_train.columns)
df_test_normalized = pd.DataFrame(scaler.transform(df_test), columns=df_test.columns)

The fit method learns the parameters from the training data, and the transform method applies these parameters to normalize the data. This way, both training and test datasets are normalized with the same parameters, ensuring consistency in your machine learning pipeline.

Inverse Transformation: Returning to Original Values

After normalization, if you wish to convert your data back to its original form, you can use the inverse_transform method. This might be useful when you want to interpret your results in the original scale.

Here's how you can perform an inverse transformation:

python
# Inverse transform the normalized data
df_inverse = pd.DataFrame(scaler.inverse_transform(df_normalized), columns=df.columns)

After this operation, df_inverse will be the same as the original DataFrame df.

Normalization of Specific Columns

In some scenarios, we may need to normalize only certain columns in a DataFrame. This can be done by applying the scaler to those specific columns.

Here's how you can do this in Python:

python
# Create a scaler object
scaler = StandardScaler()

# Apply fit_transform to specific columns
df['A'] = scaler.fit_transform(df[['A']])

In this example, we have only normalized the column 'A'. Note the double brackets [['A']] used to ensure a DataFrame is passed into the fit_transform function. This is because scikit-learn expects 2D input data.

For multiple columns, you can simply provide the list of column names:

python
df[['A', 'B']] = scaler.fit_transform(df[['A', 'B']])

This will normalize only 'A' and 'B' columns, and leave the rest of the DataFrame unchanged.

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!