2022-07-03

Normalization and Standardization

What are Normalization and Standardization

Normalization and standardization are two popular feature scaling techniques used to address the challenges of working with data of different scales. Both techniques aim to transform the input features to a common scale, but they differ in their approach and assumptions.

Normalization techniques typically rescale the input features to a specific range, such as [0, 1] or [-1, 1]. This is achieved by scaling the data according to its minimum and maximum values, using L1 or L2 normalization, or applying various transformations like log, Box-Cox, or Yeo-Johnson transformations. These techniques are best suited for data with a known or desired range and are particularly useful when dealing with non-Gaussian distributions.

Standardization techniques, on the other hand, transform the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation, effectively centering the distribution around zero. Standardization techniques, such as Z-score standardization, median and median absolute deviation (MAD) standardization, and robust scaling, are suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable.

Normalization Techniques

Min-Max Normalization

Min-Max normalization is a simple and widely used technique that scales the features of a dataset to a specific range, typically [0, 1]. The formula for Min-Max normalization is:

$x_{normalized} = \frac{x - \min(x)}{\max(x) - \min(x)}$

where $x$ is the original value of a feature, and $\min(x)$ and $\max(x)$ are the minimum and maximum values of the feature, respectively. This technique is easily implemented in most programming languages and machine learning libraries.

Pros

Easy to understand and implement
Suitable for data with a known or desired range
Preserves the original distribution of the data

Cons

Sensitive to outliers, which can lead to the compression of the majority of the data in a small range
Not suitable for data with an unknown or infinite range

L1 and L2 Normalization

L1 and L2 normalization are techniques that scale the data based on their L1 or L2 norms, respectively. The L1 norm is the sum of the absolute values of the feature vector, while the L2 norm is the square root of the sum of the squared values of the feature vector. The formulas for L1 and L2 normalization are:

L1 Normalization: $x_{normalized} = \frac{x}{||x||1}$
L2 Normalization: $x{normalized} = \frac{x}{||x||_2}$

where $x$ is the original feature vector and $||x||_1$ and $||x||_2$ are the L1 and L2 norms of the feature vector, respectively. These techniques can be easily implemented using popular machine learning libraries.

Pros

Less sensitive to outliers compared to Min-Max normalization
L1 normalization creates sparse feature vectors, which can be useful for feature selection and dimensionality reduction
L2 normalization is invariant to the scale and rotation of the input data

Cons

L1 normalization may not be suitable for data with a large number of zero values
L2 normalization is sensitive to the presence of very large values in the data

Log Transformations

Log transformations are a type of normalization technique that applies a logarithmic function to the input data. This technique can be useful for reducing the impact of outliers and transforming data with a skewed distribution. The formula for log transformation is:

$x_{normalized} = \log(x + k)$

where $x$ is the original value of a feature, and $k$ is a small constant added to avoid taking the logarithm of zero. Common choices for the logarithmic function include the natural logarithm (base $e$ ), the common logarithm (base 10), and the binary logarithm (base 2).

Log transformations can be easily implemented using popular machine learning libraries or standard programming libraries, which usually provide built-in logarithmic functions.

Pros

Reduces the impact of outliers
Transforms skewed distributions to be more symmetric
Stabilizes the variance of the data
Can be applied to data with different scales

Cons

Requires input data to be strictly positive (requires the addition of a constant if not)
Can be sensitive to the choice of the logarithm base
May not be suitable for data with a large number of zero values

Box-Cox Transformation

Box-Cox transformation is a family of normalization techniques that can be used to stabilize variance and make data more closely resemble a Gaussian distribution. The formula for the Box-Cox transformation is:

$x_{normalized} = \frac{x^\lambda - 1}{\lambda}$ if $\lambda \neq 0$

$x_{normalized} = \log(x)$ if $\lambda = 0$

where $x$ is the original value of a feature, and $\lambda$ is a parameter that determines the power of the transformation. The optimal value of $\lambda$ can be found using maximum likelihood estimation or other optimization techniques. Box-Cox transformation requires the input data to be strictly positive, so a constant may need to be added to the data before applying the transformation.

Pros

Can stabilize variance and make data more closely resemble a Gaussian distribution
Suitable for data with a skewed distribution
The optimal value of $\lambda$ can be found using well-established optimization techniques

Cons

Requires input data to be strictly positive
The transformation may be sensitive to the choice of $\lambda$

Yeo-Johnson Transformation

The Yeo-Johnson transformation is an extension of the Box-Cox transformation that can be applied to both positive and negative data. The formula for the Yeo-Johnson transformation is:

x_{normalized} = \begin{cases} \frac{(x + 1)^\lambda - 1}{\lambda} & \text{if}\quad \lambda \neq 0 \quad\text{and}\quad x \geq 0 \\ \frac{-((-x + 1)^{2-\lambda} - 1)}{2 - \lambda} & \text{if}\quad \lambda \neq 2 \quad\text{and}\quad x < 0 \\ \log(x + 1) & \text{if}\quad \lambda = 0 \quad\text{and}\quad x \geq 0 \\ -\log(-x + 1) & \text{if}\quad \lambda = 2 \quad\text{and}\quad x < 0 \end{cases}

where $x$ is the original value of a feature, and $\lambda$ is a parameter that determines the power of the transformation. Similar to the Box-Cox transformation, the optimal value of $\lambda$ can be found using maximum likelihood estimation or other optimization techniques.

Pros

Can be applied to both positive and negative data
Can stabilize variance and make data more closely resemble a Gaussian distribution
The optimal value of $\lambda$ can be found using well-established optimization techniques

Cons

More complex than the Box-Cox transformation
The transformation may be sensitive to the choice of $\lambda$

Standardization Techniques

Z-score Standardization

Z-score standardization, also known as standard score normalization, is a technique that transforms the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation. The formula for Z-score standardization is:

$x_{normalized} = \frac{x - \mu}{\sigma}$

where $x$ is the original value of a feature, $\mu$ is the mean of the feature, and $\sigma$ is the standard deviation of the feature. Z-score standardization can be easily implemented using popular machine learning libraries.

Pros

Centers the distribution around zero and scales it to have unit variance
Suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable
Improves the performance and convergence of gradient-based optimization algorithms

Cons

Sensitive to outliers, which can affect the mean and standard deviation
Assumes that the data follows a Gaussian distribution

Median and Median Absolute Deviation (MAD) Standardization

Median and Median Absolute Deviation (MAD) standardization is an alternative to Z-score standardization that is more robust to outliers. Instead of using the mean and standard deviation, this technique uses the median and the median absolute deviation. The formula for MAD standardization is:

$x_{normalized} = \frac{x - \text{median}(x)}{\text{MAD}(x)}$

where $x$ is the original value of a feature, $\text{median}(x)$ is the median of the feature, and $\text{MAD}(x)$ is the median absolute deviation of the feature. MAD standardization can be implemented using popular machine learning libraries.

Pros

More robust to outliers compared to Z-score standardization
Suitable for data with non-Gaussian distributions or heavy-tailed distributions

Cons

Assumes that the data is symmetric around the median
May be less efficient than Z-score standardization for large datasets

Robust Scaling

Robust scaling is a standardization technique that uses the interquartile range (IQR) to scale the data, making it more robust to outliers. The formula for robust scaling is:

$x_{normalized} = \frac{x - Q_1(x)}{Q_3(x) - Q_1(x)}$

where $x$ is the original value of a feature, $Q_1(x)$ is the first quartile (25th percentile) of the feature, and $Q_3(x)$ is the third quartile (75th percentile) of the feature. Robust scaling can be implemented using popular machine learning libraries.

Pros

More robust to outliers compared to Z-score standardization
Suitable for data with non-Gaussian distributions or heavy-tailed distributions
Uses the interquartile range, which is less sensitive to extreme values

Cons

May not be suitable for data with a strong skew
Assumes that the data is symmetric around the median

Underfitting and Overfitting

Bias-Variance Tradeoff

Descriptive Statistics

Differential Equation

Dimensionality Reduction

Discrete Choice Model

Google Search Console

Hugging Face

Hypothesis Testing

Inferential Statistics

Probability Distribution

Ryusei Kakujo

Weave the future of cities through data

Transportation modeling/ Urban planning/ Machine learning/ Computer science/ GIS