2022-07-03

Normalization and Standardization

What are Normalization and Standardization

Normalization and standardization are two popular feature scaling techniques used to address the challenges of working with data of different scales. Both techniques aim to transform the input features to a common scale, but they differ in their approach and assumptions.

Normalization techniques typically rescale the input features to a specific range, such as [0, 1] or [-1, 1]. This is achieved by scaling the data according to its minimum and maximum values, using L1 or L2 normalization, or applying various transformations like log, Box-Cox, or Yeo-Johnson transformations. These techniques are best suited for data with a known or desired range and are particularly useful when dealing with non-Gaussian distributions.

Standardization techniques, on the other hand, transform the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation, effectively centering the distribution around zero. Standardization techniques, such as Z-score standardization, median and median absolute deviation (MAD) standardization, and robust scaling, are suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable.

Normalization Techniques

Min-Max Normalization

Min-Max normalization is a simple and widely used technique that scales the features of a dataset to a specific range, typically [0, 1]. The formula for Min-Max normalization is:

x_{normalized} = \frac{x - \min(x)}{\max(x) - \min(x)}

where x is the original value of a feature, and \min(x) and \max(x) are the minimum and maximum values of the feature, respectively. This technique is easily implemented in most programming languages and machine learning libraries.

Pros

  • Easy to understand and implement
  • Suitable for data with a known or desired range
  • Preserves the original distribution of the data

Cons

  • Sensitive to outliers, which can lead to the compression of the majority of the data in a small range
  • Not suitable for data with an unknown or infinite range

L1 and L2 Normalization

L1 and L2 normalization are techniques that scale the data based on their L1 or L2 norms, respectively. The L1 norm is the sum of the absolute values of the feature vector, while the L2 norm is the square root of the sum of the squared values of the feature vector. The formulas for L1 and L2 normalization are:

  • L1 Normalization: x_{normalized} = \frac{x}{||x||1}
  • L2 Normalization: x{normalized} = \frac{x}{||x||_2}

where x is the original feature vector and ||x||_1 and ||x||_2 are the L1 and L2 norms of the feature vector, respectively. These techniques can be easily implemented using popular machine learning libraries.

Pros

  • Less sensitive to outliers compared to Min-Max normalization
  • L1 normalization creates sparse feature vectors, which can be useful for feature selection and dimensionality reduction
  • L2 normalization is invariant to the scale and rotation of the input data

Cons

  • L1 normalization may not be suitable for data with a large number of zero values
  • L2 normalization is sensitive to the presence of very large values in the data

Log Transformations

Log transformations are a type of normalization technique that applies a logarithmic function to the input data. This technique can be useful for reducing the impact of outliers and transforming data with a skewed distribution. The formula for log transformation is:

x_{normalized} = \log(x + k)

where x is the original value of a feature, and k is a small constant added to avoid taking the logarithm of zero. Common choices for the logarithmic function include the natural logarithm (base e), the common logarithm (base 10), and the binary logarithm (base 2).

Log transformations can be easily implemented using popular machine learning libraries or standard programming libraries, which usually provide built-in logarithmic functions.

Pros

  • Reduces the impact of outliers
  • Transforms skewed distributions to be more symmetric
  • Stabilizes the variance of the data
  • Can be applied to data with different scales

Cons

  • Requires input data to be strictly positive (requires the addition of a constant if not)
  • Can be sensitive to the choice of the logarithm base
  • May not be suitable for data with a large number of zero values

Box-Cox Transformation

Box-Cox transformation is a family of normalization techniques that can be used to stabilize variance and make data more closely resemble a Gaussian distribution. The formula for the Box-Cox transformation is:

x_{normalized} = \frac{x^\lambda - 1}{\lambda} if \lambda \neq 0

x_{normalized} = \log(x) if \lambda = 0

where x is the original value of a feature, and \lambda is a parameter that determines the power of the transformation. The optimal value of \lambda can be found using maximum likelihood estimation or other optimization techniques. Box-Cox transformation requires the input data to be strictly positive, so a constant may need to be added to the data before applying the transformation.

Pros

  • Can stabilize variance and make data more closely resemble a Gaussian distribution
  • Suitable for data with a skewed distribution
  • The optimal value of \lambda can be found using well-established optimization techniques

Cons

  • Requires input data to be strictly positive
  • The transformation may be sensitive to the choice of \lambda

Yeo-Johnson Transformation

The Yeo-Johnson transformation is an extension of the Box-Cox transformation that can be applied to both positive and negative data. The formula for the Yeo-Johnson transformation is:

x_{normalized} = \begin{cases} \frac{(x + 1)^\lambda - 1}{\lambda} & \text{if}\quad \lambda \neq 0 \quad\text{and}\quad x \geq 0 \\ \frac{-((-x + 1)^{2-\lambda} - 1)}{2 - \lambda} & \text{if}\quad \lambda \neq 2 \quad\text{and}\quad x < 0 \\ \log(x + 1) & \text{if}\quad \lambda = 0 \quad\text{and}\quad x \geq 0 \\ -\log(-x + 1) & \text{if}\quad \lambda = 2 \quad\text{and}\quad x < 0 \end{cases}

where x is the original value of a feature, and \lambda is a parameter that determines the power of the transformation. Similar to the Box-Cox transformation, the optimal value of \lambda can be found using maximum likelihood estimation or other optimization techniques.

Pros

  • Can be applied to both positive and negative data
  • Can stabilize variance and make data more closely resemble a Gaussian distribution
  • The optimal value of \lambda can be found using well-established optimization techniques

Cons

  • More complex than the Box-Cox transformation
  • The transformation may be sensitive to the choice of \lambda

Standardization Techniques

Z-score Standardization

Z-score standardization, also known as standard score normalization, is a technique that transforms the input features to have zero mean and unit variance. This is done by subtracting the mean and dividing by the standard deviation. The formula for Z-score standardization is:

x_{normalized} = \frac{x - \mu}{\sigma}

where x is the original value of a feature, \mu is the mean of the feature, and \sigma is the standard deviation of the feature. Z-score standardization can be easily implemented using popular machine learning libraries.

Pros

  • Centers the distribution around zero and scales it to have unit variance
  • Suitable for data with unknown distributions or when the assumption of a Gaussian distribution is reasonable
  • Improves the performance and convergence of gradient-based optimization algorithms

Cons

  • Sensitive to outliers, which can affect the mean and standard deviation
  • Assumes that the data follows a Gaussian distribution

Median and Median Absolute Deviation (MAD) Standardization

Median and Median Absolute Deviation (MAD) standardization is an alternative to Z-score standardization that is more robust to outliers. Instead of using the mean and standard deviation, this technique uses the median and the median absolute deviation. The formula for MAD standardization is:

x_{normalized} = \frac{x - \text{median}(x)}{\text{MAD}(x)}

where x is the original value of a feature, \text{median}(x) is the median of the feature, and \text{MAD}(x) is the median absolute deviation of the feature. MAD standardization can be implemented using popular machine learning libraries.

Pros

  • More robust to outliers compared to Z-score standardization
  • Suitable for data with non-Gaussian distributions or heavy-tailed distributions

Cons

  • Assumes that the data is symmetric around the median
  • May be less efficient than Z-score standardization for large datasets

Robust Scaling

Robust scaling is a standardization technique that uses the interquartile range (IQR) to scale the data, making it more robust to outliers. The formula for robust scaling is:

x_{normalized} = \frac{x - Q_1(x)}{Q_3(x) - Q_1(x)}

where x is the original value of a feature, Q_1(x) is the first quartile (25th percentile) of the feature, and Q_3(x) is the third quartile (75th percentile) of the feature. Robust scaling can be implemented using popular machine learning libraries.

Pros

  • More robust to outliers compared to Z-score standardization
  • Suitable for data with non-Gaussian distributions or heavy-tailed distributions
  • Uses the interquartile range, which is less sensitive to extreme values

Cons

  • May not be suitable for data with a strong skew
  • Assumes that the data is symmetric around the median

Ryusei Kakujo

researchgatelinkedingithub

Focusing on data science for mobility

Bench Press 100kg!